tech: 2013

Friday, November 22, 2013

A case of Occam's razor

I wanted to write about a seemingly bizarre issue to do with a web page fetch that ultimately proved to be none other than another validation of the Occam's razor, which is simply that the simplest explanation to a problem is generally the right one.

So, to give some background, I'm involved in doing some statistical calculations over a large number of web pages and this has the side effect of highlighting web pages that deviate from the norm. So I end up going through many web pages that stand out from the pack at first glance.

The fetcher I use talks HTTP directly, and deals leniently with the web servers out there that don't always implement HTTP according to spec. On this particular occasion, one web site : http://hairtype.naturallycurly.com responded to the fetcher with content that was nowhere close to what the browser retrieved.

Let me post here what the HTML looked like:

<html lang="en">
<head>
    
    
    <title>PHP Application - AWS Elastic Beanstalk</title>
    
    <link href="http://fonts.googleapis.com/css?family=Lobster+Two" rel="stylesheet" type="text/css"></link>
    <link href="https://awsmedia.s3.amazonaws.com/favicon.ico" rel="icon" type="image/ico"></link>
    <link href="https://awsmedia.s3.amazonaws.com/favicon.ico" rel="shortcut icon" type="image/ico"></link>
    <!--[if IE]><script src="http://html5shiv.googlecode.com/svn/trunk/html5.js"></script><![endif]-->
    <link href="/styles.css" rel="stylesheet" type="text/css"></link>
</head>
<body>
    <section class="congratulations">
        <h1>
Congratulations!</h1>
Your AWS Elastic Beanstalk <em>PHP</em> application is now running on your own dedicated environment in the AWS&nbsp;Cloud<br />

        You are running PHP version 5.4.20<br />

    </section>

    <section class="instructions">
        <h2>
What's Next?</h2>
<ul>
<li><a href="http://docs.amazonwebservices.com/elasticbeanstalk/latest/dg/">AWS Elastic Beanstalk overview</a></li>
<li><a href="http://docs.amazonwebservices.com/elasticbeanstalk/latest/dg/create_deploy_PHP_eb.html">Deploying AWS Elastic Beanstalk Applications in PHP Using Eb and Git</a></li>
<li><a href="http://docs.amazonwebservices.com/elasticbeanstalk/latest/dg/create_deploy_PHP.rds.html">Using Amazon RDS with PHP</a>
<li><a href="http://docs.amazonwebservices.com/elasticbeanstalk/latest/dg/customize-containers-ec2.html">Customizing the Software on EC2 Instances</a></li>
<li><a href="http://docs.amazonwebservices.com/elasticbeanstalk/latest/dg/customize-containers-resources.html">Customizing Environment Resources</a></li>
</li>
</ul>
<h2>
AWS SDK for PHP</h2>
<ul>
<li><a href="http://aws.amazon.com/sdkforphp">AWS SDK for PHP home</a></li>
<li><a href="http://aws.amazon.com/php">PHP developer center</a></li>
<li><a href="https://github.com/aws/aws-sdk-php">AWS SDK for PHP on GitHub</a></li>
</ul>
</section>

    <!--[if lt IE 9]><script src="http://css3-mediaqueries-js.googlecode.com/svn/trunk/css3-mediaqueries.js"></script><![endif]-->
</body>
</html>

This is nowhere close to the HTML retrieved by the browser. You can try it. The web page is about hair products.

My experience is that sometimes, based on the HTTP headers and originating IP, some web servers can return different content. Sometimes, the server has identified an IP as a bot and decided to return an error response or an outright wrong page.

So I tested the theory of the IP by running the fetcher from a different network, with a different outgoing IP. This time, the correct page was retrieved. Then I used curl to retrieve the page from the same network that had given me the incorrect page. To my surprise, curl retrieved the correct page. curl got the correct page from both networks.

This was quite puzzling. I thought that perhaps the web server might have done some sophisticated finger printing and thus having identified the User Agent and maybe other headers the fetcher was using had decided to send it a wrong page.

So using wireshark, I captured all the HTTP headers sent by the fetcher. Another team member then used curl, specifying these same headers.

curl -H 'User-Agent: rtw' -H 'Host: hairtype.naturallycurly.com' -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8' -H 'Accept-Language: en-us,en;q=0.5'  -H 'Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7' -H 'Keep-Alive: 115' -H 'Connection: keep-alive' -H 'Accept-Encoding: gzip,deflate' http://hairtype.naturallycurly.com

I was positive that curl would then fail. But of course it still returned the correct page. So my theory of the sophisticated finger printing was wrong - or maybe it was even more sophisticated that I thought. I was stumped.

And then I realized, that I had missed looking at a very crucial piece of data in this whole operation. The IP the fetcher used to get the page. The first thing the fetcher does is to resolve the IP and since the DNS query can be expensive and we do lots of those, the IP is retrieved from a memcached instance if it is available. An IP may be cached for a number of hours. From the fetcher logs, I could see the IP that it was using:

DNS resolved from cache hairtype.naturallycurly.com -> /54.243.101.48

But as dig showed, that was the incorrect IP :

>>$ dig hairtype.naturallycurly.com
\
; <<>> DiG 9.3.6-P1-RedHat-9.3.6-20.P1.el5 <<>> hairtype.naturallycurly.com
;; global options:  printcmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 28108
;; flags: qr rd ra; QUERY: 1, ANSWER: 3, AUTHORITY: 4, ADDITIONAL: 4

;; QUESTION SECTION:
;hairtype.naturallycurly.com.    IN    A

;; ANSWER SECTION:
hairtype.naturallycurly.com. 300 IN    CNAME    secure-nc-04-2015-1845606936.us-east-1.elb.amazonaws.com.
secure-nc-04-2015-1845606936.us-east-1.elb.amazonaws.com. 60 IN    A 23.23.197.30
secure-nc-04-2015-1845606936.us-east-1.elb.amazonaws.com. 60 IN    A 54.225.215.76

;; AUTHORITY SECTION:
us-east-1.elb.amazonaws.com. 1703 IN    NS    ns-1119.awsdns-11.org.
us-east-1.elb.amazonaws.com. 1703 IN    NS    ns-1793.awsdns-32.co.uk.
us-east-1.elb.amazonaws.com. 1703 IN    NS    ns-235.awsdns-29.com.
us-east-1.elb.amazonaws.com. 1703 IN    NS    ns-934.awsdns-52.net.

;; ADDITIONAL SECTION:
ns-235.awsdns-29.com.    92612    IN    A    205.251.192.235
ns-934.awsdns-52.net.    92612    IN    A    205.251.195.166
ns-1119.awsdns-11.org.    92612    IN    A    205.251.196.95
ns-1793.awsdns-32.co.uk. 92510    IN    A    205.251.199.1

;; Query time: 11 msec
;; SERVER: 10.101.51.60#53(10.101.51.60)
;; WHEN: Fri Nov 22 12:40:20 2013
;; MSG SIZE  rcvd: 345

All that remained now was to validate this - far simpler - hypothesis. It was trivial to do so, all I had to do was remove the domain->IP maping from memcached.

>>$ telnet localhost 11211
Trying 127.0.0.1...
Connected to localhost.localdomain (127.0.0.1).
Escape character is '^]'.
get hairtype.naturallycurly.com
VALUE hairtype.naturallycurly.com 4096 4
6?e0
END
delete hairtype.naturallycurly.com
DELETED
get hairtype.naturallycurly.com
END
quit
Connection closed by foreign host.

This time, the fetcher logs showed that indeed, it was picking the correct IP. And of course it fetched the correct page with all the hair product details.

DNS resolved hairtype.naturallycurly.com -> /23.23.197.30

So once again, I was reminded of the Occam's Razor and how important it is to

1. Remember all the assumptions we make about how a certain software system works.
2. Validate all the assumptions, starting with the simplest first.

Happy debugging the Net!

Wednesday, November 06, 2013

pretty print compressed JSON file

The highly useful json.tool stops short of parsing compressed JSON files. In particular, if you use the json-smart*.jar to produce your json files, you are out of luck with json.tool. But you can use jsonlint like this to get a readable view of your json file :

[~] echo '{json:"obj"}' | python -mjson.tool
Expecting property name: line 1 column 2 (char 1)
[~] echo '{json:"obj"}' | jsonlint -p 2> /dev/null
{
  json: "obj"
}
[~]

Can't find SymPy with iPython / iPython Notebook on Mac 10.8.2 (Mountain Lion)

SymPy allows you to see mathematical formulas in their hand written representation. It is just not easy to install on the Mac, at least on the Mountain Lion. It has to do with the package location for SymPy. Python finds the package, but not iPython and of course the notebook. So after installing SymPy, trying to import the module will fail like this:

ImportError: No module named sympy

However, inside the python prompt, I could import SymPy. And checking on the package path, showed where python was looking for, to find SymPy:

>>> import sympy
>>> print sympy.__file__
/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sympy/__init__.pyc

When I checked the module search path inside iPython, the list was shorter and the /opt/local path was not there :

In [7]: IPython.utils.module_paths.find_module("sympy")

In [8]:

In [10]: sys.path
Out[10]: 
['',
 '/usr/local/bin',
 '/Library/Python/2.7/site-packages/ipython-1.1.0-py2.7.egg',
 '/Library/Python/2.7/site-packages/readline-6.2.4.1-py2.7-macosx-10.7-intel.egg',
 '/Library/Python/2.7/site-packages/pyzmq-14.0.0-py2.7-macosx-10.6-intel.egg',
 '/Library/Python/2.7/site-packages/Jinja2-2.7.1-py2.7.egg',
 '/Library/Python/2.7/site-packages/MarkupSafe-0.18-py2.7-macosx-10.8-intel.egg',
 '/Library/Python/2.7/site-packages/tornado-3.1.1-py2.7.egg',
 '/Library/Python/2.7/site-packages/matplotlib-1.3.1-py2.7-macosx-10.8-intel.egg',
 '/Library/Python/2.7/site-packages/nose-1.3.0-py2.7.egg',
 '/Library/Python/2.7/site-packages/pyparsing-2.0.1-py2.7.egg',
 '/Library/Python/2.7/site-packages/scikit_learn-0.14.1-py2.7-macosx-10.8-intel.egg',
 '/Library/Python/2.7/site-packages/scipy-0.13.0-py2.7-macosx-10.8-intel.egg',
 '/Library/Python/2.7/site-packages/demjson-1.6-py2.7.egg',
 '/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python27.zip',
 '/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7',
 '/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/plat-darwin',
 '/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/plat-mac',
 '/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/plat-mac/lib-scriptpackages',
 '/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python',
 '/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/lib-tk',
 '/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/lib-old',
 '/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/lib-dynload',
 '/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/PyObjC',
 '/Library/Python/2.7/site-packages',
 '/Library/Python/2.7/site-packages/ipython-1.1.0-py2.7.egg/IPython/extensions']

In [11]:

So I copied the package over to /Library/Python/2.7/site-packages/sympy so that iPython could find it:

sudo cp -R /opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sympy /Library/Python/2.7/site-packages/

In [13]: IPython.utils.module_paths.find_module("sympy")
Out[13]: '/Library/Python/2.7/site-packages/sympy'

In [14]: from IPython.display import display

In [15]: from sympy.interactive import printing

In [16]: printing.init_printing()

In [18]: from __future__ import division

In [19]: import sympy as sym

In [20]: from sympy import *

In [22]: x, y, z = symbols("x y z")

In [23]: Rational(3,2)*pi + exp(I*x) / (x**2 + y)
Out[23]: 
        ⅈ⋅x 
3⋅π    ℯ    
─── + ──────
 2     2    
      x  + y

Thursday, October 17, 2013

HTTP Request : Host header must be lower cased

In theory, the case of values specified in most HTTP Request headers is insignificant. But there are servers that do look for a specifically lower case Host header. These servers will return incorrect results if the case was different.

If you're building your own HTTP requests, and want to get back data that a typical browser would get, it would be a good idea to lower case the Host header before sending it to the server.

A case in point is http://www.BestBuys.com

Try to fetch the page with curl, like this : curl http://www.BestBuys.com

You do get a page, but look at it closely, it is an error page.

    <h1>PPI Exception (PDOException)</h1>
    <div><strong>File:</strong> /data/www_bestbuys_com/releases/20131017195710/PPI/Vendor/Doctrine/Doctrine/DBAL/Driver/PDOConnection.php</div>
    <div><strong>Line:</strong> 36</div>
    <div><strong>Message:</strong> SQLSTATE[HY000] [2002] No such file or directory</div>

Do this with wireshark running and observe the Host header :

Host: www.BestBuys.com\r\n

Now use curl, but specify a lowercased Host header :

curl -H "Host: www.bestbuys.com" http://www.BestBuys.com

Then you get the correct page. Browsers, understanding the imperfect implementations of web servers out there always lower case the Host. You can check this with wireshark. Try to get the page using any modern browser and look at the Host header in wireshark.

Here is another example of where the practical approach is not used in a very common library used to fetch web pages.

Wednesday, June 19, 2013

Java : remove specified characters from a string (and quickly!)

Recently, I had to remove extraneous carriage returns and line feeds from URLs on a path that had real-time performance concerns. There was already some code that did this job, albeit inefficiently :

    public static String remove(final String string,
                                final char remove)
    {
        if (string == null) throw new IllegalArgumentException("string is null");

        int index = 0;
        final StringBuilder stringBuilder = new StringBuilder(string);
        while ((index = indexOf(stringBuilder, remove, index)) != -1) {
            stringBuilder.deleteCharAt(index);
        }
        return stringBuilder.toString();
    }

This code, is inefficient as it forces the StringBuilder.deleteCharAt to shift all characters in the buffer left, each time a character that need to be removed is found. Here is how StringBuilder.deleteCharAt is implemented :

764 public AbstractStringBuilder deleteCharAt(int index) {
765         if ((index &lt; 0) || (index &gt;= count))
766             throw new StringIndexOutOfBoundsException(index);
767         System.arraycopy(value, index+1, value, index, count-index-1);
768         count--;
769         return this;
770     }

So I searched around for a hopefully more efficient implementation of this basic function. There was a StringUtils (apache commons lang) function that looked hopeful. However, it was not really meant to remove characters from a string, but to replace them: StringUtils.replaceChars(String str, String searchChars, String replaceChars) Here is how that looks like:

public static String replaceChars(String str, String searchChars, String replaceChars) {
4168        if (isEmpty(str) || isEmpty(searchChars)) {
4169            return str;
4170        }
4171        if (replaceChars == null) {
4172            replaceChars = EMPTY;
4173        }
4174        boolean modified = false;
4175        int replaceCharsLength = replaceChars.length();
4176        int strLength = str.length();
4177        StrBuilder buf = new StrBuilder(strLength);
4178        for (int i = 0; i < strLength; i++) {
4179            char ch = str.charAt(i);
4180            int index = searchChars.indexOf(ch);
4181            if (index >= 0) {
4182                modified = true;
4183                if (index < replaceCharsLength) {
4184                    buf.append(replaceChars.charAt(index));
4185                }
4186            } else {
4187                buf.append(ch);
4188            }
4189        }
4190        if (modified) {
4191            return buf.toString();
4192        }
4193        return str;
4194    }

The code does not optimize for the case where the resulting string would be shorter than the original, so this code is likely to be slow. Next, I found a Guava function in com.google.common.base.CharMatcher : String removeFrom(CharSequence sequence). That used quite an interesting algorithm as follows:

  public String removeFrom(CharSequence sequence) {
    String string = sequence.toString();
    int pos = indexIn(string);
    if (pos == -1) {
      return string;
    }

    char[] chars = string.toCharArray();
    int spread = 1;

    // This unusual loop comes from extensive benchmarking                                                                                
    OUT: while (true) {
      pos++;
      while (true) {
        if (pos == chars.length) {
          break OUT;
        }
        if (matches(chars[pos])) {
          break;
        }
        chars[pos - spread] = chars[pos];
        pos++;
      }
      spread++;
    }
    return new String(chars, 0, pos - spread);
  }

It keeps track of the distance between the source and destination in the "spread" variable. However, this means that for each left shift, it needs to do a subtraction. Since these all seemed less than optimal, I decided to code my own. This was my first function, where I try to keep track of the first index for the destination.

    public static String remove2(final String string, final char... chars) {
        char[] arr = string.toCharArray();
        int dst = -1;
        for (int src=0; src<arr.length; src++) {
            boolean rm = false;
            for (char c : chars) {
                if (c == arr[src]) {
                    rm = true;
                    if (dst == -1)
                        dst = src; //set first dst pos 
                }
            }
            if (!rm && dst != -1) {
                arr[dst++] = arr[src];
            }
        }
        return dst == -1 ? string : new String(arr, 0, dst);
    }

Upon a hunch that the JVM should be smart enough to not copy a value from an address to itsef, I then decided to remove the check for the first destination index. Here is that version:

   public static String remove3(final String string, final char... chars) {
        char[] arr = string.toCharArray();
        int dst=0;
        for (int src=0; src<arr.length; src++) {
            boolean rm = false;
            for (char c : chars) {
                if (c == arr[src]) {
                    rm = true;
                }
            }
            if (!rm) {
                arr[dst++] = arr[src];
            }
        }
        return dst == string.length() ? string : new String(arr, 0, dst);
    }

Next I benchmarked the different functions with a simple test, using just one input. Depending on input, your results may vary, so it best to work with inputs you are likely to see in your scenario.

remove : 20.092s
Guava : 25.164s
apache commons langs : 1m33.573s
remove2 : 19.272s
remove3 : 12.109s

So for this simple test, Guava was worse than our hand-optimized remove3(). The apache common langs function took a big hit as it was not optimizing for removal. I'm curious why the Googlers went with their approach on Guava. It may be that all JVMs are not as smart as to prevent a dumb copy from the same address to itself, and Guava makes sure such code does not have a chance to get to the JVM, since the variable "spread" always being greater than 0, this assignment can never be from one address to itself :

chars[pos - spread] = chars[pos];

Wednesday, June 05, 2013

Java: splitting on a single character

If you want to split a Java String on a single character, on a compute intensive path in your code, you might want to stay clear of String.split. The JDK function uses a regular expression for splitting and before JDK 1.7, the String.split had no optimization for single characters.

An optimization was introduced in JDK 1.7, but if your split character happens to have special meaning in a regular expression (ex: ^ |), then the optimization will not apply.

I used org.apache.commons.lang.StringUtils.split to gain a roughly 3X advantage over the split call used in our servers.

Here is the performance test:

import org.apache.commons.lang.StringUtils;

public class TSplit {

    public static void main(String[] args) {
        if (args.length==0) {
            System.err.println("TSplit jdk|nojdk");
            System.exit(-1);
        }
        String var = "here|is|a|string|that|must|be|split";
        if (args[0].compareTo("jdk")==0) {
            for (int i=0;i<10000000;i++) {
                String[] splits = var.split("\\|");
            }
        } else {
            for (int i=0;i<10000000;i++) {
                String[] splits = StringUtils.split(var, '|');
            }
        }
    }
    
}

The results from the test :

[~/] time java -cp `echo /path/to/jars/*.jar|tr ' ' :` TSplit jdk

real 0m16.027s
user 0m16.245s
sys 0m0.412s
[~/] time java -cp `echo /path/to/jars/*.jar|tr ' ' :` TSplit nojdk

real 0m5.354s
user 0m5.395s
sys 0m0.304s
[~/]

As this post shows, Users who encountered these problems pre-1.7 have sometimes hacked their code to even pre-compile the single split character to a regular expression. This unfortunately means, that if and when they upgrade to 1.7, the optimization that Sun added will have no effect.

Wednesday, April 03, 2013

Simulate a mysql connection error

In testing, we have to often simulate errors. How can we simulate a mysql connection error on a typical network that can be found in a corporate environment? Most times, the servers themselves are some place we have no physical access to and, what is more of a problem, shared by others. So we can't shut down the server, or unplug the cable.

There is a way we can simulate a mysql connection error using the swiss army knife of the hacker - netcat. netcat allows us to forward traffic coming to a particular port, to another one of our choosing. When we set up netcat this way, it starts listening on a port, and forwards all traffic received on this port to the other port.

Here is an example :

ncat -l 10.2.2.47 8080 --sh-exec "ncat 10.2.2.47 3306"

This command will have netcat start listening on port 8080, you don't have to use port 8080, you can use any unused port on the machine. Then it forwards all traffic to port 8080 to port 3306. Now port 3306 is where mysql listens on. On the machine I ran this, the idea was to route all traffic to port 8080 to mysql on the same box. Now all you have to do is to change the connection string of your application being tested, to speak to port 8080 (instead of the default mysql port, 3306). Once you do that, port forwarding will still make the mysql calls work seamlessly. But now, when you want to simulate a mysql connection error, simply kill the ncat that is listening on port 8080. When you want to bring mysql back up, simply issue the ncat command again.

Here is how you'd use the mysql client to go via 8080:

mysql -uuser -ppass -h 10.2.2.47 --port 8080

(Make sure the host is specified as the IP. Specifying localhost will ignore the port and start using a socket file for communication.)

Voila, now you have a fairly painless way of simulating mysql errors. Of course the same technique can be used to simulate any errors generated by network calls where a server listens on a known port, which is all socket based applications.

Sunday, March 31, 2013

Hadoop : Output of Combiner can be processed again by the Combiner

A combiner can be used in a map/reduce program to aggregate values locally at the mapper, before a call is made to the reducer. When a combiner is available, the output of the map() function is fed to the combine() function first. And the general understanding is that the output of the combine() function is sent over to the reduce() function on a reducer machine.

Except, it is not strictly correct. The output of combine() can be fed back into the combine() function for repeated aggregation. In general, this does not cause a problem, but one can write incorrect code, if one was not aware of this detail. An example would be a simple counting program.

This program counts the terms present in the given documents. In the reducer, the sum is incremented by 1 for each element in [c1, c2, ...] because, we know that all counts in this array are "1"s. That is because, the mapper always emits "1"s.

class Mapper
   method Map(docid id, doc d)
      for all term t in doc d do
         Emit(term t, count 1)

class Reducer
   method Reduce(term t, counts [c1, c2,...])
      sum = 0
      for all count c in [c1, c2,...] do
          sum = sum + 1
      Emit(term t, count sum)

Now let's add a combiner to aggregate the terms locally.

class Combiner
   method Combine(term t, [c1, c2,...])
      sum = 0
      for all count c in [c1, c2,...] do
          sum = sum + 1
      Emit(term t, count sum)

It is identical to the reducer. The sum is incremented by 1 for each "1" element in the array. But now, we see that we must change the reducer to not sum "1"s, as the combiner would be doing an aggregation first. So we change the reducer to this:

class Reducer
   method Reduce(term t, counts [c1, c2,...])
      sum = 0
      for all count c in [c1, c2,...] do
          sum = sum + c
      Emit(term t, count sum)

The thinking is that the array input into the combiner could never have values other than "1"s. But this is an incorrect assumption because, as we said, the output of a combiner, that would have produced aggregated values, other than "1"s, can be again input to the combine() function. Thus the correct combine() is as follows:

class Combiner
   method Combine(term t, [c1, c2,...])
      sum = 0
      for all count c in [c1, c2,...] do
          sum = sum + c
      Emit(term t, count sum)

In fact, this is identical to the reduce() function.

We can peek under the hood in Hadoop source code to see where this two step combining happens. Inside java.org.apache.hadoop.mapred.MapTask, a flush() method has two function calls of relevance.

sortAndSpill()
mergeParts()

sortAndSpill() is actually called early on by a spill thread as well. The flush() makes sure that any remaining data is properly spilled. flush() then interrupts the spill thread, waits for it to end, and then calls mergeParts().

sortAndSpill() is the section of code that runs as the mapper is writing the intermediate values into spill files.

Inside sortAndSpill() :

            if (spstart != spindex) {
                combineCollector.setWriter(writer);
                RawKeyValueIterator kvIter =
                  new MRResultIterator(spstart, spindex);
                combinerRunner.combine(kvIter, combineCollector);
              }

This is where the combine() function is called first. But when data volume is high, Hadoop is unable to wait for all output from the mappers, before spilling the data to disk. When a threshold is reached, the buffers are spilled into a spill file. So it is quite possible, that one key gets spilled into two spill files. And if this happens, Hadoop can do yet another aggregation on running the data in spill files through the combiner. And that is partly what the mergeParts() function does:

         if (combinerRunner == null || numSpills < minSpillsForCombine) {
            Merger.writeFile(kvIter, writer, reporter, job);
          } else {
            combineCollector.setWriter(writer);
            combinerRunner.combine(kvIter, combineCollector);
          }

This is how combine() can be called twice, or more times, for the same key. So even if the mapper always emits a "1", the combiner could get values much larger than "1". It is always good practice to not assume the values coming into the combiner based on what is output from map().

Thursday, March 21, 2013

Simulate Out Of Disk condition for a test (Linux)

We recently had a problem in production where a service was filling up the disk, happily swallowing the IOException that was thrown, and continuing on its merry and error prone way. Needless to say, it emptied the input (a high volume redis queue) and produced garbage. We fixed this and then before deployment, needed to create an Out of Disk scenario for testing. My first, clumsy attempt was to fill the disk with the dd command but a few hours later felt very sheepish, as it took quite a while to fill up a large disk (more than half a Terrabyte). The easiest way to do this is to create a loopback partition of a fixed size. Here are the steps I used :

mkdir /filesystems
dd if=/dev/zero of=/filesystems/tmp_fs seek=512 count=512 bs=1M
mkfs.ext4 /filesystems/tmp_fs
mkdir /mnt/small
mount -o loop /filesystems/tmp_fs /mnt/small

This creates a partition of about 1G that gets mounted to /mnt/small. In our scenario, we had to create the Lucene index inside this directory.

Tuesday, February 26, 2013

Escaping strings in bash

Good way to escape strings in bash.

Using this, I generated a script to push a number of URLs to a redis queue :

redis-cli LPUSH our_queue http\:\/\/zuvypyzulogu\.wordpress\.com\/2013\/02\/20\/bist\-du\-bei\-mir\-music\-download 
redis-cli LPUSH our_queue http\:\/\/zuvypyzulogu\.wordpress\.com\/2013\/02\/20\/top\-10\-free\-music\-download\-sites 
redis-cli LPUSH our_queue http\:\/\/zwingliusredivivus\.wordpress\.com\/2012\/02\/09\/where\-marc\-sees\-cause\-to\-lament\-i\-see\-reason\-to\-rejoice 
redis-cli LPUSH our_queue http\:\/\/zwingliusredivivus\.wordpress\.com\/2013\/02\/26\/a\-word\-to\-the\-emergents 
redis-cli LPUSH our_queue http\:\/\/zwischen\-uns\.forumactif\.com\/post 
redis-cli LPUSH our_queue http\:\/\/zx6r\.com\/zx6r\/19061\-normal\-tempature\-07\-zx6r\.html 
redis-cli LPUSH our_queue http\:\/\/zx6r\.com\/zx6r\/23572\-09\-zx6\-hid\-kit\.html redis-cli LPUSH our_queue http\:\/\/zx6r\.com\/zx6r\/9351\-06\-636\-build\-start\.html 
redis-cli LPUSH our_queue http\:\/\/zyngadeutschland\.wordpress\.com\/2013\/02\/26\/farmville\-2\-neue\-limitierte\-auflage\-kelten 
redis-cli LPUSH our_queue http\:\/\/zzmtokg\.wordpress\.com\/2011\/07\/09\/best\-price\-stok\-fyr\-torch\-for\-less

Monday, February 18, 2013

Perl : Use Text::CSV instead of split for parsing CSV lines

When you have lines which have quoted fields delimited by commas, it is often easy to use the perl native split command to parse:

 perl -ne '@x=split(",");' file.csv

But this will not parse this type of line :


"2013-02-15","478944","http://cdn.springboard.gozammer.com/mediaplayer/springboard/mediaplayer.swf?config={""externalConfiguration"":""http://www.springeagle.com/superconfig/sgv014.js"",""playlist"":""http://cms.springeagle.com/xml_feeds_advanced/index/683/rss3/668219/0/0/5/""}","1","0",""

Everything starting from "http://" and including the closing brace is a single field.

But within this field, we find the "," separator. This means that when our perl split does its work, we get the following as the 2nd field in the line:


http://cdn.springboard.gozammer.com/mediaplayer/springboard/mediaplayer.swf?config={""externalConfiguration"":""http://www.springeagle.com/superconfig/sgv014.js"

However, Text::CSV perl module is smart enough to recognize certain sub formats within the quoted fields and parse accordingly. It identifies the opening brace and searches for a closing brace, passing over any delimiters in between.

Here is an example:


cat /tmp/x | perl -ne 'BEGIN { use Text::CSV; $csv=Text::CSV->new();} chomp; $csv->parse($_); @x=$csv->fields(); for $x (@x) {print "$x ## "}; print "\n";'

2013-02-15 ## 478944 ## http://cdn.springboard.gozammer.com/mediaplayer/springboard/mediaplayer.swf?config={"externalConfiguration":"http://www.springeagle.com/superconfig/sgv014.js","playlist":"http://cms.springeagle.com/xml_feeds_advanced/index/683/rss3/668219/0/0/5/"} ## 1 ## 0 ## ##

Wednesday, February 06, 2013

Tomcat : The Init loop

Tomcat can be configured so that the servlets initialize upon server startup, so that they are ready and primed for the subsequent GET requests. But if the init() method fails for whatever reason, then init() will be called again by Tomcat, upon a subsequent GET request.

It is not clear why Tomcat developers chose this method of calling init() again on a GET request. It may be that they wanted to coax the servlet to initialize on the second attempt, even if it failed upon startup. The reasoning might have been that the first failure was due to a race condition. Another reason I've seen posted is that this way, tomcat can return the full error with a stack trace from the init() call to the browser. The reasoning is that this will make it easier for a developer to fix the init() error as it is immediately visible upon the browser. (vs having to find it in the catalina.out file)

Well, neither of these explanations can be accepted without some misgivings. First, if the design goal of the servlet container was to reduce the chance of a race condition from affecting the serving of requests, the init() call could be repeated a configured number of times upon startup. Secondly, Tomcat could have been designed to store any error from init() locally, so that it can be returned upon the next GET request, without making another call to init().

Our complaints on the design of Tomcat server is not so much pedantic. If the servlet init() routine in fact has a race condition, this design of Tomcat, rather than resolving the race, under certain conditions, can cause Tomcat to race in a never ending series of calls to init().

In fact, this very thing happened recently on our production servers.

First, there was an extremely rare race that hit a portion of the init() code on the servlet. This caused init() to fail. Then tomcat dutifully called init() again, but this time, a component that had been created before the init() failure last time, threw an Exception as it was already created. (It was a singleton object). Now, since init() fails again, for a different reason, the next GET will make Tomcat call init() again, and again, init() will fail. This throws the server into an endless init() loop.

The problem is that now, the part of the code where we first encountered the race, never gets hit. So, the server is up without being properly initialized, and in any case, init() will never succeed as the singleton object will always throw an exception. So none of the GET requests will be served.

We can think of other unintended consequences also, due to this design. What if the servlet was accumulating some sort of a list in memory from the database upon startup? Calling init() more than once may increase the list size and possibly lead to bugs.

And then, what if the first error corrupted some data structure - or even some data on permanent storage ? Even if the second init() succeeded, the server might be corrupt at this point.

JDO : Beware when closing PreparedStatement Objects

Once you are done with a PreparedStatement, it is good practice to close it. But pain awaits you in unexpected moments, if you close a statement one too many times.

In standard code using JDO, it is not uncommon to find ResultSet and PreparedStatement objects that do not get closed. Things will work for a while before memory issues force developers to clean these up. This is what happened on our production systems recently. Unfortunately, we were somewhat overzealous and at one point, closed a statement twice. It ran on Q/A without a problem, so unnoticed to the gatekeepers, it slipped into production.

And there it crashed and burned, not in the close statement as one would imagine, but in stmt.get call with the following stack trace:

Processor.processFile: problem 
processing data from filename: /path/to/something.csv on:
com.mysql.jdbc.exceptions.jdbc4.MySQLNonTransientConnectionException: No operations allowed after statement closed.
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
        at com.mysql.jdbc.Util.handleNewInstance(Util.java:409)
        at com.mysql.jdbc.Util.getInstance(Util.java:384)
        at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1015)
        at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:989)
        at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:984)
        at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:929)
        at com.mysql.jdbc.StatementImpl.checkClosed(StatementImpl.java:406)
        at com.mysql.jdbc.ServerPreparedStatement.checkClosed(ServerPreparedStatement.java:546)
        at com.mysql.jdbc.ServerPreparedStatement.setLong(ServerPreparedStatement.java:2037)
        at com.solarmetric.jdbc.DelegatingPreparedStatement.setLong(DelegatingPreparedStatement.java:397)
        at com.solarmetric.jdbc.PoolConnection$PoolPreparedStatement.setLong(PoolConnection.java:448)
        at com.solarmetric.jdbc.DelegatingPreparedStatement.setLong(DelegatingPreparedStatement.java:397)
        at com.solarmetric.jdbc.DelegatingPreparedStatement.setLong(DelegatingPreparedStatement.java:397)
        at com.solarmetric.jdbc.DelegatingPreparedStatement.setLong(DelegatingPreparedStatement.java:397)
        at com.solarmetric.jdbc.LoggingConnectionDecorator$LoggingConnection$LoggingPreparedStatement.setLong(LoggingCo
nnectionDecorator.java:1265)
        at com.solarmetric.jdbc.DelegatingPreparedStatement.setLong(DelegatingPreparedStatement.java:397)
        at com.solarmetric.jdbc.DelegatingPreparedStatement.setLong(DelegatingPreparedStatement.java:397)
        at ourcode.getData(Manager.java:5435)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at ourcode.JDOProxy.invoke(JDOProxy.java:198)
        at $Proxy5.getData(Unknown Source)
        at ourcode.Processor.writeData(Processor.java:3185)

This is due to a bug in the mysql connector where if a statement is closed twice, and we create another statement (PreparedStatement) again with the same sql string, it retrieves the closed PreparedStatement object from an internal cache. Then any attempt to use that statement results in an exception. Here are the details of the mysql bug

Here is a bit of code showing the double close. If this function is called twice, it will throw the exception on the second call :

    private static void badsql(PersistenceManager pm) {
        Connection conn=null;
        ResultSet results =null;
        PreparedStatement stmt=null;

        try {
            conn = QUtil.getConn(pm);

            String[] urls = new String[] {"blingbling.com", "singsing.com", "soso.com"};

            stmt = conn.prepareStatement("select numhits from facttable where name = ?");

            for (String url : urls) {
                stmt.setString(1, url);
                results = stmt.executeQuery();
                if (results.next()) {
                    int numTerms = results.getInt(1);
                    System.out.println(url + "=>" + numTerms);
                }
                results.close();
            }
            //first close
            stmt.close();
            conn.close();

        } catch (Exception e) {

        } finally {
            CloseUtil.closeAndIgnore(results);
            //Here is the second close
            CloseUtil.closeAndIgnore(stmt);
            CloseUtil.closeAndIgnore(conn);
        }
    }

The problematic function was being called once for a large batch of records. Since there was not enough records in the Q/A environment, it was called just once and thus we never found the bug in Q/A.