Category Archives: Programming

Slowing Down Java’s hashCode() for Strings

Java’s hashCode method for Strings has something very slightly concerning about it. Let’s look at the code for the hashCode() method for Strings:

public int hashCode() {
        int h = hash;
        if (h == 0) { // Let's call this Path A
            int off = offset; 
            char val[] = value;
            int len = count;
			
            for (int i = 0; i < len; i++) {
                h = 31*h + val[off++];
            }
            hash = h;
        }
        // If the above code doesn't execute, we will call this Path B
        return h;
    }

There is something that bothers me about the way hashes are initially computed. Let’s say you have a large String of letters, and you want to hash it. You get a certain integer back after it is computed (Path A is taken), and all is fine and dandy. The next time you ask for the hash code, the value returned is the previously computed hash code, so we do not have to go through the computations a second time, as the hash is most likely going to not be zero (Hence path B is taken the second time, and the third, etc…). What happens if the number computed by the hash is zero? This is something that rarely happens, but it is possible for many strings to have a hashing function that evaluates to zero. In fact, it is even possible to take any arbitrary string, add seven characters (usually letters and numbers or symbols) to the end, and make the hashCode method evaluate to zero. Let’s take the string:

Bobby Likes Apples

The hashCode for the above is -1446827919. Let’s see if we can make it’s hashing function evaluate to zero. Consider the string:

Bobby Likes ApplesNAS3d1Q

This above string’s hashing function actually evaluates to zero. Try it out on your computer if you want, just print “Bobby Likes ApplesNAS3d1Q”.hashCode(). That means for any time the hashing function is called on the string “Bobby Likes ApplesNAS3d1Q”, the hash must be recomputed instead of returning the one already saved. This may not seem like a big deal, but the hashing function is supposed to be very quick. For large strings that evaluate to zero, this can actually make a gigantic difference in performance. As a test, I took the Wikipedia article on Hash Functions, and saved the page source to my computer. Then, I read it in as a string, and computed it’s hash function 10,000 times, and took the average time that it took to compute the hashcode. Here is my methodology below:

       File file = new File("HashFunction.htm");
        String input = "";
        BufferedReader reader = null;
        try
        {
            reader = new BufferedReader(new FileReader(file));
        }
        catch (FileNotFoundException e)
        {
            e.printStackTrace();
        }
                
        StringBuilder builder = new StringBuilder();
        try
        {
            while ((s = reader.readLine()) != null)
            {
                builder.append(s+"\n");
            }
        }
        catch (IOException e)
        {
            // You done messed up
            e.printStackTrace();
        }
        String article = builder.toString();

        // Find the average time this takes to hash
        long[] times = new long[10000];
        long beginTime = 0;
        
        for (int i = 0; i < times.length;i++)
        {
            beginTime = System.nanoTime();
            article.hashCode();
            times[i]=System.nanoTime()-beginTime;
        }
        
        long sumOfTimes = 0;
        for (int i = 0; i < times.length;i++)
        {
            sumOfTimes+=times[i];
        }
        
        float averageTime= (float)sumOfTimes/times.length;
        
        System.out.println(averageTime + "ns");
        

On average, it took 567.7788ns to retrieve the hash for the (rather large) file loaded in using the BufferedReader. This is very quick Now, consider the way hashcode’s are computed for strings. Ultimately, the hashcode is computed as a large sum of overflowing integers as seen above when Path A is taken. We can get very precise control over the string’s hash value by manipulating the last seven characters of the string. Let’s see how the String, “Bobby Likes ApplesNAS3d1Q” was computed exactly anyway.

    // There may be more efficient ways of doing this, but 
    // this is pretty quick anyway.
    private int precomputedHashCode = 0;
   
    public int hashCode(char[] s, int n)
    {
        int sum = precomputedHashCode;
        
        for (int i = s.length-n; i < s.length;i++)
        {
            int j = (int)s[i];
            for (int k = i; k < s.length-1;k++)
            {
                j*=31;
            }
            
            sum+=j;
        }
        
        
        return sum;
    }
        
    // Returns a character array of the string with added letters to make the 
    // hashing function evaluate to zero.
    public char[] getNextNValuesToMakeHashCodeZero(String s, int n)
    {
        char startValue='M';
        
        char[] evaluation = new char[n+s.length()];
        boolean didEvaluation = false;
        for (int i = 0; i < evaluation.length;i++)
        {
            if (i < evaluation.length-n)
            {
                evaluation[i] = s.charAt(i);
            }
            else
            {
                if (!didEvaluation)
                {
                    didEvaluation = true;
                    precomputedHashCode = new String(evaluation).hashCode();
                }
                evaluation[i] = startValue;                
            }
        }
        
        for (int i = evaluation.length-n; i < evaluation.length; i++)
        {
    
            boolean positive = (hashCode(evaluation,n)>0);
            
            if (positive)
            {
                makeHashCodeNegativeAt(evaluation, i, n);
            }
            else
            {
                makeHashCodePositiveAt(evaluation, i, n);
                
            }
            
        }
        
        return evaluation;
    }
    
    
    // Increases the character at i until the hashcode is negative or zero. 
    public void makeHashCodeNegativeAt(char[] s, int i, int n)
    {
        while (hashCode(s,n) > 0)
        {
            s[i]--;
        }
    }

    // Increases the character at i until the hashcode is positive or zero. 
    public void makeHashCodePositiveAt(char[] s, int i, int n)
    {

        while (hashCode(s,n) < 0)
        {
            s[i]++;
        }
    }

What we do is if the hash code is negative, we increase the character at i until the hash code becomes positive or zero. If the hash code is positive, we decrease the character at i until the hash code is negative. Consider the following equation:

31n-1c0 + 31n-2c1 + ... + 314cn-5 + 313cn-4 + 312cn-3 + 311cn-2 + 310cn-1

The above is the hashing function expressed mathematically. However, since it overflows, all we have to do is effectively make the number divisible by (the maximum integer value * 2 + 2), or 4294967296. Doing that is easy since the manipulating the last 7 c values gets us the precision we need to effectively make the value equal to zero without having to introduce any strange characters. Also, since the value is a sum, we can pre-compute the hash code for the first n-7 values, and simply manipulate the characters in the last 7 values. The algorithm kind of reminds me of opening a combination lock: we manipulate each character until the hash code changes its sign or becomes zero. Each character we manipulate gets us more precision and thus closer to the hash code being zero. Now, let’s take the saved Wikipedia article, run the process on it, and see how long it takes to hash. We can take the article variable we bench marked before, and simply re-assign it.

article = new String(h.getNextNValuesToMakeHashCodeZero(article, 7));

According to my computer, the time it now took to get the hash code for the string was 158755.78ns. This is over 250 times slower! By adding just seven ordinary characters to the end of the string (they were “ODP5`/]” by the way), we made the average compute time much slower.

How can this be fixed? We can get much better performance in these bizarre cases by simply adding a boolean that represents whether the value has been hashed yet or not, rather than checking if the internal hash is zero. This code is run so frequently, it surprises me that there are even two separate variables (h and hash) in the hashCode() method to begin with. By replacing one of the integers (h) with a boolean hasBeenHashed, we can save a lot of time in these bizarre circumstances.