Program To Implement Crc In Java

Today we are going to try to optimise a hash-function based on CRC32 calculation. Here is some background.

If you have an optimized program than listed on our site, then you can mail us with your name and a maximum of 2 links are allowed for a guest post. Mail Us at: [email protected] Program missing? If you find any topic or program missing according to your college, you can submit the topic or name of program using the below link. First data bit is: 1 Remainder: 1010 101 The CRC code generated is: Enter the data to be sent: Enter bit number 10: 1 Enter bit number 9: 0 Enter bit number 8: 0 Enter bit number 7: 1 Enter bit number 6: 1 Enter bit number 5: 0 Enter bit number 4: 1 Enter bit number 3: 1 Enter bit number 2: 0 Enter bit number 1: 1 1.). Program to implement Cyclic Redundancy Check,CRC-12. Program to implement RSA(Ron Rivest, Adi Shamir and Len Adleman, who invented it in 1977 RIVE78).

In “Game of Life, hash tables and hash codes” we implemented Conway’s game of Life using Java’s HashMap.We discovered that performance depends on a choice of a hash function and identified some of them that work well (spread hash values evenly). Those included a function basedon calculating a remainder and another one that used CRC32.Then we measured the speed of the hash functions and tried optimisingthe remainder-based one (in “Optimising division-based hash functions” and “More optimisations of division-based hash: Karatsuba’s method”).These attempts failed, because the optimisation performed by the HotSpot compiler was better than anything we could do by hand. However, I don’t regret that effort –we’ve learned something interesting in the process. Among other things, we discovered (see“Don’t trust a micro-benchmark”) that the results of micro-benchmark tests do not always correlate with the resultsof full program tests, so micro-benchmarks must be used with care.

Nevertheless, today we are going to run micro-benchmarks again, this time on a CRC32-based hash function. Having learned the lesson, we’ll later verify the resultson the full program test. Obviously, the same may happen that happened with the remainder-based function: the compiler may already be very good in compiling it.We can only hope that even in this case we’ll learn something interesting. Let’s go.

The method in question

The full code of the project is here. This is the function we are talking about – LongPoint7.hashCode():

Here are the execution times, in milliseconds for 100M iterations (for comparison, I added times for one other hash function and for the null function, which does nothing):

Program To Implement Crc In Java Compiler

Class nameCommentTime, Java 7Time, Java 8
LongPoint5Multiply by one big prime 547 486
LongPoint7java.util.zip.CRC3210109 1882
NullPointNull function (does nothing) 532 455

Our hash function is much slower than the others, especially on Java 7. Let’s see what can be done here.

The first ideas

The CRC32 class we are using comes from java.util.zip. Let’s look inside this class.

This is the code of update() in both Java 7 and Java 8:

A native call is made for every input byte, and we know that native calls are expensive.This suggests our first idea for improvement: to make just one native call for the entire long.The CRC32 class contains a method we can use for that (at a cost of an array allocation):

Here is the new code (class LongPoint71):

One immediate observation about this code is that we don’t have to allocate a new byte array each time and can re-use an existing one. This may seem obvious that thisway it will be faster but I won’t be convinced until I try. Here is the new class LongPoint72:

Another improvement that seems obvious is to re-use the CRC32 object as well. We’ll try this, too, but there is one important point to note: after this the hashCode() functionstops being thread-safe. Two threads using the same CRC32 object may corrupt each other’s calculations. Sharing the byte array does not cause this, because the functions writethe same values into the elements of this array. Our program is single-threaded, but it is preferable to write more generally applicable code. Still, we’ll try this option, too(see LongPoint73):

Now it’s time to start running the programs (the main class is HashTime):

Here are the results:

NameDescriptionTime, Java 7Time, Java 8
LongPoint7CRC32, 8 calls one byte each 101091882
LongPoint71CRC32, one call with an array 67302056
LongPoint72CRC32, one call with re-used array 64221901
LongPoint73CRC32, one call with re-used array and CRC 65662009
  • The optimisation helped on Java 7 (we’ve got 33% improvement), but not on Java 8, where the time got worse (it resembles the story with the remainder-based hash, doesn’t it?)

  • Re-using the array have helped a bit

  • Not only is re-using of CRC object thread-unsafe, it is also slower than allocating new one each time. Object allocation isn’t always slow!

From arrays to buffers

Our next step is replacing an array with a direct byte buffer. This offers two potential advantages:

  • Direct byte buffers support fast decomposition of primitive values into bytes. The entire long can be written into a buffer using just one CPU instruction

  • JNI has built-in support for direct buffers, and passing them to the native code is faster than passing arrays.

Unfortunately, only Java 8 contains a method that applies CRC32 to a byte buffer. Here it is:

This method doesn’t even use the JNI support for direct buffers.Instead, it takes the address of the underlying memory and passes it to the native code as a number. This trickpromises very high performance.

Let’s write a version that uses this method and run it on Java 8.Here is the code (LongPoint74):

Note that this code went one step further in re-using the same object than LongPoint73: it made the buffer a static object. There is a reason for it: direct buffers are allocatedin separate (native) memory, and the minimal quantum of allocation is 4096 bytes. They are freed when their Java objects get garbage-collected, over which we haveno control. If Java heap is big enough, too many Java objects may be allocated before the garbage collector is called, and we may run out of native memory.Obviously, the static buffer causes this method to be thread-unsafe.

The results, however, are not impressive:

NameDescriptionTime, Java 7Time, Java 8
LongPoint7CRC32, 8 calls one byte each 101091882
LongPoint72CRC32, one call with re-used array 64221901
LongPoint74CRC32, one call with a direct byte buffer 1903

We won’t even try re-using the CRC object.

Writing in Java

What else can we do? It seems like all our attempts have failed. No, there are other options to try. The first one is to write the entire CRC32 calculation in Java. It is unlikely tobe fast (if it was fast enough, the JDK designers would have used this approach), but completeness requires us to test this anyway. Fortunately, we don’t need to implementCRC32 according to its strict definition (division of polynoms). There is a fast table-based implementation(see TableCRC32.java). This is what the application of CRC to a byteand to a long look like:

And this is our new hash function:

The results look much better:

NameDescriptionTime, Java 7Time, Java 8
LongPoint7CRC32, 8 calls one byte each 101091882
LongPoint72CRC32, one call with re-used array 64221901
LongPoint75CRC32, written in Java15051420

Our prediction was wrong – this is definitely an improvement. The time on Java 7 has improved dramatically,while on Java 8 we’ve improved over the original time for the first time. It is unclear why JDK does not use Java implementation of CRC32.

Writing in C

The next thing we can try is rewrite the same code in C. Most probably, the original native code of CRC32 is very similar to what we are going to write,but one trick may give us an advantage: we are going to introduce a JNI call that operates on a long. No arrays or byte buffers necessary. For the purpose of this articlewe’ll create a simple class with a single native method:

Obviously, the proper implementation must have many other useful methods, but we’ll implement only this one, to keep things short.

We’ll need a C file, called NativeCRC32.c,with the table-based CRC implementation and a JNI wrapper:

To build it, we use a rather sophisticated command line:

The compiler produces library libNativeCRC32.so, which will be loaded when the class NativeCRC32 is initialised.

Here are the results:

NameDescriptionTime, Java 7Time, Java 8
LongPoint72CRC32, one call with re-used array 64221901
LongPoint75CRC32, written in Java15051420
LongPoint76CRC32, written in C, one JNI call 18901898

The results are not great, in the case of Java 8 they don’t differ much from those of our version using an array. In both cases they are worse than our resultsin Java. This disproves a popular belief that C is always faster than Java – at least, when JNI is involved.

Writing in assembly

It looks like we’ve reached a limit. What more can one do when we already tried to write everything in C? Only one thing: write in assembly. OK, it won’t be a true assembly; C withSSE intrinsics will be sufficient for our needs.

Intel processors (starting at SSE 4.2) have an instruction calculating CRC32. This, however, is a different version of CRC32, called CRC32C (Castagnolli). It differsfrom our CRC32 in the polynom it uses, but the general principles are the same, and it can be calculated by the same table-based code (except the table is different).We didn’t measure hash values distribution when using CRC32C, but let’s hope it is good. We’ll add another function to our NativeCRC32 class:

We’ll also need another class, LongPoint77, to use this function:

Finally, we need to add -msse4.2 to our build command, and we are ready to run:

NameDescriptionTime, Java 7Time, Java 8
LongPoint75CRC32, written in Java15051420
LongPoint76CRC32, written in C, one JNI call 18901898
LongPoint77CRC32C, written in C using SSE, one JNI call 12571214
Java

This is much better than anything we’ve seen before. On Java 7 this version runs faster by 87% (that is, eight times faster) than the original one.On Java 8 the speedup is smaller, only by 35%. Unfortunately, even such a speedup can’t be considered a success: other hash functions take 400-600 ms with similarquality of hashing, and we’ve exhausted our list of ideas for CRC32 optimisation.

The cost of JNI

It still looks strange that we’ve achieved such a low speed for a hash function consisting of just one CPU instruction (CRC32). Maybe, this is a very slow instruction?No, it has the latency of 3 cycles (according to Agner Fog). The only possible explanation can be the cost of the JNIcall. Let’s check this. We’ll add another function to our native code:

Program To Implement Crc In Java

Here are the results:

NameDescriptionTime, Java 7Time, Java 8
LongPoint75CRC32, written in Java15051420
LongPoint76CRC32, written in C, one JNI call 18901898
LongPoint77CRC32C, written in C using SSE, one JNI call 12571214
LongPoint78JNI call to an empty function12561214

This looks shocking at first: the results look as if the CRC calculation didn’t require any CPU cycles. This can be explained by the parallel structure of our processor:while CRC32 is busy executing and as long as its result isn’t used, the rest of the program can carry on. The JNI exit sequence may take longer than 3 cycles, so the CRC32instruction can totally disappear from our timing.

Our measurements also show that JNI isn’t cheap. Let’s look at Java 8. The test with empty JNI call (LongPoint78) executed for 1214 ms. The test with the empty function(NullPoint) took 455 ms (these are the expenses of the virtual call plus the test overheads). This means that the JNI call to the static method taking a long as a parameterand returning an int, executed 100 million times, took 759 ms, or about 7.6 ns per execution. On this processor it corresponds to 20 CPU cycles. This is relatively small overhead formethods that take thousands of cycles to execute, but it is too big for short methods. In the case of CRC32C the overheads of JNI were much bigger than the cost of the methoditself, while in the case of the CRC32 written in C they were roughly the same (the calculation took about 18 cycles). In short, nothing based on JNI is a good solution for ahash function – unless we need to hash objects of kilobyte or megabyte sizes.

We can now see why the original version took so long in Java 7. Eight JNI calls alone take 5800 ms. Why is then Java 8’s version faster than that?

The mystery of fast CRC32 on Java 8

To answer this question, we’ll have to look at the assembly listings of LongPoint7.hashCode. In Java 7 the code of this method contains eight calls to CRC32.update:

This fragment shows only two calls to update(). The entire code is here.

The code for CRC32.update contains calls to the runtime, which probably represent JNI call.

The code on Java 8 looks completely different. It is much longer and more complex.The entire code is here, I’ll show the Java reconstruction:

Here table is some statically defined array of type int[]. This code looks odd at first, but if we simplify it, it will look very familiar:

This is exactly the code of crc32 (long) in TableCRC32. It starts at crc = -1 and applies the function

to all eight bytes of v.

What we have just found out if that in Java 8 the method CRC32.update (int) is intrinsic. It is defined as native but is not compiled as a native call. Instead,its code is placed directly at the call site. All the usual optimisations are performed after such substitution. For instance, in our case the constant propagation has happened –the first crc >>> 8 was replaced with 0xFFFFFF. Instruction re-ordering has also been performed – extraction of bytes from v happens before the table access.

No wonder we haven’t achieved any speed improvements when calling versions of update with arrays or byte buffers – we eventually executed the same code, but carried theadditional costs of manipulating those arrays or buffers.

What is more surprising is that the pure Java version is performing faster than the one using intrinsic code (1420 ms instead of 1802). The detailed study of this phenomenonfalls outside the scope of this article, but a brief examination shows that the Java version is in fact compiled better.Here is the code. It is shorter: 71 instruction insteadof 87. For instance, a single table access takes four instructions:

while in the intrinsic version it took six:

Interestingly, the access to the array in both cases is done using the static address rather than indirectly using a reference. While in the intrinsic case this may beexplained by the code not being written in Java, in Java case the compiler clearly made use of the fact that thearray is static final, so its address is known at the time of compilation. The same applies to its size. The compiler knew the array size was 256 and it was accessedusing byte-sized indices, which allowed to optimise out the index checking. This resulted in a nearly optimal code.

The full test

Let’s now run the full test (a Life simulation). The entire code for the test is here.This table shows both micro-benchmark and full-test times for Java 7 and Java 8. Two other hash functions addedfor comparison:

Class nameCommentTime, micro-benchmarkTime, full test
Java 7Java 8Java 7Java 8
LongPoint5Multiply by one big prime 547 486 19791553
LongPoint6Modulo big prime 5781655 18901550
LongPoint7java.util.zip.CRC32101091882101153206
LongPoint71CRC32, one call with an array 67302056 74163515
LongPoint72CRC32, one call with re-used array 64221901 71963235
LongPoint73CRC32, one call with re-used array and CRC 65662009 77713603
LongPoint74CRC32, one call with a direct byte buffer 19033310
LongPoint75CRC32, written in Java 15051420 35222664
LongPoint76CRC32, written in C, one JNI call 18901898 35842915
LongPoint77CRC32C, written in C using SSE, one JNI call 12571214 24092045
LongPoint78JNI call to an empty function 12561214
NullPoint Null function (does nothing) 532 455

This time the results are predictable: what is faster in the micro-benchmark, is also faster in the full test.

Conclusions

Program To Implement Crc In Java Runtime

  • We managed to optimise CRC32 calculation quite a bit, especially on Java 7

  • java.util.zip.CRC32.update is an intrinsic on Java 8, so there is no point reducing the JNI call overheads

  • Surprisingly, the version in pure Java works faster than that intrinsic. It is also faster than the version in C called via JNI

  • The only way to perform calculations faster in C is by using the new CRC32 instruction available via SSE intrinsic set. One must remember that this is a different versionof CRC, although it is also well suitable as a hash function.

  • With all our optimisations, the hash function based on CRC32 still works slower than one based on a single multiplication or division. While CRC may be useful for hashinglonger byte sequences, or for data distributed in some special way, it isn’t the best hash function for our problem. We’ll use the division-based function from now on.

Coming soon

We’ve paid enough attention to the hash functions. Now it’s time to optimise the algorithm itself. That’s what we’ll be doing in the next article.

Comments are closed.