Hadoop supports various compression formats. Snappy is one among the compression formats supported by hadoop. I created a snappy compressed file using the google snappy library and used in hadoop. But it gave me an error that the file is missing the Snappy identifier. I did a little research on this and found the workaround for that. The method I followed for finding the solution was as follows.
I compressed a file in snappy using the google snappy library and the snappy codecs present in hadoop. I verified the file size and checksum of both the files and found that It is having difference. The compressed file created using hadoop snappy is having some bytes more than that of the compressed file created using google snappy. It is some extra metadata that is consuming the extra bytes.
The code shown below will help you in creating snappy compressed file which will work perfectly in hadoop. This code requires the following dependent jars. This is available in your hadoop installation.
1) hadoop-common.jar
2) guava-xx.jar
3) log4j.jar
4) commons-collections.jar
5) commons-logging.x.x.x.jar
You can download the code directly from github
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
package com.snappy.codec; | |
/* | |
* @author : Amal G Jose | |
* | |
*/ | |
import java.io.BufferedInputStream; | |
import java.io.BufferedOutputStream; | |
import java.io.FileInputStream; | |
import java.io.FileOutputStream; | |
import java.io.InputStream; | |
import java.io.OutputStream; | |
import org.apache.hadoop.conf.Configuration; | |
import org.apache.hadoop.io.compress.CompressionCodec; | |
import org.apache.hadoop.io.compress.SnappyCodec; | |
import org.apache.hadoop.util.ReflectionUtils; | |
/* | |
*This program compresses the given file in snappy format | |
* | |
*/ | |
public class CreateSnappy { | |
public static void main(String[] args) { | |
if (args.length < 2) { | |
System.out.println("Enter <input> <output>"); | |
System.exit(0); | |
} | |
try { | |
CompressionCodec codec = (CompressionCodec) ReflectionUtils | |
.newInstance(SnappyCodec.class, new Configuration()); | |
OutputStream outStream = codec | |
.createOutputStream(new BufferedOutputStream( | |
new FileOutputStream(args[1]))); | |
InputStream inStream = new BufferedInputStream(new FileInputStream( | |
args[0])); | |
int readCount = 0; | |
byte[] buffer = new byte[64 * 1024]; | |
while ((readCount = inStream.read(buffer)) > 0) { | |
outStream.write(buffer, 0, readCount); | |
} | |
inStream.close(); | |
outStream.close(); | |
System.out.println("File Compressed"); | |
} catch (Exception e) { | |
e.printStackTrace(); | |
} | |
} | |
} |
Hi Amal
Is there any hadoop command to convert input data into snappy compressed format?
you can use hive or pig with compression enabled for converting files to snappy. Direct command is not available. You can use my program and convert it as a jar. This will help you in converting files to snappy.
Thanks Amal…
When i run the program i see this error. Any idea on this?
Exception in thread “main” java.lang.NoSuchMethodError: org.apache.hadoop.io.compress.CodecPool.getCompressor(Lorg/apache/hadoop/io/compress/CompressionCodec;Lorg/apache/hadoop/conf/Configuration;)Lorg/apache/hadoop/io/compress/Compressor;
at org.apache.hadoop.io.compress.CompressionCodec$Util.createOutputStreamWithCodecPool(CompressionCodec.java:131)
at org.apache.hadoop.io.compress.SnappyCodec.createOutputStream(SnappyCodec.java:98)
at hive.HiveJdbcClient.main(HiveJdbcClient.java:34)
Seems like this is because of a library issue. Which version/distribution of hadoop are you using .?
Amal, I am new to Hadoop. The above code help me to compress the file in local file system. I want something to do the same in HDFS. Could you please get me the code snippet.
can you confirm the versions of the jar files which used? I’m getting below error,
log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.NativeCodeLoader).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
java.lang.RuntimeException: native snappy library not available: this version of libhadoop was built without snappy support.
org.apache.hadoop.io.compress.SnappyCodec.checkNativeCodeLoaded(SnappyCodec.java:65)
org.apache.hadoop.io.compress.SnappyCodec.getCompressorType(SnappyCodec.java:134)
org.apache.hadoop.io.compress.CodecPool.getCompressor(CodecPool.java:150)
org.apache.hadoop.io.compress.CompressionCodec$Util.createOutputStreamWithCodecPool(CompressionCodec.java:131)
org.apache.hadoop.io.compress.SnappyCodec.createOutputStream(SnappyCodec.java:100)
com.snappy.codec.CreateSnappy.main(CreateSnappy.java:35)
Do you have any idea on this?
I have taken the jars from my CDH4 cluster. Not exactly remembering the component versions.
Hi Amol,
Could you please explain in details what is snappy exactly.
How it will improve performance and also will it compress jar’s