Skip to the content.

WBZ

Check the article here: How to Build a Lossless Data Compression and Data Decompression Pipeline

A parallel implementation of the bzip2 data compressor in python, this data compression pipeline is using algorithms like Burrows–Wheeler transform (BWT) and Move to front (MTF) to improve the Huffman compression. For now, this tool only will be focused on compressing .csv files and other files on tabular format.

Data pipeline compression

alt text

How to use the tool

The tool is called WBZ, the first version only will be focused in compressing .csv files and I will be adding more features coming soon, the parameters are described as follow:

python wbz.py -a encode -f 'C:\Users\...\data.csv' -cs 20000 -ch ';'

python wbz.py -a decode -f 'C:\Users\...\data.wbz' -cs 20000 -ch ';'

The same chunk size and special character used for encode the file must be used for decode the file, The idea of keep them as parameters is to be able to get a better trade-off of the speed in the encoding and decoding process and a better compression rate.

Performance

The tests were done with three .csv files of different sizes and varying the chunk size:

There is an improvement in the rate compression for larger chunk sizes.

Imagen1

The compression times increase with a logarithmic behavior when the size of the chunk is increased as well. Imagen2

Regardless of the size of the file, the decompression times have a constant behavior and tend to be reduced when the size of the chunk increases as well.

Imagen3

To-Do List:

Contributing and Feedback

Any ideas or feedback about this repository?. Help me to improve it.

Authors

License

This project is licensed under the terms of the Apache License.