Skip to content

A command-line compression software that can handle files and folders

License

Notifications You must be signed in to change notification settings

jacky-leng/Static-Huffman-Compression

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Project: Compression based on Huffman Coding

Development Environment

  • Windows 11
  • OpenJDK 20.0.2
  • IntelliJ IDEA

Compilation and Usage

This project has been completed, fulfilling all specified requirements.

Currently, it is workable on the Windows file system.

Usage: Compile the whole project and cd to the directory that contains cf.class, xf.class and pv.class.

All the paths must be in absolute paths.

// create an archived file
java cf <InputFile>(file or directory) [OutputFile](optional)   
// preview an archived file
java pv <InputFile> [OutputFile](optional)           
// decompress an archived file
java xf <InputFile> [OutputPath](optional)                      

Code Structure

  1. cf for compression,pv for preview,xf for decompress, three class to handle command line input.
  2. Package BitwiseStream contains BitInputStream and BitOutputStream, handle the bitwise input in decompression and output in compression.
  3. Package Huffman contains implementation of Nodes. HuffmanTree implements Huffman algorithm. TreeCanoniztion canonize the Huffman tree.
  4. Package FileProcess
    • FileIO contains Magic Number for file recognition and basic helper functions
    • StopWatch provides timer function
    • OverwriteProtector protect against write override
    • FrequencyTable counts the frequency of each byte pattern
    • Decoder and Encoder contains translation table or tree. They are helper functions for Decompression and Compression
    • Compression and Decompression main functions
    • PathTreePrint helper function for file structure preview

Core Requirements

Compression and Decompression of Files

The files have been enhanced with compression headers and some magic numbers to distinguish between files and paths. This will be discussed in detail in the next section. This section focuses solely on the compression and decompression processes.

First, the cf and xf commands parse the command-line input parameters, then call the corresponding Compression and Decompression functions.

Compression

During compression, the corresponding file is first read for the first time to generate the corresponding FrequencyTable. Afterward, the HuffmanTree is called to use the Huffman algorithm to determine the code length for each byte. This tree becomes useless after this step.

After obtaining the code lengths, TreeCanonization produces the normalized code table. The advantage of this code table is that it only needs to record the code length for each byte to restore the code table. This new code table is different from the originally generated one, but it ensures that the code length for each byte is the same.

The file name and code length table are output to a file (both uncompressed), and then the Encoder is called to compress the bytes in the file one by one. The output during this process is in bits.

Decompression

First, the original file name and code length table are read from the file. The decoding tree is reconstructed using the code length table, which is then passed to the Decoder. Subsequently, each bit is read one at a time, and the Decoder outputs the corresponding byte at the leaf and writes it to the respective file.

Compression and Decompression of Directories

To implement the compression of directories, it is necessary to record the path for each file. To prevent empty folders from being ignored, each path is recorded individually. Before each name, a byte is added to determine whether it is a file or a directory. Two bytes of magic numbers are added before the compressed package to determine if the file was generated by this program, independent of the file extension.

The constants mentioned above are defined in FileIO.

OutputFile Structure

  1. header: MagicNumber(2 bytes)--fileStructureHeader(Strings of file path ending with STRING_END_SIGN)--HEADER_END_SIGN
  2. continuous FileBlocks or DirectoryBlocks, one for a path in source file.
  • FileBocks: SINGLE_FILE_MAGIC--StringOfPath--STRING_END_SIGN--CanonicalCodeTable--encodedBits(ending with special EOF)
  • DirectoryBlocks: DIRECTORY_MAGIC--StringOfPath--STRING_END_SIGN

PS: String is stored in file in C style instead of Java style. String = chars + STRING_END_SIGN

Other Requirements

User Interface

Set up the command-line processing program as cf, pv, and xf.

For example, using java cf <InputFile>(file or directory) [OutputFile](optional) allows specifying the use of the compression feature.

StopWatch class records time usage. Some functions' return values counts the original files' size.

Verify Compression Package Source

Add two bytes of magic numbers before the file generated by compression and verify before processing. If the magic numbers are incorrect, the program will output information and terminate.

Handling File Overwrite

Before creating the output stream in compression and decompression, if the file already exists, it indicates that it will be overwritten.

The program will invoke OverwriteProtect, asking the user who can choose to proceed or abort.

Considering that many files may need to be overwritten during the decompression process, the user is given the option to 'overwrite all.' In this case, OverwriteProtect will be disabled for this operation.

Compression Package Preview

To implement the preview function, add the relative paths of all files and directories at the beginning of the compressed package, exclusively for preview purposes.

pv reads the file names at the beginning of the file, avoiding reading the later large compressed information, making it very fast.

FileIO.PathTreePrint uses the trie method to convert many input strings into a tree structure, allowing the printing of a tree-shaped file structure.

Performance Test

testcase original (KB) compressed (KB) ratio compression (s) decompression (s)
"D:\testcases\testcase01EmptyFile\empty.txt" 0 0.282 / 0.017 0.007
"D:\testcases\testcase02NormalSingleFile\1.txt" 1985 1111 55.97% 0.251 0.079
"D:\testcases\testcase02NormalSingleFile\2.pdb" 34.5 15.5 44.99% 0.078 0.02
"D:\testcases\testcase02NormalSingleFile\4.gz" 0.143 0.388 271.33% 0.034 0.032
"D:\testcases\testcase03XLargeSingleFile\1.jpg" 20748 20694 99.74% 2.421 1.592
"D:\testcases\testcase03XLargeSingleFile\3.csv" 643412 411715 63.99% 27.203 15.88
"D:\testcases\testcase5NomalFolder\3\000007.jpg" 114 114 100.11% 0.077 0.017
"D:\testcases\testcase06SubFolders\2\3.txt" 5034 3700 73.5% 0.462 0.275
"D:\testcases\testcase07XlargeSubFolders\2\7023341.htm" 246 153 62.22% 0.115 0.047
"D:\testcases\testcase08Speed\1.csv" 643412 411715 63.99% 29.741 16.233
"D:\testcases\testcase09Ratio\1.csv" 441771 277050 62.71% 22.316 12.623

Comparison to Other Tools

Results

Tool testcase original size compressed size ratio compression time decompression time
Huff "D:\testcases\testcase02NormalSingleFile\1.txt" 1985 1111 55.97% 0.251 0.079
7z 571 28.7% 1 /
WinRAR 602 30.3% / /
Huff "D:\testcases\testcase02NormalSingleFile\2.pdb" 34.5 15.5 44.99% 0.078 0.02
7z 6.8 19.7% / /
WinRAR 9 26.1% / /
Huff "D:\testcases\testcase02NormalSingleFile\4.gz" 0.143 0.388 271.33% 0.034 0.032
7z 0.261 182.52% / /
WinRAR 0.211 147.55% / /
Huff "D:\testcases\testcase03XLargeSingleFile\3.csv" 643412 411715 63.99% 27.203 15.88
7z 55629 8.6% 31 2.5
WinRAR 49,695 7.7% 10 3
Huff "D:\testcases\testcase08Speed\1.csv" 643412 411715 63.99% 29.741 16.233
7z 55629 8.6% 31 2.5
WinRAR 49,685 7.72% 10 3
Huff "D:\testcases\testcase09Ratio\1.csv" 441771 277050 62.71% 22.316 12.623
7z 66736 15.47% 30 3
WinRAR 73,231 16.58% 11 3

Summary

In terms of compression ratio, both 7z and WinRAR consistently outperform the Huffman Compression program. WinRAR, in particular, WinRAR exhibits the best compression ratios across different file types.

The compression ratios for small files are relatively high for all three software tools, possibly because the compression benefits for small files are inherently limited.

In terms of compression time for larger files, WinRAR demonstrates the best performance, followed by my program, while 7z exhibits the slowest performance. This is mainly attributed to the differences in compression algorithms.

In terms of decompression time, Huff is Limited by its single-threaded nature. Decompression may occur sequentially, impacting the speed for larger files. 7z and WinRAR utilize leverage multi-threading during decompression, enabling simultaneous processing of different parts of the compressed file. This parallelism enhances the speed of decompression, especially advantageous for larger files.

Generally, decompression is much faster than compression.

Compression Algorithms

  1. Static Huffman Compression (Huff):

    • Utilizes static Huffman coding, offering a fixed compression ratio typically ranging between 45% and 65%.
    • Limited by its single-threaded nature, hindering its performance on larger files.
  2. 7z (LZMA2 Algorithm):

    • Employs the LZMA2 algorithm, a sophisticated compression method known for its adaptability and high compression ratios.
    • LZMA2 exhibits compression ratios in the range of 8% to 20%, surpassing static Huffman coding.
    • Features multi-threading support, optimizing performance on multicore processors.
  3. WinRAR:

    • Implements a proprietary compression algorithm, potentially a combination of various methods, including but not limited to RAR and ZIP compression.
    • Demonstrates superior compression ratios, especially noticeable in larger files.
    • Benefits from multi-threading, enhancing efficiency in both compression and decompression processes.

Runtime Environment

  1. Huff:

    • Runs on the Java Virtual Machine (JVM), introducing some performance overhead.
    • The higher-level nature of Java might contribute to the observed performance differences.
  2. 7z and WinRAR:

    • Likely implemented in lower-level languages such as C or C++, optimizing execution efficiency.
    • The absence of a virtual machine allows these tools to interact more closely with system resources.

Multi-Threading and Parallel Processing

  1. Huff:

    • Lacks inherent multi-threading capabilities, limiting its performance on larger files.
  2. 7z and WinRAR:

    • Leverage multi-threading to maximize processor capabilities, especially advantageous for compressing and decompressing large files.
    • The ability to utilize multiple cores contributes to faster processing times.

Programming Proficiency

  1. Huff:

    • Implemented in Java, potentially resulting in a performance gap compared to tools coded in lower-level languages.
  2. 7z and WinRAR:

    • Developed by seasoned professionals, likely using C or C++, showcasing a higher level of programming expertise.
    • The efficiency of their algorithms and implementations contributes significantly to their superior performance.

Problem and Solutions

  1. How to store the Huffman Tree efficiently

    I read an answer on stackOverflow about canonical Huffman tree.

  2. How to manipulate bitwise data

    At first, I wanted to use the library of Algorithms, 4th Edition, but it was banned by TA. So I wrote bitwiseStreams.

  3. How to do quick preview

    I didn't plan carefully for it, so I just grossly added relative path to the header.

  4. How to check files are identical

    I searched online and used WinMerge.

  5. How to inspect binary files

    I used inbox Hex-Editor in VSCode.

About

A command-line compression software that can handle files and folders

Topics

Resources

License

Stars

Watchers

Forks

Languages