Compression and decompression

Pjot
Administrator

Posts: 2,833

Compression and decompression Oct 27, 2015 21:15:15 GMT 1

Quote

Post by Pjot on Oct 27, 2015 21:15:15 GMT 1

It would be fun to have native compression in BaCon using an original algorithm. I tried to implement a simple compression technique for text files:

(1) find all words in the file and count their occurrence, put this information in a list
(2) sort the list, and identify the most occurring words with a unique number, store this in a table
(3) write this table, then the data of the original file, replacing the words from the table with their number.

For decompression it is the other way around of course.

I made a Proof of Concept (POC) to implement this idea, just to see how well it compresses. Typically, compression ratio for BaCon source code programs lies between 65% and 75%. Not bad for a first attempt, but definitely not good enough

Below the compression and decompression programs. They don't deserve a price for the nicest code, but it's just a POC like I mentioned.

BR
Peter

COMPRESSION

'
' Proof-of-Concept compression for plain text files ASCII 0-127
'
' (1) Find unique words a-zA-Z and count their frequency
' (2) Replace their occurrence with a number>127 (bit7 set)
'
' Created by PvE 2015 - GPL v3.
'---------------------------------------------------------------------

OPTION MEMSTREAM TRUE
OPTION MEMTYPE unsigned char

IF INSTR(ARGUMENT$, " ") THEN
	file$ = MID$(ARGUMENT$, INSTR(ARGUMENT$, " ")+1)
	IF NOT(FILEEXISTS(file$)) THEN
		PRINT "Cannot find file '", file$, "'! Exiting..."
		END 1
	FI
ELSE
	PRINT "Usage: ", ARGUMENT$, " <file.txt>"
	END 1
FI

CONST WordListSize = 10000

DECLARE word ASSOC int

tt = TIMER

FileSize = FILELEN(file$)

FileData = MEMORY(FileSize)

PRINT "Opening file..."

' Read file into memory area
OPEN file$ FOR READING AS myfile
GETBYTE FileData FROM myfile SIZE FileSize
CLOSE FILE myfile

' DYnamic array to store all words
amount_of_words = WordListSize
DECLARE word$ ARRAY amount_of_words

OPEN FileData FOR MEMORY AS dat$

PRINT "Analyzing data..."

length = 0
ctr = 0

FOR x = 0 TO FileSize-1
	IF (PEEK(FileData+x) & 128) THEN
		PRINT "Binary data found! Is this a UTF-8 file? Exiting..."
		END 1
	ENDIF	
	IF PEEK(FileData+x) > 64 AND PEEK(FileData+x) < 91 THEN
		INCR length
	ELIF PEEK(FileData+x) > 96 AND PEEK(FileData+x) < 123 THEN
		INCR length
	ELSE
		IF length > 1 THEN
			word$[ctr] = MID$(dat$, x+1-length, length)
			INCR ctr
			IF ctr >= amount_of_words THEN
				amount_of_words = 2*amount_of_words
				REDIM word$ TO amount_of_words
				PRINT "Resizing word list..."
			FI
		ENDIF
		length = 0
	FI
NEXT

' This assoc array contains the unique words
DECLARE uniq_word ASSOC long

FOR x = 0 TO ctr-1
	INCR uniq_word(word$[x])
NEXT

' We multiply the frequency with the length
LOOKUP uniq_word TO t$ SIZE amount

FOR x = 0 TO amount-1
	IF LEN(t$[x]) > 1 THEN uniq_word(t$[x]) = uniq_word(t$[x]) * LEN(t$[x])
NEXT

' Sort and create final array
SORT uniq_word DOWN

LOOKUP uniq_word TO s$ SIZE amount

'-------------------------------------------------------------

PRINT "Amount of words: ", ctr
PRINT "Amount of unique words: ", amount
'PRINT "-----------------------------------"
'PRINT "Words most used: "
'FOR x = 0 TO 127
'	PRINT x, " : ", s$[x], " : ", uniq_word(s$[x])
'NEXT
'PRINT "-----------------------------------"

'-------------------------------------------------------------

IdxCtr = MEMORY(2)

POKE IdxCtr, 128

' Correct index for array
DECR amount

' Not more than 127 indexes
IF amount > 127 THEN amount = 127

limit = uniq_word(s$[amount])

DECLARE idx_word ASSOC int

newFile$ = MID$(file$, 1, INSTRREV(file$, ".")) & "baz"
OPEN newFile$ FOR WRITING AS output

PRINT "Writing index table..."

' First store original filename
PUTBYTE file$ TO output SIZE LEN(file$)

' Write table with index number and word
FOR x = 0 TO amount
	PUTBYTE IdxCtr TO output SIZE 1
	PUTBYTE s$[x] TO output SIZE LEN(s$[x])
	idx_word(s$[x]) = PEEK(IdxCtr)
	POKE IdxCtr, PEEK(IdxCtr)+1
NEXT

' Marker to indicate data starts here
POKE IdxCtr, 255
PUTBYTE IdxCtr TO output SIZE 1
PUTBYTE IdxCtr TO output SIZE 1

PRINT "Writing data..."
length = 0
FOR x = 0 TO FileSize-1
	IF PEEK(FileData+x) > 64 AND PEEK(FileData+x) < 91 THEN
		INCR length
	ELIF PEEK(FileData+x) > 96 AND PEEK(FileData+x) < 123 THEN
		INCR length
	ELSE
		' If the word is longer than 1 character
		IF length > 1 THEN
			cur$ = MID$(dat$, x+1-length, length)
			IF uniq_word(cur$) > limit THEN
				POKE IdxCtr, idx_word(cur$)
				PUTBYTE IdxCtr TO output SIZE 1
			ELSE
				PUTBYTE cur$ TO output SIZE LEN(cur$)
			FI
			POKE IdxCtr, PEEK(FileData+x)
			PUTBYTE IdxCtr TO output SIZE 1
		' Original bytes are written unmodified
		ELSE
			IF length = 1 THEN
				POKE IdxCtr, PEEK(FileData+x-1)
				POKE IdxCtr+1, PEEK(FileData+x)
			ELSE
				POKE IdxCtr, PEEK(FileData+x)
			FI
			PUTBYTE IdxCtr TO output SIZE length+1
		ENDIF
		length = 0
	FI
NEXT

CLOSE FILE output
CLOSE MEMORY dat$

FREE IdxCtr, FileData

PRINT "All done. Time taken (msecs): ", TIMER-tt
PRINT "Original file size: ", FileSize
PRINT "Compressed file size: ", FILELEN(newFile$)
PRINT "Compression ratio: ", FILELEN(newFile$)*100/FileSize, "%."

END

DECOMPRESS

'
' Proof-of-Concept decompression for plain text files ASCII 0-127
'
' (1) Recreate table from beginning
' (2) Replace bytes with a value > 127 by the corresponding term in the table
'
' Created by PvE 2015 - GPL v3.
'---------------------------------------------------------------------

OPTION MEMSTREAM TRUE
OPTION MEMTYPE unsigned char

IF INSTR(ARGUMENT$, " ") THEN
	file$ = MID$(ARGUMENT$, INSTR(ARGUMENT$, " ")+1)
	IF NOT(FILEEXISTS(file$)) THEN
		PRINT "Cannot find file '", file$, "'! Exiting..."
		END 1
	FI
ELSE
	PRINT "Usage: ", ARGUMENT$, " <file.txt>"
	END 1
FI

tt = TIMER

FileSize = FILELEN(file$)

FileData = MEMORY(FileSize)

PRINT "Opening file..."

OPEN file$ FOR READING AS myfile
GETBYTE FileData FROM myfile SIZE FileSize
CLOSE FILE myfile

DECLARE word$[256]

PRINT "Constructing index table..."
' Get filename
x = 0
WHILE TRUE
	IF PEEK(FileData+x) < 128 THEN
		out$ = out$ & CHR$(PEEK(FileData+x))
	ELSE
		BREAK
	ENDIF
	INCR x
WEND

IF FILEEXISTS(out$) THEN INPUT "Filename '", out$, "' exists! Overwrite (y/n)? ", answer$

IF answer$ <> "y" THEN END

OPEN out$ FOR WRITING AS uncompressed

' Build table
REPEAT
	p = PEEK(FileData+x)
	' Reaching end marker
	IF p = 255 AND PEEK(FileData+x+1) = 255 THEN BREAK
	' Index number?
	IF p >= 128 THEN
		WHILE TRUE
			INCR x
			IF PEEK(FileData+x) < 128 THEN
				word$[p] = word$[p] & CHR$(PEEK(FileData+x))
			ELSE
				BREAK
			ENDIF
		WEND
	ELSE
		INCR x
	ENDIF
UNTIL x >= FileSize-1

INCR x, 2

IdxCtr = MEMORY(1)

PRINT "Writing uncompressed data..."
' Write file
REPEAT

	IF PEEK(FileData+x) < 128 THEN
		POKE IdxCtr, PEEK(FileData+x)
		PUTBYTE IdxCtr TO uncompressed SIZE 1
	ELSE
		txt$ = word$[PEEK(FileData+x)]
		PUTBYTE txt$ TO uncompressed SIZE LEN(txt$)
	FI
	INCR x

UNTIL x >= FileSize

FREE IdxCtr, FileData
CLOSE FILE uncompressed


PRINT "All done. Time taken (msecs): ", TIMER-tt
END

bigbass
God

.

Posts: 1,978

Compression and decompression Oct 28, 2015 15:17:46 GMT 1

Quote

Post by bigbass on Oct 28, 2015 15:17:46 GMT 1

Hello Peter

This type of challenge is always a great way to
Improve coding speed ,size and discover
Which built in commands produce better results

Here are some examples that could be ported
mattmahoney.net/dc/text.html

The zipf's law is interesting too
www.hermetic.ch/wfca/zipf.htm

Thanks for your demo!

Joe

jcfuller
Junior Member

Posts: 56

Compression and decompression Oct 28, 2015 16:17:44 GMT 1

Quote

Post by jcfuller on Oct 28, 2015 16:17:44 GMT 1

Personally I never found a reason to even look into a roll your own compression routine.

I never needed fast compression just super fast decompression so I have always used.
aPLib: ibsensoftware.com/products_aPLib.html

James

Pjot
Administrator

Posts: 2,833

Compression and decompression Oct 29, 2015 22:29:10 GMT 1

Quote

Post by Pjot on Oct 29, 2015 22:29:10 GMT 1

Hi James,

Well, it's like pushing the Sisyphus stone up the hill indeed. I just wondered how it would work and if it can be done. The compression ratio of the above approach is between 25-35%, leaving a filesize of 65-75% which is not so good.

Therefore, I investigated another approach, using binary compression, which allows all kinds of files (not only text). With this technique I reach 20-25% compression.

But if I use both of them in a two-tier approach, then text files reach a ratio of 45%-50%, leaving a filesize of 55%-50%. Pretty good!

Then, when cross checking with plain 'gzip', I can see that this utility can reach even up to 80% for text files, leaving 20% of filesize. Sigh

I guess it takes more time and genius to create an original compression algorithm.

Anyway, it was interesting nevertheless, and I saw some possibilities in BaCon which I did not realize they were there. For example, creating an associative array, sort it downwards and translate this sorted array to an array of strings. Very handy.

BR
Peter

Pjot
Administrator

Posts: 2,833

Compression and decompression Apr 29, 2018 9:54:50 GMT 1

Quote

Post by Pjot on Apr 29, 2018 9:54:50 GMT 1

So I couldn't really let go off this challenge. The above attempt only works with ASCII files, but how about compression of binary files? Furthermore, can compression not be done in a small, elegant way, with just a few lines of code?

To accomplish this, I used another algorithm, which, by lack of a better name, I am calling "binary compression". The idea is that the compression will use bit patterns based on occurrences of data:

xx11 -> 4 bytes
xxx10 -> 8 bytes
xxxx01 -> 16 bytes
xxxxxxxx00 -> remaining 228 bytes

The compressed data is identified by the 2 bits on the right. Based on the pattern, the decompression will determine the amount of bits which describes the byte. So, if a bit pattern is '11', then the next 2 bits describe an index to the actual byte (index being 00, 01, 10, 11). If a bit pattern is '10' then there are 3 bits describing the byte, and with bit pattern '01' there are 4 bits describing the byte. This way, it is possible to store 28 bytes in a compressed manner.

The remaining 228 bytes will be stored as-they-are plus 2 identifying bits '00' for each byte. This indeed enlarges the result somewhat, though effectively, the total compression ratio can still be pretty good depending on the type of data. If the data contains a lot of similar bytes, like BMP or ASCII files, the algorithm can reach a ratio up to 40%.

Of course, compressed files need a conversion table to restore the actual bytes. The conversion table is put into the compressed file by the compression routine. This also means, that a resulting file always will larger than 28 bytes. So compressing very small files of 10 bytes will actually result into a larger file (negative compression).

The below programs are a demonstration of this principle. For sake of clarity, they have not been optimized for speed or size. To give an idea on how well the programs compress, I have tested it with Spinoza's "Ethics":

$ ./compress6 ethica.txt
Done in 3442 msecs. New size is 61% from original.
$ ./decompress6 ethica.txt.baz
Done in 1247 msecs.
$ diff ethica.txt ethica.txt.orig
$

Regards
Peter

COMPRESSION

' Optimize for speed
PRAGMA OPTIONS -O3

DECLARE idx ASSOC long

' Get filename
IF AMOUNT(ARGUMENT$) < 2 THEN
    PRINT "Usage: ", ME$, " <file>"
    END 1
ENDIF

file$ = TOKEN$(ARGUMENT$, 2)

' Using array references is faster than PEEK
DECLARE data TYPE uint8_t*

data = BLOAD(file$)

length = FILELEN(file$)

' Analyze file - count occurences of bytes
FOR x = 0 TO length-1
    INCR idx(STR$(data[x]))
NEXT

' Sort array based on value
SORT idx DOWN

' Get the order of indexes
dlm$ = OBTAIN$(idx)

' Array with the values for the 28 special chars.
' We declare 256 and leave the other elements 0.
DECLARE new[256] = { 0 }

' Determine position for each byte
FOR x = 1 TO AMOUNT(dlm$)
    IF x BETWEEN 1;28 THEN new[VAL(TOKEN$(dlm$, x))] = x
NEXT

result = MEMORY(31+length*2)

' Identifier
POKE result, ASC("B")
POKE result+1, ASC("A")
POKE result+2, ASC("Z")

byte = 3

' Store table
FOR x = 1 TO 28
    POKE result+byte+x-1, VAL(TOKEN$(dlm$, x))
NEXT

DECLARE buf TYPE uint32_t

INCR byte, 28

bits = 0

' Determine value to be poked based on position
FOR x = 0 TO length-1

    pk = new[data[x]]

    IF pk BETWEEN 1;4 THEN
        buf = buf | ((((pk-1) << 2)| 3) << bits)
        INCR bits, 4
    ELIF pk BETWEEN 5;12 THEN
        buf = buf | ((((pk-5) << 2)| 2) << bits)
        INCR bits, 5
    ELIF pk BETWEEN 13;28 THEN
        buf = buf | ((((pk-13) << 2)| 1) << bits)
        INCR bits, 6
    ELSE
        buf = buf | (( data[x] << 2) << bits)
        INCR bits, 10
    ENDIF

    ' Lower byte full? Poke to memory
    WHILE bits > 7
        POKE result + byte, (buf & 255)
        buf = (buf >> 8)
        DECR bits, 8
        INCR byte
    WEND
NEXT

POKE result + byte, (buf & 255)

BSAVE result TO file$ & ".baz" SIZE byte+1

FREE result, data

PRINT "Done in ", TIMER, " msecs. New size is ", FILELEN(file$ & ".baz")*100/length, "% from original."

DECOMPRESSION

' Optimize for speed
PRAGMA OPTIONS -O3

' Get filename
IF AMOUNT(ARGUMENT$) < 2 THEN
    PRINT "Usage: ", ME$, " <file.baz>"
    END 1
ENDIF

file$ = TOKEN$(ARGUMENT$, 2)

IF LCASE$(RIGHT$(file$, 4)) <> ".baz" THEN
    PRINT "Usage: ", ME$, " <file.baz>"
    END 1
ENDIF

' Using array references is faster than PEEK
DECLARE data TYPE uint8_t*

' Load the data
data = BLOAD(file$)

length = FILELEN(file$)

' Verify file header
IF PEEK(data) <> ASC("B") OR PEEK(data+1) <> ASC("A") OR PEEK(data+2) <> ASC("Z") THEN
    PRINT "This is not a BaCon Zipped file! Exiting..."
    END
ENDIF

' Need int bytes more in the buffer because of parsing below
RESIZE data TO length+SIZEOF(int)

' Array with the values for the 28 special chars.
' We declare 256 and leave the other elements 0.
DECLARE new[256] = { 0 }

' Skip the 3 header bytes BAZ
idx = 3

' Fetch the table
FOR x = 0 TO 27
    new[x] = PEEK(data+idx+x)
NEXT

' Proceed to data
INCR idx, 28

' Memory for result
result = MEMORY(length*2)

bits = 0

' Temp buffer in which the data is parsed
DECLARE buf TYPE uint32_t

' Parse data
WHILE idx < length

    ' Read 3 bytes
    buf = ((data[idx] | (data[idx+1]<<8) | (data[idx+2]<<16)) >> bits)

    IF (buf & 3) = 3 THEN
        POKE result+pos, new[(buf >> 2) & 3]
        INCR bits, 4
    ELIF (buf & 2) = 2 THEN
        POKE result+pos, new[((buf >> 2) & 7)+4]
        INCR bits, 5
    ELIF (buf & 1) = 1 THEN
        POKE result+pos, new[((buf >> 2) & 15)+12]
        INCR bits, 6
    ELSE
        POKE result+pos, (buf >> 2) & 255
        INCR bits, 10
    ENDIF

    ' Wrap around to next byte when needed
    WHILE bits > 7
        DECR bits, 8
        INCR idx
    WEND

    INCR pos
WEND

BSAVE result TO BASENAME$(file$, 1) & ".orig" SIZE pos-1

FREE data, result

PRINT "Done in ", TIMER, " msecs."

EDIT: fixed a bug in the compression function.
EDIT 2: improved performance, fixed segfault in compression when resulting file gets bigger than original size

Last Edit: May 1, 2018 21:22:45 GMT 1 by Pjot: Performance improvements

vovchik
God

Posts: 2,792

Compression and decompression Apr 29, 2018 11:01:29 GMT 1

Quote

Post by vovchik on Apr 29, 2018 11:01:29 GMT 1

Dear Peter,

Thanks - baz and unbaz seem to work nicely.

I will have to add a mime type to globs, so that the .baz extension gets recognized. What about:

application/baz:*.baz

I have, for my purposes, attached a BAZ mime icon.

With kind regard,
vovchik

Attachments:

Last Edit: Apr 29, 2018 12:08:01 GMT 1 by vovchik

Pjot
Administrator

Posts: 2,833

Compression and decompression Apr 29, 2018 13:30:18 GMT 1

Quote

Post by Pjot on Apr 29, 2018 13:30:18 GMT 1

Thanks vovchik!

Nice icon, we definitely can use it for our BaCon version of compression

In the meantime, I did find a bug in the binary compression function. It was unable to handle byte values above 127 because it did not set the memory type to unsigned char. My tests with UTF-8 text files from Project Gutenberg therefore did not work.

I have fixed it in the above code, here is an example with Wuthering Heights from Emily Brontë which uses UTF-8:

$ ./compress6 pg768.txt
Done in 2956 msecs. New size is 62% from original.
$ ./decompress6 pg768.txt.baz
Done in 1485 msecs.
$ diff pg768.txt pg768.txt.orig
$

So I am planning to improve things further, by combining the concept of the first algorithm in this thread (character compression) and the binary compression. This should deliver pretty good results.

Regards
Peter

Pjot
Administrator

Posts: 2,833

Compression and decompression May 1, 2018 21:36:53 GMT 1

Quote

Post by Pjot on May 1, 2018 21:36:53 GMT 1

Another update on the binary compressor / decompressor:

Fixed bug when resulting compressed file is of bigger size than original
Performance improvements

The performance on the same "Wuthering Heights" on the same machine:

$ ./compress6 pg768.txt
Done in 367 msecs. New size is 62% from original.
$ ./decompress6 pg768.txt.baz
Done in 411 msecs.
$ diff pg768.txt pg768.txt.orig
$

For compression this is a performance gain of 87% and for decompression it is 73%.

Note that the binary compression works for any type of file, though some files compress less good or even expand - but this is the case with every compression function. Please refer to this very interesting story about some basic concepts of compression. I did find this very insightful!

Regarding the 2-tier approach, my first attempt in this thread, using word occurrences, only works with ASCII files in which all bytes do not use bit7. This is somewhat limited, so I am investigating alternatives.

BR
Peter

vovchik
God

Posts: 2,792

Compression and decompression May 2, 2018 16:32:57 GMT 1

Quote

Post by vovchik on May 2, 2018 16:32:57 GMT 1

Dear Peter,

Thanks - all working as described. I took the liberty of transforming bas compress and decompress into subs, so they could be more readily used in applications. Everything seems to be working. I don't know whether functions or subs should be used. With functions, we could get back a compressed structure that may come in handy. For the moment, it is subs, and a bit more could be done in passing arguments.

With kind regards,
vovhik

Attachments:

baz-sub.tar.gz (1.74 KB)

Pjot
Administrator

Posts: 2,833

Compression and decompression May 2, 2018 20:44:55 GMT 1

Quote

Post by Pjot on May 2, 2018 20:44:55 GMT 1

Thanks vovchik,

I am happy to see that all works on your systems as well. The code still is intentionally explicit and it probably can be optimized further. Nevertheless, putting it all in SUB's already provides useful functionality.

In the meantime, I have ended up in the field of sequence mining, which should allow pattern recognition on byte level. This hopefully can provide a way of replacing byte patterns by some index value, for example, a byte value which is not used in the entire file. Or, if all byte values are being used (like in binary files), a byte value not used in a chunk of the file, where such chunk then also contains its own index table.

BR
Peter

Pjot
Administrator

Posts: 2,833

Compression and decompression May 12, 2018 13:57:15 GMT 1

Quote

Post by Pjot on May 12, 2018 13:57:15 GMT 1

So the sequence recognition didn't bring what I was looking for. An algorithm to compress data using pattern recognition including the binary compression only added 4% to the compression ratio.

Instead, I looked into other methods, requirements being that the algorithm should be small and that it can be implemented with just a few lines of code. Eventually, I ended up with LZW compression. Its infamous history of patent infringements because the use of GIF came back to my mind, however, the related patents of LZW expired as of 2004, so there does not seem to be a hindrance of using it.

The LZW algorithm truly is very elegant with a high compression ratio. In fact, it is compressing so well, that the binary compression on top of it only adds more data to the resulting file. I have tested this with several file types but in al cases a 2nd tier with binary compression enlarged the resulting file.

This experiment however revealed an important flaw in the associative array implementation of BaCon. The internal hash tables run into collisions which cannot be solved. Please use the latest 3.7.3 beta which fixes this issue, and which allows a proper run of the below code.

BR
Peter

LZW COMPRESSION

' Get filename
IF AMOUNT(ARGUMENT$) < 2 THEN
    PRINT "Usage: ", ME$, " <file>"
    END 1
ENDIF

CHANGEDIR DIRNAME$(ME$)

file$ = TOKEN$(ARGUMENT$, 2)

DECLARE data TYPE unsigned char*

data = BLOAD(file$)

length = FILELEN(file$)

' Initialize first 256 elements
DECLARE dict ASSOC unsigned short
FOR nr = 0 TO 255
    dict("," & STR$(nr)) = nr
NEXT

result = MEMORY(length*2+SIZEOF(short))

FOR x = 0 TO length-1

    ch$ = "," & STR$((unsigned char)data[x])

    IF ISKEY(dict, buf$ & ch$) THEN
        buf$ = buf$ & ch$
    ELSE
        OPTION MEMTYPE unsigned short
        POKE result+byte, dict(buf$)

        INCR byte, 2

        IF nr < 65536 THEN
            dict(buf$ & ch$) = nr
            INCR nr
        ENDIF
        buf$ = ch$
    END IF
NEXT

POKE result+byte, dict(buf$)

BSAVE result TO file$ & ".lzw" SIZE byte+2

FREE data, result

PRINT "Done in ", TIMER, " msecs. New size is ", FILELEN(file$ & ".lzw")*100/length, "% from original."

LZW DECOMPRESSION

OPTION MEMTYPE unsigned short

' Get filename
IF AMOUNT(ARGUMENT$) < 2 THEN
    PRINT "Usage: ", ME$, " <file>"
    END 1
ENDIF

CHANGEDIR DIRNAME$(ME$)

file$ = TOKEN$(ARGUMENT$, 2)

DECLARE data TYPE char*

data = BLOAD(file$)

length = FILELEN(file$)

RESIZE data TO length+SIZEOF(short)
POKE data+length, 0

' Initialize first 256 elements
DECLARE dict$ ASSOC STRING
FOR nr = 0 TO 255
    dict$(STR$(nr)) = CHR$(nr)
NEXT

result = MEMORY(length*2+16)

old = PEEK(data)

ch$ = dict$(STR$(old))

OPTION MEMTYPE unsigned short
POKE result, ASC(ch$)

byte = 1

FOR x = 2 TO length-1 STEP 2

    OPTION MEMTYPE unsigned short
    buf = PEEK(data+x)

    IF NOT( ISKEY( dict$, STR$(buf) ) ) THEN
        out$ = dict$(STR$(old))
        out$ = out$ & ch$
    ELSE
        out$ = dict$(STR$(buf))
    END IF

    OPTION MEMTYPE unsigned char
    FOR y = 1 TO LEN(out$)
        POKE result+byte, ASC(MID$(out$, y, 1))
        INCR byte
    NEXT

    ch$ = LEFT$(out$, 1)

    IF nr < 65536 THEN
        dict$(STR$(nr)) = dict$(STR$(old)) & ch$
        INCR nr
    ENDIF

    old = buf
NEXT

BSAVE result TO BASENAME$(file$, 1) & ".orig" SIZE byte

FREE result, data

PRINT "Done in ", TIMER, " msecs."

Last Edit: Dec 16, 2018 22:37:20 GMT 1 by Pjot: Updated compression so it works with large binary files - PvE

Pjot
Administrator

Posts: 2,833

Compression and decompression May 12, 2018 19:20:44 GMT 1

Quote

Post by Pjot on May 12, 2018 19:20:44 GMT 1

As an add-on to the LZW compression mentioned in the previous post, the following statistics for the same 'Wuthering Heights' on the same system:

$ ./compress7 pg768.txt
Done in 257 msecs. New size is 41% from original.
$ ./decompress7 pg768.txt.lzw
Done in 614 msecs.
$ diff pg768.txt pg768.txt.lzw-orig

The compression ratio has increased with 21% and the amount of code is less.

BR
Peter

Pjot Administrator Posts: 2,833	Compression and decompression Dec 16, 2018 22:39:26 GMT 1 Quote Select Post Deselect Post Link to Post Member Give Gift Back to Top Post by Pjot on Dec 16, 2018 22:39:26 GMT 1 All, While testing the LZW compression routine above, I discovered a bug when it compressed (large) binary files. I have updated the code so it works properly now! BR Peter

bigbass
God

.

Posts: 1,978

Compression and decompression Dec 17, 2018 3:30:58 GMT 1

Quote

Post by bigbass on Dec 17, 2018 3:30:58 GMT 1

hello Peter

offering some feedback

the compression part compiles cleanly and ran correctly

pi@raspberrypi:~/Desktop/Documents $ bacon -c clang lzwc
Converting 'lzwc.bac'... done, 52 lines were processed in 0.092 seconds.
Compiling 'lzwc.bac'... clang -c lzwc.bac.c
clang -o lzwc lzwc.bac.o -L. -lbacon -lm
Done, program 'lzwc' ready.

pi@raspberrypi:~/Desktop/Documents $ ./lzwc myinstalled_packages.txt
Done in 592 msecs. New size is 30% from original.

however the decompression part generated a compile time error

pi@raspberrypi:~/Music $ bacon -v

BaCon version 3.8 on Linux armv7l - (c) Peter van Eerten - MIT License.

I used gcc and clang to see why

first with clang

pi@raspberrypi:~/Desktop/Documents $ bacon -c clang lzwd
WARNING: 6 temporary files found! Do you want to delete them (y/n)? y
Temporary files were deleted.
Converting 'lzwd.bac'... done, 72 lines were processed in 0.098 seconds.
Compiling 'lzwd.bac'... clang -c lzwd.bac.c
Makefile.bacon:6: recipe for target 'lzwd.bac.o' failed
Compiler error:

Description:
file 'lzwd.bac' line 9: CHANGEDIR DIRNAME$(ME$)
Cause:
unterminated function-like macro invocation

now with gcc

bacon lzwd
WARNING: 6 temporary files found! Do you want to delete them (y/n)? y
Temporary files were deleted.
Converting 'lzwd.bac'... done, 72 lines were processed in 0.086 seconds.
Compiling 'lzwd.bac'... cc -c lzwd.bac.c
Makefile.bacon:6: recipe for target 'lzwd.bac.o' failed
Compiler error:

Description:
file 'lzwd.bac' line 53: POKE result+byte, ASC(MID$(out$, y, 1))
Cause:
unterminated argument list invoking macro "ASC"

nothing comes to mind as to why but this may be useful to you

thanks having a small compression /decompression file is a good feature
will test more

Joe

UPDATE I COMPILED AGAIN to get the log file
lzwd.bac.c: In function 'main':
lzwd.bac.c:261:0: error: unterminated argument list invoking macro "ASC"
}

lzwd.bac.c:209:55: error: 'ASC' undeclared (first use in this function)
*(__b2c__MEMTYPE*)( result+byte) = ( __b2c__MEMTYPE)( ASC(MID__b2c__string_var(out__b2c__string_var);
^~~
lzwd.bac.c:209:55: note: each undeclared identifier is reported only once for each function it appears in
lzwd.bac.c:209:1: error: expected ')' at end of input
*(__b2c__MEMTYPE*)( result+byte) = ( __b2c__MEMTYPE)( ASC(MID__b2c__string_var(out__b2c__string_var);
^
lzwd.bac.c:209:1: error: expected declaration or statement at end of input
lzwd.bac.c:209:1: error: expected declaration or statement at end of input
lzwd.bac.c:209:1: error: expected declaration or statement at end of input
lzwd.bac.c:208:1: error: label '__B2C__PROGRAM__EXIT' used but not defined
if (__b2c__trap){if(__b2c__memory__check((char*) result+byte, sizeof(__b2c__MEMTYPE))) {ERROR=1; if(!__b2c__catch_set) RUNTIMEERROR("POKE", 53, "lzwd.bac", ERROR); else if(!setjmp(__b2c__jump)) goto __B2C__PROGRAM__EXIT;} }
^~
make: *** [lzwd.bac.o] Error 1

Last Edit: Dec 17, 2018 4:18:10 GMT 1 by bigbass

Pjot
Administrator

Posts: 2,833

Compression and decompression Dec 17, 2018 7:49:48 GMT 1

Quote

Post by Pjot on Dec 17, 2018 7:49:48 GMT 1

Hi Joe,

The fix was applied to the compression only, and the decompression code was not changed. So it is strange it now generates an error for you. On my systems it works both for gcc and clang.

From your logging it seems a '(' or a ')' is missing in your code. Can you make sure there is not a copy&paste error?

Best regards
Peter

Post by Pjot on Oct 27, 2015 21:15:15 GMT 1

Post by bigbass on Oct 28, 2015 15:17:46 GMT 1

Post by jcfuller on Oct 28, 2015 16:17:44 GMT 1

Post by Pjot on Oct 29, 2015 22:29:10 GMT 1

Post by Pjot on Apr 29, 2018 9:54:50 GMT 1

Post by vovchik on Apr 29, 2018 11:01:29 GMT 1

Post by Pjot on Apr 29, 2018 13:30:18 GMT 1

Post by Pjot on May 1, 2018 21:36:53 GMT 1

Post by vovchik on May 2, 2018 16:32:57 GMT 1

Post by Pjot on May 2, 2018 20:44:55 GMT 1

Post by Pjot on May 12, 2018 13:57:15 GMT 1

Post by Pjot on May 12, 2018 19:20:44 GMT 1

Post by Pjot on Dec 16, 2018 22:39:26 GMT 1

Post by bigbass on Dec 17, 2018 3:30:58 GMT 1

Post by Pjot on Dec 17, 2018 7:49:48 GMT 1