Word counting

Deleted
Deleted Member

Posts: 0

Word counting Apr 25, 2012 3:20:11 GMT 1

Post by Deleted on Apr 25, 2012 3:20:11 GMT 1

This program counts total and unique words in the text file, sort unique words and save a result to the file. It was written in response to the POTW on basicprogramming.org. The task was to count and sort words in text version of the Bible. Some users of that forum complained about BaCon speed. Yes, BaCon probably can't beat text processing oriented languages (like Perl, for example), but it's possible that my program is just poorly coded.

I wonder if it's possible to complete the same task other (faster) way.

Source code is attached. Text of the Bible can be found on forum.basicprogramming.org/index.php?topic=2222.0 .


OPTION BASE 1
OPTION COMPARE 1

GLOBAL SLOWA$ [1000000]
GLOBAL ILOSC [1000000]
GLOBAL LICZBA [14]

LICZBA [0] = 100000
LICZBA [1] = 10000
LICZBA [2] = 5000
LICZBA [3] = 2500
LICZBA [4] = 1000
LICZBA [5] = 500
LICZBA [6] = 250
LICZBA [7] = 100
LICZBA [8] = 50
LICZBA [9] = 25
LICZBA [10] = 10
LICZBA [11] = 5
LICZBA [12] = 1
LICZBA [13] = 0

SLOWA$ [1] = " "
STAN = 0
IN = 1
TOT = 0

OPEN "Bible.txt" FOR READING AS KSIAZKA

WHILE NOT(ENDFILE (KSIAZKA)) DO
		READLN AKAPIT$ FROM KSIAZKA
		IF NOT(ENDFILE (KSIAZKA)) THEN
				SPLIT AKAPIT$ BY " " TO PODZIELONE$ SIZE ROZMIAR
				FOR X = 1 TO ROZMIAR
					SLOWO$ = CHOP$ (PODZIELONE$ [X], "1234567890-,;:()[]'`.?!")
					IF LEN (SLOWO$) > 2 THEN
							FOR Y = 1 TO IN
									IF SLOWO$ = SLOWA$ [Y] THEN
										ILOSC [Y] = ILOSC [Y] + 1
										STAN = 1
										INCR TOT
									END IF
							NEXT Y
					ELSE 
							STAN = 1
					END IF
					IF STAN = 1 THEN
							STAN = 0
					ELSE
							INCR IN
							SLOWA$ [IN] = SLOWO$
							ILOSC [IN] = 1
							INCR TOT
					END IF
			NEXT
	END IF
	CLEAR 
	PRINT "Total number of words: ";
	PRINT TOT
	PRINT "Number of unique words: ";
	PRINT IN
	PRINT "Unique words represent ";
	PRINT 100 * IN / TOT;
	PRINT "% of all words"
WEND

CLOSE FILE KSIAZKA

STAN = 0
WHILE (STAN <= IN)
		FOR Y = 2 TO IN - 1
				A2$ = SLOWA$ [Y]
				A3$ = SLOWA$ [Y + 1]
				L2 = ILOSC [Y]
				L3 = ILOSC [Y + 1]
				IF L2 > L3 THEN
						ILOSC [Y] = L3
						ILOSC [Y + 1] = L2
						SLOWA$ [Y] = A3$
						SLOWA$ [Y + 1] = A2$
				END IF
		NEXT
		INCR STAN
		CLEAR 
		CLEAR 
		PRINT "Total number of words: ";
		PRINT TOT
		PRINT "Number of unique words: ";
		PRINT IN
		PRINT "Unique words represent ";
		PRINT 100 * IN / TOT;
		PRINT "% of all words\n"
		PRINT "Sorting... ";
		PRINT 100 * STAN / IN;
		PRINT " %\n"
WEND

PRINT "Writing to the file...\n"

OPEN "words.txt" FOR WRITING AS LISTA

C$ = CONCAT$ ("Total number of words: ", STR$ (TOT))
WRITELN C$ TO LISTA
C$ = CONCAT$ ("Number of unique words: ", STR$ (IN))
WRITELN C$ TO LISTA
C$ = CONCAT$ ("Unique words represent ", STR$ (100 * IN / TOT), "% of all words")
WRITELN C$ TO LISTA
WRITELN " " TO LISTA

FOR X = 1 TO 12
	LIC = 0
	FOR Y = IN TO 1 STEP -1
		IF ILOSC [Y] >= LICZBA [X] AND ILOSC [Y] < LICZBA [X - 1] THEN
			INCR LIC
		END IF
	NEXT Y
	C$ = CONCAT$ ("Number of words that appear ", STR$ (LICZBA [X]), " times and more, but less than ", STR$ (LICZBA [X - 1]), " times: ", STR$ (LIC))
	WRITELN C$ TO LISTA
NEXT X

WRITELN " " TO LISTA
FOR Y = IN TO 2 STEP -1
	C$ = CONCAT$ (SLOWA$ [Y], " ", STR$ (ILOSC [Y]))
	WRITELN C$ TO LISTA
NEXT

CLOSE FILE LISTA

PRINT "Done!"
END

Attachments:

2lss Full Member Posts: 140	Word counting Apr 26, 2012 3:46:22 GMT 1 Select Post Deselect Post Link to Post Member Give Gift Back to Top Post by 2lss on Apr 26, 2012 3:46:22 GMT 1 Interesting program I wonder if it's possible to complete the same task other (faster) way. What about using multiple threads? Would that be considered cheating?

Pjot
Administrator

Posts: 2,833

Word counting Apr 26, 2012 16:32:07 GMT 1

Post by Pjot on Apr 26, 2012 16:32:07 GMT 1

Hi Tomaasz,

The string performance is something I have been looking at lately. With each string function BaCon will use 'realloc' to make sure no memory leak exists. However, a 'realloc' is a slow thing to do, the kernel needs some time to allocate the memory.

A non-forum member contacted me about it and suggested that I should implemented a string library similar to the Better String Library. It has other advantages too, but the problem is that rewriting the string implementation in BaCon is just a huge task. Also, this library does not exist as a Shared Object from which functions can be IMPORTed.

Therefore I have been playing with the idea to create an INCLUDE file which uses memory streams to achieve fast string processing. Also functions similar to CONCAT$, LEFT$, RIGHT$ etc should be added to such a context. This library could be called 'FastStrings' or something like that and can be created in BaCon completely.

So if I have time I'll look at it...

Regards
Peter

Deleted
Deleted Member

Posts: 0

Word counting Apr 26, 2012 16:44:02 GMT 1

Post by Deleted on Apr 26, 2012 16:44:02 GMT 1

I was thinking more about functions I could miss (I'm still quite new to BaCon). Obviously, dialects with special functions for string and arrays processing were significantly faster in completing the task. BaCon needs about 4 min. for it, ScriptBasic - 28 sec., ThinBASIC - 2 sec., FreeBASIC - 0.2 sec.

I realized yesterday that I made a mistake and my program counts only words that contain 3 or more characters. To fix it, line 36


IF LEN (SLOWO$) > 2 THEN

should be replaced with


IF LEN (SLOWO$) > 1 OR SLOWO$ = "a" OR SLOWO$ = "I" THEN

EDIT: Peter, thanx for an explanation (it was posted when I was writing my post)!

Last Edit: Apr 26, 2012 16:46:37 GMT 1 by Deleted

Deleted Deleted Member Posts: 0	Word counting Apr 26, 2012 18:54:55 GMT 1 Select Post Deselect Post Link to Post Member Give Gift Back to Top Post by Deleted on Apr 26, 2012 18:54:55 GMT 1 ThinBASIC - 2 sec. What you are seeing is OxygenBasic doing 99% of all the work as a JIT compiler extension to thinBasic.
	Last Edit: Apr 26, 2012 18:56:26 GMT 1 by Deleted

aurel Junior Member SolarMM Posts: 61	Word counting Apr 26, 2012 23:03:35 GMT 1 Select Post Deselect Post Link to Post Member Give Gift Back to Top Post by aurel on Apr 26, 2012 23:03:35 GMT 1 Hmm... I'm not so sure, i know that thinbasic use oxygen speed in some programs but i think that original speed of thinBasic become from PowerBasic turbo-assembler ,right?
	---------------------------------- AurelSoft main site: aurelsoft.ucoz.com basic4all.epizy.com -----------------------------------

Deleted
Deleted Member

Posts: 0

Word counting Apr 26, 2012 23:58:40 GMT 1

Post by Deleted on Apr 26, 2012 23:58:40 GMT 1

thinBasic is an interpreter partially written in PowerBasic. It doesn't create tokenized PCODE and each statement is processed as it is read. When the original word count code challenge was done, Eros worked with Charles (O2 author) to achieve the compiler like results.

What helped ScriptBasic is the SPLITA internal function that created an array (over 1.4 million elements) of words in a fraction of a second.

Deleted
Deleted Member

Posts: 0

Word counting May 3, 2012 0:13:16 GMT 1

Post by Deleted on May 3, 2012 0:13:16 GMT 1

I knew there was a better way to code a word counting example. So, I studied BaCon documentation, made my grey cells work and here it is:


OPTION BASE 1
OPTION COMPARE 1

GLOBAL all$[1000000] TYPE STRING
GLOBAL unique$[20000] TYPE STRING
GLOBAL sorted$[20000] TYPE STRING
GLOBAL unique_number[20000] TYPE NUMBER
GLOBAL sorted_number[20000] TYPE NUMBER

CLEAR
PRINT "Counting..."

OPEN "Bible.txt" FOR READING AS ksiazka

counter = 0
unique_counter = 1
status = 0
unique$[1] = ""

WHILE NOT(ENDFILE (ksiazka)) DO
	READLN akapit$ FROM ksiazka
	SPLIT akapit$ BY " " TO slowa$ SIZE siz
	FOR x = 1 TO siz
		word$ = CHOP$ (slowa$[x], "1234567890-,;:()[]'`.?!")
		IF LEN (word$) > 1 OR word$ = "a" OR word$ = "I" THEN
			all$[counter + x] = word$
			FOR y = 1 TO unique_counter
				IF all$[counter + x] = unique$[y] THEN
					status = 1
					unique_number[y] = unique_number[y] + 1
					sorted_number[y] = sorted_number[y] + 1
					BREAK
				END IF
			NEXT y
			IF status = 0 THEN
				unique_counter = unique_counter + 1
				unique$[unique_counter] = all$[counter + x]
				unique_number[unique_counter] = 1
				sorted_number[unique_counter] = 1
			ELSE
				status = 0
			END IF
		counter = counter + 1
		END IF
	NEXT x
WEND

CLOSE FILE ksiazka

PRINT "Sorting..."

SORT sorted_number DOWN

FOR x = 1 TO unique_counter
	FOR y = 1 TO unique_counter
		IF sorted_number[x] = unique_number[y] THEN
			sorted$[x] = unique$[y]
			unique_number[y] = 0
			BREAK
		END IF
	NEXT y
NEXT x

PRINT "Writing to the file..."
		
OPEN "words.txt" FOR WRITING AS results

c$ = CONCAT$ ("Total number of words: ", STR$ (counter))
WRITELN c$ TO results
c$ = CONCAT$ ("Number of unique words: ", STR$ (unique_counter))
WRITELN c$ TO results
WRITELN " " TO results
FOR y = 1 TO unique_counter - 1
	c$ = CONCAT$ (sorted$[y]," - ", STR$ (sorted_number[y]))
	WRITELN c$ TO results
NEXT

CLOSE FILE results

PRINT "Done!"
END

The old program takes more than 4 min. to complete the task. This one needs only 37 sec.

Source is attached.

Attachments:

Pjot
Administrator

Posts: 2,833

Word counting May 3, 2012 14:32:29 GMT 1

Post by Pjot on May 3, 2012 14:32:29 GMT 1

Great!

You can even gain more speed by compiling with GCC speed optimizations:

bacon -o -O2 words

Regards
Peter

EDIT: just FYI, the WRITELN statement can accept comma separated arguments, as follows:


OPEN "words.txt" FOR WRITING AS results

WRITELN "Total number of words: ", STR$(counter) TO results
WRITELN "Number of unique words: ", STR$ (unique_counter) TO results
WRITELN " " TO results

FOR y = 1 TO unique_counter - 1
	WRITELN sorted$[y]," - ", STR$ (sorted_number[y]) TO results
NEXT

CLOSE FILE results

Last Edit: May 3, 2012 14:48:48 GMT 1 by Pjot

Deleted
Deleted Member

Posts: 0

Word counting May 3, 2012 21:55:04 GMT 1

Post by Deleted on May 3, 2012 21:55:04 GMT 1

May 3, 2012 14:32:29 GMT 1 Pjot said:

Great!
You can even gain more speed by compiling with GCC speed optimizations:

31 sec. with GCC speed optimizations.

May 3, 2012 14:32:29 GMT 1 Pjot said:

EDIT: just FYI, the WRITELN statement can accept comma separated arguments...

It won't make the whole program work faster (writing to the file takes less than 1 sec.), but still it's good to know it. Thanx!

Pjot
Administrator

Posts: 2,833

Word counting May 6, 2012 16:39:43 GMT 1

Post by Pjot on May 6, 2012 16:39:43 GMT 1

See if you can beat this.


OPTION BASE 1

CONST txt$ = "Bible.txt"

' Numeric array for words of 1-12 characters
DECLARE words[1000000]

' String array for words > 12 characters
DECLARE term$[10000]

' Initialize variables
DECLARE total, hash, high, length, uniq, tctr, fsize, mem

' Declare memory
fsize = FILELEN(txt$)
mem = MEMORY(fsize)

' Get the complete text into memory
OPEN txt$ FOR READING AS bible
	GETBYTE mem FROM bible SIZE fsize
CLOSE FILE bible

' ------------------------------------------ 1st part ---------------------------------------------------

' Start analyzing memory
FOR pos = 0 TO fsize-1

	' Collect word from memory
	c = PEEK(mem+pos)

	' Map a-z or A-Z to 1-26, ignore other characters
	offset = 0
	IF c >= 97 AND c <= 122 THEN offset = 96
	IF c >= 65 AND c <= 90 THEN offset = 64

	' So we found a valid character?
	IF offset != 0 THEN
		INCR length
		IF length > 12 THEN
			high = high << 5
			high = high | (c - offset)
		ELSE
			hash = hash << 5
			hash = hash | (c - offset)
		END IF
	END IF

	' Do we have a space, CR, LF?
	IF c = 32 OR c = 10 OR c = 13 THEN
		' Is there a word?
		IF hash > 0 THEN
			' Increment total amount
			INCR total
			' Handle long words separately
			IF length > 12 THEN
				build$ = ""
				WHILE high > 0
					build$ = CONCAT$(build$, CHR$((high&31)+96))
					high = high >> 5
					IF high = 0 THEN
						high = hash
						hash = 0
					END IF
				WEND
				INCR tctr
				term$[tctr] = REVERSE$(build$)
			' It is a short word
			ELSE
				words[total] = hash
			END IF
			' Reset
			hash = 0 : high = 0 : length = 0
		END IF
	END IF
NEXT

FREE mem

PRINT "Total words: ", total

' ------------------------------------------ 2nd part ---------------------------------------------------

' Determine unique words
uniq = total

' Small words
SORT words DOWN
FOR x = 1 TO total-1
	FOR y = x+1 TO total
		SELECT words[x]
			CASE 0
				BREAK
			CASE words[y]
				words[y] = 0
				DECR uniq
			DEFAULT
				BREAK
		END SELECT
	NEXT
NEXT

' Long words
FOR x = 1 TO tctr-1
	FOR y = x+1 TO tctr
		IF term$[x] = term$[y] AND LEN(term$[x]) > 0 THEN
			term$[y] = ""
			DECR uniq
		END IF
	NEXT
NEXT

PRINT "Unique words: ", uniq

Total words: 804874
Unique words: 13512

real 0m0.251s
user 0m0.224s
sys 0m0.024s

Regards
Peter

Deleted Deleted Member Posts: 0	Word counting May 6, 2012 17:13:36 GMT 1 Select Post Deselect Post Link to Post Member Give Gift Back to Top Post by Deleted on May 6, 2012 17:13:36 GMT 1 Any chance of seeing a sorted word count list as the original code challenge specified? Attachments:

Pjot
Administrator

Posts: 2,833

Word counting May 6, 2012 18:03:34 GMT 1

Post by Pjot on May 6, 2012 18:03:34 GMT 1

Boring.


OPTION BASE 1

CONST txt$ = "Bible.txt"

' Numeric array for words of 1-12 characters
DECLARE words[1000000]
DECLARE wcnt[1000000]

' String array for words > 12 characters
DECLARE term$[10000]
DECLARE ecnt[10000]

' Array containing the unique words
DECLARE result$[1000000]

' Initialize variables
DECLARE total, hash, high, length, uniq, tctr, fsize, mem, uctr

' Declare memory
fsize = FILELEN(txt$)
mem = MEMORY(fsize)

' Get the complete text into memory
OPEN txt$ FOR READING AS bible
	GETBYTE mem FROM bible SIZE fsize
CLOSE FILE bible

' ------------------------------------------ 1st part ---------------------------------------------------

' Start analyzing memory
FOR pos = 0 TO fsize-1

	' Collect word from memory
	c = PEEK(mem+pos)

	' Map a-z or A-Z to 1-26, ignore other characters
	offset = 0
	IF c >= 97 AND c <= 122 THEN offset = 96
	IF c >= 65 AND c <= 90 THEN offset = 64

	' So we found a valid character?
	IF offset != 0 THEN
		INCR length
		IF length > 12 THEN
			high = high << 5
			high = high | (c - offset)
		ELSE
			hash = hash << 5
			hash = hash | (c - offset)
		END IF
	END IF

	' Do we have a space, CR, LF?
	IF c = 32 OR c = 10 OR c = 13 THEN
		' Is there a word?
		IF hash > 0 THEN
			' Increment total amount
			INCR total
			' Handle long words separately
			IF length > 12 THEN
				build$ = ""
				WHILE high > 0
					build$ = CONCAT$(build$, CHR$((high&31)+96))
					high = high >> 5
					IF high = 0 THEN
						high = hash
						hash = 0
					END IF
				WEND
				INCR tctr
				term$[tctr] = REVERSE$(build$)
			' It is a short word
			ELSE
				words[total] = hash
			END IF
			' Reset
			hash = 0 : high = 0 : length = 0
		END IF
	END IF
NEXT

FREE mem

PRINT "Total words: ", total

' ------------------------------------------ 2nd part ---------------------------------------------------

' Determine unique words
uniq = total

' Small words
SORT words DOWN
FOR x = 1 TO total-1
	IF words[x] THEN wcnt[x] = 1
	FOR y = x+1 TO total
		SELECT words[x]
			CASE 0
				BREAK
			CASE words[y]
				words[y] = 0
				INCR wcnt[x]
				DECR uniq
			DEFAULT
				BREAK
		END SELECT
	NEXT
NEXT

' Long words
FOR x = 1 TO tctr-1
	IF LEN(term$[x]) THEN ecnt[x] = 1
	FOR y = x+1 TO tctr
		IF term$[x] = term$[y] AND LEN(term$[x]) > 0 THEN
			term$[y] = ""
			INCR ecnt[x]
			DECR uniq
		END IF
	NEXT
NEXT

PRINT "Unique words: ", uniq

' ------------------------------------------ 3rd part ---------------------------------------------------

' Small words
FOR x = 1 TO total
	IF words[x] THEN
		build$ = ""
		WHILE words[x] > 0
			build$ = CONCAT$(build$, CHR$((words[x]&31)+96))
			words[x] = words[x] >> 5
		WEND
		INCR uctr
		result$[uctr] = CONCAT$(REVERSE$(build$), " (", STR$(wcnt[x]), ")")
	END IF
NEXT

' Long words
FOR x = 1 TO tctr
	IF LEN(term$[x]) THEN
		INCR uctr
		result$[uctr] = CONCAT$(term$[x], " (", STR$(ecnt[x]), ")")
	END IF
NEXT

' Sort alphabetically
SORT result$ SIZE uniq+1

' Write result
OPEN "Bible_list.txt" FOR WRITING AS bible
	FOR x = 1 TO uniq
		WRITELN result$[x] TO bible
	NEXT
CLOSE FILE bible

Total words: 804874
Unique words: 13512

real 0m0.269s
user 0m0.236s
sys 0m0.032s

Deleted Deleted Member Posts: 0	Word counting May 6, 2012 18:40:08 GMT 1 Select Post Deselect Post Link to Post Member Give Gift Back to Top Post by Deleted on May 6, 2012 18:40:08 GMT 1 Knockout! ;D I hope ScriptBasic will be fine.

Deleted
Deleted Member

Posts: 0

Word counting May 6, 2012 19:13:04 GMT 1

Post by Deleted on May 6, 2012 19:13:04 GMT 1

Lighten up guys.

It only took BaCon two years to submit a meaning entry that most other compilers have already achieved.

I have never claimed that ScriptBasic can run head to head with compilers. It's mission is to provide a flexible, easy to use scripting language using a familiar traditional Basic syntax. It runs on all major platforms (32/64 bit) and now on ARM/Android devices.

Post by Deleted on Apr 25, 2012 3:20:11 GMT 1

Post by 2lss on Apr 26, 2012 3:46:22 GMT 1

Post by Pjot on Apr 26, 2012 16:32:07 GMT 1

Post by Deleted on Apr 26, 2012 16:44:02 GMT 1

Post by Deleted on Apr 26, 2012 18:54:55 GMT 1

Post by aurel on Apr 26, 2012 23:03:35 GMT 1

Post by Deleted on Apr 26, 2012 23:58:40 GMT 1

Post by Deleted on May 3, 2012 0:13:16 GMT 1

Post by Pjot on May 3, 2012 14:32:29 GMT 1

Post by Deleted on May 3, 2012 21:55:04 GMT 1

Post by Pjot on May 6, 2012 16:39:43 GMT 1

Post by Deleted on May 6, 2012 17:13:36 GMT 1

Post by Pjot on May 6, 2012 18:03:34 GMT 1

Post by Deleted on May 6, 2012 18:40:08 GMT 1

Post by Deleted on May 6, 2012 19:13:04 GMT 1