Deleted
Deleted Member
Posts: 0
|
Post by Deleted on Apr 25, 2012 3:20:11 GMT 1
This program counts total and unique words in the text file, sort unique words and save a result to the file. It was written in response to the POTW on basicprogramming.org. The task was to count and sort words in text version of the Bible. Some users of that forum complained about BaCon speed. Yes, BaCon probably can't beat text processing oriented languages (like Perl, for example), but it's possible that my program is just poorly coded. I wonder if it's possible to complete the same task other (faster) way. Source code is attached. Text of the Bible can be found on forum.basicprogramming.org/index.php?topic=2222.0 . OPTION BASE 1 OPTION COMPARE 1
GLOBAL SLOWA$ [1000000] GLOBAL ILOSC [1000000] GLOBAL LICZBA [14]
LICZBA [0] = 100000 LICZBA [1] = 10000 LICZBA [2] = 5000 LICZBA [3] = 2500 LICZBA [4] = 1000 LICZBA [5] = 500 LICZBA [6] = 250 LICZBA [7] = 100 LICZBA [8] = 50 LICZBA [9] = 25 LICZBA [10] = 10 LICZBA [11] = 5 LICZBA [12] = 1 LICZBA [13] = 0
SLOWA$ [1] = " " STAN = 0 IN = 1 TOT = 0
OPEN "Bible.txt" FOR READING AS KSIAZKA
WHILE NOT(ENDFILE (KSIAZKA)) DO READLN AKAPIT$ FROM KSIAZKA IF NOT(ENDFILE (KSIAZKA)) THEN SPLIT AKAPIT$ BY " " TO PODZIELONE$ SIZE ROZMIAR FOR X = 1 TO ROZMIAR SLOWO$ = CHOP$ (PODZIELONE$ [X], "1234567890-,;:()[]'`.?!") IF LEN (SLOWO$) > 2 THEN FOR Y = 1 TO IN IF SLOWO$ = SLOWA$ [Y] THEN ILOSC [Y] = ILOSC [Y] + 1 STAN = 1 INCR TOT END IF NEXT Y ELSE STAN = 1 END IF IF STAN = 1 THEN STAN = 0 ELSE INCR IN SLOWA$ [IN] = SLOWO$ ILOSC [IN] = 1 INCR TOT END IF NEXT END IF CLEAR PRINT "Total number of words: "; PRINT TOT PRINT "Number of unique words: "; PRINT IN PRINT "Unique words represent "; PRINT 100 * IN / TOT; PRINT "% of all words" WEND
CLOSE FILE KSIAZKA
STAN = 0 WHILE (STAN <= IN) FOR Y = 2 TO IN - 1 A2$ = SLOWA$ [Y] A3$ = SLOWA$ [Y + 1] L2 = ILOSC [Y] L3 = ILOSC [Y + 1] IF L2 > L3 THEN ILOSC [Y] = L3 ILOSC [Y + 1] = L2 SLOWA$ [Y] = A3$ SLOWA$ [Y + 1] = A2$ END IF NEXT INCR STAN CLEAR CLEAR PRINT "Total number of words: "; PRINT TOT PRINT "Number of unique words: "; PRINT IN PRINT "Unique words represent "; PRINT 100 * IN / TOT; PRINT "% of all words\n" PRINT "Sorting... "; PRINT 100 * STAN / IN; PRINT " %\n" WEND
PRINT "Writing to the file...\n"
OPEN "words.txt" FOR WRITING AS LISTA
C$ = CONCAT$ ("Total number of words: ", STR$ (TOT)) WRITELN C$ TO LISTA C$ = CONCAT$ ("Number of unique words: ", STR$ (IN)) WRITELN C$ TO LISTA C$ = CONCAT$ ("Unique words represent ", STR$ (100 * IN / TOT), "% of all words") WRITELN C$ TO LISTA WRITELN " " TO LISTA
FOR X = 1 TO 12 LIC = 0 FOR Y = IN TO 1 STEP -1 IF ILOSC [Y] >= LICZBA [X] AND ILOSC [Y] < LICZBA [X - 1] THEN INCR LIC END IF NEXT Y C$ = CONCAT$ ("Number of words that appear ", STR$ (LICZBA [X]), " times and more, but less than ", STR$ (LICZBA [X - 1]), " times: ", STR$ (LIC)) WRITELN C$ TO LISTA NEXT X
WRITELN " " TO LISTA FOR Y = IN TO 2 STEP -1 C$ = CONCAT$ (SLOWA$ [Y], " ", STR$ (ILOSC [Y])) WRITELN C$ TO LISTA NEXT
CLOSE FILE LISTA
PRINT "Done!" END
Attachments:
|
|
2lss
Full Member
Posts: 140
|
Post by 2lss on Apr 26, 2012 3:46:22 GMT 1
Interesting program
What about using multiple threads? Would that be considered cheating?
|
|
|
Post by Pjot on Apr 26, 2012 16:32:07 GMT 1
Hi Tomaasz, The string performance is something I have been looking at lately. With each string function BaCon will use 'realloc' to make sure no memory leak exists. However, a 'realloc' is a slow thing to do, the kernel needs some time to allocate the memory. A non-forum member contacted me about it and suggested that I should implemented a string library similar to the Better String Library. It has other advantages too, but the problem is that rewriting the string implementation in BaCon is just a huge task. Also, this library does not exist as a Shared Object from which functions can be IMPORTed. Therefore I have been playing with the idea to create an INCLUDE file which uses memory streams to achieve fast string processing. Also functions similar to CONCAT$, LEFT$, RIGHT$ etc should be added to such a context. This library could be called 'FastStrings' or something like that and can be created in BaCon completely. So if I have time I'll look at it... Regards Peter
|
|
Deleted
Deleted Member
Posts: 0
|
Post by Deleted on Apr 26, 2012 16:44:02 GMT 1
I was thinking more about functions I could miss (I'm still quite new to BaCon). Obviously, dialects with special functions for string and arrays processing were significantly faster in completing the task. BaCon needs about 4 min. for it, ScriptBasic - 28 sec., ThinBASIC - 2 sec., FreeBASIC - 0.2 sec.
I realized yesterday that I made a mistake and my program counts only words that contain 3 or more characters. To fix it, line 36
IF LEN (SLOWO$) > 2 THEN
should be replaced with
IF LEN (SLOWO$) > 1 OR SLOWO$ = "a" OR SLOWO$ = "I" THEN
EDIT: Peter, thanx for an explanation (it was posted when I was writing my post)!
|
|
Deleted
Deleted Member
Posts: 0
|
Post by Deleted on Apr 26, 2012 18:54:55 GMT 1
What you are seeing is OxygenBasic doing 99% of all the work as a JIT compiler extension to thinBasic.
|
|
aurel
Junior Member
SolarMM
Posts: 61
|
Post by aurel on Apr 26, 2012 23:03:35 GMT 1
Hmm... I'm not so sure, i know that thinbasic use oxygen speed in some programs but i think that original speed of thinBasic become from PowerBasic turbo-assembler ,right?
|
|
Deleted
Deleted Member
Posts: 0
|
Post by Deleted on Apr 26, 2012 23:58:40 GMT 1
thinBasic is an interpreter partially written in PowerBasic. It doesn't create tokenized PCODE and each statement is processed as it is read. When the original word count code challenge was done, Eros worked with Charles (O2 author) to achieve the compiler like results.
What helped ScriptBasic is the SPLITA internal function that created an array (over 1.4 million elements) of words in a fraction of a second.
|
|
Deleted
Deleted Member
Posts: 0
|
Post by Deleted on May 3, 2012 0:13:16 GMT 1
I knew there was a better way to code a word counting example. So, I studied BaCon documentation, made my grey cells work and here it is: OPTION BASE 1 OPTION COMPARE 1
GLOBAL all$[1000000] TYPE STRING GLOBAL unique$[20000] TYPE STRING GLOBAL sorted$[20000] TYPE STRING GLOBAL unique_number[20000] TYPE NUMBER GLOBAL sorted_number[20000] TYPE NUMBER
CLEAR PRINT "Counting..."
OPEN "Bible.txt" FOR READING AS ksiazka
counter = 0 unique_counter = 1 status = 0 unique$[1] = ""
WHILE NOT(ENDFILE (ksiazka)) DO READLN akapit$ FROM ksiazka SPLIT akapit$ BY " " TO slowa$ SIZE siz FOR x = 1 TO siz word$ = CHOP$ (slowa$[x], "1234567890-,;:()[]'`.?!") IF LEN (word$) > 1 OR word$ = "a" OR word$ = "I" THEN all$[counter + x] = word$ FOR y = 1 TO unique_counter IF all$[counter + x] = unique$[y] THEN status = 1 unique_number[y] = unique_number[y] + 1 sorted_number[y] = sorted_number[y] + 1 BREAK END IF NEXT y IF status = 0 THEN unique_counter = unique_counter + 1 unique$[unique_counter] = all$[counter + x] unique_number[unique_counter] = 1 sorted_number[unique_counter] = 1 ELSE status = 0 END IF counter = counter + 1 END IF NEXT x WEND
CLOSE FILE ksiazka
PRINT "Sorting..."
SORT sorted_number DOWN
FOR x = 1 TO unique_counter FOR y = 1 TO unique_counter IF sorted_number[x] = unique_number[y] THEN sorted$[x] = unique$[y] unique_number[y] = 0 BREAK END IF NEXT y NEXT x
PRINT "Writing to the file..." OPEN "words.txt" FOR WRITING AS results
c$ = CONCAT$ ("Total number of words: ", STR$ (counter)) WRITELN c$ TO results c$ = CONCAT$ ("Number of unique words: ", STR$ (unique_counter)) WRITELN c$ TO results WRITELN " " TO results FOR y = 1 TO unique_counter - 1 c$ = CONCAT$ (sorted$[y]," - ", STR$ (sorted_number[y])) WRITELN c$ TO results NEXT
CLOSE FILE results
PRINT "Done!" END
The old program takes more than 4 min. to complete the task. This one needs only 37 sec. Source is attached. Attachments:
|
|
|
Post by Pjot on May 3, 2012 14:32:29 GMT 1
Great! You can even gain more speed by compiling with GCC speed optimizations: Regards Peter EDIT: just FYI, the WRITELN statement can accept comma separated arguments, as follows: OPEN "words.txt" FOR WRITING AS results
WRITELN "Total number of words: ", STR$(counter) TO results WRITELN "Number of unique words: ", STR$ (unique_counter) TO results WRITELN " " TO results
FOR y = 1 TO unique_counter - 1 WRITELN sorted$[y]," - ", STR$ (sorted_number[y]) TO results NEXT
CLOSE FILE results
|
|
Deleted
Deleted Member
Posts: 0
|
Post by Deleted on May 3, 2012 21:55:04 GMT 1
Great! You can even gain more speed by compiling with GCC speed optimizations: 31 sec. with GCC speed optimizations. EDIT: just FYI, the WRITELN statement can accept comma separated arguments... It won't make the whole program work faster (writing to the file takes less than 1 sec.), but still it's good to know it. Thanx!
|
|
|
Post by Pjot on May 6, 2012 16:39:43 GMT 1
See if you can beat this. OPTION BASE 1
CONST txt$ = "Bible.txt"
' Numeric array for words of 1-12 characters DECLARE words[1000000]
' String array for words > 12 characters DECLARE term$[10000]
' Initialize variables DECLARE total, hash, high, length, uniq, tctr, fsize, mem
' Declare memory fsize = FILELEN(txt$) mem = MEMORY(fsize)
' Get the complete text into memory OPEN txt$ FOR READING AS bible GETBYTE mem FROM bible SIZE fsize CLOSE FILE bible
' ------------------------------------------ 1st part ---------------------------------------------------
' Start analyzing memory FOR pos = 0 TO fsize-1
' Collect word from memory c = PEEK(mem+pos)
' Map a-z or A-Z to 1-26, ignore other characters offset = 0 IF c >= 97 AND c <= 122 THEN offset = 96 IF c >= 65 AND c <= 90 THEN offset = 64
' So we found a valid character? IF offset != 0 THEN INCR length IF length > 12 THEN high = high << 5 high = high | (c - offset) ELSE hash = hash << 5 hash = hash | (c - offset) END IF END IF
' Do we have a space, CR, LF? IF c = 32 OR c = 10 OR c = 13 THEN ' Is there a word? IF hash > 0 THEN ' Increment total amount INCR total ' Handle long words separately IF length > 12 THEN build$ = "" WHILE high > 0 build$ = CONCAT$(build$, CHR$((high&31)+96)) high = high >> 5 IF high = 0 THEN high = hash hash = 0 END IF WEND INCR tctr term$[tctr] = REVERSE$(build$) ' It is a short word ELSE words[total] = hash END IF ' Reset hash = 0 : high = 0 : length = 0 END IF END IF NEXT
FREE mem
PRINT "Total words: ", total
' ------------------------------------------ 2nd part ---------------------------------------------------
' Determine unique words uniq = total
' Small words SORT words DOWN FOR x = 1 TO total-1 FOR y = x+1 TO total SELECT words[x] CASE 0 BREAK CASE words[y] words[y] = 0 DECR uniq DEFAULT BREAK END SELECT NEXT NEXT
' Long words FOR x = 1 TO tctr-1 FOR y = x+1 TO tctr IF term$[x] = term$[y] AND LEN(term$[x]) > 0 THEN term$[y] = "" DECR uniq END IF NEXT NEXT
PRINT "Unique words: ", uniq
Regards Peter
|
|
Deleted
Deleted Member
Posts: 0
|
Post by Deleted on May 6, 2012 17:13:36 GMT 1
Any chance of seeing a sorted word count list as the original code challenge specified? Attachments:
|
|
|
Post by Pjot on May 6, 2012 18:03:34 GMT 1
Boring.
OPTION BASE 1
CONST txt$ = "Bible.txt"
' Numeric array for words of 1-12 characters DECLARE words[1000000] DECLARE wcnt[1000000]
' String array for words > 12 characters DECLARE term$[10000] DECLARE ecnt[10000]
' Array containing the unique words DECLARE result$[1000000]
' Initialize variables DECLARE total, hash, high, length, uniq, tctr, fsize, mem, uctr
' Declare memory fsize = FILELEN(txt$) mem = MEMORY(fsize)
' Get the complete text into memory OPEN txt$ FOR READING AS bible GETBYTE mem FROM bible SIZE fsize CLOSE FILE bible
' ------------------------------------------ 1st part ---------------------------------------------------
' Start analyzing memory FOR pos = 0 TO fsize-1
' Collect word from memory c = PEEK(mem+pos)
' Map a-z or A-Z to 1-26, ignore other characters offset = 0 IF c >= 97 AND c <= 122 THEN offset = 96 IF c >= 65 AND c <= 90 THEN offset = 64
' So we found a valid character? IF offset != 0 THEN INCR length IF length > 12 THEN high = high << 5 high = high | (c - offset) ELSE hash = hash << 5 hash = hash | (c - offset) END IF END IF
' Do we have a space, CR, LF? IF c = 32 OR c = 10 OR c = 13 THEN ' Is there a word? IF hash > 0 THEN ' Increment total amount INCR total ' Handle long words separately IF length > 12 THEN build$ = "" WHILE high > 0 build$ = CONCAT$(build$, CHR$((high&31)+96)) high = high >> 5 IF high = 0 THEN high = hash hash = 0 END IF WEND INCR tctr term$[tctr] = REVERSE$(build$) ' It is a short word ELSE words[total] = hash END IF ' Reset hash = 0 : high = 0 : length = 0 END IF END IF NEXT
FREE mem
PRINT "Total words: ", total
' ------------------------------------------ 2nd part ---------------------------------------------------
' Determine unique words uniq = total
' Small words SORT words DOWN FOR x = 1 TO total-1 IF words[x] THEN wcnt[x] = 1 FOR y = x+1 TO total SELECT words[x] CASE 0 BREAK CASE words[y] words[y] = 0 INCR wcnt[x] DECR uniq DEFAULT BREAK END SELECT NEXT NEXT
' Long words FOR x = 1 TO tctr-1 IF LEN(term$[x]) THEN ecnt[x] = 1 FOR y = x+1 TO tctr IF term$[x] = term$[y] AND LEN(term$[x]) > 0 THEN term$[y] = "" INCR ecnt[x] DECR uniq END IF NEXT NEXT
PRINT "Unique words: ", uniq
' ------------------------------------------ 3rd part ---------------------------------------------------
' Small words FOR x = 1 TO total IF words[x] THEN build$ = "" WHILE words[x] > 0 build$ = CONCAT$(build$, CHR$((words[x]&31)+96)) words[x] = words[x] >> 5 WEND INCR uctr result$[uctr] = CONCAT$(REVERSE$(build$), " (", STR$(wcnt[x]), ")") END IF NEXT
' Long words FOR x = 1 TO tctr IF LEN(term$[x]) THEN INCR uctr result$[uctr] = CONCAT$(term$[x], " (", STR$(ecnt[x]), ")") END IF NEXT
' Sort alphabetically SORT result$ SIZE uniq+1
' Write result OPEN "Bible_list.txt" FOR WRITING AS bible FOR x = 1 TO uniq WRITELN result$[x] TO bible NEXT CLOSE FILE bible
|
|
Deleted
Deleted Member
Posts: 0
|
Post by Deleted on May 6, 2012 18:40:08 GMT 1
Knockout! ;D I hope ScriptBasic will be fine.
|
|
Deleted
Deleted Member
Posts: 0
|
Post by Deleted on May 6, 2012 19:13:04 GMT 1
Lighten up guys.
It only took BaCon two years to submit a meaning entry that most other compilers have already achieved.
I have never claimed that ScriptBasic can run head to head with compilers. It's mission is to provide a flexible, easy to use scripting language using a familiar traditional Basic syntax. It runs on all major platforms (32/64 bit) and now on ARM/Android devices.
|
|