|
Post by alexfish on Apr 7, 2019 15:22:31 GMT 1
Hi All The new lib has been written using lib-tidy5 there will be some initial code to get started / then more snippets will be added in due course Reason the internet providers here have been behaving badly so if get a result here the the due course will happen the code to get started and the original intended use is to use Bacon NETWORK yet some may find getting https a problem even with ssl , more will be revealed later the code is optimized for modern day layouts 80 col terminals are a pain for this type of lib hence typical font size needs to 9px and geomertry needs to be about 230x90. ie mlterm mlterm --deffont mono -w 9 -b black -f white -l unlimited -g 230x90
the code '!shebang bacon parser2.bac PRAGMA COMPILER g++ PRAGMA OPTIONS -Wno-write-strings -Wno-pointer-arith PRAGMA LDFLAGS -ltidy PRAGMA LDFLAGS -lbacon++ PRAGMA INCLUDE tidy_parser_5_9_3.cpp PROTO main_loop,Http_File
'=============================================================================== SUB getinet(STRING url$,STRING file$) LOCAL myurl$, myget$ TYPE STRING
myurl$ = url$ & ":80" OPEN myurl$ FOR NETWORK AS mynet myget$ = "GET / HTTP/1.1\r\nHost:" & url$ & "\r\n\r\n" SEND myget$ TO mynet REPEAT RECEIVE dat$ FROM mynet total$ = total$ & dat$ UNTIL ISFALSE(WAIT(mynet, 5000)) CLOSE NETWORK mynet total$=MID$(total$,INSTR(total$,"bytes")+5) total$=total$ & "\n\n<div><br> _Site_Reference :[" & url$ & "]</div>" PRINT "Saved as " , file$
SAVE total$ TO file$
END SUB '=============================================================================== ' getinet address(url), getinet("www.basic-converter.org","index.html") ' file prompt & enter index.html Http_File()
press q to quit BR Alex archive updated
|
|
|
Post by alexfish on Apr 7, 2019 15:34:26 GMT 1
Example output of this forum
as can see there is plenty info and can hence do further parsing as in how to render the bits example in the headers there is a means to concat the bits into menu bar the concat menu item* :notice the trailing *
[20]General*
[21]News*
[22]Documentation*
[23]Code Projects*
[24]Troubleshooting area*
[25]Bugs, features*
then transpose the bits to
[20]General* [21]News* [22]Documentation* [23]Code Projects* [24]Troubleshooting area* [25]Bugs, features*
BR Alex
the output
Home | The BAsic CONverter Forum
[0]Info Link -
[1]Info Link -
[2]Info Link -
[3]Info Link -
[4]Info Link -
[5]Info Link -
[6]Info Link -
[7]Info Link -
[8]Info Link - * * [9]The BAsic CONverter Forum [10]Skip Navigation [11]*
[12] Home
[13] Help
[14] Search
[15]Goto the BaCon website Welcome Guest. Please [16]Login or [17]Register. *
[18]Info Link - The BAsic CONverter Forum
[19]Info Link - Home
[20]General*
[21]News*
[22]Documentation*
[23]Code Projects*
[24]Troubleshooting area*
[25]Bugs, features*
*** General Board*|Threads*|Posts*|Last Post
=============================================== [26]//storage.proboards.com/forum/images/icons/board-no-new-post.png - *|[27]News News and announcements Moderator:[28]Pjot*|92*|831*|[29]HAPPY BIRTHDAY BACON!by [30]ptitjozMar 18, 2019 16:38:16 GMT 1 [31]//storage.proboards.com/forum/images/icons/board-no-new-post.png - *|[32]Documentation - 1 Viewing Tutorials & demonstrations Moderator:[33]Pjot*|123*|1,780*|[34]Mental math of GUI placement + widget tutorialby [35]bigbassApr 6, 2019 8:15:12 GMT 1 [36]//storage.proboards.com/forum/images/icons/board-no-new-post.png - *|[37]Code Projects - 5 Viewing Programs, challenges, competitions Moderator:[38]Pjot*|200*|4,123*|[39]Animated graphics in HUG without CANVASby [40]vovchikApr 7, 2019 15:40:42 GMT 1 [41]//storage.proboards.com/forum/images/icons/board-no-new-post.png - * [42]Troubleshooting area Problems, issues, tips & tricks Moderator:[43]Pjot*|391*|2,878*|[44]drag n drop compile *.bac raspberry piby [45]alexfishApr 1, 2019 18:36:31 GMT 1 [46]//storage.proboards.com/forum/images/icons/board-no-new-post.png - *|[47]Bugs, features Report a bug, request a feature Moderator:[48]Pjot*|237*|2,264*|[49]Hug object alert or msgboxby [50]bigbassApr 6, 2019 22:05:56 GMT 1
Legend
=============================================== [51]//storage.proboards.com/forum/images/icons/board-new-post.png - New Posts*|[52]//storage.proboards.com/forum/images/icons/board-no-new-post.png - No New Posts
===============================================
Forum Information & Statistics
=============================================== [53]//storage.proboards.com/forum/images/info/stats.png - *|*|Threads and Posts
=============================================== Total Threads: 1,043 Total Posts: 11,876 Last Updated: [54]Animated graphics in HUG without CANVAS by [55]vovchik (Apr 7, 2019 15:40:42 GMT 1) [56]Recent Threads - [57]Recent Posts - [58]RSS Feed
[59]//storage.proboards.com/forum/images/info/members.png - *|*|Members Total Members: 201 Newest Member: [60]shell Most Users Online: 144 (Aug 22, 2013 23:04:29 GMT 1)
[61]View today's birthdays
[62]//storage.proboards.com/forum/images/info/online.png - *|*|Users Online 0 Staff, 2 Members, 6 Guests. [63]vovchik,[64]alexfish
[65]//storage.proboards.com/forum/images/info/online_24.png - *|*|Users Online in the Last 24 Hours 1 Staff, [66]4 Members, 128 Guests. [67]juppel,[68]bigbass,[69]Pjot
[70]Click here to remove banner ads from this forum. This Forum Is Hosted For FREE By [71]ProBoardsGet Your Own [72]Free Forum !*[73]Terms of Service | [74]Privacy | [75]Cookies | [76]FTC Disclosure | [77]Report Abuse | [78]Report Ad | [79]Consent
Links: 0. http://basic-converter.proboards.com/ 1. http://basic-converter.proboards.com/rss/public 2. //storage.proboards.com/forum/images/favicon.ico 3. //storage.proboards.com/forum/images/favicon.ico 4. android-app://com.quoord.tapatalkpro.activity/tapatalk/basic-converter.proboards.com?location=index 5. ios-app://307880732/tapatalk/basic-converter.proboards.com?location=index 6. //storage.proboards.com/forum/css/0/forum_base_864.css 7. //storage.proboards.com/3081746/css/IZ9ESJACmL0jnG4dNiCP.css 8. //storage.proboards.com/forum/css/0/print_864.css 9. / 10. #content 11. # 12. / 13. /help 14. /search 15. http://www.basic-converter.org 16. https://login.proboards.com/login/3081746/1 17. https://login.proboards.com/register/3081746 18. / 19. / 20. /#category-1 21. http://basic-converter.proboards.com/board/5/news 22. http://basic-converter.proboards.com/board/3/documentation 23. http://basic-converter.proboards.com/board/2/code-projects 24. http://basic-converter.proboards.com/board/4/troubleshooting-area 25. http://basic-converter.proboards.com/board/1/bugs-features 26. //storage.proboards.com/forum/images/icons/board-no-new-post.png 27. /board/5/news 28. /user/1 29. /threads/recent/1064 30. /user/196 31. //storage.proboards.com/forum/images/icons/board-no-new-post.png 32. /board/3/documentation 33. /user/1 34. /threads/recent/263 35. /user/42 36. //storage.proboards.com/forum/images/icons/board-no-new-post.png 37. /board/2/code-projects 38. /user/1 39. /threads/recent/1074 40. /user/7 41. //storage.proboards.com/forum/images/icons/board-no-new-post.png 42. /board/4/troubleshooting-area 43. /user/1 44. /threads/recent/1070 45. /user/57 46. //storage.proboards.com/forum/images/icons/board-no-new-post.png 47. /board/1/bugs-features 48. /user/1 49. /threads/recent/1071 50. /user/42 51. //storage.proboards.com/forum/images/icons/board-new-post.png 52. //storage.proboards.com/forum/images/icons/board-no-new-post.png 53. //storage.proboards.com/forum/images/info/stats.png 54. /threads/recent/1074 55. /user/7 56. /threads/recent 57. /posts/recent 58. /rss/public 59. //storage.proboards.com/forum/images/info/members.png 60. /user/218 61. /members?view=birthdays 62. //storage.proboards.com/forum/images/info/online.png 63. /user/7 64. /user/57 65. //storage.proboards.com/forum/images/info/online_24.png 66. /members?dir=desc&sort=last_online&view=today 67. /user/217 68. /user/42 69. /user/1 70. https://www.proboards.com/store/add_cart/ad_free/50000/basic-converter.proboards.com/1 71. https://www.proboards.com 72. https://www.proboards.com/create-free-forum 73. https://www.proboards.com/tos 74. https://www.proboards.com/privacy 75. https://www.proboards.com/privacy#cookies 76. http://www.viglink.com/policies/ftc 77. https://www.proboards.com/report-abuse 78. # 79. #
|
|
|
Post by vovchik on Apr 7, 2019 17:08:54 GMT 1
Dear Alex,
Thanks. I ran into a little compile problem that was solved by modifying one line in the Bacon bit:
PRAGMA LDFLAGS -lcurl -lbacon++
I needed -lbacon++ for it to work, and it seems to be working nicely. I will now test on a few other sites.
With kind regards, vovchik
|
|
|
Post by alexfish on Apr 8, 2019 8:40:22 GMT 1
Hi Vovchik
Thanks for testing
Although it is using html2text for utf8 as in getting the code point char
I did a utf 8 decoder with a look up table , hence if get this working under this scheme then can drop the python, main sticky bit will be sites of big corps.
and yes it is a Minimal parser , yet welcome comments and or sugestions.
Thanks again + BR
Alex
|
|
|
Post by alexfish on Apr 8, 2019 9:11:55 GMT 1
Hi All & vovchik this is the utf8 code point decoder I diid some time ago and it apears not to compile in bacon any more ? OPTION PARSE FALSE
FUNCTION next_utf8_char(unsigned char *utf8, int *codepoint) TYPE unsigned char * LOCAL seqlen=0 TYPE int LOCAL p = utf8 TYPE unsigned char * '0xxxxxxx IF NOT(utf8[0] & 0x80) THEN *codepoint = (wchar_t) utf8[0]; seqlen = 1 '110xxxxx ELIF (utf8[0] & 0xE0) == 0xC0 THEN *codepoint = (int)(((utf8[0] & 0x1F) << 6) | (utf8[1] & 0x3F)) seqlen = 2
ELIF (utf8[0] & 0xF0) == 0xE0 THEN *codepoint = (int)(((utf8[0] & 0x0F) << 12) | ((utf8[1] & 0x3F) << 6) | (utf8[2] & 0x3F)) seqlen = 3 ELSE RETURN NULL END IF
RETURN p END FUNCTION
MyString$ ="¶"
PRINT MyString$ LOCAL codepoint TYPE int LOCAL ss = (unsigned char *)MyString$ TYPE unsigned char * next_utf8_char (ss,& codepoint)
PRINT codepoint INPUT a$ PRINT "You entered the following: ", a$
ADDED had been using lbacon++ using the -a flag so a rebuild lib using bacon -a -y codepoint.bac Now works + added a bit of utf8 DECLARE hello1[] = {'H','e','j',',',' ','v', 0xc3, 0xa4,'r' , 'l','d' ,'e','n',0} TYPE char
DECLARE a$ TYPE STRING a$=hello1 PRINT a$
|
|
|
Post by alexfish on Apr 8, 2019 10:11:53 GMT 1
I tested peters utf8.bac
and this is what I get in the terminal
./utf8_2 'Hej, världen'
ASCII values decimal: 34 72 101 106 44 32 118 195 164 114 108 100 101 110 34 ASCII values hex: 22 48 65 6A 2C 20 76 C3 A4 72 6C 64 65 6E 22 ASCII string: "Hej, världen"
UTF-8 values decimal: 34 72 101 106 44 32 118 195 131 194 164 114 108 100 101 110 34 UTF-8 values hex: 22 48 65 6A 2C 20 76 C3 83 C2 A4 72 6C 64 65 6E 22 UTF-8 string: "Hej, världen"
Peters actual code is more compact And readable from a Basic point of view
IF c > 127 THEN
REM Binary AND with 11000000, shift 6 positions to the right, add 11000000 to identify 2nd byte b1 = ((c & 192) >> 6) + 192
REM Binary AND with 00111111, add 10000000 to identify first byte b2 = (c & 63) + 128
REM Add UTF char with ASCII of byte1 + ASCII of byte 2 new$ = CONCAT$(new$, CHR$(b1)) new$ = CONCAT$(new$, CHR$(b2)) ELSE
new$ = CONCAT$(new$, MID$(arg$[1], t, 1)) ENDIF
BR Alex
|
|
|
Post by alexfish on Apr 8, 2019 13:49:45 GMT 1
Hi All Some of what this is about HTML & HEX format & '&' IE www.ascii.cl/htmlcodes.htmhave done this example in c++11 using wc32 and include the myparser.cxx yet here remonstrating if certain char(formats) are in the hlml a method to compare if(***) Here We Go but pure Bacon Can do the same: PRAGMA INCLUDE <uchar.h>
PRAGMA COMPILER g++ PRAGMA OPTIONS -Wno-write-strings -Wno-pointer-arith PRAGMA OPTIONS --std=c++11 PRAGMA LDFLAGS -lcurl -lbacon++ PRAGMA INCLUDE myparser.cxx PROTO ParseHtml, cout,char32_t OPTION PARSE FALSE
DECLARE str="Hej, världen" TYPE string
DECLARE wc[] = U"Hej, världen" TYPE char32_t
DECLARE wc_sz = sizeof wc / sizeof *wc TYPE size_t
printf("%zu UTF-32 code units: [ ", wc_sz)
for (size_t n = 0; n < wc_sz; ++n) printf("%#x ", wc[n]); printf("]\n");
for_each( str.begin() , str.end() , []( char32_t codepoint ){ PRINT codepoint , "," ; } ) PRINT "HEX" for_each( str.begin() , str.end() , []( char32_t codepoint ){ PRINT "0x" ,HEX$(codepoint) , "," ; } ) PRINT "HTML " for_each( str.begin() , str.end() , []( char32_t codepoint ){ PRINT "" ,codepoint ,";", "," ; } ) PRINT To Note BaCon has UTF* IE:: Example LET c$ = UTF8$(0x1F600) PRINT c$
|
|
|
Post by bigbass on Apr 9, 2019 16:54:44 GMT 1
Hello Alex Thanks for your work on  html2text (c++) please keep working on this it is a good project to have a complete tool to parse any website and will be very fast using just c++ even more now that we can use htlm with webkit GUI'S ============================= I thought to make a preparser (homebrewed) just to work on the bacon main pagewith the idea we can simplify the tools needed. quickly you can see that it a big job writing a complete tool in any language to parse html but sometimes we can get by with less sed can help ease the pain and is fast its all stand alone line by line so something may useful to take out and recycle noscript-html-preparser.tar.gz (1.32 KB)Joe
|
|
|
Post by alexfish on Apr 9, 2019 23:15:32 GMT 1
Thanks Joe
have done improvements to the parser
have check these forum posts/threads and looks like there line breaks with <br /> // br /
and the problem was compounding inside <Code> blocks hence produce a very long string now have separation
example::
Post by alexfish on Apr 7, 2019 15:22:31 GMT 1 * Hi All
have written a Simple Html to text Parser for BaCon
can't say it is a full blown spec parser but it does the job;
it's of what system html2text can't does , as encoding modern Utf8 and html encoding esp sites like youtube are & etc etc
a bit of the can do encoding is done in the supplied html2text {python code}
and the main myparser.cxx is {c++}
hence ensure they are in the same directory & set the permission's on the 'html2text' file
it can also download http & https + follow links hence dependant on lib curl
or use Peters Http and point the parser to the html file;
the output has indexed links
A bit of code to get started
PRAGMA COMPILER g++ PRAGMA OPTIONS -Wno-write-strings -Wno-pointer-arith PRAGMA LDFLAGS -lcurl PRAGMA INCLUDE myparser.cxx PROTO ParseHtml, cout
DECLARE myhtml TYPE string DECLARE myre$ TYPE STRING
'============================================
'Example ParseHtml ( "index.html" , &myhtml;) ParseHtml ( "http://www.basic-converter.org/" , &myhtml;) myre$=myhtml.c_str() PRINT myre$
'=============================================
will update the parser a a bit more testing BR Alex
|
|
|
Post by bigbass on Apr 10, 2019 8:59:05 GMT 1
Hello Alex the mytext.html was one long line but now there is a fix and a readme that goes into the details what was done and a new REPLACEWORD function that should help with parsing using just BaCon code fix4html2txt.tar.gz (13.04 KB) hope this will be useful Joe
|
|
|
Post by vovchik on Apr 11, 2019 11:20:46 GMT 1
Dear Joe and Alex, Here are a few things I use to parse html in order to make m3u lists from streaming web pages, so I don't have to use a browser to watch live TV. There is a kind of CUT and and kind of GREP, and two functions that obviate the need to use curl or wget to dump pages. Maybe they might be useful. With kind regards, vovchik Attachments:utils.tar.gz (105.62 KB)
|
|
|
Post by bigbass on Apr 11, 2019 17:49:14 GMT 1
Hello vovchik you have been busy thanks for sharing your very useful personal solutions using homemade functions in bacon oh forgot to add very fast parsing too! I did stumble a bit getting get_https1.bac to compile because I was missing several things on my box this will get it going as a stand alone demo if its added as the header or you were using something else in the header ? or suggest something better they are very useful thanks again! parsing html is a "job" but with the right tools it can be much easier Joe OPTION TLS TRUE ' This is the default PRAGMA TLS openssl INCLUDE <openssl/ssl.h> LDFLAGS -lssl -lcryptoUPDATED just the debs are needed for a clean compile '---sudo apt-get install libssl-dev '---sudo apt-get install libgcrypt20-dev
|
|
|
Post by alexfish on Apr 20, 2019 23:45:40 GMT 1
Hi All
have read through and tested '
results , still need some improvements
All now in New Archive at first post
Hence Read first post again with example code: + still using lib-curl
the output now looks normal as in spacing
have removed indexing for now + removed table indexes;
this will be set again in next update;
BR Alex
|
|
|
Post by alexfish on Apr 20, 2019 23:51:31 GMT 1
Example of forum
PRAGMA COMPILER g++ PRAGMA OPTIONS -Wno-write-strings -Wno-pointer-arith PRAGMA LDFLAGS -lcurl -lbacon++ PRAGMA INCLUDE myparser.cxx PROTO ParseHtml, cout
DECLARE myhtml TYPE string DECLARE myre$ TYPE STRING
'============================================
'/home/pi/retestVG/mybuild/examples/UTF8/index.html ParseHtml ( "http://basic-converter.proboards.com/" , &myhtml,120) myre$=myhtml.c_str() PRINT myre$
'=============================================
The results
Home | The BAsic CONverter Forum The BAsic CONverter Forum Skip Navigation Home Help Search Goto the BaCon website Welcome Guest. Please Login or Register . The BAsic CONverter Forum Home General News Documentation Code Projects Troubleshooting area Bugs, features General Board Threads Posts Last Post News News and announcements Moderator: Pjot 92 831 HAPPY BIRTHDAY BACON! by ptitjoz Mar 18, 2019 16:38:16 GMT 1 Documentation Tutorials & demonstrations Moderator: Pjot 127 1,797 golang embedded in bacon by bigbass Apr 20, 2019 15:01:48 GMT 1 Code Projects - 5 Viewing Programs, challenges, competitions Moderator: Pjot 201 4,176 html2text (c++) by alexfish Apr 20, 2019 23:45:40 GMT 1 Troubleshooting area Problems, issues, tips & tricks Moderator: Pjot 391 2,878 drag n drop compile *.bac raspberry pi by alexfish Apr 1, 2019 18:36:31 GMT 1 Bugs, features Report a bug, request a feature Moderator: Pjot 237 2,264 Hug object alert or msgbox by bigbass Apr 6, 2019 22:05:56 GMT 1 Legend New Posts No New Posts Forum Information & Statistics Threads and Posts Total Threads: 1,048 Total Posts: 11,946 Last Updated: html2text (c++) by alexfish ( Apr 20, 2019 23:45:40 GMT 1 ) Recent Threads - Recent Posts - RSS Feed Members Total Members: 201 Newest Member: shell Most Users Online: 144 ( Aug 22, 2013 23:04:29 GMT 1 ) View today's birthdays Users Online 0 Staff, 2 Members, 7 Guests. vovchik , alexfish Users Online in the Last 24 Hours 1 Staff, 4 Members , 164 Guests. Pjot , bigbass , ptitjoz Click here to remove banner ads from this forum. This Forum Is Hosted For FREE By ProBoards Get Your Own Free Forum ! Terms of Service | Privacy | Cookies | FTC Disclosure | Report Abuse | Report Ad | Consent
BR Alex
|
|
|
Post by alexfish on Apr 21, 2019 0:03:16 GMT 1
|
|