|
Post by alexfish on Apr 24, 2019 16:05:24 GMT 1
Hi Vovchik Funny thing , that is what I just be doing; funny thing here is :: in the sources there are Two PYTHON files ascii2uni.py uni2html.py funny bit here is:: in all the libs I a tried in this Genre(hi end ascii numbers) have a bit of Python looking at the bits via synaptic :: nothing in there indicates these files kind of thinking maybe are some of the sources embedded python , like the demo I posted. but at present Not 100% on this Hence doing a compile of the source + looking further at same; BR Alex
|
|
|
Post by alexfish on Apr 24, 2019 16:50:17 GMT 1
Just did a quicky of python
same codecs as posted but in python uni
print u'\u6B63\u8868\u9054\u5F0F'
result:
python uni.py 正表達式
BR Alex
|
|
|
Post by alexfish on Apr 24, 2019 18:11:13 GMT 1
Well
and something I did not Know about C++ Strings
despite googling till the cows come home about this asci to utf8
AHH!
std::string utf8 = "\u6B63\u898F\u9054\u5F0F"; cout << utf8 << "\n";
正規達式 Alex
|
|
|
Post by alexfish on Apr 24, 2019 18:51:27 GMT 1
Hi Vovchik
Finally work this one out in C++
not sure if can be don in bacon but here testing char[] with hex and ascii
the results
std::string utf8 = "\u6B63\u898F\u9054\u5F0F"; cout << utf8.size() << "\n";
cout << (int)utf8[0] << "\n"; cout << (int)utf8[1] << "\n"; cout << (int)utf8[2] << "\n"; cout << (int)utf8[3] << "\n";
char plop[]= {230,173,163,0}; char plop2[]= {0xe6,0xad,0xa3,0}; cout << plop << "\n"; cout << plop2 << "\n";
Added :: some bits done with BaCon
OPTION PARSE FALSE LOCAL HTML_UTF8 TYPE STRING LOCAL plop[]= {230,173,163,0} TYPE char LOCAL plop2[]= {0xe6,0xad,0xa3,0} TYPE char
HTML_UTF8=plop PRINT HTML_UTF8
HTML_UTF8=plop PRINT HTML_UTF8
|
|
|
Post by alexfish on Apr 24, 2019 23:51:43 GMT 1
Hi All this bit requires (uni2ascii) should be in the repo's example using myparser2.cxx but should work with original myparser.cxx PRAGMA COMPILER g++ PRAGMA OPTIONS -Wno-write-strings -Wno-pointer-arith PRAGMA LDFLAGS -lcurl -lbacon++ PRAGMA INCLUDE myparser2.cxx 'PRAGMA INCLUDE myparser.cxx
PROTO ParseHtml, cout, stringstream, getline,TRIML
OPTION PARSE FALSE
PRAGMA COMPILER g++ PRAGMA OPTIONS -Wno-write-strings -Wno-pointer-arith PRAGMA LDFLAGS -lcurl -lbacon++ PRAGMA INCLUDE myparser2.cxx PROTO ParseHtml, cout, stringstream, getline,TRIML
OPTION PARSE FALSE ' a typical html utf8 entities string LOCAL html ="正規表達式" TYPE string LOCAL to TYPE string LOCAL fmt TYPE string
fmt=html.substr() stringstream ss (fmt) html=" " WHILE (getline(ss,to,';')) DO TRIML(to,"") html.append("0") html.append(to) WEND
SaveFile("uni.txt",html) LOCAL str TYPE STRING LOCAL cmd = "ascii2uni -q uni.txt" TYPE STRING str=EXEC$(cmd) PRINT str
BR Alex actual code in the archive Attachments:utftest.bac.bz2 (483 B)
|
|
|
Post by vovchik on Apr 25, 2019 0:31:25 GMT 1
Dear Alex, Thanks. Working fine for me. Will now go and study the cxx bit. With kind regards, vovchik
|
|
|
Post by alexfish on May 7, 2019 0:41:15 GMT 1
Hi All this is a Raspberry(PI) exec demo ONLY the my-parser now has html(utf8 entities) encoders & decoders Hence is not dependant on any system or local html2text converters. The only dependency is LIBCURL the parser almost there in terms of (table sorting) , say about 90% and the text formatting is about the same. yet main object of this demo is to see how well the encoding and decoding of the (html utf8 entities ) is performing and If no bugs then the lib will be updated to parser3.01 example usage myhtml2text http://basic-converter.proboards.com/ the file is save as index.html myhtml2text index.html/ BR Alex the exec is striped of permissions , so will have to set them Attachments:myhtml2text.bz2 (20.57 KB)
|
|
|
Post by alexfish on May 11, 2019 19:52:32 GMT 1
Hi All
have Added Lib Tidy to the parser lib
possible best solution for difficult or broken html
as a straight run through of this forum can now see Pre table text :: IE the bits before applying table format
Example output now looks like the below :
update lib to be posted next week
ADDED reason next week
the encoder and decoder for hlml entities(code) like #226 #8364 #162 work but if the encoding is directly inline with text say " this bit --> t#226 #8364 #162 have removed the & in the above
fail : hence = Bug
example of youtube
English School��" FA National Finals�� Stoke City FC - Duration: 8:17:52.
BR Alex
Example of this forum
Home | The BAsic CONverter Forum The BAsic CONverter Forum Skip Navigation Home
Help
Search
Goto the BaCon website
Welcome Guest. Please Login or Register.
The BAsic CONverter Forum
Home
General
News
Documentation
Code Projects
Troubleshooting area
Bugs, features General
Board Threads Posts Last Post
News
News and announcements
Moderator: Pjot 93 832 BaCon 3.9 released by Pjot May 1, 2019 19:15:40 GMT 1
Documentation - 1 Viewing
Tutorials & demonstrations
Moderator: Pjot 129 1,805 Bacon manual translated into German by Pjot May 7, 2019 19:29:38 GMT 1
Code Projects - 1 Viewing
Programs, challenges, competitions
Moderator: Pjot 204 4,238 HTML lib-tidy by alexfish May 11, 2019 18:37:50 GMT 1
Troubleshooting area
Problems, issues, tips & tricks
Moderator: Pjot 391 2,878 drag n drop compile *.bac raspberry pi by alexfish Apr 1, 2019 18:36:31 GMT 1
Bugs, features
Report a bug, request a feature
Moderator: Pjot 238 2,272 sort command does not work by juppel Apr 28, 2019 8:37:02 GMT 1
Legend
New Posts No New Posts
Forum Information & Statistics
Threads and Posts
Total Threads: 1,055 Total Posts: 12,025
Last Updated: HTML lib-tidy by alexfish (May 11, 2019 18:37:50 GMT 1)
Recent Threads - Recent Posts - RSS Feed Members
Total Members: 201
Newest Member: shell
Most Users Online: 144 (Aug 22, 2013 23:04:29 GMT 1)
View today's birthdays
Users Online
0 Staff, 1 Member, 2 Guests.
alexfish
Users Online in the Last 24 Hours
1 Staff, 3 Members, 97 Guests.
Pjot, vovchik, ptitjoz
Click here to remove banner ads from this forum.
This Forum Is Hosted For FREE By ProBoards Get Your Own Free Forum!
|
|
|
Post by alexfish on May 11, 2019 21:38:17 GMT 1
Hi All looks like lib tidy is doing something to the encoding ; so need to look further as regards this bit in lib tidy attributes etc here is my original parse no lib tidy the copywrite sign is ok © 2019 YouTube, LLC Loading. and with lib tidy � 2019 YouTube, LLC
Loading.. AGH looking at the encoding & standards or are they big from what i gather should be prefixed with & # 0x and my parser it is showing standard ascii(section) codes below 255 rp :#226 rp :#8364 rp :#8482 rp :#226 rp :#8364 rp :#8220 rp :#226 rp :#8364 rp :#8482 rp :#226 rp :#8364 rp :#8220
so in my parser the likes of rp :#8220 are not legal my parser has a ascii(section )greater than 255 Now implementing a fix for this so it is bit of lib-tidy and a bit of what's in the html is sum the original bit of html Take the copywright bits © = & copy my decoder gives © 2019 YouTube and tidy parsed string look like #194;#169; 2019 YouTube,LLC without the & BR Alex
|
|
|
Post by alexfish on May 12, 2019 14:19:24 GMT 1
Hi All
have sorted the bugs
and have compared youtube bits
Chromium V myparser decoding a downloaded file by CURL; Note this is on a DOWNLOADED file; and not live(browser) parsing
typical Chromium
Gourmet Makes S1 • E17 Pastry Chef Attempts to Make Gourmet Almond Joys | Gourmet Makes | Bon Appétit - Duration:
lib tidy set with
tidySetCharEncoding( tdoc, "utf8" ) + rest of decoding is myparser libs own decoder;
Gourmet Makes S1 • E17 Pastry Chef Attempts to Make Gourmet Almond Joys | Gourmet Makes | Bon Appétit - Duration: 31:20.
Chromium Live
GOURMET MAKES S1 • E17 Pastry Chef Attempts to Make Gourmet Almond Joys | Gourmet Makes | Bon Appétit
BR Alex
|
|
|
Post by bigbass on May 13, 2019 19:21:48 GMT 1
Hello Guys Personal Note: Alex you had posted a ported baconized tidy at one time in this thread it had some errors that wouldn't allow me to compile it on RPI3 stretch for this reason I found the original C source code to see why it should work for everyone and we have some reference to document the steps along the way dependency sudo apt-get install libtidy-dev source code and more information www.html-tidy.org/developer/I did not port the code I just modified the c code to take the index as input and tested on the raspberry pi3 stretch to be working correctly there is a golang demo that does the same thing but I will post it in a different thread if anyone is interested I actually did that first before modding the tidy C demo compile with gcc -Wall -o "testtidy2" "testtidy2.c" -ltidy run with ./testtidy2 "index.html" > diagnose-index.html tidy-index-parser.tar.gz (12.48 KB) hope this will be useful across different linux boxes Joe
|
|
|
Post by bigbass on May 15, 2019 18:31:50 GMT 1
Hey Guys Alex had posted in another thread more options here take a look at this thread (which I thought was in this tread by mistake ) thanks for all your great work on that! basic-converter.proboards.com/thread/1086/html-lib-tidyI just wanted to focus documenting some of the steps of porting the C code demo I posted above to bacon of the original source code referenced here www.html-tidy.org/developer/and also didn't want to be lazy and not go the next step to post a BaCon demo (the c code was needed to show the porting part ) how to get started the output is just clean tidy html you could use *you need an index.html in the same directory name it tidy3.bac then you can run it easily like this ./tidy3 index.html >tidy-index.html '--- port of http://www.html-tidy.org/developer/ '--- and takes index.html as input C code styled for documentating '--- all the steps in the porting of the code '----just produce cleaned tidy output html by bigbass
PRAGMA INCLUDE <tidy/tidy.h> PRAGMA INCLUDE <tidy/tidybuffio.h> PRAGMA INCLUDE <errno.h> PRAGMA INCLUDE <stdbool.h> PRAGMA INCLUDE <stddef.h>
PRAGMA LDFLAGS -ltidy PRAGMA COMPILER gcc OPTION PARSE FALSE DECLARE input TYPE const char* DECLARE output = {0} TYPE TidyBuffer DECLARE errbuf = {0} TYPE TidyBuffer DECLARE rc TYPE int rc = -1 DECLARE ok TYPE Bool DECLARE tdoc TYPE TidyDoc
'--- read in index.html the way its done in C '--- only for the reason to document the porting of the code completely '--- later we could simplify this with an equivelent bacon command DECLARE buffer TYPE char* DECLARE len TYPE size_t DECLARE fp TYPE FILE* DECLARE bytes_read TYPE ssize_t buffer = NULL
'---the first command-line parameter is in argv[1] '---(arg[0] is the name of the program) '--- "r" = open for reading only fp = fopen(argv[1], "r")
bytes_read = getdelim( &buffer, &len, '\0', fp) IF ( bytes_read != -1) THEN '---Success, now the entire file is in the buffer '---PRINT "Success" '---PRINT buffer ELSE PRINT "maybe you didn't pass the second argument index.html" END IF
'---pass the buffer as input input = buffer
'--- Initialize "document" tdoc = tidyCreate() '--- turn off displaying the original html in the report '---printf( "Tidying:\t%s\n", input ) '--- Convert to XHTML ok = tidyOptSetBool( tdoc, TidyXhtmlOut, yes ) IF ( ok ) THEN '---Capture diagnostics rc = tidySetErrorBuffer( tdoc, &errbuf ) END IF IF ( rc >= 0 ) THEN '--- Parse the input rc = tidyParseString( tdoc, input ) END IF IF ( rc >= 0 ) THEN '--- Tidy it up! rc = tidyCleanAndRepair( tdoc ) END IF IF ( rc >= 0 ) THEN '--- catch errors rc = tidyRunDiagnostics( tdoc ) END IF '---If error, force output. IF ( rc > 1 ) THEN rc = ( tidyOptSetBool(tdoc, TidyForceOutput,yes) ? rc : -1 ) END IF IF ( rc >= 0 ) THEN '---Pretty Print rc = tidySaveBuffer( tdoc, &output ) END IF '---display the results IF ( rc >= 0 ) THEN '--- turn off display of Diagnostics in the report '---printf( "\nDiagnostics:\n\n%s", errbuf.bp ) '---printf( "\nAnd here is the result:\n\n%s", output.bp ) printf( "%s", output.bp ) ELSE printf( "A severe error (%d) occurred.\n", rc ) END IF
tidyBufFree( &output ) tidyBufFree( &errbuf ) tidyRelease( tdoc )
|
|
|
Post by alexfish on May 18, 2019 22:07:53 GMT 1
Hi Joe Thanks Now have exec (Raspberry pi ARM) of new lib parser using html tidy Needs a bit of testing with several sites hence need some feedback : if there are parsing issues if all ok will then post the source codes; note that I still be adding some html5 'named entities ' like hence some decoding will not show there are 5 search engines google,yahoo,ask,youtube,duck(duckduckgo),bing & bbc-news
typical myhtml2text youtube 'pink floyd another brick in the wall'
will also pull page of IE: myhtml2text there is partial formatting at present esp in table and nav bar only well formed navs will show on one line + Indexed list of links IE myhtml2text http://www.basic-converter.org
file is saved as index.html but can parse other files that are html point to file myhtml2text foo.html
BR Alex Added:: did find a bug Re bbc-news the nav-bar [1]News [2]iWonder [3]Weather [4]Sport [5]Programmes [6]Newsround [7]World Service [8]Bitesize [9]About the BBC [10]Newsbeat [11]School Report [12]Music
the links are not indexing correctly so now looking as to the why,Fixed :: Removed "Nav bars menu concating" Attachments:myhtml2text.bz2 (23.85 KB)
|
|
|
Post by alexfish on May 19, 2019 2:44:13 GMT 1
Hi All have new archived exec(Raspberry pi only) please read above post; search engines are, google,google-news,yahoo,ask,youtube,duck(duckduckgo),bing & bbc-news had a problem with bbc-news / and still on going as regards the links bbc-news now requires : uk : world : politics : technology : science_and_environment : health : education : entertainment_and_arts : stories : video_and_audio headlines : in_pictures : newsbeat : special_reports : explainers : the_reporters : have_your_say : disability : england : localnews : england regions IE myhtml2text bbc-news 'video_and_audio headlines' myhtml2text bbc-news 'world' myhtml2text bbc-news 'england' beyond this , then a functional app made from the lib would be required Hence added google-news Now requires testing : if ok ,will post the lib source codes for compiling on different architectures BR Alex Attachments:myhtml2text.bz2 (24.45 KB)
|
|
|
Post by alexfish on May 19, 2019 3:23:38 GMT 1
Hi All
did a test of wiki pages
this shows how the decoding of html entities is progressing
can see which are converted and ones that are not
the not look like
𒅀𒄷𒌑𒈾𒋫𒉡
some bits
List of alternatives [ [53]edit ]
○ [54]Akkadian : 𒅀𒄷𒌑𒈾𒋫𒉡 [55]translit. Yaḫu-natanu
[56]Arabic : يوناثان , [57][7] جوناثان
[58]Amharic : ዮናታን
[59]Aramaic :
○ [60]Assyrian Neo-Aramaic : ܝܘܿܢܵܕ݂ܵܡ , romanized: Yōnāḏām
[61]Classical Syriac : ܝܘܢܬܢ , romanized: Yōnāṯān
[62]Targumic [63]Aramaic : יוֹנָתָן , romanized: Yônāṯān
[64]Armenian : Հովնաթան , [65]romanized : Hovnatan
[66]Chinese : 乔纳森 (simplified), 喬納森 (traditional)
[67]Croatian : Jonatan
[68]Dutch : Jonathan, Jonatan
[69]Finnish : Joonatan
[70]French : Jonathan
[71]Georgian : იონათანი
[72]German : Jonathan, Jonatan
[73]Greek : Ιωνάθαν , [74]romanized : Ionáthan
[75]Hawaiian : Ionakana
[76]Hungarian : Jonatán
[77]Icelandic : Jónatan, Jonathan
[78]Irish : Seanachán, Ionatán
[79]Italian : Gionatan, Gionata, Jonathan
[80]Japanese : ジョナサン , [81]romanized : Jonasan
[82]Korean : 조나단 , [83]romanized : Jonadan
[84]Latin : Ionathan
[85]Lithuanian : Jonatanas
[86]Māori : Honatana
[87]Persian : جاناتان
[88]Polish : Jonatan, Jonathan
[89]Portuguese : Joatan, Jónatas, Jônatas, Jonatã, Jonatão, Jonathan
[90]Romanian : Ionatan, Ion
[91]Russian : Ионафан , [92]romanized : Ionafan
[93]Samoan : Ionatana
[94]Spanish : Jonathan , translit. Yónathan
[95]Swedish : Jonatan
[96]Tongan : Sonatane
|
|