|
Post by alexfish on May 10, 2019 21:29:13 GMT 1
Hi All
Whilst on with a c++ html2text parser on another thread had managed to get most things done with no dependencies esp decoding html text enities utf8 in hex formats etc
Now facing 'how to parse tables ' + how to get passed missing end tags ?
In the End I decide to use Lib Tidy, the lib shown be in most distro's
the actual devs may not be :: dependency for this is libtidy-dev
here have managed to get the html into blocks ready for parsing + repair broken bits:
Demo Code
PRAGMA LDFLAGS -ltidy PRAGMA INCLUDE <tidy/tidy.h> PRAGMA INCLUDE <tidy/buffio.h> 'ensure buffer can be seen by bacon PROTO output.pb OPTION PARSE FALSE
DECLARE output = {0} TYPE TidyBuffer DECLARE errbuf = {0} TYPE TidyBuffer
DECLARE tdoc TYPE TidyDoc
tdoc = tidyCreate()
DECLARE input$ TYPE STRING
'point to file and load input$ = LOAD$("index.html") ' create output as XHTML ok = tidyOptSetBool( tdoc, TidyXhtmlOut, yes ) ok = tidyOptSetBool( tdoc, TidyDropEmptyParas, yes ) ok = tidyOptSetBool( tdoc, TidyFixComments, yes ) ok = tidyOptSetBool( tdoc, TidyHideComments, yes ) ok = tidyOptSetBool( tdoc, TidyBreakBeforeBR, yes ) ' THIS WILL PUT XHTML INTO BLOCKS ok = tidyOptSetBool( tdoc, TidyVertSpace, yes )
rc = tidySetErrorBuffer( tdoc, &errbuf ) rc = tidyParseString( tdoc, input$ ) 'FIX ERRORS rc = tidyCleanAndRepair( tdoc ) ' FORCE OUTPUT rc = ( tidyOptSetBool(tdoc, TidyForceOutput, yes) ? rc : -1 ) rc = tidySaveBuffer( tdoc, &output ) IF rc > 0 THEN input$ = output.bp PRINT input$ END IF
'CLEAN UP tidyBufFree( &output ) tidyBufFree( &errbuf ) tidyRelease( tdoc )
BR Alex
|
|
|
Post by vovchik on May 10, 2019 22:32:40 GMT 1
Dear Alex, It works nicely. Thanks. And libtidy is very useful. With kind regards, vovchik
|
|
|
Post by alexfish on May 11, 2019 15:32:17 GMT 1
Hi Vovchik & all
Here have put the bits into NODE$ from there can parse them
will Leave the parser attempts to You All
Have Fun +
BR Alex
noted that string where Merging ADDED white space
@ Tags$=Tags$&Bits$[ptr] & " "
the code
PRAGMA LDFLAGS -ltidy PRAGMA INCLUDE <tidy/tidy.h> PRAGMA INCLUDE <tidy/buffio.h> 'ensure buffer can be seen by bacon PROTO output.pb OPTION PARSE FALSE
DECLARE output = {0} TYPE TidyBuffer DECLARE errbuf = {0} TYPE TidyBuffer
DECLARE tdoc TYPE TidyDoc
tdoc = tidyCreate()
DECLARE XHTML$ TYPE STRING DECLARE input$ TYPE STRING
'point to file and load input$ = LOAD$("bacon.html") ' create output as XHTML ok = tidyOptSetBool( tdoc, TidyXhtmlOut, yes ) ok = tidyOptSetBool( tdoc, TidyDropEmptyParas, yes ) ok = tidyOptSetBool( tdoc, TidyFixComments, yes ) ok = tidyOptSetBool( tdoc, TidyHideComments, yes ) ok = tidyOptSetBool( tdoc, TidyBreakBeforeBR, yes ) ' THIS WILL PUT XHTML INTO BLOCKS ok = tidyOptSetBool( tdoc, TidyVertSpace, yes )
rc = tidySetErrorBuffer( tdoc, &errbuf ) rc = tidyParseString( tdoc, input$ ) 'FIX ERRORS rc = tidyCleanAndRepair( tdoc ) ' FORCE OUTPUT rc = ( tidyOptSetBool(tdoc, TidyForceOutput, yes) ? rc : -1 ) rc = tidySaveBuffer( tdoc, &output ) IF rc > 0 THEN XHTML$ = output.bp 'SPLIT <string$> [BY <substr$>|<nr>] TO <array$> SIZE <variable> [STATIC] DECLARE tag_numb =0 TYPE int DECLARE ptr =0 TYPE int DECLARE Tags$ TYPE STRING DECLARE XHTML_NODES$ TYPE STRING
SPLIT XHTML$ BY "\n" TO Bits$ SIZE tag_numb
WHILE (ptr < tag_numb) DO Tags$=Tags$&Bits$[ptr] & " " IF NOT(LEN(Bits$[ptr])) THEN XHTML_NODES$= XHTML_NODES$ & Tags$ & "\n" Tags$="" ENDIF INCR ptr
WEND
' PRINT XHTML_NODES$ SPLIT XHTML_NODES$ BY "\n" TO NODE$ SIZE tag_numb ptr=0
' can iterrate throw the bits WHILE (ptr < tag_numb) DO ' can parse the node here PRINT NODE$[ptr] PRINT "---------------------------" INCR ptr
WEND
END IF 'CLEAN UP tidyBufFree( &output ) tidyBufFree( &errbuf ) tidyRelease( tdoc )
BR Alex
|
|
|
Post by alexfish on May 11, 2019 18:37:50 GMT 1
Hi All
Suppose the html is cleaned up (strip of like of java script etc)
Then this SUB works well with the tidy output
SUB PRINT_TEXT(STRING html) LOCAL tag_numbs,ptrs TYPE int LOCAL text TYPE STRING SPLIT html BY "<" TO bits$ SIZE tag_numbs WHILE (ptrs < tag_numbs) DO text = bits$[ptrs]
text = MID$(text,INSTR(text,">")+1) IF ( LEN(text) AND NOT(INSTR(text,">"))) THEN PRINT text; END IF INCR ptrs WEND
END SUB
Example
BR Alex ALL CODE
PRAGMA LDFLAGS -ltidy PRAGMA INCLUDE <tidy/tidy.h> PRAGMA INCLUDE <tidy/buffio.h> 'ensure buffer can be seen by bacon PROTO output.pb OPTION PARSE FALSE
SUB PRINT_TEXT(STRING html) LOCAL tag_numbs,ptrs TYPE int LOCAL text TYPE STRING SPLIT html BY "<" TO bits$ SIZE tag_numbs WHILE (ptrs < tag_numbs) DO text = bits$[ptrs]
text = MID$(text,INSTR(text,">")+1) IF ( LEN(text) AND NOT(INSTR(text,">"))) THEN PRINT text; END IF INCR ptrs WEND
END SUB
DECLARE output = {0} TYPE TidyBuffer DECLARE errbuf = {0} TYPE TidyBuffer
DECLARE tdoc TYPE TidyDoc
tdoc = tidyCreate()
DECLARE XHTML$ TYPE STRING DECLARE input$ TYPE STRING
'point to file and load input$ = LOAD$("index.html") ' create output as XHTML ok = tidyOptSetBool( tdoc, TidyXhtmlOut, yes ) ok = tidyOptSetBool( tdoc, TidyDropEmptyParas, yes ) ok = tidyOptSetBool( tdoc, TidyFixComments, yes ) ok = tidyOptSetBool( tdoc, TidyHideComments, yes ) ok = tidyOptSetBool( tdoc, TidyBreakBeforeBR, yes ) ' THIS WILL PUT XHTML INTO BLOCKS ok = tidyOptSetBool( tdoc, TidyVertSpace, yes )
rc = tidySetErrorBuffer( tdoc, &errbuf ) rc = tidyParseString( tdoc, input$ ) 'FIX ERRORS rc = tidyCleanAndRepair( tdoc ) ' FORCE OUTPUT rc = ( tidyOptSetBool(tdoc, TidyForceOutput, yes) ? rc : -1 ) rc = tidySaveBuffer( tdoc, &output ) IF rc > 0 THEN XHTML$ = output.bp 'SPLIT <string$> [BY <substr$>|<nr>] TO <array$> SIZE <variable> [STATIC] DECLARE tag_numb =0 TYPE int DECLARE ptr =0 TYPE int DECLARE Tags$ TYPE STRING DECLARE XHTML_NODES$ TYPE STRING
SPLIT XHTML$ BY "\n" TO Bits$ SIZE tag_numb
WHILE (ptr < tag_numb) DO Tags$=Tags$&Bits$[ptr] & " " IF NOT(LEN(Bits$[ptr])) THEN XHTML_NODES$= XHTML_NODES$ & Tags$ & "\n" Tags$="" ENDIF INCR ptr
WEND
' PRINT XHTML_NODES$ SPLIT XHTML_NODES$ BY "\n" TO NODE$ SIZE tag_numb ptr=0
' can iterrate throw the bits WHILE (ptr < tag_numb) DO ' can parse the node here PRINT_TEXT( NODE$[ptr]) 'PRINT "----------------------------" INCR ptr PRINT PRINT WEND
END IF 'CLEAN UP tidyBufFree( &output ) tidyBufFree( &errbuf ) tidyRelease( tdoc )
|
|
|
Post by bigbass on May 16, 2019 17:16:46 GMT 1
Hello Alex I am using the latestest raspberry pi 3 2019-04-08-raspbian-stretch.zip from the official rpi website your first demo compiles however the run time gives me alex: /build/tidy-html5-2q5e4k/tidy-html5-5.2.0/src/config.c:409: prvTidySetOptionBool: Assertion `option_defs[ optId ].type == TidyBoolean' failed. ERROR: signal ABORT received - internal error. Try to compile the program with TRAP LOCAL to find the cause.
which at first I had no clue as to why I then tried to compile from the original source to see if I was missing something on my box or not long story short if I comment out line number #27 of your demo it will run correctly also '---ok = tidyOptSetBool( tdoc, TidyVertSpace, yes ) maybe in the latest version something is different? thanks for your work on this it it is a useful tool P.S input$ = LOAD$("index.html") is a lot easier than C Joe
|
|
|
Post by bigbass on May 18, 2019 17:06:26 GMT 1
Hello Alex
I got it figured out what happened You and vovchik are using older versions of raspberry pi3 debian
I just installed xubuntu 16.04 (2016) RPI3 very nice I will say as a spare testing box it was well done and stable
and your code works correctly of course
and the reason why is because tidy is very old 2009
the later version of tidy has <tidy/tidybuffio.h>
and that's the reason so to avoid the latest version breaking things that's the only thing that will change to adapt the code
Joe
|
|
|
Post by bigbass on Jun 20, 2019 16:37:49 GMT 1
get the latest tidy built for you and installed check my last two posts to see what happens on a later version of tidy if you have the dependencies already installed the building part should take about 3 minutes Joe this compiles with the latest tidy test a baconized port of tidy here so you could use it with bacon basic-converter.proboards.com/post/12210#!/bin/bash
# build-tidy.sh # run as sudo ./build-tidy.sh
# automated script by bigbass # we need the lastest version of tidy if your default package is very outdated # or use the latest version of linux with your prefered OS # then apt-get libtidy-dev
#source code line and full build info #https://github.com/htacg/tidy-html5
#I will just script it all to keep it simple step by step
# if you want to ensure that tidy will work in the future test against # the latest version of tidy from git this is an easy way to build # from source if you use this script it will build the latest tidy for you
#(1)================================ # We need to have git installed apt-get install git
#(2)================================ # then run this in the terminal to download the source folder git clone https://github.com/htacg/tidy-html5.git
#(3)================================ # it will be built by cmake but dont stress its easier # than expected and very verbose and clear to read apt-get install cmake
#(4)================================ #there is a dependency for the the doc creation apt install xsltproc
#(5)================================ cd tidy-html5 cd build/cmake
# this will work as is with no edits cmake ../.. -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/usr
# If all went well we can install it now you're done make install
# get back to where we started cd ../../..
# so we can delete it as non root later chmod 777 tidy-html5
#-----------testing tidy in action ------------------------
# get the version tidy -v
# We will need some index to test the tidy lib with wget -N -e robots=off http://basic-converter.proboards.com
tidy index.html >fixed-index.html
|
|
|
Post by alexfish on Jun 21, 2019 7:42:14 GMT 1
Hi Joe
I have a couple of version's from git hub on Rpi raspbian
hence Trimed the ENUM to work on all versions , can not say much as yet, but there is
one thing I am looking at as regards the 'official demo 'in the tidy.h'
there is an assert (name) ie a
assert( name != NULL );
I do Know if do a print of a say 'Tidy ctrstr ' string , that print statement can segfault
IE:: ctmbstr name;
in one area of my demo's did change (name != NULL)
to a if(name)
that part of a GetTidy(bits did not crash)
I do think it is wise to try version , as in another way
with Tidy lib on Raspbian I am having problems with site's that have
large amounts of Script mixed in::
Will follow Your Suggested . as we need to be Singing on the correct Hymn Sheet
BR Alex
|
|
|
Post by bigbass on Jun 21, 2019 16:38:35 GMT 1
Hello Alex
if you need testing feedback we all have to have something in common that is independent of linux versions rpi3 or mint /ubuntu and the git version is the official future of tidy for everyone
I would be happy just to use apt-get but we have different rasbian versions also
that said there are still some pitfalls that are difficult to catch when c code is compiled using c++ with different compilers older compilers aren't so picky and pass things that will throw errors with recent compilers so what works for some of us may not for others for those reasons and more 32/64 bit
and it would be easier for any code to to be updated
all work on tidy is a good thing and a very useful lib
thanks for your work with tidy Joe
this very simple demo compiles in c and g++ and clang to see if all is set up correctly
|
|
|
Post by alexfish on Jun 21, 2019 17:44:00 GMT 1
Hi Joe Although have tidy 5 versions from git hub there was a one listed as 5-5 but version history = 2015 now here is the pita the dreaded 'Deprecated' as to the version 2019 2015 allowed 'Deprecated' 2019 does not? So this is where I sit :: Will post a Lib example with the bits that get to the necessary bit :: like the one I have just posted on the htm2text thread It takes time the form the sub that can do it , there are something like 100 to 150 calls to get there in three area's also need to rewrite how to fill missing bits in tags / here is one example of a browser that gets lost if it can not find the missing bits :: IE if the html author fails to cross the tees and dots the eyes a picky BR Alex Attachments:
|
|
|
Post by alexfish on Jun 22, 2019 15:04:11 GMT 1
Hi Joe
have tested latest lib-tidy
: objective get attributes ' now working' : objective parse out java script 'now working'
Next Post on the HTML2TEXT thread will be based on latest HTML-tidy5
Added Here is part of extractin the raw text node by node : including other info that is in the CDATA
part of Youtube Search
Line =1 : Column =3 : html Has :1 <!DOCTYPE html> ****************************************************** Tag =html : Line =1 : Column =18 : Atrr 1 =lang Value 1 :en Atrr 2 =data-cast-api-enabled Value2 :true Atrr 3 =xmlns Value3 :http://www.w3.org/1999/xhtml html Has :0
****************************************************** Tag =head : Line =1 : Column =63 : head Has :0
****************************************************** Tag =meta : Line =1394 : Column =1 : Atrr 1 =name Value 1 :generator Atrr 2 =content Value2 :HTML Tidy for HTML5 for Linux version 5.7.22 meta Has :0
****************************************************** Tag =style : Line =1 : Column =69 : Atrr 1 =name Value 1 :www-roboto style Has :0
****************************************************** Line =1 : Column =95 : style Has :1
****************************************************** Tag =script : Line =1 : Column =924 : Atrr 1 =name Value 1 :www-roboto script Has :0
****************************************************** Line =1 : Column =951 : script Has :1
****************************************************** Tag =script : Line =1 : Column =1093 : script Has :0
****************************************************** Line =1 : Column =1102 : script Has :1
****************************************************** Tag =script : Line =1 : Column =2612 : script Has :0
****************************************************** Line =1 : Column =2621 : script Has :1
****************************************************** Tag =script : Line =1 : Column =2932 : script Has :0
****************************************************** Line =1 : Column =2940 : script Has :1
****************************************************** Tag =script : Line =8 : Column =3 : script Has :0
****************************************************** Line =9 : Column =1 : script Has :1
****************************************************** Tag =script : Line =20 : Column =7 : Atrr 1 =src Value 1 :/yts/jsbin/scheduler-vfl0nTPXR/scheduler.js Atrr 2 =type Value2 :text/javascript Atrr 3 =name Value3 :scheduler/scheduler script Has :0
****************************************************** Tag =link : Line =24 : Column =3 : Atrr 1 =rel Value 1 :stylesheet Atrr 2 =href Value2 :/yts/cssbin/www-core-vflMxh-5O.css Atrr 3 =name Value3 :www-core link Has :0
****************************************************** Tag =link : Line =25 : Column =7 : Atrr 1 =rel Value 1 :stylesheet Atrr 2 =href Value2 :/yts/cssbin/player-vflU0p5Wq/www-player.css Atrr 3 =name Value3 :player/www-player link Has :0
****************************************************** Tag =link : Line =27 : Column =3 : Atrr 1 =rel Value 1 :stylesheet Atrr 2 =href Value2 :/yts/cssbin/www-pageframe-vfldKQBmr.css Atrr 3 =name Value3 :www-pageframe link Has :0
****************************************************** Tag =link : Line =28 : Column =3 : Atrr 1 =rel Value 1 :stylesheet Atrr 2 =href Value2 :/yts/cssbin/www-guide-vflybhooe.css Atrr 3 =name Value3 :www-guide link Has :0
****************************************************** Tag =title : Line =31 : Column =1 : title Has :0
****************************************************** Line =31 : Column =8 : title Has :1 pink floyd - YouTube ****************************************************** Tag =link : Line =31 : Column =36 : Atrr 1 =rel Value 1 :alternate Atrr 2 =media Value2 :handheld Atrr 3 =href Value3 :https://m.youtube.com/results?search_query=pink+floyd link Has :0
****************************************************** Tag =link : Line =31 : Column =136 : Atrr 1 =rel Value 1 :alternate Atrr 2 =media Value2 :only screen and (max-width: 640px) Atrr 3 =href Value3 :https://m.youtube.com/results?search_query=pink+floyd link Has :0
****************************************************** Tag =meta : Line =31 : Column =262 : Atrr 1 =name Value 1 :description Atrr 2 =content Value2 :Enjoy the videos and music you love, upload original content and share it all with friends, family and the world on YouTube. meta Has :0
****************************************************** BR Alex
|
|
|
Post by alexfish on Jun 26, 2019 21:57:23 GMT 1
hi joe the new methods to get to the bits created an headache but getting there, so if the latest lib tidy 5 is the way to go then a 8 bit colour terminal be the way to go obvious from those bits posted above can now get this terminal looking like a web browser and not just a paging text Browser, yet all you see is just exactly that a String of Text hope to have the rest of bits put into place by weekend; esp IE c++ lib + bacon front end BR Alex PI4 Picky from raspberry PI blogs and the demo loaded fast than RPI Chromium Attachments:
|
|
|
Post by alexfish on Jun 28, 2019 17:55:38 GMT 1
Hi All
had hoped to have separated lib this week end :: now stalled::
most sites decode , yet had previously said Rpi .org performance = dogy esp blogs
hence had to delve deeper into the Why?
In one did notice the download is just one long string , so I separated the bits
and looked further at the image tags and this is what i found . and this is just one of them;
<a href="https://www.raspberrypi.org/app/uploads/2019/06/001_Magpi83_COVER-Web.jpg"> <img class="aligncenter size-full wp-image-52304" src="https://www.raspberrypi.org/app/uploads/2019/06/001_Magpi83_COVER-Web.jpg" alt="" width="1240" height="1630" srcset="https://www.raspberrypi.org/app/uploads/2019/06/001_Magpi83_COVER-Web.jpg 1240w, https://www.raspberrypi.org/app/uploads/2019/06/001_Magpi83_COVER-Web-190x250.jpg 190w, https://www.raspberrypi.org/app/uploads/2019/06/001_Magpi83_COVER-Web-768x1010.jpg 768w, https://www.raspberrypi.org/app/uploads/2019/06/001_Magpi83_COVER-Web-500x657.jpg 500w, https://www.raspberrypi.org/app/uploads/2019/06/001_Magpi83_COVER-Web-822x1080.jpg 822w" sizes="(max-width: 1240px) 100vw, 1240px" /> </a>
and now need to look further on how to get these bits out and stop the lib dump from choking
if have a rpi and Chromium , then can see why::
BR Alex
|
|
|
Post by alexfish on Jun 29, 2019 18:49:47 GMT 1
Hi All have managed to get the image sizes reduced to suit sixel format in a terminal as a case example the buffer data needs to have some method to indicate how many lines are there and indicate where they can be split; will return to this bit next week; Added Reason as in RPI .org some of the images are base64 hence using pipes can decode using base64 -d and in case of svg can use rsvg-convert to get PNG image and then finally img2sixel Phew >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> long winded but here is proof of consept Picky BR Alex Attachments:
|
|
|
Post by alexfish on Jun 29, 2019 22:47:30 GMT 1
update to above part web view (a picky of RPI site) BR Alex Attachments:
|
|