Rocksolid Light

Welcome to novaBBS (click a section below)

mail  files  register  newsreader  groups  login

Message-ID:  

Heuristics are bug ridden by definition. If they didn't have bugs, then they'd be algorithms.


devel / comp.lang.c++ / [OT] Help for a RegEx

SubjectAuthor
* [OT] Help for a RegExFonntuggnio
+* Re: [OT] Help for a RegExChristian Gollwitzer
|`- Re: [OT] Help for a RegExMarioCPPP
+* Re: [OT] Help for a RegExPaavo Helde
|+* Re: [OT] Help for a RegExScott Lurndal
||`- Re: [OT] Help for a RegExMarioCPPP
|`* Re: [OT] Help for a RegExMarioCPPP
| `* Re: [OT] Help for a RegExPaavo Helde
|  `* Re: [OT] Help for a RegExMarioCPPP
|   `* Re: [OT] Help for a RegExBen Bacarisse
|    `* Re: [OT] Help for a RegExMarioCPPP
|     `* Re: [OT] Help for a RegExjak
|      `* Re: [OT] Help for a RegExMarioCCCP
|       `* Re: [OT] Help for a RegExjak
|        `* Re: [OT] Help for a RegExMarioCCCP
|         `- Re: [OT] Help for a RegExjak
+- Re: [OT] Help for a RegExKeith Thompson
`- Re: [OT] Help for a RegExwij

1
[OT] Help for a RegEx

<uae4bc$6g1o$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=953&group=comp.lang.c%2B%2B#953

  copy link   Newsgroups: comp.lang.c++
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: JoeFonnt...@libbbero.it (Fonntuggnio)
Newsgroups: comp.lang.c++
Subject: [OT] Help for a RegEx
Date: Wed, 2 Aug 2023 19:38:51 +0200
Organization: A noiseless patient Spider
Lines: 31
Message-ID: <uae4bc$6g1o$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Wed, 2 Aug 2023 17:38:58 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="1d1fc36468adbb54b009d5dbbbc63bfd";
logging-data="213048"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19JSl45Kpbs/JWChnCMRjClWhbnak5tLO0="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
Thunderbird/102.13.1
Cancel-Lock: sha1:UdfhPmiO5Rgoat6jatACOQbs9DU=
Content-Language: en-GB, it-IT
 by: Fonntuggnio - Wed, 2 Aug 2023 17:38 UTC

Sorry for the total OT, but I failed to build a RegEx with
the "help" (rotfl) of three different so called IA, getting
to nothing

I am scanning an HTML document (not in javascript, so I do
not have access to DOM nodes from inside) and I need to
match EVERY <p> whole tag.

for whole I mean, starting from the <p and ending with the
corresponding </p>, but such paragrapha MAY (and may not)
contain

a long list of attributes, with or without zero or more \n
\r \t characters, valid, before the >.

An innerText possibly multiline, also with or without zero
or more \n \r \t characters inside the text.

I have tried most suggestions from Bearly, ChatGpt and
You.Com IA, but none worked

(my test is the RegEx engine from KATE Editor with the
loaded HTML. It is handy since it highlights in yellow the
matches, and I can verify that the RegEx tried fail to
detect perfectly valid paragraphs).

If sb happens to be familiar with RegEx supporting
"invisible" characters ... I'd be very grateful for any hint.
Ciao !

Re: [OT] Help for a RegEx

<uae5jb$6vko$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=954&group=comp.lang.c%2B%2B#954

  copy link   Newsgroups: comp.lang.c++
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: aurio...@gmx.de (Christian Gollwitzer)
Newsgroups: comp.lang.c++
Subject: Re: [OT] Help for a RegEx
Date: Wed, 2 Aug 2023 20:00:09 +0200
Organization: A noiseless patient Spider
Lines: 40
Message-ID: <uae5jb$6vko$1@dont-email.me>
References: <uae4bc$6g1o$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Wed, 2 Aug 2023 18:00:11 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="235935bbf420c4efa8da34201e51ddfd";
logging-data="229016"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18tynoI9Joxa4qFTA1mLRzoLXIm7tqXKuQ="
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:102.0)
Gecko/20100101 Thunderbird/102.13.0
Cancel-Lock: sha1:qb6zUjLwil4R3NOK5pW6XwOGcS0=
In-Reply-To: <uae4bc$6g1o$1@dont-email.me>
 by: Christian Gollwitzer - Wed, 2 Aug 2023 18:00 UTC

Am 02.08.23 um 19:38 schrieb Fonntuggnio:
>
> Sorry for the total OT, but I failed to build a RegEx with the "help"
> (rotfl) of three different so called IA, getting to nothing
>
> I am scanning an HTML document (not in javascript, so I do not have
> access to DOM nodes from inside) and I need to match EVERY <p> whole tag.
>
> for whole I mean, starting from the <p and ending with the corresponding
> </p>, but such paragrapha MAY (and may not) contain

This may not be possible at all. RegExes cannot match nesting pairs,
i.e. if your <p></p> contains other <p></p> pairs then you have reached
the end of what a RE is capable of. Also due to the way these tags are
structured, you need at least negative lookahead for it, which also not
all RE engines support.

If you do
<p.*</p>

then the RE would catch from the first <p to the last </p>, hence you
need to specify the .* with a lookahead like (?!</p>), or use a
non-greedy RE.

> a long list of attributes, with or without zero or more \n \r \t
> characters, valid, before the >.

> (my test is the RegEx engine from KATE Editor with the loaded HTML.
> If sb happens to be familiar with RegEx supporting "invisible"
> characters ... I'd be very grateful for any hint.

This may as well be the problem. Some RE engines treat newline
characters as special, i.e. it may be that Kate matches only *within* a
line.

In short - maybe a RE engine is simply not a good tool to do that. Then
use an XML parser instead.

Christian

Re: [OT] Help for a RegEx

<uae6gp$76ua$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=955&group=comp.lang.c%2B%2B#955

  copy link   Newsgroups: comp.lang.c++
Path: i2pn2.org!i2pn.org!news.hispagatos.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: eesn...@osa.pri.ee (Paavo Helde)
Newsgroups: comp.lang.c++
Subject: Re: [OT] Help for a RegEx
Date: Wed, 2 Aug 2023 21:15:52 +0300
Organization: A noiseless patient Spider
Lines: 23
Message-ID: <uae6gp$76ua$1@dont-email.me>
References: <uae4bc$6g1o$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Wed, 2 Aug 2023 18:15:53 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="12cb2bd76e7c7b0525fecc47285ef8e6";
logging-data="236490"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/RPas6oyLYj/3uhEHik7xpXOjtMsTkv3k="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.13.0
Cancel-Lock: sha1:RoLrMvf2kYwbKgVv9LdeDfAuNo4=
In-Reply-To: <uae4bc$6g1o$1@dont-email.me>
Content-Language: en-US
 by: Paavo Helde - Wed, 2 Aug 2023 18:15 UTC

02.08.2023 20:38 Fonntuggnio kirjutas:
>
> Sorry for the total OT, but I failed to build a RegEx with the "help"
> (rotfl) of three different so called IA, getting to nothing
>
> I am scanning an HTML document (not in javascript, so I do not have
> access to DOM nodes from inside) and I need to match EVERY <p> whole tag.
>
> for whole I mean, starting from the <p and ending with the corresponding
> </p>, but such paragrapha MAY (and may not) contain

I'm afraid HTML cannot be parsed with a regex in general. Also, the HTML
rules are very lax, for example there is no such guarantee that there
actually appears a corresponding terminating </p> tag.

Also, there is no guarantee that the actual content is contained in the
<p> tags, it might well be outside and all the <p> tags might actually
be empty <p/>.

For extracting the content of unknown pages reliably you would probably
need some kind of a state machine, with a fair knowledge of obscure HTML
rules. Of course, there are libraries for that.

Re: [OT] Help for a RegEx

<M7xyM.456117$GMN3.417214@fx16.iad>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=956&group=comp.lang.c%2B%2B#956

  copy link   Newsgroups: comp.lang.c++
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!fx16.iad.POSTED!not-for-mail
X-newsreader: xrn 9.03-beta-14-64bit
Sender: scott@dragon.sl.home (Scott Lurndal)
From: sco...@slp53.sl.home (Scott Lurndal)
Reply-To: slp53@pacbell.net
Subject: Re: [OT] Help for a RegEx
Newsgroups: comp.lang.c++
References: <uae4bc$6g1o$1@dont-email.me> <uae6gp$76ua$1@dont-email.me>
Lines: 27
Message-ID: <M7xyM.456117$GMN3.417214@fx16.iad>
X-Complaints-To: abuse@usenetserver.com
NNTP-Posting-Date: Wed, 02 Aug 2023 18:20:28 UTC
Organization: UsenetServer - www.usenetserver.com
Date: Wed, 02 Aug 2023 18:20:28 GMT
X-Received-Bytes: 1814
 by: Scott Lurndal - Wed, 2 Aug 2023 18:20 UTC

Paavo Helde <eesnimi@osa.pri.ee> writes:
>02.08.2023 20:38 Fonntuggnio kirjutas:
>>
>> Sorry for the total OT, but I failed to build a RegEx with the "help"
>> (rotfl) of three different so called IA, getting to nothing
>>
>> I am scanning an HTML document (not in javascript, so I do not have
>> access to DOM nodes from inside) and I need to match EVERY <p> whole tag.
>>
>> for whole I mean, starting from the <p and ending with the corresponding
>> </p>, but such paragrapha MAY (and may not) contain
>
>I'm afraid HTML cannot be parsed with a regex in general. Also, the HTML
>rules are very lax, for example there is no such guarantee that there
>actually appears a corresponding terminating </p> tag.
>
>Also, there is no guarantee that the actual content is contained in the
><p> tags, it might well be outside and all the <p> tags might actually
>be empty <p/>.
>
>For extracting the content of unknown pages reliably you would probably
>need some kind of a state machine, with a fair knowledge of obscure HTML
>rules. Of course, there are libraries for that.

Best way to deal with HTML is using xslt processors. You may want
to run the html text through a canonicalizer first.

Re: [OT] Help for a RegEx

<87pm45l19p.fsf@nosuchdomain.example.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=957&group=comp.lang.c%2B%2B#957

  copy link   Newsgroups: comp.lang.c++
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: Keith.S....@gmail.com (Keith Thompson)
Newsgroups: comp.lang.c++
Subject: Re: [OT] Help for a RegEx
Date: Wed, 02 Aug 2023 11:58:26 -0700
Organization: None to speak of
Lines: 17
Message-ID: <87pm45l19p.fsf@nosuchdomain.example.com>
References: <uae4bc$6g1o$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain
Injection-Info: dont-email.me; posting-host="ccf7dbef32ea6c91e84b9ae6a4dccdde";
logging-data="246056"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/CygyL620I/G07jeclvlGm"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.2 (gnu/linux)
Cancel-Lock: sha1:C6lhwXT7lwqxUZCzwQC6Fm0Ikmo=
sha1:X/uvLak0l2iJPrJBHtNL3Hh2ai4=
 by: Keith Thompson - Wed, 2 Aug 2023 18:58 UTC

Fonntuggnio <JoeFonntuggnio@libbbero.it> writes:
> Sorry for the total OT, but I failed to build a RegEx with the "help"
> (rotfl) of three different so called IA, getting to nothing
>
> I am scanning an HTML document (not in javascript, so I do not have
> access to DOM nodes from inside) and I need to match EVERY <p> whole
> tag.
[...]

https://stackoverflow.com/a/1732454/827263

"TONY THE PONY HE COMES"

--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
Will write code for food.
void Void(void) { Void(); } /* The recursive call of the void */

Re: [OT] Help for a RegEx

<uaemnq$achd$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=958&group=comp.lang.c%2B%2B#958

  copy link   Newsgroups: comp.lang.c++
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: NoliMihi...@libero.it (MarioCPPP)
Newsgroups: comp.lang.c++
Subject: Re: [OT] Help for a RegEx
Date: Thu, 3 Aug 2023 00:52:42 +0200
Organization: A noiseless patient Spider
Lines: 39
Message-ID: <uaemnq$achd$1@dont-email.me>
References: <uae4bc$6g1o$1@dont-email.me> <uae5jb$6vko$1@dont-email.me>
Reply-To: MarioCCCP@CCCP.MIR
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: base64
Injection-Date: Wed, 2 Aug 2023 22:52:42 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="3dbb99bf25ab1c472cf7961f6e11428d";
logging-data="340525"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/dzPyfWX2Y3vlRQG417k3R"
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
Thunderbird/102.13.1
Cancel-Lock: sha1:rQh+sqTJeirwa+P5QmW7j5r/iMs=
Content-Language: en-GB, it-IT
In-Reply-To: <uae5jb$6vko$1@dont-email.me>
 by: MarioCPPP - Wed, 2 Aug 2023 22:52 UTC

On 02/08/23 20:00, Christian Gollwitzer wrote:
> Am 02.08.23 um 19:38 schrieb Fonntuggnio:
>>
>> Sorry for the total OT, but I failed to build a RegEx with
>> the "help" (rotfl) of three different so called IA,
>> getting to nothing
>>
>> I am scanning an HTML document (not in javascript, so I do
>> not have access to DOM nodes from inside) and I need to
>> match EVERY <p> whole tag.
>>
>> for whole I mean, starting from the <p and ending with the
>> corresponding </p>, but such paragrapha MAY (and may not)
>> contain
>
> This may not be possible at all. RegExes cannot match
> nesting pairs, i.e. if your <p></p> contains other <p></p>
this may be safely excluded. Other type of tags (like <i> or
<em> may be nested, but not <p> itself). Is it still a problem ?
> pairs then you have reached the end of what a RE is capable
> of. Also due to the way these tags are structured, you need
> at least negative lookahead for it, which also not all RE
> engines support.
>
> If you do
> <p.*</p>
>
> then the RE would catch from the first <p to the last </p>,
> hence you need to specify the .* with a lookahead like
> (?!</p>), or use a non-greedy RE.
the .* seem to fail facing multiline and tabs alas
>
>> a long list of attributes, with or without zero or more \n
>> \r \t characters, valid, before the >.
>
>> (my test is the RegEx engine from KATE Editor with the
>> loaded HTML. If sb happens to be familiar with RegEx
>> supporting "invisible" characters ... I'd be very grateful
>> for any hint.
>
> This may as well be the problem. Some RE engines treat
> newline characters as special, i.e. it may be that Kate
> matches only *within* a line.
mmmmmm intresting.
What other editor would you recommend then ?
>
> In short - maybe a RE engine is simply not a good tool to do
> that. Then use an XML parser instead.
>
>     Christian
>
--
1) Resistere, resistere, resistere.
2) Se tutti pagano le tasse, le tasse le pagano tutti
MarioCPPP

Re: [OT] Help for a RegEx

<uaemss$achd$2@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=959&group=comp.lang.c%2B%2B#959

  copy link   Newsgroups: comp.lang.c++
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: NoliMihi...@libero.it (MarioCPPP)
Newsgroups: comp.lang.c++
Subject: Re: [OT] Help for a RegEx
Date: Thu, 3 Aug 2023 00:55:24 +0200
Organization: A noiseless patient Spider
Lines: 52
Message-ID: <uaemss$achd$2@dont-email.me>
References: <uae4bc$6g1o$1@dont-email.me> <uae6gp$76ua$1@dont-email.me>
Reply-To: MarioCCCP@CCCP.MIR
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Wed, 2 Aug 2023 22:55:25 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="3dbb99bf25ab1c472cf7961f6e11428d";
logging-data="340525"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18Spp1qg+CAUChixEGfpxmA"
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
Thunderbird/102.13.1
Cancel-Lock: sha1:U3OKJfv2S42mgs0X6cZJvj0ozVc=
In-Reply-To: <uae6gp$76ua$1@dont-email.me>
Content-Language: en-GB, it-IT
 by: MarioCPPP - Wed, 2 Aug 2023 22:55 UTC

On 02/08/23 20:15, Paavo Helde wrote:
> 02.08.2023 20:38 Fonntuggnio kirjutas:
>>
>> Sorry for the total OT, but I failed to build a RegEx with
>> the "help" (rotfl) of three different so called IA,
>> getting to nothing
>>
>> I am scanning an HTML document (not in javascript, so I do
>> not have access to DOM nodes from inside) and I need to
>> match EVERY <p> whole tag.

it is HTML generated by LibreOffice .odt, so rather well
formatted (if not elegant)

>>
>> for whole I mean, starting from the <p and ending with the
>> corresponding </p>, but such paragrapha MAY (and may not)
>> contain
>
> I'm afraid HTML cannot be parsed with a regex in general.
> Also, the HTML rules are very lax, for example there is no
> such guarantee that there actually appears a corresponding
> terminating </p> tag.

I have read manually a lot of the generated code, and I had
no evidence of bad formatting

>
> Also, there is no guarantee that the actual content is
> contained in the <p> tags, it might well be outside and all
> the <p> tags might actually be empty <p/>.

true, <td> and some other contain some renderized text. But
for my purpose just paragraphs could suffice.

>
> For extracting the content of unknown pages

they are not unknown : they are .odt exported as HTML, by
LibreOffice.

> reliably you
> would probably need some kind of a state machine, with a
> fair knowledge of obscure HTML rules. Of course, there are
> libraries for that.
>

--
1) Resistere, resistere, resistere.
2) Se tutti pagano le tasse, le tasse le pagano tutti
MarioCPPP

Re: [OT] Help for a RegEx

<uaemum$achd$3@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=960&group=comp.lang.c%2B%2B#960

  copy link   Newsgroups: comp.lang.c++
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: NoliMihi...@libero.it (MarioCPPP)
Newsgroups: comp.lang.c++
Subject: Re: [OT] Help for a RegEx
Date: Thu, 3 Aug 2023 00:56:22 +0200
Organization: A noiseless patient Spider
Lines: 38
Message-ID: <uaemum$achd$3@dont-email.me>
References: <uae4bc$6g1o$1@dont-email.me> <uae6gp$76ua$1@dont-email.me>
<M7xyM.456117$GMN3.417214@fx16.iad>
Reply-To: MarioCCCP@CCCP.MIR
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Wed, 2 Aug 2023 22:56:23 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="3dbb99bf25ab1c472cf7961f6e11428d";
logging-data="340525"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19VY9TeltllNXuuX43lOPSD"
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
Thunderbird/102.13.1
Cancel-Lock: sha1:EVlGy3stcG7s1dOFTIKRwuuL1NE=
In-Reply-To: <M7xyM.456117$GMN3.417214@fx16.iad>
Content-Language: en-GB, it-IT
 by: MarioCPPP - Wed, 2 Aug 2023 22:56 UTC

On 02/08/23 20:20, Scott Lurndal wrote:
> Paavo Helde <eesnimi@osa.pri.ee> writes:
>> 02.08.2023 20:38 Fonntuggnio kirjutas:
>>>
>>> Sorry for the total OT, but I failed to build a RegEx with the "help"
>>> (rotfl) of three different so called IA, getting to nothing
>>>
>>> I am scanning an HTML document (not in javascript, so I do not have
>>> access to DOM nodes from inside) and I need to match EVERY <p> whole tag.
>>>
>>> for whole I mean, starting from the <p and ending with the corresponding
>>> </p>, but such paragrapha MAY (and may not) contain
>>
>> I'm afraid HTML cannot be parsed with a regex in general. Also, the HTML
>> rules are very lax, for example there is no such guarantee that there
>> actually appears a corresponding terminating </p> tag.
>>
>> Also, there is no guarantee that the actual content is contained in the
>> <p> tags, it might well be outside and all the <p> tags might actually
>> be empty <p/>.
>>
>> For extracting the content of unknown pages reliably you would probably
>> need some kind of a state machine, with a fair knowledge of obscure HTML
>> rules. Of course, there are libraries for that.
>
> Best way to deal with HTML is using xslt processors. You may want
> to run the html text through a canonicalizer first.

both terms were unknown to me beforehand, so I thank you
since I can do some searches

>

--
1) Resistere, resistere, resistere.
2) Se tutti pagano le tasse, le tasse le pagano tutti
MarioCPPP

Re: [OT] Help for a RegEx

<uafotr$luld$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=961&group=comp.lang.c%2B%2B#961

  copy link   Newsgroups: comp.lang.c++
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: eesn...@osa.pri.ee (Paavo Helde)
Newsgroups: comp.lang.c++
Subject: Re: [OT] Help for a RegEx
Date: Thu, 3 Aug 2023 11:36:11 +0300
Organization: A noiseless patient Spider
Lines: 20
Message-ID: <uafotr$luld$1@dont-email.me>
References: <uae4bc$6g1o$1@dont-email.me> <uae6gp$76ua$1@dont-email.me>
<uaemss$achd$2@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Thu, 3 Aug 2023 08:36:11 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="1bc1979394003c457d9dbf04a697537f";
logging-data="719533"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX184QMXP1WH6umxwrokF+RyJlV+2af68Tnk="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.13.0
Cancel-Lock: sha1:ruDtEo9YTKaoU+aYxknk1Y2MRtg=
In-Reply-To: <uaemss$achd$2@dont-email.me>
Content-Language: en-US
 by: Paavo Helde - Thu, 3 Aug 2023 08:36 UTC

03.08.2023 01:55 MarioCPPP kirjutas:
> On 02/08/23 20:15, Paavo Helde wrote:
>>
>> For extracting the content of unknown pages
>
> they are not unknown : they are .odt exported as HTML, by LibreOffice.

Well, that makes things easier. If we can exclude some complications
like CDATA, HTML comments and nested <p> tags, then it might be indeed
possible to use regex to extract some content.

Be sure to use a non-greedy regex to match the closest end tag </p>, and
the equivalent of /s or dotall for '.' to match newlines (or use
(.|\r|\n) instead of dot). This seems to work at first glance:

grep -Po '<p (.|\r|\n)*?</p>' abc.xhtml

(-P is needed for grep to support non-greedy search).

Re: [OT] Help for a RegEx

<uahf52$vcci$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=963&group=comp.lang.c%2B%2B#963

  copy link   Newsgroups: comp.lang.c++
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: NoliMihi...@libero.it (MarioCPPP)
Newsgroups: comp.lang.c++
Subject: Re: [OT] Help for a RegEx
Date: Fri, 4 Aug 2023 02:01:38 +0200
Organization: A noiseless patient Spider
Lines: 15
Message-ID: <uahf52$vcci$1@dont-email.me>
References: <uae4bc$6g1o$1@dont-email.me> <uae6gp$76ua$1@dont-email.me>
<uaemss$achd$2@dont-email.me> <uafotr$luld$1@dont-email.me>
Reply-To: MarioCCCP@CCCP.MIR
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Fri, 4 Aug 2023 00:01:39 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="2a123679bcfaf76a1bc90ecfe22db406";
logging-data="1028498"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18PzysWpug7EjLcHU04qP3r"
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
Thunderbird/102.13.1
Cancel-Lock: sha1:XzxuVkjnu5ZQSLID8Qp7DYIzjyc=
In-Reply-To: <uafotr$luld$1@dont-email.me>
Content-Language: en-GB, it-IT
 by: MarioCPPP - Fri, 4 Aug 2023 00:01 UTC

On 03/08/23 10:36, Paavo Helde wrote:
> <p (.|\r|\n)*?</p>

intresting. I tried this one and it detects most of
paragraphs, except those that does not have attributes
within the <p> opening tag.

Is it there a way to also include those ones ?

--
1) Resistere, resistere, resistere.
2) Se tutti pagano le tasse, le tasse le pagano tutti
MarioCPPP

Re: [OT] Help for a RegEx

<877cqbveje.fsf@bsb.me.uk>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=964&group=comp.lang.c%2B%2B#964

  copy link   Newsgroups: comp.lang.c++
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: ben.use...@bsb.me.uk (Ben Bacarisse)
Newsgroups: comp.lang.c++
Subject: Re: [OT] Help for a RegEx
Date: Fri, 04 Aug 2023 01:26:13 +0100
Organization: A noiseless patient Spider
Lines: 20
Message-ID: <877cqbveje.fsf@bsb.me.uk>
References: <uae4bc$6g1o$1@dont-email.me> <uae6gp$76ua$1@dont-email.me>
<uaemss$achd$2@dont-email.me> <uafotr$luld$1@dont-email.me>
<uahf52$vcci$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain
Injection-Info: dont-email.me; posting-host="82be12c07f28e99381fe1dae5b0b0603";
logging-data="1033278"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/2m13Nx0p361kFpm8YIsiJ1ynMgKYNKF4="
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.2 (gnu/linux)
Cancel-Lock: sha1:AsJwic3uiNJHjERrZdkKHj0+QUE=
sha1:Owcu1BxqGo6h+jfy4WEXxkXjcH4=
X-BSB-Auth: 1.0369bb722ab85493d53c.20230804012613BST.877cqbveje.fsf@bsb.me.uk
 by: Ben Bacarisse - Fri, 4 Aug 2023 00:26 UTC

MarioCPPP <NoliMihiFrangereMentulam@libero.it> writes:

> On 03/08/23 10:36, Paavo Helde wrote:
>> <p (.|\r|\n)*?</p>
>
> intresting. I tried this one and it detects most of paragraphs, except
> those that does not have attributes within the <p> opening tag.
>
> Is it there a way to also include those ones ?

PH's regex insists on a space after the "<p". Whilst this is not
exactly the same as requiring an attribute it will be effectively the
same. You could try

<p[ >](.|\r|\n)*?</p>

but I can't stress enough -- none of this can really work in all cases.

--
Ben.

Re: [OT] Help for a RegEx

<uak4dt$1f8eg$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=968&group=comp.lang.c%2B%2B#968

  copy link   Newsgroups: comp.lang.c++
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: NoliMihi...@libero.it (MarioCPPP)
Newsgroups: comp.lang.c++
Subject: Re: [OT] Help for a RegEx
Date: Sat, 5 Aug 2023 02:17:01 +0200
Organization: A noiseless patient Spider
Lines: 39
Message-ID: <uak4dt$1f8eg$1@dont-email.me>
References: <uae4bc$6g1o$1@dont-email.me> <uae6gp$76ua$1@dont-email.me>
<uaemss$achd$2@dont-email.me> <uafotr$luld$1@dont-email.me>
<uahf52$vcci$1@dont-email.me> <877cqbveje.fsf@bsb.me.uk>
Reply-To: MarioCCCP@CCCP.MIR
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sat, 5 Aug 2023 00:17:02 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="eff183cd8e8a9b19314fcc22671e90e2";
logging-data="1548752"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19S9QUs9x8CJZKWuUFV6h2b"
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
Thunderbird/102.13.1
Cancel-Lock: sha1:GxjkPsbatvvUFbZZKjqnY4onMSM=
In-Reply-To: <877cqbveje.fsf@bsb.me.uk>
Content-Language: en-GB, it-IT
 by: MarioCPPP - Sat, 5 Aug 2023 00:17 UTC

On 04/08/23 02:26, Ben Bacarisse wrote:
> MarioCPPP <NoliMihiFrangereMentulam@libero.it> writes:
>
>> On 03/08/23 10:36, Paavo Helde wrote:
>>> <p (.|\r|\n)*?</p>
>>
>> intresting. I tried this one and it detects most of paragraphs, except
>> those that does not have attributes within the <p> opening tag.
>>
>> Is it there a way to also include those ones ?
>
> PH's regex insists on a space after the "<p". Whilst this is not
> exactly the same as requiring an attribute it will be effectively the
> same. You could try
>
> <p[ >](.|\r|\n)*?</p>
>
> but I can't stress enough -- none of this can really work in all cases.
>

ok, tnx for the prudent disclaimer

Your variant found 3883 matches against 3644 matches found
by the previous one.

scanning visually the book, it seems this RegEx is able to
find ALL.

I'll try variants with h1...h10 to find headings also.

Many thanks for this precious hint

Now I try to better understand the Expression :D

--
1) Resistere, resistere, resistere.
2) Se tutti pagano le tasse, le tasse le pagano tutti
MarioCPPP

Re: [OT] Help for a RegEx

<f321bcfa-a0fb-41f1-b115-df15744db44cn@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=1009&group=comp.lang.c%2B%2B#1009

  copy link   Newsgroups: comp.lang.c++
X-Received: by 2002:ac8:5c08:0:b0:403:27b2:85b5 with SMTP id i8-20020ac85c08000000b0040327b285b5mr85029qti.12.1691908409006;
Sat, 12 Aug 2023 23:33:29 -0700 (PDT)
X-Received: by 2002:a17:903:32d2:b0:1b7:f443:c807 with SMTP id
i18-20020a17090332d200b001b7f443c807mr2808829plr.7.1691908408289; Sat, 12 Aug
2023 23:33:28 -0700 (PDT)
Path: i2pn2.org!i2pn.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.c++
Date: Sat, 12 Aug 2023 23:33:27 -0700 (PDT)
In-Reply-To: <uae4bc$6g1o$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=124.218.76.41; posting-account=0Ek0TQoAAAAS0oceh95IuNV59QuIWNeN
NNTP-Posting-Host: 124.218.76.41
References: <uae4bc$6g1o$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <f321bcfa-a0fb-41f1-b115-df15744db44cn@googlegroups.com>
Subject: Re: [OT] Help for a RegEx
From: wynii...@gmail.com (wij)
Injection-Date: Sun, 13 Aug 2023 06:33:28 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 3841
 by: wij - Sun, 13 Aug 2023 06:33 UTC

On Thursday, August 3, 2023 at 1:39:17 AM UTC+8, Fonntuggnio wrote:
> Sorry for the total OT, but I failed to build a RegEx with
> the "help" (rotfl) of three different so called IA, getting
> to nothing
>
> I am scanning an HTML document (not in javascript, so I do
> not have access to DOM nodes from inside) and I need to
> match EVERY <p> whole tag.
>
> for whole I mean, starting from the <p and ending with the
> corresponding </p>, but such paragrapha MAY (and may not)
> contain
>
> a long list of attributes, with or without zero or more \n
> \r \t characters, valid, before the >.

The following is an example program of class Regex (class wrapper of regex(3)
functions). The regular expression "<p>.*</p>" should do most of the job, except
real HTML involves comments, nested tags, erroneous format...etc:

[]a_grep "<p>.*</p>" *html

-------------------------------------------------
/* Copyright is licensed by GNU LGPL, see file COPYING. by I.J.Wang 2023

Simulate grep command (Extended regular expression, ERE)

Build: make a_grep
*/
#include <Wy.stdio.h>
#include <Wy.unistd.h>
#include <Wy.regex.h>

using namespace Wy;

constexpr const char Red[]="\x1B[31m";
constexpr const char Reset[]= "\x1B[0m";

void sim_grep(Regex& rexpr, const char* fname)
{ Errno r;
String str;
::regmatch_t mbuf[5];
RegFile regf(fname,O_RDONLY);
RdBuf strm(regf);

for(;strm.is_eof()==false;) {
if((r=strm.read(str))!=Ok) {
WY_THROW(r);
}
if((r=rexpr.regexec(str.c_str(),mbuf,WY_CARR_SIZE(mbuf),0))!=Ok) {
continue;
}
cout << fname << ": ";
cout << StrSeg(str.begin(), str.begin()+mbuf[0].rm_so);
cout << Red << StrSeg(str.begin()+mbuf[0].rm_so,
str.begin()+mbuf[0].rm_eo) << Reset;
cout << StrSeg(str.begin()+mbuf[0].rm_eo, str.end());
}
};

int main(int argc, const char* argv[])
try {
static const char usage[]="a_grep <pattern> <file>+" WY_ENDL;
Errno r;

if(argc<3) {
cout << "Error: Invalid argument" WY_ENDL "Usage: "
<< usage << WY_ENDL;
return -1;
}
const char* ptn= argv[1];
Regex rexpr;

if((r=rexpr.regcomp(ptn,REG_EXTENDED))!=Ok) {
if(r!=EBADMSG) {
WY_THROW(r);
}
String str;
if((r=rexpr.regerror(str))!=Ok) {
WY_THROW(r);
}
cout << str << WY_ENDL;
return -1;
}

for(int i=2; i<argc; ++i) {
const char* fname= argv[i];
sim_grep(rexpr,fname);
}

cout << "OK" WY_ENDL;
return 0;
} catch(const Errno& e) {
cerr << wrd(e) << WY_ENDL;
return -1;
} catch(...) {
cerr << "main() caught(...)" WY_ENDL;
throw;
};

Re: [OT] Help for a RegEx

<ubcp3g$280ql$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=1010&group=comp.lang.c%2B%2B#1010

  copy link   Newsgroups: comp.lang.c++
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: nos...@please.ty (jak)
Newsgroups: comp.lang.c++
Subject: Re: [OT] Help for a RegEx
Date: Mon, 14 Aug 2023 10:37:03 +0200
Organization: A noiseless patient Spider
Lines: 38
Message-ID: <ubcp3g$280ql$1@dont-email.me>
References: <uae4bc$6g1o$1@dont-email.me> <uae6gp$76ua$1@dont-email.me>
<uaemss$achd$2@dont-email.me> <uafotr$luld$1@dont-email.me>
<uahf52$vcci$1@dont-email.me> <877cqbveje.fsf@bsb.me.uk>
<uak4dt$1f8eg$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 14 Aug 2023 08:37:04 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="34b48f9e9a649f7a2de30c5fc390b72f";
logging-data="2360149"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19t5aJuZBsrmgxUEjTOGOOk"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Firefox/91.0 SeaMonkey/2.53.17
Cancel-Lock: sha1:mLXJ6b+0RqFq4lvAEfMMHCHDOwM=
In-Reply-To: <uak4dt$1f8eg$1@dont-email.me>
 by: jak - Mon, 14 Aug 2023 08:37 UTC

MarioCPPP ha scritto:
> On 04/08/23 02:26, Ben Bacarisse wrote:
>> MarioCPPP <NoliMihiFrangereMentulam@libero.it> writes:
>>
>>> On 03/08/23 10:36, Paavo Helde wrote:
>>>> <p (.|\r|\n)*?</p>
>>>
>>> intresting. I tried this one and it detects most of paragraphs, except
>>> those that does not have attributes within the <p> opening tag.
>>>
>>> Is it there a way to also include those ones ?
>>
>> PH's regex insists on a space after the "<p".  Whilst this is not
>> exactly the same as requiring an attribute it will be effectively the
>> same.  You could try
>>
>>    <p[ >](.|\r|\n)*?</p>
>>
>> but I can't stress enough -- none of this can really work in all cases.
>>
>
> ok, tnx for the prudent disclaimer
>
> Your variant found 3883 matches against 3644 matches found by the
> previous one.
>
> scanning visually the book, it seems this RegEx is able to find ALL.
>
> I'll try variants with h1...h10 to find headings also.
>
> Many thanks for this precious hint
>
> Now I try to better understand the Expression :D
>

Could I ask you if you could kindly provide a piece of text on which you
are doing the tests? Thanks in advance.

Re: [OT] Help for a RegEx

<ubfq6v$2qgsi$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=1014&group=comp.lang.c%2B%2B#1014

  copy link   Newsgroups: comp.lang.c++
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: NoliMihi...@libero.it (MarioCCCP)
Newsgroups: comp.lang.c++
Subject: Re: [OT] Help for a RegEx
Date: Tue, 15 Aug 2023 14:14:22 +0200
Organization: A noiseless patient Spider
Lines: 52
Message-ID: <ubfq6v$2qgsi$1@dont-email.me>
References: <uae4bc$6g1o$1@dont-email.me> <uae6gp$76ua$1@dont-email.me>
<uaemss$achd$2@dont-email.me> <uafotr$luld$1@dont-email.me>
<uahf52$vcci$1@dont-email.me> <877cqbveje.fsf@bsb.me.uk>
<uak4dt$1f8eg$1@dont-email.me> <ubcp3g$280ql$1@dont-email.me>
Reply-To: MarioCCCP@CCCP.MIR
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: base64
Injection-Date: Tue, 15 Aug 2023 12:14:25 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="cad71fc7983b10fe67a5bfd7269a15a8";
logging-data="2966418"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19I6e/Wvd/WSRbEbFs9R/VS"
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
Thunderbird/102.14.0
Cancel-Lock: sha1:O3rL6K6dQjddYKsdiOf1HeZLydQ=
Content-Language: en-GB, it-IT
In-Reply-To: <ubcp3g$280ql$1@dont-email.me>
 by: MarioCCCP - Tue, 15 Aug 2023 12:14 UTC

On 14/08/23 10:37, jak wrote:
> MarioCPPP ha scritto:
>> On 04/08/23 02:26, Ben Bacarisse wrote:
>>> MarioCPPP <NoliMihiFrangereMentulam@libero.it> writes:
>>>
>>>> On 03/08/23 10:36, Paavo Helde wrote:
>>>>> <p (.|\r|\n)*?</p>
>>>>
>>>> intresting. I tried this one and it detects most of
>>>> paragraphs, except
>>>> those that does not have attributes within the <p>
>>>> opening tag.
>>>>
>>>> Is it there a way to also include those ones ?
>>>
>>> PH's regex insists on a space after the "<p".  Whilst
>>> this is not
>>> exactly the same as requiring an attribute it will be
>>> effectively the
>>> same.  You could try
>>>
>>>    <p[ >](.|\r|\n)*?</p>
>>>
>>> but I can't stress enough -- none of this can really work
>>> in all cases.
>>>
>>
>> ok, tnx for the prudent disclaimer
>>
>> Your variant found 3883 matches against 3644 matches found
>> by the previous one.
>>
>> scanning visually the book, it seems this RegEx is able to
>> find ALL.
>>
>> I'll try variants with h1...h10 to find headings also.
>>
>> Many thanks for this precious hint
>>
>> Now I try to better understand the Expression :D
>>
>
> Could I ask you if you could kindly provide a piece of text
> on which you
> are doing the tests? Thanks in advance.
>
well, actually not : it's an .ODF book converted to .HTML,
written in LbO, with just a main structural index, headings
untill Title3 rank, a few tables, and some 300'000 words of
text. But, being unpublished and not going to be given away
for free, I won't share the content.
The <p> tags are often heavily loaded with font style
attributes, and some nested <span> tags, the text is also
full of nested <i> and <b> tags (italics and bold). It's a
very large document, but not complex in structure. Just
badly designed (imvho).
For example LbO does not track smartly editings that could
be "collapsed". It just dumbly obey and records formatting
command as they are, It does not produce very "CLEAN" HTML.
But I won't revise it manually, it's huge. And actually the
books to be analyzed and steganized are SIX, not just one.
I have decised to use a true XML parser though, even if the
RegEx worked well, the program is growing too complex to use
string-only approeach, and I need some true "dom-like" aware
approach to edit nodes content.
I am also considering to abandon GAMBAS and do that in
Javascript, which is able to act upon the HTML from inside
and injecting stuff is native at it.
Sorry if my reply is a bit frustrating, but those six books
are 30 years of my life, I'heve poured blood in them :D


--
1) Resistere, resistere, resistere.
2) Se tutti pagano le tasse, le tasse le pagano tutti
MarioCPPP

Re: [OT] Help for a RegEx

<ubj7p0$3dppf$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=1029&group=comp.lang.c%2B%2B#1029

  copy link   Newsgroups: comp.lang.c++
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: nos...@please.ty (jak)
Newsgroups: comp.lang.c++
Subject: Re: [OT] Help for a RegEx
Date: Wed, 16 Aug 2023 21:24:14 +0200
Organization: A noiseless patient Spider
Lines: 80
Message-ID: <ubj7p0$3dppf$1@dont-email.me>
References: <uae4bc$6g1o$1@dont-email.me> <uae6gp$76ua$1@dont-email.me>
<uaemss$achd$2@dont-email.me> <uafotr$luld$1@dont-email.me>
<uahf52$vcci$1@dont-email.me> <877cqbveje.fsf@bsb.me.uk>
<uak4dt$1f8eg$1@dont-email.me> <ubcp3g$280ql$1@dont-email.me>
<ubfq6v$2qgsi$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 16 Aug 2023 19:24:17 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="af226d40079b709cb05d280ab7508a93";
logging-data="3598127"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/oWdVBBIpkJBgTNI5foUUN"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Firefox/91.0 SeaMonkey/2.53.17
Cancel-Lock: sha1:kyP2IViG7L8sHsX/qRwiHdp+5ts=
In-Reply-To: <ubfq6v$2qgsi$1@dont-email.me>
 by: jak - Wed, 16 Aug 2023 19:24 UTC

MarioCCCP ha scritto:
> On 14/08/23 10:37, jak wrote:
>> MarioCPPP ha scritto:
>>> On 04/08/23 02:26, Ben Bacarisse wrote:
>>>> MarioCPPP <NoliMihiFrangereMentulam@libero.it> writes:
>>>>
>>>>> On 03/08/23 10:36, Paavo Helde wrote:
>>>>>> <p (.|\r|\n)*?</p>
>>>>>
>>>>> intresting. I tried this one and it detects most of paragraphs, except
>>>>> those that does not have attributes within the <p> opening tag.
>>>>>
>>>>> Is it there a way to also include those ones ?
>>>>
>>>> PH's regex insists on a space after the "<p".  Whilst this is not
>>>> exactly the same as requiring an attribute it will be effectively the
>>>> same.  You could try
>>>>
>>>>    <p[ >](.|\r|\n)*?</p>
>>>>
>>>> but I can't stress enough -- none of this can really work in all cases.
>>>>
>>>
>>> ok, tnx for the prudent disclaimer
>>>
>>> Your variant found 3883 matches against 3644 matches found by the
>>> previous one.
>>>
>>> scanning visually the book, it seems this RegEx is able to find ALL.
>>>
>>> I'll try variants with h1...h10 to find headings also.
>>>
>>> Many thanks for this precious hint
>>>
>>> Now I try to better understand the Expression :D
>>>
>>
>> Could I ask you if you could kindly provide a piece of text on which you
>> are doing the tests? Thanks in advance.
>>
>
> well, actually not : it's an .ODF book converted to .HTML, written in
> LbO, with just a main structural index, headings untill Title3 rank, a
> few tables, and some 300'000 words of text. But, being unpublished and
> not going to be given away for free, I won't share the content.
>
> The <p> tags are often heavily loaded with font style attributes, and
> some nested <span> tags, the text is also full of nested <i> and <b>
> tags (italics and bold). It's a very large document, but not complex in
> structure. Just badly designed (imvho).
> For example LbO does not track smartly editings that could be
> "collapsed". It just dumbly obey and records formatting command as they
> are, It does not produce very "CLEAN" HTML. But I won't revise it
> manually, it's huge. And actually the books to be analyzed and
> steganized are SIX, not just one.
>
> I have decised to use a true XML parser though, even if the RegEx worked
> well, the program is growing too complex to use string-only approeach,
> and I need some true "dom-like" aware approach to edit nodes content.
>
> I am also considering to abandon GAMBAS and do that in Javascript, which
> is able to act upon the HTML from inside and injecting stuff is native
> at it.
>
> Sorry if my reply is a bit frustrating, but those six books are 30 years
> of my life, I'heve poured blood in them :D
>
>
>
>
>

Hi Mario,
absolutely no problem. Maybe I misunderstood the thread and I thought
you refer to an XML document useful for debug. Practically a unique
document that contains all the characteristics, peculiarities and
exceptions of the XML that would allow an exhaustive debug of a parser.
If I had sensed that you refer to your personal document, then I would
not have allowed myself to make this request. Excuse me.

Re: [OT] Help for a RegEx

<ubofd3$c71g$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=1060&group=comp.lang.c%2B%2B#1060

  copy link   Newsgroups: comp.lang.c++
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: NoliMihi...@libero.it (MarioCCCP)
Newsgroups: comp.lang.c++
Subject: Re: [OT] Help for a RegEx
Date: Fri, 18 Aug 2023 21:05:07 +0200
Organization: A noiseless patient Spider
Lines: 35
Message-ID: <ubofd3$c71g$1@dont-email.me>
References: <uae4bc$6g1o$1@dont-email.me> <uae6gp$76ua$1@dont-email.me>
<uaemss$achd$2@dont-email.me> <uafotr$luld$1@dont-email.me>
<uahf52$vcci$1@dont-email.me> <877cqbveje.fsf@bsb.me.uk>
<uak4dt$1f8eg$1@dont-email.me> <ubcp3g$280ql$1@dont-email.me>
<ubfq6v$2qgsi$1@dont-email.me> <ubj7p0$3dppf$1@dont-email.me>
Reply-To: MarioCCCP@CCCP.MIR
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Fri, 18 Aug 2023 19:05:08 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="aa8ad2c679e2b4872806d9102795807e";
logging-data="400432"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX198+R1hz8Pv8gEoUeN1s3tW"
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
Thunderbird/102.14.0
Cancel-Lock: sha1:93anWp5FCF2Of16SwxWMRT4WDdg=
In-Reply-To: <ubj7p0$3dppf$1@dont-email.me>
Content-Language: en-GB, it-IT
 by: MarioCCCP - Fri, 18 Aug 2023 19:05 UTC

On 16/08/23 21:24, jak wrote:
> MarioCCCP ha scritto:
>> On 14/08/23 10:37, jak wrote:
>>> MarioCPPP ha scritto:
>>>> On 04/08/23 02:26, Ben Bacarisse wrote:
>>>>> MarioCPPP <NoliMihiFrangereMentulam@libero.it> writes:
>>>>>
>>>>>> On 03/08/23 10:36, Paavo Helde wrote:
>>>>>>> <p (.|\r|\n)*?</p>
>>>>>>

>
> Hi Mario,
> absolutely no problem. Maybe I misunderstood the thread and
> I thought
> you refer to an XML document useful for debug. Practically a
> unique
> document that contains all the characteristics,
> peculiarities and
> exceptions of the XML that would allow an exhaustive debug
> of a parser.
> If I had sensed that you refer to your personal document,
> then I would
> not have allowed myself to make this request. Excuse me.
>

don't even mention it ! My English is poor at times, I
cannot explain very well :D

--
1) Resistere, resistere, resistere.
2) Se tutti pagano le tasse, le tasse le pagano tutti
MarioCPPP

Re: [OT] Help for a RegEx

<ubpqbk$na6u$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=1064&group=comp.lang.c%2B%2B#1064

  copy link   Newsgroups: comp.lang.c++
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: nos...@please.ty (jak)
Newsgroups: comp.lang.c++
Subject: Re: [OT] Help for a RegEx
Date: Sat, 19 Aug 2023 09:18:11 +0200
Organization: A noiseless patient Spider
Lines: 29
Message-ID: <ubpqbk$na6u$1@dont-email.me>
References: <uae4bc$6g1o$1@dont-email.me> <uae6gp$76ua$1@dont-email.me>
<uaemss$achd$2@dont-email.me> <uafotr$luld$1@dont-email.me>
<uahf52$vcci$1@dont-email.me> <877cqbveje.fsf@bsb.me.uk>
<uak4dt$1f8eg$1@dont-email.me> <ubcp3g$280ql$1@dont-email.me>
<ubfq6v$2qgsi$1@dont-email.me> <ubj7p0$3dppf$1@dont-email.me>
<ubofd3$c71g$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sat, 19 Aug 2023 07:18:12 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="93303eef0e4bc227d65e3e087c571482";
logging-data="764126"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18+Kulx+wy88kzGgNEhwNtx"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Firefox/91.0 SeaMonkey/2.53.17
Cancel-Lock: sha1:mOiI+hJF7ctxPNRv7HSHfcTBfDM=
In-Reply-To: <ubofd3$c71g$1@dont-email.me>
 by: jak - Sat, 19 Aug 2023 07:18 UTC

MarioCCCP ha scritto:
> On 16/08/23 21:24, jak wrote:
>> MarioCCCP ha scritto:
>>> On 14/08/23 10:37, jak wrote:
>>>> MarioCPPP ha scritto:
>>>>> On 04/08/23 02:26, Ben Bacarisse wrote:
>>>>>> MarioCPPP <NoliMihiFrangereMentulam@libero.it> writes:
>>>>>>
>>>>>>> On 03/08/23 10:36, Paavo Helde wrote:
>>>>>>>> <p (.|\r|\n)*?</p>
>>>>>>>
>
>
>>
>> Hi Mario,
>> absolutely no problem. Maybe I misunderstood the thread and I thought
>> you refer to an XML document useful for debug. Practically a unique
>> document that contains all the characteristics, peculiarities and
>> exceptions of the XML that would allow an exhaustive debug of a parser.
>> If I had sensed that you refer to your personal document, then I would
>> not have allowed myself to make this request. Excuse me.
>>
>
> don't even mention it ! My English is poor at times, I cannot explain
> very well :D
>

Allora ce tocca da parlà come se magna XDD


devel / comp.lang.c++ / [OT] Help for a RegEx

1
server_pubkey.txt

rocksolid light 0.9.81
clearnet tor