Rocksolid Light

Welcome to novaBBS (click a section below)

mail  files  register  newsreader  groups  login

Message-ID:  

We can defeat gravity. The problem is the paperwork involved.


devel / comp.unix.shell / Vanilla regex

SubjectAuthor
* Vanilla regexTuxedo
+* Re: Vanilla regexJanis Papanagnou
|+* Re: Vanilla regexJanis Papanagnou
||`* Re: Vanilla regexTuxedo
|| `- Re: Vanilla regexEd Morton
|`* Re: Vanilla regexEd Morton
| `* Re: Vanilla regexJanis Papanagnou
|  `* Re: Vanilla regexJanis Papanagnou
|   `* Re: Vanilla regexEd Morton
|    `- Re: Vanilla regexJanis Papanagnou
`- Re: Vanilla regexEd Morton

1
Vanilla regex

<u7sbtf$22fas$1@solani.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=6415&group=comp.unix.shell#6415

  copy link   Newsgroups: comp.unix.shell
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!reader5.news.weretis.net!news.solani.org!.POSTED!not-for-mail
From: tux...@mailinator.net (Tuxedo)
Newsgroups: comp.unix.shell
Subject: Vanilla regex
Date: Sun, 02 Jul 2023 19:14:01 +0200
Lines: 49
Message-ID: <u7sbtf$22fas$1@solani.org>
Mime-Version: 1.0
Content-Type: text/plain; charset="ISO-8859-1"
Content-Transfer-Encoding: 7Bit
Injection-Date: Sun, 2 Jul 2023 17:25:06 -0000 (UTC)
Injection-Info: solani.org;
logging-data="2178396"; mail-complaints-to="abuse@news.solani.org"
User-Agent: KNode/4.14.10
Cancel-Lock: sha1:9E4N4ySPvRhJcH4HFsYdEzrY5os=
X-User-ID: eJwNyskBgEAIA8CWOELEclwk/Zeg855KOucCiyiVQjbtSmHbT/BdDm5zAYR03HaUT/8XAeUHIMwRMg==
 by: Tuxedo - Sun, 2 Jul 2023 17:14 UTC

Can anyone assist with a regex using fairly standard and cross-system
compatible methods?

It's for files containing wiki markup segments as follows:

[[File:Some File Name 0123.jpg|800px]]

Or maybe:

[[File:Some other file.jpg|250px]]

Or maybe:

[[File:Another file.jpg |600px|thumb]]

etc.

The unique identifiers for the relevant parts are the start of "[[File:"
followed by ASCII making up file names ending in some file type, such as
..jpg, .JPEG, .Jpeg etc. .PNG, .gif, followed by a "|" pipe or closing "]]"
brackets.

The regex needs to grab the filename portion, eg. "Another file.jpg", keep
it in a variable and replace any spaces with underscore(s) within so this
updated variable becomes "Another_file.jpg"

Thereafter, within the existing markup, for example:

[[File:Another file.jpg |600px|thumb]]

Insert the following markup after the first pipe:

link=https://example.com/display.pl?Another_file.jpg|

So the final markup becomes:

[[File:Another file.jpg |
link=https://example.com/display.pl?Another_file.jpg|600px|thumb]]

The spaces in the original "File: ..." name parts can remain as it's valid
markup but the underscores need to exist in link=... strings.

There may be instances where "|link=" occurrences already exits within the
opening of a "[[File:" and before its closing "]]" brackets. The regex
should avoid operating on such instances so the procedure can run without
conflict of previous replacement action.

Many thanks for any example code and ideas.

Tuxedo

Re: Vanilla regex

<u7se44$3ck1t$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=6416&group=comp.unix.shell#6416

  copy link   Newsgroups: comp.unix.shell
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: janis_pa...@hotmail.com (Janis Papanagnou)
Newsgroups: comp.unix.shell
Subject: Re: Vanilla regex
Date: Sun, 2 Jul 2023 20:02:44 +0200
Organization: A noiseless patient Spider
Lines: 83
Message-ID: <u7se44$3ck1t$1@dont-email.me>
References: <u7sbtf$22fas$1@solani.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: 7bit
Injection-Date: Sun, 2 Jul 2023 18:02:44 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="be9db0a86718f085dea912bbb1bae10d";
logging-data="3559485"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19BjadvZkaqMzyh/DQsnto7"
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101
Thunderbird/45.8.0
Cancel-Lock: sha1:sSPt9aEtdrDjYhRbbTzeyUfOsEk=
In-Reply-To: <u7sbtf$22fas$1@solani.org>
X-Enigmail-Draft-Status: N1110
 by: Janis Papanagnou - Sun, 2 Jul 2023 18:02 UTC

On 02.07.2023 19:14, Tuxedo wrote:
> Can anyone assist with a regex using fairly standard and cross-system
> compatible methods?
>
> It's for files containing wiki markup segments as follows:
>
> [[File:Some File Name 0123.jpg|800px]]
>
> Or maybe:
>
> [[File:Some other file.jpg|250px]]
>
> Or maybe:
>
> [[File:Another file.jpg |600px|thumb]]
>
> etc.
>
> The unique identifiers for the relevant parts are the start of "[[File:"
> followed by ASCII making up file names ending in some file type, such as
> .jpg, .JPEG, .Jpeg etc. .PNG, .gif, followed by a "|" pipe or closing "]]"
> brackets.
>
> The regex needs to grab the filename portion, eg. "Another file.jpg", keep
> it in a variable and replace any spaces with underscore(s) within so this
> updated variable becomes "Another_file.jpg"
>
> Thereafter, within the existing markup, for example:
>
> [[File:Another file.jpg |600px|thumb]]
>
> Insert the following markup after the first pipe:
>
> link=https://example.com/display.pl?Another_file.jpg|
>
> So the final markup becomes:
>
> [[File:Another file.jpg |
> link=https://example.com/display.pl?Another_file.jpg|600px|thumb]]
>
> The spaces in the original "File: ..." name parts can remain as it's valid
> markup but the underscores need to exist in link=... strings.
>
> There may be instances where "|link=" occurrences already exits within the
> opening of a "[[File:" and before its closing "]]" brackets. The regex
> should avoid operating on such instances so the procedure can run without
> conflict of previous replacement action.
>
> Many thanks for any example code and ideas.

You can do such replacements in modern shells, but since "using fairly
standard" isn't exactly an exact specification I provide an example in
(standard) awk...

awk '
BEGIN {
p = "link=https://example.com/display.pl?"
}
$0 !~ p && match($0,/\[\[File:[^]|]+/) {
f = substr($0, RSTART+7, RLENGTH-7)
sub(/ $/, "", f)
gsub(/ /, "_", f)
sub(/[|]/, "|" p f "|")
}
1
'

The first sub-condition skips the pattern defined in variable p.
The second condition does a substitution where the pattern appears.
It strips trailing spaces so that you don't get them replaced by '_'.
and finally composes the link.

This code operates on lines containing only one of these patterns.
It assumes no spaces between '[[' and 'File:'.
It's also unclear whether you need to change multiple patterns in a
line or anything else, so it might need some tweaking or refinement.

Janis

>
> Tuxedo
>

Re: Vanilla regex

<u7sefv$3cle6$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=6417&group=comp.unix.shell#6417

  copy link   Newsgroups: comp.unix.shell
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: janis_pa...@hotmail.com (Janis Papanagnou)
Newsgroups: comp.unix.shell
Subject: Re: Vanilla regex
Date: Sun, 2 Jul 2023 20:09:03 +0200
Organization: A noiseless patient Spider
Lines: 90
Message-ID: <u7sefv$3cle6$1@dont-email.me>
References: <u7sbtf$22fas$1@solani.org> <u7se44$3ck1t$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: 7bit
Injection-Date: Sun, 2 Jul 2023 18:09:04 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="be9db0a86718f085dea912bbb1bae10d";
logging-data="3560902"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1885p4acOEd+6fuM912pC2r"
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101
Thunderbird/45.8.0
Cancel-Lock: sha1:hNY7Oq2HJP9Ey4r1RwqxnGS391w=
In-Reply-To: <u7se44$3ck1t$1@dont-email.me>
 by: Janis Papanagnou - Sun, 2 Jul 2023 18:09 UTC

On 02.07.2023 20:02, Janis Papanagnou wrote:
> On 02.07.2023 19:14, Tuxedo wrote:
>> Can anyone assist with a regex using fairly standard and cross-system
>> compatible methods?
>>
>> It's for files containing wiki markup segments as follows:
>>
>> [[File:Some File Name 0123.jpg|800px]]
>>
>> Or maybe:
>>
>> [[File:Some other file.jpg|250px]]
>>
>> Or maybe:
>>
>> [[File:Another file.jpg |600px|thumb]]
>>
>> etc.
>>
>> The unique identifiers for the relevant parts are the start of "[[File:"
>> followed by ASCII making up file names ending in some file type, such as
>> .jpg, .JPEG, .Jpeg etc. .PNG, .gif, followed by a "|" pipe or closing "]]"
>> brackets.
>>
>> The regex needs to grab the filename portion, eg. "Another file.jpg", keep
>> it in a variable and replace any spaces with underscore(s) within so this
>> updated variable becomes "Another_file.jpg"
>>
>> Thereafter, within the existing markup, for example:
>>
>> [[File:Another file.jpg |600px|thumb]]
>>
>> Insert the following markup after the first pipe:
>>
>> link=https://example.com/display.pl?Another_file.jpg|
>>
>> So the final markup becomes:
>>
>> [[File:Another file.jpg |
>> link=https://example.com/display.pl?Another_file.jpg|600px|thumb]]
>>
>> The spaces in the original "File: ..." name parts can remain as it's valid
>> markup but the underscores need to exist in link=... strings.
>>
>> There may be instances where "|link=" occurrences already exits within the
>> opening of a "[[File:" and before its closing "]]" brackets. The regex
>> should avoid operating on such instances so the procedure can run without
>> conflict of previous replacement action.
>>
>> Many thanks for any example code and ideas.
>
> You can do such replacements in modern shells, but since "using fairly
> standard" isn't exactly an exact specification I provide an example in
> (standard) awk...
>
> awk '
> BEGIN {
> p = "link=https://example.com/display.pl?"
> }
> $0 !~ p && match($0,/\[\[File:[^]|]+/) {
> f = substr($0, RSTART+7, RLENGTH-7)
> sub(/ $/, "", f)

sub(/ +$/, "", f)

In case there's more that one spurious space after the file extension.

> gsub(/ /, "_", f)
> sub(/[|]/, "|" p f "|")
> }
> 1
> '
>
> The first sub-condition skips the pattern defined in variable p.
> The second condition does a substitution where the pattern appears.
> It strips trailing spaces so that you don't get them replaced by '_'.
> and finally composes the link.
>
> This code operates on lines containing only one of these patterns.
> It assumes no spaces between '[[' and 'File:'.
> It's also unclear whether you need to change multiple patterns in a
> line or anything else, so it might need some tweaking or refinement.
>
> Janis
>
>>
>> Tuxedo
>>
>

Re: Vanilla regex

<u7ugp3$23jpd$1@solani.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=6418&group=comp.unix.shell#6418

  copy link   Newsgroups: comp.unix.shell
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!reader5.news.weretis.net!news.solani.org!.POSTED!not-for-mail
From: tux...@mailinator.net (Tuxedo)
Newsgroups: comp.unix.shell
Subject: Re: Vanilla regex
Date: Mon, 03 Jul 2023 14:49:13 +0200
Lines: 100
Message-ID: <u7ugp3$23jpd$1@solani.org>
References: <u7sbtf$22fas$1@solani.org> <u7se44$3ck1t$1@dont-email.me> <u7sefv$3cle6$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset="ISO-8859-1"
Content-Transfer-Encoding: 7Bit
Injection-Date: Mon, 3 Jul 2023 13:00:20 -0000 (UTC)
Injection-Info: solani.org;
logging-data="2215725"; mail-complaints-to="abuse@news.solani.org"
User-Agent: KNode/4.14.10
Cancel-Lock: sha1:fxQ+98t5/lXoCCTCuy8FDklXHoI=
X-User-ID: eJwNwoERACEIA7CVxFLQcYoc+4/wnwsRFi89GM7hBOzomboT/J9bV4KSAymXstg09w3frPUBFl8QkA==
 by: Tuxedo - Mon, 3 Jul 2023 12:49 UTC

Janis Papanagnou wrote:

> On 02.07.2023 20:02, Janis Papanagnou wrote:
>> On 02.07.2023 19:14, Tuxedo wrote:
>>> Can anyone assist with a regex using fairly standard and cross-system
>>> compatible methods?
>>>
>>> It's for files containing wiki markup segments as follows:
>>>
>>> [[File:Some File Name 0123.jpg|800px]]
>>>
>>> Or maybe:
>>>
>>> [[File:Some other file.jpg|250px]]
>>>
>>> Or maybe:
>>>
>>> [[File:Another file.jpg |600px|thumb]]
>>>
>>> etc.
>>>
>>> The unique identifiers for the relevant parts are the start of "[[File:"
>>> followed by ASCII making up file names ending in some file type, such as
>>> .jpg, .JPEG, .Jpeg etc. .PNG, .gif, followed by a "|" pipe or closing
>>> "]]" brackets.
>>>
>>> The regex needs to grab the filename portion, eg. "Another file.jpg",
>>> keep it in a variable and replace any spaces with underscore(s) within
>>> so this updated variable becomes "Another_file.jpg"
>>>
>>> Thereafter, within the existing markup, for example:
>>>
>>> [[File:Another file.jpg |600px|thumb]]
>>>
>>> Insert the following markup after the first pipe:
>>>
>>> link=https://example.com/display.pl?Another_file.jpg|
>>>
>>> So the final markup becomes:
>>>
>>> [[File:Another file.jpg |
>>> link=https://example.com/display.pl?Another_file.jpg|600px|thumb]]
>>>
>>> The spaces in the original "File: ..." name parts can remain as it's
>>> valid markup but the underscores need to exist in link=... strings.
>>>
>>> There may be instances where "|link=" occurrences already exits within
>>> the opening of a "[[File:" and before its closing "]]" brackets. The
>>> regex should avoid operating on such instances so the procedure can run
>>> without conflict of previous replacement action.
>>>
>>> Many thanks for any example code and ideas.
>>
>> You can do such replacements in modern shells, but since "using fairly
>> standard" isn't exactly an exact specification I provide an example in
>> (standard) awk...
>>
>> awk '
>> BEGIN {
>> p = "link=https://example.com/display.pl?"
>> }
>> $0 !~ p && match($0,/\[\[File:[^]|]+/) {
>> f = substr($0, RSTART+7, RLENGTH-7)
>> sub(/ $/, "", f)
>
> sub(/ +$/, "", f)
>
> In case there's more that one spurious space after the file extension.
>
>> gsub(/ /, "_", f)
>> sub(/[|]/, "|" p f "|")
>> }
>> 1
>> '
>>
>> The first sub-condition skips the pattern defined in variable p.
>> The second condition does a substitution where the pattern appears.
>> It strips trailing spaces so that you don't get them replaced by '_'.
>> and finally composes the link.
>>
>> This code operates on lines containing only one of these patterns.
>> It assumes no spaces between '[[' and 'File:'.
>> It's also unclear whether you need to change multiple patterns in a
>> line or anything else, so it might need some tweaking or refinement.
>>
>> Janis
>>

Many thanks or posting the example along with the explanatory notes.

There could be a space between '[[' and 'File:' as it's not invalid markup,
which I did not know until testing it now. It's however unlikely.

And there could be more than one pattern instance on a single line, but
again, it's unlikely. Multiple patterns are almost always on different
lines.

I will bear this in mind and tailor to my purpose.

Tuxedo

Re: Vanilla regex

<u7uq45$3nutv$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=6419&group=comp.unix.shell#6419

  copy link   Newsgroups: comp.unix.shell
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: mortons...@gmail.com (Ed Morton)
Newsgroups: comp.unix.shell
Subject: Re: Vanilla regex
Date: Mon, 3 Jul 2023 10:39:49 -0500
Organization: A noiseless patient Spider
Lines: 19
Message-ID: <u7uq45$3nutv$1@dont-email.me>
References: <u7sbtf$22fas$1@solani.org> <u7se44$3ck1t$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 3 Jul 2023 15:39:49 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="9660604ce639605aa75b21291cea434a";
logging-data="3931071"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/yp2muXOeAWSIgli5SivGB"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.12.0
Cancel-Lock: sha1:UsKl33iRJ1Uf+cNTPU6twLahl6c=
Content-Language: en-US
In-Reply-To: <u7se44$3ck1t$1@dont-email.me>
X-Antivirus-Status: Clean
X-Antivirus: Avast (VPS 230703-2, 7/3/2023), Outbound message
 by: Ed Morton - Mon, 3 Jul 2023 15:39 UTC

On 7/2/2023 1:02 PM, Janis Papanagnou wrote:
<snip>
> awk '
> BEGIN {
> p = "link=https://example.com/display.pl?"

That `?` at the end means "0 or 1 occurrences of the preceding
expression". I suspect you meant to make the `?` literal and you should
also make the `.`s literal:

p = "link=https://example[.]com/display[.]pl[?]"

Also consider whether or not word boundaries or, more likely, some other
method of avoiding undesirable substring matches is required.

Regards,

Ed.

Re: Vanilla regex

<u7uqnu$3o2bi$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=6420&group=comp.unix.shell#6420

  copy link   Newsgroups: comp.unix.shell
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: mortons...@gmail.com (Ed Morton)
Newsgroups: comp.unix.shell
Subject: Re: Vanilla regex
Date: Mon, 3 Jul 2023 10:50:21 -0500
Organization: A noiseless patient Spider
Lines: 28
Message-ID: <u7uqnu$3o2bi$1@dont-email.me>
References: <u7sbtf$22fas$1@solani.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 3 Jul 2023 15:50:22 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="9660604ce639605aa75b21291cea434a";
logging-data="3934578"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/6lA6dxfd49Yx7c4csRG3f"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.12.0
Cancel-Lock: sha1:mbcnclqidnb058FygzhojtdIFGQ=
X-Antivirus: Avast (VPS 230703-2, 7/3/2023), Outbound message
In-Reply-To: <u7sbtf$22fas$1@solani.org>
Content-Language: en-US
X-Antivirus-Status: Clean
 by: Ed Morton - Mon, 3 Jul 2023 15:50 UTC

On 7/2/2023 12:14 PM, Tuxedo wrote:
> Can anyone assist with a regex using fairly standard and cross-system
> compatible methods?

There are 2 different POSIX regex standards:

BRE:
https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_03
ERE:
https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_04

and another fairly commonly used regex notation:

PCRE: https://www.pcre.org/

Also, every tool has its own options, extensions, delimiters,
backreferences support, other enhancements/considerations, etc. for
whatever regex flavor(s) it supports.

So there is no "fairly standard and cross-system compatible" regex
notation but Janis gave you an answer using AWK and only using POSIX
constructs within that script and that's the best you could do regarding
portability and usability as AWK is the most powerful mandatory POSIX
tool (i.e. must be present on all Unix boxes) for manipulating text.

Regards,

Ed.

Re: Vanilla regex

<u7ur3i$3o2fs$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=6421&group=comp.unix.shell#6421

  copy link   Newsgroups: comp.unix.shell
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: mortons...@gmail.com (Ed Morton)
Newsgroups: comp.unix.shell
Subject: Re: Vanilla regex
Date: Mon, 3 Jul 2023 10:56:34 -0500
Organization: A noiseless patient Spider
Lines: 18
Message-ID: <u7ur3i$3o2fs$1@dont-email.me>
References: <u7sbtf$22fas$1@solani.org> <u7se44$3ck1t$1@dont-email.me>
<u7sefv$3cle6$1@dont-email.me> <u7ugp3$23jpd$1@solani.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 3 Jul 2023 15:56:34 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="9660604ce639605aa75b21291cea434a";
logging-data="3934716"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+c1i34wqZkZoAt//TvTKsx"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.12.0
Cancel-Lock: sha1:MxDYNoHCFjO6vgUvg7lqz/erfoA=
X-Antivirus-Status: Clean
Content-Language: en-US
X-Antivirus: Avast (VPS 230703-2, 7/3/2023), Outbound message
In-Reply-To: <u7ugp3$23jpd$1@solani.org>
 by: Ed Morton - Mon, 3 Jul 2023 15:56 UTC

On 7/3/2023 7:49 AM, Tuxedo wrote:
<snip>
> There could be a space between '[[' and 'File:' as it's not invalid markup,
> which I did not know until testing it now. It's however unlikely.
>
> And there could be more than one pattern instance on a single line, but
> again, it's unlikely. Multiple patterns are almost always on different
> lines.

If you post a block of text containing concise, testable sample input
that covers ALL of your use cases and a separate block of text showing
the exact output you expect from that input then we'll have something we
can copy/paste to easily test a potential solution against and so we can
help you.

Regards,

Ed.

Re: Vanilla regex

<u7urm8$3o57l$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=6422&group=comp.unix.shell#6422

  copy link   Newsgroups: comp.unix.shell
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: janis_pa...@hotmail.com (Janis Papanagnou)
Newsgroups: comp.unix.shell
Subject: Re: Vanilla regex
Date: Mon, 3 Jul 2023 18:06:32 +0200
Organization: A noiseless patient Spider
Lines: 52
Message-ID: <u7urm8$3o57l$1@dont-email.me>
References: <u7sbtf$22fas$1@solani.org> <u7se44$3ck1t$1@dont-email.me>
<u7uq45$3nutv$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 3 Jul 2023 16:06:32 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="2202a174d76f6217c1242307d5325858";
logging-data="3937525"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+T+Jzg1xKDFgQVn9WZeIg/"
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101
Thunderbird/45.8.0
Cancel-Lock: sha1:HQD2w5x8/D7ZDQ9sckODViF077U=
X-Enigmail-Draft-Status: N1110
In-Reply-To: <u7uq45$3nutv$1@dont-email.me>
 by: Janis Papanagnou - Mon, 3 Jul 2023 16:06 UTC

On 03.07.2023 17:39, Ed Morton wrote:
> On 7/2/2023 1:02 PM, Janis Papanagnou wrote:
> <snip>
>> awk '
>> BEGIN {
>> p = "link=https://example.com/display.pl?"
>
> That `?` at the end means "0 or 1 occurrences of the preceding
> expression". I suspect you meant to make the `?` literal

Actually, no. - What I meant was to keep the _sample_ *simple*.
Since regexp meta-symbols in strings that contain patterns will
become bulky.

What I considered (while preserving the simplicity) was to just
remove the '?' from the string...

p = "link=https://example.com/display.pl?"

and later write

sub(/[|]/, "|" p "?" f "|")

but as I wrote, it's not worth the hassle for the sample where
more important questions were still unclear (as previously said).

Of course you could also just simplify writing (without the dot)
for the match

p = "link=https://example.com/display"

and this will still fail in case you have this pattern elsewhere
in the data appearing. And how likely is it that the two dots in
link=https://example.com/display.pl
will match, say, link=https://exampleXcom/displayYpl - not really,
don't you think?

Janis

> and you should
> also make the `.`s literal:
>
> p = "link=https://example[.]com/display[.]pl[?]"
>
> Also consider whether or not word boundaries or, more likely, some other
> method of avoiding undesirable substring matches is required.
>
> Regards,
>
> Ed.
>

Re: Vanilla regex

<u7v26f$3opuq$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=6423&group=comp.unix.shell#6423

  copy link   Newsgroups: comp.unix.shell
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: janis_pa...@hotmail.com (Janis Papanagnou)
Newsgroups: comp.unix.shell
Subject: Re: Vanilla regex
Date: Mon, 3 Jul 2023 19:57:35 +0200
Organization: A noiseless patient Spider
Lines: 18
Message-ID: <u7v26f$3opuq$1@dont-email.me>
References: <u7sbtf$22fas$1@solani.org> <u7se44$3ck1t$1@dont-email.me>
<u7uq45$3nutv$1@dont-email.me> <u7urm8$3o57l$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 3 Jul 2023 17:57:35 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="40f8f423decf15fa78c00b84fc163342";
logging-data="3958746"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+ZkwNbbsPhMGsvWtzgue8K"
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101
Thunderbird/45.8.0
Cancel-Lock: sha1:2Pfraji5My0ZHPopvxmgD8ra0ms=
X-Enigmail-Draft-Status: N1110
In-Reply-To: <u7urm8$3o57l$1@dont-email.me>
 by: Janis Papanagnou - Mon, 3 Jul 2023 17:57 UTC

On 03.07.2023 18:06, Janis Papanagnou wrote:
> On 03.07.2023 17:39, Ed Morton wrote:
>> [...]
>
>> and you should also make the `.`s literal:
>>
>> p = "link=https://example[.]com/display[.]pl[?]"

I forgot to point out (in case it was not obvious)...

Variable p was used in two contexts, as pattern and as variable
to be printed literally; so above expression would not qualify.

Janis

>> [...]

Re: Vanilla regex

<u7v2od$3os7v$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=6424&group=comp.unix.shell#6424

  copy link   Newsgroups: comp.unix.shell
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: mortons...@gmail.com (Ed Morton)
Newsgroups: comp.unix.shell
Subject: Re: Vanilla regex
Date: Mon, 3 Jul 2023 13:07:09 -0500
Organization: A noiseless patient Spider
Lines: 20
Message-ID: <u7v2od$3os7v$1@dont-email.me>
References: <u7sbtf$22fas$1@solani.org> <u7se44$3ck1t$1@dont-email.me>
<u7uq45$3nutv$1@dont-email.me> <u7urm8$3o57l$1@dont-email.me>
<u7v26f$3opuq$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 3 Jul 2023 18:07:10 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="9660604ce639605aa75b21291cea434a";
logging-data="3961087"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/XAMo+YbL2cQWs6w2xbrqe"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.12.0
Cancel-Lock: sha1:wWnGZ6lx1tBvphf0YB95CdhtTY0=
X-Antivirus: Avast (VPS 230703-2, 7/3/2023), Outbound message
Content-Language: en-US
X-Antivirus-Status: Clean
In-Reply-To: <u7v26f$3opuq$1@dont-email.me>
 by: Ed Morton - Mon, 3 Jul 2023 18:07 UTC

On 7/3/2023 12:57 PM, Janis Papanagnou wrote:
> On 03.07.2023 18:06, Janis Papanagnou wrote:
>> On 03.07.2023 17:39, Ed Morton wrote:
>>> [...]
>>
>>> and you should also make the `.`s literal:
>>>
>>> p = "link=https://example[.]com/display[.]pl[?]"
>
> I forgot to point out (in case it was not obvious)...
>
> Variable p was used in two contexts, as pattern and as variable
> to be printed literally; so above expression would not qualify.

I confess I just glanced at the script. Then leave the definition as it
was and change `$0 !~ p` to `!index($0,p)` if you want `p` treated
literally.

Ed.

Re: Vanilla regex

<u7v6p4$3pbqr$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=6425&group=comp.unix.shell#6425

  copy link   Newsgroups: comp.unix.shell
Path: i2pn2.org!i2pn.org!eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: janis_pa...@hotmail.com (Janis Papanagnou)
Newsgroups: comp.unix.shell
Subject: Re: Vanilla regex
Date: Mon, 3 Jul 2023 21:15:47 +0200
Organization: A noiseless patient Spider
Lines: 31
Message-ID: <u7v6p4$3pbqr$1@dont-email.me>
References: <u7sbtf$22fas$1@solani.org> <u7se44$3ck1t$1@dont-email.me>
<u7uq45$3nutv$1@dont-email.me> <u7urm8$3o57l$1@dont-email.me>
<u7v26f$3opuq$1@dont-email.me> <u7v2od$3os7v$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 3 Jul 2023 19:15:48 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="cca9ca742e1e669b17f92a808204670e";
logging-data="3977051"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+wF62rIlQxpiRB2n4dclbh"
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101
Thunderbird/45.8.0
Cancel-Lock: sha1:ZXPN3KSyINWG27EzhwMs44/PJu0=
X-Enigmail-Draft-Status: N1110
In-Reply-To: <u7v2od$3os7v$1@dont-email.me>
 by: Janis Papanagnou - Mon, 3 Jul 2023 19:15 UTC

On 03.07.2023 20:07, Ed Morton wrote:
>
> I confess I just glanced at the script. Then leave the definition as it
> was and change `$0 !~ p` to `!index($0,p)` if you want `p` treated
> literally.

Yes, indeed. It occurred to me only after my post. Maybe it's time for
an update now to add and summarize all the little details in one code
sample...

awk '
BEGIN { p = "link=https://example.com/display.pl" }

!index($0, p) && match($0, /\[\[File:[^]|]+/) {
f = substr($0, RSTART+7, RLENGTH-7)
sub(/ +$/, "", f)
gsub(/ /, "_", f)
sub(/[|]/, "|" p "?" f "|")
}

{ print }
'

Of course the OP meanwhile mentioned a couple more requirements and
also some more tweaks that would have to be considered could be added
(replacing the ' ' by a character class, add support for white-space
after "File:", etc.), but for the intended purpose it's sufficient I
suppose.

Janis

1
server_pubkey.txt

rocksolid light 0.9.81
clearnet tor