Rocksolid Light

Welcome to novaBBS (click a section below)

mail  files  register  newsreader  groups  login

Message-ID:  

Before Xerox, five carbons were the maximum extension of anybody's ego.


devel / comp.unix.shell / Byte-offset of lines in a text file

SubjectAuthor
* Byte-offset of lines in a text fileJanis Papanagnou
+* Re: Byte-offset of lines in a text fileJanis Papanagnou
|`* Re: Byte-offset of lines in a text fileKenny McCormack
| `* Re: Byte-offset of lines in a text fileJanis Papanagnou
|  `- Re: Byte-offset of lines in a text fileSpiros Bousbouras
+* Re: Byte-offset of lines in a text fileLew Pitcher
|+- Re: Byte-offset of lines in a text fileLew Pitcher
|`* Re: Byte-offset of lines in a text fileLew Pitcher
| +* Re: Byte-offset of lines in a text fileLew Pitcher
| |+- Re: Byte-offset of lines in a text fileJanis Papanagnou
| |`- Re: Byte-offset of lines in a text fileJanis Papanagnou
| `- Re: Byte-offset of lines in a text fileJanis Papanagnou
+- Re: Byte-offset of lines in a text fileJanis Papanagnou
`- Re: Byte-offset of lines in a text fileJalen Q

1
Byte-offset of lines in a text file

<u0ebpb$2ulen$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=6071&group=comp.unix.shell#6071

  copy link   Newsgroups: comp.unix.shell
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder.eternal-september.org!.POSTED!not-for-mail
From: janis_pa...@hotmail.com (Janis Papanagnou)
Newsgroups: comp.unix.shell
Subject: Byte-offset of lines in a text file
Date: Mon, 3 Apr 2023 13:03:07 +0200
Organization: A noiseless patient Spider
Lines: 49
Message-ID: <u0ebpb$2ulen$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 3 Apr 2023 11:03:07 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="2a13eec3e7979a18f2ffcdacecfa2424";
logging-data="3102167"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+CgLMrAHTtTpiIS0UnyxFD"
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101
Thunderbird/45.8.0
Cancel-Lock: sha1:Hdsrz/dHKJd2hys5L2q9ZmV79MA=
X-Mozilla-News-Host: news://news.eternal-september.org:119
X-Enigmail-Draft-Status: N1110
 by: Janis Papanagnou - Mon, 3 Apr 2023 11:03 UTC

I just needed to determine the byte-offsets of all lines in a text file
to create an index file.

On a quick search I couldn't find any Unix tool/shell solution[*] so I
wrote this quick hack[**] that I share here in case anyone's interested

#!/bin/ksh

# byteoffset - create a byte-offset list for the lines in a given file
# # Usage: byteoffset filename

f=${1:?"Usage: ${0##*/} filename"}
exec 3<"$f"

echo $( 3<# )
while read -u3
do echo $( 3<# )
done

Even though that code is fast enough for my (MB sized) files using
Kornshell's pattern seek-redirections to locate the newlines in the
file seems to be significantly faster than the 'read' based approach

#!/bin/ksh

# byteoffset - create a byte-offset list for the lines in a given file
# # Usage: byteoffset filename

f=${1:?"Usage: ${0##*/} filename"}
exec 3<"$f"

echo $( 3<# )
while 3<#$'\n'
do echo $( 3<# )
done

On a 320 MB test file the first script requires ~8 seconds and the
second one ~0.3 seconds.

Janis

[*] Does 'sed' maybe support such a function? Or is there any other
standard tool I missed?

[**] Based on a feature of newer versions of Kornshell (e.g. ksh93u+).

Re: Byte-offset of lines in a text file

<u0edfh$2utkp$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=6072&group=comp.unix.shell#6072

  copy link   Newsgroups: comp.unix.shell
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder.eternal-september.org!.POSTED!not-for-mail
From: janis_pa...@hotmail.com (Janis Papanagnou)
Newsgroups: comp.unix.shell
Subject: Re: Byte-offset of lines in a text file
Date: Mon, 3 Apr 2023 13:32:01 +0200
Organization: A noiseless patient Spider
Lines: 11
Message-ID: <u0edfh$2utkp$1@dont-email.me>
References: <u0ebpb$2ulen$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 3 Apr 2023 11:32:01 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="2a13eec3e7979a18f2ffcdacecfa2424";
logging-data="3110553"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/I/CKXX0EfIL8v41H0huP8"
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101
Thunderbird/45.8.0
Cancel-Lock: sha1:rTPSPDYNsiMukGf7J0GRz5/N/PY=
In-Reply-To: <u0ebpb$2ulen$1@dont-email.me>
 by: Janis Papanagnou - Mon, 3 Apr 2023 11:32 UTC

On 03.04.2023 13:03, Janis Papanagnou wrote:
> I just needed to determine the byte-offsets of all lines in a text file
> to create an index file.
> [...]

Don't use the second variant (the one using 3<#$'\n' ), it is
*not* running reliably, as I noticed with more tests! The first
one (using read -u3 ) runs just fine, though.

Janis

Re: Byte-offset of lines in a text file

<u0elu0$2vgkm$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=6073&group=comp.unix.shell#6073

  copy link   Newsgroups: comp.unix.shell
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder.eternal-september.org!.POSTED!not-for-mail
From: lew.pitc...@digitalfreehold.ca (Lew Pitcher)
Newsgroups: comp.unix.shell
Subject: Re: Byte-offset of lines in a text file
Date: Mon, 3 Apr 2023 13:56:16 -0000 (UTC)
Organization: The Pitcher Digital Freehold
Lines: 144
Message-ID: <u0elu0$2vgkm$1@dont-email.me>
References: <u0ebpb$2ulen$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 3 Apr 2023 13:56:16 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="72e53f448f44a44e38c260ae90e020ad";
logging-data="3130006"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+Be28mdO3S7QvARIqkulOdXwlCpBqfxxM="
User-Agent: Pan/0.139 (Sexual Chocolate; GIT bf56508
git://git.gnome.org/pan2)
Cancel-Lock: sha1:AVhyQ0V2DB5fN/T6Pnwh/ze+BJE=
 by: Lew Pitcher - Mon, 3 Apr 2023 13:56 UTC

Hi, Janis

On Mon, 03 Apr 2023 13:03:07 +0200, Janis Papanagnou wrote:

> I just needed to determine the byte-offsets of all lines in a text file
> to create an index file.
>
> On a quick search I couldn't find any Unix tool/shell solution[*] so I
> wrote this quick hack[**] that I share here in case anyone's interested
>
> #!/bin/ksh
>
> # byteoffset - create a byte-offset list for the lines in a given file
> #
> # Usage: byteoffset filename
>
> f=${1:?"Usage: ${0##*/} filename"}
> exec 3<"$f"
>
> echo $( 3<# )
> while read -u3
> do echo $( 3<# )
> done

[snip]

Your script intrigued me. While I don't normally use kornshell, I decided
to try the script out to see what it did.

I have a multi-line lorem ipsum test file that I fed to your script, and
it came up with some funny numbers. Specifically, it missed the first
few lines of the file. I double-checked, both visually and with grep[1]
egrep -b '^' lorem_ipsum.txt | awk -F : '{print $1}'
and it appears that your script somehow ignores the first ~600 bytes
of my test file.

I don't have an explanation for this behaviour.

[1] Your script resulted in
09:50 $ cat ./bo.ksh
#!/bin/ksh

# byteoffset - create a byte-offset list for the lines in a given file
#
# Usage: byteoffset filename

f=${1:?"Usage: ${0##*/} filename"}
exec 3<"$f"

echo $( 3<# )
while 3<#$'\n'
do echo $( 3<# )
done
09:50 $ ./bo.ksh lorem_ipsum.txt
0
649
710
767
832
896
957
1019
1085
1152
1153
1213
1277
1340
1401
1464
1530
1594
1660
1724
1788
1850
1913
1925
1926
2478
2479
09:50 $
and egrep tells me
09:50 $ egrep -b '^' lorem_ipsum.txt | awk -F : '{print $1}'
0
57
113
170
229
285
341
396
452
513
574
636
648
649
710
767
832
896
957
1019
1085
1152
1153
1213
1277
1340
1401
1464
1530
1594
1660
1724
1788
1850
1913
1925
1926
2478
2479
09:52 $
The first 12 lines of my test file are
09:55 $ head -12 lorem_ipsum.txt
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Phasellus ultricies, risus sed consectetur mattis, orci
leo eleifend nisl, quis lobortis urna enim at diam. Duis
placerat ac orci ut cursus. Morbi commodo purus et dapibus
lobortis. Maecenas at ante lectus. Duis semper magna in
nisi accumsan pharetra. Mauris porttitor lorem erat, ac
condimentum quam faucibus dictum. Cras et tortor orci.
Quisque fringilla porttitor semper. Nunc imperdiet enim
est, tristique maximus nunc convallis sagittis. Pellentesque
cursus odio elit, ac viverra tortor varius quis. In bibendum
viverra turpis, ut eleifend lectus malesuada at. Vivamus quis
orci nulla.
09:55 $

--
Lew Pitcher
"In Skills We Trust"

Re: Byte-offset of lines in a text file

<u0emto$2vgkm$2@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=6074&group=comp.unix.shell#6074

  copy link   Newsgroups: comp.unix.shell
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder.eternal-september.org!.POSTED!not-for-mail
From: lew.pitc...@digitalfreehold.ca (Lew Pitcher)
Newsgroups: comp.unix.shell
Subject: Re: Byte-offset of lines in a text file
Date: Mon, 3 Apr 2023 14:13:12 -0000 (UTC)
Organization: The Pitcher Digital Freehold
Lines: 70
Message-ID: <u0emto$2vgkm$2@dont-email.me>
References: <u0ebpb$2ulen$1@dont-email.me> <u0elu0$2vgkm$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 3 Apr 2023 14:13:12 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="72e53f448f44a44e38c260ae90e020ad";
logging-data="3130006"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/QafX4TQ6VpyY2JgvQuuvyP3qS4qvFowY="
User-Agent: Pan/0.139 (Sexual Chocolate; GIT bf56508
git://git.gnome.org/pan2)
Cancel-Lock: sha1:1gpxaSl6vM94LRlpSbgseLhNSnI=
 by: Lew Pitcher - Mon, 3 Apr 2023 14:13 UTC

On Mon, 03 Apr 2023 13:56:16 +0000, Lew Pitcher wrote:

> Hi, Janis
>
> On Mon, 03 Apr 2023 13:03:07 +0200, Janis Papanagnou wrote:
>
>> I just needed to determine the byte-offsets of all lines in a text file
>> to create an index file.
>>
>> On a quick search I couldn't find any Unix tool/shell solution[*] so I
>> wrote this quick hack[**] that I share here in case anyone's interested
>>
>> #!/bin/ksh
>>
>> # byteoffset - create a byte-offset list for the lines in a given file
>> #
>> # Usage: byteoffset filename
>>
>> f=${1:?"Usage: ${0##*/} filename"}
>> exec 3<"$f"
>>
>> echo $( 3<# )
>> while read -u3
>> do echo $( 3<# )
>> done
>
> [snip]
>
> Your script intrigued me. While I don't normally use kornshell, I decided
> to try the script out to see what it did.
>
> I have a multi-line lorem ipsum test file that I fed to your script, and
> it came up with some funny numbers. Specifically, it missed the first
> few lines of the file. I double-checked, both visually and with grep[1]
> egrep -b '^' lorem_ipsum.txt | awk -F : '{print $1}'
> and it appears that your script somehow ignores the first ~600 bytes
> of my test file.
>
> I don't have an explanation for this behaviour.

Well, I have an observation, that may lead to an explanation.

My lorem_ipsum.txt file has a number of "blank" lines, the first of which
is at displacement 648. Your script properly reports all the lines that
follow that blank line. It appears that, somehow, your script ignores
everything before the first blank line.

> [1] Your script resulted in
[snip]
> 09:50 $ ./bo.ksh lorem_ipsum.txt
> 0
> 649
> 710
[snip]
> 09:50 $
> and egrep tells me
> 09:50 $ egrep -b '^' lorem_ipsum.txt | awk -F : '{print $1}'
> 0
[snip]
> 636
> 648
My blank line, above
> 649
> 710
[snip]

Hope this helps in the diagnosis
--
Lew Pitcher
"In Skills We Trust"

Re: Byte-offset of lines in a text file

<u0en35$2vgkm$3@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=6075&group=comp.unix.shell#6075

  copy link   Newsgroups: comp.unix.shell
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder.eternal-september.org!.POSTED!not-for-mail
From: lew.pitc...@digitalfreehold.ca (Lew Pitcher)
Newsgroups: comp.unix.shell
Subject: Re: Byte-offset of lines in a text file
Date: Mon, 3 Apr 2023 14:16:05 -0000 (UTC)
Organization: The Pitcher Digital Freehold
Lines: 67
Message-ID: <u0en35$2vgkm$3@dont-email.me>
References: <u0ebpb$2ulen$1@dont-email.me> <u0elu0$2vgkm$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 3 Apr 2023 14:16:05 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="72e53f448f44a44e38c260ae90e020ad";
logging-data="3130006"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+g33+yxL6rUOIju0e9AO9ZsAfjjVkM4u8="
User-Agent: Pan/0.139 (Sexual Chocolate; GIT bf56508
git://git.gnome.org/pan2)
Cancel-Lock: sha1:8WSYQiOxbplEceijwz0rXbR4r+4=
 by: Lew Pitcher - Mon, 3 Apr 2023 14:16 UTC

On Mon, 03 Apr 2023 13:56:16 +0000, Lew Pitcher wrote:

> Hi, Janis
>
> On Mon, 03 Apr 2023 13:03:07 +0200, Janis Papanagnou wrote:
>
>> I just needed to determine the byte-offsets of all lines in a text file
>> to create an index file.
>>
>> On a quick search I couldn't find any Unix tool/shell solution[*] so I
>> wrote this quick hack[**] that I share here in case anyone's interested
>>
>> #!/bin/ksh
>>
>> # byteoffset - create a byte-offset list for the lines in a given file
>> #
>> # Usage: byteoffset filename
>>
>> f=${1:?"Usage: ${0##*/} filename"}
>> exec 3<"$f"
>>
>> echo $( 3<# )
>> while read -u3
>> do echo $( 3<# )
>> done
>
> [snip]
>
> Your script intrigued me. While I don't normally use kornshell, I decided
> to try the script out to see what it did.
>
> I have a multi-line lorem ipsum test file that I fed to your script, and
> it came up with some funny numbers. Specifically, it missed the first
> few lines of the file. I double-checked, both visually and with grep[1]
> egrep -b '^' lorem_ipsum.txt | awk -F : '{print $1}'
> and it appears that your script somehow ignores the first ~600 bytes
> of my test file.
>
> I don't have an explanation for this behaviour.
>
> [1] Your script resulted in
> 09:50 $ cat ./bo.ksh
> #!/bin/ksh
>
> # byteoffset - create a byte-offset list for the lines in a given file
> #
> # Usage: byteoffset filename
>
> f=${1:?"Usage: ${0##*/} filename"}
> exec 3<"$f"
>
> echo $( 3<# )
> while 3<#$'\n'
> do echo $( 3<# )
> done

Awwwwww fsck!

I copied the wrong script. Your followup noted that this version
had problems.

I'll retry with the correct script.

Sorry to have been a nuisance :-(
--
Lew Pitcher
"In Skills We Trust"

Re: Byte-offset of lines in a text file

<u0en8j$2vgkm$4@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=6076&group=comp.unix.shell#6076

  copy link   Newsgroups: comp.unix.shell
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder.eternal-september.org!.POSTED!not-for-mail
From: lew.pitc...@digitalfreehold.ca (Lew Pitcher)
Newsgroups: comp.unix.shell
Subject: Re: Byte-offset of lines in a text file
Date: Mon, 3 Apr 2023 14:18:59 -0000 (UTC)
Organization: The Pitcher Digital Freehold
Lines: 75
Message-ID: <u0en8j$2vgkm$4@dont-email.me>
References: <u0ebpb$2ulen$1@dont-email.me> <u0elu0$2vgkm$1@dont-email.me>
<u0en35$2vgkm$3@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 3 Apr 2023 14:18:59 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="72e53f448f44a44e38c260ae90e020ad";
logging-data="3130006"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/tS0Oc5qgCLfNQSb4b1Bm/UEt8BS74WG8="
User-Agent: Pan/0.139 (Sexual Chocolate; GIT bf56508
git://git.gnome.org/pan2)
Cancel-Lock: sha1:ckljVYJz2zdU8LnYIFvJsHoMqPw=
 by: Lew Pitcher - Mon, 3 Apr 2023 14:18 UTC

On Mon, 03 Apr 2023 14:16:05 +0000, Lew Pitcher wrote:

> On Mon, 03 Apr 2023 13:56:16 +0000, Lew Pitcher wrote:
>
>> Hi, Janis
>>
>> On Mon, 03 Apr 2023 13:03:07 +0200, Janis Papanagnou wrote:
>>
>>> I just needed to determine the byte-offsets of all lines in a text file
>>> to create an index file.
>>>
>>> On a quick search I couldn't find any Unix tool/shell solution[*] so I
>>> wrote this quick hack[**] that I share here in case anyone's interested
>>>
>>> #!/bin/ksh
>>>
>>> # byteoffset - create a byte-offset list for the lines in a given file
>>> #
>>> # Usage: byteoffset filename
>>>
>>> f=${1:?"Usage: ${0##*/} filename"}
>>> exec 3<"$f"
>>>
>>> echo $( 3<# )
>>> while read -u3
>>> do echo $( 3<# )
>>> done
>>
>> [snip]
>>
>> Your script intrigued me. While I don't normally use kornshell, I decided
>> to try the script out to see what it did.
>>
>> I have a multi-line lorem ipsum test file that I fed to your script, and
>> it came up with some funny numbers. Specifically, it missed the first
>> few lines of the file. I double-checked, both visually and with grep[1]
>> egrep -b '^' lorem_ipsum.txt | awk -F : '{print $1}'
>> and it appears that your script somehow ignores the first ~600 bytes
>> of my test file.
>>
>> I don't have an explanation for this behaviour.
>>
>> [1] Your script resulted in
>> 09:50 $ cat ./bo.ksh
>> #!/bin/ksh
>>
>> # byteoffset - create a byte-offset list for the lines in a given file
>> #
>> # Usage: byteoffset filename
>>
>> f=${1:?"Usage: ${0##*/} filename"}
>> exec 3<"$f"
>>
>> echo $( 3<# )
>> while 3<#$'\n'
>> do echo $( 3<# )
>> done
>
> Awwwwww fsck!
>
> I copied the wrong script. Your followup noted that this version
> had problems.
>
> I'll retry with the correct script.
>
> Sorry to have been a nuisance :-(

Retesting with the /correct/ script shows that it duplicated
the results of my egrep pipe. It looks like this script is a
winner.

Thanks for the education; I learned something new today. :-)
--
Lew Pitcher
"In Skills We Trust"

Re: Byte-offset of lines in a text file

<u0enq2$1tmsj$1@news.xmission.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=6077&group=comp.unix.shell#6077

  copy link   Newsgroups: comp.unix.shell
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!xmission!nnrp.xmission!.POSTED.shell.xmission.com!not-for-mail
From: gaze...@shell.xmission.com (Kenny McCormack)
Newsgroups: comp.unix.shell
Subject: Re: Byte-offset of lines in a text file
Date: Mon, 3 Apr 2023 14:28:19 -0000 (UTC)
Organization: The official candy of the new Millennium
Message-ID: <u0enq2$1tmsj$1@news.xmission.com>
References: <u0ebpb$2ulen$1@dont-email.me> <u0edfh$2utkp$1@dont-email.me>
Injection-Date: Mon, 3 Apr 2023 14:28:19 -0000 (UTC)
Injection-Info: news.xmission.com; posting-host="shell.xmission.com:166.70.8.4";
logging-data="2022291"; mail-complaints-to="abuse@xmission.com"
X-Newsreader: trn 4.0-test77 (Sep 1, 2010)
Originator: gazelle@shell.xmission.com (Kenny McCormack)
 by: Kenny McCormack - Mon, 3 Apr 2023 14:28 UTC

In article <u0edfh$2utkp$1@dont-email.me>,
Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
>On 03.04.2023 13:03, Janis Papanagnou wrote:
>> I just needed to determine the byte-offsets of all lines in a text file
>> to create an index file.
>> [...]
>
>Don't use the second variant (the one using 3<#$'\n' ), it is
>*not* running reliably, as I noticed with more tests! The first
>one (using read -u3 ) runs just fine, though.

I don't know what the overall goal is, but wouldn't this be easier:

$ awk '{ print tot+0;tot += length + 1 }' file

Note that this works fine in Unix, because in Unix bytes are bytes.
It might need updating to work correctly under DOS/Windows. Or VMS...

Or z/OS...

--
The randomly chosen signature file that would have appeared here is more than 4
lines long. As such, it violates one or more Usenet RFCs. In order to remain
in compliance with said RFCs, the actual sig can be found at the following URL:
http://user.xmission.com/~gazelle/Sigs/ModernXtian

Re: Byte-offset of lines in a text file

<u0f2rr$321ai$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=6081&group=comp.unix.shell#6081

  copy link   Newsgroups: comp.unix.shell
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder.eternal-september.org!.POSTED!not-for-mail
From: janis_pa...@hotmail.com (Janis Papanagnou)
Newsgroups: comp.unix.shell
Subject: Re: Byte-offset of lines in a text file
Date: Mon, 3 Apr 2023 19:36:59 +0200
Organization: A noiseless patient Spider
Lines: 36
Message-ID: <u0f2rr$321ai$1@dont-email.me>
References: <u0ebpb$2ulen$1@dont-email.me> <u0edfh$2utkp$1@dont-email.me>
<u0enq2$1tmsj$1@news.xmission.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 3 Apr 2023 17:36:59 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="2a13eec3e7979a18f2ffcdacecfa2424";
logging-data="3212626"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18PXYp8UzBoJSb6i0Or9Qub"
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101
Thunderbird/45.8.0
Cancel-Lock: sha1:AciUQl6+DPqimrC4AlvXoFTlJ9A=
In-Reply-To: <u0enq2$1tmsj$1@news.xmission.com>
X-Enigmail-Draft-Status: N1110
 by: Janis Papanagnou - Mon, 3 Apr 2023 17:36 UTC

On 03.04.2023 16:28, Kenny McCormack wrote:
> In article <u0edfh$2utkp$1@dont-email.me>,
> Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
>> On 03.04.2023 13:03, Janis Papanagnou wrote:
>>> I just needed to determine the byte-offsets of all lines in a text file
>>> to create an index file.
>>> [...]
>>
>> Don't use the second variant (the one using 3<#$'\n' ), it is
>> *not* running reliably, as I noticed with more tests! The first
>> one (using read -u3 ) runs just fine, though.
>
> I don't know what the overall goal is,

The goal was to create an index file.[*]

> but wouldn't this be easier:
>
> $ awk '{ print tot+0;tot += length + 1 }' file
>
> Note that this works fine in Unix, because in Unix bytes are bytes.
> It might need updating to work correctly under DOS/Windows. Or VMS...
>
> Or z/OS...

Ideally a solution would be CR/LF/CRLF agnostic. But thanks for the
variant; I like it for its readability.

Janis

[*] The task actually is (in a Javascript context) to use a low-level
Javascript file access function without loading the whole (huge) file
into memory; this function was suggested to me, and it requires such
an index that I have to create beforehand. (I thought there'd be some
standard tool for such a (standard-?) task available.)

Re: Byte-offset of lines in a text file

<u0f3jr$324lo$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=6082&group=comp.unix.shell#6082

  copy link   Newsgroups: comp.unix.shell
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder.eternal-september.org!.POSTED!not-for-mail
From: janis_pa...@hotmail.com (Janis Papanagnou)
Newsgroups: comp.unix.shell
Subject: Re: Byte-offset of lines in a text file
Date: Mon, 3 Apr 2023 19:49:46 +0200
Organization: A noiseless patient Spider
Lines: 90
Message-ID: <u0f3jr$324lo$1@dont-email.me>
References: <u0ebpb$2ulen$1@dont-email.me> <u0elu0$2vgkm$1@dont-email.me>
<u0en35$2vgkm$3@dont-email.me> <u0en8j$2vgkm$4@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 3 Apr 2023 17:49:47 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="2a13eec3e7979a18f2ffcdacecfa2424";
logging-data="3216056"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/SZqOY8DL09hi+Jcq/pfT5"
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101
Thunderbird/45.8.0
Cancel-Lock: sha1:P90e+soyfvDbQ3kY5DH/i8OpcoM=
X-Enigmail-Draft-Status: N1110
In-Reply-To: <u0en8j$2vgkm$4@dont-email.me>
 by: Janis Papanagnou - Mon, 3 Apr 2023 17:49 UTC

On 03.04.2023 16:18, Lew Pitcher wrote:
> On Mon, 03 Apr 2023 14:16:05 +0000, Lew Pitcher wrote:
>
>> On Mon, 03 Apr 2023 13:56:16 +0000, Lew Pitcher wrote:
>>
>>> Hi, Janis
>>>
>>> On Mon, 03 Apr 2023 13:03:07 +0200, Janis Papanagnou wrote:
>>>
>>>> I just needed to determine the byte-offsets of all lines in a text file
>>>> to create an index file.
>>>>
>>>> On a quick search I couldn't find any Unix tool/shell solution[*] so I
>>>> wrote this quick hack[**] that I share here in case anyone's interested
>>>>
>>>> #!/bin/ksh
>>>>
>>>> # byteoffset - create a byte-offset list for the lines in a given file
>>>> #
>>>> # Usage: byteoffset filename
>>>>
>>>> f=${1:?"Usage: ${0##*/} filename"}
>>>> exec 3<"$f"
>>>>
>>>> echo $( 3<# )
>>>> while read -u3
>>>> do echo $( 3<# )
>>>> done
>>>
>>> [snip]
>>>
>>> Your script intrigued me. While I don't normally use kornshell, I decided
>>> to try the script out to see what it did.
>>>
>>> I have a multi-line lorem ipsum test file that I fed to your script, and
>>> it came up with some funny numbers. Specifically, it missed the first
>>> few lines of the file. I double-checked, both visually and with grep[1]
>>> egrep -b '^' lorem_ipsum.txt | awk -F : '{print $1}'
>>> and it appears that your script somehow ignores the first ~600 bytes
>>> of my test file.
>>>
>>> I don't have an explanation for this behaviour.
>>>
>>> [1] Your script resulted in
>>> 09:50 $ cat ./bo.ksh
>>> #!/bin/ksh
>>>
>>> # byteoffset - create a byte-offset list for the lines in a given file
>>> #
>>> # Usage: byteoffset filename
>>>
>>> f=${1:?"Usage: ${0##*/} filename"}
>>> exec 3<"$f"
>>>
>>> echo $( 3<# )
>>> while 3<#$'\n'
>>> do echo $( 3<# )
>>> done
>>
>> Awwwwww fsck!
>>
>> I copied the wrong script. Your followup noted that this version
>> had problems.
>>
>> I'll retry with the correct script.
>>
>> Sorry to have been a nuisance :-(

You haven't been. I appreciate any tests and feedback.

>
> Retesting with the /correct/ script shows that it duplicated
> the results of my egrep pipe. It looks like this script is a
> winner.

You seem to have been using the 3<#$'\n' based variant? - And it
works? - Still not reliably in my environment. I'll have to examine
that further.

>
> Thanks for the education; I learned something new today. :-)

Thanks for your tests! (And for the overall confirmation.) I'm a bit
reluctant when using Kornshell's newer "redirection" operators; they
seem to not be reliable as I experienced [in my environment] in the
past. (Maybe it's advisable to test and confirm that in Martijn's
ksh93u+m, which generally seems to be much more reliable.)

Janis

Re: Byte-offset of lines in a text file

<u0f401$326gr$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=6083&group=comp.unix.shell#6083

  copy link   Newsgroups: comp.unix.shell
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder.eternal-september.org!.POSTED!not-for-mail
From: janis_pa...@hotmail.com (Janis Papanagnou)
Newsgroups: comp.unix.shell
Subject: Re: Byte-offset of lines in a text file
Date: Mon, 3 Apr 2023 19:56:16 +0200
Organization: A noiseless patient Spider
Lines: 15
Message-ID: <u0f401$326gr$1@dont-email.me>
References: <u0ebpb$2ulen$1@dont-email.me> <u0elu0$2vgkm$1@dont-email.me>
<u0en35$2vgkm$3@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 3 Apr 2023 17:56:17 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="2a13eec3e7979a18f2ffcdacecfa2424";
logging-data="3217947"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX193BJyLG83Dbbe+v2ioP8ro"
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101
Thunderbird/45.8.0
Cancel-Lock: sha1:bHUFnKeCn+RDDaRLeo3y9eClK8k=
In-Reply-To: <u0en35$2vgkm$3@dont-email.me>
 by: Janis Papanagnou - Mon, 3 Apr 2023 17:56 UTC

On 03.04.2023 16:16, Lew Pitcher wrote:
> On Mon, 03 Apr 2023 13:56:16 +0000, Lew Pitcher wrote:
>
>
> I copied the wrong script. Your followup noted that this version
> had problems.
>
> I'll retry with the correct script.

Argh! - And I missed this post.

So you made the same observation that I made. - Thanks!

Janis

Re: Byte-offset of lines in a text file

<RBdhjtpj3o5yraD1a@bongo-ra.co>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=6084&group=comp.unix.shell#6084

  copy link   Newsgroups: comp.unix.shell
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder.eternal-september.org!.POSTED!not-for-mail
From: spi...@gmail.com (Spiros Bousbouras)
Newsgroups: comp.unix.shell
Subject: Re: Byte-offset of lines in a text file
Date: Tue, 4 Apr 2023 02:45:34 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 36
Message-ID: <RBdhjtpj3o5yraD1a@bongo-ra.co>
References: <u0ebpb$2ulen$1@dont-email.me> <u0edfh$2utkp$1@dont-email.me> <u0enq2$1tmsj$1@news.xmission.com>
<u0f2rr$321ai$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 4 Apr 2023 02:45:34 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="2f380866212ae0a958992fc87607e0da";
logging-data="3449689"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/j4c9t1Ouvtwsh+fsZcSc2"
Cancel-Lock: sha1:vU2/UG+XlKJNv+MRlEdXX37zBf4=
In-Reply-To: <u0f2rr$321ai$1@dont-email.me>
X-Server-Commands: nowebcancel
X-Organisation: Weyland-Yutani
 by: Spiros Bousbouras - Tue, 4 Apr 2023 02:45 UTC

On Mon, 3 Apr 2023 19:36:59 +0200
Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
> On 03.04.2023 16:28, Kenny McCormack wrote:
> > I don't know what the overall goal is,
>
> The goal was to create an index file.[*]
>
> > but wouldn't this be easier:
> >
> > $ awk '{ print tot+0;tot += length + 1 }' file
> >
> > Note that this works fine in Unix, because in Unix bytes are bytes.
> > It might need updating to work correctly under DOS/Windows. Or VMS...
> >
> > Or z/OS...
>
> Ideally a solution would be CR/LF/CRLF agnostic. But thanks for the
> variant; I like it for its readability.

A generalisation is

awk -v nob=$(echo | wc -c) '{ print tot+0 ; tot += length + nob }' file

But I haven't tested it on a system where the newline sequence is different
than the single LF byte. And there are (or used to be) operating systems
where there is no notion of newline sequence and files are made of
records where each record is a line.

A different consideration : is it ok if the output is the same regardless
of whether the input ends in a newline sequence or not ? With the above
awk scripts it will be the same.

--
As someone once joked, "It's easier to prove the Riemann hypothesis than to get
someone to read your proof!"
http://empslocal.ex.ac.uk/people/staff/mrwatkin/zeta/RHproofs.htm

Re: Byte-offset of lines in a text file

<u0gji0$3bfr4$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=6085&group=comp.unix.shell#6085

  copy link   Newsgroups: comp.unix.shell
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder.eternal-september.org!.POSTED!not-for-mail
From: janis_pa...@hotmail.com (Janis Papanagnou)
Newsgroups: comp.unix.shell
Subject: Re: Byte-offset of lines in a text file
Date: Tue, 4 Apr 2023 09:27:59 +0200
Organization: A noiseless patient Spider
Lines: 34
Message-ID: <u0gji0$3bfr4$1@dont-email.me>
References: <u0ebpb$2ulen$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 4 Apr 2023 07:28:00 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="2ac164eed4110e536599f2695af3988f";
logging-data="3522404"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+l14uutmskq9IIpEoVP1Ki"
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101
Thunderbird/45.8.0
Cancel-Lock: sha1:yCRdB3ho1nLDahQAEVr1iyhsCrI=
In-Reply-To: <u0ebpb$2ulen$1@dont-email.me>
X-Mozilla-News-Host: news://news.eternal-september.org
 by: Janis Papanagnou - Tue, 4 Apr 2023 07:27 UTC

On 03.04.2023 13:03, Janis Papanagnou wrote:
> I just needed to determine the byte-offsets of all lines in a text file
> to create an index file.
>
> [...]
>
> #!/bin/ksh
>
> # byteoffset - create a byte-offset list for the lines in a given file
> #
> # Usage: byteoffset filename
>
> f=${1:?"Usage: ${0##*/} filename"}
> exec 3<"$f"
>
> echo $( 3<# )
> while read -u3
> do echo $( 3<# )
> done

Just occurred to me; to shorten that a bit and avoid duplicate pieces
of code...

f=${1:?"Usage: ${0##*/} filename"}
exec 3<"$f"

while echo $( 3<# ) ; read -u3
do :
done

> [...]

Janis

Re: Byte-offset of lines in a text file

<c9b31c80-5bc8-4adc-a9e1-3a5c7e40de90n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=6092&group=comp.unix.shell#6092

  copy link   Newsgroups: comp.unix.shell
X-Received: by 2002:a05:620a:170f:b0:746:af25:7e8a with SMTP id az15-20020a05620a170f00b00746af257e8amr661925qkb.14.1680671358696;
Tue, 04 Apr 2023 22:09:18 -0700 (PDT)
X-Received: by 2002:a05:6870:24a0:b0:17f:1723:fc82 with SMTP id
s32-20020a05687024a000b0017f1723fc82mr2516985oaq.9.1680671358394; Tue, 04 Apr
2023 22:09:18 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!1.us.feeder.erje.net!feeder.erje.net!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!peer01.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.unix.shell
Date: Tue, 4 Apr 2023 22:09:18 -0700 (PDT)
In-Reply-To: <u0ebpb$2ulen$1@dont-email.me>
Injection-Info: google-groups.googlegroups.com; posting-host=2600:1700:e82:d400:107f:a7d7:29c7:8889;
posting-account=rR5tnAoAAAC2kIBHWh0n6frMCTGowyvE
NNTP-Posting-Host: 2600:1700:e82:d400:107f:a7d7:29c7:8889
References: <u0ebpb$2ulen$1@dont-email.me>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <c9b31c80-5bc8-4adc-a9e1-3a5c7e40de90n@googlegroups.com>
Subject: Re: Byte-offset of lines in a text file
From: jalen...@gmail.com (Jalen Q)
Injection-Date: Wed, 05 Apr 2023 05:09:18 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 2694
 by: Jalen Q - Wed, 5 Apr 2023 05:09 UTC

On Monday, April 3, 2023 at 6:03:13 AM UTC-5, Janis Papanagnou wrote:
> I just needed to determine the byte-offsets of all lines in a text file
> to create an index file.
>
> On a quick search I couldn't find any Unix tool/shell solution[*] so I
> wrote this quick hack[**] that I share here in case anyone's interested
>
> #!/bin/ksh
>
> # byteoffset - create a byte-offset list for the lines in a given file
> #
> # Usage: byteoffset filename
>
> f=${1:?"Usage: ${0##*/} filename"}
> exec 3<"$f"
>
> echo $( 3<# )
> while read -u3
> do echo $( 3<# )
> done
>
>
> Even though that code is fast enough for my (MB sized) files using
> Kornshell's pattern seek-redirections to locate the newlines in the
> file seems to be significantly faster than the 'read' based approach
>
> #!/bin/ksh
>
> # byteoffset - create a byte-offset list for the lines in a given file
> #
> # Usage: byteoffset filename
>
> f=${1:?"Usage: ${0##*/} filename"}
> exec 3<"$f"
>
> echo $( 3<# )
> while 3<#$'\n'
> do echo $( 3<# )
> done
>
>
> On a 320 MB test file the first script requires ~8 seconds and the
> second one ~0.3 seconds.
>
> Janis
>
> [*] Does 'sed' maybe support such a function? Or is there any other
> standard tool I missed?
>
> [**] Based on a feature of newer versions of Kornshell (e.g. ksh93u+).
hjuuuyyyy77y

Re: Byte-offset of lines in a text file

<u0lmhr$8do0$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=6097&group=comp.unix.shell#6097

  copy link   Newsgroups: comp.unix.shell
Path: i2pn2.org!i2pn.org!eternal-september.org!feeder.eternal-september.org!.POSTED!not-for-mail
From: janis_pa...@hotmail.com (Janis Papanagnou)
Newsgroups: comp.unix.shell
Subject: Re: Byte-offset of lines in a text file
Date: Thu, 6 Apr 2023 07:49:46 +0200
Organization: A noiseless patient Spider
Lines: 13
Message-ID: <u0lmhr$8do0$1@dont-email.me>
References: <u0ebpb$2ulen$1@dont-email.me> <u0elu0$2vgkm$1@dont-email.me>
<u0en35$2vgkm$3@dont-email.me> <u0en8j$2vgkm$4@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
Injection-Date: Thu, 6 Apr 2023 05:49:47 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="1cb7eeb8eb3564cea51ad89e7a31687d";
logging-data="276224"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/ba+tCpUxRbDlcznpnpJmW"
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101
Thunderbird/45.8.0
Cancel-Lock: sha1:13XPPn+MB4afCbTBPuNoDQHdfvY=
X-Enigmail-Draft-Status: N1110
In-Reply-To: <u0en8j$2vgkm$4@dont-email.me>
 by: Janis Papanagnou - Thu, 6 Apr 2023 05:49 UTC

On 03.04.2023 16:18, Lew Pitcher wrote:
>
> Retesting with the /correct/ script shows that it duplicated
> the results of my egrep pipe. It looks like this script is a
> winner.

Not really. The shell's read-loop is slow and the egrep/awk pipe
seems to be a lot faster. As long as I cannot make the shell's
pattern seek functional and reliable it makes sense to stay with
the pipe. Wasn't aware of grep's '-b' option; thanks for that!

Janis

1
server_pubkey.txt

rocksolid light 0.9.81
clearnet tor