Rocksolid Light

Welcome to novaBBS (click a section below)

mail  files  register  newsreader  groups  login

Message-ID:  

Always draw your curves, then plot your reading.


devel / comp.unix.shell / Scraping and Less buffering

SubjectAuthor
* Scraping and Less bufferingfrogger
`* Re: Scraping and Less bufferingBen Bacarisse
 `* Re: Scraping and Less bufferingfrogger
  `- Re: Scraping and Less bufferingfrogger

1
Scraping and Less buffering

<t15g37$u91$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=5123&group=comp.unix.shell#5123

  copy link   Newsgroups: comp.unix.shell
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: someb...@invalid.com (frogger)
Newsgroups: comp.unix.shell
Subject: Scraping and Less buffering
Date: Sat, 19 Mar 2022 17:57:09 -0300
Organization: A noiseless patient Spider
Lines: 42
Message-ID: <t15g37$u91$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sat, 19 Mar 2022 20:57:11 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="b262770f9d3308f373bb06e00d40b664";
logging-data="31009"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/7r18zGqnisuswveTWNOvSaDi/8e3CIWk="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101
Thunderbird/91.6.1
Cancel-Lock: sha1:2ZWWnHoCqNHgsma9IsHcm2niCdE=
Content-Language: en-US
 by: frogger - Sat, 19 Mar 2022 20:57 UTC

Hello all.

I wrote a shell scraper of a news website and one of the options is to
keep re-accessing the initial webpage (while loop) at regular intervals,
grab all news links and scrape text of the ones which are new. This
option keeps the script running for days straight. When stdout is
redirected to a *file*, it works as expected.

Instead if we pipe output to `less', there is a buffering stand. When
`less' buffering is full at about 8-64KB, the whole while loop in the
script hangs and only continues to run when we scroll down `less'
display buffer. But if it hangs for a few hours, it means the scraping
tool will only resume scraping some hours later, too, failing to scrape
a lot of news from the initial webpage during that time.

I tried piping to `less -B -b4096' and `less -B -b-1' to no avail:

-B or --auto-buffers
By default, when data is read from a pipe, buffers are allocated
automatically as needed. If a large amount of data is read from
the pipe, this can cause a large amount of memory to be allo‐
cated. The -B option disables this automatic allocation of buf‐
fers for pipes, so that only 64 KB (or the amount of space spec‐
ified by the -b option) is used for the pipe.

-bn or --buffers=n
Specifies the amount of buffer space less will use for each
file, in units of kilobytes (1024 bytes). By default 64 KB of
buffer space is used for each file (unless the file is a pipe; see
the -B option). The -b option specifies instead that n kilobytes
of buffer space should be used for each file. If n is -1, buffer
space is unlimited; that is, the entire file can be read into
memory.

I use GNU coreutils and Arch Linux. My understanding is that `less
-Bb-1' should have worked but there is some internal buffering system
that still holds. Any suggestion? I know of `stdbuf', tried a little but
could not make it work. If I was not clear, please let me know.

Thanks,
JSN

Re: Scraping and Less buffering

<87h77tr410.fsf@bsb.me.uk>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=5125&group=comp.unix.shell#5125

  copy link   Newsgroups: comp.unix.shell
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: ben.use...@bsb.me.uk (Ben Bacarisse)
Newsgroups: comp.unix.shell
Subject: Re: Scraping and Less buffering
Date: Sat, 19 Mar 2022 22:06:19 +0000
Organization: A noiseless patient Spider
Lines: 23
Message-ID: <87h77tr410.fsf@bsb.me.uk>
References: <t15g37$u91$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain
Injection-Info: reader02.eternal-september.org; posting-host="21be7f390cc0ea109f7b1a2ca2f05276";
logging-data="19822"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18sZz8rwNPNaQNKOmXdSLubRwafJ3GMzvg="
Cancel-Lock: sha1:PrRpQmHtAhfOpAZtjFbuYynnz5I=
sha1:fEwLVV9K42Fq+tgnJnhnzMREDKg=
X-BSB-Auth: 1.5b2f30ad88103e34db9c.20220319220619GMT.87h77tr410.fsf@bsb.me.uk
 by: Ben Bacarisse - Sat, 19 Mar 2022 22:06 UTC

frogger <somebody@invalid.com> writes:

> I wrote a shell scraper of a news website and one of the options is to
> keep re-accessing the initial webpage (while loop) at regular
> intervals, grab all news links and scrape text of the ones which are
> new. This option keeps the script running for days straight. When
> stdout is redirected to a *file*, it works as expected.
>
> Instead if we pipe output to `less', there is a buffering stand. When
> `less' buffering is full at about 8-64KB, the whole while loop in the
> script hangs and only continues to run when we scroll down `less'
> display buffer. But if it hangs for a few hours, it means the scraping
> tool will only resume scraping some hours later, too, failing to
> scrape a lot of news from the initial webpage during that time.

I'm not 100% sure what you want, but does:

$ scraper >file & less file

and then using the F command do something like you want?

--
Ben.

Re: Scraping and Less buffering

<t15m6f$lu1$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=5126&group=comp.unix.shell#5126

  copy link   Newsgroups: comp.unix.shell
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: someb...@invalid.com (frogger)
Newsgroups: comp.unix.shell
Subject: Re: Scraping and Less buffering
Date: Sat, 19 Mar 2022 19:41:17 -0300
Organization: A noiseless patient Spider
Lines: 37
Message-ID: <t15m6f$lu1$1@dont-email.me>
References: <t15g37$u91$1@dont-email.me> <87h77tr410.fsf@bsb.me.uk>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sat, 19 Mar 2022 22:41:19 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="b262770f9d3308f373bb06e00d40b664";
logging-data="22465"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX182ZXat/+0v5GUK/iNAdkr6GlB3FVvDqCU="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101
Thunderbird/91.6.1
Cancel-Lock: sha1:1ol3UWz+EoLpbD1p9CS6CWjLzAk=
In-Reply-To: <87h77tr410.fsf@bsb.me.uk>
Content-Language: en-US
 by: frogger - Sat, 19 Mar 2022 22:41 UTC

On 19/03/2022 19:06, Ben Bacarisse wrote:
> frogger <somebody@invalid.com> writes:
>
>> I wrote a shell scraper of a news website and one of the options is to
>> keep re-accessing the initial webpage (while loop) at regular
>> intervals, grab all news links and scrape text of the ones which are
>> new. This option keeps the script running for days straight. When
>> stdout is redirected to a *file*, it works as expected.
>>
>> Instead if we pipe output to `less', there is a buffering stand. When
>> `less' buffering is full at about 8-64KB, the whole while loop in the
>> script hangs and only continues to run when we scroll down `less'
>> display buffer. But if it hangs for a few hours, it means the scraping
>> tool will only resume scraping some hours later, too, failing to
>> scrape a lot of news from the initial webpage during that time.
>
> I'm not 100% sure what you want, but does:
>
> $ scraper >file & less file
>
> and then using the F command do something like you want?
>

Hey Ben!

That is *exactly* what I am trying just now. Was not sure it would be a
good solution but it seems straightforward that is, as it was so obvious
to you.

So what I just did and works is:

scraperFunction >file & tail -f file | less

Also added a trap inside the script to kill forks and remove temp file.

Thanks, Ben!
JSN

Re: Scraping and Less buffering

<t17c7g$mu7$1@dont-email.me>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=5133&group=comp.unix.shell#5133

  copy link   Newsgroups: comp.unix.shell
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: someb...@invalid.com (frogger)
Newsgroups: comp.unix.shell
Subject: Re: Scraping and Less buffering
Date: Sun, 20 Mar 2022 11:03:26 -0300
Organization: A noiseless patient Spider
Lines: 8
Message-ID: <t17c7g$mu7$1@dont-email.me>
References: <t15g37$u91$1@dont-email.me> <87h77tr410.fsf@bsb.me.uk>
<t15m6f$lu1$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sun, 20 Mar 2022 14:03:28 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="f9b439411ef73ed9358a9d12b3734dbf";
logging-data="23495"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18xfKF/d3oEEyefxy8Dxagp/pEXlXW3Ec4="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101
Thunderbird/91.6.1
Cancel-Lock: sha1:F1eCr3LYpjKyXLnwucLh+epve7I=
In-Reply-To: <t15m6f$lu1$1@dont-email.me>
Content-Language: en-US
 by: frogger - Sun, 20 Mar 2022 14:03 UTC

I am a little drunk but stop wars, you english-american people.
English language is very funny. That is why programmer languages are
funny, too.

Cheers, english people! Stop the war. Americans and Europeans are very
funny people, wish they would be better than us, tho.

Because I did not see it here in usenet, UKRAINE wins the wars.

1
server_pubkey.txt

rocksolid light 0.9.8
clearnet tor