novaBBS - comp.lang.python

Re: tail

<mailman.228.1650752390.20749.python-list@python.org>

https://www.novabbs.com/devel/article-flat.php?id=18008&group=comp.lang.python#18008

copy link Newsgroups: comp.lang.python

Path: i2pn2.org!i2pn.org!news.swapon.de!fu-berlin.de!uni-berlin.de!not-for-mail
From: ros...@gmail.com (Chris Angelico)
Newsgroups: comp.lang.python
Subject: Re: tail
Date: Sun, 24 Apr 2022 08:19:36 +1000
Lines: 103
Message-ID: <mailman.228.1650752390.20749.python-list@python.org>
References: <CABbU2U98YKdcnJkDPfzE3Pqso+6LL72usB8hrSBVR0WbhauRoQ@mail.gmail.com>
<CAPTjJmr3AiCyvxXt=-nqNLrJfyQHmG=pvSsM7nU_XxhSe94zgA@mail.gmail.com>
<CABbU2U8TAvy0zMhUcNtTD0=WpQ6oNYEeZQuKDjnxhG85FVriDg@mail.gmail.com>
<CAPTjJmqnfoPjoNT2CNsrkMVxkzAMHHXHj-G3DuGrJ21SDRNsPA@mail.gmail.com>
<CABbU2U_sWyEmBXf0Psudwc-FLeRYqLX=B4x-_9TV0qc5ZVt3Bg@mail.gmail.com>
<CAPTjJmrJacamKq1V5T8FECkm4jURdYQgj0VsC+JK5Db0NoFaww@mail.gmail.com>
<55a04f90-8fb8-c585-afae-aca73c7d641f@DancesWithMice.info>
<CAPTjJmr3CjPhHgU8Z2VHHV8gFyGi6xtGS8GETD-eZo_s5hbPiA@mail.gmail.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
X-Trace: news.uni-berlin.de Q1sivWm1NmpFSe+IXm+NRAbKH6P+I6+kkgDT9MdV7izQ==
Return-Path: <rosuav@gmail.com>
X-Original-To: python-list@python.org
Delivered-To: python-list@mail.python.org
Authentication-Results: mail.python.org; dkim=pass
reason="2048-bit key; unprotected key"
header.d=gmail.com header.i=@gmail.com header.b=JXem7RH/;
dkim-adsp=pass; dkim-atps=neutral
X-Spam-Status: OK 0.003
X-Spam-Evidence: '*H*': 0.99; '*S*': 0.00; '(which': 0.04; '2022':
0.05; 'is.': 0.05; 'sun,': 0.07; 'algorithmic': 0.09; 'angelico':
0.09; 'apparently': 0.09; 'broad': 0.09; 'byte': 0.09; 'depend':
0.09; 'electrical': 0.09; 'indeed.': 0.09; 'linux': 0.09;
'readable': 0.09; 'situations': 0.09; 'though.': 0.09; 'timing':
0.09; 'utility': 0.09; '>>>>': 0.16; '>>>>>': 0.16; 'algorithm,':
0.16; 'algorithms': 0.16; 'app,': 0.16; 'chrisa': 0.16;
'described.': 0.16; 'displays': 0.16; 'frequently': 0.16;
'from:addr:rosuav': 0.16; 'from:name:chris angelico': 0.16;
'general.': 0.16; 'iterate': 0.16; 'logs': 0.16; 'shorter': 0.16;
'size.': 0.16; 'splitting': 0.16; 'spot': 0.16; 'sweet': 0.16;
'trials': 0.16; 'wrote:': 0.16; 'python': 0.16; 'applications':
0.17; 'to:addr:python-list': 0.20; 'basically': 0.22; 'sat,':
0.22; 'lines': 0.23; 'run': 0.23; 'idea': 0.24; 'depends': 0.25;
"wasn't": 0.26; "isn't": 0.27; 'function': 0.27; '>>>': 0.28;
'chris': 0.28; 'sense': 0.28; 'it,': 0.29; 'whole': 0.30; 'think':
0.32; "doesn't": 0.32; 'question': 0.32; '(as': 0.32; 'context':
0.32; 'elements': 0.32; 'feed': 0.32; 'guess': 0.32; 'split':
0.32; 'message-id:@mail.gmail.com': 0.32; 'but': 0.32; "i'm":
0.33; 'there': 0.33; 'server': 0.33; 'same': 0.34; 'header:In-
Reply-To:1': 0.34; 'received:google.com': 0.34; 'running': 0.34;
'from:addr:gmail.com': 0.35; 'files': 0.36; 'really': 0.37;
"it's": 0.37; 'received:209.85': 0.37; 'hard': 0.37; 'file': 0.38;
'way': 0.38; 'read': 0.38; 'received:209': 0.39; 'quite': 0.39;
'adding': 0.39; 'single': 0.39; 'use': 0.39; '(with': 0.39;
'block': 0.39; 'evaluation': 0.39; 'still': 0.40; 'seeking': 0.40;
'should': 0.40; 'best': 0.61; 'likely': 0.61; 'skip:h 10': 0.61;
'reasonable': 0.62; 'limited': 0.62; 'simply': 0.63; 'full': 0.64;
'thus': 0.64; 'your': 0.64; 'produce': 0.65; 'similar': 0.65;
'well': 0.65; 'less': 0.65; 'time.': 0.66; 'back': 0.67; 'accept':
0.67; 'time,': 0.67; 'per': 0.68; 'compare': 0.69; 'generator':
0.69; "you'll": 0.73; 'low': 0.74; 'finds': 0.76; 'choice': 0.76;
'need,': 0.76; 'assessment': 0.81; 'backwards': 0.84; 'lines,':
0.84; 'lose': 0.84; 'sulla': 0.84; 'thus,': 0.84; 'think,': 0.91;
'two.': 0.91; 'line,': 0.93; 'storage': 0.95
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112;
h=mime-version:references:in-reply-to:from:date:message-id:subject:to;
bh=nqczVO+Jy6XBmc3jkON9WsFEBYAzutb9WBxY1lALh54=;
b=JXem7RH/lLm/HvNmZqLOrA/8QDzkd4HbJBDZV2JWauHkmm6OsbeVb5LYuas7eb7M3j
jkQifke7QXs2r/pOn+eIxhR737Dt4c8VW1mjmEI8KKrwbMvfkn2Pmnb+GksCX9TUbM0E
AHYVx+0Aiix2k+jeaQVxipb1CKLui1cvfkkoWbr6uF+nx2/dhJK9JsNjzQ8cXgqI4FoJ
P7P46Q9QwtucIJkFxxIrIq3cyKe7Zju8df2+mBvTL5S3t6ugmA8ZcDvjKBW9wNOnAkJS
dt0wq+fu4XsOslI/mk9qYLnuZ+DziURcRqoY2xrx9X1Hyx8/oqHM5Fq800sZSWFWNdx1
QqGg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
d=1e100.net; s=20210112;
h=x-gm-message-state:mime-version:references:in-reply-to:from:date
:message-id:subject:to;
bh=nqczVO+Jy6XBmc3jkON9WsFEBYAzutb9WBxY1lALh54=;
b=rgUFJ2v6dLAp0HkRwRJCuC/SLiDAv1l/J1MfhugNZkrVdA9tFq/DRzbIwj3wOUPodk
BMUa7zjbpHgwpPKdOCcQeK5Fi8o7ftV1UbVfMur/HGTLSxVFzPgp0UtyHdY5/xTmpQkz
V6hA5K222mXDsF9ECz1ERyc1gh0bm5XuBSb+1M9cm9a9Md7thNHChdNGpOY6nz4XIYZT
eOk6IPmczPbP3v+ENwOxzydjBCwQg0KDa+DkLKgtTNaaSDL427hD/gMfK6OdN770a/YE
WiIXoAqDSSY4GsV0btPEsxDjAOZQajl0pXmU1gfTugjbuWibEFIfdv5PfJjnAV0o1f7r
tYJw==
X-Gm-Message-State: AOAM530v7Mnuo6sGxowwShyBspUOSOvA4w48T0BPAfo9V3eRU5Tjq7bS
XeKnjwh/Pcct7DtBFOb7SWnv9puwM5EHvSXyRLpBN1dy
X-Google-Smtp-Source: ABdhPJwFSZLQurf3OWmT0nNijpCWtLT+d8JY7oo8q0hXJQFWR86QXPHJXfBJOPv4k55Zyxlf5TSWRKboCgf78j2wknM=
X-Received: by 2002:a05:6000:188d:b0:20a:a014:7ff6 with SMTP id
a13-20020a056000188d00b0020aa0147ff6mr8763397wri.104.1650752387733; Sat, 23
Apr 2022 15:19:47 -0700 (PDT)
In-Reply-To: <55a04f90-8fb8-c585-afae-aca73c7d641f@DancesWithMice.info>
X-BeenThere: python-list@python.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: General discussion list for the Python programming language
<python-list.python.org>
List-Unsubscribe: <https://mail.python.org/mailman/options/python-list>,
<mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive: <https://mail.python.org/pipermail/python-list/>
List-Post: <mailto:python-list@python.org>
List-Help: <mailto:python-list-request@python.org?subject=help>
List-Subscribe: <https://mail.python.org/mailman/listinfo/python-list>,
<mailto:python-list-request@python.org?subject=subscribe>
X-Mailman-Original-Message-ID: <CAPTjJmr3CjPhHgU8Z2VHHV8gFyGi6xtGS8GETD-eZo_s5hbPiA@mail.gmail.com>
X-Mailman-Original-References: <CABbU2U98YKdcnJkDPfzE3Pqso+6LL72usB8hrSBVR0WbhauRoQ@mail.gmail.com>
<CAPTjJmr3AiCyvxXt=-nqNLrJfyQHmG=pvSsM7nU_XxhSe94zgA@mail.gmail.com>
<CABbU2U8TAvy0zMhUcNtTD0=WpQ6oNYEeZQuKDjnxhG85FVriDg@mail.gmail.com>
<CAPTjJmqnfoPjoNT2CNsrkMVxkzAMHHXHj-G3DuGrJ21SDRNsPA@mail.gmail.com>
<CABbU2U_sWyEmBXf0Psudwc-FLeRYqLX=B4x-_9TV0qc5ZVt3Bg@mail.gmail.com>
<CAPTjJmrJacamKq1V5T8FECkm4jURdYQgj0VsC+JK5Db0NoFaww@mail.gmail.com>
<55a04f90-8fb8-c585-afae-aca73c7d641f@DancesWithMice.info>

by: Chris Angelico - Sat, 23 Apr 2022 22:19 UTC

On Sun, 24 Apr 2022 at 08:06, dn <PythonList@danceswithmice.info> wrote:
>
> On 24/04/2022 09.15, Chris Angelico wrote:
> > On Sun, 24 Apr 2022 at 07:13, Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
> >>
> >> On Sat, 23 Apr 2022 at 23:00, Chris Angelico <rosuav@gmail.com> wrote:
> >>>>> This is quite inefficient in general.
> >>>>
> >>>> Why inefficient? I think that readlines() will be much slower, not
> >>>> only more time consuming.
> >>>
> >>> It depends on which is more costly: reading the whole file (cost
> >>> depends on size of file) or reading chunks and splitting into lines
> >>> (cost depends on how well you guess at chunk size). If the lines are
> >>> all *precisely* the same number of bytes each, you can pick a chunk
> >>> size and step backwards with near-perfect efficiency (it's still
> >>> likely to be less efficient than reading a file forwards, on most file
> >>> systems, but it'll be close); but if you have to guess, adjust, and
> >>> keep going, then you lose efficiency there.
> >>
> >> Emh, why chunks? My function simply reads byte per byte and compares it to b"\n". When it find it, it stops and do a readline():
> ...
>
> > Ah. Well, then, THAT is why it's inefficient: you're seeking back one
> > single byte at a time, then reading forwards. That is NOT going to
> > play nicely with file systems or buffers.
> >
> > Compare reading line by line over the file with readlines() and you'll
> > see how abysmal this is.
> >
> > If you really only need one line (which isn't what your original post
> > suggested), I would recommend starting with a chunk that is likely to
> > include a full line, and expanding the chunk until you have that
> > newline. Much more efficient than one byte at a time.
>
>
> Disagreeing with @Chris in the sense that I use tail very frequently,
> and usually in the context of server logs - but I'm talking about the
> Linux implementation, not Python code!

tail(1) doesn't read a single byte at a time. It works in chunks,
more-or-less the way I described. (That's where I borrowed the
technique from.) It finds one block of lines, and displays those; it
doesn't iterate backwards.

(By the way, the implementation of "tail -f" is actually less
complicated than you might think, as long as inotify is available.
That's been much more useful to me than reading a file backwards.)

> Agree with @Chris' assessment of the (in)efficiency. It is more likely
> than not, that you will have a good idea of the length of each line.
> Even if the line-length is highly-variable (thinking of some of my
> applications of the Python logging module!), one can still 'take a stab
> at it' (a "thumb suck" as an engineer-colleague used to say - apparently
> not an electrical engineer!) by remembering that lines exceeding
> 80-characters become less readable and thus have likely?hopefully been
> split into two.
>
> Thus,
>
> N*(80+p)
>
> where N is the number of lines desired and p is a reasonable
> 'safety'/over-estimation percentage, would give a good chunk size.
> Binar-ily grab that much of the end of the file, split on line-ending,
> and take the last N elements from that list. (with 'recovery code' just
> in case the 'chunk' wasn't large-enough).

Yup, that's the broad idea of the chunked read. If you know how many
lines you're going to need, that's not too bad. If you need to iterate
backwards over the file (as the original question suggested), that
gets complicated fast.

> Adding to the efficiency (of the algorithm, but not the dev-time),
> consider that shorter files are likely to be more easily--handled by
> reading serially from the beginning. To side-step @Chris' criticism, use
> a generator to produce the individual lines (lazy evaluation and low
> storage requirement) and feed them into a circular-queue which is
> limited to N-entries. QED, as fast as the machine's I/O, and undemanding
> of storage-space!

Still needs to read the whole file though.

> Running a few timing trials should reveal the 'sweet spot', at which one
> algorithm takes-over from the other!

Unfortunately, the sweet spot will depend on a lot of things other
than just the file size. Is it stored contiguously? Is it on hard
drive or SSD? Can it be read from the disk cache? How frequently do
the chunk size guesses fail, and how often do they grab more than is
necessary?

It's basically unknowable, so the best you can do is judge your own
app, pick one, and go with it.

> NB quite a few of IBM's (extensively researched) algorithms which formed
> utility program[me]s on mainframes, made similar such algorithmic
> choices, in the pursuit of efficiencies.

Indeed. Just make a choice and run with it, and accept that there will
be situations where it'll be inefficient.

ChrisA

LOAD "LINUX",8,1 -- Topic on #LinuxGER

devel / comp.lang.python / Re: tail