Message-ID:

HOST SYSTEM NOT RESPONDING, PROBABLY DOWN. DO YOU WANT TO WAIT? (Y/N)

Re: tail

<mailman.348.1652025950.20749.python-list@python.org>

https://www.novabbs.com/devel/article-flat.php?id=18188&group=comp.lang.python#18188

copy link Newsgroups: comp.lang.python

Path: i2pn2.org!i2pn.org!news.swapon.de!fu-berlin.de!uni-berlin.de!not-for-mail
From: Marco.Su...@gmail.com (Marco Sulla)
Newsgroups: comp.lang.python
Subject: Re: tail
Date: Sun, 8 May 2022 18:05:11 +0200
Lines: 109
Message-ID: <mailman.348.1652025950.20749.python-list@python.org>
References: <CABbU2U99Jpa6nuYg0sXw6=GjBEKVk9u-_oyxSoL8hLrW_2FoBA@mail.gmail.com>
<A3773CDA-B6FE-4A51-8D75-362397220F67@barrys-emacs.org>
<CABbU2U_J7HdUjDV8TLjHJkUb7xBTUes6rG0F17sDJNFX13-SNg@mail.gmail.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
X-Trace: news.uni-berlin.de zHzjhk/q0oS6d8/npAQBAgoTptFPmF4cD73aTOXZsuyw==
Return-Path: <elbarbun@gmail.com>
X-Original-To: python-list@python.org
Delivered-To: python-list@mail.python.org
Authentication-Results: mail.python.org; dkim=pass
reason="2048-bit key; unprotected key"
header.d=gmail.com header.i=@gmail.com header.b=B8lvK042;
dkim-adsp=pass; dkim-atps=neutral
X-Spam-Status: OK 0.003
X-Spam-Evidence: '*H*': 0.99; '*S*': 0.00; 'def': 0.04; 'bigger':
0.05; 'chances': 0.05; 'parameter': 0.05; 'anyway,': 0.09; 'byte':
0.09; 'elif': 0.09; 'else:': 0.09; 'linux': 0.09; 'string,': 0.09;
'text.': 0.09; 'import': 0.15; 'problem.': 0.15; '"\\n"': 0.16;
'<': 0.16; '-1:': 0.16; 'builtin': 0.16; 'char': 0.16;
'encoding': 0.16; 'found.': 0.16; 'from:name:marco sulla': 0.16;
'furthermore,': 0.16; 'newline': 0.16; 'none:': 0.16; 'shortly,':
0.16; 'simpler,': 0.16; 'specified,': 0.16; 'to:addr:python-list':
0.20; "i've": 0.22; 'way.': 0.22; 'code': 0.23; 'lines': 0.23;
'seems': 0.26; 'object': 0.26; 'else': 0.27; 'function': 0.27;
'think': 0.32; 'empty': 0.32; 'split': 0.32; 'message-
id:@mail.gmail.com': 0.32; 'there': 0.33; 'header:In-Reply-To:1':
0.34; 'received:google.com': 0.34; 'final': 0.35;
'from:addr:gmail.com': 0.35; "it's": 0.37; 'received:209.85':
0.37; 'file': 0.38; 'read': 0.38; 'received:209': 0.39; 'two':
0.39; 'quite': 0.39; 'text': 0.39; 'break': 0.39; 'case.': 0.40;
'should': 0.40; 'method': 0.61; 'skip:o 10': 0.61; "there's":
0.61; 'mode': 0.62; 'skip:o 20': 0.63; 'true': 0.63; 'between':
0.63; 'finished': 0.64; 'skip:t 20': 0.66; 'skip:e 20': 0.67;
'per': 0.68; 'clear.': 0.69; 'too.': 0.70; 'little': 0.73;
'eventually': 0.84; 'junction': 0.84; 'misalignment': 0.84;
'mode.': 0.89; 'trick': 0.91
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112;
h=mime-version:references:in-reply-to:from:date:message-id:subject:to;
bh=TO2hcwzjbsk8WKn1OqxlpUhXsFzBtC+BX9/rD7D9yMw=;
b=B8lvK042SxhPgUUxKcUq5xsIrDEP3+jdUstP6X1GQeyp9Okf0bD9lN12MGyzCg+hao
4r6eZuZR3M+1r8Jh6n1In3M2rj/TRiYBREOdhaAgMqpQZchU7jXhI2oN2YFDaKbtAOpe
cK89kUtfDEC1MEklxYXyHfZfOc9KLa+YFdeuQq3LWGGsdEl+86xpdJS/Swjg8DL5ePEm
I2re65LppwSLjKfVOHRWgqiKzJOsYmMuekp2ppL1XxOZzxUZ0X5huoR1zOuuZuKVEkL9
+GoaeKbVhgJ3CB8JB2gJzowsK24vP0D+hDk97xWYQmm9FDSS08dqMGw3r5gsICuznqgG
Yv1Q==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
d=1e100.net; s=20210112;
h=x-gm-message-state:mime-version:references:in-reply-to:from:date
:message-id:subject:to;
bh=TO2hcwzjbsk8WKn1OqxlpUhXsFzBtC+BX9/rD7D9yMw=;
b=CYRVgPlQBknrf/daXhnoUNB9pyWLp+FWRdDajlOMH/B1iPx2Q6CdCJdwRnLamQaKCg
2oNafyEkgLXfeWKKt9w89MJM+AMObvJo6Fud83A7DExxsfmC177b+U6saI0/KqaWM3gM
p89madQPjazGNJUNR3VXGNCvvTAGeonmIvXKByidFnyhYrQM9UMpDngx7P3u5feoJHzS
LPWgztsWlORdda1ckIiG9fCftYaGlYPcXMnkvUdlrGMlw8bDJUgCVi+zNg5yIsfZOhGI
vTvtQ2glbOkJz9kkMZxL+fozMHL+5LIrDsvr5a8W3tko3/lpecX30HgKcpI3yBzPNbly
9ftw==
X-Gm-Message-State: AOAM5312xGSWK5mgROq7FuFPJ+7zahypqazmQEMwVOXohKC3h1Ff8RCV
thf8LnOd7DU0IXI2vfe5PoDVmHXRGYn/wrRhRmp6F15u+8o=
X-Google-Smtp-Source: ABdhPJzW9z2SU0zn0Js3LG6wGeu9btSguF9Mdq4OeAvS3bs16xf2KejyKD5Fx/BbxPtlkxFwS6ewfN2g6fjwCjSWfQU=
X-Received: by 2002:a81:7b05:0:b0:2f1:7f75:1d1e with SMTP id
w5-20020a817b05000000b002f17f751d1emr10238396ywc.520.1652025948229; Sun, 08
May 2022 09:05:48 -0700 (PDT)
In-Reply-To: <A3773CDA-B6FE-4A51-8D75-362397220F67@barrys-emacs.org>
X-Content-Filtered-By: Mailman/MimeDel 2.1.39
X-BeenThere: python-list@python.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: General discussion list for the Python programming language
<python-list.python.org>
List-Unsubscribe: <https://mail.python.org/mailman/options/python-list>,
<mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive: <https://mail.python.org/pipermail/python-list/>
List-Post: <mailto:python-list@python.org>
List-Help: <mailto:python-list-request@python.org?subject=help>
List-Subscribe: <https://mail.python.org/mailman/listinfo/python-list>,
<mailto:python-list-request@python.org?subject=subscribe>
X-Mailman-Original-Message-ID: <CABbU2U_J7HdUjDV8TLjHJkUb7xBTUes6rG0F17sDJNFX13-SNg@mail.gmail.com>
X-Mailman-Original-References: <CABbU2U99Jpa6nuYg0sXw6=GjBEKVk9u-_oyxSoL8hLrW_2FoBA@mail.gmail.com>
<A3773CDA-B6FE-4A51-8D75-362397220F67@barrys-emacs.org>

by: Marco Sulla - Sun, 8 May 2022 16:05 UTC

I think I've _almost_ found a simpler, general way:

import os

_lf = "\n"
_cr = "\r"

def tail(filepath, n=10, newline=None, encoding=None, chunk_size=100):
n_chunk_size = n * chunk_size
pos = os.stat(filepath).st_size
chunk_line_pos = -1
lines_not_found = n

with open(filepath, newline=newline, encoding=encoding) as f:
text = ""

hard_mode = False

if newline == None:
newline = _lf
elif newline == "":
hard_mode = True

if hard_mode:
while pos != 0:
pos -= n_chunk_size

if pos < 0:
pos = 0

f.seek(pos)
text = f.read()
lf_after = False

for i, char in enumerate(reversed(text)):
if char == _lf:
lf_after == True
elif char == _cr:
lines_not_found -= 1

newline_size = 2 if lf_after else 1

lf_after = False
elif lf_after:
lines_not_found -= 1
newline_size = 1
lf_after = False

if lines_not_found == 0:
chunk_line_pos = len(text) - 1 - i + newline_size
break

if lines_not_found == 0:
break
else:
while pos != 0:
pos -= n_chunk_size

if pos < 0:
pos = 0

f.seek(pos)
text = f.read()

for i, char in enumerate(reversed(text)):
if char == newline:
lines_not_found -= 1

if lines_not_found == 0:
chunk_line_pos = len(text) - 1 - i +
len(newline)
break

if lines_not_found == 0:
break

if chunk_line_pos == -1:
chunk_line_pos = 0

return text[chunk_line_pos:]

Shortly, the file is always opened in text mode. File is read at the end in
bigger and bigger chunks, until the file is finished or all the lines are
found.

Why? Because in encodings that have more than 1 byte per character, reading
a chunk of n bytes, then reading the previous chunk, can eventually split
the character between the chunks in two distinct bytes.

I think one can read chunk by chunk and test the chunk junction problem. I
suppose the code will be faster this way. Anyway, it seems that this trick
is quite fast anyway and it's a lot simpler.

The final result is read from the chunk, and not from the file, so there's
no problems of misalignment of bytes and text. Furthermore, the builtin
encoding parameter is used, so this should work with all the encodings
(untested).

Furthermore, a newline parameter can be specified, as in open(). If it's
equal to the empty string, the things are a little more complicated, anyway
I suppose the code is clear. It's untested too. I only tested with an utf8
linux file.

Do you think there are chances to get this function as a method of the file
object in CPython? The method for a file object opened in bytes mode is
simpler, since there's no encoding and newline is only \n in that case.

HOST SYSTEM NOT RESPONDING, PROBABLY DOWN. DO YOU WANT TO WAIT? (Y/N)

devel / comp.lang.python / Re: tail