Rocksolid Light

Welcome to novaBBS (click a section below)

mail  files  register  newsreader  groups  login

Message-ID:  

Basic is a high level languish. APL is a high level anguish.


devel / comp.lang.python / Re: tail

SubjectAuthor
o Re: tailChris Angelico

1
Re: tail

<mailman.288.1651448516.20749.python-list@python.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=18099&group=comp.lang.python#18099

  copy link   Newsgroups: comp.lang.python
Path: i2pn2.org!i2pn.org!news.swapon.de!fu-berlin.de!uni-berlin.de!not-for-mail
From: ros...@gmail.com (Chris Angelico)
Newsgroups: comp.lang.python
Subject: Re: tail
Date: Mon, 2 May 2022 09:41:43 +1000
Lines: 51
Message-ID: <mailman.288.1651448516.20749.python-list@python.org>
References: <CABbU2U-RaZxGqhbWYXCojbXgkH2kKsR0sRNdx3ZtE4_Ycq_GZw@mail.gmail.com>
<Ym8HP9Gt+OE9l+1V@cskk.homeip.net>
<CAGGBd_qEzDmqqytKhJqdjt5BiRhG_KGLAHAGg3eVfVyYQBKn2A@mail.gmail.com>
<CAPTjJmrqtzPea8ApJrcCDVw0xwT-5R4DC24H5B=K+t0Qk3E9JA@mail.gmail.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
X-Trace: news.uni-berlin.de 9eirYfZBAjyxbeGyhX6d8Ak8Fe8L3B4s2cAZUtBvzUhg==
Return-Path: <rosuav@gmail.com>
X-Original-To: python-list@python.org
Delivered-To: python-list@mail.python.org
Authentication-Results: mail.python.org; dkim=pass
reason="2048-bit key; unprotected key"
header.d=gmail.com header.i=@gmail.com header.b=LTUVJnGF;
dkim-adsp=pass; dkim-atps=neutral
X-Spam-Status: OK 0.002
X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; '2022': 0.05; 'internet,':
0.07; 'sun,': 0.07; 'utf-8': 0.07; 'byte': 0.09; 'cc:addr:python-
list': 0.09; 'consistency': 0.09; 'dan': 0.09; 'general,': 0.09;
'(eg': 0.16; '(especially': 0.16; 'assumption': 0.16; 'backward':
0.16; 'cameron': 0.16; 'cc:name:python list': 0.16; 'characters.':
0.16; 'chrisa': 0.16; 'does,': 0.16; 'encoding': 0.16;
'encoding,': 0.16; 'equally': 0.16; 'from:addr:rosuav': 0.16;
'from:name:chris angelico': 0.16; 'handful': 0.16; 'many,': 0.16;
'parsing': 0.16; 'received:209.85.128.41': 0.16; 'received:mail-
wm1-f41.google.com': 0.16; 'scanning': 0.16; 'simpson': 0.16;
'unicode': 0.16; 'wrote:': 0.16; 'probably': 0.17;
'cc:addr:python.org': 0.20; 'lines': 0.23; '(and': 0.25;
'anything': 0.25; 'cc:2**0': 0.25; 'seems': 0.26; "isn't": 0.27;
'think': 0.32; 'files,': 0.32; 'fine.': 0.32; 'formats': 0.32;
'message-id:@mail.gmail.com': 0.32; 'but': 0.32; 'header:In-Reply-
To:1': 0.34; 'received:google.com': 0.34; 'majority': 0.35;
'meaning': 0.35; 'from:addr:gmail.com': 0.35; 'mon,': 0.36;
'change': 0.36; "it's": 0.37; 'received:209.85': 0.37; 'file':
0.38; 'read': 0.38; 'received:209': 0.39; 'two': 0.39; 'this,':
0.39; 'use': 0.39; 'both': 0.40; 'ever': 0.63; 'universal': 0.64;
'your': 0.64; 'documents': 0.65; 'look': 0.65; 'guarantee.': 0.69;
'vast': 0.69; 'guarantee': 0.76; 'absolutely': 0.84; 'allocated':
0.84; 'characters': 0.84; 'guaranteeing': 0.84; 'represented':
0.84; 'shifted': 0.84; 'sulla': 0.84; 'want.': 0.84; 'distant':
0.93
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112;
h=mime-version:references:in-reply-to:from:date:message-id:subject:to
:cc; bh=OCvRWyqAGM2zwSW6PNbR0yktVjHFxkE9Fk9SApEyalI=;
b=LTUVJnGFGEmNB4Pgf37uwJBe4PbaAewvE1l6CsHi14za0O1t8Lr2Nz6AjDBPDqXiRf
SUYk8uDiOQbtRA+32Dg4TEtaOfMjqzsYl9PhmZQ+j3bQNZyJEHGQhKSkcZJpqmtw8BNA
rQ8sgYGcSFNcnEtgYDasDB8rrgFFRSwEZjrr9gKvGIJjDgSmsJZH9qC1AMw91DsgJwAd
h2lqIqCoDyahMQKEyq4A5asCZFpsfQITHUrpPwGOj1NtKuaKURwYQ3FKWqddznG0Bnvk
fr/PpeAoqKZ4XwUsWJlni5/2LXctgy83XiB5yiX0ykKC2WcyyN7kYqws75+Weg/3t/G2
PH2g==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
d=1e100.net; s=20210112;
h=x-gm-message-state:mime-version:references:in-reply-to:from:date
:message-id:subject:to:cc;
bh=OCvRWyqAGM2zwSW6PNbR0yktVjHFxkE9Fk9SApEyalI=;
b=7UgQbhMqhqsNC9QfBrc/ikkfUr1UDkd3Oep7ScW9YuMOXibrTnM6c3jQQi+Jg2/VDu
8Yt7aV8LNgrlYkt5s/yYrZltZUQPyZee8/1elQZy/5Itzi8WQGMQceIm84BQ0Lrmu/D/
aEIaqXOKqeBJoW/foJ0LqhaXQnZns5FiSxoHSYAkWvACTrTQRupM6hN2K5raXOnvQVnw
VvmZ2IsH88XNtU8SxE6zaIf4Rl8qxDECevU8zPsiL4e6h/3uQ29UtU/r6VORwISeM2Zp
aDGTDb7w459Uopx6Ap/UckdxTiNpTSGGpOiFn3GEaCcA5v5clZW+CrJy8Zh81/9ByU+A
f74Q==
X-Gm-Message-State: AOAM5313GdqtmNtBO0Jse2lBh/MRiNak+7JzdlFJ1Vwc/e+gDQWaY1DU
9eHL2WLYJ+eNw51yn2KI9r1scRT4Fy4QsPfG1qQIwDVD
X-Google-Smtp-Source: ABdhPJx03GlyYl1C1j4dDEztAllNzAa150LIs3oLzDc0vM7f4QFDj1iWog9cV2N+NFLE7x/TPgLdie/PFWtK1NIZ0ig=
X-Received: by 2002:a05:600c:3544:b0:394:3a2a:c9d6 with SMTP id
i4-20020a05600c354400b003943a2ac9d6mr2324979wmq.132.1651448514759; Sun, 01
May 2022 16:41:54 -0700 (PDT)
In-Reply-To: <CAGGBd_qEzDmqqytKhJqdjt5BiRhG_KGLAHAGg3eVfVyYQBKn2A@mail.gmail.com>
X-BeenThere: python-list@python.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: General discussion list for the Python programming language
<python-list.python.org>
List-Unsubscribe: <https://mail.python.org/mailman/options/python-list>,
<mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive: <https://mail.python.org/pipermail/python-list/>
List-Post: <mailto:python-list@python.org>
List-Help: <mailto:python-list-request@python.org?subject=help>
List-Subscribe: <https://mail.python.org/mailman/listinfo/python-list>,
<mailto:python-list-request@python.org?subject=subscribe>
X-Mailman-Original-Message-ID: <CAPTjJmrqtzPea8ApJrcCDVw0xwT-5R4DC24H5B=K+t0Qk3E9JA@mail.gmail.com>
X-Mailman-Original-References: <CABbU2U-RaZxGqhbWYXCojbXgkH2kKsR0sRNdx3ZtE4_Ycq_GZw@mail.gmail.com>
<Ym8HP9Gt+OE9l+1V@cskk.homeip.net>
<CAGGBd_qEzDmqqytKhJqdjt5BiRhG_KGLAHAGg3eVfVyYQBKn2A@mail.gmail.com>
 by: Chris Angelico - Sun, 1 May 2022 23:41 UTC

On Mon, 2 May 2022 at 09:19, Dan Stromberg <drsalists@gmail.com> wrote:
>
> On Sun, May 1, 2022 at 3:19 PM Cameron Simpson <cs@cskk.id.au> wrote:
>
> > On 01May2022 18:55, Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
> > >Something like this is OK?
> >
>
> Scanning backward for a byte == 10 in ASCII or ISO-8859 seems fine.
>
> But what about Unicode? Are all 10 bytes newlines in Unicode encodings?

Most absolutely not. "Unicode" isn't an encoding, but of the Unicode
Transformation Formats and Universal Character Set encodings, most
don't make that guarantee:

* UTF-8 does, as mentioned. It sacrifices some efficiency and
consistency for a guarantee that ASCII characters are represented by
ASCII bytes, and ASCII bytes only ever represent ASCII characters.
* UCS-2 and UTF-16 will both represent BMP characters with two bytes.
Any character U+xx0A or U+0Axx will include an 0x0A in its
representation.
* UTF-16 will also encode anything U+000xxx0A with an 0x0A. (And I
don't think any codepoints have been allocated that would trigger
this, but UTF-16 can also use 0x0A in the high surrogate.)
* UTF-32 and UCS-4 will use 0x0A for any character U+xx0A, U+0Axx, and
U+Axxxx (though that plane has no characters on it either)

So, of all the available Unicode standard encodings, only UTF-8 makes
this guarantee.

Of course, if you look at documents available on the internet, UTF-8
the encoding used by the vast majority of them (especially if you
include seven-bit files, which can equally be considered ASCII,
ISO-8859-x, and UTF-8), so while it might only be one encoding out of
many, it's probably the most important :)

In general, you can *only* make this parsing assumption IF you know
for sure that your file's encoding is UTF-8, ISO-8859-x, some OEM
eight-bit encoding (eg Windows-125x), or one of a handful of other
compatible encodings. But it probably will be.

> If not, and you have a huge file to reverse, it might be better to use a
> temporary file.

Yeah, or an in-memory deque if you know how many lines you want.
Either way, you can read the file forwards, guaranteeing correct
decoding even of a shifted character set (where a byte value can
change in meaning based on arbitrarily distant context).

ChrisA

1
server_pubkey.txt

rocksolid light 0.9.8
clearnet tor