Rocksolid Light

Welcome to novaBBS (click a section below)

mail  files  register  newsreader  groups  login

Message-ID:  

Biology grows on you.


devel / comp.lang.python / Re: tail

SubjectAuthor
* Re: tailMarco Sulla
+* Re: tailStefan Ram
|+* Re: tailMRAB
||`* Re: tailStefan Ram
|| +- Re: tailChris Angelico
|| `* Re: tailMRAB
||  `- Re: tailmoi
|`- Re: tailChris Angelico
+- Re: tailDennis Lee Bieber
`- Re: tailBarry Scott

1
Re: tail

<mailman.340.1651948573.20749.python-list@python.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=18177&group=comp.lang.python#18177

  copy link   Newsgroups: comp.lang.python
Path: i2pn2.org!i2pn.org!news.swapon.de!fu-berlin.de!uni-berlin.de!not-for-mail
From: Marco.Su...@gmail.com (Marco Sulla)
Newsgroups: comp.lang.python
Subject: Re: tail
Date: Sat, 7 May 2022 20:35:34 +0200
Lines: 44
Message-ID: <mailman.340.1651948573.20749.python-list@python.org>
References: <CABbU2U-_Z546umxtnZXL8b1LUERCnyOxYw6osKTvKncOHFkJ3A@mail.gmail.com>
<60454E09-0ADA-4881-A84B-6C11397D244F@barrys-emacs.org>
<CABbU2U99Jpa6nuYg0sXw6=GjBEKVk9u-_oyxSoL8hLrW_2FoBA@mail.gmail.com>
<561ac7a8-2034-c1ce-6fca-f4280baac409@mrabarnett.plus.com>
<CABbU2U-N=YiRYfVkjpv8RP6BCo4VOLL7SWK=vNq8oje7nwuyUw@mail.gmail.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Trace: news.uni-berlin.de 65qR0fhxTgIb5kpD5LMpdACexLqsFUEVfVRiuzYEbl+A==
Return-Path: <elbarbun@gmail.com>
X-Original-To: python-list@python.org
Delivered-To: python-list@mail.python.org
Authentication-Results: mail.python.org; dkim=pass
reason="2048-bit key; unprotected key"
header.d=gmail.com header.i=@gmail.com header.b=SoBCzXGb;
dkim-adsp=pass; dkim-atps=neutral
X-Spam-Status: OK 0.017
X-Spam-Evidence: '*H*': 0.97; '*S*': 0.00; '2022': 0.05; 'bin': 0.09;
'cc:addr:python-list': 0.09; 'ok,': 0.09; 'received:209.85.219':
0.09; 'cc:no real name:2**0': 0.14; '>>>>': 0.16; 'barry': 0.16;
'encoding': 0.16; 'from:name:marco sulla': 0.16; 'furthermore,':
0.16; 'input.': 0.16; 'specify': 0.16; 'unit,': 0.16; 'wrote:':
0.16; 'cc:addr:python.org': 0.20; 'sat,': 0.22; 'code': 0.23;
'cc:2**0': 0.25; '>>>': 0.28; 'think': 0.32; 'message-
id:@mail.gmail.com': 0.32; 'but': 0.32; 'there': 0.33; "didn't":
0.34; 'skip:" 20': 0.34; 'header:In-Reply-To:1': 0.34;
'received:google.com': 0.34; 'handling': 0.35;
'from:addr:gmail.com': 0.35; 'cases': 0.36; 'those': 0.36; "it's":
0.37; 'received:209.85': 0.37; 'hard': 0.37; 'file': 0.38; 'way':
0.38; 'could': 0.38; 'received:209': 0.39; 'added': 0.39;
'handle': 0.39; 'use': 0.39; 'want': 0.40; 'method': 0.61; 'mode':
0.62; 'skip:b 20': 0.63; 'skip:b 10': 0.63; 'per': 0.68; 'skip:"
40': 0.84; 'sulla': 0.84; 'trick': 0.91
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112;
h=mime-version:references:in-reply-to:from:date:message-id:subject:to
:cc:content-transfer-encoding;
bh=oLwxfNrfkGnSn2BUZxW4L6cGUsZ+QFqJFQP76l9nMC4=;
b=SoBCzXGbzT9YEWvLZ0I+EbRt0M0H6tcE86P3butM0E07GW2uA8WZt2IERN0lqvr4sl
a0+7Pj0Skstxn0q7In01XdTed4hmzSfhn9Kydo+JsferiKBOp/no5f/gfbxgkRJ7ds5+
TS9ZsUqsgmMM6624iIpCzidokmXbHfPwbfWsUarbw4pXXvLAnW+BA9Rx8u5MU+hIhfrW
8/huXgjyZwZcWcFkjaIWjVvKLEbFMDR3rNiED6da5pE5tym6wb2J3FCg0T5NedXy85DH
tkeYKchfUY/gpU6EPQbtkiRvH8qasroEZRIR4zVNpvO6jL9CrXmSzungSl3DdFbsd2JJ
j+5A==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
d=1e100.net; s=20210112;
h=x-gm-message-state:mime-version:references:in-reply-to:from:date
:message-id:subject:to:cc:content-transfer-encoding;
bh=oLwxfNrfkGnSn2BUZxW4L6cGUsZ+QFqJFQP76l9nMC4=;
b=HNdziNIuD2m7dsJdTGzSMX9jtrK0fbCwW3bKR+uZIpoeRRmLjThCJ4Jt3Xb4h+TtOn
6vzkPGOITmVMlyD3kxyOdacrrJWZqoyUUf2mYpiyo5ffIaH58Xe/NZBuYP0KmRs7o64G
53NoE4CODpiw+oz58fMxSnDrppXrqsYxNFsl1Kfc8aB5Vc2F8cwrAXRYuVNDxfuWRJ/v
2VOuqkspU2CStszxcuqcS3FRow9msXyU7+pv+gQ4qTIbwYtE5ERw+Ya2DAVtZVANIE8N
6ArF8IJD8DkNeHfap8o1uhiVKj/3cItx04hQZPpbQrPAlltaucyAgWk1YHTttXFRUs5+
z5DQ==
X-Gm-Message-State: AOAM531jIBY4t4XCzvQ/WLMQ091YMC0Ek12mZxPjsvLV0mSbKcvAKtu5
budzFkAssq8IBQYNfQdcfhklhPCOKZfHxHH/91l8BsYBHOI=
X-Google-Smtp-Source: ABdhPJw6mq/BYfxV+5TsfKCOi306rC8rH4kEw2qeSbyi75neBKRiT+Dq/3JB/A3WVXFyhkgXywLJrrMYLu8xO0qWOfk=
X-Received: by 2002:a5b:88f:0:b0:649:6b56:38b4 with SMTP id
e15-20020a5b088f000000b006496b5638b4mr7509226ybq.82.1651948571319; Sat, 07
May 2022 11:36:11 -0700 (PDT)
In-Reply-To: <561ac7a8-2034-c1ce-6fca-f4280baac409@mrabarnett.plus.com>
X-BeenThere: python-list@python.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: General discussion list for the Python programming language
<python-list.python.org>
List-Unsubscribe: <https://mail.python.org/mailman/options/python-list>,
<mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive: <https://mail.python.org/pipermail/python-list/>
List-Post: <mailto:python-list@python.org>
List-Help: <mailto:python-list-request@python.org?subject=help>
List-Subscribe: <https://mail.python.org/mailman/listinfo/python-list>,
<mailto:python-list-request@python.org?subject=subscribe>
X-Mailman-Original-Message-ID: <CABbU2U-N=YiRYfVkjpv8RP6BCo4VOLL7SWK=vNq8oje7nwuyUw@mail.gmail.com>
X-Mailman-Original-References: <CABbU2U-_Z546umxtnZXL8b1LUERCnyOxYw6osKTvKncOHFkJ3A@mail.gmail.com>
<60454E09-0ADA-4881-A84B-6C11397D244F@barrys-emacs.org>
<CABbU2U99Jpa6nuYg0sXw6=GjBEKVk9u-_oyxSoL8hLrW_2FoBA@mail.gmail.com>
<561ac7a8-2034-c1ce-6fca-f4280baac409@mrabarnett.plus.com>
 by: Marco Sulla - Sat, 7 May 2022 18:35 UTC

On Sat, 7 May 2022 at 19:02, MRAB <python@mrabarnett.plus.com> wrote:
>
> On 2022-05-07 17:28, Marco Sulla wrote:
> > On Sat, 7 May 2022 at 16:08, Barry <barry@barrys-emacs.org> wrote:
> >> You need to handle the file in bin mode and do the handling of line endings and encodings yourself. It’s not that hard for the cases you wanted.
> >
> >>>> "\n".encode("utf-16")
> > b'\xff\xfe\n\x00'
> >>>> "".encode("utf-16")
> > b'\xff\xfe'
> >>>> "a\nb".encode("utf-16")
> > b'\xff\xfea\x00\n\x00b\x00'
> >>>> "\n".encode("utf-16").lstrip("".encode("utf-16"))
> > b'\n\x00'
> >
> > Can I use the last trick to get the encoding of a LF or a CR in any encoding?
>
> In the case of UTF-16, it's 2 bytes per code unit, but those 2 bytes
> could be little-endian or big-endian.
>
> As you didn't specify which you wanted, it defaulted to little-endian
> and added a BOM (U+FEFF).
>
> If you specify which endianness you want with "utf-16le" or "utf-16be",
> it won't add the BOM:
>
> >>> # Little-endian.
> >>> "\n".encode("utf-16le")
> b'\n\x00'
> >>> # Big-endian.
> >>> "\n".encode("utf-16be")
> b'\x00\n'

Well, ok, but I need a generic method to get LF and CR for any
encoding an user can input.
Do you think that

"\n".encode(encoding).lstrip("".encode(encoding))

is good for any encoding? Furthermore, is there a way to get the
encoding of an opened file object?

Re: tail

<encoding-20220507194637@ram.dialup.fu-berlin.de>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=18178&group=comp.lang.python#18178

  copy link   Newsgroups: comp.lang.python
Path: i2pn2.org!i2pn.org!news.swapon.de!fu-berlin.de!uni-berlin.de!not-for-mail
From: ram...@zedat.fu-berlin.de (Stefan Ram)
Newsgroups: comp.lang.python
Subject: Re: tail
Date: 7 May 2022 18:47:40 GMT
Organization: Stefan Ram
Lines: 30
Expires: 1 Apr 2023 11:59:58 GMT
Message-ID: <encoding-20220507194637@ram.dialup.fu-berlin.de>
References: <CABbU2U-_Z546umxtnZXL8b1LUERCnyOxYw6osKTvKncOHFkJ3A@mail.gmail.com> <60454E09-0ADA-4881-A84B-6C11397D244F@barrys-emacs.org> <CABbU2U99Jpa6nuYg0sXw6=GjBEKVk9u-_oyxSoL8hLrW_2FoBA@mail.gmail.com> <561ac7a8-2034-c1ce-6fca-f4280baac409@mrabarnett.plus.com> <CABbU2U-N=YiRYfVkjpv8RP6BCo4VOLL7SWK=vNq8oje7nwuyUw@mail.gmail.com> <mailman.340.1651948573.20749.python-list@python.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-Trace: news.uni-berlin.de xPfELU2tGMkxzfSu30yyWwLBzq1Ueifm45N3bBmfPGiN+6
X-Copyright: (C) Copyright 2022 Stefan Ram. All rights reserved.
Distribution through any means other than regular usenet
channels is forbidden. It is forbidden to publish this
article in the Web, to change URIs of this article into links,
and to transfer the body without this notice, but quotations
of parts in other Usenet posts are allowed.
X-No-Archive: Yes
Archive: no
X-No-Archive-Readme: "X-No-Archive" is set, because this prevents some
services to mirror the article in the web. But the article may
be kept on a Usenet archive server with only NNTP access.
X-No-Html: yes
Content-Language: en-US
Accept-Language: de-DE, en-US, it, fr-FR
 by: Stefan Ram - Sat, 7 May 2022 18:47 UTC

Marco Sulla <Marco.Sulla.Python@gmail.com> writes:
>Well, ok, but I need a generic method to get LF and CR for any
>encoding an user can input.

"LF" and "CR" come from US-ASCII. It is theoretically
possible that there might be some encodings out there
(not for Unicode) that are not based on US-ASCII and
have no LF or no CR.

>is good for any encoding? Furthermore, is there a way to get the
>encoding of an opened file object?

I have written a function that might be able to detect one
of few encodings based on a heuristic algorithm.

def encoding( name ):
path = pathlib.Path( name )
for encoding in( "utf_8", "latin_1", "cp1252" ):
try:
with path.open( encoding=encoding, errors="strict" )as file:
text = file.read()
return encoding
except UnicodeDecodeError:
pass
return "ascii"

Yes, it's potentially slow and might be wrong.
The result "ascii" might mean it's a binary file.

Re: tail

<33gd7hdkl3aj1bicf9c739hdrnbqcvanro@4ax.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=18179&group=comp.lang.python#18179

  copy link   Newsgroups: comp.lang.python
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!news.misty.com!border2.nntp.dca1.giganews.com!nntp.giganews.com!buffer2.nntp.dca1.giganews.com!news.giganews.com.POSTED!not-for-mail
NNTP-Posting-Date: Sat, 07 May 2022 14:08:56 -0500
From: wlfr...@ix.netcom.com (Dennis Lee Bieber)
Newsgroups: comp.lang.python
Subject: Re: tail
Date: Sat, 07 May 2022 15:08:57 -0400
Organization: IISS Elusive Unicorn
Message-ID: <33gd7hdkl3aj1bicf9c739hdrnbqcvanro@4ax.com>
References: <CABbU2U-_Z546umxtnZXL8b1LUERCnyOxYw6osKTvKncOHFkJ3A@mail.gmail.com> <60454E09-0ADA-4881-A84B-6C11397D244F@barrys-emacs.org> <CABbU2U99Jpa6nuYg0sXw6=GjBEKVk9u-_oyxSoL8hLrW_2FoBA@mail.gmail.com> <561ac7a8-2034-c1ce-6fca-f4280baac409@mrabarnett.plus.com> <CABbU2U-N=YiRYfVkjpv8RP6BCo4VOLL7SWK=vNq8oje7nwuyUw@mail.gmail.com> <mailman.340.1651948573.20749.python-list@python.org>
User-Agent: ForteAgent/8.00.32.1272
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Lines: 22
X-Usenet-Provider: http://www.giganews.com
X-Trace: sv3-9QNvxl4HrtcQEbpAFuzO33I2l5l97YZ5lAmqVceaYuJeMnCy9UcEZk2MEDYXd+dP31f21GKPfjMZ2GR!tOGGHYlcok/egnkqjUk7xRobfRLDT+rAOA6cgYmdIe3N1qv+cgV9YqtEnVWCwgniuvci51x7
X-Complaints-To: abuse@giganews.com
X-DMCA-Notifications: http://www.giganews.com/info/dmca.html
X-Abuse-and-DMCA-Info: Please be sure to forward a copy of ALL headers
X-Abuse-and-DMCA-Info: Otherwise we will be unable to process your complaint properly
X-Postfilter: 1.3.40
X-Original-Bytes: 2317
 by: Dennis Lee Bieber - Sat, 7 May 2022 19:08 UTC

On Sat, 7 May 2022 20:35:34 +0200, Marco Sulla
<Marco.Sulla.Python@gmail.com> declaimed the following:

>Well, ok, but I need a generic method to get LF and CR for any
>encoding an user can input.

Other than EBCDIC, <lf> and <cr> AS BYTES should appear as x0A and x0D
in any of the 8-bit encodings (ASCII, ISO-8859-x, CPxxxx, UTF-8). I believe
those bytes also appear in UTF-16 -- BUT, they will have a null (x00) byte
associated with them as padding; as a result, you can not search for just
x0Dx0A (Windows line end convention -- they may be x00x0Dx00x0A or
x0Dx00x0Ax00 depending on endianness cf:
https://docs.microsoft.com/en-us/cpp/text/support-for-unicode?view=msvc-170
)

For EBCDIC <cr> is still x0D, but <lf> is x25 (and there is a separate
<nl> [new line] at x15)

--
Wulfraed Dennis Lee Bieber AF6VN
wlfraed@ix.netcom.com http://wlfraed.microdiversity.freeddns.org/

Re: tail

<mailman.342.1651951563.20749.python-list@python.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=18181&group=comp.lang.python#18181

  copy link   Newsgroups: comp.lang.python
Path: i2pn2.org!i2pn.org!news.swapon.de!fu-berlin.de!uni-berlin.de!not-for-mail
From: pyt...@mrabarnett.plus.com (MRAB)
Newsgroups: comp.lang.python
Subject: Re: tail
Date: Sat, 7 May 2022 20:26:01 +0100
Lines: 34
Message-ID: <mailman.342.1651951563.20749.python-list@python.org>
References: <CABbU2U-_Z546umxtnZXL8b1LUERCnyOxYw6osKTvKncOHFkJ3A@mail.gmail.com>
<60454E09-0ADA-4881-A84B-6C11397D244F@barrys-emacs.org>
<CABbU2U99Jpa6nuYg0sXw6=GjBEKVk9u-_oyxSoL8hLrW_2FoBA@mail.gmail.com>
<561ac7a8-2034-c1ce-6fca-f4280baac409@mrabarnett.plus.com>
<CABbU2U-N=YiRYfVkjpv8RP6BCo4VOLL7SWK=vNq8oje7nwuyUw@mail.gmail.com>
<mailman.340.1651948573.20749.python-list@python.org>
<encoding-20220507194637@ram.dialup.fu-berlin.de>
<983cc36d-0727-23ab-f168-ae88b14e6934@mrabarnett.plus.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
X-Trace: news.uni-berlin.de rLTPaAJP92Pe4hfv1kTyCQoOspSb/LXn9mutuRvDqJtg==
Return-Path: <python@mrabarnett.plus.com>
X-Original-To: python-list@python.org
Delivered-To: python-list@mail.python.org
Authentication-Results: mail.python.org; dkim=pass
reason="2048-bit key; unprotected key"
header.d=plus.com header.i=@plus.com header.b=isIS8V54;
dkim-adsp=none (unprotected policy); dkim-atps=neutral
X-Spam-Status: OK 0.003
X-Spam-Evidence: '*H*': 0.99; '*S*': 0.00; 'def': 0.04; 'ram': 0.07;
'wrong.': 0.07; 'from:addr:python': 0.09; 'ok,': 0.09;
'received:192.168.1.64': 0.09; 'writes:': 0.09; 'supported': 0.15;
'>>is': 0.16; 'encoding': 0.16; 'encoding.': 0.16;
'from:addr:mrabarnett.plus.com': 0.16; 'from:name:mrab': 0.16;
'furthermore,': 0.16; 'heuristic': 0.16; 'input.': 0.16; 'message-
id:@mrabarnett.plus.com': 0.16; 'received:84.93': 0.16;
'received:84.93.230': 0.16; 'received:plus.net': 0.16; 'slow':
0.16; 'wrote:': 0.16; "aren't": 0.19; 'to:addr:python-list': 0.20;
'written': 0.22; 'binary': 0.26; 'stefan': 0.26; 'function': 0.27;
'wrong': 0.28; 'header:User-Agent:1': 0.30; 'received:192.168.1':
0.32; 'but': 0.32; 'there': 0.33; 'path': 0.33; 'able': 0.34;
'mean': 0.34; 'header:In-Reply-To:1': 0.34; 'yes,': 0.35; "it's":
0.37; 'received:192.168': 0.37; 'file': 0.38; 'way': 0.38;
'could': 0.38; 'text': 0.39; 'file:': 0.40; 'try': 0.40; 'method':
0.61; 'come': 0.62; 'pass': 0.64; 'back': 0.67; 'sequence': 0.69;
'potentially': 0.76; 'decode': 0.84; 'falling': 0.84; 'sulla':
0.84; 'fall': 0.95
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=plus.com; s=042019;
t=1651951561; bh=cBB0SlT5MSrgQem6/fn3YHMIJDifykDL1El8RQ9FkgM=;
h=Date:Subject:To:References:From:In-Reply-To;
b=isIS8V54QINdu0Dn6Rwp8T0FrijNCAOkNHj4bat/vnlSaU+wDSybTqc022kK800Xx
wBTjmx++lb1o0fJDhX0IqQwvGLJGxW3m2sqFmqTbvQahORHs2xL2akaE5q3hMmN2Yv
z/IN0wWNFOsbR1Pkb2JwBBrzg7Q8hkW1Y85JtmRsmDuBB5B/TYkXKIq+5eisRJgWj2
oyC/hT4j1AHzqq7g1q0QuXCv/3fx71jCIdKJllMKvJomegMTCaBVk8xxFSDkAScB8C
xwSdlIvrZKY2q9rKzLRCuQ45BWE8oFJi3sOWLwtBXSxhKfgr2Njsj4PUI9hbiFn//N
pv5yt1Nnoazkw==
X-Clacks-Overhead: "GNU Terry Pratchett"
X-CM-Score: 0.00
X-CNFS-Analysis: v=2.4 cv=HttlpmfS c=1 sm=1 tr=0 ts=6276c7c9
a=0nF1XD0wxitMEM03M9B4ZQ==:117 a=0nF1XD0wxitMEM03M9B4ZQ==:17
a=IkcTkHD0fZMA:10 a=pGLkceISAAAA:8 a=qqcIzDWdlxuQ-YOrKUkA:9 a=QEXdDO2ut3YA:10
X-AUTH: mrabarnett@:2500
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.9.0
Content-Language: en-GB
In-Reply-To: <encoding-20220507194637@ram.dialup.fu-berlin.de>
X-CMAE-Envelope: MS4xfDI0QZv8Nd1EaPcz+lLrAFXjTFGxJmSzlvUK7k9/PB3rYWR34gO2h58v/Z85SQDJE8mGE7z/dXI0ljX+JcTIMQQBLc3m3KWSNsGALfw95lU5UkTiM6GU
Cuz8I0uTiQmum7A8aIGB4kZMVwULoE+69OpImWR41OrT95bM7hvofjz8n2xdb8zqmQKbHIlTZKUcY78s45UB6DxlfeoURgEYOhs=
X-BeenThere: python-list@python.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: General discussion list for the Python programming language
<python-list.python.org>
List-Unsubscribe: <https://mail.python.org/mailman/options/python-list>,
<mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive: <https://mail.python.org/pipermail/python-list/>
List-Post: <mailto:python-list@python.org>
List-Help: <mailto:python-list-request@python.org?subject=help>
List-Subscribe: <https://mail.python.org/mailman/listinfo/python-list>,
<mailto:python-list-request@python.org?subject=subscribe>
X-Mailman-Original-Message-ID: <983cc36d-0727-23ab-f168-ae88b14e6934@mrabarnett.plus.com>
X-Mailman-Original-References: <CABbU2U-_Z546umxtnZXL8b1LUERCnyOxYw6osKTvKncOHFkJ3A@mail.gmail.com>
<60454E09-0ADA-4881-A84B-6C11397D244F@barrys-emacs.org>
<CABbU2U99Jpa6nuYg0sXw6=GjBEKVk9u-_oyxSoL8hLrW_2FoBA@mail.gmail.com>
<561ac7a8-2034-c1ce-6fca-f4280baac409@mrabarnett.plus.com>
<CABbU2U-N=YiRYfVkjpv8RP6BCo4VOLL7SWK=vNq8oje7nwuyUw@mail.gmail.com>
<mailman.340.1651948573.20749.python-list@python.org>
<encoding-20220507194637@ram.dialup.fu-berlin.de>
 by: MRAB - Sat, 7 May 2022 19:26 UTC

On 2022-05-07 19:47, Stefan Ram wrote:
> Marco Sulla <Marco.Sulla.Python@gmail.com> writes:
>>Well, ok, but I need a generic method to get LF and CR for any
>>encoding an user can input.
>
> "LF" and "CR" come from US-ASCII. It is theoretically
> possible that there might be some encodings out there
> (not for Unicode) that are not based on US-ASCII and
> have no LF or no CR.
>
>>is good for any encoding? Furthermore, is there a way to get the
>>encoding of an opened file object?
>
> I have written a function that might be able to detect one
> of few encodings based on a heuristic algorithm.
>
> def encoding( name ):
> path = pathlib.Path( name )
> for encoding in( "utf_8", "latin_1", "cp1252" ):
> try:
> with path.open( encoding=encoding, errors="strict" )as file:
> text = file.read()
> return encoding
> except UnicodeDecodeError:
> pass
> return "ascii"
>
> Yes, it's potentially slow and might be wrong.
> The result "ascii" might mean it's a binary file.
>
"latin-1" will decode any sequence of bytes, so it'll never try
"cp1252", nor fall back to "ascii", and falling back to "ascii" is wrong
anyway because the file could contain 0x80..0xFF, which aren't supported
by that encoding.

Re: tail

<detection-20220507215306@ram.dialup.fu-berlin.de>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=18184&group=comp.lang.python#18184

  copy link   Newsgroups: comp.lang.python
Path: i2pn2.org!i2pn.org!news.swapon.de!fu-berlin.de!uni-berlin.de!not-for-mail
From: ram...@zedat.fu-berlin.de (Stefan Ram)
Newsgroups: comp.lang.python
Subject: Re: tail
Date: 7 May 2022 20:54:43 GMT
Organization: Stefan Ram
Lines: 25
Expires: 1 Apr 2023 11:59:58 GMT
Message-ID: <detection-20220507215306@ram.dialup.fu-berlin.de>
References: <CABbU2U-_Z546umxtnZXL8b1LUERCnyOxYw6osKTvKncOHFkJ3A@mail.gmail.com> <60454E09-0ADA-4881-A84B-6C11397D244F@barrys-emacs.org> <CABbU2U99Jpa6nuYg0sXw6=GjBEKVk9u-_oyxSoL8hLrW_2FoBA@mail.gmail.com> <561ac7a8-2034-c1ce-6fca-f4280baac409@mrabarnett.plus.com> <CABbU2U-N=YiRYfVkjpv8RP6BCo4VOLL7SWK=vNq8oje7nwuyUw@mail.gmail.com> <mailman.340.1651948573.20749.python-list@python.org> <encoding-20220507194637@ram.dialup.fu-berlin.de> <983cc36d-0727-23ab-f168-ae88b14e6934@mrabarnett.plus.com> <mailman.342.1651951563.20749.python-list@python.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-Trace: news.uni-berlin.de pTlVCC2Hdgb0NKZXNJ8gygwz5RWt3fCW5QrtoWhgMWmV3R
X-Copyright: (C) Copyright 2022 Stefan Ram. All rights reserved.
Distribution through any means other than regular usenet
channels is forbidden. It is forbidden to publish this
article in the Web, to change URIs of this article into links,
and to transfer the body without this notice, but quotations
of parts in other Usenet posts are allowed.
X-No-Archive: Yes
Archive: no
X-No-Archive-Readme: "X-No-Archive" is set, because this prevents some
services to mirror the article in the web. But the article may
be kept on a Usenet archive server with only NNTP access.
X-No-Html: yes
Content-Language: en-US
Accept-Language: de-DE, en-US, it, fr-FR
 by: Stefan Ram - Sat, 7 May 2022 20:54 UTC

MRAB <python@mrabarnett.plus.com> writes:
>On 2022-05-07 19:47, Stefan Ram wrote:
....
>>def encoding( name ):
>> path = pathlib.Path( name )
>> for encoding in( "utf_8", "latin_1", "cp1252" ):
>> try:
>> with path.open( encoding=encoding, errors="strict" )as file:
>> text = file.read()
>> return encoding
>> except UnicodeDecodeError:
>> pass
>> return "ascii"
>>Yes, it's potentially slow and might be wrong.
>>The result "ascii" might mean it's a binary file.
>"latin-1" will decode any sequence of bytes, so it'll never try
>"cp1252", nor fall back to "ascii", and falling back to "ascii" is wrong
>anyway because the file could contain 0x80..0xFF, which aren't supported
>by that encoding.

Thank you! It's working for my specific application where
I'm reading from a collection of text files that should be
encoded in either utf_8, latin_1, or ascii.

Re: tail

<mailman.345.1651959121.20749.python-list@python.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=18185&group=comp.lang.python#18185

  copy link   Newsgroups: comp.lang.python
Path: i2pn2.org!i2pn.org!news.swapon.de!fu-berlin.de!uni-berlin.de!not-for-mail
From: ros...@gmail.com (Chris Angelico)
Newsgroups: comp.lang.python
Subject: Re: tail
Date: Sun, 8 May 2022 07:31:48 +1000
Lines: 43
Message-ID: <mailman.345.1651959121.20749.python-list@python.org>
References: <CABbU2U-_Z546umxtnZXL8b1LUERCnyOxYw6osKTvKncOHFkJ3A@mail.gmail.com>
<60454E09-0ADA-4881-A84B-6C11397D244F@barrys-emacs.org>
<CABbU2U99Jpa6nuYg0sXw6=GjBEKVk9u-_oyxSoL8hLrW_2FoBA@mail.gmail.com>
<561ac7a8-2034-c1ce-6fca-f4280baac409@mrabarnett.plus.com>
<CABbU2U-N=YiRYfVkjpv8RP6BCo4VOLL7SWK=vNq8oje7nwuyUw@mail.gmail.com>
<mailman.340.1651948573.20749.python-list@python.org>
<encoding-20220507194637@ram.dialup.fu-berlin.de>
<983cc36d-0727-23ab-f168-ae88b14e6934@mrabarnett.plus.com>
<mailman.342.1651951563.20749.python-list@python.org>
<detection-20220507215306@ram.dialup.fu-berlin.de>
<CAPTjJmrf3ObZPTUjBZyBHUsu2bjaxsg=1pqnbCbePzYtUH7yHg@mail.gmail.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
X-Trace: news.uni-berlin.de Zj8KwKZxZaoX9/aR6JmOLgn9nudWQDp7NQyKoRW5819Q==
Return-Path: <rosuav@gmail.com>
X-Original-To: python-list@python.org
Delivered-To: python-list@mail.python.org
Authentication-Results: mail.python.org; dkim=pass
reason="2048-bit key; unprotected key"
header.d=gmail.com header.i=@gmail.com header.b=at2XMYR3;
dkim-adsp=pass; dkim-atps=neutral
X-Spam-Status: OK 0.011
X-Spam-Evidence: '*H*': 0.98; '*S*': 0.00; '2022': 0.05; 'ram': 0.07;
'sun,': 0.07; 'utf-8': 0.07; 'wrong.': 0.07; 'byte': 0.09;
'check,': 0.09; 'writes:': 0.09; 'possible,': 0.15; 'supported':
0.15; 'chrisa': 0.16; 'encoding': 0.16; 'encoding.': 0.16;
'from:addr:rosuav': 0.16; 'from:name:chris angelico': 0.16;
'slow': 0.16; 'stream.': 0.16; 'windows-1252': 0.16; 'wrote:':
0.16; 'instead': 0.17; "aren't": 0.19; 'to:addr:python-list':
0.20; "i've": 0.22; "i'd": 0.24; 'binary': 0.26; 'stefan': 0.26;
'wrong': 0.28; 'raw': 0.32; 'message-id:@mail.gmail.com': 0.32;
'but': 0.32; "i'm": 0.33; 'path': 0.33; 'same': 0.34; 'mean':
0.34; 'header:In-Reply-To:1': 0.34; 'received:google.com': 0.34;
'from:addr:gmail.com': 0.35; 'files': 0.36; "it's": 0.37;
'received:209.85': 0.37; 'file': 0.38; 'could': 0.38; 'read':
0.38; 'received:209': 0.39; 'single': 0.39; 'text': 0.39;
'otherwise': 0.39; 'use': 0.39; 'define': 0.40; 'exact': 0.40;
'file:': 0.40; 'try': 0.40; 'should': 0.40; 'share': 0.63; 'pass':
0.64; 'back': 0.67; 'accept': 0.67; 'mix': 0.69; 'perfectly':
0.69; 'sequence': 0.69; "you'll": 0.73; 'potentially': 0.76;
'decode': 0.84; 'falling': 0.84; 'line,': 0.93; 'fall': 0.95
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112;
h=mime-version:references:in-reply-to:from:date:message-id:subject:to;
bh=4Nkc4mp7eZAL3MWG4xVnu6OCWutim5CtFYW+sjLNCAE=;
b=at2XMYR3iJtmPa/wHwAAJW8EdbJ2TKY5OHX0OJHwv4SWmmbTBLoG/SJ5rXiZNivvbr
BYZFAn+DeSRBuQa/A0py0ReFC1Em7NgcjbAAe+Cq6SujUg4KiCEQUjgGW8e1A6XvFR2R
qUVJQut4FNRclLUvSrf9QfbS32qPHD0W+QRn+FYai7OYowSqKwLFTxg5ywuYRWz2QZKb
1f3HG2PhhvMJ48F2VX0MhXHt3nM0DgpVEmHh18ia37UirZMGyLqFrWibaATmsYr4gH06
jXnbWo7bDIxmhKGQtoHUUZKPCScl9y4O/Cxi38HG7VViBTnr5MvmQ05NEwOxWGAbShqc
BBDg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
d=1e100.net; s=20210112;
h=x-gm-message-state:mime-version:references:in-reply-to:from:date
:message-id:subject:to;
bh=4Nkc4mp7eZAL3MWG4xVnu6OCWutim5CtFYW+sjLNCAE=;
b=Cps9uV0bNT3SuIsWIAMLlOf4K8uK8Sx8G6hu3oVwSv3pLoKkVvvQGeI30uKWCKD9nU
3S9hWiLyBRQWiZ4v5exJF0rASWdSBA8u27bEHLXjjJ9Nvv4rprRkJVZd8lniL2VHUWNx
wsWZLh/dTafreR8em7a74MWvKp5+pmk2+DSnCvLpVNjE2V6+v/K2cSwOMk04s1s5pIkX
YLwkSFul8aQD6hpQvq+MOAI8l17li3oltfsnbkSrBAbZTpHVWWNbrllwBfwfE909zhmn
N6zJ0L+JdjKrZK0gu0eXxbrNqr/R+/nSR+eHGvFrNOdAXjFDmr/NtW9QOoVHkc8DEhfK
eUrg==
X-Gm-Message-State: AOAM533YQxycNkWgkLjl3RpNjHfPnmBDUx614dOOlbsSv4Ii9eJxCpVu
YHvkvwtICr2Yvp78KGojJOqKJm6kcaHJDgAAiOgOzLzv
X-Google-Smtp-Source: ABdhPJyjxbHQlfArsltnsE2ga1QGdmJZY7Icm0wd81zAsejwTDF8sixmG2Y4onsKbArGjJ3caVz3COsJzYmnTMRRb+8=
X-Received: by 2002:adf:f543:0:b0:20a:e059:2f80 with SMTP id
j3-20020adff543000000b0020ae0592f80mr8138822wrp.495.1651959120091; Sat, 07
May 2022 14:32:00 -0700 (PDT)
In-Reply-To: <detection-20220507215306@ram.dialup.fu-berlin.de>
X-BeenThere: python-list@python.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: General discussion list for the Python programming language
<python-list.python.org>
List-Unsubscribe: <https://mail.python.org/mailman/options/python-list>,
<mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive: <https://mail.python.org/pipermail/python-list/>
List-Post: <mailto:python-list@python.org>
List-Help: <mailto:python-list-request@python.org?subject=help>
List-Subscribe: <https://mail.python.org/mailman/listinfo/python-list>,
<mailto:python-list-request@python.org?subject=subscribe>
X-Mailman-Original-Message-ID: <CAPTjJmrf3ObZPTUjBZyBHUsu2bjaxsg=1pqnbCbePzYtUH7yHg@mail.gmail.com>
X-Mailman-Original-References: <CABbU2U-_Z546umxtnZXL8b1LUERCnyOxYw6osKTvKncOHFkJ3A@mail.gmail.com>
<60454E09-0ADA-4881-A84B-6C11397D244F@barrys-emacs.org>
<CABbU2U99Jpa6nuYg0sXw6=GjBEKVk9u-_oyxSoL8hLrW_2FoBA@mail.gmail.com>
<561ac7a8-2034-c1ce-6fca-f4280baac409@mrabarnett.plus.com>
<CABbU2U-N=YiRYfVkjpv8RP6BCo4VOLL7SWK=vNq8oje7nwuyUw@mail.gmail.com>
<mailman.340.1651948573.20749.python-list@python.org>
<encoding-20220507194637@ram.dialup.fu-berlin.de>
<983cc36d-0727-23ab-f168-ae88b14e6934@mrabarnett.plus.com>
<mailman.342.1651951563.20749.python-list@python.org>
<detection-20220507215306@ram.dialup.fu-berlin.de>
 by: Chris Angelico - Sat, 7 May 2022 21:31 UTC

On Sun, 8 May 2022 at 07:19, Stefan Ram <ram@zedat.fu-berlin.de> wrote:
>
> MRAB <python@mrabarnett.plus.com> writes:
> >On 2022-05-07 19:47, Stefan Ram wrote:
> ...
> >>def encoding( name ):
> >> path = pathlib.Path( name )
> >> for encoding in( "utf_8", "latin_1", "cp1252" ):
> >> try:
> >> with path.open( encoding=encoding, errors="strict" )as file:
> >> text = file.read()
> >> return encoding
> >> except UnicodeDecodeError:
> >> pass
> >> return "ascii"
> >>Yes, it's potentially slow and might be wrong.
> >>The result "ascii" might mean it's a binary file.
> >"latin-1" will decode any sequence of bytes, so it'll never try
> >"cp1252", nor fall back to "ascii", and falling back to "ascii" is wrong
> >anyway because the file could contain 0x80..0xFF, which aren't supported
> >by that encoding.
>
> Thank you! It's working for my specific application where
> I'm reading from a collection of text files that should be
> encoded in either utf_8, latin_1, or ascii.
>

In that case, I'd exclude ASCII from the check, and just check UTF-8,
and if that fails, decode as Latin-1. Any ASCII files will decode
correctly as UTF-8, and any file will decode as Latin-1.

I've used this exact fallback system when decoding raw data from
Unicode-naive servers - they accept and share bytes, so it's entirely
possible to have a mix of encodings in a single stream. As long as you
can define the span of a single "unit" (say, a line, or a chunk in
some form), you can read as bytes and do the exact same "decode as
UTF-8 if possible, otherwise decode as Latin-1" dance. Sure, it's not
perfectly ideal, but it's about as good as you'll get with a lot of
US-based servers. (Depending on context, you might use CP-1252 instead
of Latin-1, but you might need errors="replace" there, since
Windows-1252 has some undefined byte values.)

ChrisA

Re: tail

<mailman.350.1652033713.20749.python-list@python.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=18190&group=comp.lang.python#18190

  copy link   Newsgroups: comp.lang.python
Path: i2pn2.org!i2pn.org!news.swapon.de!fu-berlin.de!uni-berlin.de!not-for-mail
From: bar...@barrys-emacs.org (Barry Scott)
Newsgroups: comp.lang.python
Subject: Re: tail
Date: Sun, 8 May 2022 19:15:09 +0100
Lines: 70
Message-ID: <mailman.350.1652033713.20749.python-list@python.org>
References: <CABbU2U-_Z546umxtnZXL8b1LUERCnyOxYw6osKTvKncOHFkJ3A@mail.gmail.com>
<60454E09-0ADA-4881-A84B-6C11397D244F@barrys-emacs.org>
<CABbU2U99Jpa6nuYg0sXw6=GjBEKVk9u-_oyxSoL8hLrW_2FoBA@mail.gmail.com>
<561ac7a8-2034-c1ce-6fca-f4280baac409@mrabarnett.plus.com>
<CABbU2U-N=YiRYfVkjpv8RP6BCo4VOLL7SWK=vNq8oje7nwuyUw@mail.gmail.com>
<mailman.340.1651948573.20749.python-list@python.org>
<encoding-20220507194637@ram.dialup.fu-berlin.de>
<983cc36d-0727-23ab-f168-ae88b14e6934@mrabarnett.plus.com>
<mailman.342.1651951563.20749.python-list@python.org>
<detection-20220507215306@ram.dialup.fu-berlin.de>
<CAPTjJmrf3ObZPTUjBZyBHUsu2bjaxsg=1pqnbCbePzYtUH7yHg@mail.gmail.com>
<0DC070E5-2BCC-47AA-9DDE-4EE7B3F3D441@barrys-emacs.org>
Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3696.80.82.1.1\))
Content-Type: text/plain;
charset=us-ascii
Content-Transfer-Encoding: quoted-printable
X-Trace: news.uni-berlin.de hf+qWzA4+eYinFoS0XdwIgViaIotCaNLHOa/peF/xi5A==
Return-Path: <barry@barrys-emacs.org>
X-Original-To: python-list@python.org
Delivered-To: python-list@mail.python.org
Authentication-Results: mail.python.org; dkim=none reason="no signature";
dkim-adsp=none (unprotected policy); dkim-atps=neutral
X-Spam-Status: OK 0.000
X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'def': 0.04; '2022': 0.05;
'ram': 0.07; 'sun,': 0.07; 'utf-8': 0.07; 'wrong.': 0.07;
'angelico': 0.09; 'byte': 0.09; 'cc:addr:python-list': 0.09;
'check,': 0.09; 'fails': 0.09; 'from:addr:barry': 0.09;
'received:217.70': 0.09; 'received:gandi.net': 0.09;
'received:mail.gandi.net': 0.09; 'utf-8.': 0.09; 'writes:': 0.09;
'cc:no real name:2**0': 0.14; 'url:mailman': 0.15; 'possible,':
0.15; 'supported': 0.15; '2022,': 0.16; '>>>>': 0.16; 'barry':
0.16; 'chrisa': 0.16; 'encoding': 0.16; 'encoding.': 0.16;
'from:addr:barrys-emacs.org': 0.16; 'from:name:barry scott': 0.16;
'message-id:@barrys-emacs.org': 0.16; 'slow': 0.16; 'stream.':
0.16; 'windows-1252': 0.16; 'wrote:': 0.16; 'instead': 0.17;
"aren't": 0.19; 'cc:addr:python.org': 0.20; 'issue': 0.21; "i've":
0.22; 'code': 0.23; "i'd": 0.24; 'url-ip:188.166.95.178/32': 0.25;
'url-ip:188.166.95/24': 0.25; 'url:listinfo': 0.25; 'cc:2**0':
0.25; 'url-ip:188.166/16': 0.25; 'binary': 0.26; 'stefan': 0.26;
'>>>': 0.28; 'chris': 0.28; 'fact': 0.28; 'wrong': 0.28; 'error':
0.29; 'url-ip:188/8': 0.31; 'raw': 0.32; 'but': 0.32; "i'm": 0.33;
'there': 0.33; 'path': 0.33; 'windows': 0.34; 'same': 0.34;
'mean': 0.34; 'header:In-Reply-To:1': 0.34; 'invalid': 0.35;
'yes,': 0.35; 'files': 0.36; "it's": 0.37; 'file': 0.38; 'could':
0.38; 'read': 0.38; 'single': 0.39; 'text': 0.39; 'otherwise':
0.39; 'use': 0.39; 'define': 0.40; 'exact': 0.40; 'file:': 0.40;
'try': 0.40; 'should': 0.40; 'share': 0.63; 'pass': 0.64; 'back':
0.67; 'received:217': 0.67; 'accept': 0.67; 'mix': 0.69;
'perfectly': 0.69; 'sequence': 0.69; 'standards': 0.69; 'claim':
0.71; "you'll": 0.73; 'quote': 0.74; 'potentially': 0.76; 'html':
0.80; 'left': 0.83; 'decode': 0.84; 'falling': 0.84; 'line,':
0.93; 'fall': 0.95
In-Reply-To: <CAPTjJmrf3ObZPTUjBZyBHUsu2bjaxsg=1pqnbCbePzYtUH7yHg@mail.gmail.com>
X-Mailer: Apple Mail (2.3696.80.82.1.1)
X-BeenThere: python-list@python.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: General discussion list for the Python programming language
<python-list.python.org>
List-Unsubscribe: <https://mail.python.org/mailman/options/python-list>,
<mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive: <https://mail.python.org/pipermail/python-list/>
List-Post: <mailto:python-list@python.org>
List-Help: <mailto:python-list-request@python.org?subject=help>
List-Subscribe: <https://mail.python.org/mailman/listinfo/python-list>,
<mailto:python-list-request@python.org?subject=subscribe>
X-Mailman-Original-Message-ID: <0DC070E5-2BCC-47AA-9DDE-4EE7B3F3D441@barrys-emacs.org>
X-Mailman-Original-References: <CABbU2U-_Z546umxtnZXL8b1LUERCnyOxYw6osKTvKncOHFkJ3A@mail.gmail.com>
<60454E09-0ADA-4881-A84B-6C11397D244F@barrys-emacs.org>
<CABbU2U99Jpa6nuYg0sXw6=GjBEKVk9u-_oyxSoL8hLrW_2FoBA@mail.gmail.com>
<561ac7a8-2034-c1ce-6fca-f4280baac409@mrabarnett.plus.com>
<CABbU2U-N=YiRYfVkjpv8RP6BCo4VOLL7SWK=vNq8oje7nwuyUw@mail.gmail.com>
<mailman.340.1651948573.20749.python-list@python.org>
<encoding-20220507194637@ram.dialup.fu-berlin.de>
<983cc36d-0727-23ab-f168-ae88b14e6934@mrabarnett.plus.com>
<mailman.342.1651951563.20749.python-list@python.org>
<detection-20220507215306@ram.dialup.fu-berlin.de>
<CAPTjJmrf3ObZPTUjBZyBHUsu2bjaxsg=1pqnbCbePzYtUH7yHg@mail.gmail.com>
 by: Barry Scott - Sun, 8 May 2022 18:15 UTC

> On 7 May 2022, at 22:31, Chris Angelico <rosuav@gmail.com> wrote:
>
> On Sun, 8 May 2022 at 07:19, Stefan Ram <ram@zedat.fu-berlin.de> wrote:
>>
>> MRAB <python@mrabarnett.plus.com> writes:
>>> On 2022-05-07 19:47, Stefan Ram wrote:
>> ...
>>>> def encoding( name ):
>>>> path = pathlib.Path( name )
>>>> for encoding in( "utf_8", "latin_1", "cp1252" ):
>>>> try:
>>>> with path.open( encoding=encoding, errors="strict" )as file:
>>>> text = file.read()
>>>> return encoding
>>>> except UnicodeDecodeError:
>>>> pass
>>>> return "ascii"
>>>> Yes, it's potentially slow and might be wrong.
>>>> The result "ascii" might mean it's a binary file.
>>> "latin-1" will decode any sequence of bytes, so it'll never try
>>> "cp1252", nor fall back to "ascii", and falling back to "ascii" is wrong
>>> anyway because the file could contain 0x80..0xFF, which aren't supported
>>> by that encoding.
>>
>> Thank you! It's working for my specific application where
>> I'm reading from a collection of text files that should be
>> encoded in either utf_8, latin_1, or ascii.
>>
>
> In that case, I'd exclude ASCII from the check, and just check UTF-8,
> and if that fails, decode as Latin-1. Any ASCII files will decode
> correctly as UTF-8, and any file will decode as Latin-1.
>
> I've used this exact fallback system when decoding raw data from
> Unicode-naive servers - they accept and share bytes, so it's entirely
> possible to have a mix of encodings in a single stream. As long as you
> can define the span of a single "unit" (say, a line, or a chunk in
> some form), you can read as bytes and do the exact same "decode as
> UTF-8 if possible, otherwise decode as Latin-1" dance. Sure, it's not
> perfectly ideal, but it's about as good as you'll get with a lot of
> US-based servers. (Depending on context, you might use CP-1252 instead
> of Latin-1, but you might need errors="replace" there, since
> Windows-1252 has some undefined byte values.)

There is a very common error on Windows that files and especially web pages that
claim to be utf-8 are in fact CP-1252.

There is logic in the HTML standards to try utf-8 and if it fails fall back to CP-1252.

Its usually the left and "smart" quote chars that cause the issue as they code
as an invalid utf-8.

Barry

>
> ChrisA
> --
> https://mail.python.org/mailman/listinfo/python-list
>

Re: tail

<mailman.351.1652034444.20749.python-list@python.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=18191&group=comp.lang.python#18191

  copy link   Newsgroups: comp.lang.python
Path: i2pn2.org!i2pn.org!news.swapon.de!fu-berlin.de!uni-berlin.de!not-for-mail
From: ros...@gmail.com (Chris Angelico)
Newsgroups: comp.lang.python
Subject: Re: tail
Date: Mon, 9 May 2022 04:27:10 +1000
Lines: 63
Message-ID: <mailman.351.1652034444.20749.python-list@python.org>
References: <CABbU2U-_Z546umxtnZXL8b1LUERCnyOxYw6osKTvKncOHFkJ3A@mail.gmail.com>
<60454E09-0ADA-4881-A84B-6C11397D244F@barrys-emacs.org>
<CABbU2U99Jpa6nuYg0sXw6=GjBEKVk9u-_oyxSoL8hLrW_2FoBA@mail.gmail.com>
<561ac7a8-2034-c1ce-6fca-f4280baac409@mrabarnett.plus.com>
<CABbU2U-N=YiRYfVkjpv8RP6BCo4VOLL7SWK=vNq8oje7nwuyUw@mail.gmail.com>
<mailman.340.1651948573.20749.python-list@python.org>
<encoding-20220507194637@ram.dialup.fu-berlin.de>
<983cc36d-0727-23ab-f168-ae88b14e6934@mrabarnett.plus.com>
<mailman.342.1651951563.20749.python-list@python.org>
<detection-20220507215306@ram.dialup.fu-berlin.de>
<CAPTjJmrf3ObZPTUjBZyBHUsu2bjaxsg=1pqnbCbePzYtUH7yHg@mail.gmail.com>
<0DC070E5-2BCC-47AA-9DDE-4EE7B3F3D441@barrys-emacs.org>
<CAPTjJmqG5G5YKf2_GTwYu4x3ReQpUKv1-xWVSmgu-2B9n7Xv8A@mail.gmail.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
X-Trace: news.uni-berlin.de lfmyqNFH9/YqdcwPBGisNgVuN9Z0SFwHms/y5lX88DwA==
Return-Path: <rosuav@gmail.com>
X-Original-To: python-list@python.org
Delivered-To: python-list@mail.python.org
Authentication-Results: mail.python.org; dkim=pass
reason="2048-bit key; unprotected key"
header.d=gmail.com header.i=@gmail.com header.b=D3W4VeE2;
dkim-adsp=pass; dkim-atps=neutral
X-Spam-Status: OK 0.001
X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'def': 0.04; '2022': 0.05;
'fairly': 0.05; 'ram': 0.07; 'simple.': 0.07; 'sun,': 0.07;
'utf-8': 0.07; 'wrong.': 0.07; 'angelico': 0.09; 'byte': 0.09;
'check,': 0.09; 'fails': 0.09; 'sometimes,': 0.09; 'utf-8.': 0.09;
'writes:': 0.09; 'possible,': 0.15; 'supported': 0.15; '2022,':
0.16; '>>>>': 0.16; 'applies:': 0.16; 'barry': 0.16; 'chrisa':
0.16; 'encoding': 0.16; 'encoding.': 0.16; 'from:addr:rosuav':
0.16; 'from:name:chris angelico': 0.16; 'slow': 0.16; 'stream.':
0.16; 'windows-1252': 0.16; 'wrote:': 0.16; 'instead': 0.17;
"aren't": 0.19; 'to:addr:python-list': 0.20; 'issue': 0.21;
"i've": 0.22; 'code': 0.23; "i'd": 0.24; 'binary': 0.26; 'stefan':
0.26; "isn't": 0.27; '>>>': 0.28; 'chris': 0.28; 'fact': 0.28;
'wrong': 0.28; 'error': 0.29; 'attempt': 0.31; 'raw': 0.32;
'message-id:@mail.gmail.com': 0.32; 'but': 0.32; "i'm": 0.33;
'there': 0.33; 'path': 0.33; 'windows': 0.34; 'same': 0.34;
'mean': 0.34; 'header:In-Reply-To:1': 0.34; 'received:google.com':
0.34; 'invalid': 0.35; 'yes,': 0.35; 'from:addr:gmail.com': 0.35;
'files': 0.36; 'mon,': 0.36; "it's": 0.37; 'received:209.85':
0.37; 'file': 0.38; 'could': 0.38; 'read': 0.38; 'received:209':
0.39; 'single': 0.39; 'text': 0.39; 'otherwise': 0.39; 'use':
0.39; 'still': 0.40; 'define': 0.40; 'exact': 0.40; 'file:': 0.40;
'try': 0.40; 'should': 0.40; "there's": 0.61; 'share': 0.63;
'pass': 0.64; 'back': 0.67; 'accept': 0.67; 'lie': 0.69; 'mix':
0.69; 'perfectly': 0.69; 'sequence': 0.69; 'standards': 0.69;
'claim': 0.71; "you'll": 0.73; 'quote': 0.74; 'potentially': 0.76;
'html': 0.80; 'left': 0.83; 'decode': 0.84; 'falling': 0.84;
'scott': 0.84; 'line,': 0.93; 'fall': 0.95
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112;
h=mime-version:references:in-reply-to:from:date:message-id:subject:to;
bh=DhnV5xpCx5FxGv7aTc4Il9VpKXvAf5++hcHGK2sWrA0=;
b=D3W4VeE232r5m/oS2lHeatlpi3eddP+niSJPHVAqwqgnL6fhz/DjOpBlKeliJy2gPE
/xbZr0YJQNym8Fm4SDuO/tFPsbhk0fB1Zb3UNdPBvGlsX5N8hZ2qZ7mRVepWG8eZHPlE
xFa4NdVFEZIxoaoSA+K+P2AHLI6AoWnnZhYcR7vgqmTCx3VOM5xZZZ1FuChtnJM/5cVm
sBO0lxxVV7GtXgypyOy9FC/zvq1uSAM4WN0ex+0mQ4OOLnx3IDsbHEuoAtPKf9hPXC8V
71wEiirlkzZJtkL5+05sbTJM1xJ5ZcMYsMyuIiYrNU+2kK0Yh8GegINytXm0uaEYJm8d
QIKA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
d=1e100.net; s=20210112;
h=x-gm-message-state:mime-version:references:in-reply-to:from:date
:message-id:subject:to;
bh=DhnV5xpCx5FxGv7aTc4Il9VpKXvAf5++hcHGK2sWrA0=;
b=UjZa2TF1Kk/PbI7JWwoqx/t7mWSio/AlGk9VGNgmdQa9OeMN0pWRX8X6A5xLEkoLkB
mPdzt3mVNt+ONNeUeHw3/i2VY1ULkxqdHJshnMfxVAQr9AzROPiffZOX8QBOADrkGTrn
jJ4VBjv9VuHHgAN5TMeeDzCuV2KQprl1aNhF08PMVaQbtjC3gotc0YQ298u54y9R3iIL
eGtBgehXhfkn4nGI3NPLYXAmjh97QcrWz9RlDg1ld6FbyW27ky5gmtM1Ne6QJfu4WaXf
EWhJgpz4hHXCPNgw2Ds8O2eYoKCyWNDKWS60JcQtQEJhnJY881eTfBA2dN4iBA7FS7WB
AMnQ==
X-Gm-Message-State: AOAM530neJP9ypht0CWc3p2NZf7QHRamklTXyl+rJ2lEvUn/4vkuOIHY
VNFWvTzBI/XR2S27KDm7svXyXCctnCFL/Ia0Td1VEJG8
X-Google-Smtp-Source: ABdhPJw/FHZcZHEEQyzzkLl5gU3Q5lcUPbqLhRH4sURn5IQDcGP9JVrtBJTAFHIpXKRqaTUDGwLYW9wvPh++Yuq782w=
X-Received: by 2002:a7b:ce08:0:b0:394:32df:2ae6 with SMTP id
m8-20020a7bce08000000b0039432df2ae6mr18935979wmc.184.1652034441822; Sun, 08
May 2022 11:27:21 -0700 (PDT)
In-Reply-To: <0DC070E5-2BCC-47AA-9DDE-4EE7B3F3D441@barrys-emacs.org>
X-BeenThere: python-list@python.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: General discussion list for the Python programming language
<python-list.python.org>
List-Unsubscribe: <https://mail.python.org/mailman/options/python-list>,
<mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive: <https://mail.python.org/pipermail/python-list/>
List-Post: <mailto:python-list@python.org>
List-Help: <mailto:python-list-request@python.org?subject=help>
List-Subscribe: <https://mail.python.org/mailman/listinfo/python-list>,
<mailto:python-list-request@python.org?subject=subscribe>
X-Mailman-Original-Message-ID: <CAPTjJmqG5G5YKf2_GTwYu4x3ReQpUKv1-xWVSmgu-2B9n7Xv8A@mail.gmail.com>
X-Mailman-Original-References: <CABbU2U-_Z546umxtnZXL8b1LUERCnyOxYw6osKTvKncOHFkJ3A@mail.gmail.com>
<60454E09-0ADA-4881-A84B-6C11397D244F@barrys-emacs.org>
<CABbU2U99Jpa6nuYg0sXw6=GjBEKVk9u-_oyxSoL8hLrW_2FoBA@mail.gmail.com>
<561ac7a8-2034-c1ce-6fca-f4280baac409@mrabarnett.plus.com>
<CABbU2U-N=YiRYfVkjpv8RP6BCo4VOLL7SWK=vNq8oje7nwuyUw@mail.gmail.com>
<mailman.340.1651948573.20749.python-list@python.org>
<encoding-20220507194637@ram.dialup.fu-berlin.de>
<983cc36d-0727-23ab-f168-ae88b14e6934@mrabarnett.plus.com>
<mailman.342.1651951563.20749.python-list@python.org>
<detection-20220507215306@ram.dialup.fu-berlin.de>
<CAPTjJmrf3ObZPTUjBZyBHUsu2bjaxsg=1pqnbCbePzYtUH7yHg@mail.gmail.com>
<0DC070E5-2BCC-47AA-9DDE-4EE7B3F3D441@barrys-emacs.org>
 by: Chris Angelico - Sun, 8 May 2022 18:27 UTC

On Mon, 9 May 2022 at 04:15, Barry Scott <barry@barrys-emacs.org> wrote:
>
>
>
> > On 7 May 2022, at 22:31, Chris Angelico <rosuav@gmail.com> wrote:
> >
> > On Sun, 8 May 2022 at 07:19, Stefan Ram <ram@zedat.fu-berlin.de> wrote:
> >>
> >> MRAB <python@mrabarnett.plus.com> writes:
> >>> On 2022-05-07 19:47, Stefan Ram wrote:
> >> ...
> >>>> def encoding( name ):
> >>>> path = pathlib.Path( name )
> >>>> for encoding in( "utf_8", "latin_1", "cp1252" ):
> >>>> try:
> >>>> with path.open( encoding=encoding, errors="strict" )as file:
> >>>> text = file.read()
> >>>> return encoding
> >>>> except UnicodeDecodeError:
> >>>> pass
> >>>> return "ascii"
> >>>> Yes, it's potentially slow and might be wrong.
> >>>> The result "ascii" might mean it's a binary file.
> >>> "latin-1" will decode any sequence of bytes, so it'll never try
> >>> "cp1252", nor fall back to "ascii", and falling back to "ascii" is wrong
> >>> anyway because the file could contain 0x80..0xFF, which aren't supported
> >>> by that encoding.
> >>
> >> Thank you! It's working for my specific application where
> >> I'm reading from a collection of text files that should be
> >> encoded in either utf_8, latin_1, or ascii.
> >>
> >
> > In that case, I'd exclude ASCII from the check, and just check UTF-8,
> > and if that fails, decode as Latin-1. Any ASCII files will decode
> > correctly as UTF-8, and any file will decode as Latin-1.
> >
> > I've used this exact fallback system when decoding raw data from
> > Unicode-naive servers - they accept and share bytes, so it's entirely
> > possible to have a mix of encodings in a single stream. As long as you
> > can define the span of a single "unit" (say, a line, or a chunk in
> > some form), you can read as bytes and do the exact same "decode as
> > UTF-8 if possible, otherwise decode as Latin-1" dance. Sure, it's not
> > perfectly ideal, but it's about as good as you'll get with a lot of
> > US-based servers. (Depending on context, you might use CP-1252 instead
> > of Latin-1, but you might need errors="replace" there, since
> > Windows-1252 has some undefined byte values.)
>
> There is a very common error on Windows that files and especially web pages that
> claim to be utf-8 are in fact CP-1252.
>
> There is logic in the HTML standards to try utf-8 and if it fails fall back to CP-1252.
>
> Its usually the left and "smart" quote chars that cause the issue as they code
> as an invalid utf-8.
>

Yeah, or sometimes, there isn't *anything* in UTF-8, and it has some
sort of straight-up lie in the form of a meta tag. It's annoying. But
the same logic still applies: attempt one decode (UTF-8) and if it
fails, there's one fallback. Fairly simple.

ChrisA

Re: tail

<mailman.353.1652035830.20749.python-list@python.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=18193&group=comp.lang.python#18193

  copy link   Newsgroups: comp.lang.python
Path: i2pn2.org!i2pn.org!usenet.goja.nl.eu.org!3.eu.feeder.erje.net!feeder.erje.net!news2.arglkargh.de!news.karotte.org!fu-berlin.de!uni-berlin.de!not-for-mail
From: pyt...@mrabarnett.plus.com (MRAB)
Newsgroups: comp.lang.python
Subject: Re: tail
Date: Sun, 8 May 2022 19:50:21 +0100
Lines: 56
Message-ID: <mailman.353.1652035830.20749.python-list@python.org>
References: <CABbU2U-_Z546umxtnZXL8b1LUERCnyOxYw6osKTvKncOHFkJ3A@mail.gmail.com>
<60454E09-0ADA-4881-A84B-6C11397D244F@barrys-emacs.org>
<CABbU2U99Jpa6nuYg0sXw6=GjBEKVk9u-_oyxSoL8hLrW_2FoBA@mail.gmail.com>
<561ac7a8-2034-c1ce-6fca-f4280baac409@mrabarnett.plus.com>
<CABbU2U-N=YiRYfVkjpv8RP6BCo4VOLL7SWK=vNq8oje7nwuyUw@mail.gmail.com>
<mailman.340.1651948573.20749.python-list@python.org>
<encoding-20220507194637@ram.dialup.fu-berlin.de>
<983cc36d-0727-23ab-f168-ae88b14e6934@mrabarnett.plus.com>
<mailman.342.1651951563.20749.python-list@python.org>
<detection-20220507215306@ram.dialup.fu-berlin.de>
<CAPTjJmrf3ObZPTUjBZyBHUsu2bjaxsg=1pqnbCbePzYtUH7yHg@mail.gmail.com>
<0DC070E5-2BCC-47AA-9DDE-4EE7B3F3D441@barrys-emacs.org>
<f60c4202-21b5-5cb0-8b15-bc97cb40747c@mrabarnett.plus.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
X-Trace: news.uni-berlin.de T10PVZ+2VvnRrQ3Q+/QduQIoEC2EJfX70AjxBQL978gA==
Return-Path: <python@mrabarnett.plus.com>
X-Original-To: python-list@python.org
Delivered-To: python-list@mail.python.org
Authentication-Results: mail.python.org; dkim=pass
reason="2048-bit key; unprotected key"
header.d=plus.com header.i=@plus.com header.b=cXDl1l8x;
dkim-adsp=none (unprotected policy); dkim-atps=neutral
X-Spam-Status: OK 0.001
X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'def': 0.04; '2022': 0.05;
'ram': 0.07; 'sun,': 0.07; 'utf-8': 0.07; 'wrong.': 0.07;
'angelico': 0.09; 'byte': 0.09; 'check,': 0.09; 'fails': 0.09;
'from:addr:python': 0.09; 'received:192.168.1.64': 0.09; 'utf-8.':
0.09; 'writes:': 0.09; 'possible,': 0.15; 'supported': 0.15;
'2022,': 0.16; '>>>>': 0.16; '>>>>>': 0.16; 'barry': 0.16;
'encoding': 0.16; 'encoding.': 0.16;
'from:addr:mrabarnett.plus.com': 0.16; 'from:name:mrab': 0.16;
'message-id:@mrabarnett.plus.com': 0.16; 'received:84.93': 0.16;
'received:84.93.230': 0.16; 'received:plus.net': 0.16; 'slow':
0.16; 'stream.': 0.16; 'windows-1252': 0.16; 'wrote:': 0.16;
'instead': 0.17; "aren't": 0.19; 'to:addr:python-list': 0.20;
'issue': 0.21; "i've": 0.22; 'code': 0.23; "i'd": 0.24; 'binary':
0.26; 'stefan': 0.26; '>>>': 0.28; 'chris': 0.28; 'fact': 0.28;
'wrong': 0.28; 'error': 0.29; 'header:User-Agent:1': 0.30; 'raw':
0.32; 'received:192.168.1': 0.32; 'but': 0.32; "i'm": 0.33;
'there': 0.33; 'path': 0.33; 'windows': 0.34; 'same': 0.34;
'mean': 0.34; 'header:In-Reply-To:1': 0.34; 'invalid': 0.35;
'yes,': 0.35; 'files': 0.36; "it's": 0.37; 'received:192.168':
0.37; 'file': 0.38; 'could': 0.38; 'read': 0.38; 'single': 0.39;
'text': 0.39; 'otherwise': 0.39; 'use': 0.39; 'define': 0.40;
'exact': 0.40; 'file:': 0.40; 'try': 0.40; 'should': 0.40;
'share': 0.63; 'pass': 0.64; 'back': 0.67; 'accept': 0.67; 'mix':
0.69; 'perfectly': 0.69; 'sequence': 0.69; 'standards': 0.69;
'claim': 0.71; "you'll": 0.73; 'quote': 0.74; 'potentially': 0.76;
'html': 0.80; 'left': 0.83; 'decode': 0.84; 'falling': 0.84;
'scott': 0.84; 'line,': 0.93; 'fall': 0.95
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=plus.com; s=042019;
t=1652035823; bh=WZMVXGd3rtusN5G/vGVqBu82y9o9hmid0XgLnWc/cwU=;
h=Date:Subject:To:References:From:In-Reply-To;
b=cXDl1l8xvx6+yRJMwoCx6IoWHlf8ZZn2x0nXdfU5ccmMxYHGCpyeOwBgKb5E8aGCV
plBifB8Qw4zfQzIbe5KSbfyyRkbfsXTlFPIC3hB5vJ9kfAyiTvMpnOTM/O7U/t+zpm
pxDUB56MDu9iwjOf5keWnLwU0XLo8ZuJEm4aJsDQzB+3Z4eMhPsb/wPWBTe6JXsTTU
YsHcI04mGCVFaB3AlseRKFkoT2PJDYO6PZnQEUJAZfCoSS4eFSpDbVhjJl8ZaTQSgn
Stx+Lo8BYIgNQtiRkr7+UuMFo0PUZEbDIxn6YtfqaApczzxhMtZo/SZMAkSaE0NVoQ
Jiwp3lE0IhFlA==
X-Clacks-Overhead: "GNU Terry Pratchett"
X-CM-Score: 0.00
X-CNFS-Analysis: v=2.4 cv=JPUoDuGb c=1 sm=1 tr=0 ts=627810ef
a=0nF1XD0wxitMEM03M9B4ZQ==:117 a=0nF1XD0wxitMEM03M9B4ZQ==:17
a=IkcTkHD0fZMA:10 a=pGLkceISAAAA:8 a=EBOSESyhAAAA:8 a=RrjdQ9-ObMdjrplUdNkA:9
a=QEXdDO2ut3YA:10 a=yJM6EZoI5SlJf8ks9Ge_:22
X-AUTH: mrabarnett@:2500
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.9.0
Content-Language: en-GB
In-Reply-To: <0DC070E5-2BCC-47AA-9DDE-4EE7B3F3D441@barrys-emacs.org>
X-CMAE-Envelope: MS4xfE4gWetZ3qu9fTK0Y9J8swagNUyLArADcpMqAqhluF+S2ELAF+r2saw3QTrR+dC460xX9hWb2z6E+SfFrPKKNo1qs+WwAw0mBqD++FViOfNv6XNxk/kM
Y51kdLFwsy93hSOR6v19AEVEn1kE9mBcivjus6HOUcuKmMvyh+xLDvjBsgaedmyORBGqtG8TFl72Gn3+3kf/Zp+0WI+EJK20Jm8=
X-BeenThere: python-list@python.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: General discussion list for the Python programming language
<python-list.python.org>
List-Unsubscribe: <https://mail.python.org/mailman/options/python-list>,
<mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive: <https://mail.python.org/pipermail/python-list/>
List-Post: <mailto:python-list@python.org>
List-Help: <mailto:python-list-request@python.org?subject=help>
List-Subscribe: <https://mail.python.org/mailman/listinfo/python-list>,
<mailto:python-list-request@python.org?subject=subscribe>
X-Mailman-Original-Message-ID: <f60c4202-21b5-5cb0-8b15-bc97cb40747c@mrabarnett.plus.com>
X-Mailman-Original-References: <CABbU2U-_Z546umxtnZXL8b1LUERCnyOxYw6osKTvKncOHFkJ3A@mail.gmail.com>
<60454E09-0ADA-4881-A84B-6C11397D244F@barrys-emacs.org>
<CABbU2U99Jpa6nuYg0sXw6=GjBEKVk9u-_oyxSoL8hLrW_2FoBA@mail.gmail.com>
<561ac7a8-2034-c1ce-6fca-f4280baac409@mrabarnett.plus.com>
<CABbU2U-N=YiRYfVkjpv8RP6BCo4VOLL7SWK=vNq8oje7nwuyUw@mail.gmail.com>
<mailman.340.1651948573.20749.python-list@python.org>
<encoding-20220507194637@ram.dialup.fu-berlin.de>
<983cc36d-0727-23ab-f168-ae88b14e6934@mrabarnett.plus.com>
<mailman.342.1651951563.20749.python-list@python.org>
<detection-20220507215306@ram.dialup.fu-berlin.de>
<CAPTjJmrf3ObZPTUjBZyBHUsu2bjaxsg=1pqnbCbePzYtUH7yHg@mail.gmail.com>
<0DC070E5-2BCC-47AA-9DDE-4EE7B3F3D441@barrys-emacs.org>
 by: MRAB - Sun, 8 May 2022 18:50 UTC

On 2022-05-08 19:15, Barry Scott wrote:
>
>
>> On 7 May 2022, at 22:31, Chris Angelico <rosuav@gmail.com> wrote:
>>
>> On Sun, 8 May 2022 at 07:19, Stefan Ram <ram@zedat.fu-berlin.de> wrote:
>>>
>>> MRAB <python@mrabarnett.plus.com> writes:
>>>> On 2022-05-07 19:47, Stefan Ram wrote:
>>> ...
>>>>> def encoding( name ):
>>>>> path = pathlib.Path( name )
>>>>> for encoding in( "utf_8", "latin_1", "cp1252" ):
>>>>> try:
>>>>> with path.open( encoding=encoding, errors="strict" )as file:
>>>>> text = file.read()
>>>>> return encoding
>>>>> except UnicodeDecodeError:
>>>>> pass
>>>>> return "ascii"
>>>>> Yes, it's potentially slow and might be wrong.
>>>>> The result "ascii" might mean it's a binary file.
>>>> "latin-1" will decode any sequence of bytes, so it'll never try
>>>> "cp1252", nor fall back to "ascii", and falling back to "ascii" is wrong
>>>> anyway because the file could contain 0x80..0xFF, which aren't supported
>>>> by that encoding.
>>>
>>> Thank you! It's working for my specific application where
>>> I'm reading from a collection of text files that should be
>>> encoded in either utf_8, latin_1, or ascii.
>>>
>>
>> In that case, I'd exclude ASCII from the check, and just check UTF-8,
>> and if that fails, decode as Latin-1. Any ASCII files will decode
>> correctly as UTF-8, and any file will decode as Latin-1.
>>
>> I've used this exact fallback system when decoding raw data from
>> Unicode-naive servers - they accept and share bytes, so it's entirely
>> possible to have a mix of encodings in a single stream. As long as you
>> can define the span of a single "unit" (say, a line, or a chunk in
>> some form), you can read as bytes and do the exact same "decode as
>> UTF-8 if possible, otherwise decode as Latin-1" dance. Sure, it's not
>> perfectly ideal, but it's about as good as you'll get with a lot of
>> US-based servers. (Depending on context, you might use CP-1252 instead
>> of Latin-1, but you might need errors="replace" there, since
>> Windows-1252 has some undefined byte values.)
>
> There is a very common error on Windows that files and especially web pages that
> claim to be utf-8 are in fact CP-1252.
>
> There is logic in the HTML standards to try utf-8 and if it fails fall back to CP-1252.
>
> Its usually the left and "smart" quote chars that cause the issue as they code
> as an invalid utf-8.
>
Is it CP-1252 or ISO-8859-1 (Latin-1)?

Re: tail

<3a3299f2-bcba-41cb-a9bd-e8f58835bf56n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=18218&group=comp.lang.python#18218

  copy link   Newsgroups: comp.lang.python
X-Received: by 2002:a05:620a:414b:b0:6a0:5c30:66fb with SMTP id k11-20020a05620a414b00b006a05c3066fbmr10045364qko.53.1652165721166;
Mon, 09 May 2022 23:55:21 -0700 (PDT)
X-Received: by 2002:a05:622a:185:b0:2f3:dc9b:84b0 with SMTP id
s5-20020a05622a018500b002f3dc9b84b0mr7156023qtw.508.1652165720971; Mon, 09
May 2022 23:55:20 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder6.news.weretis.net!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer02.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.python
Date: Mon, 9 May 2022 23:55:20 -0700 (PDT)
In-Reply-To: <mailman.353.1652035830.20749.python-list@python.org>
Injection-Info: google-groups.googlegroups.com; posting-host=2a02:1210:689b:7a00:cc89:b6ed:354d:6f9;
posting-account=ung4FAoAAAC46zhHJ0Nsnuox7M5gDvs_
NNTP-Posting-Host: 2a02:1210:689b:7a00:cc89:b6ed:354d:6f9
References: <CABbU2U-_Z546umxtnZXL8b1LUERCnyOxYw6osKTvKncOHFkJ3A@mail.gmail.com>
<60454E09-0ADA-4881-A84B-6C11397D244F@barrys-emacs.org> <CABbU2U99Jpa6nuYg0sXw6=GjBEKVk9u-_oyxSoL8hLrW_2FoBA@mail.gmail.com>
<561ac7a8-2034-c1ce-6fca-f4280baac409@mrabarnett.plus.com>
<CABbU2U-N=YiRYfVkjpv8RP6BCo4VOLL7SWK=vNq8oje7nwuyUw@mail.gmail.com>
<mailman.340.1651948573.20749.python-list@python.org> <encoding-20220507194637@ram.dialup.fu-berlin.de>
<983cc36d-0727-23ab-f168-ae88b14e6934@mrabarnett.plus.com>
<mailman.342.1651951563.20749.python-list@python.org> <detection-20220507215306@ram.dialup.fu-berlin.de>
<CAPTjJmrf3ObZPTUjBZyBHUsu2bjaxsg=1pqnbCbePzYtUH7yHg@mail.gmail.com>
<f60c4202-21b5-5cb0-8b15-bc97cb40747c@mrabarnett.plus.com>
<0DC070E5-2BCC-47AA-9DDE-4EE7B3F3D441@barrys-emacs.org> <mailman.353.1652035830.20749.python-list@python.org>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <3a3299f2-bcba-41cb-a9bd-e8f58835bf56n@googlegroups.com>
Subject: Re: tail
From: wxjmfa...@gmail.com (moi)
Injection-Date: Tue, 10 May 2022 06:55:21 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 4907
 by: moi - Tue, 10 May 2022 06:55 UTC

Le dimanche 8 mai 2022 à 20:50:50 UTC+2, MRAB a écrit :
> On 2022-05-08 19:15, Barry Scott wrote:
> >
> >
> >> On 7 May 2022, at 22:31, Chris Angelico <ros...@gmail.com> wrote:
> >>
> >> On Sun, 8 May 2022 at 07:19, Stefan Ram <r...@zedat.fu-berlin.de> wrote:
> >>>
> >>> MRAB <pyt...@mrabarnett.plus.com> writes:
> >>>> On 2022-05-07 19:47, Stefan Ram wrote:
> >>> ...
> >>>>> def encoding( name ):
> >>>>> path = pathlib.Path( name )
> >>>>> for encoding in( "utf_8", "latin_1", "cp1252" ):
> >>>>> try:
> >>>>> with path.open( encoding=encoding, errors="strict" )as file:
> >>>>> text = file.read()
> >>>>> return encoding
> >>>>> except UnicodeDecodeError:
> >>>>> pass
> >>>>> return "ascii"
> >>>>> Yes, it's potentially slow and might be wrong.
> >>>>> The result "ascii" might mean it's a binary file.
> >>>> "latin-1" will decode any sequence of bytes, so it'll never try
> >>>> "cp1252", nor fall back to "ascii", and falling back to "ascii" is wrong
> >>>> anyway because the file could contain 0x80..0xFF, which aren't supported
> >>>> by that encoding.
> >>>
> >>> Thank you! It's working for my specific application where
> >>> I'm reading from a collection of text files that should be
> >>> encoded in either utf_8, latin_1, or ascii.
> >>>
> >>
> >> In that case, I'd exclude ASCII from the check, and just check UTF-8,
> >> and if that fails, decode as Latin-1. Any ASCII files will decode
> >> correctly as UTF-8, and any file will decode as Latin-1.
> >>
> >> I've used this exact fallback system when decoding raw data from
> >> Unicode-naive servers - they accept and share bytes, so it's entirely
> >> possible to have a mix of encodings in a single stream. As long as you
> >> can define the span of a single "unit" (say, a line, or a chunk in
> >> some form), you can read as bytes and do the exact same "decode as
> >> UTF-8 if possible, otherwise decode as Latin-1" dance. Sure, it's not
> >> perfectly ideal, but it's about as good as you'll get with a lot of
> >> US-based servers. (Depending on context, you might use CP-1252 instead
> >> of Latin-1, but you might need errors="replace" there, since
> >> Windows-1252 has some undefined byte values.)
> >
> > There is a very common error on Windows that files and especially web pages that
> > claim to be utf-8 are in fact CP-1252.
> >
> > There is logic in the HTML standards to try utf-8 and if it fails fall back to CP-1252.
> >
> > Its usually the left and "smart" quote chars that cause the issue as they code
> > as an invalid utf-8.
> >
> Is it CP-1252 or ISO-8859-1 (Latin-1)?

In good software, latin-1 / ISO-8859-1 does not exist.
This also the case for (and in) unicode.

1
server_pubkey.txt

rocksolid light 0.9.8
clearnet tor