Rocksolid Light

Welcome to novaBBS (click a section below)

mail  files  register  newsreader  groups  login

Message-ID:  

"Being against torture ought to be sort of a bipartisan thing." -- Karl Lehenbauer


devel / comp.lang.python / Re: string storage [was: Re: imaplib: is this really so unwieldy?]

SubjectAuthor
* Re: string storage [was: Re: imaplib: is this really so unwieldy?]Chris Angelico
`* Re: string storage [was: Re: imaplib: is this really so unwieldy?]moi
 `- Re: string storage [was: Re: imaplib: is this really so unwieldy?]moi

1
Re: string storage [was: Re: imaplib: is this really so unwieldy?]

<mailman.360.1622032315.3087.python-list@python.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=13353&group=comp.lang.python#13353

  copy link   Newsgroups: comp.lang.python
Path: i2pn2.org!i2pn.org!news.swapon.de!fu-berlin.de!uni-berlin.de!not-for-mail
From: ros...@gmail.com (Chris Angelico)
Newsgroups: comp.lang.python
Subject: Re: string storage [was: Re: imaplib: is this really so unwieldy?]
Date: Wed, 26 May 2021 22:31:41 +1000
Lines: 81
Message-ID: <mailman.360.1622032315.3087.python-list@python.org>
References: <21fb6c5f-97a4-654b-887f-2c31a549bcbe@adminart.net>
<hd6qag98c37mvqurlu3mfcvie38o63kn6n@4ax.com>
<d0e29810-858a-8a32-fda6-a68c63224606@mrabarnett.plus.com>
<s8jtd7$e0d$1@ciao.gmane.io> <s8ksoo$10pm$1@ciao.gmane.io>
<CAPTjJmosxVRBBziQOD0h40wEdQ7ioOVNx53+5dOR7grZOWcQCA@mail.gmail.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
X-Trace: news.uni-berlin.de O0ncP5sVXoAkB6NRP3y9vg8CT77+Ti5JMQ0/Oibr18ow==
Return-Path: <rosuav@gmail.com>
X-Original-To: python-list@python.org
Delivered-To: python-list@mail.python.org
Authentication-Results: mail.python.org; dkim=pass
reason="2048-bit key; unprotected key"
header.d=gmail.com header.i=@gmail.com header.b=VIK7gbee;
dkim-adsp=pass; dkim-atps=neutral
X-Spam-Status: OK 0.003
X-Spam-Evidence: '*H*': 0.99; '*S*': 0.00; '(which': 0.04; 'fairly':
0.05; 'string': 0.05; 'variable': 0.05; '26,': 0.07; 'subject: [':
0.07; 'utf-8': 0.07; 'byte': 0.09; 'cc:addr:python-list': 0.09;
'characters,': 0.09; 'confess': 0.09; 'overhead': 0.09;
'received:209.85.166.179': 0.09; 'terry': 0.09; '3.3': 0.16;
'cc:name:python': 0.16; 'characters.': 0.16; 'chrisa': 0.16;
'cpython': 0.16; 'dictionary,': 0.16; 'encoding.': 0.16; 'fast,':
0.16; 'from:addr:rosuav': 0.16; 'from:name:chris angelico': 0.16;
'gauld': 0.16; 'post,': 0.16; 'practice,': 0.16; 'proportion':
0.16; 'remember,': 0.16; 'scheme': 0.16; 'slicing': 0.16;
'subject:string': 0.16; 'text),': 0.16; 'unicode': 0.16; 'wider':
0.16; 'wrote:': 0.16; 'memory': 0.16; 'that.': 0.16; 'python':
0.16; 'uses': 0.19; 'cc:addr:python.org': 0.19; 'languages': 0.23;
'anything': 0.24; '>>>': 0.26; 'seems': 0.26; 'cc:2**0': 0.27;
'so.': 0.27; 'single': 0.28; 'mostly': 0.28; "isn't": 0.29;
'this.': 0.29; 'text': 0.29; 'but': 0.31; '(with': 0.31; 'wide':
0.31; "doesn't": 0.32; "i'm": 0.32; 'python-list': 0.32; 'message-
id:@mail.gmail.com': 0.33; 'received:209.85.166': 0.33; 'header
:In-Reply-To:1': 0.33; 'same': 0.34; 'received:google.com': 0.34;
'contains': 0.35; 'majority': 0.35; 'yes,': 0.35;
'from:addr:gmail.com': 0.35; 'also,': 0.36; 'two': 0.37; 'way':
0.37; "that's": 0.37; 'mean': 0.37; 'received:209.85': 0.38;
"it's": 0.38; 'something': 0.38; 'received:209': 0.38; 'going':
0.38; 'use': 0.38; 'learning': 0.38; 'does': 0.38; 'means': 0.40;
'require': 0.40; 'pretty': 0.40; 'entire': 0.61; 'likely': 0.61;
'lot': 0.62; 'skip:b 10': 0.62; 'down': 0.62; 'subject:this':
0.63; 'true': 0.63; 'pay': 0.64; 'cost': 0.64; 'depending': 0.65;
'representing': 0.65; 'spend': 0.65; 'four': 0.66; 'look': 0.66;
'improve': 0.67; 'carry': 0.68; 'free': 0.68; 'that,': 0.68;
'order': 0.68; 'time,': 0.69; 'etc,': 0.69; 'vast': 0.69;
'european': 0.71; 'chinese': 0.81; '2021': 0.84; 'characters':
0.84; 'strings': 0.84; 'subject:really': 0.84; 'truth': 0.86;
'flexible': 0.91; 'storage': 0.95; 'largest': 0.96
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025;
h=mime-version:references:in-reply-to:from:date:message-id:subject:to
:cc; bh=bMWQ7B/t4HphFMNYE01MQGgDZ4DgNYf/5BpZcm64cUc=;
b=VIK7gbeeyfyybxpvykTPaN5Ekw81TQaxQB85+DWISHLtq4UT7K1DQYhtU+dVvH6t6j
d2s9GgfGVv592fv5d4EsByDjgKUXho3mYRkk8ZDDTjZTgvr6hmzkACyyYn/Iq3+D2PPu
tODzfPoHlWiQ4SG06hjP7Kd9fRafFIRRAfwsV6MX4Dwduqa7fITC4fCRW5v26eVk5Wev
6nq2FqHD/smVygD4YGMBPbwr+NL3SqB7EVF0xKcTnMK5pIaZfqR+sy0v9JTnnzN8U7Mk
1aylZxXBcRRcxc3UrG3TB8x611wuxBCW7wx297QIZ0CZUJlMDoXn1z3h3Jm9NECcMXGQ
djuA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
d=1e100.net; s=20161025;
h=x-gm-message-state:mime-version:references:in-reply-to:from:date
:message-id:subject:to:cc;
bh=bMWQ7B/t4HphFMNYE01MQGgDZ4DgNYf/5BpZcm64cUc=;
b=Ktrcw/kXhPTcO+/NrWbqpPW+6c+iulLCCz2i0epAUXdTSqUOkhGOSm5BLUUG4hpGeH
3MjMZaX9ygKjJZ8srTGOCvC8wNk8ehWmroz6M9DYan2mY5OSefLle1wS9Xz6Iw/WgHiQ
2dVLZPCriAzYmGLUmgmgGmSRybn+Pa6Nzwl2hPrxlM3voPDdd8gonFk+yx+n9KfUqyi3
1h91XQAwgwY0/w45nXTBksNTfAjxmqyFlHISaTzLinOjCEbvmvJyBGlEyqLXYj74Fbwi
4zp0n/O40tOrBLzec2SHt3xEKTnRMw450Da2opVPUr7VaTHFSOxOm3QMQoW733FJJSnl
Wfpg==
X-Gm-Message-State: AOAM530x5yPqj4+SPBDXttSYP8eVeVfQflAUgQcUecgSctQTux2L6snb
D33T8UGnXUpeXdVZM+FlfDtk5EcWh+Fe55q752ZtM+Yl3CztbQ==
X-Google-Smtp-Source: ABdhPJyb58ARO0TA+jWPDp5K/K6JPy2wYddd0kILvVVjkuDnW53/KvFfaBgdbE6qXzyrhNRV4NCwFtfNJxyhCrCHspw=
X-Received: by 2002:a92:c607:: with SMTP id p7mr26610997ilm.97.1622032312827;
Wed, 26 May 2021 05:31:52 -0700 (PDT)
In-Reply-To: <s8ksoo$10pm$1@ciao.gmane.io>
X-BeenThere: python-list@python.org
X-Mailman-Version: 2.1.34
Precedence: list
List-Id: General discussion list for the Python programming language
<python-list.python.org>
List-Unsubscribe: <https://mail.python.org/mailman/options/python-list>,
<mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive: <https://mail.python.org/pipermail/python-list/>
List-Post: <mailto:python-list@python.org>
List-Help: <mailto:python-list-request@python.org?subject=help>
List-Subscribe: <https://mail.python.org/mailman/listinfo/python-list>,
<mailto:python-list-request@python.org?subject=subscribe>
X-Mailman-Original-Message-ID: <CAPTjJmosxVRBBziQOD0h40wEdQ7ioOVNx53+5dOR7grZOWcQCA@mail.gmail.com>
X-Mailman-Original-References: <21fb6c5f-97a4-654b-887f-2c31a549bcbe@adminart.net>
<hd6qag98c37mvqurlu3mfcvie38o63kn6n@4ax.com>
<d0e29810-858a-8a32-fda6-a68c63224606@mrabarnett.plus.com>
<s8jtd7$e0d$1@ciao.gmane.io> <s8ksoo$10pm$1@ciao.gmane.io>
 by: Chris Angelico - Wed, 26 May 2021 12:31 UTC

On Wed, May 26, 2021 at 10:04 PM Alan Gauld via Python-list
<python-list@python.org> wrote:
>
> On 25/05/2021 23:23, Terry Reedy wrote:
>
> > In CPython's Flexible String Representation all characters in a string
> > are stored with the same number of bytes, depending on the largest
> > codepoint.
>
> I'm learning lots of new things in this thread!
>
> Does that mean that if I give Python a UTF8 string that is mostly single
> byte characters but contains one 4-byte character that Python will store
> the string as all 4-byte characters?

Nitpick: It won't be "a UTF-8 string"; it will be "a Unicode string".
UTF-8 is a scheme for representing Unicode as a series of bytes, so if
something is UTF-8, it'll be like b'Stra\xc3\x9fe' (with two bytes
representing one non-ASCII character), whereas the corresponding
Unicode string is 'Stra\xdfe' with a single character. Or, if it were
beyond the first 256 characters, '\u2026' is an ellipsis,
b'\xe2\x80\xa6' is a UTF-8 representation of that same character. And
if it's beyond the BMP, then '\U0001F921' is one of the few non-ASCII
characters that you can legitimately write off as a "funny character",
and b'\xf0\x9f\xa4\xa1' is the UTF-8 byte sequence that would carry
that.

So. Yes, if you give Python a large ASCII string with a single non-BMP
character, the entire string *will* be stored as four-byte characters.

(Or, to nitpick against myself: CPython will do this. Other Python
implementations are free to do differently, and for instance, uPy
actually uses UTF-8 like you were predicting. For the rest of this
post, when I say "Python", I actually mean "CPython 3.3 or later".)

> If so, doesn't that introduce a pretty big storage overhead for
> large strings?
>
> >
> > >>> sys.getsizeof('\U00011111')
> > 80
> > >>> sys.getsizeof('\U00011111'*2)
> > 84
> > >>> sys.getsizeof('a\U00011111')
> > 84

Correct. Each additional character is going to cost you four bytes.

> Which is what this seems to be saying.
>
> I confess I had just assumed the unicode strings were stored
> in native unicode UTF8 format.
>

UTF-8 isn't native any more than any other encoding. It's a good
compact format for transmission, but it's quite inefficient for
manipulation. Python opts to spend some memory in order to improve
time, because that's usually the correct tradeoff to make - it means
that indexing in a large string is fast, slicing a large string is
fast, etc, etc, etc.

Also, the truth is that, *in practice*, very few strings will pay this
sort of penalty. If you have a whole lot of (say) Chinese text,
there's going to be a small proportion of ASCII text, but most of the
text is going to be wider characters. Working with most European
languages will require the use of the BMP (which means 16-bit text),
but not anything beyond. And if someone's going to use one emoji from
the supplemental planes (which would require 32-bit text), it's fairly
likely that they'll use multiple.

And if you look at all strings in the Python interpreter, the vast
majority of them will be ASCII-only, getting optimized all the way
down to a single byte. Remember, every module-level variable is stored
in that module's dictionary, keyed by its name - and *most* variable
names in Python are ASCII.

So while it's true that, in theory, a single wide character can cost
you a lot of memory... in practice, this is still a lot more compact,
overall, than storing all strings in UCS-2.

ChrisA

Re: string storage [was: Re: imaplib: is this really so unwieldy?]

<7675d182-bd7c-4c15-ad42-6c9b15ad6f8fn@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=13404&group=comp.lang.python#13404

  copy link   Newsgroups: comp.lang.python
X-Received: by 2002:ac8:75d4:: with SMTP id z20mr2529716qtq.265.1622195197417;
Fri, 28 May 2021 02:46:37 -0700 (PDT)
X-Received: by 2002:a37:5d46:: with SMTP id r67mr2889586qkb.72.1622195197236;
Fri, 28 May 2021 02:46:37 -0700 (PDT)
Path: i2pn2.org!i2pn.org!weretis.net!feeder8.news.weretis.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.python
Date: Fri, 28 May 2021 02:46:37 -0700 (PDT)
In-Reply-To: <mailman.360.1622032315.3087.python-list@python.org>
Injection-Info: google-groups.googlegroups.com; posting-host=2a02:1206:458f:ce30:171:9d34:c9d2:46b1;
posting-account=ung4FAoAAAC46zhHJ0Nsnuox7M5gDvs_
NNTP-Posting-Host: 2a02:1206:458f:ce30:171:9d34:c9d2:46b1
References: <21fb6c5f-97a4-654b-887f-2c31a549bcbe@adminart.net>
<hd6qag98c37mvqurlu3mfcvie38o63kn6n@4ax.com> <d0e29810-858a-8a32-fda6-a68c63224606@mrabarnett.plus.com>
<s8jtd7$e0d$1@ciao.gmane.io> <CAPTjJmosxVRBBziQOD0h40wEdQ7ioOVNx53+5dOR7grZOWcQCA@mail.gmail.com>
<s8ksoo$10pm$1@ciao.gmane.io> <mailman.360.1622032315.3087.python-list@python.org>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <7675d182-bd7c-4c15-ad42-6c9b15ad6f8fn@googlegroups.com>
Subject: Re: string storage [was: Re: imaplib: is this really so unwieldy?]
From: wxjmfa...@gmail.com (moi)
Injection-Date: Fri, 28 May 2021 09:46:37 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
 by: moi - Fri, 28 May 2021 09:46 UTC

Le mercredi 26 mai 2021 à 14:32:15 UTC+2, Chris Angelico a écrit :
>
> And if you look at all strings in the Python interpreter, the vast
> majority of them will be ASCII-only, getting optimized all the way
> down to a single byte. Remember, every module-level variable is stored
> in that module's dictionary, keyed by its name - and *most* variable
> names in Python are ASCII.
>

That's the advantage of utf-8. In utf-8, what you are
calling a byte, is and can only be an "ascii character"
and one has never to lookup over 127.

In this stupid FSR, your byte is not an "ascii char",
it is or can be a "latin1 char", contradicting the Unicode
rules with bad consequences (*).

FSR
- the oppsite of utf-8, memory
- the opposite of utf32, performance
- subject to cause issues (*) (This is what is really happening).

Congratulations.

Re: string storage [was: Re: imaplib: is this really so unwieldy?]

<3937618a-4d3d-4ecf-9e94-1d03dc099e25n@googlegroups.com>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=13405&group=comp.lang.python#13405

  copy link   Newsgroups: comp.lang.python
X-Received: by 2002:a0c:8d0b:: with SMTP id r11mr3016766qvb.22.1622195712562;
Fri, 28 May 2021 02:55:12 -0700 (PDT)
X-Received: by 2002:a37:9c84:: with SMTP id f126mr3070436qke.240.1622195712435;
Fri, 28 May 2021 02:55:12 -0700 (PDT)
Path: i2pn2.org!i2pn.org!paganini.bofh.team!usenet.pasdenom.info!usenet-fr.net!proxad.net!feeder1-2.proxad.net!209.85.160.216.MISMATCH!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: comp.lang.python
Date: Fri, 28 May 2021 02:55:12 -0700 (PDT)
In-Reply-To: <7675d182-bd7c-4c15-ad42-6c9b15ad6f8fn@googlegroups.com>
Injection-Info: google-groups.googlegroups.com; posting-host=2a02:1206:458f:ce30:171:9d34:c9d2:46b1;
posting-account=ung4FAoAAAC46zhHJ0Nsnuox7M5gDvs_
NNTP-Posting-Host: 2a02:1206:458f:ce30:171:9d34:c9d2:46b1
References: <21fb6c5f-97a4-654b-887f-2c31a549bcbe@adminart.net>
<hd6qag98c37mvqurlu3mfcvie38o63kn6n@4ax.com> <d0e29810-858a-8a32-fda6-a68c63224606@mrabarnett.plus.com>
<s8jtd7$e0d$1@ciao.gmane.io> <CAPTjJmosxVRBBziQOD0h40wEdQ7ioOVNx53+5dOR7grZOWcQCA@mail.gmail.com>
<s8ksoo$10pm$1@ciao.gmane.io> <mailman.360.1622032315.3087.python-list@python.org>
<7675d182-bd7c-4c15-ad42-6c9b15ad6f8fn@googlegroups.com>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <3937618a-4d3d-4ecf-9e94-1d03dc099e25n@googlegroups.com>
Subject: Re: string storage [was: Re: imaplib: is this really so unwieldy?]
From: wxjmfa...@gmail.com (moi)
Injection-Date: Fri, 28 May 2021 09:55:12 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
 by: moi - Fri, 28 May 2021 09:55 UTC

Le vendredi 28 mai 2021 à 11:46:49 UTC+2, moi a écrit :

Addendum
And on top on this, people are discussing how to speed up
Python...

1
server_pubkey.txt

rocksolid light 0.9.81
clearnet tor