Rocksolid Light

Welcome to novaBBS (click a section below)

mail  files  register  newsreader  groups  login

Message-ID:  

Being schizophrenic is better than living alone.


devel / comp.lang.python / Re: Beautiful Soup - close tags more promptly?

SubjectAuthor
o Re: Beautiful Soup - close tags more promptly?Roel Schroeven

1
Re: Beautiful Soup - close tags more promptly?

<mailman.782.1666597336.20444.python-list@python.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=20015&group=comp.lang.python#20015

  copy link   Newsgroups: comp.lang.python
Path: i2pn2.org!i2pn.org!news.swapon.de!fu-berlin.de!uni-berlin.de!not-for-mail
From: roe...@roelschroeven.net (Roel Schroeven)
Newsgroups: comp.lang.python
Subject: Re: Beautiful Soup - close tags more promptly?
Date: Mon, 24 Oct 2022 09:42:13 +0200
Lines: 48
Message-ID: <mailman.782.1666597336.20444.python-list@python.org>
References: <CAPTjJmor6s9eOrkKr71oPh1Du-migmk5tBd0wz6Zy--TG7+sJA@mail.gmail.com>
<c4e38efe-15f2-9081-a51e-96e17cd797a5@roelschroeven.net>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
X-Trace: news.uni-berlin.de X+f2Asi+z3xvMAPiy3K3Cw86X7PgnvF5VMfyPccecExQ==
Return-Path: <roel@roelschroeven.net>
X-Original-To: python-list@python.org
Delivered-To: python-list@mail.python.org
Authentication-Results: mail.python.org; dkim=pass
reason="2048-bit key; unprotected key"
header.d=roelschroeven.net header.i=@roelschroeven.net
header.b=dL7TGa33; dkim-adsp=pass; dkim-atps=neutral
X-Spam-Status: OK 0.056
X-Spam-Evidence: '*H*': 0.89; '*S*': 0.00; 'bigger': 0.05; 'sun,':
0.07; '"""': 0.09; 'bs4': 0.09; 'items.': 0.09; 'meant': 0.09;
'parse': 0.09; 'tags': 0.09; 'import': 0.15; 'blob': 0.16;
'chasm': 0.16; 'hand,': 0.16; 'hypothesis': 0.16; 'iirc': 0.16;
'it."': 0.16; 'parsing': 0.16; 'received:10.202': 0.16;
'received:10.202.2': 0.16; 'received:internal': 0.16;
'received:messagingengine.com': 0.16; 'recursion': 0.16; 'rose':
0.16; 'schreef': 0.16; 'soup': 0.16; 'specify': 0.16; 'instead':
0.17; 'to:addr:python-list': 0.20; "i've": 0.22; 'problem,': 0.22;
'science': 0.22; 'run': 0.23; 'cannot': 0.25; 'seems': 0.26;
'normally': 0.26; 'robert': 0.26; 'bit': 0.27; 'chris': 0.28;
'header:User-Agent:1': 0.30; 'default': 0.31; 'think': 0.32;
'crazy': 0.32; 'elements': 0.32; 'but': 0.32; 'there': 0.33;
'same': 0.34; 'package': 0.34; 'header:In-Reply-To:1': 0.34;
'one.': 0.35; 'final': 0.35; 'received:66': 0.35; 'files': 0.36;
'using': 0.37; 'file': 0.38; 'way': 0.38; 'means': 0.38; 'list':
0.39; '(see': 0.40; 'something': 0.40; 'want': 0.40; 'try': 0.40;
'tell': 0.60; 'love': 0.62; 'subject': 0.63; 'feel': 0.63;
'great': 0.63; 'once': 0.63; 'skip:b 10': 0.63; 'she': 0.64;
'down': 0.64; 'let': 0.66; 'choose': 0.67; 'right': 0.68;
'url:net': 0.68; 'closing': 0.69; 'transport': 0.69; 'url:htm':
0.69; 'below': 0.69; 'waiting': 0.73; 'star': 0.76; 'html': 0.80;
'happens': 0.84; 'blows': 0.84; 'eliminate': 0.84; 'html:': 0.84;
'lean': 0.84; 'levels.': 0.84; 'subject:Beautiful': 0.84; 'url-
ip:173.254/16': 0.84; 'caused': 0.86; 'ancient': 0.91;
'subject:more': 0.95
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=
roelschroeven.net; h=cc:content-transfer-encoding:content-type
:date:date:from:from:in-reply-to:in-reply-to:message-id
:mime-version:references:reply-to:sender:subject:subject:to:to;
s=fm2; t=1666597333; x=1666683733; bh=OFk2kB69aX3PTRhP1oju/gHhy
wSe+uViSJGE0V7SO04=; b=dL7TGa33ylX53vxXMPGNRAfoqQmEBp9HcupRnP6gR
7HMzi8xZzLE97JO4rj/3WammoP6Tko08WoLXxIcT0wCiKFecZVh/nIqRS3+WV2jn
Y8IIdBiur8cyL7SUU9i1b1vrkOo70u4Q3hH06IQh2TdjF5Cd+xox+HURGxDTRDdf
Hdn+O5Ucb6NFQQnh6R3LYQmBVM5FlbaNqm/PrRToVnv6op78Y5C/qRl2rpY+/UkO
0zii3uMEdFnMglCz3TZ1uxEsq/WhFcguM9Ul/mgCUb4cnGDMcCLZ+76SoxB6nBeP
0ZQLc0BPC+kFDvw/0fyRuBxL1L8iT+OBBHVjz8QSxd9cw==
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=
messagingengine.com; h=cc:content-transfer-encoding:content-type
:date:date:feedback-id:feedback-id:from:from:in-reply-to
:in-reply-to:message-id:mime-version:references:reply-to:sender
:subject:subject:to:to:x-me-proxy:x-me-proxy:x-me-sender
:x-me-sender:x-sasl-enc; s=fm3; t=1666597333; x=1666683733; bh=O
Fk2kB69aX3PTRhP1oju/gHhywSe+uViSJGE0V7SO04=; b=VWgj6NqTaPkrrTAgP
91cG9HvlVx6OlaXXfj28ex2tfzh7Wmw3BY46hgcjlrI/MVxWFIRxvfiix3nMpzOL
Bxc788Q6uJ5tBvZIXJhaBnYBgu6SXV68SLZtEpEvGl6zX6qe+eKS+2YFV0Ljc2n4
WAaOl7da8MYBCXfGLnqWsmfiU1X8oIi0/0IzqX17hSPAU4hSf64s00B8TwLLDrHd
k80dRPWHu9xGmeMWdwlicx9tq0C38gbfN3j1bbRHNCXysTEmZZmrB7nZQ2UVzBbg
hj6XXNDYO/f71RKho54Z9gRaDDXIS+Iy8qD+93zLELUXAlhQy+zQWTSHxnuOqD0U
dDwaQ==
X-ME-Sender: <xms:1UFWYyky8v3fgyyJ73RNrQyjidQpVVc3Vn-xpM1N2aqSAp7eJ5Y-dQ>
<xme:1UFWY51N8sxu4SmOUewM-8dLzVsOXxnh6KiIPp2bDQyZDLjXcLAejWYqhPusM_4qn
gRCR6qR7u-x>
X-ME-Received: <xmr:1UFWYwpyAp91PM1rLlinjH_j_auP8jj5wxaGZt0X5dYP67gfOvu2MxXBY0CrKaKmTwUs5p0NhO2tmsb4HxfDuzy3jbAKY4A88XF02ZzK76cQLlc>
X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedvfedrgedtfedguddvhecutefuodetggdotefrod
ftvfcurfhrohhfihhlvgemucfhrghsthforghilhdpqfgfvfdpuffrtefokffrpgfnqfgh
necuuegrihhlohhuthemuceftddtnecunecujfgurhepkfffgggfuffvfhfhjggtgfesth
ejredttdefjeenucfhrhhomheptfhovghlucfutghhrhhovghvvghnuceorhhovghlsehr
ohgvlhhstghhrhhovghvvghnrdhnvghtqeenucggtffrrghtthgvrhhnpedtkedtkedvvd
dtjeeljedukeekueduheelheevfeejgffftddvgefhgfdtvefgjeenucffohhmrghinhep
ghhsrghrtghhihhvvgdrnhgvthenucevlhhushhtvghrufhiiigvpedtnecurfgrrhgrmh
epmhgrihhlfhhrohhmpehrohgvlhesrhhovghlshgthhhrohgvvhgvnhdrnhgvth
X-ME-Proxy: <xmx:1UFWY2n2mvikUb2wrTg86ebMG13IuuWlaXc8qZ6zEpfIAxlRj8m0qQ>
<xmx:1UFWYw0FbYaSqf6bZK44u05UHihLf4aqCUKzj5WW1xbTvO71yzps0g>
<xmx:1UFWY9tw152XGRGkmi8K_FFOKFLJvE8-JPeYMFiIyCeMxReaaHtcJw>
<xmx:1UFWYxi35RNm3LVds2CSfU0HY2HnVc1wxMLOMj8Lg86yUJ9F2KagOQ>
Feedback-ID: i8e5b41ae:Fastmail
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
Thunderbird/102.3.3
Content-Language: nl
In-Reply-To: <CAPTjJmor6s9eOrkKr71oPh1Du-migmk5tBd0wz6Zy--TG7+sJA@mail.gmail.com>
X-BeenThere: python-list@python.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: General discussion list for the Python programming language
<python-list.python.org>
List-Unsubscribe: <https://mail.python.org/mailman/options/python-list>,
<mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive: <https://mail.python.org/pipermail/python-list/>
List-Post: <mailto:python-list@python.org>
List-Help: <mailto:python-list-request@python.org?subject=help>
List-Subscribe: <https://mail.python.org/mailman/listinfo/python-list>,
<mailto:python-list-request@python.org?subject=subscribe>
X-Mailman-Original-Message-ID: <c4e38efe-15f2-9081-a51e-96e17cd797a5@roelschroeven.net>
X-Mailman-Original-References: <CAPTjJmor6s9eOrkKr71oPh1Du-migmk5tBd0wz6Zy--TG7+sJA@mail.gmail.com>
 by: Roel Schroeven - Mon, 24 Oct 2022 07:42 UTC

Op 24/10/2022 om 4:29 schreef Chris Angelico:
> Parsing ancient HTML files is something Beautiful Soup is normally
> great at. But I've run into a small problem, caused by this sort of
> sloppy HTML:
>
> from bs4 import BeautifulSoup
> # See: https://gsarchive.net/gilbert/plays/princess/tennyson/tenniv.htm
> blob = b"""
> <OL>
> <LI>'THERE sinks the nebulous star we call the Sun,
> <LI>If that hypothesis of theirs be sound,'
> <LI>Said Ida;' let us down and rest:' and we
> <LI>Down from the lean and wrinkled precipices,
> <LI>By every coppice-feather'd chasm and cleft,
> <LI>Dropt thro' the ambrosial gloom to where below
> <LI>No bigger than a glow-worm shone the tent
> <LI>Lamp-lit from the inner. Once she lean'd on me,
> <LI>Descending; once or twice she lent her hand,
> <LI>And blissful palpitations in the blood,
> <LI>Stirring a sudden transport rose and fell.
> </OL>
> """
> soup = BeautifulSoup(blob, "html.parser")
> print(soup)
>
>
> On this small snippet, it works acceptably, but puts a large number of
> </li> tags immediately before the </ol>. On the original file (see
> link if you want to try it), this blows right through the default
> recursion limit, due to the crazy number of "nested" list items.
>
> Is there a way to tell BS4 on parse that these <li> elements end at
> the next <li>, rather than waiting for the final </ol>? This would
> make tidier output, and also eliminate most of the recursion levels.
>
Using html5lib (install package html5lib) instead of html.parser seems
to do the trick: it inserts </li> right before the next <li>, and one
before the closing </ol> . On my system the same happens when I don't
specify a parser, but IIRC that's a bit fragile because other systems
can choose different parsers of you don't explicity specify one.

--
"I love science, and it pains me to think that to so many are terrified
of the subject or feel that choosing science means you cannot also
choose compassion, or the arts, or be awed by nature. Science is not
meant to cure us of mystery, but to reinvent and reinvigorate it."
-- Robert Sapolsky

1
server_pubkey.txt

rocksolid light 0.9.81
clearnet tor