Rocksolid Light

Welcome to novaBBS (click a section below)

mail  files  register  newsreader  groups  login

Message-ID:  

Whom computers would destroy, they must first drive mad.


devel / comp.lang.python / Re: Mutating an HTML file with BeautifulSoup

SubjectAuthor
o Re: Mutating an HTML file with BeautifulSoupChris Angelico

1
Re: Mutating an HTML file with BeautifulSoup

<mailman.308.1660955901.20444.python-list@python.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=19277&group=comp.lang.python#19277

  copy link   Newsgroups: comp.lang.python
Path: i2pn2.org!i2pn.org!news.swapon.de!fu-berlin.de!uni-berlin.de!not-for-mail
From: ros...@gmail.com (Chris Angelico)
Newsgroups: comp.lang.python
Subject: Re: Mutating an HTML file with BeautifulSoup
Date: Sat, 20 Aug 2022 10:38:07 +1000
Lines: 86
Message-ID: <mailman.308.1660955901.20444.python-list@python.org>
References: <CAPTjJmoFiJ4V-sfye5OU04=hpRRpWQ_nX0=C+RVQ4QBu5X80PA@mail.gmail.com>
<04D12A76-92D8-4584-AE6E-AD3072E438EE@barrys-emacs.org>
<CAPTjJmqKz1om04YVgaOgt9gtrqsGbU93s1OK1t6YhtTyLvF=ig@mail.gmail.com>
<7be37315-9adc-4aac-4598-be122a9c1ca6@DancesWithMice.info>
<CAPTjJmoRbrAqBVKh6kda_biMgxT1dmoRvi+cPGF4T2-9hZCZ2w@mail.gmail.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Trace: news.uni-berlin.de mbCkHgX+dgpgz9HK9y24fAI7+ABHpg+Fezat/d6aKNvQ==
Return-Path: <rosuav@gmail.com>
X-Original-To: python-list@python.org
Delivered-To: python-list@mail.python.org
Authentication-Results: mail.python.org; dkim=pass
reason="2048-bit key; unprotected key"
header.d=gmail.com header.i=@gmail.com header.b=Omvyppnb;
dkim-adsp=pass; dkim-atps=neutral
X-Spam-Status: OK 0.002
X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; '2022': 0.05; 'absolute':
0.05; 'bunch': 0.05; 'issue.': 0.05; 'aug': 0.07; 'http': 0.07;
'programmer': 0.07; 'angelico': 0.09; 'bs4': 0.09; 'intelligent':
0.09; 'moved': 0.09; 'parse': 0.09; 'tags': 0.09; 'urls.': 0.09;
'"find': 0.16; '2022,': 0.16; 'assuming': 0.16; 'assumption':
0.16; 'barry': 0.16; 'chrisa': 0.16; 'commons': 0.16;
'destination.': 0.16; 'easier.': 0.16; 'far,': 0.16; 'fixes':
0.16; 'from:addr:rosuav': 0.16; 'from:name:chris angelico': 0.16;
'gigantic': 0.16; 'https': 0.16; 'input.': 0.16; 'outdated': 0.16;
'parsing': 0.16; 'ported': 0.16; 'recall': 0.16;
'received:209.85.218': 0.16; 'should,': 0.16; 'sufficient.': 0.16;
'tag,': 0.16; 'them?': 0.16; 'thru': 0.16; 'tree,': 0.16; 'url-
ip:66/8': 0.16; 'urls': 0.16; 'wrote:': 0.16; 'instead': 0.17;
"can't": 0.17; 'url': 0.19; 'to:addr:python-list': 0.20; 'input':
0.21; "i've": 0.22; 'sat,': 0.22; 'subject:file': 0.22; 'run':
0.23; 'anything': 0.25; 'object': 0.26; 'bit': 0.27; 'old': 0.27;
'done': 0.28; '>>>': 0.28; 'chris': 0.28; 'output': 0.28; 'wrong':
0.28; 'recently': 0.29; 'attempt': 0.31; 'seem': 0.31; 'think':
0.32; 'question': 0.32; '(this': 0.32; 'domains': 0.32; 'files,':
0.32; 'raw': 0.32; 'requests,': 0.32; 'split': 0.32; 'ton': 0.32;
'message-id:@mail.gmail.com': 0.32; 'but': 0.32; "i'm": 0.33;
'there': 0.33; 'path': 0.33; 'same': 0.34; 'header:In-Reply-To:1':
0.34; 'received:google.com': 0.34; 'meaning': 0.35;
'from:addr:gmail.com': 0.35; 'fix': 0.36; 'work,': 0.36; 'really':
0.37; "it's": 0.37; 'received:209.85': 0.37; 'hard': 0.37;
'others': 0.37; 'file': 0.38; 'way': 0.38; 'could': 0.38;
'thanks': 0.38; 'received:209': 0.39; 'changes': 0.39; 'quite':
0.39; 'edit': 0.39; 'valid': 0.39; 'list': 0.39; 'use': 0.39;
'try': 0.40; 'should': 0.40; 'best': 0.61; 'detail': 0.61;
'internal': 0.63; 'skip:b 10': 0.63; "you'd": 0.64; 'in.': 0.64;
'parts': 0.65; 'plans': 0.65; 'now,': 0.67; 'back': 0.67;
'url:index': 0.68; 'url:net': 0.68; 'closing': 0.69; 'end,': 0.69;
'manually': 0.69; 'url:htm': 0.69; 'within': 0.69; 'site': 0.70;
'rules': 0.70; 'too.': 0.70; 'production': 0.71; 'trust': 0.71;
'care': 0.71; 'long-term': 0.76; 'html': 0.80; 'confirmed': 0.81;
'left': 0.83; 'became': 0.84; 'crossed': 0.84; 'inclined': 0.84;
'lines,': 0.84; 'redirect': 0.84; 'url-ip:173.254/16': 0.84;
'want.': 0.84; 'wholesale': 0.84; 'form.': 0.91; 'loses': 0.91;
'migrate': 0.93
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112;
h=content-transfer-encoding:to:subject:message-id:date:from
:in-reply-to:references:mime-version:from:to:cc;
bh=DkW9GgwC5KbS5HQXsnKWc937YcnQxzFgu7XYyHseRC4=;
b=OmvyppnbEroRKjDVTzD6LckVjT1JuGZ6aiXa5DNDk6vNVvmGLVtR0N5AY38gt/8kjC
vvX7X86wHF77wao+dpSAsXRiwZXJUWNt3MCpL/RL73MaHqfnnLCd5VPY5V799Dm/Q+hR
xlUaBlmQsdowxJUxpFNytUMVDo0xuyBWEo0LsMLyMsNowXvkrTvDzqaENe4JXlcbv8NG
crCkB5RKpUOr79wGQTTJlOERf4b7X8qDSIoupztYw66M7bv8lHx7JdIsHM41uHF1tW6W
4VmmiuXVjvCmCcAOEShR/XAQ5KisPAwvO++Jn1WhvQ2Us8yhh5wo1EVh2D4gQHGtYkuU
1eHg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
d=1e100.net; s=20210112;
h=content-transfer-encoding:to:subject:message-id:date:from
:in-reply-to:references:mime-version:x-gm-message-state:from:to:cc;
bh=DkW9GgwC5KbS5HQXsnKWc937YcnQxzFgu7XYyHseRC4=;
b=iBQd8l8ERYlbtcghZbq9jY5GPxFjs7xAIjziAxWVIeaKNu/FK226bTRqEI48gIruM7
hKsW1J/R9sbOwZEAE3UxSQPcwzyl/qXGadZ76uo1zlCJbcgJKKfLEKbu3X9hRYrud7ww
3iPGMSGuSE7D8xbFhzC8soDDJzYkIhfyrztVMsZISg4KcGL5uxYczHNQBzNUdQBqqorK
ulvjtigGOS5d3YoIJSW80sTSMxpKXhUIXBFqNAvsn+VA9E9kls1f//oLUH8B0PwxUb90
88Ak/PoFUZssUkE9aGhYoed5lZ/j/piRJvmxSHY0WBk538miySUBZq1vdenD87mFXJvK
VEvw==
X-Gm-Message-State: ACgBeo3ja483KmiDGztJVIlqwt6dAjuDo6u0vR9uGiLbeuNygXiPChLu
yCNqA+ZDNaZaUjUhWoc2wqSr2TU4qk6hwb7bEB9fuuEX6Mk=
X-Google-Smtp-Source: AA6agR459ria4tVaH+cnONOC+3o+gIEig1gsTWXg2irrRnnKVDAabolKXo4rwvImRdLTG02dYRsAmiG3Rl3Da2djxKs=
X-Received: by 2002:a17:907:7389:b0:732:fc99:cf6a with SMTP id
er9-20020a170907738900b00732fc99cf6amr6403009ejc.335.1660955898613; Fri, 19
Aug 2022 17:38:18 -0700 (PDT)
In-Reply-To: <7be37315-9adc-4aac-4598-be122a9c1ca6@DancesWithMice.info>
X-BeenThere: python-list@python.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: General discussion list for the Python programming language
<python-list.python.org>
List-Unsubscribe: <https://mail.python.org/mailman/options/python-list>,
<mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive: <https://mail.python.org/pipermail/python-list/>
List-Post: <mailto:python-list@python.org>
List-Help: <mailto:python-list-request@python.org?subject=help>
List-Subscribe: <https://mail.python.org/mailman/listinfo/python-list>,
<mailto:python-list-request@python.org?subject=subscribe>
X-Mailman-Original-Message-ID: <CAPTjJmoRbrAqBVKh6kda_biMgxT1dmoRvi+cPGF4T2-9hZCZ2w@mail.gmail.com>
X-Mailman-Original-References: <CAPTjJmoFiJ4V-sfye5OU04=hpRRpWQ_nX0=C+RVQ4QBu5X80PA@mail.gmail.com>
<04D12A76-92D8-4584-AE6E-AD3072E438EE@barrys-emacs.org>
<CAPTjJmqKz1om04YVgaOgt9gtrqsGbU93s1OK1t6YhtTyLvF=ig@mail.gmail.com>
<7be37315-9adc-4aac-4598-be122a9c1ca6@DancesWithMice.info>
 by: Chris Angelico - Sat, 20 Aug 2022 00:38 UTC

On Sat, 20 Aug 2022 at 10:19, dn <PythonList@danceswithmice.info> wrote:
>
> On 20/08/2022 09.01, Chris Angelico wrote:
> > On Sat, 20 Aug 2022 at 05:12, Barry <barry@barrys-emacs.org> wrote:
> >>
> >>
> >>
> >>> On 19 Aug 2022, at 19:33, Chris Angelico <rosuav@gmail.com> wrote:
> >>>
> >>> What's the best way to precisely reconstruct an HTML file after
> >>> parsing it with BeautifulSoup?
> >>
> >> I recall that in bs4 it parses into an object tree and loses the detail of the input.
> >> I recently ported from very old bs to bs4 and hit the same issue.
> >> So no it will not output the same as went in.
> >>
> >> If you can trust the input to be parsed as xml, meaning all the rules of closing
> >> tags have been followed. Then I think you can parse and unparse thru xml to
> >> do what you want.
> >>
> >
> >
> > Yeah, no I can't, this is HTML 4 with a ton of inconsistencies. Oh
> > well. Thanks for trying, anyhow.
> >
> > So I'm left with a few options:
> >
> > 1) Give up on validation, give up on verification, and just run this
> > thing on the production site with my fingers crossed
> > 2) Instead of doing an intelligent reconstruction, just str.replace()
> > one URL with another within the file
> > 3) Split the file into lines, find the Nth line (elem.sourceline) and
> > str.replace that line only
> > 4) Attempt to use elem.sourceline and elem.sourcepos to find the start
> > of the tag, manually find the end, and replace one tag with the
> > reconstructed form.
> >
> > I'm inclined to the first option, honestly. The others just seem like
> > hard work, and I became a programmer so I could be lazy...
> +1 - but I've noticed that sometimes I have to work quite hard to be
> this lazy!

Yeah, that's very true...

> Am assuming that http -> https is not the only 'change' (if it were,
> you'd just do that without BS). How many such changes are planned/need
> checking? Care to list them?
>

Assumption is correct. The changes are more of the form "find all the
problems, add to the list of fixes, try to minimize the ones that need
to be done manually". So far, what I have is:

1) A bunch of http -> https, but not all of them - only domains where
I've confirmed that it's valid
2) Some absolute to relative conversions:
https://www.gsarchive.net/whowaswho/index.htm should be referred to as
/whowaswho/index.htm instead
3) A few outdated URLs for which we know the replacement, eg
http://www.cris.com/~oakapple/gasdisc/<anything> to
http://www.gasdisc.oakapplepress.com/<anything> (this one can't go on
HTTPS, which is one reason I can't shortcut that)
4) Some internal broken links where the path is wrong - anything that
resolves to /books/<anything> but can't be found might be better
rewritten as /html/perf_grps/websites/<anything> if the file can be
found there
5) Any external link that yields a permanent redirect should, to save
clientside requests, get replaced by the destination. We have some
Creative Commons badges that have moved to new URLs.

And there'll be other fixes to be done too. So it's a bit complicated,
and no simple solution is really sufficient. At the very very least, I
*need* to properly parse with BS4; the only question is whether I
reconstruct from the parse tree, or go back to the raw file and try to
edit it there.

For the record, I have very long-term plans to migrate parts of the
site to Markdown, which would make a lot of things easier. But for
now, I need to fix the existing problems in the existing HTML files,
without doing gigantic wholesale layout changes.

ChrisA

1
server_pubkey.txt

rocksolid light 0.9.81
clearnet tor