Rocksolid Light

Welcome to novaBBS (click a section below)

mail  files  register  newsreader  groups  login

Message-ID:  

It's great to be smart 'cause then you know stuff.


devel / comp.lang.python / Re: Mutating an HTML file with BeautifulSoup

SubjectAuthor
o Re: Mutating an HTML file with BeautifulSoupChris Angelico

1
Re: Mutating an HTML file with BeautifulSoup

<mailman.318.1661069386.20444.python-list@python.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=19296&group=comp.lang.python#19296

  copy link   Newsgroups: comp.lang.python
Path: i2pn2.org!i2pn.org!news.swapon.de!fu-berlin.de!uni-berlin.de!not-for-mail
From: ros...@gmail.com (Chris Angelico)
Newsgroups: comp.lang.python
Subject: Re: Mutating an HTML file with BeautifulSoup
Date: Sun, 21 Aug 2022 18:09:32 +1000
Lines: 66
Message-ID: <mailman.318.1661069386.20444.python-list@python.org>
References: <CAPTjJmqKz1om04YVgaOgt9gtrqsGbU93s1OK1t6YhtTyLvF=ig@mail.gmail.com>
<C771B241-C2FC-4CCA-BFC4-E43A93099098@barrys-emacs.org>
<CAPTjJmqQt7XkGx+Le8N+CpJfy9Y+WOHjurCqiWkwGxkSkNS7eQ@mail.gmail.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Trace: news.uni-berlin.de E6uwyL4wPSVS0B7KfhdNagZAwGP7RFcLlvtrd4/4wKMg==
Return-Path: <rosuav@gmail.com>
X-Original-To: python-list@python.org
Delivered-To: python-list@mail.python.org
Authentication-Results: mail.python.org; dkim=pass
reason="2048-bit key; unprotected key"
header.d=gmail.com header.i=@gmail.com header.b=pQWeXszD;
dkim-adsp=pass; dkim-atps=neutral
X-Spam-Status: OK 0.007
X-Spam-Evidence: '*H*': 0.99; '*S*': 0.00; '2022': 0.05; 'issue.':
0.05; 'aug': 0.07; 'real-world': 0.07; 'sun,': 0.07; 'angelico':
0.09; 'bs4': 0.09; 'parse': 0.09; 'rendering': 0.09; 'tags': 0.09;
'(eg': 0.16; '2022,': 0.16; '>>>>': 0.16; 'barry': 0.16; 'beta':
0.16; 'chrisa': 0.16; 'compare.': 0.16; 'from:addr:rosuav': 0.16;
'from:name:chris angelico': 0.16; 'html,': 0.16; 'input.': 0.16;
'naive': 0.16; 'parsing': 0.16; 'ported': 0.16; 'recall': 0.16;
'received:209.85.218': 0.16; 'reference,': 0.16; 'thru': 0.16;
'wrote:': 0.16; 'problem': 0.16; 'solve': 0.19; 'uses': 0.19;
'to:addr:python-list': 0.20; 'input': 0.21; 'sat,': 0.22;
'subject:file': 0.22; 'code': 0.23; 'run': 0.23; 'stuff': 0.25;
'cannot': 0.25; 'object': 0.26; 'task': 0.26; 'visual': 0.26;
'bit': 0.27; 'old': 0.27; '>>>': 0.28; 'chris': 0.28; 'output':
0.28; 'recently': 0.29; 'whole': 0.30; 'think': 0.32;
'to:name:python': 0.32; 'ton': 0.32; 'window': 0.32; 'message-
id:@mail.gmail.com': 0.32; 'but': 0.32; "i'm": 0.33; 'same': 0.34;
'header:In-Reply-To:1': 0.34; 'received:google.com': 0.34;
'meaning': 0.35; 'from:addr:gmail.com': 0.35; "we're": 0.35;
'files': 0.36; 'built': 0.36; 'necessarily': 0.37; 'using': 0.37;
'received:209.85': 0.37; 'file': 0.38; 'way': 0.38; 'could': 0.38;
'thanks': 0.38; 'received:209': 0.39; 'two': 0.39; 'changes':
0.39; 'quite': 0.39; 'copies': 0.39; 'place.': 0.40;
'recognition': 0.40; 'want': 0.40; 'reference': 0.60; 'best':
0.61; 'detail': 0.61; 'verification': 0.62; 'here': 0.62; 'copy':
0.63; 'skip:b 10': 0.63; 'browser': 0.64; 'down': 0.64; 'full':
0.64; 'in.': 0.64; 'produce': 0.65; 'look': 0.65; 'bad': 0.67;
'closing': 0.69; 'manually': 0.69; 'site': 0.70; 'rules': 0.70;
'production': 0.71; 'trust': 0.71; 'html': 0.80; 'perfect': 0.82;
'left': 0.83; 'consequences': 0.84; 'crossed': 0.84; 'want.':
0.84; 'loses': 0.91; 'cut': 0.95
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112;
h=content-transfer-encoding:to:subject:message-id:date:from
:in-reply-to:references:mime-version:from:to:cc;
bh=AXwXkC6aj1GaUZD+Cq07Gbe0lWD8hbZj7VQO3CtEnAM=;
b=pQWeXszDzfh6EhXHMd1luf5HsZl0aIez8z4hEjjOwjD/9F+8JkD/fwaEpDqT16RC0q
PhxZyuErUX4F0wXUjl8uBblucAw5wx2PvSQINQQAjG2DAFA0u439oycy0Rq0o+inAb1z
9eWv7LNEym80IQ0Z8SJ98dSjY8qkQmy/FA494tsO1w9UJXuN7fQ+/KGwNZe+f3qmfPdm
YejsqtnQV+gRuWNNYDmMfSNxWcBLuWriCVYcvbNDParPlw0sN2K5D2ydQ0DrTJZqU/VU
XURXSH0/AWn9mIxFSytjq3IuC+oHJBOcpPAeXuqMDT7t7fQt1C8L6qVHjK/Y9rEAU1Sf
EXYA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
d=1e100.net; s=20210112;
h=content-transfer-encoding:to:subject:message-id:date:from
:in-reply-to:references:mime-version:x-gm-message-state:from:to:cc;
bh=AXwXkC6aj1GaUZD+Cq07Gbe0lWD8hbZj7VQO3CtEnAM=;
b=T6XG+4J2Kz3ialRZMbZa5rlWHptwiDkwqBTCSiC8LuihNvSwNKaARroHDO9qz6Z9H9
3ZIbL8Lc6NIl+TW8ApNy4fHubFFVMNfvV9TRdoCm/aZASiFtur3H8NJ6HBL0A6DAy3Sn
TQDcZ0e9AUbJ9fmyDn7WrRgBiJwGq9hKDTSq62eRyB70rSnqDAkSkx6UZn+Kupre1bpk
DVTAyS1nWhY9k2KhiW+SQLJXIVUk/LF/PNn82qw0/edlwYrS5FR+bl/kgPcSVtYHMBms
26SwxdQv9Tmlh2Hfjb2hCnTo9qnShwbpJiKdh66hXDXJIxb+Nh+2XZ00xPFKwNDhzjbD
trxQ==
X-Gm-Message-State: ACgBeo39VXMUdgzlspQUlIsqVhDB8ih389CHg2CaPAmRTZXPeJy9HqmR
KBr0uUjq1cYwSCZT7XIVUe9GE47aRnMPwkjtWGmvBh1Z
X-Google-Smtp-Source: AA6agR69HSqdCu++2xcnT7H1ONZucsAg3C70F0yR0XK8xv0tQ1PRfIT5wM+4acEQI1nlLIuQ2rAV8j3PoLch69ys6/0=
X-Received: by 2002:a17:907:1b03:b0:6ff:78d4:c140 with SMTP id
mp3-20020a1709071b0300b006ff78d4c140mr9795544ejc.554.1661069383770; Sun, 21
Aug 2022 01:09:43 -0700 (PDT)
In-Reply-To: <C771B241-C2FC-4CCA-BFC4-E43A93099098@barrys-emacs.org>
X-BeenThere: python-list@python.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: General discussion list for the Python programming language
<python-list.python.org>
List-Unsubscribe: <https://mail.python.org/mailman/options/python-list>,
<mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive: <https://mail.python.org/pipermail/python-list/>
List-Post: <mailto:python-list@python.org>
List-Help: <mailto:python-list-request@python.org?subject=help>
List-Subscribe: <https://mail.python.org/mailman/listinfo/python-list>,
<mailto:python-list-request@python.org?subject=subscribe>
X-Mailman-Original-Message-ID: <CAPTjJmqQt7XkGx+Le8N+CpJfy9Y+WOHjurCqiWkwGxkSkNS7eQ@mail.gmail.com>
X-Mailman-Original-References: <CAPTjJmqKz1om04YVgaOgt9gtrqsGbU93s1OK1t6YhtTyLvF=ig@mail.gmail.com>
<C771B241-C2FC-4CCA-BFC4-E43A93099098@barrys-emacs.org>
 by: Chris Angelico - Sun, 21 Aug 2022 08:09 UTC

On Sun, 21 Aug 2022 at 17:26, Barry <barry@barrys-emacs.org> wrote:
>
>
>
> > On 19 Aug 2022, at 22:04, Chris Angelico <rosuav@gmail.com> wrote:
> >
> > On Sat, 20 Aug 2022 at 05:12, Barry <barry@barrys-emacs.org> wrote:
> >>
> >>
> >>
> >>>> On 19 Aug 2022, at 19:33, Chris Angelico <rosuav@gmail.com> wrote:
> >>>
> >>> What's the best way to precisely reconstruct an HTML file after
> >>> parsing it with BeautifulSoup?
> >>
> >> I recall that in bs4 it parses into an object tree and loses the detail of the input.
> >> I recently ported from very old bs to bs4 and hit the same issue.
> >> So no it will not output the same as went in.
> >>
> >> If you can trust the input to be parsed as xml, meaning all the rules of closing
> >> tags have been followed. Then I think you can parse and unparse thru xml to
> >> do what you want.
> >>
> >
> >
> > Yeah, no I can't, this is HTML 4 with a ton of inconsistencies. Oh
> > well. Thanks for trying, anyhow.
> >
> > So I'm left with a few options:
> >
> > 1) Give up on validation, give up on verification, and just run this
> > thing on the production site with my fingers crossed
>
> Can you build a beta site with original intack?

In a naive way, a full copy would be quite a few gigabytes. I could
cut that down a good bit by taking only HTML files and the things they
reference, but then we run into the same problem of broken links,
which is what we're here to solve in the first place.

But I would certainly not want to run two copies of the site and then
manually compare.

> Also wonder if using selenium to walk the site may work as a verification step?
> I cannot recall if you can get an image of the browser window to do image compares with to look for rendering differences.

Image recognition won't necessarily even be valid; some of the changes
will have visual consequences (eg a broken image reference now
becoming correct), and as soon as that happens, the whole document can
reflow.

> From my one task using bs4 I did not see it produce any bad results.
> In my case the problems where in the code that built on bs1 using bad assumptions.

Did that get run on perfect HTML, or on messy real-world stuff that
uses quirks mode?

ChrisA

1
server_pubkey.txt

rocksolid light 0.9.81
clearnet tor