Message-ID:

Delta: We never make the same mistake three times. -- David Letterman

devel / comp.lang.python / Re: HTML extraction

Re: HTML extraction

<mailman.31.1638900470.15287.python-list@python.org>

https://www.novabbs.com/devel/article-flat.php?id=16329&group=comp.lang.python#16329

copy link Newsgroups: comp.lang.python

Path: i2pn2.org!i2pn.org!news.swapon.de!fu-berlin.de!uni-berlin.de!not-for-mail
From: ros...@gmail.com (Chris Angelico)
Newsgroups: comp.lang.python
Subject: Re: HTML extraction
Date: Wed, 8 Dec 2021 05:07:36 +1100
Lines: 56
Message-ID: <mailman.31.1638900470.15287.python-list@python.org>
References: <CAEsMKX3TkUK==fNcZVZXhDrEWFA8RW6PTY47quACz7LmJ-Xy_Q@mail.gmail.com>
<CAPTjJmp7F5M-R+yG763x8uEDFoVD_rUomDDHv8hsXFqUun20uA@mail.gmail.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Trace: news.uni-berlin.de I0OcQ3YfhmtcPJyVJl9TlQIV/utj/ocLzU0X5T8U8niA==
Return-Path: <rosuav@gmail.com>
X-Original-To: python-list@python.org
Delivered-To: python-list@mail.python.org
Authentication-Results: mail.python.org; dkim=pass
reason="2048-bit key; unprotected key"
header.d=gmail.com header.i=@gmail.com header.b=jj6YWtop;
dkim-adsp=pass; dkim-atps=neutral
X-Spam-Status: OK 0.048
X-Spam-Evidence: '*H*': 0.90; '*S*': 0.00; '(for': 0.05; 'world"':
0.05; '"hello': 0.07; 'that?': 0.07; 'bs4': 0.09; 'module.': 0.09;
'reference:': 0.09; 'regex': 0.09; 'tags': 0.09; 'though.': 0.09;
'chrisa': 0.16; 'conversion,': 0.16; 'from:addr:rosuav': 0.16;
'from:name:chris angelico': 0.16; 'hand,': 0.16; 'html,': 0.16;
'instance)': 0.16; 'library:': 0.16; 'lxml': 0.16; 'nodes': 0.16;
'soup': 0.16; 'wrote:': 0.16; 'python': 0.16; 'probably': 0.17;
'to:addr:python-list': 0.20; 'language': 0.21; "i've": 0.22;
'anyone': 0.25; 'object': 0.26; 'purpose': 0.28; 'thinking': 0.28;
'code,': 0.31; 'seem': 0.31; 'comment': 0.31; 'dec': 0.31;
'message-id:@mail.gmail.com': 0.32; 'but': 0.32; 'same': 0.34;
'header:In-Reply-To:1': 0.34; 'received:google.com': 0.34;
'trying': 0.35; 'definitely': 0.35; 'from:addr:gmail.com': 0.35;
'people': 0.36; 'using': 0.37; 'received:209.85': 0.37; 'way':
0.38; 'could': 0.38; 'received:209': 0.39; 'two': 0.39; 'text':
0.39; 'enough': 0.39; 'use': 0.39; 'wed,': 0.39; 'still': 0.40;
'something': 0.40; 'want': 0.40; 'should': 0.40; 'best': 0.61;
'search': 0.61; 'inline': 0.61; 'job.': 0.62; 'come': 0.62;
'internal': 0.63; 'simply': 0.63; 'between': 0.63; 'everything':
0.63; "you'd": 0.64; 'chief': 0.64; 'mainly': 0.64; 'universal':
0.64; 'your': 0.64; 'time.': 0.66; 'more,': 0.67; 'interested':
0.68; 'order': 0.69; 'advantages': 0.69; 'content,': 0.69; '2021':
0.71; 'ability': 0.71; 'easy': 0.74; 'tools': 0.74; 'html': 0.80;
'need.': 0.84; '(like': 0.84; 'hamilton': 0.84; 'obligatory':
0.84; 'pure,': 0.84
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112;
h=mime-version:references:in-reply-to:from:date:message-id:subject:to
:content-transfer-encoding;
bh=RPcpyRNW1VhoiHj60r6iFQYG9ZJJ0L6WtBobgt7E7+c=;
b=jj6YWtopQ6gL7POnDTl2aFj7fm7femN3YpdKfDpdoqR6RyV8FRC/8IeBiYSLPdVA+9
GfeuUfoe4rZdtx0iHpgiu9vv2GreIpKoul7tKOXiPBgB9ERq5ROCGJuBJAiO/py7yGIz
RLPhl++hHxRLPs2EXUajoK6innjNFifMKlZtodyxZLgNG2fDWXykm/aqbXO8p1Xe5fOv
2wmpeN8JC5Kng/KiVISIduPck5Lfiu/fi9IfiHLiACgb6AKAwKR91tuUNrP7XB2n2ezL
Bgaynwq+u+dfrDLpVHgJ1b0kYUK39EclMrFcvwQUG4XMZXQQEsH2ghQCVLZDvDfw+iD1
Q7lw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
d=1e100.net; s=20210112;
h=x-gm-message-state:mime-version:references:in-reply-to:from:date
:message-id:subject:to:content-transfer-encoding;
bh=RPcpyRNW1VhoiHj60r6iFQYG9ZJJ0L6WtBobgt7E7+c=;
b=UxUL+YW8FU9MllNPX3LER74hZoTt2cVp6eUtvKMXix3r9VKm+oJiQKs3u/fULwHpXv
ijrGbq4ayfEQa4Uht+36mqJAX0Vaj2TIVbhwEVr67j4v6Dltrb/ln/Vnt3xOycQQPQYf
Y4Syyc7+jK7wO37H4P/9BNU8XDSz7aGB+1YevziKAlMi5kMxQ/wnjmpZzG6gjZJb3Js6
G9xpuXaGKwbHCpG1yCdNB61kTTCNl19rejKiPss6c25D4oAHidQo5cGvg7yWwQHwGoNU
0n0wflMDiTsIQo7jzs3t9XPw3CtAGI0GPRBavC3CLNuxNVx98pzCABmTkhmeK6rSO/Wi
wOSA==
X-Gm-Message-State: AOAM532bF1t5ohIJnYiSoOk+B6kwf6QG8MWoeqQqykwMdqkyWy+HriqP
j1XsruAd3pHj/XhUoHxdZab3RTGDD9ys08xvT3vTbZI67Bo=
X-Google-Smtp-Source: ABdhPJxE1ecUXiGyvB/tg/kvIrZL5rpJsA1jTOavikgN7sh2DGHmXGxd/5guUVUn84mZWHRSGhlFIj5sn+3JvZEWOPk=
X-Received: by 2002:a05:6000:15c7:: with SMTP id
y7mr54581747wry.424.1638900468035;
Tue, 07 Dec 2021 10:07:48 -0800 (PST)
In-Reply-To: <CAEsMKX3TkUK==fNcZVZXhDrEWFA8RW6PTY47quACz7LmJ-Xy_Q@mail.gmail.com>
X-BeenThere: python-list@python.org
X-Mailman-Version: 2.1.38
Precedence: list
List-Id: General discussion list for the Python programming language
<python-list.python.org>
List-Unsubscribe: <https://mail.python.org/mailman/options/python-list>,
<mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive: <https://mail.python.org/pipermail/python-list/>
List-Post: <mailto:python-list@python.org>
List-Help: <mailto:python-list-request@python.org?subject=help>
List-Subscribe: <https://mail.python.org/mailman/listinfo/python-list>,
<mailto:python-list-request@python.org?subject=subscribe>
X-Mailman-Original-Message-ID: <CAPTjJmp7F5M-R+yG763x8uEDFoVD_rUomDDHv8hsXFqUun20uA@mail.gmail.com>
X-Mailman-Original-References: <CAEsMKX3TkUK==fNcZVZXhDrEWFA8RW6PTY47quACz7LmJ-Xy_Q@mail.gmail.com>

by: Chris Angelico - Tue, 7 Dec 2021 18:07 UTC

On Wed, Dec 8, 2021 at 4:55 AM Julius Hamilton
<juliushamilton100@gmail.com> wrote:
>
> Hey,
>
> Could anyone please comment on the purest way simply to strip HTML tags
> from the internal text they surround?
>
> I know Beautiful Soup is a convenient tool, but I’m interested to know what
> the most minimal way to do it would be.

That's definitely the best and most general way, and would still be my
first thought most of the time.

> People say you usually don’t use Regex for a second order language like
> HTML, so I was thinking about using xpath or lxml, which seem like very
> pure, universal tools for the job.
>
> I did find an example for doing this with the re module, though.
>
> Would it be fair to say that to just strip the tags, Regex is fine, but you
> need to build a tree-like object if you want the ability to select which
> nodes to keep and which to discard?

Obligatory reference:

https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags

> Can xpath / lxml do that?
>
> What are the chief differences between xpath / lxml and Beautiful Soup?
>

I've never directly used lxml, mainly because bs4 offers all the same
advantages and more, with about the same costs. However, if you're
looking for a no-external-deps option, Python *does* include an HTML
parser in the standard library:

https://docs.python.org/3/library/html.parser.html

If your purpose is extremely simple (like "strip tags, search for
text"), then it should be easy enough to whip up something using that
module. No external deps, not a lot of code, pretty straight-forward.
On the other hand, if you're trying to do an "HTML to text"
conversion, you'd probably need to be aware of which tags are
block-level and which are inline content, so that (for instance)
"<div>Hello</div> <div>world</div>" would come out as two separate
paragraphs of text, whereas the same thing with <b> tags would become
just "Hello world". But for the most part, handle_data will probably
do everything you need.

ChrisA