Message-ID:

"One day I woke up and discovered that I was in love with tripe." -- Tom Anderson

devel / comp.lang.python / HTML extraction

HTML extraction

<mailman.30.1638899261.15287.python-list@python.org>

https://www.novabbs.com/devel/article-flat.php?id=16328&group=comp.lang.python#16328

copy link Newsgroups: comp.lang.python

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!news.szaf.org!fu-berlin.de!uni-berlin.de!not-for-mail
From: juliusha...@gmail.com (Julius Hamilton)
Newsgroups: comp.lang.python
Subject: HTML extraction
Date: Tue, 7 Dec 2021 12:53:42 +0100
Lines: 26
Message-ID: <mailman.30.1638899261.15287.python-list@python.org>
References: <CAEsMKX3TkUK==fNcZVZXhDrEWFA8RW6PTY47quACz7LmJ-Xy_Q@mail.gmail.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
reason="2048-bit key; unprotected key"
header.d=gmail.com header.i=@gmail.com header.b=Njd8mgfj;
dkim-adsp=pass; dkim-atps=neutral
X-Spam-Status: OK 0.089
X-Spam-Evidence: '*H*': 0.82; '*S*': 0.00; 'that?': 0.07;
'received:209.85.219': 0.09; 'regex': 0.09; 'tags': 0.09;
'though.': 0.09; 'html,': 0.16; 'lxml': 0.16; 'nodes': 0.16;
'soup': 0.16; 'to:addr:python-list': 0.20; 'language': 0.21;
'anyone': 0.25; 'object': 0.26; 'thinking': 0.28; 'seem': 0.31;
'comment': 0.31; 'message-id:@mail.gmail.com': 0.32; 'but': 0.32;
'received:google.com': 0.34; 'from:addr:gmail.com': 0.35;
'thanks,': 0.36; 'people': 0.36; 'using': 0.37; 'received:209.85':
0.37; 'way': 0.38; 'could': 0.38; 'received:209': 0.39; 'text':
0.39; 'use': 0.39; 'want': 0.40; 'job.': 0.62; 'internal': 0.63;
'simply': 0.63; 'between': 0.63; 'chief': 0.64; 'universal': 0.64;
'interested': 0.68; 'order': 0.69; 'ability': 0.71; 'tools': 0.74;
'html': 0.80; 'pure,': 0.84
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112;
h=mime-version:from:date:message-id:subject:to;
bh=UOsrqULH614kR6LT/2GswwYVLK13Mh+uS+BXaGPQRYY=;
b=Njd8mgfj+ELfhtShYG1o5vPolzoUyerPc5dv/rcA2p8vNK4QE3v0PL47Vrvw3qIw/P
KHWAVA/GQb66Ntm0C+/aO4xmPqwACxHe10YkQNmja9rEeAO3AG4NzM08aL0IR7i7UgWB
TYO6so/8ERvtzB5bO8ktyjRG2NsAKtCuKALGap7sr7S8Mce8p5kKlPXUqBXKFkypxfIa
cQ4edtwu1VbSKi7fAtoZUVIFKKaauJCd4WEE7lbhyIJB66MVRSWwcHOokFDP7aP4k0ZV
V8Os7dhVfjYu7Q8GvVy01oFk8iQ5VT9aUA2jpAPjyTKsIXkY9x0IT9SeWFGqWBjm1TFA
2wDA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
d=1e100.net; s=20210112;
h=x-gm-message-state:mime-version:from:date:message-id:subject:to;
bh=UOsrqULH614kR6LT/2GswwYVLK13Mh+uS+BXaGPQRYY=;
b=NarlJACI+wVH7U6VFZ+hO8pZPv6LeB4157RWXebsTp4Og06lzPQSVNTKG6E3tz7bqv
BcWcCpLSXNFE87WUiegeuUZ9cXzRnf6yDlQ/zK+GpNZJLCpo4hgPj7Na/29rxzCUcf+Y
jM2d8mQiNxbAffHpWk3u0BX7bboBOPevrNRt+wg3Yn2ah5TLMOBeoPRUg8eTHKcYiBvs
f+cZixFqBhJNR/AFhIquRbCGZcVvMqBt7358mg9+ctRcoWoK8bis1EkJ6nE/NLeYZIH8
i/q2e6XzNxGvg8dqGY/e7I6JQqjRGiOUkDIOvjINN0mVR0kyTQUKO3gZmBlQRdty3t5h
7v0A==
X-Gm-Message-State: AOAM533XXpZZvWyuFnj5GcvecRqo9kx8w4T6dYjy6ilsW990ej9gqdyZ
S0//Pmgxw2KPwnQGK0wzeDqiGVanW23kPanhV19AcnOa1V0=
X-Google-Smtp-Source: ABdhPJx7O8RpuJtx75vF2qdaH02RshrwIhy3NLpv1S6drERgnaxkDcxkTphOY5EmaHLenMt+XGDPf8C1Q/+LyqA8Q/A=
X-Received: by 2002:a25:99c6:: with SMTP id q6mr50338337ybo.587.1638878033451;
Tue, 07 Dec 2021 03:53:53 -0800 (PST)
X-Mailman-Approved-At: Tue, 07 Dec 2021 12:47:40 -0500
X-Content-Filtered-By: Mailman/MimeDel 2.1.38
X-BeenThere: python-list@python.org
X-Mailman-Version: 2.1.38
Precedence: list
List-Id: General discussion list for the Python programming language
<python-list.python.org>
List-Unsubscribe: <https://mail.python.org/mailman/options/python-list>,
<mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive: <https://mail.python.org/pipermail/python-list/>
List-Post: <mailto:python-list@python.org>
List-Help: <mailto:python-list-request@python.org?subject=help>
List-Subscribe: <https://mail.python.org/mailman/listinfo/python-list>,
<mailto:python-list-request@python.org?subject=subscribe>
X-Mailman-Original-Message-ID: <CAEsMKX3TkUK==fNcZVZXhDrEWFA8RW6PTY47quACz7LmJ-Xy_Q@mail.gmail.com>

by: Julius Hamilton - Tue, 7 Dec 2021 11:53 UTC

Hey,

Could anyone please comment on the purest way simply to strip HTML tags
from the internal text they surround?

I know Beautiful Soup is a convenient tool, but I’m interested to know what
the most minimal way to do it would be.

People say you usually don’t use Regex for a second order language like
HTML, so I was thinking about using xpath or lxml, which seem like very
pure, universal tools for the job.

I did find an example for doing this with the re module, though.

Would it be fair to say that to just strip the tags, Regex is fine, but you
need to build a tree-like object if you want the ability to select which
nodes to keep and which to discard?

Can xpath / lxml do that?

What are the chief differences between xpath / lxml and Beautiful Soup?

Thanks,
Julius

Re: HTML extraction

<871r2n85yx.fsf@nightsong.com>

copy mid

https://www.novabbs.com/devel/article-flat.php?id=16345&group=comp.lang.python#16345

copy link Newsgroups: comp.lang.python

Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: no.em...@nospam.invalid (Paul Rubin)
Newsgroups: comp.lang.python
Subject: Re: HTML extraction
Date: Wed, 08 Dec 2021 07:35:34 -0800
Organization: A noiseless patient Spider
Lines: 17
Message-ID: <871r2n85yx.fsf@nightsong.com>
References: <CAEsMKX3TkUK==fNcZVZXhDrEWFA8RW6PTY47quACz7LmJ-Xy_Q@mail.gmail.com>
<mailman.30.1638899261.15287.python-list@python.org>
Mime-Version: 1.0
Content-Type: text/plain
Injection-Info: reader02.eternal-september.org; posting-host="07193e1521bd8a988e9cb335251518e2";
logging-data="5464"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/gZa7syXATkF1kDHkSkhos"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.1 (gnu/linux)
Cancel-Lock: sha1:OA9Mc5huU1aWjMl3SxS0lY8jEC8=
sha1:jiTqGpk2dS2M2wsw8GAYnU+Xgk0=

by: Paul Rubin - Wed, 8 Dec 2021 15:35 UTC

Julius Hamilton <juliushamilton100@gmail.com> writes:
> Would it be fair to say that to just strip the tags, Regex is fine, but you
> need to build a tree-like object if you want the ability to select which
> nodes to keep and which to discard?
>
> Can xpath / lxml do that?

I wouldn't use regexps, there are always too many weird corner cases
they'll miss.

BeautifulSoup is great if you don't mind it being slow. You could try
xml.etree which is based on expat which is a fast SAX-style parser,
similar to lxml. I've also used expat directly from C++ programs for
even more speed. I haven't used lxml so far.

Finally, for just converting html to text, "lynx -dump" does a decent
job and is very fast, way faster than anything like bs4.

Subject	Author
HTML extraction	Julius Hamilton
Re: HTML extraction	Paul Rubin