Rocksolid Light

Welcome to novaBBS (click a section below)

mail  files  register  newsreader  groups  login

Message-ID:  

To communicate is the beginning of understanding. -- AT&T


devel / comp.lang.python / Short, perfect program to read sentences of webpage

SubjectAuthor
* Short, perfect program to read sentences of webpageJulius Hamilton
+* Re: Short, perfect program to read sentences of webpageStefan Ram
|+* Re: Short, perfect program to read sentences of webpageCameron Simpson
||`* Re: Short, perfect program to read sentences of webpageStefan Ram
|| +- Re: Short, perfect program to read sentences of webpageStefan Ram
|| +- Re: Short, perfect program to read sentences of webpageMRAB
|| `- Re: Short, perfect program to read sentences of webpageCameron Simpson
|`- Re: Short, perfect program to read sentences of webpagePeter J. Holzer
+- Re: Short, perfect program to read sentences of webpageStefan Ram
`* Re: Short, perfect program to read sentences of webpageJon Ribbens
 `- Re: Short, perfect program to read sentences of webpageStefan Ram

1
Short, perfect program to read sentences of webpage

<mailman.47.1638995449.15287.python-list@python.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=16354&group=comp.lang.python#16354

  copy link   Newsgroups: comp.lang.python
Path: i2pn2.org!i2pn.org!news.swapon.de!fu-berlin.de!uni-berlin.de!not-for-mail
From: juliusha...@gmail.com (Julius Hamilton)
Newsgroups: comp.lang.python
Subject: Short, perfect program to read sentences of webpage
Date: Wed, 8 Dec 2021 20:39:05 +0100
Lines: 70
Message-ID: <mailman.47.1638995449.15287.python-list@python.org>
References: <CAEsMKX31XWuhSLumh4smFk2BuTEt1OAuUa+wYwqAwfL1MOXqqg@mail.gmail.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Trace: news.uni-berlin.de 0z68TxXo8vTO/elYcq3mDgLHZeo2r1CPyJGNnt2YisXw==
Return-Path: <juliushamilton100@gmail.com>
X-Original-To: Python-list@python.org
Delivered-To: python-list@mail.python.org
Authentication-Results: mail.python.org; dkim=pass
reason="2048-bit key; unprotected key"
header.d=gmail.com header.i=@gmail.com header.b=nhpcrzhb;
dkim-adsp=pass; dkim-atps=neutral
X-Spam-Status: OK 0.032
X-Spam-Evidence: '*H*': 0.94; '*S*': 0.00; 'random': 0.05;
'difficulty': 0.09; 'meant': 0.09; 'much,': 0.09;
'received:209.85.219': 0.09; 'segmentation': 0.09; 'tags': 0.09;
'import': 0.15; 'possible,': 0.15; 'displays': 0.16; 'extracts':
0.16; 'html,': 0.16; 'like.': 0.16; 'material,': 0.16; 'mean,':
0.16; 'minimally': 0.16; 'nltk': 0.16; 'optimal': 0.16;
'researching': 0.16; 'subject:program': 0.16; 'pull': 0.17; 'url':
0.19; 'to:addr:python-list': 0.20; 'input': 0.21; 'code': 0.23;
'lines': 0.23; 'programming': 0.25; 'anyone': 0.25; 'interface':
0.26; 'greatly': 0.28; 'requests': 0.28; 'code,': 0.31; 'program':
0.31; 'message-id:@mail.gmail.com': 0.32; 'but': 0.32; 'focus':
0.33; 'someone': 0.34; 'appreciated.': 0.34;
'received:google.com': 0.34; 'from:addr:gmail.com': 0.35;
'people': 0.36; 'currently': 0.37; 'really': 0.37;
'received:209.85': 0.37; 'this.': 0.37; 'way': 0.38; 'could':
0.38; 'thanks': 0.38; 'received:209': 0.39; 'text': 0.39;
'handle': 0.39; 'use': 0.39; 'neither': 0.39; 'something': 0.40;
'experienced': 0.61; 'skip:h 10': 0.61; '8bit%:20': 0.61;
'produce': 0.65; 'time.': 0.66; 'skip:e 20': 0.67; 'acceptable':
0.69; 'sentence': 0.69; 'subject:read': 0.69; '\xe2\x80\x9c':
0.69; '8bit%:43': 0.70; 'tools': 0.74; 'html': 0.80; 'study':
0.82; 'left': 0.83; 'reasons': 0.84; 'brevity': 0.84; 'dense':
0.84; 'sentences': 0.84; 'activate': 0.91
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112;
h=mime-version:from:date:message-id:subject:to;
bh=9x50lQQXXT/8CxDEaya0Z3t1JDhpKuavBYee6beUZVw=;
b=nhpcrzhbiUgDzaguZxlBBU0X8YyIGy2Ta9pZatrJzjwrNt7797gJAL4WhWQbv9NWAH
tp4ndHTuAjsHB+rPOmsqOWtxQkEM64pLUFpRjHtIGMpyd3bCfHNQKPjYjF3e5e9U9N/B
QvT2LRyzfbzyFEWq1aGPNzOmiUkn1niBMeHc8AQPIRC3l5zu/k0Xdtx3GaD9CdJSMrU0
emJYP4uh5jn5JuvSo/Y3bcPtT5SZkQqSmZuMdWdUQTa90P3LNN7h5jxsIVSaUdHuuVrx
wsyvs8NyJmiPgrM8ETO1CPgI2vhq/EzPkEDRaOzpEKp+OF7Rjs3gm8Sy/1uRw0XecvXG
I1KA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
d=1e100.net; s=20210112;
h=x-gm-message-state:mime-version:from:date:message-id:subject:to;
bh=9x50lQQXXT/8CxDEaya0Z3t1JDhpKuavBYee6beUZVw=;
b=CDpvG7DKYsa0H9WWvJvXvVDsfW3qWLYc+X/psZR+KD70CDqM5APjF73Leoea+7ycHE
Ta/oGHgXGjJ0XtAujmXRYTwbE7tDU1RNtktWIG0rzBEfZX0O3u2U0tRw5mwa3gHiIFTb
5yZ3lmk4Ef6/Vr3/aEgQZJYRYhb1s6vjvA9uCFkhE856fmNM66uc0sfIY8WkoDvtSLxB
twAGxz2FZl/6ftE44HvpjK3jBrxy55t+AVQBspA1kLZChmK1FsrQRx4PVjs6W+velepJ
c263sVoVyO93QaxJzUqZubI1VHuM8k8ihp4hFa2pTqPmau2WTYkZJWN5sIgLhoVN/Ev/
5ZTQ==
X-Gm-Message-State: AOAM531Th5YGd7vYeWPyR03oRWmS34nRtsdDjPgs7y6uWbpluotYOcst
CSVGA7HT9WCrvACpR/bpri1w2jQilhCsV1LnXgKmPvtBWCw=
X-Google-Smtp-Source: ABdhPJyE/tdcWzkRwKeVV5U5QNSjtnUwUZDO8YEVBvvScvtBXR60eTVpnTefGhqiFCBGMPnUqO800TChR2atq9Sdupg=
X-Received: by 2002:a25:b317:: with SMTP id l23mr869612ybj.338.1638992355631;
Wed, 08 Dec 2021 11:39:15 -0800 (PST)
X-Mailman-Approved-At: Wed, 08 Dec 2021 15:30:49 -0500
X-Content-Filtered-By: Mailman/MimeDel 2.1.38
X-BeenThere: python-list@python.org
X-Mailman-Version: 2.1.38
Precedence: list
List-Id: General discussion list for the Python programming language
<python-list.python.org>
List-Unsubscribe: <https://mail.python.org/mailman/options/python-list>,
<mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive: <https://mail.python.org/pipermail/python-list/>
List-Post: <mailto:python-list@python.org>
List-Help: <mailto:python-list-request@python.org?subject=help>
List-Subscribe: <https://mail.python.org/mailman/listinfo/python-list>,
<mailto:python-list-request@python.org?subject=subscribe>
X-Mailman-Original-Message-ID: <CAEsMKX31XWuhSLumh4smFk2BuTEt1OAuUa+wYwqAwfL1MOXqqg@mail.gmail.com>
 by: Julius Hamilton - Wed, 8 Dec 2021 19:39 UTC

Hey,

This is something I have been working on for a very long time. It’s one of
the reasons I got into programming at all. I’d really appreciate if people
could input some advice on this.

This is a really simple program which extracts the text from webpages and
displays them one sentence at a time. It’s meant to help you study dense
material, especially documentation, with much more focus and comprehension.
I actually hope it can be of help to people who have difficulty reading. I
know it’s been of use to me at least.

This is a minimally acceptable way to pull it off currently:

deepreader.py:

import sys
import requests
import html2text
import nltk

url = sys.argv[1]

# Get the html, pull out the text, and sentence-segment it in one line of
code

sentences = nltk.sent_tokenize(html2text.html2text(requests.get(url).text))

# Activate an elementary reader interface for the text

for index, sentence in enumerate(sentences):

# Print the sentence
print(“\n” + str(index) + “/“ + str(len(sentences)) + “: “ + sentence +
“\n”)

# Wait for user key-press
x = input(“\n> “)

EOF

That’s it.

A lot of refining is possible, and I’d really like to see how some more
experienced people might handle it.

1. The HTML extraction is not perfect. It doesn’t produce as clean text as
I would like. Sometimes random links or tags get left in there. And the
sentences are sometimes randomly broken by newlines.

2. Neither is the segmentation perfect. I am currently researching
developing an optimal segmenter with tools from Spacy.

Brevity is greatly valued. I mean, anyone who can make the program more
perfect, that’s hugely appreciated. But if someone can do it in very few
lines of code, that’s also appreciated.

Thanks very much,
Julius

Re: Short, perfect program to read sentences of webpage

<sentences-20211208223927@ram.dialup.fu-berlin.de>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=16357&group=comp.lang.python#16357

  copy link   Newsgroups: comp.lang.python
Path: i2pn2.org!i2pn.org!news.swapon.de!fu-berlin.de!uni-berlin.de!not-for-mail
From: ram...@zedat.fu-berlin.de (Stefan Ram)
Newsgroups: comp.lang.python
Subject: Re: Short, perfect program to read sentences of webpage
Date: 8 Dec 2021 21:41:20 GMT
Organization: Stefan Ram
Lines: 43
Expires: 1 Mar 2022 11:59:58 GMT
Message-ID: <sentences-20211208223927@ram.dialup.fu-berlin.de>
References: <CAEsMKX31XWuhSLumh4smFk2BuTEt1OAuUa+wYwqAwfL1MOXqqg@mail.gmail.com> <mailman.47.1638995449.15287.python-list@python.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-Trace: news.uni-berlin.de P6kY7fHE6NQUGq7+43KurQ3ksYNRG6ITM7LELBgYUsFUfD
X-Copyright: (C) Copyright 2021 Stefan Ram. All rights reserved.
Distribution through any means other than regular usenet
channels is forbidden. It is forbidden to publish this
article in the Web, to change URIs of this article into links,
and to transfer the body without this notice, but quotations
of parts in other Usenet posts are allowed.
X-No-Archive: Yes
Archive: no
X-No-Archive-Readme: "X-No-Archive" is set, because this prevents some
services to mirror the article in the web. But the article may
be kept on a Usenet archive server with only NNTP access.
X-No-Html: yes
Content-Language: en-US
Accept-Language: de-DE, en-US, it, fr-FR
 by: Stefan Ram - Wed, 8 Dec 2021 21:41 UTC

Julius Hamilton <juliushamilton100@gmail.com> writes:
>This is a really simple program which extracts the text from webpages and
>displays them one sentence at a time.

Our teacher said NLTK will not come up until next year, so
I tried to do with regexps. It still has bugs, for example
it can not tell the dot at the end of an abbreviation from
the dot at the end of a sentence!

import re
import urllib.request
uri = r'''http://example.com/article''' # replace this with your URI!
request = urllib.request.Request( uri )
resource = urllib.request.urlopen( request )
cs = resource.headers.get_content_charset()
content = resource.read().decode( cs, errors="ignore" )
content = re.sub( r'''[\r\n\t\s]+''', r''' ''', content )
upper = r"[A-ZÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝ]" # "[\\p{Lu}]"
lower = r"[a-zµàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ]" # "[\\p{Ll}]"
digit = r"[0-9]" #"[\\p{Nd}]"
firstwordstart = upper;
firstwordnext = "(?:[a-zµàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ-])";
wordcharacter = "[A-ZÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝa-zµàáâãäåæçèéêëìíîïð\
ñòóôõöøùúûüýþÿ0-9-]"
addition = "(?:(?:[']" + wordcharacter + "+)*[']?)?"
rawfirstword = "(?:" + firstwordstart + firstwordnext + "*" + ")"
rawnextword = "(?:" + wordcharacter + "+" + ")"
preword = ""
postword = "(?:[,;]?)"
firstword = rawfirstword + postword
nextword = preword + rawnextword + postword
fullStop = "[.?!]"
space = "(?:[\\s]+)"
extension = "(?:" + space + nextword + ")"
extensions = "(?:" + extension + "{7,})"
sentence = firstword + extensions + fullStop
match = "(?:^|[^A-ZÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝa-zµàáâãäåæçèéêëìíîïðñòó\
ôõöøùúûüýþÿ0-9])" + "(" + sentence + ")";
patternString = "(?:" + match + ")";
for sentence in re.finditer( patternString, content ):
print( sentence.group( 0 ))

Re: Short, perfect program to read sentences of webpage

<sentences-20211208224704@ram.dialup.fu-berlin.de>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=16358&group=comp.lang.python#16358

  copy link   Newsgroups: comp.lang.python
Path: i2pn2.org!i2pn.org!news.swapon.de!fu-berlin.de!uni-berlin.de!not-for-mail
From: ram...@zedat.fu-berlin.de (Stefan Ram)
Newsgroups: comp.lang.python
Subject: Re: Short, perfect program to read sentences of webpage
Supersedes: <sentences-20211208223927@ram.dialup.fu-berlin.de>
Date: 8 Dec 2021 21:51:25 GMT
Organization: Stefan Ram
Lines: 48
Expires: 1 Mar 2022 11:59:58 GMT
Message-ID: <sentences-20211208224704@ram.dialup.fu-berlin.de>
References: <CAEsMKX31XWuhSLumh4smFk2BuTEt1OAuUa+wYwqAwfL1MOXqqg@mail.gmail.com> <mailman.47.1638995449.15287.python-list@python.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-Trace: news.uni-berlin.de F4GkZf1slezkQLBqp6LgsAB1HCyxiOmyUq9nb00QScRN4f
X-Copyright: (C) Copyright 2021 Stefan Ram. All rights reserved.
Distribution through any means other than regular usenet
channels is forbidden. It is forbidden to publish this
article in the Web, to change URIs of this article into links,
and to transfer the body without this notice, but quotations
of parts in other Usenet posts are allowed.
X-No-Archive: Yes
Archive: no
X-No-Archive-Readme: "X-No-Archive" is set, because this prevents some
services to mirror the article in the web. But the article may
be kept on a Usenet archive server with only NNTP access.
X-No-Html: yes
Content-Language: en-US
Accept-Language: de-DE, en-US, it, fr-FR
 by: Stefan Ram - Wed, 8 Dec 2021 21:51 UTC

Supersedes: <sentences-20211208223927@ram.dialup.fu-berlin.de>
[I changed the minimum number of words in a sentence from 7 to 2
in the line starting with "extensions =" and changed a 0 to a 1
in the last line.]

Julius Hamilton <juliushamilton100@gmail.com> writes:
>This is a really simple program which extracts the text from webpages and
>displays them one sentence at a time.

Our teacher said NLTK will not come up until next year, so
I tried to do with regexps. It still has bugs, for example
it can not tell the dot at the end of an abbreviation from
the dot at the end of a sentence!

import re
import urllib.request
uri = r'''http://example.com/article''' # replace this with your URI!
request = urllib.request.Request( uri )
resource = urllib.request.urlopen( request )
cs = resource.headers.get_content_charset()
content = resource.read().decode( cs, errors="ignore" )
content = re.sub( r'''[\r\n\t\s]+''', r''' ''', content )
upper = r"[A-ZÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝ]" # "[\\p{Lu}]"
lower = r"[a-zµàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ]" # "[\\p{Ll}]"
digit = r"[0-9]" #"[\\p{Nd}]"
firstwordstart = upper;
firstwordnext = "(?:[a-zµàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ-])";
wordcharacter = "[A-ZÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝa-zµàáâãäåæçèéêëìíîïð\
ñòóôõöøùúûüýþÿ0-9-]"
addition = "(?:(?:[']" + wordcharacter + "+)*[']?)?"
rawfirstword = "(?:" + firstwordstart + firstwordnext + "*" + ")"
rawnextword = "(?:" + wordcharacter + "+" + ")"
preword = ""
postword = "(?:[,;]?)"
firstword = rawfirstword + postword
nextword = preword + rawnextword + postword
fullStop = "[.?!]"
space = "(?:[\\s]+)"
extension = "(?:" + space + nextword + ")"
extensions = "(?:" + extension + "{2,})"
sentence = firstword + extensions + fullStop
match = "(?:^|[^A-ZÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝa-zµàáâãäåæçèéêëìíîïðñòó\
ôõöøùúûüýþÿ0-9])" + "(" + sentence + ")";
patternString = "(?:" + match + ")";
for sentence in re.finditer( patternString, content ):
print( sentence.group( 1 ))

Re: Short, perfect program to read sentences of webpage

<slrnsr2bsf.4j7.jon+usenet@raven.unequivocal.eu>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=16360&group=comp.lang.python#16360

  copy link   Newsgroups: comp.lang.python
Path: i2pn2.org!i2pn.org!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: jon+use...@unequivocal.eu (Jon Ribbens)
Newsgroups: comp.lang.python
Subject: Re: Short, perfect program to read sentences of webpage
Date: Wed, 8 Dec 2021 22:19:59 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 36
Message-ID: <slrnsr2bsf.4j7.jon+usenet@raven.unequivocal.eu>
References: <CAEsMKX31XWuhSLumh4smFk2BuTEt1OAuUa+wYwqAwfL1MOXqqg@mail.gmail.com>
<mailman.47.1638995449.15287.python-list@python.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 8 Dec 2021 22:19:59 -0000 (UTC)
Injection-Info: reader02.eternal-september.org; posting-host="24dd839b2ea2c8427c51ddabed141249";
logging-data="19393"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19C6qXdXg/rM5wZ2PAc3xzt5/OF49nOMJs="
User-Agent: slrn/1.0.3 (Linux)
Cancel-Lock: sha1:ulrVDv+IlKZ2INdaBSgkWZFPcYs=
 by: Jon Ribbens - Wed, 8 Dec 2021 22:19 UTC

On 2021-12-08, Julius Hamilton <juliushamilton100@gmail.com> wrote:
> 1. The HTML extraction is not perfect. It doesn’t produce as clean text as
> I would like. Sometimes random links or tags get left in there. And the
> sentences are sometimes randomly broken by newlines.

Oh. Leaving tags in suggests you are doing this very wrongly. Python
has plenty of open source libraries you can use that will parse the
HTML reliably into tags and text for you.

> 2. Neither is the segmentation perfect. I am currently researching
> developing an optimal segmenter with tools from Spacy.
>
> Brevity is greatly valued. I mean, anyone who can make the program more
> perfect, that’s hugely appreciated. But if someone can do it in very few
> lines of code, that’s also appreciated.

It isn't something that can be done in a few lines of code. There's the
spaces issue you mention for example. Nor is it something that can
necessarily be done just by inspecting the HTML alone. To take a trivial
example:

powergen<div>italia</div> = powergen <nl> italia

but:

powergen<span>italia</span> = powergenitalia

but the second with the addition of:

<style>span { dispaly: block }</style>

is back to "powergen <nl> italia". So you need to parse and apply styles
(including external stylesheets) as well. Potentially you may also need
to execute JavaScript on the page, which means you also need a JavaScript
interpreter and a DOM implementation. Basically you need a complete
browser to do it on general web pages.

Re: Short, perfect program to read sentences of webpage

<tags-20211208232526@ram.dialup.fu-berlin.de>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=16361&group=comp.lang.python#16361

  copy link   Newsgroups: comp.lang.python
Path: i2pn2.org!i2pn.org!news.swapon.de!fu-berlin.de!uni-berlin.de!not-for-mail
From: ram...@zedat.fu-berlin.de (Stefan Ram)
Newsgroups: comp.lang.python
Subject: Re: Short, perfect program to read sentences of webpage
Date: 8 Dec 2021 22:35:22 GMT
Organization: Stefan Ram
Lines: 19
Expires: 1 Mar 2022 11:59:58 GMT
Message-ID: <tags-20211208232526@ram.dialup.fu-berlin.de>
References: <CAEsMKX31XWuhSLumh4smFk2BuTEt1OAuUa+wYwqAwfL1MOXqqg@mail.gmail.com> <mailman.47.1638995449.15287.python-list@python.org> <slrnsr2bsf.4j7.jon+usenet@raven.unequivocal.eu>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-Trace: news.uni-berlin.de gexEqO6HE/Df38IafGI0Kgv7s3au4FnpoP5R552wBeDUwn
X-Copyright: (C) Copyright 2021 Stefan Ram. All rights reserved.
Distribution through any means other than regular usenet
channels is forbidden. It is forbidden to publish this
article in the Web, to change URIs of this article into links,
and to transfer the body without this notice, but quotations
of parts in other Usenet posts are allowed.
X-No-Archive: Yes
Archive: no
X-No-Archive-Readme: "X-No-Archive" is set, because this prevents some
services to mirror the article in the web. But the article may
be kept on a Usenet archive server with only NNTP access.
X-No-Html: yes
Content-Language: en-US
Accept-Language: de-DE, en-US, it, fr-FR
 by: Stefan Ram - Wed, 8 Dec 2021 22:35 UTC

Jon Ribbens <jon+usenet@unequivocal.eu> writes:
>It isn't something that can be done in a few lines of code.

You're right. There are special scraping libraries like
"Selenium" that use a Web Browser to get the JavaScript
part right, maybe this can help in certain cases.

But many news sites essentially paste plain text into
a page template that will add all the ads and colors,
while the actual text contents has not even basic markup
like <strong> or <em>. In this case, things get easier.

My own program (show before in this thread) might miss or
distort some sentences; the intention was to get an uncluttered
overview of the gist of the text on a page so that one can
quickly get a rough idea of the content, even if the text
shown is somewhat distorted.

Re: Short, perfect program to read sentences of webpage

<mailman.52.1639003341.15287.python-list@python.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=16363&group=comp.lang.python#16363

  copy link   Newsgroups: comp.lang.python
Path: i2pn2.org!i2pn.org!news.swapon.de!fu-berlin.de!uni-berlin.de!not-for-mail
From: cs...@cskk.id.au (Cameron Simpson)
Newsgroups: comp.lang.python
Subject: Re: Short, perfect program to read sentences of webpage
Date: Thu, 9 Dec 2021 09:42:07 +1100
Lines: 87
Message-ID: <mailman.52.1639003341.15287.python-list@python.org>
References: <sentences-20211208223927@ram.dialup.fu-berlin.de>
<YbE0vxHz+vd7hKHo@cskk.homeip.net>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
X-Trace: news.uni-berlin.de rseinzHXZ6Uj9Rjt8YTgmQ4SuWvuQhdHICV7Wd4nVthQ==
Return-Path: <cameron@cskk.id.au>
X-Original-To: python-list@python.org
Delivered-To: python-list@mail.python.org
Authentication-Results: mail.python.org; dkim=none reason="no signature";
dkim-adsp=none (unprotected policy); dkim-atps=neutral
X-Spam-Status: OK 0.018
X-Spam-Evidence: '*H*': 0.96; '*S*': 0.00; 'this:': 0.03; '(which':
0.04; 'class,': 0.05; 'matches': 0.07; 'ram': 0.07; 'spaces':
0.07; 'avoided': 0.09; 'characters,': 0.09; 'choice.': 0.09;
'construct': 0.09; 'difficult.': 0.09; 'writes:': 0.09; 'cheers,':
0.11; 'that.': 0.15; 'arbitrary': 0.16; 'cameron': 0.16;
'characters.': 0.16; 'classic': 0.16; 'extracts': 0.16; 'flag':
0.16; 'for.': 0.16; 'from:addr:cs': 0.16; 'from:addr:cskk.id.au':
0.16; 'from:name:cameron simpson': 0.16; 'inspect': 0.16;
'message-id:@cskk.homeip.net': 0.16; 'nltk': 0.16;
'received:10.10': 0.16; 'regexp': 0.16; 'simpson': 0.16; 'skip:"
70': 0.16; 'skip:> 10': 0.16; 'string:': 0.16; 'strings,': 0.16;
'subject:program': 0.16; 'unicode': 0.16; 'wrote:': 0.16; 'bug':
0.19; 'it?': 0.19; 'to:addr:python-list': 0.20; 'maybe': 0.22;
'tried': 0.26; 'again,': 0.26; 'stefan': 0.26; 'teacher': 0.28;
'header:User-Agent:1': 0.30; 'whole': 0.30; 'module': 0.31;
'program': 0.31; 'raw': 0.32; 'but': 0.32; 'there': 0.33; 'year,':
0.33; 'header:In-Reply-To:1': 0.34; 'trying': 0.35; 'request':
0.35; 'really': 0.37; 'class': 0.37; 'hard': 0.37; 'read': 0.38;
'text': 0.39; 'this,': 0.39; 'list': 0.39; 'use': 0.39; 'break':
0.39; 'skip:u 20': 0.39; 'still': 0.40; 'place.': 0.40; 'seeking':
0.40; 'tell': 0.60; 'inline': 0.61; "there's": 0.61; 'lower':
0.62; 'come': 0.62; 'our': 0.64; 'skip:r 20': 0.64; '8bit%:88':
0.64; 'your': 0.64; 'tool': 0.65; 'let': 0.66; 'received:userid':
0.66; 'time.': 0.66; 'header:Received:6': 0.67; 'matter': 0.68;
'8bit%:91': 0.69; 'sentence': 0.69; 'skip:r 70': 0.69;
'subject:read': 0.69; 'reach': 0.69; 'instead,': 0.70; 'rules':
0.70; 'them,': 0.70; 'too.': 0.70; 'addition': 0.71; 'content':
0.72; '8bit%:89': 0.75; 'poor': 0.76; 'skip:r 60': 0.76; 'up,':
0.84; 'apparent': 0.84; 'characters': 0.84; 'exercise': 0.84;
'hamilton': 0.84; 'skip:" 100': 0.84; 'strings': 0.84; 'uri':
0.84; 'url:re': 0.84; 'visually': 0.84; 'demo': 0.91
X-RG-Spam: Unknown
X-RazorGate-Vade: gggruggvucftvghtrhhoucdtuddrgedvuddrjeelgddtudcutefuodetggdotefrodftvfcurfhrohhfihhlvgemucfupfevtfgpvffgnffuvffttedpqfgfvfenuceurghilhhouhhtmecugedttdenucenucfjughrpeffhffvuffkgggtugfgjggffhesthekredttderjeenucfhrhhomhepvegrmhgvrhhonhcuufhimhhpshhonhcuoegtshestghskhhkrdhiugdrrghuqeenucggtffrrghtthgvrhhnpeduleehtdetgeevtddtveetuedugedvteduudehgfefhfehffeijeekiedtgfdvfeenucffohhmrghinhepphihthhhohhnrdhorhhgpdgvgigrmhhplhgvrdgtohhmnecukfhppedurddugeehrdeigedrkedunecuvehluhhsthgvrhfuihiivgeptdenucfrrghrrghmpehhvghlohepsghorhhgrdgtshhkkhdrhhhomhgvihhprdhnvghtpdhinhgvthepuddrudeghedrieegrdekuddpmhgrihhlfhhrohhmpegtrghmvghrohhnsegtshhkkhdrihgurdgruhdprhgtphhtthhopehphihthhhonhdqlhhishhtsehphihthhhonhdrohhrgh
X-RazorGate-Vade-Verdict: clean 0
X-RazorGate-Vade-Classification: clean
X-RG-VS-CLASS: clean
Mail-Followup-To: python-list@python.org
Content-Disposition: inline
In-Reply-To: <sentences-20211208223927@ram.dialup.fu-berlin.de>
User-Agent: Mutt/2.1.3 (2021-09-10)
X-BeenThere: python-list@python.org
X-Mailman-Version: 2.1.38
Precedence: list
List-Id: General discussion list for the Python programming language
<python-list.python.org>
List-Unsubscribe: <https://mail.python.org/mailman/options/python-list>,
<mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive: <https://mail.python.org/pipermail/python-list/>
List-Post: <mailto:python-list@python.org>
List-Help: <mailto:python-list-request@python.org?subject=help>
List-Subscribe: <https://mail.python.org/mailman/listinfo/python-list>,
<mailto:python-list-request@python.org?subject=subscribe>
X-Mailman-Original-Message-ID: <YbE0vxHz+vd7hKHo@cskk.homeip.net>
X-Mailman-Original-References: <sentences-20211208223927@ram.dialup.fu-berlin.de>
 by: Cameron Simpson - Wed, 8 Dec 2021 22:42 UTC

On 08Dec2021 21:41, Stefan Ram <ram@zedat.fu-berlin.de> wrote:
>Julius Hamilton <juliushamilton100@gmail.com> writes:
>>This is a really simple program which extracts the text from webpages and
>>displays them one sentence at a time.
>
> Our teacher said NLTK will not come up until next year, so
> I tried to do with regexps. It still has bugs, for example
> it can not tell the dot at the end of an abbreviation from
> the dot at the end of a sentence!

This is almost a classic demo of why regexps are a poor tool as a first
choice. You can do much with them, but they are cryptic and bug prone.

I am not seeking to mock you, but trying to make apparent why regexps
are to be avoided a lot of the time. They have their place.

You've read the whole re module docs I hope:

https://docs.python.org/3/library/re.html#module-re

>import re
>import urllib.request
>uri = r'''http://example.com/article''' # replace this with your URI!
>request = urllib.request.Request( uri )
>resource = urllib.request.urlopen( request )
>cs = resource.headers.get_content_charset()
>content = resource.read().decode( cs, errors="ignore" )
>content = re.sub( r'''[\r\n\t\s]+''', r''' ''', content )

You're not multiline, so I would recommend a plain raw string:

content = re.sub( r'[\r\n\t\s]+', r' ', content )

No need for \r in the class, \s covers that. From the docs:

\s
For Unicode (str) patterns:

Matches Unicode whitespace characters (which includes [
\t\n\r\f\v], and also many other characters, for example the
non-breaking spaces mandated by typography rules in many
languages). If the ASCII flag is used, only [ \t\n\r\f\v] is
matched.

>upper = r"[A-ZÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝ]" # "[\\p{Lu}]"
>lower = r"[a-zµàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ]" # "[\\p{Ll}]"

This is very fragile - you have an arbitrary set of additional uppercase
characters, almost certainly incomplete, and visually hard to inspect
for completeness.

Instead, consider the \b (word boundary) and \w (word character)
markers, which will let you break strings up, and then maybe test the
results with str.isupper().

>digit = r"[0-9]" #"[\\p{Nd}]"

There's a \d character class for this, covers nondecimal digits too.

>firstwordstart = upper;
>firstwordnext = "(?:[a-zµàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ-])";

Again, an inline arbitrary list of characters. This is fragile.

>wordcharacter = "[A-ZÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝa-zµàáâãäåæçèéêëìíîïð\
>ñòóôõöøùúûüýþÿ0-9-]"

Again inline. Why not construct it?

wordcharacter = upper + lower + digit

but I recommend \w instead, or for this: [\w\d]

>addition = "(?:(?:[']" + wordcharacter + "+)*[']?)?"

As a matter of good practice with regexp strings, use raw quotes:

addition = r"(?:(?:[']" + wordcharacter + r"+)*[']?)?"

even when there are no backslahes.

Seriously, doing this with regexps is difficult. A useful exercise for
learning regexps, but in the general case not the first tool to reach
for.

Cheers,
Cameron Simpson <cs@cskk.id.au>

Re: Short, perfect program to read sentences of webpage

<mailman.53.1639004988.15287.python-list@python.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=16364&group=comp.lang.python#16364

  copy link   Newsgroups: comp.lang.python
Path: i2pn2.org!i2pn.org!news.swapon.de!fu-berlin.de!uni-berlin.de!not-for-mail
From: hjp-pyt...@hjp.at (Peter J. Holzer)
Newsgroups: comp.lang.python
Subject: Re: Short, perfect program to read sentences of webpage
Date: Thu, 9 Dec 2021 00:09:47 +0100
Lines: 70
Message-ID: <mailman.53.1639004988.15287.python-list@python.org>
References: <sentences-20211208223927@ram.dialup.fu-berlin.de>
<YbE0vxHz+vd7hKHo@cskk.homeip.net>
<20211208230947.o4wcrq4r26x3l752@hjp.at>
Mime-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha512;
protocol="application/pgp-signature"; boundary="zhou4gw2rudx2omn"
X-Trace: news.uni-berlin.de j8KWDa0dadooGzPQL9MguQ1jeTG1o9HaXE9U9g4oG5Mw==
Return-Path: <hjp-python@hjp.at>
X-Original-To: python-list@python.org
Delivered-To: python-list@mail.python.org
Authentication-Results: mail.python.org; dkim=none reason="no signature";
dkim-adsp=none (unprotected policy); dkim-atps=neutral
X-Spam-Status: OK 0.000
X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; '(which': 0.04; 'argument':
0.04; 'content-type:multipart/signed': 0.05; 'matching': 0.07;
'ram': 0.07; 'choice.': 0.09; 'content-type:application/pgp-
signature': 0.09; 'filename:fname piece:asc': 0.09;
'filename:fname piece:signature': 0.09;
'filename:fname:signature.asc': 0.09; 'parse': 0.09; 'writes:':
0.09; 'syntax': 0.15; '"creative': 0.16; '+1100,': 0.16; '__/':
0.16; 'cameron': 0.16; 'challenge!"': 0.16; 'classic': 0.16;
'expressions': 0.16; 'expressions.': 0.16; 'extracts': 0.16;
'from:addr:hjp-python': 0.16; 'from:addr:hjp.at': 0.16;
'from:name:peter j. holzer': 0.16; 'hjp@hjp.at': 0.16; 'holzer':
0.16; 'languages.': 0.16; 'nltk': 0.16; 'reality.': 0.16;
'simpson': 0.16; 'strings,': 0.16; 'stross,': 0.16;
'subject:program': 0.16; 'tasks.': 0.16; 'url-
ip:212.17.106.137/32': 0.16; 'url-ip:212.17.106/24': 0.16; 'url-
ip:212.17/16': 0.16; 'url:hjp': 0.16; '|_|_)': 0.16; 'wrote:':
0.16; 'problem': 0.16; 'python': 0.16; "can't": 0.17; "aren't":
0.19; 'bug': 0.19; 'to:addr:python-list': 0.20; 'languages': 0.22;
'maybe': 0.22; 'tried': 0.26; 'stefan': 0.26; 'fact': 0.28;
'sense': 0.28; 'teacher': 0.28; 'program': 0.31; 'think': 0.32;
'but': 0.32; "i'm": 0.33; 'year,': 0.33; 'header:In-Reply-To:1':
0.34; 'english,': 0.35; 'cases': 0.36; 'those': 0.36; 'really':
0.37; 'two': 0.39; 'least': 0.39; 'single': 0.39; 'text': 0.39;
'still': 0.40; 'match': 0.40; 'want': 0.40; 'tell': 0.60; 'here.':
0.61; 'received:212': 0.62; 'come': 0.62; 'our': 0.64; 'tool':
0.65; 'look': 0.65; 'received:userid': 0.66; 'time.': 0.66;
'natural': 0.69; 'sentence': 0.69; 'subject:read': 0.69; 'url-
ip:212/8': 0.69; 'strong': 0.69; 'within': 0.69; 'them,': 0.70;
'poor': 0.76; 'hamilton': 0.84; 'received:at': 0.84; 'sentences':
0.84; 'demo': 0.91
Content-Disposition: inline
In-Reply-To: <YbE0vxHz+vd7hKHo@cskk.homeip.net>
X-BeenThere: python-list@python.org
X-Mailman-Version: 2.1.38
Precedence: list
List-Id: General discussion list for the Python programming language
<python-list.python.org>
List-Unsubscribe: <https://mail.python.org/mailman/options/python-list>,
<mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive: <https://mail.python.org/pipermail/python-list/>
List-Post: <mailto:python-list@python.org>
List-Help: <mailto:python-list-request@python.org?subject=help>
List-Subscribe: <https://mail.python.org/mailman/listinfo/python-list>,
<mailto:python-list-request@python.org?subject=subscribe>
X-Mailman-Original-Message-ID: <20211208230947.o4wcrq4r26x3l752@hjp.at>
X-Mailman-Original-References: <sentences-20211208223927@ram.dialup.fu-berlin.de>
<YbE0vxHz+vd7hKHo@cskk.homeip.net>
 by: Peter J. Holzer - Wed, 8 Dec 2021 23:09 UTC
Attachments: signature.asc (application/pgp-signature)

On 2021-12-09 09:42:07 +1100, Cameron Simpson wrote:
> On 08Dec2021 21:41, Stefan Ram <ram@zedat.fu-berlin.de> wrote:
> >Julius Hamilton <juliushamilton100@gmail.com> writes:
> >>This is a really simple program which extracts the text from webpages and
> >>displays them one sentence at a time.
> >
> > Our teacher said NLTK will not come up until next year, so
> > I tried to do with regexps. It still has bugs, for example
> > it can not tell the dot at the end of an abbreviation from
> > the dot at the end of a sentence!
>
> This is almost a classic demo of why regexps are a poor tool as a first
> choice. You can do much with them, but they are cryptic and bug prone.

I don't think that's problem here. The problem is that natural languages
just aren't regular languages. In fact I'm not sure that they fit
anywhere within the Chomsky hierarchy (but if they aren't type-0, that
would be a strong argument against the possibility of human-level AI).

In English, if a sentence ends with an abbreviation you write only a
single dot. So if you look at these two fragments:

For matching strings, numbers, etc. Python provides regular
expressions.

Let's say you want to match strings, numbers, etc. Python provides
regular expresssions for these tasks.

In second case the dot ends a sentence in the first it doesn't. But to
distinguish those cases you need to at least parse the sentences at the
syntax level (which regular expressions can't do), maybe even understand
them semantically.

hp

--
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | hjp@hjp.at | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"

Attachments: signature.asc (application/pgp-signature)
Re: Short, perfect program to read sentences of webpage

<regexps-20211209001519@ram.dialup.fu-berlin.de>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=16365&group=comp.lang.python#16365

  copy link   Newsgroups: comp.lang.python
Path: i2pn2.org!i2pn.org!news.swapon.de!fu-berlin.de!uni-berlin.de!not-for-mail
From: ram...@zedat.fu-berlin.de (Stefan Ram)
Newsgroups: comp.lang.python
Subject: Re: Short, perfect program to read sentences of webpage
Date: 8 Dec 2021 23:17:49 GMT
Organization: Stefan Ram
Lines: 55
Expires: 1 Mar 2022 11:59:58 GMT
Message-ID: <regexps-20211209001519@ram.dialup.fu-berlin.de>
References: <sentences-20211208223927@ram.dialup.fu-berlin.de> <YbE0vxHz+vd7hKHo@cskk.homeip.net> <mailman.52.1639003341.15287.python-list@python.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-Trace: news.uni-berlin.de tyzF5cooPJTWwUQ9190huwkyPcfjYniyquc14Vw9xp0Ro8
X-Copyright: (C) Copyright 2021 Stefan Ram. All rights reserved.
Distribution through any means other than regular usenet
channels is forbidden. It is forbidden to publish this
article in the Web, to change URIs of this article into links,
and to transfer the body without this notice, but quotations
of parts in other Usenet posts are allowed.
X-No-Archive: Yes
Archive: no
X-No-Archive-Readme: "X-No-Archive" is set, because this prevents some
services to mirror the article in the web. But the article may
be kept on a Usenet archive server with only NNTP access.
X-No-Html: yes
Content-Language: en-US
Accept-Language: de-DE, en-US, it, fr-FR
 by: Stefan Ram - Wed, 8 Dec 2021 23:17 UTC

Cameron Simpson <cs@cskk.id.au> writes:
>Instead, consider the \b (word boundary) and \w (word character)
>markers, which will let you break strings up, and then maybe test the
>results with str.isupper().

Thanks for your comments, most or all of them are
valid, and I will try to take them into account!

Regexps might have their disadvantages, but when I use them,
it is clearer for me to do all the matching with regexps
instead of mixing them with Python calls like str.isupper.
Therefore, it is helpful for me to have a regexp to match
upper and lower case characters separately. Some regexp
dialects support "\p{Lu}" and "\p{Ll}" for this.

I have not yet incorporated (all) your advice into my code,
but I came to the conclusion myself that the repetition of
long sequences like r"A-ZÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝ" and
not using f strings to insert other strings was especially
ugly. So, FWIW, here's my current version which addresses
these two shortcomings:

import re
import urllib.request
uri = r'''http://example.com/article''' # replace this with your URI!
request = urllib.request.Request( uri )
resource = urllib.request.urlopen( request )
cs = resource.headers.get_content_charset()
content = resource.read().decode( cs, errors="ignore" )
content = re.sub( r"[\r\n\t\s]+", r" ", content )
upper = r"A-ZÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝ" # "[\\p{Lu}]"
lower = r"a-zµàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ" # "[\\p{Ll}]"
digit = r"0-9" #"[\\p{Nd}]"
firstwordstart = fr"[{upper}]"
firstwordnext = fr"(?:[{lower}-])"
wordcharacter = fr"[{upper}{lower}{digit}-]"
addition = fr"(?:(?:[']{wordcharacter}+)*[']?)?"
rawfirstword = fr"(?:{firstwordstart}{firstwordnext}*)"
rawnextword = fr"(?:{wordcharacter}+)"
preword = fr""
postword = fr"(?:[,;]?)"
firstword = rawfirstword + postword
nextword = preword + rawnextword + postword
period = fr"[.?!]"
space = fr"(?:[\s]+)"
extension = fr"(?:{space}{nextword})"
extensions = r"(?:"+extension+r"{2,})"
sentence = firstword + extensions + period
match = fr"(?:^|[^{upper}{lower}{digit}])" + fr"({sentence})"
patternString = fr"(?:{match})"
print(patternString)
for sentence in re.finditer( patternString, content ):
print( sentence.group( 1 ))

Re: Short, perfect program to read sentences of webpage

<sentences-20211209014745@ram.dialup.fu-berlin.de>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=16367&group=comp.lang.python#16367

  copy link   Newsgroups: comp.lang.python
Path: i2pn2.org!i2pn.org!news.swapon.de!fu-berlin.de!uni-berlin.de!not-for-mail
From: ram...@zedat.fu-berlin.de (Stefan Ram)
Newsgroups: comp.lang.python
Subject: Re: Short, perfect program to read sentences of webpage
Date: 9 Dec 2021 00:49:17 GMT
Organization: Stefan Ram
Lines: 53
Expires: 1 Mar 2022 11:59:58 GMT
Message-ID: <sentences-20211209014745@ram.dialup.fu-berlin.de>
References: <sentences-20211208223927@ram.dialup.fu-berlin.de> <YbE0vxHz+vd7hKHo@cskk.homeip.net> <mailman.52.1639003341.15287.python-list@python.org> <regexps-20211209001519@ram.dialup.fu-berlin.de>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-Trace: news.uni-berlin.de 4JhyTlVzbJ3h9YFv2UiPYQmDcXKEDamL0w1T7XWiC8C8qr
X-Copyright: (C) Copyright 2021 Stefan Ram. All rights reserved.
Distribution through any means other than regular usenet
channels is forbidden. It is forbidden to publish this
article in the Web, to change URIs of this article into links,
and to transfer the body without this notice, but quotations
of parts in other Usenet posts are allowed.
X-No-Archive: Yes
Archive: no
X-No-Archive-Readme: "X-No-Archive" is set, because this prevents some
services to mirror the article in the web. But the article may
be kept on a Usenet archive server with only NNTP access.
X-No-Html: yes
Content-Language: en-US
Accept-Language: de-DE, en-US, it, fr-FR
 by: Stefan Ram - Thu, 9 Dec 2021 00:49 UTC

ram@zedat.fu-berlin.de (Stefan Ram) writes:
>ugly. So, FWIW, here's my current version which addresses
>these two shortcomings:

I have now revised the program after all. It now has
somewhat improved heuristics for abbreviations, numbers and
some special characters. It was also a matter of concern to
me to limit myself to the supplied standard libraries, so
that the program is executable without further preparations.

import re
import urllib.request
uri = r"https://www.example.com/article" # replace this with your URI!
request = urllib.request.Request( uri )
resource = urllib.request.urlopen( request )
cs = resource.headers.get_content_charset()
content = resource.read().decode( cs, errors="ignore" )
content = re.sub( r"\s+", r" ", content )
content = re.sub( r"<[^>]*>", r"", content )
upper = r"A-ZÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝ"
lower = r"a-zµàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ"
digit = r"0-9"
firstwordstart = fr"[{upper}]"
firstwordnext = fr"(?:[{lower}-])"
wordcharacter = fr"[\w-]"
addition = rf"(?:(?:['’]{wordcharacter}+)" + r"{,3}['’]?)"
abbreviation = fr"(?:(?:[{upper}]\.)+)"
number = fr"(?:{digit}+)(?:'|’)?(?:st|th|nd|s)?"
rawfirstword = fr"(?:{abbreviation}|{number}|" \
fr"{firstwordstart}{firstwordnext}*{addition})"
rawnextword = fr"(?:{abbreviation}|{number}|" \
fr"{wordcharacter}+" + f"{addition})"
preword = fr"(?:[“„]?)"
postword = r"(?:[,;:”“]{,3})"
firstword = rawfirstword + postword
nextword = preword + rawnextword + postword
lastword = fr"(?:{wordcharacter}{2,}" + f"{addition})"
period = fr"[.?!]"
space = fr"(?:[\s]+)"
extension = fr"(?:{space}{nextword})"
extensions = r"(?:"+extension+r"+?)"
sentence = firstword + extensions + period + r'["”“]?'
match = fr"(?:^|[^{upper}{lower}{digit}])*" + fr"({sentence})"
patternString = fr"(?:{match})"
previous = ""
for sentence in re.finditer( patternString, content ):
current = sentence.group( 1 )
if len( list( re.findall( f"[{upper}{lower}]", current )))> 5:
if current != previous:
print( current )
previous = current

Re: Short, perfect program to read sentences of webpage

<mailman.55.1639013661.15287.python-list@python.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=16368&group=comp.lang.python#16368

  copy link   Newsgroups: comp.lang.python
Path: i2pn2.org!i2pn.org!news.swapon.de!fu-berlin.de!uni-berlin.de!not-for-mail
From: pyt...@mrabarnett.plus.com (MRAB)
Newsgroups: comp.lang.python
Subject: Re: Short, perfect program to read sentences of webpage
Date: Thu, 9 Dec 2021 01:31:13 +0000
Lines: 22
Message-ID: <mailman.55.1639013661.15287.python-list@python.org>
References: <sentences-20211208223927@ram.dialup.fu-berlin.de>
<YbE0vxHz+vd7hKHo@cskk.homeip.net>
<mailman.52.1639003341.15287.python-list@python.org>
<regexps-20211209001519@ram.dialup.fu-berlin.de>
<719155b1-5be1-88f6-de16-6c10a9516940@mrabarnett.plus.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
X-Trace: news.uni-berlin.de fL6oITk69ENFlhEss+/kwQUSY2uS/ZUXVuGWSnhqi14A==
Return-Path: <python@mrabarnett.plus.com>
X-Original-To: python-list@python.org
Delivered-To: python-list@mail.python.org
Authentication-Results: mail.python.org; dkim=pass
reason="2048-bit key; unprotected key"
header.d=plus.com header.i=@plus.com header.b=f18UeMPO;
dkim-adsp=none (unprotected policy); dkim-atps=neutral
X-Spam-Status: OK 0.005
X-Spam-Evidence: '*H*': 0.99; '*S*': 0.00; 'matching': 0.07; 'ram':
0.07; 'from:addr:python': 0.09; 'received:192.168.1.64': 0.09;
'url-ip:151.101.0.223/32': 0.09; 'url-ip:151.101.128.223/32':
0.09; 'url-ip:151.101.192.223/32': 0.09; 'url-
ip:151.101.64.223/32': 0.09; 'writes:': 0.09; '[snip]': 0.16;
'cameron': 0.16; 'dialects': 0.16;
'from:addr:mrabarnett.plus.com': 0.16; 'from:name:mrab': 0.16;
'message-id:@mrabarnett.plus.com': 0.16; 'received:plus.net':
0.16; 'regexp': 0.16; 'simpson': 0.16; 'subject:program': 0.16;
'url:project': 0.16; 'url:pypi': 0.16; 'url:regex': 0.16;
'wrote:': 0.16; 'python': 0.16; 'instead': 0.17; 'calls': 0.19;
'to:addr:python-list': 0.20; 'maybe': 0.22; 'stefan': 0.26;
'header:User-Agent:1': 0.30; 'module': 0.31; 'received:192.168.1':
0.32; 'but': 0.32; 'header:In-Reply-To:1': 0.34; 'this.': 0.37;
'received:192.168': 0.37; 'thanks': 0.38; 'use': 0.39; 'break':
0.39; 'match': 0.40; 'want': 0.40; 'try': 0.40; 'lower': 0.62;
'received:212': 0.62; 'url-ip:151.101.0/24': 0.62; 'url-
ip:151.101.128/24': 0.62; 'url-ip:151.101.192/24': 0.62; 'url-
ip:151.101.64/24': 0.62; 'your': 0.64; 'look': 0.65; 'let': 0.66;
'subject:read': 0.69; 'them,': 0.70; 'up,': 0.84; 'characters':
0.84; 'comments,': 0.84; 'strings': 0.84; 'valid,': 0.84
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=plus.com; s=042019;
t=1639013473; bh=EDw2DOVUOQuwjttpmjkcsdbEcqo8uvIM+gQmyXO7TGE=;
h=Date:Subject:To:References:From:In-Reply-To;
b=f18UeMPO415cMGa2bjOP5TZ/0Tqvct80kiXaDOawQhfnqhKlAu3ZRD3ocM2dutkjV
Ovd7BYnr+5eL6rnprJpDvJXGWDYBmJ8ZJb06HHXLmcW9hTxdJEW6TWdZu/Dp0dWmlq
ioJuqdCtQPcGb7pjHVn2PJ/7FJOynl1+BX/4geeh44qbmF8MTcZPMiqKYRo2Jr8Khd
uYKh6gX77B4+MeSvHNdQaDt82EqtLxoqhIQ3kvlxXGtb0jrIgVZ0blUGikW5m/zAnd
tRbnJc8xejDqFoqxVy4JR/UeS/T9d9dM5YWxRrQMisH5cfpiGVt8hb3uznKcW+W3RU
aQW/c61QHYBTw==
X-Clacks-Overhead: "GNU Terry Pratchett"
X-CM-Score: 0.00
X-CNFS-Analysis: v=2.4 cv=VJu+I/DX c=1 sm=1 tr=0 ts=61b15c61
a=0nF1XD0wxitMEM03M9B4ZQ==:117 a=0nF1XD0wxitMEM03M9B4ZQ==:17
a=IkcTkHD0fZMA:10 a=CckQENj0AAAA:8 a=X-yOg9Kh6UuOmRFhV8oA:9 a=QEXdDO2ut3YA:10
a=-XWihg8NfbdYC9mmq4w4:22
X-AUTH: mrabarnett@:2500
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.3.2
Content-Language: en-GB
In-Reply-To: <regexps-20211209001519@ram.dialup.fu-berlin.de>
X-CMAE-Envelope: MS4xfJY/rmBTSz4/6haxqrHSD6ABIV3bVgI76s5N/e7gcsuqrFVVhhbZcezPja8RzT9ppj3VfA7VIECu/+9l+Ta9E4RSmx1/csB+OdwMIEOTWiOJJnBkIcJG
k40Gg5ytAMMLGD9tTmM9jf1SiKVxzbxVu7tNFBPYzLn7oevcXnm6RVcFz0W7WDHvdx6hllbzfzGCn3NhQonlvBKKzbhNG39DZNs=
X-BeenThere: python-list@python.org
X-Mailman-Version: 2.1.38
Precedence: list
List-Id: General discussion list for the Python programming language
<python-list.python.org>
List-Unsubscribe: <https://mail.python.org/mailman/options/python-list>,
<mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive: <https://mail.python.org/pipermail/python-list/>
List-Post: <mailto:python-list@python.org>
List-Help: <mailto:python-list-request@python.org?subject=help>
List-Subscribe: <https://mail.python.org/mailman/listinfo/python-list>,
<mailto:python-list-request@python.org?subject=subscribe>
X-Mailman-Original-Message-ID: <719155b1-5be1-88f6-de16-6c10a9516940@mrabarnett.plus.com>
X-Mailman-Original-References: <sentences-20211208223927@ram.dialup.fu-berlin.de>
<YbE0vxHz+vd7hKHo@cskk.homeip.net>
<mailman.52.1639003341.15287.python-list@python.org>
<regexps-20211209001519@ram.dialup.fu-berlin.de>
 by: MRAB - Thu, 9 Dec 2021 01:31 UTC

On 2021-12-08 23:17, Stefan Ram wrote:
> Cameron Simpson <cs@cskk.id.au> writes:
>>Instead, consider the \b (word boundary) and \w (word character)
>>markers, which will let you break strings up, and then maybe test the
>>results with str.isupper().
>
> Thanks for your comments, most or all of them are
> valid, and I will try to take them into account!
>
> Regexps might have their disadvantages, but when I use them,
> it is clearer for me to do all the matching with regexps
> instead of mixing them with Python calls like str.isupper.
> Therefore, it is helpful for me to have a regexp to match
> upper and lower case characters separately. Some regexp
> dialects support "\p{Lu}" and "\p{Ll}" for this.
>
If you want "\p{Lu}" and "\p{Ll}", have a look at the 'regex' module on
PyPI:

https://pypi.org/project/regex/

[snip]

Re: Short, perfect program to read sentences of webpage

<mailman.57.1639018701.15287.python-list@python.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=16370&group=comp.lang.python#16370

  copy link   Newsgroups: comp.lang.python
Path: i2pn2.org!i2pn.org!news.swapon.de!fu-berlin.de!uni-berlin.de!not-for-mail
From: cs...@cskk.id.au (Cameron Simpson)
Newsgroups: comp.lang.python
Subject: Re: Short, perfect program to read sentences of webpage
Date: Thu, 9 Dec 2021 13:58:15 +1100
Lines: 27
Message-ID: <mailman.57.1639018701.15287.python-list@python.org>
References: <regexps-20211209001519@ram.dialup.fu-berlin.de>
<YbFwxzZE8ObSFD0F@cskk.homeip.net>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
X-Trace: news.uni-berlin.de RwiBnMzj/UtoIc46SpZi4Qzg5rdXNMbxOzsRxPyQulAw==
Return-Path: <cameron@cskk.id.au>
X-Original-To: python-list@python.org
Delivered-To: python-list@mail.python.org
Authentication-Results: mail.python.org; dkim=none reason="no signature";
dkim-adsp=none (unprotected policy); dkim-atps=neutral
X-Spam-Status: OK 0.003
X-Spam-Evidence: '*H*': 0.99; '*S*': 0.00; 'matching': 0.07; 'ram':
0.07; 'cc:addr:python-list': 0.09; 'incorporated': 0.09;
'cheers,': 0.11; 'cc:no real name:2**0': 0.14; 'backslash': 0.16;
'cameron': 0.16; 'conclusion': 0.16; 'dialects': 0.16;
'from:addr:cs': 0.16; 'from:addr:cskk.id.au': 0.16;
'from:name:cameron simpson': 0.16; 'message-id:@cskk.homeip.net':
0.16; 'received:13.237': 0.16; 'received:13.237.201': 0.16;
'received:13.237.201.189': 0.16; 'received:cskk.id.au': 0.16;
'received:id.au': 0.16; 'received:mail.cskk.id.au': 0.16;
'regexp': 0.16; 'simpson': 0.16; 'subject:program': 0.16;
'to:addr:ram': 0.16; 'to:addr:zedat.fu-berlin.de': 0.16;
'to:name:stefan ram': 0.16; 'word,': 0.16; 'wrote:': 0.16;
'python': 0.16; 'instead': 0.17; 'calls': 0.19;
'cc:addr:python.org': 0.20; 'cc:2**0': 0.25; 'stefan': 0.26;
'bit': 0.27; 'header:User-Agent:1': 0.30; 'code,': 0.31; 'module':
0.31; 'but': 0.32; 'header:In-Reply-To:1': 0.34; 'received:au':
0.35; 'those': 0.36; 'using': 0.37; 'this.': 0.37; 'way': 0.38;
'could': 0.38; 'means': 0.38; 'use': 0.39; 'double': 0.40;
'match': 0.40; 'otherwise,': 0.40; 'lower': 0.62; 'received:13':
0.64; 'your': 0.64; 'came': 0.65; 'received:userid': 0.66;
'8bit%:91': 0.69; 'subject:read': 0.69; 'them,': 0.70; 'skip:r
60': 0.76; 'yes': 0.76; 'characters': 0.84; 'strings': 0.84;
'tricky': 0.84; 'land': 0.93
Mail-Followup-To: Stefan Ram <ram@zedat.fu-berlin.de>, python-list@python.org
Content-Disposition: inline
In-Reply-To: <regexps-20211209001519@ram.dialup.fu-berlin.de>
User-Agent: Mutt/2.1.3 (2021-09-10)
X-BeenThere: python-list@python.org
X-Mailman-Version: 2.1.38
Precedence: list
List-Id: General discussion list for the Python programming language
<python-list.python.org>
List-Unsubscribe: <https://mail.python.org/mailman/options/python-list>,
<mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive: <https://mail.python.org/pipermail/python-list/>
List-Post: <mailto:python-list@python.org>
List-Help: <mailto:python-list-request@python.org?subject=help>
List-Subscribe: <https://mail.python.org/mailman/listinfo/python-list>,
<mailto:python-list-request@python.org?subject=subscribe>
X-Mailman-Original-Message-ID: <YbFwxzZE8ObSFD0F@cskk.homeip.net>
X-Mailman-Original-References: <regexps-20211209001519@ram.dialup.fu-berlin.de>
 by: Cameron Simpson - Thu, 9 Dec 2021 02:58 UTC

On 08Dec2021 23:17, Stefan Ram <ram@zedat.fu-berlin.de> wrote:
> Regexps might have their disadvantages, but when I use them,
> it is clearer for me to do all the matching with regexps
> instead of mixing them with Python calls like str.isupper.
> Therefore, it is helpful for me to have a regexp to match
> upper and lower case characters separately. Some regexp
> dialects support "\p{Lu}" and "\p{Ll}" for this.

Aye. I went looking for that in the Python re module docs and could not
find them. So the comprimise is match any word, then test the word with
isupper() (or whatever is appropriate).

> I have not yet incorporated (all) your advice into my code,
> but I came to the conclusion myself that the repetition of
> long sequences like r"A-ZÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝ" and
> not using f strings to insert other strings was especially
> ugly.

The tricky bit with f-strings and regexps is that \w{3,5} means from 3
through 5 "word characters". So if you've got those in an f-string
you're off to double-the-brackets land, a bit like double backslash land
and non-raw-strings.

Otherwise, yes f-strings are a nice way to compose things.

Cheers,
Cameron Simpson <cs@cskk.id.au>

1
server_pubkey.txt

rocksolid light 0.9.81
clearnet tor