Rocksolid Light

Welcome to novaBBS (click a section below)

mail  files  register  newsreader  groups  login

Message-ID:  

Clothes make the man. Naked people have little or no influence on society. -- Mark Twain


devel / comp.lang.python / Re: C API PyObject_CallFunctionObjArgs returns incorrect result

SubjectAuthor
o Re: C API PyObject_CallFunctionObjArgs returns incorrect resultMRAB

1
Re: C API PyObject_CallFunctionObjArgs returns incorrect result

<mailman.241.1646679817.2329.python-list@python.org>

  copy mid

https://www.novabbs.com/devel/article-flat.php?id=17431&group=comp.lang.python#17431

  copy link   Newsgroups: comp.lang.python
Path: i2pn2.org!i2pn.org!news.swapon.de!fu-berlin.de!uni-berlin.de!not-for-mail
From: pyt...@mrabarnett.plus.com (MRAB)
Newsgroups: comp.lang.python
Subject: Re: C API PyObject_CallFunctionObjArgs returns incorrect result
Date: Mon, 7 Mar 2022 19:03:24 +0000
Lines: 122
Message-ID: <mailman.241.1646679817.2329.python-list@python.org>
References: <MxWmaxK--3-2@tutanota.com>
<5ad962fc-1257-dd8d-96ab-541ae5bae2fa@mrabarnett.plus.com>
<Mx_KxMD--3-2@tutanota.com>
<411854b3-e73d-0908-72f6-4049b87145c2@mrabarnett.plus.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
X-Trace: news.uni-berlin.de lJxAI0IfAYms2B8hLABW9AB3HC6bau2Qy4GL0/qjsdIg==
Return-Path: <python@mrabarnett.plus.com>
X-Original-To: python-list@python.org
Delivered-To: python-list@mail.python.org
Authentication-Results: mail.python.org; dkim=pass
reason="2048-bit key; unprotected key"
header.d=plus.com header.i=@plus.com header.b=V6yBz4CD;
dkim-adsp=none (unprotected policy); dkim-atps=neutral
X-Spam-Status: OK 0.000
X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'looks': 0.02; 'this:':
0.03; '3.8': 0.05; 'joining': 0.07; 'mar': 0.07; 'string': 0.07;
'subject:API': 0.07; 'const': 0.09; 'from:addr:python': 0.09;
'library,': 0.09; 'list.\xc2\xa0': 0.09; 'received:192.168.1.64':
0.09; 'string,': 0.09; 'subject:result': 0.09; 'import': 0.15;
'url:mailman': 0.15; '1816': 0.16; '2022,': 0.16; 'assuming':
0.16; 'char': 0.16; 'conversion': 0.16; 'emma': 0.16;
'from:addr:mrabarnett.plus.com': 0.16; 'from:name:mrab': 0.16;
'message-id:@mrabarnett.plus.com': 0.16; 'nltk': 0.16;
'pyobject*': 0.16; 'question,': 0.16; 'quotes.': 0.16;
'received:plus.net': 0.16; 'string:': 0.16; 'subject:incorrect':
0.16; 'subject:returns': 0.16; 'wrote:': 0.16; 'problem': 0.16;
'python': 0.16; 'api': 0.17; 'instead': 0.17; 'to:addr:python-
list': 0.20; 'version': 0.23; 'code': 0.23; 'command': 0.23;
'skip:p 30': 0.23; 'run': 0.23; 'url-ip:188.166.95.178/32': 0.25;
'url-ip:188.166.95/24': 0.25; 'url:listinfo': 0.25; 'url-
ip:188.166/16': 0.25; 'space': 0.26; 'tried': 0.26; 'library':
0.26; 'object': 0.26; 'function': 0.27; '>>>': 0.28; 'header:User-
Agent:1': 0.30; 'takes': 0.31; 'url-ip:188/8': 0.31; 'think':
0.32; 'python-list': 0.32; 'same,': 0.32; 'received:192.168.1':
0.32; 'but': 0.32; 'there': 0.33; '0);': 0.33; 'same': 0.34;
'header:In-Reply-To:1': 0.34; 'item': 0.35; 'using': 0.37; "it's":
0.37; 'this.': 0.37; 'received:192.168': 0.37; 'way': 0.38;
'thanks': 0.38; 'two': 0.39; 'least': 0.39; 'single': 0.39;
'list': 0.39; 'prompt': 0.39; 'should': 0.40; 'received:212':
0.62; 'your': 0.64; 'produce': 0.65; 'exactly': 0.68; 'sentence':
0.69; 'relevant': 0.73; 'accurate': 0.74; 'implemented': 0.76;
'produces': 0.76; '");': 0.84; 'characters': 0.84; 'quotes': 0.84;
'sentences': 0.84; 'surrounded': 0.84
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=plus.com; s=042019;
t=1646679807; bh=sYPKvyhwXvSGxGL/E0f2x3QzFiaIQK2+iTocEZPb/Eg=;
h=Date:Subject:To:References:From:In-Reply-To;
b=V6yBz4CDSK0Wbj6rvvWjb/PObuM/im0HpvAKi9UjuvnniazL2q3O7C98GkPnC/qD+
n8wfZYXU0jiXkJRMeFAEaG6/2jRhfpSgfzEe7ecYgxIv2ogy49Cv2TutXt4Sgv+aHw
4DsPZ7W0Ab5529W/HahFL1vKgBG4sNbPLB39EgeqalcnXHHCIXDB2BvXWfr3EmgkCf
UWG1RG8H/PY7MufCh5vyrcj4i7wJ3yAxVWOG5Q48O5FMnmSF3tnqOKj9aE8gQ7f7IV
+g1bGe1BRmvow8YHsTRnXiBzNJgaerYw0q8Qn6jpFTXMBOD1EK4y1kFGulivBYv8lD
ChVc0WV8AKGAg==
X-Clacks-Overhead: "GNU Terry Pratchett"
X-CM-Score: 0.00
X-CNFS-Analysis: v=2.4 cv=AKknf/Zy c=1 sm=1 tr=0 ts=622656ff
a=0nF1XD0wxitMEM03M9B4ZQ==:117 a=0nF1XD0wxitMEM03M9B4ZQ==:17
a=IkcTkHD0fZMA:10 a=EBOSESyhAAAA:8 a=8AHkEIZyAAAA:8 a=QVPN9iBtWgUM2Sf8SXQA:9
a=QEXdDO2ut3YA:10 a=yJM6EZoI5SlJf8ks9Ge_:22
X-AUTH: mrabarnett@:2500
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
Thunderbird/91.6.2
Content-Language: en-GB
In-Reply-To: <Mx_KxMD--3-2@tutanota.com>
X-CMAE-Envelope: MS4xfHz7zYDbubKo96iij1R0Jvew/6Tupc4qEI+LODnxWF/kky8/k8jFPbaDqnexQ8QR2vH8pD3R5jHMhqet5Khb6X37BFn6bHynbHuGWMmBQXhFUbMOjh0U
AvVM1v2SGIEc0BXF3hzj5a4E8CrTIbi7DoLy7aozmu6yEG5x1PTKD53MCOdvIvpJ87yDmWUGW5q/Wq2Dp9EqC5e/ab0Km4besnI=
X-BeenThere: python-list@python.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: General discussion list for the Python programming language
<python-list.python.org>
List-Unsubscribe: <https://mail.python.org/mailman/options/python-list>,
<mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive: <https://mail.python.org/pipermail/python-list/>
List-Post: <mailto:python-list@python.org>
List-Help: <mailto:python-list-request@python.org?subject=help>
List-Subscribe: <https://mail.python.org/mailman/listinfo/python-list>,
<mailto:python-list-request@python.org?subject=subscribe>
X-Mailman-Original-Message-ID: <411854b3-e73d-0908-72f6-4049b87145c2@mrabarnett.plus.com>
X-Mailman-Original-References: <MxWmaxK--3-2@tutanota.com>
<5ad962fc-1257-dd8d-96ab-541ae5bae2fa@mrabarnett.plus.com>
<Mx_KxMD--3-2@tutanota.com>
 by: MRAB - Mon, 7 Mar 2022 19:03 UTC

On 2022-03-07 17:05, Jen Kris wrote:
> Thank you MRAB for your reply.
>
> Regarding your first question, pSentence is a list.  In the nltk
> library, nltk.word_tokenize takes a string, so we convert sentence to
> string before we call nltk.word_tokenize:
>
> >>> sentence = " ".join(sentence)
> >>> pt = nltk.word_tokenize(sentence)
> >>> print(sentence)
> [ Emma by Jane Austen 1816 ]
>
> But with the C API it looks like this:
>
> PyObject *pSentence = PySequence_GetItem(pSents, sent_count);
> PyObject* str_sentence = PyObject_Str(pSentence); // Convert to string
>
> ; See what str_sentence looks like:
> PyObject* repr_str = PyObject_Repr(str_sentence);
> PyObject* str_str = PyUnicode_AsEncodedString(repr_str, "utf-8", "~E~");
> const char *bytes_str = PyBytes_AS_STRING(str_str);
> printf("REPR_String: %s\n", bytes_str);
>
> REPR_String: "['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']']"
>
> So the two string representations are not the same – or at least the  
> PyUnicode_AsEncodedString is not the same, as each item is surrounded
> by single quotes.
>
> Assuming that the conversion to bytes object for the REPR is an
> accurate representation of str_sentence, it looks like I need to strip
> the quotes from str_sentence before “PyObject* pWTok =
> PyObject_CallFunctionObjArgs(pNltk_WTok, str_sentence, 0).”
>
> So my questions now are (1) is there a C API function that will
> convert a list to a string exactly the same way as ‘’.join, and if not
> then (2) how can I strip characters from a string object in the C API?
>
Your Python code is joining the list with a space as the separator.

The equivalent using the C API is:

    PyObject* separator;
    PyObject* joined;

    separator = PyUnicode_FromString(" ");
    joined = PyUnicode_Join(separator, pSentence);
    Py_DECREF(sep);

>
> Mar 6, 2022, 17:42 by python@mrabarnett.plus.com:
>
> On 2022-03-07 00:32, Jen Kris via Python-list wrote:
>
> I am using the C API in Python 3.8 with the nltk library, and
> I have a problem with the return from a library call
> implemented with PyObject_CallFunctionObjArgs.
>
> This is the relevant Python code:
>
> import nltk
> from nltk.corpus import gutenberg
> fileids = gutenberg.fileids()
> sentences = gutenberg.sents(fileids[0])
> sentence = sentences[0]
> sentence = " ".join(sentence)
> pt = nltk.word_tokenize(sentence)
>
> I run this at the Python command prompt to show how it works:
>
> sentence = " ".join(sentence)
> pt = nltk.word_tokenize(sentence)
> print(pt)
>
> ['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']']
>
> type(pt)
>
> <class 'list'>
>
> This is the relevant part of the C API code:
>
> PyObject* str_sentence = PyObject_Str(pSentence);
> // nltk.word_tokenize(sentence)
> PyObject* pNltk_WTok = PyObject_GetAttrString(pModule_mstr,
> "word_tokenize");
> PyObject* pWTok = PyObject_CallFunctionObjArgs(pNltk_WTok,
> str_sentence, 0);
>
> (where pModule_mstr is the nltk library).
>
> That should produce a list with a length of 7 that looks like
> it does on the command line version shown above:
>
> ['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']']
>
> But instead the C API produces a list with a length of 24, and
> the REPR looks like this:
>
> '[\'[\', "\'", \'[\', "\'", \',\', "\'Emma", "\'", \',\',
> "\'by", "\'", \',\', "\'Jane", "\'", \',\', "\'Austen", "\'",
> \',\', "\'1816", "\'", \',\', "\'", \']\', "\'", \']\']'
>
> I also tried this with PyObject_CallMethodObjArgs and
> PyObject_Call without success.
>
> Thanks for any help on this.
>
> What is pSentence? Is it what you think it is?
> To me it looks like it's either the list:
>
> ['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']']
>
> or that list as a string:
>
> "['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']']"
>
> and that what you're tokenising.
> --
> https://mail.python.org/mailman/listinfo/python-list
>
>

1
server_pubkey.txt

rocksolid light 0.9.8
clearnet tor