novaBBS - comp.lang.python - Re: C API PyObject_CallFunctionObjArgs returns incorrect result

Re: C API PyObject_CallFunctionObjArgs returns incorrect result

<mailman.244.1646687308.2329.python-list@python.org>

https://www.novabbs.com/devel/article-flat.php?id=17434&group=comp.lang.python#17434

copy link Newsgroups: comp.lang.python

Path: i2pn2.org!i2pn.org!news.swapon.de!fu-berlin.de!uni-berlin.de!not-for-mail
From: jenk...@tutanota.com (Jen Kris)
Newsgroups: comp.lang.python
Subject: Re: C API PyObject_CallFunctionObjArgs returns incorrect result
Date: Mon, 7 Mar 2022 22:08:26 +0100 (CET)
Lines: 149
Message-ID: <mailman.244.1646687308.2329.python-list@python.org>
References: <MxWmaxK--3-2@tutanota.com>
<5ad962fc-1257-dd8d-96ab-541ae5bae2fa@mrabarnett.plus.com>
<Mx_KxMD--3-2@tutanota.com>
<411854b3-e73d-0908-72f6-4049b87145c2@mrabarnett.plus.com>
<MxaCZ0n--3-2@tutanota.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
X-Trace: news.uni-berlin.de 2aamDNYB2JEWJL4+wYNsig/F/L8mHg+4b5E+EZ8BZ6Cg==
Return-Path: <jenkris@tutanota.com>
X-Original-To: python-list@python.org
Delivered-To: python-list@mail.python.org
Authentication-Results: mail.python.org; dkim=pass
reason="2048-bit key; unprotected key"
header.d=tutanota.com header.i=@tutanota.com header.b=wwgVFZfr;
dkim-adsp=pass; dkim-atps=neutral
X-Spam-Status: OK 0.000
X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'looks': 0.02; 'this:':
0.03; '3.8': 0.05; '>>>': 0.07; 'joining': 0.07; 'mar':
0.07; 'string': 0.07; 'subject:API': 0.07; 'angelico': 0.09;
'cc:addr:python-list': 0.09; 'const': 0.09; 'help.\xc2\xa0': 0.09;
'library,': 0.09; 'list.\xc2\xa0': 0.09; 'string,': 0.09;
'subject:result': 0.09; 'import': 0.15; 'url:mailman': 0.15;
'1816': 0.16; '2022,': 0.16; 'assuming': 0.16; 'cc:name:python
list': 0.16; 'char': 0.16; 'conversion': 0.16; 'conversion,':
0.16; 'emma': 0.16; 'nltk': 0.16; 'pyobject*': 0.16; 'question,':
0.16; 'quotes.': 0.16; 'string:': 0.16; 'subject:incorrect': 0.16;
'subject:returns': 0.16; 'wrote:': 0.16; 'problem': 0.16;
'python': 0.16; 'api': 0.17; 'instead': 0.17;
'cc:addr:python.org': 0.20; 'version': 0.23; 'code': 0.23;
'command': 0.23; 'skip:p 30': 0.23; 'run': 0.23; 'received:de':
0.23; 'url-ip:188.166.95.178/32': 0.25; 'url-ip:188.166.95/24':
0.25; 'help.': 0.25; 'url:listinfo': 0.25; 'cc:2**0': 0.25; 'url-
ip:188.166/16': 0.25; 'space': 0.26; 'tried': 0.26; 'library':
0.26; 'object': 0.26; 'function': 0.27; '>>>': 0.28; 'chris':
0.28; 'takes': 0.31; 'url-ip:188/8': 0.31; 'think': 0.32; '(this':
0.32; 'python-list': 0.32; 'same,': 0.32; 'received:192.168.1':
0.32; 'but': 0.32; 'there': 0.33; '0);': 0.33; 'same': 0.34;
'header:In-Reply-To:1': 0.34; 'handling': 0.35; 'item': 0.35;
'using': 0.37; "it's": 0.37; 'this.': 0.37; 'received:192.168':
0.37; 'way': 0.38; '8bit%:14': 0.38; 'thanks': 0.38; 'two': 0.39;
'least': 0.39; 'single': 0.39; 'list': 0.39; 'prompt': 0.39;
'both': 0.40; 'should': 0.40; 'here': 0.62; 'your': 0.64;
'produce': 0.65; 'exactly': 0.68; 'sentence': 0.69; 'you.': 0.71;
'relevant': 0.73; 'accurate': 0.74; 'implemented': 0.76;
'produces': 0.76; '");': 0.84; 'characters': 0.84; 'converted':
0.84; 'quotes': 0.84; 'sentences': 0.84; 'surrounded': 0.84;
'you.\xc2\xa0': 0.84
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; t=1646687306;
s=s1; d=tutanota.com;
h=From:From:To:To:Subject:Subject:Content-Description:Content-ID:Content-Type:Content-Type:Content-Transfer-Encoding:Cc:Cc:Date:Date:In-Reply-To:In-Reply-To:MIME-Version:MIME-Version:Message-ID:Message-ID:Reply-To:References:References:Sender;
bh=/csWDLInVbIu/QzL+qXrOXNUeBDuPlwdfyjC/ABZ54g=;
b=wwgVFZfrkjvQKgRZ3zksTpqLUsfcTrl4lwh9Cfp0mkgPshd8FcUHGKR+q7J3Zrgj
9RYwPlBsF/GVRJ2jEVvoiy2Jif1dXr1Y+u9OxLBG1T1q9S72MQA8mANtXhH6BisGFTL
BBuX7YE1R0jzMsxg0qL3hUvw3Wa3FxpEHKA/FXYj/q19a46XsefjSWGacSXOwpNVmP1
cxE2sZIve5w8W4WWMg2CXCqJrmaExy54NFYU7Xmfhh7cRYu8bKjo6uVRBpKISk5C1CD
eiQO7oGo/qQt9Miq28LfuZKdTjygTf4mQVg4jNO4TagJxa8ndaOPcMGJWSLajdh5dMQ
XSDF99wnsA==
In-Reply-To: <411854b3-e73d-0908-72f6-4049b87145c2@mrabarnett.plus.com>
X-Content-Filtered-By: Mailman/MimeDel 2.1.39
X-BeenThere: python-list@python.org
X-Mailman-Version: 2.1.39
Precedence: list
List-Id: General discussion list for the Python programming language
<python-list.python.org>
List-Unsubscribe: <https://mail.python.org/mailman/options/python-list>,
<mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive: <https://mail.python.org/pipermail/python-list/>
List-Post: <mailto:python-list@python.org>
List-Help: <mailto:python-list-request@python.org?subject=help>
List-Subscribe: <https://mail.python.org/mailman/listinfo/python-list>,
<mailto:python-list-request@python.org?subject=subscribe>
X-Mailman-Original-Message-ID: <MxaCZ0n--3-2@tutanota.com>
X-Mailman-Original-References: <MxWmaxK--3-2@tutanota.com>
<5ad962fc-1257-dd8d-96ab-541ae5bae2fa@mrabarnett.plus.com>
<Mx_KxMD--3-2@tutanota.com>
<411854b3-e73d-0908-72f6-4049b87145c2@mrabarnett.plus.com>

by: Jen Kris - Mon, 7 Mar 2022 21:08 UTC

Thanks to MRAB and Chris Angelico for your help. Here is how I implemented the string conversion, and it works correctly now for a library call that needs a list converted to a string (error handling not shown):

PyObject* str_sentence = PyObject_Str(pSentence);
PyObject* separator = PyUnicode_FromString(" ");
PyObject* str_join = PyUnicode_Join(separator, pSentence);
Py_DECREF(separator);
PyObject* pNltk_WTok = PyObject_GetAttrString(pModule_mstr, "word_tokenize");
PyObject* pWTok = PyObject_CallFunctionObjArgs(pNltk_WTok, str_join, 0);

That produces what I need (this is the REPR of pWTok):

"['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']']"

Thanks again to both of you.

Jen

Mar 7, 2022, 11:03 by python@mrabarnett.plus.com:

> On 2022-03-07 17:05, Jen Kris wrote:
>
>> Thank you MRAB for your reply.
>>
>> Regarding your first question, pSentence is a list. In the nltk library, nltk.word_tokenize takes a string, so we convert sentence to string before we call nltk.word_tokenize:
>>
>> >>> sentence = " ".join(sentence)
>> >>> pt = nltk.word_tokenize(sentence)
>> >>> print(sentence)
>> [ Emma by Jane Austen 1816 ]
>>
>> But with the C API it looks like this:
>>
>> PyObject *pSentence = PySequence_GetItem(pSents, sent_count);
>> PyObject* str_sentence = PyObject_Str(pSentence); // Convert to string
>>
>> ; See what str_sentence looks like:
>> PyObject* repr_str = PyObject_Repr(str_sentence);
>> PyObject* str_str = PyUnicode_AsEncodedString(repr_str, "utf-8", "~E~");
>> const char *bytes_str = PyBytes_AS_STRING(str_str);
>> printf("REPR_String: %s\n", bytes_str);
>>
>> REPR_String: "['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']']"
>>
>> So the two string representations are not the same – or at least the   PyUnicode_AsEncodedString is not the same, as each item is surrounded by single quotes.
>>
>> Assuming that the conversion to bytes object for the REPR is an accurate representation of str_sentence, it looks like I need to strip the quotes from str_sentence before “PyObject* pWTok = PyObject_CallFunctionObjArgs(pNltk_WTok, str_sentence, 0).”
>>
>> So my questions now are (1) is there a C API function that will convert a list to a string exactly the same way as ‘’.join, and if not then (2) how can I strip characters from a string object in the C API?
>>
> Your Python code is joining the list with a space as the separator.
>
> The equivalent using the C API is:
>
>     PyObject* separator;
>     PyObject* joined;
>
>     separator = PyUnicode_FromString(" ");
>     joined = PyUnicode_Join(separator, pSentence);
>     Py_DECREF(sep);
>
>>
>> Mar 6, 2022, 17:42 by python@mrabarnett.plus.com:
>>
>> On 2022-03-07 00:32, Jen Kris via Python-list wrote:
>>
>> I am using the C API in Python 3.8 with the nltk library, and
>> I have a problem with the return from a library call
>> implemented with PyObject_CallFunctionObjArgs.
>>
>> This is the relevant Python code:
>>
>> import nltk
>> from nltk.corpus import gutenberg
>> fileids = gutenberg.fileids()
>> sentences = gutenberg.sents(fileids[0])
>> sentence = sentences[0]
>> sentence = " ".join(sentence)
>> pt = nltk.word_tokenize(sentence)
>>
>> I run this at the Python command prompt to show how it works:
>>
>> sentence = " ".join(sentence)
>> pt = nltk.word_tokenize(sentence)
>> print(pt)
>>
>> ['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']']
>>
>> type(pt)
>>
>> <class 'list'>
>>
>> This is the relevant part of the C API code:
>>
>> PyObject* str_sentence = PyObject_Str(pSentence);
>> // nltk.word_tokenize(sentence)
>> PyObject* pNltk_WTok = PyObject_GetAttrString(pModule_mstr,
>> "word_tokenize");
>> PyObject* pWTok = PyObject_CallFunctionObjArgs(pNltk_WTok,
>> str_sentence, 0);
>>
>> (where pModule_mstr is the nltk library).
>>
>> That should produce a list with a length of 7 that looks like
>> it does on the command line version shown above:
>>
>> ['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']']
>>
>> But instead the C API produces a list with a length of 24, and
>> the REPR looks like this:
>>
>> '[\'[\', "\'", \'[\', "\'", \',\', "\'Emma", "\'", \',\',
>> "\'by", "\'", \',\', "\'Jane", "\'", \',\', "\'Austen", "\'",
>> \',\', "\'1816", "\'", \',\', "\'", \']\', "\'", \']\']'
>>
>> I also tried this with PyObject_CallMethodObjArgs and
>> PyObject_Call without success.
>>
>> Thanks for any help on this.
>>
>> What is pSentence? Is it what you think it is?
>> To me it looks like it's either the list:
>>
>> ['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']']
>>
>> or that list as a string:
>>
>> "['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']']"
>>
>> and that what you're tokenising.
>> -- https://mail.python.org/mailman/listinfo/python-list
>>
> --
> https://mail.python.org/mailman/listinfo/python-list
>

"I'm growing older, but not up." -- Jimmy Buffett

devel / comp.lang.python / Re: C API PyObject_CallFunctionObjArgs returns incorrect result

Subject	Author
Re: C API PyObject_CallFunctionObjArgs returns incorrect result	Jen Kris