what happens when you compare two strings in python

what happens when you compare two strings in python - python

When comparing strings in python e.g.
if "Hello" == "Hello":
#execute certain code
I am curious about what the code is that compares the strings. So if i were to compare these in c i would just compare each character and break when one character doesn't match. i'm wondering exactly what the process is of comparing two strings like this, i.e. when it will break and if there is any difference between this comparison and the method said above other than redundancy in lines of code

I'm going to assume you are using CPython here, the standard Python.org implementation. Under the hood, the Python string type is implemented in C, so yes, testing if two strings are equal is done exactly like you'd do it in C.
What it does is use the memcmp() function to test if the two str objects contain the same data, see the unicode_compare_eq function defined in unicodeobject.c:
static int
unicode_compare_eq(PyObject *str1, PyObject *str2)
{
int kind;
void *data1, *data2;
Py_ssize_t len;
int cmp;
len = PyUnicode_GET_LENGTH(str1);
if (PyUnicode_GET_LENGTH(str2) != len)
return 0;
kind = PyUnicode_KIND(str1);
if (PyUnicode_KIND(str2) != kind)
return 0;
data1 = PyUnicode_DATA(str1);
data2 = PyUnicode_DATA(str2);
cmp = memcmp(data1, data2, len * kind);
return (cmp == 0);
}
This function is only called if str1 and str2 are not the same object (that's an easy and cheap thing to test). It first checks if the two objects are the same length and store the same kind of data (string objects use a flexible storage implementation to save memory; different storage means the strings can't be equal).
There are other Python implementations, like Jython or IronPython, which may use different techniques, but it basically will come down to much the same thing.

Related

How to create a valid CFFI representation of a numpy array of unicode strings?

I have a numpy.ndarray named values containing numpy.unicode_ strings and I have a C function foo that consumes an array of C-strings. There is a CFFI wrapper interface for foo.
So I have tried to do something like this
p = ffi.from_buffer("char**", values)
and also
p = ffi.from_buffer("char*[]", values)
This doesn't give any errors in CFFI. But once I run the code it crashes in the C implementation of foo and indeed when I look at the pointers they look bad:
(gdb) p d
$1 = (char **) 0x1f978a50
(gdb) p d[0]
$2 = 0x7300000061 <error: Cannot access memory at address 0x7300000061>
I am on a 64 bit architecture.

It won't work like you are trying to do, because the numpy array contains pointers to Python objects (all of type str), I believe. In any case, it is something else than a raw array of char * pointers to the UTF8-encoded versions of the strings.
I think there is no automatic way to do the conversion. You need to do the loop over the items manually, and manually convert all the strings to char[] arrays, and make sure they are all kept alive long enough. This should do it:
items = [ffi.new("char[]", x.encode('utf-8')) for x in values]
p = ffi.new("char *[]", items)
# keep 'items' alive as long as you need 'p'
or, if all you need is to call a C function that expects a char ** argument, you can rely on the automatic Python-list-to-C-array conversion, as long as every item of the Python list is a char *:
items = [ffi.new("char[]", x.encode('utf-8')) for x in values]
lib.my_c_function(items)

The problem is that numpy is not really representing an array of C strings as char*[]. But it is more like a big single char[] in which all strings are occurring using strides equal to .itemsize which in the case of an array of strings is the size of the biggest occurring string. Shorter strings are padded with zero bytes. And the optional first argument cdecl in ffi.from_buffer is not involved in any rigorous type checking on the received underlying buffer/memory view. It is the responsibility of the programmer to know the correct type of the perceived buffer/memory view.
The cdecl argument will provide type safety when for instance used in conjunction with calls to other CFFI wrapped functions.
The way I solved this is by allocating a separate array of char pointers in cffi
t = ffi.new('char*[]', array_size)
Next massage the numpy array a bit to guarantee that each string is null terminated.
then to implement some logic in Python (or C and then wrapped in CFFI if performance is required)
to point each member in the char*[] array to its corresponding string in the numpy array.

Cython: when should I define a string as char*, str, or bytes?

When defining a variable type that will hold a string in Cython + Python 3, I can use (at least):
cdef char* mystring = "foo"
cdef str mystring = "foo"
cdef bytes mystring = "foo"
The documentation page on strings is unclear on this -- it mostly gives examples using char* and bytes, and frankly I'm having a lot of difficulty understanding it.
In my case the strings will be coming from a Python3 program and are assumed to be unicode. They will be used as dict keys and function arguments, but I will do no further manipulation on them. Needless to say I am trying to maximize speed.
This question suggests that under Python2.7 and without Unicode, typing as str makes string manipulation code run SLOWER than with no typing at all. (But that's not necessarily relevant here since I won't be doing much string manipulation.)
What are the advantages and disadvantages of each of these options?

If there is no further processing done on a particular type, it would be best and fastest to not type them at all, which means they are treated as a general purpose PyObject *.
The str type is a special case which means bytes on Python 2 and unicode on Python 3.
The str type is special in that it is the byte string in Python 2 and the Unicode string in Python 3
So code that types a string as str and handles it as unicode will break on python 2 where str means bytes.
Strings only need to be typed if they are to be converted to C char* or C++ std::string. There, you would use str to handle py2/py3 compatibility, along with helper functions to convert to/from bytes and unicode in order to be able to convert to either char* or std::string.
Typing of strings is for interoperability with C/C++, not for speed as such. Cython will auto-convert, without copying, a bytes string to a char* for example when it sees something like cdef char* c_string = b_string[:b_len] where b_string is a bytes type.
OTOH, if strings are typed without that type being used, Cython will do a conversion from object to bytes/unicode when it does not need to which leads to overhead.
This can be seen in the C code generated as Pyx_PyObject_AsString, Pyx_PyUnicode_FromString et al.
This is also true in general - the rule of thumb is if a specific type is not needed for further processing/conversion, best not to type it at all. Everything in python is an object so typing will convert from the general purpose PyObject* to something more specific.

Some quick testing revealed that for this particular case, only the str declaration worked -- all other options produced errors. Since the string is generated elsewhere in Python3, evidently the str type declaration is needed.
Whether it is faster not to make any declaration at all remains an open question.

What does python sys getsizeof for string return?

What does sys.getsizeof return for a standard string? I am noticing that this value is much higher than what len returns.

I will attempt to answer your question from a broader point of view. You're referring to two functions and comparing their outputs. Let's take a look at their documentation first:
len():
Return the length (the number of items) of an object. The argument may
be a sequence (such as a string, bytes, tuple, list, or range) or a
collection (such as a dictionary, set, or frozen set).
So in case of string, you can expect len() to return the number of characters.
sys.getsizeof():
Return the size of an object in bytes. The object can be any type of
object. All built-in objects will return correct results, but this
does not have to hold true for third-party extensions as it is
implementation specific.
So in case of string (as with many other objects) you can expect sys.getsizeof() the size of the object in bytes. There is no reason to think that it should be the same as the number of characters.
Let's have a look at some examples:
>>> first = "First"
>>> len(first)
5
>>> sys.getsizeof(first)
42
This example confirms that the size is not the same as the number of characters.
>>> second = "Second"
>>> len(second)
6
>>> sys.getsizeof(second)
43
We can notice that if we look at a string one character longer, its size is one byte bigger as well. We don't know if it's a coincidence or not though.
>>> together = first + second
>>> print(together)
FirstSecond
>>> len(together)
11
If we concatenate the two strings, their combined length is equal to the sum of their lengths, which makes sense.
>>> sys.getsizeof(together)
48
Contrary to what someone might expect though, the size of the combined string is not equal to the sum of their individual sizes. But it still seems to be the length plus something. In particular, something worth 37 bytes. Now you need to realize that it's 37 bytes in this particular case, using this particular Python implementation etc. You should not rely on that at all. Still, we can take a look why it's 37 bytes what they are (approximately) used for.
String objects are in CPython (probably the most widely used implementation of Python) implemented as PyStringObject. This is the C source code (I use the 2.7.9 version):
typedef struct {
PyObject_VAR_HEAD
long ob_shash;
int ob_sstate;
char ob_sval[1];
/* Invariants:
* ob_sval contains space for 'ob_size+1' elements.
* ob_sval[ob_size] == 0.
* ob_shash is the hash of the string or -1 if not computed yet.
* ob_sstate != 0 iff the string object is in stringobject.c's
* 'interned' dictionary; in this case the two references
* from 'interned' to this object are *not counted* in ob_refcnt.
*/
} PyStringObject;
You can see that there is something called PyObject_VAR_HEAD, one int, one long and a char array. The char array will always contain one more character to store the '\0' at the end of the string. This, along with the int, long and PyObject_VAR_HEAD take the additional 37 bytes. PyObject_VAR_HEAD is defined in another C source file and it refers to other implementation-specific stuff, you need to explore if you want to find out where exactly are the 37 bytes. Plus, the documentation mentions that sys.getsizeof()
adds an additional garbage collector overhead if the object is managed
by the garbage collector.
Overall, you don't need to know what exactly takes the something (the 37 bytes here) but this answer should give you a certain idea why the numbers differ and where to find more information should you really need it.

To quote the documentation:
Return the size of an object in bytes. The object can be any type of object. All built-in objects will return correct results, but this does not have to hold true for third-party extensions as it is implementation specific.
Built in strings are not simple character sequences - they are full fledged objects, with garbage collection overhead, which probably explains the size discrepancy you're noticing.

Python to C/C++ const char question

I am extending Python with some C++ code.
One of the functions I'm using has the following signature:
int PyArg_ParseTupleAndKeywords(PyObject *arg, PyObject *kwdict,
char *format, char **kwlist, ...);
(link: http://docs.python.org/release/1.5.2p2/ext/parseTupleAndKeywords.html)
The parameter of interest is kwlist. In the link above, examples on how to use this function are given. In the examples, kwlist looks like:
static char *kwlist[] = {"voltage", "state", "action", "type", NULL};
When I compile this using g++, I get the warning:
warning: deprecated conversion from string constant to ‘char*’
So, I can change the static char* to a static const char*. Unfortunately, I can't change the Python code. So with this change, I get a different compilation error (can't convert char** to const char**). Based on what I've read here, I can turn on compiler flags to ignore the warning or I can cast each of the constant strings in the definition of kwlist to char *. Currently, I'm doing the latter. What are other solutions?
Sorry if this question has been asked before. I'm new.

Does PyArg_ParseTupleAndKeywords() expect to modify the data you are passing in? Normally, in idiomatic C++, a const <something> * points to an object that the callee will only read from, whereas <something> * points to an object that the callee can write to.
If PyArg_ParseTupleAndKeywords() expects to be able to write to the char * you are passing in, you've got an entirely different problem over and above what you mention in your question.
Assuming that PyArg_ParseTupleAndKeywords does not want to modify its parameters, the idiomatically correct way of dealing with this problem would be to declare kwlist as const char *kwlist[] and use const_cast to remove its const-ness when calling PyArg_ParseTupleAndKeywords() which would make it look like this:
PyArg_ParseTupleAndKeywords(..., ..., ..., const_cast<char **>(kwlist), ...);

There is an accepted answer from seven years ago, but I'd like to add an alternative solution, since this topic seems to be still relevant.
If you don't like the const_cast solution, you can also create a write-able version of the string array.
char s_voltage[] = "voltage";
char s_state[] = "state";
char s_action[] = "action";
char s_type[] = "type";
char *kwlist[] = {s_voltage, s_state, s_action, s_type, NULL};
The char name[] = ".." copies the your string to a writable location.

How do languages such as Python overcome C's Integral data limits?

While doing some random experimentation with a factorial program in C, Python and Scheme. I came across this fact:
In C, using 'unsigned long long' data type, the largest factorial I can print is of 65. which is '9223372036854775808' that is 19 digits as specified here.
In Python, I can find the factorial of a number as large as 999 which consists of a large number of digits, much more than 19.
How does CPython achieve this? Does it use a data type like 'octaword' ?
I might be missing some fundamental facts here. So, I would appreciate some insights and/or references to read. Thanks!
UPDATE: Thank you all for the explanation. Does that means, CPython is using the GNU Multi-precision library (or some other similar library)?
UPDATE 2: I am looking for Python's 'bignum' implementation in the sources. Where exactly it is? Its here at http://svn.python.org/view/python/trunk/Objects/longobject.c?view=markup. Thanks Baishampayan.

It's called Arbitrary Precision Arithmetic. There's more here: http://en.wikipedia.org/wiki/Arbitrary-precision_arithmetic

Looking at the Python source code, it seems the long type (at least in pre-Python 3 code) is defined in longintrepr.h like this -
/* Long integer representation.
The absolute value of a number is equal to
SUM(for i=0 through abs(ob_size)-1) ob_digit[i] * 2**(SHIFT*i)
Negative numbers are represented with ob_size < 0;
zero is represented by ob_size == 0.
In a normalized number, ob_digit[abs(ob_size)-1] (the most significant
digit) is never zero. Also, in all cases, for all valid i,
0 <= ob_digit[i] <= MASK.
The allocation function takes care of allocating extra memory
so that ob_digit[0] ... ob_digit[abs(ob_size)-1] are actually available.
CAUTION: Generic code manipulating subtypes of PyVarObject has to
aware that longs abuse ob_size's sign bit.
*/
struct _longobject {
PyObject_VAR_HEAD
digit ob_digit[1];
};
The actual usable interface of the long type is then defined in longobject.h by creating a new type PyLongObject like this -
typedef struct _longobject PyLongObject;
And so on.
There is more stuff happening inside longobject.c, you can take a look at those for more details.

Data types such as int in C are directly mapped (more or less) to the data types supported by the processor. So the limits on C's int are essentially the limits imposed by the processor hardware.
But one can implement one's own int data type entirely in software. You can for example use an array of digits as your underlying representation. May be like this:
class MyInt {
private int [] digits;
public MyInt(int noOfDigits) {
digits = new int[noOfDigits];
}
}
Once you do that you may use this class and store integers containing as many digits as you want, as long as you don't run out memory.
Perhaps Python is doing something like this inside its virtual machine. You may want to read this article on Arbitrary Precision Arithmetic to get the details.

Not octaword. It implemented bignum structure to store arbitary-precision numbers.

Python assigns to long integers (all ints in Python 3) just as much space as they need -- an array of "digits" (base being a power of 2) allocated as needed.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.