how to check python bytes for dataframe string field? - python

I have a string that I would like to check for comparison.
df['ACCOUNTMANAGER'][0] this value has some special character that I cannot compare using string comparison. I tried to compare using bytes but it failed. I would like to check how the data is stored there for comparison. Is there a way to do this?

I figured it out. It was being stored as utf-8 encoded format and now comparison works. I used for comparison byte b'\x8a' in if statement.

Related

Python: How can I append literal bytes to a string with no decoding?

In Python, a string can have arbitrary bytes, via "\x??" escaping. These bytes don't necessarily have to map to a char in an encoding. For example, we can have "\xa0", even though 0xa0 isn't a good utf-8 char.
However, if I have a byte array, such as b'\xa0', I can't append it to a string without decoding it. What if I want to just append literally, just like "\xa0"?
How can I append a series of bytes to a string without decoding them at all, just like "\x" escape chars? Is there a "literal decoding" or "no decoding" option to decode()? If not, is there another way to do this?
First, consider whether storing these in a string is truly the best for your usecase. Storing as bytes/bytesarray is usually the more idiomatic option.
However, if you have considered this and still decided to proceed, then you should pass "latin1" as the encoding option to bytes.decode. This converts the bytes directly to the characters with the corresponding value.

Python 3 - Converting byte string to string with same content

I am working on migrating project code from Python 2 to Python 3.
One piece of code is using struct.pack which provides me value in string(Python2) and byte string(Python3)
I wanted to convert byte string in python3 to normal string. Converted string should have same content to make it consistent with existing values.
For e.g.
in_val = b'\0x01\0x36\0xff\0x27' # Input value
out_val = '\0x01\0x36\0xff\0x27' # Output should be this
I have one solution to convert in_val in string then explicitly remove 'b' and '\' character which will appear after its converted to string.
Is there any other solution to convert using clean way.
Any help appreciated
str values are always Unicode code points. The first 256 values are the Latin-1 range, so you can use that codec to decode bytes directly to those codepoints:
out_val = in_val.decode('latin1')
However, you want to re-assess why you are doing this. Don't store binary data in strings, there are almost always better ways to deal with binary data. If you want to store binary data in JSON, for example, then you'd want to use Base64 or some other binary-to-text encoding scheme that better handles edge cases such as binary data containing escape codes when interpreted as text.

Load JSON file in Python without the 'u in the key

I was doing some work in Python with graphs and wanted to a save some structures in files so I could load them fast when I resumed work. One of those was a dictionary which I saved in JSON format using json.dump.
When I load it back with json.load the keys have changed from "1" to u'1'. Why is that? What does it mean? How can I change it? I use the keys later to make some lists which I will then use with the original graph which nodes are the keys (in integer form) and it causes problem in comparisons...
The u prefix signifies a Unicode string. In Python 2.x, you can convert it to a regular string with str(). That shouldn't really be necessary, though; u'1' == '1' because Python will do any conversion for you before comparing.
The u'' or u"" just means that this is a unicode string. Which in general should not be any problem unless you need a byte string. Though I would expect that your original data already was unicode, so it should not be a problem.
It is a unicode string. You can treat it as a normal python string in most cases. If you really want to convert it to a normal string use str(). If you need to convert it to a bytes type, use object.encode(encoding) where encoding is the encoding of the Unicode character, usually 'utf-8'.

Coverting String. Simple vs Unicode, Python

I am writing a script in python that is used to validate the vales of each cell in a parent table and compares to values in a look up table.
So, in the parent table I have a number of columns and each column corresponds to a lookup table for the known values that should be in each record in that particular column.
When I read in the values from the parent table, there will be many types (i.e. unicode strings, ints, floats, dates, etc)
The look up tables have the same variety of types, but when it's a string, it's a simple string, not a unicode string which forces me to convert the values to match. (i.e. if the value in the cell from the parent table is a unicode string, then I need to create a conditional sentence to test if it's unicode and then convert to simple string
if isinstance(row.getValue(columnname), unicode):
x = str(row.getValue(columnname)
My question is, would it better to convert the unicode strings to simple strings or vice versa to match the type? Why would it be better?
If it helps, my parent table is all in access and the lookup tables are all in excel. I don't think that really matters, but maybe I am missing something.
It'd be better to decode byte strings to unicode.
Unicode data is the canonical representation; encoded bytes differ based on what encoding was used.
You always want to work with Unicode within your program, then encode back to bytes as needed to send over the network or write data to files.
Compare this to using date/time values; you'd convert those to datetime objects as soon as possible too. Or images; loading an image from PNG or JPG you'd want to get a representation that lets you manipulate the colours and individual pixels, something that is much harder when working with the compressed image format on disk.

How do you store raw bytes as text without losing information in python 2.x?

Suppose I have any data stored in bytes. For example:
0110001100010101100101110101101
How can I store it as printable text? The obvious way would be to convert every 0 to the character '0' and every 1 to the character '1'. In fact this is what I'm currently doing. I'd like to know how I could pack them more tightly, without losing information.
I thought of converting bits in groups of eight to ASCII, but some bit combinations are not
accepted in that format. Any other ideas?
What about an encoding that only uses "safe" characters like base64?
http://en.wikipedia.org/wiki/Base64
EDIT: That is assuming that you want to safely store the data in text files and such?
In Python 2.x, strings should be fine (Python doesn't use null terminated strings, so don't worry about that).
Else in 3.x check out the bytes and bytearray objects.
http://docs.python.org/3.0/library/stdtypes.html#bytes-methods
Not sure what you're talking about.
>>> sample = "".join( chr(c) for c in range(256) )
>>> len(sample)
256
>>> sample
'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\
x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'()*+,-./0123456789:;<=>?#ABC
DEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83
\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97
\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab
\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf
\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3
\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7
\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb
\xfc\xfd\xfe\xff'
The string sample contains all 256 distinct bytes. There is no such thing as a "bit combinations ... not accepted".
To make it printable, simply use repr(sample) -- non-ASCII characters are escaped. As you see above.
Try the standard array module or the struct module. These support storing bytes in a space efficient way -- but they don't support bits directly.
You can also try http://cobweb.ecn.purdue.edu/~kak/dist/BitVector-1.2.html or http://ilan.schnell-web.net/prog/bitarray/
For Python 2.x, your best bet is to store them in a string. Once you have that string, you can encode it into safe ASCII values using the base64 module that comes with python.
import base64
encoded = base64.b64encode(bytestring)
This will be much more condensed than storing "1" and "0".
For more information on the base64 module, see the python docs.

Categories