Coverting String. Simple vs Unicode, Python

Coverting String. Simple vs Unicode, Python - python

I am writing a script in python that is used to validate the vales of each cell in a parent table and compares to values in a look up table.
So, in the parent table I have a number of columns and each column corresponds to a lookup table for the known values that should be in each record in that particular column.
When I read in the values from the parent table, there will be many types (i.e. unicode strings, ints, floats, dates, etc)
The look up tables have the same variety of types, but when it's a string, it's a simple string, not a unicode string which forces me to convert the values to match. (i.e. if the value in the cell from the parent table is a unicode string, then I need to create a conditional sentence to test if it's unicode and then convert to simple string
if isinstance(row.getValue(columnname), unicode):
x = str(row.getValue(columnname)
My question is, would it better to convert the unicode strings to simple strings or vice versa to match the type? Why would it be better?
If it helps, my parent table is all in access and the lookup tables are all in excel. I don't think that really matters, but maybe I am missing something.

It'd be better to decode byte strings to unicode.
Unicode data is the canonical representation; encoded bytes differ based on what encoding was used.
You always want to work with Unicode within your program, then encode back to bytes as needed to send over the network or write data to files.
Compare this to using date/time values; you'd convert those to datetime objects as soon as possible too. Or images; loading an image from PNG or JPG you'd want to get a representation that lets you manipulate the colours and individual pixels, something that is much harder when working with the compressed image format on disk.

Related

how to check python bytes for dataframe string field?

I have a string that I would like to check for comparison.
df['ACCOUNTMANAGER'][0] this value has some special character that I cannot compare using string comparison. I tried to compare using bytes but it failed. I would like to check how the data is stored there for comparison. Is there a way to do this?

I figured it out. It was being stored as utf-8 encoded format and now comparison works. I used for comparison byte b'\x8a' in if statement.

Python 3 - Converting byte string to string with same content

I am working on migrating project code from Python 2 to Python 3.
One piece of code is using struct.pack which provides me value in string(Python2) and byte string(Python3)
I wanted to convert byte string in python3 to normal string. Converted string should have same content to make it consistent with existing values.
For e.g.
in_val = b'\0x01\0x36\0xff\0x27' # Input value
out_val = '\0x01\0x36\0xff\0x27' # Output should be this
I have one solution to convert in_val in string then explicitly remove 'b' and '\' character which will appear after its converted to string.
Is there any other solution to convert using clean way.
Any help appreciated

str values are always Unicode code points. The first 256 values are the Latin-1 range, so you can use that codec to decode bytes directly to those codepoints:
out_val = in_val.decode('latin1')
However, you want to re-assess why you are doing this. Don't store binary data in strings, there are almost always better ways to deal with binary data. If you want to store binary data in JSON, for example, then you'd want to use Base64 or some other binary-to-text encoding scheme that better handles edge cases such as binary data containing escape codes when interpreted as text.

Find and replace strings in raw ASN.1 encoded data

I have some ASN.1 BER encoded raw data which looks like this when opened in notepad++:
Sample ASN1 encoded data
I believe it's in binary octect format so only the "IA5string" data types are readable/meaningful.
I'm wanting to do a find and replace on certain string data that contains sensitive information (phone numbers, IP address, etc), in order to scramble and anonymise it, while leaving the rest of the encoded data intact.
I've made a python script to do it and it will work fine on plain text data, but having encoding/decoding issues when trying to read/write files in this encoded format, I guess since it contains octect values outside the ASCII range.
What method would I need to use to import this data, do find & replace on the strings to create a modified file that leaves everything else intact? i think it should be possible without completely decoding the raw ASN.1 data with a schema, since I only need to work on the IA5String data types
Thanks

Using struct.unpack() without knowing anything about the string

I need to parse a big-endian binary file and convert it to little-endian. However, the people who have handed the file over to me seem unable to tell me anything about what data types it contains, or how it is organized — the only thing they know for certain is that it is a big-endian binary file with some old data. The function struct.unpack(), however, requires a format character as its first argument.
This is the first line of the binary file:
import binascii
path = "BC2003_lr_m32_chab_Im.ised"
with open(path, 'rb') as fd:
line = fd.readline()
print binascii.hexlify(line)
a0040000dd0000000000000080e2f54780f1094840c61a4800a92d48c0d9424840a05a48404d7548e09d8948a0689a48e03fad48a063c248c01bda48c0b8f448804a0949100b1a49e0d62c49e0ed41499097594900247449a0a57f4900d98549b0278c49a0c2924990ad9949a0eba049e080a8490072b049c0c2b849d077c1493096ca494022d449a021de49a099e849e08ff349500a
Is it possible to change the endianness of a file without knowing anything about it?

You cannot do this without knowing the datatypes. There is little point in attempting to do so otherwise.
Even if it was a homogeneous sequence of one datatype, you'd still need to know what you are dealing with; flipping the byte order in double values is very different from short integers.
Take a look at the formatting characters table; anything with a different byte size in it will result in a different set of bytes being swapped; for double values, you need to reverse the order of every 8 bytes, for example.
If you know what data should be in the file, then at least you have a starting point; you'd have to puzzle out how those values fit into the bytes given. It'll be a puzzle, but with a target set of values you can build a map of the datatypes contained, then write a byte-order adjustment script. If you don't even have that, best not to start as the task is impossible to achieve.

Search and replace characters in a file with Python

I am trying to do transliteration where I need to replace every source character in English from a file with its equivalent from a dictionary I am using in the source code corresponding to another language in Unicode format. I am now able to read character by character from a file in English how do I search for its equivalent map in the dictionary I have defined in the source code and make sure that is printed in a new transliterated output file. Thank you:).

The translate method of Unicode objects is the simplest and fastest way to perform the transliteration you require. (I assume you're using Unicode, not plain byte strings which would make it impossible to have characters such as 'पत्र'!).
All you have to do is layout your transliteration dictionary in a precise way, as specified in the docs to which I pointed you:
each key must be an integer, the codepoint of a Unicode character; for example, 0x0904 is the codepoint for ऄ, AKA "DEVANAGARI LETTER SHORT A", so for transliterating it you would use as the key in the dict the integer 0x0904 (equivalently, decimal 2308). (For a table with the codepoints for many South-Asian scripts, see this pdf).
the corresponding value can be a Unicode ordinal, a Unicode string (which is presumably what you'll use for your transliteration task, e.g. u'a' if you want to transliterate the Devanagari letter short A into the English letter 'a'), or None (if during the "transliteration" you want to simply remove instances of that Unicode character).
Characters that aren't found as keys in the dict are passed on untouched from the input to the output.
Once your dict is laid out like that, output_text = input_text.translate(thedict) does all the transliteration for you -- and pretty darn fast, too. You can apply this to blocks of Unicode text of any size that will fit comfortably in memory -- basically doing one text file as a time will be just fine on most machines (e.g., the wonderful -- and huge -- Mahabharata takes at most a few tens of megabytes in any of the freely downloadable forms -- Sanskrit [[cross-linked with both Devanagari and roman-transliterated forms]], English translation -- available from this site).

Note: Updated after clarifications from questioner. Please read the comments from the OP attached to this answer.
Something like this:
for syllable in input_text.split_into_syllables():
output_file.write(d[syllable])
Here output_file is a file object, open for writing. d is a dictionary where the indexes are your source characters and the values are the output characters. You can also try to read your file line-by-line instead of reading it all in at once.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.