Find and replace strings in raw ASN.1 encoded data - python

I have some ASN.1 BER encoded raw data which looks like this when opened in notepad++:
Sample ASN1 encoded data
I believe it's in binary octect format so only the "IA5string" data types are readable/meaningful.
I'm wanting to do a find and replace on certain string data that contains sensitive information (phone numbers, IP address, etc), in order to scramble and anonymise it, while leaving the rest of the encoded data intact.
I've made a python script to do it and it will work fine on plain text data, but having encoding/decoding issues when trying to read/write files in this encoded format, I guess since it contains octect values outside the ASCII range.
What method would I need to use to import this data, do find & replace on the strings to create a modified file that leaves everything else intact? i think it should be possible without completely decoding the raw ASN.1 data with a schema, since I only need to work on the IA5String data types
Thanks

Related

Python unicode code point issues: \xe2\x82\x82 vs. CO\u2082

My program is required to take in inputs but I am having an issues with subscripts such as CO₂...
So when i use CO₂ as an argument into the function, it seems to be represented as a string: 'CO\xe2\x82\x82' which is apparently the string literal?
Further on, i read from a spreadsheet - xlsx file using read_excel() from pandas to find entries pertaining to CO₂. I then convert this into a dictionary but in this case, it is represented as 'CO\u2082'
I use the args from earlier represented as: 'CO\xe2\x82\x82' so it doesn't recognize an entry for CO\u2082... which then results in a key error.
My question is what would be a way to convert both these representations of CO₂ so that i can do look-ups in the dictionary? Thank you for any advice
Looks like your input to the function is encoded as UTF-8, while the XLSX file is in decoded Unicode.
b'\xe2\x82\x82' is the UTF-8 encoding of Unicode codepoint '\u2082' which is identical to '₂' on Unicode-enabled systems.
Most modern systems are unicode enabled, so the most common reason to see the former UTF-8 encoding is due to reading bytes data, which is always encoded. You can fix that by decoding it like so:
> data = b'CO\xe2\x82\x82'
> data.decode()
'CO₂'
If the encoded data are somehow in a normal (non-bytes) string, then you can do it by converting the existing string to bytes and then decoding it:
> data = 'CO\xe2\x82\x82'
> bytes(map(ord, data)).decode()
'CO₂'
From #mark-tolonen below, using the latin-1 encoding is functionally identical to bytes(map(ord, data)), but much, much faster:
> data = 'CO\xe2\x82\x82'
> data.encode('latin1').decode()
'CO₂'

The correct way to load and read JSON file contains special characters in Python

I'm working with a JSON file contains some unknown-encoded strings as the example below:
"L\u00c3\u00aa Nguy\u00e1\u00bb\u0085n Ph\u00c3\u00ba"
I have loaded this text by using json.load() function in Python 3.7 environment and tried to encode/decode it with some methods I found around the Internet but I still cannot get the proper string as I expected. (In this case, it has to be Lê Nguyễn Phú).
My question is, which is the encoding method they used and how to parse this text in a proper way in Python?
Because the JSON file comes from an external source that I didn't handle so that I cannot know or make any changes in the process of encoding the text.
[Updated] More details:
The JSON file looks like this:
{
"content":"L\u00c3\u00aa Nguy\u00e1\u00bb\u0085n Ph\u00c3\u00ba"
}
Firstly, I loaded the JSON file:
with open(json_path, 'r') as f:
data = json.load(f)
But when I extract the content, it's not what I expected:
string = data.get('content', '')
print(string)
'Lê Nguyá»\x85n Phú'
Someone took "Lê Nguyễn Phú", encoded that as UTF-8, and then took the resulting series of bytes and lied to a JSON encoder by telling it that those bytes were the characters of a string. The JSON encoder then cooperatively produced garbage by encoding those characters. But it is reversible garbage. You can reverse this process using something like
json.loads(in_string).encode("latin_1").decode("utf_8")
Which decodes the string from the JSON, extracts the bytes from it (the 256 symbols in Latin-1 are in a 1-to-1 correspondence with the first 256 Unicode codepoints), and then re-decodes those bytes as UTF-8.
The big problem with this technique is that it only works if you are sure that all of your input is garbled in this fashion... there's no completely reliable way to look at an input and decide whether it should have this broken decoding applied to it. If you try to apply it to a validly-encoded string containing codepoints above U+00FF, it will crash. But if you try to apply it to a validly-encoding string containing only codepoints up to U+00FF, it will turn your perfectly good string into a different kind of garbage.

Decoding a byte with latin-1 characters to string with decimal representation

I am working on a migration project to upgrade a layer of web server from python 2.7.8 to python 3.6.3 and I have hit a roadblock for some special cases.
When a request is received from a client, payload is transmitted locally using pyzmq which now interacts in bytes in python3 instead of str (as it is in python2).
Now, the payload which I am receiving is encoded using iso-8859-1 (latin-1) scheme and I can easily convert it into string as payload.decode('latin-1') and pass it to next service (svc-save-entity) which expects string argument.
However, the subsequent service 'svc-save-entity' expects latin-1 chars (if present) to be represented in ASCII Character Reference (such as é for é) rather than in Hex (such as \xe9 for é).
I am struggling to find an efficient way to achieve this conversion. Can any python expert guide me here? Essentially I need the definition of a function say decode_tostring():
payload = b'Banco Santander (M\xe9xico)' #payload is in bytes
payload_str = decode_tostring(payload) #function to convert into string
payload_str == 'Banco Santander (México)' #payload_str is a string in ASCII Character Reference
Definition of decode_tostring() please. :)
The encode() and decode() methods accept a parameter called errors which allows you to specify how characters which are not representable in the specified encoding should be handled. The one you're looking for is XML numeric character reference replacement, which is fortunately one of the standard handlers provided in the codecs module.
Now, it's a little complex to actually do the replacement the way you want it, because the operation of replacing non-ASCII characters with their corresponding XML numeric character references happens during encoding, not decoding. After all, encoding is the process that takes in characters and emits bytes, so it's only during encoding that you can tell whether you have a character that is not part of ASCII. The cleanest way I can think of at the moment to get the transformation you want is to decode, re-encode, and re-decode, applying the XML entity reference replacement during the encoding step.
def decode_tostring(payload):
return payload.decode('latin-1').encode('ascii', errors='xmlcharrefreplace').decode('ascii')
I wouldn't be surprised if there is a method somewhere out there that will replace all non-ASCII characters in a string with their XML numeric character refs and give you back a string, and if so, you could use it to replace the encoding and the second decoding. But I don't know of one. The closest I found at the moment was xml.sax.saxutils.escape(), but that only acts on certain specific characters.
This isn't really relevant to your main question, but I did want to clarify one thing: the numeric entities like é are a feature of SGML, HTML, and XML, which are markup languages - a way to represent structured data as text. They have nothing to do with ASCII. A character encoding like ASCII is nothing more than a table of some characters and some byte sequences such that each character in the table is mapped to one byte sequence in the table and vice versa, with a few constraints to make the mapping unambiguous.
If you have a string with characters that are not in a particular encoding's table, you can't encode the string using that encoding. But what you can do is convert the string into a new string by replacing the characters which aren't in the table with sequences of characters that are in the table, and then encode the new string. There are many ways to do the replacement, of which XML numeric entity references are one example. Some of the other error handlers in Python's codecs module represent other approaches to this replacement.

Separate binary data (blobs) in csv files

Is there any safe way of mixing binary with text data in a (pseudo)csv file?
One naive and partial solution would be:
using a compound field separator, made of more than one character (e.g. the \a\b sequence for example)
saving each field as either text or as binary data would require the parser of the pseudocsv to look for the \a\b sequence and read the data between separators according to a known rule (e.g. by the means of a known header with field name and field type, for example)
The core issue is that binary data is not guaranteed to not contain the \a\b sequence somewhere inside its body, before the actual end of the data.
The proper solution would be to save the individual blob fields in their own separate physical files and only include the filenames in a .csv, but this is not acceptable in this scenario.
Is there any proper and safe solution, either already implemented or applicable given these restrictions?
If you need everything in a single file, just use one of the methods to encode binary as printable ASCII, and add that results to the CSV vfieds (letting the CSV module add and escape quotes as needed).
One such method is base64 - but even on Python's base64 codec, there are more efficient codecs like base85 (on newer Pythons, version 3.4 and above, I guess).
So, an example in Python 2.7 would be:
import csv, base64
import random
data = b''.join(chr(random.randrange(0,256)) for i in range(50))
writer = csv.writer(open("testfile.csv", "wt"))
writer.writerow(["some text", base64.b64encode(data)])
Of course, you have to do the proper base64 decoding on reading the file as well - but it is certainly better than trying to create an ad-hoc escaping method.

Coverting String. Simple vs Unicode, Python

I am writing a script in python that is used to validate the vales of each cell in a parent table and compares to values in a look up table.
So, in the parent table I have a number of columns and each column corresponds to a lookup table for the known values that should be in each record in that particular column.
When I read in the values from the parent table, there will be many types (i.e. unicode strings, ints, floats, dates, etc)
The look up tables have the same variety of types, but when it's a string, it's a simple string, not a unicode string which forces me to convert the values to match. (i.e. if the value in the cell from the parent table is a unicode string, then I need to create a conditional sentence to test if it's unicode and then convert to simple string
if isinstance(row.getValue(columnname), unicode):
x = str(row.getValue(columnname)
My question is, would it better to convert the unicode strings to simple strings or vice versa to match the type? Why would it be better?
If it helps, my parent table is all in access and the lookup tables are all in excel. I don't think that really matters, but maybe I am missing something.
It'd be better to decode byte strings to unicode.
Unicode data is the canonical representation; encoded bytes differ based on what encoding was used.
You always want to work with Unicode within your program, then encode back to bytes as needed to send over the network or write data to files.
Compare this to using date/time values; you'd convert those to datetime objects as soon as possible too. Or images; loading an image from PNG or JPG you'd want to get a representation that lets you manipulate the colours and individual pixels, something that is much harder when working with the compressed image format on disk.

Categories