Separate binary data (blobs) in csv files - python

Is there any safe way of mixing binary with text data in a (pseudo)csv file?
One naive and partial solution would be:
using a compound field separator, made of more than one character (e.g. the \a\b sequence for example)
saving each field as either text or as binary data would require the parser of the pseudocsv to look for the \a\b sequence and read the data between separators according to a known rule (e.g. by the means of a known header with field name and field type, for example)
The core issue is that binary data is not guaranteed to not contain the \a\b sequence somewhere inside its body, before the actual end of the data.
The proper solution would be to save the individual blob fields in their own separate physical files and only include the filenames in a .csv, but this is not acceptable in this scenario.
Is there any proper and safe solution, either already implemented or applicable given these restrictions?

If you need everything in a single file, just use one of the methods to encode binary as printable ASCII, and add that results to the CSV vfieds (letting the CSV module add and escape quotes as needed).
One such method is base64 - but even on Python's base64 codec, there are more efficient codecs like base85 (on newer Pythons, version 3.4 and above, I guess).
So, an example in Python 2.7 would be:
import csv, base64
import random
data = b''.join(chr(random.randrange(0,256)) for i in range(50))
writer = csv.writer(open("testfile.csv", "wt"))
writer.writerow(["some text", base64.b64encode(data)])
Of course, you have to do the proper base64 decoding on reading the file as well - but it is certainly better than trying to create an ad-hoc escaping method.

Related

Write a literal newline in ElementTree Attribute

subelement = SubElement(xml_tree, "image")
stream = BytesIO()
c.image.save(stream, format="PNG")
png = encodebytes(stream.getvalue()).decode("utf-8")
subelement.set("xlink:href", f"data:image/png;base64,{png}")
I am doing a very basic writing of an svg image element and attempting to conform to RFC 2045 which requires that I provide base64 code with lineends within the file.
I get the idiomized version:
<image xlink:href="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAApUAAALUCAIAAADVN145AAAKMWlDQ1BJQ0MgUHJvZmlsZQAAeJyd
...
The written data replaces the \n with
. I need to have ElementTree literally write the \n to disk. Am I missing something? Or is there a workaround?
I think you have the correct result with the XML entity representation of newline character. You're serializing data as XML so you need to encode the value in the way XML defines. So you wrap your image data twice - first with base64 encoding, then with XML encoding (which is incidentally 1:1 for most characters you care about).
Actually, if you put the newline character itself into the attribute, the XML parser could probably normalize it to a space when reading.
That RFC is about MIME encoding, and I think you are trying to be too literal in implementing those formatting rules when encoding in XML format for that attribute.
Note that many implementations may elect to encode the
local representation of various content types directly
rather than converting to canonical form first,
encoding, and then converting back to local
representation. In particular, this may apply to plain
text material on systems that use newline conventions
other than a CRLF terminator sequence. Such an
implementation optimization is permissible, but only
when the combined canonicalization-encoding step is
equivalent to performing the three steps separately.
Similarly, a CRLF sequence in the canonical form of the data
obtained after base64 decoding must be converted to a quoted-
printable hard line break, but ONLY when converting text data.

The correct way to load and read JSON file contains special characters in Python

I'm working with a JSON file contains some unknown-encoded strings as the example below:
"L\u00c3\u00aa Nguy\u00e1\u00bb\u0085n Ph\u00c3\u00ba"
I have loaded this text by using json.load() function in Python 3.7 environment and tried to encode/decode it with some methods I found around the Internet but I still cannot get the proper string as I expected. (In this case, it has to be Lê Nguyễn Phú).
My question is, which is the encoding method they used and how to parse this text in a proper way in Python?
Because the JSON file comes from an external source that I didn't handle so that I cannot know or make any changes in the process of encoding the text.
[Updated] More details:
The JSON file looks like this:
{
"content":"L\u00c3\u00aa Nguy\u00e1\u00bb\u0085n Ph\u00c3\u00ba"
}
Firstly, I loaded the JSON file:
with open(json_path, 'r') as f:
data = json.load(f)
But when I extract the content, it's not what I expected:
string = data.get('content', '')
print(string)
'Lê Nguyá»\x85n Phú'
Someone took "Lê Nguyễn Phú", encoded that as UTF-8, and then took the resulting series of bytes and lied to a JSON encoder by telling it that those bytes were the characters of a string. The JSON encoder then cooperatively produced garbage by encoding those characters. But it is reversible garbage. You can reverse this process using something like
json.loads(in_string).encode("latin_1").decode("utf_8")
Which decodes the string from the JSON, extracts the bytes from it (the 256 symbols in Latin-1 are in a 1-to-1 correspondence with the first 256 Unicode codepoints), and then re-decodes those bytes as UTF-8.
The big problem with this technique is that it only works if you are sure that all of your input is garbled in this fashion... there's no completely reliable way to look at an input and decide whether it should have this broken decoding applied to it. If you try to apply it to a validly-encoded string containing codepoints above U+00FF, it will crash. But if you try to apply it to a validly-encoding string containing only codepoints up to U+00FF, it will turn your perfectly good string into a different kind of garbage.

Find and replace strings in raw ASN.1 encoded data

I have some ASN.1 BER encoded raw data which looks like this when opened in notepad++:
Sample ASN1 encoded data
I believe it's in binary octect format so only the "IA5string" data types are readable/meaningful.
I'm wanting to do a find and replace on certain string data that contains sensitive information (phone numbers, IP address, etc), in order to scramble and anonymise it, while leaving the rest of the encoded data intact.
I've made a python script to do it and it will work fine on plain text data, but having encoding/decoding issues when trying to read/write files in this encoded format, I guess since it contains octect values outside the ASCII range.
What method would I need to use to import this data, do find & replace on the strings to create a modified file that leaves everything else intact? i think it should be possible without completely decoding the raw ASN.1 data with a schema, since I only need to work on the IA5String data types
Thanks

Write volatile strings to file in Python

I have a lot of strings: about 14,000 in a list of tuples.
alot of the strings have commas and newlines and maybe even unicode delimiters - not 100% sure.
I need to write the tuples to file, preferably in some format the excel or numbers can open. I tried CSV, but all the commas in the strings mess up the file.
How should I write my list of tuples to file, what format should the file be so that the weird content in the strings does not affect the formatting of the file
In python csv module you can define the delimiter other than a comma:
csv.writer(file, delimiter=':')
If the target is Excel then you could use an Excel file writing module such as XlsxWriter or xlwt.
That would avoid any issues with CSV separators.
Don't change anything.
Since "my sample of tweets covers almost every unicode char", there is no reasonable safe delimiter you can choose.
But CSV has ways of dealing with that: escaping special characters, quoting fields with special characters in them, or both. There are many options to choose from, which you can see in Dialects and Formatting Parameters.
However, the default dialect is specifically designed to work well with Excel. And, since your goal is to put the data into some format that Excel can open, you can just use the defaults as-is. Unless you want to make it more readable and editable in a text editor, there is no problem.

How to properly write utf8 IPTC metadata with python library IPTCInfo?

After generating a JPEG thumbnail file with PIL, I would like to use IPTCInfo in order to write IPTC metadata containing french characters with accents. I was thinking about using UTF8 character encoding.
So I tried the following:
info = IPTCInfo(input_file, force=True, inp_charset='utf8')
info.data['credit'] = some_unicode_string
info.saveAs(output_file)
and many other variations:
info = IPTCInfo(input_file, force=True)
info = IPTCInfo(input_file, force=True, inp_charset='utf8')
info = IPTCInfo(input_file, force=True, inp_charset='utf_8')
info = IPTCInfo(input_file, force=True, inp_charset='utf8', out_charset='utf8')
info = IPTCInfo(input_file, force=True, inp_charset='utf_8', out_charset='utf_8')
...
While reading with IPTCInfo the metadata written by IPTCInfo preserves the unicode python string, I always find weird characters when trying to read with other piece of software: OSX file information, Exiftools, PhotoShop, ViewNX2.
So what is the right way to write unicode with IPTCInfo and produce a standard compliant file understandable by all software?
Something related to your question. Coming from the IPTC forum
Using the XMP packet makes things quite easy: UTF-8 is the default character set. Thus you can use and even mix different characters sets and scripts.
The IPTC IIM header is a bit more tricky: it includes a field to indicate which character set has been used for textual fields (for the IIM experts: this is dataset 1:90) but unfortunately this field has not been used by a vast majority of imaging software and only in most recent years some of them are using it.
Also in the IPTC EnvelopeRecord Tags, you will find:
90 CodedCharacterSet string[0,32]!
(values are entered in the form "ESC X Y[, ...]". The escape sequence for UTF-8 character coding is "ESC % G", but this is displayed as "UTF8" for convenience. Either string may be used when writing. The value of this tag affects the decoding of string values in the Application and NewsPhoto records. This tag is marked as "unsafe" to prevent it from being copied by default in a group operation because existing tags in the destination image may use a different encoding. When creating a new IPTC record from scratch, it is suggested that this be set to "UTF8" if special characters are a possibility)
See also -charset CHARSET
Certain meta information formats allow coded character sets other than plain ASCII. When reading, most known encodings are converted to the external character set according to the exiftool "-charset CHARSET" or -L option, or to UTF‑8 by default. When writing, the inverse conversion is performed. Alternatively, special characters may be converted to/from HTML character entities with the -E option.
Though the comment in the code of IPTCInfo implementation of IPTC is not very encouraging, but there is still a dictionary of encoding in the code which gives more clues.
In your code example which seems to be correct, you are giving. :)
info.data['credit'] = some_unicode_string
What do you call some_unicode_string? Are you sure it's a utf-8 string (!= unicode).

Categories