Write volatile strings to file in Python - python

I have a lot of strings: about 14,000 in a list of tuples.
alot of the strings have commas and newlines and maybe even unicode delimiters - not 100% sure.
I need to write the tuples to file, preferably in some format the excel or numbers can open. I tried CSV, but all the commas in the strings mess up the file.
How should I write my list of tuples to file, what format should the file be so that the weird content in the strings does not affect the formatting of the file

In python csv module you can define the delimiter other than a comma:
csv.writer(file, delimiter=':')

If the target is Excel then you could use an Excel file writing module such as XlsxWriter or xlwt.
That would avoid any issues with CSV separators.

Don't change anything.
Since "my sample of tweets covers almost every unicode char", there is no reasonable safe delimiter you can choose.
But CSV has ways of dealing with that: escaping special characters, quoting fields with special characters in them, or both. There are many options to choose from, which you can see in Dialects and Formatting Parameters.
However, the default dialect is specifically designed to work well with Excel. And, since your goal is to put the data into some format that Excel can open, you can just use the defaults as-is. Unless you want to make it more readable and editable in a text editor, there is no problem.

Related

Separate binary data (blobs) in csv files

Is there any safe way of mixing binary with text data in a (pseudo)csv file?
One naive and partial solution would be:
using a compound field separator, made of more than one character (e.g. the \a\b sequence for example)
saving each field as either text or as binary data would require the parser of the pseudocsv to look for the \a\b sequence and read the data between separators according to a known rule (e.g. by the means of a known header with field name and field type, for example)
The core issue is that binary data is not guaranteed to not contain the \a\b sequence somewhere inside its body, before the actual end of the data.
The proper solution would be to save the individual blob fields in their own separate physical files and only include the filenames in a .csv, but this is not acceptable in this scenario.
Is there any proper and safe solution, either already implemented or applicable given these restrictions?
If you need everything in a single file, just use one of the methods to encode binary as printable ASCII, and add that results to the CSV vfieds (letting the CSV module add and escape quotes as needed).
One such method is base64 - but even on Python's base64 codec, there are more efficient codecs like base85 (on newer Pythons, version 3.4 and above, I guess).
So, an example in Python 2.7 would be:
import csv, base64
import random
data = b''.join(chr(random.randrange(0,256)) for i in range(50))
writer = csv.writer(open("testfile.csv", "wt"))
writer.writerow(["some text", base64.b64encode(data)])
Of course, you have to do the proper base64 decoding on reading the file as well - but it is certainly better than trying to create an ad-hoc escaping method.

Messing up my unicode output - but where and how?

I am doing a word count on some text files, storing the results in a dictionary. My problem is that after outputting to file, the words are not displayed right even if they were in the original text. (I use TextWrangler to look at them).
For instance, dashes show up as dashes in the original but as \u2014 in the output; in the output, very word is prefixed by a u as well.
Problem
I do not know where, when and how in my script this happens.
I am reading the file with codecs.open() and outputting them with codecs.open() and as json.dump(). They both go wrong in the same way. In between, all is do is
tokenizing
regular expressions
collect in dictionary
And I don't know where I mess things up; I have de-activated tokenizing and most other functions to no effect. All this is happening in Python 2.
Following previous advice, I tried to keep everything within the script in Unicode.
Here is what I do (non-relevant code omitted):
#read in file, iterating over a list of "fileno"s
with codecs.open(os.path.join(dir,unicode(fileno)+".txt"), "r", "utf-8") as inputfili:
inputtext=inputfili.read()
#process the text: tokenize, lowercase, remove punctuation and conjugation
content=regular expression to extract text w/out metadata
contentsplit=nltk.tokenize.word_tokenize(content)
text=[i.lower() for i in contentsplit if not re.match(r"\d+", i)]
text= [re.sub(r"('s|s|s's|ed)\b", "", i) for i in text if i not in string.punctuation]
#build the dictionary of word counts
for word in text:
dicti[word].append(word)
#collect counts for each word, make dictionary of unique words
dicti_nos={unicode(k):len(v) for k,v in dicti.items()}
hapaxdicti= {k:v for k,v in perioddicti_nos.items() if v == 1}
#sort the dictionary
sorteddict=sorted(dictionary.items(), key=lambda x: x[1], reverse=True)
#output the results as .txt and json-file
with codecs.open(file_name, "w", "utf-8") as outputi:
outputi.write("\n".join([unicode(i) for i in sorteddict]))
with open(file_name+".json", "w") as jsonoutputi:
json.dump(dictionary, jsonoutputi, encoding="utf-8")
EDIT: Solution
Looks like my main issue was writing the file in the wrong way. If I change my code to what's reproduced below, things work out. Looks like joining a list of (string, number) tuples messed the string part up; if I join the tuples first, things work.
For the json output, I had to change to codecs.open() and set ensure_ascii to False. Apparently just setting the encoding to utf-8 does not do the trick like I thought.
with codecs.open(file_name, "w", "utf-8") as outputi:
outputi.write("\n".join([":".join([i[0],unicode(i[1])]) for i in sorteddict]))
with codecs.open(file_name+".json", "w", "utf-8") as jsonoutputi:
json.dump(dictionary, jsonoutputi, ensure_ascii=False)
Thanks for your help!
As your example is partially pseudocode there's no way to run a real test and give you something that runs and has been tested, but from reading what you have provided I think you may misunderstand the way Unicode works in Python 2.
The unicode type (such as is produced via the unicode() or unichr() functions) is meant to be an internal representation of a Unicode string that can be used for string manipulation and comparison purposes. It has no associated encoding. The unicode() function will take a buffer as its first argument and an encoding as its second argument and interpret that buffer using that encoding to produce an internally usable Unicode string that is from that point forward unencumbered by encodings.
That Unicode string isn't meant to be written out to a file; all file formats assume some encoding, and you're supposed to provide one again before writing that Unicode string out to a file. Everyplace you have a construct like unicode(fileno) or unicode(k) or unicode(i) is suspect both because you're relying on a default encoding (which probably isn't what you want) and because you're going on to expose most of these values directly to the file system.
After you're done working with these Unicode strings you can use the built-in method encode() on them with your desired encoding as an argument to pack them into strings of ordinary bytes set as required by your encoding.
So looking back at your example above, your inputtext variable is an ordinary string containing data encoded per the UTF-8 encoding. This isn't Unicode. You could convert it to a Unicode string with an operation like inputuni = unicode(inputtext, 'utf-8') and operate on it like that if you chose, but for what you're doing you may not even find it necessary. If you did convert it to Unicode though you'd have to perform the equivalent of a inputuni.encode('UTF-8') on any Unicode string that you were planning on writing out to your file.

Python, Hex and common file signatures

I’ve got files from a system restore which have odd bits of data padded out to the front of the file which makes it gobbledegook when opening it. I’ve got a text file of file signatures which I’ve collected, and which contain information represented like this at the moment:
Sig_MicrosoftOffice_before2007= \xD0\xCF\x11\xE0\xA1\xB1\x1A\xE1
What I am planning on is reading the text file and using the data to identify the correct header in the data of the corrupt file, and strip everything off before it – hopefully leaving a readable file after. I’m stuck on how best to get this data into python in a readable format though.
My first try was simply reading the values from the file, but as python does, it’s representing the backslashes as the escape character. Is this the best method to achieve what I need? Do I need to think about representing the data in the text file some other way? Or maybe in a dictionary? Any help you could provide would be really appreciated.
You can decode the \xhh escapes by using the string_escape codec (Python 2) or unicode_escape codec (Python 3 or when you have to us Unicode in Python 2):
>>> r'\xD0\xCF\x11\xE0\xA1\xB1\x1A\xE1'
'\\xD0\\xCF\\x11\\xE0\\xA1\\xB1\\x1A\\xE1'
>>> r'\xD0\xCF\x11\xE0\xA1\xB1\x1A\xE1'.decode('string_escape')
'\xD0\xCF\x11\xE0\xA1\xB1\x1A\xE1'

delimiter [0001] in a text file, reading using np.loadtxt in python

I have a text file with several rows.
An example of a row is :
3578312 10 3 7 8
However the delimiter is [0001] (in a box) instead of traditional delimiters like comma or a tab etc.
I'm using numpy in python to read this, does anyone know what the delimiter should be?
I've searched the documentation but haven't got anything.
import numpy as np
read_data= np.genfromtxt(fname, delimiter='\u0001')
Gives:
array([ nan, nan, nan, ..., nan, nan, nan])
But when I physically convert the null delimiter to a comma delimiter, I can read it with the proper values.
I know that \u0001 is not the right delimiter. It was just a hypothetical example. I am unable to paste delimiter here, it looks like a closed square box with 0001 in a 2 row by 2 column fashion.
Most likely, \u0001 is the right delimiter in a sense, you're just doing it wrong.
There are fonts that use symbols like that for displaying non-printing control characters, so that 0001-in-a-box is the representation of U+0001, aka Start of Heading, aka control-A.*
The first problem is that the Python 2.x literal '\u0001' doesn't specify that character. You can't use \u escapes in str literals, only unicode literals. The docs explain this, but it makes sense if you think about it. So, the literal '\u0001' isn't the character U+0001 in your source file's encoding, it's six separate characters (a backslash, a letter, and four numbers).
So, could you just use u'\u0001'? Well, yes, but then you'd need to decode the text file to Unicode, which is probably not appropriate here. It isn't really a text file at all, it's a binary file. And the key is to look at it that way.
Your text editor can't do that, because it's… well, a text editor, so it decodes your binary file as if it were ASCII (or maybe UTF-8, Latin-1, cp1252, whatever) text, then displays the resulting Unicode, which is why you're seeing your font's representation of U+0001. But Python lets you deal with binary data directly; that's what a str does.
So, what are the actual bytes in the file? If you do this:
b = f.readline()
print repr(b)
You'll probably see something like this:
'357812\x0110\x0113\x017\x018\n'
And that's the key: the actual delimiter you want is '\x01'.**
Of course you could use u'\u0001'.encode('Latin-1'), or whatever encoding your source file is in… but that's just silly. You know what byte you want to match, why try to come up with an expression that represents that byte instead of just specifying it?
If you wanted to, you could also just convert the control-A delimiters into something more traditional like a comma:
lines = (line.replace('\x01', ',') for line in file)
But there's no reason to go through the extra effort to deal with that. Especially if some of the columns may contain text, which may contain commas… then you'd have to do something like prepend a backslash to every original comma that's not inside quotes, or quote every string column, or whatever, before you can replace the delimiters with commas.
* Technically, it should be shown as a non-composing non-spacing mark… but there are many contexts where you want to see invisible characters, especially control characters, so many fonts have symbols for them, and many text editors display those symbols as if they were normal spacing glyphs. Besides 0001 in a box, common representations include SOH (for "Start of Heading") or A (for "control-A") or 001 (the octal code for the ASCII control character) in different kinds of boxes. This page and this show how a few fonts display it.
** If you knew enough, you could have easily deduced that, because '\x01' in almost any charset will decode to u'\u0001'. But it's more important to know how to look at the bytes directly than to learn other people's guesses…

Search and replace characters in a file with Python

I am trying to do transliteration where I need to replace every source character in English from a file with its equivalent from a dictionary I am using in the source code corresponding to another language in Unicode format. I am now able to read character by character from a file in English how do I search for its equivalent map in the dictionary I have defined in the source code and make sure that is printed in a new transliterated output file. Thank you:).
The translate method of Unicode objects is the simplest and fastest way to perform the transliteration you require. (I assume you're using Unicode, not plain byte strings which would make it impossible to have characters such as 'पत्र'!).
All you have to do is layout your transliteration dictionary in a precise way, as specified in the docs to which I pointed you:
each key must be an integer, the codepoint of a Unicode character; for example, 0x0904 is the codepoint for ऄ, AKA "DEVANAGARI LETTER SHORT A", so for transliterating it you would use as the key in the dict the integer 0x0904 (equivalently, decimal 2308). (For a table with the codepoints for many South-Asian scripts, see this pdf).
the corresponding value can be a Unicode ordinal, a Unicode string (which is presumably what you'll use for your transliteration task, e.g. u'a' if you want to transliterate the Devanagari letter short A into the English letter 'a'), or None (if during the "transliteration" you want to simply remove instances of that Unicode character).
Characters that aren't found as keys in the dict are passed on untouched from the input to the output.
Once your dict is laid out like that, output_text = input_text.translate(thedict) does all the transliteration for you -- and pretty darn fast, too. You can apply this to blocks of Unicode text of any size that will fit comfortably in memory -- basically doing one text file as a time will be just fine on most machines (e.g., the wonderful -- and huge -- Mahabharata takes at most a few tens of megabytes in any of the freely downloadable forms -- Sanskrit [[cross-linked with both Devanagari and roman-transliterated forms]], English translation -- available from this site).
Note: Updated after clarifications from questioner. Please read the comments from the OP attached to this answer.
Something like this:
for syllable in input_text.split_into_syllables():
output_file.write(d[syllable])
Here output_file is a file object, open for writing. d is a dictionary where the indexes are your source characters and the values are the output characters. You can also try to read your file line-by-line instead of reading it all in at once.

Categories