I have a text file with several rows.
An example of a row is :
3578312 10 3 7 8
However the delimiter is [0001] (in a box) instead of traditional delimiters like comma or a tab etc.
I'm using numpy in python to read this, does anyone know what the delimiter should be?
I've searched the documentation but haven't got anything.
import numpy as np
read_data= np.genfromtxt(fname, delimiter='\u0001')
Gives:
array([ nan, nan, nan, ..., nan, nan, nan])
But when I physically convert the null delimiter to a comma delimiter, I can read it with the proper values.
I know that \u0001 is not the right delimiter. It was just a hypothetical example. I am unable to paste delimiter here, it looks like a closed square box with 0001 in a 2 row by 2 column fashion.
Most likely, \u0001 is the right delimiter in a sense, you're just doing it wrong.
There are fonts that use symbols like that for displaying non-printing control characters, so that 0001-in-a-box is the representation of U+0001, aka Start of Heading, aka control-A.*
The first problem is that the Python 2.x literal '\u0001' doesn't specify that character. You can't use \u escapes in str literals, only unicode literals. The docs explain this, but it makes sense if you think about it. So, the literal '\u0001' isn't the character U+0001 in your source file's encoding, it's six separate characters (a backslash, a letter, and four numbers).
So, could you just use u'\u0001'? Well, yes, but then you'd need to decode the text file to Unicode, which is probably not appropriate here. It isn't really a text file at all, it's a binary file. And the key is to look at it that way.
Your text editor can't do that, because it's… well, a text editor, so it decodes your binary file as if it were ASCII (or maybe UTF-8, Latin-1, cp1252, whatever) text, then displays the resulting Unicode, which is why you're seeing your font's representation of U+0001. But Python lets you deal with binary data directly; that's what a str does.
So, what are the actual bytes in the file? If you do this:
b = f.readline()
print repr(b)
You'll probably see something like this:
'357812\x0110\x0113\x017\x018\n'
And that's the key: the actual delimiter you want is '\x01'.**
Of course you could use u'\u0001'.encode('Latin-1'), or whatever encoding your source file is in… but that's just silly. You know what byte you want to match, why try to come up with an expression that represents that byte instead of just specifying it?
If you wanted to, you could also just convert the control-A delimiters into something more traditional like a comma:
lines = (line.replace('\x01', ',') for line in file)
But there's no reason to go through the extra effort to deal with that. Especially if some of the columns may contain text, which may contain commas… then you'd have to do something like prepend a backslash to every original comma that's not inside quotes, or quote every string column, or whatever, before you can replace the delimiters with commas.
* Technically, it should be shown as a non-composing non-spacing mark… but there are many contexts where you want to see invisible characters, especially control characters, so many fonts have symbols for them, and many text editors display those symbols as if they were normal spacing glyphs. Besides 0001 in a box, common representations include SOH (for "Start of Heading") or A (for "control-A") or 001 (the octal code for the ASCII control character) in different kinds of boxes. This page and this show how a few fonts display it.
** If you knew enough, you could have easily deduced that, because '\x01' in almost any charset will decode to u'\u0001'. But it's more important to know how to look at the bytes directly than to learn other people's guesses…
Related
I'm working with regexes on byte strings, that is regexes like
re.compile(b'\x01\x02')
Some characters which correspond to, for instance, letters, are automatically formatted. For example x=b'\x50', then x will be equal to ', instead of keeping the \x50 format.
I would like to be able to force the \x## format since it's causing problems further down the line, especially due to the \ character.
For instance, if you try to execute
re.compile(b'\\x')
This should simply be the equivalent of b'\x5c\x78', yet the regex breaks and says incomplete escape \x at position 0.
When I write
>>> st = "Piperazine (perphenazine, fluphenazine)"
>>> st
'Piperazine (perphenazine, \xef\xac\x82uphenazine)'
What is happening? why doesn't it do this for any fl? How do I avoid this?
It looks \xef\xac\x82 is not, in fact, fl. Is there any way to 'translate' this character into fl (as the author intended it), without just excluding it via something like
unicode(st, errors='ignore').encode('ascii')
This is what is called a "ligature".
In printing, the f and l characters were typeset with a different amount of space between them from what normal pairs of sequential letters used - in fact, the f and l would merge into one character. Other ligatures include "th", "oe", and "st".
That's what you're getting in your input - the "fl" ligature character, UTF-8 encoded. It's a three-byte sequence. I would take minor issue with your assertion that it's "not, in fact fl" - it really is, but your input is UTF-8 and not ASCII :-). I'm guessing you pasted from a Word document or an ebook or something that's designed for presentation instead of data fidelity (or perhaps, from the content, it was a LaTeX-generated PDF?).
If you want to handle this particular case, you could replace that byte sequence with the ASCII letters "fl". If you want to handle all such cases, you will have to use the Unicode Consortium's "UNIDATA" file at: http://www.unicode.org/Public/UNIDATA/UnicodeData.txt . In that file, there is a column for the "decomposition" of a character. The f-l ligature has the identifier "LATIN SMALL LIGATURE FL". There is, by the way, a Python module for this data file at https://docs.python.org/2/library/unicodedata.html . You want the "decomposition" function:
>>> import unicodedata
>>> foo = u"fluphenazine"
>>> unicodedata.decomposition(foo[0])
'<compat> 0066 006C'
0066 006C is, of course, ASCII 'f' and 'l'.
Be aware that if you're trying to downcast UTF-8 data to ASCII, you're eventually going to have a bad day. There are only 127 ASCII characters, and UTF-8 has millions upon millions of code points. There are many codepoints in UTF-8 that cannot be readily represented as ASCII in a nonconvoluted way - who wants to have some text end up saying "<TREBLE CLEF> <SNOWMAN> <AIRPLANE> <YELLOW SMILEY FACE>"?
I am running a python program to process a tab-delimited txt data.
But it causes trouble because it often has unicodes such as U+001A or those in http://en.wikipedia.org/wiki/Newline#Unicode
(Worse, these characters are not even seen unless the txt is opened by sublime txt, not even by notepad++)
If the python program is run on Linux then it automatically ignores such characters, but on Windows, it can't.
For example if there is U+001A in the txt, then the python program will automatically think that's the end of the file.
For another example, if there is U+0085 in the txt, then the python program will think that's the point where a new line starts.
So I just want a separate program that will erase EVERY unicode characters that are not shown in ordinary file openers like notepad++(and that program should work on Windows).
I do want to keep things like あ and ä . But I only to delete things like U+001A and U+0085 which are not seen by notepad++
How can this be achieved?
There is no such thing as an "unicode character". A character is a character and how it is encoded is on a different page. The capital letter "A" can be encoded in a lot of ways, amongst these UTF-8, EBDIC, ASCII, etc.
If you want to delete every character that cannot be represented in ASCII, then you can use the following (py3):
a = 'aあäbc'
a.encode ('ascii', 'ignore')
This will yield abc.
And if there are really U+001A, i.e. SUBSTITUTE, characters in your document, most probably something has gone haywire in a prior encoding step.
Using unicodedata looks to be the best way to do it, as suggested by #Hyperboreus (Stripping non printable characters from a string in python) but as a quick hack you could do (in Python 2.x):
Open source in binary mode. This prevents Windows from truncating reads when it finds the EOL Control Char.
my_file = open("filename.txt", "rb")
Decode the file (assumes encoding was UTF-8:
my_str = my_file.read().decode("UTF-8")
Replace known "bad" code points:
my_str.replace(u"\u001A", "")
You could skip step 2 and replace the UTF-8 encoded value of each "bad" code point in step 3, for example \x1A, but the method above allows for UTF-16/32 source if required.
I would like to do the following:
1) Serialize my class
2) Also manually edit the serialization dump file to remove certain objects of my class which I find unnecessary.
I am currently using python with simplejson. As you know, simplejson converts all characters to unicde. As a result, when I dump a particular object with simplejson, the unicode characters becomes something like that "\u00bd" for 好.
I am interested to manually edit the simplejson file for convenience. Anyone here know a work around for me to do this?
My requirements for this serialization format:
1) Easy to use (just dump and load - done)
2) Allows me to edit them manually without much hassle.
3) Able to display chinese character
I use vim. Does anyone know a way to conver "\u00bd" to 好 in vim?
I don't know anything about simplejson or the Serialisation part of the question, but you asked about converting "\u00bd" to 好 in Vim. Here are some vim tips for working with unicode:
You'll need the correct encoding set up in vim, see:
:help 'encoding'
:help 'fileencoding'
Entering unicode characters by number is simply a case of going into insert mode, pressing Ctrl-V and then typing u followed by the four digit number (or U followed by an 8-digit number). See:
:help i_CTRL-V_digit
Also bear in mind that in order for the character to display correctly in Vim, you'll need a fixed-width font containing that character. It appears as a wide space in Envy Code R and as various boxes in Lucida Console, Consolas and Courier New.
To replace \uXXXX with unicode character XXXX (where X is any hexadecimal digit), type this when in normal mode (where <ENTER> means press the ENTER key, don't type it literally):
:%s/\\u\x\{4\}/\=eval('"' . submatch(0) . '"')/g<ENTER>
Note however that u00bd appears to be unicode character ½ (1/2 in case that character doesn't display correctly on your screen), not the 好 character you mentioned (which is u597D I think). See this unicode table. Start vim and type these characters (where <Ctrl-V> is produced by holding CTRL, pressing V, releasing V and then releasing CTRL):
i<Ctrl-V>u00bd
You should see a small character looking like 1/2, assuming your font supports that character.
If you want json/simplejson to produce unicode output instead of str output with Unicode escapes then you need to pass ensure_ascii=False to dump()/dumps(), then either encode before saving or use a file-like from codecs.
I am trying to do transliteration where I need to replace every source character in English from a file with its equivalent from a dictionary I am using in the source code corresponding to another language in Unicode format. I am now able to read character by character from a file in English how do I search for its equivalent map in the dictionary I have defined in the source code and make sure that is printed in a new transliterated output file. Thank you:).
The translate method of Unicode objects is the simplest and fastest way to perform the transliteration you require. (I assume you're using Unicode, not plain byte strings which would make it impossible to have characters such as 'पत्र'!).
All you have to do is layout your transliteration dictionary in a precise way, as specified in the docs to which I pointed you:
each key must be an integer, the codepoint of a Unicode character; for example, 0x0904 is the codepoint for ऄ, AKA "DEVANAGARI LETTER SHORT A", so for transliterating it you would use as the key in the dict the integer 0x0904 (equivalently, decimal 2308). (For a table with the codepoints for many South-Asian scripts, see this pdf).
the corresponding value can be a Unicode ordinal, a Unicode string (which is presumably what you'll use for your transliteration task, e.g. u'a' if you want to transliterate the Devanagari letter short A into the English letter 'a'), or None (if during the "transliteration" you want to simply remove instances of that Unicode character).
Characters that aren't found as keys in the dict are passed on untouched from the input to the output.
Once your dict is laid out like that, output_text = input_text.translate(thedict) does all the transliteration for you -- and pretty darn fast, too. You can apply this to blocks of Unicode text of any size that will fit comfortably in memory -- basically doing one text file as a time will be just fine on most machines (e.g., the wonderful -- and huge -- Mahabharata takes at most a few tens of megabytes in any of the freely downloadable forms -- Sanskrit [[cross-linked with both Devanagari and roman-transliterated forms]], English translation -- available from this site).
Note: Updated after clarifications from questioner. Please read the comments from the OP attached to this answer.
Something like this:
for syllable in input_text.split_into_syllables():
output_file.write(d[syllable])
Here output_file is a file object, open for writing. d is a dictionary where the indexes are your source characters and the values are the output characters. You can also try to read your file line-by-line instead of reading it all in at once.