I'm processing a string like this:
scrpt = "\tFrame\tX pixels\tY pixels\r\n\t2\t615.5\t334.5\r\n\t3\t615.885\t334.136\r\n\t4\t615.937\t334.087\r\n\t5\t615.917\t334.106\r\n\t6\t615.892\t334.129\r\n\t7\t615.905\t334.117\r\n\t8\t615.767\t334.246\r\n\t9\t615.546\t334.456\r\n\t10\t615.352\t334.643\r\n\r\n"
infile = StringIO(scrpt)
#pretend infile was just a regular file...
r = csv.DictReader(infile, dialect=csv.Sniffer().sniff(infile.read(1000)))
infile.seek(0)
Frame, Xco, Yco = [],[],[]
for row in r:
Frame.append(row['Frame'])
Xco.append(row['X pixels'])
Yco.append(row['Y pixels'])
This works fine. I get the string variable 'scrpt' sorted nicely into the the variables 'Frame', 'Xco', and 'Yco'
Now if I do this:
print(scrpt)
I see things neatly laid out in tabbed columns like this:
Frame X pixels Y pixels
2 615.5 334.5
3 615.885 334.136
4 615.937 334.087
5 615.917 334.106
6 615.892 334.129
7 615.905 334.117
8 615.767 334.246
9 615.546 334.456
10 615.352 334.643
But if I have the same string pasted from the clipboard and try to process it it doesn't work.
In this case, if I print it like this:
print(scrpt)
I see:
\tFrame\tX pixels\tY pixels\r\n\t2\t615.5\t334.5\r\n\t3\t615.885\t334.136\r\n\t4\t615.937\t334.087\r\n\t5\t615.917\t334.106\r\n\t6\t615.892\t334.129\r\n\t7\t615.905\t334.117\r\n\t8\t615.767\t334.246\r\n\t9\t615.546\t334.456\r\n\t10\t615.352\t334.643\r\n\r\n
Then when I go to process it the csv module won't sort it out.
What am I doing wrong?
It looks like I'm using the same data in both cases but something is different.
My guess is that your clipboard has literal backslash and t characters, not tab characters. For example, if you just copy from the first line of your source, that's exactly what you'll get.
In other words, it's as if you did this:
scrpt = r"\tFrame\tX pixels\tY pixels\r\n\t2\t615.5\t334.5\r\n\t3\t615.885\t334.136\r\n\t4\t615.937\t334.087\r\n\t5\t615.917\t334.106\r\n\t6\t615.892\t334.129\r\n\t7\t615.905\t334.117\r\n\t8\t615.767\t334.246\r\n\t9\t615.546\t334.456\r\n\t10\t615.352\t334.643\r\n\r\n"
โฆ or, equivalently:
scrpt = "\\tFrame\\tX pixels\\tY pixels\\r\\n\\t2\\t615.5\\t334.5\\r\\n\\t3\\t615.885\\t334.136\\r\\n\\t4\\t615.937\\t334.087\\r\\n\\t5\\t615.917\\t334.106\\r\\n\\t6\\t615.892\\t334.129\\r\\n\\t7\\t615.905\\t334.117\\r\\n\\t8\\t615.767\\t334.246\\r\\n\\t9\\t615.546\\t334.456\\r\\n\\t10\\t615.352\\t334.643\\r\\n\\r\\n"
If that's the problem, the fix is pretty easy:
scrpt = scrpt.decode('string_escape')
Or, in 3.x (where you can't call decode on a str):
script = codecs.decode(script, 'unicode_escape')
The unicode_escape codec is described in the list of Standard Encodings in the codecs module. It's defined as:
Produce a string that is suitable as Unicode literal in Python source code
In other words, if you encode with this codec, it will replace each non-printing Unicode character with an escape sequence that you can type into your source code. If you've got a tab character, it'll replace that with a backslash character and a t.
You want to do the exact reverse of that: you've got a string you copied out of source code, with source-code-style escape sequences, and you want to interpret it the same way the Python interpreter does. So, you just decode with the same codec. If you've got a backslash followed by a t, it'll replace them with a tab character.
It's worth playing with this in the interactive interpreter (remember to keep the repr and str representations straight while doing so!) until you get it.
Related
I am trying to search for emoticons in python strings.
So I have, for example,
em_test = ['\U0001f680']
print(em_test)
['๐']
test = 'This is a test string ๐ฐ๐ฐ๐'
if any(x in test for x in em_test):
print ("yes, the emoticon is there")
else:
print ("no, the emoticon is not there")
yes, the emoticon is there
and if a search em_test in
'This is a test string ๐ฐ๐ฐ๐'
I can actually find it.
So I have made a csv file with all the emoticons I want defined by their unicode.
The CSV looks like this:
\U0001F600
\U0001F601
\U0001F602
\U0001F923
and when I import it and print it I actullay do not get the emoticons but rather just the text representation:
['\\U0001F600',
'\\U0001F601',
'\\U0001F602',
'\\U0001F923',
...
]
and hence I cannot use this to search for these emoticons in another string...
I somehow know that the double backslash \ is only representation of a single slash but somehow the unicode reader does not get it... I do not know what I'm missing.
Any suggestions?
You can decode those Unicode escape sequences with .decode('unicode-escape'). However, .decode is a bytes method, so if those sequences are text rather than bytes you first need to encode them into bytes. Alternatively, you can (probably) open your CSV file in binary mode in order to read those sequences as bytes rather than as text strings.
Just for fun, I'll also use unicodedata to get the names of those emojis.
import unicodedata as ud
emojis = [
'\\U0001F600',
'\\U0001F601',
'\\U0001F602',
'\\U0001F923',
]
for u in emojis:
s = u.encode('ASCII').decode('unicode-escape')
print(u, ud.name(s), s)
output
\U0001F600 GRINNING FACE ๐
\U0001F601 GRINNING FACE WITH SMILING EYES ๐
\U0001F602 FACE WITH TEARS OF JOY ๐
\U0001F923 ROLLING ON THE FLOOR LAUGHING ๐คฃ
This should be much faster than using ast.literal_eval. And if you read the data in binary mode it will be even faster since it avoids the initial decoding step while reading the file, as well as allowing you to eliminate the .encode('ASCII') call.
You can make the decoding a little more robust by using
u.encode('Latin1').decode('unicode-escape')
but that shouldn't be necessary for your emoji data. And as I said earlier, it would be even better if you open the file in binary mode to avoid the need to encode it.
1. keeping your csv as-is:
it's a bloated solution, but using ast.literal_eval works:
import ast
s = '\\U0001F600'
x = ast.literal_eval('"{}"'.format(s))
print(hex(ord(x)))
print(x)
I get 0x1f600 (which is correct char code) and some emoticon character (๐). (well I had to copy/paste a strange char from my console to this answer textfield but that's a console issue by my end, otherwise that works)
just surround with quotes to allow ast to take the input as string.
2. using character codes directly
maybe you'd be better off by storing the character codes themselves instead of the \U format:
print(chr(0x1F600))
does exactly the same (so ast is slightly overkill)
your csv could contain:
0x1F600
0x1F601
0x1F602
0x1F923
then chr(int(row[0],16)) would do the trick when reading it: example if one 1 row in CSV (or first row)
with open("codes.csv") as f:
cr = csv.reader(f)
codes = [int(row[0],16) for row in cr]
Not sure if this is exactly the problem, but I'm trying to insert a tag on the first letter of a unicode string and it seems that this is not working. Could these be because unicode indices work differently than those of regular strings?
Right now my code is this:
for index, paragraph in enumerate(intro[2:-2]):
intro[index] = bold_letters(paragraph, 1)
def bold_letters(string, index):
return "<b>"+string[0]+"</b>"+string[index:]
And I'm getting output like this:
<b>?</b>?ืจื ืืืื ืืืฉืชืื ืืืืจื ืืืืืชื ืืจืฆืื ื ืื ืฆืื ืืฉืืื ืืจืฅ ืืืืื ืืื ืืืืื ืื.
It seems the unicode gets messed up when I try to insert the HTML tag. I tried messing with the insert position but didn't make any progress.
Example desired output (hebrew goes right to left):
>>>first_letter_bold("ืืงืืื")
"ืืงืื<\b>ื<b>"
BTW, this is for Python 2
You are right, indices work over each byte when you are dealing with raw bytes i.e String in Python(2.x).
To work seamlessly with Unicode data, you need to first let Python(2.x) know that you are dealing with Unicode, then do the string manipulation. You can finally convert it back to raw bytes to keep the behavior abstracted i.e you get String and you return String.
Ideally you should convert all the data from UTF8 raw encoding to Unicode object (I am assuming your source encoding is Unicode UTF8 because that is the standard used by most applications these days) at the very beginning of your code and convert back to raw bytes at the fag end of code like saving to DB, responding to client etc. Some frameworks might handle that for you so that you don't have to worry.
def bold_letters(string, index):
string = string.decode('utf8')
string "<b>"+string[0]+"</b>"+string[index:]
return string.encode('utf8')
This will also work for ASCII because UTF8 is a super-set of ASCII. You can understand how Unicode works and in Python specifically better by reading http://nedbatchelder.com/text/unipain.html
Python 3.x String is a Unicode object so you don't have to explicitly do anything.
You should use Unicode strings. Byte strings in UTF-8 use a variable number of bytes per character. Unicode use one (at least those in the BMP on Python 2...the first 65536 characters):
#coding:utf8
s = u"ืืงืืื"
t = u'<b>'+s[0]+u'</b>'+s[1:]
print(t)
with open('out.htm','w',encoding='utf-8-sig') as f:
f.write(t)
Output:
<b>ื</b>ืงืืื
But my Chrome browser displays out.htm as:
I am doing a word count on some text files, storing the results in a dictionary. My problem is that after outputting to file, the words are not displayed right even if they were in the original text. (I use TextWrangler to look at them).
For instance, dashes show up as dashes in the original but as \u2014 in the output; in the output, very word is prefixed by a u as well.
Problem
I do not know where, when and how in my script this happens.
I am reading the file with codecs.open() and outputting them with codecs.open() and as json.dump(). They both go wrong in the same way. In between, all is do is
tokenizing
regular expressions
collect in dictionary
And I don't know where I mess things up; I have de-activated tokenizing and most other functions to no effect. All this is happening in Python 2.
Following previous advice, I tried to keep everything within the script in Unicode.
Here is what I do (non-relevant code omitted):
#read in file, iterating over a list of "fileno"s
with codecs.open(os.path.join(dir,unicode(fileno)+".txt"), "r", "utf-8") as inputfili:
inputtext=inputfili.read()
#process the text: tokenize, lowercase, remove punctuation and conjugation
content=regular expression to extract text w/out metadata
contentsplit=nltk.tokenize.word_tokenize(content)
text=[i.lower() for i in contentsplit if not re.match(r"\d+", i)]
text= [re.sub(r"('s|s|s's|ed)\b", "", i) for i in text if i not in string.punctuation]
#build the dictionary of word counts
for word in text:
dicti[word].append(word)
#collect counts for each word, make dictionary of unique words
dicti_nos={unicode(k):len(v) for k,v in dicti.items()}
hapaxdicti= {k:v for k,v in perioddicti_nos.items() if v == 1}
#sort the dictionary
sorteddict=sorted(dictionary.items(), key=lambda x: x[1], reverse=True)
#output the results as .txt and json-file
with codecs.open(file_name, "w", "utf-8") as outputi:
outputi.write("\n".join([unicode(i) for i in sorteddict]))
with open(file_name+".json", "w") as jsonoutputi:
json.dump(dictionary, jsonoutputi, encoding="utf-8")
EDIT: Solution
Looks like my main issue was writing the file in the wrong way. If I change my code to what's reproduced below, things work out. Looks like joining a list of (string, number) tuples messed the string part up; if I join the tuples first, things work.
For the json output, I had to change to codecs.open() and set ensure_ascii to False. Apparently just setting the encoding to utf-8 does not do the trick like I thought.
with codecs.open(file_name, "w", "utf-8") as outputi:
outputi.write("\n".join([":".join([i[0],unicode(i[1])]) for i in sorteddict]))
with codecs.open(file_name+".json", "w", "utf-8") as jsonoutputi:
json.dump(dictionary, jsonoutputi, ensure_ascii=False)
Thanks for your help!
As your example is partially pseudocode there's no way to run a real test and give you something that runs and has been tested, but from reading what you have provided I think you may misunderstand the way Unicode works in Python 2.
The unicode type (such as is produced via the unicode() or unichr() functions) is meant to be an internal representation of a Unicode string that can be used for string manipulation and comparison purposes. It has no associated encoding. The unicode() function will take a buffer as its first argument and an encoding as its second argument and interpret that buffer using that encoding to produce an internally usable Unicode string that is from that point forward unencumbered by encodings.
That Unicode string isn't meant to be written out to a file; all file formats assume some encoding, and you're supposed to provide one again before writing that Unicode string out to a file. Everyplace you have a construct like unicode(fileno) or unicode(k) or unicode(i) is suspect both because you're relying on a default encoding (which probably isn't what you want) and because you're going on to expose most of these values directly to the file system.
After you're done working with these Unicode strings you can use the built-in method encode() on them with your desired encoding as an argument to pack them into strings of ordinary bytes set as required by your encoding.
So looking back at your example above, your inputtext variable is an ordinary string containing data encoded per the UTF-8 encoding. This isn't Unicode. You could convert it to a Unicode string with an operation like inputuni = unicode(inputtext, 'utf-8') and operate on it like that if you chose, but for what you're doing you may not even find it necessary. If you did convert it to Unicode though you'd have to perform the equivalent of a inputuni.encode('UTF-8') on any Unicode string that you were planning on writing out to your file.
Hello i am experimenting with Python and LXML, and I am stuck with the problem of extracting data from the webpage which contains windows-1250 characters like ลพ and ฤ.
tree = html.fromstring(new.text,parser=hparser)
title = tree.xpath('//strong[text()="Title"]')
opis[g] = opis[g].tail.encode('utf-8')[2:]
I get text responses containing something like this :
\xc2\x9ea
instead of characters. Then I have the problem with storing into database
So how can I accomplish this? I tried put 'windows-1250' instead utf8 without success. Can I convert this code to original characters somehow?
Try:
text = "\xc2\x9ea"
print text.decode('windows-1250').encode('utf-8')
Output:
รลพa
And save nice chars in your DB.
If encoding to UTF-8 results in b'\xc2\x9ea', then that means the original string was '\x9ea'. Whether lxml didn't do things correctly, or something happened on your end (perhaps a parser configuration issue), the fact is that you get the equivalent of this (Python 3.x syntax):
>>> '\x9ea'.encode('utf-8')
b'\xc2\x9ea'
How do you fix it? One error-prone way would be to encode as something other than UTF-8 that can properly handle the characters. It's error-prone because while something might work in one case, it might not in another. You could instead extract the character ordinals by mapping the character ordinals and work with the character ordinals instead:
>>> list(map((lambda n: hex(n)[2:]), map(ord, '\x9ea')))
['9e', '61']
That gets us somewhere because the bytes type has a fromhex method that can decode a string containing hexadecimal values to the equivalent byte values:
>>> bytes.fromhex(''.join(map((lambda n: hex(n)[2:]), map(ord, '\x9ea'))))
b'\x9ea'
You can use decode('cp1250') on the result of that to get ลพa, which I believe is the string you wanted. If you are using Python 2.x, the equivalent would be
from binascii import unhexlify
unhexlify(u''.join(map((lambda n: hex(n)[2:]), map(ord, u'\x9ea'))))
Note that this is highly destructive as it forces all characters in a Unicode string to be interpreted as bytes. For this reason, it should only be used on strings containing Unicode characters that fit in a single byte. If you had something like '\x9e\u724b\x61', that code would result in joining ['9e', '724b', '61'] as '9e724b61', and interpreting that using a single-byte character set such as CP1250 would result in something like 'ลพrKa'.
For that reason, more reliable code would replace ord with a function that throws an exception if 0 <= ord(ch) < 0x100 is false, but I'll leave that for you to code.
I am running a python program to process a tab-delimited txt data.
But it causes trouble because it often has unicodes such as U+001A or those in http://en.wikipedia.org/wiki/Newline#Unicode
(Worse, these characters are not even seen unless the txt is opened by sublime txt, not even by notepad++)
If the python program is run on Linux then it automatically ignores such characters, but on Windows, it can't.
For example if there is U+001A in the txt, then the python program will automatically think that's the end of the file.
For another example, if there is U+0085 in the txt, then the python program will think that's the point where a new line starts.
So I just want a separate program that will erase EVERY unicode characters that are not shown in ordinary file openers like notepad++(and that program should work on Windows).
I do want to keep things like ใ and รค . But I only to delete things like U+001A and U+0085 which are not seen by notepad++
How can this be achieved?
There is no such thing as an "unicode character". A character is a character and how it is encoded is on a different page. The capital letter "A" can be encoded in a lot of ways, amongst these UTF-8, EBDIC, ASCII, etc.
If you want to delete every character that cannot be represented in ASCII, then you can use the following (py3):
a = 'aใรคbc'
a.encode ('ascii', 'ignore')
This will yield abc.
And if there are really U+001A, i.e. SUBSTITUTE, characters in your document, most probably something has gone haywire in a prior encoding step.
Using unicodedata looks to be the best way to do it, as suggested by #Hyperboreus (Stripping non printable characters from a string in python) but as a quick hack you could do (in Python 2.x):
Open source in binary mode. This prevents Windows from truncating reads when it finds the EOL Control Char.
my_file = open("filename.txt", "rb")
Decode the file (assumes encoding was UTF-8:
my_str = my_file.read().decode("UTF-8")
Replace known "bad" code points:
my_str.replace(u"\u001A", "")
You could skip step 2 and replace the UTF-8 encoded value of each "bad" code point in step 3, for example \x1A, but the method above allows for UTF-16/32 source if required.