I have a lot of txt files, and I need to replace some text on them. Almost all of them has this non-ascii character (I thought it was "...", but … is not the same)
I've tried with replace() but I cannot make it, I need some help!! thanks in advance
If you use codecs.open() to open the files then you will get all strings as unicodes, which are much easier to handle.
Use unicode type strings. For example,
>>> print u'\xe2'.replace(u'\xe2','a')
a
the problem is that these characters are not valid string, they are unicode.
import re
re.sub(r'<string to repleace>','',text,re.U)
most other answers will work too
Related
I've got a string which looks like this, made up of normal characters and one single escaped Unicode character in the middle:
reb\u016bke
I want to have Python convert the whole string to the normal Unicode version, which should be rebūke. I've tried using str.encode(), but it doesn't seem to do very much, and apparently decode doesn't exist anymore? I'm really stuck!
EDIT: Output from repr is reb\\\u016bke
If I try reproducing your issue:
s="reb\\u016bke";
print(s);
# reb\u016bke
print(repr(s));
# 'reb\\u016bke'
print(s.encode().decode('unicode-escape'));
# rebūke
I have string data in various languages where parts of the strings have seen some wrong encoding/decoding while others are correct, I need to fix the wrong ones:
Here's an example for the german word "Zubehör":
correct = "ZUBEHÖR"
incorrect = "ZUBEHÃ\x96R"
I already found out that I can correct the errors like this:
incorrect.encode("raw_unicode_escape").decode("utf8")
However using this on the correct strings yields an error. I could iterate over all strings and use a try-statement, but I don't know if this will work reliably and I'd like to know a more elegant way.
Also while the \x96 is written out when printing it's actually only one character:
incorrect[-3]
Out[34]: 'Ã'
incorrect[-2]
Out[33]: '\x96'
How can I reliably only find those strings that have these odd unicode characters in them like ZUBEHÃ\x96R?
EDIT:
Here's something else I stumbled upon while experimenting:
When I do incorrect.encode("raw_unicode_escape") then the result is b'ZUBEH\xc3\x96R'.
But when I do this with e.g. a cyrillic word like this:
"Персонализированные".encode("raw_unicode_escape")
Then the result is b'\\u041f\\u0435\\u0440\\u0441\\u043e\\u043d\\u0430\\u043b\\u0438\\u0437\\u0438\\u0440\\u043e\\u0432\\u0430\\u043d\\u043d\\u044b\\u0435'
Why am I getting \x-escapes in the first case and \u-escapes in the second case while doing the exact same thing?
And why can I .decode("utf8") back the \x-escapes into a readable format but not the \u-escapes?
You should try the fixes-text-for-you library (ftfy):
>>> import ftfy
>>> ftfy.fix_text("ZUBEHÃ\x96R")
'ZUBEHÖR'
It operates line by line, so if you have a string with clean and corrupt strings, but on separate lines, ftfy can probably handle it.
Note: This is not an exact science.
The way ftfy works involves a lot of educated guesses.
The tool is very well made, but it may not guess correctly in all cases you have.
If you can, it is always better to fix the errors at the source (ie. make sure all text is correctly decoded in the first place).
I have what feels to me like a really basic question, but for the life of me I can't figure it out.
I have a whole bunch of text I'm going through and converting to the International Phonetic Alphabet. I'm using the re.sub() method a lot, and in many cases this means replacing a character of string type with a character of unicode type. For example:
for row in responsesIPA:
re.sub("3", u"\u0259", row)
I'm getting TypeError: expected string or buffer. The docs on Python re say that the type for the replacement has to match the type for what you're searching, so maybe that's the problem? I tried putting str() around u"\u0259", but I'm still getting the type error. Is there a way for me to do this replacement?
The error you're getting is telling you that the "row" isn't a valid string or buffer(str, bytes, unicode, anything that is readable), you will need to double check what is stored in row by adding a print(row) in front.
Just to prove that this is the case, doing so will work:
import re
print(re.sub("3", u"\u0259", "12345"))
I'm trying to write a csv file from json data. During that, i want to write '001023472' but its writing as '1023472'. I have searched a lot. But dint find an answer.
The value is of type string before writing. The problem is during writing it into the file.
Thanks in advance.
Convert the number to string with formatting operator; in your case: "%09d" % number.
Use the format builtin or format string method.
>>> format(1023472, '09')
'001023472'
>>> '{:09}'.format(1023472)
'001023472'
If your "number" is actually a string, you can also just left-pad it with '0''s:
>>> format('1023472', '>09')
'001023472'
The Python docs generally eschew % formatting, saying it may go away in the future and is also more finnicky; for new code there is no real reason to use it, especially in 2.7+.
Hello i was wondering if you know any other way to encode a string to a url-safe, because urllib.quote is doing it wrong, the output is different than expected:
If i try
urllib.quote('á')
i get
'%C3%A1'
But thats not the correct output, it should be
%E1
As demostrated by the tool provided here this site
And this is not me being difficult, the incorrect output of quote is preventing the browser to found resources, if i try
urllib.quote('\images\á\some file.jpg')
And then i try with the javascript tool i mentioned i get this strings respectively
%5Cimages%5C%C3%A1%5Csome%20file.jpg
%5Cimages%5C%E1%5Csome%20file.jpg
Note how is almost the same but the url provided by quote doesn't work and the other one it does.
I tried messing with encode('utf-8) on the string provided to quote but it does not make a difference.
I tried with other spanish words with accents and the ñ they all are differently represented.
Is this a python bug?
Do you know some module that get this right?
According to RFC 3986, %C3%A1 is correct. Characters are supposed to be converted to an octet stream using UTF-8 before the octet stream is percent-encoded. The site you link is out of date.
See Why does the encoding's of a URL and the query string part differ? for more detail on the history of handling non-ASCII characters in URLs.
Ok, got it, i have to encode to iso-8859-1 like this
word = u'á'
word = word.encode('iso-8859-1')
print word
Python is interpreted in ASCII by default, so even though your file may be encoded differently, your UTF-8 char is interpereted as two ASCII chars.
Try putting a comment as the first of second line of your code like this to match the file encoding, and you might need to use u'á' also.
# coding: utf-8
What about using unicode strings and the numeric representation (ord) of the char?
>>> print '%{0:X}'.format(ord(u'á'))
%E1
In this question it seems some guy wrote a pretty large function to convert to ascii urls, thats what i need. But i was hoping there was some encoding tool in the std lib for the job.