How to replace Unicode values using re in Python? - python

How to replace unicode values using re in Python ?
I'm looking for something like this:
line.replace('Ã','')
line.replace('¢','')
line.replace('â','')
Or is there any way which will replace all the non-ASCII characters from a file. Actually I converted PDF file to ASCII, where I'm getting some non-ASCII characters [e.g. bullets in PDF]
Please help me.

Edit after feedback in comments.
Another solution would be to check the numeric value of each character and see if they are under 128, since ascii goes from 0 - 127. Like so:
# coding=utf-8
def removeUnicode():
text = "hejsanäöåbadasd wodqpwdk"
asciiText = ""
for char in text:
if(ord(char) < 128):
asciiText = asciiText + char
return asciiText
import timeit
start = timeit.Timer("removeUnicode()", "from __main__ import removeUnicode")
print "Time taken: " + str(start.timeit())
Here's an altered version of jd's answer with benchmarks:
# coding=utf-8
def removeUnicode():
text = u"hejsanäöåbadasd wodqpwdk"
if(isinstance(text, str)):
return text.decode('utf-8').encode("ascii", "ignore")
else:
return text.encode("ascii", "ignore")
import timeit
start = timeit.Timer("removeUnicode()", "from __main__ import removeUnicode")
print "Time taken: " + str(start.timeit())
Output first solution using a str string as input:
computer:~ Ancide$ python test1.py
Time taken: 5.88719677925
Output first solution using a unicode string as input:
computer:~ Ancide$ python test1.py
Time taken: 7.21077990532
Output second solution using a str string as input:
computer:~ Ancide$ python test1.py
Time taken: 2.67580914497
Output second solution using a unicode string as input:
computer:~ Ancide$ python test1.py
Time taken: 1.740680933
Conclusion
Encoding is the faster solution and encoding the string is less code; Thus the better solution.

Why you want to replace if you have
title.decode('latin-1').encode('utf-8')
or if you want to ignore
unicode(title, errors='replace')

You have to encode your Unicode string to ASCII, ignoring any error that occurs. Here's how:
>>> u'uéa&à'.encode('ascii', 'ignore')
'ua&'

Try to pass re.UNICODE flag to params. Like this:
re.compile("pattern", re.UNICODE)
For more info see manual page.

Related

using variables inside regex patterns in Python

I'm trying to preprocess a text file that is in Persian, but the problem is that for digits, sometimes they used Arabic digits instead of Persian ones. I want to fix this using regex. Here is my snippet of code:
def preprocessing(content):
import re
for d in range(10):
arabic_digit = rf"\u066{d}"
persian_digit = rf"\u06F{d}"
content = re.sub(arabic_digit, persian_digit, content)
return(content)
but it gives this error message:
error: bad escape \u at position 0
I wonder how should I use variables inside the regex patterns. The weird thing is that the problem is with the second pattern (persian_digit) and when I change it to a static string, there are no errors. Thanks for your time.
chr() is the way to generate Unicode code points:
def preprocessing(content):
import re
for d in range(10):
arabic_digit = chr(0x660 + d)
persian_digit = chr(0x6f0 + d)
content = re.sub(arabic_digit, persian_digit, content)
return content
But, str has a built-in .translate function for making mass substitutions that is much more efficient. Give a list of characters to replace and a same-length list of new characters:
arabic_digits = ''.join([chr(i) for i in range(0x660,0x66a)])
persian_digits = ''.join([chr(i) for i in range(0x6f0,0x6fa)])
print('Arabic: ',arabic_digits)
print('Persian:',persian_digits)
# compute the translation table once
_xlat = str.maketrans(arabic_digits,persian_digits)
def preprocessing(content):
return content.translate(_xlat)
test = '4\u06645\u06656\u0666'
print('before:',test)
print('after: ',preprocessing(test))
Output:
Arabic: ٠١٢٣٤٥٦٧٨٩
Persian: ۰۱۲۳۴۵۶۷۸۹
before: 4٤5٥6٦
after: 4۴5۵6۶
According to this, it is not allowed to have unknown escapes in pattern consisting of '\' in re.sub() , which is the error you come across.
What you can do is to turn the raw string back to "normal" string like this, while I am not sure if it is the best practice:
import codecs
import re
def preprocessing(content):
for d in range(10):
arabic_digit = codecs.decode(rf"\u066{d}", 'unicode_escape')
persian_digit = codecs.decode(rf"\u06F{d}", 'unicode_escape')
content = re.sub(arabic_digit, persian_digit, content)
return content

How can i change string encoding? [duplicate]

I have a string where special characters like ' or " or & (...) can appear. In the string:
string = """ Hello "XYZ" this 'is' a test & so on """
how can I automatically escape every special character, so that I get this:
string = " Hello "XYZ" this 'is' a test & so on "
In Python 3.2, you could use the html.escape function, e.g.
>>> string = """ Hello "XYZ" this 'is' a test & so on """
>>> import html
>>> html.escape(string)
' Hello "XYZ" this 'is' a test & so on '
For earlier versions of Python, check http://wiki.python.org/moin/EscapingHtml:
The cgi module that comes with Python has an escape() function:
import cgi
s = cgi.escape( """& < >""" ) # s = "& < >"
However, it doesn't escape characters beyond &, <, and >. If it is used as cgi.escape(string_to_escape, quote=True), it also escapes ".
Here's a small snippet that will let you escape quotes and apostrophes as well:
html_escape_table = {
"&": "&",
'"': """,
"'": "&apos;",
">": ">",
"<": "<",
}
def html_escape(text):
"""Produce entities within text."""
return "".join(html_escape_table.get(c,c) for c in text)
You can also use escape() from xml.sax.saxutils to escape html. This function should execute faster. The unescape() function of the same module can be passed the same arguments to decode a string.
from xml.sax.saxutils import escape, unescape
# escape() and unescape() takes care of &, < and >.
html_escape_table = {
'"': """,
"'": "&apos;"
}
html_unescape_table = {v:k for k, v in html_escape_table.items()}
def html_escape(text):
return escape(text, html_escape_table)
def html_unescape(text):
return unescape(text, html_unescape_table)
The cgi.escape method will convert special charecters to valid html tags
import cgi
original_string = 'Hello "XYZ" this \'is\' a test & so on '
escaped_string = cgi.escape(original_string, True)
print original_string
print escaped_string
will result in
Hello "XYZ" this 'is' a test & so on
Hello "XYZ" this 'is' a test & so on
The optional second paramter on cgi.escape escapes quotes. By default, they are not escaped
A simple string function will do it:
def escape(t):
"""HTML-escape the text in `t`."""
return (t
.replace("&", "&").replace("<", "<").replace(">", ">")
.replace("'", "'").replace('"', """)
)
Other answers in this thread have minor problems: The cgi.escape method for some reason ignores single-quotes, and you need to explicitly ask it to do double-quotes. The wiki page linked does all five, but uses the XML entity &apos;, which isn't an HTML entity.
This code function does all five all the time, using HTML-standard entities.
The other answers here will help with such as the characters you listed and a few others. However, if you also want to convert everything else to entity names, too, you'll have to do something else. For instance, if á needs to be converted to á, neither cgi.escape nor html.escape will help you there. You'll want to do something like this that uses html.entities.entitydefs, which is just a dictionary. (The following code is made for Python 3.x, but there's a partial attempt at making it compatible with 2.x to give you an idea):
# -*- coding: utf-8 -*-
import sys
if sys.version_info[0]>2:
from html.entities import entitydefs
else:
from htmlentitydefs import entitydefs
text=";\"áèïøæỳ" #This is your string variable containing the stuff you want to convert
text=text.replace(";", "$ஸ$") #$ஸ$ is just something random the user isn't likely to have in the document. We're converting it so it doesn't convert the semi-colons in the entity name into entity names.
text=text.replace("$ஸ$", "&semi;") #Converting semi-colons to entity names
if sys.version_info[0]>2: #Using appropriate code for each Python version.
for k,v in entitydefs.items():
if k not in {"semi", "amp"}:
text=text.replace(v, "&"+k+";") #You have to add the & and ; manually.
else:
for k,v in entitydefs.iteritems():
if k not in {"semi", "amp"}:
text=text.replace(v, "&"+k+";") #You have to add the & and ; manually.
#The above code doesn't cover every single entity name, although I believe it covers everything in the Latin-1 character set. So, I'm manually doing some common ones I like hereafter:
text=text.replace("ŷ", "&ycirc;")
text=text.replace("Ŷ", "&Ycirc;")
text=text.replace("ŵ", "&wcirc;")
text=text.replace("Ŵ", "&Wcirc;")
text=text.replace("ỳ", "ỳ")
text=text.replace("Ỳ", "Ỳ")
text=text.replace("ẃ", "&wacute;")
text=text.replace("Ẃ", "&Wacute;")
text=text.replace("ẁ", "ẁ")
text=text.replace("Ẁ", "Ẁ")
print(text)
#Python 3.x outputs: &semi;"áèïøæỳ
#The Python 2.x version outputs the wrong stuff. So, clearly you'll have to adjust the code somehow for it.

Convert escaped utf-8 string to utf in python 3

I have a py3 string that includes escaped utf-8 sequencies, such as "Company\\ffffffc2\\ffffffae", which I would like to convert to the correct utf 8 string (which would in the example be "Company®", since the escaped sequence is c2 ae). I've tried
print (bytes("Company\\\\ffffffc2\\\\ffffffae".replace(
"\\\\ffffff", "\\x"), "ascii").decode("utf-8"))
result: Company\xc2\xae
print (bytes("Company\\\\ffffffc2\\\\ffffffae".replace (
"\\\\ffffff", "\\x"), "ascii").decode("unicode_escape"))
result: Company®
(wrong, since chracters are treated separately, but they should be treated together.
If I do
print (b"Company\xc2\xae".decode("utf-8"))
It gives the correct result.
Company®
How can i achieve that programmatically (i.e. starting from a py3 str)
A simple solution is:
import ast
test_in = "Company\\\\ffffffc2\\\\ffffffae"
test_out = ast.literal_eval("b'''" + test_in.replace('\\\\ffffff','\\x') + "'''").decode('utf-8')
print(test_out)
However it will fail if there is a triple quote ''' in the input string itself.
Following code does not have this problem, but it is not as simple as the first one.
In the first step the string is split on a regular expression. The odd items are ascii parts, e.g. "Company"; each even item corresponds to one escaped utf8 code, e.g. "\\\\ffffffc2". Each substring is converted to bytes according to its meaning in the input string. Finally all parts are joined together and decoded from bytes to a string.
import re
REGEXP = re.compile(r'(\\\\ffffff[0-9a-f]{2})', flags=re.I)
def convert(estr):
def split(estr):
for i, substr in enumerate(REGEXP.split(estr)):
if i % 2:
yield bytes.fromhex(substr[-2:])
elif substr:
yield bytes(substr, 'ascii')
return b''.join(split(estr)).decode('utf-8')
test_in = "Company\\\\ffffffc2\\\\ffffffae"
print(convert(test_in))
The code could be optimized. Ascii parts do not need encode/decode and consecutive hex codes should be concatenated.

Search and replace using regular expressions in Python

I have a log file that is full of tweets. Each tweet is on its own line so that I can iterate though the file easily.
An example tweet would be like this:
# sample This is a sample string $ 1.00 # sample
I want to be able to clean this up a bit by removing the white space between the special character and the following alpha-numeric character. "# s", "$ 1", "# s"
So that it would look like this:
#sample This is a sample string $1.00 #sample
I'm trying to use regular expressions to match these instances because they can be variable, but I am unsure of how to go about doing this.
I've been using re.sub() and re.search() to find the instances, but am struggling to figure out how to only remove the white space while leaving the string intact.
Here is the code I have so far:
#!/usr/bin/python
import csv
import re
import sys
import pdb
import urllib
f=open('output.csv', 'w')
with open('retweet.csv', 'rb') as inputfile:
read=csv.reader(inputfile, delimiter=',')
for row in read:
a = row[0]
matchObj = re.search("\W\s\w", a)
print matchObj.group()
f.close()
Thanks for any help!
Something like this using re.sub:
>>> import re
>>> strs = "# sample This is a sample string $ 1.00 # sample"
>>> re.sub(r'([##$])(\s+)([a-z0-9])', r'\1\3', strs, flags=re.I)
'#sample This is a sample string $1.00 #sample'
>>> re.sub("([#$#]) ", r"\1", "# sample This is a sample string $ 1.00 # sample")
'#sample This is a sample string $1.00 #sample'
This seemed to work pretty nice.
print re.sub(r'([#$])\s+',r'\1','# blah $ 1')

Python3 : unescaping non ascii characters

(Python 3.3.2) I have to unescape some non ASCII escaped characters returned by a call to re.escape(). I see here and here methods that doesn't work. I'm working in a 100% UTF-8 environment.
# pure ASCII string : ok
mystring = "a\n" # expected unescaped string : "a\n"
cod = codecs.getencoder('unicode_escape')
print( cod(mystring) )
# non ASCII string : method #1
mystring = "€\n"
# equivalent to : mystring = codecs.unicode_escape_decode(mystring)
cod = codecs.getdecoder('unicode_escape')
print(cod(mystring))
# RESULT = ('â\x82¬\n', 5) INSTEAD OF ("€\n", 2)
# non ASCII string : method #2
mystring = "€\n"
mystring = bytes(mystring, 'utf-8').decode('unicode_escape')
print(mystring)
# RESULT = â\202¬ INSTEAD OF "€\n"
Is this a bug ? Have I misunderstood something ?
Any help would be appreciated !
PS : I edited my post thanks to the Michael Foukarakis' remark.
I guess the actual string you need to process is mystring = €\\n?
mystring = "€\n" # that's 2 char, "€" and new line
mystring = "€\\n" # that's 3 char, "€", "\" and "n"
I don't really understand what's going wrong within encode() and decode() of python3, but my friend solve this problem when we are writing some tools.
How we did is to bypass the encoder("utf_8") after the escape procedure is done.
>>> "€\\n".encode("utf_8")
b'\xe2\x82\xac\\n'
>>> "€\\n".encode("utf_8").decode("unicode_escape")
'â\x82¬\n'
>>> "€\\n".encode("utf_8").decode("unicode_escape").encode("utf_8")
b'\xc3\xa2\xc2\x82\xc2\xac\n' # we don't want this
>>> bytes([ord(char) for char in "€\\n".encode("utf_8").decode("unicode_escape")])
b'\xe2\x82\xac\n' # what we really need
>>> str(bytes([ord(char) for char in "€\\n".encode("utf_8").decode("unicode_escape")]), "utf_8")
'€\n'
We can see that: though the result of decode("unicode_escape") looks wired, the bytes object actually contain the correct bytes of your strings(with utf-8 encoding), in this case, "\xe2\x82\xac\n"
And we now do not print the str object directly, neither do we use encode("utf_8"), we use ord() to create the bytes object b'\xe2\x82\xac\n'.
And you can get the correct str from this bytes object, just put it into str()
BTW, the tool my friend and me want to make is a wrapper that allow user to input c-like string literal, and convert the escaped sequence automatically.
User input:\n\x61\x62\n\x20\x21 # 20 characters, which present 6 chars semantically
output: # \n
ab # \x61\x62\n
! # \x20\x21
That's a powerful tool for user to input some non-printable character in terminal.
Our final tools is:
#!/usr/bin/env python3
import sys
for line in sys.stdin:
sys.stdout.buffer.write(bytes([ord(char) for char in line[:-1].encode().decode('unicode_escape')]))
sys.stdout.flush()
You seem to misunderstand encodings. To be protected against common errors, we usually encode a string when it leaves our application, and decode it when it comes in.
Firstly, let's look at the documentation for unicode_escape, which states:
Produce[s] a string that is suitable as Unicode literal in Python source code.
Here is what you would get from the network or a file that claims its contents are Unicode escaped:
b'\\u20ac\\n'
Now, you have to decode this to use it in your app:
>>> s = b'\\u20ac\\n'.decode('unicode_escape')
>>> s
'€\n'
and if you wanted to write it back to, say, a Python source file:
with open('/tmp/foo', 'wb') as fh: # binary mode
fh.write(b'print("' + s.encode('unicode_escape') + b'")')
import string
printable = string.printable
printable = printable + '€'
def cod(c):
return c.encode('unicode_escape').decode('ascii')
def unescape(s):
return ''.join(c if ord(c)>=32 and c in printable else cod(c) for c in s)
mystring = "€\n"
print(unescape(mystring))
Unfortunately string.printable only includes ASCII characters. You can make a copy as I did here and extend it with any Unicode characters that you'd like, such as €.

Categories