how to remove '\xe2' from a list

how to remove '\xe2' from a list - python

I am new to python and am using it to use nltk in my project.After word-tokenizing the raw data obtained from a webpage I got a list containing '\xe2' ,'\xe3','\x98' etc.However I do not need these and want to delete them.
I simply tried
if '\x' in a
and
if a.startswith('\xe')
and it gives me an error saying invalid \x escape
But when I try a regular expression
re.search('^\\x',a)
i get
Traceback (most recent call last):
File "<pyshell#83>", line 1, in <module>
print re.search('^\\x',a)
File "C:\Python26\lib\re.py", line 142, in search
return _compile(pattern, flags).search(string)
File "C:\Python26\lib\re.py", line 245, in _compile
raise error, v # invalid expression
error: bogus escape: '\\x'
even re.search('^\\x',a) is not identifying it.
I am confused by this,even googling didnt help(I might be missing something).Please suggest any simple way to remove such strings from the list and what was wrong with the above.
Thanks in advance!

You can use unicode(a, 'ascii', 'ignore') to remove all non-ascii characters in the string at once.

It helps here to understand the difference between a string literal and a string.
A string literal is a sequence of characters in your source code. When parsed and compiled by the Python interpreter, it produces a string, which is a sequence of characters in memory.
For example, the string literal " a " produces the string a.
String literals can take a number of forms. All of these produce the same string a:
"a"
'a'
r"a"
"""a"""
r'''a'''
Source code is traditionally ASCII-only, but we'd like it to contain string literals that can produce characters beyond ASCII. To do this escapes can be used. For example, the string literal "\xe2" produces a single-character string, with a character with integer value E2 hexadecimal, or 226 decimal.
This explains the error about "\x" being an invalid escape: the parser is expecting you to specify the hexadecimal value of a character.
To detect if a string has any characters in a certain range, you can use a regex with a character class specifying the lower and upper bounds of the characters you don't want:
if re.search(r"[\x90-\xff]", a):

'\xe2' is one character, \x is an escape sequence that's followed by a hex number and used to specify a byte literally.
That means you have to specify the whole expression:
>>> s = '\xe2hello'
>>> print s
'\xe2hello'
>>> s.replace('\xe2', '')
'hello'
More information can be found in the Python docs.

I see other answers have done a good job in explaining your confusion with respect to '\x', but while suggesting that you may not want to completely remove non-ASCII characters, have not provided a specific way to do other normalization beyond such removing.
If you want to obtain some "reasonably close ASCII character" (e.g., strip accents from letters but leave the underlying letter, &c), this SO answer may help -- the code in the accepted answer, using only the standard Python library, is:
import unicodedata
def strip_accents(s):
return ''.join(c for c in unicodedata.normalize('NFD', s)
if unicodedata.category(c) != 'Mn')
Of course, you'll need to apply this function to each string item in the list you mention in the title, e.g
cleanedlist = [strip_accents(s) for s in mylist]
if all items in mylist are strings.

Let's stand back and think about this a little bit ...
You're using nltk (natural language toolkit) to parse (presumably) natural language.
Your '\xe2' is highly likely to represent U+00E2 LATIN SMALL LETTER A WITH CIRCUMFLEX (â).
Your '\xe3' is highly likely to represent U+00E3 LATIN SMALL LETTER A WITH TILDE (ã).
They look like natural language letters to me. Are you SURE that you don't need them?

If you want only to enter this pattern and avoid the error,
you can try insert a + between \ and x like here:
re.search('\+x[0123456789abcdef]*',a)

Related

In 'Automating Boring Stuff Using Python' Page 208, I cannot understand this line of code [duplicate]

While asking this question, I realized I didn't know much about raw strings. For somebody claiming to be a Django trainer, this sucks.
I know what an encoding is, and I know what u'' alone does since I get what is Unicode.
But what does r'' do exactly? What kind of string does it result in?
And above all, what the heck does ur'' do?
Finally, is there any reliable way to go back from a Unicode string to a simple raw string?
Ah, and by the way, if your system and your text editor charset are set to UTF-8, does u'' actually do anything?

There's not really any "raw string"; there are raw string literals, which are exactly the string literals marked by an 'r' before the opening quote.
A "raw string literal" is a slightly different syntax for a string literal, in which a backslash, \, is taken as meaning "just a backslash" (except when it comes right before a quote that would otherwise terminate the literal) -- no "escape sequences" to represent newlines, tabs, backspaces, form-feeds, and so on. In normal string literals, each backslash must be doubled up to avoid being taken as the start of an escape sequence.
This syntax variant exists mostly because the syntax of regular expression patterns is heavy with backslashes (but never at the end, so the "except" clause above doesn't matter) and it looks a bit better when you avoid doubling up each of them -- that's all. It also gained some popularity to express native Windows file paths (with backslashes instead of regular slashes like on other platforms), but that's very rarely needed (since normal slashes mostly work fine on Windows too) and imperfect (due to the "except" clause above).
r'...' is a byte string (in Python 2.*), ur'...' is a Unicode string (again, in Python 2.*), and any of the other three kinds of quoting also produces exactly the same types of strings (so for example r'...', r'''...''', r"...", r"""...""" are all byte strings, and so on).
Not sure what you mean by "going back" - there is no intrinsically back and forward directions, because there's no raw string type, it's just an alternative syntax to express perfectly normal string objects, byte or unicode as they may be.
And yes, in Python 2.*, u'...' is of course always distinct from just '...' -- the former is a unicode string, the latter is a byte string. What encoding the literal might be expressed in is a completely orthogonal issue.
E.g., consider (Python 2.6):
>>> sys.getsizeof('ciao')
28
>>> sys.getsizeof(u'ciao')
34
The Unicode object of course takes more memory space (very small difference for a very short string, obviously ;-).

There are two types of string in Python 2: the traditional str type and the newer unicode type. If you type a string literal without the u in front you get the old str type which stores 8-bit characters, and with the u in front you get the newer unicode type that can store any Unicode character.
The r doesn't change the type at all, it just changes how the string literal is interpreted. Without the r, backslashes are treated as escape characters. With the r, backslashes are treated as literal. Either way, the type is the same.
ur is of course a Unicode string where backslashes are literal backslashes, not part of escape codes.
You can try to convert a Unicode string to an old string using the str() function, but if there are any unicode characters that cannot be represented in the old string, you will get an exception. You could replace them with question marks first if you wish, but of course this would cause those characters to be unreadable. It is not recommended to use the str type if you want to correctly handle unicode characters.

'raw string' means it is stored as it appears. For example, '\' is just a backslash instead of an escaping.

Let me explain it simply:
In python 2, you can store string in 2 different types.
The first one is ASCII which is str type in python, it uses 1 byte of memory. (256 characters, will store mostly English alphabets and simple symbols)
The 2nd type is UNICODE which is unicode type in python. Unicode stores all types of languages.
By default, python will prefer str type but if you want to store string in unicode type you can put u in front of the text like u'text' or you can do this by calling unicode('text')
So u is just a short way to call a function to cast str to unicode. That's it!
Now the r part, you put it in front of the text to tell the computer that the text is raw text, backslash should not be an escaping character. r'\n' will not create a new line character. It's just plain text containing 2 characters.
If you want to convert str to unicode and also put raw text in there, use ur because ru will raise an error.
NOW, the important part:
You cannot store one backslash by using r, it's the only exception.
So this code will produce error: r'\'
To store a backslash (only one) you need to use '\\'
If you want to store more than 1 characters you can still use r like r'\\' will produce 2 backslashes as you expected.
I don't know the reason why r doesn't work with one backslash storage but the reason isn't described by anyone yet. I hope that it is a bug.

A "u" prefix denotes the value has type unicode rather than str.
Raw string literals, with an "r" prefix, escape any escape sequences within them, so len(r"\n") is 2. Because they escape escape sequences, you cannot end a string literal with a single backslash: that's not a valid escape sequence (e.g. r"\").
"Raw" is not part of the type, it's merely one way to represent the value. For example, "\\n" and r"\n" are identical values, just like 32, 0x20, and 0b100000 are identical.
You can have unicode raw string literals:
>>> u = ur"\n"
>>> print type(u), len(u)
<type 'unicode'> 2
The source file encoding just determines how to interpret the source file, it doesn't affect expressions or types otherwise. However, it's recommended to avoid code where an encoding other than ASCII would change the meaning:
Files using ASCII (or UTF-8, for Python 3.0) should not have a coding cookie. Latin-1 (or UTF-8) should only be used when a comment or docstring needs to mention an author name that requires Latin-1; otherwise, using \x, \u or \U escapes is the preferred way to include non-ASCII data in string literals.

Unicode string literals
Unicode string literals (string literals prefixed by u) are no longer used in Python 3. They are still valid but just for compatibility purposes with Python 2.
Raw string literals
If you want to create a string literal consisting of only easily typable characters like english letters or numbers, you can simply type them: 'hello world'. But if you want to include also some more exotic characters, you'll have to use some workaround.
One of the workarounds are Escape sequences. This way you can for example represent a new line in your string simply by adding two easily typable characters \n to your string literal. So when you print the 'hello\nworld' string, the words will be printed on separate lines. That's very handy!
On the other hand, sometimes you might want to include the actual characters \ and n into your string – you might not want them to be interpreted as a new line. Look at these examples:
'New updates are ready in c:\windows\updates\new'
'In this lesson we will learn what the \n escape sequence does.'
In such situations you can just prefix the string literal with the r character like this: r'hello\nworld' and no escape sequences will be interpreted by Python. The string will be printed exactly as you created it.
Raw string literals are not completely "raw"?
Many people expect the raw string literals to be raw in a sense that "anything placed between the quotes is ignored by Python". That is not true. Python still recognizes all the escape sequences, it just does not interpret them - it leaves them unchanged instead. It means that raw string literals still have to be valid string literals.
From the lexical definition of a string literal:
string ::= "'" stringitem* "'"
stringitem ::= stringchar | escapeseq
stringchar ::= <any source character except "\" or newline or the quote>
escapeseq ::= "\" <any source character>
It is clear that string literals (raw or not) containing a bare quote character: 'hello'world' or ending with a backslash: 'hello world\' are not valid.

Maybe this is obvious, maybe not, but you can make the string '\' by calling x=chr(92)
x=chr(92)
print type(x), len(x) # <type 'str'> 1
y='\\'
print type(y), len(y) # <type 'str'> 1
x==y # True
x is y # False

Replace Unicode code point with actual character using regex

I have a large file where any unicode character that wasn't in UTF-8 got replaced by its code point in angle brackets (e.g. the "👍" was converted to "<U+0001F44D>"). Now I want to revert this with a regex substitution.
I've tried to acomplish this with
re.sub(r'<U\+([A-F0-9]+)>',r'\U\1', str)
but obviously this won't work because we cannot insert the group into this unicode escape.
What's the best/easiest way to do this? I found many questions trying to do the exact opposite but nothing useful to 're-encode' these code points as actual characters...

When you have a number of the character, you can do ord(number) to get the character of that number.
Because we have a string, we need to read it as int with base 16.
Both of those together:
>>> chr(int("0001F44D", 16))
'👍'
However, now we have a small function, not a string to simply replace! Quick search returned that you can pass a function to re.sub
Now we get:
re.sub(r'<U\+([A-F0-9]+)>', lambda x: chr(int(x.group(1), 16)), my_str)
PS Don't name your string just str - you'll shadow the builtin str meaning type.

Python: how to split a string by a delimiter that is invalid <0x0c>

I have a document which contain the charterer <0x0c>.
Using re.split.
The problem that it look like that:
import re
re.split('',text)
When although it works, you CAN'T see the charterer and except of living a nice comment it is a great candidate to be one of this legacy code that only I would understand.
How can I write it in a different, readable way?

You can express any character using escape codes. The 0x0C Form Feed ASCII codepoint can be expressed as \f or as \x0c:
re.split('\f', text)
See the Python string and byte literals syntax for more details on what escape sequences Python supports when defining a string literal value.
Note: you don't need to use the regex module to split on straight-up character sequences, you can just as well use str.split() here:
text.split('\f')

How can I replace '%' to '\x' in Python

My aim is that converting base64 encoding "%EB" string to "\xEB". However, as soon as I tried, I found that it is hard and can't achieved by string.replace nor re.sub both.
My code failed as below:
target = '%EB%AF%B8%EB%9F%AC%EC%8A%A4%20%EC%97%A3%EC%A7%80'
target.replace('%','\x')
-> ValueError: invalid \x escape
re.sub('%','\x',target)
-> ValueError: invalid \x escape
UPDATED:
Thanks for comments, I tried '\x' and r'\x', however, it seems that those couldn't be a solution.
for example,
target = '%EB%AF%B8%EB%9F%AC%EC%8A%A4%20%EC%97%A3%EC%A7%80'
converted1 = target.replace('%',r'\x')
converted2 = target.replace('%','\\x')
converted1
-> '\\xEB\\xAF\\xB8\\xEB\\x9F\\xAC\\xEC\\x8A\\xA4\\x20\\xEC\\x97\\xA3\\xEC\\xA7\\x80'
converted2
-> '\\xEB\\xAF\\xB8\\xEB\\x9F\\xAC\\xEC\\x8A\\xA4\\x20\\xEC\\x97\\xA3\\xEC\\xA7\\x80'
Results:
print converted1
\xEB\xAF\xB8\xEB\x9F\xAC\xEC\x8A\xA4\x20\xEC\x97\xA3\xEC\xA7\x80
print converted2
\xEB\xAF\xB8\xEB\x9F\xAC\xEC\x8A\xA4\x20\xEC\x97\xA3\xEC\xA7\x80
What I want to have is:
print "\xEB\xAF\xB8\xEB\x9F\xAC\xEC\x8A\xA4\x20\xEC\x97\xA3\xEC\xA7\x80"
미러스 엣지

The method replace cannot decode URL-safe string. It just replace character % to \x.
If you want to decode URL-safe string, you should use urllib.unquote.
import urllib
target = '%EB%AF%B8%EB%9F%AC%EC%8A%A4%20%EC%97%A3%EC%A7%80'
print urllib.unquote(target)

>>> target = '%EB%AF%B8%EB%9F%AC%EC%8A%A4%20%EC%97%A3%EC%A7%80'
>>> target.replace('%',r'\x')
'xEBxAFxB8xEBx9FxACxECx8AxA4x20xECx97xA3xECxA7x80'
Why is '\x' invalid in Python?
For the second part of your code, use:
print target.replace('%',r'\x').decode('string-escape')
Though this fixes your error, the best solution is the one by #kamae

I think you missed difference between CLI of interactive Python and the python source code. What you actually do in your code is changing character "%" in the string into "\x" characters.
What you do from the Python's command line is to enter string with escape code interpreted at the moment of string creation (when you pressed Enter). Your string then is unicode and contains binary representation of your Korean characters.
Converting unicode codepoints to UTF8 hex in Python may help you.

How to check if a string in Python is in ASCII?

I want to I check whether a string is in ASCII or not.
I am aware of ord(), however when I try ord('é'), I have TypeError: ord() expected a character, but string of length 2 found. I understood it is caused by the way I built Python (as explained in ord()'s documentation).
Is there another way to check?

I think you are not asking the right question--
A string in python has no property corresponding to 'ascii', utf-8, or any other encoding. The source of your string (whether you read it from a file, input from a keyboard, etc.) may have encoded a unicode string in ascii to produce your string, but that's where you need to go for an answer.
Perhaps the question you can ask is: "Is this string the result of encoding a unicode string in ascii?" -- This you can answer
by trying:
try:
mystring.decode('ascii')
except UnicodeDecodeError:
print "it was not a ascii-encoded unicode string"
else:
print "It may have been an ascii-encoded unicode string"

def is_ascii(s):
return all(ord(c) < 128 for c in s)

In Python 3, we can encode the string as UTF-8, then check whether the length stays the same. If so, then the original string is ASCII.
def isascii(s):
"""Check if the characters in string s are in ASCII, U+0-U+7F."""
return len(s) == len(s.encode())
To check, pass the test string:
>>> isascii("♥O◘♦♥O◘♦")
False
>>> isascii("Python")
True

New in Python 3.7 (bpo32677)
No more tiresome/inefficient ascii checks on strings, new built-in str/bytes/bytearray method - .isascii() will check if the strings is ascii.
print("is this ascii?".isascii())
# True

Vincent Marchetti has the right idea, but str.decode has been deprecated in Python 3. In Python 3 you can make the same test with str.encode:
try:
mystring.encode('ascii')
except UnicodeEncodeError:
pass # string is not ascii
else:
pass # string is ascii
Note the exception you want to catch has also changed from UnicodeDecodeError to UnicodeEncodeError.

Your question is incorrect; the error you see is not a result of how you built python, but of a confusion between byte strings and unicode strings.
Byte strings (e.g. "foo", or 'bar', in python syntax) are sequences of octets; numbers from 0-255. Unicode strings (e.g. u"foo" or u'bar') are sequences of unicode code points; numbers from 0-1112064. But you appear to be interested in the character é, which (in your terminal) is a multi-byte sequence that represents a single character.
Instead of ord(u'é'), try this:
>>> [ord(x) for x in u'é']
That tells you which sequence of code points "é" represents. It may give you [233], or it may give you [101, 770].
Instead of chr() to reverse this, there is unichr():
>>> unichr(233)
u'\xe9'
This character may actually be represented either a single or multiple unicode "code points", which themselves represent either graphemes or characters. It's either "e with an acute accent (i.e., code point 233)", or "e" (code point 101), followed by "an acute accent on the previous character" (code point 770). So this exact same character may be presented as the Python data structure u'e\u0301' or u'\u00e9'.
Most of the time you shouldn't have to care about this, but it can become an issue if you are iterating over a unicode string, as iteration works by code point, not by decomposable character. In other words, len(u'e\u0301') == 2 and len(u'\u00e9') == 1. If this matters to you, you can convert between composed and decomposed forms by using unicodedata.normalize.
The Unicode Glossary can be a helpful guide to understanding some of these issues, by pointing how how each specific term refers to a different part of the representation of text, which is far more complicated than many programmers realize.

Ran into something like this recently - for future reference
import chardet
encoding = chardet.detect(string)
if encoding['encoding'] == 'ascii':
print 'string is in ascii'
which you could use with:
string_ascii = string.decode(encoding['encoding']).encode('ascii')

How about doing this?
import string
def isAscii(s):
for c in s:
if c not in string.ascii_letters:
return False
return True

I found this question while trying determine how to use/encode/decode a string whose encoding I wasn't sure of (and how to escape/convert special characters in that string).
My first step should have been to check the type of the string- I didn't realize there I could get good data about its formatting from type(s). This answer was very helpful and got to the real root of my issues.
If you're getting a rude and persistent
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 263: ordinal not in range(128)
particularly when you're ENCODING, make sure you're not trying to unicode() a string that already IS unicode- for some terrible reason, you get ascii codec errors. (See also the Python Kitchen recipe, and the Python docs tutorials for better understanding of how terrible this can be.)
Eventually I determined that what I wanted to do was this:
escaped_string = unicode(original_string.encode('ascii','xmlcharrefreplace'))
Also helpful in debugging was setting the default coding in my file to utf-8 (put this at the beginning of your python file):
# -*- coding: utf-8 -*-
That allows you to test special characters ('àéç') without having to use their unicode escapes (u'\xe0\xe9\xe7').
>>> specials='àéç'
>>> specials.decode('latin-1').encode('ascii','xmlcharrefreplace')
'àéç'

To improve Alexander's solution from the Python 2.6 (and in Python 3.x) you can use helper module curses.ascii and use curses.ascii.isascii() function or various other: https://docs.python.org/2.6/library/curses.ascii.html
from curses import ascii
def isascii(s):
return all(ascii.isascii(c) for c in s)

You could use the regular expression library which accepts the Posix standard [[:ASCII:]] definition.

A sting (str-type) in Python is a series of bytes. There is no way of telling just from looking at the string whether this series of bytes represent an ascii string, a string in a 8-bit charset like ISO-8859-1 or a string encoded with UTF-8 or UTF-16 or whatever.
However if you know the encoding used, then you can decode the str into a unicode string and then use a regular expression (or a loop) to check if it contains characters outside of the range you are concerned about.

Like #RogerDahl's answer but it's more efficient to short-circuit by negating the character class and using search instead of find_all or match.
>>> import re
>>> re.search('[^\x00-\x7F]', 'Did you catch that \x00?') is not None
False
>>> re.search('[^\x00-\x7F]', 'Did you catch that \xFF?') is not None
True
I imagine a regular expression is well-optimized for this.

import re
def is_ascii(s):
return bool(re.match(r'[\x00-\x7F]+$', s))
To include an empty string as ASCII, change the + to *.

To prevent your code from crashes, you maybe want to use a try-except to catch TypeErrors
>>> ord("¶")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: ord() expected a character, but string of length 2 found
For example
def is_ascii(s):
try:
return all(ord(c) < 128 for c in s)
except TypeError:
return False

I use the following to determine if the string is ascii or unicode:
>> print 'test string'.__class__.__name__
str
>>> print u'test string'.__class__.__name__
unicode
>>>
Then just use a conditional block to define the function:
def is_ascii(input):
if input.__class__.__name__ == "str":
return True
return False

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.