python join/format possible hex values for regex - python

I'd like to create a template string as possible values for an expression:
'\x1C,\x2C,\x3C,\x4C,\x5C,\x6C,\x7C,\x8C,\x9C,\xAC,\xBC,\xCC,\xDC,\xEC,\xFC'
in a manner like this:
from string import digits, ascii_uppercase
','.join(['\x'+i+'C' for i in digits+ascii_uppercase[:6]])
but unfortunately join does not treat '\x' litterally:
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-1: truncated \xXX escape
Unlike, for example, double slashes:
','.join(['\\x'+i+'C' for i in digits+ascii_uppercase[:6]])
\\x0C,\\x1C,\\x2C,\\x3C,\\x4C,\\x5C,\\x6C,\\x7C,\\x8C,\\x9C,\\xAC,\\xBC,\\xCC,\\xDC,\\xEC,\\xFC'
Any ideas around this? Maybe another encoding?

Since you're dealing with characters, deal with characters.
','.join(chr(x) for x in range(0x1c, 0x100, 0x10))

\x will try to escape \x like \n (newline), you need use \\ to use the first \ escape the second \.
However, the two \ only display when you just type it in shell, but when you print it out the another one will be gone:
>>> text = '\\x0C,\\x1C,\\x2C,\\x3C,\\x4C,\\x5C,\\x6C,\\x7C,\\x8C,\\x9C,\\xAC,\\ xBC,\\xCC,\\xDC,\\xEC,\\xFC'
>>> text
'\\x0C,\\x1C,\\x2C,\\x3C,\\x4C,\\x5C,\\x6C,\\x7C,\\x8C,\\x9C,\\xAC,\\xBC,\\xCC,\\xDC,\\xEC,\\xFC'
>>> print(text)
\x0C,\x1C,\x2C,\x3C,\x4C,\x5C,\x6C,\x7C,\x8C,\x9C,\xAC,\xBC,\xCC,\xDC,\xEC,\xFC

Related

Can't do ASCII with 'u' character in python

I'm trying to do an ascii image in python but gives me this error
File "main.py", line 1
teste = print('''
^
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 375-376: truncated \UXXXXXXXX escape
And I think it's because of the U character, why that happened, is any way to solve this?
ASCII image
You've got \U in your string, which is being interpreted as the beginning of a Unicode ordinal escape (it expects it to be followed by 8 hex characters representing a single Unicode ordinal).
You could double the escape, making it \\U, but that would make it harder to see the image in the code itself. The simplest approach is to make it a raw string that ignores all escapes save escapes applied to the quote character, by putting an r immediately before the literal:
teste = print(r'''
Note the r immediately after the (, before the '''.

How to replace all instances of /N with NaN in a csv file using Python

I've read a csv file into Python, and it contains many objects for which the value is \N. I need to replace all of those instances with 'NaN'.
I've gotten the file to read in correctly, but I get an error when I try to replace the \Ns.
import pandas as pd
df = pd.read_csv(r'file.csv')
df.replace('\N', 'NaN')
File "<ipython-input-63-a631ab1f5217>", line 3
df.replace('\N', 'NaN')
^
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-1: malformed \N character escape
Python uses backslashes as a symbol to signify escape sequences like newlines, tabs, quotes, etc. So if you want to use backslashes in a string, you must replace all the single backslashes with double backslashes, like so;
df.replace('\\N', 'NaN')
Pass na_values="\\N" parameter:
df = pd.read_csv('file.csv',na_values="\\N")
Double backslash should be used to escape backslash.

How to list Amharic (Unicode) code points in python 3.6

I want a list containing Amharic alphabet from utf-8. The character ranges are from U+1200 to U+1399. I am using windows 8. I encountered SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-5: truncated \UXXXXXXXX escape.
I tried this:
[print(c) for c in u'U1399']
How can i list the characters?
To print the characters from U-1200 to U-1399, I would use a for loop with an int control variable. It's easy enough to convert numbers to characters using chr().
The integer value 0x1200 (i.e. 1200 in hexadecimal) can be converted to the Unicode codepoint U-1200 like so: chr(0x1200) == '\u1200'.
Similarly for 0x1201, 0x1202, ... 0x1399.
Note that we use .isprintable() to filter out code some of the useless entries.
print(' '.join(chr(x) for x in range(0x1200, 0x139A) if chr(x).isprintable()))
or
for x in range(0x1200, 0x139A):
if chr(x).isprintable():
print(hex(x), chr(x))
Note that the code samples require Python3.
Your posted code doesn't produce any errors at all:
>>> [print(c) for c in u'U1399']
U
1
3
9
9
[None, None, None, None, None]
It also doesn't have any non-ASCII characters in it.
You probably wanted to use a Unicode backslash escape. And your problem is probably more like this:
>>> u'\U1399'
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-5: truncated \UXXXXXXXX escape
The reason is that—as the error message implies—a \U escape requires 8 hex digits, and you've only provided 4. So:
>>> u'\U00001399'
'᎙'
But there's a different escape, sequence \u (notice the lowercase u), which takes only 4 digits:
>>> u'\u1399'
'᎙'
If you're using Python 2.7, and possibly even with Python 3 on Windows, you may not see that nice output, but instead something with backslash escapes in it. But if you print that string, you will see the right character.
The full details for \U and \u escapes (and other escapes) are documented in String and Bytes literals (make sure to switch to the Python version you're actually using, because the details can be different, especially between 2.x and 3.x), but usually you don't need to know much more than explained above.

Error "(unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape" [duplicate]

This question already has answers here:
How should I write a Windows path in a Python string literal?
(5 answers)
Closed 3 years ago.
I'm trying to read a CSV file into Python (Spyder), but I keep getting an error. My code:
import csv
data = open("C:\Users\miche\Documents\school\jaar2\MIK\2.6\vektis_agb_zorgverlener")
data = csv.reader(data)
print(data)
I get the following error:
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes
in position 2-3: truncated \UXXXXXXXX escape
I have tried to replace the \ with \\ or with / and I've tried to put an r before "C.., but all these things didn't work.
This error occurs, because you are using a normal string as a path. You can use one of the three following solutions to fix your problem:
1: Just put r before your normal string. It converts a normal string to a raw string:
pandas.read_csv(r"C:\Users\DeePak\Desktop\myac.csv")
2:
pandas.read_csv("C:/Users/DeePak/Desktop/myac.csv")
3:
pandas.read_csv("C:\\Users\\DeePak\\Desktop\\myac.csv")
The first backslash in your string is being interpreted as a special character. In fact, because it's followed by a "U", it's being interpreted as the start of a Unicode code point.
To fix this, you need to escape the backslashes in the string. The direct way to do this is by doubling the backslashes:
data = open("C:\\Users\\miche\\Documents\\school\\jaar2\\MIK\\2.6\\vektis_agb_zorgverlener")
If you don't want to escape backslashes in a string, and you don't have any need for escape codes or quotation marks in the string, you can instead use a "raw" string, using "r" just before it, like so:
data = open(r"C:\Users\miche\Documents\school\jaar2\MIK\2.6\vektis_agb_zorgverlener")
You can just put r in front of the string with your actual path, which denotes a raw string. For example:
data = open(r"C:\Users\miche\Documents\school\jaar2\MIK\2.6\vektis_agb_zorgverlener")
Consider it as a raw string. Just as a simple answer, add r before your Windows path.
import csv
data = open(r"C:\Users\miche\Documents\school\jaar2\MIK\2.6\vektis_agb_zorgverlener")
data = csv.reader(data)
print(data)
Try writing the file path as "C:\\Users\miche\Documents\school\jaar2\MIK\2.6\vektis_agb_zorgverlener" i.e with double backslash after the drive as opposed to "C:\Users\miche\Documents\school\jaar2\MIK\2.6\vektis_agb_zorgverlener"
Add r before your string. It converts a normal string to a raw string.
As per String literals:
String literals can be enclosed within single quotes (i.e. '...') or double quotes (i.e. "..."). They can also be enclosed in matching groups of three single or double quotes (these are generally referred to as triple-quoted strings).
The backslash character (i.e. \) is used to escape characters which otherwise will have a special meaning, such as newline, backslash itself, or the quote character. String literals may optionally be prefixed with a letter r or R. Such strings are called raw strings and use different rules for backslash escape sequences.
In triple-quoted strings, unescaped newlines and quotes are allowed, except that the three unescaped quotes in a row terminate the string.
Unless an r or R prefix is present, escape sequences in strings are interpreted according to rules similar to those used by Standard C.
So ideally you need to replace the line:
data = open("C:\Users\miche\Documents\school\jaar2\MIK\2.6\vektis_agb_zorgverlener")
To any one of the following characters:
Using raw prefix and single quotes (i.e. '...'):
data = open(r'C:\Users\miche\Documents\school\jaar2\MIK\2.6\vektis_agb_zorgverlener')
Using double quotes (i.e. "...") and escaping backslash character (i.e. \):
data = open("C:\\Users\\miche\\Documents\\school\\jaar2\\MIK\\2.6\\vektis_agb_zorgverlener")
Using double quotes (i.e. "...") and forwardslash character (i.e. /):
data = open("C:/Users/miche/Documents/school/jaar2/MIK/2.6/vektis_agb_zorgverlener")
Just putting an r in front works well.
eg:
white = pd.read_csv(r"C:\Users\hydro\a.csv")
It worked for me by neutralizing the '' by f = open('F:\\file.csv')
The double \ should work for Windows, but you still need to take care of the folders you mention in your path. All of them (except the filename) must exist. Otherwise you will get an error.

How do I treat an ASCII string as unicode and unescape the escaped characters in it in python?

For example, if I have a unicode string, I can encode it as an ASCII string like so:
>>> u'\u003cfoo/\u003e'.encode('ascii')
'<foo/>'
However, I have e.g. this ASCII string:
'\u003foo\u003e'
... that I want to turn into the same ASCII string as in my first example above:
'<foo/>'
It took me a while to figure this one out, but this page had the best answer:
>>> s = '\u003cfoo/\u003e'
>>> s.decode( 'unicode-escape' )
u'<foo/>'
>>> s.decode( 'unicode-escape' ).encode( 'ascii' )
'<foo/>'
There's also a 'raw-unicode-escape' codec to handle the other way to specify Unicode strings -- check the "Unicode Constructors" section of the linked page for more details (since I'm not that Unicode-saavy).
EDIT: See also Python Standard Encodings.
On Python 2.5 the correct encoding is "unicode_escape", not "unicode-escape" (note the underscore).
I'm not sure if the newer version of Python changed the unicode name, but here only worked with the underscore.
Anyway, this is it.
At some point you will run into issues when you encounter special characters like Chinese characters or emoticons in a string you want to decode i.e. errors that look like this:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 109-123: ordinal not in range(128)
For my case (twitter data processing), I decoded as follows to allow me to see all characters with no errors
>>> s = '\u003cfoo\u003e'
>>> s.decode( 'unicode-escape' ).encode( 'utf-8' )
>>> <foo>
Ned Batchelder said:
It's a little dangerous depending on where the string is coming from,
but how about:
>>> s = '\u003cfoo\u003e'
>>> eval('u"'+s.replace('"', r'\"')+'"').encode('ascii')
'<foo>'
Actually this method can be made safe like so:
>>> s = '\u003cfoo\u003e'
>>> s_unescaped = eval('u"""'+s.replace('"', r'\"')+'-"""')[:-1]
Mind the triple-quote string and the dash right before the closing 3-quotes.
Using a 3-quoted string will ensure that if the user enters ' \\" ' (spaces added for visual clarity) in the string it would not disrupt the evaluator;
The dash at the end is a failsafe in case the user's string ends with a ' \" ' . Before we assign the result we slice the inserted dash with [:-1]
So there would be no need to worry about what the users enter, as long as it is captured in raw format.
It's a little dangerous depending on where the string is coming from, but how about:
>>> s = '\u003cfoo\u003e'
>>> eval('u"'+s.replace('"', r'\"')+'"').encode('ascii')
'<foo>'

Categories