Reasons for a different interpretation of a python code?

Reasons for a different interpretation of a python code? - python

What are the possible reasons for a different interpretation of a given python code?
I have a code which I can execute with no errors on a computer, but which outputs errors on another one.
The python versions are the same (2.7.12).
The encoding of the scripts are the same.
I wonder what could explain this because these are the only two reasons I see for a different code interpretation.
Here is what the code looks like, using luigi (here is only a part of the code) :
class ExampleClass(luigi.postgres.CopyToTable):
def rows(self):
"""
Return/yield tuples or lists corresponding to each row to be inserted.
"""
with self.input()['Data'].open('r') as fobj:
for line in fobj:
yield line.strip('\n').split('\t')
When I run the whole code on the computer where i do have an error (which is caused by the lines I wrote above), I get this :
IndentationError: unexpected indent
And there is, indeed, a mix of spaces and tabs in the code.
It is easy to solve, no problem here, but my question is about :
What could explain that difference in the interpretation?
I'm asking because after solving this by replacing spaces with tabs, I got other errors that should not appear either and that are harder to solve, and the thing is the code is supposedly correct as it works on the other computer so I should not have to solve these errors.
To give more details, here is the second error I get after solving the indentation problems (I don't manage to solve this one) :
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: malformed \N character escape
And it is caused by the following part of the code :
class AnotherClass(ExampleClass):
def copy(self, cursor, file):
if isinstance(self.columns[0], six.string_types):
column_names = self.columns
elif len(self.columns[0]) == 2:
column_names = [c[0] for c in self.columns]
else:
raise Exception('columns must consist of column strings or (column string, type string) tuples (was %r ...)' % (self.columns[0],))
cursor.copy_expert('copy ' + self.table + ' from stdin with header csv delimiter \'' + self.column_separator + '\' null \'''\\\N''\' quote \'''\"''\' ', file)

As the error says
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: malformed \N character escape
The line in question is this one:
cursor.copy_expert('copy ' + self.table + ' from stdin with header csv delimiter \'' + self.column_separator + '\' null \'''\\N''\' quote \'''\"''\' ', file)
Your "N" should be lowercase otherwise it doesn't count as a newline character. Try this:
cursor.copy_expert('copy ' + self.table + ' from stdin with header csv delimiter \'' + self.column_separator + '\' null \'''\\\n''\' quote \'''\"''\' ', file)

Related

How to extract features from text data set? [duplicate]

This question already has answers here:
Error "(unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape" [duplicate]
(10 answers)
Closed 3 years ago.
I try to tokenize the text file that i get from my zip folder but i am facing this error
My Error
TypeError: expected string or bytes-like object

Add r to yourC:\Users\killer\Desktop\User1.txt so the backslash become \\ instead of \ because \U in Users is being interpreted as a start of an unicode
pd.read_csv(r"C:\Users\killer\Desktop\User1.txt")
Or you can escape it manually or just change \ to /

Try the following code:
Data = pd.read_csv("C:\Users\killer\Desktop\User1.txt", sep=", ")
Just add => , sep=", " at the end of the file you want to read.
Note that in quotation marks add what separates the text. In most cases, the text is separated by a comma "," but you can check the file by opening it with your default text reader to see what separates it.

What you are doing is right but there are some characters that can't be read (not Unicode characters). This is because the file path you have given as \U (from \User) will by default be recognized as an escape sequence character and is unknown. For a file path to be recognized as one, you have to:
A) write it with \\, for eg. "C:\\Users\\killer\\..."
B) write it with / , for eg "C:/Users/killer/..."
C) use r in front, for eg. r"C:\Users\killer\" to use it as raw text, ie, everything is text and no escape sequences, etc.

Why does "SyntaxError: (unicode error)" occur when raw string is in triple quotes?

Whenever I put triple quotes around a raw string, the following error occurs:
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 28-29: malformed \N character escape
I was wondering why this is the case and if there is any way to avoid it.
I have tried moving the triple quotes to align with various parts of the code but nothing has worked so far.
This runs without error:
final_dir = (r'C:\Documents\Newsletters')
'''
path_list = []
for file in os.listdir(final_dir):
path = os.path.join(final_dir, file)
path_list.append(path)
'''
But then this creates an error:
'''
final_dir = (r'C:\Documents\Newsletters')
path_list = []
for file in os.listdir(final_dir):
path = os.path.join(final_dir, file)
path_list.append(path)
'''

In a string literal like '\N', \N has a special meaning:
\N{name} Character named name in the Unicode database
from String and Bytes literals - Python 3 documentation
For example, '\N{tilde}' becomes '~'.
Since you're quoting code, you probably want to use a raw string literal:
r'\N'
For example:
>>> r"""r'C:\Documents\Newsletters'"""
"r'C:\\Documents\\Newsletters'"
Or you could escape the backslash:
'\\N'
The error doesn't occur for \D because it doesn't have a special meaning.
Thanks to deceze for practically writing this answer in the comments

Subtitle Project: How to solve the unicode reading failure?

Basically I'm doing a subtitle project.
Very complicated, but I just want to insert lines after a line for all lines in a converted ASS file(Currently still a txt file in the experiment)
Untouched lines. I won't talk about Aegisub problems here
Dialogue: 0,0:00:00.00,0:00:03.90,Default,,0,0,0,,Hello, viewers. This is The Reassembler,
Dialogue: 0,0:00:03.90,0:00:07.04,Default,,0,0,0,,the show where we take everyday objects in their component form
Dialogue: 0,0:00:07.04,0:00:10.24,Default,,0,0,0,,and put them back together, very slowly.
Objective:
Every line in the dialogue section appended with
'\N{\3c&HAA0603&\fs31\b1}'
Dialogue: 0,0:00:00.00,0:00:03.90,Default,,0,0,0,,Hello, viewers. This is The Reassembler,\N{\3c&HAA0603&\fs31\b1}
Dialogue: 0,0:00:03.90,0:00:07.04,Default,,0,0,0,,the show where we take everyday objects in their component form\N{\3c&HAA0603&\fs31\b1}
Dialogue: 0,0:00:07.04,0:00:10.24,Default,,0,0,0,,and put them back together, very slowly.\N{\3c&HAA0603&\fs31\b1}
The Python 3.x code:
text1 = open('d:\Programs\sub1.txt','r')
text2 = open('e:\modsub.ass','w+')
alltext1 = text1.read()
lines = alltext1.split('\n')
for i in range(lines.index('[Events]')+1,len(lines)):
lines[i] += ' hello '
print(lines)
text2.write(str(lines))
text1.close()
text2.close()
1.Python doesn't recognize one or two characters in it, apparently, in unicode
'\N{\3c&HAA0603&\fs31\b1}'
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-23: unknown Unicode character name
How to deal with it without affecting the output?
2.When I used ' hello ' instead of the subtitling code, the output was this:
'Dialogue: 0,0:00:07.04,0:00:10.24,Default,,0,0,0,,and put them back together, very slowly. hello ', 'Dialogue: 0,0:00:10.24,0:00:11.72,Default,,0,0,0,,That feels very nice. hello ', 'Dialogue: 0,0:00:11.72,0:00:13.36,Default,,0,0,0,,Oh, yes. Look at that! hello ',
et cetera, instead of lines after lines arrangement.
How to make the strings just line up and take out the quotes and stuff?

Use a raw string literal, i.e. replace:
'\N{\3c&HAA0603&\fs31\b1}'
with:
r'\N{\3c&HAA0603&\fs31\b1}'
In this way the interpreter will not try to look for the unicode character named \3c&HAA0603&\fs31\b1 which does not exist.
>>> '\N{\3c&HAA0603&\fs31\b1}'
File "<stdin>", line 1
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-23: unknown Unicode character name
>>> r'\N{\3c&HAA0603&\fs31\b1}'
'\\N{\\3c&HAA0603&\\fs31\\b1}'
>>> print(r'\N{\3c&HAA0603&\fs31\b1}')
\N{\3c&HAA0603&\fs31\b1}

The problem is that you're using a string with \ characters in it, without escaping them. You need to double them up or use the r'' notation.
lines[i] += '\\N{\\3c&HAA0603&\\fs31\\b1}'
or
lines[i] += r'\N{\3c&HAA0603&\fs31\b1}'
As for your other problem, you're writing str(lines) which shows a literal representation. Use '\n'.join(lines) + '\n' instead.

Special character encoding is lost when string is passed to function

string = "Magic Cookie® Extra"
print string
Will give the output:
"Magic Cookie® Extra"
However, if I pass the string into this function, which combines it with another string:
def label_print(label, string):
print label + ": " + string
label_print("Product name", string)
Will give the output:
"Product name: Magic Cookie?? Extra"
Why is this and how do I prevent it?
Does the concatenation with the first string reset the encoding so that the ® character becomes ??.
I have tried editing the function so that the local variable label is label.encode("utf-8") but that doesn't help.
I also have # -*- coding: utf-8 -*- at the very top of my Python file.

As you said in comments that the string was scraped from a web page, here is a possible explaination of what happens. UTF8 encodes characters above 127 as multi byte characters. For example the ® character has code 0xae and is encoded in utf8 as '\xc2\xae'.
So your string is actually 'Magic Cookie\xc2\xae Extra' and when concatenated leads to 'Product name: Magic Cookie\xc2\xae Extra'.
As #AaronDigulla explained, the two special characters are then translated as ? giving the result.
An consistant way to obtain it is to use the encode method with 'replace' error handler:
>>>> print 'Product name: Magic Cookie\xc2\xae Extra'.decode('ascii', 'replace').encode('ascii', replace')
Product name: Magic Cookie?? Extra
But until you say exactly what you do and what you want, I cannot tell you how to fix...

If I run your code, I get
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 12: ordinal not in range(128)
when I try to call string.encode('UTF-8'), so there is something else at work here.
Generally speaking, you must not mix strings which are UTF-8 encoded with ones that are not. Either everything is encoded or nothing. No mixing.
One way to solve these problems in Python 2 is to use unicode strings:
string = u"Magic Cookie® Extra"
print repr(string)
print repr('a ' + string + ' b')
which prints:
u'Magic Cookie\xae Extra'
u'a Magic Cookie\xae Extra b'
As you can see, even though the strings in the concatenation aren't unicode strings, Pyhton "upgrades" them. This will work pretty well ... unless you have UTF-8 encoded byte strings somewhere ...
Note: The ? means that someone has installed an output converter for sys.stdout which converts unknown/unprintable characters into ?. Search all your sources for sys.stdout to find out why this happens.

Printing all unicode characters in Python

I've written some code to create all 4-digit combinations of the hexidecimal system, and now I'm trying to use that to print out all the unicode characters that are associated with those values. Here's the code I'm using to do this:
char_list =["0","1","2","3","4","5","6","7","8","9","A","B","C","D","E","F"]
pairs = []
all_chars = []
# Construct pairs list
for char1 in char_list:
for char2 in char_list:
pairs.append(char1 + char2)
# Create every combination of unicode characters ever
for pair1 in pairs:
for pair2 in pairs:
all_chars.append(pair1 + pair2)
# Print all characters
for code in all_chars:
expression = "u'\u" + code + "'"
print "{}: {}".format(code,eval(expression))
And here is the error message I'm getting:
Traceback (most recent call last): File "C:\Users\andr7495\Desktop\unifun.py",
line 18, in <module> print "{}: {}".format(code,eval(expression))
UnicodeEncodeError: 'ascii' codec can't encode character u'\x80' in position 0:
ordinal not in range(128)
The exception is thrown when the code tries to print u"\u0080", however, I can do this in the interactive interpreter without a problem.
I've tried casting the results to unicode and specifying to ignore errors, but it's not helping. I feel like I'm missing a basic understanding about how unicode works, but is there anything I can do to get my code to print out all valid unicode expressions?

import sys
for i in xrange(sys.maxunicode):
print unichr(i);

You're trying to format a Unicode character into a byte string. You can remove the error by using a Unicode string instead:
print u"{}: {}".format(code,eval(expression))
^
The other answers are better at simplifying the original problem however, you're definitely doing things the hard way.

it is likely a problem with your terminal (cmd.exe is notoriously bad at this) as most of the time when you "print" you are printing to a terminal and that ends up trying to do encodings ... if you run your code in idle or some other space that can render unicode you should see the characters. also you should not use eval try this
for uni_code in range(...):
print hex(uni_code),unichr(uni_code)

Here's a rewrite of examples in this article that saves the list to a file.
Python 3.x:
import sys
txtfile = "unicode_table.txt"
print("creating file: " + txtfile)
F = open(txtfile, "w", encoding="utf-16", errors='ignore')
for uc in range(sys.maxunicode):
line = "%s %s" % (hex(uc), chr(uc))
print(line, file=F)
F.close()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Reasons for a different interpretation of a python code? - python

Related

How to extract features from text data set? [duplicate]

Why does "SyntaxError: (unicode error)" occur when raw string is in triple quotes?

Subtitle Project: How to solve the unicode reading failure?

Special character encoding is lost when string is passed to function

Printing all unicode characters in Python

Categories

Resources