"ValueError: embedded null character" when using open() - python

I am taking python at my college and I am stuck with my current assignment. We are supposed to take 2 files and compare them. I am simply trying to open the files so I can use them but I keep getting the error "ValueError: embedded null character"
file1 = input("Enter the name of the first file: ")
file1_open = open(file1)
file1_content = file1_open.read()
What does this error mean?

It seems that you have problems with characters "\" and "/". If you use them in input - try to change one to another...

Default encoding of files for Python 3.5 is 'utf-8'.
Default encoding of files for Windows tends to be something else.
If you intend to open two text files, you may try this:
import locale
locale.getdefaultlocale()
file1 = input("Enter the name of the first file: ")
file1_open = open(file1, encoding=locale.getdefaultlocale()[1])
file1_content = file1_open.read()
There should be some automatic detection in the standard library.
Otherwise you may create your own:
def guess_encoding(csv_file):
"""guess the encoding of the given file"""
import io
import locale
with io.open(csv_file, "rb") as f:
data = f.read(5)
if data.startswith(b"\xEF\xBB\xBF"): # UTF-8 with a "BOM"
return "utf-8-sig"
elif data.startswith(b"\xFF\xFE") or data.startswith(b"\xFE\xFF"):
return "utf-16"
else: # in Windows, guessing utf-8 doesn't work, so we have to try
try:
with io.open(csv_file, encoding="utf-8") as f:
preview = f.read(222222)
return "utf-8"
except:
return locale.getdefaultlocale()[1]
and then
file1 = input("Enter the name of the first file: ")
file1_open = open(file1, encoding=guess_encoding(file1))
file1_content = file1_open.read()

Try putting r (raw format).
r'D:\python_projects\templates\0.html'

On Windows while specifying the full path of the file name, we should use double backward slash as the seperator and not single backward slash.
For instance, C:\\FileName.txt instead of C:\FileName.txt

I got this error when copying a file to a folder that starts with a number. If you write the folder path with the double \ sign before the number, the problem will be solved.

The first slash of the file path name throws the error.
Need Raw, r
Raw string
FileHandle = open(r'..', encoding='utf8')
FilePath='C://FileName.txt'
FilePath=r'C:/FileName.txt'

The problem is due to bytes data that needs to be decoded.
When you insert a variable into the interpreter, it displays it's repr attribute whereas print() takes the str (which are the same in this scenario) and ignores all unprintable characters such as: \x00, \x01 and replaces them with something else.
A solution is to "decode" file1_content (ignore bytes):
file1_content = ''.join(x for x in file1_content if x.isprintable())

I was also getting the same error with the following code:
with zipfile.ZipFile("C:\local_files\REPORT.zip",mode='w') as z:
z.writestr(data)
It was happening because I was passing the bytestring i.e. data in writestr() method without specifying the name of file i.e. Report.zip where it should be saved.
So I changed my code and it worked.
with zipfile.ZipFile("C:\local_files\REPORT.zip",mode='w') as z:
z.writestr('Report.zip', data)

If you are trying to open a file then you should use the path generated by os, like so:
import os
os.path.join("path","to","the","file")

Related

How can I write characters such as § into a file using Python?

This is my code for creating the string to be written ('result' is the variable that holds the final text):
fileobj = open('file_name.yml','a+')
begin = initial+":0 "
n_name = '"§'+tag+name+'§!"'
begin_d = initial+"_desc:0 "
n_desc = '"§3'+desc+'§!"'
title = ' '+begin + n_name
descript = ' '+begin_d + n_desc
result = title+'\n'+descript
print()
fileobj.close()
return result
This is my code for actually writing it into the file:
text = writing(initial, tag, name, desc)
override = inserter(fileobj, country, text)
fileobj.close()
fileobj = open('file_name.yml','w+')
fileobj.write(override)
fileobj.close()
(P.S: Override is a function which works perfectly. It returns a longer string to be written into the file.)
I have tried this with .txt and .yml files but in both cases, instead of §, this is what takes its place: xA7 (I cannot copy the actual text into the internet as it changes into the correct character. It is, however, appearing as xA7 in the file.) Everything else is unaffected, and the code runs fine.
Do let me know if I can improve the question in any way.
You're running into a problem called character encoding. There are two parts to the problem - first is to get the encoding you want in the file, the second is to get the OS to use the same encoding.
The most flexible and common encoding is UTF-8, because it can handle any Unicode character while remaining backwards compatible with the very old 7-bit ASCII character set. Most Unix-like systems like Linux will handle it automatically.
fileobj = open('file_name.yml','w+',encoding='utf-8')
You can set your PYTHONIOENCODING environment value to make it the default.
Windows operating systems are a little trickier because they'll rarely assume UTF-8, especially if it's a Microsoft program opening the file. There's a magic byte sequence called a BOM that will trigger Microsoft to use UTF-8 if it's at the beginning of a file. Python can add that automatically for you:
fileobj = open('file_name.yml','w+',encoding='utf_8_sig')

Find TM superscript in python 2 using regex

My text file includes "SSS™" as one of its words and I am trying to find it using regular expression. My problem is with finding ™ superscript. My code is:
import re
path='G:\python_code\A.txt'
f_general=open(path, 'r')
special=re.findall(r'\U2122',f_general.read())
print(special)
but it doesn't print anything. How can I fix it?
It may have to do with the encoding of your file. Try this:
import re
path = "g:\python_code\A.txt"
f_general=open(path, "r", encoding="UTF-16")
data = f_general.read()
special=re.findall(chr(8482), data)
print(special)
print(chr(8482))
Note I'm using the decimal value for Trade mark. This is the site I use:
https://www.ascii.cl/htmlcodes.htm
So, open the file you have in notepad. Do a save as and choose encoding unicode and this should all work. Working with extended ascii can be a hassle. I am using Python 3.6 but I think this should still work in 2.x
Note when it prints out the chr(8482) in your command line it will probably just be a T, at least that is what I get in windows.
update
Try this for Python 2 and this should capture the word before trademark:
import re
with open("g:\python_code\A.txt", "rb") as f:
data = f.read().decode("UTF-16")
regex = re.compile("\S+" + chr(8482))
match = re.search(regex, data)
if match:
print (match.group(0))

Correctly reading text from Windows-1252(cp1252) file in python

so okay, as the title suggests the problem I have is with correctly reading input from a windows-1252 encoded file in python and inserting said input into SQLAlchemy-MySql table.
The current system setup:
Windows 7 VM with "Roger Access Control System" which outputs the file;
Ubuntu 12.04 LTS VM with a shared-folder to the Windows system so I can access the file, using "Python 2.7.3".
Now to the actual problem, for the input file I have a "VM shared-folder" that contains a file that is genereate on a Windows 7 system through Roger Access Control System(roger.pl for more details), this file is called "PREvents.csv" which suggests to it's contents, a ";" seperated list of data.
An example format of the data:
2013-03-19;15:58:30;100;Jānis;Dumburs;1;Uznemums1;0;Ieeja;
2013-03-19;15:58:40;100;Jānis;Dumburs;1;Uznemums1;2;Izeja;
The 4th field contains the card owners name and 5th contains the owners lastname, the 6th contains the owners assigned group.
The issue comes from the fact that any one of the 3 above mentioned fields can contain characters specific to Latvian language, in the example file the word "Jānis" contains the letter "ā" which in unicode is 257.
As I'm used to, I open the file as such:
try:
f = codecs.open(file, 'rb', 'cp1252')
except IOError:
f = codecs.open(file, 'wb', 'cp1252')
So far, everything works - it opens the file and so I move on to iterate over each line of the file(this is a continuos running script so pardon the loop):
while True:
line = f.readline()
if not line:
# Pause loop for 1 second
time.sleep(1)
else:
# Split the line into list
date, timed, userid, firstname, lastname, groupid, groupname, typed, pointname, empty = line.split(';')
And this is where the issues start, if I print repr(firstname) it prints u'J\xe2nis' which is, as far as I undestand, not correct - `\xe2\ does not represent the Latvian character "ā".
Further down the loop depending on event type I assign the variables to SQLAlchemy object and insert/update:
if typed == '0': # Entry type
event = Events(
period,
fullname,
userid,
groupname,
timestamp,
0,
0
)
session.add(event)
else: # Exit type
event = session.query(Events).filter(
Events.period == period,
Events.exit == 0,
Events.userid == userid
).first()
if event is not None:
event.exit = timestamp
event.spent = timestamp - event.entry
# Commit changes to database
session.commit()
In my search for answers I've found how to define the default encoding to use:
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
Which hasn't helped me in any way.
Basically, this is all leads to the me not being able to insert the correct owners First/last name aswell as owners assigned groupname if they contain any of Latvian-specific characters, for example:
Instead of the character "ā" it inserts "â"
I'd also like to add that I cannot change the "PREvents.csv" file encoding and the "RACS" system does not support inserting into UTF-8 or Unicode files - if you try either way, the system inserts random symbols for the Latvian-specific characters.
Please let me now if any other information is needed, I'll gladly provide it :)
Any help would be highly appreciated.
CP1252 cannot represent ā; your input contains the similar character â. repr just displays an ASCII representation of a unicode string in Python 2.x:
>>> print(repr(b'J\xe2nis'.decode('cp1252')))
u'J\xe2nis'
>>> print(b'J\xe2nis'.decode('cp1252'))
Jânis
I think u'J\xe2nis' is correct, see:
>>> print u'J\xe2nis'.encode('utf-8')
Jânis
Are you getting actual errors from SQLAlchemy or in your application's output?
I had the same problem with some XML files, I solved reading the file with ANSI encoding (Windows-1252) and writing a file with UTF-8 encoding:
import os
import sys
path = os.path.dirname(__file__)
file_name = 'my_input_file.xml'
if __name__ == "__main__":
with open(os.path.join(path, './' + file_name), 'r', encoding='cp1252') as f1:
lines = f1.read()
f2 = open(os.path.join(path, './' + 'my_output_file.xml'), 'w', encoding='utf-8')
f2.write(lines)
f2.close()

Encoding file names to base64 on OS X not correct when using Japanese characters

I have a bunch of files named after people's names (e.g. "john.txt", "mary.txt") but among them are also japanese names (e.g. "fūka.txt", "tetsurō.txt").
What I'm trying to do is to convert names before ".txt" to Base64.
Only problem is that when I take a file name (without extension) and use a web based converter I get a different result than encoding with a help of my Python script.
So... For example when I copy file name part without extension and encode "fūka" in http://www.base64encode.org I get "ZsWra2E=". Same result I get when I take person's name from UTF-8 encoded PostgreSQL database, make it lower case and base64 encode it.
But when I use Python script below I get "ZnXMhGth"
import glob, os
import base64
def rename(dir, pattern):
for pathAndFilename in glob.iglob(os.path.join(dir, pattern)):
title, ext = os.path.splitext(os.path.basename(pathAndFilename))
t = title.lower().encode("utf-8")
encoded_string = base64.b64encode(t) + ext
p = os.path.join(dir, encoded_string)
os.rename(pathAndFilename, p)
rename(u'./test', u'*.txt')
I get the same results in OS X 10.8 and Linux (files uploaded from Mac to Linux server). Python is 2.7. And I tried also PHP script (the result was same as for Python script).
And similar difference happens when I use names with other characters (e.g. "tetsurō").
One more strange thing ... when I output filename part with a Python script in OS X's Terminal application and then copy this text as a filename ... and THEN encode file name to base64, I get the same result as on a webpage I mentioned above. Terminal has UTF-8 encoding.
Could somebody please explain me what am I doing (or thinking) wrong? Is there somewhere inbetween some little character substitution going on? How can I make Python script get the same result as above mentioned web page Any hints will be greatly appreciated.
SOLUTION:
With a help of Marks answer I modified a script and it worked like a charm! Thanks Mark!
import glob, os
import base64
from unicodedata import normalize
def rename(dir, pattern):
for pathAndFilename in glob.iglob(os.path.join(dir, pattern)):
title, ext = os.path.splitext(os.path.basename(pathAndFilename))
t = normalize('NFC', title.lower()).encode("utf-8") # <-- NORMALIZE !!!
encoded_string = base64.b64encode(t) + ext
p = os.path.join(dir, encoded_string)
os.rename(pathAndFilename, p)
rename(u'./test', u'*.txt')
It appears that the Python script is using a normalized form of Unicode, where the ū has been split into two characters, u and a combining macron. The other form uses a single character latin small letter u with macron. As far as Unicode is concerned, they're the same string even though they don't have the same binary representation.
You might get some more information from this Unicode FAQ: http://www.unicode.org/faq/normalization.html

Opening a file and matching Engr Label

1.Getting buildid from a buildlocation which is the last word after "\" which is "A1234ABCDE120083.1" in this case
2.After getting the buildid,am opening a file and then trying to match the line "Engr Label: Data_CRM_PL_177999" to get the label name which is "Data_CRM_PL_177999"
3.Final output should be "Data_CRM_PL_177999"
For some reason I am getting the following syntax error..
import re
Buildlocation= '\\umor\locations455\INT\A1234ABCDE120083.1'
Labelgetbuildlabel(Buildlocation)
def getbuildlabel(BuildLocation):
buildid=BuildLocation.split('\')[-1]
Notes=os.path.join(BuildLocation,Buildid + '_notes.txt')
if os.path.exists(Notes):
try:
open(Notes)
except IOError as er:
pass
else:
for i in Notes.splitlines:
if i.find(Engr Label)
label=i.split(:)[-1]
print label//output should be Data_CRM_PL_177999
Output should be:-
Line looks like below in the file
Engr Label: Data_CRM_PL_177999
SYNTAX ERROR
buildid=BuildLocation.split('\')[-1]
^
SyntaxError: EOL while scanning string literal
In the line
buildid=BuildLocation.split('\')[-1]
The backslash is actually escaping the following quotation mark
So, Python thinks this is actually your string:
'[-1])
Instead, you should do the following:
buildid=BuildLocation.split('\\')[-1]
And Python will interpret your string to be
\\
Interestingly, StackOverflow's syntax highlighter hints at this issue. If you look at your code, it treats everything after that first slash as part of the string, all the way to the end of your code sample.
You also have a few other issues in your code, so I tried cleaning it up a bit for you. (However, I don't have a copy of the file, so obviously, I wasn't able to test this)
import re
import os.path
build_location= r'\\umor\locations455\INT\A1234ABCDE120083.1'
label = get_build_label(build_location)
# Python prefers function and variable names to be all lowercase with
# underscore separating words.
def get_build_label(build_location):
build_id = build_location.split('\\')[-1]
notes_path = os.path.join(build_location, build_id + '_notes.txt')
# notes_path is the filename (a string)
try:
with open(notes_path) as notes:
# The 'with' keyword will automatically open and close
# the file for you
for line in notes:
if line.find('Engr Label'):
label = line.split(':')[-1]
return label
except IOError:
# No need to do 'os.path.exists' since notes_path doesn't
# exist, then the IOError exception will be raised.
pass
print label
The backslash is escaping the ' character (see the escape codes documentation)
Try this line instead:
buildid=BuildLocation.split('\\')[-1]
Now you have a backslash escaping the backslash, so your string is a literal backslash. The other thing you could do would be to tell Python that this string doesn't have any escape codes by prefixing it with an r like this:
buildid=BuildLocation.split(r'\')[-1]
You've got a number of other problems as well.
The comment character in Python is #, not //.
I think you're also confusing a filename with a file object.
Notes is the name of the file you're trying to open. Then, when you call open(Notes), you will get back a file object that you can read data from.
So you should probably replace:
open(Notes)
with
f = open(Notes)
And then replace:
for i in Notes.splitlines:
with
for line in f:
When you do a for loop over a file object, Python will automatically give you a line at a time.
Now you can check each line like this:
if line.find("Engr Label") != -1:
label = line.split(':')[-1]

Categories