I am new to python and stackoverflow.
I have a folder with csv files and I am trying to read field name from each file and write them on new csv file.
Thanks to stackoverflow, I was able to make and edit my code until unicode error came out.
I tried my best trying to solve this error and did research.
I found out that files created in Mac or Linux have utf8 unicode and files created in windows have cp949.
Thus, I have to open them by utf8.
My code first looked like this :
import csv
import glob
lst=[]
files=glob.glob('C:/dataset/*.csv')
with open('test.csv','w',encoding='cp949',newline='') as testfile:
csv_writer=csv.writer(testfile)
for file in files:
with open(file,'r') as infile:
file=file[file.rfind('\\')+1:]
reader=csv.reader(infile)
headers=next(reader)
headers=[str for str in headers if str]
while len(headers) < 3 :
headers=next(reader)
headers=[str for str in headers if str]
lst=[file]+headers
csv_writer.writerow(lst)
Then this error came out :
Traceback (most recent call last):
File "C:\Python35\2.py", line 12, in <module>
headers=next(reader)
UnicodeDecodeError: 'cp949' codec can't decode byte 0xec in position 6: illegal multibyte sequence
Here is how I tried to fix unicode error :
import csv
import glob
lst=[]
files=glob.glob('C:/dataset/*.csv')
with open('test.csv','w',encoding='cp949',newline='') as testfile:
csv_writer=csv.writer(testfile)
for file in files:
try:
with open(file,'r') as infile:
file=file[file.rfind('\\')+1:]
reader=csv.reader(infile)
headers=next(reader)
headers=[str for str in headers if str]
while len(headers) < 3 :
headers=next(reader)
headers=[str for str in headers if str]
lst=[file]+headers
csv_writer.writerow(lst)
except:
with open(file,'r',encoding='utf8') as infile:
file=file[file.rfind('\\')+1:]
reader=csv.reader(infile)
headers=next(reader)
headers=[str for str in headers if str]
while len(headers) < 3 :
headers=next(reader)
headers=[str for str in headers if str]
lst=[file]+headers
csv_writer.writerow(lst)
And this error came out :
Traceback (most recent call last):
File "C:\Python35\2.py", line 12, in <module>
headers=next(reader)
UnicodeDecodeError: 'cp949' codec can't decode byte 0xec in position 6: illegal multibyte sequence
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Python35\2.py", line 20, in <module>
with open(file,'r',encoding='utf8') as infile:
FileNotFoundError: [Errno 2] No such file or directory: '2010_1_1.csv'
File '2010_1_1.csv' definitely exists in my directory ('C:/dataset/*.csv')
When I try to open this file individually using open('C:/dataset/2010_1_1.csv','r',encoding='utf8') it works but there is '\ufeff' next to filename.
I am not sure but my guess is that this file is being opened in try: and not yet closed thus python can't open this file at except.
How can I edit my code to solve this Unicode problem?
import glob
from chardet.universaldetector import UniversalDetector
files=glob.glob('C:/example/*.csv')
for filename in files:
print(filename.ljust(60)),
detector.reset()
for line in file(filename, 'rb'):
detector.feed(line)
if detector.done: break
detector.close()
print(detector.result)
Error :
Traceback (most recent call last):
File "<pyshell#20>", line 4, in <module>
for line in file(filename, 'rb'):
TypeError: 'str' object is not callable
I'm not very experienced with Python so call me out of this is not possible, but you could simply attempt ignoring the encoding of the file when opening it. I'm a Java programmer and from my experience, encoding only needs to be specifyed when creating a new file, and not when opening one.
It looks like your file is not written in cp949 if it won't decode properly. You'll have to figure out the correct encoding. A module like chardet can help.
On Windows, when reading a file open it with the encoding it was written in. If UTF-8, use utf-8-sig, which will automatically handle and remove the byte-order-mark (BOM) U+FEFF character if present. When writing, the best bet is to use utf-8-sig because it handles all possible Unicode characters and will add a BOM so Windows tools like Notepad and Excel will recognize UTF-8-encoded files. Without it, most Windows tools will assume the ANSI encoding, which varies per localized version of Windows.
Error Comes Out in Form of Unicode Decode Error its just Somewhere Missed Decoding it Could be both due to Format is Not Supported in Decoding the Particular File or Decoding is perfect but Error in file while it was Written in Whatever Format it had been ex:>json,xml,csv....
The Only way to Avoid when Stuck in this Problem is to Ignore errors in Decode in first part of Code by using errors='ignore' argument in open():>
with open('test.csv','w',encoding='cp949',newline='') as testfile:
#to
with open(r'test.csv','w',encoding='cp949',newline='',errors='ignore') as testfile:
#or
data = open(r'test.csv',errors='ignore').read()#read the file as a data
if error persist check with diffrent encoding using errors='ignore'
Related
So while trying to mess around with python i tried making a program which would get me the content from pastebin url's and then save each ones content into a file of their own. I got an error
This is the code :-
import requests
file = open("file.txt", "r", encoding="utf-8").readlines()
for line in file:
link = line.rstrip("\n")
n_link = link.replace("https://pastebin.com/", "https://pastebin.com/raw/")
pastebin = n_link.replace("https://pastebin.com/raw/", "")
r = requests.get(n_link, timeout=3)
x = open(f"{pastebin}.txt", "a+")
x.write(r.text)
x.close
I get the following error :-
Traceback (most recent call last):
File "C:\Users\Lenovo\Desktop\Py\Misc. Scripts\ai.py", line 9, in <module>
x.write(r.text)
File "C:\Users\Lenovo\AppData\Local\Programs\Python\Python39\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2694' in position 9721: character maps to <undefined>
Can somebody help?
You’re doing good at the start by reading in the input file as UTF-8. The only thing you’re missing is to do the same thing with your output file:
x = open(f"{pastebin}.txt", "a+", encoding="utf-8")
I've written a script that is supposed to retrieve html pages off a site and update their contents. The following function looks for a certain file on my system, then attempts to open it and edit it:
def update_sn(files_to_update, sn, table, title):
paths = files_to_update['files']
print('updating the sn')
try:
sn_htm = [s for s in paths if re.search('^((?!(Default|Notes|Latest_Addings)).)*htm$', s)][0]
notes_htm = [s for s in paths if re.search('_Notes\.htm$', s)][0]
except Exception:
print('no sns were found')
pass
new_path_name = new_path(sn_htm, files_to_update['predecessor'], files_to_update['original'])
new_sn_number = sn
htm_text = open(sn_htm, 'rb').read().decode('cp1252')
content = re.findall(r'(<table>.*?<\/table>.*)(?:<\/html>)', htm_text, re.I | re.S)
minus_content = htm_text.replace(content[0], '')
table_soup = BeautifulSoup(table, 'html.parser')
new_soup = BeautifulSoup(minus_content, 'html.parser')
head_title = new_soup.title.string.replace_with(new_sn_number)
new_soup.link.insert_after(table_soup.div.next)
with open(new_path_name, "w+") as file:
result = str(new_soup)
try:
file.write(result)
except Exception:
print('Met exception. Changing encoding to cp1252')
try:
file.write(result('cp1252'))
except Exception:
print('cp1252 did\'nt work. Changing encoding to utf-8')
file.write(result.encode('utf8'))
try:
print('utf8 did\'nt work. Changing encoding to utf-16')
file.write(result.encode('utf16'))
except Exception:
pass
This works in the majority of cases, but sometimes it fails to write, at which point the exception kicks in and I try every feasible encoding without success:
updating the sn
Met exception. Changing encoding to cp1252
cp1252 did'nt work. Changing encoding to utf-8
Traceback (most recent call last):
File "C:\Users\Joseph\Desktop\SN Script\update_files.py", line 145, in update_sn
file.write(result)
File "C:\Users\Joseph\AppData\Local\Programs\Python\Python36\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 4006-4007: character maps to <undefined>
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\Joseph\Desktop\SN Script\update_files.py", line 149, in update_sn
file.write(result('cp1252'))
TypeError: 'str' object is not callable
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "scraper.py", line 79, in <module>
get_latest(entries[0], int(num), entries[1])
File "scraper.py", line 56, in get_latest
update_files.update_sn(files_to_update, data['number'], data['table'], data['title'])
File "C:\Users\Joseph\Desktop\SN Script\update_files.py", line 152, in update_sn
file.write(result.encode('utf8'))
TypeError: write() argument must be str, not bytes
Can anyone give me any pointers on how to better handle html data that might have inconsistent encoding?
In your code you open the file in text mode, but then you attempt to write bytes (str.encode returns bytes) and so Python throws an exception:
TypeError: write() argument must be str, not bytes
If you want to write bytes, you should open the file in binary mode.
BeautifulSoup detects the document’s encoding (if it is bytes) and converts it to string automatically. We can access the encoding with .original_encoding, and use it to encode the content when writting to file. For example,
soup = BeautifulSoup(b'<tag>ascii characters</tag>', 'html.parser')
data = soup.tag.text
encoding = soup.original_encoding or 'utf-8'
print(encoding)
#ascii
with open('my.file', 'wb+') as file:
file.write(data.encode(encoding))
In order for this to work you should pass your html as bytes to BeautifulSoup, so don't decode the response content.
If BeautifulSoup fails to detect the correct encoding for some reason, then you could try a list of possible encodings, like you have done in your code.
data = 'Somé téxt'
encodings = ['ascii', 'utf-8', 'cp1252']
with open('my.file', 'wb+') as file:
for encoding in encodings:
try:
file.write(data.encode(encoding))
break
except UnicodeEncodeError:
print(encoding + ' failed.')
Alternatively, you could open the file in text mode and set the encoding in open (instead of encoding the content), but note that this option is not available in Python2.
Just out of curiosity, is this line of code a typo file.write(result('cp1252'))? Seems like it is missing .encode method.
Traceback (most recent call last):
File "C:\Users\Joseph\Desktop\SN Script\update_files.py", line 149, in update_sn
file.write(result('cp1252'))
TypeError: 'str' object is not callable
Will it work perfectly if you modify the code to: file.write(result.encode('cp1252'))
I once had this write to file with encoding problem and brewed my own solution through the following thread:
Saving utf-8 texts in json.dumps as UTF8, not as \u escape sequence
.
My problem solved by changing the html.parser parsing mode to html5lib. I root-caused my problem due to malformed HTML tag and solved it with html5lib parser. For your reference, this is the documentation for each parser provided by BeautifulSoup.
Hope this helps
I have a long list of domain names which I need to generate some reports on. The list contains some IDN domains, and although I know how to convert them in python on the command line:
>>> domain = u"pfarmerü.com"
>>> domain
u'pfarmer\xfc.com'
>>> domain.encode("idna")
'xn--pfarmer-t2a.com'
>>>
I'm struggling to get it to work with a small script reading data from the text file.
#!/usr/bin/python
import sys
infile = open(sys.argv[1])
for line in infile:
print line,
domain = unicode(line.strip())
print type(domain)
print "IDN:", domain.encode("idna")
print
I get the following output:
$ ./idn.py ./test
pfarmer.com
<type 'unicode'>
IDN: pfarmer.com
pfarmerü.com
Traceback (most recent call last):
File "./idn.py", line 9, in <module>
domain = unicode(line.strip())
UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 7: ordinal not in range(128)
I have also tried:
#!/usr/bin/python
import sys
import codecs
infile = codecs.open(sys.argv[1], "r", "utf8")
for line in infile:
print line,
domain = line.strip()
print type(domain)
print "IDN:", domain.encode("idna")
print
Which gave me:
$ ./idn.py ./test
Traceback (most recent call last):
File "./idn.py", line 8, in <module>
for line in infile:
File "/usr/lib/python2.6/codecs.py", line 679, in next
return self.reader.next()
File "/usr/lib/python2.6/codecs.py", line 610, in next
line = self.readline()
File "/usr/lib/python2.6/codecs.py", line 525, in readline
data = self.read(readsize, firstline=True)
File "/usr/lib/python2.6/codecs.py", line 472, in read
newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-5: unsupported Unicode code range
Here is my test data file:
pfarmer.com
pfarmerü.com
I'm very aware of my need to understand unicode now.
Thanks,
Peter
you need to know in which encoding you file was saved. This would be something like 'utf-8' (which is NOT Unicode) or 'iso-8859-1' or 'cp1252' or alike.
Then you can do (assuming 'utf-8'):
infile = open(sys.argv[1])
for line in infile:
print line,
domain = line.strip().decode('utf-8')
print type(domain)
print "IDN:", domain.encode("idna")
print
Convert encoded strings to unicode with decode. Convert unicode to string with encode. If you try to encode something which is already encoded, python tries to decode first, with the default codec 'ascii' which fails for non-ASCII-values.
Your first example is fine, except that:
domain = unicode(line.strip())
you have to specify a particular encoding here: unicode(line.strip(), 'utf-8'). Otherwise you get the default encoding which for safety is 7-bit ASCII, hence the error. Alternatively you can spell it line.strip().decode('utf-8') as in knitti's example; there is no difference in behaviour between the two syntaxes.
However judging by the error “can't decode byte 0xfc”, I think you haven't actually saved your test file as UTF-8. Presumably this is why the second example, that also looks OK in principle, fails.
Instead it's ISO-8859-1 or the very similar Windows code page 1252. If it's come from a text editor on a Western Windows box it will certainly be the latter; Linux machines use UTF-8 by default instead nowadays. Either make sure to save your file as UTF-8, or read the file using the encoding 'cp1252' instead.
I have a lengthy json file that contains utf-8 characters (and is encoded in utf-8). I want to read it in python using the built-in json module.
My code looks like this:
dat = json.load(open("data.json"), "utf-8")
Though I understand the "utf-8" argument should be unnecessary as it is assumed as the default. However, I get this error:
Traceback (most recent call last):
File "winratio.py", line 9, in <module>
dat = json.load(open("data.json"), "utf-8")
File "C:\Python33\lib\json\__init__.py", line 271, in load
return loads(fp.read(),
File "C:\Python33\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 28519: ch
aracter maps to <undefined>
My question is: Why does python seem to ignore my encoding specification and try to load the file in cp1252?
Try this:
import codecs
dat = json.load(codecs.open("data.json", "r", "utf-8"))
Also here are described some tips about a writing mode in context of the codecs library: Write to UTF-8 file in Python
From a few days I'm struggling this annoying problem with file encoding in my little program in Python.
I work a lot with MediaWiki - recently I do documents conversion from .doc to Wikisource.
Document in Microsoft Word format is opened in Libre Office and then exported to .txt file with Wikisource format. My program is searching for a [[Image:]] tag and replace it with a name of image taken from a list - and that mechanism works really fine (Big Thanks for help brjaga!).
When I did some test on .txt files created by me everything worked just fine but when I put a .txt file with Wikisource whole thing is not so funny anymore :D
I got this message prom Python:
Traceback (most recent call last):
File "C:\Python33\final.py", line 15, in <module>
s = ' '.join([line.replace('\n', '') for line in myfile.readlines()])
File "C:\Python33\lib\encodings\cp1250.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 7389: character maps to <undefined>
And this is my Python code:
li = [
"[[Image:124_BPP_PL_PL_Page_03_Image_0001.jpg]]",
"[[Image:124_BPP_PL_PL_Page_03_Image_0002.jpg]]",
"[[Image:124_BPP_PL_PL_Page_03_Image_0003.jpg]]",
"[[Image:124_BPP_PL_PL_Page_03_Image_0004.jpg]]",
"[[Image:124_BPP_PL_PL_Page_03_Image_0005.jpg]]",
"[[Image:124_BPP_PL_PL_Page_03_Image_0006.jpg]]",
"[[Image:124_BPP_PL_PL_Page_03_Image_0007.jpg]]",
"[[Image:124_BPP_PL_PL_Page_05_Image_0001.jpg]]",
"[[Image:124_BPP_PL_PL_Page_05_Image_0002.jpg]]"
]
with open ("C:\\124_BPP_PL_PL.txt") as myfile:
s = ' '.join([line.replace('\n', '') for line in myfile.readlines()])
dest = open('C:\\124_BPP_PL_PL_processed.txt', 'w')
for item in li:
s = s.replace("[[Image:]]", item, 1)
dest.write(s)
dest.close()
OK, so I did some research and found that this is a problem with encoding. So I installed a program Notepad++ and changed the encoding of my .txt file with Wikisource to: UTF-8 and saved it. Then I did some change in my code:
with open ("C:\\124_BPP_PL_PL.txt", encoding="utf8') as myfile:
s = ' '.join([line.replace('\n', '') for line in myfile.readlines()])
But I got this new error message:
Traceback (most recent call last):
File "C:\Python33\final.py", line 22, in <module>
dest.write(s)
File "C:\Python33\lib\encodings\cp1250.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\ufeff' in position 0: character maps to <undefined>
And I'm really stuck on this one. I thought, when I change the encoding manually in Notepad++ and then I will tell the encoding which I set - everything will be good.
Please help, Thank You in advance.
When Python 3 opens a text file, it uses the default encoding for your system when trying to decode the file in order to give you full Unicode text (the str type is fully Unicode aware). It does the same when writing out such Unicode text values.
You already solved the input side; you specified an encoding when reading. Do the same when writing: specify a codec to use to write out the file that can handle Unicode, including the non-breaking whitespace character at codepoint U+FEFF. UTF-8 is usually a good default choice:
dest = open('C:\\124_BPP_PL_PL_processed.txt', 'w', encoding='utf8')
You can use the with statement when writing too and save yourself the .close() call:
for item in li:
s = s.replace("[[Image:]]", item, 1)
with open('C:\\124_BPP_PL_PL_processed.txt', 'w', encoding='utf8') as dest:
dest.write(s)