Unable to encode a unicode into a .txt file in python - python

So while trying to mess around with python i tried making a program which would get me the content from pastebin url's and then save each ones content into a file of their own. I got an error
This is the code :-
import requests
file = open("file.txt", "r", encoding="utf-8").readlines()
for line in file:
link = line.rstrip("\n")
n_link = link.replace("https://pastebin.com/", "https://pastebin.com/raw/")
pastebin = n_link.replace("https://pastebin.com/raw/", "")
r = requests.get(n_link, timeout=3)
x = open(f"{pastebin}.txt", "a+")
x.write(r.text)
x.close
I get the following error :-
Traceback (most recent call last):
File "C:\Users\Lenovo\Desktop\Py\Misc. Scripts\ai.py", line 9, in <module>
x.write(r.text)
File "C:\Users\Lenovo\AppData\Local\Programs\Python\Python39\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2694' in position 9721: character maps to <undefined>
Can somebody help?

You’re doing good at the start by reading in the input file as UTF-8. The only thing you’re missing is to do the same thing with your output file:
x = open(f"{pastebin}.txt", "a+", encoding="utf-8")

Related

json.load() function doesn't work- Python

JSON load doesn't work for me whenever I run my code,
In this part, I create the code to read the file
import json
filename = 'eq_1_day_m1.json'
with open(filename) as f:
all_eq_data = json.load(f)
readable_file = 'readable_eq_data.json'
with open(readable_file, 'w') as c:
json.dump(all_eq_data, c, indent=4
Then It gives me so many errors talking about charmap. I think this is because of the maximum capacity. Can I do something about this?
C:\Users\PC\AppData\Local\Microsoft\WindowsApps\python.exe "C:/Users/PC/PycharmProjects/Learning/Learning Matplotlib/eq_explore_data.py"
Traceback (most recent call last):
File "C:\Users\PC\PycharmProjects\Learning\Learning Matplotlib\eq_explore_data.py", line 5, in <module>
all_eq_data = json.load(f)
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.9_3.9.496.0_x64__qbz5n2kfra8p0\lib\json\__init__.py", line 293, in load
return loads(fp.read(),
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.9_3.9.496.0_x64__qbz5n2kfra8p0\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 10292: character maps to <undefined>
Process finished with exit code 1
I have my json file: 'eq_1_day_m1.json' if you are wondering. It's too big for StackOverflow to handle so I didn't add it to the question.

How to avoid automatic ASCII encoding on Python 3?

I'm working on an encryption program in Python 3 right now, but I am having some problems with ASCII encoding. For example, if I want to write a text file from python that rights Ϩ (which is chr(1000)) into a text file, and I do:
a_file = open('chr_ord.txt', 'w')
a_file.write(chr(1000))
a_file.close()
I get:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
...
File "C:/Comp_Sci/Coding/printRAW.py", line 3, in <module>
a_file.write(chr(1000))
File "C:\WinPython-64bit-3.4.3.4\python-3.4.3.amd64\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u03e8' in position 0: character maps to <undefined>
And if I try:
a_file = open('chr_ord.txt', 'w')
a_file.write(ascii(chr(1000)))
a_file.close()
Python doesn't crash, but the text file contains '\u03e8' instead of the desired Ϩ
Is there any way I can go around this?
The Python 3 way is to use the encoding parameter when opening the file. Eg. encode the file as UTF-8
a_file = open('chr_ord.txt', 'w', encoding='utf-8')
The default is your system ANSI code page, which doesn't contain the Ϩ character.

UnidecodeDecode when reading .txt file

This may be a very basic fix, but I've dived through every example online trying to sort this out. I'm loading in a text file with Python 3.4 like so:
text = open("/Users/Stu/python/extext.txt")
text = unidecode(text)
text = open(text, "r").read()
and then I get thrown this error:
Traceback (most recent call last):
File "/Users/Stu/Twitter Python/Victoria.py", line 46, in <module>
short_pos = unidecode(short_pos)
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/unidecode/__init__.py", line 37, in unidecode
for char in string:
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf3 in position 4645: ordinal not in range(128)
I'm assuming that it's finding a character that it can't decode, but all there is in this doc is english and basic punctuation. Any support you guys could give would be greatly appreciated.
Cheers!
This seemed to allow me to read the text:
short_pos = open("/Users/Stu/Twitter Python/short_reviews/positive1.txt","r", encoding = "latin-1").read()
Thanks for everyone's support!

Unicode problems in Python again

I have a strange problem. A friend of mine sent me a text file. If I copy/paste the text and then paste it into my text editor and save it, the following code works. If I choose the option to save the file directly from the browser, the following code breaks. What's going on? Is it the browser's fault for saving invalid characters?
This is an example line.
When I save it, the line says
What�s going on?
When I copy/paste it, the line says
What’s going on?
This is the code:
import codecs
def do_stuff(filename):
with codecs.open(filename, encoding='utf-8') as f:
def process_line(line):
return line.strip()
lines = f.readlines()
for line in lines:
line = process_line(line)
print line
do_stuff('stuff.txt')
This is the traceback I get:
Traceback (most recent call last):
File "test-encoding.py", line 13, in <module>
do_stuff('stuff.txt')
File "test-encoding.py", line 8, in do_stuff
lines = f.readlines()
File "/home/somebody/.python/lib64/python2.7/codecs.py", line 679, in readlines
return self.reader.readlines(sizehint)
File "/home/somebody/.python/lib64/python2.7/codecs.py", line 588, in readlines
data = self.read()
File "/home/somebody/.python/lib64/python2.7/codecs.py", line 477, in read
newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 4: invalid start byte
What can I do in such cases?
How can I distribute the script if I don't know what encoding the user who runs it will use?
Fixed:
codecs.open(filename, encoding='utf-8', errors='ignore') as f:
The "file-oriented" part of the browser works with raw bytes, not characters. The specific encoding used by the page should be specified either in the HTTP headers or in the HTML itself. You must use this encoding instead of assuming that you have UTF-8 data.

Python: File encoding errors

From a few days I'm struggling this annoying problem with file encoding in my little program in Python.
I work a lot with MediaWiki - recently I do documents conversion from .doc to Wikisource.
Document in Microsoft Word format is opened in Libre Office and then exported to .txt file with Wikisource format. My program is searching for a [[Image:]] tag and replace it with a name of image taken from a list - and that mechanism works really fine (Big Thanks for help brjaga!).
When I did some test on .txt files created by me everything worked just fine but when I put a .txt file with Wikisource whole thing is not so funny anymore :D
I got this message prom Python:
Traceback (most recent call last):
File "C:\Python33\final.py", line 15, in <module>
s = ' '.join([line.replace('\n', '') for line in myfile.readlines()])
File "C:\Python33\lib\encodings\cp1250.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 7389: character maps to <undefined>
And this is my Python code:
li = [
"[[Image:124_BPP_PL_PL_Page_03_Image_0001.jpg]]",
"[[Image:124_BPP_PL_PL_Page_03_Image_0002.jpg]]",
"[[Image:124_BPP_PL_PL_Page_03_Image_0003.jpg]]",
"[[Image:124_BPP_PL_PL_Page_03_Image_0004.jpg]]",
"[[Image:124_BPP_PL_PL_Page_03_Image_0005.jpg]]",
"[[Image:124_BPP_PL_PL_Page_03_Image_0006.jpg]]",
"[[Image:124_BPP_PL_PL_Page_03_Image_0007.jpg]]",
"[[Image:124_BPP_PL_PL_Page_05_Image_0001.jpg]]",
"[[Image:124_BPP_PL_PL_Page_05_Image_0002.jpg]]"
]
with open ("C:\\124_BPP_PL_PL.txt") as myfile:
s = ' '.join([line.replace('\n', '') for line in myfile.readlines()])
dest = open('C:\\124_BPP_PL_PL_processed.txt', 'w')
for item in li:
s = s.replace("[[Image:]]", item, 1)
dest.write(s)
dest.close()
OK, so I did some research and found that this is a problem with encoding. So I installed a program Notepad++ and changed the encoding of my .txt file with Wikisource to: UTF-8 and saved it. Then I did some change in my code:
with open ("C:\\124_BPP_PL_PL.txt", encoding="utf8') as myfile:
s = ' '.join([line.replace('\n', '') for line in myfile.readlines()])
But I got this new error message:
Traceback (most recent call last):
File "C:\Python33\final.py", line 22, in <module>
dest.write(s)
File "C:\Python33\lib\encodings\cp1250.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\ufeff' in position 0: character maps to <undefined>
And I'm really stuck on this one. I thought, when I change the encoding manually in Notepad++ and then I will tell the encoding which I set - everything will be good.
Please help, Thank You in advance.
When Python 3 opens a text file, it uses the default encoding for your system when trying to decode the file in order to give you full Unicode text (the str type is fully Unicode aware). It does the same when writing out such Unicode text values.
You already solved the input side; you specified an encoding when reading. Do the same when writing: specify a codec to use to write out the file that can handle Unicode, including the non-breaking whitespace character at codepoint U+FEFF. UTF-8 is usually a good default choice:
dest = open('C:\\124_BPP_PL_PL_processed.txt', 'w', encoding='utf8')
You can use the with statement when writing too and save yourself the .close() call:
for item in li:
s = s.replace("[[Image:]]", item, 1)
with open('C:\\124_BPP_PL_PL_processed.txt', 'w', encoding='utf8') as dest:
dest.write(s)

Categories