I have a zip file file link
it is encoded with utf-8 code, how can I decode every file within it? I tried but failed: TypeError: decoding with 'utf-8' codec failed (TypeError: a bytes-like object is required, not 'ZipFile')
from zipfile import ZipFile
import codecs
with ZipFile('articles.zip', 'r') as zip:
with zip.open('articles/document0001.txt') as file:
codecs.decode(file, encoding='utf-8', errors='strict')
also there are 100 files on that Zip, any smart way to do the decoding for all the files in one off?
You can use bytes.decode on the text:
from zipfile import ZipFile
with ZipFile('articles.zip', 'r') as z:
with z.open('articles/document0001.txt') as file:
file_text = file.read().decode('utf-8')
print(file_text) # or do whatever else you want to do with it.
Related
I need to convert multiple CSV files (with different encodings) into UTF-8.
Here is my code:
#find encoding and if not in UTF-8 convert it
import os
import sys
import glob
import chardet
import codecs
myFiles = glob.glob('/mypath/*.csv')
csv_encoding = []
for file in myFiles:
with open(file, 'rb') as opened_file:
bytes_file=opened_file.read()
result=chardet.detect(bytes_file)
my_encoding=result['encoding']
csv_encoding.append(my_encoding)
print(csv_encoding)
for file in myFiles:
if csv_encoding in ['utf-8', 'ascii']:
print(file + ' in utf-8 encoding')
else:
with codecs.open(file, 'r') as file_for_conversion:
read_file_for_conversion = file_for_conversion.read()
with codecs.open(file, 'w', 'utf-8') as converted_file:
converted_file.write(read_file_for_conversion)
print(file +' converted to utf-8')
When I try to run this code I get the following error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf3 in position 5057: invalid continuation byte
Can someone help me? Thanks!!!
You need to zip the lists myFiles and csv_encoding to get their values aligned:
for file, encoding in zip(myFiles, csv_encoding):
...
And you need to specify that value in the open() call:
...
with codecs.open(file, 'r', encoding=encoding) as file_for_conversion:
Note: in Python 3 there's no need to use the codecs module for opening files.
Just use the built-in open function and specify the encoding with the encoding parameter.
So i'm trying to create a very simple program that opens a file, read the file and convert what is in it from hex to base64 using python3.
I tried this :
file = open("test.txt", "r")
contenu = file.read()
encoded = contenu.decode("hex").encode("base64")
print (encoded)
but I get the error:
AttributeError: 'str' object has no attribute 'decode'
I tried multiple other things but always get the same error.
inside the test.txt is :
4B
if you guys can explain me what I do wrong would be awesome.
Thank you
EDIT:
i should get Sw== as output
This should do the trick. Your code works for Python <= 2.7 but needs updating in later versions.
import base64
file = open("test.txt", "r")
contenu = file.read()
bytes = bytearray.fromhex(contenu)
encoded = base64.b64encode(bytes).decode('ascii')
print(encoded)
you need to encode hex string from file test.txt to bytes-like object using bytes.fromhex() before encoding it to base64.
import base64
with open("test.txt", "r") as file:
content = file.read()
encoded = base64.b64encode(bytes.fromhex(content))
print(encoded)
you should always use with statement for opening your file to automatically close the I/O when finished.
in IDLE:
>>>> import base64
>>>>
>>>> with open('test.txt', 'r') as file:
.... content = file.read()
.... encoded = base64.b64encode(bytes.fromhex(content))
....
>>>> encoded
b'Sw=='
just trying to load this JSON file(with non-ascii characters) as a python dictionary with Unicode encoding but still getting this error:
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 90: ordinal not in range(128)
JSON file content = "tooltip":{
"dxPivotGrid-sortRowBySummary": "Sort\"{0}\"byThisRow",}
import sys
import json
data = []
with open('/Users/myvb/Desktop/Automation/pt-PT.json') as f:
for line in f:
data.append(json.loads(line.encode('utf-8','replace')))
You have several problems as near as I can tell. First, is the file encoding. When you open a file without specifying an encoding, the file is opened with whatever sys.getfilesystemencoding() is. Since that may vary (especially on Windows machines) its a good idea to explicitly use encoding="utf-8" for most json files. Because of your error message, I suspect that the file was opened with an ascii encoding.
Next, the file is decoded from utf-8 into python strings as it is read by the file system object. The utf-8 line has already been decoded to a string and is already ready for json to read. When you do line.encode('utf-8','replace'), you encode the line back into a bytes object which the json loads (that is, "load string") can't handle.
Finally, "tooltip":{ "navbar":"Operações de grupo"} isn't valid json, but it does look like one line of a pretty-printed json file containing a single json object. My guess is that you should read the entire file as 1 json object.
Putting it all together you get:
import json
with open('/Users/myvb/Desktop/Automation/pt-PT.json', encoding="utf-8") as f:
data = json.load(f)
From its name, its possible that this file is encoded as a Windows Portugese code page. If so, the "cp860" encoding may work better.
I had the same problem, what worked for me was creating a regular expression, and parsing every line from the json file:
REGEXP = '[^A-Za-z0-9\'\:\.\;\-\?\!]+'
new_file_line = re.sub(REGEXP, ' ', old_file_line).strip()
Having a file with content similar to yours I can read the file in one simple shot:
>>> import json
>>> fname = "data.json"
>>> with open(fname) as f:
... data = json.load(f)
...
>>> data
{'tooltip': {'navbar': 'Operações de grupo'}}
You don't need to read each line. You have two options:
import sys
import json
data = []
with open('/Users/myvb/Desktop/Automation/pt-PT.json') as f:
data.append(json.load(f))
Or, you can load all lines and pass them to the json module:
import sys
import json
data = []
with open('/Users/myvb/Desktop/Automation/pt-PT.json') as f:
data.append(json.loads(''.join(f.readlines())))
Obviously, the first suggestion is the best.
I have this Python script that attempts to detect the character encoding of a text file (in this case, C# .cs source files, but they could be any text file) and then convert them from that character encoding and into UTF-8 (without BOM).
While chardet detects the encoding well enough and the script runs without errors, characters like © are encoded into $. So I assume there's something wrong with the script and my understanding of encoding in Python 2. Since converting files from UTF-8-SIG to UTF-8 works, I have a feeling that the problem is the decoding (reading) part and not the encoding (writing) part.
Can anyone tell me what I'm doing wrong? If switching to Python 3 is a solution, I'm all for it, I then just need help figuring out how to convert the script from running on version 2.7 to 3.4. Here's the script:
import os
import glob
import fnmatch
import codecs
from chardet.universaldetector import UniversalDetector
# from http://farmdev.com/talks/unicode/
def to_unicode_or_bust(obj, encoding='utf-8'):
if isinstance(obj, basestring):
if not isinstance(obj, unicode):
obj = unicode(obj, encoding)
return obj
def enforce_unicode():
detector = UniversalDetector()
for root, dirnames, filenames in os.walk('.'):
for filename in fnmatch.filter(filenames, '*.cs'):
detector.reset()
filepath = os.path.join(root, filename)
with open(filepath, 'r') as f:
for line in f:
detector.feed(line)
if detector.done: break
detector.close()
encoding = detector.result['encoding']
if encoding and not encoding == 'UTF-8':
print '%s -> UTF-8 %s' % (encoding.ljust(12), filepath)
with codecs.open(filepath, 'r', encoding=encoding) as f:
content = ''.join(f.readlines())
content = to_unicode_or_bust(content)
with codecs.open(filepath, 'w', encoding='utf-8') as f:
f.write(content)
enforce_unicode()
I have tried to do content = content.decode(encoding).encode('utf-8') before writing the file, but that fails with the following error:
/usr/local/.../lib/python2.7/encodings/utf_8_sig.py:19: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
if input[:3] == codecs.BOM_UTF8:
Traceback (most recent call last):
File "./enforce-unicode.py", line 48, in <module>
enforce_unicode()
File "./enforce-unicode.py", line 43, in enforce_unicode
content = content.decode(encoding).encode('utf-8')
File "/usr/local/.../lib/python2.7/encodings/utf_8_sig.py", line 22, in decode
(output, consumed) = codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 87: ordinal not in range(128)
Ideas?
chardet simply got the detected codec it wrong, your code is otherwise correct. Character detection is based on statistics, heuristics and plain guesses, it is not a foolproof method.
For example, the Windows 1252 codepage is very close to the Latin-1 codec; files encoded with the one encoding can be decoded without error in the other encoding. Detecting the difference between a control code in the one or a Euro symbol in the other usually takes a human being looking at the result.
I'd record the chardet guesses for each file, if the file turns out to be wrongly re-coded, you need to look at what other codecs could be close. All of the 1250-series codepages look a lot alike.
I have received a large set of sas files which all need to have their filepaths altered.
The code I've written for that tasks is as follows:
import glob
import os
import sys
os.chdir(r"C:\path\subdir")
glob.glob('*.sas')
import os
fileLIST=[]
for dirname, dirnames, filenames in os.walk('.'):
for filename in filenames:
fileLIST.append(os.path.join(dirname, filename))
print fileLIST
import re
for fileITEM in set(fileLIST):
dataFN=r"//path/subdir/{0}".format(fileITEM)
dataFH=open(dataFN, 'r+')
for row in dataFH:
print row
if re.findall('\.\.\.', str(row)) != []:
dataSTR=re.sub('\.\.\.', "//newpath/newsubdir", row)
print >> dataFH, dataSTR.encode('utf-8')
else:
print >> dataFH, row.encode('utf-8')
dataFH.close()
The issues I have are two fold: First, it seems as though my code does not recognize the three sequential periods, even when separated by a backslash. Second, I receive an error "UnicodeDecodeError: 'ascii' codec can't decode byte...'
Is it possible that SAS program files (.sas) are not utf-8? If so, is the fix as simple as knowing what file encoding they use?
The full traceback is as follows:
Traceback (most recent call last):
File "stringsubnew.py", line 26, in <module>
print >> dataFH, row.encode('utf-8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0x83 in position 671: ordinal not in range(128)
Thanks in advance
The problem lies with the reading rather than writing. You have to know what encoding lies within the source file you are reading from and decode it appropriately.
Let's say the source file contains data encoded with iso-8859-1
You can do this when reading using str.decode()
my_row = row.decode('iso-8859-1')
Or you can open the file using codecs to take care of it for you.
import codecs
dataFH = codecs.open(dataFN, 'r+', 'iso-8859-1')
A good talk on this can be found at http://nedbatchelder.com/text/unipain.html