I am working on a script that fetches a zip file from a URL using tje request library. That zip file contains a csv file. I'm trying to read that csv file without saving it. But while parsing it's giving me this error: _csv.Error: iterator should return strings, not bytes (did you open the file in text mode?)
import csv
import requests
from io import BytesIO, StringIO
from zipfile import ZipFile
response = requests.get(url)
zip_file = ZipFile(BytesIO(response.content))
files = zip_file.namelist()
with zip_file.open(files[0]) as csvfile:
csvreader = csv.reader(csvfile)
# _csv.Error: iterator should return strings, not bytes (did you open the file in text mode?)
for row in csvreader:
print(row)
Try this:
import pandas as pd
import requests
from io import BytesIO, StringIO
from zipfile import ZipFile
response = requests.get(url)
zip_file = ZipFile(BytesIO(response.content))
files = zip_file.namelist()
with zip_file.open(files[0]) as csvfile:
print(pd.read_csv(csvfile, encoding='utf8', sep=","))
As #Aran-Fey alluded to:
import zipfile
import csv
import io
with open('/path/to/archive.zip', 'r') as f:
with zipfile.ZipFile(f) as zf:
csv_filename = zf.namelist()[0] # see namelist() for the list of files in the archive
with zf.open(csv_filename) as csv_f:
csv_f_as_text = io.TextIOWrapper(csv_f)
reader = csv.reader(csv_f_as_text)
csv.reader (and csv.DictReader) require a file-like object opened in text mode. Normally this is not a problem when simply open(...)ing file in 'r' mode, as the Python 3 docs say, text mode is the default: "The default mode is 'r' (open for reading text, synonym of 'rt')". But if you try rt with open on a ZipFile, you'll see an error that: ZipFile.open() requires mode "r" or "w":
with zf.open(csv_filename, 'rt') as csv_f:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
...
ValueError: open() requires mode "r" or "w"
That's what io.TextIOWrapper is for -- for wrapping byte streams to be readable as text, decoding them on the fly.
Related
I have the following code:
import re
#open the xml file for reading:
file = open('path/test.xml','r+')
#convert to string:
data = file.read()
file.write(re.sub(r"<string>ABC</string>(\s+)<string>(.*)</string>",r"<xyz>ABC</xyz>\1<xyz>\2</xyz>",data))
file.close()
where I'd like to replace the old content that's in the file with the new content. However, when I execute my code, the file "test.xml" is appended, i.e. I have the old content follwed by the new "replaced" content. What can I do in order to delete the old stuff and only keep the new?
You need seek to the beginning of the file before writing and then use file.truncate() if you want to do inplace replace:
import re
myfile = "path/test.xml"
with open(myfile, "r+") as f:
data = f.read()
f.seek(0)
f.write(re.sub(r"<string>ABC</string>(\s+)<string>(.*)</string>", r"<xyz>ABC</xyz>\1<xyz>\2</xyz>", data))
f.truncate()
The other way is to read the file then open it again with open(myfile, 'w'):
with open(myfile, "r") as f:
data = f.read()
with open(myfile, "w") as f:
f.write(re.sub(r"<string>ABC</string>(\s+)<string>(.*)</string>", r"<xyz>ABC</xyz>\1<xyz>\2</xyz>", data))
Neither truncate nor open(..., 'w') will change the inode number of the file (I tested twice, once with Ubuntu 12.04 NFS and once with ext4).
By the way, this is not really related to Python. The interpreter calls the corresponding low level API. The method truncate() works the same in the C programming language: See http://man7.org/linux/man-pages/man2/truncate.2.html
file='path/test.xml'
with open(file, 'w') as filetowrite:
filetowrite.write('new content')
Open the file in 'w' mode, you will be able to replace its current text save the file with new contents.
Using truncate(), the solution could be
import re
#open the xml file for reading:
with open('path/test.xml','r+') as f:
#convert to string:
data = f.read()
f.seek(0)
f.write(re.sub(r"<string>ABC</string>(\s+)<string>(.*)</string>",r"<xyz>ABC</xyz>\1<xyz>\2</xyz>",data))
f.truncate()
import os#must import this library
if os.path.exists('TwitterDB.csv'):
os.remove('TwitterDB.csv') #this deletes the file
else:
print("The file does not exist")#add this to prevent errors
I had a similar problem, and instead of overwriting my existing file using the different 'modes', I just deleted the file before using it again, so that it would be as if I was appending to a new file on each run of my code.
See from How to Replace String in File works in a simple way and is an answer that works with replace
fin = open("data.txt", "rt")
fout = open("out.txt", "wt")
for line in fin:
fout.write(line.replace('pyton', 'python'))
fin.close()
fout.close()
in my case the following code did the trick
with open("output.json", "w+") as outfile: #using w+ mode to create file if it not exists. and overwrite the existing content
json.dump(result_plot, outfile)
Using python3 pathlib library:
import re
from pathlib import Path
import shutil
shutil.copy2("/tmp/test.xml", "/tmp/test.xml.bak") # create backup
filepath = Path("/tmp/test.xml")
content = filepath.read_text()
filepath.write_text(re.sub(r"<string>ABC</string>(\s+)<string>(.*)</string>",r"<xyz>ABC</xyz>\1<xyz>\2</xyz>", content))
Similar method using different approach to backups:
from pathlib import Path
filepath = Path("/tmp/test.xml")
filepath.rename(filepath.with_suffix('.bak')) # different approach to backups
content = filepath.read_text()
filepath.write_text(re.sub(r"<string>ABC</string>(\s+)<string>(.*)</string>",r"<xyz>ABC</xyz>\1<xyz>\2</xyz>", content))
I am trying to use the Python CSV reader to read a CSV file that I extract from a .tar.gz file using Python's tarfile library.
I have this:
tarFile = tarfile.open(name=tarFileName, mode="r")
for file in tarFile.getmembers():
tarredCSV = tarFile.extractfile(file)
reader = csv.reader(tarredCSV)
next(reader) # skip header
for row in reader:
if row[3] not in CSVRows.values():
CSVRows[row[3]] = row
All the files in the tar file are all CSVs.
I am getting an exception on the first file. I am getting this exception on the first next line:
_csv.Error: iterator should return strings, not bytes (did you open the file in text mode?)
How do I open said file (without extracting the file then opening it)?
tarfile.extractfile returns an io.BufferedReader object, a bytes stream, and yet csv.reader expects a text stream. You can use io.TextIOWrapper to convert the bytes stream to a text stream instead:
import io
...
reader = csv.reader(io.TextIOWrapper(tarredCSV, encoding='utf-8'))
You need to provide a file-like object to csv.reader.
Probably the best solution, without having to consume a complete file at once is this approach (thanks to blhsing and damon for suggesting it):
import csv
import io
import tarfile
tarFile = tarfile.open(name=tarFileName, mode="r")
for file in tarFile.getmembers():
csv_file = io.TextIOWrapper(tarFile.extractfile(file), encoding="utf-8")
reader = csv.reader(csv_file)
next(reader) # skip header
for row in reader:
print(row)
Alternatively a possible solution from here: Python3 working with csv files in tar files would be
import csv
import io
import tarfile
tarFile = tarfile.open(name=tarFileName, mode="r")
for file in tarFile.getmembers():
csv_file = io.StringIO(tarFile.extractfile(file).read().decode('utf-8'))
reader = csv.reader(csv_file)
next(reader) # skip header
for row in reader:
print(row)
Here a io.StringIO object is used to make csv.reader happy. However, this might not scale well for larger files contained in the tar as each file is read in one single step.
I have a doubt about how to do to continue the code, I need to take all files from a folder and merge them in 1 file with another text format.
Example:
The Input files are of text format like this:
"{'nr': '3173391045', 'data': '27/12/2017'}"
"{'nr': '2173391295', 'data': '05/01/2017'}"
"{'nr': '5173351035', 'data': '07/03/2017'}"
The Output files must be lines like this:
"3173391045","27/09/2017"
"2173391295","05/01/2017"
"5173351035","07/03/2017"
This is my working code, it's working for merge and taking out the blank lines
import glob2
import datetime
filenames=glob2.glob("*.txt")
with open(datetime.datetime.now().strftime("%Y-%m-%d-%H-%M-%S-%f")+".SAI", 'w') as file:
for filename in filenames:
with open(filename,"r") as f:
file.write(f.read())
I'm trying something with .replace but is not working, I get syntax errors or blank files
filedata = filedata.replace("{", "") for line in filedata
If your input files had contained valid JSON strings, the correct way would have been to parse the lines as JSON and write them back in csv. As strings are enclosed in single quotes (') they are rejected by the json module of the Python library, and my advice is to use a regex to parse them. Code could become:
import glob2
import datetime
import csv
import re
# the regex to parse the line
rx = re.compile(r".*'nr'\s*:\s*'(\d+)'.*'data'\s*:\s*'([/\d]+)'")
filenames=glob2.glob("*.txt")
with open(datetime.datetime.now().strftime("%Y-%m-%d-%H-%M-%S-%f")+".SAI", 'w') as file:
wr = csv.writer(file, quoting = csv.QUOTE_ALL)
for filename in filenames:
with open(filename,"r") as f:
for line in f: # process line by line
m = rx.match(line)
wr.writerow(m.groups())
With a few tweaks, the input data can be coerced into a form suitable for JSON parsing:
from datetime import datetime
import json
import glob2
import csv
with open(datetime.now().strftime("%Y-%m-%d-%H-%M-%S-%f")+".SAI", 'w', newline='') as f_output:
csv_output = csv.writer(f_output, quoting=csv.QUOTE_ALL)
for filename in glob2.glob('*.txt'):
with open(filename) as f_input:
for row in f_input:
row_dict = json.loads(row.strip('"\n').replace("'", '"'))
csv_output.writerow([row_dict['nr'], row_dict['data']])
Giving you:
"3173391045","27/12/2017"
"2173391295","05/01/2017"
"5173351035","07/03/2017"
Note, in Python 3.x the output file should be opened with newline=''. Without this, extra blank lines can appear in the output file.
using regex/replaces to parse those strings is dangerous. You could always stumble on a data containing the delimiter, the comma, etc..
And in this case, even if json cannot read those lines,ast.literal_eval can without any modification whatsoever:
import ast
with open("output.csv",newline="") as fw:
cw = csv.writer(fw)
for filename in filenames:
with open(filename) as f:
for line in f:
d = ast.literal_eval(line)
cw.writerow([d['nr'],d['data'])
This is my code:
filepath = sys.argv[1]
csvdata = list(csv.reader(open(filepath)))
How can I fix it?
I saved my excel file as a csv and receieved this error:
_csv.Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode?
An Excel file is not a csv file. First export / save the file as csv.
There are differences between python versions about whether to open the file as binary or text. This has relevance to how newlines are handled.
In Python 2.x, open as binary: open(filepath, 'rb')
In Python 3.x, don't : open('file.csv', 'r')
The second part I learned from this link about reading in csv files
For some operating systems (Mac OS for sure) you need to open with the mode 'rU' See: this link with same problem specifically on Mac OS
try this (put actual location of csv file)...
with open('c:\pytest.csv', 'rb') as csvfile:
data = csv.reader(csvfile)
mylist = list (data)
print mylist
from tkFileDialog import askopenfilename
import csv
filename = askopenfilename()
with open(filename, 'rb') as csvfile:
data = csv.reader(csvfile)
mylist = list (data)
print mylist
I need to extract a gz file that I have downloaded from an FTP site to a local Windows file server. I have the variables set for the local path of the file, and I know it can be used by GZIP muddle.
How can I do this? The file inside the GZ file is an XML file.
import gzip
import shutil
with gzip.open('file.txt.gz', 'rb') as f_in:
with open('file.txt', 'wb') as f_out:
shutil.copyfileobj(f_in, f_out)
From the documentation:
import gzip
with gzip.open('file.txt.gz', 'rb') as f:
file_content = f.read()
Maybe you want pass it to pandas also.
with gzip.open('features_train.csv.gz') as f:
features_train = pd.read_csv(f)
features_train.head()
from sh import gunzip
gunzip('/tmp/file1.gz')
Not an exact answer because you're using xml data and there is currently no pd.read_xml() function (as of v0.23.4), but pandas (starting with v0.21.0) can uncompress the file for you! Thanks Wes!
import pandas as pd
import os
fn = '../data/file_to_load.json.gz'
print(os.path.isfile(fn))
df = pd.read_json(fn, lines=True, compression='gzip')
df.tail()
If you are parsing the file after unzipping it, don't forget to use decode() method, is necessary when you open a file as binary.
import gzip
with gzip.open(file.gz, 'rb') as f:
for line in f:
print(line.decode().strip())
It is very simple.. Here you go !!
import gzip
#path_to_file_to_be_extracted
ip = sample.gzip
#output file to be filled
op = open("output_file","w")
with gzip.open(ip,"rb") as ip_byte:
op.write(ip_byte.read().decode("utf-8")
wf.close()
You can use gzip.decompress() to do it:
read input file using rb mode;
open output file using w mode and utf8 encoding;
gzip.decompress() input bytes;
decode what you get to str.
write str to output file.
def decompress(infile, tofile):
with open(infile, 'rb') as inf, open(tofile, 'w', encoding='utf8') as tof:
decom_str = gzip.decompress(inf.read()).decode('utf-8')
tof.write(decom_str)
If you have the gzip (and gunzip) programs installed on your computer a simple way is to call that command from python:
import os
filename = 'file.txt.gz'
os.system('gunzip ' + filename)
optionally, if you want to preserve the original file, use
os.system('gunzip --keep ' + filename)
if you have a linux environment it is very easy to unzip using the command gunzip.
go to the file folder and give as below
gunzip file-name