How to read in a huge csv zipped file in python? - python

I have a few really big .zip files. Each contains 1 huge .csv.
When I try to read it in, I either get a memory error or everything freezes/crashes.
I've tried this:
zf = zipfile.ZipFile('Eve.zip', 'r')
df1 = zf.read('Eve.csv')
but this gives a MemoryError.
I've done some research and tried this:
import zipfile
with zipfile.ZipFile('Events_WE20200308.zip', 'r') as z:
with z.open('Events_WE20200308.csv') as f:
for line in f:
df=pd.DataFrame(f)
print(line)
but I can't get it into a dataframe.
Any ideas please?

Related

Saving output to a csv file

Trying to save the output to a csv file. Below prints the information to the screen fine but when I try to save it to a csv or text file, I get one letter at a time. Trying to understand why.
data = json.loads(response.text)
info = data['adapterInstancesInfoDto']
for x in range(len(info)):
val = info[x]['resourceKey']['name']
print(val)
Tried writing to a csv and text file same issue. Tried Pandas same result. I am thinking I need to convert it into a tuple or diction to save to a csv file.
Use the built-in module csv for working with csv files:
Here's a example for writing to the file:
import csv
with open('filename.csv', 'w', newline='') as file:
writer = csv.writer(file)
writer.writerow(["SNo", "Name"])
writer.writerow([1, "Python"])
writer.writerow([2, "Csv"])

Exporting and Importing a list of Data Frames as json file in Pandas

Pandas has the DataFrame.to_json and pd.read_json functions that work for single Data Frames. However, I have been trying to figure a way to export and import a list with many Data Frames into and from a single json file. So far, I have come to successfully export the list with this code:
with open('my_file.json', 'w') as outfile:
outfile.writelines([json.dumps(df.to_dict()) for df in list_of_df])
This creates a json file with all the Data Frames converted to dicts. However, when I try to do the reverse to read the file and extract my Data Frames, I get an error. This is the code:
with open('my_file.json', 'r') as outfile:
list_of_df = [pd.DataFrame.from_dict(json.loads(item)) for item in
outfile]
The error I get is:
JSONDecodeError: Extra data
I think the problem is that I have to include somehow the opposite of 'writelines', which is 'readlines' in the code that reads the json file, but I do not know how to do it. Any help will be appreciated!
By using writelines your data isn't really a list in the python sense, which makes reading it a bit tricky. I'd recommend instead writing to your file like this:
with open('my_file.json', 'w') as outfile:
outfile.write(json.dumps([df.to_dict() for df in list_of_df]))
Which means we can read it back just as simply using:
with open('my_file.json', 'r') as outfile:
list_of_df = [pd.DataFrame.from_dict(item) for item in json.loads(outfile.read())]

Read and write large CSV file in python

I use the following code to read a LARGE CSV file (6-10 GB), insert a header text, and then export it to CSV a again.
df = read_csv('read file')
df.columns =['list of headers']
df.to_csv('outfile',index=False,quoting=csv.QUOTE_NONNUMERIC)
But this methodology is extremely slow and I run out of memory. Any suggestions?
Rather than reading in the whole 6GB file, could you not just add the headers to a new file, and then cat in the rest? Something like this:
import fileinput
columns = ['list of headers']
columns.to_csv('outfile.csv',index=False,quoting=csv.QUOTE_NONNUMERIC)
with FileInput(files=('infile.csv')) as f:
for line in f:
outfile.write(line)
outfile.close()

Zipping a file in memory from a byte string

Currently I am writing a file like this:
with open('derp.bin', 'wb') as f:
f.write(file_data)
However sometimes this file is very large and I need to zip it, I'd like to minimize the number of writes to disk and do this in memory. I understand I can use BytesIO and ZipFile to create a zip file from the data. This is what I have so far:
zipbuf = io.BytesIO(file_data)
z = zipfile.ZipFile(zipbuf, 'w')
z.close()
with open('derp.zip', 'wb') as f:
shutil.copyfileobj(zipbuf, f)
How can I make it so that when you extract the zip file there is the original derp.bin inside
z = zipfile.ZipFile('derp.bin','w')
z.writestr('derp.zip',file_data,8)
z.close()

Convert JSON *files* to CSV *files* using Python (Idle)

This question piggybacks a question I had posted yesterday. I actually got my code to work fine. I was starting small. I switched out the JSON in the Python code for multiple JSON files outside of the Python code. I actually got that to work beautifully. And then there was some sort of catastrophe, and my code was lost.
I have spent several hours trying to recreate it to no avail. I am actually using arcpy (ArcGIS's Python module) since I will later on be using it to perform some spatial analysis, but I don't think you need to know much about arcpy to help me out with this part (I don't think, but it may help).
Here is one version of my latest attempts, but it is not working. I switched out my actual path to just "Pathname." I actually have everything working up until the point when I try to populate the rows in the CSV (which are of latitude and longitude values. It is successfully writing the latitude/longitude headers in the CSV files). So apparently whatever is below dict_writer.writerows(openJSONfile) is not working:
import json, csv, arcpy
from arcpy import env
arcpy.env.workspace = r"C:\GIS\1GIS_DATA\Pathname"
workspaces = arcpy.ListWorkspaces("*", "Folder")
for workspace in workspaces:
arcpy.env.workspace = workspace
JSONfiles = arcpy.ListFiles("*.json")
for JSONfile in JSONfiles:
descJSONfile = arcpy.Describe(JSONfile)
JSONfileName = descJSONfile.baseName
openJSONfile = open(JSONfile, "wb+")
print "JSON file is open"
fieldnames = ['longitude', 'latitude']
with open(JSONfileName+"test.csv", "wb+") as f:
dict_writer = csv.DictWriter(f, fieldnames=fieldnames)
dict_writer.writerow(dict(zip(fieldnames, fieldnames)))
dict_writer.writerows(openJSONfile)
#Do I have to open the CSV files? Aren't they already open?
#openCSVfile = open(CSVfile, "r+")
for row in openJSONfile:
f.writerow( [row['longitude'], row['latitude']] )
Any help is greatly appreciated!!
You're not actually loading the JSON file.
You're trying to write rows from an open file instead of writing rows from json.
You will need to add something like this:
rows = json.load(openJSONfile)
and later:
dict_writer.writerows(rows)
The last two lines you have should be removed, since all the csv writing is done before you reach them, and they are outside of the loop, so they would only work for the last file anyway(they don't write anything, since there are no lines left in the file at that point).
Also, I see you're using with open... to open the csv file, but not the json file.
You should always use it rather than using open() without the with statement.
You should use a csv.DictWriter object to do everything. Here's something similar to your code with all the Arc stuff removed because I don't have it, that worked when I tested it:
import json, csv
JSONfiles = ['sample.json']
for JSONfile in JSONfiles:
with open(JSONfile, "rb") as openJSONfile:
rows = json.load(openJSONfile)
fieldnames = ['longitude', 'latitude']
with open(JSONfile+"test.csv", "wb") as f:
dict_writer = csv.DictWriter(f, fieldnames=fieldnames)
dict_writer.writeheader()
dict_writer.writerows(rows)
It was unnecessary to write out each row because your json file was a list of row dictionaries (assuming it was what you had embedded in your linked question).
I can't say I know for sure what was wrong, but putting all of the .JSON files in the same folder as my code (and changing my code appropriately) works. I will have to keep investigating why, when trying to read into other folders, it gives me the error:
IOError: [Errno 2] No such file or directory:
For now, the following code DOES work :)
import json, csv, arcpy, os
from arcpy import env
arcpy.env.workspace = r"C:\GIS\1GIS_DATA\MyFolder"
JSONfiles = arcpy.ListFiles("*.json")
print JSONfiles
for JSONfile in JSONfiles:
print "Current JSON file is: " + JSONfile
descJSONfile = arcpy.Describe(JSONfile)
JSONfileName = descJSONfile.baseName
with open(JSONfile, "rb") as openJSONfile:
rows = json.load(openJSONfile)
print "JSON file is loaded"
fieldnames = ['longitude', 'latitude']
with open(JSONfileName+"test.csv", "wb") as f:
dict_writer = csv.DictWriter(f, fieldnames = fieldnames)
dict_writer.writerow(dict(zip(fieldnames, fieldnames)))
dict_writer.writerows(rows)
print "CSVs are Populated with headers and rows from JSON file.", '\n'
Thanks everyone for your help.

Categories