Apply GZIP compression to a CSV in Python Pandas - python

I am trying to write a dataframe to a gzipped csv in python pandas, using the following:
import pandas as pd
import datetime
import csv
import gzip
# Get data (with previous connection and script variables)
df = pd.read_sql_query(script, conn)
# Create today's date, to append to file
todaysdatestring = str(datetime.datetime.today().strftime('%Y%m%d'))
print todaysdatestring
# Create csv with gzip compression
df.to_csv('foo-%s.csv.gz' % todaysdatestring,
sep='|',
header=True,
index=False,
quoting=csv.QUOTE_ALL,
compression='gzip',
quotechar='"',
doublequote=True,
line_terminator='\n')
This just creates a csv called 'foo-YYYYMMDD.csv.gz', not an actual gzip archive.
I've also tried adding this:
#Turn to_csv statement into a variable
d = df.to_csv('foo-%s.csv.gz' % todaysdatestring,
sep='|',
header=True,
index=False,
quoting=csv.QUOTE_ALL,
compression='gzip',
quotechar='"',
doublequote=True,
line_terminator='\n')
# Write above variable to gzip
with gzip.open('foo-%s.csv.gz' % todaysdatestring, 'wb') as output:
output.write(d)
Which fails as well. Any ideas?

Using df.to_csv() with the keyword argument compression='gzip' should produce a gzip archive. I tested it using same keyword arguments as you, and it worked.
You may need to upgrade pandas, as gzip was not implemented until version 0.17.1, but trying to use it on prior versions will not raise an error, and just produce a regular csv. You can determine your current version of pandas by looking at the output of pd.__version__.

It is done very easily with pandas
import pandas as pd
Write a pandas dataframe to disc as gunzip compressed csv
df.to_csv('dfsavename.csv.gz', compression='gzip')
Read from disc
df = pd.read_csv('dfsavename.csv.gz', compression='gzip')

From documentation
import gzip
content = "Lots of content here"
with gzip.open('file.txt.gz', 'wb') as f:
f.write(content)
with pandas
import gzip
content = df.to_csv(
sep='|',
header=True,
index=False,
quoting=csv.QUOTE_ALL,
quotechar='"',
doublequote=True,
line_terminator='\n')
with gzip.open('foo-%s.csv.gz' % todaysdatestring, 'wb') as f:
f.write(content)
The trick here being that to_csv outputs text if you don't pass it a filename. Then you just redirect that text to gzip's write method.

with gzip.open('foo-%s.csv.gz' % todaysdatestring, 'wb') as f:
f.write(df.to_csv(sep='|', index=False, quoting=csv.QUOTE_ALL))

Related

write from panda dataframe to s3 with utf-8 encoding

This code works:
os.system('pip install "s3fs==0.4"')
df.to_csv("s3://path",storage_options=aws_credentials,index=False, sep=';', encoding='utf-8-sig')
However this code doesnot works(code runs successfully without encoding):
csv_buffer = StringIO()
df.to_csv(csv_buffer, sep=";", index=False, encoding='utf-8-sig')
s3_resource = boto3.resource("s3")
s3_resource.Object('bucket', 'filename').put(Body=csv_buffer.getvalue())
How can I use utf-8-sig encoding without the library 's3fs'

Pandas read_csv throws ValueError while reading gzip file

I am trying to read a gzip file using pandas.read_csv like so:
import pandas as pd
df = pd.read_csv("data.ZIP.gz", usecols=[*range(0, 39)], encoding="latin1", skipinitialspace=True)
But it throws this error:
ValueError: Passed header names mismatches usecols
However, if I manually extract the zip file from gz file, then read_csv if able to read the data without errors:
df = pd.read_csv("data.ZIP", usecols=[*range(0, 39)], encoding="latin1", skipinitialspace=True)
Since I have to read a lot of these files I don't want to manually extract them. So, how can I fix this error?
You have two levels of compression - gzip and zip - but pandas know how to work with only one level of compression.
You can use module gzip and zipfile with io.BytesIO to extract it to file-like object in memory.
Here minimal working code
It can be useful if zip has many files and you want to select which one to extract
import pandas as pd
import gzip
import zipfile
import io
with gzip.open('data.csv.zip.gz') as f1:
data = f1.read()
file_like_object_1 = io.BytesIO(data)
with zipfile.ZipFile(file_like_object_1) as f2:
#print([x.filename for x in f2.filelist]) # list all filenames
#data = f2.read('data.csv') # extract selected filename
#data = f2.read(f2.filelist[0]) # extract first file
data = f2.read(f2.filelist[0].filename) # extract first file
file_like_object_2 = io.BytesIO(data)
df = pd.read_csv(file_like_object_2)
print(df)
But if zip has only one file then you can use read_csv to extract it - it needs to add option compression='zip' because file-like object has no filename and read_csv can't use filename's extension to recognize compressed file.
import pandas as pd
import gzip
import io
with gzip.open('data.csv.zip.gz') as f1:
data = f1.read()
file_like_object_1 = io.BytesIO(data)
df = pd.read_csv(file_like_object_1, compression='zip')
print(df)
use the gzip module to unzip all your files somethings like this
for file in list_file_names:
file_name=file.replace(".gz","")
with gzip.open(file, 'rb') as f:
file_content = f.read()
with open(file_name,"wb") as r:
r.write(file_content)
You can use zipfile module, such as :
import zipfile
with zipfile.ZipFile(path_to_zip_file, 'r') as zip_ref:
zip_ref.extractall(directory_to_extract_to)

How to write dataframe to csv [duplicate]

I would like to use pd.write_csv to write "filename" (with headers) if "filename" doesn't exist, otherwise to append to "filename" if it exists. If I simply use command:
df.to_csv('filename.csv',mode = 'a',header ='column_names')
The write or append succeeds, but it seems like the header is written every time an append takes place.
How can I only add the header if the file doesn't exist, and append without header if the file does exist?
Not sure there is a way in pandas but checking if the file exists would be a simple approach:
import os
# if file does not exist write header
if not os.path.isfile('filename.csv'):
df.to_csv('filename.csv', header='column_names')
else: # else it exists so append without writing the header
df.to_csv('filename.csv', mode='a', header=False)
with open(filename, 'a') as f:
df.to_csv(f, mode='a', header=f.tell()==0)
it will add header when writes to the file first time
In Pandas dataframe "to_csv" function, use header=False if csv file exists & append to existing file.
import os
hdr = False if os.path.isfile('filename.csv') else True
df.to_csv('filename.csv', mode='a', header=hdr)
The above solutions are great, but I have a moral obligation to include the pathlib solution here:
from pathlib import Path
file_path = Path(filename)
if file_path.exists():
df.to_csv(file_path, header=False, mode='a')
else:
df.to_csv(file_path, header=True, mode='w')
Alternatively (depending on your inlining preferences):
file_exists = file_path.exists()
df.to_csv(file_path, header=not file_exists, mode='a' if file_exists else 'w')
Apart from file exist check, you can also check for non zero file size. Since it will make sense to add header if file exists but file size is zero i.e file without content. I find it helpful in some exceptional cases
import os.path
header_flag = False if (os.path.exists(fpath) and (os.path.getsize(fpath) > 0)) else True
df.to_csv(fpath, mode='a', index=False, header=header_flag)
In case if you have dict() and want to write and append into CSV file :
import pandas as pd
file_name = 'data.csv'
my_dict = {"column_1":"Apple","column_2":"Mango"}
with open(file_name, 'a') as f:
df = pd.DataFrame(my_dict)
df.to_csv(f, mode='a', header=f.tell()==0)

What function does `with open(...)` serve while parsing a csv file with `pandas`?

I just found a notebook from a book that has the following construction:
filename = 'data/counts.txt'
with open(filename, 'rt') as f:
data_table = pd.read_csv(f, index_col=0) # Parse file with pandas
In what way is that different from simply data_table = pd.read_csv(filename, index_col=0)?
The read_csv can accept a couple of different types in the first argument. The documentation says filepath_or_buffer : str, path object, or file-like object.
When you run pd.read_csv(filename, index_col=0) you are asking pandas to find and open the file for reading.
When you run
with open(filename, 'rt') as f:
data_table = pd.read_csv(f, index_col=0)
you opening the file beforehand and passing pandas a file object/buffer for it to read.
Both accomplish the same thing. If you want more control about how the file is opened/read, you would do the latter.

Panda's Write CSV - Append vs. Write

I would like to use pd.write_csv to write "filename" (with headers) if "filename" doesn't exist, otherwise to append to "filename" if it exists. If I simply use command:
df.to_csv('filename.csv',mode = 'a',header ='column_names')
The write or append succeeds, but it seems like the header is written every time an append takes place.
How can I only add the header if the file doesn't exist, and append without header if the file does exist?
Not sure there is a way in pandas but checking if the file exists would be a simple approach:
import os
# if file does not exist write header
if not os.path.isfile('filename.csv'):
df.to_csv('filename.csv', header='column_names')
else: # else it exists so append without writing the header
df.to_csv('filename.csv', mode='a', header=False)
with open(filename, 'a') as f:
df.to_csv(f, mode='a', header=f.tell()==0)
it will add header when writes to the file first time
In Pandas dataframe "to_csv" function, use header=False if csv file exists & append to existing file.
import os
hdr = False if os.path.isfile('filename.csv') else True
df.to_csv('filename.csv', mode='a', header=hdr)
The above solutions are great, but I have a moral obligation to include the pathlib solution here:
from pathlib import Path
file_path = Path(filename)
if file_path.exists():
df.to_csv(file_path, header=False, mode='a')
else:
df.to_csv(file_path, header=True, mode='w')
Alternatively (depending on your inlining preferences):
file_exists = file_path.exists()
df.to_csv(file_path, header=not file_exists, mode='a' if file_exists else 'w')
Apart from file exist check, you can also check for non zero file size. Since it will make sense to add header if file exists but file size is zero i.e file without content. I find it helpful in some exceptional cases
import os.path
header_flag = False if (os.path.exists(fpath) and (os.path.getsize(fpath) > 0)) else True
df.to_csv(fpath, mode='a', index=False, header=header_flag)
In case if you have dict() and want to write and append into CSV file :
import pandas as pd
file_name = 'data.csv'
my_dict = {"column_1":"Apple","column_2":"Mango"}
with open(file_name, 'a') as f:
df = pd.DataFrame(my_dict)
df.to_csv(f, mode='a', header=f.tell()==0)

Categories