write from panda dataframe to s3 with utf-8 encoding - python

This code works:
os.system('pip install "s3fs==0.4"')
df.to_csv("s3://path",storage_options=aws_credentials,index=False, sep=';', encoding='utf-8-sig')
However this code doesnot works(code runs successfully without encoding):
csv_buffer = StringIO()
df.to_csv(csv_buffer, sep=";", index=False, encoding='utf-8-sig')
s3_resource = boto3.resource("s3")
s3_resource.Object('bucket', 'filename').put(Body=csv_buffer.getvalue())
How can I use utf-8-sig encoding without the library 's3fs'

Related

batch tsv to csv script

I'm pretty new to Python, I wrote this script that batch converts tsv files to csv. I keep getting an error message and spend hours trying to see what I did wrong. Any help on this will truly be appreciate it. Error code is "UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte"
import os
import sys
import shutil
import pandas as pd
import argparse
def main():
if len(sys.argv) == 1:
files = [x for x in os.listdir('.') if x.endswith('.tsv')]
else:
files = [sys.argv[1]]
for file in files:
df = pd.read_csv(file, header=0, sep='\t', encoding='utf-8', quoting=3)
new_filename = f'{file.replace(".tsv", "")}.csv'
df.to_csv(new_filename, encoding='utf-8', index=False)
print(f'Converted file: {new_filename}')
print('Done!')
if __name__ == '__main__':
main()
When the CSV is read into Pandas, it uses utf-8 encoding, however, there are other encoding formats that could have been used on the file
In this line of code:
df = pd.read_csv(file, header=0, sep='\t', encoding='utf-8', quoting=3)
Try setting encoding to a different format.
There are many different formats you can try, here is full list. I would recommend opening the file with notepad, or another text editor, and then save as a CSV with a utf-8 encoding.

What function does `with open(...)` serve while parsing a csv file with `pandas`?

I just found a notebook from a book that has the following construction:
filename = 'data/counts.txt'
with open(filename, 'rt') as f:
data_table = pd.read_csv(f, index_col=0) # Parse file with pandas
In what way is that different from simply data_table = pd.read_csv(filename, index_col=0)?
The read_csv can accept a couple of different types in the first argument. The documentation says filepath_or_buffer : str, path object, or file-like object.
When you run pd.read_csv(filename, index_col=0) you are asking pandas to find and open the file for reading.
When you run
with open(filename, 'rt') as f:
data_table = pd.read_csv(f, index_col=0)
you opening the file beforehand and passing pandas a file object/buffer for it to read.
Both accomplish the same thing. If you want more control about how the file is opened/read, you would do the latter.

How to import a text file on AWS S3 into pandas without writing to disk

I have a text file saved on S3 which is a tab delimited table. I want to load it into pandas but cannot save it first because I am running on a heroku server. Here is what I have so far.
import io
import boto3
import os
import pandas as pd
os.environ["AWS_ACCESS_KEY_ID"] = "xxxxxxxx"
os.environ["AWS_SECRET_ACCESS_KEY"] = "xxxxxxxx"
s3_client = boto3.client('s3')
response = s3_client.get_object(Bucket="my_bucket",Key="filename.txt")
file = response["Body"]
pd.read_csv(file, header=14, delimiter="\t", low_memory=False)
the error is
OSError: Expected file path name or file-like object, got <class 'bytes'> type
How do I convert the response body into a format pandas will accept?
pd.read_csv(io.StringIO(file), header=14, delimiter="\t", low_memory=False)
returns
TypeError: initial_value must be str or None, not StreamingBody
pd.read_csv(io.BytesIO(file), header=14, delimiter="\t", low_memory=False)
returns
TypeError: 'StreamingBody' does not support the buffer interface
UPDATE - Using the following worked
file = response["Body"].read()
and
pd.read_csv(io.BytesIO(file), header=14, delimiter="\t", low_memory=False)
pandas uses boto for read_csv, so you should be able to:
import boto
data = pd.read_csv('s3://bucket....csv')
If you need boto3 because you are on python3.4+, you can
import boto3
import io
s3 = boto3.client('s3')
obj = s3.get_object(Bucket='bucket', Key='key')
df = pd.read_csv(io.BytesIO(obj['Body'].read()))
Since version 0.20.1 pandas uses s3fs, see answer below.
Now pandas can handle S3 URLs. You could simply do:
import pandas as pd
import s3fs
df = pd.read_csv('s3://bucket-name/file.csv')
You need to install s3fs if you don't have it. pip install s3fs
Authentication
If your S3 bucket is private and requires authentication, you have two options:
1- Add access credentials to your ~/.aws/credentials config file
[default]
aws_access_key_id=AKIAIOSFODNN7EXAMPLE
aws_secret_access_key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
Or
2- Set the following environment variables with their proper values:
aws_access_key_id
aws_secret_access_key
aws_session_token
This is now supported in latest pandas. See
http://pandas.pydata.org/pandas-docs/stable/io.html#reading-remote-files
eg.,
df = pd.read_csv('s3://pandas-test/tips.csv')
For python 3.6+ Amazon now have a really nice library to use Pandas with their services, called awswrangler.
import awswrangler as wr
import boto3
# Boto3 session
session = boto3.session.Session(aws_access_key_id='XXXX',
aws_secret_access_key='XXXX')
# Awswrangler pass forward all pd.read_csv() function args
df = wr.s3.read_csv(path='s3://bucket/path/',
boto3_session=session,
skiprows=2,
sep=';',
decimal=',',
na_values=['--'])
To install awswrangler: pip install awswrangler
With s3fs it can be done as follow:
import s3fs
import pandas as pd
fs = s3fs.S3FileSystem(anon=False)
# CSV
with fs.open('mybucket/path/to/object/foo.pkl') as f:
df = pd.read_csv(f)
# Pickle
with fs.open('mybucket/path/to/object/foo.pkl') as f:
df = pd.read_pickle(f)
Since the files can be too large, it is not wise to load them in the dataframe altogether. Hence, read line by line and save it in the dataframe. Yes, we can also provide the chunk size in the read_csv but then we have to maintain the number of rows read.
Hence, I came up with this engineering:
def create_file_object_for_streaming(self):
print("creating file object for streaming")
self.file_object = self.bucket.Object(key=self.package_s3_key)
print("File object is: " + str(self.file_object))
print("Object file created.")
return self.file_object
for row in codecs.getreader(self.encoding)(self.response[u'Body']).readlines():
row_string = StringIO(row)
df = pd.read_csv(row_string, sep=",")
I also delete the df once work is done.
del df
For text files, you can use below code with pipe-delimited file for example :-
import pandas as pd
import io
import boto3
s3_client = boto3.client('s3', use_ssl=False)
bucket = #
prefix = #
obj = s3_client.get_object(Bucket=bucket, Key=prefix+ filename)
df = pd.read_fwf((io.BytesIO(obj['Body'].read())) , encoding= 'unicode_escape', delimiter='|', error_bad_lines=False,header=None, dtype=str)
An option is to convert the csv to json via df.to_dict() and then store it as a string. Note this is only relevant if the CSV is not a requirement but you just want to quickly put the dataframe in an S3 bucket and retrieve it again.
from boto.s3.connection import S3Connection
import pandas as pd
import yaml
conn = S3Connection()
mybucket = conn.get_bucket('mybucketName')
myKey = mybucket.get_key("myKeyName")
myKey.set_contents_from_string(str(df.to_dict()))
This will convert the df to a dict string, and then save that as json in S3. You can later read it in the same json format:
df = pd.DataFrame(yaml.load(myKey.get_contents_as_string()))
The other solutions are also good, but this is a little simpler. Yaml may not necessarily be required but you need something to parse the json string. If the S3 file doesn't necessarily need to be a CSV this can be a quick fix.
import s3fs
import pandas as pd
s3 = s3fs.S3FileSystem(profile='<profile_name>')
pd.read_csv(s3.open(<s3_path>))

Apply GZIP compression to a CSV in Python Pandas

I am trying to write a dataframe to a gzipped csv in python pandas, using the following:
import pandas as pd
import datetime
import csv
import gzip
# Get data (with previous connection and script variables)
df = pd.read_sql_query(script, conn)
# Create today's date, to append to file
todaysdatestring = str(datetime.datetime.today().strftime('%Y%m%d'))
print todaysdatestring
# Create csv with gzip compression
df.to_csv('foo-%s.csv.gz' % todaysdatestring,
sep='|',
header=True,
index=False,
quoting=csv.QUOTE_ALL,
compression='gzip',
quotechar='"',
doublequote=True,
line_terminator='\n')
This just creates a csv called 'foo-YYYYMMDD.csv.gz', not an actual gzip archive.
I've also tried adding this:
#Turn to_csv statement into a variable
d = df.to_csv('foo-%s.csv.gz' % todaysdatestring,
sep='|',
header=True,
index=False,
quoting=csv.QUOTE_ALL,
compression='gzip',
quotechar='"',
doublequote=True,
line_terminator='\n')
# Write above variable to gzip
with gzip.open('foo-%s.csv.gz' % todaysdatestring, 'wb') as output:
output.write(d)
Which fails as well. Any ideas?
Using df.to_csv() with the keyword argument compression='gzip' should produce a gzip archive. I tested it using same keyword arguments as you, and it worked.
You may need to upgrade pandas, as gzip was not implemented until version 0.17.1, but trying to use it on prior versions will not raise an error, and just produce a regular csv. You can determine your current version of pandas by looking at the output of pd.__version__.
It is done very easily with pandas
import pandas as pd
Write a pandas dataframe to disc as gunzip compressed csv
df.to_csv('dfsavename.csv.gz', compression='gzip')
Read from disc
df = pd.read_csv('dfsavename.csv.gz', compression='gzip')
From documentation
import gzip
content = "Lots of content here"
with gzip.open('file.txt.gz', 'wb') as f:
f.write(content)
with pandas
import gzip
content = df.to_csv(
sep='|',
header=True,
index=False,
quoting=csv.QUOTE_ALL,
quotechar='"',
doublequote=True,
line_terminator='\n')
with gzip.open('foo-%s.csv.gz' % todaysdatestring, 'wb') as f:
f.write(content)
The trick here being that to_csv outputs text if you don't pass it a filename. Then you just redirect that text to gzip's write method.
with gzip.open('foo-%s.csv.gz' % todaysdatestring, 'wb') as f:
f.write(df.to_csv(sep='|', index=False, quoting=csv.QUOTE_ALL))

Which encoding to use while reading Excel using xlrd

I am trying to read an Excel file using xlrd to write into txt files. Everything is being written fine except for some rows which has some spanish characters like 'Téd'. I can encode those using latin-1 encoding. However the code then fails for other rows which have a 'â' with unicode u'\u2013'. u'\2013' can't be encoded using latin-1. When using UTF-8 'â' are written out fine but 'Téd' is written as 'Téd' which is not acceptable. How do I correct this.
Code below :
#!/usr/bin/python
import xlrd
import csv
import sys
filePath = sys.argv[1]
with xlrd.open_workbook(filePath) as wb:
shNames = wb.sheet_names()
for shName in shNames:
sh = wb.sheet_by_name(shName)
csvFile = shName + ".csv"
with open(csvFile, 'wb') as f:
c = csv.writer(f)
for row in range(sh.nrows):
sh_row = []
cell = ''
for item in sh.row_values(row):
if isinstance(item, float):
cell=item
else:
cell=item.encode('utf-8')
sh_row.append(cell)
cell=''
c.writerow(sh_row)
print shName + ".csv File Created"
Python's csv module
doesn’t support Unicode input.
You are correctly encoding your input before writing it -- so you don't need codecs. Just open(csvFile, "wb") (the b is important) and pass that object to the writer:
with open(csvFile, "wb") as f:
writer = csv.writer(f)
writer.writerow([entry.encode("utf-8") for entry in row])
Alternatively, unicodecsv is a drop-in replacement for csv that handles encoding.
You are getting é instead of é because you are mistaking UTF-8 encoded text for latin-1. This is probably because you're encoding twice, once as .encode("utf-8") and once as codecs.open.
By the way, the right way to check the type of an xlrd cell is to do cell.ctype == xlrd.ONE_OF_THE_TYPES.

Categories