Read Excel file with pandas from url reponse - python

I am looking how to read an xlsx file with pandas, with the file hosted on SharePoint. These contents, when shown through reponse.text, are in string but are a binary representation of the file.
PK╚╝ ! #h�╔�╔ �═ �╔[Content_Types].xml ��╔(� ╗
��[O�#��M�;���1��G% {�΀��.Z�E��Ҧݝ�I{��5�╗"j� ���"╚W�J�I!^_Z�"CR�R�;(�
P ��g��U ̸�a!�D�FJ,�`�>�㕱�V?Ɖ
��� �n�}%K��������Pv���k'#�Dv��W� �B0�T�F��U? -?�*_�-K�"� � dM�fb|═"�BndF0x�3UΕ�Nu�
���P�lO�Y�ğ#� �����,g�K#�}�����E=�tD�U�}���O�Q�[��F�|Ix��╚���[H2{�H+╚x�k�]dn�a�╔yZ"N�jͺ�"ih�s�Gn�<j�╚
I would like to know how to read this format into memory so that I can call pd.read_excel with it.
I've tried to use urllib and openpyxl, in this manner:
import openpyxl as excel
import pandas as pd
from io import BytesIO
import urllib
req = urllib.request.Request(url=url, data=payload, headers=headers)
with urllib.request.urlopen(url=req) as reponse:
rsp = reponse.read()
excel.load_workbook(filename=rsp)
But I am getting error 400 Bad Request, from the urllib request module.
The url looks something like this:
https://company.sharepoint.com/sites/test-department/_api/Web/GetFileByServerRelativeUrl('/sites/test-department/DepartmentDocuments/test/Book1.xlsx')/$value?binaryStringResponseBody=true

I found a way to do it. the key was to seek back before passing the file to pandas.
file_ext = self.file_name.split('.')[-1]
if file_ext == 'xlsx':
import pandas as pd
from io import BytesIO
xl = bytes(memoryview(response.content))
memfile =BytesIO()
memfile.write(xl)
memfile.seek(0)
df = pd.io.excel.read_excel(memfile, engine='openpyxl')
print(df.head(10))

Related

Read compressed GRIB file from URL into xarray dataframe entirely in memory in Python

I am trying to read the gzipped grib2 files at tis URL: https://mtarchive.geol.iastate.edu/2022/12/24/mrms/ncep/SeamlessHSR/
I want to read the grib file into an xarray DataFrame. I know I could write a script to download the file to disk, decompress it, and read it in, but ideally I want to be able to do this entirely in-memory.
I feel like I should be able to do this with some combination of the urllib and gzip packages, but I can't quite figure it out.
I have the following code so far:
import urllib
import io
import gzip
URL = 'https://mtarchive.geol.iastate.edu/2022/12/24/mrms/ncep/SeamlessHSR/SeamlessHSR_00.00_20221224-000000.grib2.gz'
response = urllib.request.urlopen(URL)
compressed_file = io.BytesIO(response.read())
decompressed_file = gzip.GzipFile(fileobj=compressed_file)
But I can't figure out how to read decompressed_file into xarray.
Bonus points if you can figure out how to open_mfdataset on all of the URLs there at once.
One way that works for me is writing the decompressed data into a temporary file which can then be opened with xarray.
import urllib
import gzip
import tempfile
import xarray as xr
URL = 'https://mtarchive.geol.iastate.edu/2022/12/24/mrms/ncep/SeamlessHSR/SeamlessHSR_00.00_20221224-000000.grib2.gz'
response = urllib.request.urlopen(URL)
compressed_file = response.read()
with tempfile.NamedTemporaryFile(suffix=".grib2") as f:
f.write(gzip.decompress(compressed_file))
xx = xr.load_dataset(f.name)
display(xx)

How do I get my .xlsx file I created using Pandas (python) to save to a different file location?

I've just started out with Pandas and I have gotten my xls file to convert into an xlsx file using Pandas however I now want the file to save to a different loaction such as OneDrive I was wondering if you could help me out?
Here is the code I have written for it:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
#Deleting original file
path = (r"C:\Users\MQ\Downloads\Incident Report.xls")
os.remove(path)
print("Original file has been deleted :)")
#Identifying the xls file
excel_file_1 = 'Incident Report.xls'
df_first_shift = pd.read_excel(r'C:\Users\MQ\3D Objects\New Folder\Incident Report.xls')
print(df_first_shift)
#combining data
df_all = pd.concat([df_first_shift])
print(df_all)
#Creating the .xlsx file
df_all.to_excel("Incident_Report.xlsx")
Use pd.ExcelWriter by passing in your destination path!
destination_path = "path\\to\\your\\onedrive\\filename.xlsx"
writer = pd.ExcelWriter(destination_path , engine='xlsxwriter')
df.to_excel(writer, sheet_name='sheetname')
writer.save()
To write to cloud OneDrive the following code is suggested. I did not run it but offer it as a suggestion.
REFER to www.lieben.nu's example for uploading file to onedrive`
import requests
import io
import pandas as pd
def cloudOneDrive(filename, bytesIO):
'''
Reference : https://www.lieben.nu/liebensraum/2019/04/uploading-a-file-to-onedrive-for-business-with-python/
Write to cloud (bytesIO)
'''
data = {'grant_type':"client_credentials",
'resource':"https://graph.microsoft.com",
'client_id':'XXXXX',
'client_secret':'XXXXX'}
URL = "https://login.windows.net/YOURTENANTDOMAINNAME/oauth2/token?api-version=1.0"
# FIXME: put coder top open OneDrive file here as bytes stream
r = requests.put(URL+"/"+filename+":/content", data=bytesIO, headers=headers)
if r.status_code == 200 or r.status_code == 201:
print("succeeded")
return True
else:
print("Fail", r.status_code)
fn = 'junk.xlsx'
with io.BytesIO() as bio:
with pd.ExcelWriter(bio, mode='wb') as xio:
df.to_excel(bio, sheet_name='sh1')
bio.seek(0)
cloudOneDrive(fn, bio)

Download dataset Which is a Zip file Contaning lots of csv file in notebook for data analysis

I am doing a data science project.
I am using google notebook for my job
My dataset is residing at here which I want to access directly at python Notebook.
I am using following line of code to get out of it.
df = pd.read_csv('link')
But Command line is throwing an error like below
What should I do?
Its difficult to answer exactly as there lack of data but here you go for this kind of request..
you have to import ZipFile & urlopen in order to get data from url and extract the data from Zip and the use the csv file for pandas processings.
from zipfile import ZipFile
from urllib.request import urlopen
import pandas as pd
import os
URL = 'https://he-s3.s3.amazonaws.com/media/hackathon/hdfc-bank-ml-hiring-challenge/application-scorecard-for-customers/05d2b4ea-c-Dataset.zip'
# open and save the zip file onto computer
url = urlopen(URL)
output = open('05d2b4ea-c-Dataset.zip', 'wb') # note the flag: "wb"
output.write(url.read())
output.close()
# read the zip file as a pandas dataframe
df = pd.read_csv('05d2b4ea-c-Dataset.zip') zip files
# if keeping on disk the zip file is not wanted, then:
os.remove(zipName) # remove the copy of the zipfile on disk
Use urllib module to download into memory the zip file which returns a file-like object that you can read(), pass it to ZipFile(standard package).
Since here there are multiple files like
['test_data/AggregateData_Test.csv', 'test_data/TransactionData_Test.csv', 'train_data/AggregateData_Train.csv', 'train_data/Column_Descriptions.xlsx', 'train_data/sample_submission.csv', 'train_data/TransactionData_Train.csv']
Load it to a dict of dataframes with filename as the key. Altogether the code will be.
from urllib.request import urlopen
from zipfile import ZipFile
from io import BytesIO
zip_in_memory = urlopen("https://he-s3.s3.amazonaws.com/media/hackathon/hdfc-bank-ml-hiring-challenge/application-scorecard-for-customers/05d2b4ea-c-Dataset.zip").read()
z = ZipFile(BytesIO(zip_in_memory))
dict_of_dfs = {file.filename: pd.read_csv(z.open(file.filename))\
for file in z.infolist()\
if file.filename.endswith('.csv')}
Now you can access dataframes of each csv like dict_of_dfs['test_data/AggregateData_Test.csv'].
Ofcourse all of this is unnecessary if you will just download the zip from the link and pass it as a zipfile.

How to import a text file on AWS S3 into pandas without writing to disk

I have a text file saved on S3 which is a tab delimited table. I want to load it into pandas but cannot save it first because I am running on a heroku server. Here is what I have so far.
import io
import boto3
import os
import pandas as pd
os.environ["AWS_ACCESS_KEY_ID"] = "xxxxxxxx"
os.environ["AWS_SECRET_ACCESS_KEY"] = "xxxxxxxx"
s3_client = boto3.client('s3')
response = s3_client.get_object(Bucket="my_bucket",Key="filename.txt")
file = response["Body"]
pd.read_csv(file, header=14, delimiter="\t", low_memory=False)
the error is
OSError: Expected file path name or file-like object, got <class 'bytes'> type
How do I convert the response body into a format pandas will accept?
pd.read_csv(io.StringIO(file), header=14, delimiter="\t", low_memory=False)
returns
TypeError: initial_value must be str or None, not StreamingBody
pd.read_csv(io.BytesIO(file), header=14, delimiter="\t", low_memory=False)
returns
TypeError: 'StreamingBody' does not support the buffer interface
UPDATE - Using the following worked
file = response["Body"].read()
and
pd.read_csv(io.BytesIO(file), header=14, delimiter="\t", low_memory=False)
pandas uses boto for read_csv, so you should be able to:
import boto
data = pd.read_csv('s3://bucket....csv')
If you need boto3 because you are on python3.4+, you can
import boto3
import io
s3 = boto3.client('s3')
obj = s3.get_object(Bucket='bucket', Key='key')
df = pd.read_csv(io.BytesIO(obj['Body'].read()))
Since version 0.20.1 pandas uses s3fs, see answer below.
Now pandas can handle S3 URLs. You could simply do:
import pandas as pd
import s3fs
df = pd.read_csv('s3://bucket-name/file.csv')
You need to install s3fs if you don't have it. pip install s3fs
Authentication
If your S3 bucket is private and requires authentication, you have two options:
1- Add access credentials to your ~/.aws/credentials config file
[default]
aws_access_key_id=AKIAIOSFODNN7EXAMPLE
aws_secret_access_key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
Or
2- Set the following environment variables with their proper values:
aws_access_key_id
aws_secret_access_key
aws_session_token
This is now supported in latest pandas. See
http://pandas.pydata.org/pandas-docs/stable/io.html#reading-remote-files
eg.,
df = pd.read_csv('s3://pandas-test/tips.csv')
For python 3.6+ Amazon now have a really nice library to use Pandas with their services, called awswrangler.
import awswrangler as wr
import boto3
# Boto3 session
session = boto3.session.Session(aws_access_key_id='XXXX',
aws_secret_access_key='XXXX')
# Awswrangler pass forward all pd.read_csv() function args
df = wr.s3.read_csv(path='s3://bucket/path/',
boto3_session=session,
skiprows=2,
sep=';',
decimal=',',
na_values=['--'])
To install awswrangler: pip install awswrangler
With s3fs it can be done as follow:
import s3fs
import pandas as pd
fs = s3fs.S3FileSystem(anon=False)
# CSV
with fs.open('mybucket/path/to/object/foo.pkl') as f:
df = pd.read_csv(f)
# Pickle
with fs.open('mybucket/path/to/object/foo.pkl') as f:
df = pd.read_pickle(f)
Since the files can be too large, it is not wise to load them in the dataframe altogether. Hence, read line by line and save it in the dataframe. Yes, we can also provide the chunk size in the read_csv but then we have to maintain the number of rows read.
Hence, I came up with this engineering:
def create_file_object_for_streaming(self):
print("creating file object for streaming")
self.file_object = self.bucket.Object(key=self.package_s3_key)
print("File object is: " + str(self.file_object))
print("Object file created.")
return self.file_object
for row in codecs.getreader(self.encoding)(self.response[u'Body']).readlines():
row_string = StringIO(row)
df = pd.read_csv(row_string, sep=",")
I also delete the df once work is done.
del df
For text files, you can use below code with pipe-delimited file for example :-
import pandas as pd
import io
import boto3
s3_client = boto3.client('s3', use_ssl=False)
bucket = #
prefix = #
obj = s3_client.get_object(Bucket=bucket, Key=prefix+ filename)
df = pd.read_fwf((io.BytesIO(obj['Body'].read())) , encoding= 'unicode_escape', delimiter='|', error_bad_lines=False,header=None, dtype=str)
An option is to convert the csv to json via df.to_dict() and then store it as a string. Note this is only relevant if the CSV is not a requirement but you just want to quickly put the dataframe in an S3 bucket and retrieve it again.
from boto.s3.connection import S3Connection
import pandas as pd
import yaml
conn = S3Connection()
mybucket = conn.get_bucket('mybucketName')
myKey = mybucket.get_key("myKeyName")
myKey.set_contents_from_string(str(df.to_dict()))
This will convert the df to a dict string, and then save that as json in S3. You can later read it in the same json format:
df = pd.DataFrame(yaml.load(myKey.get_contents_as_string()))
The other solutions are also good, but this is a little simpler. Yaml may not necessarily be required but you need something to parse the json string. If the S3 file doesn't necessarily need to be a CSV this can be a quick fix.
import s3fs
import pandas as pd
s3 = s3fs.S3FileSystem(profile='<profile_name>')
pd.read_csv(s3.open(<s3_path>))

Django Pandas to http response (download file)

Python: 2.7.11
Django: 1.9
Pandas: 0.17.1
How should I go about creating a potentially large xlsx file download? I'm creating a xlsx file with pandas from a list of dictionaries and now need to give the user possibility to download it. The list is in a variable and is not allowed to be saved locally (on server).
Example:
df = pandas.DataFrame(self.csvdict)
writer = pandas.ExcelWriter('pandas_simple.xlsx', engine='xlsxwriter')
df.to_excel(writer, sheet_name='Sheet1')
writer.save()
This example would just create the file and save it where the executing script is located. What I need is to create it to a http response so that the user would get a download prompt.
I have found a few posts about doing this for a xlsxwriter but non for pandas. I also think that I should be using 'StreamingHttpResponse' for this and not a 'HttpResponse'.
I will elaborate on what #jmcnamara wrote. This if for the latest versions of Excel, Pandas and Django. The import statements would be at the top of your views.py and the remaining code could be in a view:
import pandas as pd
from django.http import HttpResponse
try:
from io import BytesIO as IO # for modern python
except ImportError:
from io import StringIO as IO # for legacy python
# this is my output data a list of lists
output = some_function()
df_output = pd.DataFrame(output)
# my "Excel" file, which is an in-memory output file (buffer)
# for the new workbook
excel_file = IO()
xlwriter = pd.ExcelWriter(excel_file, engine='xlsxwriter')
df_output.to_excel(xlwriter, 'sheetname')
xlwriter.save()
xlwriter.close()
# important step, rewind the buffer or when it is read() you'll get nothing
# but an error message when you try to open your zero length file in Excel
excel_file.seek(0)
# set the mime type so that the browser knows what to do with the file
response = HttpResponse(excel_file.read(), content_type='application/vnd.openxmlformats-officedocument.spreadsheetml.sheet')
# set the file name in the Content-Disposition header
response['Content-Disposition'] = 'attachment; filename=myfile.xlsx'
return response
Jmcnamara is pointing you in the rigth direction. Translated to your question you are looking for the following code:
sio = StringIO()
PandasDataFrame = pandas.DataFrame(self.csvdict)
PandasWriter = pandas.ExcelWriter(sio, engine='xlsxwriter')
PandasDataFrame.to_excel(PandasWriter, sheet_name=sheetname)
PandasWriter.save()
sio.seek(0)
workbook = sio.getvalue()
response = StreamingHttpResponse(workbook, content_type='application/vnd.openxmlformats-officedocument.spreadsheetml.sheet')
response['Content-Disposition'] = 'attachment; filename=%s' % filename
Notice the fact that you are saving the data to the StringIO variable and not to a file location. This way you prevent the file being saved before you generate the response.
Maybe a bit off-topic, but it's worth pointing out that the to_csv method is generally faster than to_excel, since excel contains format information of the sheets. If you only have data and not formatting information, consider to_csv. Microsoft Excel can view and edit csv files with no problem.
One gain by using to_csv is that to_csv function can take any file-like object as the first argument, not only a filename string. Since Django response object is file-like, to_csv function can directly write to it. Some codes in your view function will look like:
df = <your dataframe to be downloaded>
response = HttpResponse(content_type='text/csv')
response['Content-Disposition'] = 'attachment; filename=<default filename you wanted to give to the downloaded file>'
df.to_csv(response, index=False)
return response
Reference:
https://gist.github.com/jonperron/733c3ead188f72f0a8a6f39e3d89295d
https://docs.djangoproject.com/en/2.1/howto/outputting-csv/
With Pandas 0.17+ you can use a StringIO/BytesIO object as a filehandle to pd.ExcelWriter. For example:
import pandas as pd
import StringIO
output = StringIO.StringIO()
# Use the StringIO object as the filehandle.
writer = pd.ExcelWriter(output, engine='xlsxwriter')
# Write the data frame to the StringIO object.
pd.DataFrame().to_excel(writer, sheet_name='Sheet1')
writer.save()
xlsx_data = output.getvalue()
print len(xlsx_data)
After that follow the XlsxWriter Python 2/3 HTTP examples.
For older versions of Pandas you can use this workaround.
Just wanted to share a class-based view approach to this, using elements from the answers above. Just override the get method of a Django View. My model has a JSON field which contains the results of dumping a dataframe to JSON with the to_json method.
Python version is 3.6 with Django 1.11.
# models.py
from django.db import models
from django.contrib.postgres.fields import JSONField
class myModel(models.Model):
json_field = JSONField(verbose_name="JSON data")
# views.py
import pandas as pd
from io import BytesIO as IO
from django.http import HttpResponse
from django.views import View
from .models import myModel
class ExcelFileDownloadView(View):
"""
Allows the user to download records in an Excel file
"""
def get(self, request, *args, **kwargs):
obj = myModel.objects.get(pk=self.kwargs['pk'])
excel_file = IO()
xlwriter = pd.ExcelWriter(excel_file, engine='xlsxwriter')
pd.read_json(obj.json_field).to_excel(xlwriter, "Summary")
xlwriter.save()
xlwriter.close()
excel_file.seek(0)
response = HttpResponse(excel_file.read(),
content_type='application/vnd.openxmlformats-officedocument.spreadsheetml.sheet')
response['Content-Disposition'] = 'attachment; filename="excel_file.xlsx"'
return response
# urls.py
from django.conf.urls import url
from .views import ExcelFileDownloadView
urlpatterns = [
url(r'^mymodel/(?P<pk>\d+)/download/$', ExcelFileDownloadView.as_view(), name="excel-download"),]
You're mixing two requirements that should be separate:
Creating a .xlsx file using python or pandas--it looks like you're good on this part.
Serving a downloadable file (django); see this post or maybe this one

Categories