Requesting a .gz file then opening as a dataframe - python

Very new python user here. I have a link to a zipped data set (csv.gz) that I'm trying to get into python without first downloading to my desktop, unzipping, and uploading to github in order to use pd.read_csv. The link is of the form where I think I am meant to request it using requests? But I am getting a 403 status code. I can't tell if it's because I'm requesting it wrong, or if the .gz is the problem?
This is how I'm trying to import it.
import requests
query = {'AWSAccessKeyId':'theaccesskey', 'Signature':'thesignature','Expires':'12345'}
url = 'https://s3.amazonaws.com/research.insideairbnb.com/data/united-states/or/portland/2015-12-02/data/listings.csv.gz'
response = requests.get(url, params=query)
But even assuming that I can figure out how to get a 200 status code, how do I go from .gz to dataframe that I can pandas with?

You can use io.BytesIO with pandas.read_csv:
import requests
import pandas as pd
from io import BytesIO
url = 'https://github.com/apache-superset/examples-data/blob/master/san_francisco.csv.gz'
# This is required to access raw binary files on Github
# i.e. it appends the following to the URL: `?raw=true`.
query = {'raw': 'true'}
# Use the requests module to parse the URL with the provided parameters.
response = requests.get(url, params=query)
# Create a file pointer initialized to the content of the response
# using `BytesIO`. This is a psuedo-file, which can now be read
# using `pandas.read_csv`. Since `response.content` is binary data
# i.e. bytes, we use `BytesIO`. If the response was text, we would
# have used `StringIO`.
fp = BytesIO(response.content)
# Finally, parse the content into a DataFrame
# (populate other parameters as needed).
df = pd.read_csv(fp, compression='gzip')
print(df)
This should return the contents of the csv.gz file as a DataFrame. Using the URL in this example yields the following output:
LON LAT NUMBER STREET UNIT CITY DISTRICT REGION POSTCODE ID
0 -122.391267 37.769093 1550 04th Street NaN NaN NaN NaN 94158 NaN
1 -122.390850 37.769426 1505 04th Street NaN NaN NaN NaN 94158 NaN
2 -122.428577 37.780627 1160 Buchanan Street NaN NaN NaN NaN 94115 NaN
3 -122.428534 37.780385 1142 Buchanan Street NaN NaN NaN NaN 94115 NaN
4 -122.428525 37.780317 1140 Buchanan Street NaN NaN NaN NaN 94115 NaN
... ... ... ... ... ... ... ... ... ... ..
261547 -122.418380 37.808349 360 Jefferson Street NaN NaN NaN NaN 94133 NaN
261548 -122.418380 37.808349 350 Jefferson Street NaN NaN NaN NaN 94133 NaN
261549 -122.417829 37.807479 333 Jefferson Street NaN NaN NaN NaN 94133 NaN
261550 -122.418916 37.809044 1965 Al Scoma Way NaN NaN NaN NaN 94133 NaN
261551 -122.444322 37.749124 350 Glenview Drive NaN NaN NaN NaN 94131 NaN
I used a sample csv.gz file I found online as the URL and parameters yield a JPEG image instead which is a bit puzzling. In any case, adapt this snippet of code to your case, and it should produce the desired results.

Related

Pandas dataframe only reading first value, NaN for everything else

I am attempting to read a csv with pandas and then insert into a SQL table. I am reading the data from the csv correctly when I print(data), but once I add it into the dataframe it is only reading the very first column, and is inserting NaN for every other value in the csv. Code and output below;
data = pd.read_csv (localFilePath)
print(data)
df = pd.DataFrame(data, columns= ['Date','EECode','LastName','FirstName', \
'HomeDepartmentCode','HomeDepartmentDesc','PayClass','InPunchTime', \
'OutPunchTime','DepartmentCode','DepartmentDesc','JobCodesCode', \
'JobCodesDesc','TeamCode','TeamDesc','EarnCode'])
print(df)
for row in df.itertuples():
SQLInsert = ('''
INSERT INTO [Reporting].[dbo].[Paycom_Missing_Punch]
(Date, EECode, LastName, FirstName, HomeDepartmentCode,
HomeDepartmentDesc, PayClass, InPunchTime, OutPunchTime,
DepartmentCode, DepartmentDesc, JobCodesCode, JobCodesDesc,
TeamCode, TeamDesc, EarnCode)
VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)
'''
)
args = row.Date, row.EECode, row.LastName, row.FirstName, \
row.HomeDepartmentCode, row.HomeDepartmentDesc, row.PayClass, row.InPunchTime, \
row.OutPunchTime, row.DepartmentCode, row.DepartmentDesc, row.JobCodesCode, \
row.JobCodesDesc, row.TeamCode, row.TeamDesc, row.EarnCode
#print(SQLInsert)
#print(args)
cursor.execute(SQLInsert, args)
conn.commit()
output when I print(data);
Date EE Code ... Team Desc Earn Code
0 01/21/2021 1435 ... Indiana DWD NaN
1 01/21/2021 1435 ... Indiana DWD NaN
2 01/22/2021 1180 ... Supervisors NaN
3 01/21/2021 1664 ... Technical Support Desk NaN
4 01/21/2021 1078 ... Supervisors NaN
output once I add it to the dataframe;
Date EECode LastName ... TeamCode TeamDesc EarnCode
0 01/21/2021 NaN NaN ... NaN NaN NaN
1 01/21/2021 NaN NaN ... NaN NaN NaN
2 01/22/2021 NaN NaN ... NaN NaN NaN
3 01/21/2021 NaN NaN ... NaN NaN NaN
4 01/21/2021 NaN NaN ... NaN NaN NaN
I assume the problem is how I am passing the values to the dataframe, but from everything I have read or seen, the way I am doing it looks correct.
The problem is the way you're doing the df. You're creating the dataframe first with your data. Then you're trying to create another dataframe of it, using names that don't exist. To fix your problem simply do this:
>>> col_names = ['Date','EECode','LastName','FirstName', \
'HomeDepartmentCode','HomeDepartmentDesc','PayClass','InPunchTime', \
'OutPunchTime','DepartmentCode','DepartmentDesc','JobCodesCode', \
'JobCodesDesc','TeamCode','TeamDesc','EarnCode']
>>> df = pd.read_csv(localFilePath)
>>> df.columns = col_names

How to get Dataframe with Table ID in Pandas?

I want to extract dataframe from HTML using URL.
The page contains 59 table/dataframe.
I want to extract 1 particular table which can be identified by its ID "ctl00_Menu1"
Following is my trail which is giving error.
import pandas as pd
df = pd.read_html("http://eciresults.nic.in/statewiseS12.htm?st=S12",attrs = {'id': 'ctl00_Menu1'})
As this is my very early stage in python so can be simple solution but I am unable to find. appreciate help.
I would look at how the URL passes params and probably try to read a dataframe directly from it. I'm unsure if you are trying to develop a function or a script or just exercising.
If you do (notice the 58 at the end of the url)
df = pd.read_html("http://eciresults.nic.in/statewiseS12.htm?st=S1258",attrs = {'id':
'ctl00_Menu1'})
It works and gives you table 59.
[ 0 1 2 \
0 Partywise Partywise NaN
1 Partywise NaN NaN
2 Constituencywise-All Candidates NaN NaN
3 Constituencywise Trends NaN NaN
3 4 5 \
0 Constituencywise-All Candidates Constituencywise-All Candidates NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
6 7
0 Constituencywise Trends Constituencywise Trends
1 NaN NaN
2 NaN NaN
3 NaN NaN ]
Unsure if that's the table you want to extract, but most of the time it's easier to pass it as a url parameter. If you try it without the 58 it works too, I believe the 'ElectionResult' argument might not be a table classifier hence why you can't find any tables with that name.

Python pandas - value_counts not working properly

Based on this post on stack i tried the value counts function like this
df2 = df1.join(df1.genres.str.split(",").apply(pd.value_counts).fillna(0))
and it works fine apart from the fact that although my data has 22 unique genres and after the split i get 42 values, which of course are not unique.
Data example:
Action Adventure Casual Design & Illustration Early Access Education Free to Play Indie Massively Multiplayer Photo Editing RPG Racing Simulation Software Training Sports Strategy Utilities Video Production Web Publishing Accounting Action Adventure Animation & Modeling Audio Production Casual Design & Illustration Early Access Education Free to Play Indie Massively Multiplayer Photo Editing RPG Racing Simulation Software Training Sports Strategy Utilities Video Production Web Publishing nan
0 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 1.0 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
(i have pasted the head and the first row only)
I have a feeling that the problem is caused from my original data.Well, my column (genres) was a list of lists which contained brackets
example :[Action,Indie]
so when python was reading it, it would read [Action and Action and Action] as different values and the output was 303 different values.
So what i did is that:
for i in df1['genres'].tolist():
if str(i) != 'nan':
i = i[1:-1]
new.append(i)
else:
new.append('nan')
You have to remove first and last [] from column genres by function str.strip and then replace spaces by empty string by function str.replace
import pandas as pd
df = pd.read_csv('test/Copy of AppCrawler.csv', sep="\t")
df['genres'] = df['genres'].str.strip('[]')
df['genres'] = df['genres'].str.replace(' ', '')
df = df.join(df.genres.str.split(",").apply(pd.value_counts).fillna(0))
#temporaly display 30 rows and 60 columns
with pd.option_context('display.max_rows', 30, 'display.max_columns', 60):
print df
#remove for clarity
print df.columns
Index([u'Unnamed: 0', u'appid', u'currency', u'final_price', u'genres',
u'initial_price', u'is_free', u'metacritic', u'release_date',
u'Accounting', u'Action', u'Adventure', u'Animation&Modeling',
u'AudioProduction', u'Casual', u'Design&Illustration', u'EarlyAccess',
u'Education', u'FreetoPlay', u'Indie', u'MassivelyMultiplayer',
u'PhotoEditing', u'RPG', u'Racing', u'Simulation', u'SoftwareTraining',
u'Sports', u'Strategy', u'Utilities', u'VideoProduction',
u'WebPublishing'],
dtype='object')

Replacing/Stripping certain text from data in pandas?

I've got an issue with Pandas not replacing certain bits of text correctly...
# Create blank column
csvdata["CTemp"] = ""
# Create a copy of the data in "CDPure"
dcol = csvdata.CDPure
# Fill "CTemp" with the data from "CDPure" and replace and/or remove certain parts
csvdata['CTemp'] = dcol.str.replace(" (AMI)", "").replace(" N/A", "Non")
But yet when i print it hasn't replaced any as seen below by running print csvdata[-50:].head(50)
Pole KI DE Score STAT CTemp
4429 NaN NaN NaN 42 NaN Data N/A
4430 NaN NaN NaN 23.43 NaN Data (AMI)
4431 NaN NaN NaN 7.05 NaN Data (AMI)
4432 NaN NaN NaN 9.78 NaN Data
4433 NaN NaN NaN 169.68 NaN Data (AMI)
4434 NaN NaN NaN 26.29 NaN Data N/A
4435 NaN NaN NaN 83.11 NaN Data N/A
NOTE: The CSV is rather big so I have to use pandas.set_option('display.max_columns', 250) to be able to print the above.
Anyone know how I can make it replace those parts correctly in pandas?
EDIT, I've tried .str.replace("", "") and tried just .replace("", "")
Example CSV:
No,CDPure,Blank
1,Data Test,
2,Test N/A,
3,Data N/A,
4,Test Data,
5,Bla,
5,Stack,
6,Over (AMI),
7,Flow (AMI),
8,Test (AMI),
9,Data,
10,Ryflex (AMI),
Example Code:
# Import pandas
import pandas
# Open csv (I have to keep it all as dtype object otherwise I can't do the rest of my script)
csvdata = pandas.read_csv('test.csv', dtype=object)
# Create blank column
csvdata["CTemp"] = ""
# Create a copy of the data in "CDPure"
dcol = csvdata.CDPure
# Fill "CTemp" with the data from "CDPure" and replace and/or remove certain parts
csvdata['CTemp'] = dcol.str.replace(" (AMI)", "").str.replace(" N/A", " Non")
# Print
print csvdata.head(11)
Output:
No CDPure Blank CTemp
0 1 Data Test NaN Data Test
1 2 Test N/A NaN Test Non
2 3 Data N/A NaN Data Non
3 4 Test Data NaN Test Data
4 5 Bla NaN Bla
5 5 Stack NaN Stack
6 6 Over (AMI) NaN Over (AMI)
7 7 Flow (AMI) NaN Flow (AMI)
8 8 Test (AMI) NaN Test (AMI)
9 9 Data NaN Data
10 10 Ryflex (AMI) NaN Ryflex (AMI)
str.replace interprets its argument as a regular expression, so you need to escape the parentheses using dcol.str.replace(r" \(AMI\)", "").str.replace(" N/A", "Non").
This does not appear to be adequately documented; the docs mention that split and replace "take regular expressions, too", but doesn't make it clear that they always interpret their argument as a regular expression.

Python Reading Multiple NetCDF Rainfall files of variable size

The issue I have is that the Australian Bureau of Meteorology has supplied me with Rainfall Data Files, that contains rainfall records recorded every 30 minutes for all active gauges. The problem is that for 1 day there are 48 30Minute files. I want to create time series of a particular Gauge. Which means reading all 48 files and searching for the Gauge ID, making sure it doesn't fail if for 1 30 minute period the gauge did not record anything??
here is link to file format:
https://dl.dropboxusercontent.com/u/15223371/14/gauge_30min_20100214_000000.nc
https://dl.dropboxusercontent.com/u/15223371/14/gauge_30min_20100214_003000.nc
https://dl.dropboxusercontent.com/u/15223371/14/gauge_30min_20100214_010000.nc
This is what I have tried so far:
"""
This script is used to read a directory of raingauge data from a Data Directory
"""
from anuga.file.netcdf import NetCDFFile
from anuga.config import netcdf_mode_r, netcdf_mode_w, netcdf_mode_a, \
netcdf_float
import os
import glob
from easygui import *
import string
import numpy
"""
print 'Default file Extension...'
msg="Enter 3 letter extension."
title = "Enter the 3 letter file extension to search for in DIR "
default = "csv"
file_extension = enterbox(msg,title,default)
"""
print 'Present Directory Open...'
title = "Select Directory to Read Multiple rainfall .nc files"
msg = "This is a test of the diropenbox.\n\nPick the directory that you wish to open."
d = diropenbox(msg, title)
fromdir = d
filtered_list = glob.glob(os.path.join(fromdir, '*.nc'))
filtered_list.sort()
nf = len(filtered_list)
print nf
import numpy
rain = numpy.zeros(nf,'float')
t = numpy.arange(nf)
Stn_Loc_File='Station_Location.csv'
outfid = open(Stn_Loc_File, 'w')
prec = numpy.zeros((nf,1752),numpy.float)
gauge_id_list = ['570002','570021','570025','570028','570030','570032','570031','570035','570036',
'570047','570772','570781','570910','570903','570916','570931','570943','570965',
'570968','570983','570986','70214','70217','70349','70351']
"""
title = "Select Gauge to plot"
msg = "Select Gauge"
gauge_id = int(choicebox(msg=msg,title=title, choices=gauge_id_list))
"""
#for gauge_id in gauge_id_list:
# gauge_id = int(gauge_id)
try:
for i, infile in enumerate(filtered_list):
infilenet = NetCDFFile(infile, netcdf_mode_r)
print infilenet.variables
raw_input('Hold.... check variables...')
stn_lats = infilenet.variables['latitude']
stn_longs = infilenet.variables['longitude']
stn_ids = infilenet.variables['station_id']
stn_rain = infilenet.variables['precipitation']
print stn_ids.shape
#print stn_lats.shape
#print stn_longs.shape
#print infile.dimensions
stn_ids = numpy.array(stn_ids)
l_id = numpy.where(stn_ids == gauge_id)
if stn_ids in gauge_id_list:
try:
l_id = l_id[0][0]
rain[i] = stn_rain[l_id]
except:
rain[i] = numpy.nan
print 'End for i...'
#print rain
import pylab as pl
pl.bar(t,rain)
pl.title('Rain Gauge data')
pl.xlabel('time steps')
pl.ylabel('rainfall (mm)')
pl.show()
except:
pass
raw_input('END....')
OK, you got the data in a format that's more convoluted than it would need to be. They could easily have stuffed the whole day into a netCDF file. And indeed, one option for you to solve this would have been to combine all files into one with a times dimension, using for example the NCO command line tools.
But here is a solution that uses the scipy netcdf module. I believe it is deprecated -- myself, I prefer the NetCDF4 library. The main approach is: preset your output data structure with np.nan values; loop through your input files and retrieve precipitation and station ids; for each of your stationids of interest, retrieve index, and then precipitation at that index; add to the output structure. (I didn't do the work to extract timestamps - that's up to you.)
import glob
import numpy as np
from scipy.io import netcdf
# load data file names
stationdata = glob.glob('gauge*.nc')
stationdata.sort()
# initialize np arrays of integer gauging station ids
gauge_id_list = ['570002','570021','570025','570028','570030','570032','570031','570035','570036',
'570047','570772','570781','570910','570903','570916','570931','570943','570965',
'570968','570983','570986','70214','70217','70349','70351']
gauge_ids = np.array(gauge_id_list).astype('int32')
ngauges = len(gauge_ids)
ntimesteps = 48
# initialize output dictionary
dtypes = zip(gauge_id_list, ['float32']*ngauges)
timeseries_per_station = np.empty((ntimesteps,))
timeseries_per_station.fill(np.nan)
timeseries_per_station = timeseries_per_station.astype(dtypes)
# Instead of using the index, you could extract the datetime stamp
for timestep, datafile in enumerate(stationdata):
data = netcdf.NetCDFFile(datafile, 'r')
precip = data.variables['precip'].data
stid = data.variables['stid'].data
# create np array of indices of the gaugeid present in file
idx = np.where(np.in1d(stid, gauge_ids))[0]
for i in idx:
timeseries_per_station[str(stid[i])][timestep] = precip[i]
data.close()
np.set_printoptions(precision=1)
for gauge_id in gauge_id_list:
print "Station %s:" % gauge_id
print timeseries_per_station[gauge_id]
The output looks like this:
Station 570002:
[ 1.9 0.3 0. nan nan nan nan nan nan nan nan nan nan nan nan
nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
nan nan nan]
Station 570021:
[ 0. 0. 0. nan nan nan nan nan nan nan nan nan nan nan nan
nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
nan nan nan]
...
(Obviously, there were only three files.)
Edit: The OP noted that the code wasn't running without errors for him because his variable names are "precipitation" and "station_id". The code runs for me on the files he posted. Obviously, he should be using whatever variable names are used in the files that he was supplied with. As they seem to be custom-produced files for his use, it is conceivable that the authors may not be consistent in variable naming.

Categories