The issue I have is that the Australian Bureau of Meteorology has supplied me with Rainfall Data Files, that contains rainfall records recorded every 30 minutes for all active gauges. The problem is that for 1 day there are 48 30Minute files. I want to create time series of a particular Gauge. Which means reading all 48 files and searching for the Gauge ID, making sure it doesn't fail if for 1 30 minute period the gauge did not record anything??
here is link to file format:
https://dl.dropboxusercontent.com/u/15223371/14/gauge_30min_20100214_000000.nc
https://dl.dropboxusercontent.com/u/15223371/14/gauge_30min_20100214_003000.nc
https://dl.dropboxusercontent.com/u/15223371/14/gauge_30min_20100214_010000.nc
This is what I have tried so far:
"""
This script is used to read a directory of raingauge data from a Data Directory
"""
from anuga.file.netcdf import NetCDFFile
from anuga.config import netcdf_mode_r, netcdf_mode_w, netcdf_mode_a, \
netcdf_float
import os
import glob
from easygui import *
import string
import numpy
"""
print 'Default file Extension...'
msg="Enter 3 letter extension."
title = "Enter the 3 letter file extension to search for in DIR "
default = "csv"
file_extension = enterbox(msg,title,default)
"""
print 'Present Directory Open...'
title = "Select Directory to Read Multiple rainfall .nc files"
msg = "This is a test of the diropenbox.\n\nPick the directory that you wish to open."
d = diropenbox(msg, title)
fromdir = d
filtered_list = glob.glob(os.path.join(fromdir, '*.nc'))
filtered_list.sort()
nf = len(filtered_list)
print nf
import numpy
rain = numpy.zeros(nf,'float')
t = numpy.arange(nf)
Stn_Loc_File='Station_Location.csv'
outfid = open(Stn_Loc_File, 'w')
prec = numpy.zeros((nf,1752),numpy.float)
gauge_id_list = ['570002','570021','570025','570028','570030','570032','570031','570035','570036',
'570047','570772','570781','570910','570903','570916','570931','570943','570965',
'570968','570983','570986','70214','70217','70349','70351']
"""
title = "Select Gauge to plot"
msg = "Select Gauge"
gauge_id = int(choicebox(msg=msg,title=title, choices=gauge_id_list))
"""
#for gauge_id in gauge_id_list:
# gauge_id = int(gauge_id)
try:
for i, infile in enumerate(filtered_list):
infilenet = NetCDFFile(infile, netcdf_mode_r)
print infilenet.variables
raw_input('Hold.... check variables...')
stn_lats = infilenet.variables['latitude']
stn_longs = infilenet.variables['longitude']
stn_ids = infilenet.variables['station_id']
stn_rain = infilenet.variables['precipitation']
print stn_ids.shape
#print stn_lats.shape
#print stn_longs.shape
#print infile.dimensions
stn_ids = numpy.array(stn_ids)
l_id = numpy.where(stn_ids == gauge_id)
if stn_ids in gauge_id_list:
try:
l_id = l_id[0][0]
rain[i] = stn_rain[l_id]
except:
rain[i] = numpy.nan
print 'End for i...'
#print rain
import pylab as pl
pl.bar(t,rain)
pl.title('Rain Gauge data')
pl.xlabel('time steps')
pl.ylabel('rainfall (mm)')
pl.show()
except:
pass
raw_input('END....')
OK, you got the data in a format that's more convoluted than it would need to be. They could easily have stuffed the whole day into a netCDF file. And indeed, one option for you to solve this would have been to combine all files into one with a times dimension, using for example the NCO command line tools.
But here is a solution that uses the scipy netcdf module. I believe it is deprecated -- myself, I prefer the NetCDF4 library. The main approach is: preset your output data structure with np.nan values; loop through your input files and retrieve precipitation and station ids; for each of your stationids of interest, retrieve index, and then precipitation at that index; add to the output structure. (I didn't do the work to extract timestamps - that's up to you.)
import glob
import numpy as np
from scipy.io import netcdf
# load data file names
stationdata = glob.glob('gauge*.nc')
stationdata.sort()
# initialize np arrays of integer gauging station ids
gauge_id_list = ['570002','570021','570025','570028','570030','570032','570031','570035','570036',
'570047','570772','570781','570910','570903','570916','570931','570943','570965',
'570968','570983','570986','70214','70217','70349','70351']
gauge_ids = np.array(gauge_id_list).astype('int32')
ngauges = len(gauge_ids)
ntimesteps = 48
# initialize output dictionary
dtypes = zip(gauge_id_list, ['float32']*ngauges)
timeseries_per_station = np.empty((ntimesteps,))
timeseries_per_station.fill(np.nan)
timeseries_per_station = timeseries_per_station.astype(dtypes)
# Instead of using the index, you could extract the datetime stamp
for timestep, datafile in enumerate(stationdata):
data = netcdf.NetCDFFile(datafile, 'r')
precip = data.variables['precip'].data
stid = data.variables['stid'].data
# create np array of indices of the gaugeid present in file
idx = np.where(np.in1d(stid, gauge_ids))[0]
for i in idx:
timeseries_per_station[str(stid[i])][timestep] = precip[i]
data.close()
np.set_printoptions(precision=1)
for gauge_id in gauge_id_list:
print "Station %s:" % gauge_id
print timeseries_per_station[gauge_id]
The output looks like this:
Station 570002:
[ 1.9 0.3 0. nan nan nan nan nan nan nan nan nan nan nan nan
nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
nan nan nan]
Station 570021:
[ 0. 0. 0. nan nan nan nan nan nan nan nan nan nan nan nan
nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
nan nan nan]
...
(Obviously, there were only three files.)
Edit: The OP noted that the code wasn't running without errors for him because his variable names are "precipitation" and "station_id". The code runs for me on the files he posted. Obviously, he should be using whatever variable names are used in the files that he was supplied with. As they seem to be custom-produced files for his use, it is conceivable that the authors may not be consistent in variable naming.
Related
Very new python user here. I have a link to a zipped data set (csv.gz) that I'm trying to get into python without first downloading to my desktop, unzipping, and uploading to github in order to use pd.read_csv. The link is of the form where I think I am meant to request it using requests? But I am getting a 403 status code. I can't tell if it's because I'm requesting it wrong, or if the .gz is the problem?
This is how I'm trying to import it.
import requests
query = {'AWSAccessKeyId':'theaccesskey', 'Signature':'thesignature','Expires':'12345'}
url = 'https://s3.amazonaws.com/research.insideairbnb.com/data/united-states/or/portland/2015-12-02/data/listings.csv.gz'
response = requests.get(url, params=query)
But even assuming that I can figure out how to get a 200 status code, how do I go from .gz to dataframe that I can pandas with?
You can use io.BytesIO with pandas.read_csv:
import requests
import pandas as pd
from io import BytesIO
url = 'https://github.com/apache-superset/examples-data/blob/master/san_francisco.csv.gz'
# This is required to access raw binary files on Github
# i.e. it appends the following to the URL: `?raw=true`.
query = {'raw': 'true'}
# Use the requests module to parse the URL with the provided parameters.
response = requests.get(url, params=query)
# Create a file pointer initialized to the content of the response
# using `BytesIO`. This is a psuedo-file, which can now be read
# using `pandas.read_csv`. Since `response.content` is binary data
# i.e. bytes, we use `BytesIO`. If the response was text, we would
# have used `StringIO`.
fp = BytesIO(response.content)
# Finally, parse the content into a DataFrame
# (populate other parameters as needed).
df = pd.read_csv(fp, compression='gzip')
print(df)
This should return the contents of the csv.gz file as a DataFrame. Using the URL in this example yields the following output:
LON LAT NUMBER STREET UNIT CITY DISTRICT REGION POSTCODE ID
0 -122.391267 37.769093 1550 04th Street NaN NaN NaN NaN 94158 NaN
1 -122.390850 37.769426 1505 04th Street NaN NaN NaN NaN 94158 NaN
2 -122.428577 37.780627 1160 Buchanan Street NaN NaN NaN NaN 94115 NaN
3 -122.428534 37.780385 1142 Buchanan Street NaN NaN NaN NaN 94115 NaN
4 -122.428525 37.780317 1140 Buchanan Street NaN NaN NaN NaN 94115 NaN
... ... ... ... ... ... ... ... ... ... ..
261547 -122.418380 37.808349 360 Jefferson Street NaN NaN NaN NaN 94133 NaN
261548 -122.418380 37.808349 350 Jefferson Street NaN NaN NaN NaN 94133 NaN
261549 -122.417829 37.807479 333 Jefferson Street NaN NaN NaN NaN 94133 NaN
261550 -122.418916 37.809044 1965 Al Scoma Way NaN NaN NaN NaN 94133 NaN
261551 -122.444322 37.749124 350 Glenview Drive NaN NaN NaN NaN 94131 NaN
I used a sample csv.gz file I found online as the URL and parameters yield a JPEG image instead which is a bit puzzling. In any case, adapt this snippet of code to your case, and it should produce the desired results.
I want to extract dataframe from HTML using URL.
The page contains 59 table/dataframe.
I want to extract 1 particular table which can be identified by its ID "ctl00_Menu1"
Following is my trail which is giving error.
import pandas as pd
df = pd.read_html("http://eciresults.nic.in/statewiseS12.htm?st=S12",attrs = {'id': 'ctl00_Menu1'})
As this is my very early stage in python so can be simple solution but I am unable to find. appreciate help.
I would look at how the URL passes params and probably try to read a dataframe directly from it. I'm unsure if you are trying to develop a function or a script or just exercising.
If you do (notice the 58 at the end of the url)
df = pd.read_html("http://eciresults.nic.in/statewiseS12.htm?st=S1258",attrs = {'id':
'ctl00_Menu1'})
It works and gives you table 59.
[ 0 1 2 \
0 Partywise Partywise NaN
1 Partywise NaN NaN
2 Constituencywise-All Candidates NaN NaN
3 Constituencywise Trends NaN NaN
3 4 5 \
0 Constituencywise-All Candidates Constituencywise-All Candidates NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
6 7
0 Constituencywise Trends Constituencywise Trends
1 NaN NaN
2 NaN NaN
3 NaN NaN ]
Unsure if that's the table you want to extract, but most of the time it's easier to pass it as a url parameter. If you try it without the 58 it works too, I believe the 'ElectionResult' argument might not be a table classifier hence why you can't find any tables with that name.
Based on this post on stack i tried the value counts function like this
df2 = df1.join(df1.genres.str.split(",").apply(pd.value_counts).fillna(0))
and it works fine apart from the fact that although my data has 22 unique genres and after the split i get 42 values, which of course are not unique.
Data example:
Action Adventure Casual Design & Illustration Early Access Education Free to Play Indie Massively Multiplayer Photo Editing RPG Racing Simulation Software Training Sports Strategy Utilities Video Production Web Publishing Accounting Action Adventure Animation & Modeling Audio Production Casual Design & Illustration Early Access Education Free to Play Indie Massively Multiplayer Photo Editing RPG Racing Simulation Software Training Sports Strategy Utilities Video Production Web Publishing nan
0 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 1.0 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
(i have pasted the head and the first row only)
I have a feeling that the problem is caused from my original data.Well, my column (genres) was a list of lists which contained brackets
example :[Action,Indie]
so when python was reading it, it would read [Action and Action and Action] as different values and the output was 303 different values.
So what i did is that:
for i in df1['genres'].tolist():
if str(i) != 'nan':
i = i[1:-1]
new.append(i)
else:
new.append('nan')
You have to remove first and last [] from column genres by function str.strip and then replace spaces by empty string by function str.replace
import pandas as pd
df = pd.read_csv('test/Copy of AppCrawler.csv', sep="\t")
df['genres'] = df['genres'].str.strip('[]')
df['genres'] = df['genres'].str.replace(' ', '')
df = df.join(df.genres.str.split(",").apply(pd.value_counts).fillna(0))
#temporaly display 30 rows and 60 columns
with pd.option_context('display.max_rows', 30, 'display.max_columns', 60):
print df
#remove for clarity
print df.columns
Index([u'Unnamed: 0', u'appid', u'currency', u'final_price', u'genres',
u'initial_price', u'is_free', u'metacritic', u'release_date',
u'Accounting', u'Action', u'Adventure', u'Animation&Modeling',
u'AudioProduction', u'Casual', u'Design&Illustration', u'EarlyAccess',
u'Education', u'FreetoPlay', u'Indie', u'MassivelyMultiplayer',
u'PhotoEditing', u'RPG', u'Racing', u'Simulation', u'SoftwareTraining',
u'Sports', u'Strategy', u'Utilities', u'VideoProduction',
u'WebPublishing'],
dtype='object')
I have gathered data from the penultimate worksheet in this Excel file along with all the data in the last Worksheet from "Maturity Years" of 5.5 onward. I have code that does this. However, I am now looking to restructure the dataframe such that it has the following columns and am struggling to do this:
My code is below.
import urllib2
import pandas as pd
import os
import xlrd
url = 'http://www.bankofengland.co.uk/statistics/Documents/yieldcurve/uknom05_mdaily.xls'
socket = urllib2.urlopen(url)
xd = pd.ExcelFile(socket)
#Had to do this based on actual sheet_names rather than index as there are some extra sheet names in xd.sheet_names
df1 = xd.parse('4. spot curve', header=None)
df1 = df1.loc[:, df1.loc[3, :] >= 5.5] #Assumes the maturity is always on the 4th line of the sheet
df2 = xd.parse('3. spot, short end', header=None)
bigdata = df1.append(df2,ignore_index = True)
Edit: The Dataframe currently looks as follows. The current Dataframe is pretty disorganized unfortunately:
0 1 2 3 4 5 6 \
0 NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN
2 Maturity NaN NaN NaN NaN NaN NaN
3 years: NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN NaN
5 2005-01-03 00:00:00 NaN NaN NaN NaN NaN NaN
6 2005-01-04 00:00:00 NaN NaN NaN NaN NaN NaN
... ... ... .. .. ... ... ...
5410 2015-04-20 00:00:00 NaN NaN NaN NaN 0.367987 0.357069
5411 2015-04-21 00:00:00 NaN NaN NaN NaN 0.362478 0.352581
It has 5440 rows and 61 columns
However, I want the dataframe to be of the format:
I think Columns 1,2,3,4,5 and 6 contain Yield Curve Data. However, I am unsure where the data associated with "Maturity Years" is in the current DataFrame.
Date(which is the 2nd Column in the current Dataframe) Update time(which would just be a column with datetime.datetime.now()) Currency(which would just be a column with 'GBP') Maturity Date Yield Data from SpreadSheet
I use the pandas.io.excel.read_excel function to read xls from url. Here is one way to clean this UK yield curve dataset.
Note: executing the cubic spline interpolation via the apply function takes quite a mount of time (about 2 minutes in my PC). It interpolates from about 100 points to 300 points, row by row (2638 in total).
from pandas.io.excel import read_excel
import pandas as pd
import numpy as np
url = 'http://www.bankofengland.co.uk/statistics/Documents/yieldcurve/uknom05_mdaily.xls'
# check the sheet number, spot: 9/9, short end 7/9
spot_curve = read_excel(url, sheetname=8)
short_end_spot_curve = read_excel('uknom05_mdaily.xls', sheetname=6)
# preprocessing spot_curve
# ==============================================
# do a few inspection on the table
spot_curve.shape
spot_curve.iloc[:, 0]
spot_curve.iloc[:, -1]
spot_curve.iloc[0, :]
spot_curve.iloc[-1, :]
# do some cleaning, keep NaN for now, as forward fill NaN is not recommended for yield curve
spot_curve.columns = spot_curve.loc['years:']
spot_curve.columns.name = 'years'
valid_index = spot_curve.index[4:]
spot_curve = spot_curve.loc[valid_index]
# remove all maturities within 5 years as those are duplicated in short-end file
col_mask = spot_curve.columns.values > 5
spot_curve = spot_curve.iloc[:, col_mask]
# now spot_curve is ready, check it
spot_curve.head()
spot_curve.tail()
spot_curve.shape
spot_curve.shape
Out[184]: (2715, 40)
# preprocessing short end spot_curve
# ==============================================
short_end_spot_curve.columns = short_end_spot_curve.loc['years:']
short_end_spot_curve.columns.name = 'years'
valid_index = short_end_spot_curve.index[4:]
short_end_spot_curve = short_end_spot_curve.loc[valid_index]
short_end_spot_curve.head()
short_end_spot_curve.tail()
short_end_spot_curve.shape
short_end_spot_curve.shape
Out[185]: (2715, 60)
# merge these two, time index are identical
# ==============================================
combined_data = pd.concat([short_end_spot_curve, spot_curve], axis=1, join='outer')
# sort the maturity from short end to long end
combined_data.sort_index(axis=1, inplace=True)
combined_data.head()
combined_data.tail()
combined_data.shape
# deal with NaN: the most sound approach is fit the non-arbitrage NSS curve
# however, this is not currently supported in python.
# do a cubic spline instead
# ==============================================
# if more than half of the maturity points are NaN, then interpolation is likely to be unstable, so I'll remove all rows with NaNs count greater than 50
def filter_func(group):
return group.isnull().sum(axis=1) <= 50
combined_data = combined_data.groupby(level=0).filter(filter_func)
# no. of rows down from 2715 to 2628
combined_data.shape
combined_data.shape
Out[186]: (2628, 100)
from scipy.interpolate import interp1d
# mapping points, monthly frequency, 1 mon to 25 years
maturity = pd.Series((np.arange(12 * 25) + 1) / 12)
# do the interpolation day by day
key = lambda x: x.date
by_day = combined_data.groupby(level=0)
# write out apply function
def interpolate_maturities(group):
# transpose row vector to column vector and drops all nans
a = group.T.dropna().reset_index()
f = interp1d(a.iloc[:, 0], a.iloc[:, 1], kind='cubic', bounds_error=False, assume_sorted=True)
return pd.Series(maturity.apply(f).values, index=maturity.values)
# this may take a while .... apply provides flexibility but spead is not good
cleaned_spot_curve = by_day.apply(interpolate_maturities)
# a quick look on the data
cleaned_spot_curve.iloc[[1,1000, 2000], :].T.plot(title='Cross-Maturity Yield Curve')
cleaned_spot_curve.iloc[:, [23, 59, 119]].plot(title='Time-Series')
I've got an issue with Pandas not replacing certain bits of text correctly...
# Create blank column
csvdata["CTemp"] = ""
# Create a copy of the data in "CDPure"
dcol = csvdata.CDPure
# Fill "CTemp" with the data from "CDPure" and replace and/or remove certain parts
csvdata['CTemp'] = dcol.str.replace(" (AMI)", "").replace(" N/A", "Non")
But yet when i print it hasn't replaced any as seen below by running print csvdata[-50:].head(50)
Pole KI DE Score STAT CTemp
4429 NaN NaN NaN 42 NaN Data N/A
4430 NaN NaN NaN 23.43 NaN Data (AMI)
4431 NaN NaN NaN 7.05 NaN Data (AMI)
4432 NaN NaN NaN 9.78 NaN Data
4433 NaN NaN NaN 169.68 NaN Data (AMI)
4434 NaN NaN NaN 26.29 NaN Data N/A
4435 NaN NaN NaN 83.11 NaN Data N/A
NOTE: The CSV is rather big so I have to use pandas.set_option('display.max_columns', 250) to be able to print the above.
Anyone know how I can make it replace those parts correctly in pandas?
EDIT, I've tried .str.replace("", "") and tried just .replace("", "")
Example CSV:
No,CDPure,Blank
1,Data Test,
2,Test N/A,
3,Data N/A,
4,Test Data,
5,Bla,
5,Stack,
6,Over (AMI),
7,Flow (AMI),
8,Test (AMI),
9,Data,
10,Ryflex (AMI),
Example Code:
# Import pandas
import pandas
# Open csv (I have to keep it all as dtype object otherwise I can't do the rest of my script)
csvdata = pandas.read_csv('test.csv', dtype=object)
# Create blank column
csvdata["CTemp"] = ""
# Create a copy of the data in "CDPure"
dcol = csvdata.CDPure
# Fill "CTemp" with the data from "CDPure" and replace and/or remove certain parts
csvdata['CTemp'] = dcol.str.replace(" (AMI)", "").str.replace(" N/A", " Non")
# Print
print csvdata.head(11)
Output:
No CDPure Blank CTemp
0 1 Data Test NaN Data Test
1 2 Test N/A NaN Test Non
2 3 Data N/A NaN Data Non
3 4 Test Data NaN Test Data
4 5 Bla NaN Bla
5 5 Stack NaN Stack
6 6 Over (AMI) NaN Over (AMI)
7 7 Flow (AMI) NaN Flow (AMI)
8 8 Test (AMI) NaN Test (AMI)
9 9 Data NaN Data
10 10 Ryflex (AMI) NaN Ryflex (AMI)
str.replace interprets its argument as a regular expression, so you need to escape the parentheses using dcol.str.replace(r" \(AMI\)", "").str.replace(" N/A", "Non").
This does not appear to be adequately documented; the docs mention that split and replace "take regular expressions, too", but doesn't make it clear that they always interpret their argument as a regular expression.