Restructuring Dataframe in Python

Restructuring Dataframe in Python - python

I have gathered data from the penultimate worksheet in this Excel file along with all the data in the last Worksheet from "Maturity Years" of 5.5 onward. I have code that does this. However, I am now looking to restructure the dataframe such that it has the following columns and am struggling to do this:
My code is below.
import urllib2
import pandas as pd
import os
import xlrd
url = 'http://www.bankofengland.co.uk/statistics/Documents/yieldcurve/uknom05_mdaily.xls'
socket = urllib2.urlopen(url)
xd = pd.ExcelFile(socket)
#Had to do this based on actual sheet_names rather than index as there are some extra sheet names in xd.sheet_names
df1 = xd.parse('4. spot curve', header=None)
df1 = df1.loc[:, df1.loc[3, :] >= 5.5] #Assumes the maturity is always on the 4th line of the sheet
df2 = xd.parse('3. spot, short end', header=None)
bigdata = df1.append(df2,ignore_index = True)
Edit: The Dataframe currently looks as follows. The current Dataframe is pretty disorganized unfortunately:
0 1 2 3 4 5 6 \
0 NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN
2 Maturity NaN NaN NaN NaN NaN NaN
3 years: NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN NaN
5 2005-01-03 00:00:00 NaN NaN NaN NaN NaN NaN
6 2005-01-04 00:00:00 NaN NaN NaN NaN NaN NaN
... ... ... .. .. ... ... ...
5410 2015-04-20 00:00:00 NaN NaN NaN NaN 0.367987 0.357069
5411 2015-04-21 00:00:00 NaN NaN NaN NaN 0.362478 0.352581
It has 5440 rows and 61 columns
However, I want the dataframe to be of the format:
I think Columns 1,2,3,4,5 and 6 contain Yield Curve Data. However, I am unsure where the data associated with "Maturity Years" is in the current DataFrame.
Date(which is the 2nd Column in the current Dataframe) Update time(which would just be a column with datetime.datetime.now()) Currency(which would just be a column with 'GBP') Maturity Date Yield Data from SpreadSheet

I use the pandas.io.excel.read_excel function to read xls from url. Here is one way to clean this UK yield curve dataset.
Note: executing the cubic spline interpolation via the apply function takes quite a mount of time (about 2 minutes in my PC). It interpolates from about 100 points to 300 points, row by row (2638 in total).
from pandas.io.excel import read_excel
import pandas as pd
import numpy as np
url = 'http://www.bankofengland.co.uk/statistics/Documents/yieldcurve/uknom05_mdaily.xls'
# check the sheet number, spot: 9/9, short end 7/9
spot_curve = read_excel(url, sheetname=8)
short_end_spot_curve = read_excel('uknom05_mdaily.xls', sheetname=6)
# preprocessing spot_curve
# ==============================================
# do a few inspection on the table
spot_curve.shape
spot_curve.iloc[:, 0]
spot_curve.iloc[:, -1]
spot_curve.iloc[0, :]
spot_curve.iloc[-1, :]
# do some cleaning, keep NaN for now, as forward fill NaN is not recommended for yield curve
spot_curve.columns = spot_curve.loc['years:']
spot_curve.columns.name = 'years'
valid_index = spot_curve.index[4:]
spot_curve = spot_curve.loc[valid_index]
# remove all maturities within 5 years as those are duplicated in short-end file
col_mask = spot_curve.columns.values > 5
spot_curve = spot_curve.iloc[:, col_mask]
# now spot_curve is ready, check it
spot_curve.head()
spot_curve.tail()
spot_curve.shape
spot_curve.shape
Out[184]: (2715, 40)
# preprocessing short end spot_curve
# ==============================================
short_end_spot_curve.columns = short_end_spot_curve.loc['years:']
short_end_spot_curve.columns.name = 'years'
valid_index = short_end_spot_curve.index[4:]
short_end_spot_curve = short_end_spot_curve.loc[valid_index]
short_end_spot_curve.head()
short_end_spot_curve.tail()
short_end_spot_curve.shape
short_end_spot_curve.shape
Out[185]: (2715, 60)
# merge these two, time index are identical
# ==============================================
combined_data = pd.concat([short_end_spot_curve, spot_curve], axis=1, join='outer')
# sort the maturity from short end to long end
combined_data.sort_index(axis=1, inplace=True)
combined_data.head()
combined_data.tail()
combined_data.shape
# deal with NaN: the most sound approach is fit the non-arbitrage NSS curve
# however, this is not currently supported in python.
# do a cubic spline instead
# ==============================================
# if more than half of the maturity points are NaN, then interpolation is likely to be unstable, so I'll remove all rows with NaNs count greater than 50
def filter_func(group):
return group.isnull().sum(axis=1) <= 50
combined_data = combined_data.groupby(level=0).filter(filter_func)
# no. of rows down from 2715 to 2628
combined_data.shape
combined_data.shape
Out[186]: (2628, 100)
from scipy.interpolate import interp1d
# mapping points, monthly frequency, 1 mon to 25 years
maturity = pd.Series((np.arange(12 * 25) + 1) / 12)
# do the interpolation day by day
key = lambda x: x.date
by_day = combined_data.groupby(level=0)
# write out apply function
def interpolate_maturities(group):
# transpose row vector to column vector and drops all nans
a = group.T.dropna().reset_index()
f = interp1d(a.iloc[:, 0], a.iloc[:, 1], kind='cubic', bounds_error=False, assume_sorted=True)
return pd.Series(maturity.apply(f).values, index=maturity.values)
# this may take a while .... apply provides flexibility but spead is not good
cleaned_spot_curve = by_day.apply(interpolate_maturities)
# a quick look on the data
cleaned_spot_curve.iloc[[1,1000, 2000], :].T.plot(title='Cross-Maturity Yield Curve')
cleaned_spot_curve.iloc[:, [23, 59, 119]].plot(title='Time-Series')

Related

Compare and find Duplicated values (not entire columns) across data frame with python

I have a large data frame of schedules, and I need to count the numbers of experiments run. The challenge is that usage for is repeated in rows (which is ok), but is duplicated in some, but not all columns. I want to remove the second entry (if duplicated), but I can't delete the entire second column because it will contain some new values too. How can I compare individual entries for two columns in a side by side fashion and delete the second if there is a duplicate?
The duration for this is a maximum of two days, so three days in a row is a new event with the same name starting on the third day.
The actual text for the experiment names is complicated and the data frame is 120 columns wide, so typing this in as a list or dictionary isn't possible. I'm hoping for a python or numpy function, but could use a loop.
Here are pictures for an example of the starting data frame and the desired output.starting data frame example
de-duplicated data frame example

This a hack and similar to #Params answer, but would be faster because you aren't calling .iloc a lot. The basic idea is to transpose the data frame and repeat an operation for as many times as you need to compare all of the columns. Then transpose it back to get the result in the OP.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Monday':['exp_A','exp_A','exp_A','exp_A','exp_B',np.nan,np.nan,np.nan,'exp_D','exp_D'],
'Tuesday':['exp_A','exp_A','exp_A','exp_A','exp_B','exp_B','exp_C','exp_C','exp_D','exp_D'],
'Wednesday':['exp_A','exp_D',np.nan,np.nan,np.nan,'exp_B','exp_C','exp_C','exp_C',np.nan],
'Thursday':['exp_A','exp_D',np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,'exp_C',np.nan]
})
df = df.T
for i in range(int(np.ceil(df.shape[0]/2))):
df[(df == df.shift(1))& (df != df.shift(2))] = np.nan
df = df.T
Monday Tuesday Wednesday Thursday
0 exp_A NaN exp_A NaN
1 exp_A NaN exp_D NaN
2 exp_A NaN NaN NaN
3 exp_A NaN NaN NaN
4 exp_B NaN NaN NaN
5 NaN exp_B NaN NaN
6 NaN exp_C NaN NaN
7 NaN exp_C NaN NaN
8 exp_D NaN exp_C NaN
9 exp_D NaN NaN NaN

Python Pandas - Rolling regressions for multiple columns in a dataframe

I have a large dataframe containing daily timeseries of prices for 10,000 columns (stocks) over a period of 20 years (5000 rows x 10000 columns). Missing observations are indicated by NaNs.
0 1 2 3 4 5 6 7 8 \
31.12.2009 30.75 66.99 NaN NaN NaN NaN 393.87 57.04 NaN
01.01.2010 30.75 66.99 NaN NaN NaN NaN 393.87 57.04 NaN
04.01.2010 31.85 66.99 NaN NaN NaN NaN 404.93 57.04 NaN
05.01.2010 33.26 66.99 NaN NaN NaN NaN 400.00 58.75 NaN
06.01.2010 33.26 66.99 NaN NaN NaN NaN 400.00 58.75 NaN
Now I want to run a rolling regression for a 250 day window for each column over the whole sample period and save the coefficient in another dataframe
Iterating over the colums and rows using two for-loops isn't very efficient, so I tried this but getting the following error message
def regress(start, end):
y = df_returns.iloc[start:end].values
if np.isnan(y).any() == False:
X = np.arange(len(y))
X = sm.add_constant(X, has_constant="add")
model = sm.OLS(y,X).fit()
return model.params[1]
else:
return np.nan
regression_window = 250
for t in (regression_window, len(df_returns.index)):
df_coef[t] = df_returns.apply(regress(t-regression_window, t), axis=1)
TypeError: ("'float' object is not callable", 'occurred at index 31.12.2009')

here is my version, using df.rolling() instead and iterating over the columns.
I am not completely sure it is what you were looking for don't hesitate to comment
import statsmodels.regression.linear_model as sm
import statsmodels.tools.tools as sm2
df_returns =pd.DataFrame({'0':[30,30,31,32,32],'1':[60,60,60,60,60],'2':[np.NaN,np.NaN,np.NaN,np.NaN,np.NaN]})
def regress(X,Z):
if np.isnan(X).any() == False:
model = sm.OLS(X,Z).fit()
return model.params[1]
else:
return np.NaN
regression_window = 3
Z = np.arange(regression_window)
Z= sm2.add_constant(Z, has_constant="add")
df_coef=pd.DataFrame()
for col in df_returns.columns:
df_coef[col]=df_returns[col].rolling(window=regression_window).apply(lambda col : regress(col, Z))
df_coef

Fastest way to add rows to existing pandas dataframe

I'm currently trying to create a new csv based on an existing csv.
I can't find a faster way to set values of a dataframe based on an existing dataframe values.
import pandas
import sys
import numpy
import time
# path to file as argument
path = sys.argv[1]
df = pandas.read_csv(path, sep = "\t")
# only care about lines with response_time
df = df[pandas.notnull(df['response_time'])]
# new empty dataframe
new_df = pandas.DataFrame(index = df["datetime"])
# new_df needs to have datetime as index
# and columns based on a combination
# of 2 columns name from previous dataframe
# (there are only 10 differents combinations)
# and response_time as values, so there will be lots of
# blank cells but I don't care
for i, row in df.iterrows():
start = time.time()
new_df.set_value(row["datetime"], row["name"] + "-" + row["type"], row["response_time"])
print(i, time.time() - start)
Original dataframe is:
datetime name type response_time
0 2018-12-18T00:00:00.500829 HSS_ANDROID audio 0.02430
1 2018-12-18T00:00:00.509108 HSS_ANDROID video 0.02537
2 2018-12-18T00:00:01.816758 HSS_TEST audio 0.03958
3 2018-12-18T00:00:01.819865 HSS_TEST video 0.03596
4 2018-12-18T00:00:01.825054 HSS_ANDROID_2 audio 0.02590
5 2018-12-18T00:00:01.842974 HSS_ANDROID_2 video 0.03643
6 2018-12-18T00:00:02.492477 HSS_ANDROID audio 0.01575
7 2018-12-18T00:00:02.509231 HSS_ANDROID video 0.02870
8 2018-12-18T00:00:03.788196 HSS_TEST audio 0.01666
9 2018-12-18T00:00:03.807682 HSS_TEST video 0.02975
new_df will look like this:
I takes 7ms per loop.
It takes an eternity to process a (only ?) 400 000 rows Dataframe. How can I make it faster ?

Indeed, using pivot will do what you look for such as:
import pandas as pd
new_df = pd.pivot(df.datetime, df.name + '-' + df.type, df.response_time)
print (new_df.head())
HSS_ANDROID-audio HSS_ANDROID-video \
datetime
2018-12-18T00:00:00.500829 0.0243 NaN
2018-12-18T00:00:00.509108 NaN 0.02537
2018-12-18T00:00:01.816758 NaN NaN
2018-12-18T00:00:01.819865 NaN NaN
2018-12-18T00:00:01.825054 NaN NaN
HSS_ANDROID_2-audio HSS_ANDROID_2-video \
datetime
2018-12-18T00:00:00.500829 NaN NaN
2018-12-18T00:00:00.509108 NaN NaN
2018-12-18T00:00:01.816758 NaN NaN
2018-12-18T00:00:01.819865 NaN NaN
2018-12-18T00:00:01.825054 0.0259 NaN
HSS_TEST-audio HSS_TEST-video
datetime
2018-12-18T00:00:00.500829 NaN NaN
2018-12-18T00:00:00.509108 NaN NaN
2018-12-18T00:00:01.816758 0.03958 NaN
2018-12-18T00:00:01.819865 NaN 0.03596
2018-12-18T00:00:01.825054 NaN NaN
and to not have NaN, you can use fillna with any value you want such as:
new_df = pd.pivot(df.datetime, df.name +'-'+df.type, df.response_time).fillna(0)

you can also use unstack as well just another option
new = df.set_index(['type','name', 'datetime']).unstack([0,1])
new.columns = ['{}-{}'.format(z,y) for x,y,z, in new.columns]
using f-strings will be a little faster than format:
new.columns = [f'{z}-{y}' for x,y,z, in new.columns]

Python Pandas Dataframe: length of index does not match - df['column'] = ndarray

I have a pandas Dataframe containing EOD financial data (OHLC) for analysis.
I'm using https://github.com/cirla/tulipy library to generate technical indicator values, that have a certain timeperiod as option. For Example. ADX with timeperiod=5 shows ADX for last 5 days.
Because of this timeperiod, the generated array with indicator values is always shorter in length than the Dataframe. Because the prices of first 5 days are used to generate ADX for day 6..
pdi14, mdi14 = ti.di(
high=highData, low=lowData, close=closeData, period=14)
df['mdi_14'] = mdi14
df['pdi_14'] = pdi14
>> ValueError: Length of values does not match length of index
Unfortunately, unlike TA-LIB for example, this tulip library does not provide NaN-values for these first couple of empty days...
Is there an easy way to prepend these NaN to the ndarray?
Or insert into df at a certain index & have it create NaN for the rows before it automatically?
Thanks in advance, I've been researching for days!

Maybe make the shift yourself in the code ?
period = 14
pdi14, mdi14 = ti.di(
high=highData, low=lowData, close=closeData, period=period
)
df['mdi_14'] = np.NAN
df['mdi_14'][period - 1:] = mdi14
I hope they will fill the first values with NAN in the lib in the future. It's dangerous to leave time series data like this without any label.

Full MCVE
df = pd.DataFrame(1, range(10), list('ABC'))
a = np.full((len(df) - 6, df.shape[1]), 2)
b = np.full((6, df.shape[1]), np.nan)
c = np.row_stack([b, a])
d = pd.DataFrame(c, df.index, df.columns)
d
A B C
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN
5 NaN NaN NaN
6 2.0 2.0 2.0
7 2.0 2.0 2.0
8 2.0 2.0 2.0
9 2.0 2.0 2.0

The C version of the tulip library includes a start function for each indicator (reference: https://tulipindicators.org/usage) that can be used to determine the output length of an indicator given a set of input options. Unfortunately, it does not appear that the python bindings library, tulipy, includes this functionality. Instead you have to resort to dynamically reassigning your index values to align the output with the original DataFrame.
Here is an example that uses the price series from the tulipy docs:
#Create the dataframe with close prices
prices = pd.DataFrame(data={81.59, 81.06, 82.87, 83, 83.61, 83.15, 82.84, 83.99, 84.55,
84.36, 85.53, 86.54, 86.89, 87.77, 87.29}, columns=['close'])
#Compute the technical indicator using tulipy and save the result in a DataFrame
bbands = pd.DataFrame(data=np.transpose(ti.bbands(real = prices['close'].to_numpy(), period = 5, stddev = 2)))
#Dynamically realign the index; note from the tulip library documentation that the price/volume data is expected be ordered "oldest to newest (index 0 is oldest)"
bbands.index += prices.index.max() - bbands.index.max()
#Put the indicator values with the original DataFrame
prices[['BBANDS_5_2_low', 'BBANDS_5_2_mid', 'BBANDS_5_2_up']] = bbands
prices.head(15)
close BBANDS_5_2_low BBANDS_5_2_mid BBANDS_5_2_up
0 81.06 NaN NaN NaN
1 81.59 NaN NaN NaN
2 82.87 NaN NaN NaN
3 83.00 NaN NaN NaN
4 83.61 80.530042 82.426 84.321958
5 83.15 81.494061 82.844 84.193939
6 82.84 82.533343 83.094 83.654657
7 83.99 82.471983 83.318 84.164017
8 84.55 82.417750 83.628 84.838250
9 84.36 82.435203 83.778 85.120797
10 85.53 82.511331 84.254 85.996669
11 86.54 83.142618 84.994 86.845382
12 86.89 83.536488 85.574 87.611512
13 87.77 83.870324 86.218 88.565676
14 87.29 85.288871 86.804 88.319129

Python Reading Multiple NetCDF Rainfall files of variable size

The issue I have is that the Australian Bureau of Meteorology has supplied me with Rainfall Data Files, that contains rainfall records recorded every 30 minutes for all active gauges. The problem is that for 1 day there are 48 30Minute files. I want to create time series of a particular Gauge. Which means reading all 48 files and searching for the Gauge ID, making sure it doesn't fail if for 1 30 minute period the gauge did not record anything??
here is link to file format:
https://dl.dropboxusercontent.com/u/15223371/14/gauge_30min_20100214_000000.nc
https://dl.dropboxusercontent.com/u/15223371/14/gauge_30min_20100214_003000.nc
https://dl.dropboxusercontent.com/u/15223371/14/gauge_30min_20100214_010000.nc
This is what I have tried so far:
"""
This script is used to read a directory of raingauge data from a Data Directory
"""
from anuga.file.netcdf import NetCDFFile
from anuga.config import netcdf_mode_r, netcdf_mode_w, netcdf_mode_a, \
netcdf_float
import os
import glob
from easygui import *
import string
import numpy
"""
print 'Default file Extension...'
msg="Enter 3 letter extension."
title = "Enter the 3 letter file extension to search for in DIR "
default = "csv"
file_extension = enterbox(msg,title,default)
"""
print 'Present Directory Open...'
title = "Select Directory to Read Multiple rainfall .nc files"
msg = "This is a test of the diropenbox.\n\nPick the directory that you wish to open."
d = diropenbox(msg, title)
fromdir = d
filtered_list = glob.glob(os.path.join(fromdir, '*.nc'))
filtered_list.sort()
nf = len(filtered_list)
print nf
import numpy
rain = numpy.zeros(nf,'float')
t = numpy.arange(nf)
Stn_Loc_File='Station_Location.csv'
outfid = open(Stn_Loc_File, 'w')
prec = numpy.zeros((nf,1752),numpy.float)
gauge_id_list = ['570002','570021','570025','570028','570030','570032','570031','570035','570036',
'570047','570772','570781','570910','570903','570916','570931','570943','570965',
'570968','570983','570986','70214','70217','70349','70351']
"""
title = "Select Gauge to plot"
msg = "Select Gauge"
gauge_id = int(choicebox(msg=msg,title=title, choices=gauge_id_list))
"""
#for gauge_id in gauge_id_list:
# gauge_id = int(gauge_id)
try:
for i, infile in enumerate(filtered_list):
infilenet = NetCDFFile(infile, netcdf_mode_r)
print infilenet.variables
raw_input('Hold.... check variables...')
stn_lats = infilenet.variables['latitude']
stn_longs = infilenet.variables['longitude']
stn_ids = infilenet.variables['station_id']
stn_rain = infilenet.variables['precipitation']
print stn_ids.shape
#print stn_lats.shape
#print stn_longs.shape
#print infile.dimensions
stn_ids = numpy.array(stn_ids)
l_id = numpy.where(stn_ids == gauge_id)
if stn_ids in gauge_id_list:
try:
l_id = l_id[0][0]
rain[i] = stn_rain[l_id]
except:
rain[i] = numpy.nan
print 'End for i...'
#print rain
import pylab as pl
pl.bar(t,rain)
pl.title('Rain Gauge data')
pl.xlabel('time steps')
pl.ylabel('rainfall (mm)')
pl.show()
except:
pass
raw_input('END....')

OK, you got the data in a format that's more convoluted than it would need to be. They could easily have stuffed the whole day into a netCDF file. And indeed, one option for you to solve this would have been to combine all files into one with a times dimension, using for example the NCO command line tools.
But here is a solution that uses the scipy netcdf module. I believe it is deprecated -- myself, I prefer the NetCDF4 library. The main approach is: preset your output data structure with np.nan values; loop through your input files and retrieve precipitation and station ids; for each of your stationids of interest, retrieve index, and then precipitation at that index; add to the output structure. (I didn't do the work to extract timestamps - that's up to you.)
import glob
import numpy as np
from scipy.io import netcdf
# load data file names
stationdata = glob.glob('gauge*.nc')
stationdata.sort()
# initialize np arrays of integer gauging station ids
gauge_id_list = ['570002','570021','570025','570028','570030','570032','570031','570035','570036',
'570047','570772','570781','570910','570903','570916','570931','570943','570965',
'570968','570983','570986','70214','70217','70349','70351']
gauge_ids = np.array(gauge_id_list).astype('int32')
ngauges = len(gauge_ids)
ntimesteps = 48
# initialize output dictionary
dtypes = zip(gauge_id_list, ['float32']*ngauges)
timeseries_per_station = np.empty((ntimesteps,))
timeseries_per_station.fill(np.nan)
timeseries_per_station = timeseries_per_station.astype(dtypes)
# Instead of using the index, you could extract the datetime stamp
for timestep, datafile in enumerate(stationdata):
data = netcdf.NetCDFFile(datafile, 'r')
precip = data.variables['precip'].data
stid = data.variables['stid'].data
# create np array of indices of the gaugeid present in file
idx = np.where(np.in1d(stid, gauge_ids))[0]
for i in idx:
timeseries_per_station[str(stid[i])][timestep] = precip[i]
data.close()
np.set_printoptions(precision=1)
for gauge_id in gauge_id_list:
print "Station %s:" % gauge_id
print timeseries_per_station[gauge_id]
The output looks like this:
Station 570002:
[ 1.9 0.3 0. nan nan nan nan nan nan nan nan nan nan nan nan
nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
nan nan nan]
Station 570021:
[ 0. 0. 0. nan nan nan nan nan nan nan nan nan nan nan nan
nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
nan nan nan]
...
(Obviously, there were only three files.)
Edit: The OP noted that the code wasn't running without errors for him because his variable names are "precipitation" and "station_id". The code runs for me on the files he posted. Obviously, he should be using whatever variable names are used in the files that he was supplied with. As they seem to be custom-produced files for his use, it is conceivable that the authors may not be consistent in variable naming.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Restructuring Dataframe in Python - python

Related

Compare and find Duplicated values (not entire columns) across data frame with python

Python Pandas - Rolling regressions for multiple columns in a dataframe

Fastest way to add rows to existing pandas dataframe

Python Pandas Dataframe: length of index does not match - df['column'] = ndarray

Python Reading Multiple NetCDF Rainfall files of variable size

Categories

Resources