Data: Here
Question:
I have several data sheets which I export to Python as dataframes. I want to perform multiplications across these dataframes, which will generate another dataframe that takes the same dimension as the dataframes I use and/or augment the dimension (i.e. the index) based on the combination from the different dataframes used. However, I stumble upon some issues to which I could not find a solution. Below is the code.
Code:
#---------------------------------------------------------------------------------------------------
#Load the pandas library
#---------------------------------------------------------------------------------------------------
import numpy as np
import pandas as pd
#---------------------------------------------------------------------------------------------------
#Load the dataframes
#---------------------------------------------------------------------------------------------------
##Supply at the gridcell level (in Pj per year)
biosup = pd.read_excel('01EconMod_EU1.xlsx', sheet_name = 'biosup', skiprows = 5, index_col = 0, usecols = 'A:K')
##Cost at the gridcell level (in MEUR per Pj)
biocost = pd.read_excel('01EconMod_EU1.xlsx', sheet_name = 'biocost', skiprows = 5, index_col = 0, usecols = 'A:K')
##Demand at the gridcell level (in Pj per year)
biodem = pd.read_excel('01EconMod_EU1.xlsx', sheet_name = 'biodem', skiprows = 5, index_col = [0,1], usecols = 'A:L')
##Inter-gridcell distance matrix (in km)
dist = pd.read_excel('01EconMod_EU1.xlsx', sheet_name = 'distance', skiprows = 5, index_col = 0, usecols = 'A:AE')
#---------------------------------------------------------------------------------------------------
#Definition of model parameter
#---------------------------------------------------------------------------------------------------
##Power parameter for the distance-decay component (gamma)
gamma = pd.DataFrame({'sim1':[1.06],'sim2':[1.59],'sim3':[2.12]})
gamma = gamma.transpose()
gamma.columns = ['val']
##Inter-gridcell distance range for the supply curve determination (dmaxsup in km)
dmaxsup = pd.DataFrame({'dsup1':[390],'dsup2':[770],'dsup3':[1050]})
dmaxsup = dmaxsup.transpose()
dmaxsup.columns = ['dmax']
##Inter-gridcell distance range for the distance-decay (dmaxdem in km)
dmaxdem = pd.DataFrame({'ddem1':[750],'ddem2':[1000]})
dmaxdem = dmaxdem.transpose()
dmaxdem.columns = ['dmax']
#---------------------------------------------------------------------------------------------------
#New parameter calculation
#---------------------------------------------------------------------------------------------------
##The ratio of the inter-gridcell distance and the dmaxdem
dist1 = pd.DataFrame(np.concatenate(dist.values / dmaxdem.values[:, None]), pd.MultiIndex.from_product([dmaxdem.index, dist.index]), dist.columns)
##The decay coefficients
decay = pd.DataFrame(np.concatenate(2 * (1 / (1 + (np.exp(dist1.values)**gamma.values[:, None])))), pd.MultiIndex.from_product([gamma.index, dist1.index]), dist1.columns)
decay1 = pd.DataFrame(np.concatenate(2 * (1 / (1 + (np.exp(dist.values / dmaxdem.values[:, None])**gamma.values[:, None])))), pd.MultiIndex.from_product([dmaxdem.index, gamma.index, dist.index]), dist.columns)
Comments on the code:
1/The parameter "dist1" represents the division of the "dist" dataframe by each of the element of the "dmaxdem" dataframe. Think of the values of the "dmaxdem" dataframe are distance scenarios. In other words, this operation computes the ratio for each of the distance values prodived.
2/ I try to compute a distance decay coefficients, i.e. "decay" dataframe, as defined by the formula inside the brackets. However, I get the following error message
NotImplementedError: isna is not defined for MultiIndex
which I believe has something to do with the multiindex structure of the "dist1" dataframe. I have tried a direct approach by embedding the previous operation, and which will require the use of the 3 different dataframes as illustrated by the code for "decay1". I get the following error
ValueError: operands could not be broadcast together with shapes (2,30,30) (3,1,1)
Any help would be appreciated.
pardon me if I misunderstood you because I am unable to comment before posting answer:
Well, if they are all the same length, and have the same index, you can start off by first concatenation them along the 0 axis. This will create a larger dataframe. Next, you can assert a conditional column or columns that you need:
largerdf = pd.concat([df1, df2, df3 , dfn], axis=0)
largerdf[“calculationcolumn”] = largerdf[“columnvalue1”] *largerdf[“columnvalue2”]
Or change the operand to any you need.
Related
I am trying to use linear regression using data pulled from yfinance to predict future stock prices, but I am having trouble using linear regression after transposing my data's shape.
Here I create a normalization function
def normalize_data(df):
# df on input should contain only one column with the price data (plus dataframe index)
min = df.min()
max = df.max()
x = df
# time series normalization part
# y will be a column in a dataframe
y = (x - min) / (max - min)
return y
And another function to pull stock prices from Yfinance that calls the normalization function
def closing_price(ticker):
#Asset = pd.DataFrame(yf.download(ticker, start=Start,end=End)['Adj Close'])
Asset = pd.DataFrame(yf.download(ticker, start='2022-07-13',end='2022-09-16')['Adj Close'])
Asset = normalize_data(Asset)
return Asset.to_numpy()
I then pull 11 different stocks using the function
MRO= closing_price('MRO')
HES= closing_price('HES')
FANG= closing_price('FANG')
DVN= closing_price('DVN')
PXD= closing_price('PXD')
COP= closing_price('COP')
CVX= closing_price('CVX')
APA= closing_price('APA')
EOG= closing_price('EOG')
HAL= closing_price('HAL')
BLK = closing_price('BLK')
Which works so far
But when I try to merge the first 10 numpy arrays together,
X = np.array([MRO, HES, FANG, DVN, PXD, COP, CVS, APA, EOG, HAL])[:, :, 0]
X = np.transpose(X)
it gives me the error for the first line when I merge the numpy arrays
<ipython-input-53-a30faf3e4390>:1: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
Have you tried passing the following as is suggested by your error message?
X = np.array([MRO, HES, FANG, DVN, PXD, COP, CVS, APA, EOG, HAL], dtype=float)[:, :, 0]
Alternatively, what are you trying to do with your data afterwards, run a linear regression? Does the data have to be an np array? Often working with data is a lot easier using pandas.DataFrame, and basically all machine learning libraries such as sklearn or statsmodels or any other you might want to use will have pandas support.
To create one big dataset out of these you could try the following:
data = pd.DataFrame() #creating empty dataframe
list_of_tickers = [MRO, HES, FANG, DVN, PXD, COP, CVS, APA, EOG, HAL, BLK]
for ticker in list_of_tickers:
for column in ticker: #because each column will just be labelled "Adj. Close" and you can't name multiple columns the same way
new_name = str(ticker) + "_" + str(column) #columns in data will then be named "MRO_Adj. Close", "HES_Adj. Close", etc
ticker[new_name] = ticker[column]
ticker = ticker.drop(column, axis=1)
data = pd.concat([data, ticker], axis=1)
Additionally, this neatly prevents problems that might arise from issues that different stock tickers have or lack different dates in their dataset, as was correctly pointed out by Kevin Choon Liang Yew in the comments above.
I need to calculate the SMA for a time serie data.
In particular, I want to plot in x-axis that mean function and in y-axis another array.
At the beginning, when I didn't do the SMA, these 2 arrays have the same lenght.
Then I found this example online to calculate the SMA:
# Python program to calculate
# simple moving averages using pandas
import pandas as pd
arr = [1, 2, 3, 7, 9]
window_size = 3
# Convert array of integers to pandas series
numbers_series = pd.Series(arr)
# Get the window of series
# of observations of specified window size
windows = numbers_series.rolling(window_size)
# Create a series of moving
# averages of each window
moving_averages = windows.mean()
# Convert pandas series back to list
moving_averages_list = moving_averages.tolist()
# Remove null entries from the list
final_list = moving_averages_list[window_size - 1:]
print(final_list)
I tried to replay that in my case, but I obtain this error:
"ValueError: x and y must have same first dimension, but have shapes
(261228,) and (261237,)"
I paste a bit of my code, maybe it can be useful to understand better:
y_Int_40_dmRe=pd.Series(y_Int_40_dmRe)
windows_40_dmRe = y_Int_40_dmRe.rolling(window_size)
moving_averages_40_dmRe = windows_40_dmRe.mean()
moving_averages_40_dmRe_list = moving_averages_40_dmRe.tolist()
final_list_40_dmRe = moving_averages_40_dmRe_list[window_size - 1:]
plt.plot(final_list_40_dmRe,y_TotEn_40_dmRe, linewidth=2, label="40° - dmRe")
I'm here if you need more informations, thank you in advance for your help
Chiara
I have some skin temperature data (collected at 1Hz) which I intend to analyse.
However, the sensors were not always in contact with the skin. So I have a challenge of removing this non-skin temperature data, whilst preserving the actual skin temperature data. I have about 100 files to analyse, so I need to make this automated.
I'm aware that there is already this similar post, however I've not been able to use that to solve my problem.
My data roughly looks like this:
df =
timeStamp Temp
2018-05-04 10:08:00 28.63
. .
. .
2018-05-04 21:00:00 31.63
The first step I've taken is to simply apply a minimum threshold- this has got rid of the majority of the non-skin data. However, I'm left with the sharp jumps where the sensor was either removed or attached:
To remove these jumps, I was thinking about taking an approach where I use the first order differential of the temp and then use another set of thresholds to get rid of the data I'm not interested in.
e.g.
df_diff = df.diff(60) # period of about 60 makes jumps stick out
filter_index = np.nonzero((df.Temp <-1) | (df.Temp>0.5)) # when diff is less than -1 and greater than 0.5, most likely data jumps.
However, I find myself stuck here. The main problem is that:
1) I don't know how to now use this index list to delete the non-skin data in df. How is best to do this?
The more minor problem is that
2) I think I will still be left with some residual artefacts from the data jumps near the edges (e.g. where a tighter threshold would start to chuck away good data). Is there either a better filtering strategy or a way to then get rid of these artefacts?
*Edit as suggested I've also calculated the second order diff, but to be honest, I think the first order diff would allow for tighter thresholds (see below):
*Edit 2: Link to sample data
Try the code below (I used a tangent function to generate data). I used the second order difference idea from Mad Physicist in the comments.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.DataFrame()
df[0] = np.arange(0,10,0.005)
df[1] = np.tan(df[0])
#the following line calculates the absolute value of a second order finite
#difference (derivative)
df[2] = 0.5*(df[1].diff()+df[1].diff(periods=-1)).abs()
df.loc[df[2] < .05][1].plot() #select out regions of a high rate-of-change
df[1].plot() #plot original data
plt.show()
Following is a zoom of the output showing what got filtered. Matplotlib plots a line from beginning to end of the removed data.
Your first question I believe is answered with the .loc selection above.
You second question will take some experimentation with your dataset. The code above only selects out high-derivative data. You'll also need your threshold selection to remove zeroes or the like. You can experiment with where to make the derivative selection. You can also plot a histogram of the derivative to give you a hint as to what to select out.
Also, higher order difference equations are possible to help with smoothing. This should help remove artifacts without having to trim around the cuts.
Edit:
A fourth-order finite difference can be applied using this:
df[2] = (df[1].diff(periods=1)-df[1].diff(periods=-1))*8/12 - \
(df[1].diff(periods=2)-df[1].diff(periods=-2))*1/12
df[2] = df[2].abs()
It's reasonable to think that it may help. The coefficients above can be worked out or derived from the following link for higher orders.
Finite Difference Coefficients Calculator
Note: The above second and fourth order central difference equations are not proper first derivatives. One must divide by the interval length (in this case 0.005) to get the actual derivative.
Here's a suggestion that targets your issues regarding
[...]an approach where I use the first order differential of the temp and then use another set of thresholds to get rid of the data I'm not interested in.
[..]I don't know how to now use this index list to delete the non-skin data in df. How is best to do this?
using stats.zscore() and pandas.merge()
As it is, it will still have a minor issue with your concerns regarding
[...]left with some residual artefacts from the data jumps near the edges[...]
But we'll get to that later.
First, here's a snippet to produce a dataframe that shares some of the challenges with your dataset:
# Imports
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from scipy import stats
np.random.seed(22)
# A function for noisy data with a trend element
def sample():
base = 100
nsample = 50
sigma = 10
# Basic df with trend and sinus seasonality
trend1 = np.linspace(0,1, nsample)
y1 = np.sin(trend1)
dates = pd.date_range(pd.datetime(2016, 1, 1).strftime('%Y-%m-%d'), periods=nsample).tolist()
df = pd.DataFrame({'dates':dates, 'trend1':trend1, 'y1':y1})
df = df.set_index(['dates'])
df.index = pd.to_datetime(df.index)
# Gaussian Noise with amplitude sigma
df['y2'] = sigma * np.random.normal(size=nsample)
df['y3'] = df['y2'] + base + (np.sin(trend1))
df['trend2'] = 1/(np.cos(trend1)/1.05)
df['y4'] = df['y3'] * df['trend2']
df=df['y4'].to_frame()
df.columns = ['Temp']
df['Temp'][20:31] = np.nan
# Insert spikes and missing values
df['Temp'][19] = df['Temp'][39]/4000
df['Temp'][31] = df['Temp'][15]/4000
return(df)
# Dataframe with random data
df_raw = sample()
df_raw.plot()
As you can see, there are two distinct spikes with missing numbers between them. And it's really the missing numbers that are causing the problems here if you prefer to isolate values where the differences are large. The first spike is not a problem since you'll find the difference between a very small number and a number that is more similar to the rest of the data:
But for the second spike, you're going to get the (nonexisting) difference between a very small number and a non-existing number, so that the extreme data-point you'll end up removing is the difference between your outlier and the next observation:
This is not a huge problem for one single observation. You could just fill it right back in there. But for larger data sets that would not be a very viable soution. Anyway, if you can manage without that particular value, the below code should solve your problem. You will also have a similar problem with your very first observation, but I think it would be far more trivial to decide whether or not to keep that one value.
The steps:
# 1. Get some info about the original data:
firstVal = df_raw[:1]
colName = df_raw.columns
# 2. Take the first difference and
df_diff = df_raw.diff()
# 3. Remove missing values
df_clean = df_diff.dropna()
# 4. Select a level for a Z-score to identify and remove outliers
level = 3
df_Z = df_clean[(np.abs(stats.zscore(df_clean)) < level).all(axis=1)]
ix_keep = df_Z.index
# 5. Subset the raw dataframe with the indexes you'd like to keep
df_keep = df_raw.loc[ix_keep]
# 6.
# df_keep will be missing some indexes.
# Do the following if you'd like to keep those indexes
# and, for example, fill missing values with the previous values
df_out = pd.merge(df_keep, df_raw, how='outer', left_index=True, right_index=True)
# 7. Keep only the first column
df_out = df_out.ix[:,0].to_frame()
# 8. Fill missing values
df_complete = df_out.fillna(axis=0, method='ffill')
# 9. Replace first value
df_complete.iloc[0] = firstVal.iloc[0]
# 10. Reset column names
df_complete.columns = colName
# Result
df_complete.plot()
Here's the whole thing for an easy copy-paste:
# Imports
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from scipy import stats
np.random.seed(22)
# A function for noisy data with a trend element
def sample():
base = 100
nsample = 50
sigma = 10
# Basic df with trend and sinus seasonality
trend1 = np.linspace(0,1, nsample)
y1 = np.sin(trend1)
dates = pd.date_range(pd.datetime(2016, 1, 1).strftime('%Y-%m-%d'), periods=nsample).tolist()
df = pd.DataFrame({'dates':dates, 'trend1':trend1, 'y1':y1})
df = df.set_index(['dates'])
df.index = pd.to_datetime(df.index)
# Gaussian Noise with amplitude sigma
df['y2'] = sigma * np.random.normal(size=nsample)
df['y3'] = df['y2'] + base + (np.sin(trend1))
df['trend2'] = 1/(np.cos(trend1)/1.05)
df['y4'] = df['y3'] * df['trend2']
df=df['y4'].to_frame()
df.columns = ['Temp']
df['Temp'][20:31] = np.nan
# Insert spikes and missing values
df['Temp'][19] = df['Temp'][39]/4000
df['Temp'][31] = df['Temp'][15]/4000
return(df)
# A function for removing outliers
def noSpikes(df, level, keepFirst):
# 1. Get some info about the original data:
firstVal = df[:1]
colName = df.columns
# 2. Take the first difference and
df_diff = df.diff()
# 3. Remove missing values
df_clean = df_diff.dropna()
# 4. Select a level for a Z-score to identify and remove outliers
df_Z = df_clean[(np.abs(stats.zscore(df_clean)) < level).all(axis=1)]
ix_keep = df_Z.index
# 5. Subset the raw dataframe with the indexes you'd like to keep
df_keep = df_raw.loc[ix_keep]
# 6.
# df_keep will be missing some indexes.
# Do the following if you'd like to keep those indexes
# and, for example, fill missing values with the previous values
df_out = pd.merge(df_keep, df_raw, how='outer', left_index=True, right_index=True)
# 7. Keep only the first column
df_out = df_out.ix[:,0].to_frame()
# 8. Fill missing values
df_complete = df_out.fillna(axis=0, method='ffill')
# 9. Reset column names
df_complete.columns = colName
# Keep the first value
if keepFirst:
df_complete.iloc[0] = firstVal.iloc[0]
return(df_complete)
# Dataframe with random data
df_raw = sample()
df_raw.plot()
# Remove outliers
df_cleaned = noSpikes(df=df_raw, level = 3, keepFirst = True)
df_cleaned.plot()
The goal is to calculate RMSE between two groups of columns in a pandas dataframe. The problem is that the amount of memory actually used is almost 10x the size of the dataframe. Here is the code I used to calculate RMSE:
import pandas as pd
import numpy as np
from random import shuffle
# set up test df (actual data is a pre-computed DF stored in HDF5)
dim_x, dim_y = 50, 1000000 # actual dataset dim_y = 56410949
cols = ["a_"+str(i) for i in range(1,(dim_x//2)+1)]
cols_b = ["b_"+str(i) for i in range(1,(dim_x//2)+1)]
cols.extend(cols_b)
df = pd.DataFrame(np.random.uniform(0,10,[dim_y, dim_x]), columns=cols)
# calculate rmse : https://stackoverflow.com/a/46349518
a = df.values
diffs = a[:,1:26] - a[:,26:27]
rmse_out = np.sqrt(np.einsum('ij,ij->i',diffs,diffs)/3.0)
df['rmse_out'].to_pickle('results_rmse.p')
When I get the values from the df with a = df.values, the memory usage for that routine approaches 100GB according to top. The routine calculate the difference between these columns, diffs = a[:,1:26] - a[:,26:27], approaches 120GB then produces a Memory Error. How can I modify my code to make it more memory-efficient, avoid the error, and actually calculate my RMSE values?
The solution I used was to chunk the dataframe down:
df = pd.read_hdf('madre_merge_sort32.h5')
for i,d in enumerate(np.array_split(df, 10)):
d.to_pickle(str(i)+".p")
Then I ran through those pickled mini-dfs and calculated rmse in each:
for fn in glob.glob("*.p"):
# process df values
df = pd.read_pickle(fn)
df.replace([np.inf, -np.inf], np.nan, inplace=True)
df.dropna(inplace=True)
a= df[df.columns[2:]].as_matrix() # first two cols are non-numeric, so skip
# calculate rmse
diffs = a[:,:25] - a[:,25:]
rmse_out = np.sqrt(np.einsum('ij,ij->i',diffs,diffs)/3.0)
df['rmse_out'] = rmse_out
df.to_pickle("out"+fn)
Then I concatenated them:
dfls = []
for fn in glob.glob("out*.p"):
df = pd.read_pickle(fn)
dfls.append(df)
dfcat = pd.concat(dfls)
Chunking seemed to work for me.
Assume two dataframes, each with a datetime index, and each with one column of unnamed data. The dataframes are of different lengths and the datetime indexes may or may not overlap.
df1 is length 20. df2 is length 400. The data column consists of random floats.
I want to iterate through df2 taking 20 units per iteration, with each iteration incrementing the start vector by one unit - and similarly the end vector by one unit. On each iteration I want to calculate the correlation between the 20 units of df1 and the 20 units I've selected for this iteration of df2. This correlation coefficient and other statistics will then be recorded.
Once the loop is complete I want to plot df1 with the 20-unit vector of df2 that satisfies my statistical search - thus needing to keep up with some level of indexing to reacquire the vector once analysis has been completed.
Any thoughts?
Without knowing more specifics of the questions such as, why are you doing this or do dates matter, this will do what you asked. I'm happy to update based on your feedback.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import random
df1 = pd.DataFrame({'a':[random.randint(0, 20) for x in range(20)]}, index = pd.date_range(start = '2013-01-01',periods = 20, freq = 'D'))
df2 = pd.DataFrame({'b':[random.randint(0, 20) for x in range(400)]}, index = pd.date_range(start = '2013-01-10',periods = 400, freq = 'D'))
corr = pd.DataFrame()
for i in range(0,380):
t0 = df1.reset_index()['a'] # grab the numbers from df1
t1 = df2.iloc[i:i+20].reset_index()['b'] # grab 20 days, incrementing by one each time
t2 = df2.iloc[i:i+20].index[0] # set the index to be the first day of df2
corr = corr.append(pd.DataFrame({'corr':t0.corr(t1)}, index = [t2])) #calculate the correlation and append it to the DF
# plot it and save the graph
corr.plot()
plt.title("Correlation Graph")
plt.ylabel("(%)")
plt.grid(True)
plt.show()
plt.savefig('corr.png')