Pandas Time series manipulation with large panel data - python

Here is my large panel dataset:
Date
x1
x2
x3
2017-07-20
50
60
Kevin
2017-07-21
51
80
Kevin
2016-05-23
100
200
Cathy
2016-04-20
20
20
Cathy
2019-01-02
50
60
Leo
This dataset contains billions of rows. What I would like to do is that I would like to calculate the 1-day different in terms of percentage for x1 and x2: Denote t and t+1 to the time representing today and tomorrow. I would like to calculate (x1_{t+1} - x2_t) / x2_t
First I used the fastest way in terms of writing:
I created a nested list containing all the target values of each group of x3:
nested_list = []
flatten_list = []
for group in df.x3.unique():
df_ = df[df.x3 == group]
nested_list.append((df_.x1.shift(-1) / df_.x2) / df_.x2))
for lst in nested_list:
for i in lst:
flatten_list.append(i)
df["target"] = flatten_list
However, this method will literality take a year to run, which is not implementable.
I also tried the native pandas groupby method for potentially runnable outcome but it DID NOT seem to work:
def target_calculation(x):
target = (x.x1.shift(-1) - x.x2) / x.x2
return target
df["target"] = df.groupby("x3")[["x1", "x2"]].apply(target_calculation)
How can I calculate this without using for loop or possibly vectorize the whole process?

You could groupby + shift "x1" and subtract "x2" from it:
df['target'] = (df.groupby('x3')['x1'].shift(-1) - df['x2']) / df['x2']
Output:
Date x1 x2 x3 target
0 2017-07-20 50 60 Kevin -0.15
1 2017-07-21 51 80 Kevin NaN
2 2016-05-23 100 200 Cathy -0.90
3 2016-04-20 20 20 Cathy NaN
4 2019-01-02 50 60 Leo NaN
Note that
(df.groupby('x3')['x1'].shift(-1) / df['x2']) / df['x2']
produces the output equivalent to flatten_list but I don't think this is your true desired output but rather a typo.

Related

pythonic way to make sure data is always decreasing within group in pandas

I have a dataset that looks like this
Id date value
x1 01-01-22 46
x1 02-01-22 46
x1 03-01-22 45.8
....
x2 03-04-22 57
X2 03-04-22 62
....
The number in value should always decrease (or the stay the same) as time goes on. So the next observation for x2 would fail.
What's the most pythonic way to append a column of 1's and 0's if the value increases by say more than 3% (as there could be some measurement error). In R, I would just use dplyr & groupby and I was hoping for something as elegant in pandas.
Edit for clarity: The decrease must be within the id (eg per item).
I think this should do it (edited I had the minus flipped) (edited again after clarification)
def f(gdf):
return ((gdf.value - gdf.value.shift())/gdf.value.shift()) > .03
df['flag'] = df.groupby('Id').apply(f).values
You have pct_change:
df['big_change'] = df.groupby('Id')['value'].pct_change().gt(.03).astype(int)
Output:
Id date value big_change
0 x1 01-01-22 46.0 0
1 x1 02-01-22 46.0 0
2 x1 03-01-22 45.8 0
3 x2 03-04-22 57.0 0
4 x2 03-04-22 62.0 1

Compare and loop every row of in between time in a second dataframe, if match copy values from columns of the same row

I have two DFs:
DF1 has a column for time
MsgTime
13:45:33
14:13:25
15:16:43
16:51:19
DF2 has two columns of time, and additional two columns of information
MsgTime1 | MgsTime2 | Temperature | Humidity
13:40:33 13:50:13 21 45
14:16:43 14:26:43 22 56
16:49:11 16:59:02 32 40
So if MsgTime is in between MsgTime1 and MsgTime2 from DF2, copy the Temperature and Humidity to DF1 as output. The loop should compare every row to find a match, if not leave it blank. A desired output should look like this:
MgsTime | Temperature | Humidity
13:45:33 21 45
14:13:25 NaN NaN
15:16:43 NaN NaN
16:51:19 32 40
I tried to do like a double for loop, but it didn't seem to work:
for i, row in DF1.iterrows():
for j, row2 in DF2.iterrows():
if (row2['MsgTime1'] <= row['MsgTime']) and (row['MsgTime'] <= row2['MsgTime2']):
row['Temperature'] = row2['Temperature']
row['Humidity'] = row2['Humidity']
else:
row2 += 1
A package piso (pandas interval set operations), provides a very fast implementation for this
df1 = pd.DataFrame(
pd.to_datetime(["13:45:33", "14:13:25", "15:16:43", "16:51:19"]),
columns=["MsgTime"],
)
df2 = pd.DataFrame(
{
"MsgTime1": pd.to_datetime(["13:40:33", "14:16:43", "16:49:11"]),
"MsgTime2": pd.to_datetime(["13:50:13", "14:26:43", "16:59:02"]),
"Temperature":[21,22,32],
"Humidity":[45,56,40],
}
)
solution
Create a dataframe with Temperature and Humidity columns, indexed by a pandas.IntervalIndex
df3 = df2[["Temperature", "Humidity"]].set_index(pd.IntervalIndex.from_arrays(df2["MsgTime1"], df2["MsgTime2"]))
df3 looks like this
Temperature Humidity
(2021-11-03 13:40:33, 2021-11-03 13:50:13] 21 45
(2021-11-03 14:16:43, 2021-11-03 14:26:43] 22 56
(2021-11-03 16:49:11, 2021-11-03 16:59:02] 32 40
Note since no date component was provided it just assumes today's date. You could alternatively use pandas.Timedelta instead of pandas.Timestamp - the approach will work the same
Next use piso.lookup
piso.lookup(df3, df1["MsgTime"])
which produces
Temperature Humidity
2021-11-03 13:45:33 21.0 45.0
2021-11-03 14:13:25 NaN NaN
2021-11-03 15:16:43 NaN NaN
2021-11-03 16:51:19 32.0 40.0
note: I am the creator of piso. Please feel free to reach out with feedback or questions if you have any.
One way using pandas.Interval with pandas.to_datetime:
df2.index = df2[["MsgTime1", "MsgTime2"]].apply(lambda x: pd.Interval(*pd.to_datetime(x)), axis=1)
s = pd.to_datetime(df["MsgTime"])
for k in ["Temperature", "Humidity"]:
df[k] = s.map(df2[k])
print(df)
Output:
MsgTime Temperature Humidity
0 13:45:33 21.0 45.0
1 14:13:25 NaN NaN
2 15:16:43 NaN NaN
3 16:51:19 32.0 40.0

Create new Pandas Dataframe from observations which meets specific criteria

I have two original dataframes.
One contains limits: df_limits
feat_1 feat_2 feat_3
target 12 9 90
UL 15 10 120
LL 9 8 60
where target is ideal value,
UL - upper limit,
LL - lower limit
And another one original data: df_to_check
ID feat_1 feat_2 feat_3
123 12.5 9.6 100
456 18 3 100
789 9 11 100
I'm creating a function which desired output is get ID and features which are below or above the threshold (limits from first Dataframe) Till now I'm able to recognise which features are out of limits but I'm getting full output of original Dataframe...
def table(df_limits, df_to_check, column):
UL = df_limits[column].loc['target'] + df_limits[column].loc['UL']
LL = df_limits[column].loc['target'] + df_limits[column].loc['LL']
UL_index = df_to_check.loc[df_to_check[column] > UL].index
LL_index = df_to_check.loc[df_to_check[column] < LL].index
if UL_index is not None:
above_limit = {'ID': df_to_check['ID'],
'column': df_to_check[column],
'target': df_limits[column].loc['target']}
return pd.DataFrame(above_limit)
What I should change so my desired output would be:
(showing only ID and column where observations are out of limit)
The best if it would show also how many percent of original value is deviate from ideal value target (I would be glad for advices how to add such a column)
ID column target value deviate(%)
456 feat_1 12 18 50
456 feat_2 9 3 ...
789 feat_2 9 11 ...
Now after running this function its returning whole dataset because statement says if not null... which is not null... I understand why I have this issue but I don't know how to change it
Issue is with statement if UL_index is not None: since it returning whole dataset and I'm looking for way how to replace this part
Approach
reshape
merge
calculate
new_df = (df_to_check.set_index("ID").unstack().reset_index()
.rename(columns={"level_0":"column",0:"value"})
.merge(df_limits.T, left_on="column", right_index=True)
.assign(deviate=lambda dfa: (dfa.value-dfa.target)/dfa.target)
)
column
ID
value
target
UL
LL
deviate
feat_1
123
12.5
12
15
9
0.0416667
feat_1
456
18
12
15
9
0.5
feat_1
789
9
12
15
9
-0.25
feat_2
123
9.6
9
10
8
0.0666667
feat_2
456
3
9
10
8
-0.666667
feat_2
789
11
9
10
8
0.222222
feat_3
123
100
90
120
60
0.111111
feat_3
456
100
90
120
60
0.111111
feat_3
789
100
90
120
60
0.111111
First of all, you have not provided a reproducible example https://stackoverflow.com/help/minimal-reproducible-example because you have not shared the code which produces the two initial dataframes. Next time you ask a question, please keep it in mind, Without those, I made a toy example with my own (random) data.
I start by unpivoting what you call dataframe_to_check: that's because, if you want to check each feature independently, then that dataframe is not normalised (you might want to look up what database normalisation means).
The next step is a left outer join between the unpivoted dataframe you want to check and the (transposed) dataframe with the limits.
Once you have that, you can easily calculate whether a row is within the range, the deviation between value and target, etc, and you can of course group this however you want.
My code is below. It should be easy enough to customise it to your case.
import pandas as pd
import numpy as np
df_limits = pd.DataFrame(index =['min val','max val','target'])
df_limits['a']=[2,4,3]
df_limits['b']=[3,5,4.5]
df =pd.DataFrame(columns = df_limits.columns, data =np.random.rand(100,2)*6 )
df_unpiv = pd.melt( df.reset_index().rename(columns ={'index':'id'}), id_vars='id', var_name ='feature', value_name = 'value' )
# I reset the index because I couldn't get a join on a column and index, but there is probably a better way to do it
df_joined = pd.merge( df_unpiv, df_limits.transpose().reset_index().rename(columns = {'index':'feature'}), how='left', on ='feature' )
df_joined['abs diff from target'] = abs( df_joined['value'] - df_joined['target'] )
df_joined['outside range'] = (df_joined['value'] < df_joined['min val'] ) | (df_joined['value'] > df_joined['max val'])
df_outside_range = df_joined.query(" `outside range` == True " )
df_inside_range = df_joined.query(" `outside range` == False " )
I solved my issue maybe in bit clumsy way but it works as desired...
If someone have better answer I will still appreciate:
Example how to get only observations above limits, to have both just concatenate observation from UL_index and LL_index
def table(df_limits,df_to_check,column):
above_limit = []
df_above_limit = pd.DataFrame()
UL = df_limits[column].loc['target'] + df_limits[column].loc['UL']
LL = df_limits[column].loc['target'] + df_limits[column].loc['LL']
UL_index = df_to_check.loc[df_to_check[column] > UL].index
LL_index = df_to_check.loc[df_to_check[column] < LL].index
df_to_check_UL = df_to_check.loc[UL_index]
df_to_check_LL = df_to_check.loc[LL_index]
above_limit = {
'ID': df_to_check_UL['ID'],
'feature value': df_to_check[column],
'target': df_limits[column].loc['target']
}
df_above_limit = pd.DataFrame(above_limit, index = df_to_check_UL.index)
return df_above_limit

I wanna read each cell of pandas df one after another and do some calculation on them

I wanna read each cell of pandas df one after another and do some calculation on them, but I have a problem using dictionaries or lists. for example, I wanna check the Ith cell whether the outdoor door temperature is more than X and also humidity is more/less than Y!then do a special calculation for the row.
here is the body of loaded df:
data=pd.read_csv('/content/drive/My Drive/Thesis/DS1.xlsx - Sheet1.csv')
data=data.drop(columns=["Date","time","real feel","Humidity","indoor temp"])
print(data)
and here is the data:
outdoor temp Unnamed: 6 Humidity Estimation: (poly3)
0 26 NaN 64.1560
1 25 NaN 68.6875
2 25 NaN 68.6875
3 24 NaN 72.4640
4 24 NaN 72.4640
.. ... ... ...
715 35 NaN 22.5625
716 33 NaN 28.1795
717 32 NaN 32.3680
718 31 NaN 37.2085
719 30 NaN 42.5000
[720 rows x 3 columns]
Create a function and then use .apply() to use the function on each row. You can edit temp and humid to your desired values. If you want to reference a specific row then just use data[row index]. I am not sure what calculation you want to do but I just added one to the value.
def calculation(row, temp, humid):
if row["outdoor temp"] > temp:
row["outdoor temp"] += 1
if row["humidity"] > humid:
row["humidity"] += 1
data = data.apply(lambda row : calculation(row, temp, humid), axis = 1)

Correlation between two dataframes column with matched headers

I have two dataframes from excels which look like the below. The first dataframe has a multi-index header.
I am trying to find the correlation between each column in the dataframe with the corresponding dataframe based on the currency (i.e KRW, THB, USD, INR). At the moment, I am doing a loop to iterate through each column, matching by index and corresponding header before finding the correlation.
for stock_name in index_data.columns.get_level_values(0):
stock_prices = index_data.xs(stock_name, level=0, axis=1)
stock_prices = stock_prices.dropna()
fx = currency_data[stock_prices.columns.get_level_values(1).values[0]]
fx = fx[fx.index.isin(stock_prices.index)]
merged_df = pd.merge(stock_prices, fx, left_index=True, right_index=True)
merged_df[0].corr(merged_df[1])
Is there a more panda-ish way of doing this?
So you wish to find the correlation between the stock price and its related currency. (Or stock price correlation to all currencies?)
# dummy data
date_range = pd.date_range('2019-02-01', '2019-03-01', freq='D')
stock_prices = pd.DataFrame(
np.random.randint(1, 20, (date_range.shape[0], 4)),
index=date_range,
columns=[['BYZ6DH', 'BLZGSL', 'MBT', 'BAP'],
['KRW', 'THB', 'USD', 'USD']])
fx = pd.DataFrame(np.random.randint(1, 20, (date_range.shape[0], 3)),
index=date_range, columns=['KRW', 'THB', 'USD'])
This is what it looks like, calculating correlations on this data shouldn't make much sense since it is random.
>>> print(stock_prices.head())
BYZ6DH BLZGSL MBT BAP
KRW THB USD USD
2019-02-01 15 10 19 19
2019-02-02 5 9 19 5
2019-02-03 19 7 18 10
2019-02-04 1 6 7 18
2019-02-05 11 17 6 7
>>> print(fx.head())
KRW THB USD
2019-02-01 15 11 10
2019-02-02 6 5 3
2019-02-03 13 1 3
2019-02-04 19 8 14
2019-02-05 6 13 2
Use apply to calculate the correlation between columns with the same currency.
def f(x, fx):
correlation = x.corr(fx[x.name[1]])
return correlation
correlation = stock_prices.apply(f, args=(fx,), axis=0)
>>> print(correlation)
BYZ6DH KRW -0.247529
BLZGSL THB 0.043084
MBT USD -0.471750
BAP USD 0.314969
dtype: float64

Categories