How do you fill a dataframe index in a sequential fashion? - python

I have the following dataframe, df:
Date Number
2022-01-01 1
2022-01-08 2
2022-01-15 5
I wish to have the following, where "Date" is the index column how to add rows which sequentially increasing rows?
Date Number
2022-01-01 1
2022-01-08 2
2022-01-15 5
2022-01-22 NaN
2022-01-29 NaN

Does the following help:
import pandas as pd
from datetime import datetime
df = pd.DataFrame([(datetime(2022, 1, 1), 1), (datetime(2022, 1, 8), 2), (datetime(2022, 1, 15), 5)], columns=['Date', 'Number']).set_index('Date')
n = 5 # number of extra rows
diff = df.index[-1] - df.index[-2]
add_keys = [df.index[-1] + x * diff for x in range(n)]
df1 = pd.DataFrame(index=add_keys)
df3 = pd.concat([df, df1], axis=1)
print(df3)

Related

Resampling timeseries dataframe with multi-index

Generate data:
import pandas as pd
import numpy as np
df = pd.DataFrame(index=pd.date_range(freq=f'{FREQ}T',start='2020-10-01',periods=(12)*24))
df['col1'] = np.random.normal(size = df.shape[0])
df['col2'] = np.random.random_integers(1, 100, size= df.shape[0])
df['uid'] = 1
df2 = pd.DataFrame(index=pd.date_range(freq=f'{FREQ}T',start='2020-10-01',periods=(12)*24))
df2['col1'] = np.random.normal(size = df2.shape[0])
df2['col2'] = np.random.random_integers(1, 50, size= df2.shape[0])
df2['uid'] = 2
df3=pd.concat([df, df2]).reset_index()
df3=df3.set_index(['index','uid'])
I am trying to resample the data to 30min intervals and assign how to aggregate the data for each uid and each column individually. I have many columns and I need to assign whether if I want the mean, median, std, max, min, for each column. Since there are duplicate timestamps I need to do this operation for each user, that's why I try to set the multiindex and do the following:
df3.groupby(pd.Grouper(freq='30Min',closed='right',label='right')).agg({
"col1": "max", "col2": "min", 'uid':'max'})
but I get the following error
ValueError: MultiIndex has no single backing array. Use
'MultiIndex.to_numpy()' to get a NumPy array of tuples.
How can I do this operation?
You have to specify the level name when you use pd.Grouper on index:
out = (df3.groupby([pd.Grouper(level='index', freq='30T', closed='right', label='right'), 'uid'])
.agg({"col1": "max", "col2": "min"}))
print(out)
# Output
col1 col2
index uid
2020-10-01 00:00:00 1 -0.222489 77
2 -1.490019 22
2020-10-01 00:30:00 1 1.556801 16
2 0.580076 1
2020-10-01 01:00:00 1 0.745477 12
... ... ...
2020-10-02 23:00:00 2 0.272276 13
2020-10-02 23:30:00 1 0.378779 20
2 0.786048 5
2020-10-03 00:00:00 1 1.716791 20
2 1.438454 5
[194 rows x 2 columns]

How to replace negative values in dataframe with specified values?

I have a df and need to replace negative values with specified values, how can I make my code simpler and without warning.
Before replacement:
datetime a0 a1 a2
0 2022-01-01 0.097627 0.430379 0.205527
1 2022-01-02 0.089766 -0.152690 0.291788
2 2022-01-03 -0.124826 0.783546 0.927326
3 2022-01-04 -0.233117 0.583450 0.057790
4 2022-01-05 0.136089 0.851193 -0.857928
5 2022-01-06 -0.825741 -0.959563 0.665240
6 2022-01-07 0.556314 0.740024 0.957237
7 2022-01-08 0.598317 -0.077041 0.561058
8 2022-01-09 -0.763451 0.279842 -0.713293
9 2022-01-10 0.889338 0.043697 -0.170676
After replacing,
datetime a0 a1 a2
0 2022-01-01 9.762701e-02 4.303787e-01 2.055268e-01
1 2022-01-02 8.976637e-02 1.000000e-13 2.917882e-01
2 2022-01-03 1.000000e-13 7.835460e-01 9.273255e-01
3 2022-01-04 1.000000e-13 5.834501e-01 5.778984e-02
4 2022-01-05 1.360891e-01 8.511933e-01 1.000000e-13
5 2022-01-06 1.000000e-13 1.000000e-13 6.652397e-01
6 2022-01-07 5.563135e-01 7.400243e-01 9.572367e-01
7 2022-01-08 5.983171e-01 1.000000e-13 5.610584e-01
8 2022-01-09 1.000000e-13 2.798420e-01 1.000000e-13
9 2022-01-10 8.893378e-01 4.369664e-02 1.000000e-13
<ipython-input-5-887189ce29a9>:3: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df2[df2 < 0] = float(1e-13)
<ipython-input-5-887189ce29a9>:3: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df2[df2 < 0] = float(1e-13)
My code is as follows, where the generate_data function is to generate demo data.
import numpy as np
import pandas as pd
np.random.seed(0)
# This function generates demo data.
def generate_data():
datetime1 = pd.date_range(start='20220101', end='20220110')
df = pd.DataFrame(data=datetime1, columns=['datetime'])
col = [f'a{x}' for x in range(3)]
df[col] = np.random.uniform(-1, 1, (10, 3))
return df
def main():
df = generate_data()
print(df)
col = list(df.columns)[1:]
df2 = df[col]
df2[df2 < 0] = float(1e-13)
df[col] = df2
print(df)
return
if __name__ == '__main__':
main()
You get a warning, because not all columns contain numerical values, you can use df2.mask(...) to avoid the warnings.
df2 = df2.mask(df2 < 0, float(1e-13))
you may try to use np.where
pd.concat([df.datetime, df.iloc[:,1:4].apply(lambda x:np.where(x<0,float(1e-13),x),axis=0)],axis=1)
Btw thanks for the beautiful reproducible example
Function loc from pandas library may help. Once your df is generated:
# get columns to check for the condition
cols = list(df.columns)[1:]
# iterate through columns and replace
for col in cols:
df.loc[df[col] < 0, col] = float(1e-13)
This should do the trick, hope it helps!
Maybe this:
df1['datetime'] = df['datetime']
df = df.mask(df.loc[:, df.columns != 'datetime'] < 0, float(1e-13))
df['datetime'] = df1['datetime']
print(df)
All the code:
import numpy as np
import pandas as pd
np.random.seed(0)
# This function generates demo data.
def generate_data():
datetime1 = pd.date_range(start='20220101', end='20220110')
df = pd.DataFrame(data=datetime1, columns=['datetime'])
col = [f'a{x}' for x in range(3)]
df[col] = np.random.uniform(-1, 1, (10, 3))
return df
def main():
df = generate_data()
df1['datetime'] = df['datetime']
df = df.mask(df.loc[:, df.columns != 'datetime'] < 0, float(1e-13))
df['datetime'] = df1['datetime']
print(df)
return
if __name__ == '__main__':
main()

Python Pandas merge on row index and column index across 2 dataframes

Am trying to do something where I calculate a new dataframe which is dataframe1 divided by dataframe2 where columnname match and date index matches bases on closest date nonexact match)
idx1 = pd.DatetimeIndex(['2017-01-01','2018-01-01','2019-01-01'])
idx2 = pd.DatetimeIndex(['2017-02-01','2018-03-01','2019-04-01'])
df1 = pd.DataFrame(index = idx1,data = {'XYZ': [10, 20, 30],'ABC': [15, 25, 30]})
df2 = pd.DataFrame(index = idx2,data = {'XYZ': [1, 2, 3],'ABC': [3, 5, 6]})
#looking for some code
#df3 = df1/df2 on matching column and closest matching row
This should produce a dataframe which looks like this
XYZ ABC
2017-01-01 10 5
2018-01-01 10 5
2019-01-01 10 5
You can use an asof merge to do a match on a "close" row. Then we'll group over the columns axis and divide.
df3 = pd.merge_asof(df1, df2, left_index=True, right_index=True,
direction='nearest')
# XYZ_x ABC_x XYZ_y ABC_y
#2017-01-01 10 15 1 3
#2018-01-01 20 25 2 5
#2019-01-01 30 30 3 6
df3 = (df3.groupby(df3.columns.str.split('_').str[0], axis=1)
.apply(lambda x: x.iloc[:, 0]/x.iloc[:, 1]))
# ABC XYZ
#2017-01-01 5.0 10.0
#2018-01-01 5.0 10.0
#2019-01-01 5.0 10.0

How do I add values to an existing column in Pandas?

I have a Pandas columns as such:
Free-Throw Percentage
0 .371
1 .418
2 .389
3 .355
4 .386
5 .605
And I have a list of values: [.45,.31,.543]
I would like to append these values to the above column such that the final result would be:
Free-Throw Percentage
0 .371
1 .418
2 .389
3 .355
4 .386
5 .605
6 .45
7 .31
8 .543
How can I achieve this?
df.append(pd.DataFrame({'Free-Throw Percentage':[.45,.31,.543]}))
should do the job
import pandas as pd
df_1 = pd.DataFrame(data=zip(np.random.randint(0, 20, 10), np.random.randint(0, 10, 10)), columns=['A', 'B'])
new_vals = [3, 4, 5]
df_1 = df_1.append(pd.DataFrame({'B': new_vals}), sort=False, ignore_index=True)
print(df_1)

Pandas dataframe resample and count events per day

I have a dataframe with time-index. I can resample the data to get (e.g) mean per-day, however I would like also to get the counts per day. Here is a sample:
import datetime
import pandas as pd
import numpy as np
dates = pd.date_range(datetime.datetime(2012, 4, 5, 11,
0),datetime.datetime(2012, 4, 7, 7, 0),freq='5H')
var1 = np.random.sample(dates.size) * 10.0
var2 = np.random.sample(dates.size) * 10.0
df = pd.DataFrame(data={'var1': var1, 'var2': var2}, index=dates)
df1=df.resample('D').mean()
I'd like to get also a 3rd column 'count' which counts per day:
count
3
5
7
Thank you very much!
Use Resampler.agg and then flatten MultiIndex in columns:
df1 = df.resample('D').agg({'var1': 'mean','var2': ['mean', 'size']})
df1.columns = df1.columns.map('_'.join)
df1 = df1.rename(columns={'var2_size':'count'})
print (df1)
var1_mean var2_mean count
2012-04-05 3.992166 4.968410 3
2012-04-06 6.843105 6.193568 5
2012-04-07 4.568436 3.135089 1
Alternative solution with Grouper:
df1 = df.groupby(pd.Grouper(freq='D')).agg({'var1': 'mean','var2': ['mean', 'size']})
df1.columns = df1.columns.map('_'.join)
df1 = df1.rename(columns={'var2_size':'count'})
print (df1)
var1_mean var2_mean count
2012-04-05 3.992166 4.968410 3
2012-04-06 6.843105 6.193568 5
2012-04-07 4.568436 3.135089 1
EDIT:
r = df.resample('D')
df1 = r.mean().add_suffix('_mean').join(r.size().rename('count'))
print (df1)
var1_mean var2_mean count
2012-04-05 7.840487 6.885030 3
2012-04-06 4.762477 5.091455 5
2012-04-07 2.702414 6.046200 1

Categories