I am trying to create new columns where each row has the value of the previous row (the day before).
My data is formatted like that (in the orginal file there are 12 columns plus the timestamp and thousands of rows):
import numpy as np
import pandas as pd
df = pd.DataFrame({"Timestamp" : ['1993-11-01' ,'1993-11-02', '1993-11-03', '1993-11-04','1993-11-15'], "Austria" : [6.11 ,6.18, 6.17, 6.17, 6.40],"Belgium" : [7.01, 7.05, 7.2, 7.5, 7.6],"France" : [7.69, 7.61, 7.67, 7.91, 8.61]},index = [1, 2, 3,4,5])
What I have:
Timestamp Austria Belgium France
1 1993-11-01 6.11 7.01 7.69
2 1993-11-02 6.18 7.05 7.61
3 1993-11-03 6.17 7.20 7.67
4 1993-11-04 6.17 7.50 7.91
5 1993-11-15 6.40 7.60 8.61
What I want:
Timestamp Austria t-1 Belgium t-1 France t-1
1 1993-11-01 NaN NaN NaN
2 1993-11-02 6.11 7.01 7.69
3 1993-11-03 6.18 7.05 7.61
4 1993-11-04 6.17 7.20 7.67
5 1993-11-15 6.17 7.50 7.91
Its easy in Excel but I cannot find a way to do that in Python. But surely there is a way. Anyone knows how to do to it?
Use shift on the columns to compute:
cols = ["Austria", "Belgium", "France"]
df[cols] = df[cols].shift()
print(df)
Output
Timestamp Austria Belgium France
1 1993-11-01 NaN NaN NaN
2 1993-11-02 6.11 7.01 7.69
3 1993-11-03 6.18 7.05 7.61
4 1993-11-04 6.17 7.20 7.67
5 1993-11-15 6.17 7.50 7.91
As an alternative:
df.iloc[:, 1:] = df.iloc[:, 1:].shift()
print(df)
First df.set_index on Timestamp column, then use df.shift:
In [4400]: d = df.set_index('Timestamp').shift()
In [4403]: d.columns = [i + ' t-1' for i in d.columns]
In [4406]: d.reset_index(inplace=True)
In [4407]: d
Out[4407]:
Timestamp Austria t-1 Belgium t-1 France t-1
0 1993-11-01 NaN NaN NaN
1 1993-11-02 6.11 7.01 7.69
2 1993-11-03 6.18 7.05 7.61
3 1993-11-04 6.17 7.20 7.67
4 1993-11-15 6.17 7.50 7.91
Related
I have a dataframe like so:
index date symbol stock_id open high low close volume vwap
0 0 2021-10-11 BVN 13 7.69 7.98 7.5600 7.61 879710 7.782174
1 1 2021-10-12 BVN 13 7.67 8.08 7.5803 8.02 794436 7.967061
2 2 2021-10-13 BVN 13 8.12 8.36 8.0900 8.16 716012 8.231286
3 3 2021-10-14 BVN 13 8.26 8.29 8.0500 8.28 586091 8.185899
4 4 2021-10-15 BVN 13 8.18 8.44 8.0600 8.44 1278409 8.284539
... ... ... ... ... ... ... ... ... ... ...
227774 227774 2022-10-04 ERIC 11000 6.27 6.32 6.2400 6.29 14655189 6.280157
227775 227775 2022-10-05 ERIC 11000 6.17 6.31 6.1500 6.29 10569193 6.219965
227776 227776 2022-10-06 ERIC 11000 6.20 6.25 6.1800 6.22 7918812 6.217198
227777 227777 2022-10-07 ERIC 11000 6.17 6.19 6.0800 6.10 9671252 6.135976
227778 227778 2022-10-10 ERIC 11000 6.13 6.15 6.0200 6.04 6310661 6.066256
[227779 rows x 10 columns]
And then a function to return a boolean mask on whether or not the df is consolidating inside of a range:
def is_consolidating(df, window=2, minp=2, percentage=0.95):
rolling_min = pd.Series(df['close']).rolling(window=window, min_periods=minp).min()
rolling_max = pd.Series(df['close']).rolling(window=window, min_periods=minp).max()
consolidation = np.where( (rolling_min / rolling_max) >= percentage, True, False)
return consolidation
Which I then call like:
df['t'] = df.groupby("stock_id").apply(is_consolidating)
The problem is when I print the df I am getting NaN for the values of my new column:
dan#danalgo:~/Documents/code/wolfhound$ python3 add_indicators_daily.py
index date symbol stock_id open high low close volume vwap t
0 0 2021-10-11 BVN 13 7.69 7.98 7.5600 7.61 879710 7.782174 NaN
1 1 2021-10-12 BVN 13 7.67 8.08 7.5803 8.02 794436 7.967061 NaN
2 2 2021-10-13 BVN 13 8.12 8.36 8.0900 8.16 716012 8.231286 NaN
3 3 2021-10-14 BVN 13 8.26 8.29 8.0500 8.28 586091 8.185899 NaN
4 4 2021-10-15 BVN 13 8.18 8.44 8.0600 8.44 1278409 8.284539 NaN
... ... ... ... ... ... ... ... ... ... ... ...
227774 227774 2022-10-04 ERIC 11000 6.27 6.32 6.2400 6.29 14655189 6.280157 NaN
227775 227775 2022-10-05 ERIC 11000 6.17 6.31 6.1500 6.29 10569193 6.219965 NaN
227776 227776 2022-10-06 ERIC 11000 6.20 6.25 6.1800 6.22 7918812 6.217198 NaN
227777 227777 2022-10-07 ERIC 11000 6.17 6.19 6.0800 6.10 9671252 6.135976 NaN
227778 227778 2022-10-10 ERIC 11000 6.13 6.15 6.0200 6.04 6310661 6.066256 NaN
[227779 rows x 11 columns]
Full code:
import pandas as pd
from IPython.display import display
import sqlite3 as sql
import numpy as np
conn = sql.connect('allStockData.db')
# get everything inside daily_ohlc and add to a dataframe
df = pd.read_sql_query("SELECT * from daily_ohlc_init", conn)
def is_consolidating(df, window=2, minp=2, percentage=0.95):
rolling_min = pd.Series(df['close']).rolling(window=window, min_periods=minp).min()
rolling_max = pd.Series(df['close']).rolling(window=window, min_periods=minp).max()
consolidation = np.where( (rolling_min / rolling_max) >= percentage, True, False)
return consolidation
df['t'] = df.groupby("stock_id").apply(is_consolidating)
print(df)
df.to_sql('daily_ohlc_init_with_indicators', if_exists='replace', con=conn, index=True)
You could do it like this:
def is_consolidating(grp, window=2, minp=2, percentage=0.95):
rolling_min = pd.Series(grp).rolling(window=window, min_periods=minp).min()
rolling_max = pd.Series(grp).rolling(window=window, min_periods=minp).max()
consolidation = np.where( (rolling_min / rolling_max) >= percentage, True, False)
return pd.Series(consolidation, index=grp.index)
df['t'] = df.groupby("stock_id")['close'].apply(is_consolidating)
print(df)
Output (part of it):
volume vwap t
0 879710 7.782174 False
1 794436 7.967061 False
2 716012 8.231286 True
3 586091 8.185899 True
4 1278409 8.284539 True
227774 14655189 6.280157 False
227775 10569193 6.219965 True
227776 7918812 6.217198 True
227777 9671252 6.135976 True
227778 6310661 6.066256 True
I have Excel files with multiple sheets, each of which looks a little like this (but much longer):
Sample CD4 CD8
Day 1 8311 17.3 6.44
8312 13.6 3.50
8321 19.8 5.88
8322 13.5 4.09
Day 2 8311 16.0 4.92
8312 5.67 2.28
8321 13.0 4.34
8322 10.6 1.95
The first column is actually four cells merged vertically.
When I read this using pandas.read_excel, I get a DataFrame that looks like this:
Sample CD4 CD8
Day 1 8311 17.30 6.44
NaN 8312 13.60 3.50
NaN 8321 19.80 5.88
NaN 8322 13.50 4.09
Day 2 8311 16.00 4.92
NaN 8312 5.67 2.28
NaN 8321 13.00 4.34
NaN 8322 10.60 1.95
How can I either get Pandas to understand merged cells, or quickly and easily remove the NaN and group by the appropriate value? (One approach would be to reset the index, step through to find the values and replace NaNs with values, pass in the list of days, then set the index to the column. But it seems like there should be a simpler approach.)
You could use the Series.fillna method to forword-fill in the NaN values:
df.index = pd.Series(df.index).fillna(method='ffill')
For example,
In [42]: df
Out[42]:
Sample CD4 CD8
Day 1 8311 17.30 6.44
NaN 8312 13.60 3.50
NaN 8321 19.80 5.88
NaN 8322 13.50 4.09
Day 2 8311 16.00 4.92
NaN 8312 5.67 2.28
NaN 8321 13.00 4.34
NaN 8322 10.60 1.95
[8 rows x 3 columns]
In [43]: df.index = pd.Series(df.index).fillna(method='ffill')
In [44]: df
Out[44]:
Sample CD4 CD8
Day 1 8311 17.30 6.44
Day 1 8312 13.60 3.50
Day 1 8321 19.80 5.88
Day 1 8322 13.50 4.09
Day 2 8311 16.00 4.92
Day 2 8312 5.67 2.28
Day 2 8321 13.00 4.34
Day 2 8322 10.60 1.95
[8 rows x 3 columns]
df = df.fillna(method='ffill', axis=0) # resolved updating the missing row entries
To casually come back 8 years later, pandas.read_excel() can solve this internally for you with the index_col parameter.
df = pd.read_excel('path_to_file.xlsx', index_col=[0])
Passing index_col as a list will cause pandas to look for a MultiIndex. In the case where there is a list of length one, pandas creates a regular Index filling in the data.
I have the following dataframe.
a b c d
0 4.65 30.572857 133.899994 23.705000
1 4.77 30.625713 134.690002 23.225000
2 4.73 30.138571 132.250000 23.040001
3 5.07 30.082857 130.000000 23.290001
4 4.98 30.282858 133.520004 23.389999
5 NaN NaN NaN NaN
6 4.82 29.674286 127.349998 23.700001
7 4.83 30.092857 129.110001 24.254999
8 4.85 29.918571 127.349998 24.695000
9 4.70 29.418571 127.139999 24.424999
10 4.69 30.719999 127.610001 25.200001
11 NaN NaN NaN NaN
12 4.43 29.724285 126.620003 24.764999
13 NaN NaN NaN NaN
14 4.29 29.010000 120.309998 24.730000
15 4.11 29.420000 119.480003 25.035000
I want to split this df into multiple dfs when there is row with all NaN.
I explored the following links but could not figure out how to apply it to my problem.
Split pandas dataframe in two if it has more than 10 rows
Splitting dataframe into multiple dataframes
In my example, I would have 4 dataframes with 5,5,1 and 2 rows as the output.
Please suggest the way forward.
Using isna, all, cumsum and groupby.
First we check if all the values in a row are NaN, then use cumsum to create a group indicator and finally we save these dataframes in a list with groupby:
grps = df.isna().all(axis=1).cumsum()
dfs = [df.dropna() for _, df in df.groupby(grps)]
for df in dfs:
print(df)
a b c d
0 4.65 30.572857 133.899994 23.705000
1 4.77 30.625713 134.690002 23.225000
2 4.73 30.138571 132.250000 23.040001
3 5.07 30.082857 130.000000 23.290001
4 4.98 30.282858 133.520004 23.389999
a b c d
6 4.82 29.674286 127.349998 23.700001
7 4.83 30.092857 129.110001 24.254999
8 4.85 29.918571 127.349998 24.695000
9 4.70 29.418571 127.139999 24.424999
10 4.69 30.719999 127.610001 25.200001
a b c d
12 4.43 29.724285 126.620003 24.764999
a b c d
14 4.29 29.01 120.309998 24.730
15 4.11 29.42 119.480003 25.035
Something like this should do the trick:
import pandas as pd
import numpy as np
data_frame = pd.DataFrame({"a":[1,np.nan,3,np.nan,4,np.nan,5],
"b":[1,np.nan,3,np.nan,4,np.nan,5],
"c":[1,np.nan,3,np.nan,4,np.nan,5],
"d":[1,np.nan,3,np.nan,4,np.nan,5],
"e":[1,np.nan,3,np.nan,4,np.nan,5],
"f":[1,np.nan,3,np.nan,4,np.nan,5]})
all_nan = data_frame.index[data_frame.isnull().all(1)]
df_list = []
prev = 0
for i in all_nan:
df_list.append(data_frame[prev:i])
prev = i+1
for i in df_list:
print(i)
Just another flavor of doing the same thing:
nan_indices = df.index[df.isna().all(axis=1)]
df_list = [df.dropna() for df in np.split(df, nan_indices)]
df_list
[ a b c d
0 4.65 30.572857 133.899994 23.705000
1 4.77 30.625713 134.690002 23.225000
2 4.73 30.138571 132.250000 23.040001
3 5.07 30.082857 130.000000 23.290001
4 4.98 30.282858 133.520004 23.389999,
a b c d
6 4.82 29.674286 127.349998 23.700001
7 4.83 30.092857 129.110001 24.254999
8 4.85 29.918571 127.349998 24.695000
9 4.70 29.418571 127.139999 24.424999
10 4.69 30.719999 127.610001 25.200001,
a b c d
12 4.43 29.724285 126.620003 24.764999,
a b c d
14 4.29 29.01 120.309998 24.730
15 4.11 29.42 119.480003 25.035]
I have this dataframe; please note the last column ("Yr_Mo_Date") on the right
In[38]: data.head()
Out[38]:
RPT VAL ROS KIL SHA BIR DUB CLA MUL CLO BEL MAL Yr_Mo_Dy
0 15.04 14.96 13.17 9.29 NaN 9.87 13.67 10.25 10.83 12.58 18.50 15.04 61-1-1
1 14.71 NaN 10.83 6.50 12.62 7.67 11.50 10.04 9.79 9.67 17.54 13.83 61-1-2
2 18.50 16.88 12.33 10.13 11.17 6.17 11.25 NaN 8.50 7.67 12.75 12.71 61-1-3
3 10.58 6.63 11.75 4.58 4.54 2.88 8.63 1.79 5.83 5.88 5.46 10.88 61-1-4
4 13.33 13.25 11.42 6.17 10.71 8.21 11.92 6.54 10.92 10.34 12.92 11.83 61-1-5
The type of the "Yr_Mo_Dy" column is object while the others are float64.
I simply want to change the order of the columns so that "Yr_Mo_Dy" is the first column in the dataframe.
I tried the following but I get TypeError. What's wrong?
In[39]: cols = data.columns.tolist()
In[40]: cols
Out[40]:
['RPT',
'VAL',
'ROS',
'KIL',
'SHA',
'BIR',
'DUB',
'CLA',
'MUL',
'CLO',
'BEL',
'MAL',
'Yr_Mo_Dy']
In[41]: cols = cols[-1] + cols[:-1]
TypeError Traceback (most recent call last)
<ipython-input-59-c0130d1863e8> in <module>()
----> 1 cols = cols[-1] + cols[:-1]
TypeError: must be str, not list
You need add : for one element list because need concanecate 2 lists:
#string
print (cols[-1])
Yr_Mo_Dy
#one element list
print (cols[-1:])
['Yr_Mo_Dy']
cols = cols[-1:] + cols[:-1]
Or is possible add [], but it is worse readable:
cols = [cols[-1]] + cols[:-1]
print (cols)
['Yr_Mo_Dy', 'RPT', 'VAL', 'ROS', 'KIL', 'SHA', 'BIR',
'DUB', 'CLA', 'MUL', 'CLO', 'BEL', 'MAL']
Option 1
Use pd.DataFrame.insert and pd.DataFrame.pop to alter the dataframe in place. This is a very generalizable solution as you can swap in any column position for popping or inserting.
c = df.columns[-1]
df.insert(0, c, df.pop(c))
df
Yr_Mo_Dy RPT VAL ROS KIL SHA BIR DUB CLA MUL CLO BEL MAL
0 61-1-1 15.04 14.96 13.17 9.29 NaN 9.87 13.67 10.25 10.83 12.58 18.50 15.04
1 61-1-2 14.71 NaN 10.83 6.50 12.62 7.67 11.50 10.04 9.79 9.67 17.54 13.83
2 61-1-3 18.50 16.88 12.33 10.13 11.17 6.17 11.25 NaN 8.50 7.67 12.75 12.71
3 61-1-4 10.58 6.63 11.75 4.58 4.54 2.88 8.63 1.79 5.83 5.88 5.46 10.88
4 61-1-5 13.33 13.25 11.42 6.17 10.71 8.21 11.92 6.54 10.92 10.34 12.92 11.83
Option 2
pd.DataFrame.reindex_axis and np.roll
df.reindex_axis(np.roll(df.columns, 1), 1)
Yr_Mo_Dy RPT VAL ROS KIL SHA BIR DUB CLA MUL CLO BEL MAL
0 61-1-1 15.04 14.96 13.17 9.29 NaN 9.87 13.67 10.25 10.83 12.58 18.50 15.04
1 61-1-2 14.71 NaN 10.83 6.50 12.62 7.67 11.50 10.04 9.79 9.67 17.54 13.83
2 61-1-3 18.50 16.88 12.33 10.13 11.17 6.17 11.25 NaN 8.50 7.67 12.75 12.71
3 61-1-4 10.58 6.63 11.75 4.58 4.54 2.88 8.63 1.79 5.83 5.88 5.46 10.88
4 61-1-5 13.33 13.25 11.42 6.17 10.71 8.21 11.92 6.54 10.92 10.34 12.92 11.83
I have Excel files with multiple sheets, each of which looks a little like this (but much longer):
Sample CD4 CD8
Day 1 8311 17.3 6.44
8312 13.6 3.50
8321 19.8 5.88
8322 13.5 4.09
Day 2 8311 16.0 4.92
8312 5.67 2.28
8321 13.0 4.34
8322 10.6 1.95
The first column is actually four cells merged vertically.
When I read this using pandas.read_excel, I get a DataFrame that looks like this:
Sample CD4 CD8
Day 1 8311 17.30 6.44
NaN 8312 13.60 3.50
NaN 8321 19.80 5.88
NaN 8322 13.50 4.09
Day 2 8311 16.00 4.92
NaN 8312 5.67 2.28
NaN 8321 13.00 4.34
NaN 8322 10.60 1.95
How can I either get Pandas to understand merged cells, or quickly and easily remove the NaN and group by the appropriate value? (One approach would be to reset the index, step through to find the values and replace NaNs with values, pass in the list of days, then set the index to the column. But it seems like there should be a simpler approach.)
You could use the Series.fillna method to forword-fill in the NaN values:
df.index = pd.Series(df.index).fillna(method='ffill')
For example,
In [42]: df
Out[42]:
Sample CD4 CD8
Day 1 8311 17.30 6.44
NaN 8312 13.60 3.50
NaN 8321 19.80 5.88
NaN 8322 13.50 4.09
Day 2 8311 16.00 4.92
NaN 8312 5.67 2.28
NaN 8321 13.00 4.34
NaN 8322 10.60 1.95
[8 rows x 3 columns]
In [43]: df.index = pd.Series(df.index).fillna(method='ffill')
In [44]: df
Out[44]:
Sample CD4 CD8
Day 1 8311 17.30 6.44
Day 1 8312 13.60 3.50
Day 1 8321 19.80 5.88
Day 1 8322 13.50 4.09
Day 2 8311 16.00 4.92
Day 2 8312 5.67 2.28
Day 2 8321 13.00 4.34
Day 2 8322 10.60 1.95
[8 rows x 3 columns]
df = df.fillna(method='ffill', axis=0) # resolved updating the missing row entries
To casually come back 8 years later, pandas.read_excel() can solve this internally for you with the index_col parameter.
df = pd.read_excel('path_to_file.xlsx', index_col=[0])
Passing index_col as a list will cause pandas to look for a MultiIndex. In the case where there is a list of length one, pandas creates a regular Index filling in the data.