Pandas: dynamically shifting values across columns

Pandas: dynamically shifting values across columns - python

I have the following df:
sales2001 sales2002 sales2003 sales2004
200012 19.12 0.98
200101 19.1 0.98 2.3
200102 21 0.97 0.8
...
200112 19.12 0.99 2.4
200201 0.98 2.5
200202 0.97 0.8 1.2
I would like to shift the content in order to align it a timegap view, as follow:
sales+1y sales+2y
200012 19.12 0.98
200101 0.98 2.3
200102 0.97 0.8
...
200112 0.99 2.4
200201 0.98 2.5
200202 0.8 1.2
basically aligning the forecasted data points to a fixed timegap to the index.
I tried with iterrows and dynamically calling the columns given the index but cannot make it work. do you guys have any suggestion?

Use justify with DataFrame.dropna and axis=1 for remove all columns with at least one NaN:
df1 = (pd.DataFrame(justify(df.values, invalid_val=np.nan, side='right'), index=df.index)
.dropna(axis=1))
If need select last columns by position:
df1 = pd.DataFrame(justify(df.values, invalid_val=np.nan, side='right')[:, -2:],index=df.index)
Or:
df1 = (pd.DataFrame(justify(df.values, invalid_val=np.nan, side='right'), index=df.index)
.iloc[:, -2:])
df1.columns = [f'sales+{i+1}y' for i in range(len(df1.columns))]
print (df1)
sales+1y sales+2y
200012 19.12 0.98
200101 0.98 2.30
200102 0.97 0.80
200112 0.99 2.40
200201 0.98 2.50
200202 0.80 1.20

Another option is to use pd.wide_to_long and pivot:
# here I assume the index name is index
new_df = pd.wide_to_long(df.reset_index(), 'sales', i='index', j='sale_end').reset_index()
# if index is datetime, then use dt.year
new_df['periods'] = new_df['sale_end'] - new_df['index']//100
# pivot
new_df.dropna().pivot(index='index',columns='periods', values='sales')
output:
periods -1 0 1 2
idx
200012 NaN NaN 19.12 0.98
200101 NaN 19.10 0.98 2.30
200102 NaN 21.00 0.97 0.80
200112 NaN 19.12 0.99 2.40
200201 0.98 2.50 NaN NaN
200202 0.97 0.80 1.20 NaN

Related

Formating a data frame reducing column and increasing rows

I have a pandas data frame like this
data = [['NGS_34',0.47,0.55,0.51,0.53,0.54,0.62], ['NGS_38',0.52,0.52,0.49,0.51,0.52,0.45]]
df = pd.DataFrame(data, columns = ['probes','01_PLAGL1', '02_PLAGL1','H19','H19', 'H19','GNAS_A/B'])
df = df.set_index('probes')
df
01_PLAGL1 02_PLAGL1 H19 H19 H19 GNAS_A/B
probes
NGS_34 0.47 0.55 0.51 0.53 0.54 0.62
NGS_38 0.52 0.52 0.49 0.51 0.52 0.45
This is actually a minimal reproducible example. The real data frame is formed by many paired columns like the example 01_PLAGL1 02_PLAGL1, then 2 sets of three columns like the example H19 H19 H19 and 2 unique columns. With this explanation and the columns of my real dataset below, I think you will understand the input data of my problem.
data_no_control.columns.values
array(['PLAGL1', 'PLAGL1', 'GRB10', 'GRB10', 'MEST', 'MEST', 'H19', 'H19',
'H19', 'KCNQ1OT1', 'KCNQ1OT1', 'MEG3', 'MEG3', 'MEG8', 'MEG8',
'SNRPN', 'SNRPN', 'PEG3', 'PEG3', 'PEG3', 'NESP55', 'GNAS-AS1',
'GNASXL', 'GNASXL', 'GNAS_A/B'], dtype=object)
The final output I would like to achieve should be like this
01_PLAGL1 H19 GNAS A/B
probes
NGS_34 0.47 0.51 0.62
0.55 0.53
0.54
(One empty row)
(Second empty row)
NGS_38 0.52 0.49 0.45
0.52 0.51
0.52
(One empty row)
(Second empty row)
NGS_41 ...
I have tried this
df = data_no_control.reset_index(level=0)
empty_rows = 5
df.index = range(0, empty_rows*len(df), empty_rows)
new_df = df.reindex(index=range(empty_rows*len(df)))
new_df = new_df.set_index('index')
new_df
index 01_PLAGL1 02_PLAGL1 H19 H19 H19 GNAS_A/B
NGS_34 0.47 0.55 0.51 0.53 0.54 0.62
NaN NaN NaN NaN NaN. NaN NaN
NaN NaN NaN NaN NaN. NaN NaN
NaN NaN NaN NaN NaN. NaN NaN
NaN NaN NaN NaN NaN. NaN NaN
NGS_38 0.52 0.52 0.49 0.51 0.52 0.45
NaN NaN NaN NaN NaN. NaN NaN
NaN NaN NaN NaN NaN. NaN NaN
NaN NaN NaN NaN NaN. NaN NaN
NaN NaN NaN NaN NaN. NaN NaN

Use:
data = [['NGS_34',0.47,0.55,0.51,0.53,0.54,0.62], ['NGS_38',0.52,0.52,0.49,0.51,0.52,0.45]]
df = pd.DataFrame(data, columns = ['probes','01_PLAGL1', '02_PLAGL1','H19','H19', 'H19','GNAS_A/B'])
df = df.set_index('probes')
#No of new rows
new = 2
#reove values before _ for pairs columns names
s = df.columns.str.split('_').str[-1].to_series()
#create Multiindex by counter
df.columns = [s, s.groupby(s).cumcount()]
#reshape
df = df.stack()
#create MultiIndex for add new rows and original order in columns names
mux = pd.MultiIndex.from_product([df.index.levels[0],
np.arange(df.index.levels[1].max() + new + 1)])
df = df.reindex(index=mux, columns=s.unique())
print (df)
PLAGL1 H19 A/B
probes
NGS_34 0 0.47 0.51 0.62
1 0.55 0.53 NaN
2 NaN 0.54 NaN
3 NaN NaN NaN
4 NaN NaN NaN
NGS_38 0 0.52 0.49 0.45
1 0.52 0.51 NaN
2 NaN 0.52 NaN
3 NaN NaN NaN
4 NaN NaN NaN
Last if need empty values instead misisng values and no counter level use:
df = df.droplevel(1).fillna('')
df.index = df.index.where(~df.index.duplicated(), '')
print (df)
PLAGL1 H19 A/B
probes
NGS_34 0.47 0.51 0.62
0.55 0.53
0.54
NGS_38 0.52 0.49 0.45
0.52 0.51
0.52
EDIT: In real data are not duplicates, so ouput is different:
d = {'PLAGL1': {'NGS_34': 0.55, 'NGS_38': 0.52}, 'GRB10': {'NGS_34': 0.48, 'NGS_38': 0.49}, 'MEST': {'NGS_34': 0.56, 'NGS_38': 0.5}, 'H19': {'NGS_34': 0.54, 'NGS_38': 0.52}, 'KCNQ1OT1': {'NGS_34': 0.41, 'NGS_38': 0.49}, 'MEG3': {'NGS_34': 0.5, 'NGS_38': 0.55}, 'MEG8': {'NGS_34': 0.46, 'NGS_38': 0.5}, 'SNRPN': {'NGS_34': 0.55, 'NGS_38': 0.46}, 'PEG3': {'NGS_34': 0.51, 'NGS_38': 0.51}, 'NESP55': {'NGS_34': 0.55, 'NGS_38': 0.53}, 'GNAS-AS1': {'NGS_34': 0.52, 'NGS_38': 0.48}, 'GNASXL': {'NGS_34': 0.49, 'NGS_38': 0.44}, 'GNAS A/B': {'NGS_34': 0.62, 'NGS_38': 0.45}}
df = pd.DataFrame(d)
print (df)
PLAGL1 GRB10 MEST H19 KCNQ1OT1 MEG3 MEG8 SNRPN PEG3 NESP55 \
NGS_34 0.55 0.48 0.56 0.54 0.41 0.50 0.46 0.55 0.51 0.55
NGS_38 0.52 0.49 0.50 0.52 0.49 0.55 0.50 0.46 0.51 0.53
GNAS-AS1 GNASXL GNAS A/B
NGS_34 0.52 0.49 0.62
NGS_38 0.48 0.44 0.45
#No of new rows
new = 2
#reove values before _ for pairs columns names
s = df.columns.str.split('_').str[-1].to_series()
#create Multiindex by counter
df.columns = [s, s.groupby(s).cumcount()]
#reshape
df = df.stack()
#create MultiIndex for add new rows and original order in columns names
mux = pd.MultiIndex.from_product([df.index.levels[0],
np.arange(df.index.levels[1].max() + new + 1)])
df = df.reindex(index=mux, columns=s.unique())
print (df)
PLAGL1 GRB10 MEST H19 KCNQ1OT1 MEG3 MEG8 SNRPN PEG3 \
NGS_34 0 0.55 0.48 0.56 0.54 0.41 0.50 0.46 0.55 0.51
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN
NGS_38 0 0.52 0.49 0.50 0.52 0.49 0.55 0.50 0.46 0.51
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN
NESP55 GNAS-AS1 GNASXL GNAS A/B
NGS_34 0 0.55 0.52 0.49 0.62
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
NGS_38 0 0.53 0.48 0.44 0.45
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN

Copy a column to multiple columns of a DataFrame with Pandas

I have a DataFrame with multiple columns, a few columns being NaN. The dataframe is quite big having around 5,000 columns. Below is a sample from it:
GeoCode ESP FIN USA EZ19 PRT
1 Geography Spain Finland USA EZ Portugal
2 31-Mar-15 NaN NaN 0.26 0.89 NaN
3 30-Jun-15 NaN NaN NaN 0.90 NaN
4 30-Sep-15 NaN NaN 0.31 0.90 NaN
5 31-Dec-15 NaN NaN 0.41 0.91 NaN
I want to copy the value of column 'EZ19' to all columns where all values for row 2 and below are NaN. I tried the following code and it works:
nan_cols = df.columns[df_macro[2:].isnull().all()].to_list()
for c in nan_cols:
df.loc[2:,c]= df.loc[2:,'EZ19']
But I was thinking there should be a way to assign value of column 'EZ19' to the target columns without using a loop and am surprised that there didn't seem to be a straight forward way to do this. Other questions here don't seem to handle the exact issue I have and couldn't find a solution that worked for me.
Given the size of my dataframe(and it is expected to grow larger overtime) I really want to avoid using a loop in my final code so any help with this will be greatly appreciated.

If you're interested in replacing values of columns that contain all nulls, you can take a shortcut and simply overwrite all values below row 2 after identifying those values are entirely null.
# Identify columns that contain null values from row 2 onwards
all_null_cols = df.loc[2:].isnull().all()
# overwrite row 2 onwards in only our null columns with values from "EZ19"
df.loc[2:, all_nulls] = df.loc[2:, ["EZ19"]].values
print(df)
GeoCode ESP FIN USA EZ19 PRT
1 Geography Spain Finland USA EZ Portugal
2 31-Mar-15 0.89 0.89 0.26 0.89 0.89
3 30-Jun-15 0.90 0.90 NaN 0.90 0.90
4 30-Sep-15 0.90 0.90 0.31 0.90 0.90
5 31-Dec-15 0.91 0.91 0.41 0.91 0.91

Not sure if this is what you have in mind:
outcome = df.loc[2:, df.loc[2:].isna().all()].mask(
lambda df: df.isna(), df.loc[2:, "EZ19"], axis=0
)
outcome
ESP FIN PRT
2 0.89 0.89 0.89
3 0.90 0.90 0.90
4 0.90 0.90 0.90
5 0.91 0.91 0.91
df.update(outcome)
df
GeoCode ESP FIN USA EZ19 PRT
1 Geography Spain Finland USA EZ Portugal
2 31-Mar-15 0.89 0.89 0.26 0.89 0.89
3 30-Jun-15 0.90 0.90 NaN 0.90 0.90
4 30-Sep-15 0.90 0.90 0.31 0.90 0.90
5 31-Dec-15 0.91 0.91 0.41 0.91 0.91
It only fills completely null rows from row 2 downwards; USA is not completely null from row 2, that's why it was not altered.

A simple oneliner that replaces all empty values in a row with the value in EZ19:
df = df.apply(lambda row: row.where(pd.notnull(row), row.EZ19), axis=1)
Output:
GeoCode ESP FIN USA EZ19 PRT
0 Geography Spain Finland USA EZ Portugal
1 31-Mar-15 0.89 0.89 0.26 0.89 0.89
2 30-Jun-15 0.90 0.90 0.90 0.90 0.90
3 30-Sep-15 0.90 0.90 0.31 0.90 0.90
4 31-Dec-15 0.91 0.91 0.41 0.91 0.91

count values of each month, fill NaN if under certain limit

I am working with a dataframe, where every column represents a company. The index is a datetime index with daily frequency. My problem is the following: For each company, I would like to fill a month with NaN if there are less than 20 values in that month. In the example below, this would mean that Company_1's entry 0.91 on 2012-08-31 would be changed to NaN, while company_2 and 3 would be unchanged.
Company_1 Company_2 Company_3
2012-08-01 NaN 0.99 0.11
2012-08-02 NaN 0.21 NaN
2012-08-03 NaN 0.32 0.40
... ... ... ...
2012-08-29 NaN 0.50 -0.36
2012-08-30 NaN 0.48 -0.32
2012-08-31 0.91 0.51 -0.33
Total Values: 1 22 21
I am struggling to find an efficient way to count the number of values for each month of each stock. I could theoretically write a function which creates a new dataframe, which reports the number of values for each month (and for each stock), to then use that dataframe for the original company information, but I am sure that there has to be an easier way. Any help is highly appreciated. Thanks in advance.

groupby the dataframe on monthly freq and transform using count then using Series.lt create a boolean mask and use this mask to fill NaN values in dataframe:
df1 = df.mask(df.groupby(pd.Grouper(freq='M')).transform('count').lt(20))
print(df1)
Company_1 Company_2 Company_3
2012-08-01 NaN 0.99 0.11
2012-08-02 NaN 0.21 NaN
2012-08-03 NaN 0.32 0.40
....
2012-08-29 NaN 0.50 -0.36
2012-08-30 NaN 0.48 -0.32
2012-08-31 NaN 0.51 -0.33

IIUC:
df.loc[:, df.apply(lambda d: d.notnull().sum()<20)] = np.NaN
print (df)
Company 1 Company 2 Company 3
2012-08-01 NaN 0.99 0.11
2012-08-02 NaN 0.21 NaN
2012-08-03 NaN 0.32 0.40
2012-08-29 NaN 0.50 -0.36
2012-08-30 NaN 0.48 -0.32
2012-08-31 NaN 0.51 -0.33

Calculating momentum signal in python using 1 month and 12 month lag

I am wanting to calculate a simple momentum signal. The method I am following is 1 month lagged cumret divided by 12 month lagged cumret minus 1.
date starts at 1/5/14 and ends at 1/5/16. As a 12 month lag is required, the first mom signal has to start 12 months after the start of date. Hence, why the first mom signal starts at 1/5/15.
Here is the data utilized:
import pandas as pd
data = {'date':['1/5/14','1/6/14','1/7/14','1/8/14','1/9/14','1/10/14','1/11/14','1/12/14' .,'1/1/15','1/2/15','1/3/15','1/4/15','1/5/15','1/6/15','1/7/15','1/8/15','1/9/15','1/10/15','1/11/15','1/12/15','1/1/16','1/2/16','1/3/16','1/4/16','1/5/16'],
'id': ['a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a','a' ],
'ret':[0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.10, 0.11, 0.12, 0.13, 0.14, 0.15, 0.16, 0.17, 0.18, 0.19, 0.20, 0.21, 0.22, 0.23, 0.24, 0.25],
'cumret':[1.01,1.03, 1.06,1.1 ,1.15,1.21,1.28, 1.36,1.45,1.55,1.66, 1.78,1.91,2.05,2.2,2.36, 2.53,2.71,2.9,3.1,3.31,3.53, 3.76,4,4.25]}
df = pd.DataFrame(data).set_index(['date', 'id'])
Desired output
ret cumret mom
date id
1/5/14 a .01 1.01
1/6/14 a .02 1.03
1/7/14 a .03 1.06
1/8/14 a .04 1.1
1/9/14 a .05 1.15
1/10/14 a .06 1.21
1/11/14 a .07 1.28
1/12/14 a .08 1.36
1/1/15 a .09 1.45
1/2/15 a .1 1.55
1/3/15 a .11 1.66
1/4/15 a .12 1.78
1/5/15 a .13 1.91 .8
1/6/15 a .14 2.05 .9
1/7/15 a .15 2.2 .9
1/8/15 a .16 2.36 1
1/9/15 a .17 2.53 1.1
1/10/15 a .18 2.71 1.1
1/11/15 a .19 2.9 1.1
1/12/15 a .2 3.1 1.1
1/1/16 a .21 3.31 1.1
1/2/16 a .22 3.53 1.1
1/3/16 a .23 3.76 1.1
1/4/16 a .24 4 1.1
1/5/16 a .25 4.25 1.1
This is the code tried to calculate mom
df['mom'] = ((df['cumret'].shift(-1) / (df['cumret'].shift(-12))) - 1).groupby(level = ['id'])
Entire dataset has more id e.g. a, b, c. Just included 1 variable for this example.
Any help would be awesome! :)

As far as I know, momentum is simply rate of change. Pandas has a built-in method for this:
df['mom'] = df['ret'].pct_change(12) # 12 month change
Also, I am not sure why you are using cumret instead of ret to calculate momentum.
Update: If you have multiple IDs that you need to go through, I'd recommend:
for i in df.index.levels[1]:
temp = df.loc[(slice(None), i), "ret"].pct_change(11)
df.loc[(slice(None), i), "mom"] = temp
# or df.loc[(slice(None), i), "mom"] = df.loc[(slice(None), i), "ret"].pct_change(11) for short
Output:
ret cumret mom
date id
1/5/14 a 0.01 1.01 NaN
1/6/14 a 0.02 1.03 NaN
1/7/14 a 0.03 1.06 NaN
1/8/14 a 0.04 1.10 NaN
1/9/14 a 0.05 1.15 NaN
1/10/14 a 0.06 1.21 NaN
1/11/14 a 0.07 1.28 NaN
1/12/14 a 0.08 1.36 NaN
1/1/15 a 0.09 1.45 NaN
1/2/15 a 0.10 1.55 NaN
1/3/15 a 0.11 1.66 NaN
1/4/15 a 0.12 1.78 11.000000
1/5/15 a 0.13 1.91 5.500000
1/6/15 a 0.14 2.05 3.666667
1/7/15 a 0.15 2.20 2.750000
1/8/15 a 0.16 2.36 2.200000
1/9/15 a 0.17 2.53 1.833333
1/10/15 a 0.18 2.71 1.571429
1/11/15 a 0.19 2.90 1.375000
1/12/15 a 0.20 3.10 1.222222
1/1/16 a 0.21 3.31 1.100000
1/2/16 a 0.22 3.53 1.000000
1/3/16 a 0.23 3.76 0.916667
1/4/16 a 0.24 4.00 0.846154
1/5/16 a 0.25 4.25 0.785714
1/5/14 b 0.01 1.01 NaN
1/6/14 b 0.02 1.03 NaN
1/7/14 b 0.03 1.06 NaN
1/8/14 b 0.04 1.10 NaN
1/9/14 b 0.05 1.15 NaN
1/10/14 b 0.06 1.21 NaN
1/11/14 b 0.07 1.28 NaN
1/12/14 b 0.08 1.36 NaN
1/1/15 b 0.09 1.45 NaN
1/2/15 b 0.10 1.55 NaN
1/3/15 b 0.11 1.66 NaN
1/4/15 b 0.12 1.78 11.000000
1/5/15 b 0.13 1.91 5.500000
1/6/15 b 0.14 2.05 3.666667
1/7/15 b 0.15 2.20 2.750000
1/8/15 b 0.16 2.36 2.200000
1/9/15 b 0.17 2.53 1.833333
1/10/15 b 0.18 2.71 1.571429
1/11/15 b 0.19 2.90 1.375000
1/12/15 b 0.20 3.10 1.222222
1/1/16 b 0.21 3.31 1.100000
1/2/16 b 0.22 3.53 1.000000
1/3/16 b 0.23 3.76 0.916667
1/4/16 b 0.24 4.00 0.846154
1/5/16 b 0.25 4.25 0.785714

How to use describe () by group for all variables?

I would appreciate if you could let me know how to apply describe () to calculate summary statistics by group. My data (TrainSet) is like the following but there is a lot of coulmns:
Financial Distress x1 x2 x3
0 1.28 0.02 0.87
0 1.27 0.01 0.82
0 1.05 -0.06 0.92
1 1.11 -0.02 0.86
0 1.06 0.11 0.81
0 1.06 0.08 0.88
1 0.87 -0.03 0.79
I want to compute the summary statistics by "Financial Distress" as it is shown below:
count mean std min 25% 50% 75% max
cat index
x1 0 2474 1.4 1.3 0.07 0.95 1.1 1.54 38.1
1 95 0.7 -1.7 0.02 2.9 2.1 1.75 11.2
x2 0 2474 0.9 1.7 0.02 1.9 1.4 1.75 11.2
1 95 .45 1.95 0.07 2.8 1.6 2.94 20.12
x3 0 2474 2.4 1.5 0.07 0.85 1.2 1.3 30.1
1 95 1.9 2.3 0.33 6.1 0.15 1.66 12.3
I wrote the following code but it does not provide the answer in the aforementioned format.
Statistics=pd.concat([TrainSet[TrainSet["Financial Distress"]==0].describe(),TrainSet[TrainSet["Financial Distress"]==1].describe()])
Statistics.to_csv("Descriptive Statistics1.csv")
Thanks in advance.
The result of coldspeed's solution:
Financial Distress count mean std
x1 0 2474 1.398623286 1.320468688
x1 1 95 1.028107053 0.360206966
x10 0 2474 0.143310534 0.136257947
x10 1 95 -0.032919408 0.080409407
x100 0 2474 0.141875505 0.348992946
x100 1 95 0.115789474 0.321669776

You can use DataFrameGroupBy.describe with unstack first, but it by default change ordering by reindex:
print (df)
Financial Distress x1 x2 x10
0 0 1.28 0.02 0.87
1 0 1.27 0.01 0.82
2 0 1.05 -0.06 0.92
3 1 1.11 -0.02 0.86
4 0 1.06 0.11 0.81
5 0 1.06 0.08 0.88
6 1 0.87 -0.03 0.79
df1 = (df.groupby('Financial Distress')
.describe()
.unstack()
.unstack(1)
.reindex(df.columns[1:], level=0))
print (df1)
count mean std min 25% 50% 75% \
Financial Distress
x1 0 5.0 1.144 0.119708 1.05 1.0600 1.060 1.2700
1 2.0 0.990 0.169706 0.87 0.9300 0.990 1.0500
x2 0 5.0 0.032 0.066106 -0.06 0.0100 0.020 0.0800
1 2.0 -0.025 0.007071 -0.03 -0.0275 -0.025 -0.0225
x10 0 5.0 0.860 0.045277 0.81 0.8200 0.870 0.8800
1 2.0 0.825 0.049497 0.79 0.8075 0.825 0.8425
max
Financial Distress
x1 0 1.28
1 1.11
x2 0 0.11
1 -0.02
x10 0 0.92
1 0.86

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas: dynamically shifting values across columns - python

Related

Formating a data frame reducing column and increasing rows

Copy a column to multiple columns of a DataFrame with Pandas

count values of each month, fill NaN if under certain limit

Calculating momentum signal in python using 1 month and 12 month lag

How to use describe () by group for all variables?

Categories

Resources