I have a dataframe that I want to use to calculate rolling sums relative to an event date. The event date is different for each column and is represented by the latest date in which there is a value in each column.
Here is a toy example:
rng = pd.date_range('1/1/2011', periods=8, freq='D')
df = pd.DataFrame({
'1' : [56, 2, 3, 4, 5, None, None, None],
'2' : [51, 2, 3, 4, 5, 6, None, None],
'3' : [51, 2, 3, 4, 5, 6, 0, None]}, index = rng)
pd.rolling_sum(df,3)
The dataframe it produces looks like this:
1 2 3
2011-01-01 NaN NaN NaN
2011-01-02 NaN NaN NaN
2011-01-03 61 56 56
2011-01-04 9 9 9
2011-01-05 12 12 12
2011-01-06 NaN 15 15
2011-01-07 NaN NaN 11
2011-01-08 NaN NaN NaN
I now want to align the last event dates on the final row of the dataframe and set the index to 0 with each preceding row index -1,-2,-3 and so on. The periods no longer being absolute but relative to the event date.
The desired dataframe would look like this:
1 2 3
-7.00 NaN NaN NaN
-6.00 NaN NaN NaN
-5.00 NaN NaN NaN
-4.00 NaN NaN 56
-3.00 NaN 56 9
-2.00 61 9 12
-1.00 9 12 15
0.00 12 15 11
Thanks for any guidance.
I don't see any easy ways to do this. The following will work, but a bit messy.
In [37]: def f(x):
....: y = x.dropna()
....: return Series(y.values,x.index[len(x)-len(y):])
....:
In [40]: roller = pd.rolling_sum(df,3).reset_index(drop=True)
In [41]: roller
Out[41]:
1 2 3
0 NaN NaN NaN
1 NaN NaN NaN
2 61 56 56
3 9 9 9
4 12 12 12
5 NaN 15 15
6 NaN NaN 11
7 NaN NaN NaN
[8 rows x 3 columns]
In [43]: roller.apply(f).reindex_like(roller)
Out[43]:
1 2 3
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN 56
4 NaN 56 9
5 61 9 12
6 9 12 15
7 12 15 11
[8 rows x 3 columns]
In [44]: result = roller.apply(f).reindex_like(roller)
In [49]: result.index = result.index.values-len(result.index)+1
In [50]: result
Out[50]:
1 2 3
-7 NaN NaN NaN
-6 NaN NaN NaN
-5 NaN NaN NaN
-4 NaN NaN 56
-3 NaN 56 9
-2 61 9 12
-1 9 12 15
0 12 15 11
[8 rows x 3 columns]
Related
Say I have just 2 columns in pandas.
Column 1 has all numerical values and column 2 has values only at the every 16th position (so column 2 has value at index 0 followed by 15 NaN and value at index 16 followed by 15 NaNs).
How to create a new row, that contains itself and next 15 values of column 1 (as list [value, value2,....value16]) when column 2 is not null.
Can someone let me know a time efficient solution for the below:
Here is the pandas code to reproduce the sample data
df=pd.DataFrame(zip([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32],
['xyz',None,None,None,None,None,None,None,None,None,None,None,None,None,None,None,
'abc',None,None,None,None,None,None,None,None,None,None,None,None,None,None,None],
[[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],None,None,None,None,None,None,None,None,None,None,None,None,None,None,None,
[17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32],None,None,None,None,None,None,None,None,None,None,None,None,None,None,None]), columns= ['A','B','C'])
Use a boolean mask:
m = df['column 2'].notna()
df.loc[m, 'column 3'] = df.groupby(m.cumsum())['column 1'].agg(list).values
print(df)
# Output
column 1 column 2 column 3
0 1 xyz [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14...
1 2 NaN NaN
2 3 NaN NaN
3 4 NaN NaN
4 5 NaN NaN
5 6 NaN NaN
6 7 NaN NaN
7 8 NaN NaN
8 9 NaN NaN
9 10 NaN NaN
10 11 NaN NaN
11 12 NaN NaN
12 13 NaN NaN
13 14 NaN NaN
14 15 NaN NaN
15 16 NaN NaN
16 17 abc [17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 2...
17 18 NaN NaN
18 19 NaN NaN
19 20 NaN NaN
20 21 NaN NaN
21 22 NaN NaN
22 23 NaN NaN
23 24 NaN NaN
24 25 NaN NaN
25 26 NaN NaN
26 27 NaN NaN
27 28 NaN NaN
28 29 NaN NaN
29 30 NaN NaN
30 31 NaN NaN
31 32 NaN NaN
I have a dataset similar to this
Serial A B
1 12
1 31
1
1 12
1 31 203
1 10
1 2
2 32 100
2 32 242
2 3
3 2
3 23 100
3
3 23
I group the dataframe based on Serial and find the maximum value of each A column by df['A_MAX'] = df.groupby('Serial')['A'].transform('max').values and retain the first value by df['A_MAX'] = df['A_MAX'].mask(df['Serial'].duplicated(), '')
Serial A B A_MAX B_corresponding
1 12 31 203
1 31
1
1 12
1 31 203
1 10
1 2
2 32 100 32 100
2 32 242
2 3
3 2 23 100
3 23 100
3
3 23
Now for the B_corresponding column, I would like to get the corresponding B values of the A_MAX. I thought of locating the A_MAX values in A but there are similar max A values per group. Additional condition, for example in Serial 2 I would also prefer to get the smallest B values between the 32
Idea is use DataFrame.sort_values for maximal values per groups, then remove missing values by DataFrame.dropna and get first rows by Serial by DataFrame.drop_duplicates. Create Series by DataFrame.set_index and last use Series.map:
df['A_MAX'] = df.groupby('Serial')['A'].transform('max')
df['A_MAX'] = df['A_MAX'].mask(df['Serial'].duplicated())
s = (df.sort_values(['Serial','A'], ascending=[True, False])
.dropna(subset=['B'])
.drop_duplicates('Serial')
.set_index('Serial')['B'])
df['B_corresponding'] = df['Serial'].map(s).mask(df['Serial'].duplicated())
print (df)
Serial A B A_MAX B_corresponding
0 1 12.0 NaN 31.0 203.0
1 1 31.0 NaN NaN NaN
2 1 NaN NaN NaN NaN
3 1 12.0 NaN NaN NaN
4 1 31.0 203.0 NaN NaN
5 1 10.0 NaN NaN NaN
6 1 2.0 NaN NaN NaN
7 2 32.0 100.0 32.0 100.0
8 2 32.0 242.0 NaN NaN
9 2 3.0 NaN NaN NaN
10 3 2.0 NaN 23.0 100.0
11 3 23.0 100.0 NaN NaN
12 3 NaN NaN NaN NaN
13 3 23.0 NaN NaN NaN
Converting missing values to empty strings is possible, but get mixed values - numeric and strings, so next processing should be problematic:
df['A_MAX'] = df.groupby('Serial')['A'].transform('max')
df['A_MAX'] = df['A_MAX'].mask(df['Serial'].duplicated(), '')
s = (df.sort_values(['Serial','A'], ascending=[True, False])
.dropna(subset=['B'])
.drop_duplicates('Serial')
.set_index('Serial')['B'])
df['B_corresponding'] = df['Serial'].map(s).mask(df['Serial'].duplicated(), '')
print (df)
Serial A B A_MAX B_corresponding
0 1 12.0 NaN 31 203
1 1 31.0 NaN
2 1 NaN NaN
3 1 12.0 NaN
4 1 31.0 203.0
5 1 10.0 NaN
6 1 2.0 NaN
7 2 32.0 100.0 32 100
8 2 32.0 242.0
9 2 3.0 NaN
10 3 2.0 NaN 23 100
11 3 23.0 100.0
12 3 NaN NaN
13 3 23.0 NaN
You could also use dictionaries to achieve the same if you are not so inclined to only use pandas.
a_to_b_mapping = df.groupby('A')['B'].min().to_dict()
series_to_a_mapping = df.groupby('Series')['A'].max().to_dict()
agg_df = {}
for series, a in series_to_a_mapping.items():
agg_df.append((series, a, a_to_b_mapping.get(a, None)))
agg_df = pd.DataFrame(agg_df, columns=['Series', 'A_max', 'B_corresponding'])
agg_df.head()
Series A_max B_corresponding
0 1 31.0 203.0
1 2 32.0 100.0
2 3 23.0 100.0
If you want, you could join this to original dataframe and mask duplicates.
dft = df.join(final_df.set_index('Serial'), on='Serial', how='left')
dft['A_max'] = dft['A_max'].mask(dft['A_max'].duplicated(), '')
dft['B_corresponding'] = dft['B_corresponding'].mask(dft['B_corresponding'].duplicated(), '')
dft
This question already has answers here:
How do I operate on a DataFrame with a Series for every column?
(3 answers)
Closed 4 years ago.
I have a two dataframe with identical index in both dataframes. I want to perform subtract operation. i.e., I want to subtract the all columns in df1 with respect to df2 column. where df2 has only one column.
Input:
df1
col1 col2 col3
0 10 34 6
1 3 23 123
2 23 45 23
3 5 1 5
4 1 45 6
5 65 6 88
df2
base
0 12
1 43
2 435
3 76
4 23
5 12
I tried,
df1-df2['base']
Results,
0 1 2 3 4 5 col1 col2 col3
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN NaN NaN NaN NaN NaN
But Expected.
col1 col2 col3
0 -2 22 -6
1 -40 -20 80
2 -412 -390 -412
3 -71 -75 -71
4 -22 22 -17
5 53 -6 76
Why I'am getting NaN and how two df's combined?
How to get expected result?
Use DataFrame.substract with argument axis=0
df1.subtract(df2['base'], axis=0)
[out]
col1 col2 col3
0 -2 22 -6
1 -40 -20 80
2 -412 -390 -412
3 -71 -75 -71
4 -22 22 -17
5 53 -6 76
I have 2 dataframes:
DF1:
Count
0 98.0
1 176.0
2 260.5
3 389.0
I have to assign these values to a column in another dataframe for every 3rd row starting from 3rd row.
The Output of DF2 should look like this:
Count
0
1
2 98.0
3
4
5 176.0
6
7
8 260.5
9
10
11 389.0
I am doing
DF2.loc[2::3,'Count'] = DF1['Count']
But, I am not getting the expected results.
Use values
Ohterwise, Pandas tries to align the index values from DF1 and that messes you up.
DF2.loc[2::3, 'Count'] = DF1['Count'].values
DF2
Count
0 NaN
1 NaN
2 98.0
3 NaN
4 NaN
5 176.0
6 NaN
7 NaN
8 260.5
9 NaN
10 NaN
11 389.0
New From DF1
DF1.set_index(DF1.index * 3 + 2).reindex(range(len(DF1) * 3))
Count
0 NaN
1 NaN
2 98.0
3 NaN
4 NaN
5 176.0
6 NaN
7 NaN
8 260.5
9 NaN
10 NaN
11 389.0
I had following data frame (the real data frame is much more larger than this one ) :
sale_user_id sale_product_id count
1 1 1
1 8 1
1 52 1
1 312 5
1 315 1
Then reshaped it to move the values in sale_product_id as column headers using the following code:
reshaped_df=id_product_count.pivot(index='sale_user_id',columns='sale_product_id',values='count')
and the resulting data frame is:
sale_product_id -1057 1 2 3 4 5 6 8 9 10 ... 98 980 981 982 983 984 985 986 987 99
sale_user_id
1 NaN 1.0 NaN NaN NaN NaN NaN 1.0 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN 1.0 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN 1.0 NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
as you can see we have a multililevel index , what i need is to have sale_user_is in the first column without multilevel indexing:
i take the following approach :
reshaped_df.reset_index()
the the result would be like this i still have the sale_product_id column , but i do not need it anymore:
sale_product_id sale_user_id -1057 1 2 3 4 5 6 8 9 ... 98 980 981 982 983 984 985 986 987 99
0 1 NaN 1.0 NaN NaN NaN NaN NaN 1.0 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 3 NaN 1.0 NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 4 NaN NaN 1.0 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN
i can subset this data frame to get rid of sale_product_id but i don't think it would be efficient.I am looking for an efficient way to get rid of multilevel indexing while reshaping the original data frame
You need remove only index name, use rename_axis (new in pandas 0.18.0):
print (reshaped_df)
sale_product_id 1 8 52 312 315
sale_user_id
1 1 1 1 5 1
print (reshaped_df.index.name)
sale_user_id
print (reshaped_df.rename_axis(None))
sale_product_id 1 8 52 312 315
1 1 1 1 5 1
Another solution working in pandas below 0.18.0:
reshaped_df.index.name = None
print (reshaped_df)
sale_product_id 1 8 52 312 315
1 1 1 1 5 1
If need remove columns name also:
print (reshaped_df.columns.name)
sale_product_id
print (reshaped_df.rename_axis(None).rename_axis(None, axis=1))
1 8 52 312 315
1 1 1 1 5 1
Another solution:
reshaped_df.columns.name = None
reshaped_df.index.name = None
print (reshaped_df)
1 8 52 312 315
1 1 1 1 5 1
EDIT by comment:
You need reset_index with parameter drop=True:
reshaped_df = reshaped_df.reset_index(drop=True)
print (reshaped_df)
sale_product_id 1 8 52 312 315
0 1 1 1 5 1
#if need reset index nad remove column name
reshaped_df = reshaped_df.reset_index(drop=True).rename_axis(None, axis=1)
print (reshaped_df)
1 8 52 312 315
0 1 1 1 5 1
Of if need remove only column name:
reshaped_df = reshaped_df.rename_axis(None, axis=1)
print (reshaped_df)
1 8 52 312 315
sale_user_id
1 1 1 1 5 1
Edit1:
So if need create new column from index and remove columns names:
reshaped_df = reshaped_df.rename_axis(None, axis=1).reset_index()
print (reshaped_df)
sale_user_id 1 8 52 312 315
0 1 1 1 1 5 1
Make a DataFrame
import random
d = {'Country': ['Afghanistan','Albania','Algeria','Andorra','Angola']*2,
'Year': [2005]*5 + [2006]*5, 'Value': random.sample(range(1,20),10)}
df = pd.DataFrame(data=d)
df:
Country Year Value
1 Afghanistan 2005 6
2 Albania 2005 13
3 Algeria 2005 10
4 Andorra 2005 11
5 Angola 2005 5
6 Afghanistan 2006 3
7 Albania 2006 2
8 Algeria 2006 7
9 Andorra 2006 3
10 Angola 2006 6
Pivot
table = df.pivot(index='Country',columns='Year',values='Value')
Table:
Year Country 2005 2006
0 Afghanistan 16 9
1 Albania 17 19
2 Algeria 11 7
3 Andorra 5 12
4 Angola 6 18
I want 'Year' to be 'index':
clean_tbl = table.rename_axis(None, axis=1).reset_index(drop=True)
clean_tbl:
Country 2005 2006
0 Afghanistan 16 9
1 Albania 17 19
2 Algeria 11 7
3 Andorra 5 12
4 Angola 6 18
Done!
You can also use a to_flat_index method of MultiIndex object to convert it into a list of tuples, which you can then concatenate with list comprehension and use it to overwrite the .columns attribute of your dataframe.
# create a dataframe
df = pd.DataFrame({"a": [1, 2, 3, 1], "b": ["x", "x", "y", "y"], "c": [0.1, 0.2, 0.1, 0.2]})
a b c
0 1 x 0.1
1 2 x 0.2
2 3 y 0.1
3 1 y 0.2
# pivot the dataframe
df_pivoted = df.pivot(index="a", columns="b")
c
b x y
a
1 0.1 0.2
2 0.2 NaN
3 NaN 0.1
Now let's overwrite the .columns attribute and .reset_index():
df_pivoted.columns = ["_".join(tup) for tup in df_pivoted.columns.to_flat_index()]
df_pivoted.reset_index()
a c_x c_y
0 1 0.1 0.2
1 2 0.2 NaN
2 3 NaN 0.1
We need to reset_index() to reset the index columns back into the dataframe, then rename_axis() to rename the index to None and the columns to their axis=1 (column headers) values.
reshaped_df = reshaped_df.reset_index().rename_axis(None, axis=1)
Pivot from long to wide format using pivot:
import pandas
df = pandas.DataFrame({
"lev1": [1, 1, 1, 2, 2, 2],
"lev2": [1, 1, 2, 1, 1, 2],
"lev3": [1, 2, 1, 2, 1, 2],
"lev4": [1, 2, 3, 4, 5, 6],
"values": [0, 1, 2, 3, 4, 5]})
df_wide = df.pivot(index="lev1", columns=["lev2", "lev3"], values="values")
df_wide
# lev2 1 2
# lev3 1 2 1 2
# lev1
# 1 0.0 1.0 2.0 NaN
# 2 4.0 3.0 NaN 5.0
Rename the (sometimes confusing) axis names
df_wide.rename_axis(columns=[None, None])
# 1 2
# 1 2 1 2
# lev1
# 1 0.0 1.0 2.0 NaN
# 2 4.0 3.0 NaN 5.0
The way it works for me is
df_cross=pd.DataFrame(pd.crosstab(df[c1], df[c2]).to_dict()).reset_index()