How to get rid of multilevel index after using pivot table pandas?

How to get rid of multilevel index after using pivot table pandas? - python

I had following data frame (the real data frame is much more larger than this one ) :
sale_user_id sale_product_id count
1 1 1
1 8 1
1 52 1
1 312 5
1 315 1
Then reshaped it to move the values in sale_product_id as column headers using the following code:
reshaped_df=id_product_count.pivot(index='sale_user_id',columns='sale_product_id',values='count')
and the resulting data frame is:
sale_product_id -1057 1 2 3 4 5 6 8 9 10 ... 98 980 981 982 983 984 985 986 987 99
sale_user_id
1 NaN 1.0 NaN NaN NaN NaN NaN 1.0 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN 1.0 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN 1.0 NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
as you can see we have a multililevel index , what i need is to have sale_user_is in the first column without multilevel indexing:
i take the following approach :
reshaped_df.reset_index()
the the result would be like this i still have the sale_product_id column , but i do not need it anymore:
sale_product_id sale_user_id -1057 1 2 3 4 5 6 8 9 ... 98 980 981 982 983 984 985 986 987 99
0 1 NaN 1.0 NaN NaN NaN NaN NaN 1.0 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 3 NaN 1.0 NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 4 NaN NaN 1.0 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN
i can subset this data frame to get rid of sale_product_id but i don't think it would be efficient.I am looking for an efficient way to get rid of multilevel indexing while reshaping the original data frame

You need remove only index name, use rename_axis (new in pandas 0.18.0):
print (reshaped_df)
sale_product_id 1 8 52 312 315
sale_user_id
1 1 1 1 5 1
print (reshaped_df.index.name)
sale_user_id
print (reshaped_df.rename_axis(None))
sale_product_id 1 8 52 312 315
1 1 1 1 5 1
Another solution working in pandas below 0.18.0:
reshaped_df.index.name = None
print (reshaped_df)
sale_product_id 1 8 52 312 315
1 1 1 1 5 1
If need remove columns name also:
print (reshaped_df.columns.name)
sale_product_id
print (reshaped_df.rename_axis(None).rename_axis(None, axis=1))
1 8 52 312 315
1 1 1 1 5 1
Another solution:
reshaped_df.columns.name = None
reshaped_df.index.name = None
print (reshaped_df)
1 8 52 312 315
1 1 1 1 5 1
EDIT by comment:
You need reset_index with parameter drop=True:
reshaped_df = reshaped_df.reset_index(drop=True)
print (reshaped_df)
sale_product_id 1 8 52 312 315
0 1 1 1 5 1
#if need reset index nad remove column name
reshaped_df = reshaped_df.reset_index(drop=True).rename_axis(None, axis=1)
print (reshaped_df)
1 8 52 312 315
0 1 1 1 5 1
Of if need remove only column name:
reshaped_df = reshaped_df.rename_axis(None, axis=1)
print (reshaped_df)
1 8 52 312 315
sale_user_id
1 1 1 1 5 1
Edit1:
So if need create new column from index and remove columns names:
reshaped_df = reshaped_df.rename_axis(None, axis=1).reset_index()
print (reshaped_df)
sale_user_id 1 8 52 312 315
0 1 1 1 1 5 1

Make a DataFrame
import random
d = {'Country': ['Afghanistan','Albania','Algeria','Andorra','Angola']*2,
'Year': [2005]*5 + [2006]*5, 'Value': random.sample(range(1,20),10)}
df = pd.DataFrame(data=d)
df:
Country Year Value
1 Afghanistan 2005 6
2 Albania 2005 13
3 Algeria 2005 10
4 Andorra 2005 11
5 Angola 2005 5
6 Afghanistan 2006 3
7 Albania 2006 2
8 Algeria 2006 7
9 Andorra 2006 3
10 Angola 2006 6
Pivot
table = df.pivot(index='Country',columns='Year',values='Value')
Table:
Year Country 2005 2006
0 Afghanistan 16 9
1 Albania 17 19
2 Algeria 11 7
3 Andorra 5 12
4 Angola 6 18
I want 'Year' to be 'index':
clean_tbl = table.rename_axis(None, axis=1).reset_index(drop=True)
clean_tbl:
Country 2005 2006
0 Afghanistan 16 9
1 Albania 17 19
2 Algeria 11 7
3 Andorra 5 12
4 Angola 6 18
Done!

You can also use a to_flat_index method of MultiIndex object to convert it into a list of tuples, which you can then concatenate with list comprehension and use it to overwrite the .columns attribute of your dataframe.
# create a dataframe
df = pd.DataFrame({"a": [1, 2, 3, 1], "b": ["x", "x", "y", "y"], "c": [0.1, 0.2, 0.1, 0.2]})
a b c
0 1 x 0.1
1 2 x 0.2
2 3 y 0.1
3 1 y 0.2
# pivot the dataframe
df_pivoted = df.pivot(index="a", columns="b")
c
b x y
a
1 0.1 0.2
2 0.2 NaN
3 NaN 0.1
Now let's overwrite the .columns attribute and .reset_index():
df_pivoted.columns = ["_".join(tup) for tup in df_pivoted.columns.to_flat_index()]
df_pivoted.reset_index()
a c_x c_y
0 1 0.1 0.2
1 2 0.2 NaN
2 3 NaN 0.1

We need to reset_index() to reset the index columns back into the dataframe, then rename_axis() to rename the index to None and the columns to their axis=1 (column headers) values.
reshaped_df = reshaped_df.reset_index().rename_axis(None, axis=1)

Pivot from long to wide format using pivot:
import pandas
df = pandas.DataFrame({
"lev1": [1, 1, 1, 2, 2, 2],
"lev2": [1, 1, 2, 1, 1, 2],
"lev3": [1, 2, 1, 2, 1, 2],
"lev4": [1, 2, 3, 4, 5, 6],
"values": [0, 1, 2, 3, 4, 5]})
df_wide = df.pivot(index="lev1", columns=["lev2", "lev3"], values="values")
df_wide
# lev2 1 2
# lev3 1 2 1 2
# lev1
# 1 0.0 1.0 2.0 NaN
# 2 4.0 3.0 NaN 5.0
Rename the (sometimes confusing) axis names
df_wide.rename_axis(columns=[None, None])
# 1 2
# 1 2 1 2
# lev1
# 1 0.0 1.0 2.0 NaN
# 2 4.0 3.0 NaN 5.0

The way it works for me is
df_cross=pd.DataFrame(pd.crosstab(df[c1], df[c2]).to_dict()).reset_index()

Related

Joining or merging multiple columns within one dataframe and keeping all data

I have this dataframe:
df = pd.DataFrame({'Position1':[1,2,3], 'Count1':[55,35,45],\
'Position2':[4,2,7], 'Count2':[15,35,75],\
'Position3':[3,5,6], 'Count3':[45,95,105]})
print(df)
Position1 Count1 Position2 Count2 Position3 Count3
0 1 55 4 15 3 45
1 2 35 2 35 5 95
2 3 45 7 75 6 105
I want to join the Position columns into one column named "Positions" while sorting the data in the Counts columns like so:
Positions Count1 Count2 Count3
0 1 55 Nan Nan
1 2 35 35 Nan
2 3 45 NaN 45
3 4 NaN 15 Nan
4 5 NaN NaN 95
5 6 Nan NaN 105
6 7 Nan 75 NaN
I've tried melting the dataframe, combining and merging columns but I am a bit stuck.
Note that the NaN types can easily be replaced by using df.fillna to get a dataframe like so:
df = df.fillna(0)
Positions Count1 Count2 Count3
0 1 55 0 0
1 2 35 35 0
2 3 45 0 45
3 4 0 15 0
4 5 0 0 95
5 6 0 0 105
6 7 0 75 0

Here is a way to do what you've asked:
df = df[['Position1', 'Count1']].rename(columns={'Position1':'Positions'}).join(
df[['Position2', 'Count2']].set_index('Position2'), on='Positions', how='outer').join(
df[['Position3', 'Count3']].set_index('Position3'), on='Positions', how='outer').sort_values(
by=['Positions']).reset_index(drop=True)
Output:
Positions Count1 Count2 Count3
0 1 55.0 NaN NaN
1 2 35.0 35.0 NaN
2 3 45.0 NaN 45.0
3 4 NaN 15.0 NaN
4 5 NaN NaN 95.0
5 6 NaN NaN 105.0
6 7 NaN 75.0 NaN
Explanation:
Use join first on Position1, Count1 and Position2, Count2 (with Position1 renamed as Positions) then on that join result and Position3, Count3.
Sort by Positions and use reset_index to create a new integer range index (ascending with no gaps).

Does this achieve what you are after?
import pandas as pd
df = pd.DataFrame({'Position1':[1,2,3], 'Count1':[55,35,45],\
'Position2':[4,2,7], 'Count2':[15,35,75],\
'Position3':[3,5,6], 'Count3':[45,95,105]})
df1, df2, df3 = df.iloc[:,:2], df.iloc[:, 2:4], df.iloc[:, 4:6]
df1.columns, df2.columns, df3.columns = ['Positions', 'Count1'], ['Positions', 'Count2'], ['Positions', 'Count3']
df1.merge(df2, on='Positions', how='outer').merge(df3, on='Positions', how='outer').sort_values('Positions')
Output:

wide_to_long unpivots the DF from Long to wide and that is what's used here.
columns names are also renamed here, with this edit
df['id'] = df.index
df2=pd.wide_to_long(df, stubnames=['Position','Count'], i='id', j='pos').reset_index()
df2=df2.pivot(index=['id','Position'], columns='pos', values='Count').reset_index().fillna(0).add_prefix('count_')
df2.rename(columns={'count_id': 'id', 'count_Position' :'Position'}, inplace=True)
df2
RESULT:
pos id Position 1 2 3
0 0 1 55.0 0.0 0.0
1 0 3 0.0 0.0 45.0
2 0 4 0.0 15.0 0.0
3 1 2 35.0 35.0 0.0
4 1 5 0.0 0.0 95.0
5 2 3 45.0 0.0 0.0
6 2 6 0.0 0.0 105.0
7 2 7 0.0 75.0 0.0
PS: I'm unable to format the output, I'll appreciate if someone guide me here. Thanks!

One option is to flip to long form with pivot_longer before flipping back to wide form with pivot_wider from pyjanitor:
# pip install pyjanitor
import pandas as pd
import janitor
(df
.pivot_longer(
index = None,
names_to = ('.value', 'num'),
names_pattern = r"(.+)(\d+)")
.pivot_wider(index = 'Position', names_from = 'num')
)
Position Count_1 Count_2 Count_3
0 1 55.0 NaN NaN
1 2 35.0 35.0 NaN
2 3 45.0 NaN 45.0
3 4 NaN 15.0 NaN
4 5 NaN NaN 95.0
5 6 NaN NaN 105.0
6 7 NaN 75.0 NaN
In the pivot_longer section, the .value determines which part of the column names remain as column headers - in this case it is is Position and Count.

Getting corresponding values in a groupby

I have a dataset similar to this
Serial A B
1 12
1 31
1
1 12
1 31 203
1 10
1 2
2 32 100
2 32 242
2 3
3 2
3 23 100
3
3 23
I group the dataframe based on Serial and find the maximum value of each A column by df['A_MAX'] = df.groupby('Serial')['A'].transform('max').values and retain the first value by df['A_MAX'] = df['A_MAX'].mask(df['Serial'].duplicated(), '')
Serial A B A_MAX B_corresponding
1 12 31 203
1 31
1
1 12
1 31 203
1 10
1 2
2 32 100 32 100
2 32 242
2 3
3 2 23 100
3 23 100
3
3 23
Now for the B_corresponding column, I would like to get the corresponding B values of the A_MAX. I thought of locating the A_MAX values in A but there are similar max A values per group. Additional condition, for example in Serial 2 I would also prefer to get the smallest B values between the 32

Idea is use DataFrame.sort_values for maximal values per groups, then remove missing values by DataFrame.dropna and get first rows by Serial by DataFrame.drop_duplicates. Create Series by DataFrame.set_index and last use Series.map:
df['A_MAX'] = df.groupby('Serial')['A'].transform('max')
df['A_MAX'] = df['A_MAX'].mask(df['Serial'].duplicated())
s = (df.sort_values(['Serial','A'], ascending=[True, False])
.dropna(subset=['B'])
.drop_duplicates('Serial')
.set_index('Serial')['B'])
df['B_corresponding'] = df['Serial'].map(s).mask(df['Serial'].duplicated())
print (df)
Serial A B A_MAX B_corresponding
0 1 12.0 NaN 31.0 203.0
1 1 31.0 NaN NaN NaN
2 1 NaN NaN NaN NaN
3 1 12.0 NaN NaN NaN
4 1 31.0 203.0 NaN NaN
5 1 10.0 NaN NaN NaN
6 1 2.0 NaN NaN NaN
7 2 32.0 100.0 32.0 100.0
8 2 32.0 242.0 NaN NaN
9 2 3.0 NaN NaN NaN
10 3 2.0 NaN 23.0 100.0
11 3 23.0 100.0 NaN NaN
12 3 NaN NaN NaN NaN
13 3 23.0 NaN NaN NaN
Converting missing values to empty strings is possible, but get mixed values - numeric and strings, so next processing should be problematic:
df['A_MAX'] = df.groupby('Serial')['A'].transform('max')
df['A_MAX'] = df['A_MAX'].mask(df['Serial'].duplicated(), '')
s = (df.sort_values(['Serial','A'], ascending=[True, False])
.dropna(subset=['B'])
.drop_duplicates('Serial')
.set_index('Serial')['B'])
df['B_corresponding'] = df['Serial'].map(s).mask(df['Serial'].duplicated(), '')
print (df)
Serial A B A_MAX B_corresponding
0 1 12.0 NaN 31 203
1 1 31.0 NaN
2 1 NaN NaN
3 1 12.0 NaN
4 1 31.0 203.0
5 1 10.0 NaN
6 1 2.0 NaN
7 2 32.0 100.0 32 100
8 2 32.0 242.0
9 2 3.0 NaN
10 3 2.0 NaN 23 100
11 3 23.0 100.0
12 3 NaN NaN
13 3 23.0 NaN

You could also use dictionaries to achieve the same if you are not so inclined to only use pandas.
a_to_b_mapping = df.groupby('A')['B'].min().to_dict()
series_to_a_mapping = df.groupby('Series')['A'].max().to_dict()
agg_df = {}
for series, a in series_to_a_mapping.items():
agg_df.append((series, a, a_to_b_mapping.get(a, None)))
agg_df = pd.DataFrame(agg_df, columns=['Series', 'A_max', 'B_corresponding'])
agg_df.head()
Series A_max B_corresponding
0 1 31.0 203.0
1 2 32.0 100.0
2 3 23.0 100.0
If you want, you could join this to original dataframe and mask duplicates.
dft = df.join(final_df.set_index('Serial'), on='Serial', how='left')
dft['A_max'] = dft['A_max'].mask(dft['A_max'].duplicated(), '')
dft['B_corresponding'] = dft['B_corresponding'].mask(dft['B_corresponding'].duplicated(), '')
dft

Reshape dataframe from vertical to horizontal based on multiple columns in Pandas

Given a toy data as follows:
id group_name v1 v2
0 C45C6DA8-0721-40F3-B5CE-CA72DE102707 a_13 110 70
1 74D067B1-819B-4E9A-A1A7-2CD2E70577A9 a_0 118 76
2 65376D7B-8816-4FA0-9A2D-401D15808F92 b_39 130 80
3 CABB6BFA-98A8-417F-B765-D9C2C69511FC a_15 125 75
4 43D115F4-AA1F-4241-9AE0-2947986D9ED0 a_13 130 75
I need to groupby id and group_name, then reshape dataframe from vertical to horizontal, please note that new columns name are renamed based on the value after _ in the group_name column.
id group_name v1_0 v2_0 v1_13 v2_13 v1_15 v2_15 v1_39 v2_39
0 C45C6DA8-0721-40F3-B5CE-CA72DE102707 a_13 NaN NaN 110.0 70.0 NaN NaN NaN NaN
1 74D067B1-819B-4E9A-A1A7-2CD2E70577A9 a_0 118.0 76.0 NaN NaN NaN NaN NaN NaN
2 65376D7B-8816-4FA0-9A2D-401D15808F92 b_39 NaN NaN NaN NaN NaN NaN 130.0 80.0
3 CABB6BFA-98A8-417F-B765-D9C2C69511FC a_15 NaN NaN NaN NaN 125.0 75.0 NaN NaN
4 43D115F4-AA1F-4241-9AE0-2947986D9ED0 a_13 NaN NaN 130.0 75.0 NaN NaN NaN NaN
How could I do that in Pandas?

Looks like this can be done with pivot table with some addtional work:
out = (df.pivot_table(index = ['id', 'group_name'],
columns = df['group_name'].str.split('_').str[1])
# columns = df['group_name'].str.extract('(\d+)',expand=False)
.sort_index(level = 1, axis = 1))
out.columns = out.columns.map('{0[0]}_{0[1]}'.format)
out = out.reset_index()
display(out)

In my opinion, you should you pandas.melt function, as described in documentation. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.melt.html
df.columns = [list('ABC'), list('DEF')]
df
A B C
D E F
0 a 1 2
1 b 3 4
2 c 5 6
pd.melt(df, col_level=0, id_vars=['A'], value_vars=['B'])
A variable value
0 a B 1
1 b B 3
2 c B 5
pd.melt(df, id_vars=[('A', 'D')], value_vars=[('B', 'E')])
(A, D) variable_0 variable_1 value
0 a B E 1
1 b B E 3
2 c B E 5

Pandas assign series to a column based on index

I have 2 dataframes:
DF1:
Count
0 98.0
1 176.0
2 260.5
3 389.0
I have to assign these values to a column in another dataframe for every 3rd row starting from 3rd row.
The Output of DF2 should look like this:
Count
0
1
2 98.0
3
4
5 176.0
6
7
8 260.5
9
10
11 389.0
I am doing
DF2.loc[2::3,'Count'] = DF1['Count']
But, I am not getting the expected results.

Use values
Ohterwise, Pandas tries to align the index values from DF1 and that messes you up.
DF2.loc[2::3, 'Count'] = DF1['Count'].values
DF2
Count
0 NaN
1 NaN
2 98.0
3 NaN
4 NaN
5 176.0
6 NaN
7 NaN
8 260.5
9 NaN
10 NaN
11 389.0
New From DF1
DF1.set_index(DF1.index * 3 + 2).reindex(range(len(DF1) * 3))
Count
0 NaN
1 NaN
2 98.0
3 NaN
4 NaN
5 176.0
6 NaN
7 NaN
8 260.5
9 NaN
10 NaN
11 389.0

Python Pandas - turn absolute periods into relative periods

I have a dataframe that I want to use to calculate rolling sums relative to an event date. The event date is different for each column and is represented by the latest date in which there is a value in each column.
Here is a toy example:
rng = pd.date_range('1/1/2011', periods=8, freq='D')
df = pd.DataFrame({
'1' : [56, 2, 3, 4, 5, None, None, None],
'2' : [51, 2, 3, 4, 5, 6, None, None],
'3' : [51, 2, 3, 4, 5, 6, 0, None]}, index = rng)
pd.rolling_sum(df,3)
The dataframe it produces looks like this:
1 2 3
2011-01-01 NaN NaN NaN
2011-01-02 NaN NaN NaN
2011-01-03 61 56 56
2011-01-04 9 9 9
2011-01-05 12 12 12
2011-01-06 NaN 15 15
2011-01-07 NaN NaN 11
2011-01-08 NaN NaN NaN
I now want to align the last event dates on the final row of the dataframe and set the index to 0 with each preceding row index -1,-2,-3 and so on. The periods no longer being absolute but relative to the event date.
The desired dataframe would look like this:
1 2 3
-7.00 NaN NaN NaN
-6.00 NaN NaN NaN
-5.00 NaN NaN NaN
-4.00 NaN NaN 56
-3.00 NaN 56 9
-2.00 61 9 12
-1.00 9 12 15
0.00 12 15 11
Thanks for any guidance.

I don't see any easy ways to do this. The following will work, but a bit messy.
In [37]: def f(x):
....: y = x.dropna()
....: return Series(y.values,x.index[len(x)-len(y):])
....:
In [40]: roller = pd.rolling_sum(df,3).reset_index(drop=True)
In [41]: roller
Out[41]:
1 2 3
0 NaN NaN NaN
1 NaN NaN NaN
2 61 56 56
3 9 9 9
4 12 12 12
5 NaN 15 15
6 NaN NaN 11
7 NaN NaN NaN
[8 rows x 3 columns]
In [43]: roller.apply(f).reindex_like(roller)
Out[43]:
1 2 3
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN 56
4 NaN 56 9
5 61 9 12
6 9 12 15
7 12 15 11
[8 rows x 3 columns]
In [44]: result = roller.apply(f).reindex_like(roller)
In [49]: result.index = result.index.values-len(result.index)+1
In [50]: result
Out[50]:
1 2 3
-7 NaN NaN NaN
-6 NaN NaN NaN
-5 NaN NaN NaN
-4 NaN NaN 56
-3 NaN 56 9
-2 61 9 12
-1 9 12 15
0 12 15 11
[8 rows x 3 columns]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.