Create Pivot table for each column in Pandas df - python

I have a DataFrame where I have many columns (there is one dependent variable and many independent variables)
variable_id
dep_var
variable_1
variable_2
new
1
6
3
new
0
3
6
new
0
8
7
new
1
11
1
new
0
17
9
new
1
1
2
I want to create a Pivot table such as this:
pd.pivot_table(df,index=['variable_1'], columns=['dep_var'], values=['variable_id'],aggfunc='count')
I want to create it for each column separatly (so I need to change index in pd.pivot_table)
I have written a sample code:
def pivot_table(df):
df_columns = list(df)
for column in df_columns:
print("indexing by: ", column)
print(pd.pivot_table(df,index=[column], columns=['dep_var'], values=['variable_id'],aggfunc='count'))
but I want my result to be saved as pandas DataFrame
desired output:
how I want my output for each variable separately

Use:
def pivot_table(df):
dfs = []
for column in df:
print("indexing by: ", column)
df = pd.pivot_table(df,index=[column], values=['dep_var'])
dfs.append(df)
return dfs

Related

Put level of dataframe index at the same level of columns on a Multi-Index Dataframe

Context: I'd like to "bump" the index level of a multi-index dataframe up. In other words, I'd like to put the index level of a dataframe at the same level as the columns of a multi-indexed dataframe
Let's say we have this dataframe:
tt = pd.DataFrame({'A':[1,2,3],'B':[4,5,6],'C':[7,8,9]})
tt.index.name = 'Index Column'
And we perform this change to add a multi-index level (like a label of a table)
tt = pd.concat([tt],keys=['Multi-Index Table Label'], axis=1)
Which results in this:
Multi-Index Table Label
A B C
Index Column
0 1 4 7
1 2 5 8
2 3 6 9
Desired Output: How can I make it so that the dataframe looks like this instead (notice the removal of the empty level on the dataframe/table):
Multi-Index Table Label
Index Column A B C
0 1 4 7
1 2 5 8
2 3 6 9
Attempts: I was testing something out and you can essentially remove the index level by doing this:
tt.index.name = None
Which would result in :
Multi-Index Table Label
A B C
0 1 4 7
1 2 5 8
2 3 6 9
Essentially removing that extra level/empty line, but the thing is that I do want to keep the Index Column as it will give information about the type of data present on the index (which in this example are just 0,1,2 but can be years, dates, etc).
How could I do that?
Thank you all in advance :)
How about this:
tt = pd.DataFrame({'A':[1,2,3],'B':[4,5,6],'C':[7,8,9]})
tt.insert(loc=0, column='Index Column', value=tt.index)
tt = pd.concat([tt],keys=['Multi-Index Table Label'], axis=1)
tt = tt.style.hide_index()

Update dataframe 1 using two columns in dataframe 2 in python

I want to update Freq column in df1 using Freq column in data frame 2 as shown below,
data = {'Cell':[1,2,3,4,'10-05','10-09'], 'Freq':[True, True,True,True,True,True]}
df1 = pd.DataFrame(data)
Dataframe 1
Dataframe 1
Dataframe 2
data2 = {'Cell-1':[1,1,1,1,1,1,2,2,2,2,2,2],'Cell-2':[1,2,3,4,'10-05','10-09',1,2,3,4,'10-05','10-09'] ,'Freq':[True, False,True,False,True,True,True, False,True,False,True,False]}
df2 = pd.DataFrame(data2)
Dataframe 2
df1 column 1 has keys while column 2 is corresponding value which in this case is either True or False.
Lets take for example key = 1 in Dataframe 1. This key = 1 has multiple values in Dataframe 2 as shown in the figure. The multiple values for this key = 1 in dataframe 2 is due to values in Column 2, Dataframe 2 which in turn are keys to Dataframe 1 which I want to update in column 2 of df1.
Algorithm in action figure
Alogrithm in action

Stick the columns based on the one columns keeping ids

I have a DataFrame with 100 columns (however I provide only three columns here) and I want to build a new DataFrame with two columns. Here is the DataFrame:
import pandas as pd
df = pd.DataFrame()
df ['id'] = [1,2,3]
df ['c1'] = [1,5,1]
df ['c2'] = [-1,6,5]
df
I want to stick the values of all columns for each id and put them in one columns. For example, for id=1 I want to stick 2, 3 in one column. Here is the DataFrame that I want.
Note: df.melt does not solve my question. Since I want to have the ids also.
Note2: I already use the stack and reset_index, and it can not help.
df = df.stack().reset_index()
df.columns = ['id','c']
df
You could first set_index with "id"; then stack + reset_index:
out = (df.set_index('id').stack()
.droplevel(1).reset_index(name='c'))
Output:
id c
0 1 1
1 1 -1
2 2 5
3 2 6
4 3 1
5 3 5

How to re-index as multi-index pandas dataframe from index value that repeats

I have an index in a pandas dataframe which repeats the index value. I want to re-index as multi-index where repeated indexes are grouped.
The indexing looks like such:
so I would like all the 112335586 index values would be grouped under the same in index.
I have looked at this question Create pandas dataframe by repeating one row with new multiindex but here the value can be index can be pre-defined but this is not possible as my dataframe is far too large to hard code this.
I also looked at at the multi-index documentation but this also pre-defines the value for the index.
I believe you need:
s = pd.Series([1,2,3,4], index=[10,10,20,20])
s.index.name = 'EVENT_ID'
print (s)
EVENT_ID
10 1
10 2
20 3
20 4
dtype: int64
s1 = s.index.to_series()
s2 = s1.groupby(s1).cumcount()
s.index = [s.index, s2]
print (s)
EVENT_ID
10 0 1
1 2
20 0 3
1 4
dtype: int64
Try this:
df.reset_index(inplace=True)
df['sub_idx'] = df.groupby('EVENT_ID').cumcount()
df.set_index(['EVENT_ID','sub_idx'], inplace=True)

Sort a column containing string in Pandas

I am new to Pandas, and looking to sort a column containing strings and generate a numerical value to uniquely identify the string. My data frame looks something like this:
df = pd.DataFrame({'key': range(8), 'year_week': ['2015_10', '2015_1', '2015_11', '2016_9', '2016_10','2016_3', '2016_9', '2016_10']})
First I like to sort the 'year_week' column to arrange in ascending order (2015_1, 2016_9, '2016_9', 2016_10, 2016_11, 2016_3, 2016_10, 2016_10) and then generate a numerical value for each unique 'year_week' string.
You can first convert to_datetime column year_week, then sort it by sort_values and last use factorize:
df = pd.DataFrame({'key': range(8), 'year_week': ['2015_10', '2015_1', '2015_11', '2016_9', '2016_10','2016_3', '2016_9', '2016_10']})
#http://stackoverflow.com/a/17087427/2901002
df['date'] = pd.to_datetime(df.year_week + '-0', format='%Y_%W-%w')
#sort by column date
df.sort_values('date', inplace=True)
#create numerical values
df['num'] = pd.factorize(df.year_week)[0]
print (df)
key year_week date num
1 1 2015_1 2015-01-11 0
0 0 2015_10 2015-03-15 1
2 2 2015_11 2015-03-22 2
5 5 2016_3 2016-01-24 3
3 3 2016_9 2016-03-06 4
6 6 2016_9 2016-03-06 4
4 4 2016_10 2016-03-13 5
7 7 2016_10 2016-03-13 5
## 1st method :-- This apply for large dataset
## Split the "year_week" column into 2 columns
df[['year', 'week']] =df['year_week'].str.split("_",expand=True)
## Change the datatype of newly created columns
df['year'] = df['year'].astype('int')
df['week'] = df['week'].astype('int')
## Sort the dataframe by newly created column
df= df.sort_values(['year','week'],ascending=True)
## Drop years & months column
df.drop(['year','week'],axis=1,inplace=True)
## Sorted dataframe
df
## 2nd method:--
## This apply for small dataset
## Change the datatype of column
df['year_week'] = df['year_week'].astype('str')
## Categories the string, the way you want
cats = ['2015_1', '2015_10','2015_11','2016_3','2016_9', '2016_10']
# Use pd.categorical() to categories it
df['year_week']=pd.Categorical(df['year_week'],categories=cats,ordered=True)
## Sort the 'year_week' column
df= df.sort_values('year_week')
## Sorted dataframe
df

Categories