I want to create a pivot with average values over multiple tables. Here is an example that I want to create: Inputs are df1 and df2, res is the result I want to calculate from df1 and df2
import pandas as pd
import numpy as np
df1 = pd.DataFrame({"2000": ["A", "A", "B"],
"2001": ["A", "B", "B"],
"2002": ["B", "B", "B"]},
index =['Item1', 'Item2', 'Item3'])
df2 = pd.DataFrame({"2000": [0.5, 0.7, 0.1],
"2001": [0.6, 0.6, 0.3],
"2002": [0.7, 0.4, 0.2]},
index =['Item1', 'Item2', 'Item3'])
display(df1)
display(df2)
res = pd.DataFrame({"2000": [0.6, 0.1],
"2001": [0.6, 0.45],
"2002": [np.nan, 0.43]},
index =['A', 'B'])
display(res)
Both dataframes have years in columns. Each row is an item. The items change state over time. The state is defined in df1. They also have values each year, defined in df2. I want to calculate the average value by year for each group of states A, B.
I did not achieve to calculate res, any suggestions?
To solve this problem you should merge both DataFrames in one, at first. For example you can use this code convert dataframe from wide to long and then merge both of them by the index (year, item), and finally reset the index to be used as a column in the pivot:
df_full = pd.concat([df1.unstack(), df2.unstack()], axis=1).reset_index()
Then, if you want, you can rename columns to build a clear pivot:
df_full = df_full.rename(columns={'level_0': 'year', 'level_1': 'item', 0: 'DF1', 1:'DF2'})
And finally build a pivot table.
res_out = pd.pivot_table(data=df_full, index='DF1', columns='year', values='DF2', aggfunc='mean')
It's not a one line solution, but it works.
df_full = pd.concat([df1.unstack(), df2.unstack()], axis=1).reset_index()
df_full = df_full.rename(columns={'level_0': 'year', 'level_1': 'item', 0: 'DF1', 1:'DF2'})
res_out = pd.pivot_table(data=df_full, index='DF1', columns='year', values='DF2', aggfunc='mean')
display(res_out)
This code using stack, join and unstack should work:
df1_long = df1.stack().to_frame().rename({0:'category'}, axis=1)
df2_long = df2.stack().to_frame().rename({0:'values'}, axis=1)
joined_data = df1_long.join(df2_long).reset_index().rename({'level_0':'item','level_1':'year'}, axis=1)
res = joined_data.groupby(['category', 'year']).mean().unstack()
display(res)
Related
I have two data frames df and df_copy. I would like to copy the data from df_copy, but only if the data is also identical. How do I do that?
import pandas as pd
d = {'Nameid': [100, 200, 300, 100]
, 'Name': ['Max', 'Michael', 'Susan', 'Max']
, 'Projectid': [100, 200, 200, 100]}
df = pd.DataFrame(data=d)
display(df.head(5))
df['nameid_index'] = df['Nameid'].astype('category').cat.codes
df['projectid_index'] = df['Projectid'].astype('category').cat.codes
display(df.head(5))
df_copy = df.copy()
df.drop(['Nameid', 'Name', 'Projectid'], axis=1, inplace=True)
df = df.drop([1, 3])
display(df.head(5))
df
df_copy
What I want
I looked at Pandas Merging 101
df.merge(df_copy, on=['nameid_index', 'projectid_index'])
But I got this result
The same row are twice, I only want once.
Use DataFrame.drop_duplicates first:
df1 = (df.drop_duplicates(['nameid_index', 'projectid_index'])
.merge(df_copy, on=['nameid_index', 'projectid_index']))
If need merge by intersection of columns names in both DataFrames, on parameter should be removed:
df1 = df.drop_duplicates(['nameid_index', 'projectid_index']).merge(df_copy)
I'm trying to combine 2 pandas dataframe and rank the combined dataframe again, after that, extract the row with highest score/ with the 1st ranking as the return. I tried something shown below in my code but it return me the 1st index instead of 1st rank.
df = pd.DataFrame({'Name' : ['Peter', 'James', 'John', 'Marry'], 'Score' : [6.1, 5.6, 6.8, 7.99],
'ranking' :[ 3, 4, 2, 1]})
df1 = pd.DataFrame({'Name' : ['Albert', 'Kelsey', 'Janice'], 'Score' : [1.1, 8.2, 7],
'ranking' :[ 3, 1, 2]})
pd_combine = pd.concat([df,df1], sort=False)
pd_combine.iloc[pd_combine['Score'].idxmax()]
Just change the third line with this:
pd_combine = pd.concat([df,df1], sort=False).reset_index()
Problem was that you had duplicated indices. This change ensures that every row has its own unique id in this new combined data frame.
If you want to rank again entries in new, combined data frame use:
res = pd_combine.sort_values('Score',ascending=False)
res.ranking = range(1,len(res)+1)
Instead of sort=False, specify ignore_index=True
pd_combine = pd.concat([df,df1], ignore_index=True)
I have 2 excel csv files as below
df1 = {'Transaction_Name':['SC-001_Homepage', 'SC-002_Homepage', 'SC-003_Homepage', 'SC-001_Signinlink'], 'Count': [1, 0, 2, 1]}
df1 = pd.DataFrame(df1, columns=df1.keys())
df2 = {'Transaction_Name':['SC-001_Homepage', 'SC-002_Homepage', 'SC-001_Signinlink', 'SC-002_Signinlink'], 'Count': [2, 1, 2, 1]}
df2 = pd.DataFrame(df2, columns=df2.keys())
In df1 I could see that there is one extra transaction called SC-003_Homepage which is not there in df2. Can someone help me how to find only that transaction which is missing in df2?
So far I had done below work to get the transactions.
merged_df = pd.merge(df1, df2, on = 'Transaction_Name', suffixes=('_df1', '_df2'), how='inner')
Maybe a simple set will do the job
set(df1['Transaction_Name']) - set(df2['Transaction_Name'])
Add a merger column and then filter the missing data based on that. see below example.
For more information see merge documentation
import pandas as pd
df1 = {'Transaction_Name':['SC-001_Homepage', 'SC-002_Homepage', 'SC-003_Homepage', 'SC-001_Signinlink'], 'Count': [1, 0, 2, 1]}
df1 = pd.DataFrame(df1, columns=df1.keys())
df2 = {'Transaction_Name':['SC-001_Homepage', 'SC-002_Homepage', 'SC-001_Signinlink', 'SC-002_Signinlink'], 'Count': [2, 1, 2, 1]}
df2 = pd.DataFrame(df2, columns=df2.keys())
#create a merged df
merge_df = df1.merge(df2, on='Transaction_Name', how='outer', suffixes=['', '_'], indicator=True)
#filter rows which are missing in df2
missing_df2_rows = merge_df[merge_df['_merge'] =='left_only'][df1.columns]
#filter rows which are missing in df1
missing_df1_rows = merge_df[merge_df['_merge'] =='right_only'][df2.columns]
print missing_df2_rows
print missing_df1_rows
Output:
Count Transaction_Name
2 2.0 SC-003_Homepage
Count Transaction_Name
4 NaN SC-002_Signinlink
I have a dataframe, which consists of summary statistics of another dataframe:
df = sample[['Place','Lifeexp']]
df = df.groupby('Place').agg(['count','mean', 'max','min']).reset_index()
df = df.sort_values([('Lifeexp', 'count')], ascending=False)
When looking at the structure, the dataframe has a multi index, which makes plot creations difficult:
df.columns
MultiIndex(levels=[['Lifeexp', 'Place'], ['count', 'mean', 'max', 'min', '']],
labels=[[1, 0, 0, 0, 0], [4, 0, 1, 2, 3]])
I tried the solutions of different questions here (e.g. this), but somehow don't get the desired result. I want df to have Place, count, mean,max, min as column names and delete Lifeexp so that I can create easy plots e.g. df.plot.bar(x = "Place", y = 'count')
I think solution should be simplify define column after groupby for prevent MultiIndex in columns:
df = df.groupby('Place')['Lifeexp'].agg(['count','mean', 'max','min']).reset_index()
df = df.sort_values('count', ascending=False)
Given two dataframes:
df1 = pd.DataFrame([
['Red', 'Blu', 1.1],
['Yel', 'Blu', 2.1],
['Grn', 'Grn', 3.1]], columns=['col_1a','col_1b','score_1'])
df2 = pd.DataFrame([
['Blu', 'Red', 1.2],
['Yel', 'Blu', 2.2],
['Vio', 'Vio', 3.2]], columns=['col_2a','col_2b','score_2'])
I want to merge them on two columns like below:
df3 = pd.DataFrame([
['Blu', 'Red', 1.1, 1.2],
['Yel', 'Blu', 2.1, 2.2],
], columns=['col_a','col_b','score_1','score_2'])
Caveat 1: The order of column contents can switch between dataframes to merge. The first row, for example, should be merged because it contains both "Red" and "Blue" even if they appear in different columns.
Caveat 2: The order of columns in the final df_3 is unimportant. Whether "Blu" is in col_a or col_b doesn't mean anything.
Caveat 3: Anything else not matching, like the last row, is ignored
You can sort the first two columns along the rows, then merge on them:
# rename column names
cols = ['col_a', 'col_b']
df1.columns = cols + ['score_1']
df2.columns = cols + ['score_2']
# sort the two id columns along the row
df1[cols] = pd.np.sort(df1[cols], axis=1)
df2[cols] = pd.np.sort(df2[cols], axis=1)
# merge
df1.merge(df2)