How to sum columns from three different dataframes with a common key - python

I am reading in an excel spreadsheet about schools with three sheets as follows.
import sys
import pandas as pd
inputfile = sys.argv[1]
xl = pd.ExcelFile(inputfile)
print xl.sheet_names
df1 = xl.parse(xl.sheet_names[0], skiprows=14)
df2 = xl.parse(xl.sheet_names[1], skiprows=14)
df3 = xl.parse(xl.sheet_names[2], skiprows=14)
df1.columns = [chr(65+i) for i in xrange(len(df1.columns))]
df2.columns = df1.columns
df3.columns = df1.columns
The unique id for each school is in column 'D' in each of the three dataframes. I would like to make a new dataframe which has two columns. The first is the sum of column 'G' from df1, df2, df3 and the second is the sum of column 'K' from df1, df2, df3. In other words, I think I need the following steps.
Filter rows for which unique column 'D' ids actually exist in all three dataframes. If the school doesn't appear in all three sheets then I discard it.
For each remaining row (school), add up the values in column 'G' in the three dataframes.
Do the same for column 'K'.
I am new to pandas but how should I do this? Somehow the unique ids have to be used in steps 2 and 3 to make sure the values that are added correspond to the same school.
Attempted solution
df1 = df1.set_index('D')
df2 = df2.set_index('D')
df3 = df3.set_index('D')
df1['SumK']= df1['K'] + df2['K'] + df3['K']
df1['SumG']= df1['G'] + df2['G'] + df3['G']

After concatenating the dataframes, you can use groupby and count to get a list of values for "D" that exist in all three dataframes since there is only one in each dataframe. You can then use this to filter concatenated dataframe to sum whichever columns you need, e.g.:
df = pd.concat([df1, df2, df3])
criteria = df.D.isin((df.groupby('D').count() == 3).index)
df[criteria].groupby('D')[['G', 'K']].sum()

Related

creating a new column in a dataframe based on 4 other dataframes

Imagine we have 4 dataframes
df1(35000, 20)
df2(12000, 21)
df3(323, 18)
df4(220, 6)
Here is where it is get tricky:
df4 was created by a merge of df3 and df2 based on 1 column.
It took 3 columns from df3 and 3 columns from df2. (that is why it has 6 cols in total)
what I want is the following: I wish to create an extra column in df1 and insert specific values for the rows that have the same value in a specific column in df1 and df3. For this reason I have done the following
df1['new col'] = df1['Name'].isin(df3['Name'])
Now my new column is filled with values True/False whether the value in the column name is the same for both df1 and df2. So far so good, but what I want to fill this new column with the values of a specific column from df2. I tried the following
df1['new col'] = df1['Name'].map({True:df2['Address'],False:'no address inserted'})
However, it inserts all the values of addresses from df2 in that cell instead only the 1 value that is needed. Any ideas?
I also tried the following
merged = df2(df4, how='left', left_on='Name',right_on = 'First Name', indicator=True)
df1['Code'] = np.where(merged['_merge'] == 'both', merged['Address'], 'n.a.')
but I get the following error
Length of values (1210) does not match length of index (35653)
merge using the how='left' and then fill the missing values with fillna.
merged = df2(df4, how='left', left_on='Name',right_on = 'First Name', indicator=True)
merged[address_column].fillna('n.a.', inplace=True) #address column is the name or list of names of columns that you want the replace the nan's with

Mapping Two dataframes Pandas

I want to map two dataframes in pandas , in DF1 I have
df1
my second dataframe looks like
df2
I want to merge the two dataframes and get something like this
merged DF
on the basis of the 1 occuring in the DF1 , it should be replaced by the value after merging
so far i have tried
mergedDF = pd.merge(df1,df2, on=companies)
Seems like you need .idxmax() method.
merged = df1.merge(df2, on='Company')
merged['values'] = merged[[x for x in merged.columns if x != 'Company']].idxmax(axis=1)

Get the missing columns from one dataframe and append it to another dataframe

I have a Dataframe df1 with the columns. I need to compare the headers of columns in df1 with a list of headers from df2
df1 =['a','b','c','d','f']
df2 =['a','b','c','d','e','f']
I need to compare the df1 with df2 and if any missing columns, I need to add them to df1 with blank values.
I tried concat and also append and both didn't work. with concat, I'm not able to add the column e and with append, it is appending all the columns from df1 and df2. How would I get only missing column added to df1 in the same order?
df1_cols = df1.columns
df2_cols = df2._combine_match_columns
if (df1_cols == df2_cols).all():
df1.to_csv(path + file_name, sep='|')
else:
print("something is missing, continuing")
#pd.concat([my_df,flat_data_frame], ignore_index=False, sort=False)
all_list = my_df.append(flat_data_frame, ignore_index=False, sort=False)
I wanted to see the results as
a|b|c|d|e|f - > headers
1|2|3|4||5 -> values
pandas.DataFrame.align
df1.align(df2, axis=1)[0]
By default this does an 'outer' join
By specifying axis=1 we focus on columns
This returns a tuple of both an aligned df1 and df2 with the calling dataframe being the first element. So I grab the first element with [0]
pandas.DataFrame.reindex
df1.reindex(columns=df1.columns | df2.columns)
You can treat pandas.Index objects like sets most of the time. So df1.columns | df2.columns is the union of those two index objects. I then reindex using the result.
Lets first create the two dataframes as:
import pandas as pd, numpy as np
df1 = pd.DataFrame(np.random.random((5,5)), columns = ['a','b','c','d','f'])
df2 = pd.DataFrame(np.random.random((5,7)), columns = ['a','b','c','d','e','f','g'])
Now add those columns of df2 to df1 (with nan values), which are not in df1:
for i in list(df2):
if i not in list(df1):
df1[i] = np.nan
Now display the columns of df1 alphabetically:
df1 = df1[sorted(list(df1))]

Pandas: Merge 2 dataframes based on a column values; for mulitple rows containing same column value, append those to different columns

I have two dataframes, dataframe1 and dataframe2. They both share the same data in a particular column for both, lets call this column 'share1' and 'share2' for dataframe1 and dataframe2 respectively.
The issue is, there are instances where in dataframe1 , there is only one row in 'share1' with a particular value (lets call it 'c34z'), but in dataframe2 there are multiple rows with the value 'c34z' in the 'share2' column.
What I would like to do is, in the new merged dataframe, when there are new values, I would just like to place them in a new column.
So the number of columns in the new dataframe will be the maximum number of duplicates for a particular value in 'share2' . And for rows where there was only a unique value in 'share2', the rest of the added columns will be blank, for that row.
You can using cumcount create the additional key then, pivot df2
newdf2=df2.assign(key=df2.groupby('share2').cumcount(),v=df2.share2).pivot_table(index='share2',columns='key',values='v',aggfunc='first')
After this ,I am using .loc or reindex concat df2 to df1
df2=df2.reindex(df1.share1)
df2.index=df1.index
yourdf=pd.concat([df1,df2],axis=1)
Loading Data:
import pandas as pd
df1 = {'key': ['c34z', 'c34z_2'], 'value': ['x', 'y']}
df2 = {'key': ['c34z', 'c34z_2', 'c34z_2'], 'value': ['c34z_value', 'c34z_2_value', 'c34z_2_value']}
df1 = pd.DataFrame(df1)
df2 = pd.DataFrame(df2)
Convert df2 by grouping and pivoting
df2_pivot = df2.groupby('key')['value'].apply(lambda df: df.reset_index(drop=True)).unstack().reset_index()
merge df1 and df2_pivot
df_merged = pd.merge(df1, df2_pivot, on='key')

Perform full Merge between the columns of two data frames, based on starting alphabet

I want to perform full merge between the values of two columns (Name) of two different data frames. Merge should only be done between Names starting with same alphabet. For eg. ABC should be merged with all Names of other data frame which start with letter 'A'. And this should be done for all letters 'A' to 'Z'. I am writing the following code. But length of full merge shows 0. I also want to append the result obtained after merging based on each letter into a new data frame. What changes should I make? Here's my code -
for c in ascii_uppercase:
df1 = df1[df1.Name.str[0] == c ].copy()
df2 = df2[df2.Name.str[0] == c].copy()
df1['Join'] =1
df2['Join'] =1
FullMerge = pd.merge(df2,df1, left_on='Join',right_on='Join')
len(FullMerge)
I'd create a column of 'FirstLetter' and [merge][1] on that.
import pandas as pd
import numpy as np
from string import ascii_uppercase
df1 = pd.DataFrame(np.random.choice(list(ascii_uppercase), (5, 3)))
df1 = df1.apply(lambda x: pd.Series([''.join(x)], index=['Name']), axis=1)
df1['FirstLetter'] = df1.Name.str.get(0)
df2 = pd.DataFrame(np.random.choice(list(ascii_uppercase), (1000, 10)))
df2 = df2.apply(lambda x: pd.Series([''.join(x)], index=['Name']), axis=1)
df2['FirstLetter'] = df2.Name.str.get(0)
df1.merge(df2, on='FirstLetter')
All you should have to do with your dataframes is:
df1['FirstLetter'] = df1.Name.str.get(0)
df2['FirstLetter'] = df2.Name.str.get(0)
df1.merge(df2, on='FirstLetter')
columns with common names will be have a suffix appended (which you can control: docs). All columns should be represented. Caveat, you may need to use the how parameter to change the merge behavior to one of either 'inner' (default), 'outer', 'left', 'right'.
df1
df2.head()
df1.merge(df2, on='FirstLetter').head()

Categories