Comparing two excel file with pandas - python

I have two excel file, A and B. A is Master copy where updated record of employee Name and Organization Name (Name and Org) is available. File B contains Name and Org columns with bit older record and many other columns which we are not interested in.
Name Org
0 abc ddc systems
1 sdc ddc systems
2 csc ddd systems
3 rdc kbf org
4 rfc kbf org
I want to do two operation on this:
1) I want to compare Excel B (column Name and Org) with Excel A (column Name and Org) and update file B with all the missing entries of Name and corresponding Org.
2) For all existing entries in File B (column Name and Org), I would like to compare file and with file A and update Org column if any employee organization has changed.
For Solution 1) to find the new entries tried below approach (Not sure if this approach is correct though), output is tuple which I was not sure how to update back to DataFrame.
diff = set(zip(new_df.Name, new_df.Org)) - set(zip(old_df.Name, old_df.Org))
Any help will be appreciated. Thanks.

If names are unique, just concatenate A and B, and drop duplicates. Assuming A and B are your DataFrames,
df = pd.concat([A, B]).drop_duplicates(subset=['Name'], keep='first')
Or,
A = A.set_index('Name')
B = B.set_index('Name')
idx = B.index.difference(A.index)
df = pd.concat([A, B.loc[idx]]).reset_index()
Both should be approximately the same in terms of performance.

Solution:
diff=pd.DataFrame(list(set(zip(df['aa'], df['bb'])) - set(zip(df2['aa'], df2['bb']))),columns=df.columns)
print(diff.sort_values(by='aa').reset_index(drop=True))
Example:
import pandas as pd
aa = ['aa1', 'aa2', 'aa3', 'aa4', 'aa5']
bb = ['bb1', 'bb2', 'bb3', 'bb4','bb5']
nest = [aa, bb]
df = pd.DataFrame(nest, ['aa', 'bb']).T
df2 = pd.DataFrame(nest, ['aa', 'bb']).T
df2['aa']=df2['aa'].shift(2)
diff=pd.DataFrame(list(set(zip(df['aa'], df['bb'])) - set(zip(df2['aa'], df2['bb']))),columns=df.columns)
print(diff.sort_values(by='aa').reset_index(drop=True))
Output:
aa bb
0 aa1 bb1
1 aa2 bb2
2 aa3 bb3
3 aa4 bb4
4 aa5 bb5

Related

Pivot DataFrame with Multiple Dimensions

I have two reports, one with training status and then one master roster. The training report has 15 columns. The master roster has 9 columns. I have created a small sample below. My terminology might not be correct since I'm new to Python.
Training Report (I add the Training column with some conditional logic from the Training Code column. Please note, a name can be repeated if they have completed multiple training such as Name2.)
import pandas as pd
df = pd.DataFrame({'Name':['Name1','Name2','Name2','Name3'],
'Office':['A', 'B', 'B', 'A'],
'Position':['Director','Manager','Manager','Analyst'],
'Training Code':['C3','C1-L','C2','C1-B'],
'Training':['ADV','BEG','INT','BEG']
})
Output
Name Office Position Training Code Training
0 Name1 A Director C3 ADV
1 Name2 B Manager C1-L BEG
2 Name2 B Manager C2 INT
3 Name3 A Analyst C1-B BEG
Master Roster (I add the Required column based on the condition of the Status column. This is a unique list of names of everyone on the roster.)
df4 = pd.DataFrame({'Name':['Name1','Name2','Name3','Name4'],
'Office':['A', 'B', 'A', 'C'],
'Position':['Director','Manager','Analyst','Supervisor'],
'Symbol':['OS','BP','OD','EO'],
'Status':[1,3,8,2],
'Required':['Required','Required','Recommended','Required']})
Output
Name Office Position Symbol Status Required
0 Name1 A Director OS 1 Required
1 Name2 B Manager BP 3 Required
2 Name3 A Analyst OD 8 Recommended
3 Name4 C Supervisor EO 2 Required
I need to merge the master roster and training data so it looks like below.
df3 = pd.DataFrame({'Name':['Name1','Name2','Name3','Name4'],
'Office':['A', 'B', 'A', 'C'],
'Position':['Director','Manager','Analyst','Supervisor'],
'Symbol':['OS','BP','OD','EO'],
'Status':[1,3,8,2],
'Required':['Required','Required','Recommended','Required'],
'ADV':[1,0,0,0],
'INT':[0,1,0,0],
'BEG':[0,1,1,0]
})
DESIRED OUTPUT (Unique list of names and information about each name - the master roster, merged with a pivoted version of the training report.)
Name Office Position Symbol Status Required ADV INT BEG
0 Name1 A Director OS 1 Required 1 0 0
1 Name2 B Manager BP 3 Required 0 1 1
2 Name3 A Analyst OD 8 Recommended 0 0 1
3 Name4 C Supervisor EO 2 Required 0 0 0
I need to use the master roster to get all the names and the other fields in that report. Then, I need to merge that report with a pivoted training report with the Training column being broken apart into multiple columns with a count.
My first step was to try to pivot the training report data (not using all the columns) and then merge it with the master roster.
pvt = df.pivot_table(index = ['Name','Office','Position'],
columns = 'Training',
fill_value = 0,
aggfunc='count')
However, I'm not sure if that is the best way, and that the pivot output doesn't seem to be merge friendly (I could be wrong). In SQL I would just LEFT JOIN the training report to the pivoted master roster on the Name column.
Any guidance would be greatly appreciated on the easiest and best way to accomplish merging those 2 reports to get my final desired outcome. Please let me know if I need to clarify anything further!
----- UPDATE 2 -------
I was able to merge and then pivot the data set, but it's not quite how I want it to look. The merge looks good, and I only bring in the columns I need.
result = pd.merge(df4,
df[['Name','Training']],
on='Name',
how='left')
I then replace the 'NaN' values in the Training column with 'NONE'.
result.update(result[['Training']].fillna('NONE'))
Merge Output
Name Office Position Symbol Status Required Training
0 Name1 A Director OS 1 Required ADV
1 Name2 B Manager BP 3 Required BEG
2 Name2 B Manager BP 3 Required INT
3 Name3 A Analyst OD 8 Recommended BEG
4 Name4 C Supervisor EO 2 Required NONE
However, when I try to pivot the result dataframe, I get 'Empty DataFrame' now.
cols = ['Name','Office','Position','Symbol','Status','Required']
pvt2 = result.pivot_table(index=cols,
columns='Training',
fill_value = 0,
aggfunc = 'count')
-------- FINAL UPDATE ---------
I got it to work! Yay!
result = pd.merge(df4,
df[['Name','Training']],
on='Name',
how='left')
result.update(result[['Training']].fillna('NONE'))
cols = ['Name','Office','Position','Symbol','Status','Required']
pvt2 = result.pivot_table(index=cols,
columns=['Training'],
fill_value = 0,
aggfunc = len)
All I had to do was the change the aggfunc =counttoaggfunc = len`. I hope that ends up helping someone else! If anyone has improvements on this, I'm definitely open to those as well.
There might be a better way, but this solution works for me! Again, I'm happy to accept feedback or improvements!
import pandas as pd
#Create DataFrame for training
df = pd.DataFrame({'Name':['Name1','Name2','Name2','Name3','Name1'],
'Office':['A', 'B', 'B', 'A','A'],
'Position':['Director','Manager','Manager','Analyst','Director'],
'Training Code':['C3','C1-L','C2','C1-B','C3'],
'Training':['ADV','BEG','INT','BEG','ADV']
})
#Create DataFrame for master roster
df4 = pd.DataFrame({'Name':['Name1','Name2','Name3','Name4'],
'Office':['A', 'B', 'A', 'C'],
'Position':['Director','Manager','Analyst','Supervisor'],
'Symbol':['OS','BP','OD','EO'],
'Status':[1,3,8,2],
'Required':['Required','Required','Recommended','Required']})
#Left join the training DataFrame to the master roster DataFrame using the 'Name'
#column as the join key.
result = pd.merge(df4,
df[['Name','Training']],
on='Name',
how='left')
#Substitute any 'NaN' values with 'NONE' so the pivot doesn't drop rows with 'NaN'
result.update(result[['Training']].fillna('NONE'))
#Store all the column headers of the master roster into the 'cols' list
cols = list(roster.columns)
#Pivot the combined 'result' DataFrame using all the columns from
#the master roster DataFrame. The 'Training' column is the column
#that will be broken apart. 'aggfunc = len' does a count of the instances
#of each 'Training' element.
pvt2 = result.pivot_table(index=cols,
columns=['Training'],
fill_value = 0,
aggfunc = len)

What's the fastest way to select values from columns based on keys in another columns in pandas?

I need a fast way to extract the right values from a pandas dataframe:
Given a dataframe with (a lot of) data in several named columns and an additional columns whose values only contains names of the other columns, how do I select values from the data-columns with the additional columns as keys?
It's simple to do via an explicit loop, but this is extremely slow with something like .iterrows() directly on the DataFrame. If converting to numpy-arrays, it's faster, but still not fast. Can I combine methods from pandas to do it even faster?
Example: This is the kind of DataFrame structure, where columns A and B contain data and column keys contains the keys to select from:
import pandas
df = pandas.DataFrame(
{'A': [1,2,3,4],
'B': [5,6,7,8],
'keys': ['A','B','B','A']},
)
print(df)
output:
Out[1]:
A B keys
0 1 5 A
1 2 6 B
2 3 7 B
3 4 8 A
Now I need some fast code that returns a DataFrame like
Out[2]:
val_keys
0 1
1 6
2 7
3 4
I was thinking something along the lines of this:
tmp = df.melt(id_vars=['keys'], value_vars=['A','B'])
out = tmp.loc[a['keys']==a['variable']]
which produces:
Out[2]:
keys variable value
0 A A 1
3 A A 4
5 B B 6
6 B B 7
but doesn't have the right order or index. So it's not quite a solution.
Any suggestions?
See if either of these work for you
df['val_keys']= np.where(df['keys'] =='A', df['A'],df['B'])
or
df['val_keys']= np.select([df['keys'] =='A', df['keys'] =='B'], [df['A'],df['B']])
No need to specify anything for the code below!
def value(row):
a = row.name
b = row['keys']
c = df.loc[a,b]
return c
df.apply(value, axis=1)
Have you tried filtering then mapping:
df_A = df[df['key'].isin(['A'])]
df_B = df[df['key'].isin(['B'])]
A_dict = dict(zip(df_A['key'], df_A['A']))
B_dict = dict(zip(df_B['key'], df_B['B']))
df['val_keys'] = df['key'].map(A_dict)
df['val_keys'] = df['key'].map(B_dict).fillna(df['val_keys']) # non-exhaustive mapping for the second one
Your df['val_keys'] column will now contain the result as in your val_keys output.
If you want you can just retain that column as in your expected output by:
df = df[['val_keys']]
Hope this helps :))

How to modify data after replicate in Pandas?

I am trying to edit values after making duplicate rows in Pandas.
I want to edit only one column ("code"), but i see that since it has duplicates , it will affect the entire rows.
Is there any method to first create duplicates and then modify data only of duplicates created ?
import pandas as pd
df=pd.read_excel('so.xlsx',index=False)
a = df['code'] == 1234
b = df[a]
df=df.append(b)
print('\n\nafter replicate')
print(df)
Current output after making duplicates is as below:
coun code name
0 A 123 AR
1 F 123 AD
2 N 7 AR
3 I 0 AA
4 T 10 AS
2 N 7 AR
3 I 7 AA
Now I expect to change values only on duplicates created , in this case bottom two rows. But now I see the indexes are duplicated as well.
You can avoid the duplicate indices by using the ignore_index argument to append.
df=df.append(b, ignore_index=True)
You may also find it easier to modify your data in b, before appending it to the frame.
import pandas as pd
df=pd.read_excel('so.xlsx',index=False)
a = df['code'] == 3
b = df[a]
b["region"][2] = "N"
df=df.append(b, ignore_index=True)
print('\n\nafter replicate')
print(df)

add columns different length pandas

I have a problem with adding columns in pandas.
I have DataFrame, dimensional is nxk. And in process I wiil need add columns with dimensional mx1, where m = [1,n], but I don't know m.
When I try do it:
df['Name column'] = data
# type(data) = list
result:
AssertionError: Length of values does not match length of index
Can I add columns with different length?
If you use accepted answer, you'll lose your column names, as shown in the accepted answer example, and described in the documentation (emphasis added):
The resulting axis will be labeled 0, ..., n - 1. This is useful if you are concatenating objects where the concatenation axis does not have meaningful indexing information.
It looks like column names ('Name column') are meaningful to the Original Poster / Original Question.
To save column names, use pandas.concat, but don't ignore_index (default value of ignore_index is false; so you can omit that argument altogether). Continue to use axis=1:
import pandas
# Note these columns have 3 rows of values:
original = pandas.DataFrame({
'Age':[10, 12, 13],
'Gender':['M','F','F']
})
# Note this column has 4 rows of values:
additional = pandas.DataFrame({
'Name': ['Nate A', 'Jessie A', 'Daniel H', 'John D']
})
new = pandas.concat([original, additional], axis=1)
# Identical:
# new = pandas.concat([original, additional], ignore_index=False, axis=1)
print(new.head())
# Age Gender Name
#0 10 M Nate A
#1 12 F Jessie A
#2 13 F Daniel H
#3 NaN NaN John D
Notice how John D does not have an Age or a Gender.
Use concat and pass axis=1 and ignore_index=True:
In [38]:
import numpy as np
df = pd.DataFrame({'a':np.arange(5)})
df1 = pd.DataFrame({'b':np.arange(4)})
print(df1)
df
b
0 0
1 1
2 2
3 3
Out[38]:
a
0 0
1 1
2 2
3 3
4 4
In [39]:
pd.concat([df,df1], ignore_index=True, axis=1)
Out[39]:
0 1
0 0 0
1 1 1
2 2 2
3 3 3
4 4 NaN
We can add the different size of list values to DataFrame.
Example
a = [0,1,2,3]
b = [0,1,2,3,4,5,6,7,8,9]
c = [0,1]
Find the Length of all list
la,lb,lc = len(a),len(b),len(c)
# now find the max
max_len = max(la,lb,lc)
Resize all according to the determined max length (not in this example
if not max_len == la:
a.extend(['']*(max_len-la))
if not max_len == lb:
b.extend(['']*(max_len-lb))
if not max_len == lc:
c.extend(['']*(max_len-lc))
Now the all list is same length and create dataframe
pd.DataFrame({'A':a,'B':b,'C':c})
Final Output is
A B C
0 1 0 1
1 2 1
2 3 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
I had the same issue, two different dataframes and without a common column. I just needed to put them beside each other in a csv file.
Merge:
In this case, "merge" does not work; even adding a temporary column to both dfs and then dropping it. Because this method makes both dfs with the same length. Hence, it repeats the rows of the shorter dataframe to match the longer dataframe's length.
Concat:
The idea of The Red Pea didn't work for me. It just appended the shorter df to the longer one (row-wise) while leaving an empty column (NaNs) above the shorter df's column.
Solution: You need to do the following:
df1 = df1.reset_index()
df2 = df2.reset_index()
df = [df1, df2]
df_final = pd.concat(df, axis=1)
df_final.to_csv(filename, index=False)
This way, you'll see your dfs besides each other (column-wise), each of which with its own length.
If somebody like to replace a specific column of a different size instead of adding it.
Based on this answer, I use a dict as an intermediate type.
Create Pandas Dataframe with different sized columns
If the column to be inserted is not a list but already a dict, the respective line can be omitted.
def fill_column(dataframe: pd.DataFrame, list: list, column: str):
dict_from_list = dict(enumerate(list)) # create enumertable object from list and create dict
dataFrame_asDict = dataframe.to_dict() # Get DataFrame as Dict
dataFrame_asDict[column] = dict_from_list # Assign specific column
return pd.DataFrame.from_dict(dataFrame_asDict, orient='index').T # Create new DataSheet from Dict and return it

programmatically add pandas DataFrame name to columns

This should be a pretty simple question, but I'm looking to programmatically insert the name of a pandas DataFrame into that DataFrame's column names.
Say I have the following DataFrame:
name_of_df = pandas.DataFrame({1: ['a','b','c','d'], 2: [1,2,3,4]})
print name_of_df
1 2
0 a 1
1 b 2
2 c 3
3 d 4
I want to have following:
name_of_df = %%some_function%%(name_of_df)
print name_of_df
name_of_df1 name_of_df2
0 a 1
1 b 2
2 c 3
3 d 4
..where, as you can see, the name of the DataFrame is programatically inputted into the column names. I know pandas DataFrames don't have a __name__ attribute, so I'm drawing a blank on how to do this.
Please note that I want to do this programatically, so altering the names of the columns with a hardcoded 'name_of_df' string won't work.
From the linked question, you can do something like this. Multiple names can point to the same DataFrame, so this will just grab the "first" one.
def first_name(obj):
return [k for k in globals() if globals()[k] is obj and not k.startswith('_')][0]
In [24]: first_name(name_of_df)
Out[24]: 'name_of_df'

Categories