Merging two dataframes without creating suffix - python

I want to create a new dataframe by merging two seperate dataframes. The data frames share a common key and some common columns. This common columns also contain some but not all of the same values. I would like to remove the duplicate values and keep both values are that different in each cell. My data looks something like this:
Left:
key1 valueZ valueX valueY
A bob 1 4
B jes 8 5
C joe 3 6
Right:
key1 valueZ valueX valueY valueK
A sam 7 4 hill town
B beth 8 11 market
C joe 9 12 mall
The expected output would be:
key1 valueZ valueX valueY valueK
A bob/sam 1/7 4 hill town
B jes/beth 8 5/11 market
C joe 3/9 6/12 mall

You will need to do this in a few steps.
Here is my setup for reference:
import pandas as pd
# define Data Frames
left = pd.DataFrame({
'key1': ['A', 'B', 'C'],
'valueZ': ['bob', 'jes', 'joe'],
'valueX': [1, 8, 3],
'valueY': [4, 5, 6]
})
right = pd.DataFrame({
'key1': ['A', 'B', 'C'],
'valueZ': ['sam', 'beth', 'joe'],
'valueX': [7, 8, 9],
'valueY': [4, 11, 12],
'valueK': ['hill town', 'market', 'mall']
})
Now I have two DataFrame objects. They are left and right and match your example.
In order to combine how you want, I will need to know which columns are in common between the two Data Frames, as well as the final list of columns. I also define the key column here for ease of configuration. You can do that like this:
# determine important columns
keyCol = 'key1'
commonCols = list(set(left.columns & right.columns))
finalCols = list(set(left.columns | right.columns))
print('Common = ' + str(commonCols) + ', Final = ' + str(finalCols))
Which gives:
Common = ['valueZ', 'valueX', 'valueY', 'key1'], Final = ['valueZ', 'key1', 'valueK', 'valueX', 'valueY']
Next, you will join the two Data Frames as normal, but give the columns in both Data Frames a suffix (documentation here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html)
# join dataframes with suffixes
mergeDf = left.merge(right, how='left', on=keyCol, suffixes=('_left', '_right'))
Finally, you will combine the common columns using whatever logic you desire. Once they are combined, you can remove the suffixed columns from your Data Frame. Example below. You can do this in more efficient ways, but I wanted to break it down for clarity.
# combine the common columns
for col in commonCols:
if col != keyCol:
for i, row in mergeDf.iterrows():
leftVal = str(row[col + '_left'])
rightVal = str(row[col + '_right'])
print(leftVal + ',' + rightVal)
if leftVal == rightVal:
mergeDf.loc[i, col] = leftVal
else:
mergeDf.loc[i, col] = leftVal + '/' + rightVal
# only use the finalCols
mergeDf = mergeDf[finalCols]
This gives:
valueZ key1 valueK valueX valueY
0 bob/sam A hill town 1/7 4
1 jes/beth B market 8 5/11
2 joe C mall 3/9 6/12

Related

How to create multiple summary statistics for each column in a grouping?

Using groupby().agg() allows to calculate summary statistics for specifically named columns. However, what if I want to calculate „min“, „max“ and „mean“ for every column of the data frame per group. Is there a way such that pandas will append a prefix to each column name automatically? I do not want to enumerate each basic column name within the agg() function.
You can iterate through every column, then create the prefixes etc. using the original column name as a starting point. If you use .agg and do min and max on the same column, you only get the last operation as far as I can tell, though maybe there is a way to do that. So in this example, I do one operation at a time. Here's one way to do what you want, assuming there is a certain column 'col1' that you will use to use to line up all the groupby data.:
df = pd.DataFrame({'col1': ['A', 'A', 'B', 'B'], 'col2': [1, 2, 3, 4], 'col3': [5, 6, 7, 8]})
col_list = df.columns.tolist()
col_list.remove('col1') # the column you will use for the groupby output
dfg_all = df[['col1']].drop_duplicates()
for col in col_list:
for op in ['min', 'max', 'mean']:
if op == 'min':
dfg = df.groupby('col1', as_index=False)[col].min()
elif op == 'max':
dfg = df.groupby('col1', as_index=False)[col].max()
else:
dfg = df.groupby('col1', as_index=False)[col].mean()
dfg = dfg.rename(columns={col:col+'_'+ op})
dfg_all = dfg_all.merge(dfg, on='col1', how='left')
to get
col1 col2_min col2_max col2_mean col3_min col3_max col3_mean
0 A 1 2 1.5 5 6 5.5
1 B 3 4 3.5 7 8 7.5
You could get there using describe():
df1 = pd.DataFrame(df.describe().unstack())
n_label = pd.Series(['_'.join(map(str,i)) for i in df1.index.tolist()])
df1 = df1.reset_index(drop=True)
df1['label'] = n_label
print(df1[df1['label'].str.contains('_m')].reset_index(drop=True))
0 label
0 4.0105 col1_mean
1 0.0000 col1_min
2 12.0000 col1_max
3 3.9639 col2_mean
4 0.0000 col2_min
5 12.0000 col2_max
6 4.0256 col3_mean
7 0.0000 col3_min
8 12.0000 col3_max

How do I compare each row with all the others and if it's the same I concatenate to a new dataframe? Python

I have a DataFrame with 2 columns:
import pandas as pd
data = {'Country': ['A', 'A', 'A' ,'B', 'B'],'Capital': ['CC', 'CD','CE','CF','CG'],'Population': [5, 35, 20,34,65]}
df = pd.DataFrame(data,columns=['Country', 'Capital', 'Population'])
I want to compare each row with all others, and if it has the same Country, I would like to concatenate the pair into a new data frame (and transfor it into a new csv).
new_data = {'Country': ['A', 'A','B'],'Capital': ['CC', 'CD','CF'],'Population': [5, 35,34],'Country_2': ['A', 'A' ,'B'],'Capital_2': ['CD','CE','CG'],'Population_2': [35, 20,65]}
df_new = pd.DataFrame(new_data,columns=['Country', 'Capital', 'Population','Country_2','Capital_2','Population_2'])
NOTE: This is a simplification of my data, I have more than 5000 rows and I would like to do it automatically
I tried comparing dictionaries, and also comparing one row at a time, but I couldn't do it.
Thank you for the attention
>>> df.join(df.groupby('Country').shift(-1), rsuffix='_2')\
... .dropna(how='any')
Country Capital Population Capital_2 Population_2
0 A CC 5 CD 35.0
1 A CD 35 CE 20.0
3 B CF 34 CG 65.0
This pairs every row with the next one using join + shift − but we restrict shifting only within the same country using groupby. See what the groupby + shift does on its own:
>>> df.groupby('Country').shift(-1)
Capital Population
0 CD 35.0
1 CE 20.0
2 NaN NaN
3 CG 65.0
4 NaN NaN
Then once these values are added to the right of your data with the _2 suffix, the rows that have NaNs are dropped with dropna().
Finally note that Country_2 is not repeated as it’s the same as Country, but it would be very easy to add
To get all combinations you can try:
from itertools import combinations,chain
df = (
pd.concat(
[pd.DataFrame(
np.array(list(chain(*(combinations(k.values,2))))).reshape(-1, len(df.columns) * 2),
columns = df.columns.append(df.columns.map(lambda x: x + '_2')))
for g,k in df.groupby('Country')]
)
)

Sorting and ranking subsets of complex data

I have a large and complex GIS-datafile on road accidents in “cities" within “counties”. Rows represent roads. Columns provide “City”, “County” and "Sum of accidents in city”. A city thus contains several roads (repeated values of accident sums) and a county several cities.
For each 'County', I now want to rank cities according to the number of accidents, so that within each 'County', the city with most accidents is ranked “1" and cities with fewer accidents are ranked “2” and above. This rank-value shall be written into the original datafile.
My original approach was to:
1. Sort data according to "'County'_ID" and “Accidents” (descending)
2. than calculate for each row:
if('County' in row 'n+1' = 'County' in row ’n’) AND (Accidents in row 'n+1' = 'Accidents' in row ’n’):
return value: ’n’ ## maintain same rank for cities within 'County'
else if ('County' in row 'n+1' = 'County' in row ’n’) AND if ('Accidents' in row 'n+1' < 'Accidents' in row ’n’):
return value: ’n+1’ ## increasing rank value within 'County'
else if ('County' in row 'n+1' < 'County' in row ’n’) AND ('Accidents' in row 'n+1’ < 'Accidents' in row ’n’):
return value:’1’ ## new 'County', i.e. start ranking from 1
else:
return “0” #error
However, I could not figure out how to code this properly; and maybe this approach is not an appropriate way either; maybe a loop would do the trick?
Any recommendations?
Suggest using Python Pandas module
Fictitious Data
Create Data with columns county, accidents, city
Would use pandas read_csv to load actual data.
import pandas as pd
df = pd.DataFrame([
['a', 1, 'A'],
['a', 2, 'B'],
['a', 5, 'C'],
['b', 5, 'D'],
['b', 5, 'E'],
['b', 6, 'F'],
['b', 8, 'G'],
['c', 2, 'H'],
['c', 2, 'I'],
['c', 7, 'J'],
['c', 7, 'K']
], columns = ['county', 'accidents', 'city'])
Resultant Dataframe
df:
county accidents city
0 a 1 A
1 a 2 B
2 a 5 C
3 b 5 D
4 b 5 E
5 b 6 F
6 b 8 G
7 c 2 H
8 c 2 I
9 c 7 J
10 c 7 K
Group data rows by county, and rank rows within group by accidents
Ranking Code
# ascending = False causes cities with most accidents to be ranked = 1
df["rank"] = df.groupby("county")["accidents"].rank("dense", ascending=True)
Result
df:
county accidents city rank
0 a 1 A 3.0
1 a 2 B 2.0
2 a 5 C 1.0
3 b 5 D 3.0
4 b 5 E 3.0
5 b 6 F 2.0
6 b 8 G 1.0
7 c 2 H 2.0
8 c 2 I 2.0
9 c 7 J 1.0
10 c 7 K 1.0
I think the approach by #DarryIG is correct, but it doesn't consider that the environment is ArcGIS.
Since you tagged your question with Python I came up with a workflow utilizing Pandas. There are other ways to do the same, using ArcGIS tools and or the Field Calculator.
import arcpy # if you are using this script outside ArcGIS
import pandas as pd
# change this to your actual shapefile, you might have to include a path
filename = "road_accidents"
sFields = ['County', 'City', 'SumOfAccidents'] # consider this to be your columns
# read everything in your file into a Pandas DataFrame with a SearchCursor
with arcpy.da.SearchCursor(filename, sFields) as sCursor:
df = pandas.DataFrame(data=[row for row in sCursor], columns=field_names)
df = df.drop_duplicates() # since each row represents a street, we can remove duplicate
# we use this code from DarrylG to calculate a rank
df['Rank'] = df.groupby('County')['SumOfAccidents'].rank('dense', ascending=True)
# set a multiindex, since there might be duplicate city-names
df = df.set_index(['County', 'City'])
dct = df.to_dict() # convert the dataframe into a dictionary
# add a field to your shapefile
arcpy.AddField_management('Rank', 'Rank', 'SHORT')
# now we can update the Shapefile
uFields = ['County', 'City', 'Rank']
with arcpy.da.UpdateCursor(filename, uFields) as uCursor: # open a UpdateCursor on the file
for row in uCursor: # for each row (street)
# get the county/city combo
County_City = (row[uFields.index('County')], row[uFields.index('City')])
if County_City in dct: # see if it is in your dictionary (it should)
# give it the value from dictionary
row[uFields.index('Rank')] = dct['Rank'][County_City]
else:
# otherwise...
row[uFields.index('Rank')] = 999
uCursor.updateRow(row) # update the row
You can run this code inside ArcGIS Pro Python console. Or using Jupyter Notebooks. Hope it helps!

find difference between any two columns of dataframes with a common key column pandas

I have two dataframes with one having
Title Name Quantity ID
as the columns
and the 2nd dataframe has
ID Quantity
as the columns with lesser number of rows than first dataframe .
I need to find the difference between the Quantity of both dataframes based the match in the ID columns and I want to store this difference in a seperate column in the first dataframe .
I tried this (did't work) :
DF1[['ID','Quantity']].reset_index(drop=True).apply(lambda id_qty_tup : DF2[DF2.ID==asin_qty_tup[0]].quantity - id_qty_tup[1] , axis = 1)
Another approach is to apply the ID and quantity of DF1 and iterate through each row of DF2 but it takes more time . Im sure there is a better way .
You can perform index-aligned subtraction, and pandas takes care of the rest.
df['Diff'] = df.set_index('ID').Quantity.sub(df2.set_index('ID').Quantity).values
Demo
Here, changetype is the index, and I've already set it, so pd.Series.sub will align subtraction by default. Otherwise, you'd need to set the index as above.
df1
strings test
changetype
0 a very -1.250150
1 very boring text -1.376637
2 I cannot read it -1.011108
3 Hi everyone -0.527900
4 please go home -1.010845
5 or I will go 0.008159
6 now -0.470354
df2
strings test
changetype
0 a very very boring text 0.625465
1 I cannot read it -1.487183
2 Hi everyone 0.292866
3 please go home or I will go now 1.430081
df1.test.sub(df2.test)
changetype
0 -1.875614
1 0.110546
2 -1.303974
3 -1.957981
4 NaN
5 NaN
6 NaN
Name: test, dtype: float64
You can use map in this case:
df['diff'] = df['ID'].map(df2.set_index('ID').Quantity) - df.Quantity
Some Data
import pandas as pd
df = pd.DataFrame({'Title': ['A', 'B', 'C', 'D', 'E'],
'Name': ['AA', 'BB', 'CC', 'DD', 'EE'],
'Quantity': [1, 21, 14, 15, 611],
'ID': ['A1', 'A1', 'B2', 'B2', 'C1']})
df2 = pd.DataFrame({'Quantity': [11, 51, 44],
'ID': ['A1', 'B2', 'C1']})
We will use df2 to create a dictionary which can be used to map ID to Quantity. So anywhere there is an ID==A1 in df it gets assigned the Quantity 11, B2 gets assigned 51 and C1 gets assigned 44. Here' I'll add it as another column just for illustration purposes.
df['Quantity2'] = df['ID'].map(df2.set_index('ID').Quantity)
print(df)
ID Name Quantity Title Quantity2
0 A1 AA 1 A 11
1 A1 BB 21 B 11
2 B2 CC 14 C 51
3 B2 DD 15 D 51
4 C1 EE 611 E 44
Name: ID, dtype: int64
Then you can just subtract df['Quantity'] and the column we just created to get the difference. (Or subtract that from df['Quantity'] if you want the other difference)

Anti-Join Pandas

I have two tables and I would like to append them so that only all the data in table A is retained and data from table B is only added if its key is unique (Key values are unique in table A and B however in some cases a Key will occur in both table A and B).
I think the way to do this will involve some sort of filtering join (anti-join) to get values in table B that do not occur in table A then append the two tables.
I am familiar with R and this is the code I would use to do this in R.
library("dplyr")
## Filtering join to remove values already in "TableA" from "TableB"
FilteredTableB <- anti_join(TableB,TableA, by = "Key")
## Append "FilteredTableB" to "TableA"
CombinedTable <- bind_rows(TableA,FilteredTableB)
How would I achieve this in python?
indicator = True in merge command will tell you which join was applied by creating new column _merge with three possible values:
left_only
right_only
both
Keep right_only and left_only. That is it.
outer_join = TableA.merge(TableB, how = 'outer', indicator = True)
anti_join = outer_join[~(outer_join._merge == 'both')].drop('_merge', axis = 1)
easy!
Here is a comparison with a solution from piRSquared:
1) When run on this example matching based on one column, piRSquared's solution is faster.
2) But it only works for matching on one column. If you want to match on several columns - my solution works just as fine as with one column.
So it's up for you to decide.
Consider the following dataframes
TableA = pd.DataFrame(np.random.rand(4, 3),
pd.Index(list('abcd'), name='Key'),
['A', 'B', 'C']).reset_index()
TableB = pd.DataFrame(np.random.rand(4, 3),
pd.Index(list('aecf'), name='Key'),
['A', 'B', 'C']).reset_index()
TableA
TableB
This is one way to do what you want
Method 1
# Identify what values are in TableB and not in TableA
key_diff = set(TableB.Key).difference(TableA.Key)
where_diff = TableB.Key.isin(key_diff)
# Slice TableB accordingly and append to TableA
TableA.append(TableB[where_diff], ignore_index=True)
Method 2
rows = []
for i, row in TableB.iterrows():
if row.Key not in TableA.Key.values:
rows.append(row)
pd.concat([TableA.T] + rows, axis=1).T
Timing
4 rows with 2 overlap
Method 1 is much quicker
10,000 rows 5,000 overlap
loops are bad
I had the same problem. This answer using how='outer' and indicator=True of merge inspired me to come up with this solution:
import pandas as pd
import numpy as np
TableA = pd.DataFrame(np.random.rand(4, 3),
pd.Index(list('abcd'), name='Key'),
['A', 'B', 'C']).reset_index()
TableB = pd.DataFrame(np.random.rand(4, 3),
pd.Index(list('aecf'), name='Key'),
['A', 'B', 'C']).reset_index()
print('TableA', TableA, sep='\n')
print('TableB', TableB, sep='\n')
TableB_only = pd.merge(
TableA, TableB,
how='outer', on='Key', indicator=True, suffixes=('_foo','')).query(
'_merge == "right_only"')
print('TableB_only', TableB_only, sep='\n')
Table_concatenated = pd.concat((TableA, TableB_only), join='inner')
print('Table_concatenated', Table_concatenated, sep='\n')
Which prints this output:
TableA
Key A B C
0 a 0.035548 0.344711 0.860918
1 b 0.640194 0.212250 0.277359
2 c 0.592234 0.113492 0.037444
3 d 0.112271 0.205245 0.227157
TableB
Key A B C
0 a 0.754538 0.692902 0.537704
1 e 0.499092 0.864145 0.004559
2 c 0.082087 0.682573 0.421654
3 f 0.768914 0.281617 0.924693
TableB_only
Key A_foo B_foo C_foo A B C _merge
4 e NaN NaN NaN 0.499092 0.864145 0.004559 right_only
5 f NaN NaN NaN 0.768914 0.281617 0.924693 right_only
Table_concatenated
Key A B C
0 a 0.035548 0.344711 0.860918
1 b 0.640194 0.212250 0.277359
2 c 0.592234 0.113492 0.037444
3 d 0.112271 0.205245 0.227157
4 e 0.499092 0.864145 0.004559
5 f 0.768914 0.281617 0.924693
Easiest answer imaginable:
tableB = pd.concat([tableB, pd.Series(1)], axis=1)
mergedTable = tableA.merge(tableB, how="left" on="key")
answer = mergedTable[mergedTable.iloc[:,-1].isnull()][tableA.columns.tolist()]
Should be the fastest proposed as well.
One liner
TableA.append(TableB.loc[~TableB.Key.isin(TableA.Key)], ignore_index=True)
%%timeit gives about the same timing as the accepted answer.
You'll have both tables TableA and TableB such that both DataFrame objects have columns with unique values in their respective tables, but some columns may have values that occur simultaneously (have the same values for a row) in both tables.
Then, we want to merge the rows in TableA with the rows in TableB that don't match any in TableA for a 'Key' column. The concept is to picture it as comparing two series of variable length, and combining the rows in one series sA with the other sB if sB's values don't match sA's. The following code solves this exercise:
import pandas as pd
TableA = pd.DataFrame([[2, 3, 4], [5, 6, 7], [8, 9, 10]])
TableB = pd.DataFrame([[1, 3, 4], [5, 7, 8], [9, 10, 0]])
removeTheseIndexes = []
keyColumnA = TableA.iloc[:,1] # your 'Key' column here
keyColumnB = TableB.iloc[:,1] # same
for i in range(0, len(keyColumnA)):
firstValue = keyColumnA[i]
for j in range(0, len(keyColumnB)):
copycat = keyColumnB[j]
if firstValue == copycat:
removeTheseIndexes.append(j)
TableB.drop(removeTheseIndexes, inplace = True)
TableA = TableA.append(TableB)
TableA = TableA.reset_index(drop=True)
Note this affects TableB's data as well. You can use inplace=False and re-assign it to a newTable, then TableA.append(newTable) alternatively.
# Table A
0 1 2
0 2 3 4
1 5 6 7
2 8 9 10
# Table B
0 1 2
0 1 3 4
1 5 7 8
2 9 10 0
# Set 'Key' column = 1
# Run the script after the loop
# Table A
0 1 2
0 2 3 4
1 5 6 7
2 8 9 10
3 5 7 8
4 9 10 0
# Table B
0 1 2
1 5 7 8
2 9 10 0
Based on one of the other suggestions, here's a function that should do it. Using only pandas functions, no looping. You can use multiple columns as the key as well. If you change the line output = merged.loc[merged.dummy_col.isna(),tableA.columns.tolist()]
to output = merged.loc[~merged.dummy_col.isna(),tableA.columns.tolist()]
you have a semi_join.
def anti_join(tableA,tableB,on):
#if joining on index, make it into a column
if tableB.index.name is not None:
dummy = tableB.reset_index()[on]
else:
dummy = tableB[on]
#create a dummy columns of 1s
if isinstance(dummy, pd.Series):
dummy = dummy.to_frame()
dummy.loc[:,'dummy_col'] = 1
#preserve the index of tableA if it has one
if tableA.index.name is not None:
idx_name = tableA.index.name
tableA = tableA.reset_index(drop = False)
else:
idx_name = None
#do a left-join
merged = tableA.merge(dummy,on=on,how='left')
#keep only the non-matches
output = merged.loc[merged.dummy_col.isna(),tableA.columns.tolist()]
#reset the index (if applicable)
if idx_name is not None:
output = output.set_index(idx_name)
return(output)

Categories