Python / Pandas - merging two dataframes based in a non-index column

Python / Pandas - merging two dataframes based in a non-index column - python

I want to join two dataframes. Already tried concat, merge and join but I should be doing something wrong.
df 1:
index cnpj country state
1 7468 34 23
4 3421 23 12
7 2314 12 45
df 2:
index cnpj street number
2 7468 32 34
5 3421 18 89
546 2314 92 73
I want them to be merged using 'cnpj' as a 'joining key' and preserving the index of df1. It should look like this:
df 1:
index cnpj country state street number
1 7468 34 23 32 34
4 3421 23 12 18 89
7 2314 12 45 92 73
Any suggestions on how to do that?

Let's use merge with suffixes and drop:
df1.merge(df2, on='cnpj',suffixes=('','_y')).drop('index_y',axis=1)
Output:
index cnpj country state street number
0 1 7468 34 23 32 34
1 4 3421 23 12 18 89
2 7 2314 12 45 92 73

Related

Comparing Pandas DataFrame rows against two threshold values

I have two DataFrames shown below. The DataFrames in reality are larger than the sample below.
df1
route_no cost_h1 cost_h2 cost_h3 cost_h4 cost_h5 max min location
0 0010 20 22 21 23 26 26 20 NY
1 0011 30 25 23 31 33 33 23 CA
2 0012 67 68 68 69 65 69 67 GA
3 0013 34 33 31 30 35 35 31 MO
4 0014 44 42 40 39 50 50 39 WA
df2
route_no cost_h1 cost_h2 cost_h3 cost_h4 cost_h5 location
0 0020 19 27 21 24 20 NY
1 0021 31 22 23 30 33 CA
2 0023 66 67 68 70 65 GA
3 0022 34 33 31 30 35 MO
4 0025 41 42 40 39 50 WA
5 0030 19 26 20 24 20 NY
6 0032 37 31 31 20 35 MO
7 0034 40 41 39 39 50 WA
The idea is to compare each row of df2 against the appropriate max and min value specified in df1. The threshold value to be compared depends on the match in the location column. If any of the row values are outside the range defined by min and max value, they will be put in a separate dataframe. Please note the number of cost segments are vary.

Solution
# Merge the dataframes on location to append the min/max columns to df2
df3 = df2.merge(df1[['location', 'max', 'min']], on='location', how='left')
# select the cost like columns
cost = df3.filter(like='cost')
# Check whether the cost values satisfy the interval condition
mask = cost.ge(df3['min'], axis=0) & cost.le(df3['max'], axis=0)
# filter the rows where one or more values in row do not satisfy the condition
df4 = df2[~mask.all(axis=1)]
Result
print(df4)
route_no cost_h1 cost_h2 cost_h3 cost_h4 cost_h5 location
0 0020 19 27 21 24 20 NY
1 0021 31 22 23 30 33 CA
2 0023 66 67 68 70 65 GA
3 0022 34 33 31 30 35 MO
5 0030 19 26 20 24 20 NY
6 0032 37 31 31 20 35 MO

Add/Update/Merge original DataFrame into a grouped DataFrame

How can I merge, update, join, concat, or filter the original DF correctly so that I can have the complete 78 columns?
I have a DataFrame with 22 rows and 78 columns. An internet-friendly version of the file can be found here. This a sample:
item_no code group gross_weight net_weight value ... ... +70 columns more
1 7417.85.24.25 0 18 17 13018.74
2 1414.19.00.62 1 35 33 0.11
3 7815.80.99.96 0 49 48 1.86
4 1414.19.00.62 1 30 27 2.7
5 5867.21.36.92 1 31 24 94
6 9227.71.84.12 1 24 17 56.4
7 1414.19.00.62 0 42 35 0.56
8 4465.58.84.31 0 50 42 0.94
9 1596.09.32.64 1 20 13 0.75
10 2194.64.27.41 1 38 33 1.13
11 1596.09.32.64 1 53 46 1.9
12 1596.09.32.64 1 18 15 10.44
13 1596.09.32.64 1 35 33 15.36
14 4835.09.81.44 1 55 47 10.44
15 5698.44.72.13 1 51 49 15.36
16 5698.44.72.13 1 49 45 2.15
17 5698.44.72.13 0 41 33 16
18 3815.79.80.69 1 25 21 4
19 3815.79.80.69 1 35 30 2.4
20 4853.40.53.94 1 53 46 3.12
21 4853.40.53.94 1 50 47 3.98
22 4853.40.53.94 1 16 13 6.53
The column group gives me the instruction that I should group all similar values in the code column and add the values in the columns: 'gross_weight', 'net_weight', 'value', and 'item_quantity'. Additionally, I have to modify 2 additional columns as shown below:
#Group DF
grouped_df = df.groupby(['group', 'code'], as_index=False).agg({'item_quantity':'sum', 'gross_weight':'sum','net_weight':'sum', 'value':'sum'}).copy()
#Total items should be equal to the length of the DF
grouped_df['total_items'] = len(grouped_df)
#Item No.
grouped_df['item_no'] = [x+1 for x in range(len(grouped_df))]
This is the result:
group code item_quantity gross_weight net_weight value total_items item_no
0 0 1414.19.00.62 75.0 42 35 0.56 14 1
1 0 4465.58.84.31 125.0 50 42 0.94 14 2
2 0 5698.44.72.13 200.0 41 33 16.0 14 3
3 0 7417.85.24.25 1940.2 18 17 13018.74 14 4
4 0 7815.80.99.96 200.0 49 48 1.86 14 5
5 1 1414.19.00.62 275.0 65 60 2.81 14 6
6 1 1596.09.32.64 515.0 126 107 28.45 14 7
7 1 2194.64.27.41 151.0 38 33 1.13 14 8
8 1 3815.79.80.69 400.0 60 51 6.4 18 14 9
9 1 4835.09.81.44 87.0 55 47 10.44 14 10
10 1 4853.40.53.94 406.0 119 106 13.63 14 11
11 1 5698.44.72.13 328.0 100 94 17.51 14 12
12 1 5867.21.36.92 1000.0 31 24 94.0 14 13
13 1 9227.71.84.12 600.0 24 17 56.4 14 14
All of the columns in the grouped DF exist in the original DF but some have different values.
How can I merge, update, join, concat, or filter the original DF correctly so that I can have the complete 78 columns?
The objective DataFrame is the grouped DF.
The columns in the original DF that already exists in the Grouped DF should be omitted.
I should be able to take the first value of the columns in the original DF that aren't in the Grouped DF.
The column code does not have unique values.
The column part_number in the complete file does not have unique values.
I tried:
pd.Merge(how='left') after creating a unique ID; it duplicates existing columns instead of updating values or overwriting.
join, concat, update: does not yield the expected results.
.agg({lambda x: x.iloc[0]}) adds all the columns but I don't know how to add it to the current .agg({'item_quantity':'sum', 'gross_weight':'sum','net_weight':'sum', 'value':'sum'})
I know that .agg({'column_name':'first']) returns the first value, but I don't know how to make it work for over 70 columns automatically.

You can achieve this dynamically creating a dictionary with list comprehension like this:
df.groupby(['group', 'code'], as_index=False).agg({col : 'sum' for col in df.columns[3:]}
If item_no is your index, then change df.columns[3:] to df.columns[2:]

Pandas Dataframe Lookup using partial column name

Hi I'm trying to lookup a value from selected column's using a value from my Dataframe. My lookup value needs to identify which column name it matches out of the selected columns, for example below I only want to consider columns ending in JT in my vlookup.
Example of dataframe:
Plan1_JT
Plan2_JT
Plan3_JT
Plan1_T
Plan2_T
JT
89
67
25
67
90
Plan1
9
45
7
6
5
Plan3
45
3
2
6
23
Plan1
Outcome:
Plan1_JT
Plan2_JT
Plan3_JT
Plan1_T
Plan2_T
JT
Plan_JT
89
67
25
67
90
Plan1
89
9
45
7
6
5
Plan3
7
45
3
2
6
23
Plan1
45
Example code:
df2['Plan_JT'].astype(str)=df2.loc[:,('Plan1_JT','Plan2_JT','Plan3_JT')].str.contains.iloc[1:5]

Solution for old pandas versions with DataFrame.lookup:
df['new'] = df.lookup(df.index, df['JT'] + '_JT')
print (df)
Plan1_JT Plan2_JT Plan3_JT Plan1_T Plan2_T JT new
0 89 67 25 67 90 Plan1 89
1 9 45 7 6 5 Plan3 7
2 45 3 2 6 23 Plan1 45
And for last versions with DataFrame.melt:
melt = df.melt('JT', ignore_index=False)
df['new'] = melt.loc[melt['JT'] + '_JT' == melt['variable'], 'value']
print (df)
Plan1_JT Plan2_JT Plan3_JT Plan1_T Plan2_T JT new
0 89 67 25 67 90 Plan1 89
1 9 45 7 6 5 Plan3 7
2 45 3 2 6 23 Plan1 45

Overwrite some rows in pandas dataframe with ones from another dataframe based on index

I have a pandas dataframe, df1.
I want to overwrite its values with values in df2, where the index and column name match.
I've found a few answers on this site, but nothing that quite does what I want.
df1
A B C
0 33 44 54
1 11 32 54
2 43 55 12
3 43 23 34
df2
A
0 5555
output
A B C
0 5555 44 54
1 11 32 54
2 43 55 12
3 43 23 34

You can use combine_first with convert to integer if necessary:
df = df2.combine_first(df1).astype(int)
print (df)
A B C
0 5555 44 54
1 11 32 54
2 43 55 12
3 43 23 34
If need check intersection index and columns between both DataFrames:
df2= pd.DataFrame({'A':[5555, 2222],
'D':[3333, 4444]},index=[0, 10])
idx = df2.index.intersection(df1.index)
cols = df2.columns.intersection(df1.columns)
df = df2.loc[idx, cols].combine_first(df1).astype(int)
print (df)
A B C
0 5555 44 54
1 11 32 54
2 43 55 12
3 43 23 34

Pandas, merge 2 dataframes [duplicate]

This question already has answers here:
Pandas merge two dataframes with different columns
(3 answers)
Closed 4 years ago.
I actualy have 2 dataframes one is like:
seq1_id seq2_id dN dS Dist1 Dist_brute kingdom
seq1 seq2 45 56 23 455 eucaryota
seq6 seq9 34 43 34 453 procaryota
seq3 seq98 32 34 21 90 Virus
seq21 seq87 32 12 35 211 Virus
and the other like:
seq1_id seq2_id dN dS Dist1 Dist_brute
seq1 seq2 45 56 23 455
seq4 seq12 78 45 32 789
seq3 seq98 32 34 21 90
seq21 seq87 32 12 35 211
seq45 seq90 21 23 12 123
seq6 seq9 34 43 34 453
and what I would like to do is to get a new dataframe such:
seq1_id seq2_id dN dS Dist1 Dist_brute kingdom
seq1 seq2 45 56 23 455 eucaryota
seq4 seq12 78 45 32 789 NaN
seq3 seq98 32 34 21 90 Virus
seq21 seq87 32 12 35 211 Virus
seq45 seq90 21 23 12 123 NaN
seq6 seq9 34 43 34 453 procaryota
Does someone have an idea?
Thanks :)

For me working omit parameter on for merge by all columns with left join:
df = df2.merge(df1, how='left')
If need define columns for merge:
df = df2.merge(df1, on=['seq1_id','seq2_id','dN','dS','Dist1','Dist_brute'], how='left')
print (df)
seq1_id seq2_id dN dS Dist1 Dist_brute kingdom
0 seq1 seq2 45 56 23 455 eucaryota
1 seq4 seq12 78 45 32 789 NaN
2 seq3 seq98 32 34 21 90 Virus
3 seq21 seq87 32 12 35 211 Virus
4 seq45 seq90 21 23 12 123 NaN
5 seq6 seq9 34 43 34 453 procaryota

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python / Pandas - merging two dataframes based in a non-index column - python

Let's use merge with suffixes and drop: df1.merge(df2, on='cnpj',suffixes=('','_y')).drop('index_y',axis=1) Output: index cnpj country state street number 0 1 7468 34 23 32 34 1 4 3421 23 12 18 89 2 7 2314 12 45 92 73

Related

Comparing Pandas DataFrame rows against two threshold values

Add/Update/Merge original DataFrame into a grouped DataFrame

Pandas Dataframe Lookup using partial column name

Overwrite some rows in pandas dataframe with ones from another dataframe based on index

Pandas, merge 2 dataframes [duplicate]

Categories

Resources