Merge two dataframes on nearest value while duplicating rows - python

I have two dataframes,
DF1 = NUM1 Car COLOR
100 Honda blue
100 Honda yellow
200 Volvo red
DF2 = NUM2 Car STATE
110 Honda good
110 Honda bad
230 Volvo not bad
230 Volvo excellent
I want to merge them on nearest value in columns NUM1 & NUM2 in order to get this desired dataframe:
DF3 = NUM CAR COLOR STATE
100 HONDA blue good
100 HONDA blue bad
100 HONDA yellow good
100 HONDA yellow bad
200 VOLVO red not bad
200 VOLVO red excellent
I've tried this:
df3 = pd.merge_asof(df1, df2, left_on="NUM1", right_on="NUM2")
But this is the result I get:
DF3 = NUM CAR COLOR STATE
100 HONDA blue good
100 HONDA yellow good
200 VOLVO red not bad

IIUC, you might need to combine merge_asof and merge:
key = pd.merge_asof(DF1.reset_index().sort_values(by='NUM1'),
DF2['NUM2'],
left_on='NUM1', right_on='NUM2',
direction='nearest')['NUM2']
DF1.merge(DF2.drop(columns=DF1.columns.intersection(DF2.columns)),
left_on=key, right_on='NUM2')

Related

Matching column values with elements in a list of lists of lists and adding corresponding values from another list

assets = [[['Ferrari', 'BMW', 'Suzuki'], ['Ducati', 'Honda']], [['Apple', 'Samsung', 'Oppo']]]
price = [[[853600, 462300, 118900], [96500, 16700]], [[1260, 750, 340]]]
I have a dataframe as follows :
Car
Bike
Phone
BMW
Ducati
Apple
Ferrari
Honda
Oppo
Looking for code to get the Total_Cost , i.e 462300 + 96500 + 1260 = 560060
Car
Bike
Phone
Total Cost
BMW
Ducati
Apple
560060
Ferrari
Honda
Oppo
870640
I tried the for loop and succeeded, I want the advanced code if any.
Here is a possible solution:
df = pd.DataFrame({'Car': ['BMW', 'Ferrari'], 'Bike': ['Ducati', 'Honda'], 'Phone': ['Apple', 'Oppo']})
asset_price = {asset: price[a][b][c]
for a, asset_list in enumerate(assets)
for b, asset_sub_list in enumerate(asset_list)
for c, asset in enumerate(asset_sub_list)
}
df['Total_Cost'] = df.apply(lambda row: sum([asset_price[asset] for asset in row]), axis=1)
print(df)
Car Bike Phone Total_Cost
0 BMW Ducati Apple 560060
1 Ferrari Honda Oppo 870640
You can also use numpy approach import numpy as np depending on your use-case. But I will suggest the first approach which is more simple and easy to understand.
df = pd.DataFrame({'Car': ['BMW', 'Ferrari'], 'Bike': ['Ducati', 'Honda'], 'Phone': ['Apple', 'Oppo']})
flat_assets = np.concatenate([np.concatenate(row) for row in assets])
flat_price = np.concatenate([np.concatenate(row) for row in price])
asset_dict = dict(zip(flat_assets, flat_price))
asset_prices = np.array([asset_dict[row] for row in df.values.flatten()
if row in asset_dict])
df['Total Cost'] = np.sum(asset_prices.reshape(-1, 3), axis=1)
print(df)
Car Bike Phone Total Cost
0 BMW Ducati Apple 560060
1 Ferrari Honda Oppo 870640
An alternative approach:
First build a dataframe df_price which maps prices onto the assets and the classification (Car, Bike, and Phone):
df_price = (
pd.DataFrame({"assets": assets, "price": price}).explode(["assets", "price"])
.assign(cols=["Car", "Bike", "Phone"]).explode(["assets", "price"])
)
Result:
assets price cols
0 Ferrari 853600 Car
0 BMW 462300 Car
0 Suzuki 118900 Car
0 Ducati 96500 Bike
0 Honda 16700 Bike
1 Apple 1260 Phone
1 Samsung 750 Phone
1 Oppo 340 Phone
(I have inserted the classification here due to the comment on the other answer: "... But if the nested lists of asset is having common name (say : Honda in place if Suzuki ) then Honda car and Honda Bike will take one price".
Then join the prices onto the .melted main dataframe df, .pivot (using the auxilliary column idx), sum up the prices in the rows, and bring the result in shape.
res = (
df.melt(var_name="cols", value_name="assets", ignore_index=False)
.merge(df_price, on=["cols", "assets"])
.assign(idx=lambda df: df.groupby("cols").cumcount())
.pivot(index="idx", columns="cols")
.assign(total=lambda df: df.loc[:, "price"].sum(axis=1))
.loc[:, ["assets", "total"]]
.droplevel(0, axis=1).rename(columns={"": "Total_Costs"})
)
Result:
cols Bike Car Phone Total_Costs
idx
0 Ducati BMW Apple 560060.0
1 Honda Ferrari Oppo 870640.0

check if a name in one dataframe exist in other dataframe python

I am a beginner in Python and trying to find a solution for the following problem.
I have a csv file:
name, mark
Anna,24
John,19
Mike,22
Monica,20
Alex, 17
Daniel, 26
And xls file:
name, group
John, red
Anna, blue
Monica, blue
Mike, yellow
Alex, red
I am trying to get the result:
group, mark
Red, 26
Blue, 44
Yellow, 22
The number in result shows the total mark for the whole group.
I was trying to find similar problems but was not successful and I do not have much experience to find out what exactly I have to do and what commands to use.
Use pd.read_csv with df.merge and Groupby.sum:
In [89]: df1 = pd.read_csv('file1.csv')
In [89]: df1
Out[89]:
name mark
0 Anna 24
1 John 19
2 Mike 22
3 Monica 20
4 Alex 17
5 Daniel 26
In [90]: df2 = pd.read_csv('file2.csv')
In [90]: df2
Out[90]:
name group
0 John red
1 Anna blue
2 Monica blue
3 Mike yellow
4 Alex red
In [94]: df = df1.merge(df2).groupby('group').sum().reset_index()
In [95]: df
Out[95]:
group mark
0 blue 44
1 red 36
2 yellow 22
EDIT: If you have other columns, which you don't want to sum, do this:
In [284]: df1.merge(df2).groupby('group').agg({'mark': 'sum'}).reset_index()
Out[284]:
group mark
0 blue 44
1 red 36
2 yellow 22

Splitting a dataframe column on a pattern of characters and numerals

I have a dataframe that is:
A
1 king, crab, 2008
2 green, 2010
3 blue
4 green no. 4
5 green, house
I want to split the dates out into:
A B
1 king, crab 2008
2 green 2010
3 blue
4 green no. 4
5 green, house
I cant split the first instance of ", " because that would make:
A B
1 king crab, 2008
2 green 2010
3 blue
4 green no. 4
5 green house
I cant split after the last instance of ", " because that would make:
A B
1 king crab 2008
2 green 2010
3 blue
4 green no. 4
5 green house
I also cant separate it by numbers because that would make:
A B
1 king, crab 2008
2 green 2010
3 blue
4 green no. 4
5 green, house
Is there some way to split by ", " and then a 4 digit number that is between two values? The two values condition would be extra safety to filter out accidental 4 digit numbers that are clearly not years. For example.
Split by:
", " + (four digit number between 1000 - 2021)
Also appreciated are answers that split by:
", " + four digit number
Even better would be an answer that took into account that the number is ALWAYS at the end of the string.
Or you can just use series.str.extract and replace:
df = pd.DataFrame({"A":["king, crab, 2008","green, 2010","blue","green no. 4","green, house"]})
df["year"] = df["A"].str.extract("(\d{4})")
df["A"] = df["A"].str.replace(",\s\d{4}","")
print (df)
A year
0 king, crab 2008
1 green 2010
2 blue NaN
3 green no. 4 NaN
4 green, house NaN
import pandas as pd
list_dict_Input = [{'A': 'king, crab, 2008'},
{'A': 'green, 2010'},
{'A': 'green no. 4'},
{'A': 'green no. 4'},]
df = pd.DataFrame(list_dict_Input)
for row_Index in range(len(df)):
text = (df.iloc[row_Index]['A']).strip()
last_4_Char = (text[-4:])
if last_4_Char.isdigit() and int(last_4_Char) >= 1000 and int(last_4_Char) <= 2021:
df.at[row_Index, 'B'] = last_4_Char
print(df)

For loop and if for multiple conditions and change another column in the same row in pandas

I am trying to have a change a column if some strings are in the other column in the same row. I am new to Pandas.
I need to change the price of some oranges to 200 but not the price of 'Red Orange'. I cannot change the name of the "fruits". It is a much longer string and I just made it shorter for convenience here.
fruits price
Green apple from us 10
Orange Apple from US 11
Mango from Canada 15
Blue Orange from Mexico 16
Red Orange from Costa 15
Pink Orange from Brazil 19
Yellow Pear from Guatemala 32
Black Melon from Guatemala 4
Purple orange from Honduras 5
so that the final result would be
fruits price
Green apple from us 10
Orange Apple from US 11
Mango from Canada 15
Blue Orange from Mexico 200
Red Orange from Costa 15
Pink Orange from Brazil 200
Yellow Pear from Guatemala 32
Black Melon from Guatemala 4
Purple orange from Honduras 5
I tried
df.loc[df['fruits'].str.lower().str.contains('orange'), 'price'] = 200
But this produces total of 4 items to change its price instead of only 2 items.
I have used for loop once and that changed the entire column to change its price.
You can use regex:
import re
df.loc[df['fruits'].str.lower().str.contains(r'(?<!red) orange', regex = True), 'price'] = 200
(?<!red) is a negative look behind. So if behind orange is red it wont match it. It also ensure its the second word with the mandatory space before the word orange, so you wont have to worry about it been the color describing something.
df.loc[((df['fruits'].str.contains('orange')) & (~df['fruits'].str.contains('Red'))),'price'] = 200
We check for oranges and ~ to confirm red is not present in the string. If both conditions are true, price change to 200

pandas dataframe columns to subcolumns

I would like to convert columns to subcolumns.
Supposing that data is like;
Q1 Q2:Q21 Q2:Q22 Q2:Q23 Q3:Q31 Q3:Q32
0 yes green blue green bus car
1 no red orange blue car bike
2 yes green yellow black car walk
3 yes yellow green brown bus walk
4 no green green red car bus
After reshaping the columns, I would like to have;
Q1 Q2 Q3
Q1 Q21 Q22 Q23 Q31 Q32
0 yes green blue green bus car
1 no red orange blue car bike
2 yes green yellow black car walk
3 yes yellow green brown bus walk
4 no green green red car bus
Here, what I tried;
import pandas as pd
survey = pd.read_csv('survey.csv')
# first column names
survey_cols = [col.split(':')[0] for col in survey.columns]
# unique column names
survey_ucols = []
for e in survey_cols:
if e not in survey_ucols:
survey_ucols.append(e)
# second column names, subcolumns
survey_subcols = []
for col in survey_ucols:
survey_subcols.append([subcol.split(':')[-1] for subcol in survey.columns if col in subcol])
# create new df
tuples = list(zip(survey_ucols,survey_subcols))
cols = pd.MultiIndex.from_tuples(tuples, names=['mainQ', 'subQ'])
survey_new = pd.DataFrame(survey, columns=cols)
Thanks in advance
You can create helper DataFrame with Index.to_series and Series.str.split, so possible forward filling missing values per rows by ffill, last assign back MultiIndex.from_arrays:
df = survey.columns.to_series().str.split(':', expand=True).ffill(axis=1)
survey.columns = pd.MultiIndex.from_arrays([df[0].tolist(), df[1].tolist()])
#simplified
#survey.columns = [df[0].tolist(), df[1].tolist()]
print (survey)
Q1 Q2 Q3
Q1 Q21 Q22 Q23 Q31 Q32
0 yes green blue green bus car
1 no red orange blue car bike
2 yes green yellow black car walk
3 yes yellow green brown bus walk
4 no green green red car bus
Detail:
print (df)
0 1
Q1 Q1 Q1
Q2:Q21 Q2 Q21
Q2:Q22 Q2 Q22
Q2:Q23 Q2 Q23
Q3:Q31 Q3 Q31
Q3:Q32 Q3 Q32

Categories