I have two Dataframes, and would like to create a new column in DataFrame 1 based on DataFrame 2 values.
But I dont want to join the two dataframes per say and make one big dataframe, but rather use the second Dataframe simply as a look-up.
#Main Dataframe:
df1 = pd.DataFrame({'Size':["Big", "Medium", "Small"], 'Sold_Quantity':[10, 6, 40]})
#Lookup Dataframe
df2 = pd.DataFrame({'Size':["Big", "Medium", "Small"], 'Sold_Quantiy_Score_Mean':[10, 20, 30]})
#Create column in Dataframe 1 based on lookup dataframe values:
df1['New_Column'] = when df1['Size'] = df2['Size'] and df1['Sold_Quantity'] < df2['Sold_Quantiy_Score_Mean'] then 'Below Average Sales' else 'Above Average Sales!' end
One approach, is to use np.where:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'Size': ["Big", "Medium", "Small"], 'Sold_Quantity': [10, 6, 40]})
df2 = pd.DataFrame({'Size': ["Big", "Medium", "Small"], 'Sold_Quantiy_Score_Mean': [10, 20, 30]})
condition = (df1['Size'] == df2['Size']) & (df1['Sold_Quantity'] < df2['Sold_Quantiy_Score_Mean'])
df1['New_Column'] = np.where(condition, 'Below Average Sales', 'Above Average Sales!')
print(df1)
Output
Size Sold_Quantity New_Column
0 Big 10 Above Average Sales!
1 Medium 6 Below Average Sales
2 Small 40 Above Average Sales!
Given that df2 is sort of like a lookup based on Size, it would make sense if your Size column was its index:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'Size': ["Big", "Medium", "Small"], 'Sold_Quantity': [10, 6, 40]})
df2 = pd.DataFrame({'Size': ["Big", "Medium", "Small"], 'Sold_Quantiy_Score_Mean': [10, 20, 30]})
lookup = df2.set_index("Size")
You can then map the Sizes in df1 to their mean and compare each with the sold quantity:
is_below_mean = df1["Sold_Quantity"] < df1["Size"].map(lookup["Sold_Quantiy_Score_Mean"])
and finally map the boolean values to the respective strings using np.where
df1["New_Column"] = np.where(is_below_mean, 'Below Average Sales', 'Above Average Sales!')
df1:
Size Sold_Quantity New_Column
0 Big 10 Above Average Sales!
1 Medium 6 Below Average Sales
2 Small 40 Above Average Sales!
Related
I have two dataframes.
DF1
DF2
I want to add a column to DF1, 'Speed', that references the track category, and the LocationFrom and LocationTo range, to result in the below.
I have looked at merge_asof, and IntervalIndex, but unable to figure out how to reference the category before the range.
Thanks.
Check Below code: SQLITE
import pandas as pd
import sqlite3
conn = sqlite3.connect(':memory:')
DF1.to_sql('DF1', con = conn, index = False)
DF2.to_sql('DF2', con = conn, index = False)
pd.read_sql("""Select DF1.*, DF2.Speed
From DF1
join DF2 on DF1.Track = Df2.Track
AND DF1.Location BETWEEN DF2.LocationFrom and DF2.LocationTo""", con=conn)
Output:
As hinted in your question, this is a perfect use case for merge_asof:
pd.merge_asof(df1, df2, by='Track',
left_on='Location', right_on='LocationTo',
direction='forward'
)#.drop(columns=['LocationFrom', 'LocationTo'])
output:
Track Location LocationFrom LocationTo Speed
0 A 1 0 5 45
1 A 2 0 5 45
2 A 6 5 10 50
3 B 24 20 50 100
NB. uncomment the drop to remove the extra columns.
It works, but I would like to see someone do this without a for loop and without creating mini dataframes.
import pandas as pd
data1 = {'Track': list('AAAB'), 'Location': [1, 2, 6, 24]}
df1 = pd.DataFrame(data1)
data2 = {'Track': list('AABB'), 'LocationFrom': [0, 5, 0, 20], 'LocationTo': [5, 10, 20, 50], 'Speed': [45, 50, 80, 100]}
df2 = pd.DataFrame(data2)
speeds = []
for k in range(len(df1)):
track = df1['Track'].iloc[k]
location = df1['Location'].iloc[k]
df1_track = df1.loc[df1['Track'] == track]
df2_track = df2.loc[df2['Track'] == track]
speeds.append(df2_track['Speed'].loc[(df2_track['LocationFrom'] <= location) & (location < df2_track['LocationTo'])].iloc[0])
df1['Speed'] = speeds
print(df1)
Output:
Track Location Speed
0 A 1 45
1 A 2 45
2 A 6 50
3 B 24 100
This approach is probably not viable if your tables are large. It creates an intermediate table which has a merge of all pairs of matching Tracks between df1 and df2. Then it removes rows where the location is not between the boundaries. Thanks #Aeronatix for the dfs.
The all_merge intermediate table gets really big really fast. If a1 rows of df1 are Track A, a2 in df2 etc.. then the total rows in all_merge will be a1*a2+b1*b2+c1*c2...+z1*z2 which might or might not be gigantic depending on your dataset
all_merge = df1.merge(df2)
results = all_merge[all_merge.Location.between(all_merge.LocationFrom,all_merge.LocationTo)]
print(results)
I have it currently working comparing two values on the same row but different columns but I need it to compare one value to the previous row.
For example:
I need to compare the column 'Close' of the row at index 0 to the column 'Open' of the row at index 1.
df['HigherLower'] = 'No Change'
for index, row in df.iterrows():
if row['Open'] < row['Close'] :
df['HigherLower'] = df['HigherLower'].replace(['No Change'],'Lower')
else:
df['HigherLower'] = df['HigherLower'].replace(['No Change'],'Higher')
Use np.where:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Open': [10, 20, 30], 'Close': [5, 25, 15]})
df['HigherLower'] = np.where(df['Open'].shift() < df['Close'], 'Lower', 'Higher')
print(df)
# Output:
Open Close HigherLower
0 10 5 Higher
1 20 25 Lower
2 30 15 Higher
I have two dataframes:
df1 = pd.DataFrame({'Code' : ['10', '100', '1010'],
'Value' : [25, 50, 75]})
df2 = pd.DataFrame({'ID' : ['A', 'B', 'C'],
'Codes' : ['10', '100;1010', '100'],
'Value' : [25, 125, 50]})
Column "Codes" in df2 can contain multiple codes separated by ";". If this is the case, I need to sum up their values from df1.
I tried .map(), but this did not work for rows with multiple codes in a row. Also, I end up converting code '1010' to value '2525'.
How do I specify a perfect match and the summation for ";" separated values?
explode() the list of Codes
merge() with df1 and calculate total, grouping on the index of df2
have created a new column with this calculated
df1 = pd.DataFrame({"Code": ["10", "100", "1010"], "Value": [25, 50, 75]})
df2 = pd.DataFrame(
{"ID": ["A", "B", "C"], "Codes": ["10", "100;1010", "100"], "Value": [25, 125, 50]}
)
df2.join(
df2["Codes"]
.str.split(";")
.explode()
.reset_index()
.merge(df1, left_on="Codes", right_on="Code")
.groupby("index")
.agg({"Value": "sum"}),
rsuffix="_calc",
)
ID
Codes
Value
Value_calc
0
A
10
25
25
1
B
100;1010
125
125
2
C
100
50
50
def sum(df1, df2):
df1['sum'] = df1['Value'] + df2['Value']
print(df1)
df1.loc[df2['Codes'].isin(df1['Code'])].apply(sum(df1, df2))
If the code in df2 is in df1 theen add values
We can make a lookup table of Code to Value mapping from df1, then use .map() on df2 to map the expanded list of Codes to the mapping. Finally, sum up the mapped values for the same ID to arrive at the desired value, as follows:
1. Make a lookup table of Code to Value mapping from df1:
mapping = df1.set_index('Code')['Value']
2. Use .map() on df2 to map the expanded list of Codes to the mapping. Sum up the mapped values for the same ID to arrive at the desired value:
df2a = df2.set_index('ID') # set `ID` as index
df2a['value_map'] = (
df2a['Codes'].str.split(';') # split by semicolon
.explode() # expand splitted values into rows
.map(mapping) # map Code from mapping
.groupby('ID').sum() # group sum by ID
)
df2 = df2a.reset_index() # reset `ID` from index back to data column
Result:
print(df2)
ID Codes Value value_map
0 A 10 25 25
1 B 100;1010 125 125
2 C 100 50 50
Please would you like to know how I can update two DataFrames df1 y df2 from another DataFrame df3. All this is done within a for loop that iterates over all the elements of the DataFrame df3
for i in range(len(df3)):
df1.p_mw = ...
df2.p_mw = ...
The initial DataFrames df1 and df2 are as follows:
df1 = pd.DataFrame([['GH_1', 10, 'Hidro'],
['GH_2', 20, 'Hidro'],
['GH_3', 30, 'Hidro']],
columns= ['name','p_mw','type'])
df2 = pd.DataFrame([['GT_1', 40, 'Termo'],
['GT_2', 50, 'Termo'],
['GF_1', 10, 'Fict']],
columns= ['name','p_mw','type'])
The DataFrame from which I want to update the data is:
df3 = pd.DataFrame([[150,57,110,20,10],
[120,66,110,20,0],
[90,40,105,20,0],
[60,40,90,20,0]],
columns= ['GH_1', 'GH_2', 'GH_3', 'GT_1', 'GT_2'])
As you can see the DataFrame df3 contains data from the corresponding column p_mw for both DataFrames df1 and df2. Furthermore, the DataFrame df2 has an element named GF_1 for which there is no update and should remain the same.
After updating for the last iteration, the desired output is the following:
df1 = pd.DataFrame([['GH_1', 60, 'Hidro'],
['GH_2', 40, 'Hidro'],
['GH_3', 90, 'Hidro']],
columns= ['name','p_mw','type'])
df2 = pd.DataFrame([['GT_1', 20, 'Termo'],
['GT_2', 0, 'Termo'],
['GF_1', 10, 'Fict']],
columns= ['name','p_mw','type'])
Create a mapping series by selecting the last row from df3, then map it on the column name and fill the nan values using the values from p_mw column
s = df3.iloc[-1]
df1['p_mw'] = df1['name'].map(s).fillna(df1['p_mw'])
df2['p_mw'] = df2['name'].map(s).fillna(df2['p_mw'])
If there are multiple dataframes that needed to be updated then we can use a for loop to avoid repetition of our code:
for df in (df1, df2):
df['p_mw'] = df['name'].map(s).fillna(df['p_mw'])
>>> df1
name p_mw type
0 GH_1 60 Hidro
1 GH_2 40 Hidro
2 GH_3 90 Hidro
>>> df2
name p_mw type
0 GT_1 20.0 Termo
1 GT_2 0.0 Termo
2 GF_1 10.0 Fict
This should do as you ask. No need for a for loop.
df1 = pd.DataFrame([['GH_1', 10, 'Hidro'],
['GH_2', 20, 'Hidro'],
['GH_3', 30, 'Hidro']],
columns= ['name','p_mw','type'])
df2 = pd.DataFrame([['GT_1', 40, 'Termo'],
['GT_2', 50, 'Termo'],
['GF_1', 10, 'Fict']],
columns= ['name','p_mw','type'])
df3 = pd.DataFrame([[150,57,110,20,10],
[120,66,110,20,0],
[90,40,105,20,0],
[60,40,90,20,0]],
columns= ['GH_1', 'GH_2', 'GH_3', 'GT_1', 'GT_2'])
updates = df3.iloc[-1].values
df1["p_mw"] = updates[:3]
df2["p_mw"] = np.append(updates[3:], df2["p_mw"].iloc[-1])
I have 2 dateframes DF1 & DF2.i want perform the division for all DF2_sale column values with DF1_Expectation and update in DF1_percentage
df1 = pd.DataFrame({'Period' : ['Jan', 'Feb', 'Mar'],
'Sale': [10 , 20, 30],
})
df2 = pd.DataFrame({'Loc': ['UAE'],
'Expectation': [98],
})
Please refer attached dataframe screens
To apply some operation along an axis of the DataFrame you can always use apply. For example:
df1['Percentage'] = df1['Sale'].apply(lambda x: x / df2['Expectation'])
or, if instead of a simple division you want to count percentage:
df1['Percentage'] = df1['Sale'].apply(lambda x: x * df2['Expectation'] / 100)
Details in the documentation.
You can use pandas.apply method:
import pandas as pd
df1 = pd.DataFrame({"Period": ["Jan", "Feb", "Mar"], "Sale": [10, 20, 30]})
df2 = pd.DataFrame({"Loc": ["UAE"], "Expectation": [98]})
df1['Percentage'] = df1['Sale'].apply(lambda x: x * df2['Expectation'] / 100)
print(f"df1 = {df1}")
output:
df1 = Period Sale Percentage
0 Jan 10 9.8
1 Feb 20 19.6
2 Mar 30 29.4