I have 2 dateframes DF1 & DF2.i want perform the division for all DF2_sale column values with DF1_Expectation and update in DF1_percentage
df1 = pd.DataFrame({'Period' : ['Jan', 'Feb', 'Mar'],
'Sale': [10 , 20, 30],
})
df2 = pd.DataFrame({'Loc': ['UAE'],
'Expectation': [98],
})
Please refer attached dataframe screens
To apply some operation along an axis of the DataFrame you can always use apply. For example:
df1['Percentage'] = df1['Sale'].apply(lambda x: x / df2['Expectation'])
or, if instead of a simple division you want to count percentage:
df1['Percentage'] = df1['Sale'].apply(lambda x: x * df2['Expectation'] / 100)
Details in the documentation.
You can use pandas.apply method:
import pandas as pd
df1 = pd.DataFrame({"Period": ["Jan", "Feb", "Mar"], "Sale": [10, 20, 30]})
df2 = pd.DataFrame({"Loc": ["UAE"], "Expectation": [98]})
df1['Percentage'] = df1['Sale'].apply(lambda x: x * df2['Expectation'] / 100)
print(f"df1 = {df1}")
output:
df1 = Period Sale Percentage
0 Jan 10 9.8
1 Feb 20 19.6
2 Mar 30 29.4
Related
I have two Dataframes, and would like to create a new column in DataFrame 1 based on DataFrame 2 values.
But I dont want to join the two dataframes per say and make one big dataframe, but rather use the second Dataframe simply as a look-up.
#Main Dataframe:
df1 = pd.DataFrame({'Size':["Big", "Medium", "Small"], 'Sold_Quantity':[10, 6, 40]})
#Lookup Dataframe
df2 = pd.DataFrame({'Size':["Big", "Medium", "Small"], 'Sold_Quantiy_Score_Mean':[10, 20, 30]})
#Create column in Dataframe 1 based on lookup dataframe values:
df1['New_Column'] = when df1['Size'] = df2['Size'] and df1['Sold_Quantity'] < df2['Sold_Quantiy_Score_Mean'] then 'Below Average Sales' else 'Above Average Sales!' end
One approach, is to use np.where:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'Size': ["Big", "Medium", "Small"], 'Sold_Quantity': [10, 6, 40]})
df2 = pd.DataFrame({'Size': ["Big", "Medium", "Small"], 'Sold_Quantiy_Score_Mean': [10, 20, 30]})
condition = (df1['Size'] == df2['Size']) & (df1['Sold_Quantity'] < df2['Sold_Quantiy_Score_Mean'])
df1['New_Column'] = np.where(condition, 'Below Average Sales', 'Above Average Sales!')
print(df1)
Output
Size Sold_Quantity New_Column
0 Big 10 Above Average Sales!
1 Medium 6 Below Average Sales
2 Small 40 Above Average Sales!
Given that df2 is sort of like a lookup based on Size, it would make sense if your Size column was its index:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'Size': ["Big", "Medium", "Small"], 'Sold_Quantity': [10, 6, 40]})
df2 = pd.DataFrame({'Size': ["Big", "Medium", "Small"], 'Sold_Quantiy_Score_Mean': [10, 20, 30]})
lookup = df2.set_index("Size")
You can then map the Sizes in df1 to their mean and compare each with the sold quantity:
is_below_mean = df1["Sold_Quantity"] < df1["Size"].map(lookup["Sold_Quantiy_Score_Mean"])
and finally map the boolean values to the respective strings using np.where
df1["New_Column"] = np.where(is_below_mean, 'Below Average Sales', 'Above Average Sales!')
df1:
Size Sold_Quantity New_Column
0 Big 10 Above Average Sales!
1 Medium 6 Below Average Sales
2 Small 40 Above Average Sales!
I have a dataframe that looks like this:
total_customers total_customer_2021-03-31 total_purchases total_purchases_2021-03-31
1 10 4 6
3 14 3 2
Now, I want to sum up the columns row-wise that are the same expect the suffix. I.e the expected output is:
total_customers total_purchases
11 10
17 5
The issue why I cannot do this manually is because I have 100+ column pairs, so I need an efficient way to do this. Also, the order of columns is not predictable either. What do you recommend?
Thanks!
Somehow we need to get an Index of columns so pairs of columns share the same name, then we can groupby sum on axis=1:
cols = pd.Index(['total_customers', 'total_customers',
'total_purchases', 'total_purchases'])
result_df = df.groupby(cols, axis=1).sum()
With the shown example, we can str.replace an optional s, followed by underscore, followed by the date format (four numbers-two numbers-two numbers) with a single s. This pattern may need modified depending on the actual column names:
cols = df.columns.str.replace(r's?_\d{4}-\d{2}-\d{2}$', 's', regex=True)
result_df = df.groupby(cols, axis=1).sum()
result_df:
total_customers total_purchases
0 11 10
1 17 5
Setup and imports:
import pandas as pd
df = pd.DataFrame({
'total_customers': [1, 3],
'total_customer_2021-03-31': [10, 14],
'total_purchases': [4, 3],
'total_purchases_2021-03-31': [6, 2]
})
assuming that your dataframe is called df the best solution is:
sum_costumers = df[total_costumers] + df[total_costumers_2021-03-31]
sum_purchases = df[total_purchases] + df[total_purchases_2021-03-31]
data = {"total_costumers" : f"{sum_costumers}", "total_purchases" : f"sum_purchases"}
df_total = pd.DataFrame(data=data, index=range(1,len(data)))
and that will give you the output you want
import pandas as pd
data = {"total_customers": [1, 3], "total_customer_2021-03-31": [10, 14], "total_purchases": [4, 3], "total_purchases_2021-03-31": [6, 2]}
df = pd.DataFrame(data=data)
final_df = pd.DataFrame()
final_df["total_customers"] = df.filter(regex='total_customers*').sum(1)
final_df["total_purchases"] = df.filter(regex='total_purchases*').sum(1)
output
final_df
total_customers total_purchases
0 11 10
1 17 5
Using #HenryEcker's sample data, and building off of the example in the docs, you can create a function and groupby on the column axis:
def get_column(column):
if column.startswith('total_customer'):
return 'total_customers'
return 'total_purchases'
df.groupby(get_column, axis=1).sum()
total_customers total_purchases
0 11 10
1 17 5
I changed the headings while coding, to make it shorter, jfi
data = {"total_c" : [1,3], "total_c_2021" :[10,14],
"total_p": [4,3], "total_p_2021": [6,2]}
df = pd.DataFrame(data)
df["total_costumers"] = df["total_c"] + df["total_c_2021"]
df["total_purchases"] = df["total_p"] + df["total_p_2021"]
If you don't want to see other columns you can drop them
df = df.loc[:, ['total_costumers','total_purchases']]
NEW PART
So I might have find a starting point for your solution! I dont now the column names but following code can be changed, İf you have a pattern with your column names( it have patterned dates, names, etc). Can you changed the column names with a loop?
df['total_customer'] = df[[col for col in df.columns if col.startswith('total_c')]].sum(axis=1)
And this solution might be helpful for you with some alterationsexample
I have a dataframe with [Year] & [Week] columns sometimes missing. I have another dataframe that is a calendar for reference from which I can get these missing values. How to fill these missing columns using pandas?
I have tried using reindex to set them up, but I am getting the following error
ValueError: Buffer has wrong number of dimensions (expected 1, got 2)
import pandas as pd
d1 = {'Year': [2019,2019,2019,2019,2019], 'Week':[1,2,4,6,7], 'Value':
[20,40,60,75,90]}
d2 = {'Year': [2019,2019,2019,2019,2019,2019,2019,2019,2019,2019], 'Week':[1,2,3,4,5,6,7,8,9,10]}
df1 = pd.DataFrame(data=d1)
df2 = pd.DataFrame(data=d2)
df1 = df1.set_index(['Year', 'Week'])
df2 = df2.set_index(['Year', 'Week'])
df1 = df1.reindex(df2, fill_value=0)
print(df1)
You should adding index so df2.index
df1.reindex(df2.index,fill_value=0)
Out[851]:
Value
Year Week
2019 1 20
2 40
3 0
4 60
5 0
6 75
7 90
df2.index.difference(df1.index)
Out[854]:
MultiIndex(levels=[[2019], [3, 5]],
labels=[[0, 0], [0, 1]],
names=['Year', 'Week'],
sortorder=0)
Update
s=df1.reindex(df2.index)
s[s.bfill().notnull().values].fillna(0)
Out[877]:
Value
Year Week
2019 1 20.0
2 40.0
3 0.0
4 60.0
5 0.0
6 75.0
7 90.0
import pandas as pd
d1 = {'Year': [2019,2019,2019,2019,2019], 'Week':[1,2,4,6,7], 'Value':
[20,40,60,75,90]}
d2 = {'Year': [2019,2019,2019,2019,2019,2019,2019], 'Week':[1,2,3,4,5,6,7]}
df1 = pd.DataFrame(data=d1)
df2 = pd.DataFrame(data=d2)
df1 = df1.set_index(['Year', 'Week'])
df2 = df2.set_index(['Year', 'Week'])
fill_value = df1['Value'].mean() #value to fill `NaN` rows with - can choose another logic if you do not want the mean
df1 = df1.join(df2, how='right')
df1.fillna(value=fill_value,axis=1) # Fill missing data here
print(df1)
I have the a df,
date amount code id
2018-01-01 50 12 1
2018-02-03 100 12 1
2017-12-30 1 13 2
2017-11-30 2 14 2
I want to groupby id, while in each group the date is also sorted in ascending or descending order, so I can do the following,
grouped = df.groupby('id')
a = np.where(grouped['code'].transform('nunique') == 1, 20, 0)
b = np.where(grouped['amount'].transform('max') > 100, 20, 0)
c = np.where(grouped['date'].transform(lambda x: x.diff().dropna().sum()).dt.days < 5, 30, 0)
You can sort the data within each group by using apply and sort_values:
grouped = df.groupby('id').apply(lambda g: g.sort_values('date', ascending=True))
Adding to the previous answer, if you wish indexes to remain as they were, you might consider the following :
import pandas as pd
df = {'a':[1,2,3,0,5], 'b':[2,2,3,2,5], 'c':[22,11,11,42,12]}
df = pd.DataFrame(df)
e = (df.groupby(['c','b', 'a']).size()).reset_index()
e = e[['a', 'b', 'c']]
e = e.sort_values(['c','a'])
print(e)
I want to create a new column in a pandas data frame by applying a function to two existing columns. Following this answer I've been able to create a new column when I only need one column as an argument:
import pandas as pd
df = pd.DataFrame({"A": [10,20,30], "B": [20, 30, 10]})
def fx(x):
return x * x
print(df)
df['newcolumn'] = df.A.apply(fx)
print(df)
However, I cannot figure out how to do the same thing when the function requires multiple arguments. For example, how do I create a new column by passing column A and column B to the function below?
def fxy(x, y):
return x * y
You can go with #greenAfrican example, if it's possible for you to rewrite your function. But if you don't want to rewrite your function, you can wrap it into anonymous function inside apply, like this:
>>> def fxy(x, y):
... return x * y
>>> df['newcolumn'] = df.apply(lambda x: fxy(x['A'], x['B']), axis=1)
>>> df
A B newcolumn
0 10 20 200
1 20 30 600
2 30 10 300
Alternatively, you can use numpy underlying function:
>>> import numpy as np
>>> df = pd.DataFrame({"A": [10,20,30], "B": [20, 30, 10]})
>>> df['new_column'] = np.multiply(df['A'], df['B'])
>>> df
A B new_column
0 10 20 200
1 20 30 600
2 30 10 300
or vectorize arbitrary function in general case:
>>> def fx(x, y):
... return x*y
...
>>> df['new_column'] = np.vectorize(fx)(df['A'], df['B'])
>>> df
A B new_column
0 10 20 200
1 20 30 600
2 30 10 300
This solves the problem:
df['newcolumn'] = df.A * df.B
You could also do:
def fab(row):
return row['A'] * row['B']
df['newcolumn'] = df.apply(fab, axis=1)
If you need to create multiple columns at once:
Create the dataframe:
import pandas as pd
df = pd.DataFrame({"A": [10,20,30], "B": [20, 30, 10]})
Create the function:
def fab(row):
return row['A'] * row['B'], row['A'] + row['B']
Assign the new columns:
df['newcolumn'], df['newcolumn2'] = zip(*df.apply(fab, axis=1))
One more dict style clean syntax:
df["new_column"] = df.apply(lambda x: x["A"] * x["B"], axis = 1)
or,
df["new_column"] = df["A"] * df["B"]
This will dynamically give you desired result. It works even if you have more than two arguments
df['anothercolumn'] = df[['A', 'B']].apply(lambda x: fxy(*x), axis=1)
print(df)
A B newcolumn anothercolumn
0 10 20 100 200
1 20 30 400 600
2 30 10 900 300