Pandas: Reorder a data frame based on conditions - python

I would like to reorder my dataframe based on certain conditions
My original dataframe looks like
Merchant name original_rnk
Boohoo 1
PRETTYLITTLETHING 2
ASOS US 3
PRINCESS POLLY 4
URBAN OUTFITTERS 5
KIM+ONO 6
And there is a reference dataframe that has some merchant information
Merchant name order_cnt profit epc
Boohoo 200 30 0.6
PRETTYLITTLETHING 100 -60 -0.4
ASOS US 50 100 1.0
PRINCESS POLLY 80 50 0.8
URBAN OUTFITTERS 120 -20 -0.1
KIM+ONO 500 90 0.7
I would like to give a new rank to these merchant based on their epc if their order_cnt >= 100 and profit >=0. The first merchant will always be the first no matter how much its order_cnt and profit are, but for the rest whose order_cnt <100 or profit <0, r their original order.
So my desired output is
Merchant name new_rnk original_rnk
Boohoo 1 1
PRETTYLITTLETHING 3 2
ASOS US 4 3
PRINCESS POLLY 5 4
URBAN OUTFITTERS 6 5
KIM+ONO 2 6

Using the data provided in the question:
info = pd.DataFrame({
'Merchant name': ['Boohoo', 'PRETTYLITTLETHING', 'ASOS US', 'PRINCESS POLLY', 'URBAN OUTFITTERS', 'KIM+ONO'],
'order_cnt': [200, 100, 50, 80, 120, 500],
'profit': [30, -60, 100, 50, -20, 90],
'epc': [0.6, -0.4, 1.0, 0.8, -0.1, 0.7]
})
Split the data into the first row (head), the rows that satisfy the condition (mask) and the rows that don't (pos and neg):
head = info.head(1)
tail = info.iloc[1:]
mask = tail.eval('order_cnt >= 100 and profit >= 0')
pos = tail[mask]
neg = tail[~mask]
Sort the positive rows using the desired criteria (epc) and concatenate the three partitions back together:
df = pd.concat([head, pos.sort_values('epc', ascending=False), neg])
To get output as presented in the original question (with both the original and the new ranks and sorted by the original rank) add these lines:
df['new_rank'] = range(1, 7)
df['original_rank'] = df['Merchant name'].map(ranks.set_index('Merchant name')['original_rnk'])
df.sort_values('original_rank')[['Merchant name', 'new_rank', 'original_rank']]
where ranks is the "original data frame":
ranks = pd.DataFrame({
'Merchant name': ['Boohoo', 'PRETTYLITTLETHING', 'ASOS US', 'PRINCESS POLLY', 'URBAN OUTFITTERS', 'KIM+ONO'],
'original_rnk': range(1, 7)
})

You can use the following code for the desired output;
import pandas as pd
original_rank_frame = pd.DataFrame({'Merchant name': ['Boohoo', 'PRETTYLITTLETHING', 'ASOS US', 'PRINCESS POLLY', 'URBAN OUTFITTERS', 'KIM+ONO'],
'original_rnk': [1, 2, 3, 4, 5 ,6]})
reference_frame = pd.DataFrame({'Merchant name': ['Boohoo', 'PRETTYLITTLETHING', 'ASOS US', 'PRINCESS POLLY', 'URBAN OUTFITTERS', 'KIM+ONO'],
'order_cnt': [200, 100, 50, 80, 120, 500],
'profit': [30, -60, 100, 50, -20, 90],
'epc': [0.6, -0.4, 1.0, 0.8, -0.1, 0.7]})
final_table = pd.concat([reference_frame[((reference_frame['order_cnt'] >= 100) & (reference_frame['epc'] >= 0))],
reference_frame[~((reference_frame['order_cnt'] >= 100) & (reference_frame['epc'] >= 0))]], axis=0)
final_table = final_table.reset_index().rename({'index':'original_rnk'}, axis = 'columns').reset_index().rename({'index':'new_rnk'}, axis = 'columns')[['Merchant name', 'new_rnk', 'original_rnk']]
final_table[['new_rnk', 'original_rnk']] += 1
final_table.sort_values('original_rnk')
Output
Merchant name new_rnk original_rnk
0 Boohoo 1 1
2 PRETTYLITTLETHING 3 2
3 ASOS US 4 3
4 PRINCESS POLLY 5 4
5 URBAN OUTFITTERS 6 5
1 KIM+ONO 2 6
Explanation
The first step is to filter the dataframe by desired qualities reference_frame[((reference_frame['order_cnt'] >= 100) & (reference_frame['epc'] >= 0)). Since these are mutually exclusive sets, we can use the negation (~) to get the rest.Then, we concat these two dataframes and extract the original index. We assign a new index by resetting. In the last step, we increment the index values since they start with 0 but the desired output from 1.

Related

Sort the products based on the frequency of changes in customer demand

Imagine following dataframe is given.
import pandas as pd
products = ['Apple', 'Apple', 'Carrot', 'Eggplant', 'Eggplant']
customer_demand_date = ['2023-01-01', '2023-01-07', '2023-01-01', '2023-01-01', '2023-01-07', '2023-01-14']
col_02_2023 = [0, 20, 0, 0, 0, 10]
col_03_2023 = [20, 30, 10, 0, 10, 0]
col_04_2023 = [10, 40, 50, 30, 40, 10]
col_05_2023 = [40, 40, 60, 50, 60, 20]
data = {'Products': products,
'customer_demand_date': customer_demand_date,
'02_2023': col_02_2023,
'03_2023': col_03_2023,
'04_2023': col_04_2023,
'05_2023': col_05_2023}
df = pd.DataFrame(data)
print(df)
Products customer_demand_date 02_2023 03_2023 04_2023 05_2023
0 Apple 2023-01-01 0 20 10 40
1 Apple 2023-01-07 20 30 40 40
2 Carrot 2023-01-01 0 10 50 60
3 Egg 2023-01-01 0 0 30 50
4 Egg 2023-01-07 0 10 40 60
5 Egg 2023-01-14 0 0 10 20
I have columns products, custome_demand_date (every week there is new customer demand for products per upcoming months) and months with quantity demand.
How can I determine which product has experienced the most frequent changes in customer demand over the months, and sort the products in descending order of frequency of change?
I have tried to group by product, accumulate the demand quantity but none of them can analyze the data both horizontally (per customer demand date) and vertically (per months).
Desired output:
Sorted products Ranking(or %, or count of changes)
Egg 1 (or 70% or 13)
Apple 2 (or 52% or 8)
Carrot 3 (22% or 3)
Either ranking or % of change frequency or count of changes.
Note: percentages in desired output are random numbers
I'd really appreciate if you have any clever approach to solve this problem?
Thanks
One way is to define a function that counts horizontal and vertical changes which you can apply to each group individually.
import pandas as pd
from io import StringIO
def change_freq(x, months):
# count horizontal changes
chngs_horizontal = x[months].diff(axis=1).fillna(0).astype(bool).sum().sum()
# count vertical changes
chngs_vertical = x[months].diff(axis=0).fillna(0).astype(bool).sum().sum()
return chngs_horizontal+chngs_vertical
# sample data
data = StringIO("""
Products,customer_demand_date,02_2023,03_2023,04_2023,05_2023
Apple,2023-01-01,0,20,10,40
Apple,2023-01-07,20,30,40,40
Carrot,2023-01-01,0,10,50,60
Egg,2023-01-01,0,0,30,50
Egg,2023-01-07,0,10,40,60
Egg,2023-01-14,0,0,10,20
""")
df = pd.read_csv(data, sep=",")
# count horizontal and vertical changes by product
result = df.groupby('Products').apply(change_freq, ['02_2023','03_2023','04_2023','05_2023'])
result = result.sort_values(ascending=False).to_frame('count_changes')
result['rank'] = result['count_changes'].rank(ascending=False)
This returns
count_changes rank
Products
Egg 13 1.0
Apple 8 2.0
Carrot 3 3.0
to find the variance in y direction a groupby("Products") with an lambda function can be used:
var_y=(df.loc[:,~df.columns.isin(['customer_demand_date','HEHE'])].groupby("Products").agg(lambda x: x.pct_change().fillna(0).astype(bool).sum())).reset_index(level=0)
to find the variance in x direction pct_change interpreted as True or False can be used using astype(bool):
var_x=pd.concat([df[["Products"]], df.iloc[:,2:].pct_change(axis='columns').replace(np.inf, 1).fillna(0).astype(bool).sum(axis=1).rename('sum_x')], axis=1)
adding / grouping both together would finally look like this:
Products sum_x sum_y sum_xy
0 Apple 5 3 8
1 Carrot 3 0 3
2 Eggplant 7 6 13
below the complete code:
import pandas as pd
import numpy as np
products = ['Apple', 'Apple', 'Carrot', 'Eggplant', 'Eggplant', 'Eggplant']
customer_demand_date = ['2023-01-01', '2023-01-07', '2023-01-01', '2023-01-01', '2023-01-07','2023-01-14']
col_02_2023 = [0, 20, 0, 0, 0, 0]
col_03_2023 = [20, 30, 10, 0, 10, 0]
col_04_2023 = [10, 40, 50, 30, 40, 10]
col_05_2023 = [40, 40, 60, 50, 60, 20]
data = {'Products': products,
'customer_demand_date': customer_demand_date,
'02_2023': col_02_2023,
'03_2023': col_03_2023,
'04_2023': col_04_2023,
'05_2023': col_05_2023}
df = pd.DataFrame(data)
var_y=(df.loc[:,~df.columns.isin(['customer_demand_date','HEHE'])].groupby("Products").agg(lambda x: x.pct_change().fillna(0).astype(bool).sum())).reset_index(level=0)
var_y["sum_y"]=var_y.iloc[:,1:].sum(axis="columns")
var_x=pd.concat([df[["Products"]], df.iloc[:,2:].pct_change(axis='columns').replace(np.inf, 1).fillna(0).astype(bool).sum(axis=1).rename('sum_x')], axis=1)
var_x_sum=var_x.groupby("Products", as_index=False).agg(sum_x=('sum_x','sum'))
var_total=pd.concat([var_x_sum,var_y["sum_y"]],axis=1)
var_total["sum_xy"]=var_total.iloc[:,1:].sum(axis="columns")
print(var_total)

Is it possible to summarize or group every row with a specific column value? - python

Picture of my dataframe
Is it possible to summarize or group every country's info to something like a 'total info' row
This df is fluent, it will change each month and having a "quick access" view of how it looks will be very beneficial.
Take the picture as example: I would like to have Albania's (every county's) info in row so something like this
**ORIGINATING COUNTRY Calls Made Actual Qty Billable Qty. Cost (€)**
Albania 10 190 600 7
Zambia total total total
and total total total
every total total total
other total total total
country in my df total total total
I've tried groupby() and sum() but can figure it out.
import pandas as pd
df = pd.DataFrame(
data=[
['Albania', 1, 10, 100, 0.1],
['Albania', 2, 20, 200, 0.2],
['Zambia', 3, 30, 300, 0.3],
['Zambia', 4, 40, 400, 0.4],
[None, 5, 50, 500, 0.5],
[None, 6, 60, 600, 0.6],
],
columns=[
'ORIGINATING COUNTRY',
'Calls Made',
'Actual Qty. (s)',
'Billable Qty. (s)',
'Cost (€)',
],
)
df['ORIGINATING COUNTRY'].replace({None: 'Unknown'}, inplace=True)
df.groupby('ORIGINATING COUNTRY', as_index=False).sum()
Output:
ORIGINATING COUNTRY Calls Made Actual Qty. (s) Billable Qty. (s) Cost (€)
0 Albania 3 30 300 0.3
1 Unknown 11 110 1100 1.1
2 Zambia 7 70 700 0.7

How to drop part of the values from one column by condition from another column in Python, Pandas?

I have real estate dataframe with many outliers and many observations.
I have variables: total area, number of rooms (if rooms = 0, then it's studio appartment) and kitchen_area.
"Minimalized" extraction from my dataframe:
dic = [{'area': 40, 'kitchen_area': 10, 'rooms': 1, 'price': 50000 },
{'area': 20, 'kitchen_area': 0, 'rooms': 0, 'price': 50000},
{'area': 60, 'kitchen_area': 0, 'rooms': 2, 'price': 70000},
{'area': 29, 'kitchen_area': 9, 'rooms': 1, 'price': 30000},
{'area': 15, 'kitchen_area': 0, 'rooms': 0, 'price': 25000}]
df = pd.DataFrame(dic, index=['apt1', 'apt2','apt3','apt4', 'apt5'])
My target would be to eliminate apt3, because by law, kitchen area cannot be smaller than 5 squared meters in non-studio apartments.
In other words, I would like to eliminate all rows from my dataframe containing the data about apartments which are non-studio (rooms>0), but have kitchen_area <5
I have tried code like this:
df1 = df.drop(df[(df.rooms > 0) & (df.kitchen_area < 5)].index)
But it just eliminated all data from both columns kitchen_area and rooms according to the multiple conditions I put.
Clean
mask1 = df.rooms > 0
mask2 = df.kitchen_area < 5
df1 = df[~(mask1 & mask2)]
df1
area kitchen_area rooms price
apt1 40 10 1 50000
apt2 20 0 0 50000
apt4 29 9 1 30000
apt5 15 0 0 25000
pd.DataFRame.query
df1 = df.query('rooms == 0 | kitchen_area >= 5')
df1
area kitchen_area rooms price
apt1 40 10 1 50000
apt2 20 0 0 50000
apt4 29 9 1 30000
apt5 15 0 0 25000

Map counts of a numerical column from a new DataFrame to the bin range column of training data

I am trying to get the count of Age column and append it to my existing bin-range column created. I am able to do it for the training df and want to do it for prediction data. How do I map values of count of Age column from prediction data to to Age_bin column in my training data? The first one is my output DF whereas the 2nd one is the sample DF. I can get the count using value_counts() for the file I am reading.
First image - bin and count from training data
Second image - Training data
Third image - Prediction data
Fourth image - Final output
.
.
The Data
import pandas as pd
data = {
0: 0,
11: 1500,
12: 1000,
22: 3000,
32: 35000,
34: 40000,
44: 55000,
65: 7000,
80: 8000,
100: 1000000,
}
df = pd.DataFrame(data.items(), columns=['Age', 'Salary'])
Age Salary
0 0 0
1 11 1500
2 12 1000
3 22 3000
4 32 35000
5 34 40000
6 44 55000
7 65 7000
8 80 8000
9 100 1000000
The Code
bins = [-0.1, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
# create a "binned" column
df['binned'] = pd.cut(df['Age'], bins)
# add bin count
df['count'] = df.groupby('binned')['binned'].transform('count')
The Output
Age Salary binned count
0 0 0 (-0.1, 10.0] 1
1 11 1500 (10.0, 20.0] 2
2 12 1000 (10.0, 20.0] 2
3 22 3000 (20.0, 30.0] 1
4 32 35000 (30.0, 40.0] 2
5 34 40000 (30.0, 40.0] 2
6 44 55000 (40.0, 50.0] 1
7 65 7000 (60.0, 70.0] 1
8 80 8000 (70.0, 80.0] 1
9 100 1000000 (90.0, 100.0] 1

Pandas DataFrame: Complex linear interpolation

I have a dataframe with 4 sections
Section 1: Product details
Section 2: 6 Potential product values based on a range of simulations
Section 3: Upper and lower bound for the input parameter to the simulations
Section 4: Randomly generated values for the input parameters
Section 2 is generated by pricing the product at equal intervals between the upper and lower bound.
I need to take the values in Section 4 and figure out the corresponding product value. Here is a possible setup for this dataframe:
table2 = pd.DataFrame({
'Product Type': ['A', 'B', 'C', 'D'],
'State_1_Value': [10, 11, 12, 13],
'State_2_Value': [20, 21, 22, 23],
'State_3_Value': [30, 31, 32, 33],
'State_4_Value': [40, 41, 42, 43],
'State_5_Value': [50, 51, 52, 53],
'State_6_Value': [60, 61, 62, 63],
'Lower_Bound': [-1, 1, .5, 5],
'Upper_Bound': [1, 2, .625, 15],
'sim_1': [0, 0, .61, 7],
'sim_2': [1, 1.5, .7, 9],
})
>>> table2
Lower_Bound Product Type State_1_Value State_2_Value State_3_Value \
0 -1.0 A 10 20 30
1 1.0 B 11 21 31
2 0.5 C 12 22 32
3 5.0 D 13 23 33
State_4_Value State_5_Value State_6_Value Upper_Bound sim_1 sim_2
0 40 50 60 1.000 0.0 1.0
1 41 51 61 2.000 0.0 1.5
2 42 52 62 0.625 0.61 0.7
3 43 53 63 15.000 7.0 9.0
I will run through a couple examples of this calculation to make it clear what my question is.
Product A - sim_2
The input here is 1.0. This is equal to the upper bound for this product. Therefore the simulation value is equivalent to the state_6 value - 60
Product B - sim_2
The input here is 1.5. the LB to UB range is (1,2), therefore the 6 states are {1,1.2,1.4,1.6,1.8,2}. 1.5 is exactly in the middle of state_3 which has a value of 31 and state 4 which has a value of 41. Therefore the simulation value is 36.
Product C - sim_1
The input here is .61. The LB to UB range is (.5,.625), therefore the 6 states are {.5,.525,.55,.575,.6,.625}. .61 is between state 5 and 6. Specifically the bucket it would fall under would be 5*(.61-.5)/(.625-.5)+1 = 5.4 (it is multiplied by 5 as that is the number of intervals - you can calculate it other ways and get the same result). Then to calculate the value we use that bucket in a weighing of the values for state 5 and state 6: (62-52)*(5.4-5)+52 = 56.
Product B - sim_1
The input here is 0 which is below the lower bound of 1. Therefore we need to extrapolate the value. We use the same formula as above we just use the values of state 1 and state 2 to extrapolate. The bucket would be 5*(0-1)/(2-1)+1 = -4. The two values used at 11 and 21, so the value is (21-11)*(-4-1)+11= -39
I've also simplified the problem to try to visualize the solution, my final code needs to run on 500 values and 10,000 simulations, and the dataframe will have about 200 rows.
Here are the formulas I've used for the interpolation although I'm not committed to them specifically.
Bucket = N*(sim_value-LB)/(UB-LB) + 1
where N is the number of intervals
then nLower is the state value directly below the bucket, and nHigher is the state value directly above the bucket. If the bucket is outside the UB/LB, then force nLower and nHigher to be either the first two or last two values.
Final_value = (nHigher-nLower)*(Bucket1 - number_value_of_nLower)+nLower
To summarize, my question is how I can generate the final results based on the combination of input data provided. The most challenging part to me is how to make the connection from the Bucket number to the nLower and nHigher values.
I was able to generate the result using the following code. I'm not sure of the memory implications on a large dataframe, so still interested in better answers or improvements.
Edit: Ran this code on the full dataset, 141 rows, 500 intervals, 10,000 simulations, and it took slightly over 1.5 hours. So not quite as useless as I assumed, but there is probably a smarter way of doing this in a tiny fraction of that time.
for i in range(1,3):
table2['Bucket%s'%i] = 5 * (table2['sim_%s'%i] - table2['Lower_Bound']) / (table2['Upper_Bound'] - table2['Lower_Bound']) + 1
table2['lv'] = table2['Bucket%s'%i].map(int)
table2['hv'] = table2['Bucket%s'%i].map(int) + 1
table2.ix[table2['lv'] < 1 , 'lv'] = 1
table2.ix[table2['lv'] > 5 , 'lv'] = 5
table2.ix[table2['hv'] > 6 , 'hv'] = 6
table2.ix[table2['hv'] < 2 , 'hv'] = 2
table2['nLower'] = table2.apply(lambda row: row['State_%s_Value'%row['lv']],axis=1)
table2['nHigher'] = table2.apply(lambda row: row['State_%s_Value'%row['hv']],axis=1)
table2['Final_value_%s'%i] = (table2['nHigher'] - table2['nLower'])*(table2['Bucket%s'%i]-table2['lv']) + table2['nLower']
Output:
>>> table2
Lower_Bound Product Type State_1_Value State_2_Value State_3_Value \
0 -1.0 A 10 20 30
1 1.0 B 11 21 31
2 0.5 C 12 22 32
3 5.0 D 13 23 33
State_4_Value State_5_Value State_6_Value Upper_Bound sim_1 sim_2 \
0 40 50 60 1.000 0.00 1.0
1 41 51 61 2.000 0.00 1.5
2 42 52 62 0.625 0.61 0.7
3 43 53 63 15.000 7.00 9.0
Bucket1 lv hv nLower nHigher Final_value_1 Bucket2 Final_value_2
0 3.5 5 6 50 60 35.0 6.0 60.0
1 -4.0 3 4 31 41 -39.0 3.5 36.0
2 5.4 5 6 52 62 56.0 9.0 92.0
3 2.0 3 4 33 43 23.0 3.0 33.0
I posted a superior solution with no loops here:
Alternate method to avoid loop in pandas dataframe
df= pd.DataFrame({
'Product Type': ['A', 'B', 'C', 'D'],
'State_1_Value': [10, 11, 12, 13],
'State_2_Value': [20, 21, 22, 23],
'State_3_Value': [30, 31, 32, 33],
'State_4_Value': [40, 41, 42, 43],
'State_5_Value': [50, 51, 52, 53],
'State_6_Value': [60, 61, 62, 63],
'Lower_Bound': [-1, 1, .5, 5],
'Upper_Bound': [1, 2, .625, 15],
'sim_1': [0, 0, .61, 7],
'sim_2': [1, 1.5, .7, 9],
})
buckets = df.ix[:,-2:].sub(df['Lower_Bound'],axis=0).div(df['Upper_Bound'].sub(df['Lower_Bound'],axis=0),axis=0) * 5 + 1
low = buckets.applymap(int)
high = buckets.applymap(int) + 1
low = low.applymap(lambda x: 1 if x < 1 else x)
low = low.applymap(lambda x: 5 if x > 5 else x)
high = high.applymap(lambda x: 6 if x > 6 else x)
high = high.applymap(lambda x: 2 if x < 2 else x)
low_value = pd.DataFrame(df.filter(regex="State|Type").values[np.arange(low.shape[0])[:,None], low])
high_value = pd.DataFrame(df.filter(regex="State|Type").values[np.arange(high.shape[0])[:,None], high])
df1 = (high_value - low_value).mul((buckets - low).values) + low_value
df1['Product Type'] = df['Product Type']

Categories