I have a 2 dataframes
One- Score Card for scoring student marks
Second One-Student dataset.
I want to apply score card on a given student dataset to compute score and aggregate them. I'm trying to devlop a generic function that takes the
score card and applies on any studentmarks dataset
import pandas as pd
score_card_data = {
'subject_id': ['MATHS', 'SCIENCE', 'ARTS'],
'bin_list': [[0,25,50,75,100], [0,20,40,60,80,100], [0,20,40,60,80,100]],
'bin_value': [[1,2,3,4], [1,2,3,4,5], [3,4,5,6,7] ]}
score_card_data = pd.DataFrame(score_card_data, columns = ['subject_id', 'bin_list', 'bin_value'])
score_card_data
student_scores = {
'STUDENT_ID': ['S1', 'S2', 'S3','S4','S5'],
'MATH_MARKS': [10,15,25,65,75],
'SCIENCE_MARKS': [8,15,20,35,85],
'ARTS_MARKS':[55,90,95,88,99]}
student_scores = pd.DataFrame(student_scores, columns = ['STUDENT_ID', 'MATH_MARKS', 'SCIENCE_MARKS','ARTS_MARKS'])
student_scores
Functions
Define bins
Apply the bins over columns
bins = list(score_card_data.loc[score_card_data['subject_id'] == 'MATHS', 'bin_list'])
student_scores['MATH_SCORE'] = pd.cut(student_scores['MATH_MARKS'],bins, labels='MATHS_MARKS')
Error:ValueError: object too deep for desired array
I'm trying to convert the cellvalue to a string and it is getting detected as an object. Any way to resolve
How can I make the function more generic?
Thanks
Pari
You can just use bins[0] to extract the list, which otherwise raises the ValueError:
bins[0]
[0, 25, 50, 75, 100]
type(bins[0])
<class 'list'>
student_scores['MATH_SCORE'] = pd.cut(student_scores['MATH_MARKS'], bins[0])
STUDENT_ID MATH_MARKS SCIENCE_MARKS ARTS_MARKS MATH_SCORE
0 S1 10 8 55 (0, 25]
1 S2 15 15 90 (0, 25]
2 S3 25 20 95 (0, 25]
3 S4 65 35 88 (50, 75]
4 S5 75 85 99 (50, 75]
I left out the labels because you'd need to provide a list of four labels given there are five cutoffs / bin edges.
Related
Imagine following dataframe is given.
import pandas as pd
products = ['Apple', 'Apple', 'Carrot', 'Eggplant', 'Eggplant']
customer_demand_date = ['2023-01-01', '2023-01-07', '2023-01-01', '2023-01-01', '2023-01-07', '2023-01-14']
col_02_2023 = [0, 20, 0, 0, 0, 10]
col_03_2023 = [20, 30, 10, 0, 10, 0]
col_04_2023 = [10, 40, 50, 30, 40, 10]
col_05_2023 = [40, 40, 60, 50, 60, 20]
data = {'Products': products,
'customer_demand_date': customer_demand_date,
'02_2023': col_02_2023,
'03_2023': col_03_2023,
'04_2023': col_04_2023,
'05_2023': col_05_2023}
df = pd.DataFrame(data)
print(df)
Products customer_demand_date 02_2023 03_2023 04_2023 05_2023
0 Apple 2023-01-01 0 20 10 40
1 Apple 2023-01-07 20 30 40 40
2 Carrot 2023-01-01 0 10 50 60
3 Egg 2023-01-01 0 0 30 50
4 Egg 2023-01-07 0 10 40 60
5 Egg 2023-01-14 0 0 10 20
I have columns products, custome_demand_date (every week there is new customer demand for products per upcoming months) and months with quantity demand.
How can I determine which product has experienced the most frequent changes in customer demand over the months, and sort the products in descending order of frequency of change?
I have tried to group by product, accumulate the demand quantity but none of them can analyze the data both horizontally (per customer demand date) and vertically (per months).
Desired output:
Sorted products Ranking(or %, or count of changes)
Egg 1 (or 70% or 13)
Apple 2 (or 52% or 8)
Carrot 3 (22% or 3)
Either ranking or % of change frequency or count of changes.
Note: percentages in desired output are random numbers
I'd really appreciate if you have any clever approach to solve this problem?
Thanks
One way is to define a function that counts horizontal and vertical changes which you can apply to each group individually.
import pandas as pd
from io import StringIO
def change_freq(x, months):
# count horizontal changes
chngs_horizontal = x[months].diff(axis=1).fillna(0).astype(bool).sum().sum()
# count vertical changes
chngs_vertical = x[months].diff(axis=0).fillna(0).astype(bool).sum().sum()
return chngs_horizontal+chngs_vertical
# sample data
data = StringIO("""
Products,customer_demand_date,02_2023,03_2023,04_2023,05_2023
Apple,2023-01-01,0,20,10,40
Apple,2023-01-07,20,30,40,40
Carrot,2023-01-01,0,10,50,60
Egg,2023-01-01,0,0,30,50
Egg,2023-01-07,0,10,40,60
Egg,2023-01-14,0,0,10,20
""")
df = pd.read_csv(data, sep=",")
# count horizontal and vertical changes by product
result = df.groupby('Products').apply(change_freq, ['02_2023','03_2023','04_2023','05_2023'])
result = result.sort_values(ascending=False).to_frame('count_changes')
result['rank'] = result['count_changes'].rank(ascending=False)
This returns
count_changes rank
Products
Egg 13 1.0
Apple 8 2.0
Carrot 3 3.0
to find the variance in y direction a groupby("Products") with an lambda function can be used:
var_y=(df.loc[:,~df.columns.isin(['customer_demand_date','HEHE'])].groupby("Products").agg(lambda x: x.pct_change().fillna(0).astype(bool).sum())).reset_index(level=0)
to find the variance in x direction pct_change interpreted as True or False can be used using astype(bool):
var_x=pd.concat([df[["Products"]], df.iloc[:,2:].pct_change(axis='columns').replace(np.inf, 1).fillna(0).astype(bool).sum(axis=1).rename('sum_x')], axis=1)
adding / grouping both together would finally look like this:
Products sum_x sum_y sum_xy
0 Apple 5 3 8
1 Carrot 3 0 3
2 Eggplant 7 6 13
below the complete code:
import pandas as pd
import numpy as np
products = ['Apple', 'Apple', 'Carrot', 'Eggplant', 'Eggplant', 'Eggplant']
customer_demand_date = ['2023-01-01', '2023-01-07', '2023-01-01', '2023-01-01', '2023-01-07','2023-01-14']
col_02_2023 = [0, 20, 0, 0, 0, 0]
col_03_2023 = [20, 30, 10, 0, 10, 0]
col_04_2023 = [10, 40, 50, 30, 40, 10]
col_05_2023 = [40, 40, 60, 50, 60, 20]
data = {'Products': products,
'customer_demand_date': customer_demand_date,
'02_2023': col_02_2023,
'03_2023': col_03_2023,
'04_2023': col_04_2023,
'05_2023': col_05_2023}
df = pd.DataFrame(data)
var_y=(df.loc[:,~df.columns.isin(['customer_demand_date','HEHE'])].groupby("Products").agg(lambda x: x.pct_change().fillna(0).astype(bool).sum())).reset_index(level=0)
var_y["sum_y"]=var_y.iloc[:,1:].sum(axis="columns")
var_x=pd.concat([df[["Products"]], df.iloc[:,2:].pct_change(axis='columns').replace(np.inf, 1).fillna(0).astype(bool).sum(axis=1).rename('sum_x')], axis=1)
var_x_sum=var_x.groupby("Products", as_index=False).agg(sum_x=('sum_x','sum'))
var_total=pd.concat([var_x_sum,var_y["sum_y"]],axis=1)
var_total["sum_xy"]=var_total.iloc[:,1:].sum(axis="columns")
print(var_total)
An error is returned when I want to plot an interval.
I created an interval for my age column so now I want to show on a chart the age interval compares to the revenue
my code
bins = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
clients['tranche'] = pd.cut(clients.age, bins)
clients.head()
client_id sales revenue birth age sex tranche
0 c_1 39 558.18 1955 66 m (60, 70]
1 c_10 58 1353.60 1956 65 m (60, 70]
2 c_100 8 254.85 1992 29 m (20, 30]
3 c_1000 125 2261.89 1966 55 f (50, 60]
4 c_1001 102 1812.86 1982 39 m (30, 40]
# Plot a scatter tranche x revenue
df = clients.groupby('tranche')[['revenue']].sum().reset_index().copy()
plt.scatter(df.tranche, df.revenue)
plt.show()
But an error appears ending by
TypeError: float() argument must be a string or a number, not 'pandas._libs.interval.Interval'
How to use an interval for plotting ?
You'll need to add labels. (i tried to convert them to str using .astype(str) but that does not seem to work in 3.9)
if you do the following, it will work just fine.
labels = ['10-20', '20-30', '30-40']
df['tranche'] = pd.cut(df.age, bins, labels=labels)
I would like to reorder my dataframe based on certain conditions
My original dataframe looks like
Merchant name original_rnk
Boohoo 1
PRETTYLITTLETHING 2
ASOS US 3
PRINCESS POLLY 4
URBAN OUTFITTERS 5
KIM+ONO 6
And there is a reference dataframe that has some merchant information
Merchant name order_cnt profit epc
Boohoo 200 30 0.6
PRETTYLITTLETHING 100 -60 -0.4
ASOS US 50 100 1.0
PRINCESS POLLY 80 50 0.8
URBAN OUTFITTERS 120 -20 -0.1
KIM+ONO 500 90 0.7
I would like to give a new rank to these merchant based on their epc if their order_cnt >= 100 and profit >=0. The first merchant will always be the first no matter how much its order_cnt and profit are, but for the rest whose order_cnt <100 or profit <0, r their original order.
So my desired output is
Merchant name new_rnk original_rnk
Boohoo 1 1
PRETTYLITTLETHING 3 2
ASOS US 4 3
PRINCESS POLLY 5 4
URBAN OUTFITTERS 6 5
KIM+ONO 2 6
Using the data provided in the question:
info = pd.DataFrame({
'Merchant name': ['Boohoo', 'PRETTYLITTLETHING', 'ASOS US', 'PRINCESS POLLY', 'URBAN OUTFITTERS', 'KIM+ONO'],
'order_cnt': [200, 100, 50, 80, 120, 500],
'profit': [30, -60, 100, 50, -20, 90],
'epc': [0.6, -0.4, 1.0, 0.8, -0.1, 0.7]
})
Split the data into the first row (head), the rows that satisfy the condition (mask) and the rows that don't (pos and neg):
head = info.head(1)
tail = info.iloc[1:]
mask = tail.eval('order_cnt >= 100 and profit >= 0')
pos = tail[mask]
neg = tail[~mask]
Sort the positive rows using the desired criteria (epc) and concatenate the three partitions back together:
df = pd.concat([head, pos.sort_values('epc', ascending=False), neg])
To get output as presented in the original question (with both the original and the new ranks and sorted by the original rank) add these lines:
df['new_rank'] = range(1, 7)
df['original_rank'] = df['Merchant name'].map(ranks.set_index('Merchant name')['original_rnk'])
df.sort_values('original_rank')[['Merchant name', 'new_rank', 'original_rank']]
where ranks is the "original data frame":
ranks = pd.DataFrame({
'Merchant name': ['Boohoo', 'PRETTYLITTLETHING', 'ASOS US', 'PRINCESS POLLY', 'URBAN OUTFITTERS', 'KIM+ONO'],
'original_rnk': range(1, 7)
})
You can use the following code for the desired output;
import pandas as pd
original_rank_frame = pd.DataFrame({'Merchant name': ['Boohoo', 'PRETTYLITTLETHING', 'ASOS US', 'PRINCESS POLLY', 'URBAN OUTFITTERS', 'KIM+ONO'],
'original_rnk': [1, 2, 3, 4, 5 ,6]})
reference_frame = pd.DataFrame({'Merchant name': ['Boohoo', 'PRETTYLITTLETHING', 'ASOS US', 'PRINCESS POLLY', 'URBAN OUTFITTERS', 'KIM+ONO'],
'order_cnt': [200, 100, 50, 80, 120, 500],
'profit': [30, -60, 100, 50, -20, 90],
'epc': [0.6, -0.4, 1.0, 0.8, -0.1, 0.7]})
final_table = pd.concat([reference_frame[((reference_frame['order_cnt'] >= 100) & (reference_frame['epc'] >= 0))],
reference_frame[~((reference_frame['order_cnt'] >= 100) & (reference_frame['epc'] >= 0))]], axis=0)
final_table = final_table.reset_index().rename({'index':'original_rnk'}, axis = 'columns').reset_index().rename({'index':'new_rnk'}, axis = 'columns')[['Merchant name', 'new_rnk', 'original_rnk']]
final_table[['new_rnk', 'original_rnk']] += 1
final_table.sort_values('original_rnk')
Output
Merchant name new_rnk original_rnk
0 Boohoo 1 1
2 PRETTYLITTLETHING 3 2
3 ASOS US 4 3
4 PRINCESS POLLY 5 4
5 URBAN OUTFITTERS 6 5
1 KIM+ONO 2 6
Explanation
The first step is to filter the dataframe by desired qualities reference_frame[((reference_frame['order_cnt'] >= 100) & (reference_frame['epc'] >= 0)). Since these are mutually exclusive sets, we can use the negation (~) to get the rest.Then, we concat these two dataframes and extract the original index. We assign a new index by resetting. In the last step, we increment the index values since they start with 0 but the desired output from 1.
New student to python and struggling with a task at the moment. I'm trying to publish a scatter plot from some data in a pandas table and can't seem to work it out.
Here is a sample of my data set:
import pandas as pd
data = {'housing_age': [14, 11, 3, 4],
'total_rooms': [25135, 32627, 39320, 37937],
'total_bedrooms': [4819, 6445, 6210, 5471],
'population': [35682, 28566, 16305, 16122]}
df = pd.DataFrame(data)
I'm trying to draw a scatter plot on the data in housing_age, but having some difficult figuring it out.
Initially tried for x axis to be 'housing_data' and the y axis to be a count of housing data, but couldn't get the code to work. Then read somewhere that x-axis should be variable, and y-axis should be constant, so tried this code:
x='housing_data'
y=[0,5,10,15,20,25,30,35,40,45,50,55]
plt.scatter(x,y)
ax.set_xlabel("Number of buildings")
ax.set_ylabel("Age of buildings")
but get this error:
ValueError: x and y must be the same size
Note - the data in 'housing_data' ranges from 1-53 years.
I imagine this should be a pretty easy thing, but for some reason I can't figure it out.
Does anyone have any tips?
I understand you are starting so confusion is common. Please bear with me.
From your description, it looks like you swapped x and y:
# x is the categories: 0-5 yrs, 5-10 yrs, ...
x = [0,5,10,15,20,25,30,35,40,45,50,55]
# y is the number of observations in each category
# I just assigned it some random numbers
y = [78, 31, 7, 58, 88, 43, 47, 87, 91, 87, 36, 78]
plt.scatter(x,y)
plt.set_title('Housing Data')
Generally, if you have a list of observations and you want to count them across a number of categories, it's called a histogram. pandas has many convenient functions to give you a quick look at the data. The one of interest for this question is hist - create a histogram:
# A series of 100 random buildings whose age is between 1 and 55 (inclusive)
buildings = pd.Series(np.random.randint(1, 55, 100))
# Make a histogram with 10 bins
buildings.hist(bins=10)
# The edges of those bins were determined automatically so they appear a bit weird:
pd.cut(buildings, bins=10)
0 (22.8, 28.0]
1 (7.2, 12.4]
2 (33.2, 38.4]
3 (38.4, 43.6]
4 (48.8, 54.0]
...
95 (48.8, 54.0]
96 (22.8, 28.0]
97 (12.4, 17.6]
98 (43.6, 48.8]
99 (1.948, 7.2]
You can also set the bins explicitly: 0-5, 5-10, ..., 50-55
buildings.hist(bins=range(0,60,5))
# Then the edges are no longer random
pd.cut(buildings, bins=range(0,60,5))
0 (25, 30]
1 (5, 10]
2 (30, 35]
3 (40, 45]
4 (45, 50]
...
95 (45, 50]
96 (25, 30]
97 (15, 20]
98 (40, 45]
99 (0, 5]
I have a dataframe as follows,
import pandas as pd
df = pd.DataFrame({'value': [54, 74, 71, 78, 12]})
Expected output,
value score
54 scaled value
74 scaled value
71 scaled value
78 50.000
12 600.00
I want to assign a score between 50 and 600 to all, but lowest value must have a highest score. Do you have an idea?
Not sure what you want to achieve, maybe you could provide the exact expected output for this input.
But if I understand well, maybe you could try
import pandas as pd
df = pd.DataFrame({'value': [54, 74, 71, 78, 12]})
min = pd.DataFrame.min(df).value
max = pd.DataFrame.max(df).value
step = 550 / (max - min)
df['score'] = 600 - (df['value']-min) * step
print(df)
This will output
value score
0 54 250.000000
1 74 83.333333
2 71 108.333333
3 78 50.000000
4 12 600.000000
This is my idea. But I think you have a scale on your scores that is missing in your questions.
dfmin = df.min()[0]
dfmax = df.max()[0]
dfrange = dfmax - dfmin
score_value = (600-50)/dfrange
df.loc[:,'score'] = np.where(df['value'] == dfmin, 600,
np.where(df.value == dfmax,
50,
600 - ((df.value - dfmin)* (1/score_value))))
df
that produces:
value score
0 54 594.96
1 74 592.56
2 71 592.92
3 78 50.00
4 12 600.00
Not matching your output, because of the missing scale.