im working with ebay, and i have a list of 100 prices of items that sold. what i want to do is separate each floating-point price into groups, and then count up the groups to sortof determine the most common general price for this item, so i can automate the pricing of my own item.
initially, i thought to separate prices by the $10 value, but i realized that this isnt a good method of grouping because prices can vary greatly because of outliers or unrelated items, etc.
if i have a list of prices like so: [90, 92, 95, 99, 1013, 1100]
my desire is for the application to separate values into:
{nineties:4, thousands:2}
but im just not sure how to tell python to do this. preferably, the simplest i can integrate this snippet into my code, the better!
any help or suggestions would be appreciated!
The technique you use depends your notion of what a group is.
If the number of groups is known, use kmeans with k==2. See this link for working code in pure Python:
from kmeans import k_means, assign_data
prices = [90, 92, 95, 99, 1013, 1100]
points = [(x,) for x in prices]
centroids = k_means(points, k=2)
labeled = assign_data(centroids, points)
for centroid, group in labeled.items():
print('Group centered around:', centroid[0])
print([x for (x,) in group])
print()
This outputs:
Group centered around: 94.0
[90, 92, 95, 99]
Group centered around: 1056.5
[1013, 1100]
Alternatively, if a fixed maximum distance between elements defines the groupings, then just sort and loop over the elements, checking the distance between them to see whether a new group has started:
max_gap = 100
prices.sort()
groups = []
last_price = prices[0] - (max_gap + 1)
for price in prices:
if price - last_price > max_gap:
groups.append([])
groups[-1].append(price)
last_price = price
print(groups)
This outputs:
[[90, 92, 95, 99], [1013, 1100]]
Naive approach to point in the right direction:
> from math import log10
> from collections import Counter
> def f(i):
> x = 10**int(log10(i)) # largest from 1, 10, 100, etc. < i
> return i // x * x
> lst = [90, 92, 95, 99, 1013, 1100]
> c = Counter(map(f, lst))
> c
Counter({90: 4, 1000: 2})
Assume that your buckets are somewhat arbitrary in size (like between 55 and 95 and between 300 and 366) then you can use a binning approach to classify a value into a bin range. The cut-off for the various bin sizes can be anything you want so long as they are increasing in size left to right.
Assume these bin values:
bins=[0,100,1000,10000]
Then:
[0,100,1000,10000]
^ bin 1 -- 0 <= x < 100
^ bin 2 -- 100 <= x < 1000
^ bin 3 -- 1000 <= x < 10000
You can use numpy digitize to do this:
import numpy as np
bins=np.array([0.0,100,1000,10000])
prices=np.array([90, 92, 95, 99, 1013, 1100])
inds=np.digitize(prices,bins)
You can also do this in pure Python:
bins=[0.0,100,1000,10000]
tests=zip(bins, bins[1:])
prices=[90, 92, 95, 99, 1013, 1100]
inds=[]
for price in prices:
if price <min(bins) or price>max(bins):
idx=-1
else:
for idx, test in enumerate(tests,1):
if test[0]<= price < test[1]:
break
inds.append(idx)
Then classify by bin (from the result of either approach above):
for i, e in enumerate(prices):
print "{} <= {} < {} bin {}".format(bins[inds[i]-1],e,bins[inds[i]],inds[i])
0.0 <= 90 < 100 bin 1
0.0 <= 92 < 100 bin 1
0.0 <= 95 < 100 bin 1
0.0 <= 99 < 100 bin 1
1000 <= 1013 < 10000 bin 3
1000 <= 1100 < 10000 bin 3
Then filter out the values of interest (bin 1) versus the outlier (bin 3)
>>> my_prices=[price for price, bin in zip(prices, inds) if bin==1]
my_prices
[90, 92, 95, 99]
I think scatter plots are underrated for this sort of thing. I recommend plotting the distribution of prices, then choosing threshold(s) that look right for your data, then adding any descriptive stats by group that you want.
# Reproduce your data
prices = pd.DataFrame(pd.Series([90, 92, 95, 99, 1013, 1100]), columns=['price'])
# Add an arbitrary second column so I have two columns for scatter plot
prices['label'] = 'price'
# jitter=True spreads your data points out horizontally, so you can see
# clearly how much data you have in each group (groups based on vertical space)
sns.stripplot(data=prices, x='label', y='price', jitter=True)
plt.show()
Any number between 200 and 1,000 separates your data nicely. I'll arbitrarily choose 200, maybe you'll choose different threshold(s) with more data.
# Add group labels, Get average by group
prices['price group'] = pd.cut(prices['price'], bins=(0,200,np.inf))
prices['group average'] = prices.groupby('price group')['price'].transform(np.mean)
price label price group group average
0 90 price (0, 200] 94.0
1 92 price (0, 200] 94.0
2 95 price (0, 200] 94.0
3 99 price (0, 200] 94.0
4 1013 price (200, inf] 1056.5
5 1100 price (200, inf] 1056.5
Related
Imagine following dataframe is given.
import pandas as pd
products = ['Apple', 'Apple', 'Carrot', 'Eggplant', 'Eggplant']
customer_demand_date = ['2023-01-01', '2023-01-07', '2023-01-01', '2023-01-01', '2023-01-07', '2023-01-14']
col_02_2023 = [0, 20, 0, 0, 0, 10]
col_03_2023 = [20, 30, 10, 0, 10, 0]
col_04_2023 = [10, 40, 50, 30, 40, 10]
col_05_2023 = [40, 40, 60, 50, 60, 20]
data = {'Products': products,
'customer_demand_date': customer_demand_date,
'02_2023': col_02_2023,
'03_2023': col_03_2023,
'04_2023': col_04_2023,
'05_2023': col_05_2023}
df = pd.DataFrame(data)
print(df)
Products customer_demand_date 02_2023 03_2023 04_2023 05_2023
0 Apple 2023-01-01 0 20 10 40
1 Apple 2023-01-07 20 30 40 40
2 Carrot 2023-01-01 0 10 50 60
3 Egg 2023-01-01 0 0 30 50
4 Egg 2023-01-07 0 10 40 60
5 Egg 2023-01-14 0 0 10 20
I have columns products, custome_demand_date (every week there is new customer demand for products per upcoming months) and months with quantity demand.
How can I determine which product has experienced the most frequent changes in customer demand over the months, and sort the products in descending order of frequency of change?
I have tried to group by product, accumulate the demand quantity but none of them can analyze the data both horizontally (per customer demand date) and vertically (per months).
Desired output:
Sorted products Ranking(or %, or count of changes)
Egg 1 (or 70% or 13)
Apple 2 (or 52% or 8)
Carrot 3 (22% or 3)
Either ranking or % of change frequency or count of changes.
Note: percentages in desired output are random numbers
I'd really appreciate if you have any clever approach to solve this problem?
Thanks
One way is to define a function that counts horizontal and vertical changes which you can apply to each group individually.
import pandas as pd
from io import StringIO
def change_freq(x, months):
# count horizontal changes
chngs_horizontal = x[months].diff(axis=1).fillna(0).astype(bool).sum().sum()
# count vertical changes
chngs_vertical = x[months].diff(axis=0).fillna(0).astype(bool).sum().sum()
return chngs_horizontal+chngs_vertical
# sample data
data = StringIO("""
Products,customer_demand_date,02_2023,03_2023,04_2023,05_2023
Apple,2023-01-01,0,20,10,40
Apple,2023-01-07,20,30,40,40
Carrot,2023-01-01,0,10,50,60
Egg,2023-01-01,0,0,30,50
Egg,2023-01-07,0,10,40,60
Egg,2023-01-14,0,0,10,20
""")
df = pd.read_csv(data, sep=",")
# count horizontal and vertical changes by product
result = df.groupby('Products').apply(change_freq, ['02_2023','03_2023','04_2023','05_2023'])
result = result.sort_values(ascending=False).to_frame('count_changes')
result['rank'] = result['count_changes'].rank(ascending=False)
This returns
count_changes rank
Products
Egg 13 1.0
Apple 8 2.0
Carrot 3 3.0
to find the variance in y direction a groupby("Products") with an lambda function can be used:
var_y=(df.loc[:,~df.columns.isin(['customer_demand_date','HEHE'])].groupby("Products").agg(lambda x: x.pct_change().fillna(0).astype(bool).sum())).reset_index(level=0)
to find the variance in x direction pct_change interpreted as True or False can be used using astype(bool):
var_x=pd.concat([df[["Products"]], df.iloc[:,2:].pct_change(axis='columns').replace(np.inf, 1).fillna(0).astype(bool).sum(axis=1).rename('sum_x')], axis=1)
adding / grouping both together would finally look like this:
Products sum_x sum_y sum_xy
0 Apple 5 3 8
1 Carrot 3 0 3
2 Eggplant 7 6 13
below the complete code:
import pandas as pd
import numpy as np
products = ['Apple', 'Apple', 'Carrot', 'Eggplant', 'Eggplant', 'Eggplant']
customer_demand_date = ['2023-01-01', '2023-01-07', '2023-01-01', '2023-01-01', '2023-01-07','2023-01-14']
col_02_2023 = [0, 20, 0, 0, 0, 0]
col_03_2023 = [20, 30, 10, 0, 10, 0]
col_04_2023 = [10, 40, 50, 30, 40, 10]
col_05_2023 = [40, 40, 60, 50, 60, 20]
data = {'Products': products,
'customer_demand_date': customer_demand_date,
'02_2023': col_02_2023,
'03_2023': col_03_2023,
'04_2023': col_04_2023,
'05_2023': col_05_2023}
df = pd.DataFrame(data)
var_y=(df.loc[:,~df.columns.isin(['customer_demand_date','HEHE'])].groupby("Products").agg(lambda x: x.pct_change().fillna(0).astype(bool).sum())).reset_index(level=0)
var_y["sum_y"]=var_y.iloc[:,1:].sum(axis="columns")
var_x=pd.concat([df[["Products"]], df.iloc[:,2:].pct_change(axis='columns').replace(np.inf, 1).fillna(0).astype(bool).sum(axis=1).rename('sum_x')], axis=1)
var_x_sum=var_x.groupby("Products", as_index=False).agg(sum_x=('sum_x','sum'))
var_total=pd.concat([var_x_sum,var_y["sum_y"]],axis=1)
var_total["sum_xy"]=var_total.iloc[:,1:].sum(axis="columns")
print(var_total)
I have two DataFrames as follows:
df_discount = pd.DataFrame(data={'Graduation' : np.arange(0,1000,100), 'Discount %' : np.arange(0,50,5)})
df_values = pd.DataFrame(data={'Sum' : [20,801,972,1061,1251]})
Now my goal is to get a new column df_values['New Sum'] for my df_values that applies the corresponding discount to df_values['Sum'] based on the value of df_discount['Graduation']. If the Sum is >= the Graduation the corresponding discount is applied.
Examples: Sum 801 should get a discount of 40% resulting in 480.6, Sum 1061 gets 45% resulting in 583.55.
I know I could write a funtion with if else conditions and the returning values. However, is there a better way to do this if you have very many different conditions?
You could try if pd.merge_asof() works for you:
df_discount = pd.DataFrame({
'Graduation': np.arange(0, 1000, 100), 'Discount %': np.arange(0, 50, 5)
})
df_values = pd.DataFrame({'Sum': [20, 100, 101, 350, 801, 972, 1061, 1251]})
df_values = (
pd.merge_asof(
df_values, df_discount,
left_on="Sum", right_on="Graduation",
direction="backward"
)
.assign(New_Sum=lambda df: df["Sum"] * (1 - df["Discount %"] / 100))
.drop(columns=["Graduation", "Discount %"])
)
Result (without the last .drop(columns=...) to see what's happening):
Sum Graduation Discount % New_Sum
0 20 0 0 20.00
1 100 100 5 95.00
2 101 100 5 95.95
3 350 300 15 297.50
4 801 800 40 480.60
5 972 900 45 534.60
6 1061 900 45 583.55
7 1251 900 45 688.05
pandas.cut() is made for problems like this where you need to segment your data into bins (i.e. discount % based on value range).
First define the column, the ranges, and the corresponding bins.
# The column we need to segment
col = df_values['Sum']
# The ranges: [0, 100, 200,... ,900, np.inf] means (0,100), (100,200), ... (900,inf)
graduation = np.append(df_discount['Graduation'], np.inf)
# For each range what is the corresponding bin (i.e. discount)
discount = df_discount['Discount %']
Now call pandas.cut() and do the discount calculation.
df_values['Discount %'] = pd.cut(col,
graduation,
labels=discount)
# Convert the string label to an int for calculation
df_values['Discount %'] = df_values['Discount %'].astype(int)
df_values['New Sum'] = df_values['Sum'] * (1-df_values['Discount %']/100)
Sum Discount % New Sum
0 20 0 20.00
1 801 40 480.60
2 972 45 534.60
3 1061 45 583.55
4 1251 45 688.05
You can use pandas.DataFrame.mask. Basically if your condition is true it replaces the value. But for that your sum column has to be inside first dataframe.
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.mask.html
New student to python and struggling with a task at the moment. I'm trying to publish a scatter plot from some data in a pandas table and can't seem to work it out.
Here is a sample of my data set:
import pandas as pd
data = {'housing_age': [14, 11, 3, 4],
'total_rooms': [25135, 32627, 39320, 37937],
'total_bedrooms': [4819, 6445, 6210, 5471],
'population': [35682, 28566, 16305, 16122]}
df = pd.DataFrame(data)
I'm trying to draw a scatter plot on the data in housing_age, but having some difficult figuring it out.
Initially tried for x axis to be 'housing_data' and the y axis to be a count of housing data, but couldn't get the code to work. Then read somewhere that x-axis should be variable, and y-axis should be constant, so tried this code:
x='housing_data'
y=[0,5,10,15,20,25,30,35,40,45,50,55]
plt.scatter(x,y)
ax.set_xlabel("Number of buildings")
ax.set_ylabel("Age of buildings")
but get this error:
ValueError: x and y must be the same size
Note - the data in 'housing_data' ranges from 1-53 years.
I imagine this should be a pretty easy thing, but for some reason I can't figure it out.
Does anyone have any tips?
I understand you are starting so confusion is common. Please bear with me.
From your description, it looks like you swapped x and y:
# x is the categories: 0-5 yrs, 5-10 yrs, ...
x = [0,5,10,15,20,25,30,35,40,45,50,55]
# y is the number of observations in each category
# I just assigned it some random numbers
y = [78, 31, 7, 58, 88, 43, 47, 87, 91, 87, 36, 78]
plt.scatter(x,y)
plt.set_title('Housing Data')
Generally, if you have a list of observations and you want to count them across a number of categories, it's called a histogram. pandas has many convenient functions to give you a quick look at the data. The one of interest for this question is hist - create a histogram:
# A series of 100 random buildings whose age is between 1 and 55 (inclusive)
buildings = pd.Series(np.random.randint(1, 55, 100))
# Make a histogram with 10 bins
buildings.hist(bins=10)
# The edges of those bins were determined automatically so they appear a bit weird:
pd.cut(buildings, bins=10)
0 (22.8, 28.0]
1 (7.2, 12.4]
2 (33.2, 38.4]
3 (38.4, 43.6]
4 (48.8, 54.0]
...
95 (48.8, 54.0]
96 (22.8, 28.0]
97 (12.4, 17.6]
98 (43.6, 48.8]
99 (1.948, 7.2]
You can also set the bins explicitly: 0-5, 5-10, ..., 50-55
buildings.hist(bins=range(0,60,5))
# Then the edges are no longer random
pd.cut(buildings, bins=range(0,60,5))
0 (25, 30]
1 (5, 10]
2 (30, 35]
3 (40, 45]
4 (45, 50]
...
95 (45, 50]
96 (25, 30]
97 (15, 20]
98 (40, 45]
99 (0, 5]
I have a dataframe as follows,
import pandas as pd
df = pd.DataFrame({'value': [54, 74, 71, 78, 12]})
Expected output,
value score
54 scaled value
74 scaled value
71 scaled value
78 50.000
12 600.00
I want to assign a score between 50 and 600 to all, but lowest value must have a highest score. Do you have an idea?
Not sure what you want to achieve, maybe you could provide the exact expected output for this input.
But if I understand well, maybe you could try
import pandas as pd
df = pd.DataFrame({'value': [54, 74, 71, 78, 12]})
min = pd.DataFrame.min(df).value
max = pd.DataFrame.max(df).value
step = 550 / (max - min)
df['score'] = 600 - (df['value']-min) * step
print(df)
This will output
value score
0 54 250.000000
1 74 83.333333
2 71 108.333333
3 78 50.000000
4 12 600.000000
This is my idea. But I think you have a scale on your scores that is missing in your questions.
dfmin = df.min()[0]
dfmax = df.max()[0]
dfrange = dfmax - dfmin
score_value = (600-50)/dfrange
df.loc[:,'score'] = np.where(df['value'] == dfmin, 600,
np.where(df.value == dfmax,
50,
600 - ((df.value - dfmin)* (1/score_value))))
df
that produces:
value score
0 54 594.96
1 74 592.56
2 71 592.92
3 78 50.00
4 12 600.00
Not matching your output, because of the missing scale.
I have a 2 dataframes
One- Score Card for scoring student marks
Second One-Student dataset.
I want to apply score card on a given student dataset to compute score and aggregate them. I'm trying to devlop a generic function that takes the
score card and applies on any studentmarks dataset
import pandas as pd
score_card_data = {
'subject_id': ['MATHS', 'SCIENCE', 'ARTS'],
'bin_list': [[0,25,50,75,100], [0,20,40,60,80,100], [0,20,40,60,80,100]],
'bin_value': [[1,2,3,4], [1,2,3,4,5], [3,4,5,6,7] ]}
score_card_data = pd.DataFrame(score_card_data, columns = ['subject_id', 'bin_list', 'bin_value'])
score_card_data
student_scores = {
'STUDENT_ID': ['S1', 'S2', 'S3','S4','S5'],
'MATH_MARKS': [10,15,25,65,75],
'SCIENCE_MARKS': [8,15,20,35,85],
'ARTS_MARKS':[55,90,95,88,99]}
student_scores = pd.DataFrame(student_scores, columns = ['STUDENT_ID', 'MATH_MARKS', 'SCIENCE_MARKS','ARTS_MARKS'])
student_scores
Functions
Define bins
Apply the bins over columns
bins = list(score_card_data.loc[score_card_data['subject_id'] == 'MATHS', 'bin_list'])
student_scores['MATH_SCORE'] = pd.cut(student_scores['MATH_MARKS'],bins, labels='MATHS_MARKS')
Error:ValueError: object too deep for desired array
I'm trying to convert the cellvalue to a string and it is getting detected as an object. Any way to resolve
How can I make the function more generic?
Thanks
Pari
You can just use bins[0] to extract the list, which otherwise raises the ValueError:
bins[0]
[0, 25, 50, 75, 100]
type(bins[0])
<class 'list'>
student_scores['MATH_SCORE'] = pd.cut(student_scores['MATH_MARKS'], bins[0])
STUDENT_ID MATH_MARKS SCIENCE_MARKS ARTS_MARKS MATH_SCORE
0 S1 10 8 55 (0, 25]
1 S2 15 15 90 (0, 25]
2 S3 25 20 95 (0, 25]
3 S4 65 35 88 (50, 75]
4 S5 75 85 99 (50, 75]
I left out the labels because you'd need to provide a list of four labels given there are five cutoffs / bin edges.