I am dealing with a dataframe in python.
Here is what I want to do.
1. same value gets same rank
2. the next rank should be added as much as the same rank counts
this is what I intended
price rank
5300 1
5300 1
5300 1
5200 4 < previous rank: 1 + counts of 5300s: 3
5200 4 < same value, same rank
5100 6 < previous rank: 4 + counts of 5200s: 2
First, I tried to use rank(method = "dense") function. But it did not work as I expected.
df_sales["rank"] = df_sales["price"].rank(ascending = False, method = "dense")
Thank you in advance.
You need to use method='min' and ascending=False:
df = pd.DataFrame({'x':[5300,5300,5300,5200,5200, 5100]})
df['r'] = df['x'].rank(method='min', ascending=False)
From pandas.Series.rank
method : {‘average’, ‘min’, ‘max’, ‘first’, ‘dense’}
average: average rank of group
min: lowest rank in group
max: highest rank in group
first: ranks assigned in order they appear in the array
dense: like ‘min’, but rank always increases by 1 between groups
Note that dense specifically increases rank by 1 within groups. You want the min option.
Related
I have a dataframe that I'm hoping to return a list of all the values that match the minimum cost per segment. The dataframe looks like this:
Segment
Part ID
Cost
1
1
$0.5
-
-
-
1
2
$0.6
1
3
$0.5
1
4
$0.7
2
5
$0.4
2
6
$0.5
2
7
$0.6
Etc.
What I am hoping to end up with is a new dataframe like this:
Segment
Part List
Min. Cost
1
[1,3]
$0.5
2
[5]
$0.4
I'm struggling to get this completed. I've tried a few things:
df['Min Segment Price'] = df.groupby(*['Segment']['Cost'].transform(min)
This line correctly adds a column to the full dataframe for what the minimum price is for the Segment.
min_part_list= df['Part ID'].loc[df['Cost'].eq(df['Cost'].min())].to_list()
Seems like it's only returning the first minimum value for a segment, not all of them.
I also tried this:
df['Segment Min Part ID']= df['Part ID'].loc[df['Cost'].eq(df['Cost'].min())]
And it's returning the part ID only on the row with the cheapest price for the data frame, not the cheapest price per the segment. I'm unsure how to add in the extra qualification about the segment minimum price.
Thanks!
You can use a double groupby, one to filter, the other to aggregate:
s = pd.to_numeric(df['Cost'].str.strip('$'))
out = (df[s.eq(s.groupby(df['Segment']).transform('min'))]
.groupby('Segment', as_index=False)
.agg({'Part ID': list, 'Cost': 'first'})
)
Output:
Segment Part ID Cost
0 1 [1, 3] $0.5
1 2 [5] $0.4
I have a dataset where in I have to identify if the Sale value of a Toy is greater than the average in the column and count in how many different sale areas, value is greater than the average.
For Ex: find the average of column "Sale B" - 2.5, check in how many rows the value is greater than 2.5 and then perform the same exercise for "SaleA" and "SaleC" and then add all of it
input_data = pd.DataFrame({'Toy': ['A','B','C','D'],
'Color': ['Red','Green','Orange','Blue'],
'SaleA': [1,2,0,1],
'SaleB': [1,3,4,2],
'SaleC': [5,2,3,5]})
New column "Count_Sale_Average" is created, For ex: for toy "A" only at one position the sale was greater than average.
output_data = pd.DataFrame({'Toy': ['A','B','C','D'],
'Color': ['Red','Green','Orange','Blue'],
'SaleA': [1,2,0,1],
'SaleB': [1,3,4,2],
'SaleC': [5,2,3,5],
'Count_Sale_Average':[1,2,1,1]})
My code is working and giving the desired output. Any suggestions on other ways of doing it, may be more efficient and in less number of lines.
list_var = ['SaleA','SaleB','SaleC']
df = input_data[list_var]
for i in range(0,len(list_var)):
var = list_var[i]
mean_var = df[var].mean()
df[var] = df[var].apply(lambda x: 1 if x > mean_var else 0)
df['Count_Sale_Average'] = df[list_var].sum(axis=1)
output_data = pd.concat([input_data, df[['Count_Sale_Average']]], axis=1)
output_data
You could filter, find mean and sum on axis:
filtered = input_data.filter(like='Sale')
input_data['Count_Sale_Average'] = filtered.gt(filtered.mean()).sum(axis=1)
Output:
Toy Color SaleA SaleB SaleC Count_Sale_Average
0 A Red 1 1 5 1
1 B Green 2 3 2 2
2 C Orange 0 4 3 1
3 D Blue 1 2 5 1
Below is a dataframe showing coordinate values from and to, each row having a corresponding value column.
I want to find the range of coordinates where the value column doesn't exceed 5. Below is the dataframe input.
import pandas as pd
From=[10,20,30,40,50,60,70]
to=[20,30,40,50,60,70,80]
value=[2,3,5,6,1,3,1]
df=pd.DataFrame({'from':From, 'to':to, 'value':value})
print(df)
hence I want to convert the following table:
to the following outcome:
Further explanation:
Coordinates from 10 to 30 are joined and the value column changed to 5
as its sum of values from 10 to 30 (not exceeding 5)
Coordinates 30 to 40 equals 5
Coordinate 40 to 50 equals 6 (more than 5, however, it's included as it cannot be divided further)
Remaining coordinates sum up to a value of 5
What code is required to achieve the above?
We can do a groupby on cumsum:
s = df['value'].ge(5)
(df.groupby([~s, s.cumsum()], as_index=False, sort=False)
.agg({'from':'min','to':'max', 'value':'sum'})
)
Output:
from to value
0 10 30 5
1 30 40 5
2 40 50 6
3 50 80 5
Update: It looks like you want to accumulate the values so the new groups do not exceed 5. There are several threads on SO saying that this can only be done with a for a loop. So we can do something like this:
thresh = 5
groups, partial, curr_grp = [], thresh, 0
for v in df['value']:
if partial + v > thresh:
curr_grp += 1
partial = v
else:
partial += v
groups.append(curr_grp)
df.groupby(groups).agg({'from':'min','to':'max', 'value':'sum'})
I have a data frame as such
Customer Day
0. A 1
1. A 1
2. A 1
3. A 2
4. B 3
5. B 4
and I want to sample from it but I want to sample different sizes for each customer. I have the size of each customer in another dataframe. For example,
Customer Day
0. A 2
1. B 1
Suppose I want to sample per customer per day. So far I have this function:
def sampling(frame,a):
return np.random.choice(frame.Id,size=a)
grouped = frame.groupby(['Customer','Day'])
sampled = grouped.apply(sampling, a=??).reset_index()
If I set the size parameter to a global constant, no problem it runs. But I don't know how to set this when the different values are on a separate dataframe.
You can create a mapper from the df1 with sample size and use that value as sample size,
mapper = df1.set_index('Customer')['Day'].to_dict()
df.groupby('Customer', as_index=False).apply(lambda x: x.sample(n = mapper[x.name]))
Customer Day
0 3 A 2
2 A 1
1 4 B 3
This returns multi-index, you can always reset_index,
df.groupby('Customer').apply(lambda x: x.sample(n = mapper[x.name])).reset_index(drop = True)
Customer Day
0 A 1
1 A 1
2 B 3
I am having data with variables participants, origin, and score. And I want those participants and origin whose score is greater than 60. That means to filter participants I have to consider origin and score. I can filter only on origin or only on score but it will not work.
If anyone can help me, its great!!!
I think you can use boolean indexing:
df = pd.DataFrame({'participants':['a','s','d'],
'origin':['f','g','t'],
'score':[7,80,9]})
print (df)
origin participants score
0 f a 7
1 g s 80
2 t d 9
df = df[df.score > 60]
print (df)
origin participants score
1 g s 80