Brand new to Pandas (Python) here and am cutting my teeth with some lightweight analytics, but am having some difficulty getting started.
I have a spreadsheet with the following data in it:
Fruit,HarvestCount,HarvestDate
Apple,100,08/03/2022
Banana,2500,04/15/2022
Apple,4000,10/11/2022
Pineapple,5,02/07/2022
Pear,250,06/09/2022
Banana,1000,08/11/2022
Orange,20,07/23/2022
Orange,140,11/29/2022
Strawberry,600,12/11/2022
Apple,5000,04/01/2022
Pear,10,07/07/2022
Banana,50,10/19/2022
I am reading this Excel into a dataframe like so:
data = pd.read_excel('fruit-harvests.xlsx', sheet_name='Harvests')
df_temp = pd.DataFrame(data)
Now what I am trying to do is:
collapse the dataframe by "Fruit" name and sum each fruit's Harvest Count; then
figure out the who the bottom 25% quartile performers were (that is, the bottom 25% fruits with the lowest summed harvest count)
Hence if I did this manually, the collapse + sum would look like:
Apple,9100
Banana,3550
Pineapple,5
Pear,260
Orange,160
Strawberry,600
Sorted by HarvestCount descending it looks like:
Apple,9100
Banana,3550
Strawberry,600
Pear,260
Orange,160
Pineapple,5
Since after the collapse we see there are six (6) distinct fruits, the bottom quartile would be the worst-performing 1.5 fruits, or rounded up, the worst 2 fruits:
Orange,160
Pineapple,5
So from the time I read my Excel into a dataframe, I have to:
Sum/aggregate/collapse
Sort by HarvestCount descending (or ascending whichever is easier for the next step)
And finally create a new dataframe of the worst-performing 25% of fruits (rows)
Can anyone point me in the right direction here please?
EDITED
Grouping and aggregate the count:
df_temp = df_temp.groupby('Fruit', as_index =False).agg(count = ('HarvestCount', 'sum')
Sorting:
df_temp = df_temp.sort_values('count', ascending = False, ignore_index = True)
25% worst:
import math
percent = math.ceil((25/100) * len(df_temp))
df = df_temp(tail(percent))
Related
I have a dataset I am trying to group by some common values and then sum up some other values. The tricky part is I want to add some sort of weighting that keeps the largest number, I'll try to elaborate more below:
I've created a dummy data frame that is along the lines of my data just for example purposes:
df = pd.DataFrame({'Family': ['Contactors', 'Contactors', 'Contactors'],
'Cell': ['EP&C', 'EXR', 'C&S'],
'Visits': ['25620', '626', '40']})
This produces a table like so:
So, in this example I would want all of the 'Contactors' to be grouped up by EP&C (as this has the highest visits to start with) but I would like all of the visits summed up and the other 'Cell' values dropped, so I would be left with something like this:
Could anyone advise?
Thanks.
IIUC, you can use:
(df
# convert to numeric
.assign(Visits=pd.to_numeric(df['Visits']))
# ensure the top row per group is the highest visits
.sort_values(by=['Family', 'Visits'], ascending=False)
# for groups per Family
.groupby('Family', sort=False, as_index=False)
# aggregate per group: Cell (first row, i.e top) and Visits (sum of rows)
.agg({'Cell': 'first', 'Visits': sum})
)
output:
Family Cell Visits
0 Contactors EP&C 26286
I have the following setting:
Column of IDS
Column with a binary condition
Column with bonds between IDS
What I am trying the obtain is the distance of each ID from the condition==1
Baiscally, if an ID has condition==0 through the bonds with other IDS it can have a distance from the condition > 1
e.g.1. an ID has condition==0; it has bonds with 3 other IDS; one of these IDS has the condition==0; THEREFORE the inital ID has distance==2
e.g.2. take e.g.1 but this time none of the bonds IDS has condition==1; then the function should look into the bonds of each ID (in the list of bonds) and see if any of them has condtion==0 (which would result in distance==2) and so on...
I hope the setting it's clear, here's the dummy dataframe I was using, if you know any computationally better way to structure the bond column please let me know
import random
import pandas as pd
id=[i for i in range(50)]
condition=[]
for i in range(50):
n=random.randint(0, 1)
condition.append(n)
bond=[]
for i in range(50):
x = random.sample(range(50),random.randint(0,10))
bond.append(x)
df = pd.DataFrame({'id':id, 'condition':condition})
df['bond']=pd.Series(bond)
Conceptually, I am working on something like these but its been already a few days since I am stuck:
def distance(id, condition, bond):
level=1
increment=0
if condition[id]==0:
list_bonds=[bond[id]]
for i in list_bonds:
I first found a cycle in a manufacturing process. I collected the 2 largest pressure values from the given cycles and printed them to a new sheet. I now need to capture the corresponding time to where the largest values land. This portion of my code looks like this:
df2 = df.groupby('group')['Pressure'].nlargest(2).rename_axis (index=['group','row_index'])
df2 = df.groupby('group')['Date/Time']
A sample snippet of the data I am trying to extract can be seen here:
Any help on this would be appreciated!
You can sort the data frame and take the last 2 rows per group. Typing this in the blind as you did not provide sample data:
df2 = (
df.sort_values(['group', 'Pressure'])
.groupby('group', sort=False)
.tail(2)
)
I am trying to create a column of the 10-day moving average of points for nba players. My dataframe has game by game statistics for each player, and I would like to have the moving average column contain the 10 day moving average at that point. I have tried df.groupby('player')['points].rolling(10,1).mean, but this is just giving me the number of points scored on that day as the moving average. All of the players from each day are listed and then the dataframe moves onto the following day, so I could have a couple hundred rows with the same date but different players' stats. Any help would be greatly appreciated. Thanks.
As stated, you really should provide a sample dataset, and show what you are trying to achieve. However, I love working with sports data so don't mind puting in the minute or so to get a sample set.
So basically you need to do a rolling mean on a groupby. You'll notice obviously the first 10 rows of each player are blank, because it doesn't have 10 dates to take the mean of. You can change that by changing the min to 1. Also, when you do this, you want to make sure your data is sorted by date (which here it already is).
import pandas as pd
player_link_list = ['https://www.basketball-reference.com/players/l/lavinza01/gamelog/2021/',
'https://www.basketball-reference.com/players/v/vucevni01/gamelog/2021/',
'https://www.basketball-reference.com/players/j/jamesle01/gamelog/2021/',
'https://www.basketball-reference.com/players/d/davisan02/gamelog/2021/']
dfs = []
for link in player_link_list:
w=1
df = pd.read_html(link)[-1]
df = df[df['Rk'].ne('Rk')]
df = df[df['PTS'].ne('Inactive')]
df['Player'] = link.split('/')[-4]
df['PTS'] = df['PTS'].astype(int,errors = 'ignore')
dfs.append(df)
df = pd.concat(dfs)
df['rolling_10_avg'] = df.groupby('Player')['PTS'].transform(lambda s: s.rolling(10, min_periods=10).mean())
I need to plot a pie chart of frequencies from a column of a dataframe, but a lot of lower frequencies appear and visualization is poor.
the code I wrote is :
df[column].value_counts(normalize=True).plot(kind="pie")
I know that df[column].value_counts(normalize=True) will give me percentages of every unique value, but I want to apply the filter percentage>0.05
What I tried?:
new_df = df[column].value_counts(normalize=True)
but this gives me column as index, so I reset the index
new_df = new_df.reset_index()
and then tried
new_df.plot(kind = "pie")
but nothing appears.
I want some 1 line code that can make something like:
df[column].value_counts(normalize=True).plot(kind="pie" if value_counts > 0.05)
Try this:
df['column'].value_counts()[df['column'].value_counts(normalize=True)>0.05].plot(kind='pie')