Max Value in a Data Frame with Groupby using idmax() - python

I have a dataframe that has 10 columns.
I used this code to filter to the rows I want: basically, the rows where the Revision Date is less than the cutoff date (declared variable) and the Job Title is in the provided list.
aggregate = df.loc[(df['RevisionDate']<= cutoff_date) & (df['JobTitle'].isin(['Production Control Clerk','Customer Service Representative III, Data Entry Operator I','Accounting Clerk II','General Clerk III','Technical Instructor']))]
Then, I need to group them by the column WD (there are multiple of these), and then by Job Title (again, multiple of these). So I did that by:
aggregate1 = aggregate.groupby(['WD','JobTitle'])
This produces a dataframe object that has the required rows, and still all 10 columns.
Then, from this smaller dataframe, I need to pull out only the rows with the highest (max) Revision Number.
aggregate1 = aggregate.max('RevisionNumber')
However, this last step produces a dataframe, but with only 3 of the columns: WD, Job Title and Revision Number. I need ALL 10 of the columns.
Based on other questions I've seen posted here, I have tried to use idmax():
df2 = aggregate.loc[aggregate.groupby(['WD','JobTitle'])['RevisionNumber'].idmax()]
but I get this error:
AttributeError: 'SeriesGroupBy' object has no attribute 'idmax'
What am I doing wrong?

If you sort first, you can take the top row of each group
aggregate.sort_values(by='RevisionNumber', ascending=False).groupby(['WD','JobTitle']).head(1)

Related

In python pandas, How do you eliminate rows of data that fail to meet a condition of grouped data?

I have a data set that contains hourly data of marketing campaigns. There are several campaigns and not all of them are active during the 24 hours of the day. My goal is to eliminate all rows of active hour campaigns where I don't have the 24 data rows of a single day.
The raw data contains a lot of information like this:
Original Data Set
I created a dummy variable with ones to be able to count single instance of rows. This is the code I applied to be able to see the results I want to get.
tmp = df.groupby(['id','date']).count()
tmp.query('Hour' > 23)
I get the following results:
Results of two lines of code
These results illustrate exactly the data that I want to keep in my data frame.
How can I eliminate the data per campaign per day that does not reach 24? The objective is not the count but the real data. Therefore ungrouped from what I present in the second picture.
I appreciate the guidance.
Use transform to broadcast the count over all rows of your dataframe the use loc as replacement of query:
out = df.loc[df.groupby(['id', 'date'])['Hour'].transform('count')
.loc[lambda x: x > 23].index]
drop the data you don't want before you do the groupby
you can use .loc or .drop, I am unfamiliar with .query

Pandas dataframe: summing cell data from a group of rows, storing in a new column

As a part of a treatment for a health related issue, I need to measure my liquid intake (along with some other parameters), registring the amount of liquid every time I drink. I have a dataframe, of several months of such registration.
I want to sum my daily amount in an additional column (in red, image below)
As you may see, I wish like to store it in the first column of the slice returned by df.groupby(df['Date'])., for all the days.
I tried the following:
df.groupby(df.Date).first()['Total']= df.groupby(df.Date)['Drank'].fillna(0).sum()
But seems not to be the way to do it.
Greatful for any advice.
Thanks
Michael
use fact False==0
first row of date will be where data is not equal to shift() of date
merge() to sum
## construct a data set
d = pd.date_range("1-jan-2021", "1-mar-2021", freq="2H")
A = np.random.randint(20,300,len(d)).astype(float)
A.ravel()[np.random.choice(A.size, A.size//2, replace=False)] = np.nan
df = pd.DataFrame({"datetime":d, "Drank":A})
df = df.assign(Date=df.datetime.dt.date, Time=df.datetime.dt.time).drop(columns=["datetime"]).loc[:,["Date","Time","Drank"]]
## construction done
# first row will have different date to shift
# merge Total back
df.assign(row=df.Date.eq(df.Date.shift())).merge(df.groupby("Date", as_index=False).agg(Total=("Drank","sum")).assign(row=0),
on=["Date","row"], how="left").drop(columns="row")

groupby.mean function dividing by pre-group count rather than post-group count

So I have the following dataset of trade flows that track imports, exports, by reporting country and partner countries. After I remove some unwanted columns, I edit my data frame such that trade flows between country A and country B is showing. I'm left with something like this:
[My data frame image] 1
My issue is that I want to be able to take the average of imports and exports for every partner country ('partner_code') per year, but when I run the following:
x = df[(df.location_code.isin(["IRN"])) &
df.partner_code.isin(['TCD'])]
grouped = x.groupby(['partner_code']).mean()
I end up getting an average of all exports divided by all instances where there is a 'product_id' (so a much higher number) rather than averaging imports or exports by total for all the years.
Taking the average of the following 5 export values gives an incorrect average:
5 export values
Wrong average
In pandas, we can groupby multiple columns, based on my understanding you want to group by partner, country and year.
The following line would work:
df = df.groupby(['partner_code', 'location_code', 'year'])['import_value', 'export_value'].mean()
Please note that the resulting dataframe is has MultiIndex index.
For reference, the official documentation: DataFrame.groupby documentation

I am not able to correctly assign a value to a df row based on 3 conditions (checking values in 3 other columns)

I am trying to assign a proportion value to a column in a specific row inside my df. Each row represents a unique product's sales in a specific month, in a dataframe (called testingAgain) like this:
Month ProductID(SKU) Family Sales ProporcionVenta
1 1234 FISH 10000.0 0.0
This row represents product 1234's sales during January. (It is an aggregate, so it represents every January in the DB)
Now I am trying to find the proportion of sales of that unique productid-month in relation to the sum of sales of family-month. For example, the family fish has sold 100,000 in month 1, so in this specific case it would be calculated 10,000/100,000 (productid-month-sales/family-month-sales)
I am trying to do so like this:
for family in uniqueFamilies:
for month in months:
salesFamilyMonth = testingAgain[(testingAgain['Family']==family)&(testingAgain['Month']==month)]['Qty'].sum()
for sku in uniqueSKU:
salesSKUMonth = testingAgain[(testingAgain['Family']==family)&(testingAgain['Month']==month)&(testingAgain['SKU']==sku)]['Qty'].sum()
proporcion = salesSKUMonth/salesFamilyMonth
testingAgain[(testingAgain['SKU']==sku)&(testingAgain['Family']==familia)&(testingAgain['Month']==month)]['ProporcionVenta'] = proporcion
The code works, it runs, and I have even individually printed the proportions and calculated them in Excel and they are correct, but the problem is with the last line. As soon as the code finishes running, I print testingAgain and see all proportions listed as 0.0, even though they should have been assigned the new one.
I'm not completely convinced about my approach, but I think it is decent.
Any ideas on how to solve this problem?
Thanks, appreciate it.
Generally, in Pandas (even Numpy), unlike general purpose Python, analysts should avoid using for loops as there are many vectorized options to run conditional or grouped calculations. In your case, consider groupby().transform() which returns inline aggregates (i.e., aggregate values without collapsing rows) or
as docs indicate: broadcast to match the shape of the input array.
Currently, your code is attempting to assign a value to a subsetted slice of data frame column that should raise SettingWithCopyWarning. Such an operation would not affect original data frame. Your loop can use .loc for conditional assignment
testingAgain.loc[(testingAgain['SKU']==sku) &
(testingAgain['Family']==familia) &
(testingAgain['Month']==month), 'ProporcionVenta'] = proporcion
However, avoid looping since transform works nicely to assign new data frame columns. Also, below div is the Series division method (functionally equivalent to / operator).
testingAgain['ProporcionVenta'] = (testingAgain.groupby(['SKU', 'Family', 'Monthh'])['Qty'].transform('sum')
.div(testingAgain.groupby(['Family', 'Month'])['Qty'].transform('sum'))
)

how to divide pandas dataframe into different dataframes based on unique values from one column and itterate over that?

I have a dataframe with three columns
The first column has 3 unique values I used the below code to create unique dataframes, However I am unable to iterate over that dataframe and not sure how to use that to iterate.
df = pd.read_excel("input.xlsx")
unique_groups = list(df.iloc[:,0].unique()) ### lets assume Unique values are 0,1,2
mtlist = []
for index, value in enumerate(unique_groups):
globals()['df%s' % index] = df[df.iloc[:,0] == value]
mtlist.append('df%s' % index)
print(mtlist)
O/P
['df0', 'df1', 'df2']
for example lets say I want to find out the length of the first unique dataframe
if I manually type the name of the DF I get the correct output
len(df0)
O/P
35
But I am trying to automate the code so technically I want to find the length and itterate over that dataframe normally as i would by typing the name.
What I'm looking for is
if I try the below code
len('df%s' % 0)
I want to get the actual length of the dataframe instead of the length of the string.
Could someone please guide me how to do this?
I have also tried to create a Dictionary using the below code but I cant figure out how to iterate over the dictionary when the DF columns are more than two, where key would be the unique group and the value containes the two columns in same line.
df = pd.read_excel("input.xlsx")
unique_groups = list(df["Assignment Group"].unique())
length_of_unique_groups = len(unique_groups)
mtlist = []
df_dict = {name: df.loc[df['Assignment Group'] == name] for name in unique_groups}
Can someone please provide a better solution?
UPDATE
SAMPLE DATA
Assignment_group Description Document
Group A Text to be updated on the ticket 1 doc1.pdf
Group B Text to be updated on the ticket 2 doc2.pdf
Group A Text to be updated on the ticket 3 doc3.pdf
Group B Text to be updated on the ticket 4 doc4.pdf
Group A Text to be updated on the ticket 5 doc5.pdf
Group B Text to be updated on the ticket 6 doc6.pdf
Group C Text to be updated on the ticket 7 doc7.pdf
Group C Text to be updated on the ticket 8 doc8.pdf
Lets assume there are 100 rows of data
I'm trying to automate ServiceNow ticket creation with the above data.
So my end goal is GROUP A tickets should go to one group, however for each description an unique task has to be created, but we can club 10 task once and submit as one request so if I divide the df's into different df based on the Assignment_group it would be easier to iterate over(thats the only idea which i could think of)
For example lets say we have REQUEST001
within that request it will have multiple sub tasks such as STASK001,STASK002 ... STASK010.
hope this helps
Your problem is easily solved by groupby: one of the most useful tools in pandas. :
length_of_unique_groups = df.groupby('Assignment Group').size()
You can do all kind of operations (sum, count, std, etc) on your remaining columns, like getting the mean value of price for each group if that was a column.
I think you want to try something like len(eval('df%s' % 0))

Categories