python pandas averaging columns to produce new ones - python

I have a Pandas DataFrame with the following data, displaying the hours worked per week for employees at a company:
name week 1 week 2 week 3 week 4...
joey 20 15 35 10
thomas 20 10 25 15
mark 30 20 25 10
sal 25 25 15 20
amy 25 30 20 10
Assume the data carries on in the same for 100+ weeks.
What I want to produce is a biweekly average of hours for each employee,
so the average hours worked over two weeks. Shown in the following DataFrame:
name weeks 1-2 weeks 2-4...
joey 17.5 22.5
thomas 15 20
mark 25 17.5
sal 25 17.5
amy 27.5 15
How could I make this work? Trying out iterating right now but I'm stuck.

You can achieve that with the following:
for i in range(0, len(df.columns), 2):
df[f'weeks {i+1}-{i+2}'] = df.iloc[:, i:i+1].mean(axis=1)
This code basically iterates through the amount of columns, taking a step of size 2. Then it selects the column indicated by the current iteration (variable i) and the following column (i+1), averages these two, and stores in a new column.
It assumes columns are properly ordered.

Related

Add a column in pandas based on sum of the subgroup values in another column [duplicate]

This question already has answers here:
How do I create a new column from the output of pandas groupby().sum()?
(4 answers)
Closed 12 days ago.
Here is a simplified version of my dataframe (the number of persons in my dataframe is way more than 3):
df = pd.DataFrame({'Person':['John','David','Mary','John','David','Mary'],
'Sales':[10,15,20,11,12,18],
})
Person Sales
0 John 10
1 David 15
2 Mary 20
3 John 11
4 David 12
5 Mary 18
I would like to add a column "Total" to this data frame, which is the sum of total sales per person
Person Sales Total
0 John 10 21
1 David 15 27
2 Mary 20 38
3 John 11 21
4 David 12 27
5 Mary 18 38
What would be the easiest way to achieve this?
I have tried
df.groupby('Person').sum()
but the shape of the output is not congruent with the shape of df.
Sales
Person
David 27
John 21
Mary 38
What you want is the transform method which can apply a function on each group:
df['Total'] = df.groupby('Person')['Sales'].transform(sum)
It gives as expected:
Person Sales Total
0 John 10 21
1 David 15 27
2 Mary 20 38
3 John 11 21
4 David 12 27
5 Mary 18 38
The easiest way to achieve this is by using the pandas groupby and sum functions.
df['Total'] = df.groupby('Person')['Sales'].sum()
This will add a column to the dataframe with the total sales per person.
your 'Persons' column in the dataframe contains repeated values
it is not possible to apply a new column to this via groupby
I would suggest making a new dataframe based on sales sum
The below code will help you with that
newDf = pd.DataFrame(df.groupby('Person')['Sales'].sum()).reset_index()
This will create a new dataframe with 'Person' and 'sales' as columns.

How to prioritize specific item data when dropping from a data frame in Python

Hi i have a question about dataframe in python
There is a dataframe table as below.
and I want to remove some duplicate data.
If all the conditions are the same, remove the item above. (Jack's case)
If all conditions except the name and quarter are the same, remove the David's data(row)
The first is possible, but I don't know how to do the second.
Thank you.
drop_df = df.drop_duplicates(subset=['Name'],keep='last)
(input data)
Name
quarter
math
physics
Jack
1Q
90
100
Jack
2Q
90
100
Kevin
1Q
45
20
David
1Q
15
60
Adam
1Q
15
60
David
2Q
40
75
Adam
2Q
40
75
(wanted data)
Name
quarter
math
physics
Jack
2Q
90
100
Kevin
1Q
45
20
Adam
1Q
15
60
Adam
2Q
40
75
better using python dataFRAME
You mentioned a pair of DROP criteria:
certain conditions C
same conditions C and matching quarter
So (2.) is more specific than (1.) -- we
say that (1.) is a subset of (2.)
Begin by DROPping rows using (2.),
then go on to DROP relevant surviving rows using (1.)

Make new column from slice of string from one column pandas python

I read the answer at the link [Text] (Pandas make new column from string slice of another column), but it does not solve my problem.
df
SKU Noodles FaceCream BodyWash Powder Soap
Jan10_Sales 122 100 50 200 300
Feb10_Sales 100 50 80 90 250
Mar10_sales 40 30 100 10 11
and so on
Now I want column month and year which will take value from SKU Column and return Jan for month and 10 for year (2010).
df['month']=df['SKU'].str[0:3]
df['year']=df['SKU'].str[4:5]
I get KeyError: 'SKU'
Doing other things to understand why the error, I perform the following:
[IN]df.index.name
[OUT]None
[IN]df.columns
[OUT]Index(['Noodles','FaceCream','BodyWash','Powder','Soap'], dtype='object', name='SKU')
Please help
I think first column is index, so use .index, also for year change 4:5 slicing to 3:5, 0 is possible omit in 0:3:
df['month']=df.index.str[:3]
df['year']=df.index.str[3:5]
print (df)
Noodles FaceCream BodyWash Powder Soap month year
SKU
Jan10_Sales 122 100 50 200 300 Jan 10
Feb10_Sales 100 50 80 90 250 Feb 10
Mar10_sales 40 30 100 10 11 Mar 10

pandas histogram: extracting column and group by from data

I have a dataframe for which I'm looking at histograms of subsets of the data using column and by of pandas' hist() method, as in:
ax = df.hist(column='activity_count', by='activity_month')
(then I go along and plot this info). I'm trying to determine how to programmatically pull out two pieces of data: the number of records with that particular value of 'activity_month' as well as the value of 'activity_month' when I loop over the axes:
for i,x in enumerate(ax):`
print("the value of a is", a)
print("the number of rows with value of a", b)
so that I'd get:
January 1002
February 4305
etc
Now, I can easily get the list of unique values of "activity_month", as well as a count of how many rows have a given value of activity_month equal to that,
a="January"
len(df[df["activity_month"]=a])
but I'd like to do that within the loop, for a particular iteration of i,x. How do I get a handle on the subsetted data within "x" on each iteration so I can look at the value of the "activity_month" and the number of rows with that value on that iteration?
Here is a short example dataframe:
import pandas as pd
df = pd.DataFrame([['January',19],['March',6],['January',24],['November',83],['February',23],
['November',4],['February',98],['January',44],['October',47],['January',4],
['April',8],['March',21],['April',41],['June',34],['March',63]],
columns=['activity_month','activity_count'])
Yields:
activity_month activity_count
0 January 19
1 March 6
2 January 24
3 November 83
4 February 23
5 November 4
6 February 98
7 January 44
8 October 47
9 January 4
10 April 8
11 March 21
12 April 41
13 June 34
14 March 63
If you want the sum of the values for each group from your df.groupby('activity_month'), then this will do:
df.groupby('activity_month')['activity_count'].sum()
Gives:
activity_month
April 49
February 121
January 91
June 34
March 90
November 87
October 47
Name: activity_count, dtype: int64
To get the number of rows that correspond to a given group:
df.groupby('activity_month')['activity_count'].agg('count')
Gives:
activity_month
April 2
February 2
January 4
June 1
March 3
November 2
October 1
Name: activity_count, dtype: int64
After re-reading your question, I'm convinced that you are not approaching this problem in the most efficient manner. I would highly recommend that you do not explicitly loop through the axes you have created with df.hist(), especially when this information is quickly (and directly) accessible from df itself.

data cleaning a python dataframe

I have a Python dataframe with 1408 lines of data. My goal is to compare the largest number and smallest number associated with a given weekday during one week to the next week's number on the same day of the week which the prior largest/smallest occurred. Essentially, I want to look at quintiles (since there are 5 days in a business week) rank 1 and 5 and see how they change from week to week. Build a cdf of numbers associated to each weekday.
To clean the data, I need to remove 18 weeks in total from it. That is, every week in the dataframe associated with holidays plus the entire week following week after the holiday occurred.
After this, I think I should insert a column in the dataframe that labels all my data with Monday through Friday-- for all the dates in the file (there are 6 years of data). The reason for labeling M-F is so that I can sort each number associated to the day of the week in ascending order. And query on the day of the week.
Methodological suggestions on either 1. or 2. or both would be immensely appreciated.
Thank you!
#2 seems like it's best tackled with a combination of df.groupby() and apply() on the resulting Groupby object. Perhaps an example is the best way to explain.
Given a dataframe:
In [53]: df
Out[53]:
Value
2012-08-01 61
2012-08-02 52
2012-08-03 89
2012-08-06 44
2012-08-07 35
2012-08-08 98
2012-08-09 64
2012-08-10 48
2012-08-13 100
2012-08-14 95
2012-08-15 14
2012-08-16 55
2012-08-17 58
2012-08-20 11
2012-08-21 28
2012-08-22 95
2012-08-23 18
2012-08-24 81
2012-08-27 27
2012-08-28 81
2012-08-29 28
2012-08-30 16
2012-08-31 50
In [54]: def rankdays(df):
.....: if len(df) != 5:
.....: return pandas.Series()
.....: return pandas.Series(df.Value.rank(), index=df.index.weekday)
.....:
In [52]: df.groupby(lambda x: x.week).apply(rankdays).unstack()
Out[52]:
0 1 2 3 4
32 2 1 5 4 3
33 5 4 1 2 3
34 1 3 5 2 4
35 2 5 3 1 4

Categories