Transforming Dataframe from days to weeks and aggregating quantity column - python

This is a tricky one and I'm having a difficult time aggregating this data by week. So, starting on 5/26/20, for each week what is the total quantity? That is the desired dataframe. My data has 3 months worth of data points where some 'products' have 0 quantities and this needs to be reflected in the desired df.
Original DF:
Product Date Qty
A 5/26/20 4
A 5/28/20 2
A 5/31/20 2
A 6/02/20 1
A 6/03/20 5
A 6/05/20 2
B 5/26/20 1
B 5/27/20 8
B 6/02/20 2
B 6/06/20 10
B 6/14/20 7
Desired DF
Product Week Qty
A 1 9
A 2 7
A 3 0
B 1 11
B 2 10
B 3 7

We can do it with transform , then create the new week with subtract
s = (df.Date-df.groupby('Product').Date.transform('min')).dt.days//7 + 1
s = df.groupby([df.Product, s]).Qty.sum().unstack(fill_value=0).stack().reset_index()
s
Out[348]:
Product Date 0
0 A 1 8
1 A 2 8
2 A 3 0
3 B 1 9
4 B 2 12
5 B 3 7

Related

How to sort the data wrt final output?

I want to group my dataframe by two columns and then sort the aggregated results within the groups.
In [167]:df
count job source
0 2 sales A
1 4 sales B
2 6 sales C
3 3 sales D
4 7 sales E
5 5 market A
6 3 market B
7 2 market C
8 4 market D
9 1 market E
df.groupby(['job','source']).agg({'count':sum})
Out[168]:
job source count
market A 5
B 3
C 2
D 4
E 1
sales A 2
B 4
C 6
D 3
E 7
I would now like to sort the count column in descending order within each of the groups. And then take only the top three rows. To get something like:
job source count
market A 5
D 4
B 3
sales E 7
C 6
B 4
I want to further sort this problem w.r.t job, so if the sum of count for sales is more, I want the data to be printed as
job source count
sales E 7
C 6
B 4
market A 5
D 4
B 3
I am unable to get the top 5 job
IIUC, we can do a further groupby and use nlargest(3) to get the top n values.
then we can create an ordered list to sort your top values to sort and create a categorical column.
s = df.groupby(['job','source']).agg({'count':sum}).groupby(level=0)['count']\
.nlargest(3).reset_index(0,drop=True).to_frame()
# see which of your indices is higher and create a sorting list.
sorter = s.groupby(level=0)['count'].sum().sort_values(ascending=False).index
#Index(['sales', 'market'], dtype='object', name='job')
s['sort'] = pd.Categorical(s.index.get_level_values(0),sorter)
df2 = s.sort_values('sort').drop('sort',axis=1)
print(df2)
count
job source
sales E 7
C 6
B 4
market A 5
D 4
B 3
You could use the sort_values mentioned in another similar answer sorting after aggregation and again group by job to get the top N from the job like,
>>> df
count job source
0 2 sales A
1 4 sales B
2 6 sales C
3 3 sales D
4 7 sales E
5 5 market A
6 3 market B
7 2 market C
8 4 market D
9 1 market E
>>> agg = df.groupby(['job','source']).agg({'count':sum})
>>> agg
count
job source
market A 5
B 3
C 2
D 4
E 1
sales A 2
B 4
C 6
D 3
E 7
>>> agg.reset_index().sort_values(['job', 'count'], ascending=False).set_index(['job', 'source']).groupby('job').head(3)
count
job source
sales E 7
C 6
B 4
market A 5
D 4
B 3
>>>

Assign the frequency of each value to dataframe with new column

I try to set up a Dataframe that countains a column called frequency.
This column should show how often the value is present in a specific column of the dataframe in every row. Something like this:
Index Category Frequency
0 1 1
1 3 2
2 3 2
3 4 1
4 7 3
5 7 3
6 7 3
7 8 1
This is just an example
I already tried it with value_counts(), however I only receive a value in the last line of the appearing number.
In the case of the example
Index Category Frequency
0 1 1
1 3 N.A
2 3 2
3 4 1
4 7 N.A
5 7 N.A
6 7 3
7 8 1
It is very important that the column has the same number of rows as the dataframe, preferably appended to the same dataframe
df['Frequency'] = df.groupby('Category').transform('count')
Use pandas.Series.map:
df['Frecuency']=df['Category'].map(df['Category'].value_counts())
or pandas.Series.replace:
df['Frecuency']=df['Category'].replace(df['Category'].value_counts())
Output:
Index Category Frecuency
0 0 1 1
1 1 3 2
2 2 3 2
3 3 4 1
4 4 7 3
5 5 7 3
6 6 7 3
7 7 8 1
Details
df['Category'].value_counts()
7 3
3 2
4 1
1 1
8 1
Name: Category, dtype: int64
using value_counts you get a series whose index are the elements of the category and the values ​​is the count. So you can use map or pandas.Series.replace to create a series with the category values ​​replaced by those in the count. And finally assign this series to the frequency column
you can do it using group by like below
df.groupby("Category") \
.apply(lambda g: g.assign(frequency = len(g))) \
.reset_index(level=0, drop=True)

creating dataframe efficiently without for loop

I am working with some advertising data, such as email data. I have two data sets:
one at the mail level, that for each person, states what days they were mailed, and then what day they were converted.
import pandas as pd
df_emailed=pd.DataFrame()
df_emailed['person']=['A','A','A','A','B','B','B']
df_emailed['day']=[2,4,8,9,1,2,5]
df_emailed
print(df_emailed)
person day
0 A 2
1 A 4
2 A 8
3 A 9
4 B 1
5 B 2
6 B 5
I have a summary dataframe that says whether someone converted, and which day they converted.
df_summary=pd.DataFrame()
df_summary['person']=['A','B']
df_summary['days_max']=[10,5]
df_summary['convert']=[1,0]
print(df_summary)
person days_max convert
0 A 10 1
1 B 5 0
I would like to combine these into a final dataframe that says, for each person:
1 to max date,
whether they were emailed (0,1) and on the last day in the dataframe,
whether they converted or not (0,1).
We are assuming they convert on the last day in the dataframe.
I know to do to this using a nested for loop, but I think that is just incredibly inefficient and sort of dumb. Does anyone know an efficient way of getting this done?
Desired result
df_final=pd.DataFrame()
df_final['person']=['A','A','A','A','A','A','A','A','A','A','B','B','B','B','B']
df_final['day']=[1,2,3,4,5,6,7,8,9,10,1,2,3,4,5]
df_final['emailed']=[0,1,0,1,0,0,0,1,1,0,1,1,0,0,1]
df_final['convert']=[0,0,0,0,0,0,0,0,0,1,0,0,0,0,0]
print(df_final)
person day emailed convert
0 A 1 0 0
1 A 2 1 0
2 A 3 0 0
3 A 4 1 0
4 A 5 0 0
5 A 6 0 0
6 A 7 0 0
7 A 8 1 0
8 A 9 1 0
9 A 10 0 1
10 B 1 1 0
11 B 2 1 0
12 B 3 0 0
13 B 4 0 0
14 B 5 1 0
Thank you and happy holidays!
A high level approach involves modifying the df_summary (alias df2) to get our output. We'll need to
set_index operation on the days_max column on df2. We'll also change the name to days (which will help later on)
groupby to group on person
apply a reindex operation on the index (days, so we get rows for each day leading upto the last day)
fillna to fill NaNs in the convert column generated as a result of the reindex
assign to create a dummy column for emailed that we'll set later.
Next, index into the result of the previous operation using df_emailed. We'll use those values to set the corresponding emailed cells to 1. This is done by MultiIndexing with loc.
Finally, use reset_index to bring the index out as columns.
def f(x):
return x.reindex(np.arange(1, x.index.max() + 1))
df = df2.set_index('days_max')\
.rename_axis('day')\
.groupby('person')['convert']\
.apply(f)\
.fillna(0)\
.astype(int)\
.to_frame()\
.assign(emailed=0)
df.loc[df1[['person', 'day']].apply(tuple, 1).values, 'emailed'] = 1
df.reset_index()
person day convert emailed
0 A 1 0 0
1 A 2 0 1
2 A 3 0 0
3 A 4 0 1
4 A 5 0 0
5 A 6 0 0
6 A 7 0 0
7 A 8 0 1
8 A 9 0 1
9 A 10 1 0
10 B 1 0 1
11 B 2 0 1
12 B 3 0 0
13 B 4 0 0
14 B 5 0 1
Where
df1 = df_emailed
and,
df2 = df_summary

aggregate under certain condition

I have this data frame.
df = pd.DataFrame({'day':[1,2,1,4,2,3], 'user':['A','B','B','B','A','A'],
'num_posts':[1,2,3,4,5,6]})
I want a new column containing the total number of posts for that user to date of that post excluding that day. What I want looks like this:
user day num_post total_todate
A 1 1 0
B 2 2 3
B 1 3 0
B 4 4 5
A 2 5 1
A 3 6 6
Any ideas?
You can sort data frame by day, group by user, calculate the cumulative sum of num_posts column and then shift it down by 1:
df['total_todate'] = (df.sort_values('day').groupby('user').num_posts
.transform(
lambda p: p.cumsum().shift()
).fillna(0))
df
# day num_posts user total_todate
#0 1 1 A 0.0
#1 2 2 B 3.0
#2 1 3 B 0.0
#3 4 4 B 5.0
#4 2 5 A 1.0
#5 3 6 A 6.0

How to sum values by value in other columns in pandas in Python?

Hi I am dealing with some data by using pandas.
I am facing a problem but here I'll try to simplify it.
Suppose I have a dataset looks like this:
# Incidents Place Month
0 3 A 1
1 5 B 1
2 2 C 2
3 2 B 2
4 6 C 3
5 3 A 1
So I want to sum the # of incidents by the place, that is, I want to have a result like
P #
A 3
B 7(5+2)
C 8(2+6)
stored in a pandas DataFrame. I don't care about other columns at this point.
Next question is, now if I want to use the data in Month column as well, I'd like to have result looks like
P M #
A 1 6(3+3)
B 1 5
B 2 2
C 2 2
C 3 6
How can I achieve these results in pandas? I have tried groupby and some other functions but I cannot reach the point...
Any help is appreciated!
You can do it in this way:
In [35]: df
Out[35]:
# Incidents Place Month
0 3 A 1
1 5 B 1
2 2 C 2
3 2 B 2
4 6 C 3
5 3 A 1
In [36]: df.groupby('Place')['# Incidents'].sum().reset_index()
Out[36]:
Place # Incidents
0 A 6
1 B 7
2 C 8
In [37]: df.groupby(['Place', 'Month'])['# Incidents'].sum().reset_index()
Out[37]:
Place Month # Incidents
0 A 1 6
1 B 1 5
2 B 2 2
3 C 2 2
4 C 3 6
Please find here a Pandas documentation with lots of examples.

Categories