Combine rows in a DataFrame & add the value as a column - python

My Dataframe looks like this:
campaign_name campaign_id event_name clicks installs conversions
campaign_1 1234 registration 100 5 1
campaign_1 1234 hv_users_r 100 5 2
campaign_2 2345 registration 500 10 3
campaign_2 2345 hv_users_w 500 10 2
campaign_3 3456 registration 1000 50 10
campaign_4 3456 hv_users_r 1000 50 15
campaign_4 3456 hv_users_w 1000 50 25
I want to categorize all the "event names" into 2 new columns, where 1st new columns represents "registration", and the 2nd new column represents "hv_users", which will the sum of all rows having event-names of "hv_users_r" & "hv_users_w".
To keep this simple - "registration" column will have rows which only have event_name as "registration". All non "registration" event_names would go into the new column "hv_users".
This is my expected new Dataframe:
campaign_name campaign_id clicks installs registrations hv_users
campaign_1 1234 100 5 1 2
campaign_2 2345 500 10 3 2
campaign_3 3456 1000 50 10 40
Can someone please give me directions on how to go from the input DataFrame to the output DataFrame?

df['hv_users'] = df.conversions.where(df.event_name.str.match(r'hv_users_[r|w]'), 0)
df['registrations'] = df.conversions.where(df.event_name == 'registration', 0)
df.hv_users = df.groupby('campaign_id').hv_users.transform(sum)
df = df.groupby('campaign_id').head(1).drop('event_name', axis=1)

You can using split + join, then groupby+unstack
df.assign(event_name=df['event_name'].apply(lambda x:"_".join(x.split("_", 2)[:2]))).\
groupby(['ampaign_name','campaign_id','clicks','installs','event_name'])['conversions'].sum().\
unstack(fill_value=0).reset_index()
Out[302]:
event_name ampaign_name campaign_id clicks installs hv_users registration
0 campaign_1 1234 100 5 2 1
1 campaign_2 2345 500 10 2 3
2 campaign_3 3456 1000 50 0 10
3 campaign_4 3456 1000 50 40 0

pd.crosstab() and pd.pivot() should do the trick.
#df is your input dataframe
replacement = {'hv_users_w':'hv_users', 'hv_users_r':'hv_users','registration':'registration'}
df.event_name = df.event_name.map(replacement)
df1 = pd.crosstab(df.campaign_name, df.event_name)
df2 = pd.pivot_table(df, index = 'campaign_name')
output = pd.concat([df1,df2], axis = 1)

Try using pivot_table
df.loc[df['event_name'].str.contains('_'), 'event_name'] = df.loc[df['event_name'].str.contains('_'), 'event_name'].str.extract('(.*_.*)_.*', expand = False)
new_df = df.pivot_table(index=['campaign_name', 'campaign_id','clicks', 'installs'], columns='event_name', values = 'conversions',aggfunc='sum',fill_value=0).reset_index().rename_axis(None, axis=1)
campaign_name campaign_id clicks installs hv_users registration
0 campaign_1 1234 100 5 2 1
1 campaign_2 2345 500 10 2 3
2 campaign_3 3456 1000 50 0 10
3 campaign_4 3456 1000 50 40 0

Related

pyspark append very large multiple dataframes after each process under for loop (eg: append after daily ETL)

I have to do ETL for each day and then add it to a single dataframe.
Eg: After each day ETL following are the outputs..
df1:
id category quantity date
1 abc 100 01-07-18
2 deg 175 01-07-18
.....
df2:
id category quantity date
1 abc 50 02-07-18
2 deg 300 02-07-18
3 zzz 250 02-07-18
.....
df3:
id category quantity date
1 abc 500 03-07-18
.....
df4:
id category quantity date
5 jjj 200 04-07-18
7 ddd 100 04-07-18
.....
For each day ETL, one dataframe need to be created like df1,df2,df3,... and after each day ETL that dataframe should be appneded with earlier dates ETL..
Final output expected:
After day 2 output should be:
finaldf:
id category quantity date
1 abc 100 01-07-18
2 deg 175 01-07-18
1 abc 50 02-07-18
2 deg 300 02-07-18
3 zzz 250 02-07-18
.....
After day 4 output should be:
finaldf:
id category quantity date
1 abc 100 01-07-18
2 deg 175 01-07-18
1 abc 50 02-07-18
2 deg 300 02-07-18
3 zzz 250 02-07-18
1 abc 500 03-07-18
5 jjj 200 04-07-18
7 ddd 100 04-07-18
.....
I have done this using Pandas using append function but as the data size is very large I am getting MemoryError.
Answer for PySpark
Put all the DataFrames into a list
df_list = [df1, df2, df3, df4]
finaldf = reduce(lambda x, y: x.union(y), df_list)
finaldf will contain all the data.

Calculations and update results in Python data frames

I'm a recent convert from excel to python. I think that what I'm trying to here would be traditionally done with a Vlookup of sorts. But I might be struggling with the terminology and not being able to find the python solution. I have been using the pandas library for most of my data analysis framework.
I have two different data frames. One with the weight changes (DF1), and other with the weights(DF2). I want to go line by line (changes are chronological) and:
create a new column in DF1 with the weight before the change
(basically extracted from DF2).
update the results in DF2 where Weight = Weight + WeightChange
Note: The data frames do not have the same dimension, an individual has several weight changes(DF1) but only one weight (DF2):
Name WeightChange
1 John 5
2 Peter 10
3 John 7
4 Mary -20
5 Gary -3
DF2:
Name Weight
1 John 180
2 Peter 160
3 Mary 120
4 Gary 150
Firstly I'd merge df1 and df2 on the 'Name' column to add the weight column to df1.
Then I'd groupby df1 on name and apply a transform to calculate the total weight change for each person. transform returns a Series aligned to the orig df so you can add an aggregated column back to the df.
Then I'd merge this column to df2 and then it's a simple case of adding this total weight change to the existing weight column:
In [242]:
df1 = df1.merge(df2, on='Name', how='left')
df1['WeightChangeTotal'] = df1.groupby('Name')['WeightChange'].transform('sum')
df1
Out[242]:
Name WeightChange Weight WeightChangeTotal
0 John 5 180 12
1 Peter 10 160 10
2 John 7 180 12
3 Mary -20 120 -20
4 Gary -3 150 -3
In [243]:
df2 = df2.merge(df1[['Name','WeightChangeTotal']], on='Name')
df2
Out[243]:
Name Weight WeightChangeTotal
0 John 180 12
1 John 180 12
2 Peter 160 10
3 Mary 120 -20
4 Gary 150 -3
In [244]:
df2['Weight'] = df2['Weight'] + df2['WeightChangeTotal']
df2
Out[244]:
Name Weight WeightChangeTotal
0 John 192 12
1 John 192 12
2 Peter 170 10
3 Mary 100 -20
4 Gary 147 -3
EDIT
To address your desired behaviour for the 'WeightBefore' column:
In [267]:
df1['WeightBefore'] = df1['Weight'] + df1.groupby('Name')['WeightChange'].shift().cumsum().fillna(0)
df1
Out[267]:
Name WeightChange Weight WeightBefore
0 John 5 180 180
1 Peter 10 160 160
2 John 7 180 185
3 Mary -20 120 120
4 Gary -3 150 150
So the above groups on 'Name', applies a shift to the column and then cumsum so we add the incremental differences, we have to call fillna as this will produce NaN values where we have only a single weight change per Name.

Time series: Mean per hour per day per Id number

I am a somewhat beginner programmer and learning python (+pandas) and hope I can explain this well enough. I have a large time series pd dataframe of over 3 million rows and initially 12 columns spanning a number of years. This covers people taking a ticket from different locations denoted by Id numbers(350 of them). Each row is one instance (one ticket taken).
I have searched many questions like counting records per hour per day and getting average per hour over several years. However, I run into the trouble of including the 'Id' variable.
I'm looking to get the mean value of people taking a ticket for each hour, for each day of the week (mon-fri) and per station.
I have the following, setting datetime to index:
Id Start_date Count Day_name_no
149 2011-12-31 21:30:00 1 5
150 2011-12-31 20:51:00 1 0
259 2011-12-31 20:48:00 1 1
3015 2011-12-31 19:38:00 1 4
28 2011-12-31 19:37:00 1 4
Using groupby and Start_date.index.hour, I cant seem to include the 'Id'.
My alternative approach is to split the hour out of the date and have the following:
Id Count Day_name_no Trip_hour
149 1 2 5
150 1 4 10
153 1 2 15
1867 1 4 11
2387 1 2 7
I then get the count first with:
Count_Item = TestFreq.groupby([TestFreq['Id'], TestFreq['Day_name_no'], TestFreq['Hour']]).count().reset_index()
Id Day_name_no Trip_hour Count
1 0 7 24
1 0 8 48
1 0 9 31
1 0 10 28
1 0 11 26
1 0 12 25
Then use groupby and mean:
Mean_Count = Count_Item.groupby(Count_Item['Id'], Count_Item['Day_name_no'], Count_Item['Hour']).mean().reset_index()
However, this does not give the desired result as the mean values are incorrect.
I hope I have explained this issue in a clear way. I looking for the mean per hour per day per Id as I plan to do clustering to separate my dataset into groups before applying a predictive model on these groups.
Any help would be grateful and if possible an explanation of what I am doing wrong either code wise or my approach.
Thanks in advance.
I have edited this to try make it a little clearer. Writing a question with a lack of sleep is probably not advisable.
A toy dataset that i start with:
Date Id Dow Hour Count
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
19/12/2014 1234 0 9 1
19/12/2014 1234 0 9 1
19/12/2014 1234 0 9 1
26/12/2014 1234 0 10 1
27/12/2014 1234 1 11 1
27/12/2014 1234 1 11 1
27/12/2014 1234 1 11 1
27/12/2014 1234 1 11 1
04/01/2015 1234 1 11 1
I now realise I would have to use the date first and get something like:
Date Id Dow Hour Count
12/12/2014 1234 0 9 5
19/12/2014 1234 0 9 3
26/12/2014 1234 0 10 1
27/12/2014 1234 1 11 4
04/01/2015 1234 1 11 1
And then calculate the mean per Id, per Dow, per hour. And want to get this:
Id Dow Hour Mean
1234 0 9 4
1234 0 10 1
1234 1 11 2.5
I hope this makes it a bit clearer. My real dataset spans 3 years with 3 million rows, contains 350 Id numbers.
Your question is not very clear, but I hope this helps:
df.reset_index(inplace=True)
# helper columns with date, hour and dow
df['date'] = df['Start_date'].dt.date
df['hour'] = df['Start_date'].dt.hour
df['dow'] = df['Start_date'].dt.dayofweek
# sum of counts for all combinations
df = df.groupby(['Id', 'date', 'dow', 'hour']).sum()
# take the mean over all dates
df = df.reset_index().groupby(['Id', 'dow', 'hour']).mean()
You can use the groupby function using the 'Id' column and then use the resample function with how='sum'.

Update Specific Pandas Rows with Value from Different Dataframe

I have a pandas dataframe that contains budget data but my sales data is located in another dataframe that is not the same size. How can I get my sales data updated in my budget data? How can I write conditions so that it makes these updates?
DF budget:
cust type loc rev sales spend
0 abc new north 500 0 250
1 def new south 700 0 150
2 hij old south 700 0 150
DF sales:
cust type loc sales
0 abc new north 15
1 hij old south 18
DF budget outcome:
cust type loc rev sales spend
0 abc new north 500 15 250
1 def new south 700 0 150
2 hij old south 700 18 150
Any thoughts?
Assuming that 'cust' column is unique in your other df, you can call map on the sales df after setting the index to be the 'cust' column, this will map for each 'cust' in budget df to it's sales value, additionally you will get NaN where there are missing values so you call fillna(0) to fill those values:
In [76]:
df['sales'] = df['cust'].map(df1.set_index('cust')['sales']).fillna(0)
df
Out[76]:
cust type loc rev sales spend
0 abc new north 500 15 250
1 def new south 700 0 150
2 hij old south 700 18 150

Pandas Pivot tables row subtotals

I'm using Pandas 0.10.1
Considering this Dataframe:
Date State City SalesToday SalesMTD SalesYTD
20130320 stA ctA 20 400 1000
20130320 stA ctB 30 500 1100
20130320 stB ctC 10 500 900
20130320 stB ctD 40 200 1300
20130320 stC ctF 30 300 800
How can i group subtotals per state?
State City SalesToday SalesMTD SalesYTD
stA ALL 50 900 2100
stA ctA 20 400 1000
stA ctB 30 500 1100
I tried with a pivot table but i only can have subtotals in columns
table = pivot_table(df, values=['SalesToday', 'SalesMTD','SalesYTD'],\
rows=['State','City'], aggfunc=np.sum, margins=True)
I can achieve this on excel, with a pivot table.
If you put State and City not both in the rows, you'll get separate margins. Reshape and you get the table you're after:
In [10]: table = pivot_table(df, values=['SalesToday', 'SalesMTD','SalesYTD'],\
rows=['State'], cols=['City'], aggfunc=np.sum, margins=True)
In [11]: table.stack('City')
Out[11]:
SalesMTD SalesToday SalesYTD
State City
stA All 900 50 2100
ctA 400 20 1000
ctB 500 30 1100
stB All 700 50 2200
ctC 500 10 900
ctD 200 40 1300
stC All 300 30 800
ctF 300 30 800
All All 1900 130 5100
ctA 400 20 1000
ctB 500 30 1100
ctC 500 10 900
ctD 200 40 1300
ctF 300 30 800
I admit this isn't totally obvious.
You can get the summarized values by using groupby() on the State column.
Lets make some sample data first:
import pandas as pd
import StringIO
incsv = StringIO.StringIO("""Date,State,City,SalesToday,SalesMTD,SalesYTD
20130320,stA,ctA,20,400,1000
20130320,stA,ctB,30,500,1100
20130320,stB,ctC,10,500,900
20130320,stB,ctD,40,200,1300
20130320,stC,ctF,30,300,800""")
df = pd.read_csv(incsv, index_col=['Date'], parse_dates=True)
Then apply the groupby function and add a column City:
dfsum = df.groupby('State', as_index=False).sum()
dfsum['City'] = 'All'
print dfsum
State SalesToday SalesMTD SalesYTD City
0 stA 50 900 2100 All
1 stB 50 700 2200 All
2 stC 30 300 800 All
We can append the original data to the summed df by using append:
dfsum.append(df).set_index(['State','City']).sort_index()
print dfsum
SalesMTD SalesToday SalesYTD
State City
stA All 900 50 2100
ctA 400 20 1000
ctB 500 30 1100
stB All 700 50 2200
ctC 500 10 900
ctD 200 40 1300
stC All 300 30 800
ctF 300 30 800
I added the set_index and sort_index to make it look more like your example output, its not strictly necessary to get the results.
I think this subtotal example code is what you want (similar to excel subtotal).
I assume that you want group by columns A, B, C, D, then count column value of E.
main_df.groupby(['A', 'B', 'C']).apply(lambda sub_df:
sub_df.pivot_table(index=['D'], values=['E'], aggfunc='count', margins=True))
output:
E
A B C D
a a a a 1
b 2
c 2
all 5
b b a a 3
b 2
c 2
all 7
b b b a 3
b 6
c 2
d 3
all 14
How about this one ?
table = pd.pivot_table(data, index=['State'],columns = ['City'],values=['SalesToday', 'SalesMTD','SalesYTD'],\
aggfunc=np.sum, margins=True)
If you are interested I have just created a little function to make it more easy as you might want to apply this function 'subtotal' on many table. It works for both table created via pivot_table() and groupby(). An example of table to use it is provide on this stack overflow page : Sub Total in pandas pivot Table
def get_subtotal(table, sub_total='subtotal', get_total=False, total='TOTAL'):
"""
Parameters
----------
table : dataframe, table with multi-index resulting from pd.pivot_table() or
df.groupby().
sub_total : str, optional
Name given to the subtotal. The default is '_Sous-total'.
get_total : boolean, optional
Precise if you want to add the final total (in case you used groupeby()).
The default is False.
total : str, optional
Name given to the total. The default is 'TOTAL'.
Returns
-------
A table with the total and subtotal added.
"""
index_name1 = table.index.names[0]
index_name2 = table.index.names[1]
pvt = table.unstack(0)
mask = pvt.columns.get_level_values(index_name1) != 'All'
#print (mask)
pvt.loc[sub_total] = pvt.loc[:, mask].sum()
pvt = pvt.stack().swaplevel(0,1).sort_index()
pvt = pvt[pvt.columns[1:].tolist() + pvt.columns[:1].tolist()]
if get_total:
mask = pvt.index.get_level_values(index_name2) != sub_total
pvt.loc[(total, '' ),: ] = pvt.loc[mask].sum()
print (pvt)
return(pvt)
table = pd.pivot_table(df, index=['A'], values=['B', 'C'], columns=['D', 'E'], fill_value='0', aggfunc=np.sum/'count'/etc., margins=True, margins_name='Total')
print(table)

Categories