Pandas means in column for subset in another column - python

I have a dataframe called houses:
transaction_id house_id date_sale sale_price boolean_2015 \
0 1 1 31 Mar 2016 £880,000 True
3 4 2 31 Mar 2016 £450,000 True
4 5 3 31 Mar 2016 £680,000 True
6 7 4 31 Mar 2016 £1,850,000 True
postcode
0 EC2Y
3 EC2Y
4 EC1Y
6 EC2Y
and I was wondering how to compute averages of sale_price based on each postcode
so the output is
Average
0 EC1Y £123220
1 EC2Y £434930
I did this with averages = data.groupby(['postcode'], as_index=False).mean()
but this did not return sale_price
any thoughts?

You can first replace £, to empty string and then convert to_numeric column sale_price. Last cast to string by astype if need add £ to column sale_price:
data.sale_price = pd.to_numeric(data.sale_price.str.replace('[£,]',''))
averages = data.groupby(['postcode'], as_index=False)['sale_price'].mean()
averages.sale_price = '£' + averages.sale_price.astype(str)
print (averages)
postcode sale_price
0 EC1Y £680000
1 EC2Y £1060000

Related

Compare two dataframes column values. Find which values are in one df and not the other

I have the following dataset
df=pd.read_csv('https://raw.githubusercontent.com/michalis0/DataMining_and_MachineLearning/master/data/sales.csv')
df["OrderYear"] = pd.DatetimeIndex(df['Order Date']).year
I want to compare the customers in 2017 and 2018 and see if the store has lost customers.
I did two subsets corresponding to 2017 and 2018 :
Customer_2018 = df.loc[(df.OrderYear == 2018)]
Customer_2017 = df.loc[(df.OrderYear == 2017)]
I then tried to do this to compare the two :
Churn = Customer_2017['Customer ID'].isin(Customer_2018['Customer ID']).value_counts()
Churn
And i get the following output :
True 2206
False 324
Name: Customer ID, dtype: int64
The problem is some customers may appear several times in the dataset since they made several orders.
I would like to get only unique customers (Customer ID is the only unique attribute) and then compare the two dataframes to see how many customers the store lost between 2017 and 2018.
To go further in the analysis, you can use pd.crosstab:
out = pd.crosstab(df['Customer ID'], df['OrderYear'])
At this point your dataframe looks like:
>>> out
OrderYear 2015 2016 2017 2018
Customer ID
AA-10315 4 1 4 2
AA-10375 2 4 4 5
AA-10480 1 0 10 1
AA-10645 6 3 8 1
AB-10015 4 0 2 0 # <- lost customer
... ... ... ... ...
XP-21865 10 3 9 6
YC-21895 3 1 3 1
YS-21880 0 5 0 7
ZC-21910 5 9 9 8
ZD-21925 3 0 5 1
Values are the number of order per customer and year.
Now it's easy to get "lost customers":
>>> sum((out[2017] != 0) & (out[2018] == 0))
83
If only one comparison is required, I would use python sets:
c2017 = set(Customer_2017['Customer ID'])
c2018 = set(Customer_2018['Customer ID'])
print(f'lost customers between 2017 and 2018: {len(c2017 - c2018)}')
print(f'customers from 2017 remaining in 2018: {len(c2017 & c2018)}')
print(f'new customers in 2018: {len(c2018 - c2017)}')
output:
lost customers between 2017 and 2018: 83
customers from 2017 remaining in 2018: 552
new customers in 2018: 138
building on the crosstab suggestion from #Corralien:
out = pd.crosstab(df['Customer ID'], df['OrderYear'])
(out.gt(0).astype(int).diff(axis=1)
.replace({0: 'remained', 1: 'new', -1: 'lost'})
.apply(pd.Series.value_counts)
)
output:
OrderYear 2015 2016 2017 2018
lost NaN 163 123 83
new NaN 141 191 138
remained NaN 489 479 572
You could just use normal sets to get unique customer ids for each year and then subtract them appropriately:
set_lost_cust = set(Customer_2017["Customer ID"]) - set(Customer_2018["Customer ID"])
len(set_lost_cust)
Out: 83
For your original approach to work you would need to drop the duplicates from the DataFrames, to make sure each customer appears only a single time:
Customer_2018 = df.loc[(df.OrderYear == 2018), ​"Customer ID"].drop_duplicates()
Customer_2017 = df.loc[(df.OrderYear == 2017), ​"Customer ID"].drop_duplicates()
Churn = Customer_2017.isin(Customer_2018)
Churn.value_counts()
#Out:
True 552
False 83
Name: Customer ID, dtype: int64

How to create a new Pandas DataFrame from alternating boolean rows such that the new DataFrame is ready to plot?

I was hoping someone could help me figure out the best way to arrange my DataFrame to do a scatter plot. The scatter plot should have year on the x axis and percent Foreign Players on the y axis. The DataFrame has about 400 rows and for convenience I will show a head with fewer values.
I began with this DataFrame from a larger DataFrame:
df1 = df.head(5).loc[:, ['Year', 'Nationality', 'Foreign Player']]
Year Nationality Foreign Player
0 2016 United States False
1 2016 United States False
2 2016 United States False
3 2016 United States False
4 2016 United States False
I did a groupby by year and foreign player making this a multi index DataFrame:
df2 = df.groupby(['Year','Foreign Player']).count()[['Player']].head(6)
Player
Year Foreign Player
2000 False 26
True 2
2001 False 21
True 5
2002 False 20
True 5
I reset the index to make a single index DataFrame:
df3 = df2.reset_index(level = [0,1]).head(6)
Year Foreign Player Player
0 2000 False 26
1 2000 True 2
2 2001 False 21
3 2001 True 5
4 2002 False 20
As You can see, the True and False are alternating with corresponding values in a different column.
I wanted to do something like:
df3['Percent Foreign'] = df3[['Foreign Player']= False] / (df3[['Foreign Player']= False ] + df3[['Foreign Player']= True)
Obviously that will not work. My objective is a new DataFrame:
Year Percent Foreign
0 2000 15
1 2001 12
2 2002 5
3 2003 22
4 2004 17
Such as then I can plot x = Year and x = Percent Foreign using Matplotlib. By any means, if there is an easier way to plot this in an earlier step it would be even better.
Thanks again!
For select False values is used innverting mask by ~, convert values of Year to index and divide by aggregate sum:
print (df3)
Year Foreign Player Player
0 2000 False 26
1 2000 True 2
2 2001 False 21
3 2001 True 5
4 2002 False 20
4 2002 True 10
df4 = (df3[~df3['Foreign Player']].set_index('Year')['Player'] /
df3.groupby('Year')['Player'].sum()).mul(100).reset_index(name='Percent Foreign')
print (df4)
Year Percent Foreign
0 2000 92.857143
1 2001 80.769231
2 2002 66.666667
Another idea is change df2 by Series.unstack:
df22 = df.groupby(['Year','Foreign Player'])['Player'].count().unstack()
print (df22)
Foreign Player False True
Year
2000 26 2
2001 21 5
2002 20 10
And then divide False column by sum both columns:
df4 = (df22[False] / df22.sum(axis=1)).mul(100).reset_index(name='Percent Foreign')
print (df4)
Year Percent Foreign
0 2000 92.857143
1 2001 80.769231
2 2002 66.666667
For percentage of Trues:
df5 = (df22[True] / df22.sum(axis=1)).reset_index(name='Percent Foreign')
To get ratio of players, an idea to make two new columns which count total players and total foreign players, and then a third column which divides the aggregated two columns.
Example - simplified dataframe
df = pd.DataFrame(
{'Year': [2010, 2010, 2010, 2010, 2011, 2011, 2011, 2011],
'Foreign Player': [True, True, False, True, False, False, True, True]}
)
Year Foreign Player
0 2010 True
1 2010 True
2 2010 False
3 2010 True
4 2011 False
5 2011 False
6 2011 True
7 2011 True
Count rows and foreigners
df_agg = df.groupby('Year')['Foreign Player'].agg(['count', 'sum'])
Find ratio:
df_agg['ratio'] = df_agg['sum']/df_agg['count']
df_agg
count sum ratio
Year
2010 4 3 0.75
2011 4 2 0.50

How to remove leading '0' from my column? Python

I am trying to remove the '0' leading my data
My dataframe looks like this
Id Year Month Day
1 2019 01 15
2 2019 03 30
3 2019 10 20
4 2019 11 18
Note: 'Year','Month','Day' columns data types are object
I get the 'Year','Month','Day' columns by extracting it from a date.
I want to remove the '0' at the beginning of each months.
Desired Ouput:
Id Year Month Day
1 2019 1 15
2 2019 3 30
3 2019 10 20
4 2019 11 18
What I tried to do so far:
df['Month'].str.lstrip('0')
But it did not work.
Any solution? Thank you!
You could use re package and apply regex on it
import re
# Create sample data
d = pd.DataFrame(data={"Month":["01","02","03","10","11"]})
d["Month" = d["Month"].apply(lambda x: re.sub(r"^0+", "", x))
Result:
0 1
1 2
2 3
3 10
4 11
Name: Month, dtype: object
If you are 100% that Month column will always contain numbers, then you could simply do:
d["Month"] = d["Month"].astype(int)

Cumulative sum (pandas)

Apologies if this has been asked already.
I am trying to create a yearly cumulative sum for all order-points within a certain customer account, and am struggling.
Essentially, I want to create `YearlyTotal' below:
Customer Year Date Order PointsPerOrder YearlyTotal
123456 2016 11/2/16 A939 1 20
123456 2016 3/13/16 A102 19 19
789089 2016 7/15/16 A123 7 7
I've tried:
df['YEARLYTOTAL'] = df.groupby(by=['Customer','Year'])['PointsPerOrder'].cumsum()
But this produces YearlyTotal in the wrong order (i.e., YearlyTotal of A939 is 1 instead of 20.
Not sure if this matters, but Customer is a string (the database has leading zeroes -- don't get me started). sort_values(by=['Customer','Year','Date'],ascending=True) at the front also produces an error.
Help?
Use [::-1] for reversing dataframe:
df['YEARLYTOTAL'] = df[::-1].groupby(by=['Customer','Year'])['PointsPerOrder'].cumsum()
print (df)
Customer Year Date Order PointsPerOrder YearlyTotal YEARLYTOTAL
0 123456 2016 11/2/16 A939 1 20 20
1 123456 2016 3/13/16 A102 19 19 19
2 789089 2016 7/15/16 A123 7 7 7
first make sure Date is a datetime column:
In [35]: df.Date = pd.to_datetime(df.Date)
now we can do:
In [36]: df['YearlyTotal'] = df.sort_values('Date').groupby(['Customer','Year'])['PointsPerOrder'].cumsum()
In [37]: df
Out[37]:
Customer Year Date Order PointsPerOrder YearlyTotal
0 123456 2016 2016-11-02 A939 1 20
1 123456 2016 2016-03-13 A102 19 19
2 789089 2016 2016-07-15 A123 7 7
PS this solution will NOT depend on the order of records...

selecting a particular row from groupby object in python

id marks year
1 18 2013
1 25 2012
3 16 2014
2 16 2013
1 19 2013
3 25 2013
2 18 2014
suppose now I group the above on id by python command.
grouped = file.groupby(file.id)
I would like to get a new file with only the row in each group with recent year that is highest of all the year in the group.
Please let me know the command, I am trying with apply but it ll only given the boolean expression. I want the entire row with latest year.
I cobbled this together using this: Python : Getting the Row which has the max value in groups using groupby
So basically we can groupby the 'id' column, then call transform on the 'year' column and create a boolean index where the year matches the max year value for each 'id':
In [103]:
df[df.groupby(['id'])['year'].transform(max) == df['year']]
Out[103]:
id marks year
0 1 18 2013
2 3 16 2014
4 1 19 2013
6 2 18 2014

Categories