How to update a column using another column in pandas - python

I'm trying to create a dataframe that keeps track of the number of public schools opened between 2010-2016.
StatusType County 2010 ...2016 OpenYear ClosedYear
1 Closed Alameda 0 0 2005 2015.0
2 Active Alameda 0 0 2006 NaN
3 Closed Alameda 0 0 2008 2015.0
4 Active Alameda 0 0 2011 NaN
5 Active Alameda 0 0 2011 NaN
6 Active Alameda 0 0 2012 NaN
7 Closed Alameda 0 0 1980 1989.0
8 Active Alameda 0 0 1980 NaN
9 Active Alameda 0 0 1980 NaN
I want to update the 2010-2016 columns to keep track of the number of schools open per year. For example, the first school in the dataframe opens in 2005 and closes in 2015. The iterator should check "ClosedYear" column and add 1 to all columns' rows' values < 2015 (2010,2011...,2014). If the "ClosedYear" column shows "NaN", then starting at the year in "OpenYear" column, add 1 to all columns' rows' values >= "OpenYear" (ex: school#4, columns[2011,2012...,2016] +1 & column[2010] no change)
I was thinking about using "apply" to apply a function to the dataframe. But that might not be the most efficient way to solve the problem. Need help figuring out how to make this work! Thanks!
Extra Step:
After finishing the counts, I want to group the year columns by county. I'm leaning towards using the "groupby" w/sum function to sum up the open school counts per county per year. If someone could add that with the answer to the question above, would be very helpful.
Expected Output:
StatusType County 2010 ...2016 OpenYear ClosedYear
1 Closed Alameda 1 0 2005 2015.0
2 Active Alameda 1 1 2006 NaN
3 Closed Alameda 1 0 2008 2015.0
4 Active Alameda 0 1 2011 NaN
5 Active Alameda 0 1 2011 NaN
6 Active Alameda 0 1 2012 NaN
7 Closed Alameda 0 0 1980 1989.0
8 Active Alameda 1 1 1980 NaN
9 Active Alameda 1 1 1980 NaN

I feel like there should be a way to do this without using a for loop but, I cannot think of it atm, so here's my solution:
# Read Example data
from io import StringIO # This only works python 3+
df = pd.read_fwf(StringIO(
"""StatusType County OpenYear ClosedYear
Closed Alameda 2005 2015.0
Active Alameda 2006 NaN
Closed Alameda 2008 2015.0
Active Alameda 2011 NaN
Active Alameda 2011 NaN
Active Alameda 2012 NaN
Closed Alameda 1980 1989.0
Active Alameda 1980 NaN
Active Alameda 1980 NaN"""))
# For each year
for year in range(2010, 2016+1):
# Create a column of 0s
df[str(year)] = 0
# Where the year is between OpenYear and ClosedYear (or closed year is NaN) set it to 1
df.loc[(df['OpenYear'] <= year) & (pd.isna(df['ClosedYear']) | (df['ClosedYear'] >= year)), str(year)] = int(1)
print(df.to_string)
Output:
StatusType County OpenYear ClosedYear 2010 2011 2012 2013 2014 2015 2016
0 Closed Alameda 2005 2015.0 1 1 1 1 1 1 0
1 Active Alameda 2006 NaN 1 1 1 1 1 1 1
2 Closed Alameda 2008 2015.0 1 1 1 1 1 1 0
3 Active Alameda 2011 NaN 0 1 1 1 1 1 1
4 Active Alameda 2011 NaN 0 1 1 1 1 1 1
5 Active Alameda 2012 NaN 0 0 1 1 1 1 1
6 Closed Alameda 1980 1989.0 0 0 0 0 0 0 0
7 Active Alameda 1980 NaN 1 1 1 1 1 1 1
8 Active Alameda 1980 NaN 1 1 1 1 1 1 1
(PS: I'm not quite sure what you were trying to do with the groupby)

Unless there is really a need to create those intermediate columns, you can get the counts directly with a groupby and .size Depending upon whether you want to include the closing year, change the inequalities from <= to <. If you want to group this by county also you can do that in the same step.
Here's the starting df
StatusType County OpenYear ClosedYear
1 Closed Alameda 2005 2015.0
2 Active Alameda 2006 NaN
3 Closed Alameda 2008 2015.0
4 Active Alameda 2011 NaN
5 Active Alameda 2011 NaN
6 Active Alameda 2012 NaN
7 Closed Alameda 1980 1989.0
8 Active Alameda 1980 NaN
9 Active Alameda 1980 NaN
import pandas as pd
year_list = [2010, 2011, 2012, 2013, 2014, 2015, 2016]
df_list = []
for year in year_list:
group = ((df.ClosedYear.isnull()) | (df.ClosedYear >= year)) & (df.OpenYear <= year)
n_schools = df.groupby([group, df.County]).size()[True]
df_list.append(pd.DataFrame({'n_schools':n_schools, 'year': year}))
ndf = pd.concat(df_list)
# n_schools year
#County
#Alameda 5 2010
#Alameda 7 2011
#Alameda 8 2012
#Alameda 8 2013
#Alameda 8 2014
#Alameda 8 2015
#Alameda 6 2016

Related

How to count the total sales by year, month

I have a big csv (17985 rows) with sales in different days.The csv looks like this:
Customer Date Sale
Larry 1/2/2018 20$
Mike 4/3/2020 40$
John 12/5/2017 10$
Sara 3/2/2020 90$
Charles 9/8/2022 75$
Below is how many times that exact day appears in my csv (how many sales were made that day):
occur = df.groupby(['Date']).size()
occur
2018-01-02 32
2018-01-03 31
2018-01-04 42
2018-01-05 192
2018-01-06 26
I used crosstab, groupby and several methods but the problem is that they don't add up, or is NaN.
new_df['total_sales_that_month'] = df.groupby('Date')['Sale'].sum()
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
..
17980 NaN
17981 NaN
17982 NaN
17983 NaN
17984 NaN
I want to group them by year and month in a dataframe, based on total sales. Using dt.year and dt.month I managed to do this:
year
month
1 2020
1 2020
7 2019
8 2019
2 2018
... ...
4 2020
4 2020
4 2020
4 2020
4 2020
What I want to have is: month/year/total_sales_that_month. What method should I apply? This is the expected output:
Month Year Total_sale_that_month
1 2018 420$
2 2018 521$
3 2018 124$
4 2018 412$
5 2018 745$
You can use groupby_sum but before you have to strip '$' from Sale column and convert as numeric:
# Clean your dataframe first
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
df['Sale'] = df['Sale'].str.strip('$').astype(float)
out = (df.groupby([df['Date'].dt.month.rename('Month'),
df['Date'].dt.year.rename('Year')])
['Sale'].sum()
.rename('Total_sale_that_month')
# .astype(str).add('$') # uncomment if '$' matters
.reset_index())
Output:
>>> out
Month Year Total_sale_that_month
0 2 2018 20.0
1 2 2020 90.0
2 3 2020 40.0
3 5 2017 10.0
4 8 2022 75.0
i share you my code,
pivot_table, reset_index and sorting,
convert your col name:
df["Dt_Customer_Y"] = pd.DatetimeIndex(df['Dt_Customer']).year
df["Dt_Customer_M"] = pd.DatetimeIndex(df['Dt_Customer']).month
pvtt = df.pivot_table(index=['Dt_Customer_Y', 'Dt_Customer_M'], aggfunc={'Income':sum})
pvtt.reset_index().sort_values(['Dt_Customer_Y', 'Dt_Customer_M'])
Dt_Customer_Y Dt_Customer_M Income
0 2012 1 856039.0
1 2012 2 487497.0
2 2012 3 921940.0
3 2012 4 881203.0

How to find first occurence of a specified integer over multiple columns using Pandas?

I have this dataset:
2010 2011 2012
0 NaN NaN 505303.0
1 542225.0 NaN 210530.0
2 123210.0 429439.0 543964.0
3 434304.0 540325.0 NaN
4 750450.0 143430.0 540425.0
5 543015.0 549320.0 104365.0
and I want first to find the first digit for each cell like this (see MWE):
2010 2011 2012
0 - - 5
1 5 - 2
2 1 4 5
3 4 5 -
4 7 1 5
5 5 5 1
but finally I want to count the first occurence of 5 in each row, and which year it occured. If 5 occurs several places, I only want to know the first one. How do I accomplish this?
2010 2011 2012 Year
0 - - 5 2012
1 5 - 2 2010
2 1 4 5 2012
3 4 5 - 2011
4 7 1 5 2012
5 5 5 1 2010
Below you will find the MWE:
import numpy as np
data = {"2010": [np.nan, 542225, 123210, 434304, 750450, 543015],
"2011": [np.nan, np.nan, 429439, 540325, 143430, 549320],
"2012": [505303, 210530, 543964, np.nan, 540425, 104365]
}
df_t = pd.DataFrame(data)
for col in df_t.columns:
df_t[col] = (df_t[col]
.fillna(-1)
.astype(str)
.str[0]
)
Your solution should be used with DataFrame.apply:
df = df_t.fillna(-1).astype(str).apply(lambda x: x.str[0])
print (df)
2010 2011 2012
0 - - 5
1 5 - 2
2 1 4 5
3 4 5 -
4 7 1 5
5 5 5 1
Then compare by string '5' and get first matched year by DataFrame.idxmax, if no match get None:
m = df.eq('5')
df['Year'] = m.idxmax(axis=1).where(m.any(axis=1), None)
print (df)
2010 2011 2012 Year
0 - - 5 2012
1 5 - 2 2010
2 1 4 5 2012
3 4 5 - 2011
4 7 1 5 2012
5 5 5 1 2010
Another idea with numeric only values:
df = df_t // (10 ** np.log10(df_t).fillna(1).astype(int))
print (df)
2010 2011 2012
0 NaN NaN 5.0
1 5.0 NaN 2.0
2 1.0 4.0 5.0
3 4.0 5.0 NaN
4 7.0 1.0 5.0
5 5.0 5.0 1.0
m = df.eq(5)
df['Year'] = m.idxmax(axis=1).where(m.any(axis=1), None)
print (df)
2010 2011 2012 Year
0 NaN NaN 5.0 2012
1 5.0 NaN 2.0 2010
2 1.0 4.0 5.0 2012
3 4.0 5.0 NaN 2011
4 7.0 1.0 5.0 2012
5 5.0 5.0 1.0 2010

Resampling DataFrame while keeping my other columns untouched

I have a dataset that tells me the frequency of an event for each demographic(for example, first row says there are 13 white men, who are 11 years old, in the county Alameda in the year 2006 who experienced an event). Here is the original DataFrame:
Year County age Race Sex freq
0 2006 Alameda 11 1 0 13
1 2006 Alameda 11 1 1 9
2 2006 Alameda 11 2 0 9
3 2006 Alameda 11 2 1 16
4 2006 Alameda 11 3 0 2
Now, I want to compute the 2 year average of the "freq" column, by the demographic. This is the code I tried and the output:
dc = dc.dropna()
dc['date'] = dc.apply(lambda x: pd.Timestamp('{year}'
.format(year=int(x.Year),
)),
axis=1)
dc.set_index('date', inplace=True)
dc=dc.resample('2A', how='mean')
Date age_range Race Sex freq
2006-12-31 14.507095 1.637789 0.489171 10.451830
2008-12-31 14.543697 1.664187 0.493120 10.285980
2010-12-31 14.516471 1.670205 0.489019 10.349927
2012-12-31 14.512953 1.675056 0.486677 10.109178
2014-12-31 14.568190 1.699817 0.485923 10.134186
It's computing the averages for each column, but how do I do it for just the freq,by the demographic cuts(like the orginal DF) column?
Combining groupby() Grouper() and transform() should give you a series you want. Your sample data set does not have enough data to demonstrate.
dc = pd.read_csv(io.StringIO(""" Year County age Race Sex freq
0 2006 Alameda 11 1 0 13
1 2006 Alameda 11 1 1 9
2 2006 Alameda 11 2 0 9
3 2006 Alameda 11 2 1 16
4 2006 Alameda 11 3 0 2"""), sep="\s+")
dc["date"] = pd.to_datetime(dc["Year"], format="%Y")
dc["freq_2y"] = dc.groupby(["County","age","Race","Sex",
pd.Grouper(key="date", freq="2A", origin="start"),])["freq"].transform("mean")
Year
County
age
Race
Sex
freq
date
freq_2y
0
2006
Alameda
11
1
0
13
2006-01-01 00:00:00
13
1
2006
Alameda
11
1
1
9
2006-01-01 00:00:00
9
2
2006
Alameda
11
2
0
9
2006-01-01 00:00:00
9
3
2006
Alameda
11
2
1
16
2006-01-01 00:00:00
16
4
2006
Alameda
11
3
0
2
2006-01-01 00:00:00
2

Pandas group by id and year(date), but show year for all years, not just those which are present in id?

I have a years of transaction data which I am working with by customer ids. The transaction information is at an invoice level and an id could easily have multiple invoices on the same day or not have invoices for years. I am attempting to create dataframes which contain sums of invoices by customer by each year, but also show years where invoices where not added. Something akin to:
tmp = invoices[invoice['invoice_year'].isin([2018,2019,2020]]
tmp = tmp.groupby(['id', pd.Grouper(key = 'invoice_date', freq = 'Y')])['sales'].sum()
This would return something akin to:
id invoice_year sales
1 2018 483982.20
1 2019 3453
1 2020 453533
2 2018 243
2 2020 23423
3 2020 2330202
However the desired output would be:
id invoice_year sales
1 2018 483982.20
1 2019 3453
1 2020 453533
2 2018 243
2 2019 nan
2 2020 23423
3 2018 nan
3 2019 nan
3 2020 2330202
Ideas?
Let's suppose the original values are defined in the dataframe named df then you can try the following:
output = (df.groupby(['id', 'invoice_date'])['val'].sum()
.unstack(fill_value=0)
.stack()
.reset_index(name='val'))
Otherwise you can previously create the column invoice_year:
df['invoice_year'] = df['invoice_date'].dt.year
And repeat the same code, this outputs:
id invoice_year val
0 1 2018 1
1 1 2019 1
2 1 2020 0
3 2 2018 1
4 2 2019 0
5 2 2020 1
6 3 2018 0
7 3 2019 1
8 3 2020 1
Using the following data as example:
df = pd.DataFrame({'id':[1]*2+[2]*2+[3]*2,'invoice_date':pd.to_datetime(['2018-12-01','2019-12-01','2020-12-01']*2,infer_datetime_format=True),'val':[1]*6})
Stefan has posted a comment that may help. Simply passing dropna=False to your .groupby seems like the best bet; but, you could also take the approach where you bring the NaNs back afterward, which may be required on earlier versions of pandas that don't have the dropna=False parameter:
id invoice_year sales
1 2018 483982.20
1 2019 3453
1 2020 453533
2 2018 243
2 2020 23423
3 2020 2330202
You can use pd.MultiIndex.from_product and reindex the dataframe from a newly created index called idx:
i, iy = df['id'], df['invoice_year']
idx = pd.MultiIndex.from_product([range(i.min(), i.max()+1),
range(iy.min(), iy.max()+1)],
names=[i.name, iy.name])
df = df.set_index([i.name, iy.name]).reindex(idx).reset_index()
df
Out[1]:
id invoice_year sales
0 1 2018 483982.2
1 1 2019 3453.0
2 1 2020 453533.0
3 2 2018 243.0
4 2 2019 NaN
5 2 2020 23423.0
6 3 2018 NaN
7 3 2019 NaN
8 3 2020 2330202.0

Create variable with multiple return of numpy where

Hi i am stata user and now iam trying to pass my codes in stata to python/pandas. In this case i want to create a new variables size that assign the value 1 if the number of jobs is between 1 and 9, the value 2 if jobs is between 10 and 49, 3 between 50 and 199 and 4 for bigger than 200 jobs.
And aftewards, if it is possible label them (1:'Micro', 2:'Small', 3:'Median', 4:'Big')
id year entry cohort jobs
1 2009 0 NaN 3
1 2012 1 2012 3
1 2013 0 2012 4
1 2014 0 2012 11
2 2010 1 2010 11
2 2011 0 2010 12
2 2012 0 2010 13
3 2007 0 NaN 38
3 2008 0 NaN 58
3 2012 1 2012 58
3 2013 0 2012 70
4 2007 0 NaN 231
4 2008 0 NaN 241
I tried using this code but couldnt succed
df['size'] = np.where((1 <= df['jobs'] <= 9),'Micro',np.where((10 <= df['jobs'] <= 49),'Small'),np.where((50 <= df['jobs'] <= 200),'Median'),np.where((200 <= df['empleo']),'Big','NaN'))
What you are trying to do is called binning use pd.cut i.e
df['new'] = pd.cut(df['jobs'],bins=[1,10,50,201,np.inf],labels=['micro','small','medium','big'])
Output:
id year entry cohort jobs new
0 1 2009 0 NaN 3 micro
1 1 2012 1 2012.0 3 micro
2 1 2013 0 2012.0 4 micro
3 1 2014 0 2012.0 11 small
4 2 2010 1 2010.0 11 small
5 2 2011 0 2010.0 12 small
6 2 2012 0 2010.0 13 small
7 3 2007 0 NaN 38 small
8 3 2008 0 NaN 58 medium
9 3 2012 1 2012.0 58 medium
10 3 2013 0 2012.0 70 medium
11 4 2007 0 NaN 231 big
12 4 2008 0 NaN 241 big
For multiple conditions you have to go for np.select not np.where. Hope that helps.
​
numpy.select(condlist, choicelist, default=0)
Where condlist is the list of your condtions, and choicelist is the
list of choices if condition is met. default = 0, here you can put
that as np.nan
Using np.select for doing the same with the help of .between i.e
np.select([df['jobs'].between(1,10),
df['jobs'].between(10,50),
df['jobs'].between(50,200),
df['jobs'].between(200,np.inf)],
['Micro','Small','Median','Big']
,'NaN')

Categories