My pandas df has a column containg the birthyearof the household members and looks like this:
Birthyear_household_members
1960
1982 + 1989
1941
1951 + 1953
1990 + 1990
1992
I want to create a column with a variable that contains the number of people above 64 years old in a household.
Therefore, for each row, I need to separate the string and count the number of people with a birthyear before 1956.
How can I do this using pandas? My original df is very large.
Try use apply method of your df
df['cnt'] = df['Birthyear_household_members'].apply(lambda x: len([None for year in x.split(" + ") if year < '1956']))
Related
I am using Python to analyze a data set that has a column with a year range (see below for example):
Name
Years Range
Andy
1985 - 1987
Bruce
2011 - 2018
I am trying to convert the "Years Range" column that has a string of start and end years into two separate columns within the data frame to: "Year Start" and "Year End".
Name
Years Range
Year Start
Year End
Andy
1985 - 1987
1985
1987
Bruce
2011 - 2018
2011
2018
You can use expand=True within split function
df[['Year Start','Year End']] = df['Years Range'].str.split('-',expand=True)
output #
Nmae Years_Range Year Start Year End
0 NAdy 1995-1987 1995 1987
1 bruce 1890-8775 1890 8775
I think str.extract can do the job.
Here is an example :
df = pd.DataFrame([ "1985 - 1987"], columns = [ "Years Range"])
df['Year Start'] = df['Years Range'].str.extract('(\d{4})')
df['Year End'] = df['Years Range'].str.extract('- (\d{4})')
df['start']=''#create a blank column name 'start'
df['end']=''#create a blank column name 'end'
#loop over the data frame
for i in range(len(df)):
df['start'][i]=df['Year'][i].split('-')[0]#split each data and store first element
df['end'][i]=df['Year'][i].split('-')[1]#split each data and store second element
https://colab.research.google.com/drive/1Kemzk-aSUKRfE_eSrsQ7jS6e0NwhbXWp#scrollTo=esXNvRpnSN9I&line=1&uniqifier=1
This question already has answers here:
Adding column to pandas DataFrame containing list of other columns' values
(2 answers)
Closed 2 years ago.
I have the following dataframe:
name age year salary1 salary2 salary3 salary4
Rodrigo 28 2021 1945 2312 4567 3214
Neil 26 2021 3546 6657 -3200 1855
Loti 34 5500 4671 3895 5512 7864
...
I would like to create new column that will have list of values of the column salary1,salary2,salary3 and salary4 , when this is the result output:
name age year salary1 salary2 salary3 salary4 new_colum
Rodrigo 28 2021 1945 2312 4567 3214 [1945,2312,4567,3214]
Neil 26 2021 3546 6657 -3200 1855 [3546,6657,-3200,1855]
Loti 34 2021 4671 3895 5512 7864 [4671,3895,5512,7864]
I have tried to concat the relevant column by change the type of each column to string and then add them:
df['new_column'] = df['salary1'].astype(str) + ',' + \
df['salary2'].astype(str) + ',' + \
df['salary3'].astype(str) + ',' + \
df['salary4'].astype(str)
that indeed concat the columns but does not make them a list and also change the type to string while I still need it numerical.
My questionn is how can I cerate new column with list of the four column values?
Try this:
df['new_column'] = df[['salary1', 'salary2', 'salary3', 'salary 4']].values.tolist()
Another possibility using apply is
df['new_col'] = df[['salary1', 'salary2', 'salary3', 'salary4']].apply(lambda r: list(r), axis=1)
Note however that this is probably slower than using .values.tolist() instead of .apply as suggested in another answer.
I've got a fun one! And I've tried to find a duplicate question but was unsuccessful...
My dataframe consists of all United States and territories for years 2013-2016 with several attributes.
>>> df.head(2)
state enrollees utilizing enrol_age65 util_age65 year
1 Alabama 637247 635431 473376 474334 2013
2 Alaska 30486 28514 21721 20457 2013
>>> df.tail(2)
state enrollees utilizing enrol_age65 util_age65 year
214 Puerto Rico 581861 579514 453181 450150 2016
215 U.S. Territories 24329 16979 22608 15921 2016
I want to groupby year and state, and show the top 3 states (by 'enrollees' or 'utilizing' - does not matter) for each year.
Desired Output:
enrollees utilizing
year state
2013 California 3933310 3823455
New York 3133980 3002948
Florida 2984799 2847574
...
2016 California 4516216 4365896
Florida 4186823 3984756
New York 4009829 3874682
So far I've tried the following:
df.groupby(['year','state'])['enrollees','utilizing'].sum().head(3)
Which yields just the first 3 rows in the GroupBy object:
enrollees utilizing
year state
2013 Alabama 637247 635431
Alaska 30486 28514
Arizona 707683 683273
I've also tried a lambda function:
df.groupby(['year','state'])['enrollees','utilizing']\
.apply(lambda x: np.sum(x)).nlargest(3, 'enrollees')
Which yields the absolute largest 3 in the GroupBy object:
enrollees utilizing
year state
2016 California 4516216 4365896
2015 California 4324304 4191704
2014 California 4133532 4011208
I think it may have to do with the indexing of the GroupBy object, but I am not sure...Any guidance would be appreciated!
Well, you could do something not that pretty.
First getting a list of unique years using set():
years_list = list(set(df.year))
Create a dummy dataframe and a function to concat that I've made in the past:
def concatenate_loop_dfs(df_temp, df_full, axis=0):
"""
to avoid retyping the same line of code for every df.
the parameters should be the temporary df created at each loop and the concatenated DF that will contain all
values which must first be initialized (outside the loop) as df_name = pd.DataFrame(). """
if df_full.empty:
df_full = df_temp
else:
df_full = pd.concat([df_full, df_temp], axis=axis)
return df_full
creating the dummy final df
df_final = pd.DataFrame()
Now you'll loop for each year and concating into the new DF:
for year in years_list:
# The query function does a search for where
# the #year means the external variable, in this case the input from loop
# then you'll have a temporary DF with only the year and sorting and getting top3
df2 = df.query("year == #year")
df_temp = df2.groupby(['year','state'])['enrollees','utilizing'].sum().sort_values(by="enrollees", ascending=False).head(3)
# finally you'll call our function that will keep concating the tmp DFs
df_final = concatenate_loop_dfs(df_temp, df_final)
and done.
print(df_final)
You then need to sort your GroupBy object .sort_values('enrollees), ascending=False
This question already has answers here:
Increase index of pandas DataFrame by one
(2 answers)
Closed 6 months ago.
Currently I am trying to read in a .csv file and then use the to_html() to create a table with indexing on the side. All lines of code here:
import pandas as pd
df = pd.read_csv('file.csv')
df.to_html('example.html')
As expected I am currently getting:
Year Population Annual Growth Rate
0 1950 2557628654 1.458
1 1951 2594919657 1.611
2 1952 2636732631 1.717
3 1953 2681994386 1.796
4 1954 2730149884 1.899
However I want to start the indexing at 2 instead of 0. For example:
Year Population Annual Growth Rate
2 1950 2557628654 1.458
3 1951 2594919657 1.611
4 1952 2636732631 1.717
5 1953 2681994386 1.796
6 1954 2730149884 1.899
I know I could achieve this outcome by adding two dummy rows in the .csv file and then deleting them with df.ix[], but I do not want to do this.
Is there a way to change the indexing to start at something other than 0 without having to add or delete rows in the .csv file?
Thanks!
I know it looks like a hack, but what if just change index series. For example:
df.index = df.index + 2
suppose I have a dataframe with index as monthy timestep, I know I can use dataframe.groupby(lambda x:x.year) to group monthly data into yearly and apply other operations. Is there some way I could quick group them, let's say by decade?
thanks for any hints.
To get the decade, you can integer-divide the year by 10 and then multiply by 10. For example, if you're starting from
>>> dates = pd.date_range('1/1/2001', periods=500, freq="M")
>>> df = pd.DataFrame({"A": 5*np.arange(len(dates))+2}, index=dates)
>>> df.head()
A
2001-01-31 2
2001-02-28 7
2001-03-31 12
2001-04-30 17
2001-05-31 22
You can group by year, as usual (here we have a DatetimeIndex so it's really easy):
>>> df.groupby(df.index.year).sum().head()
A
2001 354
2002 1074
2003 1794
2004 2514
2005 3234
or you could do the (x//10)*10 trick:
>>> df.groupby((df.index.year//10)*10).sum()
A
2000 29106
2010 100740
2020 172740
2030 244740
2040 77424
If you don't have something on which you can use .year, you could still do lambda x: (x.year//10)*10).
if your Data Frame has Headers say : DataFrame ['Population','Salary','vehicle count']
Make your index as Year: DataFrame=DataFrame.set_index('Year')
use below code to resample data in decade of 10 years and also gives you some of all other columns within that dacade
datafame=dataframe.resample('10AS').sum()
Use the year attribute of index:
df.groupby(df.index.year)
lets say your date column goes by the name Date, then you can group up
dataframe.set_index('Date').ix[:,0].resample('10AS', how='count')
Note: the ix - here chooses the first column in your dataframe
You get the various offsets:
http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases