I am using Python to analyze a data set that has a column with a year range (see below for example):
Name
Years Range
Andy
1985 - 1987
Bruce
2011 - 2018
I am trying to convert the "Years Range" column that has a string of start and end years into two separate columns within the data frame to: "Year Start" and "Year End".
Name
Years Range
Year Start
Year End
Andy
1985 - 1987
1985
1987
Bruce
2011 - 2018
2011
2018
You can use expand=True within split function
df[['Year Start','Year End']] = df['Years Range'].str.split('-',expand=True)
output #
Nmae Years_Range Year Start Year End
0 NAdy 1995-1987 1995 1987
1 bruce 1890-8775 1890 8775
I think str.extract can do the job.
Here is an example :
df = pd.DataFrame([ "1985 - 1987"], columns = [ "Years Range"])
df['Year Start'] = df['Years Range'].str.extract('(\d{4})')
df['Year End'] = df['Years Range'].str.extract('- (\d{4})')
df['start']=''#create a blank column name 'start'
df['end']=''#create a blank column name 'end'
#loop over the data frame
for i in range(len(df)):
df['start'][i]=df['Year'][i].split('-')[0]#split each data and store first element
df['end'][i]=df['Year'][i].split('-')[1]#split each data and store second element
https://colab.research.google.com/drive/1Kemzk-aSUKRfE_eSrsQ7jS6e0NwhbXWp#scrollTo=esXNvRpnSN9I&line=1&uniqifier=1
Related
My pandas df has a column containg the birthyearof the household members and looks like this:
Birthyear_household_members
1960
1982 + 1989
1941
1951 + 1953
1990 + 1990
1992
I want to create a column with a variable that contains the number of people above 64 years old in a household.
Therefore, for each row, I need to separate the string and count the number of people with a birthyear before 1956.
How can I do this using pandas? My original df is very large.
Try use apply method of your df
df['cnt'] = df['Birthyear_household_members'].apply(lambda x: len([None for year in x.split(" + ") if year < '1956']))
I have a dataframe that has records from 2011 to 2018. One of the columns has the drop_off_date which is the date when the customer left the rewards program. I want to count for each month between 2011 to 2018 how many people dropped of during that month. So for the 84 month period, I want the count of people who dropped off then using the drop_off_date column.
I changed the column to datetime and I know i can use the .agg and .count method but I am not sure how to count per month. I honestly do not know what the next step would be.
Example of the data:
Record ID | store ID | drop_off_date
a1274c212| 12876| 2011-01-27
a1534c543| 12877| 2011-02-23
a1232c952| 12877| 2018-12-02
The result should look like this:
Month: | #of dropoffs:
Jan 2011 | 15
........
Dec 2018 | 6
What I suggest is to work directly with the strings in the column drop_off_ym and to strip them to only keep the year and month:
df['drop_off_ym'] = df.drop_off_date.apply(lambda x: x[:-3])
Then you apply a groupby on the new created column an then a count():
df_counts_by_month = df.groupby('drop_off_ym')['StoreId'].count()
Using your data,
I'm assuming your date has been cast to a datetime value and used errors='coerce' to handle outliers.
you should then drop any NA's from this so you're only dealing with customers who dropped off.
you can do this in a multitude of ways, I would do a simple df.dropna(subset=['drop_off_date'])
print(df)
Record ID store ID drop_off_date
0 a1274c212 12876 2011-01-27
1 a1534c543 12877 2011-02-23
2 a1232c952 12877 2018-12-02
Lets create a month column to use as an aggregate
df['Month'] = df['drop_off_date'].dt.strftime('%b')
then we can do a simple groupby on the record ID as a count. (assuming you only want to count unique ID's)?
df1 = df.groupby(df['Month'])['Record ID'].count().reset_index()
print(df1)
Month Record ID
0 Dec 1
1 Feb 1
2 Jan 1
EDIT: To account for years.
first lets create a year helper column
df['Year'] = df['drop_off_date'].dt.year
df1 = df.groupby(['Month','Year' ])['Record ID'].count().reset_index()
print(df)
Month Year Record ID
0 Dec 2018 1
1 Feb 2011 1
2 Jan 2011 1
I've properly converted my year column into a datetime index, however the month and date are inaccurate and unneeded seeing my dataset only includes year. I've used the format parameter to set year only, however it's still showing as "%Y-%M-%D" format.
Original data:
index song year artist genre
0 0 ego-remix 2009 beyonce knowles Pop
1 1 shes-tell-me 2009 save Rock
2 2 hello 2009 yta Pop
3 3 the rock 2009 term R&B
4 4 black-culture 2009 hughey Country
conducted a few more scrubbing techniques with the above code.
Then here are example rows from my dataframe code:
clean_df.index = pd.to_datetime(clean_df['year'], format='%Y')
clean_df = clean_df.drop(['index', 'year'], 1)
clean_df.sort_index(inplace=True)
clean_df.head()
year song artist genre
1970-01-01 hey now caravan Rock
1970-01-01 show me abc Rock
1970-01-01 hey now xyz Pop
1970-01-01 tell me foxy R&B
1970-01-01 move up curtis R&B
Is there any other method to be used to set index as annual only?
You were close
clean_df.index = pd.to_datetime(clean_df['year'], format='%Y-%m-%d').year
It's hard to provide the actual correct format needed because I don't have your original data, but you just need to transform to date object and then call the year parameter
I had a similar issue. Solved it this way:
df['Year'] = df.Year.astype(np.datetime64)
df['Year'] = df.Year.dt.year
df.set_index('Year')
Output should only show the year with 4 digits.
My dataframe has a month column with values that repeat as Apr, Apr.1, Apr.2 etc. because there is no year column. I added a year column based on the month value using a for loop as shown below, but I'd like to find a more efficient way to do this:
Products['Year'] = '2015'
for i in range(0, len(Products.Month)):
if '.1' in Products['Month'][i]:
Products['Year'][i] = '2016'
elif '.2' in Products['Month'][i]:
Products['Year'][i] = '2017'
You can use .str and treat the whole columns like string to split at the dot.
Now, apply a function that takes the number string and turns into a new year value if possible.
Starting dataframe:
Month
0 Apr
1 Apr.1
2 Apr.2
Solution:
def get_year(entry):
value = 2015
try:
value += int(entry[-1])
finally:
return str(value)
df['Year'] = df.Month.str.split('.').apply(get_year)
Now df is:
Month Year
0 Apr 2015
1 Apr.1 2016
2 Apr.2 2017
You can use pd.to_numeric after splitting and add 2015 i.e
df['new'] = pd.to_numeric(df['Month'].str.split('.').str[-1],errors='coerce').fillna(0) + 2015
# Sample DataFrame from # Mike Muller
Month Year new
0 Apr 2015 2015.0
1 Apr.1 2016 2016.0
2 Apr.2 2017 2017.0
suppose I have a dataframe with index as monthy timestep, I know I can use dataframe.groupby(lambda x:x.year) to group monthly data into yearly and apply other operations. Is there some way I could quick group them, let's say by decade?
thanks for any hints.
To get the decade, you can integer-divide the year by 10 and then multiply by 10. For example, if you're starting from
>>> dates = pd.date_range('1/1/2001', periods=500, freq="M")
>>> df = pd.DataFrame({"A": 5*np.arange(len(dates))+2}, index=dates)
>>> df.head()
A
2001-01-31 2
2001-02-28 7
2001-03-31 12
2001-04-30 17
2001-05-31 22
You can group by year, as usual (here we have a DatetimeIndex so it's really easy):
>>> df.groupby(df.index.year).sum().head()
A
2001 354
2002 1074
2003 1794
2004 2514
2005 3234
or you could do the (x//10)*10 trick:
>>> df.groupby((df.index.year//10)*10).sum()
A
2000 29106
2010 100740
2020 172740
2030 244740
2040 77424
If you don't have something on which you can use .year, you could still do lambda x: (x.year//10)*10).
if your Data Frame has Headers say : DataFrame ['Population','Salary','vehicle count']
Make your index as Year: DataFrame=DataFrame.set_index('Year')
use below code to resample data in decade of 10 years and also gives you some of all other columns within that dacade
datafame=dataframe.resample('10AS').sum()
Use the year attribute of index:
df.groupby(df.index.year)
lets say your date column goes by the name Date, then you can group up
dataframe.set_index('Date').ix[:,0].resample('10AS', how='count')
Note: the ix - here chooses the first column in your dataframe
You get the various offsets:
http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases