Map dataframe column value by another column's value - python

My dataframe has a month column with values that repeat as Apr, Apr.1, Apr.2 etc. because there is no year column. I added a year column based on the month value using a for loop as shown below, but I'd like to find a more efficient way to do this:
Products['Year'] = '2015'
for i in range(0, len(Products.Month)):
if '.1' in Products['Month'][i]:
Products['Year'][i] = '2016'
elif '.2' in Products['Month'][i]:
Products['Year'][i] = '2017'

You can use .str and treat the whole columns like string to split at the dot.
Now, apply a function that takes the number string and turns into a new year value if possible.
Starting dataframe:
Month
0 Apr
1 Apr.1
2 Apr.2
Solution:
def get_year(entry):
value = 2015
try:
value += int(entry[-1])
finally:
return str(value)
df['Year'] = df.Month.str.split('.').apply(get_year)
Now df is:
Month Year
0 Apr 2015
1 Apr.1 2016
2 Apr.2 2017

You can use pd.to_numeric after splitting and add 2015 i.e
df['new'] = pd.to_numeric(df['Month'].str.split('.').str[-1],errors='coerce').fillna(0) + 2015
# Sample DataFrame from # Mike Muller
Month Year new
0 Apr 2015 2015.0
1 Apr.1 2016 2016.0
2 Apr.2 2017 2017.0

Related

How to make a year range column into separate columns?

I am using Python to analyze a data set that has a column with a year range (see below for example):
Name
Years Range
Andy
1985 - 1987
Bruce
2011 - 2018
I am trying to convert the "Years Range" column that has a string of start and end years into two separate columns within the data frame to: "Year Start" and "Year End".
Name
Years Range
Year Start
Year End
Andy
1985 - 1987
1985
1987
Bruce
2011 - 2018
2011
2018
You can use expand=True within split function
df[['Year Start','Year End']] = df['Years Range'].str.split('-',expand=True)
output #
Nmae Years_Range Year Start Year End
0 NAdy 1995-1987 1995 1987
1 bruce 1890-8775 1890 8775
I think str.extract can do the job.
Here is an example :
df = pd.DataFrame([ "1985 - 1987"], columns = [ "Years Range"])
df['Year Start'] = df['Years Range'].str.extract('(\d{4})')
df['Year End'] = df['Years Range'].str.extract('- (\d{4})')
df['start']=''#create a blank column name 'start'
df['end']=''#create a blank column name 'end'
#loop over the data frame
for i in range(len(df)):
df['start'][i]=df['Year'][i].split('-')[0]#split each data and store first element
df['end'][i]=df['Year'][i].split('-')[1]#split each data and store second element
https://colab.research.google.com/drive/1Kemzk-aSUKRfE_eSrsQ7jS6e0NwhbXWp#scrollTo=esXNvRpnSN9I&line=1&uniqifier=1

Pandas replace NaN values with zeros after pivot operation

I have a DataFrame like this:
Year Month Day Rain (mm)
2021 1 1 15
2021 1 2 NaN
2021 1 3 12
And so on (there are multiple years). I have used pivot_table function to convert the DataFrame into this:
Year 2021 2020 2019 2018 2017
Month Day
1 1 15
2 NaN
3 12
I used:
df = df.pivot_table(index=['Month', 'Day'], columns='Year',
values='Rain (mm)', aggfunc='first')
Now I would like to replace all NaN values and also possible -1 values with zeros from every column (by columns I mean years) but I have not been able to do so. I have tried:
df = df.fillna(0)
And also:
df.loc[df['Rain (mm)'] == NaN, 'Rain (mm)'] = 0
But neither won't work, no error message/exception, dataframe just remains unchanged. What I'm doing wrong? Any advise is highly appreciated.
I think problem is NaN are strings, so cannot replace them, so first try convert values to numeric:
df['Rain (mm)'] = pd.to_numeric(df['Rain (mm)'], errors='coerce')
df = df.pivot_table(index=['Month', 'Day'], columns='Year',
values='Rain (mm)', aggfunc='first').fillna(0)

Dataframe groupby to new dataframe

I have a table as below.
Month,Count,Parameter
March 2015,1,40
March 2015,1,10
March 2015,1,1
March 2015,1,25
March 2015,1,50
April 2015,1,15
April 2015,1,1
April 2015,1,1
April 2015,1,15
April 2015,1,15
I need to create a new table from above as shown below.
Unique Month,Total Count,<=30
March 2015,5,3
April 2015,5,5
The logic for new table is as follows. "Unique Month" column is unique month from original table and needs to sorted. "Total Count" is sum of "Count" column from original table for that particular month. "<=30" column is count of "Parameter <= 30" for that particular month.
Is there an easy way to do this in dataframes?
Thanks in advance.
IIUC, just check for Parameter < 30 and then groupby:
(df.assign(le_30=df.Parameter.le(30))
.groupby('Month', as_index=False) # pass sort=False if needed
[['Count','le_30']].sum()
)
Or
(df.Parameter.le(30)
.groupby(df['Month']) # pass sort=False if needed
.agg(['count','sum'])
)
Output:
Month Count le_30
0 April 2015 5 5.0
1 March 2015 5 3.0
Update: as commented above, adding sort=False to groupby will respect your original sorting of Month. For example:
(df.Parameter.le(30)
.groupby(df['Month'], sort=False)
.agg(['count','sum'])
.reset_index()
)
Output:
Month count sum
0 March 2015 5 3.0
1 April 2015 5 5.0

How to get cumsum using custom aggregation function in pandas

I have DataFrame like mentioned below
df = pd.DataFrame({'year':[2014,2017,2014,2016,2016],'prod':['A','B','C','D','E']})
I can get it using this
df.groupby('year').count().cumsum() ##
prod
year
2014 2
2016 4
2017 5
I want to make this result using Custom Function only where is custom_func can be passed
df.groupby('year').agg({'year':custom_func})
I have tried so far
def count_sum(series):
se = pd.Series(np.ones(series.shape[0]))
return se.sum()
df.groupby('year').agg({'year':count_sum}) ## it is just returning as 'count' function
Function cumsum as applied for output of aggregation - here one column DataFrame, so is necessary chain it after agg:
print (df.groupby('year').agg({'year':count_sum}))
year
year
2014 2
2016 2
2017 1
df1 = df.groupby('year').agg({'year':count_sum}).cumsum()
print (df1)
year
year
2014 2
2016 4
2017 5

How to count the number of dropoffs per month for dataframe column

I have a dataframe that has records from 2011 to 2018. One of the columns has the drop_off_date which is the date when the customer left the rewards program. I want to count for each month between 2011 to 2018 how many people dropped of during that month. So for the 84 month period, I want the count of people who dropped off then using the drop_off_date column.
I changed the column to datetime and I know i can use the .agg and .count method but I am not sure how to count per month. I honestly do not know what the next step would be.
Example of the data:
Record ID | store ID | drop_off_date
a1274c212| 12876| 2011-01-27
a1534c543| 12877| 2011-02-23
a1232c952| 12877| 2018-12-02
The result should look like this:
Month: | #of dropoffs:
Jan 2011 | 15
........
Dec 2018 | 6
What I suggest is to work directly with the strings in the column drop_off_ym and to strip them to only keep the year and month:
df['drop_off_ym'] = df.drop_off_date.apply(lambda x: x[:-3])
Then you apply a groupby on the new created column an then a count():
df_counts_by_month = df.groupby('drop_off_ym')['StoreId'].count()
Using your data,
I'm assuming your date has been cast to a datetime value and used errors='coerce' to handle outliers.
you should then drop any NA's from this so you're only dealing with customers who dropped off.
you can do this in a multitude of ways, I would do a simple df.dropna(subset=['drop_off_date'])
print(df)
Record ID store ID drop_off_date
0 a1274c212 12876 2011-01-27
1 a1534c543 12877 2011-02-23
2 a1232c952 12877 2018-12-02
Lets create a month column to use as an aggregate
df['Month'] = df['drop_off_date'].dt.strftime('%b')
then we can do a simple groupby on the record ID as a count. (assuming you only want to count unique ID's)?
df1 = df.groupby(df['Month'])['Record ID'].count().reset_index()
print(df1)
Month Record ID
0 Dec 1
1 Feb 1
2 Jan 1
EDIT: To account for years.
first lets create a year helper column
df['Year'] = df['drop_off_date'].dt.year
df1 = df.groupby(['Month','Year' ])['Record ID'].count().reset_index()
print(df)
Month Year Record ID
0 Dec 2018 1
1 Feb 2011 1
2 Jan 2011 1

Categories