use only year for datetime index - pandas dataframe - python

I've properly converted my year column into a datetime index, however the month and date are inaccurate and unneeded seeing my dataset only includes year. I've used the format parameter to set year only, however it's still showing as "%Y-%M-%D" format.
Original data:
index song year artist genre
0 0 ego-remix 2009 beyonce knowles Pop
1 1 shes-tell-me 2009 save Rock
2 2 hello 2009 yta Pop
3 3 the rock 2009 term R&B
4 4 black-culture 2009 hughey Country
conducted a few more scrubbing techniques with the above code.
Then here are example rows from my dataframe code:
clean_df.index = pd.to_datetime(clean_df['year'], format='%Y')
clean_df = clean_df.drop(['index', 'year'], 1)
clean_df.sort_index(inplace=True)
clean_df.head()
year song artist genre
1970-01-01 hey now caravan Rock
1970-01-01 show me abc Rock
1970-01-01 hey now xyz Pop
1970-01-01 tell me foxy R&B
1970-01-01 move up curtis R&B
Is there any other method to be used to set index as annual only?

You were close
clean_df.index = pd.to_datetime(clean_df['year'], format='%Y-%m-%d').year
It's hard to provide the actual correct format needed because I don't have your original data, but you just need to transform to date object and then call the year parameter

I had a similar issue. Solved it this way:
df['Year'] = df.Year.astype(np.datetime64)
df['Year'] = df.Year.dt.year
df.set_index('Year')
Output should only show the year with 4 digits.

Related

Selecting columns and rows in a dataframe

Here i am trying to count the times a police office is present (a 1 value (2 and 3 mean not present)) at an accident and if there is more chance they are present on a weekday or at the weekend. So far i have out my data into day of the week i now need to select the 1 values and compare them if anyone hows ho to do this. The code i have used and pandas dataframe is below;
#first we need to modify the date so we can find days of the week
accidents['Date'] = pd.to_datetime(accidents['Date'], format="%d/%m/%Y")
accidents.sort_values(['Date', 'Time'], inplace=True)
#now we can assign days of the week
accidents['day'] = accidents['Date'].dt.strftime('%A')
#now we can count the number of police at each day of the week
accidents.value_counts(['Did_Police_Officer_Attend_Scene_of_Accident','day'])
What im looking for in this bottom like is something like; accidents.value_counts(['Did_Police_Officer_Attend_Scene_of_Accident','day'] ==1) but im unsure how to write it
data preview;
Accident_Index Location_Easting_OSGR Location_Northing Did_Police_Officer_Attend_Scene_of_Accident day
2019320634369 521429.0 21973.0 1 Tuesday
2019320634368 521429.0 21970.0 2 Tuesday
2019320634367 521429.0 21972.0 1 Wednesday
2019320634366 521429.0 21972.0 3 Sunday
2019320634366 521429.0 21971.0 1 Sunday
2019320634365 521429.0 21975.0 2 Monday
Update, desired outcome.
So here is the code i had from all of the attended accidents. I now wish to do this again but split into weekdays and weekends
#when did an officer attend
attended = (accidents.Did_Police_Officer_Attend_Scene_of_Accident == 1).sum()
This bit of code now need to include the weekday (then another with weekend) before calling.sum
My desired output would be similar to this but would also count the weekday and weekend values, preferably returned in 2 dataframes. This would then allow me to compare the weekday to the weekend dataframe allowing me to return an single value for each of which has more officers attending

Count occurrences in column based on another column (date)

I am trying to count the number of "Type" occurrences by what month they are in.
Daily data is given, so to group by month I tried using .resample() but the problem with using is that combines all the strings together in one LONG string and then I can't count the number of occurrences using str.count() as it returns the wrong value (it finds too many matches because it isn't looking for the EXACT pattern).
I think it has to be done in more than one step...
I have tried SO many things... I even heard there is a pivot table?
Sample data:
Type
Date
Cat
2020-01-01
Cat
2020-01-01
Bird
2020-01-01
Dog
2020-01-01
Cat
2020-02-01
Cat
2020-03-01
Bird
2020-03-01
Cat
2020-05-02
... For all the months over a few years...
Converted to the following format: (titles of header can be in numeric form as well)
January 2020
February 2020
Cat
4
1
Bird
1
0
Dog
1
0
As far as I know, Pandas does not have a standard function or typical approach to obtain your desired result. Below I've included a code snippet that gets your desired result.
If you do not mind using extra packages, there exist some packages which you can use for quicker/easier binary encoding (e.g. category_encoder).
import pandas as pd
# your data in dictionary format
d = {
"Type":["Cat","Cat","Bird","Dog","Cat","Cat","Bird","Cat"],
"Date":["2020-01-01","2020-01-01","2020-01-01","2020-01-01","2020-02-01","2020-03-01","2020-03-01","2020-05-02"]
}
# creata dataframe with the dates as index
df = pd.DataFrame(data = d['Type'], index=pd.to_datetime(d['Date']))
animals = list(df[0].unique()) # a list contaning all unique animals
ndf = pd.DataFrame(index=animals) # empty new dataframe with all animals as index
for animal in animals:
ndf.loc[animal, df.index.month.unique()] = ( # at row = animal, insert all unique months
(df == animal).groupby(df.index.month) # groupby months, using .month (returns 1 for Jan)
.sum() # sum since we use bool comparison
.transpose() # tranpose due to desired output format
.values # array of values to insert
)
# convert column names back to date time and save as string in desired format
ndf.columns = pd.to_datetime(ndf.columns, format='%m').strftime('%B 2020')
Result
January 2020
February 2020
March 2020
May 2020
Cat
2
1
1
1
Bird
1
0
1
0
Dog
1
0
0
0

Is there a way to count and calculate mean for text columns using groupby?

I have been using pandas.groupby to pivot data and create descriptive charts and tables for my data. While doing groupby for three variables, I keep running into a DataError: No numeric types to aggregate error while working with the cancelled column.
To describe my data, Year and Month contain yearly and monthly data for multiple columns (multiple years, all months), Type contains the type of order item (Clothes, Appliances, etc.), and cancelled contains yes or no string values to determine whether a order was cancelled or not.
I am hoping to plot a graph and show a table to show what the cancellation rate (and success rate) is by order item. The following is what I'm using so far
df.groupby(['Year', 'Month', 'Type'])['cancelled'].mean()
But this doesn't seem to be working.
Sample
Year Month Type cancelled
2012 1 electronics yes
2012 10 fiber yes
2012 9 clothes no
2013 4 vegetables yes
2013 5 appliances no
2016 3 fiber no
2017 1 clothes yes
Use:
df = pd.DataFrame({
'Year':[2020] * 6,
'Month':[7,8,7,8,7,8],
'cancelled':['yes','no'] * 3,
'Type':list('aaaaba')
})
print (df)
Get counts per Year, Month, Type columns:
df1 = df.groupby(['Year', 'Month', 'Type','cancelled']).size().unstack(fill_value=0)
print (df1)
cancelled no yes
Year Month Type
2020 7 a 0 2
b 0 1
8 a 3 0
And then divide by sum of values for ratio:
df2 = df1.div(df1.sum()).mul(100)
print (df2)
cancelled no yes
Year Month Type
2020 7 a 0.0 66.666667
b 0.0 33.333333
8 a 100.0 0.000000
It's possible I have misunderstood what you want your output to look like, but to find the cancellation rate for each item type, you could do something like this:
# change 'cancelled' to numeric values
df.loc[df['cancelled'] == 'yes', 'cancelled'] = 1
df.loc[df['cancelled'] == 'no', 'cancelled'] = 0
# get the mean of 'cancelled' for each item type
res = {}
for t in df['Type'].unique():
res[t] = df.loc[df['Type'] == t, 'cancelled'].mean()
# if desired, put it into a dataframe
results = pd.DataFrame([res], index=['Rate']).T
Output:
Rate
electronics 1.0
fiber 0.5
clothes 0.5
vegetables 1.0
appliances 0.0
Note: If you want to specify specific years or months, you can do that with loc as well, but given that your example data did not have any repeats within a given year or month, this would return your original dataframe for your given example.

How to count the number of dropoffs per month for dataframe column

I have a dataframe that has records from 2011 to 2018. One of the columns has the drop_off_date which is the date when the customer left the rewards program. I want to count for each month between 2011 to 2018 how many people dropped of during that month. So for the 84 month period, I want the count of people who dropped off then using the drop_off_date column.
I changed the column to datetime and I know i can use the .agg and .count method but I am not sure how to count per month. I honestly do not know what the next step would be.
Example of the data:
Record ID | store ID | drop_off_date
a1274c212| 12876| 2011-01-27
a1534c543| 12877| 2011-02-23
a1232c952| 12877| 2018-12-02
The result should look like this:
Month: | #of dropoffs:
Jan 2011 | 15
........
Dec 2018 | 6
What I suggest is to work directly with the strings in the column drop_off_ym and to strip them to only keep the year and month:
df['drop_off_ym'] = df.drop_off_date.apply(lambda x: x[:-3])
Then you apply a groupby on the new created column an then a count():
df_counts_by_month = df.groupby('drop_off_ym')['StoreId'].count()
Using your data,
I'm assuming your date has been cast to a datetime value and used errors='coerce' to handle outliers.
you should then drop any NA's from this so you're only dealing with customers who dropped off.
you can do this in a multitude of ways, I would do a simple df.dropna(subset=['drop_off_date'])
print(df)
Record ID store ID drop_off_date
0 a1274c212 12876 2011-01-27
1 a1534c543 12877 2011-02-23
2 a1232c952 12877 2018-12-02
Lets create a month column to use as an aggregate
df['Month'] = df['drop_off_date'].dt.strftime('%b')
then we can do a simple groupby on the record ID as a count. (assuming you only want to count unique ID's)?
df1 = df.groupby(df['Month'])['Record ID'].count().reset_index()
print(df1)
Month Record ID
0 Dec 1
1 Feb 1
2 Jan 1
EDIT: To account for years.
first lets create a year helper column
df['Year'] = df['drop_off_date'].dt.year
df1 = df.groupby(['Month','Year' ])['Record ID'].count().reset_index()
print(df)
Month Year Record ID
0 Dec 2018 1
1 Feb 2011 1
2 Jan 2011 1

Python Pandas: Change value associated with each first day entry in every month

I'd like to change the value associated with the first day in every month for a pandas.Series I have. For example, given something like this:
Date
1984-01-03 0.992701
1984-01-04 1.003614
1984-01-17 0.994647
1984-01-18 1.007440
1984-01-27 1.006097
1984-01-30 0.991546
1984-01-31 1.002928
1984-02-01 1.009894
1984-02-02 0.996608
1984-02-03 0.996595
...
I'd like to change the values associated with 1984-01-03, 1984-02-01 and so on. I've racked my brain for hours on this one and have looked around Stack Overflow a fair bit. Some solutions have come close. For example, using:
[In]: series.groupby((m_ret.index.year, m_ret.index.month)).first()
[Out]:
Date Date
1984 1 0.992701
2 1.009894
3 1.005963
4 0.997899
5 1.000342
6 0.995429
7 0.994620
8 1.019377
9 0.993209
10 1.000992
11 1.009786
12 0.999069
1985 1 0.981220
2 1.011928
3 0.993042
4 1.015153
...
Is almost there, but I'm sturggling to proceed further.
What I'd ike to do is set the values associated with the first day present in each month for every year to 1.
series[m_ret.index.is_month_start] = 1 comes close, but the problem here is that is_month_start only selects rows where the day value is 1. However in my series, this isn't always the case as you can see. For example, the date of the first day in January is 1984-01-03.
series.groupby(pd.TimeGrouper('BM')).nth(0) doesn't appear to return the first day either, instead I get the last day:
Date
1984-01-31 0.992701
1984-02-29 1.009894
1984-03-30 1.005963
1984-04-30 0.997899
1984-05-31 1.000342
1984-06-29 0.995429
1984-07-31 0.994620
1984-08-31 1.019377
...
I'm completely stumped. Your help is as always, greatly appreciated! Thank you.
One way would to be to use your .groupby((m_ret.index.year, m_ret.index.month)) idea, but use idxmin instead on the index itself converted into a Series:
In [74]: s.index.to_series().groupby([s.index.year, s.index.month]).idxmin()
Out[74]:
Date Date
1984 1 1984-01-03
2 1984-02-01
Name: Date, dtype: datetime64[ns]
In [75]: start = s.index.to_series().groupby([s.index.year, s.index.month]).idxmin()
In [76]: s.loc[start] = 999
In [77]: s
Out[77]:
Date
1984-01-03 999.000000
1984-01-04 1.003614
1984-01-17 0.994647
1984-01-18 1.007440
1984-01-27 1.006097
1984-01-30 0.991546
1984-01-31 1.002928
1984-02-01 999.000000
1984-02-02 0.996608
1984-02-03 0.996595
dtype: float64

Categories