I have a data frame of client names, dates and transactions. I'm not sure how far back my error goes, so here is all the pre-processing I do:
data = pd.read_excel('Test.xls')
## convert to datetime object
data['Date Order'] = pd.to_datetime(data['Date Order'], format = '%d.%m.%Y')
## add columns for month and year of each row for easier analysis later
data['month'] = data['Date Order'].dt.month
data['year'] = data['Date Order'].dt.year
So the data frame becomes something like:
Date Order NameCustomers SumOrder month year
2019-01-02 00:00:00 Customer 1 290 1 2019
2019-02-02 00:00:00 Customer 1 50 2 2019
-----
2020-06-28 00:00:00 Customer 2 900 6 2020
------
..etc.
You get the idea. Next I group by both month and year and calculate the mean.
groupedMonthYearMean = data.groupby(['month', 'year'])['SumOrder'].mean().reset_index()
Output:
month year SumOrder
1 2019 233.08
1 2020 303.40
2 2019 255.34
2 2020 842.24
--------------------------
I use the resulting dataframe to make a lineplot, which tracks the SumOrder for each month, and displays it for each year.
linechart = sns.lineplot(x = 'month',
y = 'SumOrder',
hue = 'year',
data = groupedMonthYearMean).set_title('Mean Sum Order by month')
plt.show()
I have attached a screenshot of the resulting plot - overall it seems to show what I expected to create.
In my entire data, the 'year' column has only two values: 2019 and 2020. For some reason, whatever I do, they show up as 0, -1 and -2. Any ideas what is going on?
You want to change the dtype of the year column from int to category
df['year'] = df['year'].astype('category')
This is due to how hue treats ints.
Related
I've created the following datafram from data given on CDC link.
googledata = pd.read_csv('/content/data_table_for_daily_case_trends__the_united_states.csv', header=2)
# Inspect data
googledata.head()
id
State
Date
New Cases
0
United States
Oct 2 2022
11553
1
United States
Oct 1 2022
8024
2
United States
Sep 30 2022
46383
3
United States
Sep 29 2022
89873
4
United States
Sep 28 2022
63763
After converting the date column to datetime and trimming the data for the last 1 year by implementing the mask operation I got the data in the last 1 year:
googledata['Date'] = pd.to_datetime(googledata['Date'])
df = googledata
start_date = '2021-10-1'
end_date = '2022-10-1'
mask = (df['Date'] > start_date) & (df['Date'] <= end_date)
df = df.loc[mask]
But the problem is I am getting the data in terms of days, but I wish to convert this data in terms of weeks ; i.e converting the 365 rows to 52 rows corresponding to weeks data taking mean of New cases the 7 days in 1 week's data.
I tried implementing the following method as shown in the previous post: link I don't think I am even applying this correctly! Because this code is not asking me to put my dataframe anywhere!
logic = {'New Cases' : 'mean'}
offset = pd.offsets.timedelta(days=-6)
f = pd.read_clipboard(parse_dates=['Date'], index_col=['Date'])
f.resample('W', loffset=offset).apply(logic)
But I am getting the following error:
AttributeError: module 'pandas.tseries.offsets' has no attribute
'timedelta'
If I'm understanding you want to resample
df = df.set_index("Date")
df.index = df.index - pd.tseries.frequencies.to_offset("6D")
df = df.resample("W").agg({"New Cases": "mean"}).reset_index()
You can use strftime to convert date to week number before applying groupby
df['Week'] = df['Date'].dt.strftime('%Y-%U')
df.groupby('Week')['New Cases'].mean()
I have a df
date
2021-03-12
2021-03-17
...
2022-05-21
2022-08-17
I am trying to add a column year_week, but my year week starts at 2021-06-28, which is the first day of July.
I tried:
df['date'] = pd.to_datetime(df['date'])
df['year_week'] = (df['date'] - timedelta(days=datetime(2021, 6, 24).timetuple()
.tm_yday)).dt.isocalendar().week
I played around with the timedelta days values so that the 2021-06-28 has a value of 1.
But then I got problems with previous & dates exceeding my start date + 1 year:
2021-03-12 has a value of 38
2022-08-17 has a value of 8
So it looks like the valid period is from 2021-06-28 + 1 year.
date year_week
2021-03-12 38 # LY38
2021-03-17 39 # LY39
2021-06-28 1 # correct
...
2022-05-21 47 # correct
2022-08-17 8 # NY8
Is there a way to get around this? As I am aggregating the data by year week I get incorrect results due to the past & upcoming dates. I would want to have negative dates for the days before 2021-06-28 or LY38 denoting that its the year week of the last year, accordingly year weeks of 52+ or NY8 denoting that this is the 8th week of the next year?
Here is a way, I added two dates more than a year away. You need the isocalendar from the difference between the date column and the dayofyear of your specific date. Then you can select the different scenario depending on the year of your specific date. use np.select for the different result format.
#dummy dataframe
df = pd.DataFrame(
{'date': ['2020-03-12', '2021-03-12', '2021-03-17', '2021-06-28',
'2022-05-21', '2022-08-17', '2023-08-17']
}
)
# define start date
d = pd.to_datetime('2021-6-24')
# remove the nomber of day of year from each date
s = (pd.to_datetime(df['date']) - pd.Timedelta(days=d.day_of_year)
).dt.isocalendar()
# get the difference in year
m = (s['year'].astype('int32') - d.year)
# all condition of result depending on year difference
conds = [m.eq(0), m.eq(-1), m.eq(1), m.lt(-1), m.gt(1)]
choices = ['', 'LY','NY',(m+1).astype(str)+'LY', '+'+(m-1).astype(str)+'NY']
# create the column
df['res'] = np.select(conds, choices) + s['week'].astype(str)
print(df)
date res
0 2020-03-12 -1LY38
1 2021-03-12 LY38
2 2021-03-17 LY39
3 2021-06-28 1
4 2022-05-21 47
5 2022-08-17 NY8
6 2023-08-17 +1NY8
I think
pandas period_range can be of some help
pd.Series(pd.period_range("6/28/2017", freq="W", periods=Number of weeks you want))
Python and Pandas beginner here.
I want to round off a pandas dataframe column to years. Dates before the 1st of July must be rounded off to the current year and dates after and on the 1st of July must be rounded up to the next year.
For example:
2011-04-05 must be rounded to 2011
2011-08-09 must be rounded to 2012
2011-06-30 must be rounded to 2011
2011-07-01 must be rounded to 2012
What I've tried:
pd.series.dt.round(freq='Y')
Gives the error: ValueError: <YearEnd: month=12> is a non-fixed frequency
The dataframe column has a wide variety of dates, starting from 1945 all the way up to 2021. Therefore a simple if df.date < 2011-07-01: df['Date']+ pd.offsets.YearBegin(-1) is not working.
I also tried the dt.to_period('Y') function, but then I can't give the before and after the 1st of July argument.
Any tips on how I can solve this issue?
Suppose you have this dataframe:
dates
0 2011-04-05
1 2011-08-09
2 2011-06-30
3 2011-07-01
4 1945-06-30
5 1945-07-01
Then:
# convert to datetime:
df["dates"] = pd.to_datetime(df["dates"])
df["year"] = np.where(
(df["dates"].dt.month < 7), df["dates"].dt.year, df["dates"].dt.year + 1
)
print(df)
Prints:
dates year
0 2011-04-05 2011
1 2011-08-09 2012
2 2011-06-30 2011
3 2011-07-01 2012
4 1945-06-30 1945
5 1945-07-01 1946
a bit of a roundabout year is to convert the date values to strings, separate them, and then classify them in a loop, like so:
for i in df["Date"]: # assuming the column's name is "Date"
thisdate = df["Date"] # extract the ith element of Date
thisdate = str(thisdate) # convert to string
datesplit = thisdate.split("-") # split
Yr = int(datesplit[0]) # get the year # convert year back to a number
Mth = int(datesplit[1]) # get the month # convert month back to a number
if Mth < 7: # any date before July
rnd_Yr = Yr
else: # any date after July 1st
rnd_Yr = Yr + 1
I have a dataframe (df) with a column in datetime format YYYY-MM-DD ('date'). I am trying to create a new column that returns the policy year, which always starts on April 1st and thus the policy year for January through March will always be the prior calander year. There are dates that are rather old so setting up individual date ranges for the sample size below wouldn't be ideal
The dataframe would look like this
df['date']
2020-12-10
2021-02-10
2019-03-31
and output should look like this
2020
2020
2018
I now know how to get the year using df['date'].dt.year. However, I am having trouble getting the dataframe to convert each year to the respective policy year so that if df['date'].dt.month >= 4 then df['date'].dt.year, else df['date'].dt.year - 1
I am not quite sure how to set this up exactly. I have been trying to avoid setting up multiple columns to do a bool for month >= 4 and then setting up different columns. I've gone so far as to set up this but get ValueError stating the series is too ambiguous
def PolYear(x):
y = x.dt.month
if y >= 4:
x.dt.year
else:
x.dt.year - 1
df['Pol_Year'] = PolYear(df['date'])
I'm wasn't sure if this was the right way to go about it so I also tried a df.loc format for >= and < 4 but len key and value are not equal. Definitely think I'm missing something super simple.
I previously had mentioned 'fiscal year', but this is incorrect.
Quang Hoand had the right idea but used the incorrect frequency in the call to to_period(self, freq). For your purposes you want to use the following code:
df.date.dt.to_period('Q-MAR').dt.qyear
This will give you:
0 2021
1 2021
2 2019
Name: date, dtype: int64
Q-MAR defines fiscal year end in March
These values are the correct fiscal years (fiscal years use the year in which they end, not where they begin[reference]). If you you want to have the output using the year in which they begin, it's simple:
df.date.dt.to_period('Q-MAR').dt.qyear - 1
Giving you
0 2020
1 2020
2 2018
Name: date, dtype: int64
qyear docs
This is qyear:
df.date.dt.to_period('Q').dt.qyear
Output:
0 2020
1 2021
2 2019
Name: date, dtype: int64
I have two time series, df1
day cnt
2020-03-01 135006282
2020-03-02 145184482
2020-03-03 146361872
2020-03-04 147702306
2020-03-05 148242336
and df2:
day cnt
2017-03-01 149104078
2017-03-02 149781629
2017-03-03 151963252
2017-03-04 147384922
2017-03-05 143466746
The problem is that the sensors I'm measuring are sensitive to the day of the week, so on Sunday, for instance, they will produce less cnt. Now I need to compare the time series over 2 different years, 2017 and 2020, but to do that I have to align (March, in this case) to the matching day of the week, and plot them accordingly. How do I "shift" the data to make the series comparable?
The ISO calendar is a representation of date in a tuple (year, weeknumber, weekday). In pandas they are the dt members year, weekofyear and weekday. So assuming that the day column actually contains Timestamps (convert if first with to_datetime if it does not), you could do:
df1['Y'] = df1.day.dt.year
df1['W'] = df1.day.dt.weekofyear
df1['D'] = df1.day.dt.weekday
Then you could align the dataframes on the W and D columns
March 2017 started on wednesday
March 2020 started on Sunday
So, delete the last 3 days of march 2017
So, delete the first sunday, monday and tuesday from 2020
this way you have comparable days
df1['ctn2020'] = df1['cnt']
df2['cnt2017'] = df2['cnt']
df1 = df1.iloc[2:, 2]
df2 = df2.iloc[:-3, 2]
Since you don't want to plot the date, but want the months to align, make a new dataframe with both columns and a index column. This way you will have 3 columns: index(0-27), 2017 and 2020. The index will represent.
new_df = pd.concat([df1,df2], axis=1)
If you also want to plot the days of the week on the x axis, check out this link, to know how to get the day of the week from a date, and them change the x ticks label.
Sorry for the "written step-to-stop", if it all sounds confusing, i can type the whole code later for you.