I have a dataframe of downsampled Open/High/Low/Last/Change/Volume values for a security over ten years.
I'm trying to get the weekly count of samples i.e. how many samples did my downsampling method, in this case a Volume bar, sample per week over the entire dataset so that I can plot it and compare to other downsampling methods.
So far I've tried creating a series in the df called 'Year-Week' following the answers prescribed here and here.
The problem with these answers is that my EOY dates such as '1997-12-30' get transformed to '1997-01' because of the ISO calendar system used as described in this answer, which breaks my results when I apply the value_counts method.
My code is the following:
volumeBar['Year/Week'] = (pd.Series(volumeBar.index).dt.year.astype(str) + "/" + pd.Series(volumeBar.index).dt.week.astype(str)).values
So my question is: As it stand the following sample DateTimeIndex
Date
1997-12-22
1997-12-29
1997-12-30
becomes
Year/Week
1997/52
1997/1
1997/1
How could I get the following expected result?
Year/Week
1997/52
1997/52
1997/52
Please keep in mind that I cannot manually correct this behavior because of the size of the dataset and the erradict nature of these appearing results due to the way the ISO calendar works.
Many thanks in advance!
You can use the below function get_years_week to get years and weeks without ISO formating.
import pandas as pd
import datetime
a = {'Date': ['1997-11-29', '1997-12-22',
'1997-12-29',
'1997-12-30']}
data = pd.DataFrame(a)
data['Date'] = pd.to_datetime(data['Date'])
# Function for getting weeks and years
def get_years_week(data):
# Get year from date
data['year'] = data['Date'].dt.year
# loop over each row of date column and get week number
for i in range(len(data)):
data['week'] = (((data['Date'][i] - datetime.datetime\
(data['Date'][i].year,1,1)).days // 7) + 1)
# create column for week and year
data['year/week'] = pd.Series(data_2['year'].astype('str'))\
+ '/' + pd.Series(data_2['week'].astype('str'))
return data
Related
Trying to use the shift function for Feature Extraction to create 3 additional columns: same day last week, same day last month, same day last year. Data I am using is found here
Initially, I am trying to just use the shift function before creating a new column.
data['timestamp'] = pd.to_datetime(data['timestamp'])
data['year'] = data['timestamp'].dt.year
data['month'] = data['timestamp'].dt.month
data['day'] = data['timestamp'].dt.day
data['day'] = pd.to_datetime(data['day'])
data.info()
the_7_days_diff = data['day'] - data.shift(freq='7D')['day']
Getting an error "This method is only implemented for DatetimeIndex, PeriodIndex and TimedeltaIndex; Got type RangeIndex"
Any help would be appreciated to understand what i am doing wrong.
The error implies that shift is applied on the index of the dataframe, not the value. You need to set the timestamp column as index after converting it to datetime data type.
data['timestamp'] = pd.to_datetime(data['timestamp'])
data = data.set_index('timestamp')
week_diff = (data - data.shift(freq='7D')).dropna()
I have a dataset of highest and lowest temperatures recorded for each day of the year, for the years 2005-2014. I want to create a graph where I plot the max and min temperatures for each day of the year for this period (so there will be only one max and min temperature for each day plotted). I was able to create a df from the data set of the absolute min and maxs for each day, here's the example of the max:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('data/C2A2_data/BinnedCsvs_d400/fb441e62df2d58994928907a91895ec62c2c42e6cd075c2700843b89.csv')
# splitting 2005-2014 df dates into separate columns for easier analysis
weather_05_14['Year'] = weather_05_14['Date'].dt.strftime('%Y')
weather_05_14['Month'] = weather_05_14['Date'].dt.strftime('%m')
weather_05_14['Day'] = weather_05_14['Date'].dt.strftime('%d')
# extracting the min and max temperatures for each day, regardless of year
max_temps = weather_05_14.loc[weather_05_14.groupby(['Day', 'Month'], sort=False)
['Data_Value'].idxmax()][['Data_Value', 'Date']]
max_temps.rename(columns={'Data_Value': 'Max'}, inplace=True)
This is what the data frame looks like:
Now here's where my issue is. I want to plot this data in a line plot based on month/day, disregarding the year so it's in order. My thought was that I could do this by changing the year to be the same for every data point (as it won't be data that will be in the final graph anyway) and this is what I did to try to accomplish that:
max_temps['Date'] = max_temps['Date'].apply(lambda x: x.replace(year=2005)
but I got this error:
ValueError: day is out of range for month
I have also tried to take my separate Day, Month, Year columns that I used to group by, include those with the max_temps df, change the year, and then move those all to a new column and convert them to a datetime object, but I get a similar error
max_temps['Year'] = 2005
max_temps['New Date'] = pd.to_datetime[max_temps[['Year', 'Month', 'Day']])
Error: ValueError: cannot assemble the datetimes: day is out of range for month
I have also tried to ignore this issue and then plot with the pandas plot function like:
max_temps.plot(x=['Month', 'Day'], y=['Max'])
Which does work but then I don't get the full functionality of matplotlib (as far as I can tell anyway, I'm new to these libraries).
It gives me this graph:
This is close to the result I'm looking for, but I'd like to use matplotlib to do it.
I feel like I'm making the problem harder than it needs to be but I don't know how. If anyone has any advice or suggestions I would greatly appreciate it, thanks!
As #Jody Klymak pointed out, the reason max_temps['Date'] = max_temps['Date'].apply(lambda x: x.replace(year=2005) isn't working is because in your full dataset, there's probably a leap year and the 29th is included. That means that when you try to set the year to 2005, pandas is trying to create the date 2005-02-29 which will throw
ValueError: day is out of range for month. You can fix this by choosing the year 2004 instead of 2005.
My solution would be to disregard the year entirely, and create a new column that includes the month and day in the format "01-01". Since the month comes first, then all of these strings are guaranteed to be in chronological order regardless of the year.
Here's an example:
import pandas as pd
import matplotlib.pyplot as plt
max_temps = pd.DataFrame({
'Max': [15.6,13.9,13.3,10.6,12.8,18.9,21.7],
'Date': ['2005-01-01','2005-01-02','2005-01-03','2007-01-04','2007-01-05','2008-01-06','2008-01-07']
})
max_temps['Date'] = pd.to_datetime(max_temps['Date'])
## use string formatting to create a new column with Month-Day
max_temps['Month_Day'] = max_temps['Date'].dt.strftime('%m') + "-" + max_temps['Date'].dt.strftime('%d')
plt.plot(max_temps['Month_Day'], max_temps['Max'])
plt.show()
I had a list of dates that turned into week number and years using;
dfweek['weeknum'] = df['Date'].dt.strftime('%U_%Y')
This would output: 34_2019
34 being the 34th week of 2019
How would I go about sorting data by this string in chronological order since the order comes out:
00_2018
00_2019
01_2018
01_2019
I tried converting back to datetime by:
dfweek['weeknum1'] = pd.to_datetime(dfweek['weeknum'], format = '%W_%Y')
This kept returning the error: ValueError: Cannot use '%W' or '%U' without day and year
Tried adding a day in the form of %w just to see what happens
dfweek['weeknum'] = df['Date'].dt.strftime('%U_%Y_%w')
dfweek['weeknum1'] = pd.to_datetime(dfweek['weeknum'], format = '%W_%Y_%w')
but it just spits back the original date without the week number
My desired output would be
00_2018
01_2018
02_2018
...
51_2019
52_2019
You can use the following for the sorting:
dfweek = dfweek.assign(weeknum1= df['Date'].dt.strftime('%Y_%U')).sort_values('weeknum1')
Here, we made a temporary column weeknum1 using format e.g. '2018_00' and then sort using this format. As a result, it is sorting in year + week number as required.
I have a data frame consisting of hourly wind speed measurements for the year 2012 for different locations as seen below:
I was able to change the index to datetime format for just one location using the code:
dfn=df1_s.reset_index(drop=True)
import datetime
dfn['datetime']=pd.to_datetime(dfn.index,
origin=pd.Timestamp('2012-01-01 00:00:00'), unit= 'H')
When using the same code over the entire dataframe, i obtain the error: cannot convert input with unit 'h'. This is probably because there are way to much data than the number of years to be represented by 'hours', i am not sure. Nevertheless, it works when i use units in minutes i.e. units='m'.
What i want to do is to set the datetime in such a way that it repeats itself after every 8784 hours i.e. having the same replicate datetime format for each location on the same dataframe as seen in the image below(expected results produced on excel).
When trying the following, all i obtained was a column with a series on NaNs:
import pdb, random
dates = pd.date_range('2012-01-01', '2013-01-01', freq='H')
data = [int(1000*random.random()) for i in range(len(dates))]
dfn['cum_data'] = pd.Series(data, index=dates)
Can you please direct me on how to go about this?
I would like to do some annual statistics (cumulative sum) on an daily time series of data in an xarray dataset. The tricky part is that the day on which my considered year begins must be flexible and the time series contains leap years.
I tried e.g. the following:
rollday = -181
dr = pd.date_range('2015-01-01', '2017-08-23')
foo = xr.Dataset({'data': (['time'], np.ones(len(dr)))}, coords={'time': dr})
foo_groups = foo.roll(time=rollday).groupby(foo.time.dt.year)
foo_cumsum = foo_groups.apply(lambda x: x.cumsum(dim='time', skipna=True))
which is "unfavorable" mainly because of two things:
(1) the rolling doesn't account for the leap years, so the get an offset of one day per leap year and
(2) the beginning of the first year (until end of June) is appended to the end of the rolled time series, which creates some "fake year" where the cumulative sums doesn't make sense anymore.
I tried also to first cut off the ends of the time series, but then the rolling doesn't work anymore. Resampling to me also did not seem to be an option, as I could not find a fitting pandas freq string.
I'm sure there is a better/correct way to do this. Can somebody help?
You can use a xarray.DataArray that specifies the groups. One way to do this is to create an array of values (years) that define the group ids:
# setup sample data
dr = pd.date_range('2015-01-01', '2017-08-23')
foo = xr.Dataset({'data': (['time'], np.ones(len(dr)))}, coords={'time': dr})
# create an array of years (modify day/month for your use case)
my_years = xr.DataArray([t.year if ((t.month < 9) or ((t.month==9) and (t.day < 15))) else (t.year + 1) for t in foo.indexes['time']],
dims='time', name='my_years', coords={'time': dr})
# use that array of years (integers) to do the groupby
foo_cumsum = foo.groupby(my_years).apply(lambda x: x.cumsum(dim='time', skipna=True))
# Voila!
foo_cumsum['data'].plot()