Two dataframes:
Dataframe 'prices' contains minute pricing.
ts average
2017-12-13 15:55:00-05:00 339.389
2017-12-13 15:56:00-05:00 339.293
2017-12-13 15:57:00-05:00 339.172
2017-12-13 15:58:00-05:00 339.148
2017-12-13 15:59:00-05:00 339.144
Dataframe 'articles' contains articles:
ts title
2017-10-25 11:45:00-04:00 Your Evening Briefing
2017-11-24 14:15:00-05:00 Tesla's Grand Designs Distract From Model 3 Bo...
2017-10-26 11:09:00-04:00 UAW Files Claim That Tesla Fired Workers Who S...
2017-10-25 11:42:00-04:00 Forget the Grid of the Future, Puerto Ricans J...
2017-10-22 09:54:00-04:00 Tesla Reaches Deal for Shanghai Facility, WSJ ...
When 'article' happens, I want the current average stock price (easy), plus the stock price of the end of the day (problem).
My current approach:
articles['t-eod'] = prices.loc[articles.index.strftime('%Y-%m-%d')[0]].between_time('15:30','15:31')
However, it gives a warning:
/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
"""Entry point for launching an IPython kernel.
Reading the docs didn't make it a lot clearer to me.
So question: How can I, for every Article, get Prices' last average price of that day?
Thanks!
/Maurice
You could try using idxmax on ts to identify the index of the maximum timestamp of that date and extract the average value with loc
#Reset our index
prices_df.reset_index(inplace=True)
articles_df.reset_index(inplace=True)
#Ensure our ts field is datetime
prices_df['ts'] = pd.to_datetime(prices_df['ts'])
articles_df['ts'] = pd.to_datetime(articles_df['ts'])
#Get maximum average value from price_df by date
df_max = prices_df.loc[prices_df.groupby(prices_df.ts.dt.date, as_index=False).ts.idxmax()]
#We need to join df_max and articles on the date so we make a new index
df_max['date'] = df_max.ts.dt.date
articles_df['date'] = articles_df.ts.dt.date
df_max.set_index('date',inplace=True)
articles_df.set_index('date',inplace=True)
#Set our max field
articles_df['max'] = df_max['average']
articles_df.set_index('ts',inplace=True)
Related
I have a pandas dataframe of stock records, my goal is to pass in a particular 'day' e.g 8 and get the filtered data frame for the 8th of each month and year in the dataset.
I have gone through some SO questions and managed to get one part of my requirement that was getting the records for a particular day, however if the data for say '8th' does not exist for the particular month and year, I need to get the records for the closest day where record exists for this particular month and year.
As an example, if I pass in 8th and there is no record for 8th Jan' 2022, I need to see if records exists for 7th and 9th Jan'22, and so on..and get the record for the nearest date.
If record is present in both 7th and 9th, I will get the record for 9th (higher date).
However, it is possible if the record for 7th exists and 9th does not exist, then I will get the record for 7th (closest).
Code I have written so far
filtered_df = data.loc[(data['Date'].dt.day == 8)]
If the dataset is required, please let me know. I tried to make it clear but if there is any doubt, please let me know. Any help in the correct direction is appreciated.
Alternative 1
Resample to a daily resolution, selecting the nearest day to fill in missing values:
df2 = df.resample('D').nearest()
df2 = df2.loc[df2.index.day == 8]
Alternative 2
A more general method (and a tiny bit faster) is to generate dates/times of your choice, then use reindex() and method 'nearest'. It is more general because you can use any series of timestamps you could come up with (not necessarily aligned with any frequency).
dates = pd.date_range(
start=df.first_valid_index().normalize(), end=df.last_valid_index(),
freq='D')
dates = dates[dates.day == 8]
df2 = df.reindex(dates, method='nearest')
Example
Let's start with a reproducible example:
import yfinance as yf
df = yf.download(['AAPL', 'AMZN'], start='2022-01-01', end='2022-12-31', freq='D')
>>> df.iloc[:10, :5]
Adj Close Close High
AAPL AMZN AAPL AMZN AAPL
Date
2022-01-03 180.959747 170.404495 182.009995 170.404495 182.880005
2022-01-04 178.663086 167.522003 179.699997 167.522003 182.940002
2022-01-05 173.910645 164.356995 174.919998 164.356995 180.169998
2022-01-06 171.007523 163.253998 172.000000 163.253998 175.300003
2022-01-07 171.176529 162.554001 172.169998 162.554001 174.139999
2022-01-10 171.196426 161.485992 172.190002 161.485992 172.500000
2022-01-11 174.069748 165.362000 175.080002 165.362000 175.179993
2022-01-12 174.517136 165.207001 175.529999 165.207001 177.179993
2022-01-13 171.196426 161.214005 172.190002 161.214005 176.619995
2022-01-14 172.071335 162.138000 173.070007 162.138000 173.779999
Now:
df2 = df.resample('D').nearest()
df2 = df2.loc[df2.index.day == 8]
>>> df2.iloc[:5, :5]
Adj Close Close High
AAPL AMZN AAPL AMZN AAPL
2022-01-08 171.176529 162.554001 172.169998 162.554001 174.139999
2022-02-08 174.042633 161.413498 174.830002 161.413498 175.350006
2022-03-08 156.730942 136.014496 157.440002 136.014496 162.880005
2022-04-08 169.323975 154.460495 170.089996 154.460495 171.779999
2022-05-08 151.597595 108.789001 152.059998 108.789001 155.830002
Warning
Replacing a missing day with data from the future (which is what happens when the nearest day is after the missing one) is called peak-ahead and can cause peak-ahead bias in quant research that would use that data. It is usually considered dangerous. You'd be safer using method='ffill'.
I am trying to forecast daily profit using time series analysis, but daily profit is not only recorded unevenly, but some of the data is missing.
Raw Data:
Date
Revenue
2020/1/19
10$
2020/1/20
7$
2020/1/25
14$
2020/1/29
18$
2020/2/1
12$
2020/2/2
17$
2020/2/9
28$
The above table is an example of what kind of data I have. Profit is not recorded daily, so date between 2020/1/20 and 2020/1/24 does not exist. Not only that, say the profit recorded during the period between 2020/2/3 and 2020/3/8 went missing in the database. I would like to recover this missing data and use time series analysis to predict the profit after 2020/2/9 ~.
My approach was to first aggregate the profit every 6 days since I have to recover the profit between 2020/2/3 and 2020/3/8. So my cleaned data will look something like this
Date
Revenue
2020/1/16 ~ 2020/1/21
17$
2020/1/22 ~ 2020/1/27
14$
2020/1/28 ~ 2020/2/2
47$
2020/2/3 ~ 2020/2/8
? (to predict)
After applying this to a time series model, I would like to further predict the profit after 2020/2/9 ~.
This is my general idea, but as a beginner at Python, using pandas library, I have trouble executing my ideas. Could you please help me how to aggregate the profit every 6 days and have the data look like the above table?
Easiest way is using pandas resample function.
Provided you have an index of type Datetime resampling to aggregate profits at every 6 days would be as simple as your_dataframe.resample('6D').sum()
You can do all sorts of resampling (end of month, end of quarter, begining of week, every hour, minute, second, ...). Check the full documentation if you're interested: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.resample.html?highlight=resample#pandas.DataFrame.resample
I suggest using a combination of .rolling, pd.date_range, and .reindex
Say your DataFrame is df, with proper datetime indexing:
df = pd.DataFrame([['2020/1/19',10],
['2020/1/20',7],
['2020/1/25',14],
['2020/1/29',18],
['2020/2/1',12],
['2020/2/2',17],
['2020/2/9',28]],columns=['Date','Revenue'])
df['Date'] = pd.to_datetime(df['Date'])
df.set_index('Date',inplace=True)
The first step is to 'fill in' the missing days with dummy, zero revenue. We can use pd.date_range to get an index with evenly spaced dates from 2020/1/16 to 2020/2/8, and then .reindex to bring this into the main df DataFrame:
evenly_spaced_idx = pd.date_range(start='2020/1/16',end='2020/2/8',freq='1d')
df = df.reindex(evenly_spaced_idx, fill_value=0)
Now we can take a rolling sum for each 6 day period. We're not interested in every day's six day revenue total, only every 6th days, though:
summary_df = df.rolling('6d').sum().iloc[5::6, :]
The last thing with summary_df is just to format it the way you'd like so that it clearly states the date range which each row refers to.
summary_df['Start Date'] = summary_df.index-pd.Timedelta('6d')
summary_df['End Date'] = summary_df.index
summary_df.reset_index(drop=True,inplace=True)
You can use resample for this.
Make sure to have the "Date" column as datetime type.
>>> df = pd.DataFrame([["2020/1/19" ,10],
... ["2020/1/20" ,7],
... ["2020/1/25" ,14],
... ["2020/1/29" ,18],
... ["2020/2/1" ,12],
... ["2020/2/2" ,17],
... ["2020/2/9" ,28]], columns=['Date', 'Revenue'])
>>> df['Date'] = pd.to_datetime(df.Date)
For pandas < 1.1.0
>>> df.set_index('Date').resample('6D', base=3).sum()
Revenue
Date
2020-01-16 17
2020-01-22 14
2020-01-28 47
2020-02-03 0
2020-02-09 28
For pandas >= 1.1.0
>>> df.set_index('Date').resample('6D', origin='2020-01-16').sum()
Revenue
Date
2020-01-16 17
2020-01-22 14
2020-01-28 47
2020-02-03 0
2020-02-09 28
I have this S&P 500 historical data sample, and I want to compare the dates inside of him.
>> df
High Low Open Close Volume Adj Close
Date
2011-01-03 127.599998 125.699997 126.709999 127.050003 138725200.0 104.119293
2011-01-04 127.370003 126.190002 127.330002 126.980003 137409700.0 104.061905
2011-01-05 127.720001 126.459999 126.580002 127.639999 133975300.0 104.602806
2011-01-06 127.830002 127.010002 127.690002 127.389999 122519000.0 104.397934
2011-01-07 127.769997 126.150002 127.559998 127.139999 156034600.0 104.193031
... ... ... ... ... ... ...
2020-12-14 369.799988 364.470001 368.640015 364.660004 69216200.0 363.112183
2020-12-15 369.589996 365.920013 367.399994 369.589996 64071100.0 368.021240
2020-12-16 371.160004 368.869995 369.820007 370.170013 58420500.0 368.598816
2020-12-17 372.459991 371.049988 371.940002 372.239990 64119500.0 370.660004
2020-12-18 371.149994 367.019989 370.970001 369.179993 135359900.0 369.179993
let latest be the most recent SnP OHLC prices
latest = df.iloc[-1]
How can I find the date inside this dataframe index that is the closest to latest lagged by 1 year (latest.replace(year=latest.year-1)? Just using the pd.Timestamp.replace method sometimes doesn't work, it can generates a date that is not inside my index.
This approach only works if your index column ('Date') contains DateTime objects. If it contains strings, you first have to convert the index to DateTime format.
df.index = pd.to_datetime(df.index)
With that, you can get the latest time either with latest = df.index[-1] or df.index.max().
Then we offset the lastest date by one year using pd.DateOffset and get the theoretical lagged date.
lagged_theoretical = latest - pd.DateOffset(years=1)
To obtain the closest date to the calculated date that is actually present in your DataFrame, we calculate the time delta between all the dates in your dataframe and the calculated date. From there, we selected the minimum to get the closest date. We fetch the index of the minimum in the timedelta array and use that index to get the actual date from the DataFrame`s index column. Here is the whole code:
latest = df.index[-1]
lagged_theoretical = latest - pd.DateOffset(years=1)
td = (abs(df.index - lagged_theoretical)).values
idx = np.where(td == td.min())[0][0]
lagged_actual = df.index[idx]
Say I have a dataset at daily scale, but not all days have valid data. In other words, some days are missing in the data. I want to compute the summer season mean from the dataset, and want to remove the month which has less than 20 days of valid data.
How do I achieve this (in pythonic fashion)?
Say my dataframe (df) is like this:
DATE VAR
1900-01-01 123
1900-01-02 456
1900-01-10 789
...
I know how to compute the count:
df_count = df.resample('MS').count()
I also know how to compute the summer season mean:
df_summer = df.resample('Q-NOV').mean()
You can based on df_count to filter out the month which have less than 20 days of valid data. After that compute the summer season mean using your formula.
df_count = df.resample('MS').count()
relevant_month = df_count[df_count > 10].index
df_summer = df[df.index.isin(relevant_month)].resample('Q-NOV').mean()
I suppose you store the month in index. If the month or time is stored in a different column, change df.index.isin(relevant_month) to df.columnName.isin(relevant_month).
I also don't know the format of your time column (date or datetime) so you might need to modify the code to change this part df.index.isin(relevant_month) accordingly. It is just the general idea.
I have a data that looks like this that I get from an API (of course in JSON form):
0,1500843600,8872
1,1500807600,18890
2,1500811200,2902
.
.
.
where the second column is the date/time in ticks, and the third column is some value. I basically have the data for every hour of the day, for every day for a couple of months. Now, what I want to achieve is that I want to get the minimum value for the third column for every week. I have the code segment below, which correctly returns the minimum value for me, but apart from returning the minimum value, I also want to return the specific Timestamp as which date/time the lowest that week occurred. How can I modify my code below, so I can get also the Timestamp along with the minimum value.
df = pandas.DataFrame(columns=['Timestamp', 'Value'])
# dic holds the data that I get from my API.
for i in range(len(dic)):
df.loc[i] = [dic[i][1], dic[i][2]]
df['Timestamp'] = pandas.to_datetime(df['Timestamp'], unit='s')
df.sort_values(by=['Timestamp'])
df.set_index(df['Timestamp'], inplace=True)
df.groupby([pandas.Grouper(key='Timestamp', freq='W-MON')])['Value'].min()
I think you need DataFrameGroupBy.idxmin for index of min value of column Timestamp and then select by loc:
df['Timestamp'] = pd.to_datetime(df['Timestamp'], unit='s')
df = df.loc[df.groupby([pd.Grouper(key='Timestamp', freq='W-MON')])['Value'].idxmin()]
print (df)
Timestamp Value
2 2017-07-23 12:00:00 2902
Detail:
print (df.groupby([pd.Grouper(key='Timestamp', freq='W-MON')])['Value'].idxmin())
Timestamp
2017-07-24 2
Freq: W-MON, Name: Value, dtype: int64