Python resample function not resampling - python

I'm trying to resample the data, however, it does not seem to be working properly. I want to have start-of-month data to start-of-month.
The code is the following
df = pd.read_csv('OSEBX_daily.csv')
df = data[['time', 'OSEBX GR']]
df['time'] = pd.to_datetime(df['time']).dt.normalize()
df.set_index('time', inplace=True)
df.index = pd.to_datetime(df.index)
df.resample('1M').mean()
df['returns'] = df['OSEBX GR'].pct_change()
plt.plot(df['returns'])

You forget assign back:
df = df.resample('1M').mean()

Related

Converting separate hour/min/sec columns into a single time column with pandas?

I'm trying to create a single time-column that I can create a time-series plot by resampling the date/time index. However I'm trouble combining the columns to a singular column and/or indexing it. Below is my code and what I've tried to do. Any suggestions would be appreciated!
colnames=['time_ms','power','chisq','stations','alt','hour','min','sec','time_frac','lat','lon']
df = pd.read_csv('/data/selected_lma_matlab_20210914.txt',delim_whitespace=True, header=None, names=colnames)
#df = pd.read_csv('/data/selected_lma_matlab_20210914.txt',delim_whitespace=True, header=None,names=colnames,parse_dates=[[5, 7]], index_col=0)
#df = pd.read_csv('/data/selected_lma_matlab_20210914.txt',delim_whitespace=True, header=None,names=colnames,infer_datetime_format=True,parse_dates=[[5, 6]], index_col=0)
I did try this method to include/add the date as well, which isn't necessary I believe but would be nice for consistency. However I wasn't able to get this to work.
s = df['hour'].mul(10000) + df['min'].mul(100) + df['sec']
df['date'] = pd.to_datetime('2021-09-14 ' + s.astype(int), format='%Y-%m-%d %H%M%S.%f')
This method did work to create a new column, but had trouble indexing it.
df['time'] = (pd.to_datetime(df['hour'].astype(str) + ':' + df['min'].astype(str), format='%H:%M')
.dt.time)
df['Datetime'] = pd.to_datetime(df['time'])
df.set_index('Datetime')
Creating this column to get counts for a time-series
df['tot'] = 1
Using this to resample the data necessary for the timeseries in a new df
df2 = df[['tot']].resample('5min').sum()
However I keep getting datetime/index errors despite what I've tried above.
Link to data: https://drive.google.com/file/d/16GmXfQNMK81aAbB6C-W_Bjm2mcOVrILP/view?usp=sharing
you should try and keep all the data in different columns as string, concatenate them and then convert it to datatime. Below updated code would do this...
colnames=['time_ms','power','chisq','stations','alt','hour','min','sec','time_frac','lat','lon']
df = pd.read_csv('selected_lma_matlab_20210914.txt',delim_whitespace=True, header=None, names=colnames)
df['date'] = '2021-09-14 ' + df['hour'].astype('string') + ":" + df['min'].astype('string') + ":" + df['sec'].astype('string')
df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d %H:%M:%S')
df.set_index('date', inplace=True)
Post this you can do the plots as you need. I tried these and they appear to work fine...
df.alt.plot(kind='line')
df.plot('lat', 'lon', kind='scatter')

How can I deal with Expect data.index as DatetimeIndex?

I am planning to get a candlestick plot from the bitcoin data.
Here is my code for select the dataframe I want after loading the csv file.
df['Date'] = pd.to_datetime(df['Date'])
start_date = '2016-02-27'
end_date = '2021-02-27'
mask = (df['Date'] >= start_date) & (df['Date'] <= end_date)
df = df.loc[mask]
df
and then, I entered the code for making the candlestick plot like below:
import matplotlib.pyplot as plt
! pip install --upgrade mplfinance
import mplfinance as mpf
import matplotlib.dates as mpl_dates
mpf.plot(df, type = 'candle', style = 'charles',
title = 'Bitcoin Price',
ylabel = 'Price (USD$)',
volume = True,
ylabel_lower = 'Shares \nTraded',
mav = (3,6,9),
savefig = 'chart-mplfinance.png')
It said "TypeError: Expect data.index as DatetimeIndex".
So I looked up the solution for this on google and I tried out this:
df = dict()
df['Date'] = []
df['High'] = []
df['Low'] = []
df['Open'] = []
df['Close'] = []
df['Volume'] = []
for dict in df:
df['Date'].append(datetime.datetime.fromtimestamp(t).strftime('%Y-%m-%d %H:%M:%S')
df['High'].append(dict['High'])
df['Low'].append(dict['Low'])
df['Open'].append(dict['Open'])
df['Close'].append(dict['Close'])
df['Volume'].append(dict['Vol'])
print("df:", df)
pdata = pd.DataFrame.from_dict(df)
pdata.set_index('Date', inplace=True)
mpf.plot(pdata)
This time, it said "invalid syntax"
I'm not sure where I get this wrong, is there anything that I have missed?
There are two easy ways to make sure your dataframe has a pandas.DatetimeIndex as the dataframe index:
When calling read_csv() indicate which column you want to use for the index (which should be the column that contains the dates/datetimes), and also set kwarg parse_dates=True. This will will automatically convert the datetimes column (which is normally strings within a csv file) into a DatetimeIndex object, and set it as the index You can see this being done in the examples in the mplfinance repository, for example, click here, and look under basic usage where you can see index_col=0, parse_dates=True in the call to read_csv().  
Use the pandas.DatetimeIndex() constructor. For example, instead of what you wrote,
df['Date'] = pd.to_datetime(df['Date']), you would write:
df.index = pd.DatetimeIndex(df['Date'])
As a side note, once the dataframe has a DatetimeIndex, you don't need the mask in the next section of code, but can simply slice as follows:
start_date = '2016-02-27'
end_date = '2021-02-27'
df = df.loc[start_date:end_date]
hth.

Sort a Pandas DataFrame using both Date and Time

I'm Trying to sort my dataframe using "sort_value" Im not getting the desired output
df1 = pd.read_csv('raw data/120_FT DDMG.csv')
df2 = pd.read_csv('raw data/120_FT MG.csv')
df3 = pd.read_csv('raw data/120_FT DD.csv')
dconcat = pd.concat([df1,df2,df3])
dconcat['date'] = pd.to_datetime(dconcat['ActivityDates(Individual)']+' '+dconcat['ScheduledStartTime'])
dconcat.sort_values(by='date')
dconcat = dconcat.set_index('date')
print(dconcat)
sort_values returns a data frame which is sorted if inplace=False.
so dconcat=dconcat.sort_values(by='date')
or you can do dconcat.sort_values(by='date', inplace=True)
you can try this;
dconcat = pd.concat([df1,df2,df3])
dconcat['date'] = pd.to_datetime(dconcat['ActivityDates(Individual)']+' '+dconcat['ScheduledStartTime'])
dconcat.set_index('date', inplace=True)
dconcat.sort_index(inplace=True)
print(dconcat)

Pandas groupby (+15mins runtime)

I'm trying to analyze a network traffic dataset with +1.000.000 of packets and I have the following code:
pcap_data = pd.read_csv('/home/alexfrancow/AAA/data1.csv')
pcap_data.columns = ['no', 'time', 'ipsrc', 'ipdst', 'proto', 'len']
pcap_data['info'] = "null"
pcap_data.parse_dates=["time"]
pcap_data['num'] = 1
df = pcap_data
df
%%time
df['time'] = pd.to_datetime(df['time'])
df.index = df['time']
data = df.copy()
data_group = pd.DataFrame({'count': data.groupby(['ipdst', 'proto', data.index]).size()}).reset_index()
pd.options.display.float_format = '{:,.0f}'.format
data_group.index = data_group['time']
data_group
data_group2 = data_group.groupby(['ipdst','proto']).resample('5S', on='time').sum().reset_index().dropna()
data_group2
The first part of the script when I import the .csv runtime is 5 seconds, but when pandas groupby IP + PROTO, and resample the time in 5s, the runtime is 15 minutes, does anyone know how I can get a better performance?
EDIT:
Now I'm trying to use dask, and I have the following code:
Import the .csv
filename = '/home/alexfrancow/AAA/data1.csv'
df = dd.read_csv(filename)
df.columns = ['no', 'time', 'ipsrc', 'ipdst', 'proto', 'info']
df.parse_dates=["time"]
df['num'] = 1
%time df.head(2)
Group by ipdst + proto by 5S freq
df.set_index('time').groupby(['ipdst','proto']).resample('5S', on='time').sum().reset_index()
How can I group by IP + PROTO by 5S frequency?
I try a bit simplify your code, but if large DataFrame performance should be only a bit better:
pd.options.display.float_format = '{:,.0f}'.format
#convert time column to DatetimeIndex
pcap_data = pd.read_csv('/home/alexfrancow/AAA/data1.csv',
parse_dates=['time'],
index_col=['time'])
pcap_data.columns = ['no', 'time', 'ipsrc', 'ipdst', 'proto', 'len']
pcap_data['info'] = "null"
pcap_data['num'] = 1
#remove DataFrame constructor
data_group = pcap_data.groupby(['ipdst', 'proto', 'time']).size().reset_index(name='count')
data_group2 = (data_group.set_index('time')
.groupby(['ipdst','proto'])
.resample('5S')
.sum()
.reset_index()
.dropna())
in dask:
meta = pd.Dataframe(columns=['no','ipsrc','info'],dtype=object, index=pd.MultiIndex([[], [],[]],[[],[], []], names=['ipdst','proto','time'])
df = df.set_index('time').groupby(['ipdst','proto']).apply(lambda x:x.resample('5S').sum(),meta=meta)
df = df.reset_index()
Hope it work for you

How to extrapolate a periodic time serie in Pandas?

In Python 3.5, Pandas 20, say I have a one year periodic time serie :
import pandas as pd
import numpy as np
start_date = pd.to_datetime("2015-01-01T01:00:00.000Z", infer_datetime_format=True)
end_date = pd.to_datetime("2015-12-31T23:00:00.000Z", infer_datetime_format=True)
index = pd.DatetimeIndex(start=start_date,
freq="60min",
end=end_date)
time = np.array((index - start_date)/ np.timedelta64(1, 'h'), dtype=int)
df = pd.DataFrame(index=index)
df["foo"] = np.sin( 2 * np.pi * time / len(time))
df.plot()
I want to do some periodic extrapolation of the time serie for a new index. I.e with :
new_start_date = pd.to_datetime("2017-01-01T01:00:00.000Z", infer_datetime_format=True)
new_end_date = pd.to_datetime("2019-12-31T23:00:00.000Z", infer_datetime_format=True)
new_index = pd.DatetimeIndex(start=new_start_date,
freq="60min",
end=new_end_date)
I would like to use some kind of extrapolate_periodic method to get:
# DO NOT RUN
new_df = df.extrapolate_periodic(index=new_index)
# END DO NOT RUN
new_df.plot()
What is the best way do such a thing in pandas?
How can I define a periodicity and get data from a new index easily?
I think I have what you are looking for, though it is not a simple pandas method.
Carrying on directly from where you left off,
def extrapolate_periodic(df, new_index):
df_right = df.groupby([df.index.dayofyear, df.index.hour]).mean()
df_left = pd.DataFrame({'new_index': new_index}).set_index('new_index')
df_left = df_left.assign(dayofyear=lambda x: x.index.dayofyear,
hour=lambda x: x.index.hour)
df = (pd.merge(df_left, df_right, left_on=['dayofyear', 'hour'],
right_index=True, suffixes=('', '_y'))
.drop(['dayofyear', 'hour'], axis=1))
return df.sort_index()
new_df = extrapolate_periodic(df, new_index)
# or as a method style
# new_df = df.pipe(extrapolate_periodic, new_index)
new_df.plot()
If you have more that a years worth of data it will take the mean of each duplicated day-hour. Here mean could be changed for last if you wanted just the most recent reading.
This will not work if you do not have a full years worth of data but you could fix this by adding in a reindex to complete the year and then using interpolate with a polynomial feature to fill in the missing foo column.
Here is some code I've used to solve my problem. The asumption is that the initial serie corresponds to a period of data.
def extrapolate_periodic(df, new_index):
index = df.index
start_date = np.min(index)
end_date = np.max(index)
period = np.array((end_date - start_date) / np.timedelta64(1, 'h'), dtype=int)
time = np.array((new_index - start_date)/ np.timedelta64(1, 'h'), dtype=int)
new_df = pd.DataFrame(index=new_index)
for col in list(df.columns):
new_df[col] = np.array(df[col].iloc[time % period])
return new_df

Categories