Selecting one date at a time from a resampled dataframe - python

I have a dataframe of tick data, which i have resampled into minute data. doing a vanilla
df.resample('1Min').ohlc().fillna(method='ffill')
super easy.
I now need to iterate over that resampled dataframe each day at a time, but i cant figure out the best way to do it.
ive tried taking my 1min resampled dataframe and then resampling that for "1D" and then converting that to a list to iterate over and filter, but that gives me a list of:
Timestamp('2011-09-13 00:00:00', freq='D')
objects, and it wont let me slice a dataframe based on that.
this seems like it would be something easy, but i just cant find the answer. thanks-
#sample data_1m dataframe
data_1m.head()
open high low close
timestamp
2011-09-13 13:53:00 5.8 6.0 5.8 6.0
2011-09-13 13:54:00 5.8 6.0 5.8 6.0
2011-09-13 13:55:00 5.8 6.0 5.8 6.0
2011-09-13 13:56:00 5.8 6.0 5.8 6.0
2011-09-13 13:57:00 5.8 6.0 5.8 6.0
...
#i want to get everything for date 2011-09-13 im trying
days_in_df = data_1m.resample('1D').ohlc().fillna(method='ffill').index.to_list()
data_1m.loc[days_in_df[0]]
KeyError: Timestamp('2011-09-13 00:00:00', freq='D')

Here's my two cents. I don't resample the data so much as adding another index level to the frame:
data_1m = data_1m.reset_index()
data_1m['date'] = data_1m['timestamp'].astype('datetime64[D]')
data_1m = data_1m.set_index(['date', 'timestamp'])
And to select an entire day:
data_1m.loc['2011-09-13']

Related

reorder panda data frame columns vertical

I'm a bit new to panda and have some diabetic data that i would like to reorder.
I'd like to copy the data from column 'wakeup' through '23:00:00',
and put this data vertical under each other so I would get a new dataframe column:
5.6
8.1
9.9
6.3
4.1
13.3
NAN
3.9
3.3
6.8
.....etc
I'm assuming the data is in a dataframe already. You can index the columns you want and then use melt as suggested. Without any parameters, melt will 'stack' all your data into one column of a new dataframe. There's another column created to identify the original column names, but you can drop that if needed.
df.loc[:, 'wakeup':'23:00:00'].melt()
variable value
0 wakeup 5.6
1 wakeup 8.1
2 wakeup 9.9
3 wakeup 6.3
4 wakeup 4.1
5 wakeup 13.3
6 wakeup NAN
7 09:30:00 3.9
8 09:30:00 3.3
9 09:30:00 6.8
...
You mention you want this as another column, but there's no way to sensibly add it into your existing dataframe. The shape likely won't match also.
Solved it myself finally took me quite some time.
Notice here the orginal data was in df1 result in dfAllMeasurements
dfAllMeasurements = df1.loc[:, 'weekday':'23:00:00']
temp = dfAllMeasurements.set_index('weekday','ID').stack(dropna=False) #dropna = keeping NAN
dfAllMeasurements = temp.reset_index(drop=False, level=0).reset_index()

Maximum Monthly Values whilst retaining the Data at which that values occured

I have daily rainfall data that looks like the following:
Date Rainfall (mm)
1922-01-01 0.0
1922-01-02 0.0
1922-01-03 0.0
1922-01-04 0.0
1922-01-05 31.5
1922-01-06 0.0
1922-01-07 0.0
1922-01-08 0.0
1922-01-09 0.0
1922-01-10 0.0
1922-01-11 0.0
1922-01-12 9.1
1922-01-13 6.4
.
.
.
I am trying to work out the maximum value for each month for each year, and also what date the maximum value occurred on. I have been using the code:
rain_data.groupby(pd.Grouper(freq = 'M'))['Rainfall (mm)'].max()
This is returning the correct maximum value but returns the end date of each month rather than the date that maximum event occurred on.
1974-11-30 0.0
1974-12-31 0.0
1975-01-31 0.0
1975-02-28 65.0
1975-03-31 129.5
1975-11-30 59.9
1975-12-31 7.1
1976-01-31 10.0
1976-11-30 0.0
1976-12-31 0.0
1977-01-31 4.3
Any suggestions on how I could get the correct date?
I'm new to this, but what I think you're doing in (pd.Grouper(freq = 'M')) is grouping all the values in each month, but it's assigning every value within a group to the same date. I think this is why your groupby isn't returning the dates you're looking for.
I think your question is answered here. Alexander suggests to use:
df.groupby(pd.TimeGrouper('M')).Close.agg({'max date': 'idxmax', 'max rainfall': np.max})
The agg works without the Close I think, so if it's problematic (as I found) you might want to take it out.

Create multiple dataframe columns containing calculated values from an existing column

I have a dataframe, sega_df:
Month 2016-11-01 2016-12-01
Character
Sonic 12.0 3.0
Shadow 5.0 23.0
I would like to create multiple new columns, by applying a formula for each already existing column within my dataframe (to put it shortly, pretty much double the number of columns). That formula is (100 - [5*eachcell])*0.2.
For example, for November for Sonic, (100-[5*12.0])*0.2 = 8.0, and December for Sonic, (100-[5*3.0])*0.2 = 17.0 My ideal output is:
Month 2016-11-01 2016-12-01 Weighted_2016-11-01 Weighted_2016-12-01
Character
Sonic 12.0 3.0 8.0 17.0
Shadow 5.0 23.0 15.0 -3.0
I know how to create a for loop to create one column. This is for if only one month was in consideration:
for w in range(1,len(sega_df.index)):
sega_df['Weighted'] = (100 - 5*sega_df)*0.2
sega_df[sega_df < 0] = 0
I haven't gotten the skills or experience yet to create multiple columns. I've looked for other questions that may answer what exactly I am doing but haven't gotten anything to work yet. Thanks in advance.
One vectorised approach is to drown to numpy:
A = sega_df.values
A = (100 - 5*A) * 0.2
res = pd.DataFrame(A, index=sega_df.index, columns=('Weighted_'+sega_df.columns))
Then join the result to your original dataframe:
sega_df = sega_df.join(res)

Grouping columns of pandas dataframe in datetime format

I have two questions:
1) Is there something like pandas groupby but applicable on columns (df.columns, not the data within)?
2) How can I extract the "date" from a datetime object?
I have lots of pandas dataframes (or csv files) that have a position column (that I use as index) and then columns of values measured at each position at different time. The column header is a datetime object (or pd.to_datetime).
I would like to extract data from the same date and save them into a new file.
Here is a simple example of two such dataframes.
df1:
2015-03-13 14:37:00 2015-03-13 14:38:00 2015-03-13 14:38:15 \
0.0 24.49393 24.56345 24.50552
0.5 24.45346 24.54904 24.60773
1.0 24.46216 24.55267 24.74365
1.5 24.55414 24.63812 24.80463
2.0 24.68079 24.76758 24.78552
2.5 24.79236 24.83005 24.72879
3.0 24.83691 24.78308 24.66727
3.5 24.78452 24.73071 24.65085
4.0 24.65857 24.79398 24.72290
4.5 24.56390 24.93515 24.83267
5.0 24.62161 24.96939 24.87366
2015-05-19 11:33:00 2015-05-19 11:33:15 2015-05-19 11:33:30
0.0 8.836121 8.726685 8.710449
0.5 8.732880 8.742462 8.687408
1.0 8.881165 8.935120 8.925903
1.5 9.043396 9.092651 9.204041
2.0 9.080902 9.153839 9.329681
2.5 9.128815 9.183777 9.296509
3.0 9.191254 9.121643 9.207397
3.5 9.131866 8.975372 9.160248
4.0 8.966003 8.951813 9.195221
4.5 8.846924 9.074982 9.264099
5.0 8.848663 9.101593 9.283081
and df2:
2015-05-19 11:33:00 2015-05-19 11:33:15 2015-05-19 11:33:30 \
0.0 8.836121 8.726685 8.710449
0.5 8.732880 8.742462 8.687408
1.0 8.881165 8.935120 8.925903
1.5 9.043396 9.092651 9.204041
2.0 9.080902 9.153839 9.329681
2.5 9.128815 9.183777 9.296509
3.0 9.191254 9.121643 9.207397
3.5 9.131866 8.975372 9.160248
4.0 8.966003 8.951813 9.195221
4.5 8.846924 9.074982 9.264099
5.0 8.848663 9.101593 9.283081
2015-05-23 12:25:00 2015-05-23 12:26:00 2015-05-23 12:26:30
0.0 10.31052 10.132660 10.176910
0.5 10.26834 10.086910 10.252720
1.0 10.27393 10.165890 10.276670
1.5 10.29330 10.219090 10.335910
2.0 10.24432 10.193940 10.406430
2.5 10.11618 10.157470 10.323120
3.0 10.02454 10.110720 10.115360
3.5 10.08716 10.010680 9.997345
4.0 10.23868 9.905670 10.008090
4.5 10.27216 9.879425 9.979645
5.0 10.10693 9.919800 9.870361
df1 has data from 13 March and 19 May, df2 has data from 19 May and 23 May. From these two dataframes containing data from 3 days, I would like to get 3 dataframes (or csv files or any other object), one for each day.
(And for a real-life example, multiply the number of lines, columns and files by some hundred.)
In the worst case I can specify the dates in a separate list, but I am still failing to extract these dates from the dataframes.
I did have an idea of a nested loop
for df in dataframes:
for d in dates:
new_df = df[d]
but I can't get the date from the datetime.
First concat all DataFrames by columns and then convert groupby object by strftime for string keys of dictionary of DataFrames:
df = pd.concat([df1,df2, dfN], axis=1)
dfs = dict(tuple(df.groupby(df.columns.strftime('%Y-%m-%d'), axis=1)))
#select DataFrame
print (dfs['2015-03-13'])

Python Pandas combining timestamp columns and fillna in read_csv

I'm reading a csv file with Pandas. The format is:
Date Time x1 x2 x3 x4 x5
3/7/2012 11:09:22 13.5 2.3 0.4 7.3 6.4
12.6 3.4 9.0 3.0 7.0
3.6 4.4 8.0 6.0 5.0
10.6 3.5 1.0 3.0 8.0
...
3/7/2012 11:09:23 10.5 23.2 0.3 7.8 4.4
11.6 13.4 19.0 13.0 17.0
...
As you can see, not every row has a timestamp. Every row without a timestamp is from the same 1-second interval as the closest row above it that does have a timestamp.
I am trying to do 3 things:
1. combine the Date and Time columns to get a single timestamp column.
2. convert that column to have units of seconds.
3. fill empty cells to have the appropriate timestamp.
The desired end result is an array with the timestamp, in seconds, at each row.
I am not sure how to quickly convert the timestamps into units of seconds, other then to do a slow for loop and use the Python builtin time.mktime method.
Then when I fill in missing timestamp values, the problem is that the cells in the Date and Time columns which did not have a timestamp each get a "nan" value and when merged give a cell with the value "nan nan". Then when I use the fillna() method, it doesn't interpret "nan nan" as being a nan.
I am using the following code to get the problem result (not including the part of trying to convert to seconds):
import pandas as pd
df = pd.read_csv('file.csv', delimiter=',', parse_dates={'CorrectTime':[0,1]}, usecols=[0,1,2,4,6], names=['Date','Time','x1','x3','x5'])
df.fillna(method='ffill', axis=0, inplace=True)
Thanks for your help.
Assuming you want seconds since Jan 1, 1900...
import pandas
from io import StringIO
import datetime
data = StringIO("""\
Date,Time,x1,x2,x3,x4,x5
3/7/2012,11:09:22,13.5,2.3,0.4,7.3,6.4
,,12.6,3.4,9.0,3.0,7.0
,,3.6,4.4,8.0,6.0,5.0
,,10.6,3.5,1.0,3.0,8.0
3/7/2012,11:09:23,10.5,23.2,0.3,7.8,4.4
,,11.6,13.4,19.0,13.0,17.0
""")
df = pandas.read_csv(data, parse_dates=['Date']).fillna(method='ffill')
def dealwithdates(row):
datestring = row['Date'].strftime('%Y-%m-%d')
dtstring = '{} {}'.format(datestring, row['Time'])
date = datetime.datetime.strptime(dtstring, '%Y-%m-%d %H:%M:%S')
refdate = datetime.datetime(1900, 1, 1)
return (date - refdate).total_seconds()
df['ordinal'] = df.apply(dealwithdates, axis=1)
print(df)
Date Time x1 x2 x3 x4 x5 ordinal
0 2012-03-07 11:09:22 13.5 2.3 0.4 7.3 6.4 3540107362
1 2012-03-07 11:09:22 12.6 3.4 9.0 3.0 7.0 3540107362
2 2012-03-07 11:09:22 3.6 4.4 8.0 6.0 5.0 3540107362
3 2012-03-07 11:09:22 10.6 3.5 1.0 3.0 8.0 3540107362
4 2012-03-07 11:09:23 10.5 23.2 0.3 7.8 4.4 3540107363
5 2012-03-07 11:09:23 11.6 13.4 19.0 13.0 17.0 3540107363

Categories