Conditional test within pandas dataframe - python

Can someone help me with a pandas question? I have a timeseries dataframe such as this:
GOOG AAPL
2010-12-09 16:00:00 591.50 551
2010-12-10 16:00:00 592.21 523
2010-12-13 16:00:00 594.62 578
2010-12-14 16:00:00 594.91 567
2010-12-15 16:00:00 590.30 577
...
I need to loop through each timestamp and test whether AAPL is > 570. If it is, then I want to print the date and the price of AAPL for that entry. Is this possible?

There's no need for any looping, one of the main benefits of pandas being built on numpy is it can easily operate on whole columns. It's as simple as:
df['AAPL'][df['AAPL'] > 570]
Output:
2010-12-13 16:00:00 578
2010-12-15 16:00:00 577
Name: AAPL, dtype: int64

Ah ha I got it:
What you can do since it is built on top of numpy is this:
my_dataframe[my_dataframe.AAPL > 570]
and you're almost done.
From here you have all the rows that correspond to AAPL > 570, now it's just printing out the values you need:
valid_rows = my_dataframe[my_dataframe.AAPL > 570]
for row in valid_rows.to_records():
print row[1],row[2]
The dataframe.where can be used for searching the entire frame.
I had forgotten that pandas made it extremely easy to reference columns.

Related

For loops in pandas dataframe, how to filter by days of hourly range and associate a value?

I have a dataframe named df_sub like this:
date open high low close volume
405 2022-01-03 08:00:00 4293.5 4295.5 4291.5
406 2022-01-03 08:01:00 4294.0 4295.5 4294.0
407 2022-01-03 08:02:00 4295.5 4297.5 4295.5
408 2022-01-03 08:03:00 4297.0 4298.0 4296.0
409 2022-01-03 08:04:00 4296.5 4296.5 4295.0
... ... ... ... ... ... ... ... ...
5460 2022-01-07 08:55:00 4311.0 4312.0 4310.5
5461 2022-01-07 08:56:00 4311.5 4311.5 4311.0
5462 2022-01-07 08:57:00 4311.0 4312.0 4310.0
I need to create a loop of this type:
for row in df_sub:
take a single day (so, in this case 2022-01-03, 04...07) and create a column with df_sub["high"].max() value,
so i will have the maximum value of the high in all the rows of the same day,
naturally, this implies that in other day the maximum value will be different from the
previews one because the high will be different.
You can use resample:
df_sub=df_sub.set_index('date')
df_new=df_sub.resample('d')['high'].max()

Slicing pandas dataframe by custom months and days -- is there a way to avoid for loops?

The problem
Suppose I have a time series dataframe df (a pandas dataframe) and some days I want to slice from it, contained in another dataframe called sample_days:
>>> df
foo bar
2020-01-01 00:00:00 0.360049 0.897839
2020-01-01 01:00:00 0.285667 0.409544
2020-01-01 02:00:00 0.323871 0.240926
2020-01-01 03:00:00 0.921623 0.766624
2020-01-01 04:00:00 0.087618 0.142409
... ... ...
2020-12-31 19:00:00 0.145111 0.993822
2020-12-31 20:00:00 0.331223 0.021287
2020-12-31 21:00:00 0.531099 0.859035
2020-12-31 22:00:00 0.759594 0.790265
2020-12-31 23:00:00 0.103651 0.074029
[8784 rows x 2 columns]
>>> sample_days
month day
0 3 16
1 7 26
2 8 15
3 9 26
4 11 25
I want to slice df with the days specified in sample_days. I can do this with for loops (see below). However, is there a way to avoid for loops (as this is more efficient)? The result should be a dataframe called sample like the following:
>>> sample
foo bar
2020-03-16 00:00:00 0.707276 0.592614
2020-03-16 01:00:00 0.136679 0.357872
2020-03-16 02:00:00 0.612331 0.290126
2020-03-16 03:00:00 0.276389 0.576996
2020-03-16 04:00:00 0.612977 0.781527
... ... ...
2020-11-25 19:00:00 0.904266 0.825501
2020-11-25 20:00:00 0.269589 0.050304
2020-11-25 21:00:00 0.271814 0.418235
2020-11-25 22:00:00 0.595005 0.973198
2020-11-25 23:00:00 0.151149 0.024057
[120 rows x 2 columns
which is just the df sliced across the correct days.
My (slow) solution
I've managed to do this using for loops and pd.concat:
sample = pd.concat([df.loc[df.index.month.isin([sample_day.month]) &
df.index.day.isin([sample_day.day])]
for sample_day in sample_days.itertuples()])
which is based on concatenating multiple days as sliced by the method indicated here. This gives the desired result but is rather slow. For example, using this method to get the first day of each month takes 0.2 seconds on average, whereas just calling df.loc[df.index.day == 1] (presumably avoiding python for loops under-the-hood) is around 300 times faster. However, this is a slice on just the day -- I am slicing on month and day.
Apologies if this has been answered somewhere else -- I've searched for quite a while but perhaps was not using the correct keywords.
You can do a string comparison of the month and days at the same time.
You need the space to differentiate between 11 2 and 1 12 for example, otherwise both would be regarded as the same.
df.loc[(df.index.month.astype(str) +' '+ df.index.day.astype(str)).isin(sample_days['month'].astype(str)+' '+sample_days['day'].astype(str))]
After getting a bit of inspiration from #Ben Pap's solution (thanks!), I've found a solution that is both fast and avoids any "hacks" like changing datetime to strings. It combines the month and day into a single MultiIndex, as below (you can make this a single line, but I've expanded it into multiple to make the idea clear).
full_index = pd.MultiIndex.from_arrays([df.index.month, df.index.day],
names=['month', 'day'])
sample_index = pd.MultiIndex.from_frame(sample_days)
sample = df.loc[full_index.isin(sample_index)]
If I run this code along with my original for loop and #Ben Pap's answer, and sample 100 days from one year time series for 2020 (8784 hours with the leap day), I get the following solution times:
Original for loop: 0.16s
#Ben Pap's solution, combining month and day into single string: 0.019s
Above solution using MultiIndex: 0.006s
so I think using a MultiIndex is the way to go.

How to resample yearly starting from 1st of June to 31st may?

How do I resample a dataframe with a daily time-series index to yearly, but not from 1st Jan to 31th Dec. Instead I want the yearly sum from 1.June to 31.May.
First I did this, which gives me the yearly sum from 1.Jan to 31.Dec:
df.resample(rule='A').sum()
I have tried using the base-parameter, but it does not change the resample sum.
df.resample(rule='A', base=100).sum()
Here is a part of my dataframe:
In []: df
Out[]:
Index ET P R
2010-01-01 00:00:00 -0.013 0.0 0.773
2010-01-02 00:00:00 0.0737 0.21 0.797
2010-01-03 00:00:00 -0.048 0.0 0.926
...
In []: df.resample(rule='A', base = 0, label='left').sum()
Out []:
Index
2009-12-31 00:00:00 424.131138 871.48 541.677405
2010-12-31 00:00:00 405.625780 939.06 575.163096
2011-12-31 00:00:00 461.586365 1064.82 710.507947
...
I would really appreciate if anyone could help me figuring out how to do this.
Thank you
Use 'AS-JUN' as the rule with resample:
# Example data
idx = pd.date_range('2017-01-01', '2018-12-31')
s = pd.Series(1, idx)
# Resample
s = s.resample('AS-JUN').sum()
The resulting output:
2016-06-01 151
2017-06-01 365
2018-06-01 214
Freq: AS-JUN, dtype: int64

Resampling dataframe in pandas as a checking operation

I have a DataFrame like this:
A B value
2014-11-14 12:00:00 30.5 356.3 344
2014-11-15 00:00:00 30.5 356.3 347
2014-11-15 12:00:00 30.5 356.3 356
2014-11-16 00:00:00 30.5 356.3 349
...
2017-01-06 00:00:00 30.5 356.3 347
I want to check if the index is running every 12 hours, perhaps there is some data missing, so there can be a jump of 24 or more hours. In that case I want to introduce nan in the value column and copy the values from columns A and B.
I thought of using resample:
df = df.resample('12H')
but I don't know how to handles the different columns or if this is the right approach.
EDIT: If there is a value missing, for instance in 2015-12-12 12:00:00 I would like to add a row like this:
...
2015-12-12 00:00:00 30.5 356.3 323
2015-12-12 12:00:00 30.5 356.3 NaN *<- add this*
2015-12-13 00:00:00 30.5 356.3 347
...
You can use the asfreq method to produce evenly spaced indexes every 12 hours which will automatically put np.nan values for every jump. Then you can just forward fill columns A and B.
df1= df.asfreq('12H')
df1[['A','B']] = df1[['A','B']].fillna(method='ffill')
I would go for simply sorting your dataframe on the index and create a new column that takes the value from the next row (for the time). The current time would be called "from" and the time from the next time would be called "to".
Next step would be to use the two columns ("from" and "to") to create a column containing a list of values between this row and next row for every 12 hours (a range basically).
Final step would be to "explode" every line for each value in the range. Look at How to explode a list inside a Dataframe cell into separate rows
Hope this helps :)

Pandas, Python. How to filter out days depending on number of observations?

I would like to filter out days, which have less then minute 200 observations in them. My data looks as follows:
Time
2009-01-30 09:30:00 85.1100 100.1100
2009-01-30 09:39:00 84.9300 100.0500
2009-01-30 09:40:00 84.9000 100.0000
2009-01-30 09:45:00 84.9100 99.9400
2009-01-30 09:48:00 84.8100 99.9000
2009-01-30 09:55:00 84.7800 100.0000
... ...
2016-02-29 15:58:00 193.7200 24.8300
2016-02-29 15:59:00 193.4800 24.8700
2016-02-29 16:00:00 193.6100 24.8300
2016-03-01 09:30:00 195.2200 24.3099
2016-03-01 09:31:00 195.1000 24.3300
2016-03-01 09:32:00 195.1500 24.3100
2016-03-01 09:33:00 195.1100 24.3800
First column is a DateTimeIndex, as you probably noted this is a minute data and some minutes are missing from the dataset. I would like to avoid resampling on minute data and dealing with NA values, but rather find a way of filtering out days based on index (day has more that > 200 minute observations it stays, <200 minute observations it is dropped out)
assuming that Time is a column (not an index), try something like as follows:
df.ix[df.groupby(df['Time'].dt.date)['col1'].transform('count') > 200]
where col1 is a column name
if Time column is an index:
df.ix[df.groupby(df.index.date)['col1'].transform('count') > 200]
UPDATE: starting from Pandas 0.20.1 the .ix indexer is deprecated, in favor of the more strict .iloc and .loc indexers.
So use df.loc[...] instead of deprecated df.ix[...]

Categories