Num of occurances where Column difference in DataFrame satisfies a condition - python

I have a Dataframe with lots of Rows, and I am just looking for a count of the rows which fulfil a criteria.
Data snippet:
mydf:
Date Time Open High Low Close
143 07:08:2015 14:55:00 300.10 300.45 300.10 300.45
144 07:08:2015 15:00:00 300.50 300.95 300.45 300.90
145 07:08:2015 15:05:00 300.90 301.20 300.75 300.90
146 07:08:2015 15:10:00 300.85 301.40 300.75 301.40
147 07:08:2015 15:15:00 301.40 301.60 301.20 301.55
148 07:08:2015 15:20:00 301.45 301.55 301.10 301.40
My current Code, first splits the required columns into 2 series, and then Counts the number of occurances of the Last 6 elements
openpr = mydf['Open']
closepr = mydf['Close'] # 2 Series, one for Open and One for Close data
differ = abs(closepr - openpr) #I have a series list with absolute Difference.
myarr = differ[142:].values == 0 # last X elements
sum(myarr) #Num of occurances with Zero Difference.
From what I understand there is a much way of achieving the above result with minimal code and directly using the DF itself.
TIA

I think need compare by eq for == with last 6 values by tail and count values by sum:
out = mydf['Close'].tail(6).eq(mydf['Open'].tail(6)).sum()
Your solution should be changed for last 6 values, also added sub for less () in code:
out = mydf['Close'].tail(6).sub(mydf['Open'].tail(6)).abs().eq(0).sum()

You don't need to difference then take absolute value only to find where zero. Just find where they're equal in the first place.
eval
This is a pandas.DataFrame method that allows for strings to represent formulas. It turns out to be pretty quick on large datasets. I find it very readable in many circumstances.
mydf.tail(6).eval('Close == Open').sum()
If you needed to be within some delta and had to difference the columns
mydf.tail(6).eval('abs(Close - Open) < 1e-6').sum()
isclose
This is a Numpy function that acknowledges that floats are inherently a little off due to lack of precision. So we just want to know if values are close enough.
np.isclose(mydf.Open.tail(6), mydf.Close.tail(6)).sum()
However, for determining if the difference is within some delta is easier when using isclose because of the built in tolerance argument
np.isclose(mydf.Open.tail(6), mydf.Close.tail(6), atol=1e-6).sum()

Related

How to clean up or smoothen a time series using two criteria in Pandas

Sorry for the confusing title. I'm trying to clean up a dataset that has engine hours reported on different time intervals. I'm trying to detect and address two situations:
Engine hours reported are less than last records engine hours
Engine hours reported between two dates is greater than the hour difference between said dates
Sampled Date Meter eng_hours_diff date_hours_diff
2017-02-02 5336 24 24
2017-02-20 5578 242 432
2017-02-22 5625 47 48
2017-03-07 5930 305 312
2017-05-16 6968 1038 1680
2017-06-01 7182 214 384
2017-06-22 7527 345 504
2017-07-10 7919 392 432
2017-07-25 16391 8472 360
2017-08-20 8590 -7801 624
2017-09-05 8827 237 384
2017-09-26 9106 279 504
2017-10-16 9406 300 480
2017-10-28 9660 254 288
2017-11-29 10175 515 768
What I would like to do is re-write the ['Meter'] series if either of the two scenarios above come up and take the average between the points around it.
I'm thinking that this might require two steps, one to eliminate any inaccuracy due to the difference in engine hours being > than the hours between the dates, and then re-calculate the ['eng_hours_diff'] column and check if there are still any that are negative.
The last two columns I've calculated like this:
dfa['eng_hours_diff'] = dfa['Meter'].diff().fillna(24)
dfa['date_hours_diff'] = dfa['Sampled Date'].diff().apply(lambda x:str(x)).apply(lambda x: x.split(' ')[0]).apply(lambda x:x.replace('NaT',"1")).apply(lambda x: int(x)*24)
EDIT:
Ok I think I'm getting somewhere but not quite there yet..
dfa['MeterReading'] = [x if y>0 & y<z else 0 for x,y,z in
zip(dfa['Meter'],dfa['eng_hours_diff'], dfa['date_hours_diff'])]
EDIT 2:
I'm much closer thanks to Bill's answer.
Applying this function will replace any record that doesn't meet the criteria with a zero. Then I'm replacing those zeros with np.nan and using the interpolate method.
The only thing that I'm missing is how to fill out the last values when they also come as np.nan, I'm looking to see if there's an extrapolate method.
Here is the function in case anyone stumbles upon a similar problem in the future:
dfa['MeterReading'] = dfa['MeterReading'].replace({0:np.nan}).interpolate(method='polynomial', order=2, limit=5, limit_direction='both').bfill()
This is the issue that I'm having at the end. Two values were missed but since the difference becomes negative it discards all 4.
One problem with your code is that the logic condition is not doing what you want I think. y>0 & y<z is not the same as (y>0) & (y<z) (e.g. for the first row).
Putting that aside, there are in general three ways to do operations on the elements of rows in a pandas Dataframe.
For simple cases like yours where the operations are vectorizable you can do them without a for loop or list comprehension:
dfa['MeterReading'] = dfa['Meter']
condition = (dfa['eng_hours_diff'] > 0) & (dfa['eng_hours_diff'] < dfa['date_hours_diff'])
dfa.loc[~condition, 'MeterReading'] = 0
For more complex logic, you can use a for loop like this:
dfa['MeterReading'] = 0
for i, row in dfa.iterrows():
if (row['eng_hours_diff'] > 0) & (row['eng_hours_diff'] < row['date_hours_diff']):
dfa.loc[i, 'MeterReading'] = row['Meter']
Or, use apply with a custom function like:
def calc_meter_reading(row):
if (row['eng_hours_diff'] > 0) & (row['eng_hours_diff'] < row['date_hours_diff']):
return row['Meter']
else:
return 0
dfa['MeterReading'] = dfa.apply(calc_meter_reading, axis=1)

Pandas: How to efficiently diff() after a groupby() operation?

I do have a large dataset (around 8 million rows x 25 columns) in Pandas and I am struggling to use the diff() function in a performant manner on a subset of the data.
Here is how my dataset looks like:
prec type
location_id hours
135 78 12.0 A
79 14.0 A
80 14.3 A
81 15.0 A
82 15.0 A
83 15.0 A
84 15.5 A
I have a multi-index on [location_id, hours]. I have around 60k locations and 140 hours for each location (making up the 8 million rows).
The rest of the data is numeric (float) or categorical. I have only included 2 columns here, normally there are around 20 columns.
What I am willing to do is to apply the diff() function for each location on the prec column. The original dataset piles up the prec numbers; by applying diff() I will get the appropriate prec value for each hour.
With these in mind, I have implemented the following algorithm in Pandas:
# Filter the data first
df_filtered = df_data[df_data.type == "A"] # only work on locations with 'A' type
df_filtered = df_filtered.query('hours > 0 & hours <= 120') # only work on certain hours
# Apply the diff()
for location_id, data_of_location in df_filtered.groupby(level="location_id"):
df_data.loc[data_of_location.index, "prec"] = data_of_location.prec.diff().replace(np.nan, 0.0)
del df_filtered
This works really well functionally, however the performance and the memory consumption is horrible. It is taking around 30 minutes on my dataset and that is currently not acceptable. The existence of the for loop is an indicator that this could be handled better.
Is there a better/faster way to implement this?
Also, the overall memory consumption of the Python script is sky-rocketing during this operation; it grows around 300%! The memory consumed by the main df_data data frame doesn't change but the overall process memory consumption rises.
With the input from #Quang Hoang and #Ben. T, I figured out a solution that is pretty fast but still consumes a lot of memory.
# Filter the data first
df_filtered = df_data[df_data.type == "A"] # only work on locations with 'A' type
df_filtered = df_filtered.query('hours > 0 & hours <= 120') # only work on certain hours
# Apply the diff()
df_diffed = df_data.groupby(level="location_id").prec.diff().replace(np.nan, 0.0)
df_data[df_diffed.index, "prec"] = df_diffed
del df_diffed
del df_filtered
I am guessing 2 things can be done to improve memory usage:
df_filtered seems like a copy of the data; that should increase the memory a lot.
df_diffed is also a copy.
The memory usage is very intensive while computing these two variables. I am not sure if there is any in-place way to execute such operations.

How to *extract* latitud and longitude greedily in Pandas?

I have a dataframe in Pandas like this:
id loc
40 100005090 -38.229889,-72.326819
188 100020985 ut: -33.442101,-70.650327
249 10002732 ut: -33.437478,-70.614637
361 100039605 ut: 10.646041,-71.619039 \N
440 100048229 4.666439,-74.071554
I need to extract the gps points. I first ask for a contain of a certain regex (found here in SO, see below) to match all cells that have a "valid" lat/long value. However, I also need to extract these numbers and either put them on a series of their own (and then call split on the comma) or put them in two new pandas series. I have tried the following for the extraction part:
ids_with_latlong["loc"].str.extract("[-+]?([1-8]?\d(\.\d+)?|90(\.0+)?),\s*[-+]?(180(\.0+)?|((1[0-7]\d)|([1-9]?\d))(\.\d+)?)$")
but it looks, because of the output, that the reg exp is not doing the matching greedily, because I get something like this:
0 1 2 3 4 5 6 7 8
40 38.229889 .229889 NaN 72.326819 NaN 72 NaN 72 .326819
188 33.442101 .442101 NaN 70.650327 NaN 70 NaN 70 .650327
Obviously it's matching more than I want (I would just need cols 0, 1, and 4), but simply dropping them is too much of a hack for me to do. Notice that the extract function also got rid of the +/- signs at the beginning. If anyone has a solution, I'd really appreciate.
#HYRY's answer looks pretty good to me. This is just an alternate approach that uses built in pandas methods rather than a regex approach. I think it's a little simpler to read though I'm not sure if it will be sufficiently general for all your cases (it works fine on this sample data though).
df['loc'] = df['loc'].str.replace('ut: ','')
df['lat'] = df['loc'].apply( lambda x: x.split(',')[0] )
df['lon'] = df['loc'].apply( lambda x: x.split(',')[1] )
id loc lat lon
0 100005090 -38.229889,-72.326819 -38.229889 -72.326819
1 100020985 -33.442101,-70.650327 -33.442101 -70.650327
2 10002732 -33.437478,-70.614637 -33.437478 -70.614637
3 100039605 10.646041,-71.619039 10.646041 -71.619039
4 100048229 4.666439,-74.071554 4.666439 -74.071554
As a general suggestion for this type of approach you might think about doing in in the following steps:
1) remove extraneous characters with replace (or maybe this is where the regex is best)
2) split into pieces
3) check that each piece is valid (all you need to do is check that it's a number although you could take an extra step that it falls into the number range of being a valid lat or lon)
You can use (?:) to ignore the group:
df["loc"].str.extract(r"((?:[\+-])?\d+\.\d+)\s*,\s*((?:[\+-])?\d+\.\d+)")

How to get average of a column inside a time era?

I need to get the average of a column (which I will set in the input of my function) during a precise era :
In my case the date is the index, so I can get the week with index.week.
Then I would like to compute some basic statistics each 2 weeks for instances
So I will need to "slice" the dataframe every 2 weeks and then compute. It can destroy the part of the dataframe already computed, but what's still in the dataframe mustn't be erase.
My first guess was to parse the data with a row iterator then compare it :
# get the week num. of the first row
start_week = temp.data.index.week[0]
# temp.data is my data frame
for index, row in temp.data.iterrows():
while index.week < start_week + 2:
print index.week
but it's really slow so shouldn't be the proper way
Welcome to Stackoverflow. Please note that your question is not very specific and is difficult to supply you with exactly what you want. Optimally, you would supply code to recreate your dataset and also post the expected outcome. I'll post regarding two parts: (i) Working with dataframes sliced using time-specific functions and (ii) Applying statistical functions using rolling window operations.
Working with Dataframes and time indices
The question is not how to get the mean of x, because you know how to do that (x.mean()). The question is, how to get x: How do you select elements of a dataframe which satisfy certain conditions on their timestamp? I will use a series generated by the documentation which I found after googling for one minute:
In[13]: ts
Out[13]:
2011-01-31 0.356701
2011-02-28 -0.814078
2011-03-31 1.382372
2011-04-29 0.604897
2011-05-31 1.415689
2011-06-30 -0.237188
2011-07-29 -0.197657
2011-08-31 -0.935760
2011-09-30 2.060165
2011-10-31 0.618824
2011-11-30 1.670747
2011-12-30 -1.690927
Then, you can select some time series based on index weeks using
ts[(ts.index.week > 3) & (ts.index.week < 10)]
And specifically, if you want to get the mean of this series, you can do
ts[(ts.index.week > 3) & (ts.index.week < 10)].mean()
If you work with a dataframe, you might want to select the column first:
df[(df.index.week > 3) & (df.index.week < 10)]['someColumn'].mean()
Rolling window operations
Now, if you want to operate rolling statistics onto a time series indexed pandas object, have a look at this part of the manual.
Given that I have a monthly time series, say I want the mean for 3 months, I'd do:
rolling_mean(ts, window=3)
Out[25]:
2011-01-31 NaN
2011-02-28 NaN
2011-03-31 0.308331
2011-04-29 0.391064
2011-05-31 1.134319
2011-06-30 0.594466
2011-07-29 0.326948
2011-08-31 -0.456868
2011-09-30 0.308916
2011-10-31 0.581076
2011-11-30 1.449912
2011-12-30 0.199548

Indexing by row counts in a pandas dataframe

I have a pandas dataframe with a two-element hierarchical index ("month" and "item_id"). Each row represents a particular item at a particular month, and has columns for several numeric measures of interest. The specifics are irrelevant, so we'll just say we have column X for our purposes here.
My problem stems from the fact that items vary in the months for which they have observations, which may or may not be contiguous. I need to calculate the average of X, across all items, for the 1st, 2nd, ..., n-th month in which there is an observation for that item.
In other words, the first row in my result should be the average across all items of the first row in the dataframe for each item, the second result row should be the average across all items of the second observation for that item, and so on.
Stated another way, if we were to take all the date-ordered rows for each item and index them from i=1,2,...,n, I need the average across all items of the values of rows 1,2,...,n. That is, I want the average of the first observation for each item across all items, the average of the second observation across all items, and so on.
How can I best accomplish this? I can't use the existing date index, so do I need to add another index to the dataframe (something like I describe in the previous paragraph), or is my only recourse to iterate across the rows for each item and keep a running average? This would work, but is not leveraging the power of pandas whatsoever.
Adding some example data:
item_id date X DUMMY_ROWS
20 2010-11-01 16759 0
2010-12-01 16961 1
2011-01-01 17126 2
2011-02-01 17255 3
2011-03-01 17400 4
2011-04-01 17551 5
21 2007-09-01 4 6
2007-10-01 5 7
2007-11-01 6 8
2007-12-01 10 9
22 2006-05-01 10 10
2006-07-01 13 11
23 2006-05-01 2 12
24 2008-01-01 2 13
2008-02-01 9 14
2008-03-01 18 15
2008-04-01 19 16
2008-05-01 23 17
2008-06-01 32 18
I've added a dummy rows column that does not exist in the data for explanatory purposes. The operation I'm describing would effectively give the mean of rows 0,6,10,12, and 13 (the first observation for each item), then the mean of rows 1,7,11,and 15 (the second observation for each item, excluding item 23 because it has only one observation), and so on.
One option is to reset the index then group by id.
df_new = df.reset_index()
df_new.groupby(['item_id']).X.agg(np.mean)
this leaves your original df intact and gets you the mean across all months for each item id.
For your updated question (great example by the way) I think the approach would be to add an "item_sequence_id" I've done this in the path with similar data.
df.sort(['item_id', 'date'], inplace = True)
def sequence_id(item):
item['seq_id'] = range(0,len(item)-1,1)
return item
df_with_seq_id = df.groupby(['item_id']).apply(sequence_id)
df_with_seq_id.groupby(['seq_id']).agg(np.mean)
The idea here is that the seq_id allows you to identify the position of the data point in time per item_id assigning non-unique seq_id values to the items will allow you to group across multiple items. The context I've used this in before relates to users doing something first in a session. Using this ID structure I can identify all of the first, second, third, etc... actions taken by users regardless of their absolute time and user id.
Hopefully this is more of what you want.
Here's an alternative method for this I finally figured out (which assumes we don't care about the actual dates for the purposes of calculating the mean). Recall the method proposed by #cwharland:
def sequence_id(item):
item['seq'] = range(0,len(item),1)
return item
shrinkWithSeqID_old = df.groupby(level='item_id').apply(sequence_id)
Testing this on a 10,000 row subset of the data frame:
%timeit -n10 dfWithSeqID_old = shrink.groupby(level='item_id').apply(sequence_id)
10 loops, best of 3: 301 ms per loop
It turns out we can simplify things by remembering that pandas' default behavior (i.e. without specifying an index column) is to generate a numeric index for a dataframe numbered from 0 to n (the number of rows in the frame). We can leverage this like so:
dfWithSeqID_new = df.groupby(level='item_id').apply(lambda x: x.reset_index(drop=True))
The only difference in the output is that we have a new, unlabeled numeric index with the same content as the 'seq' column used in the previous answer, BUT it's almost 4 times faster (I can't compare the methods for the full 13 million row dataframe, as the first methods was resulting in memory errors):
%timeit -n10 dfWithSeqID_new = df.groupby(level='item_id').apply(lambda x: x.reset_index(drop=True))
10 loops, best of 3: 77.2 ms per loop
Calculating the average as in my original question is only slightly different. The original method was:
dfWithSeqID_old.groupby('seq').agg(np.mean).head()
But now we simply have to account for the fact that we're using the new unlabeled index instead of the 'seq' column:
dfWithSeqID_new.mean(level=1).head()
The result is the same.

Categories