Calculating distance between a sequence of low points in dataset

Calculating distance between a sequence of low points in dataset - python

I have a dataset that is composed of 360 measurements stored in a python dictionary looking something like this:
data = {137: 0.0, 210: 102.700984375, 162: 0.7173203125, 39: 134.47830729166665, 78: 10.707765625, 107: 0.0, 194: 142.042953125, 316: 2.6041666666666666e-06, 329: 0.0, 240: 46.4257578125, ...}
All measurements are stored in a key-value-pair.
Plotted as a scatter plot (key on x, value on y) the data looks like this:
Scatter plot of data
As you can see, there are sections in the data, where the stored value is (close to) 0. I would now like to write a script, that calculates the distance of those sections - you could also call it the 'period' of the data.
What I have come up with feels very crude:
I go through all items in sequence, and record the first key that has a value of 0. then I continue to go through the data until I find a key that has a value above 0 and record that key (-1). (I throw out all sequences, that are shorter than 5 consecutive 0s)
Now I have the start and the end of my first sequence of 0s. I continue to do this, until i have all of those sequences.
As there are ALWAYS two of these sequences in the data (there is no way for it to be more) I now calculate the midpoint of each sequence and subtract one midpoint from the other.
This gives me the distance.
But:
This method is very much prone to error. sometimes there are artifacts in the middle of the sequence of 0s (slightly higher values every 2-4 data points.
Also, if the data starts part way through a sequence of 0s I end up with three sequences.
There has to be a more elegant way of doing this.
I already looked into some scipy functions for determining the period of an oscillating signal, but the data seems to be too messy to get good results.
EDIT 1:
Here is the full dataset (should be easily importable as a python dictionary).
Python dictionary of sample data
EDIT 2:
Following Droid's method I get this nicely structured Dataframe:
(...)
79 79 9.831346 False 1
80 80 10.168792 False 1
81 81 10.354690 False 1
82 82 10.439753 False 1
83 83 10.714523 False 1
84 84 10.859503 False 1
85 85 10.809422 False 1
86 86 10.257599 False 1
87 87 0.159802 True 2
88 88 0.000000 True 2
89 89 0.000000 True 2
90 90 0.000000 True 2
91 91 0.000000 True 2
92 92 0.000000 True 2
93 93 0.000000 True 2
(...)

First of all, do yourself a favour and convert the data into a dataframe :) doing something like pd.DataFrame.from_records(data).T.
Then, the problem seems to me a lot like finding the length of sequences of same values, the "values" being a boolean indicating whether the signal is less than a certain arbitrary threshold (say 0.05, but you can make this exactly zero if you want).
You can do that by defining a grouper that identifies all the values pertaining to the same sequence.
For example, if df is your dataframe, the index is your x and y are the signal values, you can do something like (after having ordered by the index x)
df['is_less'] = df['y'] < 0.05
df['grouper'] = df['is_less'].diff().ne(0).cumsum()
What the second row does is basically doing a discrete difference between the rows, then negating it and then doing a cumulative sum to get some integers. This is a grouper that you can now use to count the length of your events, which is exactly the distance between the start and the end of your "valleys" as you have an integer index.
So you can simply do
df[df.is_less].groupby('grouper').count()
You can play around with the threshold to get the results exactly the way you want. This method will count all segments made of contiguous values (according to your initial condition); as soon as the condition is false you'll get a new grouper.
I tested with your data and verified that it is working.

Related

How to clean up or smoothen a time series using two criteria in Pandas

Sorry for the confusing title. I'm trying to clean up a dataset that has engine hours reported on different time intervals. I'm trying to detect and address two situations:
Engine hours reported are less than last records engine hours
Engine hours reported between two dates is greater than the hour difference between said dates
Sampled Date Meter eng_hours_diff date_hours_diff
2017-02-02 5336 24 24
2017-02-20 5578 242 432
2017-02-22 5625 47 48
2017-03-07 5930 305 312
2017-05-16 6968 1038 1680
2017-06-01 7182 214 384
2017-06-22 7527 345 504
2017-07-10 7919 392 432
2017-07-25 16391 8472 360
2017-08-20 8590 -7801 624
2017-09-05 8827 237 384
2017-09-26 9106 279 504
2017-10-16 9406 300 480
2017-10-28 9660 254 288
2017-11-29 10175 515 768
What I would like to do is re-write the ['Meter'] series if either of the two scenarios above come up and take the average between the points around it.
I'm thinking that this might require two steps, one to eliminate any inaccuracy due to the difference in engine hours being > than the hours between the dates, and then re-calculate the ['eng_hours_diff'] column and check if there are still any that are negative.
The last two columns I've calculated like this:
dfa['eng_hours_diff'] = dfa['Meter'].diff().fillna(24)
dfa['date_hours_diff'] = dfa['Sampled Date'].diff().apply(lambda x:str(x)).apply(lambda x: x.split(' ')[0]).apply(lambda x:x.replace('NaT',"1")).apply(lambda x: int(x)*24)
EDIT:
Ok I think I'm getting somewhere but not quite there yet..
dfa['MeterReading'] = [x if y>0 & y<z else 0 for x,y,z in
zip(dfa['Meter'],dfa['eng_hours_diff'], dfa['date_hours_diff'])]
EDIT 2:
I'm much closer thanks to Bill's answer.
Applying this function will replace any record that doesn't meet the criteria with a zero. Then I'm replacing those zeros with np.nan and using the interpolate method.
The only thing that I'm missing is how to fill out the last values when they also come as np.nan, I'm looking to see if there's an extrapolate method.
Here is the function in case anyone stumbles upon a similar problem in the future:
dfa['MeterReading'] = dfa['MeterReading'].replace({0:np.nan}).interpolate(method='polynomial', order=2, limit=5, limit_direction='both').bfill()
This is the issue that I'm having at the end. Two values were missed but since the difference becomes negative it discards all 4.

One problem with your code is that the logic condition is not doing what you want I think. y>0 & y<z is not the same as (y>0) & (y<z) (e.g. for the first row).
Putting that aside, there are in general three ways to do operations on the elements of rows in a pandas Dataframe.
For simple cases like yours where the operations are vectorizable you can do them without a for loop or list comprehension:
dfa['MeterReading'] = dfa['Meter']
condition = (dfa['eng_hours_diff'] > 0) & (dfa['eng_hours_diff'] < dfa['date_hours_diff'])
dfa.loc[~condition, 'MeterReading'] = 0
For more complex logic, you can use a for loop like this:
dfa['MeterReading'] = 0
for i, row in dfa.iterrows():
if (row['eng_hours_diff'] > 0) & (row['eng_hours_diff'] < row['date_hours_diff']):
dfa.loc[i, 'MeterReading'] = row['Meter']
Or, use apply with a custom function like:
def calc_meter_reading(row):
if (row['eng_hours_diff'] > 0) & (row['eng_hours_diff'] < row['date_hours_diff']):
return row['Meter']
else:
return 0
dfa['MeterReading'] = dfa.apply(calc_meter_reading, axis=1)

Median and quantile values in Pyspark

In my dataframe I have an age column. The total number of rows are approx 77 billion. I want to calculate the quantile values of that column using PySpark. I have some code but the computation time is huge (maybe my process is very bad).
Is there any good way to improve this?
Dataframe example:
id age
1 18
2 32
3 54
4 63
5 42
6 23
What I have done so far:
#Summary stats
df.describe('age').show()
#For Quantile values
x5 = df.approxQuantile("age", [0.5], 0)
x25 = df.approxQuantile("age", [0.25], 0)
x75 = df.approxQuantile("age", [0.75], 0)

The first improvment to do would be to do all the quantile calculations at the same time:
quantiles = df.approxQuantile("age", [0.25, 0.5, 0.75], 0)
Also, note that you use the exact calculation of the quantiles. From the documentation we can see that (emphasis added by me):
relativeError – The relative target precision to achieve (>= 0). If set to zero, the exact quantiles are computed, which could be very expensive. Note that values greater than 1 are accepted but give the same result as 1.
Since you have a very large dataframe I expect that some error is acceptable in these calculations, but it will be a trade-off between speed and precision (although anything more than 0 could have a significant speed improvement).

Num of occurances where Column difference in DataFrame satisfies a condition

I have a Dataframe with lots of Rows, and I am just looking for a count of the rows which fulfil a criteria.
Data snippet:
mydf:
Date Time Open High Low Close
143 07:08:2015 14:55:00 300.10 300.45 300.10 300.45
144 07:08:2015 15:00:00 300.50 300.95 300.45 300.90
145 07:08:2015 15:05:00 300.90 301.20 300.75 300.90
146 07:08:2015 15:10:00 300.85 301.40 300.75 301.40
147 07:08:2015 15:15:00 301.40 301.60 301.20 301.55
148 07:08:2015 15:20:00 301.45 301.55 301.10 301.40
My current Code, first splits the required columns into 2 series, and then Counts the number of occurances of the Last 6 elements
openpr = mydf['Open']
closepr = mydf['Close'] # 2 Series, one for Open and One for Close data
differ = abs(closepr - openpr) #I have a series list with absolute Difference.
myarr = differ[142:].values == 0 # last X elements
sum(myarr) #Num of occurances with Zero Difference.
From what I understand there is a much way of achieving the above result with minimal code and directly using the DF itself.
TIA

I think need compare by eq for == with last 6 values by tail and count values by sum:
out = mydf['Close'].tail(6).eq(mydf['Open'].tail(6)).sum()
Your solution should be changed for last 6 values, also added sub for less () in code:
out = mydf['Close'].tail(6).sub(mydf['Open'].tail(6)).abs().eq(0).sum()

You don't need to difference then take absolute value only to find where zero. Just find where they're equal in the first place.
eval
This is a pandas.DataFrame method that allows for strings to represent formulas. It turns out to be pretty quick on large datasets. I find it very readable in many circumstances.
mydf.tail(6).eval('Close == Open').sum()
If you needed to be within some delta and had to difference the columns
mydf.tail(6).eval('abs(Close - Open) < 1e-6').sum()
isclose
This is a Numpy function that acknowledges that floats are inherently a little off due to lack of precision. So we just want to know if values are close enough.
np.isclose(mydf.Open.tail(6), mydf.Close.tail(6)).sum()
However, for determining if the difference is within some delta is easier when using isclose because of the built in tolerance argument
np.isclose(mydf.Open.tail(6), mydf.Close.tail(6), atol=1e-6).sum()

Python, Pandas: Filter dataframe to a subset and update this subset in place

I have a pandas dataframe that looks like:
cleanText.head()
source word count
0 twain_ess 988
1 twain_ess works 139
2 twain_ess short 139
3 twain_ess complete 139
4 twain_ess would 98
5 twain_ess push 94
And a dictionary that contains the total word count for each source:
titles
{'orw_ess': 1729, 'orw_novel': 15534, 'twain_ess': 7680, 'twain_novel': 60004}
My goal is to normalize the word counts for each source by the total number of words in that source, i.e. turn them into a percentage. This seems like it should be trivial but python seems to make it very difficult (if anyone could explain the rules for inplace operations to me that would be great).
The caveat comes from needing to filter the entries in cleanText to just those from a single source, and then I attempt to inplace divide the counts for this subset by the value in the dictionary.
# Adjust total word counts and normalize
for key, value in titles.items():
# This corrects the total words for overcounting the '' entries
overcounted= cleanText[cleanText.iloc[:,0]== key].iloc[0,2]
titles[key]= titles[key]-overcounted
# This is where I divide by total words, however it does not save inplace, or at all for that matter
cleanText[cleanText.iloc[:,0]== key].iloc[:,2]= cleanText[cleanText.iloc[:,0]== key]['count']/titles[key]
If anyone could explain how to alter this division statement so that the output is actually saved in the original column that would be great.
Thanks

If I understand Correctly:
cleanText['count']/cleanText['source'].map(titles)
Which gives you:
0 0.128646
1 0.018099
2 0.018099
3 0.018099
4 0.012760
5 0.012240
dtype: float64
To re-assign these percentage values into your count column, use:
cleanText['count'] = cleanText['count']/cleanText['source'].map(titles)

Comparison between one element and all the others of a DataFrame column

I have a list of tuples which I turned into a DataFrame with thousands of rows, like this:
frag mass prot_position
0 TFDEHNAPNSNSNK 1573.675712 2
1 EPGANAIGMVAFK 1303.659458 29
2 GTIK 417.258734 2
3 SPWPSMAR 930.438172 44
4 LPAK 427.279469 29
5 NEDSFVVWEQIINSLSALK 2191.116099 17
...
and I have the follow rule:
def are_dif(m1, m2, ppm=10):
if abs((m1 - m2) / m1) < ppm * 0.000001:
v = False
else:
v = True
return v
So, I only want the "frag"s that have a mass that difers from all the other fragments mass. How can I achieve that "selection"?
Then, I have a list named "pinfo" that contains:
d = {'id':id, 'seq':seq_code, "1HW_fit":hits_fit}
# one for each protein
# each dictionary as the position of the protein that it describes.
So, I want to sum 1 to the "hits_fit" value, on the dictionary respective to the protein.

If I'm understanding correctly (not sure if I am), you can accomplish quite a bit by just sorting. First though, let me adjust the data to have a mix of close and far values for mass:
Unnamed: 0 frag mass prot_position
0 0 TFDEHNAPNSNSNK 1573.675712 2
1 1 EPGANAIGMVAFK 1573.675700 29
2 2 GTIK 417.258734 2
3 3 SPWPSMAR 417.258700 44
4 4 LPAK 427.279469 29
5 5 NEDSFVVWEQIINSLSALK 2191.116099 17
Then I think you can do something like the following to select the "good" ones. First, create 'pdiff' (percent difference) to see how close mass is to the nearest neighbors:
ppm = .00001
df = df.sort('mass')
df['pdiff'] = (df.mass-df.mass.shift()) / df.mass
Unnamed: 0 frag mass prot_position pdiff
3 3 SPWPSMAR 417.258700 44 NaN
2 2 GTIK 417.258734 2 8.148421e-08
4 4 LPAK 427.279469 29 2.345241e-02
1 1 EPGANAIGMVAFK 1573.675700 29 7.284831e-01
0 0 TFDEHNAPNSNSNK 1573.675712 2 7.625459e-09
5 5 NEDSFVVWEQIINSLSALK 2191.116099 17 2.817926e-01
The first and last data lines make this a little tricky so this next line backfills the first line and repeats the last line so that the following mask works correctly. This works for the example here, but might need to be tweaked for other cases (but only as far as the first and last lines of data are concerned).
df = df.iloc[range(len(df))+[-1]].bfill()
df[ (df['pdiff'] > ppm) & (df['pdiff'].shift(-1) > ppm) ]
Results:
Unnamed: 0 frag mass prot_position pdiff
4 4 LPAK 427.279469 29 0.023452
5 5 NEDSFVVWEQIINSLSALK 2191.116099 17 0.281793
Sorry, I don't understand the second part of the question at all.
Edit to add: As mentioned in a comment to #AmiTavory's answer, I think possibly the sorting approach and groupby approach could be combined for a simpler answer than this. I might try at a later time, but everyone should feel free to give this a shot themselves if interested.

Here's something that's slightly different from what you asked, but it is very simple, and I think gives a similar effect.
Using numpy.round, you can create a new column
import numpy as np
df['roundedMass'] = np.round(df.mass, 6)
Following that, you can do a groupby of the frags on the rounded mass, and use nunique to count the numbers in the group. Filter for the groups of size 1.
So, the number of frags per bin is:
df.frag.groupby(np.round(df.mass, 6)).nunique()

Another solution can be create a dup of your list (if you need to preserve it for further processing later), iterate over it and remove all element that are not corresponding with your rule (m1 & m2).
You will get a new list with all unique masses.
Just don't forget that if you do need to use the original list later you will need to use deepcopy.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.