Efficient method to find nearest datetime's for large dataframes

Efficient method to find nearest datetime's for large dataframes - python

I have a pandas dataframe with two columns, both are datetime instances. The first column is made of measurement timings and the second column is the sum of the first column with a constant offset. E.g assuming a constant offset of 1 gives:
index
Measurement_time
offset_time
0
0.1
1.2
1
0.5
1.5
2
1.2
2.2
3
2.4
3.4
I would like to find the index of each measurement_time that closest matches the offset_time with the condition that the measurement_time must be smaller than or equal to the offset_time. The solution to the given example would therefore be:
index = [2, 2, 2, 3]
I have tried using get_loc and making a mask but because my dataframe is large, these solutions are too inefficient.
Any help would be greatly appreciated!

Let us use np.searchsorted to find the indices of closest matches
s = df['Measurement_time'].sort_values()
np.searchsorted(s, df['offset_time'], side='right') - 1
Result:
array([2, 2, 2, 3], dtype=int64)
Note: You may skip the .sort_values part if your dataframe is already sorted on the column Measurement_time

Related

Selecting rows based on Boolean values in a non dangerous way

This is an easy question since it is so fundamental. See - in R, when you want to slice rows from a dataframe based on some condition, you just write the condition and it selects the corresponding rows. For example, if you have a condition such that only the third row in the dataframe meets the condition it returns the third row. Easy Peasy.
In python, you have to use loc. IF the index matches the row numbers then everything is great. IF you have been removing rows or re-ordering them for any reason, you have to remember that - since loc is based on INDEX NOT ROW POSITION. So if in your current dataframe the third row matches your boolean conditional in the loc statement - then it will retrieve the index with a number 3 - which could be the 50th row, rather than your current third row. This seems to be an incredibly dangerous way to select rows, so I know I am doing something wrong.
So what is the best practice method of ensuring you select the nth row based on a boolean conditional? Is it just to use loc and "always remember to use reset_index - otherwise if you miss it, even once your entire dataframe is wrecked"? This can't be it.

Use iloc instead of loc for integer based indexing:
data = {'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]}
df = pd.DataFrame(data, index=[1, 2, 3])
df
Dataset:
A B C
1 1 4 7
2 2 5 8
3 3 6 9
Label based index
df.loc[1]
Results:
A 1
B 4
C 7
Integer based:
df.iloc[1]
Results:
A 2
B 5
C 8

Applying the hampel filter to a df in python

I am currently trying to apply the hampel filter to my dataframe in python, I have looked around and there isn't a lot of documentation for its implementation in python. I found one post but it looks like it was created before there was an actual hampel package/function and someone created a function to do a rolling mean calculation not using the filter from the package itself, even the site for Hampel the package is minimal. I am looking at the number of Covid cases per day by fips code. I have 470 time series (in days) data frame, each column is a different FIPS code and each row has the number of Covid cases per day (with dates, not the day number from start). The package for Hampel is very straight forward, it has two options for outputs, it will either return a list of the indices where it thinks there are outliers or it will replace the outliers with the median with in the data.
the two codes for using the hampel are:
[IN]:
ts = pd.Series([1, 2, 1 , 1 , 1, 2, 13, 2, 1, 2, 15, 1, 2])
[IN]: # to return indices:
outlier_indices = hampel(ts, window_size=5, n=3)
print("Outlier Indices: ", outlier_indices)
[OUT]:
Outlier Indices: [6, 10]
[IN]: # to return series with rolling medians replaced*** I'm using this format
ts_imputation = hampel(ts, window_size=5, n=3, imputation=True)
ts_imputation
[OUT]:
0 1.0
1 2.0
2 1.0
3 1.0
4 1.0
5 2.0
6 2.0
7 2.0
8 1.0
9 2.0
10 2.0
11 1.0
12 2.0
dtype: float64
So with my data frame I want it to replace the outliers in each column with the column median, I am using a window = 21 and a threshold = 6 (b/c of the data setup). I should mention each of the column starts with a different number of 0's for the rows. So for example the value for the first 80 rows for column one may be 0's and then for column 2 the first 95 rows may have 0's because each FIPs code has a diffferent number of days Given this I tried to use the .apply method with the following fx:
[IN]:
def hamp(col):
no_out = hampel(col, window_size=21, n=6, imputation=True)
return (no_out)
[IN]:
df = df.apply(hamp2, axis=1)
However, when I printed my data frame is now just all 0's. Can someone tell me what I am doing wrong?
Thank you!

recently SKTIME added HampelFilter
from sktime.transformations.series.outlier_detection import HampelFilter
y = your_data
transformer = HampelFilter(window_length=10)
y_hat = transformer.fit_transform(y)
also you can read the documentation here HampelFilter_sktime

Is there a way to vectorize code that currently iterates over rows in a Pandas dataframe?

I have some code right now that works fine, but it entirely too slow. I'm trying to add up the weighted sum of squares for every row in a Pandas dataframe. I'd like to vectorize the operations--that seems to run much, much faster--but there's a wrinkle in the code that has defeated my attempts to vectorize.
totalDist = 0.0
for index, row in pU.iterrows():
totalDist += (row['distance'][row['schoolChoice']]**2.0*float(row['students']))
The row has 'students' (an integer), distance (a numpy array of length n), and schoolChoice (an integer less than or equal to n-1 which designates which element of the distance array I'm using for the calcuation). Basically, I'm pulling a row-specific value from the numpy array. I've used df.lookup, but that actually seems to be slower and is being deprecated. Any suggestions on how to make this run faster? Thanks in advance!

If all else fails you can use .apply() on each row
totalSum = df.apply(lambda row: row.distance[row.schoolChoice] ** 2 * row.students, axis=1).sum()
To go faster you can import numpy
totalSum = (numpy.stack(df.distance)[range(len(df.schoolChoice)), df.schoolChoice] ** 2 * df.students).sum()
The numpy method requires distance be the same length for each row - however it is possible to pad them to the same length if needed. (Though this may affect any gains made.)
Tested on a df of 150,000 rows like:
distance schoolChoice students
0 [1, 2, 3] 0 4
1 [4, 5, 6] 2 5
2 [7, 8, 9] 2 6
3 [1, 2, 3] 0 4
4 [4, 5, 6] 2 5
Timings:
method time
0 for loop 15.9s
1 df.apply 4.1s
2 numpy 0.7s

Pandas axes explained

I'm trying to understand the axis parameter in python pandas. I understand that it's analogous to the numpy axis, but the following example still confuses me:
a = pd.DataFrame([[0, 1, 4], [1, 2, 3]])
print a
0 1 2
0 0 1 4
1 1 2 3
According to this post, axis=0 runs along the rows (fixed column), while axis=1 runs along the columns (fixed row). Running print a.drop(1, axis=1) yields
0 2
0 0 4
1 1 3
which results in a dropped column, while print a.drop(1, axis=0) drops a row. Why? That seems backwards to me.

It's slightly confusing, but axis=0 operates on rows, axis=1 operates on columns.
So when you use df.drop(1, axis=1) you are saying drop column number 1.
The other post has df.mean(axis=1), which essentially says calculate the mean on columns, per row.
This is similar to indexing numpy arrays, where the first index specifies the row number (0th dimension), the second index the column number (1st dimension), and so on.

Applying Operation to Pandas column if other column meets criteria

I'm relatively new to Python and totally new to Pandas, so my apologies if this is really simple. I have a dataframe, and I want to operate over all elements in a particular column, but only if a different column with the same index meets a certain criteria.
float_col int_col str_col
0 0.1 1 a
1 0.2 2 b
2 0.2 6 None
3 10.1 8 c
4 NaN -1 a
For example, if the value in float_col is greater than 5, I want to multiply the value in in_col (in the same row) by 2. I'm guessing I'm supposed to use one of the map apply or applymap functions, but I'm not sure which, or how.

There might be more elegant ways to do this, but once you understand how to use things like loc to get at a particular subset of your dataset, you can do it like this:
df.loc[df['float_col'] > 5, 'int_col'] = df.loc[df['float_col'] > 5, 'int_col'] * 2
You can also do it a bit more succinctly like this, since pandas is smart enough to match up the results based on the index of your dataframe and only use the relevant data from the df['int_col'] * 2 expression:
df.loc[df['float_col'] > 5, 'int_col'] = df['int_col'] * 2

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Efficient method to find nearest datetime's for large dataframes - python

Related

Selecting rows based on Boolean values in a non dangerous way

Applying the hampel filter to a df in python

Is there a way to vectorize code that currently iterates over rows in a Pandas dataframe?

Pandas axes explained

Applying Operation to Pandas column if other column meets criteria

Categories

Resources