Python pandas drop_duplicates with rolling window

Python pandas drop_duplicates with rolling window - python

I have a pandas data frame (about 500,000 rows) with a datetime index and 3 columns (a, b, c):
a b c
2016-03-30 09:59:36.619 0 55 0
2016-03-30 09:59:41.979 0 20 0
2016-03-30 09:59:41.986 0 1 0
2016-03-30 09:59:45.853 0 1 3
2016-03-30 09:59:51.265 0 20 9
2016-03-30 10:00:03.273 0 55 26
2016-03-30 10:00:05.658 0 55 28
2016-03-30 10:00:17.416 0 156 0
2016-03-30 10:00:17.928 0 122 1073
2016-03-30 10:00:21.933 0 122 0
2016-03-30 10:00:31.937 0 122 10
2016-03-30 10:00:40.941 0 122 0
2016-03-30 10:00:51.147 10 2 0
2016-03-30 10:01:27.060 0 156 0
I want to search within a 10 minute rolling window and remove duplicate items from one of the columns (column b), to get something like this:
a b c
2016-03-30 09:59:36.619 0 55 0
2016-03-30 09:59:41.979 0 20 0
2016-03-30 09:59:41.986 0 1 0
2016-03-30 09:59:51.265 0 20 9
2016-03-30 10:00:03.273 0 55 26
2016-03-30 10:00:17.416 0 156 0
2016-03-30 10:00:17.928 0 122 1073
2016-03-30 10:00:51.147 10 2 0
2016-03-30 10:01:27.060 0 156 0
Using drop_duplicates with rolling_apply comes to mind, but these two functions don't play well together, i.e.:
pd.rolling_apply(df, '10T', lambda x:x.drop_duplicates(subset='b'))
raises an error, since the function must return a value, not a df.
So this is what I have so far:
import datetime as dt
windows = []
for ind in range(len(df)):
t0 = df.index[ind]
t1 = df.index[ind]+dt.timedelta(minutes=10)
windows.append(df[numpy.logical_and(t0<df.index,\
df.index<=t1)].drop_duplicates(subset='b'))
Here I end up with a list of 10 min dataframes with duplicates removed, but there are a lot of overlapping values as the window rolls on to the next 10 min segment. To keep the unique values, I've tried something like:
new_df = []
for ind in range(len(windows)-1):
new_df.append(pd.unique(pd.concat([pd.Series(windows[ind].index),\
pd.Series(windows[ind+1].index)])))
But this doesn't work, and it's already starting to get messy. Does anyone have any bright ideas how to solve this as efficiently as possible?
Thanks in advance.

I hope this is useful. I roll a function that checks if the last value is a duplicate of an earlier element over a 10 minute window. The result can be used with boolean indexing.
# Simple example
dates = pd.date_range('2017-01-01', periods = 5, freq = '4min')
col1 = [1, 2, 1, 3, 2]
df = pd.DataFrame({'col1':col1}, index = dates)
# Make function that checks if last element is a duplicate
def last_is_duplicate(a):
if len(a) > 1:
return a[-1] in a[:len(a)-1]
else:
return False
# Roll over 10 minute window to find duplicates of recent elements
dup = df.col1.rolling('10T').apply(last_is_duplicate).astype('bool')
# Keep only those rows for which col1 is not a recent duplicate
df[~dup]

Related

Fill column with max value of a sub-set in-between a specific value

lvalues = [0,0,0,0,0,0,0,0,242,222,183,149,121,102,91,84,0,0,0,0,0,0,0,0,0,230,218,209,197,162,156,144,0,0,0,0,0,0,0,0]
idx = range(0,len(lvalues))
dfSample = pd.DataFrame(lvalues, index=idx)
I have a column with several subsets in-between zeros. I would like to loop through it and use the highest value of which subset and repeat it until it reaches 0 again. For example once the loop reaches 242 repeats it until 0 starts again. Thanks in advance.

If you want to group by consecutive 0/non-0 and get the max, use:
g = dfSample[0].eq(0).diff().fillna(False).cumsum()
dfSample.groupby(g).transform('max')
Logic: transform the series to booleans and get the diff. There will be True on each group start (except the very first item that we fill). Get the cumsum to form groups. Use the grouper to get the max per group.
If you rather want to replace by the first value of each group, a simple mask and fill should work:
dfSample.mask(dfSample[0].shift(fill_value=0).ne(0)).ffill(downcast='infer')
Logic: mask the values that are not preceded by 0, ffill the NaNs.

Use shift to put the the zero at line with the start of the numbers, then use and to condition on to combine it with the real column. Use similar procedure for specifying the end of numbers. After that you can use cumsum to make groups and then just groupby and return the first value of each group. Use:
g =(((dfSample[0].shift()==0)&(dfSample[0]!=0))|((dfSample[0].shift(-1)==0)&(dfSample[0]!=0)).shift()).astype(int).cumsum()
dfSample.groupby(g).transform(lambda x: x.iloc[0])
Output:
0
0 0
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 242
9 242
10 242
11 242
12 242
13 242
14 242
15 242
16 0
17 0
18 0
19 0
20 0
21 0
22 0
23 0
24 0
25 230
26 230
27 230
28 230
29 230
30 230
31 230
32 0
33 0
34 0
35 0
36 0
37 0
38 0
39 0

Is there a way to avoid while loops using pandas in order to speed up my code?

I'm writing a code to merge several dataframe together using pandas .
Here is my first table :
Index Values Intensity
1 11 98
2 12 855
3 13 500
4 24 140
and here is the second one:
Index Values Intensity
1 21 1000
2 11 2000
3 24 0.55
4 25 500
With these two df, I concanate and drop_duplicates the Values columns which give me the following df :
Index Values Intensity_df1 Intensity_df2
1 11 0 0
2 12 0 0
3 13 0 0
4 24 0 0
5 21 0 0
6 25 0 0
I would like to recover the intensity of each values in each Dataframes, for this purpose, I'm iterating through each line of each df which is very inefficient. Here is the following code I use:
m = 0
while m < len(num_df):
n = 0
while n < len(df3):
temp_intens_abs = df[m]['Intensity'][df3['Values'][n] == df[m]['Values']]
if temp_intens_abs.empty:
merged.at[n,"Intensity_df%s" %df[m]] = 0
else:
merged.at[n,"Intensity_df%s" %df[m]] = pandas.to_numeric(temp_intens_abs, errors='coerce')
n = n + 1
m = m + 1
The resulting df3 looks like this at the end:
Index Values Intensity_df1 Intensity_df2
1 11 98 2000
2 12 855 0
3 13 500 0
4 24 140 0.55
5 21 0 1000
6 25 0 500
My question is : Is there a way to directly recover "present" values in a df by comparing directly two columns using pandas? I've tried several solutions using numpy but without success.. Thanks in advance for your help.

You can try joining these dataframes: df3 = df1.merge(df2, on="Values")

How to group values in a column in Panda and get corresponding value in another column?

I have a Pandas data frame like below.
X Y Z
0 10 101 1
0 12 120 2
0 15 112 3
0 06 115 4
0 07 125 1
0 17 131 2
0 14 121 1
0 11 127 2
0 13 107 3
0 02 180 4
0 19 114 1
I want to calculate the average of the values in column X according to the group values in Z.
That is something like
X Z
(10+7+14+19)/4 1
(12+17+11)/2 2
(15+13)/2 3
(2+6/1) 4
What is an optimum way of doing this using Pandas?
It works this way,
sample_data = [['X','Y','Z'],[10,101,1],[12,120,2],[15,12 ,3],[6,115,4],[7,125,1],[17,131,2]]
def group_X_based_on_Z(data):
value_pair = [(row[2], row[0]) for row in data[1:]]
dictionary_with_groouped_values = {}
for z, x in value_pair:
dictionary_with_groouped_values.setdefault(z, []).append(x)
return dictionary_with_groouped_values
def cal_avg_values(data):
grouped_dictionary = group_X_based_on_Z(data)
avg_value_dictionary = {}
for z, x in grouped_dictionary.items():
avg_value_dictionary[z] = mean(x)
return avg_value_dictionary
print(cal_avg_values(sample_data))
I want to know whether there is a Pandas specific method for this?

Use the groupby function.
df.groupby('Z').agg(x_avg = ('X', 'mean'))
edit: forgot a ')'

Try
s=df.groupby('Z',as_index=False).X.mean()
Z X
0 1 12.500000
1 2 13.333333
2 3 14.000000
3 4 4.000000

Calculation within Pandas dataframe group

I've Pandas Dataframe as shown below. What I'm trying to do is, partition (or groupby) by BlockID, LineID, WordID, and then within each group use current WordStartX - previous (WordStartX + WordWidth) to derive another column, e.g., WordDistance to indicate the distance between this word and previous word.
This post Row operations within a group of a pandas dataframe is very helpful but in my case multiple columns involved (WordStartX and WordWidth).
*BlockID LineID WordID WordStartX WordWidth WordDistance
0 0 0 0 275 150 0
1 0 0 1 431 96 431-(275+150)=6
2 0 0 2 642 90 642-(431+96)=115
3 0 0 3 746 104 746-(642+90)=14
4 1 0 0 273 69 ...
5 1 0 1 352 151 ...
6 1 0 2 510 92
7 1 0 3 647 90
8 1 0 4 752 105**

The diff() and shift() functions are usually helpful for calculation referring to previous or next rows:
df['WordDistance'] = (df.groupby(['BlockID', 'LineID'])
.apply(lambda g: g['WordStartX'].diff() - g['WordWidth'].shift()).fillna(0).values)

Python offset column value with previous record value if the record meets a condition

I'm brand new to Python and am stuck on how to conditionally offset values. I've successfully been able to use the shift function when I just need to create a new column. However, this doesn't seem to work with a function.
Original df:
BEGIN SPEED SPEED_END
322 28 0
341 0 23
496 5 1
500 0 0
775 0 0
979 0 0
1015 0 0
1022 0 14
1050 11 6
I want the BEGIN value to be changed to the previous record BEGIN value and the SPEED value to be changed to the previous record SPEED value on records where SPEED=0 and the previous SPEED_END=0.
So the table above should be:
BEGIN SPEED SPEED_END
322 28 0
322 28 23
496 5 1
500 0 0
500 0 0
500 0 0
500 0 0
500 0 14
1050 11 6
I've tried a lot of different things. Currently, I've tried:
def cont(row,param):
if row['SPEED'] == 0 and row['SPEED_END'].shift(1) == 0:
val = row[param].shift(1)
else:
val = row[param]
return val
df['BEGIN'] = df.apply(cont, param='BEGIN', axis=1)
But this gives me the error:
AttributeError: ("'float' object has no attribute 'shift'", u'occurred at index 0')
Any suggestions are appreciated!!

You can use mask and ffill:
begin_cond = (df['SPEED'] == 0) & (df['SPEED_END'].shift(1) == 0)
df['BEGIN'] = df['BEGIN'].mask(begin_cond).ffill().astype(int)
Essentially, mask will replace the values in df['BEGIN'] where begin_cond is True with NaN. Then, ffill will forward fill the NaN values with the last valid value in df['BEGIN'].
The resulting output:
BEGIN SPEED SPEED_END
0 322 28 0
1 322 0 23
2 496 5 1
3 500 0 0
4 500 0 0
5 500 0 0
6 500 0 0
7 500 0 14
8 1050 11 6

I will propose a two-step solution that will SHOCK you.
df['begin_temp'] = df.begin.shift(1)
df['begin_shifted'] = df.ix[( df.SPEED== 0) | (df.SPEED_END== 0), 'begin_temp']
and then
df.ix[df.begin_shifted.isnull(),'begin_shifted'] = df.ix[df.begin_shifted.isnull(),'begin']

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python pandas drop_duplicates with rolling window - python

Related

Fill column with max value of a sub-set in-between a specific value

Is there a way to avoid while loops using pandas in order to speed up my code?

How to group values in a column in Panda and get corresponding value in another column?

Calculation within Pandas dataframe group

Python offset column value with previous record value if the record meets a condition

Categories

Resources