Create a new columns in dataframe equaling differenciated series - python

I want to create a new column diff aqualing the differenciation of a series in a nother column.
The following is my dataframe:
df=pd.DataFrame({
'series_1' : [10.1, 15.3, 16, 12, 14.5, 11.8, 2.3, 7.7,5,10],
'series_2' : [9.6,10.4, 11.2, 3.3, 6, 4, 1.94, 15.44, 6.17, 8.16]
})
It has the following display:
series_1 series_2
0 10.1 9.60
1 15.3 10.40
2 16.0 11.20
3 12.0 3.30
4 14.5 6.00
5 11.8 4.00
6 2.3 1.94
7 7.7 15.44
8 5.0 6.17
9 10.0 8.16
Goal
Is to get the following output:
series_1 series_2 diff_2
0 10.1 9.60 NaN
1 15.3 10.40 0.80
2 16.0 11.20 0.80
3 12.0 3.30 -7.90
4 14.5 6.00 2.70
5 11.8 4.00 -2.00
6 2.3 1.94 -2.06
7 7.7 15.44 13.50
8 5.0 6.17 -9.27
9 10.0 8.16 1.99
My code
To reach the desired output I used the following code and it worked:
diff_2=[np.nan]
l=len(df)
for i in range(1, l):
diff_2.append(df['series_2'][i] - df['series_2'][i-1])
df['diff_2'] = diff_2
Issue with my code
I replicated here a simplified dataframe, the real one I am working on is extremly large and my code took almost 9 minute runtime!
I want an alternative allowing me to get the output in a fast way,
Any suggestion from your side will be highly appreciated, thanks.

here is one way to do it, using diff
# create a new col by taking difference b/w consecutive rows of DF using diff
df['diff_2']=df['series_2'].diff()
df
series_1 series_2 diff_2
0 10.1 9.60 NaN
1 15.3 10.40 0.80
2 16.0 11.20 0.80
3 12.0 3.30 -7.90
4 14.5 6.00 2.70
5 11.8 4.00 -2.00
6 2.3 1.94 -2.06
7 7.7 15.44 13.50
8 5.0 6.17 -9.27
9 10.0 8.16 1.99

You might want to add the following line of code:
df["diff_2"] = df["series_2"].sub(df["series_2"].shift(1))
to achieve your goal output:
series_1 series_2 diff_2
0 10.1 9.60 NaN
1 15.3 10.40 0.80
2 16.0 11.20 0.80
3 12.0 3.30 -7.90
4 14.5 6.00 2.70
5 11.8 4.00 -2.00
6 2.3 1.94 -2.06
7 7.7 15.44 13.50
8 5.0 6.17 -9.27
9 10.0 8.16 1.99
That is a build-in pandas feature, so that should be optimized for good performance.

Related

Avoid SettingWithCopyWarning in python using iloc

Usually, to avoid SettingWithCopyWarning, I replace values using .loc or .iloc, but this does not work when I want to forward fill my column (from the first to the last non-nan value).
Do you know why it does that and how to bypass it ?
My test dataframe :
df3 = pd.DataFrame({'Timestamp':[11.1,11.2,11.3,11.4,11.5,11.6,11.7,11.8,11.9,12.0,12.10,12.2,12.3,12.4,12.5,12.6,12.7,12.8,12.9],
'test':[np.nan,np.nan,np.nan,2,22,8,np.nan,4,5,4,5,np.nan,-3,-54,-23,np.nan,89,np.nan,np.nan]})
and the code that raises me a warning :
df3['test'].iloc[df3['test'].first_valid_index():df3['test'].last_valid_index()+1] = df3['test'].iloc[df3['test'].first_valid_index():df3['test'].last_valid_index()+1].fillna(method="ffill")
I would like something like that in the end :
Use first_valid_index and last_valid_index to determine range that you want to ffill and then select range of your dataframe
df = pd.DataFrame({'Timestamp':[11.1,11.2,11.3,11.4,11.5,11.6,11.7,11.8,11.9,12.0,12.10,12.2,12.3,12.4,12.5,12.6,12.7,12.8,12.9],
'test':[np.nan,np.nan,np.nan,2,22,8,np.nan,4,5,4,5,np.nan,-3,-54,-23,np.nan,89,np.nan,np.nan]})
first=df['test'].first_valid_index()
last=df['test'].last_valid_index()+1
df['test']=df['test'][first:last].ffill()
print(df)
Timestamp test
0 11.1 NaN
1 11.2 NaN
2 11.3 NaN
3 11.4 2.0
4 11.5 22.0
5 11.6 8.0
6 11.7 8.0
7 11.8 4.0
8 11.9 5.0
9 12.0 4.0
10 12.1 5.0
11 12.2 5.0
12 12.3 -3.0
13 12.4 -54.0
14 12.5 -23.0
15 12.6 -23.0
16 12.7 89.0
17 12.8 NaN
18 12.9 NaN

Weird bug when changing boolean number to classify dataframe

I have a dataframe where I am trying to add some boolean constraints that are numbers.
hrw_hotdry=combined_hrw[(combined_hrw['June_anom']<0) & (combined_hrw['June_anom_t'])>0]
hrw_hotdry.head()
Year June_val June_anom July_val July_anom June_val_t June_anom_t July_val_t July_anom_t
0 1980 2.14 -1.40 0.99 -2.11 76.7 2.6 83.7 5.0
1 1981 2.85 -0.69 4.01 0.91 75.5 1.4 79.1 0.4
8 1988 2.08 -1.46 3.22 0.12 76.2 2.1 77.5 -1.2
10 1990 1.88 -1.66 3.16 0.06 77.3 3.2 76.7 -2.0
11 1991 3.13 -0.41 2.69 -0.41 75.1 1.0 78.4 -0.3
However, when I change the second constraint to 1 like this:
hrw_hotdry=combined_hrw[(combined_hrw['June_anom']<0) & (combined_hrw['June_anom_t'])>1]
hrw_hotdry.head()
Year June_val June_anom July_val July_anom June_val_t June_anom_t July_val_t July_anom_t
There is no output. How does this make sense?
Parentheses were incorrect:
hrw_hotdry = combined_hrw[(combined_hrw['June_anom']<0) & (combined_hrw['June_anom_t']>1.0)]
Year June_val June_anom July_val July_anom June_val_t June_anom_t July_val_t July_anom_t
0 1980 2.14 -1.40 0.99 -2.11 76.7 2.6 83.7 5.0
1 1981 2.85 -0.69 4.01 0.91 75.5 1.4 79.1 0.4
8 1988 2.08 -1.46 3.22 0.12 76.2 2.1 77.5 -1.2
10 1990 1.88 -1.66 3.16 0.06 77.3 3.2 76.7 -2.0

delete consecutive rows conditionally pandas

I have a df with columns (A, B, C, D, F). I want to:
1) Compare consecutive rows
2) if the absolute difference between consecutive E <=1 AND absolute difference between consecutive C>7, then delete the row with the lowest C value.
Sample Data:
A B C D E
0 94.5 4.3 26.0 79.0 NaN
1 34.0 8.8 23.0 58.0 54.5
2 54.2 5.4 25.5 9.91 50.2
3 42.2 3.5 26.0 4.91 5.1
4 98.0 8.2 13.0 193.7 5.5
5 20.5 9.6 17.0 157.3 5.3
6 32.9 5.4 24.5 45.9 79.8
Desired result:
A B C D E
0 94.5 4.3 26.0 79.0 NaN
1 34.0 8.8 23.0 58.0 54.5
2 54.2 5.4 25.5 9.91 50.2
3 42.2 3.5 26.0 4.91 5.01
4 32.9 5.4 24.5 45.9 79.8
Row 4 was deleted when compared with row 3. Row 5 is now row 4 and it was deleted when compared to row 3.
This code returns the results as boolean (not df with values) and does not satisfy all the conditions.
df = (abs(df.E.diff(-1)) <=1 & (abs(df.C.diff(-1)) >7.)
The result of the code:
0 False
1 False
2 False
3 True
4 False
5 False
6 False
dtype: bool
Any help appreciated.
Using shift() to compare the rows, and a while loop to iterate until no new change happens:
while(True):
rows = len(df)
df = df[~((abs(df.E - df.E.shift(1)) <= 1)&(abs(df.C - df.C.shift(1)) > 7))]
df.reset_index(inplace = True, drop = True)
if (rows == len(df)):
break
It produces the desired output:
A B C D E
0 94.5 4.3 26.0 79.00 NaN
1 34.0 8.8 23.0 58.00 54.5
2 54.2 5.4 25.5 9.91 50.2
3 42.2 3.5 26.0 4.91 5.1
4 32.9 5.4 24.5 45.90 79.8

Merging pandas dataframes: empty columns created in left

I have several datasets, which I am trying to merge into one. Below, I created fictive simpler smaller datasets to test the method and it worked perfectly fine.
examplelog = pd.DataFrame({'Depth':[10,20,30,40,50,60,70,80],
'TVD':[10,19.9,28.8,37.7,46.6,55.5,64.4,73.3],
'T1':[11,11.3,11.5,12.,12.3,12.6,13.,13.8],
'T2':[11.3,11.5,11.8,12.2,12.4,12.7,13.1,14.1]})
log1 = pd.DataFrame({'Depth':[30,40,50,60],'T3':[12.1,12.6,13.7,14.]})
log2 = pd.DataFrame({'Depth':[20,30,40,50,60],'T4':[12.0,12.2,12.4,13.2,14.1]})
logs=[log1,log2]
result=examplelog.copy()
for i in logs:
result=result.merge(i,how='left', on='Depth')
print result
The result is, as expected:
Depth T1 T2 TVD T3 T4
0 10 11.0 11.3 10.0 NaN NaN
1 20 11.3 11.5 19.9 NaN 12.0
2 30 11.5 11.8 28.8 12.1 12.2
3 40 12.0 12.2 37.7 12.3 12.4
4 50 12.3 12.4 46.6 13.5 13.2
5 60 12.6 12.7 55.5 14.2 14.1
6 70 13.0 13.1 64.4 NaN NaN
7 80 13.8 14.1 73.3 NaN NaN
Happy with the result, I applied this method to my actual data, but for T3 and T4 in the resulting dataframes, I received just empty columns (all values were NaN). I suspect that the problem is with floating numbers, because my datasets were created on different machines by different software and although the "Depth" has the precision of two decimal numbers in all of the files, I am afraid that it may not be 20.05 in both of them, but one might be 20.049999999999999 while in the other it might be 20.05000000000001. Then, the merge function will not work, as shown in the following example:
examplelog = pd.DataFrame({'Depth':[10,20,30,40,50,60,70,80],
'TVD':[10,19.9,28.8,37.7,46.6,55.5,64.4,73.3],
'T1':[11,11.3,11.5,12.,12.3,12.6,13.,13.8],
'T2':[11.3,11.5,11.8,12.2,12.4,12.7,13.1,14.1]})
log1 = pd.DataFrame({'Depth':[30.05,40.05,50.05,60.05],'T3':[12.1,12.6,13.7,14.]})
log2 = pd.DataFrame({'Depth':[20.01,30.01,40.01,50.01,60.01],'T4':[12.0,12.2,12.4,13.2,14.1]})
logs=[log1,log2]
result=examplelog.copy()
for i in logs:
result=result.merge(i,how='left', on='Depth')
print result
Depth T1 T2 TVD T3 T4
0 10 11.0 11.3 10.0 NaN NaN
1 20 11.3 11.5 19.9 NaN NaN
2 30 11.5 11.8 28.8 NaN NaN
3 40 12.0 12.2 37.7 NaN NaN
4 50 12.3 12.4 46.6 NaN NaN
5 60 12.6 12.7 55.5 NaN NaN
6 70 13.0 13.1 64.4 NaN NaN
7 80 13.8 14.1 73.3 NaN NaN
Do you know how to fix this?
Thanks!
Round the Depth values to the appropriate precision:
for df in [examplelog, log1, log2]:
df['Depth'] = df['Depth'].round(1)
import numpy as np
import pandas as pd
examplelog = pd.DataFrame({'Depth':[10,20,30,40,50,60,70,80],
'TVD':[10,19.9,28.8,37.7,46.6,55.5,64.4,73.3],
'T1':[11,11.3,11.5,12.,12.3,12.6,13.,13.8],
'T2':[11.3,11.5,11.8,12.2,12.4,12.7,13.1,14.1]})
log1 = pd.DataFrame({'Depth':[30.05,40.05,50.05,60.05],'T3':[12.1,12.6,13.7,14.]})
log2 = pd.DataFrame({'Depth':[20.01,30.01,40.01,50.01,60.01],
'T4':[12.0,12.2,12.4,13.2,14.1]})
for df in [examplelog, log1, log2]:
df['Depth'] = df['Depth'].round(1)
logs=[log1,log2]
result=examplelog.copy()
for i in logs:
result=result.merge(i,how='left', on='Depth')
print(result)
yields
Depth T1 T2 TVD T3 T4
0 10 11.0 11.3 10.0 NaN NaN
1 20 11.3 11.5 19.9 NaN 12.0
2 30 11.5 11.8 28.8 12.1 12.2
3 40 12.0 12.2 37.7 12.6 12.4
4 50 12.3 12.4 46.6 13.7 13.2
5 60 12.6 12.7 55.5 14.0 14.1
6 70 13.0 13.1 64.4 NaN NaN
7 80 13.8 14.1 73.3 NaN NaN
Per the comments, rounding does not appear to work for the OP on the actual
data. To debug the problem, find some rows which should merge:
subframes = []
for frame in [examplelog, log2]:
mask = (frame['Depth'] < 20.051) & (frame['Depth'] >= 20.0)
subframes.append(frame.loc[mask])
Then post
for frame in subframes:
print(frame.to_dict('list'))
print(frame.info()) # shows the dtypes of the columns
This might give us the info we need to reproduce the problem.

Updating Pandas DataFrame column conditionally using other columns

With a DataFrame like the one below, how do I set c1len equal to zero when c1pos equals zero? I would then like to do the same for c2len/c2pos. Is there an easy way to do it without creating a bunch of columns to arrive at the desired answer?
distance c1pos c1len c2pos c2len daysago
line_date
2013-06-22 7.00 9 0.0 9 6.4 27
2013-05-18 8.50 6 4.6 7 4.9 62
2012-12-31 8.32 5 4.6 5 2.1 200
2012-12-01 8.00 7 7.1 6 8.6 230
2012-11-03 7.00 7 0.0 7 2.7 258
2012-10-15 7.00 7 0.0 8 5.2 277
2012-09-22 8.32 10 10.1 8 4.1 300
2012-09-15 9.00 10 12.5 9 12.1 307
2012-08-18 7.00 8 0.0 8 9.2 335
2012-08-02 9.00 5 3.5 5 2.2 351
2012-07-14 12.00 3 4.5 3 3.5 370
2012-06-16 8.32 7 3.7 7 5.1 398
I do't think you have anything that actually satifies those conditions, but
this will work
This creates a boolean mask for when the rows of the column in question (e.g. c2pos)
are 0; then it is setting the column c2len to 0 for those that are True
In [15]: df.loc[df.c2pos==0,'c2len'] = 0
In [16]: df.loc[df.c1pos==0,'c1len'] = 0
In [17]: df
Out[17]:
distance c1pos c1len c2pos c2len daysago
2013-06-22 7.00 9 0.0 9 6.4 27
2013-05-18 8.50 6 4.6 7 4.9 62
2012-12-31 8.32 5 4.6 5 2.1 200
2012-12-01 8.00 7 7.1 6 8.6 230
2012-11-03 7.00 7 0.0 7 2.7 258
2012-10-15 7.00 7 0.0 8 5.2 277
2012-09-22 8.32 10 10.1 8 4.1 300
2012-09-15 9.00 10 12.5 9 12.1 307
2012-08-18 7.00 8 0.0 8 9.2 335
2012-08-02 9.00 5 3.5 5 2.2 351
2012-07-14 12.00 3 4.5 3 3.5 370
2012-06-16 8.32 7 3.7 7 5.1 398

Categories