Function on dataframe rows to reduce duplicate pairs Python - python

I've got a dataframe that looks like:
0 1 2 3 4 5 6 7 8 9 10 11
12 13 13 13.4 13.4 12.4 12.4 16 0 0 0 0
14 12.2 12.2 13.4 13.4 12.6 12.6 19 5 5 6.7 6.7
.
.
.
Each 'layer'/row has pairs that are duplicates that I want to reduce.
The one problem is that there are repeating 0s as well so I cannot just simply remove duplicates per row or it will leave an uneven number of rows.
My desired output would be a lambda function that I could apply to all rows of this dataframe to get this:
0 1 2 3 4 5 6
12 13 13.4 12.4 16 0 0
14 12.2 13.4 12.6 19 5 6.7
.
.
.
Is there a simple function I could write to do this?

Method 1 using transpose
As mentioned by Yuca in the comments:
df = df.T.drop_duplicates().T
df.columns = range(len(df.columns))
print(df)
0 1 2 3 4 5 6
0 12.0 13.0 13.4 12.4 16.0 0.0 0.0
1 14.0 12.2 13.4 12.6 19.0 5.0 6.7
Method 2 using list comprehension with even numbers
We can make a list of even numbers and then select those columns based on their index:
idxcols = [x-1 for x in range(len(df.columns)) if x % 2]
df = df.iloc[:, idxcols]
df.columns = range(len(df.columns))
print(df)
0 1 2 3 4 5
0 12 13.0 13.4 12.4 0 0.0
1 14 12.2 13.4 12.6 5 6.7

In your case
from itertools import zip_longest
l=[sorted(set(x), key=x.index) for x in df.values.tolist()]
newdf=pd.DataFrame(l).ffill(1)
newdf
Out[177]:
0 1 2 3 4 5 6
0 12.0 13.0 13.4 12.4 16.0 0.0 0.0
1 14.0 12.2 13.4 12.6 19.0 5.0 6.7

You can use functools.reduce to sequentially concatenate columns to your output DataFrame if the next column is not equal to the last column added:
from functools import reduce
output_df = reduce(
lambda d, c: d if (d.iloc[:,-1] == df[c]).all() else pd.concat([d, df[c]], axis=1),
df.columns[1:],
df[df.columns[0]].to_frame()
)
print(output_frame)
# 0 1 3 5 7 8 10
#0 12 13.0 13.4 12.4 16 0 0.0
#1 14 12.2 13.4 12.6 19 5 6.7
This method also maintains the column names of the columns which were picked, if that's important.
Assuming this is your input df:
print(df)
# 0 1 2 3 4 5 6 7 8 9 10 11
#0 12 13.0 13.0 13.4 13.4 12.4 12.4 16 0 0 0.0 0.0
#1 14 12.2 12.2 13.4 13.4 12.6 12.6 19 5 5 6.7 6.7

Related

Lookup group quartile value based on individual value in Python DataFrame

I have a DataFrame df1 containing daily time-series of IDs and Scores in different countries C. For the countries, I have an additional DataFrame df2 which contains for each country 4 quartiles Q with quantile scores Q_Scores.
df1:
Date ID C Score
20220102 A US 12.6
20220103 A US 11.3
20220104 A US 13.2
20220105 A US 14.5
20220102 B US 9.8
20220103 B US 19.8
20220104 B US 12.3
20220105 B US 15.1
20220102 C GB 13.5
20220103 C GB 14.5
20220104 C GB 11.5
20220105 C GB 14.8
df2:
Date C Q Q_Score
20220102 US 1 10
20220103 US 2 13
20220104 US 3 16
20220105 US 4 20
20220102 GB 1 12
20220103 GB 2 13
20220104 GB 3 14
20220105 GB 4 15
I try to lookup the quartile scores Q_Score and create df3 with an additional column called Q_Scores. A specific score should lookup the next bigger quartile score from df2 for a specific country. For example:
20220104 / A / US: Score = 13.2 --> next bigger quartile score on that date in the US is 16 --> Q-Score: 16
df3:
Date ID C Score Q_Score
20220102 A US 12.6 13
20220103 A US 11.3 13
20220104 A US 13.2 16
20220105 A US 14.5 16
20220102 B US 9.8 10
20220103 B US 19.8 20
20220104 B US 12.3 13
20220105 B US 15.1 16
20220102 C GB 13.5 14
20220103 C GB 14.5 15
20220104 C GB 11.5 12
20220105 C GB 14.8 15
Because the Score and Q_Score don't match, I wasn't able to do it with a simple pd.merge().
You can use pd.merge_asof, but you need some processing:
# two data must have the same data type
df2['Q_Score'] = df2['Q_Score'].astype('float64')
# keys must be sorted
pd.merge_asof(df1.sort_values('Score'),
df2.drop(['Date','Q'], axis=1).sort_values('Q_Score'),
by=['C'],
left_on='Score',
right_on='Q_Score',
direction='forward'
).sort_values(['ID','Date'])
Output:
Date ID C Score Q_Score
4 20220102 A US 12.6 13.0
1 20220103 A US 11.3 13.0
5 20220104 A US 13.2 16.0
7 20220105 A US 14.5 16.0
0 20220102 B US 9.8 10.0
11 20220103 B US 19.8 20.0
3 20220104 B US 12.3 13.0
10 20220105 B US 15.1 16.0
6 20220102 C GB 13.5 14.0
8 20220103 C GB 14.5 15.0
2 20220104 C GB 11.5 12.0
9 20220105 C GB 14.8 15.0

Avoid SettingWithCopyWarning in python using iloc

Usually, to avoid SettingWithCopyWarning, I replace values using .loc or .iloc, but this does not work when I want to forward fill my column (from the first to the last non-nan value).
Do you know why it does that and how to bypass it ?
My test dataframe :
df3 = pd.DataFrame({'Timestamp':[11.1,11.2,11.3,11.4,11.5,11.6,11.7,11.8,11.9,12.0,12.10,12.2,12.3,12.4,12.5,12.6,12.7,12.8,12.9],
'test':[np.nan,np.nan,np.nan,2,22,8,np.nan,4,5,4,5,np.nan,-3,-54,-23,np.nan,89,np.nan,np.nan]})
and the code that raises me a warning :
df3['test'].iloc[df3['test'].first_valid_index():df3['test'].last_valid_index()+1] = df3['test'].iloc[df3['test'].first_valid_index():df3['test'].last_valid_index()+1].fillna(method="ffill")
I would like something like that in the end :
Use first_valid_index and last_valid_index to determine range that you want to ffill and then select range of your dataframe
df = pd.DataFrame({'Timestamp':[11.1,11.2,11.3,11.4,11.5,11.6,11.7,11.8,11.9,12.0,12.10,12.2,12.3,12.4,12.5,12.6,12.7,12.8,12.9],
'test':[np.nan,np.nan,np.nan,2,22,8,np.nan,4,5,4,5,np.nan,-3,-54,-23,np.nan,89,np.nan,np.nan]})
first=df['test'].first_valid_index()
last=df['test'].last_valid_index()+1
df['test']=df['test'][first:last].ffill()
print(df)
Timestamp test
0 11.1 NaN
1 11.2 NaN
2 11.3 NaN
3 11.4 2.0
4 11.5 22.0
5 11.6 8.0
6 11.7 8.0
7 11.8 4.0
8 11.9 5.0
9 12.0 4.0
10 12.1 5.0
11 12.2 5.0
12 12.3 -3.0
13 12.4 -54.0
14 12.5 -23.0
15 12.6 -23.0
16 12.7 89.0
17 12.8 NaN
18 12.9 NaN

delete consecutive rows conditionally pandas

I have a df with columns (A, B, C, D, F). I want to:
1) Compare consecutive rows
2) if the absolute difference between consecutive E <=1 AND absolute difference between consecutive C>7, then delete the row with the lowest C value.
Sample Data:
A B C D E
0 94.5 4.3 26.0 79.0 NaN
1 34.0 8.8 23.0 58.0 54.5
2 54.2 5.4 25.5 9.91 50.2
3 42.2 3.5 26.0 4.91 5.1
4 98.0 8.2 13.0 193.7 5.5
5 20.5 9.6 17.0 157.3 5.3
6 32.9 5.4 24.5 45.9 79.8
Desired result:
A B C D E
0 94.5 4.3 26.0 79.0 NaN
1 34.0 8.8 23.0 58.0 54.5
2 54.2 5.4 25.5 9.91 50.2
3 42.2 3.5 26.0 4.91 5.01
4 32.9 5.4 24.5 45.9 79.8
Row 4 was deleted when compared with row 3. Row 5 is now row 4 and it was deleted when compared to row 3.
This code returns the results as boolean (not df with values) and does not satisfy all the conditions.
df = (abs(df.E.diff(-1)) <=1 & (abs(df.C.diff(-1)) >7.)
The result of the code:
0 False
1 False
2 False
3 True
4 False
5 False
6 False
dtype: bool
Any help appreciated.
Using shift() to compare the rows, and a while loop to iterate until no new change happens:
while(True):
rows = len(df)
df = df[~((abs(df.E - df.E.shift(1)) <= 1)&(abs(df.C - df.C.shift(1)) > 7))]
df.reset_index(inplace = True, drop = True)
if (rows == len(df)):
break
It produces the desired output:
A B C D E
0 94.5 4.3 26.0 79.00 NaN
1 34.0 8.8 23.0 58.00 54.5
2 54.2 5.4 25.5 9.91 50.2
3 42.2 3.5 26.0 4.91 5.1
4 32.9 5.4 24.5 45.90 79.8

Merging pandas dataframes: empty columns created in left

I have several datasets, which I am trying to merge into one. Below, I created fictive simpler smaller datasets to test the method and it worked perfectly fine.
examplelog = pd.DataFrame({'Depth':[10,20,30,40,50,60,70,80],
'TVD':[10,19.9,28.8,37.7,46.6,55.5,64.4,73.3],
'T1':[11,11.3,11.5,12.,12.3,12.6,13.,13.8],
'T2':[11.3,11.5,11.8,12.2,12.4,12.7,13.1,14.1]})
log1 = pd.DataFrame({'Depth':[30,40,50,60],'T3':[12.1,12.6,13.7,14.]})
log2 = pd.DataFrame({'Depth':[20,30,40,50,60],'T4':[12.0,12.2,12.4,13.2,14.1]})
logs=[log1,log2]
result=examplelog.copy()
for i in logs:
result=result.merge(i,how='left', on='Depth')
print result
The result is, as expected:
Depth T1 T2 TVD T3 T4
0 10 11.0 11.3 10.0 NaN NaN
1 20 11.3 11.5 19.9 NaN 12.0
2 30 11.5 11.8 28.8 12.1 12.2
3 40 12.0 12.2 37.7 12.3 12.4
4 50 12.3 12.4 46.6 13.5 13.2
5 60 12.6 12.7 55.5 14.2 14.1
6 70 13.0 13.1 64.4 NaN NaN
7 80 13.8 14.1 73.3 NaN NaN
Happy with the result, I applied this method to my actual data, but for T3 and T4 in the resulting dataframes, I received just empty columns (all values were NaN). I suspect that the problem is with floating numbers, because my datasets were created on different machines by different software and although the "Depth" has the precision of two decimal numbers in all of the files, I am afraid that it may not be 20.05 in both of them, but one might be 20.049999999999999 while in the other it might be 20.05000000000001. Then, the merge function will not work, as shown in the following example:
examplelog = pd.DataFrame({'Depth':[10,20,30,40,50,60,70,80],
'TVD':[10,19.9,28.8,37.7,46.6,55.5,64.4,73.3],
'T1':[11,11.3,11.5,12.,12.3,12.6,13.,13.8],
'T2':[11.3,11.5,11.8,12.2,12.4,12.7,13.1,14.1]})
log1 = pd.DataFrame({'Depth':[30.05,40.05,50.05,60.05],'T3':[12.1,12.6,13.7,14.]})
log2 = pd.DataFrame({'Depth':[20.01,30.01,40.01,50.01,60.01],'T4':[12.0,12.2,12.4,13.2,14.1]})
logs=[log1,log2]
result=examplelog.copy()
for i in logs:
result=result.merge(i,how='left', on='Depth')
print result
Depth T1 T2 TVD T3 T4
0 10 11.0 11.3 10.0 NaN NaN
1 20 11.3 11.5 19.9 NaN NaN
2 30 11.5 11.8 28.8 NaN NaN
3 40 12.0 12.2 37.7 NaN NaN
4 50 12.3 12.4 46.6 NaN NaN
5 60 12.6 12.7 55.5 NaN NaN
6 70 13.0 13.1 64.4 NaN NaN
7 80 13.8 14.1 73.3 NaN NaN
Do you know how to fix this?
Thanks!
Round the Depth values to the appropriate precision:
for df in [examplelog, log1, log2]:
df['Depth'] = df['Depth'].round(1)
import numpy as np
import pandas as pd
examplelog = pd.DataFrame({'Depth':[10,20,30,40,50,60,70,80],
'TVD':[10,19.9,28.8,37.7,46.6,55.5,64.4,73.3],
'T1':[11,11.3,11.5,12.,12.3,12.6,13.,13.8],
'T2':[11.3,11.5,11.8,12.2,12.4,12.7,13.1,14.1]})
log1 = pd.DataFrame({'Depth':[30.05,40.05,50.05,60.05],'T3':[12.1,12.6,13.7,14.]})
log2 = pd.DataFrame({'Depth':[20.01,30.01,40.01,50.01,60.01],
'T4':[12.0,12.2,12.4,13.2,14.1]})
for df in [examplelog, log1, log2]:
df['Depth'] = df['Depth'].round(1)
logs=[log1,log2]
result=examplelog.copy()
for i in logs:
result=result.merge(i,how='left', on='Depth')
print(result)
yields
Depth T1 T2 TVD T3 T4
0 10 11.0 11.3 10.0 NaN NaN
1 20 11.3 11.5 19.9 NaN 12.0
2 30 11.5 11.8 28.8 12.1 12.2
3 40 12.0 12.2 37.7 12.6 12.4
4 50 12.3 12.4 46.6 13.7 13.2
5 60 12.6 12.7 55.5 14.0 14.1
6 70 13.0 13.1 64.4 NaN NaN
7 80 13.8 14.1 73.3 NaN NaN
Per the comments, rounding does not appear to work for the OP on the actual
data. To debug the problem, find some rows which should merge:
subframes = []
for frame in [examplelog, log2]:
mask = (frame['Depth'] < 20.051) & (frame['Depth'] >= 20.0)
subframes.append(frame.loc[mask])
Then post
for frame in subframes:
print(frame.to_dict('list'))
print(frame.info()) # shows the dtypes of the columns
This might give us the info we need to reproduce the problem.

Slice pandas DataFrame where column's value exists in another array

I have a pandas.DataFrame with a large amount of data. In one column are randomly repeating keys. In another array I have a list of of theys keys for which I would like to slice from the DataFrame along with the data from the other columns in their row.
keys:
keys = numpy.array([1,5,7])
data:
indx a b c d
0 5 25.0 42.1 13
1 2 31.7 13.2 1
2 9 16.5 0.2 9
3 7 43.1 11.0 10
4 1 11.2 31.6 10
5 5 15.6 2.8 11
6 7 14.2 19.0 4
I would like slice all rows from the DataFrame if the value in the column a matches a value from keys.
Desired result:
indx a b c d
0 5 25.0 42.1 13
3 7 43.1 11.0 10
4 1 11.2 31.6 10
5 5 15.6 2.8 11
6 7 14.2 19.0 4
You can use isin:
>>> df[df.a.isin(keys)]
a b c d
indx
0 5 25.0 42.1 13
3 7 43.1 11.0 10
4 1 11.2 31.6 10
5 5 15.6 2.8 11
6 7 14.2 19.0 4
[5 rows x 4 columns]
or query:
>>> df.query("a in #keys")
a b c d
indx
0 5 25.0 42.1 13
3 7 43.1 11.0 10
4 1 11.2 31.6 10
5 5 15.6 2.8 11
6 7 14.2 19.0 4
[5 rows x 4 columns]

Categories