Removing duplicates from pandas data frame with condition based on another column

Removing duplicates from pandas data frame with condition based on another column - python

Assuming I have the following DataFrame:
Row | Temperature | Measurement
A1 | 26.7 | 12
A1 | 25.7 | 13
A2 | 27.3 | 11
A2 | 28.3 | 12
A3 | 25.6 | 17
A3 | 23.4 | 14
----------------------------
P3 | 25.7 |14
I want to remove the duplicate rows with respect to column 'Row', and I want to retain only the rows with value closest to 25 in column Temperature.
For example:
Row | Temperature | Measurement
A1 | 25.7 | 13
A2 | 27.3 | 11
A3 | 25.6 | 17
----------------------------
P3 | 25.7 |14
I am trying to use this function to find the nearest within an array:
array = np.asarray(array)
idx = (np.abs(array - value)).argmin()
return array[idx]
array = df['Temperature']
value = 25
But I am not sure how to go about pandas.drop_duplicates in the df. Thank you!
python pandas dataframe

One way to do is create a temporary column and sort on that, then drop duplicates:
df['key'] = df['Temperature'].sub(25).abs()
# sort by key, drop duplicates, and resort
df.sort_values('key').drop_duplicates('Row').sort_index()
Output:
Row Temperature Measurement key
1 A1 25.7 13 0.7
2 A2 27.3 11 2.3
4 A3 25.6 17 0.6
6 P3 25.7 14 0.7
Another option, similar to what you are trying to do, is to use np.argsort on the key, and sort by iloc. This avoids creation of a new column in the data:
orders = np.argsort(df['Temperature'].sub(25).abs())
df.iloc[orders].drop_duplicates('Row').sort_index()
Output:
Row Temperature Measurement
1 A1 25.7 13
2 A2 27.3 11
4 A3 25.6 17
6 P3 25.7 14

Related

Downsample to quarter level and get quarter end date value in Pandas

my data frame has daily value from 2005-01-01 to 2021-10-31.
| C1 | C2
-----------------------------
2005-01-01 | 2.7859 | -7.790
2005-01-02 |-0.7756 | -0.97
2005-01-03 |-6.892 | 2.770
2005-01-04 | 2.785 | -0.97
. . .
. . .
2021-10-28 | 6.892 | 2.785
2021-10-29 | 2.785 | -6.892
2021-10-30 |-6.892 | -0.97
2021-10-31 |-0.7756 | 2.34
I want to downsample this data frame to get quarter value as follows.
| C1 | C2
------------------------------
2005-03-01 | 2.7859 | -7.790
2005-06-30 |-0.7756 | -0.97
2005-09-30 |-6.892 | 2.770
2005-12-31 | 2.785 | -0.97
I tried to do it with Pandas resample method but it requires an aggregation method.
df = df.resample('Q').mean()
I don't want the aggregated value I want the current value at the quarter-end date as it is.

Your code works except you are not using the right function. Replace mean by last:
dti = pd.date_range('2005-01-01', '2021-10-31', freq='D')
df = pd.DataFrame(np.random.random((len(dti), 2)), columns=['C1', 'C2'], index=dti)
dfQ = df.resample('Q').last()
print(dfQ)
# Output:
C1 C2
2005-03-31 0.653733 0.334182
2005-06-30 0.425229 0.316189
2005-09-30 0.055675 0.746406
2005-12-31 0.394051 0.541684
2006-03-31 0.525208 0.413624
... ... ...
2020-12-31 0.662081 0.887147
2021-03-31 0.824541 0.363729
2021-06-30 0.064824 0.621555
2021-09-30 0.126891 0.549009
2021-12-31 0.126217 0.044822
[68 rows x 2 columns]

You can do this,
df = df[df.index.is_quarter_end]
You will filter out the dates only at the end of each quarter.

Pandas combining sparse columns in dataframe

I am using Python, Pandas for data analysis. I have sparsely distributed data in different columns like following
| id | col1a | col1b | col2a | col2b | col3a | col3b |
|----|-------|-------|-------|-------|-------|-------|
| 1 | 11 | 12 | NaN | NaN | NaN | NaN |
| 2 | NaN | NaN | 21 | 86 | NaN | NaN |
| 3 | 22 | 87 | NaN | NaN | NaN | NaN |
| 4 | NaN | NaN | NaN | NaN | 545 | 32 |
I want to combine this sparsely distributed data in different columns to tightly packed column like following.
| id | group | cola | colb |
|----|-------|-------|-------|
| 1 | g1 | 11 | 12 |
| 2 | g2 | 21 | 86 |
| 3 | g1 | 22 | 87 |
| 4 | g3 | 545 | 32 |
What I have tried is doing following, but not able to do it properly
df['cola']=np.nan
df['colb']=np.nan
df['cola'].fillna(df.col1a,inplace=True)
df['colb'].fillna(df.col1b,inplace=True)
df['cola'].fillna(df.col2a,inplace=True)
df['colb'].fillna(df.col2b,inplace=True)
df['cola'].fillna(df.col3a,inplace=True)
df['colb'].fillna(df.col3b,inplace=True)
But I think there must be more concise and efficient way way of doing this. How to do this in better way?

You can use df.stack() assuming 'id' is your index else set 'id' as index. Then use pd.pivot_table.
df = df.stack().reset_index(name='val',level=1)
df['group'] = 'g'+ df['level_1'].str.extract('col(\d+)')
df['level_1'] = df['level_1'].str.replace('col(\d+)','')
df.pivot_table(index=['id','group'],columns='level_1',values='val')
level_1 cola colb
id group
1 g1 11.0 12.0
2 g2 21.0 86.0
3 g1 22.0 87.0
4 g3 545.0 32.0

Another alternative with pd.wide_to_long
m = pd.wide_to_long(df,['col'],'id','j',suffix='\d+\w+').reset_index()
(m.join(pd.DataFrame(m.pop('j').agg(list).tolist()))
.assign(group=lambda x:x[0].radd('g'))
.set_index(['id','group',1])['col'].unstack().dropna()
.rename_axis(None,axis=1).add_prefix('col').reset_index())
id group cola colb
0 1 g1 11 12
1 2 g2 21 86
2 3 g1 22 87
3 4 g3 545 32

Use:
import re
def fx(s):
s = s.dropna()
group = 'g' + re.search(r'\d+', s.index[0])[0]
return pd.Series([group] + s.tolist(), index=['group', 'cola', 'colb'])
df1 = df.set_index('id').agg(fx, axis=1).reset_index()
# print(df1)
id group cola colb
0 1 g1 11.0 12.0
1 2 g2 21.0 86.0
2 3 g1 22.0 87.0
3 4 g3 545.0 32.0

This would a way of doing it:
df = pd.DataFrame({'id':[1,2,3,4],
'col1a':[11,np.nan,22,np.nan],
'col1b':[12,np.nan,87,np.nan],
'col2a':[np.nan,21,np.nan,np.nan],
'col2b':[np.nan,86,np.nan,np.nan],
'col3a':[np.nan,np.nan,np.nan,545],
'col3b':[np.nan,np.nan,np.nan,32]})
df_new = df.copy(deep=False)
df_new['group'] = 'g'+df_new['id'].astype(str)
df_new['cola'] = df_new[[x for x in df_new.columns if x.endswith('a')]].sum(axis=1)
df_new['colb'] = df_new[[x for x in df_new.columns if x.endswith('b')]].sum(axis=1)
df_new = df_new[['id','group','cola','colb']]
print(df_new)
Output:
id group cola colb
0 1 g1 11.0 12.0
1 2 g2 21.0 86.0
2 3 g3 22.0 87.0
3 4 g4 545.0 32.0
So if you have more suffixes (colc, cold, cole, colf, etc...) you can create a loop and then use:
suffixes = ['a','b','c','d','e','f']
cols = ['id','group'] + ['col'+x for x in suffixes]
for i in suffixes:
df_new['col'+i] = df_new[[x for x in df_new.columns if x.endswith(i)]].sum(axis=1)
df_new = df_new[cols]

Thanks to #CeliusStingher for providing the code for the dataframe :
One suggestion is to set the id as index, rearrange the columns, with the numbers extracted from the text. Create a multiIndex, and stack to get the final result :
#set id as index
df = df.set_index("id")
#pull out the numbers from each column
#so that you have (cola,1), (colb,1) ...
#add g to the numbers ... (cola, g1),(colb,g1), ...
#create a MultiIndex
#and reassign to the columns
df.columns = pd.MultiIndex.from_tuples([("".join((first,last)), f"g{second}")
for first, second, last
in df.columns.str.split("(\d)")],
names=[None,"group"])
#stack the data
#to get your result
df.stack()
cola colb
id group
1 g1 11.0 12.0
2 g2 21.0 86.0
3 g1 22.0 87.0
4 g3 545.0 32.0

Efficiency: Check if value in Pandas DataFrame has changed with an specific threshold outside an interval of measurement

I have a column which a sensor recorded.
This Data has some noise on it, so the values are not exactly the same for each point of time while nothing was detected.
I want to split that recorded DataFrame into new DataFrames only containing the "interesting" Data (with values bigger than a certain threshold of column 'B', in this example bigger than 5). In this example 'A' represents a timestamp, and 'B' represents the sensor data, with noise.The desired outcome of this example would be two DataFrames. One with the rows from 5 to 6, the other one with the rows from 10 to 15. A normal loop over the DataFrame is very time consuming, as the DataFrame has ~24mio rows. Is there a efficient way to deal with such an issue in pandas or similar?
Example:
# | A | B
--+-----+-----
1 | 1 | 0.10
2 | 2 | 0.11
3 | 3 | 0.09
4 | 4 | 0.12
5 | 5 | 5.24
6 | 6 | 6.33
7 | 7 | 0.08
8 | 8 | 0.09
9 | 9 | 0.10
10| 10 | 7.54
11| 11 | 8.33
12| 12 | 9.03
13| 13 | 1.43
14| 14 | 9.64
15| 15 | 9.03
16| 16 | 0.43
17| 17 | 0.53
18| 18 | 0.62
19| 19 | 0.73
20| 20 | 0.51
It can occur, that in between the "interesting interval" a value below the threshold occurs. A indicator of an ended interval would be that 1000 values in a row are below the threshold.
Thank you!

Here's a solution which is generalisable and tries to catch edge cases:
# all rows where B > 5
mask1 = df['B'].gt(5)
# all rows where Bt-1 > 5 & Bt+1 > 5
mask2 = df['B'].shift().gt(5) & df['B'].shift(-1).gt(5)
# all rows where mask1 OR mask2 is True
mask3 = (mask1 | mask2)
# turn rows where mask 3 is False to NaN
mask4 = mask3.astype(int).diff().eq(1).cumsum().where(mask3)
# put each group of turned on sensor into a different dataframe
dfs = [dfg.reset_index(drop=True) for _, dfg in df.groupby(mask4)]
Output
for d in dfs:
print(d, '\n')
A B
4 5 5.24
5 6 6.33
A B
9 10 7.54
10 11 8.33
11 12 9.03
12 13 1.43
13 14 9.64
14 15 9.03
Or in a function:
def split_turn_on_off(dataframe):
mask1 = dataframe['B'].gt(5)
mask2 = dataframe['B'].shift().gt(5) & dataframe['B'].shift(-1).gt(5)
mask3 = (mask1 | mask2)
mask4 = mask3.astype(int).diff().eq(1).cumsum().where(mask3)
# put each group of turned on sensor into a different dataframe
dataframes = [dataframeg.reset_index(drop=True) for _, dataframeg in dataframe.groupby(mask4)]
return dataframes

Pandas/Python: Groupby and transform against a reference table

I have a Target Table with two types of categories: stationID and Month. I need to standardise the Temperature values of that table against the values of another Reference Table (by matching the stationID). What would be the best way to do that with pandas?
For example:
Reference Table: it contains mean and standard deviation reference values for unique stations
stationID | Temp_mean | Temp_std |...
----------+-------------+----------+
A | 30.0 | 3.4 |
B | 31.1 | 4.5 |
C | 24.5 | 0.2 |
...
Target Table: it contains the raw data for each station and month
stationID | Mon | Temperature |...
----------+------+-------------+
A | 1 | 30.1 |
A | 2 | 31.2 |
A | 3 | 24.0 |
B | 1 | 30.3 |
C | 2 | 20.4 |
C | 1 | 24.3 |
C | 2 | 25.4 |
...
So, from the temperature values in the Target table, I need to subtract the mean and divide by the standard deviation of the reference table.
What I have so far is the code below
df['Temperature_Stdized']=df(['stationID','Mon'])['Temperature'].transform(lambda x: (x - x.mean()) / x.std())
But, instead of using the mean and std from "x", I would like to use the values from the Reference Table, by matching the stationID values.
Any help is appreciated. Thanks.

Considering your Reference Table to be ref and Target Table to be tar, you could do:
tar['Temprature'] = (ref.merge(tar, on = 'stationID')
.eval('(Temperature - Temp_mean) / Temp_std'))
stationID Mon Temperature
0 A 1 0.029412
1 A 2 0.352941
2 A 3 -1.764706
3 B 1 -0.177778
4 C 2 -20.500000
5 C 1 -1.000000
6 C 2 4.500000
Details
The first step is a merge of both dataframes on stationID:
x = ref.merge(tar, on = 'stationID')
print(x)
stationID Temp_mean Temp_std Mon Temperature
0 A 30.0 3.4 1 30.1
1 A 30.0 3.4 2 31.2
2 A 30.0 3.4 3 24.0
3 B 31.1 4.5 1 30.3
4 C 24.5 0.2 2 20.4
5 C 24.5 0.2 1 24.3
6 C 24.5 0.2 2 25.4
and then eval with the following expression to normalise each row:
x.eval('(Temperature - Temp_mean) / Temp_std')
0 0.029412
1 0.352941
2 -1.764706
3 -0.177778
4 -20.500000
5 -1.000000
6 4.500000
dtype: float64

Pandas calculate and apply weighted rolling average on another column

I am having a hard time figuring out how to get "rolling weights" based off of one of my columns, then factor these weights onto another column.
I've tried groupby.rolling.apply (function) on my data but the main problem is just conceptualizing how I'm going to take a running/rolling average of the column I'm going to turn into weights, and then factor this "window" of weights onto another column that isn't rolled.
I'm also purposely setting min_period to 1, so you'll notice my first two rows in each group final output "rwag" mirror the original.
W is the rolling column to derive the weights from.
B is the column to apply the rolled weights to.
Grouping is only done on column a.
df is already sorted by a and yr.
def wavg(w,x):
return (x * w).sum() / w.sum()
n=df.groupby(['a1'])[['w']].rolling(window=3,min_periods=1).apply(lambda x: wavg(df['w'],df['b']))
Input:
id | yr | a | b | w
---------------------------------
0 | 1990 | a1 | 50 | 3000
1 | 1991 | a1 | 40 | 2000
2 | 1992 | a1 | 10 | 1000
3 | 1993 | a1 | 20 | 8000
4 | 1990 | b1 | 10 | 500
5 | 1991 | b1 | 20 | 1000
6 | 1992 | b1 | 30 | 500
7 | 1993 | b1 | 40 | 4000
Desired output:
id | yr | a | b | rwavg
---------------------------------
0 1990 a1 50 50
1 1991 a1 40 40
2 1992 a1 10 39.96
3 1993 a1 20 22.72
4 1990 b1 10 10
5 1991 b1 20 20
6 1992 b1 30 20
7 1993 b1 40 35.45

apply with rolling usually have some wired behavior
df['Weight']=df.b*df.w
g=df.groupby(['a']).rolling(window=3,min_periods=1)
g['Weight'].sum()/g['w'].sum()
df['rwavg']=(g['Weight'].sum()/g['w'].sum()).values
Out[277]:
a
a1 0 50.000000
1 46.000000
2 40.000000
3 22.727273
b1 4 10.000000
5 16.666667
6 20.000000
7 35.454545
dtype: float64

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Removing duplicates from pandas data frame with condition based on another column - python

Related

Downsample to quarter level and get quarter end date value in Pandas

Pandas combining sparse columns in dataframe

Efficiency: Check if value in Pandas DataFrame has changed with an specific threshold outside an interval of measurement

Pandas/Python: Groupby and transform against a reference table

Pandas calculate and apply weighted rolling average on another column

Categories

Resources