Pandas : Delete rows based on other rows - python

I have a pandas dataframe which looks like that :
qseqid sseqid qstart qend
2 1 125 345
4 1 150 320
3 2 150 450
6 2 25 300
8 2 50 500
I would like to remove rows based on other rows values with these criterias : A row (r1) must be removed if another row (r2) exist with the same sseqid and r1[qstart] > r2[qstart] and r1[qend] < r2[qend].
Is this possible with pandas ?

df = pd.DataFrame({'qend': [345, 320, 450, 300, 500],
'qseqid': [2, 4, 3, 6, 8],
'qstart': [125, 150, 150, 25, 50],
'sseqid': [1, 1, 2, 2, 2]})
def remove_rows(df):
merged = pd.merge(df.reset_index(), df, on='sseqid')
mask = ((merged['qstart_x'] > merged['qstart_y'])
& (merged['qend_x'] < merged['qend_y']))
df_mask = ~df.index.isin(merged.loc[mask, 'index'].values)
result = df.loc[df_mask]
return result
result = remove_rows(df)
print(result)
yields
qend qseqid qstart sseqid
0 345 2 125 1
3 300 6 25 2
4 500 8 50 2
The idea is to use pd.merge to form a DataFrame with every pairing of rows
with the same sseqid:
In [78]: pd.merge(df.reset_index(), df, on='sseqid')
Out[78]:
index qend_x qseqid_x qstart_x sseqid qend_y qseqid_y qstart_y
0 0 345 2 125 1 345 2 125
1 0 345 2 125 1 320 4 150
2 1 320 4 150 1 345 2 125
3 1 320 4 150 1 320 4 150
4 2 450 3 150 2 450 3 150
5 2 450 3 150 2 300 6 25
6 2 450 3 150 2 500 8 50
7 3 300 6 25 2 450 3 150
8 3 300 6 25 2 300 6 25
9 3 300 6 25 2 500 8 50
10 4 500 8 50 2 450 3 150
11 4 500 8 50 2 300 6 25
12 4 500 8 50 2 500 8 50
Each row of merged contains data from two rows of df. You can then compare every two rows using
mask = ((merged['qstart_x'] > merged['qstart_y'])
& (merged['qend_x'] < merged['qend_y']))
and find the labels in df.index that do not match this condition:
df_mask = ~df.index.isin(merged.loc[mask, 'index'].values)
and select those rows:
result = df.loc[df_mask]
Note that this assumes df has a unique index.

Related

Python-How to bin positive and negative values to get counts for time series plot

I'm trying to recreate a time-series plot similar to the one below (not including the 'HLMA Flashes' data)
This is what my datafile looks like, the polarity is in the "charge" column. I used pandas to load in the file and set up the table on jupyter notebook. The value of the charge does not matter, only whether it is positive or negative.
Once I get the count of the total/negative/positives, I know how to plot this against time, but I'm not sure how to approach binning to get the counts (or whatever is needed) to make the time series. Preferably I need this in 5-minute bin periods during the timeframe of my dataframe (0000-07000 UTC). Apologies if this question is worded poorly, but any leads would be appreciated.
Link to .txt file: https://drive.google.com/file/d/13XEc74LO3cZQhylAdSfhLeUn7GFgtiKT/view?usp=sharing
Here's a way to do what I believe you are asking:
df2 = ( pd.DataFrame( {
'Datetime' : pd.to_datetime(df.agg(lambda x: f"{x['Date']} {x['Time']}", axis=1)),
'Neg': df.Charge < 0,
'Pos': df.Charge > 0,
'Tot': [1] * len(df)} ) )
df2['minutes'] = (df2.Datetime.dt.hour * 60 + df2.Datetime.dt.minute) // 5 * 5
df3 = df2[['minutes','Neg','Pos','Tot']].groupby('minutes').sum()
Output:
Neg Pos Tot
minutes
45 0 1 1
55 0 1 1
65 0 2 2
85 0 2 2
90 0 2 2
95 0 1 1
100 0 3 3
105 1 4 5
110 2 11 13
115 0 10 10
120 0 6 6
125 1 13 14
130 3 70 73
135 2 20 22
140 1 5 6
165 0 2 2
170 3 1 4
175 2 5 7
180 2 12 14
185 3 26 29
190 1 11 12
195 0 4 4
200 1 14 15
205 1 4 5
210 0 1 1
215 0 1 1
220 0 1 1
225 3 0 3
230 1 5 6
235 0 4 4
240 1 2 3
245 0 3 3
260 0 1 1
265 0 1 1
Explanation:
create a 'Datetime' column from 'Date' and 'Time' columns using to_datetime()
create Neg and Pos columns based on sign of Charge, and create Tot column equal to 1 for each row
create minutes column to bin the rows into 5 minute intervals
use groupby() and sum() to aggregate Neg, Pos and Tot for each interval with at least one row.

Count consecutive numbers from a column of a dataframe in Python

I have a dataframe that has segments of consecutive values appearing in column a (the value in column b does not matter):
import pandas as pd
import numpy as np
np.random.seed(150)
df = pd.DataFrame(data={'a':[1,2,3,4,5,15,16,17,18,203,204,205],'b':np.random.randint(50000,size=(12))})
>>> df
a b
0 1 27066
1 2 28155
2 3 49177
3 4 496
4 5 2354
5 15 23292
6 16 9358
7 17 19036
8 18 29946
9 203 39785
10 204 15843
11 205 21917
I would like to add a column c whose values are sequential counts according to presenting consecutive values in column a, as shown below:
a b c
1 27066 1
2 28155 2
3 49177 3
4 496 4
5 2354 5
15 23292 1
16 9358 2
17 19036 3
18 29946 4
203 39785 1
204 15843 2
205 21917 3
How to do this?
One solution:
df["c"] = (s := df["a"] - np.arange(len(df))).groupby(s).cumcount() + 1
print(df)
Output
a b c
0 1 27066 1
1 2 28155 2
2 3 49177 3
3 4 496 4
4 5 2354 5
5 15 23292 1
6 16 9358 2
7 17 19036 3
8 18 29946 4
9 203 39785 1
10 204 15843 2
11 205 21917 3
The original idea comes from ancient Python docs.
In order to use the walrus operator ((:=) or assignment expressions) you need Python 3.8+, instead you can do:
s = df["a"] - np.arange(len(df))
df["c"] = s.groupby(s).cumcount() + 1
print(df)
A simple solution is to find consecutive groups, use cumsum to get the number sequence and then remove any extra in later groups.
a = df['a'].add(1).shift(1).eq(df['a'])
df['c'] = a.cumsum() - a.cumsum().where(~a).ffill().fillna(0).astype(int) + 1
df
Result:
a b c
0 1 27066 1
1 2 28155 2
2 3 49177 3
3 4 496 4
4 5 2354 5
5 15 23292 1
6 16 9358 2
7 17 19036 3
8 18 29946 4
9 203 39785 1
10 204 15843 2
11 205 21917 3

Adding dataframe columns from shorter lists

I have a dataframe with three columns. The first column specifies a group into which each row is classified. Each group normally consists of 3 data points (rows), but it is possible for the last group to be "cut off," and contain fewer than three data points. In the real world, this could be due to the experiment or data collection process being cut off prematurely. In the below example, group 3 is cut off and only contains one data point.
import pandas as pd
data = {
"group_id": [0, 0, 0, 1, 1, 1, 2, 2, 2, 3],
"valueA": [420, 380, 390, 500, 270, 220, 150, 400, 330, 170],
"valueB": [50, 40, 45, 22, 20, 50, 10, 60, 90, 10]
}
#load data into a DataFrame object:
df = pd.DataFrame(data)
print(df)
I also have two lists with additional values.
x_list = [1, 3, 5]
y_list = [2, 4, 6]
I want to add these lists to my dataframe as new columns, and have the values repeat for each group. In other words, I want my output to look like this.
group_id valueA valueB x y
0 0 420 50 1 2
1 0 380 40 3 4
2 0 390 45 5 6
3 1 500 22 1 2
4 1 270 20 3 4
5 1 220 50 5 6
6 2 150 10 1 2
7 2 400 60 3 4
8 2 330 90 5 6
9 3 170 10 1 2
Notice that even though the length of a column is not divisible by the length of the shorter lists, the number of rows in the dataframe does not change.
How do I achieve this without losing dataframe rows or adding new rows with NaN values?
You can use GroupBy.cumcount to generate a indexer, then use this to duplicate the values in order of the groups:
new = pd.DataFrame({'x': x_list, 'y': y_list})
idx = df.groupby('group_id').cumcount()
df[['x', 'y']] = new.reindex(idx).to_numpy()
Output:
group_id valueA valueB x y
0 0 420 50 1 2
1 0 380 40 3 4
2 0 390 45 5 6
3 1 500 22 1 2
4 1 270 20 3 4
5 1 220 50 5 6
6 2 150 10 1 2
7 2 400 60 3 4
8 2 330 90 5 6
9 3 170 10 1 2
As your lists have the same length, you can use:
df[['x', 'y']] = (pd.DataFrame({'x': x_list, 'y': y_list})
.reindex(df.groupby('group_id').cumcount().mod(3)).values)
print(df)
# Output
group_id valueA valueB x y
0 0 420 50 1 2
1 0 380 40 3 4
2 0 390 45 5 6
3 1 500 22 1 2
4 1 270 20 3 4
5 1 220 50 5 6
6 2 150 10 1 2
7 2 400 60 3 4
8 2 330 90 5 6
9 3 170 10 1 2
Let's use np.resize:
import pandas as pd
import numpy as np
data = {
"group_id": [0, 0, 0, 1, 1, 1, 2, 2, 2, 3],
"valueA": [420, 380, 390, 500, 270, 220, 150, 400, 330, 170],
"valueB": [50, 40, 45, 22, 20, 50, 10, 60, 90, 10]
}
#load data into a DataFrame object:
df = pd.DataFrame(data)
df['x'] = np.resize(x_list, len(df))
df['y'] = np.resize(y_list, len(df))
df
Output:
group_id valueA valueB x y
0 0 420 50 1 2
1 0 380 40 3 4
2 0 390 45 5 6
3 1 500 22 1 2
4 1 270 20 3 4
5 1 220 50 5 6
6 2 150 10 1 2
7 2 400 60 3 4
8 2 330 90 5 6
9 3 170 10 1 2
An alternative in case of having lists of different sizes:
lambda_duplicator = lambda lista, lenn, shape : (lista*int(1 + shape/lenn))[:shape]
df['x'] = lambda_duplicator(x_list, len(x_list), df.shape[0])
df['y'] = lambda_duplicator(y_list, len(y_list), df.shape[0])

Split a data frame into six equal parts based on number of rows without knowing the number of rows - pandas

I have a df as shown below.
df:
ID Job Salary
1 A 100
2 B 200
3 B 20
4 C 150
5 A 500
6 A 600
7 A 200
8 B 150
9 C 110
10 B 200
11 B 220
12 A 150
13 C 20
14 B 50
I would like to split the df into 6 equal parts based on the number of rows.
Expected Output
df1:
ID Job Salary
1 A 100
2 B 200
3 B 20
df2:
ID Job Salary
4 C 150
5 A 500
6 A 600
df3:
ID Job Salary
7 A 200
8 B 150
df4:
ID Job Salary
9 C 110
10 B 200
df5:
ID Job Salary
11 B 220
12 A 150
df6:
ID Job Salary
13 C 20
14 B 50
Note: Since there are 14 rows first two dfs can have 3 rows and the remaining 4 dfs should have 2 rows.
And I would like to save all dfs as csv dynamically
You can use np.array_split():
dfs = np.array_split(df, 6)
for index, df in enumerate(dfs):
df.to_csv(f'df{index+1}.csv')
>>> print(dfs)
[ ID Job Salary
0 1 A 100
1 2 B 200
2 3 B 20,
ID Job Salary
3 4 C 150
4 5 A 500
5 6 A 600,
ID Job Salary
6 7 A 200
7 8 B 150,
ID Job Salary
8 9 C 110
9 10 B 200,
ID Job Salary
10 11 B 220
11 12 A 150,
ID Job Salary
12 13 C 20
13 14 B 50]

Add column in dataframe from another dataframe doing some arithmetic calculations python

i have a table in pandas df
id product_1 product_2 count
1 100 200 10
2 200 600 20
3 100 500 30
4 400 100 40
5 500 700 50
6 200 500 60
7 100 400 70
also i have another table in dataframe df2
product price
100 5
200 10
300 15
400 20
500 25
600 30
700 35
i have to create a new column price_product2 in my first df, taking values of price from df2 with respect to product_2.
And also find the percentage difference of product_2 with respect to product_1
and make one more column %_diff .
i.e say product_1 = 100 and product_2 = 200. therefore product_2 is 200% of the price of 100.
similarly if product_1 = 400 and product_2 = 100, it is a decline in price.
therefore product_2 is -25% of product_1.
my final output should be. df =
id product_1 product_2 count price_product_2 %_diff
1 100 200 10 10 +200
2 200 600 20 30 +300
3 100 500 30 25 +500
4 400 100 40 5 -25
5 500 700 50 35 +140
6 200 500 60 25 +250
7 100 400 70 20 -71.42
Any ideas how to achieve it?
i was trying to use map functions.
df['price_product_2'] = df['product_2'].map(df2.set_index('product_id')['price'])
but i could get only one column , how do i get the %_diff column?
Use merge (or map) twice, once for each product, then calculate the difference.
# Add prices for products 1 and 2
df3 = (df1.
merge(df2, left_on='product_1', right_on='product').
merge(df2, left_on='product_2', right_on='product'))
# Calculate the percent difference
df3['pct_diff'] = (df3.price_y - df3.price_x) / df3.price_x
Suppose you have the following data frames:
In [32]: df1
Out[32]:
index id product_1 product_2 count
0 0 1 100 200 10
1 1 2 200 600 20
2 2 3 100 500 30
3 3 4 400 100 40
4 4 5 500 700 50
5 5 6 200 500 60
6 6 7 100 400 70
In [33]: df2
Out[33]:
product price
0 100 5
1 200 10
2 300 15
3 400 20
4 500 25
5 600 30
6 700 35
It is probably easier simply to set product as the index for df2:
In [35]: df2.set_index('product', inplace=True)
In [36]: df2
Out[36]:
price
product
100 5
200 10
300 15
400 20
500 25
600 30
700 35
Then you can do things like the following:
In [37]: df2.loc[df1['product_2']]
Out[37]:
price
product
200 10
600 30
500 25
100 5
700 35
500 25
400 20
Use the values explicitly to set, or else the product index will screw things up:
In [38]: df1['price_product_2'] = df2.loc[df1['product_2']].values
In [39]: df1
Out[39]:
index id product_1 product_2 count price_product_2
0 0 1 100 200 10 10
1 1 2 200 600 20 30
2 2 3 100 500 30 25
3 3 4 400 100 40 5
4 4 5 500 700 50 35
5 5 6 200 500 60 25
6 6 7 100 400 70 20
For the percentage difference, you can also use vectorized operations:
In [40]: df1.product_2 / df1.product_1 * 100
Out[40]:
0 200.0
1 300.0
2 500.0
3 25.0
4 140.0
5 250.0
6 400.0
dtype: float64
Solution with map by dict d with divide by div:
d = df2.set_index('product')['price'].to_dict()
df['price_product_2'] = df['product_2'].map(d)
df['price_product_1'] = df['product_1'].map(d)
df['diff'] = df['price_product_2'].div(df['price_product_1']).mul(100)
print (df)
id product_1 product_2 count price_product_2 price_product_1 diff
0 1 100 200 10 10 5 200.0
1 2 200 600 20 30 10 300.0
2 3 100 500 30 25 5 500.0
3 4 400 100 40 5 20 25.0
4 5 500 700 50 35 25 140.0
5 6 200 500 60 25 10 250.0
6 7 100 400 70 20 5 400.0
But it seems only divide is necessary if multiple by same constant columns product_1 and product_2, then difference is same:
df['diff1'] = df['product_2'].div(df['product_1']).mul(100)
print (df)
id product_1 product_2 count diff1
0 1 100 200 10 200.0
1 2 200 600 20 300.0
2 3 100 500 30 500.0
3 4 400 100 40 25.0
4 5 500 700 50 140.0
5 6 200 500 60 250.0
6 7 100 400 70 400.0

Categories