Pandas: Find values within multiple ranges defined by start- and stop-columns - python

I'm trying to use two columns start and stop to define multiple ranges of values in another dataframe's age column. Ranges are defined in a df called intervals:
start stop
1 3
5 7
Ages are defined in another df:
age some_random_value
1 100
2 200
3 300
4 400
5 500
6 600
7 700
8 800
9 900
10 1000
Desired output is values where age is between the ranges defined in intervals (1-3 and 5-7):
age some_random_value
1 100
2 200
3 300
5 500
6 600
7 700
I've tried using numpy.r_ but it doesn't work quite as I want it to:
df.age.loc[pd.np.r_[intervals.start, intervals.stop]]
Which yields:
age some_random_value
2 200
6 600
4 400
8 800
Any ideas are much appreciated!

I believe need parameter closed='both' in IntervalIndex.from_arrays:
intervals = pd.IntervalIndex.from_arrays(df2['start'], df2['stop'], 'both')
And then select matching values:
df = df[intervals.get_indexer(df.age.values) != -1]
print (df)
age some_random_value
0 1 100
1 2 200
2 3 300
4 5 500
5 6 600
6 7 700
Detail:
print (intervals.get_indexer(df.age.values))
[ 0 0 0 -1 1 1 1 -1 -1 -1]

Related

Create an incremental serial no for filtered rows in pandas dataframe

Can you please help me change my code from current output to expected output? I am using apply() function of dataframe. It would be great if it could also be done more efficiently using vector operation.
Current output:
col1 col2 serialno
0 100 0 4
1 100 100 0
2 100 100 0
3 200 100 4
4 200 200 0
5 300 200 4
6 300 300 0
Expected output:
col1 col2 serialno
0 100 0 4
1 100 100 4
2 100 100 4
3 200 100 5
4 200 200 5
5 300 200 6
6 300 300 6
My current code contains a static value (4). I need to increment it by one based on a condition (col1 != col2). In addition, I need to repeat the serial no for all rows that meet the condition (col1 == col2).
My code:
import pandas as pd
columns = ['col1']
data = ['100','100','100','200','200','300','300']
df = pd.DataFrame(data=data,columns=columns)
df['col2'] = df.col1.shift(1).fillna(0)
print(df)
start = 4
series = (df['col1']!=df['col2']).apply(lambda x: start if x==True else 0)
df['serialno'] = series
print(df)
You can try this
import itertools
start = 4
counter = itertools.count(start) # to have incremental counter
df["serialno"] = [start if (x["col1"]==x["col2"] or (start:=next(counter))) else start for _, x in df.iterrows()]
This if condition will be executed in two manner: if col1 and col2 have same value then it will not go the next condition so start value will be same and if first condition is false then our counter will be incremented by 1.
Here is how you can do it with apply function:
ID = 3
def check_value(A, B):
global ID
if A != B:
ID += 1
return ID
df['id'] = df.apply(lambda row: check_value(row['col1'], row['col2']), axis=1)
You just need to start from 3 since the first row will increment it.
print(df) will give you this:
col1 col2 id
0 100 0 4
1 100 100 4
2 100 100 4
3 200 100 5
4 200 200 5
5 300 200 6
6 300 300 6
Another way could be using itertools.accumulate as follows:
import pandas as pd
imoprt numpy as np
df['serialno'] = list(accumulate(np.arange(1, len(df.index)), lambda x, y: x + 1 if df.iloc[y, 0] != df.iloc[y, 1] else x, initial = 4))
df
col1 col2 serialno
0 100 0 4
1 100 100 4
2 100 100 4
3 200 100 5
4 200 200 5
5 300 200 6
6 300 300 6
First, I have reused the idea posted by #BehRouz above of creating a user-defined function that increments the serial no.
Second, I transform the boolean series to a dynamic serial no as shown below for reference sake.
STEP 1: Create a dataframe and initialize the incremental counter (serialno)
import pandas as pd
columns = ['col1']
data = ['100','100','100','200','200','300','300']
df = pd.DataFrame(data=data,columns=columns)
df['col2'] = df.col1.shift(1).fillna(0)
df['serialno']=0 #initialize new column
print(df)
col1 col2 serialno
0 100 0 0
1 100 100 0
2 100 100 0
3 200 100 0
4 200 200 0
5 300 200 0
6 300 300 0
STEP 2: Create a boolean series and then use the transform method and the user-defined function posted by #BehRouz. This will create a dynamic serial no each time new rows are added to the dataframe.
start = df['serialno'].max()
def getvalue(x):
global start
if x:
start += 1
else:
start
return start
df['serialno'] = (df['col1']!=df['col2']).transform(func = lambda x: getvalue(x))
print(df)
Iteration 1:
col1 col2 serialno
0 100 0 1
1 100 100 1
2 100 100 1
3 200 100 2
4 200 200 2
5 300 200 3
6 300 300 3
Iteration 2:
col1 col2 serialno
0 100 0 4
1 100 100 4
2 100 100 4
3 200 100 5
4 200 200 5
5 300 200 6
6 300 300 6
Iteration 3:
col1 col2 serialno
0 100 0 7
1 100 100 7
2 100 100 7
3 200 100 8
4 200 200 8
5 300 200 9
6 300 300 9

How to drop rows with a value of less than a percentage of the maximum per group

I have a pandas dataframe with a time series of a signal with some peaks identified:
Time (s) Intensity Peak
1 1 a
2 10 a
3 30 a
4 100 a
5 40 a
6 20 a
7 2 a
1 20 b
2 100 b
3 300 b
4 80 b
5 20 b
6 2 b
I would like to drop the rows where the Intensity value is less than 10% of the maximum Intensity value for each peak in order to obtain:
Time (s) Intensity Peak
3 30 a
4 200 a
5 40 a
6 25 a
2 100 b
3 300 b
4 80 b
How would I do that? I tried looking for a groupby function that would do that but I just cannot seem to find something that fits.
Thank you!
Use groupby to generate a mask:
filtered = df[df.groupby('Peak')['Intensity'].apply(lambda x: x > x.max() / 10)]
Output:
>>> filtered
Time(s) Intensity Peak
2 3 30 a
3 4 100 a
4 5 40 a
5 6 20 a
8 2 100 b
9 3 300 b
10 4 80 b
You could use GroupBy.transform with max to get max from each group and take 10% using Series.div. Now, compare that with df['Intensity'] and use it for boolean indexing.
max_vals = df.groupby('Peak')['Intensity'].transform('max').div(10)
mask = df['Intensity'] > max_vals
df[mask]
# Time (s) Intensity Peak
# 2 3 30 a
# 3 4 100 a
# 4 5 40 a
# 5 6 20 a
# 8 2 100 b
# 9 3 300 b
# 10 4 80 b

How to remove rows of a data frame when specific amount are not in specific columns?

I have two data frames with four and two columns. For example:
A B C D
0 4 2 320 700
1 5 7 400 800
2 2 6 300 550
3 4 6 100 300
4 5 2 250 360
and
A B
0 2 4
1 5 7
2 2 5
I need to compare the first data frame with the second data frame and if column A and column B in the second data frame was in column A and column B in the first data frame.
(order doesn't matter. it means in the first data frame in the first row A is 4, B is 2 and in the second data frame is A is 2 and B is 4 and it's not important but both numbers should be in the columns) keep the whole row in the first data frame; otherwise remove the row. so the output will be :
A B C D
0 4 2 320 700
1 5 7 400 800
2 5 2 250 360
How can I get this output (my actual data frames are so huge and can't iterate through them so need a fast efficient way)?
I would do this by first sorting, then performing an LEFT OUTER JOIN using merge with an indicator to determine which rows to keep. Example,
u = df.loc[:, ['A', 'B']]
u.values.sort() # sort columns of `u`
df2.values.sort() # sort columns of `df2`
df[u.merge(df2, how='left', indicator='ind').eval('ind == "both"').values]
A B C D
0 4 2 320 700
1 5 7 400 800
4 5 2 250 360
More info on joins with indicator can be found in my post: Pandas Merging 101
If you don't care about the final result being sorted or not, you can simplify this to an inner join.
df[['A', 'B']] = np.sort(df[['A', 'B']])
df2[:] = np.sort(df2)
df.merge(df2, on=['A', 'B'])
A B C D
0 2 4 320 700
1 5 7 400 800
2 2 5 250 360
What I will do using frozenset + isin
yourdf=df[df[['A','B']].apply(frozenset,1).isin(df1.apply(frozenset,1))].copy()
A B C D
0 4 2 320 700
1 5 7 400 800
4 5 2 250 360
Using np.equal.outer
arr = np.equal.outer(df, df2)
df.loc[arr.any(1).all(-1).any(-1)]
Outputs
A B C D
0 4 2 320 700
1 5 7 400 800
4 5 2 250 360

Changing values of a pandas dataframe based on other values in the dataframe

I am learning machine learning and generated a pandas dataframe containing the following columns Id Category Cost_price Sold. The shape of the dataframe is (100000, 4).
Here the target variable is the Sold column(1=Sold, 0=not sold). But no machine learning algorithm is able to get a good enough accuracy as all the columns in the dataframe is very random. To introduce a pattern to the dataframe I am trying to manipulate some of the values in the Sold column.
What i want to do is to change 6000 of the sold values to 1 where the cost_price is less than 800. But i am not able to do that.
I am new to machine learning and python. Please help me
Thanks in advance
Use:
df.loc[np.random.choice(df.index[df['cost_price'] < 800], 6000, replace=False), 'Sold'] = 1
Sample:
df = pd.DataFrame({
'Sold':[1,0,0,1,1,0] * 3,
'cost_price':[500,300,6000,900,100,400] * 3,
})
print (df)
Sold cost_price
0 1 500
1 0 300
2 0 6000
3 1 900
4 1 100
5 0 400
6 1 500
7 0 300
8 0 6000
9 1 900
10 1 100
11 0 400
12 1 500
13 0 300
14 0 6000
15 1 900
16 1 100
17 0 400
df.loc[np.random.choice(df.index[df['cost_price'] < 800], 10, replace=False), 'Sold'] = 1
print (df)
Sold cost_price
0 1 500
1 1 300
2 0 6000
3 1 900
4 1 100
5 1 400
6 1 500
7 1 300
8 0 6000
9 1 900
10 1 100
11 1 400
12 1 500
13 1 300
14 0 6000
15 1 900
16 1 100
17 1 400
Explanation:
First filter index values by condition with boolean indexing:
print (df.index[df['cost_price'] < 800])
Int64Index([0, 1, 4, 5, 6, 7, 10, 11, 12, 13, 16, 17], dtype='int64')
Then select random N values by numpy.random.choice:
print (np.random.choice(df.index[df['cost_price'] < 800], 10, replace=False))
[16 1 7 13 17 12 10 6 5 11]
And last set 1 by index values with DataFrame.loc.
I will assume you will randomly choose those 6000 rows.
idx = df.Sold[df.Cost_price < 800].tolist()
r = random.sample(idx, 6000)
df.Sold.loc[r] = 1
IIUC use DataFrame.at
df.at[df.Sold[df.cost_price < 800][:6000].index, 'Sold'] = 1
If you randomly choose the rows use .sample
df.at[df[df.cost_price < 800].sample(6000).index, 'Sold'] = 1

Conditional shift in pandas

The following pandas DataFrame is an example that I need to deal with:
Group Amount
1 1 100
2 1 300
3 1 400
4 1 700
5 2 500
6 2 900
Here's the result that I want after calculation:
Group Amount Difference
1 1 100 100
2 1 300 200
3 1 400 100
4 1 700 300
5 2 500 500
6 2 900 400
I knew that df["Difference"] = df["Amount"] - df["Amount"].shift(-1) can produce the difference between all rows, but what can I do for the problem I have like this that needs a group as condition?
groupby on 'Group' and call transform on the 'Amount' col, additionally call fillna and pass the 'Amount' column:
In [110]:
df['Difference'] = df.groupby('Group')['Amount'].transform(pd.Series.diff).fillna(df['Amount'])
df
​
Out[110]:
Group Amount Difference
1 1 100 100
2 1 300 200
3 1 400 100
4 1 700 300
5 2 500 500
6 2 900 400

Categories