I would like to filter my DataFrame by evaluating some conditions against several columns of the DataFrame. I illustrate what I want to do with the following eample:
df = {'user': [1,1,1,2,2,2],
'speed':[10,20,90,15,39, 10],
'acceleration': [9.8,29,5,4,7, 3],
'jerk':[50,60,60,40,20,-50],
'mode':['car','car','car','metro','metro', 'metro']}
df = pd.DataFrame.from_dict(df)
df
user speed acceleration jerk mode
0 1 10 9.8 50 car
1 1 20 29.0 60 car
2 1 90 5.0 60 car
3 2 15 4.0 40 metro
4 2 39 7.0 20 metro
5 2 10 3.0 -50 metro
In the given example, I would like to filter the dataframe based on thresholds set against speed, acceleration and jerk columns as in the table below:
+-------+-------+--------------+------+-----+
| | speed | acceleration | jerk |
+-------+-------+--------------+------+-----+
| | max | max | min | max |
| --- | --- | --- | --- | --- |
| car | 50 | 10 | -100 | 100 |
| metro | 35 | 5 | 60 | -40 |
+-------+-------+--------------+------+-----+
So only users' with speed & acceleration below the max as well as user's jerk within min-max are selected (or delete rows not satisfying stated conditions).
You can use reindex, and then do the msk:
threshold=threshold.reindex(df['mode'])
threshold=threshold.reset_index(drop=True)
msk=(df.acceleration.lt(threshold['acceleration','max']))&\
(df.speed.lt(threshold['speed','max']))&\
(df.jerk.ge(threshold['jerk','min'])&\
df.jerk.le(threshold['jerk','max']))
df[msk]
Details
Taking this threshold dataframe:
threshold=pd.DataFrame({'s':['car','car','metro','metro'],
'acceleration':[10,5,5,2],
'speed':[50,5,35,2],
'jerk':[-100,100,60,-40]})
threshold=threshold.groupby('s').agg({'acceleration':'max',
'speed':'max',
'jerk':['min','max']})
threshold
# acceleration speed jerk
# max max min max
#s
#car 10 50 -100 100
#metro 5 35 -40 60
You can use 'mode' column to make the reindex:
threshold=threshold.reindex(df['mode'])
# acceleration speed jerk
# max max min max
#mode
#car 10 50 -100 100
#car 10 50 -100 100
#car 10 50 -100 100
#metro 5 35 -40 60
#metro 5 35 -40 60
#metro 5 35 -40 60
threshold=threshold.reset_index(drop=True)
msk=(df.acceleration.lt(threshold['acceleration','max']))&\
(df.speed.lt(threshold['speed','max']))&\
(df.jerk.ge(threshold['jerk','min'])&\
df.jerk.le(threshold['jerk','max']))
df[msk]
# user speed acceleration jerk mode
#0 1 10 9.8 50 car
#3 2 15 4.0 40 metro
maybe the where clause is what you're looking for.
Related
I am trying to run a simple calculation over the values of each row from within a group inside of a dataframe, but I'm having trouble with the syntax, I think I'm specifically getting confused in relation to what data object I should return, i.e. dataframe vs series etc.
For context, I have a bunch of stock values for each product I am tracking and I want to estimate the number of sales via a custom function which essentially does the following:
# Because stock can go up and down, I'm looking to record the difference
# when the stock is less than the previous stock number from the previous row.
# How do I access each row of the dataframe and then return the series I need?
def get_stock_sold(x):
# Written in pseudo
stock_sold = previous_stock_no - current_stock_no if current_stock_no < previous_stock_no else 0
return pd.Series(stock_sold)
I then have the following dataframe:
# 'Order' is a date in the real dataset.
data = {
'id' : ['1', '1', '1', '2', '2', '2'],
'order' : [1, 2, 3, 1, 2, 3],
'current_stock' : [100, 150, 90, 50, 48, 30]
}
df = pd.DataFrame(data)
df = df.sort_values(by=['id', 'order'])
df['previous_stock'] = df.groupby('id')['current_stock'].shift(1)
I'd like to create a new column (stock_sold) and apply the logic from above to each row within the grouped dataframe object:
df['stock_sold'] = df.groupby('id').apply(get_stock_sold)
Desired output would look as follows:
| id | order | current_stock | previous_stock | stock_sold |
|----|-------|---------------|----------------|------------|
| 1 | 1 | 100 | NaN | 0 |
| | 2 | 150 | 100.0 | 0 |
| | 3 | 90 | 150.0 | 60 |
| 2 | 1 | 50 | NaN | 0 |
| | 2 | 48 | 50.0 | 2 |
| | 3 | 30 | 48 | 18 |
Try:
df["previous_stock"] = df.groupby("id")["current_stock"].shift()
df["stock_sold"] = np.where(
df["current_stock"] > df["previous_stock"].fillna(0),
0,
df["previous_stock"] - df["current_stock"],
)
print(df)
Prints:
id order current_stock previous_stock stock_sold
0 1 1 100 NaN 0.0
1 1 2 150 100.0 0.0
2 1 3 90 150.0 60.0
3 2 1 50 NaN 0.0
4 2 2 48 50.0 2.0
5 2 3 30 48.0 18.0
I have a csv file that looks like this:
Signal Channel
-60 1
-40 6
-70 11
-80 3
-80 4
-66 1
-60 11
-50 6
I want to create a new csv file using those data in matrix form:
channel 1 | channel 2 | channel 3 | channel 4 | channel 5 | channel 6 | channel 7 | channel 8 | channel 9 | channel 10 | channel 11
-60 | | -80 | -80 | | -40 | | | | | -70
-66 | | | | | -50 | | | | | -60
But I don't know how to do this.
I think you can manage with, you just have to put the args you want in the to_csv function to make it display as you want:
import pandas as pd
data={"Signal":[-60,-40,-70,-80,-80,-66,-60,-50],
"Channel":[1,6,11,3,4,1,11,6]}
df=pd.DataFrame(data)
df['count']=df.groupby('Signal')['Signal'].cumcount()
pivot=pd.pivot_table(df,values=["Signal"],columns=["Channel"],index=['count'])
pivot=pivot.add_prefix('Channel_')
pivot.to_csv("test.csv",index=False)
You can use df.pivot() from pandas (see here). First cumcount is used to determine the index location for the pivoting.
from io import StringIO
import pandas as pd
csv = """
Signal,Channel
-60,1
-40,6
-70,11
-80,3
-80,4
-66,1
-60,11
-50,6"""
df = pd.read_csv(StringIO(csv))
df['index'] = df.groupby('Channel').cumcount()
df.pivot(index='index', columns="Channel", values="Signal")
gives you
Channel 1 3 4 6 11
index
0 -60.0 -80.0 -80.0 -40.0 -70.0
1 -66.0 NaN NaN -50.0 -60.0
This answer is adapted from #unutbu's answer here
I have created a sample datatable as,
DT_EX = dt.Frame({'recency': ['current','savings','fixex','current','savings','fixed','savings','current'],
'amount': [4200,2300,1500,8000,1200,6500,4500,9010],
'no_of_pl': [3,2,1,5,1,2,5,4],
'default': [True,False,True,False,True,True,True,False]})
and it can be viewed as,
| recency amount no_of_pl default
-- + ------- ------ -------- -------
0 | current 4200 3 1
1 | savings 2300 2 0
2 | fixex 1500 1 1
3 | current 8000 5 0
4 | savings 1200 1 1
5 | fixed 6500 2 1
6 | savings 4500 5 1
7 | current 9010 4 0
[8 rows x 4 columns]
I'm doing some data manipulations as explained in the below steps:
Step 1: Two new columns are added to datatable as
DT_EX[:, f[:].extend({"total_amount": f.amount*f.no_of_pl,
'test_col': f.amount/f.no_of_pl})]
output:
| recency amount no_of_pl default total_amount test_col
-- + ------- ------ -------- ------- ------------ --------
0 | current 4200 3 1 12600 1400
1 | savings 2300 2 0 4600 1150
2 | fixex 1500 1 1 1500 1500
3 | current 8000 5 0 40000 1600
4 | savings 1200 1 1 1200 1200
5 | fixed 6500 2 1 13000 3250
6 | savings 4500 5 1 22500 900
7 | current 9010 4 0 36040 2252.5
[8 rows x 6 columns]
Step2:
A dictionary is created as, and note it has values stored in a list
test_dict = {'discount': [10,20,30,40,50,60,70,80],
'charges': [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8]}
Step 3:
A new datatable created with the above mentioned dict and append to a datatable DT_EX as,
dt.cbind(DT_EX, dt.Frame(test_dict))
output:
| recency amount no_of_pl default discount charges
-- + ------- ------ -------- ------- -------- -------
0 | current 4200 3 1 10 0.1
1 | savings 2300 2 0 20 0.2
2 | fixex 1500 1 1 30 0.3
3 | current 8000 5 0 40 0.4
4 | savings 1200 1 1 50 0.5
5 | fixed 6500 2 1 60 0.6
6 | savings 4500 5 1 70 0.7
7 | current 9010 4 0 80 0.8
[8 rows x 6 columns]
Here we can see a datatable with the newly added columns (discount, charges)
Step 4:
As we know that extend function can be used to add on the columns i tried to pass in the dictionary named test_dict as,
DT_EX[:, f[:].extend(test_dict)]
Output:
Out[18]:
| recency amount no_of_pl default discount discount.0 discount.1 discount.2 discount.3 discount.4 … charges.2 charges.3 charges.4 charges.5 charges.6
-- + ------- ------ -------- ------- -------- ---------- ---------- ---------- ---------- ---------- --------- --------- --------- --------- ---------
0 | current 4200 3 1 10 20 30 40 50 60 … 0.4 0.5 0.6 0.7 0.8
1 | savings 2300 2 0 10 20 30 40 50 60 … 0.4 0.5 0.6 0.7 0.8
2 | fixex 1500 1 1 10 20 30 40 50 60 … 0.4 0.5 0.6 0.7 0.8
3 | current 8000 5 0 10 20 30 40 50 60 … 0.4 0.5 0.6 0.7 0.8
4 | savings 1200 1 1 10 20 30 40 50 60 … 0.4 0.5 0.6 0.7 0.8
5 | fixed 6500 2 1 10 20 30 40 50 60 … 0.4 0.5 0.6 0.7 0.8
6 | savings 4500 5 1 10 20 30 40 50 60 … 0.4 0.5 0.6 0.7 0.8
7 | current 9010 4 0 10 20 30 40 50 60 … 0.4 0.5 0.6 0.7 0.8
[8 rows x 20 columns]
Note : Here in the output it is seen that there are about 8 columns created (each element of a list is filled in) for each of dictionary key (discount, charges) and total newly added columns are 16.
Step 5:
I have had thought of creating a dictionary with values of numpy array as,
test_dict_1 = {'discount': np.array([10,20,30,40,50,60,70,80]),
'charges': np.array([0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8])}
I have pass the test_dict_1 to extend function as
DT_EX[:, f[:].extend(test_dict_1)]
output:
Out[20]:
| recency amount no_of_pl default discount charges
-- + ------- ------ -------- ------- -------- -------
0 | current 4200 3 1 10 0.1
1 | savings 2300 2 0 20 0.2
2 | fixex 1500 1 1 30 0.3
3 | current 8000 5 0 40 0.4
4 | savings 1200 1 1 50 0.5
5 | fixed 6500 2 1 60 0.6
6 | savings 4500 5 1 70 0.7
7 | current 9010 4 0 80 0.8
[8 rows x 6 columns]
At this step, extend has taken a dictionary and added the new columns to DT_EX. and it is an expected output.
So, here i would like to understand what has happened in the step 4? Why didn't it take a list of values from a dictionary key to add a new column? Why the step 5 case was executed?
Could you please write your comments/answers on it?
You could wrap the dictionary in a Frame constructor to get the desired result:
>>> DT_EX[:, f[:].extend(dt.Frame(test_dict))]
| recency amount no_of_pl default discount charges
-- + ------- ------ -------- ------- -------- -------
0 | current 4200 3 1 10 0.1
1 | savings 2300 2 0 20 0.2
2 | fixex 1500 1 1 30 0.3
3 | current 8000 5 0 40 0.4
4 | savings 1200 1 1 50 0.5
5 | fixed 6500 2 1 60 0.6
6 | savings 4500 5 1 70 0.7
7 | current 9010 4 0 80 0.8
[8 rows x 6 columns]
As to what happens in step 4, the following logic is applied: when we evaluate a dictionary for the DT[] call, we treat it simply as a list of elements, where each item in the list is named by the corresponding key. If an "item" produces multiple columns, then each of the columns gets the same name from the key. Now, in this case each "item" is a list again, and we don't have any special rules for evaluating such lists of primitives. So they end up expanding into a list of columns where each column is a constant.
You are right that the end result looks quite counterintuitive, so we'd probably want to adjust the rules for evaluating lists inside DT[] expressions.
I want to create a new column based on the value of other columns in pandas dataframe. My data is about a truck that moves back and forth from loading to dumping location. I want calculates the distance of current road segment to the last segment. The example of the data shown below:
State | segment length |
-----------------------------
Loaded | 20 |
Loaded | 10 |
Loaded | 10 |
Empty | 15 |
Empty | 10 |
Empty | 10 |
Loaded | 30 |
Loaded | 20 |
Loaded | 10 |
So, the end of the road will be the record where the State changes. Hence I want to calculate the distance from end of the road. The final dataframe will be:
State | segment length | Distance to end
Loaded | 20 | 40
Loaded | 10 | 20
Loaded | 10 | 10
Empty | 15 | 35
Empty | 10 | 20
Empty | 10 | 10
Loaded | 30 | 60
Loaded | 20 | 30
Loaded | 10 | 10
Can anyone help?
Thank you in advance
Use GroupBy.cumsum with DataFrame.iloc for swap ordering and custom Series for get unique consecutive groups with shift and cumsum:
g = df['State'].ne(df['State'].shift()).cumsum()
df['Distance to end'] = df.iloc[::-1].groupby(g)['segment length'].cumsum()
print (df)
State segment length Distance to end
0 Loaded 20 40
1 Loaded 10 20
2 Loaded 10 10
3 Empty 15 35
4 Empty 10 20
5 Empty 10 10
6 Loaded 30 60
7 Loaded 20 30
8 Loaded 10 10
Detail:
print (g)
0 1
1 1
2 1
3 2
4 2
5 2
6 3
7 3
8 3
Name: State, dtype: int32
df['Distance to end'] = (
df.assign(i=df.State.ne(df.State.shift()).cumsum())
.assign(s=lambda x: x.groupby(by='i')['segment length'].transform(sum))
.groupby(by='i')
.apply(lambda x: x.s.sub(x['segment length'].shift().cumsum().fillna(0)))
.values
)
State segment length Distance to end
0 Loaded 20 40.0
1 Loaded 10 20.0
2 Loaded 10 10.0
3 Empty 15 35.0
4 Empty 10 20.0
5 Empty 10 10.0
6 Loaded 30 60.0
7 Loaded 20 30.0
8 Loaded 10 10.0
I am having a hard time figuring out how to get "rolling weights" based off of one of my columns, then factor these weights onto another column.
I've tried groupby.rolling.apply (function) on my data but the main problem is just conceptualizing how I'm going to take a running/rolling average of the column I'm going to turn into weights, and then factor this "window" of weights onto another column that isn't rolled.
I'm also purposely setting min_period to 1, so you'll notice my first two rows in each group final output "rwag" mirror the original.
W is the rolling column to derive the weights from.
B is the column to apply the rolled weights to.
Grouping is only done on column a.
df is already sorted by a and yr.
def wavg(w,x):
return (x * w).sum() / w.sum()
n=df.groupby(['a1'])[['w']].rolling(window=3,min_periods=1).apply(lambda x: wavg(df['w'],df['b']))
Input:
id | yr | a | b | w
---------------------------------
0 | 1990 | a1 | 50 | 3000
1 | 1991 | a1 | 40 | 2000
2 | 1992 | a1 | 10 | 1000
3 | 1993 | a1 | 20 | 8000
4 | 1990 | b1 | 10 | 500
5 | 1991 | b1 | 20 | 1000
6 | 1992 | b1 | 30 | 500
7 | 1993 | b1 | 40 | 4000
Desired output:
id | yr | a | b | rwavg
---------------------------------
0 1990 a1 50 50
1 1991 a1 40 40
2 1992 a1 10 39.96
3 1993 a1 20 22.72
4 1990 b1 10 10
5 1991 b1 20 20
6 1992 b1 30 20
7 1993 b1 40 35.45
apply with rolling usually have some wired behavior
df['Weight']=df.b*df.w
g=df.groupby(['a']).rolling(window=3,min_periods=1)
g['Weight'].sum()/g['w'].sum()
df['rwavg']=(g['Weight'].sum()/g['w'].sum()).values
Out[277]:
a
a1 0 50.000000
1 46.000000
2 40.000000
3 22.727273
b1 4 10.000000
5 16.666667
6 20.000000
7 35.454545
dtype: float64