How to seperate out a dataframe by ID? - python

I have a df which has object ID as the index, and then x values and y values in the columns, giving coordinates for where the object moved over time. For example:
id x y
1 100 400
1 110 390
1 115 385
2 110 380
2 115 380
3 200 570
3 210 580
I would like to calculate the change in x and the change in y, for each object, so I can see direction (eg north-east) and how linear or how non linear each route is. I can then filter out object moving in a way I am not interested in.
How do I create a loop which loops over each object (aka ID) separately? For example trying something like: for len(df) would loop over the entire number of rows, it would not discriminate based on ID.
Thank you

# if id is your index, fix that:
df = df.reset_index()
# groupby id, getting the difference row by row within each group:
df[['chngX', 'chngY']] = df.groupby('id')[['x', 'y']].diff()
print(df)
Output:
id x y chngX chngY
0 1 100 400 NaN NaN
1 1 110 390 10.0 -10.0
2 1 115 385 5.0 -5.0
3 2 110 380 NaN NaN
4 2 115 380 5.0 0.0
5 3 200 570 NaN NaN
6 3 210 580 10.0 10.0

Related

How to use each row's value as compare object to get the count of rows satisfying a condition from whole DataFrame?

date data1
0 2012/1/1 100
1 2012/1/2 109
2 2012/1/3 108
3 2012/1/4 120
4 2012/1/5 80
5 2012/1/6 130
6 2012/1/7 100
7 2012/1/8 140
Given the dataframe above, I want get the number of rows which data1 value is between ± 10 of each row's data1 field, and append that count to each row, such that:
date data Count
0 2012/1/1 100.0 4.0
1 2012/1/2 109.0 4.0
2 2012/1/3 108.0 4.0
3 2012/1/4 120.0 2.0
4 2012/1/5 80.0 1.0
5 2012/1/6 130.0 3.0
6 2012/1/7 100.0 4.0
7 2012/1/8 140.0 2.0
Since each row's field is rule's compare object, I use iterrows, although I know this is not elegant:
result = pd.DataFrame(index=df.index)
for i,r in df.iterrows():
high=r['data']+10
low=r['data1']-10
df2=df.loc[(df['data']<=r['data']+10)&(df['data']>=r['data']-10)]
result.loc[i,'date']=r['date']
result.loc[i,'data']=r['data']
result.loc[i,'count']=df2.shape[0]
result
Is there any more Pandas-style way to do that?
Thank you for any help!
Use numpy broadcasting for boolean mask and for count Trues use sum:
arr = df['data'].to_numpy()
df['count'] = ((arr[:, None] <= arr+10)&(arr[:, None] >= arr-10)).sum(axis=1)
print (df)
date data count
0 2012/1/1 100 4
1 2012/1/2 109 4
2 2012/1/3 108 4
3 2012/1/4 120 2
4 2012/1/5 80 1
5 2012/1/6 130 3
6 2012/1/7 100 4
7 2012/1/8 140 2

Calculate the sum of values replacing NaN

I have a data frame with some NaNs in column B.
df = pd.DataFrame({
'A':[654,987,321,654,987,15,98,338],
'B':[987,np.nan,741,np.nan, 65,35,94,np.nan]})
df
A B
0 654 987.0
1 987 NaN
2 321 741.0
3 654 NaN
4 987 65.0
5 15 35.0
6 98 94.0
7 338 NaN
I replace NaNs in B with the numbers form A
df.B.fillna(df.A, inplace = True)
df
A B
0 654 987.0
1 987 987.0
2 321 741.0
3 654 654.0
4 987 65.0
5 15 35.0
6 98 94.0
7 338 338.0
What's the easiest way to calculate the sum of the values that have replaced the NaNs in B?
You can use series.isna() with .loc[] to filter the Column A which meets the condition that column B is null and then sum:
df.loc[df['B'].isna(),'A'].sum()
Alternative:
df['B'].fillna(df['A']).sum() - df['B'].sum()
Note: you should do this before doing the inplace operation or preferable create a copy and save under a different variable for later reference.
Try the function math.isnan to check the NaN value.
import numpy as np
import pandas as pd
import math
df = pd.DataFrame({
'A':[654,987,321,654,987,15,98,338],
'B':[987,np.nan,741,np.nan, 65,35,94,np.nan]})
for i in range(0,len(df['B'])):
if (math.isnan(df['B'][i])):
df['B'][i] = df['A'][i]
print(df)
Output :
A B
0 654 987.0
1 987 987.0
2 321 741.0
3 654 654.0
4 987 65.0
5 15 35.0
6 98 94.0
7 338 338.0

Calculations between different rows

I try to run loop over a pandas dataframe that takes two arguments from different rows. I tried to use .iloc and shift functions but did not manage to get the result i need.
Here's a simple example to explain better what i want to do:
dataframe1:
a b c
0 101 1 aaa
1 211 2 dcd
2 351 3 yyy
3 401 5 lol
4 631 6 zzz
for the above df I want to make new column ('d') that gets the diff between the values in column 'a' only if the diff between the values in column 'b' is equal to 1, if not the value should be null. like the following dataframe2:
a b c d
0 101 1 aaa nan
1 211 2 dcd 110
2 351 3 yyy 140
3 401 5 lol nan
4 631 6 zzz 230
Is there any designed function that can handle this kind of calculations?
Try like this, using loc and diff():
df.loc[df.b.diff() == 1, 'd'] = df.a.diff()
>>> df
a b c d
0 101 1 aaa NaN
1 211 2 dcd 110.0
2 351 3 yyy 140.0
3 401 5 lol NaN
4 631 6 zzz 230.0
You can create a group key
df1.groupby(df1.b.diff().ne(1).cumsum()).a.diff()
Out[361]:
0 NaN
1 110.0
2 140.0
3 NaN
4 230.0
Name: a, dtype: float64

Create an indicator column based on one column being within +/- 5% of another column

I would like to populate the 'Indicator' column based on both charge columns. If 'Charge1' is within plus or minus 5% of the 'Charge2' value, set the 'Indicator' to RFP, otherwise leave it blank (see example below).
ID Charge1 Charge2 Indicator
1 9.5 10 RFP
2 22 20
3 41 40 RFP
4 65 80
5 160 160 RFP
6 315 320 RFP
7 613 640 RFP
8 800 700
9 759 800
10 1480 1500 RFP
I tried using a .loc approach, but struggled to establish if 'Charge1' was within +/- 5% of 'Charge2'.
In [190]: df.loc[df.eval("Charge2*0.95 <= Charge1 <= Charge2*1.05"), 'RFP'] = 'REP'
In [191]: df
Out[191]:
ID Charge1 Charge2 RFP
0 1 9.5 10 REP
1 2 22.0 20 NaN
2 3 41.0 40 REP
3 4 65.0 80 NaN
4 5 160.0 160 REP
5 6 315.0 320 REP
6 7 613.0 640 REP
7 8 800.0 700 NaN
8 9 759.0 800 NaN
9 10 1480.0 1500 REP
Pretty simple, create an 'indicator' series of booleans which depends on the percentage difference between Charge1 and Charge2.
df = pd.read_clipboard()
threshold = 0.05
indicator = ( (df['Charge1'] / df['Charge2']) - 1).abs() <= threshold
df.loc[indicator]
Set a threshold figure and compare the values against that.
Wherever the value is within the threshold, return true, and so you can directly use the indicator (boolean series) as an input into .loc.
Try
cond = ((df['Charge2'] - df['Charge1'])/df['Charge2']*100).abs() <= 5
df['Indicator'] = np.where(cond, 'RFP', np.nan)
ID Charge1 Charge2 Indicator
0 1 9.5 10 RFP
1 2 22.0 20 nan
2 3 41.0 40 RFP
3 4 65.0 80 nan
4 5 160.0 160 RFP
5 6 315.0 320 RFP
6 7 613.0 640 RFP
7 8 800.0 700 nan
8 9 759.0 800 nan
9 10 1480.0 1500 RFP
You can using pct_change
df[['Charge2','Charge1']].T.pct_change().dropna().T.abs().mul(100).astype(int)<=(5)
Out[245]:
Charge1
0 True
1 False
2 True
3 False
4 True
5 True
6 True
7 False
8 True
9 True
Be very careful!
In Python / float counting, 9.5/10 - 1 == -0.050000000000000044
This is one way to explicitly account for this issue via numpy.
import numpy as np
vals = np.abs(df.Charge1.values / df.Charge2.values - 1)
cond1 = vals <= 0.05
cond2 = np.isclose(vals, 0.05, atol=1e-08)
df['Indicator'] = np.where(cond1 | cond2, 'RFP', '')

Problems with combining columns from dataframes in pandas

I have two dataframes that I'm trying to merge.
df1
code scale R1 R2...
0 121 1 80 110
1 121 2 NaN NaN
2 121 3 NaN NaN
3 313 1 60 60
4 313 2 NaN NaN
5 313 3 NaN NaN
...
df2
code scale R1 R2...
0 121 2 30 20
3 313 2 15 10
...
I need, based on the equality of the columns code and scale copy the value from df2 to df1.
The result should look like this:
df1
code scale R1 R2...
0 121 1 80 110
1 121 2 30 20
2 121 3 NaN NaN
3 313 1 60 60
4 313 2 15 10
5 313 3 NaN NaN
...
The problem is that there can be a lot of columns like R1 and R2 and I can not check each one separately, so I wanted to use something from this instruction, but nothing gives me the desired result. I'm doing something wrong, but I can't understand what. I really need advice.
What do you want to happen if the two dataframes both have values for R1/R2? If you want keep df1, you could do
df1.set_index(['code', 'scale']).fillna(df2.set_index(['code', 'scale'])).reset_index()
To keep df2 just do the fillna the other way round. To combine in some other way please clarify the question!
Try this ?
pd.concat([df,df1],axis=0).sort_values(['code','scale']).drop_duplicates(['code','scale'],keep='last')
Out[21]:
code scale R1 R2
0 121 1 80.0 110.0
0 121 2 30.0 20.0
2 121 3 NaN NaN
3 313 1 60.0 60.0
3 313 2 15.0 10.0
5 313 3 NaN NaN
This is a good situation for combine_first. It replaces the nulls in the calling dataframe from the passed dataframe.
df1.set_index(['code', 'scale']).combine_first(df2.set_index(['code', 'scale'])).reset_index()
code scale R1 R2
0 121 1 80.0 110.0
1 121 2 30.0 20.0
2 121 3 NaN NaN
3 313 1 60.0 60.0
4 313 2 15.0 10.0
5 313 3 NaN NaN
Other solutions
with fillna
df.set_index(['code', 'scale']).fillna(df1.set_index(['code', 'scale'])).reset_index()
with add - a bit faster
df.set_index(['code', 'scale']).add(df1.set_index(['code', 'scale']), fill_value=0)

Categories