Python Pandas - difference between 'loc' and 'where'? - python

Just curious on the behavior of 'where' and why you would use it over 'loc'.
If I create a dataframe:
df = pd.DataFrame({'ID':[1,2,3,4,5,6,7,8,9,10],
'Run Distance':[234,35,77,787,243,5435,775,123,355,123],
'Goals':[12,23,56,7,8,0,4,2,1,34],
'Gender':['m','m','m','f','f','m','f','m','f','m']})
And then apply the 'where' function:
df2 = df.where(df['Goals']>10)
I get the following which filters out the results where Goals > 10, but leaves everything else as NaN:
Gender Goals ID Run Distance
0 m 12.0 1.0 234.0
1 m 23.0 2.0 35.0
2 m 56.0 3.0 77.0
3 NaN NaN NaN NaN
4 NaN NaN NaN NaN
5 NaN NaN NaN NaN
6 NaN NaN NaN NaN
7 NaN NaN NaN NaN
8 NaN NaN NaN NaN
9 m 34.0 10.0 123.0
If however I use the 'loc' function:
df2 = df.loc[df['Goals']>10]
It returns the dataframe subsetted without the NaN values:
Gender Goals ID Run Distance
0 m 12 1 234
1 m 23 2 35
2 m 56 3 77
9 m 34 10 123
So essentially I am curious why you would use 'where' over 'loc/iloc' and why it returns NaN values?

Think of loc as a filter - give me only the parts of the df that conform to a condition.
where originally comes from numpy. It runs over an array and checks if each element fits a condition. So it gives you back the entire array, with a result or NaN. A nice feature of where is that you can also get back something different, e.g. df2 = df.where(df['Goals']>10, other='0'), to replace values that don't meet the condition with 0.
ID Run Distance Goals Gender
0 1 234 12 m
1 2 35 23 m
2 3 77 56 m
3 0 0 0 0
4 0 0 0 0
5 0 0 0 0
6 0 0 0 0
7 0 0 0 0
8 0 0 0 0
9 10 123 34 m
Also, while where is only for conditional filtering, loc is the standard way of selecting in Pandas, along with iloc. loc uses row and column names, while iloc uses their index number. So with loc you could choose to return, say, df.loc[0:1, ['Gender', 'Goals']]:
Gender Goals
0 m 12
1 m 23

If check docs DataFrame.where it replace rows by condition - default by NAN, but is possible specify value:
df2 = df.where(df['Goals']>10)
print (df2)
ID Run Distance Goals Gender
0 1.0 234.0 12.0 m
1 2.0 35.0 23.0 m
2 3.0 77.0 56.0 m
3 NaN NaN NaN NaN
4 NaN NaN NaN NaN
5 NaN NaN NaN NaN
6 NaN NaN NaN NaN
7 NaN NaN NaN NaN
8 NaN NaN NaN NaN
9 10.0 123.0 34.0 m
df2 = df.where(df['Goals']>10, 100)
print (df2)
ID Run Distance Goals Gender
0 1 234 12 m
1 2 35 23 m
2 3 77 56 m
3 100 100 100 100
4 100 100 100 100
5 100 100 100 100
6 100 100 100 100
7 100 100 100 100
8 100 100 100 100
9 10 123 34 m
Another syntax is called boolean indexing and is for filter rows - remove rows matched condition.
df2 = df.loc[df['Goals']>10]
#alternative
df2 = df[df['Goals']>10]
print (df2)
ID Run Distance Goals Gender
0 1 234 12 m
1 2 35 23 m
2 3 77 56 m
9 10 123 34 m
If use loc is possible also filter by rows by condition and columns by name(s):
s = df.loc[df['Goals']>10, 'ID']
print (s)
0 1
1 2
2 3
9 10
Name: ID, dtype: int64
df2 = df.loc[df['Goals']>10, ['ID','Gender']]
print (df2)
ID Gender
0 1 m
1 2 m
2 3 m
9 10 m

loc retrieves only the rows that matches the condition.
where returns the whole dataframe, replacing the rows that don't match the condition (NaN by default).

Related

Modify column B values based on the range of column A

Following is my test dataframe:
df.head(25)
freq pow group avg
0 0.000000 10.716615 0 NaN
1 0.022888 -9.757687 0 NaN
2 0.045776 -9.203844 0 NaN
3 0.068665 -8.746512 0 NaN
4 0.091553 -8.725540 0 NaN
...
95 2.174377 -12.697743 0 NaN
96 2.197266 -7.398328 0 NaN
97 2.220154 -23.002036 0 NaN
98 2.243042 -22.591483 0 NaN
99 2.265930 -13.686127 0 NaN
I am trying to assign values from 1-24 in group column based on range of values in freq column.
For example, using df.loc[(df['freq'] >= 0.1) & (df['freq'] <= 0.2)] yields the following:
freq pow group avg
5 0.114441 -8.620905 0 NaN
6 0.137329 -10.633629 0 NaN
7 0.160217 -9.098974 0 NaN
8 0.183105 -9.381907 0 NaN
So, if I select any particular range as shown above, I would want to change the values in group column from 0 to 1 as shown below.
freq pow group avg
5 0.114441 -8.620905 1 NaN
6 0.137329 -10.633629 1 NaN
7 0.160217 -9.098974 1 NaN
8 0.183105 -9.381907 1 NaN
Similarly, I want to change more values in group column from 0 to anything from 1-24 depending on the range I provide.
For example, df.loc[(df['freq'] >= 1) & (df['freq'] <= 1.2)] results in
freq pow group avg
44 1.007080 -13.826930 0 NaN
45 1.029968 -12.892703 0 NaN
46 1.052856 -14.353349 0 NaN
47 1.075745 -15.389498 0 NaN
48 1.098633 -12.519916 0 NaN
49 1.121521 -15.118952 0 NaN
50 1.144409 -15.986558 0 NaN
51 1.167297 -13.262798 0 NaN
52 1.190186 -12.629713 0 NaN
But I want to change the value of group column for this range from 0 to 2
freq pow group avg
44 1.007080 -13.826930 2 NaN
45 1.029968 -12.892703 2 NaN
46 1.052856 -14.353349 2 NaN
47 1.075745 -15.389498 2 NaN
48 1.098633 -12.519916 2 NaN
49 1.121521 -15.118952 2 NaN
50 1.144409 -15.986558 2 NaN
51 1.167297 -13.262798 2 NaN
52 1.190186 -12.629713 2 NaN
and once I have allocated the entire table with custom group values, I would need to calculate mean from values in pow column for the respective ranges and edit the avg column accordingly.
I cannot seem to figure out on how modify the multiple values in group column to a single value for a particular range of freq column. Any help would be greatly appreciated for this problem. Thank you in advance.
You can use -
df.loc[(df['freq'] >= 1) & (df['freq'] <= 1.2), 'group'] = 2
in the same way for mean-
df.loc[(df['freq'] >= 1) & (df['freq'] <= 1.2), 'avg'] = df[(df['freq'] >= 1) & (df['freq'] <= 1.2)]['pow'].mean
You can loop these commands changing the ranges of df['freq'] & the update integer (1-24 as mentioned in the question)

I want to change the column value for a specific index

df = pd.read_csv('test.txt',dtype=str)
print(df)
HE WE
0 aa NaN
1 181 76
2 22 13
3 NaN NaN
I want to overwrite any of these data frames with the following indexes
dff = pd.DataFrame({'HE' : [100,30]},index=[1,2])
print(dff)
HE
1 100
2 30
for i in dff.index:
df._set_value(i,'HE',dff._get_value(i,'HE'))
print(df)
HE WE
0 aa NaN
1 100 76
2 30 13
3 NaN NaN
Is there a way to change it all at once without using 'for'?
Use DataFrame.update, (working inplace):
df.update(dff)
print (df)
HE WE
0 aa NaN
1 100 76.0
2 30 13.0
3 NaN NaN

Is there a way to apply a function on a MultiIndex column?

I have a dataframe that looks like this:
id sex isActive score
0 1 M 1 10
1 2 F 0 20
2 2 F 1 30
3 2 M 0 40
4 3 M 1 50
I want to pivot the dataframe on the index id and columns sex and isActive (the value should be score). I want each id to have their score be a percentage of their total score associated with the sex group.
In the end, my dataframe should look like this:
sex F M
isActive 0 1 0 1
id
1 NaN NaN NaN 1.0
2 0.4 0.6 1.0 NaN
3 NaN NaN NaN 1.0
I tried pivoting first:
p = df.pivot_table(index='id', columns=['sex', 'isActive'], values='score')
print(p)
sex F M
isActive 0 1 0 1
id
1 NaN NaN NaN 10.0
2 20.0 30.0 40.0 NaN
3 NaN NaN NaN 50.0
Then, I summed up the scores for each group:
row_sum = p.sum(axis=1, level=[0])
print(row_sum)
sex F M
id
1 0.0 10.0
2 50.0 40.0
3 0.0 50.0
This is where I'm getting stuck. I'm trying to use DataFrame.apply to perform a column-wise sum based on the second dataframe. However, I keep getting errors following this format:
p.apply(lambda col: col/row_sum)
I may be overthinking this problem. Is there some better approach out there?
I think just a simple division of p by row_sum would work like:
print (p/row_sum)
sex F M
isActive 0 1 0 1
id
1 NaN NaN NaN 1.0
2 0.4 0.6 1.0 NaN
3 NaN NaN NaN 1.0

How to find rate of change across successive rows using time and data columns after grouping by a different column using pandas?

I have a pandas DataFrame of the form:
df
ID_col time_in_hours data_col
1 62.5 4
1 40 3
1 20 3
2 30 1
2 20 5
3 50 6
What I want to be able to do is, find the rate of change of data_col by using the time_in_hours column. Specifically,
rate_of_change = (data_col[i+1] - data_col[i]) / abs(time_in_hours[ i +1] - time_in_hours[i])
Where i is a given row and the rate_of_change is calculated separately for different IDs
Effectively, I want a new DataFrame of the form:
new_df
ID_col time_in_hours data_col rate_of_change
1 62.5 4 NaN
1 40 3 -0.044
1 20 3 0
2 30 1 NaN
2 20 5 0.4
3 50 6 NaN
How do I go about this?
You can use groupby:
s = df.groupby('ID_col').apply(lambda dft: dft['data_col'].diff() / dft['time_in_hours'].diff().abs())
s.index = s.index.droplevel()
s
returns
0 NaN
1 -0.044444
2 0.000000
3 NaN
4 0.400000
5 NaN
dtype: float64
You can actually get around the groupby + apply given how your DataFrame is sorted. In this case, you can just check if the ID_col is the same as the shifted row.
So calculate the rate of change for everything, and then only assign the values back if they are within a group.
import numpy as np
mask = df.ID_col == df.ID_col.shift(1)
roc = (df.data_col - df.data_col.shift(1))/np.abs(df.time_in_hours - df.time_in_hours.shift(1))
df.loc[mask, 'rate_of_change'] = roc[mask]
Output:
ID_col time_in_hours data_col rate_of_change
0 1 62.5 4 NaN
1 1 40.0 3 -0.044444
2 1 20.0 3 0.000000
3 2 30.0 1 NaN
4 2 20.0 5 0.400000
5 3 50.0 6 NaN
You can use pandas.diff:
df.groupby('ID_col').apply(
lambda x: x['data_col'].diff() / x['time_in_hours'].diff().abs())
ID_col
1 0 NaN
1 -0.044444
2 0.000000
2 3 NaN
4 0.400000
3 5 NaN
dtype: float64

Best way to join / merge by range in pandas

I'm frequently using pandas for merge (join) by using a range condition.
For instance if there are 2 dataframes:
A (A_id, A_value)
B (B_id,B_low, B_high, B_name)
which are big and approximately of the same size (let's say 2M records each).
I would like to make an inner join between A and B, so A_value would be between B_low and B_high.
Using SQL syntax that would be:
SELECT *
FROM A,B
WHERE A_value between B_low and B_high
and that would be really easy, short and efficient.
Meanwhile in pandas the only way (that's not using loops that I found), is by creating a dummy column in both tables, join on it (equivalent to cross-join) and then filter out unneeded rows. That sounds heavy and complex:
A['dummy'] = 1
B['dummy'] = 1
Temp = pd.merge(A,B,on='dummy')
Result = Temp[Temp.A_value.between(Temp.B_low,Temp.B_high)]
Another solution that I had is by applying on each of A value a search function on B by usingB[(x>=B.B_low) & (x<=B.B_high)] mask, but it sounds inefficient as well and might require index optimization.
Is there a more elegant and/or efficient way to perform this action?
Setup
Consider the dataframes A and B
A = pd.DataFrame(dict(
A_id=range(10),
A_value=range(5, 105, 10)
))
B = pd.DataFrame(dict(
B_id=range(5),
B_low=[0, 30, 30, 46, 84],
B_high=[10, 40, 50, 54, 84]
))
A
A_id A_value
0 0 5
1 1 15
2 2 25
3 3 35
4 4 45
5 5 55
6 6 65
7 7 75
8 8 85
9 9 95
B
B_high B_id B_low
0 10 0 0
1 40 1 30
2 50 2 30
3 54 3 46
4 84 4 84
numpy
The ✌easiest✌ way is to use numpy broadcasting.
We look for every instance of A_value being greater than or equal to B_low while at the same time A_value is less than or equal to B_high.
a = A.A_value.values
bh = B.B_high.values
bl = B.B_low.values
i, j = np.where((a[:, None] >= bl) & (a[:, None] <= bh))
pd.concat([
A.loc[i, :].reset_index(drop=True),
B.loc[j, :].reset_index(drop=True)
], axis=1)
A_id A_value B_high B_id B_low
0 0 5 10 0 0
1 3 35 40 1 30
2 3 35 50 2 30
3 4 45 50 2 30
To address the comments and give something akin to a left join, I appended the part of A that doesn't match.
pd.concat([
A.loc[i, :].reset_index(drop=True),
B.loc[j, :].reset_index(drop=True)
], axis=1).append(
A[~np.in1d(np.arange(len(A)), np.unique(i))],
ignore_index=True, sort=False
)
A_id A_value B_id B_low B_high
0 0 5 0.0 0.0 10.0
1 3 35 1.0 30.0 40.0
2 3 35 2.0 30.0 50.0
3 4 45 2.0 30.0 50.0
4 1 15 NaN NaN NaN
5 2 25 NaN NaN NaN
6 5 55 NaN NaN NaN
7 6 65 NaN NaN NaN
8 7 75 NaN NaN NaN
9 8 85 NaN NaN NaN
10 9 95 NaN NaN NaN
Not sure that is more efficient, however you can use sql directly (from the module sqlite3 for instance) with pandas (inspired from this question) like:
conn = sqlite3.connect(":memory:")
df2 = pd.DataFrame(np.random.randn(10, 5), columns=["col1", "col2", "col3", "col4", "col5"])
df1 = pd.DataFrame(np.random.randn(10, 5), columns=["col1", "col2", "col3", "col4", "col5"])
df1.to_sql("df1", conn, index=False)
df2.to_sql("df2", conn, index=False)
qry = "SELECT * FROM df1, df2 WHERE df1.col1 > 0 and df1.col1<0.5"
tt = pd.read_sql_query(qry,conn)
You can adapt the query as needed in your application
I don't know how efficient it is, but someone wrote a wrapper that allows you to use SQL syntax with pandas objects. That's called pandasql. The documentation explicitly states that joins are supported. This might be at least easier to read since SQL syntax is very readable.
conditional_join from pyjanitor may be helpful in the abstraction/convenience;:
# pip install pyjanitor
import pandas as pd
import janitor
inner join
A.conditional_join(B,
('A_value', 'B_low', '>='),
('A_value', 'B_high', '<=')
)
A_id A_value B_id B_low B_high
0 0 5 0 0 10
1 3 35 1 30 40
2 3 35 2 30 50
3 4 45 2 30 50
left join
A.conditional_join(
B,
('A_value', 'B_low', '>='),
('A_value', 'B_high', '<='),
how = 'left'
)
A_id A_value B_id B_low B_high
0 0 5 0.0 0.0 10.0
1 1 15 NaN NaN NaN
2 2 25 NaN NaN NaN
3 3 35 1.0 30.0 40.0
4 3 35 2.0 30.0 50.0
5 4 45 2.0 30.0 50.0
6 5 55 NaN NaN NaN
7 6 65 NaN NaN NaN
8 7 75 NaN NaN NaN
9 8 85 NaN NaN NaN
10 9 95 NaN NaN NaN
lets take a simple example:
df=pd.DataFrame([2,3,4,5,6],columns=['A'])
returns
A
0 2
1 3
2 4
3 5
4 6
now lets define a second dataframe
df2=pd.DataFrame([1,6,2,3,5],columns=['B_low'])
df2['B_high']=[2,8,4,6,6]
results in
B_low B_high
0 1 2
1 6 8
2 2 4
3 3 6
4 5 6
here we go; and we want output to be index 3 and A value 5
df.where(df['A']>=df2['B_low']).where(df['A']<df2['B_high']).dropna()
results in
A
3 5.0
I know this is an old question but for newcomers there is now the pandas.merge_asof function that performs join based on closest match.
In case you want to do a merge so that a column of one DataFrame (df_right) is between 2 columns of another DataFrame (df_left) you can do the following:
df_left = pd.DataFrame({
"time_from": [1, 4, 10, 21],
"time_to": [3, 7, 15, 27]
})
df_right = pd.DataFrame({
"time": [2, 6, 16, 25]
})
df_left
time_from time_to
0 1 3
1 4 7
2 10 15
3 21 27
df_right
time
0 2
1 6
2 16
3 25
First, find matches of the right DataFrame that are closest but largest than the left boundary (time_from) of the left DataFrame:
merged = pd.merge_asof(
left=df_1,
right=df_2.rename(columns={"time": "candidate_match_1"}),
left_on="time_from",
right_on="candidate_match_1",
direction="forward"
)
merged
time_from time_to candidate_match_1
0 1 3 2
1 4 7 6
2 10 15 16
3 21 27 25
As you can see the candidate match in index 2 is wrongly matched, as 16 is not between 10 and 15.
Then, find matches of the right DataFrame that are closest but smaller than the right boundary (time_to) of the left DataFrame:
merged = pd.merge_asof(
left=merged,
right=df_2.rename(columns={"time": "candidate_match_2"}),
left_on="time_to",
right_on="candidate_match_2",
direction="backward"
)
merged
time_from time_to candidate_match_1 candidate_match_2
0 1 3 2 2
1 4 7 6 6
2 10 15 16 6
3 21 27 25 25
Finally, keep the matches where the candidate matches are the same, meaning that the value of the right DataFrame are between values of the 2 columns of the left DataFrame:
merged["match"] = None
merged.loc[merged["candidate_match_1"] == merged["candidate_match_2"], "match"] = \
merged.loc[merged["candidate_match_1"] == merged["candidate_match_2"], "candidate_match_1"]
merged
time_from time_to candidate_match_1 candidate_match_2 match
0 1 3 2 2 2
1 4 7 6 6 6
2 10 15 16 6 None
3 21 27 25 25 25

Categories