Following is my test dataframe:
df.head(25)
freq pow group avg
0 0.000000 10.716615 0 NaN
1 0.022888 -9.757687 0 NaN
2 0.045776 -9.203844 0 NaN
3 0.068665 -8.746512 0 NaN
4 0.091553 -8.725540 0 NaN
...
95 2.174377 -12.697743 0 NaN
96 2.197266 -7.398328 0 NaN
97 2.220154 -23.002036 0 NaN
98 2.243042 -22.591483 0 NaN
99 2.265930 -13.686127 0 NaN
I am trying to assign values from 1-24 in group column based on range of values in freq column.
For example, using df.loc[(df['freq'] >= 0.1) & (df['freq'] <= 0.2)] yields the following:
freq pow group avg
5 0.114441 -8.620905 0 NaN
6 0.137329 -10.633629 0 NaN
7 0.160217 -9.098974 0 NaN
8 0.183105 -9.381907 0 NaN
So, if I select any particular range as shown above, I would want to change the values in group column from 0 to 1 as shown below.
freq pow group avg
5 0.114441 -8.620905 1 NaN
6 0.137329 -10.633629 1 NaN
7 0.160217 -9.098974 1 NaN
8 0.183105 -9.381907 1 NaN
Similarly, I want to change more values in group column from 0 to anything from 1-24 depending on the range I provide.
For example, df.loc[(df['freq'] >= 1) & (df['freq'] <= 1.2)] results in
freq pow group avg
44 1.007080 -13.826930 0 NaN
45 1.029968 -12.892703 0 NaN
46 1.052856 -14.353349 0 NaN
47 1.075745 -15.389498 0 NaN
48 1.098633 -12.519916 0 NaN
49 1.121521 -15.118952 0 NaN
50 1.144409 -15.986558 0 NaN
51 1.167297 -13.262798 0 NaN
52 1.190186 -12.629713 0 NaN
But I want to change the value of group column for this range from 0 to 2
freq pow group avg
44 1.007080 -13.826930 2 NaN
45 1.029968 -12.892703 2 NaN
46 1.052856 -14.353349 2 NaN
47 1.075745 -15.389498 2 NaN
48 1.098633 -12.519916 2 NaN
49 1.121521 -15.118952 2 NaN
50 1.144409 -15.986558 2 NaN
51 1.167297 -13.262798 2 NaN
52 1.190186 -12.629713 2 NaN
and once I have allocated the entire table with custom group values, I would need to calculate mean from values in pow column for the respective ranges and edit the avg column accordingly.
I cannot seem to figure out on how modify the multiple values in group column to a single value for a particular range of freq column. Any help would be greatly appreciated for this problem. Thank you in advance.
You can use -
df.loc[(df['freq'] >= 1) & (df['freq'] <= 1.2), 'group'] = 2
in the same way for mean-
df.loc[(df['freq'] >= 1) & (df['freq'] <= 1.2), 'avg'] = df[(df['freq'] >= 1) & (df['freq'] <= 1.2)]['pow'].mean
You can loop these commands changing the ranges of df['freq'] & the update integer (1-24 as mentioned in the question)
df = pd.read_csv('test.txt',dtype=str)
print(df)
HE WE
0 aa NaN
1 181 76
2 22 13
3 NaN NaN
I want to overwrite any of these data frames with the following indexes
dff = pd.DataFrame({'HE' : [100,30]},index=[1,2])
print(dff)
HE
1 100
2 30
for i in dff.index:
df._set_value(i,'HE',dff._get_value(i,'HE'))
print(df)
HE WE
0 aa NaN
1 100 76
2 30 13
3 NaN NaN
Is there a way to change it all at once without using 'for'?
Use DataFrame.update, (working inplace):
df.update(dff)
print (df)
HE WE
0 aa NaN
1 100 76.0
2 30 13.0
3 NaN NaN
I have a dataframe that looks like this:
id sex isActive score
0 1 M 1 10
1 2 F 0 20
2 2 F 1 30
3 2 M 0 40
4 3 M 1 50
I want to pivot the dataframe on the index id and columns sex and isActive (the value should be score). I want each id to have their score be a percentage of their total score associated with the sex group.
In the end, my dataframe should look like this:
sex F M
isActive 0 1 0 1
id
1 NaN NaN NaN 1.0
2 0.4 0.6 1.0 NaN
3 NaN NaN NaN 1.0
I tried pivoting first:
p = df.pivot_table(index='id', columns=['sex', 'isActive'], values='score')
print(p)
sex F M
isActive 0 1 0 1
id
1 NaN NaN NaN 10.0
2 20.0 30.0 40.0 NaN
3 NaN NaN NaN 50.0
Then, I summed up the scores for each group:
row_sum = p.sum(axis=1, level=[0])
print(row_sum)
sex F M
id
1 0.0 10.0
2 50.0 40.0
3 0.0 50.0
This is where I'm getting stuck. I'm trying to use DataFrame.apply to perform a column-wise sum based on the second dataframe. However, I keep getting errors following this format:
p.apply(lambda col: col/row_sum)
I may be overthinking this problem. Is there some better approach out there?
I think just a simple division of p by row_sum would work like:
print (p/row_sum)
sex F M
isActive 0 1 0 1
id
1 NaN NaN NaN 1.0
2 0.4 0.6 1.0 NaN
3 NaN NaN NaN 1.0
I have a pandas DataFrame of the form:
df
ID_col time_in_hours data_col
1 62.5 4
1 40 3
1 20 3
2 30 1
2 20 5
3 50 6
What I want to be able to do is, find the rate of change of data_col by using the time_in_hours column. Specifically,
rate_of_change = (data_col[i+1] - data_col[i]) / abs(time_in_hours[ i +1] - time_in_hours[i])
Where i is a given row and the rate_of_change is calculated separately for different IDs
Effectively, I want a new DataFrame of the form:
new_df
ID_col time_in_hours data_col rate_of_change
1 62.5 4 NaN
1 40 3 -0.044
1 20 3 0
2 30 1 NaN
2 20 5 0.4
3 50 6 NaN
How do I go about this?
You can use groupby:
s = df.groupby('ID_col').apply(lambda dft: dft['data_col'].diff() / dft['time_in_hours'].diff().abs())
s.index = s.index.droplevel()
s
returns
0 NaN
1 -0.044444
2 0.000000
3 NaN
4 0.400000
5 NaN
dtype: float64
You can actually get around the groupby + apply given how your DataFrame is sorted. In this case, you can just check if the ID_col is the same as the shifted row.
So calculate the rate of change for everything, and then only assign the values back if they are within a group.
import numpy as np
mask = df.ID_col == df.ID_col.shift(1)
roc = (df.data_col - df.data_col.shift(1))/np.abs(df.time_in_hours - df.time_in_hours.shift(1))
df.loc[mask, 'rate_of_change'] = roc[mask]
Output:
ID_col time_in_hours data_col rate_of_change
0 1 62.5 4 NaN
1 1 40.0 3 -0.044444
2 1 20.0 3 0.000000
3 2 30.0 1 NaN
4 2 20.0 5 0.400000
5 3 50.0 6 NaN
You can use pandas.diff:
df.groupby('ID_col').apply(
lambda x: x['data_col'].diff() / x['time_in_hours'].diff().abs())
ID_col
1 0 NaN
1 -0.044444
2 0.000000
2 3 NaN
4 0.400000
3 5 NaN
dtype: float64
I'm frequently using pandas for merge (join) by using a range condition.
For instance if there are 2 dataframes:
A (A_id, A_value)
B (B_id,B_low, B_high, B_name)
which are big and approximately of the same size (let's say 2M records each).
I would like to make an inner join between A and B, so A_value would be between B_low and B_high.
Using SQL syntax that would be:
SELECT *
FROM A,B
WHERE A_value between B_low and B_high
and that would be really easy, short and efficient.
Meanwhile in pandas the only way (that's not using loops that I found), is by creating a dummy column in both tables, join on it (equivalent to cross-join) and then filter out unneeded rows. That sounds heavy and complex:
A['dummy'] = 1
B['dummy'] = 1
Temp = pd.merge(A,B,on='dummy')
Result = Temp[Temp.A_value.between(Temp.B_low,Temp.B_high)]
Another solution that I had is by applying on each of A value a search function on B by usingB[(x>=B.B_low) & (x<=B.B_high)] mask, but it sounds inefficient as well and might require index optimization.
Is there a more elegant and/or efficient way to perform this action?
Setup
Consider the dataframes A and B
A = pd.DataFrame(dict(
A_id=range(10),
A_value=range(5, 105, 10)
))
B = pd.DataFrame(dict(
B_id=range(5),
B_low=[0, 30, 30, 46, 84],
B_high=[10, 40, 50, 54, 84]
))
A
A_id A_value
0 0 5
1 1 15
2 2 25
3 3 35
4 4 45
5 5 55
6 6 65
7 7 75
8 8 85
9 9 95
B
B_high B_id B_low
0 10 0 0
1 40 1 30
2 50 2 30
3 54 3 46
4 84 4 84
numpy
The ✌easiest✌ way is to use numpy broadcasting.
We look for every instance of A_value being greater than or equal to B_low while at the same time A_value is less than or equal to B_high.
a = A.A_value.values
bh = B.B_high.values
bl = B.B_low.values
i, j = np.where((a[:, None] >= bl) & (a[:, None] <= bh))
pd.concat([
A.loc[i, :].reset_index(drop=True),
B.loc[j, :].reset_index(drop=True)
], axis=1)
A_id A_value B_high B_id B_low
0 0 5 10 0 0
1 3 35 40 1 30
2 3 35 50 2 30
3 4 45 50 2 30
To address the comments and give something akin to a left join, I appended the part of A that doesn't match.
pd.concat([
A.loc[i, :].reset_index(drop=True),
B.loc[j, :].reset_index(drop=True)
], axis=1).append(
A[~np.in1d(np.arange(len(A)), np.unique(i))],
ignore_index=True, sort=False
)
A_id A_value B_id B_low B_high
0 0 5 0.0 0.0 10.0
1 3 35 1.0 30.0 40.0
2 3 35 2.0 30.0 50.0
3 4 45 2.0 30.0 50.0
4 1 15 NaN NaN NaN
5 2 25 NaN NaN NaN
6 5 55 NaN NaN NaN
7 6 65 NaN NaN NaN
8 7 75 NaN NaN NaN
9 8 85 NaN NaN NaN
10 9 95 NaN NaN NaN
Not sure that is more efficient, however you can use sql directly (from the module sqlite3 for instance) with pandas (inspired from this question) like:
conn = sqlite3.connect(":memory:")
df2 = pd.DataFrame(np.random.randn(10, 5), columns=["col1", "col2", "col3", "col4", "col5"])
df1 = pd.DataFrame(np.random.randn(10, 5), columns=["col1", "col2", "col3", "col4", "col5"])
df1.to_sql("df1", conn, index=False)
df2.to_sql("df2", conn, index=False)
qry = "SELECT * FROM df1, df2 WHERE df1.col1 > 0 and df1.col1<0.5"
tt = pd.read_sql_query(qry,conn)
You can adapt the query as needed in your application
I don't know how efficient it is, but someone wrote a wrapper that allows you to use SQL syntax with pandas objects. That's called pandasql. The documentation explicitly states that joins are supported. This might be at least easier to read since SQL syntax is very readable.
conditional_join from pyjanitor may be helpful in the abstraction/convenience;:
# pip install pyjanitor
import pandas as pd
import janitor
inner join
A.conditional_join(B,
('A_value', 'B_low', '>='),
('A_value', 'B_high', '<=')
)
A_id A_value B_id B_low B_high
0 0 5 0 0 10
1 3 35 1 30 40
2 3 35 2 30 50
3 4 45 2 30 50
left join
A.conditional_join(
B,
('A_value', 'B_low', '>='),
('A_value', 'B_high', '<='),
how = 'left'
)
A_id A_value B_id B_low B_high
0 0 5 0.0 0.0 10.0
1 1 15 NaN NaN NaN
2 2 25 NaN NaN NaN
3 3 35 1.0 30.0 40.0
4 3 35 2.0 30.0 50.0
5 4 45 2.0 30.0 50.0
6 5 55 NaN NaN NaN
7 6 65 NaN NaN NaN
8 7 75 NaN NaN NaN
9 8 85 NaN NaN NaN
10 9 95 NaN NaN NaN
lets take a simple example:
df=pd.DataFrame([2,3,4,5,6],columns=['A'])
returns
A
0 2
1 3
2 4
3 5
4 6
now lets define a second dataframe
df2=pd.DataFrame([1,6,2,3,5],columns=['B_low'])
df2['B_high']=[2,8,4,6,6]
results in
B_low B_high
0 1 2
1 6 8
2 2 4
3 3 6
4 5 6
here we go; and we want output to be index 3 and A value 5
df.where(df['A']>=df2['B_low']).where(df['A']<df2['B_high']).dropna()
results in
A
3 5.0
I know this is an old question but for newcomers there is now the pandas.merge_asof function that performs join based on closest match.
In case you want to do a merge so that a column of one DataFrame (df_right) is between 2 columns of another DataFrame (df_left) you can do the following:
df_left = pd.DataFrame({
"time_from": [1, 4, 10, 21],
"time_to": [3, 7, 15, 27]
})
df_right = pd.DataFrame({
"time": [2, 6, 16, 25]
})
df_left
time_from time_to
0 1 3
1 4 7
2 10 15
3 21 27
df_right
time
0 2
1 6
2 16
3 25
First, find matches of the right DataFrame that are closest but largest than the left boundary (time_from) of the left DataFrame:
merged = pd.merge_asof(
left=df_1,
right=df_2.rename(columns={"time": "candidate_match_1"}),
left_on="time_from",
right_on="candidate_match_1",
direction="forward"
)
merged
time_from time_to candidate_match_1
0 1 3 2
1 4 7 6
2 10 15 16
3 21 27 25
As you can see the candidate match in index 2 is wrongly matched, as 16 is not between 10 and 15.
Then, find matches of the right DataFrame that are closest but smaller than the right boundary (time_to) of the left DataFrame:
merged = pd.merge_asof(
left=merged,
right=df_2.rename(columns={"time": "candidate_match_2"}),
left_on="time_to",
right_on="candidate_match_2",
direction="backward"
)
merged
time_from time_to candidate_match_1 candidate_match_2
0 1 3 2 2
1 4 7 6 6
2 10 15 16 6
3 21 27 25 25
Finally, keep the matches where the candidate matches are the same, meaning that the value of the right DataFrame are between values of the 2 columns of the left DataFrame:
merged["match"] = None
merged.loc[merged["candidate_match_1"] == merged["candidate_match_2"], "match"] = \
merged.loc[merged["candidate_match_1"] == merged["candidate_match_2"], "candidate_match_1"]
merged
time_from time_to candidate_match_1 candidate_match_2 match
0 1 3 2 2 2
1 4 7 6 6 6
2 10 15 16 6 None
3 21 27 25 25 25