The DataFrame has two columns A and B of integers.
a b
1 3
4 2
2 0
6 1
...
I need to swap in the following way:
if df.a > df.b:
temp = df.b
df.b = df.a
df.a = temp
expected output:
a b
1 3
2 4 <----
0 2 <----
1 6 <----
Basically always having in column A the smaller value of the twos.
I feel I should use loc but I couldn't find the right way yet.
In [443]: df['a'], df['b'] = df.min(axis=1), df.max(axis=1)
In [444]: df
Out[444]:
a b
0 1 3
1 2 4
2 0 2
3 1 6
or
pd.DataFrame(np.sort(d.values, axis=1), d.index, d.columns)
Using np.where you can do
In [21]: df.a, df.b = np.where(df.a > df.b, [df.b, df.a], [df.a, df.b])
In [23]: df
Out[23]:
a b
0 1 3
1 2 4
2 0 2
3 1 6
Or, using .loc
In [35]: cond = df.a > df.b
In [36]: df.loc[cond, ['a', 'b']] = df.loc[cond, ['b', 'a']].values
In [37]: df
Out[37]:
a b
0 1 3
1 2 4
2 0 2
3 1 6
Or, .apply(np.sort, axis=1) if you need smaller a values and larger b
In [54]: df.apply(np.sort, axis=1)
Out[54]:
a b
0 1 3
1 2 4
2 0 2
3 1 6
Seeing the methods proposed by #JohnGait and #MaxU, I did a small speed comparison.
arr = np.random.randint(low = 100, size = (10000000, 2))
# using np.where
df = pd.DataFrame(arr, columns = ['a', 'b'])
t_0 = time.time()
df.a, df.b = np.where(df.a > df.b, [df.b, df.a], [df.a, df.b])
t_1 = time.time()
# using df.loc
df = pd.DataFrame(arr, columns = ['a', 'b'])
t_2 = time.time()
cond = df.a > df.b
df.loc[cond, ['a', 'b']] = df.loc[cond, ['b', 'a']].values
t_3 = time.time()
# using df.min
df = pd.DataFrame(arr, columns = ['a', 'b'])
t_4 = time.time()
df['a'], df['b'] = df.min(axis=1), df.max(axis=1)
t_5 = time.time()
# using np.sort
t_6 = time.time()
df_ = pd.DataFrame(np.sort(arr, axis=1), df.index, df.columns)
t_7 = time.time()
t_1 - t_0 # using np.where: 5.759037971496582
t_3 - t_2 # using .loc: 0.12156987190246582
t_5 - t_4 # using df.min: 1.0503261089324951
t_7 - t_6 # 0.20351791381835938
Although second approach is the fastest approach, the practical gain is insignificant. I am adding it here for pedantic reasons. I didn't include the sort method as I am convinced that's going be a lot slower.
EDIT
I had wrongly reported the computation time of np.where due to a mistake I made. Corrected that (turns out its the slowest of the lot!) and added another method (following #MaxU's comment)
Solution
It's as simple as
df.values.sort(1)
df
a b
0 1 3
1 2 4
2 0 2
3 1 6
What Happened
I can sort a numpy.array in place with its sort method. I pass the parameter axis=1 to indicate that I want to sort along the first axis (row wise). The values attribute of a dataframe accesses the underlying numpy array. So df.values.sort(1) sorts the underlying values in place row wise... done.
We can be a bit more explicit with
df.values[:] = np.sort(df.values, 1)
This allows us a lot of flexibility to perform this over subsets of columns or reverse sort
df.values[:, ::-1] = np.sort(df.values, 1)
Related
This should be straightforward, but the closest thing I've found is this post:
pandas: Filling missing values within a group, and I still can't solve my problem....
Suppose I have the following dataframe
df = pd.DataFrame({'value': [1, np.nan, np.nan, 2, 3, 1, 3, np.nan, 3], 'name': ['A','A', 'B','B','B','B', 'C','C','C']})
name value
0 A 1
1 A NaN
2 B NaN
3 B 2
4 B 3
5 B 1
6 C 3
7 C NaN
8 C 3
and I'd like to fill in "NaN" with mean value in each "name" group, i.e.
name value
0 A 1
1 A 1
2 B 2
3 B 2
4 B 3
5 B 1
6 C 3
7 C 3
8 C 3
I'm not sure where to go after:
grouped = df.groupby('name').mean()
Thanks a bunch.
One way would be to use transform:
>>> df
name value
0 A 1
1 A NaN
2 B NaN
3 B 2
4 B 3
5 B 1
6 C 3
7 C NaN
8 C 3
>>> df["value"] = df.groupby("name").transform(lambda x: x.fillna(x.mean()))
>>> df
name value
0 A 1
1 A 1
2 B 2
3 B 2
4 B 3
5 B 1
6 C 3
7 C 3
8 C 3
fillna + groupby + transform + mean
This seems intuitive:
df['value'] = df['value'].fillna(df.groupby('name')['value'].transform('mean'))
The groupby + transform syntax maps the groupwise mean to the index of the original dataframe. This is roughly equivalent to #DSM's solution, but avoids the need to define an anonymous lambda function.
#DSM has IMO the right answer, but I'd like to share my generalization and optimization of the question: Multiple columns to group-by and having multiple value columns:
df = pd.DataFrame(
{
'category': ['X', 'X', 'X', 'X', 'X', 'X', 'Y', 'Y', 'Y'],
'name': ['A','A', 'B','B','B','B', 'C','C','C'],
'other_value': [10, np.nan, np.nan, 20, 30, 10, 30, np.nan, 30],
'value': [1, np.nan, np.nan, 2, 3, 1, 3, np.nan, 3],
}
)
... gives ...
category name other_value value
0 X A 10.0 1.0
1 X A NaN NaN
2 X B NaN NaN
3 X B 20.0 2.0
4 X B 30.0 3.0
5 X B 10.0 1.0
6 Y C 30.0 3.0
7 Y C NaN NaN
8 Y C 30.0 3.0
In this generalized case we would like to group by category and name, and impute only on value.
This can be solved as follows:
df['value'] = df.groupby(['category', 'name'])['value']\
.transform(lambda x: x.fillna(x.mean()))
Notice the column list in the group-by clause, and that we select the value column right after the group-by. This makes the transformation only be run on that particular column. You could add it to the end, but then you will run it for all columns only to throw out all but one measure column at the end. A standard SQL query planner might have been able to optimize this, but pandas (0.19.2) doesn't seem to do this.
Performance test by increasing the dataset by doing ...
big_df = None
for _ in range(10000):
if big_df is None:
big_df = df.copy()
else:
big_df = pd.concat([big_df, df])
df = big_df
... confirms that this increases the speed proportional to how many columns you don't have to impute:
import pandas as pd
from datetime import datetime
def generate_data():
...
t = datetime.now()
df = generate_data()
df['value'] = df.groupby(['category', 'name'])['value']\
.transform(lambda x: x.fillna(x.mean()))
print(datetime.now()-t)
# 0:00:00.016012
t = datetime.now()
df = generate_data()
df["value"] = df.groupby(['category', 'name'])\
.transform(lambda x: x.fillna(x.mean()))['value']
print(datetime.now()-t)
# 0:00:00.030022
On a final note you can generalize even further if you want to impute more than one column, but not all:
df[['value', 'other_value']] = df.groupby(['category', 'name'])['value', 'other_value']\
.transform(lambda x: x.fillna(x.mean()))
Shortcut:
Groupby + Apply + Lambda + Fillna + Mean
>>> df['value1']=df.groupby('name')['value'].apply(lambda x:x.fillna(x.mean()))
>>> df.isnull().sum().sum()
0
This solution still works if you want to group by multiple columns to replace missing values.
>>> df = pd.DataFrame({'value': [1, np.nan, np.nan, 2, 3, np.nan,np.nan, 4, 3],
'name': ['A','A', 'B','B','B','B', 'C','C','C'],'class':list('ppqqrrsss')})
>>> df['value']=df.groupby(['name','class'])['value'].apply(lambda x:x.fillna(x.mean()))
>>> df
value name class
0 1.0 A p
1 1.0 A p
2 2.0 B q
3 2.0 B q
4 3.0 B r
5 3.0 B r
6 3.5 C s
7 4.0 C s
8 3.0 C s
I'd do it this way
df.loc[df.value.isnull(), 'value'] = df.groupby('group').value.transform('mean')
The featured high ranked answer only works for a pandas Dataframe with only two columns. If you have a more columns case use instead:
df['Crude_Birth_rate'] = df.groupby("continent").Crude_Birth_rate.transform(
lambda x: x.fillna(x.mean()))
To summarize all above concerning the efficiency of the possible solution
I have a dataset with 97 906 rows and 48 columns.
I want to fill in 4 columns with the median of each group.
The column I want to group has 26 200 groups.
The first solution
start = time.time()
x = df_merged[continuous_variables].fillna(df_merged.groupby('domain_userid')[continuous_variables].transform('median'))
print(time.time() - start)
0.10429811477661133 seconds
The second solution
start = time.time()
for col in continuous_variables:
df_merged.loc[df_merged[col].isnull(), col] = df_merged.groupby('domain_userid')[col].transform('median')
print(time.time() - start)
0.5098445415496826 seconds
The next solution I only performed on a subset since it was running too long.
start = time.time()
for col in continuous_variables:
x = df_merged.head(10000).groupby('domain_userid')[col].transform(lambda x: x.fillna(x.median()))
print(time.time() - start)
11.685635566711426 seconds
The following solution follows the same logic as above.
start = time.time()
x = df_merged.head(10000).groupby('domain_userid')[continuous_variables].transform(lambda x: x.fillna(x.median()))
print(time.time() - start)
42.630549907684326 seconds
So it's quite important to choose the right method.
Bear in mind that I noticed once a column was not a numeric the times were going up exponentially (makes sense as I was computing the median).
def groupMeanValue(group):
group['value'] = group['value'].fillna(group['value'].mean())
return group
dft = df.groupby("name").transform(groupMeanValue)
I know that is an old question. But I am quite surprised by the unanimity of apply/lambda answers here.
Generally speaking, that is the second worst thing to do after iterating rows, from timing point of view.
What I would do here is
df.loc[df['value'].isna(), 'value'] = df.groupby('name')['value'].transform('mean')
Or using fillna
df['value'] = df['value'].fillna(df.groupby('name')['value'].transform('mean'))
I've checked with timeit (because, again, unanimity for apply/lambda based solution made me doubt my instinct). And that is indeed 2.5 faster than the most upvoted solutions.
To fill all the numeric null values with the mean grouped by "name"
num_cols = df.select_dtypes(exclude='object').columns
df[num_cols] = df.groupby("name").transform(lambda x: x.fillna(x.mean()))
df.fillna(df.groupby(['name'], as_index=False).mean(), inplace=True)
You can also use "dataframe or table_name".apply(lambda x: x.fillna(x.mean())).
I have a DataFrame df:
A B
a 2 2
b 3 1
c 1 3
I want to create a new column based on the following criteria:
if row A == B: 0
if rowA > B: 1
if row A < B: -1
so given the above table, it should be:
A B C
a 2 2 0
b 3 1 1
c 1 3 -1
For typical if else cases I do np.where(df.A > df.B, 1, -1), does pandas provide a special syntax for solving my problem with one step (without the necessity of creating 3 new columns and then combining the result)?
To formalize some of the approaches laid out above:
Create a function that operates on the rows of your dataframe like so:
def f(row):
if row['A'] == row['B']:
val = 0
elif row['A'] > row['B']:
val = 1
else:
val = -1
return val
Then apply it to your dataframe passing in the axis=1 option:
In [1]: df['C'] = df.apply(f, axis=1)
In [2]: df
Out[2]:
A B C
a 2 2 0
b 3 1 1
c 1 3 -1
Of course, this is not vectorized so performance may not be as good when scaled to a large number of records. Still, I think it is much more readable. Especially coming from a SAS background.
Edit
Here is the vectorized version
df['C'] = np.where(
df['A'] == df['B'], 0, np.where(
df['A'] > df['B'], 1, -1))
df.loc[df['A'] == df['B'], 'C'] = 0
df.loc[df['A'] > df['B'], 'C'] = 1
df.loc[df['A'] < df['B'], 'C'] = -1
Easy to solve using indexing. The first line of code reads like so, if column A is equal to column B then create and set column C equal to 0.
For this particular relationship, you could use np.sign:
>>> df["C"] = np.sign(df.A - df.B)
>>> df
A B C
a 2 2 0
b 3 1 1
c 1 3 -1
When you have multiple if
conditions, numpy.select is the way to go:
In [4102]: import numpy as np
In [4098]: conditions = [df.A.eq(df.B), df.A.gt(df.B), df.A.lt(df.B)]
In [4096]: choices = [0, 1, -1]
In [4100]: df['C'] = np.select(conditions, choices)
In [4101]: df
Out[4101]:
A B C
a 2 2 0
b 3 1 1
c 1 3 -1
Lets say above one is your original dataframe and you want to add a new column 'old'
If age greater than 50 then we consider as older=yes otherwise False
step 1: Get the indexes of rows whose age greater than 50
row_indexes=df[df['age']>=50].index
step 2:
Using .loc we can assign a new value to column
df.loc[row_indexes,'elderly']="yes"
same for age below less than 50
row_indexes=df[df['age']<50].index
df[row_indexes,'elderly']="no"
You can use the method mask:
df['C'] = np.nan
df['C'] = df['C'].mask(df.A == df.B, 0).mask(df.A > df.B, 1).mask(df.A < df.B, -1)
I have a Pandas DataFrame, df:
import pandas as pd
import numpy as np
import math
df = pd.DataFrame({'A':[1,2,2,4,np.nan],'B':[1,2,3,4,5]})
and a dict, mask:
mask = {1:32,2:64,3:100,4:200}
I want my end result to be a DataFrame like this:
A B C
1 1 32
2 2 64
2 3 96
4 4 400
nan nan nan
Right now I am doing this, which seems innefficient:
for idx, row in df.iterrows():
if not math.isnan(row['A']):
if row['A'] != 1:
df.loc[idx, 'C'] = row['B'] * mask[row['A'] - 1]
else:
df.loc[idx, 'C'] = row['B'] * mask[row['A']]
Is there an easy way to vectorize this?
This should work:
df['C'] = df.B * (df.A - (df.A != 1)).map(mask)
Timing
10,000 rows
# Initialize each run with
df = pd.DataFrame({'A':[1,2,2,4,np.nan],'B':[1,2,3,4,5]})
df = pd.concat([df for _ in range(2000)])
100,000 rows
# Initialize each run with
df = pd.DataFrame({'A':[1,2,2,4,np.nan],'B':[1,2,3,4,5]})
df = pd.concat([df for _ in range(20000)])
Here is an option using apply, and the get method for dictionary which returns None if the key is not in the dictionary:
df['C'] = df.apply(lambda r: mask.get(r.A) if r.A == 1 else mask.get(r.A - 1), axis = 1) * df.B
df
# A B C
#0 1 1 32
#1 2 2 64
#2 2 3 96
#3 4 4 400
#4 NaN 5 NaN
I am trying to make the last two rows of my dataframe df the first two of my dataframe with the previous first row becoming the 3rd row after the shift. Its because I just added the rows [3,0.3232, 0, 0, 2,0.500], [6,0.3232, 0, 0, 2,0.500]. However, these get added to to the end of df and hence become the last two rows, when I want them to be the first two. I was just wondering how to do this.
df = df.T
df[0] = [3,0.3232, 0, 0, 2,0.500]
df[1] = [6,0.3232, 0, 0, 2,0.500]
df = df.T
df = df.reset_index()
You can just call reindex and pass the new desired order:
In [14]:
df = pd.DataFrame({'a':['a','b','c']})
df
Out[14]:
a
0 a
1 b
2 c
In [16]:
df.reindex([1,2,0])
Out[16]:
a
1 b
2 c
0 a
EDIT
Another method would be to use np.roll note that this returns a np.array so we have to explicitly select the columns from the df to overwrite them:
In [30]:
df = pd.DataFrame({'a':['a','b','c'], 'b':np.arange(3)})
df
Out[30]:
a b
0 a 0
1 b 1
2 c 2
In [42]:
df[df.columns] = np.roll(df, shift=-1, axis=0)
df
Out[42]:
a b
0 b 1
1 c 2
2 a 0
The axis=0 param seems to be necessary otherwise the column order is not preserved:
In [44]:
df[df.columns] = np.roll(df, shift=-1)
df
Out[44]:
a b
0 0 b
1 1 c
2 2 a
Unless I'm missing something, the easiest solution is just to add the new rows to the beginning in the first place:
existing_rows = pd.DataFrame( np.random.randn(4,3) )
new_rows = pd.DataFrame( np.random.randn(2,3) )
new_rows.append( existing_rows )
0 1 2
0 0.406690 -0.699925 0.449278
1 1.729282 0.387896 0.652381
0 0.091711 1.634247 0.749282
1 1.354132 -0.180248 -1.880638
2 -0.151871 -1.266152 0.333071
3 1.351072 -0.421404 -0.951583
If you really want to switch rows you can do as EdChum suggests. Another way is like this:
df.iloc[-2:].append( df.iloc[:-2] )
I think this is slightly simpler than np.roll as suggested by EdChum, but numpy is generally faster so I'd use np.roll if you care about speed. (And doing some quick tests on 1,000x3 data suggests it is about 3x to 4x faster than append.)
I have a DataFrame df:
A B
a 2 2
b 3 1
c 1 3
I want to create a new column based on the following criteria:
if row A == B: 0
if rowA > B: 1
if row A < B: -1
so given the above table, it should be:
A B C
a 2 2 0
b 3 1 1
c 1 3 -1
For typical if else cases I do np.where(df.A > df.B, 1, -1), does pandas provide a special syntax for solving my problem with one step (without the necessity of creating 3 new columns and then combining the result)?
To formalize some of the approaches laid out above:
Create a function that operates on the rows of your dataframe like so:
def f(row):
if row['A'] == row['B']:
val = 0
elif row['A'] > row['B']:
val = 1
else:
val = -1
return val
Then apply it to your dataframe passing in the axis=1 option:
In [1]: df['C'] = df.apply(f, axis=1)
In [2]: df
Out[2]:
A B C
a 2 2 0
b 3 1 1
c 1 3 -1
Of course, this is not vectorized so performance may not be as good when scaled to a large number of records. Still, I think it is much more readable. Especially coming from a SAS background.
Edit
Here is the vectorized version
df['C'] = np.where(
df['A'] == df['B'], 0, np.where(
df['A'] > df['B'], 1, -1))
df.loc[df['A'] == df['B'], 'C'] = 0
df.loc[df['A'] > df['B'], 'C'] = 1
df.loc[df['A'] < df['B'], 'C'] = -1
Easy to solve using indexing. The first line of code reads like so, if column A is equal to column B then create and set column C equal to 0.
For this particular relationship, you could use np.sign:
>>> df["C"] = np.sign(df.A - df.B)
>>> df
A B C
a 2 2 0
b 3 1 1
c 1 3 -1
When you have multiple if
conditions, numpy.select is the way to go:
In [4102]: import numpy as np
In [4098]: conditions = [df.A.eq(df.B), df.A.gt(df.B), df.A.lt(df.B)]
In [4096]: choices = [0, 1, -1]
In [4100]: df['C'] = np.select(conditions, choices)
In [4101]: df
Out[4101]:
A B C
a 2 2 0
b 3 1 1
c 1 3 -1
Lets say above one is your original dataframe and you want to add a new column 'old'
If age greater than 50 then we consider as older=yes otherwise False
step 1: Get the indexes of rows whose age greater than 50
row_indexes=df[df['age']>=50].index
step 2:
Using .loc we can assign a new value to column
df.loc[row_indexes,'elderly']="yes"
same for age below less than 50
row_indexes=df[df['age']<50].index
df[row_indexes,'elderly']="no"
You can use the method mask:
df['C'] = np.nan
df['C'] = df['C'].mask(df.A == df.B, 0).mask(df.A > df.B, 1).mask(df.A < df.B, -1)