What's a simple and efficient way to shuffle a dataframe in pandas, by rows or by columns? I.e. how to write a function shuffle(df, n, axis=0) that takes a dataframe, a number of shuffles n, and an axis (axis=0 is rows, axis=1 is columns) and returns a copy of the dataframe that has been shuffled n times.
Edit: key is to do this without destroying the row/column labels of the dataframe. If you just shuffle df.index that loses all that information. I want the resulting df to be the same as the original except with the order of rows or order of columns different.
Edit2: My question was unclear. When I say shuffle the rows, I mean shuffle each row independently. So if you have two columns a and b, I want each row shuffled on its own, so that you don't have the same associations between a and b as you do if you just re-order each row as a whole. Something like:
for 1...n:
for each col in df: shuffle column
return new_df
But hopefully more efficient than naive looping. This does not work for me:
def shuffle(df, n, axis=0):
shuffled_df = df.copy()
for k in range(n):
shuffled_df.apply(np.random.shuffle(shuffled_df.values),axis=axis)
return shuffled_df
df = pandas.DataFrame({'A':range(10), 'B':range(10)})
shuffle(df, 5)
Use numpy's random.permuation function:
In [1]: df = pd.DataFrame({'A':range(10), 'B':range(10)})
In [2]: df
Out[2]:
A B
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
8 8 8
9 9 9
In [3]: df.reindex(np.random.permutation(df.index))
Out[3]:
A B
0 0 0
5 5 5
6 6 6
3 3 3
8 8 8
7 7 7
9 9 9
1 1 1
2 2 2
4 4 4
Sampling randomizes, so just sample the entire data frame.
df.sample(frac=1)
As #Corey Levinson notes, you have to be careful when you reassign:
df['column'] = df['column'].sample(frac=1).reset_index(drop=True)
In [16]: def shuffle(df, n=1, axis=0):
...: df = df.copy()
...: for _ in range(n):
...: df.apply(np.random.shuffle, axis=axis)
...: return df
...:
In [17]: df = pd.DataFrame({'A':range(10), 'B':range(10)})
In [18]: shuffle(df)
In [19]: df
Out[19]:
A B
0 8 5
1 1 7
2 7 3
3 6 2
4 3 4
5 0 1
6 9 0
7 4 6
8 2 8
9 5 9
You can use sklearn.utils.shuffle() (requires sklearn 0.16.1 or higher to support Pandas data frames):
# Generate data
import pandas as pd
df = pd.DataFrame({'A':range(5), 'B':range(5)})
print('df: {0}'.format(df))
# Shuffle Pandas data frame
import sklearn.utils
df = sklearn.utils.shuffle(df)
print('\n\ndf: {0}'.format(df))
outputs:
df: A B
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
df: A B
1 1 1
0 0 0
3 3 3
4 4 4
2 2 2
Then you can use df.reset_index() to reset the index column, if needs to be:
df = df.reset_index(drop=True)
print('\n\ndf: {0}'.format(df)
outputs:
df: A B
0 1 1
1 0 0
2 4 4
3 2 2
4 3 3
A simple solution in pandas is to use the sample method independently on each column. Use apply to iterate over each column:
df = pd.DataFrame({'a':[1,2,3,4,5,6], 'b':[1,2,3,4,5,6]})
df
a b
0 1 1
1 2 2
2 3 3
3 4 4
4 5 5
5 6 6
df.apply(lambda x: x.sample(frac=1).values)
a b
0 4 2
1 1 6
2 6 5
3 5 3
4 2 4
5 3 1
You must use .value so that you return a numpy array and not a Series, or else the returned Series will align to the original DataFrame not changing a thing:
df.apply(lambda x: x.sample(frac=1))
a b
0 1 1
1 2 2
2 3 3
3 4 4
4 5 5
5 6 6
From the docs use sample():
In [79]: s = pd.Series([0,1,2,3,4,5])
# When no arguments are passed, returns 1 row.
In [80]: s.sample()
Out[80]:
0 0
dtype: int64
# One may specify either a number of rows:
In [81]: s.sample(n=3)
Out[81]:
5 5
2 2
4 4
dtype: int64
# Or a fraction of the rows:
In [82]: s.sample(frac=0.5)
Out[82]:
5 5
4 4
1 1
dtype: int64
I resorted to adapting #root 's answer slightly and using the raw values directly. Of course, this means you lose the ability to do fancy indexing but it works perfectly for just shuffling the data.
In [1]: import numpy
In [2]: import pandas
In [3]: df = pandas.DataFrame({"A": range(10), "B": range(10)})
In [4]: %timeit df.apply(numpy.random.shuffle, axis=0)
1000 loops, best of 3: 406 µs per loop
In [5]: %%timeit
...: for view in numpy.rollaxis(df.values, 1):
...: numpy.random.shuffle(view)
...:
10000 loops, best of 3: 22.8 µs per loop
In [6]: %timeit df.apply(numpy.random.shuffle, axis=1)
1000 loops, best of 3: 746 µs per loop
In [7]: %%timeit
for view in numpy.rollaxis(df.values, 0):
numpy.random.shuffle(view)
...:
10000 loops, best of 3: 23.4 µs per loop
Note that numpy.rollaxis brings the specified axis to the first dimension and then let's us iterate over arrays with the remaining dimensions, i.e., if we want to shuffle along the first dimension (columns), we need to roll the second dimension to the front, so that we apply the shuffling to views over the first dimension.
In [8]: numpy.rollaxis(df, 0).shape
Out[8]: (10, 2) # we can iterate over 10 arrays with shape (2,) (rows)
In [9]: numpy.rollaxis(df, 1).shape
Out[9]: (2, 10) # we can iterate over 2 arrays with shape (10,) (columns)
Your final function then uses a trick to bring the result in line with the expectation for applying a function to an axis:
def shuffle(df, n=1, axis=0):
df = df.copy()
axis = int(not axis) # pandas.DataFrame is always 2D
for _ in range(n):
for view in numpy.rollaxis(df.values, axis):
numpy.random.shuffle(view)
return df
This might be more useful when you want your index shuffled.
def shuffle(df):
index = list(df.index)
random.shuffle(index)
df = df.ix[index]
df.reset_index()
return df
It selects new df using new index, then reset them.
I know the question is for a pandas df but in the case the shuffle occurs by row (column order changed, row order unchanged), then the columns names do not matter anymore and it could be interesting to use an np.array instead, then np.apply_along_axis() will be what you are looking for.
If that is acceptable then this would be helpful, note it is easy to switch the axis along which the data is shuffled.
If you panda data frame is named df, maybe you can:
get the values of the dataframe with values = df.values,
create an np.array from values
apply the method shown below to shuffle the np.array by row or column
recreate a new (shuffled) pandas df from the shuffled np.array
Original array
a = np.array([[10, 11, 12], [20, 21, 22], [30, 31, 32],[40, 41, 42]])
print(a)
[[10 11 12]
[20 21 22]
[30 31 32]
[40 41 42]]
Keep row order, shuffle colums within each row
print(np.apply_along_axis(np.random.permutation, 1, a))
[[11 12 10]
[22 21 20]
[31 30 32]
[40 41 42]]
Keep colums order, shuffle rows within each column
print(np.apply_along_axis(np.random.permutation, 0, a))
[[40 41 32]
[20 31 42]
[10 11 12]
[30 21 22]]
Original array is unchanged
print(a)
[[10 11 12]
[20 21 22]
[30 31 32]
[40 41 42]]
Here is a work around I found if you want to only shuffle a subset of the DataFrame:
shuffle_to_index = 20
df = pd.concat([df.iloc[np.random.permutation(range(shuffle_to_index))], df.iloc[shuffle_to_index:]])
Related
I'm using the following code to find rows across columns that have all the same values. I'm using the following code and would like to increase computation speed since I have a very big dataframe and would like to do the same operation for other columns subsets:
dfSPSSstudent[dfSPSSstudent.loc[:,['Q4_1a_1', 'Q4_1a_2', 'Q4_1a_3', 'Q4_1a_4', 'Q4_1a_5', 'Q4_1a_6']].nunique(axis=1) ==1]
What would you recommend. Many thanks for your help
You can try the following:
import pandas as pd
import numpy as np
#generate sample data
np.random.seed(0)
arr = np.random.randint(0, 5, (10**6, 5))
df = pd.DataFrame(arr, columns=list("abcde"))
print(df.head())
It gives:
a b c d e
0 4 0 3 3 3
1 1 3 2 4 0
2 0 4 2 1 0
3 1 1 0 1 4
4 3 0 3 0 2
Select rows where the columns a, b and c have equal values:
a = df[['a', 'b', 'c']].to_numpy()
ddf = df[np.all(a == a[:, 0].reshape(-1, 1), axis=1)]
print(ddf.head())
It gives:
a b c d e
11 1 1 1 3 3
21 3 3 3 2 3
41 1 1 1 3 2
52 1 1 1 1 2
137 2 2 2 3 2
The above code will omit rows where the columns a, b and c all have the NaN value. In order to include such rows in the results the code can be modified as follows:
a = df[['a', 'b', 'c']].to_numpy()
ddf = df[(np.all(a == a[:, 0].reshape(-1, 1), axis=1)) |
(df.loc[:, ['a', 'b', 'c']].isna().all(axis=1))]
Timing tests:
the above code: 14.7 ms ± 358 µs
the original code with nunique(): 5 s ± 291 ms
I want to slice the data frame by rows or columns using iloc, while wrapping around the out of the bound indices. Here is an example:
import pandas as pd
df = pd.DataFrame([[1,2,3], [4,5,6], [7,8,9]],columns=['a', 'b', 'c'])
#Slice the rows from 2 to 4, which the dataframe only have 3 rows
print(df.iloc[2:4,:])
Data frame:
a b c
0 1 2 3
1 4 5 6
2 7 8 9
The output will be:
a b c
2 7 8 9
But I want to wrap around the out of the bound index, which is like:
a b c
2 7 8 9
0 1 2 3
In numpy, it is possible to use numpy.take to wrap around the out of the bound index for slicing. (The numpy take link)
import numpy as np
array = np.array([[1,2,3], [4,5,6], [7,8,9]])
print(array.take(range(2,4) , axis = 0, mode='wrap'))
The output is:
[[7 8 9]
[1 2 3]]
A possible solution for wrapping out in pandas is using the numpy.take:
import pandas as pd
import numpy as np
df = pd.DataFrame([[1,2,3], [4,5,6], [7,8,9]],columns=['a', 'b', 'c'])
# Get the integer indices of the dataframe
row_indices = np.arange(df.shape[0])
# Wrap the slice explicitly
wrap_slice = row_indices.take(range(2,4),axis = 0, mode='wrap')
print(df.iloc[wrap_slice, :])
The output will be the output I want:
a b c
2 7 8 9
0 1 2 3
I looked into pandas.DataFrame.take and there is no "wrap" mode. (The pandas take link). What is a good and easy way to solve this problem? Thank you very much!
Let's try using np.roll:
df.reindex(np.roll(df.index, shift=-2)[0:2])
Output:
a b c
2 7 8 9
0 1 2 3
And, to make it a little more generic:
startidx = 2
endidx = 4
df.iloc[np.roll(df.index, shift=-1*startidx)[0:endidx-startidx]]
You could use remainder division
import numpy as np
start_id = 2
end_id = 4
idx = np.arange(start_id, end_id, 1)%len(df)
df.iloc[idx]
# a b c
#2 7 8 9
#0 1 2 3
This method actually allows you to loop around multiple times:
start_id = 2
end_id = 10
idx = np.arange(start_id, end_id, 1)%len(df)
df.iloc[idx]
# a b c
#2 7 8 9
#0 1 2 3
#1 4 5 6
#2 7 8 9
#0 1 2 3
#1 4 5 6
#2 7 8 9
#0 1 2 3
When I am using Pandas, I have a problem. My task is like this:
df=pd.DataFrame([(1,2,3,4,5,6),(1,2,3,4,5,6),(1,2,3,4,5,6)],columns=['a','b','c','d','e','f'])
Out:
a b c d e f
0 1 2 3 4 5 6
1 1 2 3 4 5 6
2 1 2 3 4 5 6
what I want to do is the output dataframe looks like this:
Out:
s1 s2 s3
0 3 7 11
1 3 7 11
2 3 7 11
That is to say, sum the column (a,b),(c,d),(e,f) separately and rename the result columns names as (s1,s2,s3). Could anyone help solve this problem in Pandas? Thank you so much.
1) Perform groupby w.r.t columns by supplying axis=1. Per #Boud's comment, you exactly get what you want with a minor tweak in the grouping array:
df.groupby((np.arange(len(df.columns)) // 2) + 1, axis=1).sum().add_prefix('s')
Grouping gets performed according to this condition:
np.arange(len(df.columns)) // 2
# array([0, 0, 1, 1, 2, 2], dtype=int32)
2) Use np.add.reduceat which is a faster alternative:
df = pd.DataFrame(np.add.reduceat(df.values, np.arange(len(df.columns))[::2], axis=1))
df.columns = df.columns + 1
df.add_prefix('s')
Timing Constraints:
For a DF of 1 million rows spanned over 20 columns:
from string import ascii_lowercase
np.random.seed(42)
df = pd.DataFrame(np.random.randint(0, 10, (10**6,20)), columns=list(ascii_lowercase[:20]))
df.shape
(1000000, 20)
def with_groupby(df):
return df.groupby((np.arange(len(df.columns)) // 2) + 1, axis=1).sum().add_prefix('s')
def with_reduceat(df):
df = pd.DataFrame(np.add.reduceat(df.values, np.arange(len(df.columns))[::2], axis=1))
df.columns = df.columns + 1
return df.add_prefix('s')
# test whether they give the same o/p
with_groupby(df).equals(with_groupby(df))
True
%timeit with_groupby(df.copy())
1 loop, best of 3: 1.11 s per loop
%timeit with_reduceat(df.copy()) # <--- (>3X faster)
1 loop, best of 3: 345 ms per loop
Is there an efficient way to delete columns that have at least 20% missing values?
Suppose my dataframe is like:
A B C D
0 sg hh 1 7
1 gf 9
2 hh 10
3 dd 8
4 6
5 y 8`
After removing the columns, the dataframe becomes like this:
A D
0 sg 7
1 gf 9
2 hh 10
3 dd 8
4 6
5 y 8`
You can use boolean indexing on the columns where the count of notnull values is larger then 80%:
df.loc[:, pd.notnull(df).sum()>len(df)*.8]
This is useful for many cases, e.g., dropping the columns where the number of values larger than 1 would be:
df.loc[:, (df > 1).sum() > len(df) *. 8]
Alternatively, for the .dropna() case, you can also specify the thresh keyword of .dropna() as illustrated by #EdChum:
df.dropna(thresh=0.8*len(df), axis=1)
The latter will be slightly faster:
df = pd.DataFrame(np.random.random((100, 5)), columns=list('ABCDE'))
for col in df:
df.loc[np.random.choice(list(range(100)), np.random.randint(10, 30)), col] = np.nan
%timeit df.loc[:, pd.notnull(df).sum()>len(df)*.8]
1000 loops, best of 3: 716 µs per loop
%timeit df.dropna(thresh=0.8*len(df), axis=1)
1000 loops, best of 3: 537 µs per loop
You can call dropna and pass a thresh value to drop the columns that don't meet your threshold criteria:
In [10]:
frac = len(df) * 0.8
df.dropna(thresh=frac, axis=1)
Out[10]:
A D
0 sg 7
1 gf 9
2 hh 10
3 dd 8
4 NaN 6
5 y 8
Considering the following DataFrames
In [136]:
df = pd.DataFrame({'A':[1,1,2,2],'B':[1,2,1,2],'C':np.arange(10,30,5)}).set_index(['A','B'])
df
Out[136]:
C
A B
1 1 10
2 15
2 1 20
2 25
In [130]:
vals = pd.DataFrame({'A':[1,2],'values':[True,False]}).set_index('A')
vals
Out[130]:
values
A
1 True
2 False
How can I select only the rows of df with corresponding True values in vals?
If I reset_index on both frames I can now merge/join them and slice however I want, but how can I do it using the (multi)indexes?
boolean indexing all the way...
In [65]: df[pd.Series(df.index.get_level_values('A')).isin(vals[vals['values']].index)]
Out[65]:
C
A B
1 1 10
2 15
Note that you can use xs on a multiindex.
In [66]: df.xs(1)
Out[66]:
C
B
1 10
2 15