Pandas dataframe randomly shuffle some column values in groups - python

I would like to shuffle some column values but only within a certain group and only a certain percentage of rows within the group. For example, per group, I want to shuffle n% of values in column b with each other.
df = pd.DataFrame({'grouper_col':[1,1,2,3,3,3,3,4,4], 'b':[12, 13, 16, 21, 14, 11, 12, 13, 15]})
grouper_col b
0 1 12
1 1 13
2 2 16
3 3 21
4 3 14
5 3 11
6 3 12
7 4 13
8 4 15
Example output:
grouper_col b
0 1 13
1 1 12
2 2 16
3 3 21
4 3 11
5 3 14
6 3 12
7 4 15
8 4 13
I found
df.groupby("grouper_col")["b"].transform(np.random.permutation)
but then I have no control over the percentage of shuffled values.
Thank you for any hints!

You can use numpy to create a function like this (it takes a numpy array for input)
import numpy as np
def shuffle_portion(arr, percentage):
shuf = np.random.choice(np.arange(arr.shape[0]),
round(arr.shape[0]*percentage/100),
replace=False)
arr[np.sort(shuf)] = arr[shuf]
return arr
np.random.choice will choose a set of indexes with the size you need. Then the corresponding values in the given array can be rearranged in the shuffled order. Now this should shuffle 3 values out of the 9 in cloumn 'b'
df['b'] = shuffle_portion(df['b'].values, 33)
EDIT:
To use with apply, you need to convert the passed dataframe to an array inside the function (explained in the comments) and create the return dataframe as well
def shuffle_portion(_df, percentage=50):
arr = _df['b'].values
shuf = np.random.choice(np.arange(arr.shape[0]),
round(arr.shape[0]*percentage/100),
replace=False)
arr[np.sort(shuf)] = arr[shuf]
_df['b'] = arr
return _df
Now you can just do
df.groupby("grouper_col", as_index=False).apply(shuffle_portion)
It would be better practice if you pass the name of the column which you need to shuffle, to the function (def shuffle_portion(_df, col='b', percentage=50): arr = _df[col].values ...)

Related

Pandas: Apply a function to exploded values

Assume I have the following dataframe:
df = pd.DataFrame([[1, 2, 3], [8, [10, 11, 12, 13], [6, 7, 8, 9]]], columns=list("abc"))
a
b
c
1
[2]
[3]
8
[10,11,12,13]
[6,7,8,9]
Columns b and c can consist of 1 to n elements, but in a row they will always have the same number of elements.
I am looking for a way to explode the columns b and c, while applying a function to the whole row, here for example in order to divide the column a by the number of elements in b and c (I know this is a stupid example as this would be easily solvable by first dividing, then exploding. The real use case is a bit more complicated but of no importance here.)
So the result would look something like this:
a
b
c
1
2
3
2
10
6
2
11
7
2
12
8
2
13
9
I tried using the apply-method like in the following snippet, but this only produced garbage and does not work, when the number of elements in the list does not fit the number of columns:
def fun(row):
if isinstance(row.c, list):
result = [[row.a, row.b, c] for c in row.c]
return result
return row
df.apply(fun, axis=1)
Panda's explode explode function also doesn't really fit here for me, because afterwards there is no chance of telling anymore, whether the rows were exploded or not.
Is there an easier way than to iterate through the data-frame, exploding the values and manually building up a new data-frame in the way I need it here?
Thank you already for your help.
Edit:
The real use case is basically a mapping from b+c to a.
So I have another file that looks something like that:
b
c
a
2
3
1
10
6
1
11
7
1
12
8
2
13
9
4
So coming from this example, the result would actually be as follows:
a
b
c
1
2
3
1
10
6
1
11
7
2
12
8
4
13
9
The problem is, that between this two files there is no 1:1 relation between those two files as it might seem here.
You say:
Panda's explode function also doesn't really fit here for me, because afterwards there is no chance of telling anymore, whether the rows were exploded or not.
This isn't true. Consider this:
import pandas as pd
df = pd.DataFrame([[1, 2, 3], [8, [10, 11, 12, 13], [6, 7, 8, 9]]], columns=list("abc"))
# explode
df2 = df.explode(['b','c'])
As you can see from the print below, the result also explodes the index:
a b c
0 1 2 3
1 8 10 6
1 8 11 7
1 8 12 8
1 8 13 9
So, we can use the index to track how many elements per row got exploded. Try this:
# reset index, add old index as column and create a Series with frequence
# for old index; now populate 'a' with 'a' divided by this Series
df2.reset_index(drop=False, inplace=True)
df2['a'] = df2['a']/df2.loc[:,'index'].map(df2.loc[:,'index'].value_counts())
# drop the old index, now as column
df2.drop('index', axis=1, inplace=True)
Result:
a b c
0 1.0 2 3
1 2.0 10 6
2 2.0 11 7
3 2.0 12 8
4 2.0 13 9

Pandas: Keep only first occurance of value in group of consecutive values

I have a dataframe that looks like the following (actually, this is the abstracted result of a calculation):
import pandas as pd
data = {"A":[i for i in range(10)]}
index = [1, 3, 4, 5, 9, 10, 12, 13, 15, 20]
df = pd.DataFrame(index=index, data=data)
print(df)
yields:
A
1 0
3 1
4 2
5 3
9 4
10 5
12 6
13 7
15 8
20 9
Now I want to filter the index values to only show the first value in a group of consecutive values e. g. the following result:
A
1 0
3 1
9 4
12 6
15 8
20 9
Any hints on how to achieve this efficiently?
Use Series.diff which is not implemented for Index, so convert to Series and compre for not equal 1:
df = df[df.index.to_series().diff().ne(1)]
print (df)
A
1 0
3 1
9 4
12 6
15 8
20 9
Try this one:
import numpy as np
df.iloc[np.unique(np.array(index)-np.arange(len(index)), return_index=True)[1]]
Try this:
df.groupby('A').index.first().reset_index()

Retrieving Unknown Column Names from DataFrame.apply

How I can retrieve column names from a call to DataFrame apply without knowing them in advance?
What I'm trying to do is apply a mapping from column names to functions to arbitrary DataFrames. Those functions might return multiple columns. I would like to end up with a DataFrame that contains the original columns as well as the new ones, the amount and names of which I don't know at build-time.
Other solutions here are Series-based. I'd like to do the whole frame at once, if possible.
What am I missing here? Are the columns coming back from apply lost in destructuring unless I know their names? It looks like assign might be useful, but will likely require a lot of boilerplate.
import pandas as pd
def fxn(col):
return pd.Series(col * 2, name=col.name+'2')
df = pd.DataFrame({'A': range(0, 10), 'B': range(10, 0, -1)})
print(df)
# [Edit:]
# A B
# 0 0 10
# 1 1 9
# 2 2 8
# 3 3 7
# 4 4 6
# 5 5 5
# 6 6 4
# 7 7 3
# 8 8 2
# 9 9 1
df = df.apply(fxn)
print(df)
# [Edit:]
# Observed: columns changed in-place.
# A B
# 0 0 20
# 1 2 18
# 2 4 16
# 3 6 14
# 4 8 12
# 5 10 10
# 6 12 8
# 7 14 6
# 8 16 4
# 9 18 2
df[['A2', 'B2']] = df.apply(fxn)
print(df)
# [Edit: I am doubling column values, so missing something, but the question about the column counts stands.]
# Expected: new columns added. How can I do this at runtime without knowing column names?
# A B A2 B2
# 0 0 40 0 80
# 1 4 36 8 72
# 2 8 32 16 64
# 3 12 28 24 56
# 4 16 24 32 48
# 5 20 20 40 40
# 6 24 16 48 32
# 7 28 12 56 24
# 8 32 8 64 16
# 9 36 4 72 8
You need to concat the result of your function with the original df.
Use pd.concat:
In [8]: x = df.apply(fxn) # Apply function on df and store result separately
In [10]: df = pd.concat([df, x], axis=1) # Concat with original df to get all columns
Rename duplicate column names by adding suffixes:
In [82]: from collections import Counter
In [38]: mylist = df.columns.tolist()
In [41]: d = {a:list(range(1, b+1)) if b>1 else '' for a,b in Counter(mylist).items()}
In [62]: df.columns = [i+str(d[i].pop(0)) if len(d[i]) else i for i in mylist]
In [63]: df
Out[63]:
A1 B1 A2 B2
0 0 10 0 20
1 1 9 2 18
2 2 8 4 16
3 3 7 6 14
4 4 6 8 12
5 5 5 10 10
6 6 4 12 8
7 7 3 14 6
8 8 2 16 4
9 9 1 18 2
You can assign directly with:
df[df.columns + '2'] = df.apply(fxn)
Outut:
A B A2 B2
0 0 10 0 20
1 1 9 2 18
2 2 8 4 16
3 3 7 6 14
4 4 6 8 12
5 5 5 10 10
6 6 4 12 8
7 7 3 14 6
8 8 2 16 4
9 9 1 18 2
Alternatively, you can leverage the #MayankPorwal answer by using .add_suffix('2') to the output from your apply function:
pd.concat([df, df.apply(fxn).add_suffix('2')], axis=1)
which will return the same output.
In your function, name=col.name+'2' is doing nothing (it's basically returning just col * 2). That's because apply returns the values back to the original column.
Anyways, it's possible to take the MayankPorwal approach: pd.concat + managing duplicated columns (make them unique). Another possible way to do that:
# Use pd.concat as mentioned in the first answer from Mayank Porwal
df = pd.concat([df, df.apply(fxn)], axis=1)
# Rename duplicated columns
suffix = (pd.Series(df.columns).groupby(df.columns).cumcount()+1).astype(str)
df.columns = df.columns + suffix.rename('1', '')
which returns the same output, and additionally manage further duplicated columns.
Answer on the behalf of OP:
This code does what I wanted:
import pandas as pd
# Simulated business logic: for an input row, return a number of columns
# related to the input, and generate names for them, such that we don't
# know the shape of the output or the names of its columns before the call.
def fxn(row):
length = row[0]
indicies = [row.index[0] + str(i) for i in range(0, length)]
series = pd.Series([i for i in range(0, length)], index=indicies)
return series
# Sample data: 0 to 18, inclusive, counting by 2.
df1 = pd.DataFrame(list(range(0, 20, 2)), columns=['A'])
# Randomize the rows to simulate different input shapes.
df1 = df1.sample(frac=1)
# Apply fxn to rows to get new columns (with expand). Concat to keep inputs.
df1 = pd.concat([df1, df1.apply(fxn, axis=1, result_type='expand')], axis=1)
print(df1)

Creating a dataframe in pandas by multiplying two series together

Say I have two series in pandas, series A and series B. How do I create a dataframe in which all of those values are multiplied together, i.e. with series A down the left hand side and series B along the top. Basically the same concept as this, where series A would be the yellow on the left and series B the yellow along the top, and all the values in between would be filled in by multiplication:
http://www.google.co.uk/imgres?imgurl=http://www.vaughns-1-pagers.com/computer/multiplication-tables/times-table-12x12.gif&imgrefurl=http://www.vaughns-1-pagers.com/computer/multiplication-tables.htm&h=533&w=720&sz=58&tbnid=9B8R_kpUloA4NM:&tbnh=90&tbnw=122&zoom=1&usg=__meqZT9kIAMJ5b8BenRzF0l-CUqY=&docid=j9BT8tUCNtg--M&sa=X&ei=bkBpUpOWOI2p0AWYnIHwBQ&ved=0CE0Q9QEwBg
Sorry, should probably have added that my two series are not the same length. I'm getting an error now that 'matrices are not aligned' so I assume that's the problem.
You can use matrix multiplication dot, but before you have to convert Series to DataFrame (because dot method on Series implements dot product):
>>> B = pd.Series(range(1, 5))
>>> A = pd.Series(range(1, 5))
>>> dfA = pd.DataFrame(A)
>>> dfB = pd.DataFrame(B)
>>> dfA.dot(dfB.T)
0 1 2 3
0 1 2 3 4
1 2 4 6 8
2 3 6 9 12
3 4 8 12 16
You can create a DataFrame from multiplying two series of unequal length by broadcasting each value of the row (or column) with the other series. For example:
> row = pd.Series(np.arange(1, 6), index=np.arange(1, 6))
> col = pd.Series(np.arange(1, 4), index=np.arange(1, 4))
> row.apply(lambda r: r * col)
1 2 3
1 1 2 3
2 2 4 6
3 3 6 9
4 4 8 12
5 5 10 15
First create a DataFrame of 1's. Then broadcast multiply along each axis in turn.
>>> s1 = Series([1,2,3,4,5])
>>> s2 = Series([10,20,30])
>>> df = DataFrame(1, index=s1.index, columns=s2.index)
>>> df
0 1 2
0 1 1 1
1 1 1 1
2 1 1 1
3 1 1 1
4 1 1 1
>>>> df.multiply(s1, axis='index') * s2
0 1 2
0 10 20 30
1 20 40 60
2 30 60 90
3 40 80 120
4 50 100 150
You need to use df.multiply in order to specify that the series will line up with the row index. You can use the normal multiplication operator * with s2 because matching on columns is the default way of doing multiplication between a DataFrame and a Series.
So I think this may get you most of the way there if you have two series of different lengths. This seems like a very manual process but I cannot think of another way using pandas or NumPy functions.
>>>> a = Series([1, 3, 3, 5, 5])
>>>> b = Series([5, 10])
First convert your row values a to a DataFrame and make copies of this Series in the form of new columns as many as you have values in your columns series b.
>>>> result = DataFrame(a)
>>>> for i in xrange(len(b)):
result[i] = a
0 1
0 1 1
1 3 3
2 3 3
3 5 5
4 5 5
You can then broadcast your Series b over your DataFrame result:
>>>> result = result.mul(b)
0 1
0 5 10
1 15 30
2 15 30
3 25 50
4 25 50
In the example I have chosen, you will end up with indexes that are duplicates due to your initial Series. I would recommend leaving the indexes as unique identifiers. This makes programmatic sense otherwise you will return more than one value when you select an index that has more than one row assigned to it. If you must, you can then reindex your row labels and your column labels using these functions:
>>>> result.columns = b
>>>> result.set_index(a)
5 10
1 5 10
3 15 30
3 15 30
5 25 50
5 25 50
Example of duplicate indexing:
>>>> result.loc[3]
5 10
3 15 30
3 15 30
In order to use the DataFrame.dot method, you need to transpose one of the series:
>>> a = pd.Series([1, 2, 3, 4])
>>> b = pd.Series([10, 20, 30])
>>> a.to_frame().dot(b.to_frame().transpose())
0 1 2
0 10 20 30
1 20 40 60
2 30 60 90
3 40 80 120
Also make sure the series have the same name.

How to find the top 10 pairs that appear the most in a pandas dataframe in python

I have a pandas dataframe in python with columns 'a', 'b', 'c'. The 'a','b' pairs are unique and repeat multiple times. 'c' is changing all the time. I want to find the 10 pairs 'a','b' that appear the most and put them in a dataframe but don't know how. Any help is appreciated.
I'm not entirely sure I follow you, but assuming you mean you have a DataFrame looking something like
>>> N = 1000
>>> df = pd.DataFrame(np.random.randint(0, 10, (N, 3)), columns="A B C".split())
>>> df.head()
A B C
0 7 4 5
1 5 1 3
2 8 9 8
3 2 3 0
4 2 3 0
and you simply want to count (A, B) pairs, that's easy enough:
>>> df.groupby(["A", "B"]).size().order().iloc[-10:]
A B
6 1 13
1 0 14
4 0 14
7 2 14
1 6 15
8 2 15
1 8 16
2 6 16
6 4 16
7 4 16
dtype: int64
That can be broken down into four parts:
groupby, which groups the data by (A, B) tuples
size, which computes the size of each group
order, which returns the Series sorted by value
iloc, which lets us select the last 10 entries in the Series by position
That results in a Series, but you could make a DataFrame out of it simply by passing it to pd.DataFrame.

Categories