I need to conditionally randomly allocate users to groups. The table governing the process is as follows:
A B C
0 9 1 1
1 1 7 8
2 0 2 1
According to the above matrix, there's a total of 11 users from area 0, 16 from area 1, and 3 from area 2.
Furthermore, of the 11 users from area 0, 9 should be allocated to group A, 1 each should be allocated to B and C. The process is analogous for the rest of the groups.
I have some code in Python:
import random
import pandas as pd
df = pd.DataFrame({"A": [9,1,0], "B": [1,7,2], "C": [1,8,1]})
random.sample(range(1,df.sum(axis=1)[0] + 1),df.sum(axis=1)[0])
The last line creates a random vector of integers e.g. [1, 4, 10, 2, 5, 11, 9, 3, 8, 7, 6]. I can allocate indices from 1 to 9 to group A,
the index with 10 to group B, the index with 11 to group C. In other words, user 3 goes to group B, user 6 goes to group C, and all the rest go to group A.
The desired output would be [A,A,B,A,A,C,A,A,A,A,A], or even better, a pandas dataframe like:
1 A
2 A
3 B
4 A
5 A
6 C
...
How can I automate the process I described in words above? (the actual allocation matrix is 10 x 10)
You could use np.repeat to get an array with the right number of users:
In [38]: [np.repeat(df.columns, row) for row in df.values]
Out[38]:
[Index(['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'B', 'C'], dtype='object'),
Index(['A', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'C', 'C',
'C', 'C'],
dtype='object'),
Index(['B', 'B', 'C'], dtype='object')]
And then permute them:
In [39]: [np.random.permutation(np.repeat(df.columns, row)) for row in df.values]
Out[39]:
[array(['C', 'A', 'A', 'A', 'A', 'A', 'B', 'A', 'A', 'A', 'A'], dtype=object),
array(['A', 'B', 'C', 'C', 'B', 'C', 'B', 'C', 'C', 'B', 'B', 'C', 'C',
'C', 'B', 'B'], dtype=object),
array(['B', 'C', 'B'], dtype=object)]
and then you could call pd.Series on each array if you wanted.
Related
I want to collect some data which is more than two in a list.
For this, I wrote code like the one below.
A= ['a', 'a', 'b', 'b', 'a', 'c', 'd', 'b']
for ab in A:
ab_list = list()
for _ in range(A.count(ab)):
Ab_list.append(A.pop(A.index(ab)))
# other code ~
When I checked the code, it didn't work at 'c', 'd'.
It just stops when all 'b' in list A are removed.
For me, it's okay because 'c', 'd' is just one, but I want to know the reason it stops at 'c' and 'd'.
Please help newbie
thanks expert
Try this
A = ['a', 'a', 'b', 'b', 'a', 'c', 'd', 'b']
ab_list = [character for character in set(A) if A.count(character) > 0]
print(ab_list)
Output:
['b', 'a', 'c', 'd']
Your problem in this line
for _ in range(A.count(ab)):
You receive count of current character, for example 2, and after make loop by this value. So loop check only 2-3 first element is array.
But if you want just count elements in array, you can use numpy
A = ['a', 'a', 'b', 'b', 'a', 'c', 'd', 'b']
# count element in A array
B = np.unique(A, return_counts=True)
Results:
(array(['a', 'b', 'c', 'd'], dtype='<U1'), array([3, 3, 1, 1], dtype=int64))
I have a correlation dataframe, and I'm trying to turn it into different lists:
A B C
A 1.000000 0.932159 -0.976221
B 0.932159 1.000000 -0.831509
C -0.976221 -0.831509 1.000000
The output I need is:
[A, B, 0.932159]
[A, C, -0.976221]
[B, A, 0.932159]
[B, C, -0.831509]
[C, A, -0.976221]
[C, B, -0.831509]
I have tried converting the dataframe into list, but I don't get what I need.
Thanks
Stack the dataframe, reset the index, then exclude the rows which have identical 1st and 2nd column values, then create the list out of it:
out=df.stack().reset_index()
out=out[out.iloc[:,0].ne(out.iloc[:,1])].values.tolist()
OUTPUT
[['A', 'B', 0.932159],
['A', 'C', -0.976221],
['B', 'A', 0.932159],
['B', 'C', -0.831509],
['C', 'A', -0.976221],
['C', 'B', -0.831509]]
The simplest way I can think of for a tiny DataFrame like this is with a list comprehension:
corr
A B C
A 1.000000 0.932159 -0.976221
B 0.932159 1.000000 -0.831509
C -0.976221 -0.831509 1.000000
[[row, col, corr[col][row]] for row in corr.index for col in corr if row != col]
[['A', 'B', 0.932159],
['A', 'C', -0.976221],
['B', 'A', 0.932159],
['B', 'C', -0.831509],
['C', 'A', -0.976221],
['C', 'B', -0.831509]]
The longer-form way may be easier to read for a general audience:
result = []
for row in corr.index:
for col in corr:
if row != col:
result.append([row, col, corr[col][row]])
result
[['A', 'B', 0.932159],
['A', 'C', -0.976221],
['B', 'A', 0.932159],
['B', 'C', -0.831509],
['C', 'A', -0.976221],
['C', 'B', -0.831509]]
A simple list comprehension will be enough
from itertools import permutations
elements = df.index
out = [[val[0], val[1], df.loc[val[0], val[1]]]
for val in permutations(elements, 2)]
Output
[['A', 'B', 0.932159],
['A', 'C', -0.976221],
['B', 'A', 0.932159],
['B', 'C', -0.831509],
['C', 'A', -0.976221],
['C', 'B', -0.831509]]
I have a df with columns that represents a stratum (strat). I want to loop over those stratum and pull out rows to a new df, df_sample. I want to pull out all rows in a stratum if cases are few.
I've tried the below, and it works. But I wonder if there is a better solution to this problem. Perhaps pd.concat is slow when I later use the real much larger data for example.
df=pd.DataFrame({'ID': range(0,120),
'strat': ['A', 'B', 'B', 'A', 'B', 'A', 'D', 'A', 'B', 'C',
'A', 'D', 'A', 'A', 'A', 'D', 'F', 'D', 'F', 'C',
'B', 'A', 'A', 'C', 'A', 'A', 'B', 'D', 'B', 'C',
'C', 'A', 'C', 'A', 'C', 'A', 'D', 'C', 'C', 'A',
'B', 'F', 'F', 'C', 'B', 'D', 'A', 'A', 'B', 'B',
'A', 'C', 'A', 'A', 'F', 'A', 'A', 'B', 'A', 'D',
'C', 'B', 'B', 'A', 'B', 'C', 'B', 'A', 'D', 'B',
'B', 'A', 'A', 'C', 'D', 'F', 'F', 'A', 'B', 'C',
'F', 'B', 'D', 'A', 'A', 'F', 'B', 'D', 'B', 'A',
'F', 'D', 'A', 'A', 'C', 'B', 'B', 'C', 'C', 'B',
'F', 'A', 'A', 'B', 'B', 'B', 'F', 'A', 'B', 'C',
'A', 'A', 'A', 'B', 'B', 'A', 'A', 'A', 'B', 'B']})
df_sample=pd.DataFrame()
for i in df.strat.unique():
temp=df[df['strat']==i]
if len(temp) < 21:
strat=temp.sample(len(temp))
elif len(temp) > 20:
strat = temp.sample(frac=0.5)
df_sample=pd.concat([df_sample, strat])
Other solutions may be faster. Here is another one in case readability/maintainability is more important.
def sample_stratum(stratum):
nrows = stratum.shape[0]
if nrows < 21:
output = stratum.sample(nrows)
else:
output = stratum.sample(frac=0.5)
return output
# Index may be retained if needed
sampled_df = df.groupby(by=['strat']).apply(sample_stratum).reset_index(drop=True)
# ID strat
# 0 12 A
# 1 7 A
# 2 50 A
# 3 58 A
# 4 0 A
# .. .. ...
# 77 41 F
# 78 42 F
# 79 16 F
# 80 76 F
# 81 90 F
# [82 rows x 2 columns]
You can groupby "strat" and count the number of entries in each "strat", then identify the strats that have less than 21 entries and shuffle them. Then take the remaining strats (those with more than 20 entries) and sample 50% of them. Finally concatenate the two DataFrames:
msk1 = df.groupby('strat')['strat'].count() < 21
less_than_21 = msk1.index[msk1]
msk2 = df['strat'].isin(less_than_21)
out = pd.concat((df[~msk2].groupby('strat').sample(frac=0.5), df[msk2].sample(msk2.sum())))
Output:
ID strat
110 110 A
72 72 A
46 46 A
31 31 A
92 92 A
.. ... ...
18 18 F
9 9 C
23 23 C
42 42 F
82 82 D
[82 rows x 2 columns]
Create mask for all groups with counts and then processing each group separately:
m = df.groupby('strat')['strat'].transform('size').lt(21)
df = pd.concat((df[~m].groupby('strat').sample(frac=0.5),
df[m].sample(frac=1)),
ignore_index=True)
print (df)
ID strat
0 71 A
1 31 A
2 72 A
3 39 A
4 83 A
.. .. ...
77 37 C
78 85 F
79 19 C
80 34 C
81 73 C
[82 rows x 2 columns]
Alternative solution:
m = df['strat'].map(df['strat'].value_counts()).lt(21)
df = pd.concat((df[~m].groupby('strat').sample(frac=0.5), df[m].sample(frac=1)))
I am having a problem with my list in python.
I am printing out the list (working), a number that shows the line number (working) and an item in the list that should change every time the list is printed(not working?)
a = ["A", "B", "C", "D", "E"]
b = 0
for x in a:
while b <= 10:
print(a, x, b)
b += 1
My current program output is
['A', 'B', 'C', 'D', 'E'] A 0
['A', 'B', 'C', 'D', 'E'] A 1
['A', 'B', 'C', 'D', 'E'] A 2
['A', 'B', 'C', 'D', 'E'] A 3
so on
the output I would like
['A', 'B', 'C', 'D', 'E'] A 0
['A', 'B', 'C', 'D', 'E'] B 1
['A', 'B', 'C', 'D', 'E'] C 2
['A', 'B', 'C', 'D', 'E'] D 3
and so on
Although, when I try a different program it works perfectly?
list = ["a", "b", "c"]
for a in list:
print(a)
Why does this happen and how can I fix it?
That is because you have the while loop inside the outer for loop (that iterates over the elements of the list. So the inner while loop only exists when b becomes greater than 10, and till then the value of x is A.
For what you want I would suggest using itertools.cycle(). Example -
>>> a = ["A", "B", "C", "D", "E"]
>>>
>>> b = 0
>>> import itertools
>>> acycle = itertools.cycle(a)
>>> for i in range(11):
... print(a,next(acycle),i)
...
['A', 'B', 'C', 'D', 'E'] A 0
['A', 'B', 'C', 'D', 'E'] B 1
['A', 'B', 'C', 'D', 'E'] C 2
['A', 'B', 'C', 'D', 'E'] D 3
['A', 'B', 'C', 'D', 'E'] E 4
['A', 'B', 'C', 'D', 'E'] A 5
['A', 'B', 'C', 'D', 'E'] B 6
['A', 'B', 'C', 'D', 'E'] C 7
['A', 'B', 'C', 'D', 'E'] D 8
['A', 'B', 'C', 'D', 'E'] E 9
['A', 'B', 'C', 'D', 'E'] A 10
You have a double loop here (while inside for) and you never reset the b to 0. To get the result you expected, you should use enumerate:
for idx, x in enumerate(a):
print(a, x, idx)
If A and B are two arrays corresponding to two orderings of the same (distinct) elements, there is a unique indexing array P such that A[P] is equal to B. For example, if A and B are
A = ['b', 'c', 'e', 'd', 'a']
B = ['a', 'd', 'c', 'b', 'e']
then the desired P is
P = [4, 3, 1, 0, 2]
Does numpy (or standard Python) have a function for computing such a P?
Using standard python
>>> A = ['b', 'c', 'e', 'd', 'a']
>>> B = ['a', 'd', 'c', 'b', 'e']
>>> P = [ A.index(i) for i in B ]
>>> P
[4, 3, 1, 0, 2]
Using Numpy
import numpy as np
A = np.array(['b', 'c', 'e', 'd', 'a'])
B = np.array(['a', 'd', 'c', 'b', 'e'])
P = np.empty(len(A), int)
P[B.argsort()] = A.argsort()