Generating range of unique random numbers for each group in pandas - python

I have to create a range of unique random numbers between 0 to 99999 for each group.
I have following dataset
District Prefix Quota
A 98426 783
A 98427 223
A 98446 127
A 98626 51
B 98049 167
B 98079 153
B 98140 120
B 98159 139
B 98169 182
B 98249 86
B 98426 588
B 98446 96
C 98049 104
C 98060 68
C 98149 65
C 98150 68
C 98159 86
C 98160 80
C 98169 113
Code to reproduce:
import pandas as pd
df = pd.DataFrame([
['A', 98426, 783],
['A', 98427, 223],
['A', 98446, 127],
['A', 98626, 51],
['B', 98049, 167],
['B', 98079, 153],
['B', 98140, 120],
['B', 98159, 139],
['B', 98169, 182],
['B', 98249, 86],
['B', 98426, 588],
['B', 98446, 96],
['C', 98049, 104],
['C', 98060, 68],
['C', 98149, 65],
['C', 98150, 68],
['C', 98159, 86],
['C', 98160, 80],
['C', 98169, 113]
],
columns=['District', 'Prefix', 'Quota'])
so I have to generate "quota" number of unique number from 0 to 99999 and add it to prefix so I can create a 10 digit number.
I tried:
numbers = np.random.choice(range(99999), size=df.Quota.sum(), replace=False)
random = df.Prefix.repeat(df.Quota)*100000 + numbers
but the problem is, it subtotals "Quota" and generates the unique numbers from 0 to 99999. but I want unique number 0 to 99999 available for each prefix. for example. suppose np.random.choice(range(99999), size=df.Quota.sum(), replace=False) generated 16195 and add to the first prefix "98426". the same number wont be generated and it wont be available to other prefix. so the range (0-99999) should be unique for each prefix

Here's a solution using transform, random.choice, and explode.
def make_random_numbers(x):
total = x.sum()
r = np.random.choice(range(99999), total, replace = False)
chunks = x.cumsum()[:-1]
res = np.hsplit(r, chunks)
return res
df["rand_items"] = df.groupby("Prefix")["Quota"].transform(make_random_numbers)
df.explode("rand_items")
The result is:
District Prefix Quota rand_items
0 A 98426 783 2681
0 A 98426 783 94952
0 A 98426 783 79496
0 A 98426 783 58361
0 A 98426 783 54883
0 A 98426 783 44819
0 A 98426 783 36209
0 A 98426 783 91710
...
18 C 98169 113 41859
18 C 98169 113 92311
18 C 98169 113 18572
18 C 98169 113 72492
18 C 98169 113 39188
18 C 98169 113 36808
18 C 98169 113 32055
18 C 98169 113 74678

This method returns a list of random choices for each row:
def gen_rand(x):
return (x['Prefix'].min() * 1E5 + np.random.choice(range(99999),
size=sum(x['Quota']), replace = False)).astype(int)
df.groupby('Prefix').apply(gen_rand)

Related

Dynamically differencing columns in a pandas dataframe using similar column names

The following is the first couple of columns of a data frame, and I calculate V1_x - V1_y, V2_x - V2_y, V3_x - V3_y etc. The difference variable names differ only by the last character (either x or y)
import pandas as pd
data = {'Name': ['Tom', 'Joseph', 'Krish', 'John'], 'Address': ['xx', 'yy', 'zz','ww'], 'V1_x': [20, 21, 19, 18], 'V2_x': [233, 142, 643, 254], 'V3_x': [343, 543, 254, 543], 'V1_y': [20, 21, 19, 18], 'V2_y': [233, 142, 643, 254], 'V3_y': [343, 543, 254, 543]}
df = pd.DataFrame(data)
df
Name Address V1_x V2_x V3_x V1_y V2_y V3_y
0 Tom xx 20 233 343 20 233 343
1 Joseph yy 21 142 543 21 142 543
2 Krish zz 19 643 254 19 643 254
3 John ww 18 254 543 18 254 543
I currently do the calculation by manually defining the column names:
new_df = pd.DataFrame()
new_df['Name'] = df['Name']
new_df['Address'] = df['Address']
new_df['Col1'] = df['V1_x']-df['V1_y']
new_df['Col1'] = df['V2_x']-df['V2_y']
new_df['Col1'] = df['V3_x']-df['V3_y']
Is there an approach that I can use to check if the last column names only differ by the last character and difference them if so?
Try creating a multiindex header using .str.split then reshape the dataframe and using pd.DataFrame.eval for calcuation then reshape back to original form with additional columns. Lastly flatten the multiindex header using list comprehension with f-string formatting:
dfi = df.set_index(['Name', 'Address'])
dfi.columns = dfi.columns.str.split('_', expand=True)
dfs = dfi.stack(0).eval('diff=x-y').unstack()
dfs.columns = [f'{j}_{i}' for i, j in dfs.columns]
dfs
Output:
V1_x V2_x V3_x V1_y V2_y V3_y V1_diff V2_diff V3_diff
Name Address
John ww 18 254 543 18 254 543 0 0 0
Joseph yy 21 142 543 21 142 543 0 0 0
Krish zz 19 643 254 19 643 254 0 0 0
Tom xx 20 233 343 20 233 343 0 0 0

How to read data that has been split into multiple columns?

I have the following dataframe:
q
1 0.83 97 0.7 193 0.238782 289 0.129692 385 0.090692
2 0.75 98 0.7 194 0.238782 290 0.129692 386 0.090692
...
96 0.94693 192 0.299753 288 0.145046 384 0.0965338 480 0.0823061
This data comes from somewhere else, and it has been split. However, the values correspond to a single variable 'q', along with its indices. To clarify, even though there are many columns, they all correspond to one column 'q', plus an index column (notice that the starting index of each column is the continuation of the end of the previous column).
How can I read the data with pandas? I believe I can do it by assigning names to each column and then merging them all together, but I was looking for a more elegant solution. Plus, the number of columns is not fixed.
This is the code that I am using at the moment:
q_param = pd.read_csv('Initial_solutions/initial_q_20y.dat', delim_whitespace=True)
Which does not do the trick. I would prefer to use pandas to solve this issue, but I can also work without it.
EDIT:
At the request of #user17242583, the following command:
print(q_param.head().to_dict())
Gives this output:
{'q': {(1, 0.83, 97, 0.7, 193, 0.238782, 289, 0.129692, 385): 0.090692, (2, 0.75, 98, 0.7, 194, 0.238782, 290, 0.129692, 386): 0.090692, (3, 0.64, 99, 0.64, 195, 0.238782, 291, 0.129692, 387): 0.090692, (4, 0.7, 100, 0.7, 196, 0.238782, 292, 0.129692, 388): 0.0884839, (5, 0.64, 101, 0.64, 197, 0.238782, 293, 0.129692, 389): 0.090692}}
It seems most of your data is index. Try:
df = pd.DataFrame({k:v for lst in [list(k)+[v] for k,v in q_param['q'].items()] for k,v in zip(lst[::2],lst[1::2])}, index=['q']).T.sort_index()
Try this:
data = {
0: pd.concat(q[c] for c in q.columns[0::2]).reset_index(drop=True),
1: pd.concat(q[c] for c in q.columns[1::2]).reset_index(drop=True),
}
df = pd.DataFrame(data)
Output:
>>> df
0 1
0 1 0.830000
1 2 0.750000
2 3 0.640000
3 4 0.700000
4 5 0.640000
5 97 0.700000
6 98 0.700000
7 99 0.640000
8 100 0.700000
9 101 0.640000
10 193 0.238782
11 194 0.238782
12 195 0.238782
13 196 0.238782
14 197 0.238782
15 289 0.129692
16 290 0.129692
17 291 0.129692
18 292 0.129692
19 293 0.129692
20 385 0.090692
21 386 0.090692
22 387 0.090692
23 388 0.088484
24 389 0.090692

Pandas: How to (cleanly) unpivot two columns with same category?

I'm trying to unpivot two columns inside a pandas dataframe. The transformation I seek would be the inverse of this question.
We start with a dataset that looks like this:
import pandas as pd
import numpy as np
df_orig = pd.DataFrame(data=np.random.randint(255, size=(4,5)),
columns=['accuracy','time_a','time_b','memory_a', 'memory_b'])
df_orig
accuracy time_a time_b memory_a memory_b
0 6 118 170 102 239
1 241 9 166 159 162
2 164 70 76 228 121
3 228 121 135 128 92
I wish to unpivot both themwmory and time columns, obtaining this dataset in result:
df
accuracy memory category time
0 6 102 a 118
1 241 159 a 9
2 164 228 a 70
3 228 128 a 121
12 6 239 b 170
13 241 162 b 166
14 164 121 b 76
15 228 92 b 135
So far I have managed to get my desired output using df.melt() twice plus some extra commands:
df = df_orig.copy()
# Unpivot memory columns
df = df.melt(id_vars=['accuracy','time_a', 'time_b'],
value_vars=['memory_a', 'memory_b'],
value_name='memory',
var_name='mem_cat')
# Unpivot time columns
df = df.melt(id_vars=['accuracy','memory', 'mem_cat'],
value_vars=['time_a', 'time_b'],
value_name='time',
var_name='time_cat')
# Keep only the 'a'/'b' as categories
df.mem_cat = df.mem_cat.str[-1]
df.time_cat = df.time_cat.str[-1]
# Keeping only the colums whose categories match (DIRTY!)
df = df[df.mem_cat==df.time_cat]
# Removing the duplicated category column.
df = df.drop(columns='time_cat').rename(columns={"mem_cat":'category'})
Given how easy it was to solve the inverse question, I believe my code is way too complex. Can anyone do it better?
Use wide_to_long:
np.random.seed(123)
df_orig = pd.DataFrame(data=np.random.randint(255, size=(4,5)),
columns=['accuracy','time_a','time_b','memory_a', 'memory_b'])
df = (pd.wide_to_long(df_orig.reset_index(),
stubnames=['time','memory'],
i='index',
j='category',
sep='_',
suffix='\w+')
.reset_index(level=1)
.reset_index(drop=True)
.rename_axis(None))
print (df)
category accuracy time memory
0 a 254 109 66
1 a 98 230 83
2 a 123 57 225
3 a 113 126 73
4 b 254 126 220
5 b 98 17 106
6 b 123 214 96
7 b 113 47 32

Numpy: Use vectorization for loop while referring to previous row value?

I have the following dataframe for which I want to create a column named 'Value' using numpy for fast looping and at the same time refer to the previous row value in the same column.
import pandas as pd
import numpy as np
df = pd.DataFrame(
{
"Product": ["A", "A", "A", "A", "B", "B", "B", "C", "C"],
"Inbound": [115, 220, 200, 402, 313, 434, 321, 343, 120],
"Outbound": [10, 20, 24, 52, 40, 12, 43, 23, 16],
"Is First?": ["Yes", "No", "No", "No", "Yes", "No", "No", "Yes", "No"],
}
)
Product Inbound Outbound Is First? Value
0 A 115 10 Yes 125
1 A 220 20 No 105
2 A 200 24 No 81
3 A 402 52 No 29
4 B 313 40 Yes 353
5 B 434 12 No 341
6 B 321 43 No 298
7 C 343 23 Yes 366
8 C 120 16 No 350
The formula for Value column in pseudocode is:
if ['Is First?'] = 'Yes' then [Value] = [Inbound] + [Outbound]
else [Value] = [Previous Value] - [Outbound]
The ideal way of creating the Value column right now is to do a for loop and use shift to refer to the previous column (which I am somehow not able to make work). But since I will be applying this over a giant dataset, I want to use the numpy vectorization method on it.
for i in range(len(df)):
if df.loc[i, "Is First?"] == "Yes":
df.loc[i, "Value"] = df.loc[i, "Inbound"] + df.loc[i, "Outbound"]
else:
df.loc[i, "Value"] = df.loc[i, "Value"].shift(-1) + df.loc[i, "Outbound"]
One way:
You may use np.subtract.accumulate with transform
s = df['Is First?'].eq('Yes').cumsum()
df['value'] = ((df.Inbound + df.Outbound).where(df['Is First?'].eq('Yes'), df.Outbound)
.groupby(s)
.transform(np.subtract.accumulate))
Out[1749]:
Product Inbound Outbound Is First? value
0 A 115 10 Yes 125
1 A 220 20 No 105
2 A 200 24 No 81
3 A 402 52 No 29
4 B 313 40 Yes 353
5 B 434 12 No 341
6 B 321 43 No 298
7 C 343 23 Yes 366
8 C 120 16 No 350
Another way:
Assign value for Yes. Create groupid s to use for groupby. Groupby and shift Outbound to calculate cumsum, and subtract it from 'Yes' value of each group. Finally, use it to fillna.
df['value'] = (df.Inbound + df.Outbound).where(df['Is First?'].eq('Yes'))
s = df['Is First?'].eq('Yes').cumsum()
s1 = df.value.ffill() - df.Outbound.shift(-1).groupby(s).cumsum().shift()
df['value'] = df.value.fillna(s1)
Out[1671]:
Product Inbound Outbound Is First? value
0 A 115 10 Yes 125.0
1 A 220 20 No 105.0
2 A 200 24 No 81.0
3 A 402 52 No 29.0
4 B 313 40 Yes 353.0
5 B 434 12 No 341.0
6 B 321 43 No 298.0
7 C 343 23 Yes 366.0
8 C 120 16 No 350.0
This is not a trivial task, the difficulty lies in the consecutive Nos. It's necessary to group consecutive no's together, the code below should do,
col_sum = df.Inbound+df.Outbound
mask_no = df['Is First?'].eq('No')
mask_yes = df['Is First?'].eq('Yes')
consec_no = mask_yes.cumsum()
result = col_sum.groupby(consec_no).transform('first')-df['Outbound'].where(mask_no,0).groupby(consec_no).cumsum()
Use:
df.loc[df['Is First?'].eq('Yes'),'Value']=df['Inbound']+df['Outbound']
df.loc[~df['Is First?'].eq('Yes'),'Value']=df['Value'].fillna(0).shift().cumsum()-df.loc[~df['Is First?'].eq('Yes'),'Outbound'].cumsum()
Annotated numpy code:
## 1. line up values to sum
ob = -df["Outbound"].values
# get yes indices
fi, = np.where(df["Is First?"].values == "Yes")
# insert yes formula at yes positions
ob[fi] = df["Inbound"].values[fi] - ob[fi]
## 2. calculate block sums and subtract each from the
## first element of the **next** block
ob[fi[1:]] -= np.add.reduceat(ob,fi)[:-1]
# now simply taking the cumsum will reset after each block
df["Value"] = ob.cumsum()
Result:
Product Inbound Outbound Is First? Value
0 A 115 10 Yes 125
1 A 220 20 No 105
2 A 200 24 No 81
3 A 402 52 No 29
4 B 313 40 Yes 353
5 B 434 12 No 341
6 B 321 43 No 298
7 C 343 23 Yes 366
8 C 120 16 No 350

Compare two dataframes based off of key

I have two dataframes, df1 and df2, which have the exact same columns and most of the time the same values for each key.
Country A B C D E F G H Key
Argentina xylo 262 4632 0 0 26.12 2 0 Argentinaxylo
Argentina phone 6860 155811 48 0 4375.87 202 0 Argentinaphone
Argentina land 507 1803728 2 117 7165.810566 3 154 Argentinaland
Australia xylo 7650 139472 69 0 16858.42 184 0 Australiaxylo
Australia mink 1284 2342788 1 0 39287.71 53 0 Australiamink
Country A B C D E F G H Key
Argentina xylo 262 4632 0 0 26.12 2 0 Argentinaxylo
Argentina phone 6860 155811 48 0 4375.87 202 0 Argentinaphone
Argentina land 507 1803728 2 117 7165.810566 3 154 Argentinaland
Australia xylo 7650 139472 69 0 16858.42 184 0 Australiaxylo
Australia mink 1284 2342788 1 0 39287.71 53 0 Australiamink
I want a snippet that compares the keys (key = column Country + column A) in each dataframe against each other and calculates the percent difference for each column B-H, if there is any. If there isn't, output nothing.
Hope, the below given code may help you to solve your problem. I have compared both datasets based on the Key column data and generate the difference of their (B-H) columns respectively. Thereafter, having the difference in percentage, I just have merge on the both datasets on Key column, compare the difference and have the final output in df3diff column of df3 dataset.
import pandas as pd
df1 = pd.DataFrame([['Argentina', 'xylo', 262 ,4632, 0 , 0 , 26.12 , 2 , 0 , 'Argentinaxylo']
,['Argentina', 'phone',6860,155811 , 48 , 0 ,4375.87 ,202, 0 , 'Argentinaphone']
,['Argentina', 'land', 507 ,1803728, 2 , 117 ,7165.810,566, 3 , '154 Argentinaland']
,['Australia', 'xylo', 7650,139472 , 69 , 0 ,16858.42,184, 0 , 'Australiaxylo']
,['Australia', 'mink', 1284,2342788, 1 , 0 ,39287.71, 53, 0 , 'Australiamink']]
,columns=['Country', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'Key'])
df1['df1BH'] = (df1['B']-df1['H'])/100.00
print(df1)
df2 = pd.DataFrame([['Argentina', 'xylo', 262 ,4632 , 0 , 0 ,26.12 ,2 , 0 ,'Argentinaxylo']
,['Argentina', 'phone',6860,155811 , 48, 0 ,4375.87 ,202, 0 ,'Argentinaphone']
,['Argentina', 'land', 507 ,1803728, 2 , 117 ,7165.810,566, 3 ,'154 Argentinaland']
,['Australia', 'xylo', 97650,139472 , 69, 0 ,96858.42,184, 0 ,'Australiaxylo']
,['Australia', 'mink', 1284,2342788, 1 , 0 ,39287.71, 53, 0 ,'Australiamink']]
,columns=['Country', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'Key'])
df2['df2BH'] = (df2['B']-df2['H'])/100.00
print(df2)
df3 = pd.merge(df1[['Key','df1BH']],df2[['Key','df2BH']], on=['Key'],how='outer')
df3['df3diff'] = df3['df1BH'] - df3['df2BH']
print(df3)
Output:
Key df1BH df2BH df3diff
0 Argentinaxylo 2.62 2.62 0.0
1 Argentinaphone 68.60 68.60 0.0
2 154 Argentinaland 5.04 5.04 0.0
3 Australiaxylo 76.50 976.50 -900.0
4 Australiamink 12.84 12.84 0.0

Categories