Generate column of unique ID in pandas

Generate column of unique ID in pandas - python

I have a dataframe with three columns, bins_x, bins_y and z. I wish to add a new column unique that is an "index" of sorts for that unique combination of bins_x and bins_y. Below is an example of what I would like to append.
Note that I ordered the dataframe for clarity, but order does not matter in this context.
import numpy as np
import pandas as pd
np.random.seed(12)
n = 1000
height = 20
width = 20
bins_x = np.random.randint(1, width, size=n)
bins_y = np.random.randint(1, height, size=n)
z = np.random.randint(1, 500, size=n)
df = pd.DataFrame({'bins_x': bins_x, 'bins_y': bins_y, 'z': z})
print(df.sort_values(['bins_x', 'bins_y'])
bins_x bins_y z unique
23 0 0 462 0
531 0 0 199 1
665 0 0 176 2
363 0 1 219 0
468 0 1 450 1
593 0 1 385 2
609 0 1 74 3
663 0 1 46 4
14 0 2 242 0
208 0 2 381 1
600 0 2 445 2
865 0 2 221 3
400 0 3 178 0
75 0 4 281 0
140 0 4 205 1
282 0 4 47 2
838 0 4 212 3

Use groupby and cumcount:
df['unique'] = df.groupby(['bins_x','bins_y']).cumcount()
>>> df.sort_values(['bins_x', 'bins_y']).head(10)
bins_x bins_y z unique
207 1 1 4 0
259 1 1 313 1
327 1 1 300 2
341 1 1 64 3
440 1 1 398 4
573 1 1 96 5
174 1 2 219 0
563 1 2 398 1
796 1 2 417 2
809 1 2 167 3

Related

Split columns conditionally on string

I have a data frame with the following shape:
0 1
0 OTT:81 DVBC:398
1 OTT:81 DVBC:474
2 OTT:81 DVBC:474
3 OTT:81 DVBC:454
4 OTT:81 DVBC:443
5 OTT:1 DVBC:254
6 DVBC:151 None
7 OTT:1 DVBC:243
8 OTT:1 DVBC:254
9 DVBC:227 None
I want for column 1 to be same as column 0 if column 1 contains "DVBC".
The split the values on ":" and the fill the empty ones with 0.
The end data frame should look like this
OTT DVBC
0 81 398
1 81 474
2 81 474
3 81 454
4 81 443
5 1 254
6 0 151
7 1 243
8 1 254
9 0 227
I try to do this starting with:
if df[0].str.contains("DVBC") is True:
df[1] = df[0]
But after this the data frame looks the same not sure why.
My idea after is to pass the values to the respective columns then split by ":" and rename the columns.
How can I implement this?

Universal solution for split values by : and pivoting- first create Series by DataFrame.stack, split by Series.str.splitSeries.str.rsplit and last reshape by DataFrame.pivot:
df = df.stack().str.split(':', expand=True).reset_index()
df = df.pivot('level_0',0,1).fillna(0).rename_axis(index=None, columns=None)
print (df)
DVBC OTT
0 398 81
1 474 81
2 474 81
3 454 81
4 443 81
5 254 1
6 151 0
7 243 1
8 254 1
9 227 0

Here is one way that should work with any number of columns:
(df
.apply(lambda c: c.str.extract(':(\d+)', expand=False))
.ffill(axis=1)
.mask(df.replace('None', pd.NA).isnull().shift(-1, axis=1, fill_value=False), 0)
)
output:
OTT DVBC
0 81 398
1 81 474
2 81 474
3 81 454
4 81 443
5 1 254
6 0 151
7 1 243
8 1 254
9 0 227

How to calculate min and max of a column for particular rows?

I have a csv file as following:
0 2 1 1 464 385 171 0:44:4
1 1 2 26 254 444 525 0:56:2
2 3 1 90 525 785 522 0:52:8
3 8 2 3 525 233 555 0:52:8
4 7 1 10 525 433 522 1:52:8
5 9 2 55 525 555 522 1:52:8
6 6 3 3 392 111 232 1:43:4
7 1 4 23 322 191 112 1:43:4
8 1 3 30 322 191 112 1:43:4
9 1 5 2 322 191 112 1:43:4
10 1 3 22 322 191 112 1:43:4
11 1 4 44 322 191 112 1:43:4
12 1 5 1 322 191 112 1:43:4
12 1 4 3 322 191 112 1:43:4
12 1 6 33 322 191 112 1:43:4
12 1 6 1 322 191 112 1:43:4
12 1 5 3 322 191 112 1:43:4
12 1 6 33 322 191 112 1:43:4
.
.
Third column has numbers between 1 to 6. I want to read information of columns #4 and #5 for all the rows that have number 1 to 6 in the third columns and find the maximum and minmum amount for each row that has number 1 to 6 seprately. For example output like this:
Mix for row with 1: 1
Max for row with 1: 90
Min for row with 2: 3
Max for row with 2: 55
and so on
I can plot the figure using following code. How to get summary statistics by group? What I'm looking for is to get multiple statistics for the same group like mean, min, max, number of each group in one call, is that doable?
import matplotlib.pyplot as plt
import csv
x= []
y= []
with open('mydata.csv','r') as csvfile:
ap = csv.reader(csvfile, delimiter=',')
for row in ap:
x.append(int(row[2]))
y.append(int(row[7]))
plt.scatter(x, y, color = 'g',s = 4, marker='o')
plt.show()

One easy way would be to use Pandas with read_csv(), .groupby() and .agg():
import pandas as pd
df = pd.read_csv("mydata.csv", header=None)
def min_max_avg(col):
return (col.min() + col.max()) / 2
result = df[[2, 3, 4]].groupby(2).agg(["min", "max", "mean", min_max_avg])
Result:
3 4
min max mean min_max_avg min max mean min_max_avg
2
1 1 90 33.666667 45.5 464 525 504.666667 494.5
2 3 55 28.000000 29.0 254 525 434.666667 389.5
3 3 30 18.333333 16.5 322 392 345.333333 357.0
4 3 44 23.333333 23.5 322 322 322.000000 322.0
5 1 3 2.000000 2.0 322 322 322.000000 322.0
6 1 33 22.333333 17.0 322 322 322.000000 322.0
If you don't like that you could do it with pure Python, it's only a little bit more work:
import csv
data = {}
with open("mydata.csv", "r") as file:
for row in csv.reader(file):
dct = data.setdefault(row[2], {})
for col in (3, 4):
dct.setdefault(col, []).append(row[col])
min_str = "Min for group {} - column {}: {}"
max_str = "Max for group {} - column {}: {}"
for row in data:
for col in (3, 4):
print(min_str.format(row, col, min(data[row][col])))
print(max_str.format(row, col, max(data[row][col])))
Result:
Min for group 1 - column 3: 1
Max for group 1 - column 3: 90
Min for group 1 - column 4: 464
Max for group 1 - column 4: 525
Min for group 2 - column 3: 26
Max for group 2 - column 3: 55
Min for group 2 - column 4: 254
Max for group 2 - column 4: 525
Min for group 3 - column 3: 22
Max for group 3 - column 3: 30
Min for group 3 - column 4: 322
Max for group 3 - column 4: 392
...
mydata.csv:
0,2,1,1,464,385,171,0:44:4
1,1,2,26,254,444,525,0:56:2
2,3,1,90,525,785,522,0:52:8
3,8,2,3,525,233,555,0:52:8
4,7,1,10,525,433,522,1:52:8
5,9,2,55,525,555,522,1:52:8
6,6,3,3,392,111,232,1:43:4
7,1,4,23,322,191,112,1:43:4
8,1,3,30,322,191,112,1:43:4
9,1,5,2,322,191,112,1:43:4
10,1,3,22,322,191,112,1:43:4
11,1,4,44,322,191,112,1:43:4
12,1,5,1,322,191,112,1:43:4
12,1,4,3,322,191,112,1:43:4
12,1,6,33,322,191,112,1:43:4
12,1,6,1,322,191,112,1:43:4
12,1,5,3,322,191,112,1:43:4
12,1,6,33,322,191,112,1:43:4

How to merge multiple sheets and rename column names with the names of the sheet names?

I have the following data. It is all in one excel file.
Sheet name: may2019
Productivity Count
Date : 01-Apr-2020 00:00 to 30-Apr-2020 23:59
Date Type: Finalized Date Modality: All
Name MR DX CT US MG BMD TOTAL
Svetlana 29 275 101 126 5 5 541
Kate 32 652 67 171 1 0 923
Andrew 0 452 0 259 1 0 712
Tom 50 461 61 104 4 0 680
Maya 0 353 0 406 0 0 759
Ben 0 1009 0 143 0 0 1152
Justin 0 2 9 0 1 9 21
Total 111 3204 238 1209 12 14 4788
Sheet Name: June 2020
Productivity Count
Date : 01-Jun-2019 00:00 to 30-Jun-2019 23:59
Date Type: Finalized Date Modality: All
NAme US DX CT MR MG BMD TOTAL
Svetlana 4 0 17 6 0 4 31
Kate 158 526 64 48 1 0 797
Andrew 154 230 0 0 0 0 384
Tom 1 0 19 20 2 8 50
Maya 260 467 0 0 1 1 729
Ben 169 530 59 40 3 0 801
Justin 125 164 0 0 4 0 293
Alvin 0 1 0 0 0 0 1
Total 871 1918 159 114 11 13 3086
I want to merge all the sheets into on sheet, drop the first 3 rows of all the sheets and and this is the output I am looking for
Sl.No Name US_jun2019 DX_jun2019 CT_jun2019 MR_jun2019 MG_jun2019 BMD_jun2019 TOTAL_jun2019 MR_may2019 DX_may2019 CT_may2019 US_may2019 MG_may2019 BMD_may2019 TOTAL_may2019
1 Svetlana 4 0 17 6 0 4 31 29 275 101 126 5 5 541
2 Kate 158 526 64 48 1 0 797 32 652 67 171 1 0 923
3 Andrew 154 230 0 0 0 0 384 0 353 0 406 0 0 759
4 Tom 1 0 19 20 2 8 50 0 2 9 0 1 9 21
5 Maya 260 467 0 0 1 1 729 0 1009 0 143 0 0 1152
6 Ben 169 530 59 40 3 0 801 50 461 61 104 4 0 680
7 Justin 125 164 0 0 4 0 293 0 452 0 259 1 0 712
8 Alvin 0 1 0 0 0 0 1 #N/A #N/A #N/A #N/A #N/A #N/A #N/A
I tried the following code but the output is not the one i am looking for.
df=pd.concat(df,sort=False)
df= df.drop(df.index[[0,1]])
df=df.rename(columns=df.iloc[0])
df= df.drop(df.index[[0]])
df=df.drop(['Sl.No'], axis = 1)
print(df)

First, read both Excel sheets.
>>> df1 = pd.read_excel('path/to/excel/file.xlsx', sheet_name="may2019")
>>> df2 = pd.read_excel('path/to/excel/file.xlsx', sheet_name="jun2019")
Drop the first three rows.
>>> df1.drop(index=range(3), inplace=True)
>>> df2.drop(index=range(3), inplace=True)
Rename columns to the first row, and drop the first row
>>> df1.rename(columns=dict(zip(df1.columns, df1.iloc[0])), inplace=True)
>>> df1.drop(index=[0], inplace=True)
>>> df2.rename(columns=dict(zip(df2.columns, df2.iloc[0])), inplace=True)
>>> df2.drop(index=[0], inplace=True)
Add suffixes to the columns.
>>> df1.rename(columns=lambda col_name: col_name + '_may2019', inplace=True)
>>> df2.rename(columns=lambda col_name: col_name + '_jun2019', inplace=True)
Remove the duplicate name column in the second DF.
>>> df2.drop(columns=['Name'], inplace=True)
Concatenate both the dataframes
>>> df = pd.concat([df1, df2], axis=1, inplace=True)
All the code in one place:
import pandas as pd
df1 = pd.read_excel('path/to/excel/file.xlsx', sheet_name="may2019")
df2 = pd.read_excel('path/to/excel/file.xlsx', sheet_name="jun2019")
df1.drop(index=range(3), inplace=True)
df2.drop(index=range(3), inplace=True)
df1.rename(columns=dict(zip(df1.columns, df1.iloc[0])), inplace=True)
df1.drop(index=[0], inplace=True)
df2.rename(columns=dict(zip(df2.columns, df2.iloc[0])), inplace=True)
df2.drop(index=[0], inplace=True)
df1.rename(columns=lambda col_name: col_name + '_may2019', inplace=True)
df2.rename(columns=lambda col_name: col_name + '_jun2019', inplace=True)
df2.drop(columns=['Name'], inplace=True)
df = pd.concat([df2, df1], axis=1, inplace=True)
print(df)

use result of a function applied to groupby for calculation on the original df

I am having some data which look like as shown below df.
I am trying to calculate first the mean angle for each group using the function mean_angle. The calculated mean angle is then used to do another calculation per group using the function fun.
import pandas as pd
import numpy as np
generate sample data
a = np.array([1,2,3,4]).repeat(4)
x1 = 90 + np.random.randint(-15, 15, size=a.size//2 - 2 )
x2 = 270 + np.random.randint(-50, 50, size=a.size//2 + 2 )
b = np.concatenate((x1, x2))
np.random.shuffle(b)
df = pd.DataFrame({'a':a, 'b':b})
The returned dataframe is printed below.
a b
0 1 295
1 1 78
2 1 280
3 1 94
4 2 308
5 2 227
6 2 96
7 2 299
8 3 248
9 3 288
10 3 81
11 3 78
12 4 103
13 4 265
14 4 309
15 4 229
My functions are mean_angle and fun
def mean_angle(deg):
deg = np.deg2rad(deg)
deg = deg[~np.isnan(deg)]
S = np.sum(np.sin(deg))
C = np.sum(np.cos(deg))
mu = np.arctan2(S,C)
mu = np.rad2deg(mu)
if mu <0:
mu = 360 + mu
return mu
def fun(x, mu):
return np.where(abs(mu - x) < 45, x, np.where(x+180<360, x+180, x-180))
what I have tried
mu = df.groupby(['a'])['b'].apply(mean_angle)
df2 = df.groupby(['a'])['b'].apply(fun, args = (mu,)) #this function should be element wise
I know it is totally wrong but I could not come up with a better way.
The desired output is something like this where mu the mean_angle per group
a b c
0 1 295 np.where(abs(mu - 295) < 45, 295, np.where(295 +180<360, 295 +180, 295 -180))
1 1 78 np.where(abs(mu - 78) < 45, 78, np.where(78 +180<360, 78 +180, 78 -180))
2 1 280 np.where(abs(mu - 280 < 45, 280, np.where(280 +180<360, 280 +180, 280 -180))
3 1 94 ...
4 2 308 ...
5 2 227 .
6 2 96 .
7 2 299 .
8 3 248 .
9 3 288 .
10 3 81 .
11 3 78 .
12 4 103 .
13 4 265 .
14 4 309 .
15 4 229 .
Any help is appreciated

You don't need your second function, just pass the necessary columns to np.where(). So creating your dataframe in the same manner and not modifying your mean_angle function, we have the following sample dataframe:
a b
0 1 228
1 1 291
2 1 84
3 1 226
4 2 266
5 2 311
6 2 82
7 2 274
8 3 79
9 3 250
10 3 222
11 3 88
12 4 80
13 4 291
14 4 100
15 4 293
Then create your c column (containing your mu values) using groupby() and transform(), and finally apply your np.where() logic:
df['c'] = df.groupby(['a'])['b'].transform(mean_angle)
df['c'] = np.where(abs(df['c'] - df['b']) < 45, df['b'], np.where(df['b']+180<360, df['b']+180, df['b']-180))
Yields:
a b c
0 1 228 228
1 1 291 111
2 1 84 264
3 1 226 226
4 2 266 266
5 2 311 311
6 2 82 262
7 2 274 274
8 3 79 259
9 3 250 70
10 3 222 42
11 3 88 268
12 4 80 260
13 4 291 111
14 4 100 280
15 4 293 113

pandas number >mean(), or <mean() , than output a number

I have a dataframe like this:
Id F M R
7 1 286 907
12 1 286 907
17 1 186 1271
21 1 296 905
30 1 308 908
32 1 267 905
40 2 591 788
41 1 486 874
47 1 686 906
74 1 230 907
for each row if f> f's mean() and M> M's mean() and R>R's mean(),then output in new column is "1".
like this:
Id F M R score
7 1 286 907 1
12 1 286 907 0
17 1 186 1271 1
21 1 296 905
30 1 308 908
32 1 267 905
40 2 591 788
41 1 486 874
47 1 686 906
74 1 230 907

You can use numpy.where with mask created with comparing 3 columns with their mean and then use all for check all rows are True:
# I modify last value in row with index 6 to 1000
print (df)
Id F M R
0 7 1 286 907
1 12 1 286 907
2 17 1 186 1271
3 21 1 296 905
4 30 1 308 908
5 32 1 267 905
6 40 2 591 1000
7 41 1 486 874
8 47 1 686 906
9 74 1 230 907
print (df.F.mean())
1.1
print (df.M.mean())
362.2
print (df.R.mean())
949.0
print (df[['F','M','R']] > df[['F','M','R']].mean())
F M R
0 False False False
1 False False False
2 False False True
3 False False False
4 False False False
5 False False False
6 True True True
7 False True False
8 False True False
9 False False False
mask = (df[['F','M','R']] > df[['F','M','R']].mean()).all(1)
print (mask)
0 False
1 False
2 False
3 False
4 False
5 False
6 True
7 False
8 False
9 False
dtype: bool
df['score'] = np.where(mask,1,0)
print (df)
Id F M R score
0 7 1 286 907 0
1 12 1 286 907 0
2 17 1 186 1271 0
3 21 1 296 905 0
4 30 1 308 908 0
5 32 1 267 905 0
6 40 2 591 1000 1
7 41 1 486 874 0
8 47 1 686 906 0
9 74 1 230 907 0
If condition is changed:
mask = (df.F > df.F.mean()) & (df.M < df.M.mean()) & (df.R < df.R.mean())
print (mask)
0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 False
9 False
dtype: bool
df['score'] = np.where(mask,2,0)
print (df)
Id F M R score
0 7 1 286 907 0
1 12 1 286 907 0
2 17 1 186 1271 0
3 21 1 296 905 0
4 30 1 308 908 0
5 32 1 267 905 0
6 40 2 591 1000 0
7 41 1 486 874 0
8 47 1 686 906 0
9 74 1 230 907 0
EDIT:
I think you can first check if in some conditions are not in some row more as one values by:
mask1 = (df.F > df.F.mean()) & (df.M > df.M.mean()) & (df.R > df.R.mean())
mask2 = (df.F > df.F.mean()) & (df.M < df.M.mean()) & (df.R < df.R.mean())
mask3 = (df.F < df.F.mean()) & (df.M < df.M.mean()) & (df.R < df.R.mean())
df['score1'] = np.where(mask1,1,0)
df['score2'] = np.where(mask2,2,0)
df['score3'] = np.where(mask3,3,0)
If not, use:
df.loc[mask1, 'score'] = 1
df.loc[mask2, 'score'] = 2
df.loc[mask3, 'score'] = 3
df.score.fillna(0, inplace=True)

df.loc[df['f']>df['f'].mean(),['f']] += 1
df.loc[df['m']>df['m'].mean(),['m']] += 1
df.loc[df['r']>df['r'].mean(),['r']] += 1
Have not tested this,please try and comment if works.
Or try this one
df['f'] = [x+1 for x in df['f'] if x>df['f'].mean()]
df['m'] = [x+1 for x in df['m'] if x>df['m'].mean()]
df['r'] = [x+1 for x in df['r'] if x>df['r'].mean()]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Generate column of unique ID in pandas - python

Use groupby and cumcount: df['unique'] = df.groupby(['bins_x','bins_y']).cumcount() >>> df.sort_values(['bins_x', 'bins_y']).head(10) bins_x bins_y z unique 207 1 1 4 0 259 1 1 313 1 327 1 1 300 2 341 1 1 64 3 440 1 1 398 4 573 1 1 96 5 174 1 2 219 0 563 1 2 398 1 796 1 2 417 2 809 1 2 167 3

Related

Split columns conditionally on string

How to calculate min and max of a column for particular rows?

How to merge multiple sheets and rename column names with the names of the sheet names?

use result of a function applied to groupby for calculation on the original df

pandas number >mean(), or <mean() , than output a number

Categories

Resources