I have the following dataframe:
UNIQUE_IDENTIFIER COUNTRY_CODE
0 1 CZ
1 nan CZ
2 2 SK
3 4 AE
4 nan DK
5 nan CZ
6 nan DK
7 nan ES
For all blank values in the "UNIQUE_IDENTIFIER" column, I would like to create a value that takes the "COUNTRY_CODE" and add incremental numbers (with a space in between the number and the Country Code) starting from 1 for each different country code. So the final dataframe would be this:
UNIQUE_IDENTIFIER COUNTRY_CODE
0 1 CZ
1 CZ 1 CZ
2 2 SK
3 4 AE
4 DK 1 DK
5 CZ 2 CZ
6 DK 2 DK
7 ES 1 ES
What would be the best way to do it?
Use GroupBy.cumcount only for missing rows by UNIQUE_IDENTIFIER and add COUNTRY_CODE values with space separator:
m = df.UNIQUE_IDENTIFIER.isna()
s = df[m].groupby('COUNTRY_CODE').cumcount().add(1).astype(str)
df.loc[m, 'UNIQUE_IDENTIFIER'] = df.loc[m, 'COUNTRY_CODE'] + ' ' + s
print (df)
UNIQUE_IDENTIFIER COUNTRY_CODE
0 1.0 CZ
1 CZ 1 CZ
2 2.0 SK
3 4.0 AE
4 DK 1 DK
5 CZ 2 CZ
6 DK 2 DK
7 ES 1 ES
Or use Series.fillna for replace missing values:
s = df[df.UNIQUE_IDENTIFIER.isna()].groupby('COUNTRY_CODE').cumcount().add(1).astype(str)
df['UNIQUE_IDENTIFIER'] = df['UNIQUE_IDENTIFIER'].fillna(df['COUNTRY_CODE'] + ' ' + s)
print (df)
UNIQUE_IDENTIFIER COUNTRY_CODE
0 1.0 CZ
1 CZ 1 CZ
2 2.0 SK
3 4.0 AE
4 DK 1 DK
5 CZ 2 CZ
6 DK 2 DK
7 ES 1 ES
Details:
print (df[m].groupby('COUNTRY_CODE').cumcount().add(1).astype(str))
1 1
4 1
5 2
6 2
7 1
dtype: object
You can set up an incremental count with GroupBy.cumcount, then add 1 and convert to string, and use it either to fillna (option #1) or to replace the values with boolean indexing (option #2):
s = df['COUNTRY_CODE'].where(df['UNIQUE_IDENTIFIER'].isna(), '')
df['UNIQUE_IDENTIFIER'] = (df['UNIQUE_IDENTIFIER']
.fillna(s+' '+s.groupby(s).cumcount()
.add(1).astype(str))
)
or:
m = df['UNIQUE_IDENTIFIER'].isna()
s = df['COUNTRY_CODE'].where(m, '')
df.loc[m, 'UNIQUE_IDENTIFIER'] = s+' '+s.groupby(s).cumcount().add(1).astype(str)
output:
UNIQUE_IDENTIFIER COUNTRY_CODE
0 1.0 CZ
1 CZ 1 CZ
2 2.0 SK
3 4.0 AE
4 DK 1 DK
5 CZ 2 CZ
6 DK 2 DK
7 ES 1 ES
Related
I have two dataframes with identical columns. However the 'labels' column can have different labels. All labels are comma seperated strings. I want to make a union on the labels in order to go from this:
df1:
id1 id2 labels language
0 1 1 1 en
1 2 3 en
2 3 4 4 en
3 4 5 en
4 5 6 en
df2:
id1 id2 labels language
0 1 1 1,2 en
1 2 3 en
2 3 4 5,7 en
3 4 5 en
4 5 6 3 en
to this:
id1 id2 labels language
0 1 1 1,2 en
1 2 3 en
2 3 4 4,5,7 en
3 4 5 en
4 5 6 3 en
I've tried this:
df1['labels'] = df1['labels'].apply(lambda x: set(str(x).split(',')))
df2['labels'] = df2['labels'].apply(lambda x: set(str(x).split(',')))
result = df1.merge(df2, on=['article_id', 'line_number', 'language'], how='outer')
result['labels'] = result[['labels_x', 'labels_y']].apply(lambda x: list(set.union(*x)) if None not in x else set(), axis=1)
result['labels'] = result['labels'].apply(lambda x: ','.join(set(x)))
result = result.drop(['labels_x', 'techniques_y'], axis=1)
but I get a wierd df with odd commas in some places, e.g the ,3.:
id1 id2 labels language
0 1 1 1,2 en
1 2 3 en
2 3 4 4,5,7 en
3 4 5 en
4 5 6 ,3 en
How can I properly fix the commas? Any help is appreciated!
Here is a possible solution with pandas.merge :
out = (
df1.merge(df2, on=["id1", "id2", "language"])
.assign(labels= lambda x: x.filter(like="label")
.stack().str.split(",")
.explode().drop_duplicates()
.groupby(level=0).agg(",".join))
.drop(columns=["labels_x", "labels_y"])
[df1.columns]
)
Output :
print(out)
id1 id2 labels language
0 1 1 1,2 en
1 2 3 NaN en
2 3 4 4,5,7 en
3 4 5 NaN en
4 5 6 3 en
Below is the input data
df1
A B C D E F G
Messi Forward Argentina 1 Nan 5 6
Ronaldo Defender Portugal Nan 4 Nan 3
Messi Midfield Argentina Nan 5 Nan 6
Ronaldo Forward Portugal 3 Nan 2 3
Mbappe Forward France 1 3 2 5
Below is the intended output
df
A B C D E F G
Messi Forward,Midfield Argentina 1 5 5 6
Ronaldo Forward,Defender Portugal 3 4 2 3
Mbappe Forward France 1 3 2 5
My try:
df.groupby(['A','C'])['B'].agg(','.join).reset_index()
df.fillna(method='ffill')
Do we have a better way to do this ?
You can get first non missing values per groups by all columns without A,C and for B aggregate by join:
d = dict.fromkeys(df.columns.difference(['A','C']), 'first')
d['B'] = ','.join
df1 = df.groupby(['A','C'], sort=False, as_index=False).agg(d)
print (df1)
A C B D E F G
0 Messi Argentina Forward,Midfield 1.0 5.0 5.0 6
1 Ronaldo Portugal Defender,Forward 3.0 4.0 2.0 3
2 Mbappe France Forward 1.0 3.0 2.0 5
df1 = df.groupby(['A','C'], sort=False, as_index=False).agg(d).convert_dtypes()
print (df1)
A C B D E F G
0 Messi Argentina Forward,Midfield 1 5 5 6
1 Ronaldo Portugal Defender,Forward 3 4 2 3
2 Mbappe France Forward 1 3 2 5
For a generic method without manual definition of the columns, you can use the columns types to define whether to aggregate with ', '.join or 'first':
from pandas.api.types import is_string_dtype
out = (df.groupby(['A', 'C'], as_index=False)
.agg({c: ', '.join if is_string_dtype(df[c]) else 'first' for c in df})
)
Output:
A B C D E F G
0 Mbappe Forward France 1.0 3.0 2.0 5
1 Messi, Messi Forward, Midfield Argentina, Argentina 1.0 5.0 5.0 6
2 Ronaldo, Ronaldo Defender, Forward Portugal, Portugal 3.0 4.0 2.0 3
I have a below data frame
df = pd.DataFrame([['NY','R',1],
['NJ','Y',12],
['FL','B',20],
['CA','B',40],
['AZ','Y',51],
['NY','R',2],
['NJ','Y',18],
['FL','B',30],
['CA','B',20],
['AZ','Y',45],
['NY','Y',3],
['NJ','R',15],
['FL','R',10],
['CA','R',70],
['AZ','B',25],
['NY','B',4],
['NJ','B',17],
['FL','Y',30],
['CA','R',30],
['AZ','B',75],
['FL','R',5],
['FL','Y',25],
['NJ','R',14],
['NJ','B',11],
['NY','B',5],
['NY','Y',7]],
columns = ['State', 'ID','data'])
State ID data
0 NY R 1
1 NJ Y 12
2 FL B 20
3 CA B 40
4 AZ Y 51
5 NY R 2
6 NJ Y 18
7 FL B 30
8 CA B 20
9 AZ Y 45
10 NY Y 3
11 NJ R 15
12 FL R 10
13 CA R 70
14 AZ B 25
15 NY B 4
16 NJ B 17
17 FL Y 30
18 CA R 30
19 AZ B 75
20 FL R 5
21 FL Y 25
22 NJ R 14
23 NJ B 11
24 NY B 5
25 NY Y 7
What I want to do: to re-create a new data frame such that it only contains the smallest number from eachID of the state. For example: for State: NY and ID: R, there are 2 data: 1 and 2. The new dataframe will only take value: 1 for category State: NY and ID: R. The new data frame shall preferably look like this:
State dataR dataB dataY
0 NY 1.0 4 3.0
1 NJ 14.0 11 12.0
2 FL 5.0 20 25.0
3 CA 30.0 20 NaN
4 AZ NaN 25 45.0
Note that: State AZ and CA does not have any value (NaN) for column dataR and dataY, respectively, in the result because they originally have no such value in the original data frame. Please also note that the column in the result becomes dataR, dataB and dataY. I aim to create these columns in the result such that one can easily read the result later on in the actual data.
AND: I also want to be flexible such that I can seek for minimum value of each State in data per each ID R&Y together and B, so the new data frame will look like:
State dataRY dataB
0 NY 1 4
1 NJ 12 11
2 FL 5 20
3 CA 30 20
4 AZ 45 25
I tried using for loops as below:
colours = [['R'],['B'],['Y']]
def rearranging(df):
df_result = []
for c in colours:
df_colours = df[df['ID'].isin(c)]
df_colours_result = []
for state in np.unique(df['State'].values):
df1 = df_colours[df_colours['State'] == state]
df2 = df1.nsmallest(1,'data')
df_colours_result.append(df2)
first_loop_result = pd.concat(df_colours_result,ignore_index = True, sort = False)
df_result.append(first_loop_result)
final_result = pd.concat(df_result, axis = 1)
return final_result
The variable colours should be there because I want to be flexible and such that I can change their values if the data source changes later in the time.
And the result of the above for loop is:
State ID data State ID data State ID data
0 CA R 30.0 AZ B 25 AZ Y 45.0
1 FL R 5.0 CA B 20 FL Y 25.0
2 NJ R 14.0 FL B 20 NJ Y 12.0
3 NY R 1.0 NJ B 11 NY Y 3.0
4 NaN NaN NaN NY B 4 NaN NaN NaN
I don't like my result because: it is hard to read and I need to re-arrange and rename the columns again. Is there anyway to get the result that I actually aim at the above by using for loops? Vectorization is also welcome.
Please also be informed (once again) that I also want to be flexible on the column ID. This is the reason for me to include for example I want to say I need to see the smallest value of data for each State for ID R&Y combined and ID B. In my attempt I simply alter the code as below with the loop stays the same:
colours = [['R','Y'],['B']]
And the result is:
State ID data State ID data
0 AZ Y 45 AZ B 25
1 CA R 30 CA B 20
2 FL R 5 FL B 20
3 NJ Y 12 NJ B 11
4 NY R 1 NY B 4
Note: in comparison and if there is NaN, then NaN is simply ignored (and not treated as Zero).
Once again the result is not the same as what I aim for and this table is not informative enough.
IIUC, use groupby() on State and ID and get min of data column, and unstack(add_prefix) if required.:
df.groupby(['State','ID'],sort=False)['data'].min().unstack().add_prefix('data_')
ID data_R data_Y data_B
State
NY 1.0 3.0 4.0
NJ 14.0 12.0 11.0
FL 5.0 25.0 20.0
CA 30.0 NaN 20.0
AZ NaN 45.0 25.0
EDIT: As requested by OP , if you want to merge Y and R together, just replace and do similar:
(df.assign(ID=df['ID'].replace(['Y','R'],'YR'))
.groupby(['State','ID'],sort=False)['data'].min().unstack().add_prefix('data_'))
ID data_YR data_B
State
NY 1 4
NJ 12 11
FL 5 20
CA 30 20
AZ 45 25
df = pd.DataFrame({'state': ['CA', 'WA', 'CO', 'AZ'] * 3,'id':
[1,2,3,4,5,6]*2 ,'sales': [np.random.randint(100000, 999999) for _ in
range(12)]})
This is ouput of df:
id sales state
0 1 847754 CA
1 2 362532 WA
2 3 615849 CO
3 4 376480 AZ
4 5 381286 CA
5 6 411001 WA
6 1 946795 CO
7 2 857435 AZ
8 3 928087 CA
9 4 675593 WA
10 5 371339 CO
11 6 440285 AZ
I am not able to do the cumulative percentage for each group in descending order. I want the output like this:
id sales state cumsum run_pct
0 2 857435 AZ 857435 0.5121460996296738
1 6 440285 AZ 1297720 0.7751284195436626
2 4 376480 AZ 1674200 1.0
3 3 928087 CA 928087 0.43024216932985404
4 1 847754 CA 1775841 0.8232436013271356
5 5 381286 CA 2157127 1.0
6 1 946795 CO 946795 0.48955704367618535
7 3 615849 CO 1562644 0.807992624547372
8 5 371339 CO 1933983 1.0
9 4 675593 WA 675593 0.46620721731581655
10 6 411001 WA 1086594 0.7498271371847582
11 2 362532 WA 1449126 1.0
One possible solution is to first sort the data, calculate the cumsum and then the percentages last.
Sorting with ascending states and descending sales:
df = df.sort_values(['state', 'sales'], ascending=[True, False])
Calculating the cumsum:
df['cumsum'] = df.groupby('state')['sales'].cumsum()
and the percentages:
df['run_pct'] = df.groupby('state')['sales'].apply(lambda x: (x/x.sum()).cumsum())
This will give:
id sales state cumsum run_pct
0 4 846079 AZ 846079 0.608566
1 2 312708 AZ 1158787 0.833491
2 6 231495 AZ 1390282 1.000000
3 3 790291 CA 790291 0.506795
4 1 554631 CA 1344922 0.862467
5 5 214467 CA 1559389 1.000000
6 1 983878 CO 983878 0.388139
7 5 779497 CO 1763375 0.695650
8 3 771486 CO 2534861 1.000000
9 6 794407 WA 794407 0.420899
10 2 587843 WA 1382250 0.732355
11 4 505155 WA 1887405 1.000000
I am trying to classify one data based on a dataframe of standard.
The standard like df1, and I want to classify df2 based on df1.
df1:
PAUCode SubClass
1 RA
2 RB
3 CZ
df2:
PAUCode SubClass
2 non
2 non
2 non
3 non
1 non
2 non
3 non
I want to get the df2 like as below:
expected result:
PAUCode SubClass
2 RB
2 RB
2 RB
3 CZ
1 RA
2 RB
3 CZ
Option 1
fillna
df2 = df2.replace('non', np.nan)
df2.set_index('PAUCode').SubClass\
.fillna(df1.set_index('PAUCode').SubClass)
PAUCode
2 RB
2 RB
2 RB
3 CZ
1 RA
2 RB
3 CZ
Name: SubClass, dtype: object
Option 2
map
df2.PAUCode.map(df1.set_index('PAUCode').SubClass)
0 RB
1 RB
2 RB
3 CZ
4 RA
5 RB
6 CZ
Name: PAUCode, dtype: object
Option 3
merge
df2[['PAUCode']].merge(df1, on='PAUCode')
PAUCode SubClass
0 2 RB
1 2 RB
2 2 RB
3 2 RB
4 3 CZ
5 3 CZ
6 1 RA
Note here the order of the data changes, but the answer remains the same.
Let us using reindex
df1.set_index('PAUCode').reindex(df2.PAUCode).reset_index()
Out[9]:
PAUCode SubClass
0 2 RB
1 2 RB
2 2 RB
3 3 CZ
4 1 RA
5 2 RB
6 3 CZ