I am trying to classify one data based on a dataframe of standard.
The standard like df1, and I want to classify df2 based on df1.
df1:
PAUCode SubClass
1 RA
2 RB
3 CZ
df2:
PAUCode SubClass
2 non
2 non
2 non
3 non
1 non
2 non
3 non
I want to get the df2 like as below:
expected result:
PAUCode SubClass
2 RB
2 RB
2 RB
3 CZ
1 RA
2 RB
3 CZ
Option 1
fillna
df2 = df2.replace('non', np.nan)
df2.set_index('PAUCode').SubClass\
.fillna(df1.set_index('PAUCode').SubClass)
PAUCode
2 RB
2 RB
2 RB
3 CZ
1 RA
2 RB
3 CZ
Name: SubClass, dtype: object
Option 2
map
df2.PAUCode.map(df1.set_index('PAUCode').SubClass)
0 RB
1 RB
2 RB
3 CZ
4 RA
5 RB
6 CZ
Name: PAUCode, dtype: object
Option 3
merge
df2[['PAUCode']].merge(df1, on='PAUCode')
PAUCode SubClass
0 2 RB
1 2 RB
2 2 RB
3 2 RB
4 3 CZ
5 3 CZ
6 1 RA
Note here the order of the data changes, but the answer remains the same.
Let us using reindex
df1.set_index('PAUCode').reindex(df2.PAUCode).reset_index()
Out[9]:
PAUCode SubClass
0 2 RB
1 2 RB
2 2 RB
3 3 CZ
4 1 RA
5 2 RB
6 3 CZ
Related
I have two dataframes with identical columns. However the 'labels' column can have different labels. All labels are comma seperated strings. I want to make a union on the labels in order to go from this:
df1:
id1 id2 labels language
0 1 1 1 en
1 2 3 en
2 3 4 4 en
3 4 5 en
4 5 6 en
df2:
id1 id2 labels language
0 1 1 1,2 en
1 2 3 en
2 3 4 5,7 en
3 4 5 en
4 5 6 3 en
to this:
id1 id2 labels language
0 1 1 1,2 en
1 2 3 en
2 3 4 4,5,7 en
3 4 5 en
4 5 6 3 en
I've tried this:
df1['labels'] = df1['labels'].apply(lambda x: set(str(x).split(',')))
df2['labels'] = df2['labels'].apply(lambda x: set(str(x).split(',')))
result = df1.merge(df2, on=['article_id', 'line_number', 'language'], how='outer')
result['labels'] = result[['labels_x', 'labels_y']].apply(lambda x: list(set.union(*x)) if None not in x else set(), axis=1)
result['labels'] = result['labels'].apply(lambda x: ','.join(set(x)))
result = result.drop(['labels_x', 'techniques_y'], axis=1)
but I get a wierd df with odd commas in some places, e.g the ,3.:
id1 id2 labels language
0 1 1 1,2 en
1 2 3 en
2 3 4 4,5,7 en
3 4 5 en
4 5 6 ,3 en
How can I properly fix the commas? Any help is appreciated!
Here is a possible solution with pandas.merge :
out = (
df1.merge(df2, on=["id1", "id2", "language"])
.assign(labels= lambda x: x.filter(like="label")
.stack().str.split(",")
.explode().drop_duplicates()
.groupby(level=0).agg(",".join))
.drop(columns=["labels_x", "labels_y"])
[df1.columns]
)
Output :
print(out)
id1 id2 labels language
0 1 1 1,2 en
1 2 3 NaN en
2 3 4 4,5,7 en
3 4 5 NaN en
4 5 6 3 en
I have the following dataframe:
UNIQUE_IDENTIFIER COUNTRY_CODE
0 1 CZ
1 nan CZ
2 2 SK
3 4 AE
4 nan DK
5 nan CZ
6 nan DK
7 nan ES
For all blank values in the "UNIQUE_IDENTIFIER" column, I would like to create a value that takes the "COUNTRY_CODE" and add incremental numbers (with a space in between the number and the Country Code) starting from 1 for each different country code. So the final dataframe would be this:
UNIQUE_IDENTIFIER COUNTRY_CODE
0 1 CZ
1 CZ 1 CZ
2 2 SK
3 4 AE
4 DK 1 DK
5 CZ 2 CZ
6 DK 2 DK
7 ES 1 ES
What would be the best way to do it?
Use GroupBy.cumcount only for missing rows by UNIQUE_IDENTIFIER and add COUNTRY_CODE values with space separator:
m = df.UNIQUE_IDENTIFIER.isna()
s = df[m].groupby('COUNTRY_CODE').cumcount().add(1).astype(str)
df.loc[m, 'UNIQUE_IDENTIFIER'] = df.loc[m, 'COUNTRY_CODE'] + ' ' + s
print (df)
UNIQUE_IDENTIFIER COUNTRY_CODE
0 1.0 CZ
1 CZ 1 CZ
2 2.0 SK
3 4.0 AE
4 DK 1 DK
5 CZ 2 CZ
6 DK 2 DK
7 ES 1 ES
Or use Series.fillna for replace missing values:
s = df[df.UNIQUE_IDENTIFIER.isna()].groupby('COUNTRY_CODE').cumcount().add(1).astype(str)
df['UNIQUE_IDENTIFIER'] = df['UNIQUE_IDENTIFIER'].fillna(df['COUNTRY_CODE'] + ' ' + s)
print (df)
UNIQUE_IDENTIFIER COUNTRY_CODE
0 1.0 CZ
1 CZ 1 CZ
2 2.0 SK
3 4.0 AE
4 DK 1 DK
5 CZ 2 CZ
6 DK 2 DK
7 ES 1 ES
Details:
print (df[m].groupby('COUNTRY_CODE').cumcount().add(1).astype(str))
1 1
4 1
5 2
6 2
7 1
dtype: object
You can set up an incremental count with GroupBy.cumcount, then add 1 and convert to string, and use it either to fillna (option #1) or to replace the values with boolean indexing (option #2):
s = df['COUNTRY_CODE'].where(df['UNIQUE_IDENTIFIER'].isna(), '')
df['UNIQUE_IDENTIFIER'] = (df['UNIQUE_IDENTIFIER']
.fillna(s+' '+s.groupby(s).cumcount()
.add(1).astype(str))
)
or:
m = df['UNIQUE_IDENTIFIER'].isna()
s = df['COUNTRY_CODE'].where(m, '')
df.loc[m, 'UNIQUE_IDENTIFIER'] = s+' '+s.groupby(s).cumcount().add(1).astype(str)
output:
UNIQUE_IDENTIFIER COUNTRY_CODE
0 1.0 CZ
1 CZ 1 CZ
2 2.0 SK
3 4.0 AE
4 DK 1 DK
5 CZ 2 CZ
6 DK 2 DK
7 ES 1 ES
I have a dataframe and I want to create a new column based on a condition on a different column. Create the new column "ans" with 1 and increment based on the column "ix". In the "ix" column if the value is the same as the next one keep the "ans" column the same and if its different increment "ans"
Thank you for your answer, I am new to Python so I am not sure how to do this
index ix
1 pa
2 pa
3 pa
4 pe
5 fc
6 pb
7 pb
8 df
should result in:-
index ix ans
1 pa 1
2 pa 1
3 pa 1
4 pe 2
5 fc 3
6 pb 4
7 pb 4
8 df 5
In [47]: df['ans'] = (df['ix'] != df['ix'].shift(1)).cumsum()
In [48]: df
Out[48]:
index ix ans
0 1 pa 1
1 2 pa 1
2 3 pa 1
3 4 pe 2
4 5 fc 3
5 6 pb 4
6 7 pb 4
7 8 df 5
Looking for assistance to group by elements of a column in a Pandas df.
Original df:
Country Feature Number
0 US A 1
1 DE A 2
2 FR A 3
3 US B 0
4 DE B 5
5 FR B 7
6 US C 9
7 DE C 0
8 FR C 1
Desired df:
Country A B C
0 US 1 0 9
1 DE 2 5 0
2 FR 3 7 1
Not sure if group by is the best choice if I should create a dictionary. Thanks in advance for your help!
You could use pivot_table for that:
In [39]: df.pivot_table(index='Country', columns='Feature')
Out[39]:
Number
Feature A B C
Country
DE 2 5 0
FR 3 7 1
US 1 0 9
If you want your index to be 0, 1, 2 you could use reset_index
EDIT
If your Number actually not numbers but strings you could convert that column with astype or with pd.to_numeric:
df.Number = df.Number.astype(float)
or:
df.Number = pd.to_numeric(df.Number)
Note: pd.to_numeric is available only for pandas >= 0.17.0
I have this simple multiindex dataframe df obtained after performing some groupby.size() operations:
U G C
1 1 en 0.600000
2 en 0.400000
2 1 es 0.333333
3 es 0.500000
I would like to mask only the rows having the maximum value of the last column with respect to the U index column. So far I tried grouping by:
mask = df.groupby(level=[0]).max()
which returns:
U
1 0.6
2 0.5
but I would need the whole structure of the dataframe:
U G C
1 1 en
2 3 es
How can I reset in some way the multiindex dataframe?
For your df:
data
U G C
1 1 en 0.600000
2 en 0.400000
2 1 es 0.333333
3 es 0.500000
You can use
df[df['data'] == df.groupby(level=[0])['data'].transform(max)]
which returns
data
U G C
1 1 en 0.6
2 3 es 0.5