Create union of two columns in pandas - python

I have two dataframes with identical columns. However the 'labels' column can have different labels. All labels are comma seperated strings. I want to make a union on the labels in order to go from this:
df1:
id1 id2 labels language
0 1 1 1 en
1 2 3 en
2 3 4 4 en
3 4 5 en
4 5 6 en
df2:
id1 id2 labels language
0 1 1 1,2 en
1 2 3 en
2 3 4 5,7 en
3 4 5 en
4 5 6 3 en
to this:
id1 id2 labels language
0 1 1 1,2 en
1 2 3 en
2 3 4 4,5,7 en
3 4 5 en
4 5 6 3 en
I've tried this:
df1['labels'] = df1['labels'].apply(lambda x: set(str(x).split(',')))
df2['labels'] = df2['labels'].apply(lambda x: set(str(x).split(',')))
result = df1.merge(df2, on=['article_id', 'line_number', 'language'], how='outer')
result['labels'] = result[['labels_x', 'labels_y']].apply(lambda x: list(set.union(*x)) if None not in x else set(), axis=1)
result['labels'] = result['labels'].apply(lambda x: ','.join(set(x)))
result = result.drop(['labels_x', 'techniques_y'], axis=1)
but I get a wierd df with odd commas in some places, e.g the ,3.:
id1 id2 labels language
0 1 1 1,2 en
1 2 3 en
2 3 4 4,5,7 en
3 4 5 en
4 5 6 ,3 en
How can I properly fix the commas? Any help is appreciated!

Here is a possible solution with pandas.merge :
out = (
df1.merge(df2, on=["id1", "id2", "language"])
.assign(labels= lambda x: x.filter(like="label")
.stack().str.split(",")
.explode().drop_duplicates()
.groupby(level=0).agg(",".join))
.drop(columns=["labels_x", "labels_y"])
[df1.columns]
)
Output :
print(out)
id1 id2 labels language
0 1 1 1,2 en
1 2 3 NaN en
2 3 4 4,5,7 en
3 4 5 NaN en
4 5 6 3 en

Related

Sum two columns in a grouped data frame using shift()

I have a data frame df where I would like to create new column ID which is a diagonal combination of two other columns ID1 & ID2.
This is the data frame:
import pandas as pd
df = pd.DataFrame({'Employee':[5,5,5,20,20],
'Department':[4,4,4,6,6],
'ID':['AB','CD','EF','XY','AA'],
'ID2':['CD','EF','GH','AA','ZW']},)
This is how the initial data frame looks like:
Employee Department ID1 ID2
0 5 4 AB CD
1 5 4 CD EF
2 5 4 EF GH
3 20 6 XY AA
4 20 6 AA ZW
If I group df by Employee & Department:
df2=df.groupby(["Employee","Department"])
I would have only two option of groups, groups containing two rows or groups containing three rows.
The column ID would be the sum of ID1 of the first row & ID2 of the next row & for the last row of the group, ID would take the value of the previous ID.
Expected output:
Employee Department ID1 ID2 ID
0 5 4 AB CD ABEF
1 5 4 CD EF CDGH
2 5 4 EF GH CDGH
3 20 6 XY AA XYZW
4 20 6 AA ZW XYZW
I thought about using shift()
df2["ID"]=df["ID1"]+df["ID2"].shift(-1)
But I could not quite figure it out. Any ideas ?
(df["ID1"] + df.groupby(["Employee", "Department"])["ID2"].shift(-1)).ffill()
almost your code, but we first groupby and then shift up. Lastly forward fill for those last rows per group.
In [24]: df
Out[24]:
Employee Department ID1 ID2
0 5 4 AB CD
1 5 4 CD EF
2 5 4 EF GH
3 20 6 XY AA
4 20 6 AA ZW
In [25]: df["ID"] = (df["ID1"] + df.groupby(["Employee", "Department"])["ID2"].shift(-1)).ffill()
In [26]: df
Out[26]:
Employee Department ID1 ID2 ID
0 5 4 AB CD ABEF
1 5 4 CD EF CDGH
2 5 4 EF GH CDGH
3 20 6 XY AA XYZW
4 20 6 AA ZW XYZW
You can groupby.shift, concatenate, and ffill:
df['ID'] = (df['ID1']+df.groupby(['Employee', 'Department'])['ID2'].shift(-1)
).ffill()
output:
Employee Department ID1 ID2 ID
0 5 4 AB CD ABEF
1 5 4 CD EF CDGH
2 5 4 EF GH CDGH
3 20 6 XY AA XYZW
4 20 6 AA ZW XYZW

How to find the next row that have a value in column in a dataframe pandas?

I have a dataframe such as:
id info date group label
1 aa 02/05 1 7
2 ba 02/05 1 8
3 cp 09/05 2 7
4 dd 09/05 2 8
5 ii 09/05 2 9
Every group should have the numbers 7, 8 and 9. In the example above, the group 1 does not have the three numbers, the number 9 is missing. In that case, I would like to find the closest row with a 9 in the label, and add it to the dataframe, also changing the date to the group's date.
So the desired result would be:
id info date group label
1 aa 02/05 1 7
2 ba 02/05 1 8
6 ii 02/05 1 9
3 cp 09/05 2 7
4 dd 09/05 2 8
5 ii 09/05 2 9
Welcome to SO. Its good if you include what you have tried so far so keep that in mind. Anyhow for this question, breakdown your thought process into pandas syntax. Like first step would be to check what group do not have which label from [8,9]:
dfs = df.groupby(['group', 'date']).agg({'label':set}).reset_index().sort_values('group')
dfs['label'] = dfs['label'].apply(lambda x: {8, 9}.difference(x)).explode() # This is the missing label
dfs
Which will give you:
group
date
label
1
02/05
9
2
09/05
nan
Now merge it with original on label and have info filled in:
final_df = pd.concat([df, dfs.merge(df[['label', 'info']], on='label', suffixes=['','_grouped'])])
final_df
id
info
date
group
label
1
aa
02/05
1
7
2
ba
02/05
1
8
3
cp
09/05
2
7
4
dd
09/05
2
8
5
ii
09/05
2
9
nan
ii
02/05
1
9
And prettify:
final_df.reset_index(drop=True).reset_index().assign(id=lambda x:x['index']+1).drop(columns=['index']).sort_values(['group', 'id'])
id
info
date
group
label
1
aa
02/05
1
7
2
ba
02/05
1
8
6
ii
02/05
1
9
3
cp
09/05
2
7
4
dd
09/05
2
8
5
ii
09/05
2
9

Count level 1 size per level 0 in multi index and add new column

What is a pythonic way of counting level 1 size per level 0 in multi index and creating a new column (named counts). I can achieve this in the following way but would like to gain an understanding of any simpler approaches:
Code
df = pd.DataFrame({'STNAME':['AL'] * 3 + ['MI'] * 4,
'CTYNAME':list('abcdefg'),
'COL': range(7) }).set_index(['STNAME','CTYNAME'])
print(df)
COL
STNAME CTYNAME
AL a 0
b 1
c 2
MI d 3
e 4
f 5
g 6
df1 = df.groupby(level=0).size().reset_index(name='count')
counts = df.merge(df1,left_on="STNAME",right_on="STNAME")["count"].values
df["counts"] = counts
This is the desired output:
COL counts
STNAME CTYNAME
AL a 0 3
b 1 3
c 2 3
MI d 3 4
e 4 4
f 5 4
g 6 4
You can use groupby.transform with size here instead of merging:
output = df.assign(Counts=df.groupby(level=0)['COL'].transform('size'))
print(output)
COL Counts
STNAME CTYNAME
AL a 0 3
b 1 3
c 2 3
MI d 3 4
e 4 4
f 5 4
g 6 4

How to classify one column's value by other dataframe?

I am trying to classify one data based on a dataframe of standard.
The standard like df1, and I want to classify df2 based on df1.
df1:
PAUCode SubClass
1 RA
2 RB
3 CZ
df2:
PAUCode SubClass
2 non
2 non
2 non
3 non
1 non
2 non
3 non
I want to get the df2 like as below:
expected result:
PAUCode SubClass
2 RB
2 RB
2 RB
3 CZ
1 RA
2 RB
3 CZ
Option 1
fillna
df2 = df2.replace('non', np.nan)
df2.set_index('PAUCode').SubClass\
.fillna(df1.set_index('PAUCode').SubClass)
PAUCode
2 RB
2 RB
2 RB
3 CZ
1 RA
2 RB
3 CZ
Name: SubClass, dtype: object
Option 2
map
df2.PAUCode.map(df1.set_index('PAUCode').SubClass)
0 RB
1 RB
2 RB
3 CZ
4 RA
5 RB
6 CZ
Name: PAUCode, dtype: object
Option 3
merge
df2[['PAUCode']].merge(df1, on='PAUCode')
PAUCode SubClass
0 2 RB
1 2 RB
2 2 RB
3 2 RB
4 3 CZ
5 3 CZ
6 1 RA
Note here the order of the data changes, but the answer remains the same.
Let us using reindex
df1.set_index('PAUCode').reindex(df2.PAUCode).reset_index()
Out[9]:
PAUCode SubClass
0 2 RB
1 2 RB
2 2 RB
3 3 CZ
4 1 RA
5 2 RB
6 3 CZ

Pandas: Group By Elements of a Column

Looking for assistance to group by elements of a column in a Pandas df.
Original df:
Country Feature Number
0 US A 1
1 DE A 2
2 FR A 3
3 US B 0
4 DE B 5
5 FR B 7
6 US C 9
7 DE C 0
8 FR C 1
Desired df:
Country A B C
0 US 1 0 9
1 DE 2 5 0
2 FR 3 7 1
Not sure if group by is the best choice if I should create a dictionary. Thanks in advance for your help!
You could use pivot_table for that:
In [39]: df.pivot_table(index='Country', columns='Feature')
Out[39]:
Number
Feature A B C
Country
DE 2 5 0
FR 3 7 1
US 1 0 9
If you want your index to be 0, 1, 2 you could use reset_index
EDIT
If your Number actually not numbers but strings you could convert that column with astype or with pd.to_numeric:
df.Number = df.Number.astype(float)
or:
df.Number = pd.to_numeric(df.Number)
Note: pd.to_numeric is available only for pandas >= 0.17.0

Categories