Merge multiple tables and join the same column with comma split - python

I have about 15 csv files with the same number of unique IDs. And for each of the file the col1 contains different text. How can I join them together to create a new table contains all the information from those 15 files? I tried to use pd.merge, create a new col1 comma split those text and delete the duplicates col1. There will be some columns named col1_x,col1_y, col1_y,etc.. Is there any other better ways to implement this?
My input is,
df1:
ID col1 location gender
1 Airplane NY F
2 Bus CA M
3 NaN FL M
4 Bus WA F
df2:
ID col1 location gender
1 Apple NY F
2 Peach CA M
3 Melon FL M
4 Banana WA F
df3:
ID col1 location gender
1 NaN NY F
2 Football CA M
3 Boxing FL M
4 Running WA F
Expected output is,
ID col1 location gender
1 Airplane,Apple NY F
2 Bus,Peach,Football CA M
3 Melon,Boxing FL M
4 Bus,Banana,Running WA F

You could use concat + groupby:
merged = pd.concat([df1, df2, df3], sort=False)
result = merged.dropna().groupby(['location', 'gender'], as_index=False).agg({'col1' : ','.join}).reset_index(drop=True)
print(result)
Output
location gender col1
0 CA M Bus,Peach,Football
1 FL M Melon,Boxing
2 NY F Airplane,Apple
3 WA F Bus,Banana,Running

For your data, you can do:
(pd.concat(df.melt(id_vars='ID').dropna() for df in [df1,df2,df3])
.groupby(['ID','variable'])['value'].apply(lambda x: ','.join(x.unique()))
.unstack()
)
Output:
variable col1 gender location
ID
1 Airplane,Apple F NY
2 Bus,Peach,Football M CA
3 Melon,Boxing M FL
4 Bus,Banana,Running F WA

Related

Pandas filter without ~ and not in operator

I have two dataframes like as below
ID,Name,Sub,Country
1,ABC,ENG,UK
1,ABC,MATHS,UK
1,ABC,Science,UK
2,ABE,ENG,USA
2,ABE,MATHS,USA
2,ABE,Science,USA
3,ABF,ENG,IND
3,ABF,MATHS,IND
3,ABF,Science,IND
df1 = pd.read_clipboard(sep=',')
ID,Name,class,age
11,ABC,ENG,21
12,ABC,MATHS,23
1,ABC,Science,25
22,ABE,ENG,19
23,ABE,MATHS,22
24,ABE,Science,26
33,ABF,ENG,24
31,ABF,MATHS,28
32,ABF,Science,26
df2 = pd.read_clipboard(sep=',')
I would like to do the below
a) Check whether the ID and Name from df1 is present in df2.
b) If present in df2, put Yes in Status column or No in Status column. Don't use ~ or not in operator because my df2 has million of rows. So, it will result in irrelevant results
I tried the below
ID_list = df1['ID'].unique().tolist()
Name_list = df1['Name'].unique().tolist()
filtered_df = df2[((df2['ID'].isin(ID_list)) & (df2['Name'].isin(Name_list)))]
filtered_df = filtered_df.groupby(['ID','Name','Sub']).size().reset_index()
The above code gives matching ids and names between df1 and df2.
But I want to find the ids and names that are present in df1 but missing/absent in df2. I cannot use the ~ operator because it will return all the rows from df2 that don't have a match in df1. In real world, my df2 has millions of rows. I only want to find the missing df1 ids and names and put a status column
I expect my output to be like as below
ID,Name,Sub,Country, Status
1,ABC,ENG,UK,No
1,ABC,MATHS,UK,No
1,ABC,Science,UK,Yes
2,ABE,ENG,USA,No
2,ABE,MATHS,USA,No
2,ABE,Science,USA,No
3,ABF,ENG,IND,No
3,ABF,MATHS,IND,No
3,ABF,Science,IND,No
Expected ouput is for match by 3 columns:
m = df1.merge(df2,
left_on=['ID','Name','Sub'],
right_on=['ID','Name','class'],
indicator=True, how='left')['_merge'].eq('both')
df1['Status'] = np.where(m, 'Yes', 'No')
print (df1)
ID Name Sub Country Status
0 1 ABC ENG UK No
1 1 ABC MATHS UK No
2 1 ABC Science UK Yes
3 2 ABE ENG USA No
4 2 ABE MATHS USA No
5 2 ABE Science USA No
6 3 ABF ENG IND No
7 3 ABF MATHS IND No
8 3 ABF Science IND No
With testing by isin solution is:
idx1 = pd.MultiIndex.from_frame(df1[['ID','Name','Sub']])
idx2 = pd.MultiIndex.from_frame(df2[['ID','Name','class']])
df1['Status'] = np.where(idx1.isin(idx2), 'Yes', 'No')
print (df1)
ID Name Sub Country Status
0 1 ABC ENG UK No
1 1 ABC MATHS UK No
2 1 ABC Science UK Yes
3 2 ABE ENG USA No
4 2 ABE MATHS USA No
5 2 ABE Science USA No
6 3 ABF ENG IND No
7 3 ABF MATHS IND No
8 3 ABF Science IND No
Because if match by 2 columns ouput is different:
m = df1.merge(df2, on=['ID','Name'], indicator=True, how='left')['_merge'].eq('both')
df1['Status'] = np.where(m, 'Yes', 'No')
print (df1)
ID Name Sub Country Status
0 1 ABC ENG UK Yes
1 1 ABC MATHS UK Yes
2 1 ABC Science UK Yes
3 2 ABE ENG USA No
4 2 ABE MATHS USA No
5 2 ABE Science USA No
6 3 ABF ENG IND No
7 3 ABF MATHS IND No
8 3 ABF Science IND No

Python Pandas extract duplicates from a csv and aggregate the values ​of a column

I have a CSV file consisting of 4 columns A, B, C, D. I would like to:
find all duplicates that have the same value for columns A, B, C
for these take the value of D and create a single row without duplicates, where column D is the union of column D of all duplicates
Example CSV input:
John,Yes,123,street 1
John,Yes,123,street 2
Tom,No,345,street 1
Tom,No,345,street 2
Tom,No,345,street 3
Jason,Yes,567,street 1
Thomas,No,123,street 1
Jess,No,999,street 1
Expected result:
John,Yes,123,street 1 street 2
Tom,No,345,street 1 street 2 street 3
Jason,Yes,567,street 1
Thomas,No,123,street 1
Jess,No,999,street 1
df.groupby(['A','B','C'])['D'].apply(' '.join).reset_index()
Full code:
from io import StringIO
df = """A,B,C,D
John,Yes,123,street 1
John,Yes,123,street 2
Tom,No,345,street 1
Tom,No,345,street 2
Tom,No,345,street 3
Jason,Yes,567,street 1
Thomas,No,123,street 1
Jess,No,999,street 1"""
df = pd.read_csv(StringIO(df))
df.groupby(['A','B','C'])['D'].apply(' '.join).reset_index()
Output:
A
B
C
D
0
Jason
Yes
567
street 1
1
Jess
No
999
street 1
2
John
Yes
123
street 1 street 2
3
Thomas
No
123
street 1
4
Tom
No
345
street 1 street 2 street 3
Explanation:
df.groupby(['A','B','C'])['D'].apply(' '.join).reset_index() is the same as
df.groupby(['A','B','C'])['D'].apply(lambda g: ' '.join(g.values)).reset_index()
which is the same as these alternatives:
# alternative #1
df.groupby(['A','B','C'])['D'].apply(lambda g: ' '.join(g)).reset_index()
# alternative #2
df.groupby(['A','B','C']).apply(lambda g: ' '.join(g['D'])).reset_index()

dataframe not droping duplicate

i have two dataframes :
df:
id Name Number Stat
1 co 4
2 ma 98
3 sa 0
df1:
id Name Number Stat
1 co 4
2 ma 98 5%
I want to merge both dataframes in 1 (dfnew) and i want it as follow:
id Name Number Stat
1 co 4
2 ma 98 5%
3 sa 0
I used
dfnew = pd.concat([df, df2])
dfnew = df_row.drop_duplicates(keep='last')
I am not getting the result i want. the dataframes are joined but duplicates are not deleted. I need help please
It seems you need check only first 3 columns for duplicates:
dfnew = pd.concat([df, df2]).drop_duplicates(subset=['id','Name','Number'], keep='last')
print (dfnew)
id Name Number Stat
2 3 sa 0 NaN
0 1 co 4 NaN
1 2 ma 98 5%
try pd.merge function with inner/ outer based on requirement.

DataFrame condition on multiple values python

I need to drop some lines from dataframe with python , based on multiple values
Code Names Country
1 a France
2 b France
3 c USA
4 d Canada
5 e TOTO
6 f TITI
7 g Corona
I need to have this
Code Names Country
1 a France
4 d Canada
5 e TOTO
7 g Corona
I do this :
df.drop(df[('f','b','c')in df['names']].index)
But it doesnt work : KeyError: False
it works for only one key like this : df.drop(df['f' in df['names']].index)
Do you have any idea ?
To remove rows of certain values:
indexNames = df[df['Names'].isin(['f', 'b', 'c'])].index
df.drop(indexNames, inplace=True)
print(df)
Output:
Code Names Country
0 1 a France
3 4 d Canada
4 5 e TOTO
6 7 g Corona
Based on your example, I think this may be what you are looking for.
new_df = df.loc[~df.Names.isin(['f','b','c'])].copy()
new_df
Output:
Code Names Country
0 1 a France
3 4 d Canada
4 5 e TOTO
6 7 g Corona
In pandas, we can use .drop() function to drop column and rows.
For dropping specific rows, we need to use axis = 0
So your required output can be achieved by following line of code :
df4.drop([1,2,5], axis=0)
The output will be :
code Names Country
1 a France
4 d Canada
5 e TOTO
7 g Corona

pandas groupby on multiple columns

I have a data set which contains state code and its status.
code status
1 AZ a
2 CA b
3 KS c
4 MO c
5 NY d
6 AZ d
7 MO a
8 MO b
9 MN b
10 NV a
11 NV e
12 MO f
13 NY a
14 NY a
15 NY b
I want to filter out this data set which code contains only a status and count how many they have. Example output will be,
code status
1 AZ a
2 MO a
3 NY a
AZ =1 MO = 1 NY =2
I used df.groupyby("code").loc[df.status == 'a'] but didn't have any luck.
Any help appreciated!
Let's filter the dataframe first for a, then groupby and count.
df[df.status == 'a'].groupby('code').size()
Output:
code
AZ 1
MO 1
NV 1
NY 2
dtype: int64
I've recreated your dataset
data = [["AZ","CA", "KS","MO","NY","AZ","MO","MO","MN","NV","NV","MO","NY","NY" ,"NY"],
["a","b","c","c","d","d","a","b","b","a","e","f","a","a","b"]]
df = pd.DataFrame(data)
df = df.T
df.columns = ["code","status" ]
df[df["status"] == "a"].groupby(["code", "status"]).size()
gives
code status
AZ a 1
MO a 1
NV a 1
NY a 2
dtype: int64

Categories