I need to drop some lines from dataframe with python , based on multiple values
Code Names Country
1 a France
2 b France
3 c USA
4 d Canada
5 e TOTO
6 f TITI
7 g Corona
I need to have this
Code Names Country
1 a France
4 d Canada
5 e TOTO
7 g Corona
I do this :
df.drop(df[('f','b','c')in df['names']].index)
But it doesnt work : KeyError: False
it works for only one key like this : df.drop(df['f' in df['names']].index)
Do you have any idea ?
To remove rows of certain values:
indexNames = df[df['Names'].isin(['f', 'b', 'c'])].index
df.drop(indexNames, inplace=True)
print(df)
Output:
Code Names Country
0 1 a France
3 4 d Canada
4 5 e TOTO
6 7 g Corona
Based on your example, I think this may be what you are looking for.
new_df = df.loc[~df.Names.isin(['f','b','c'])].copy()
new_df
Output:
Code Names Country
0 1 a France
3 4 d Canada
4 5 e TOTO
6 7 g Corona
In pandas, we can use .drop() function to drop column and rows.
For dropping specific rows, we need to use axis = 0
So your required output can be achieved by following line of code :
df4.drop([1,2,5], axis=0)
The output will be :
code Names Country
1 a France
4 d Canada
5 e TOTO
7 g Corona
Related
I have two dataframes like as below
ID,Name,Sub,Country
1,ABC,ENG,UK
1,ABC,MATHS,UK
1,ABC,Science,UK
2,ABE,ENG,USA
2,ABE,MATHS,USA
2,ABE,Science,USA
3,ABF,ENG,IND
3,ABF,MATHS,IND
3,ABF,Science,IND
df1 = pd.read_clipboard(sep=',')
ID,Name,class,age
11,ABC,ENG,21
12,ABC,MATHS,23
1,ABC,Science,25
22,ABE,ENG,19
23,ABE,MATHS,22
24,ABE,Science,26
33,ABF,ENG,24
31,ABF,MATHS,28
32,ABF,Science,26
df2 = pd.read_clipboard(sep=',')
I would like to do the below
a) Check whether the ID and Name from df1 is present in df2.
b) If present in df2, put Yes in Status column or No in Status column. Don't use ~ or not in operator because my df2 has million of rows. So, it will result in irrelevant results
I tried the below
ID_list = df1['ID'].unique().tolist()
Name_list = df1['Name'].unique().tolist()
filtered_df = df2[((df2['ID'].isin(ID_list)) & (df2['Name'].isin(Name_list)))]
filtered_df = filtered_df.groupby(['ID','Name','Sub']).size().reset_index()
The above code gives matching ids and names between df1 and df2.
But I want to find the ids and names that are present in df1 but missing/absent in df2. I cannot use the ~ operator because it will return all the rows from df2 that don't have a match in df1. In real world, my df2 has millions of rows. I only want to find the missing df1 ids and names and put a status column
I expect my output to be like as below
ID,Name,Sub,Country, Status
1,ABC,ENG,UK,No
1,ABC,MATHS,UK,No
1,ABC,Science,UK,Yes
2,ABE,ENG,USA,No
2,ABE,MATHS,USA,No
2,ABE,Science,USA,No
3,ABF,ENG,IND,No
3,ABF,MATHS,IND,No
3,ABF,Science,IND,No
Expected ouput is for match by 3 columns:
m = df1.merge(df2,
left_on=['ID','Name','Sub'],
right_on=['ID','Name','class'],
indicator=True, how='left')['_merge'].eq('both')
df1['Status'] = np.where(m, 'Yes', 'No')
print (df1)
ID Name Sub Country Status
0 1 ABC ENG UK No
1 1 ABC MATHS UK No
2 1 ABC Science UK Yes
3 2 ABE ENG USA No
4 2 ABE MATHS USA No
5 2 ABE Science USA No
6 3 ABF ENG IND No
7 3 ABF MATHS IND No
8 3 ABF Science IND No
With testing by isin solution is:
idx1 = pd.MultiIndex.from_frame(df1[['ID','Name','Sub']])
idx2 = pd.MultiIndex.from_frame(df2[['ID','Name','class']])
df1['Status'] = np.where(idx1.isin(idx2), 'Yes', 'No')
print (df1)
ID Name Sub Country Status
0 1 ABC ENG UK No
1 1 ABC MATHS UK No
2 1 ABC Science UK Yes
3 2 ABE ENG USA No
4 2 ABE MATHS USA No
5 2 ABE Science USA No
6 3 ABF ENG IND No
7 3 ABF MATHS IND No
8 3 ABF Science IND No
Because if match by 2 columns ouput is different:
m = df1.merge(df2, on=['ID','Name'], indicator=True, how='left')['_merge'].eq('both')
df1['Status'] = np.where(m, 'Yes', 'No')
print (df1)
ID Name Sub Country Status
0 1 ABC ENG UK Yes
1 1 ABC MATHS UK Yes
2 1 ABC Science UK Yes
3 2 ABE ENG USA No
4 2 ABE MATHS USA No
5 2 ABE Science USA No
6 3 ABF ENG IND No
7 3 ABF MATHS IND No
8 3 ABF Science IND No
I have two .tsv files that look like:
ID prop name size
A x rob 2
B y sally 3
C z debby 5
D w meg 6
and
ID lst_name area
A sanches 4
D smith 7
C roberts 8
I have them loaded into pandas DataFrames and would like to merge them so I get a new dataFrame:
ID-name prop name size lst_name area
A x rob 2 sanches 4
B y sally 3
C z debby 5 roberts 8
D w meg 6 smith 7
I have been trying to accomplish this with pd.merge() but am having issues with the following:
df = pd.DataFrame.from_csv("a.tsv", sep='\t')
df1 = pd.DataFrame.from_csv("b.tsv", sep='\t')
result = pd.merge(df, df1, how='inner',on=["ID","ID-name"])
Is it possible to accomplish a merge like this with pandas?
What you would need is an left join (or outer join, of course depending on your case), since in this sample you also would like to see the records for B even though it has no records on df1.
result = pd.merge(df, df1, how="left",on=["ID","ID"])
prop name size lst_name area
ID ID
A A x rob 2 sanches 4.0
B B y sally 3 NaN NaN
C C z debby 5 roberts 8.0
D D w meg 6 smith 7.0
Here's one way to do it using join
df1 = pd.DataFrame({'ID':['A','B','C','D'],'prop':['x','y','z','w'],'name':['rob','sally','debby','meg'],'size':[2,3,5,6]})
df2 = pd.DataFrame({'ID':['A','D','C'],'lst_name':['sanches','smith','roberts'],'area':[4,7,8]})
df1.set_index('ID').join(df2.set_index('ID')).reset_index()
>>>
ID prop name size lst_name area
0 A x rob 2 sanches 4.0
1 B y sally 3 NaN NaN
2 C z debby 5 roberts 8.0
3 D w meg 6 smith 7.0
I've been struggling to sort the entire columns of my df, however, my code seems to be working for solely the first column ('Name') and shuffles the rest of the columns based upon the first column as shown here:
Index Name Age Education Country
0 W 2 BS C
1 V 1 PhD F
2 R 9 MA A
3 A 8 MA A
4 D 7 PhD B
5 C 4 BS C
df.sort_values(by=['Name', 'Age', 'Education', 'Country'],ascending=[True,True, True, True])
Here's what I'm hoping to get:
Index Name Age Education Country
0 A 1 BS A
1 C 2 BS A
2 D 4 MA B
3 R 7 MA C
4 V 8 PhD C
5 W 9 PhD F
Instead, I'm getting the following:
Index Name Age Education Country
3 A 8 MA A
5 C 4 BS C
4 D 7 PhD B
2 R 9 MA A
1 V 1 PhD F
0 W 2 BS C
Could you please shed some light on this issue. Many thanks in advance.
Cheers,
R.
Your code is sorting by name, then age, then country, etc.
To get what you want, you can do sort for each column to sort column by column. For example,
for col in df.columns:
df[col]=sorted(df[col])
But are you sure that’s what you want to do? DataFrame is designed so that each row corresponds to a single entry, e.g. a person, and the columns corresponds to attributes like, ‘name’ and ‘age’, etc. So you don’t want sort the name and age separately so that people’s name and age get mismatched.
You can use np.sort along the 0th axis:
df[:] = np.sort(df.values, axis=0)
df
Index Name Age Education Country
0 0 A 1 BS A
1 1 C 2 BS A
2 2 D 4 MA B
3 3 R 7 MA C
4 4 V 8 PhD C
5 5 W 9 PhD F
If course, you should beware that sorting columns independently will mess the order of your columns relative to one another and render your data meaningless.
I have a data set which contains state code and its status.
code status
1 AZ a
2 CA b
3 KS c
4 MO c
5 NY d
6 AZ d
7 MO a
8 MO b
9 MN b
10 NV a
11 NV e
12 MO f
13 NY a
14 NY a
15 NY b
I want to filter out this data set which code contains only a status and count how many they have. Example output will be,
code status
1 AZ a
2 MO a
3 NY a
AZ =1 MO = 1 NY =2
I used df.groupyby("code").loc[df.status == 'a'] but didn't have any luck.
Any help appreciated!
Let's filter the dataframe first for a, then groupby and count.
df[df.status == 'a'].groupby('code').size()
Output:
code
AZ 1
MO 1
NV 1
NY 2
dtype: int64
I've recreated your dataset
data = [["AZ","CA", "KS","MO","NY","AZ","MO","MO","MN","NV","NV","MO","NY","NY" ,"NY"],
["a","b","c","c","d","d","a","b","b","a","e","f","a","a","b"]]
df = pd.DataFrame(data)
df = df.T
df.columns = ["code","status" ]
df[df["status"] == "a"].groupby(["code", "status"]).size()
gives
code status
AZ a 1
MO a 1
NV a 1
NY a 2
dtype: int64
I have a dataframe df:
PID AID Ethnicity
1 A Asian
1 B Asian
1 C Arab
1 D African
2 A Asian
2 D African
2 E Caucasian
2 F African
2 B Asian
I want to generate a frame that tells me for each PID how many AIDs it has, and how many Ethnic groups:
So for the above the resulting newdf would be:
PID numAID numEthnicities
1 4 3
2 5 3
I know how to find numAID:
newdf = df[['PID','AID']].groupby('PID',
as_index=False).count().rename(columns={'AID':'numAID'})
I'm not sure how to add the third column to the dataframe.
This will work:
df.groupby('PID').agg({'AID':'count','Ethnicity':pd.Series.nunique}).add_prefix('num')
numAID numEthnicity
PID
1 4 3
2 5 3
since you have found out newdf, you could try to use join function.)
df = df.set_index('PID')
newdf = newdf.set_index('PID')
result = df.join(newdf, lsuffix='df', rsuffix='newdf')
You can add a third column like this:
newdf['numEthnicities'] = df[['PID, 'Ethnicity']].groupby('PID', as_index=False).count()