How to remove rows from a dataframe based on another - python

I have been try my level best to compare two data frames in a specific manner but not successful. I hope experts over here can help with solution.
Below is my problem description:
I have two dataframes.
Data frame #1 looks like this.
df1:
pid name age
121 John 36
132 Mary 26
132 Jim 46
145 Kim 50
Dataframe#2 looks like below
df2:
pid name age
121 John 32
132 Tom 28
132 Susan 40
155 Kim 50
I want to compare both df's in such a way that those rows in df2 which don't have the same pid's in df1 should be deleted.
My new data frame #2 should look like below
df2:
pid name age
121 John 32
132 Tom 28
132 Susan 40
Highly appreciate your help on this.

You could use isin as in
df2[df2.pid.isin(df1.pid)]
which will return only the rows of df2 whose pid is in df1.

Related

How to prioritize specific item data when dropping from a data frame in Python

Hi i have a question about dataframe in python
There is a dataframe table as below.
and I want to remove some duplicate data.
If all the conditions are the same, remove the item above. (Jack's case)
If all conditions except the name and quarter are the same, remove the David's data(row)
The first is possible, but I don't know how to do the second.
Thank you.
drop_df = df.drop_duplicates(subset=['Name'],keep='last)
(input data)
Name
quarter
math
physics
Jack
1Q
90
100
Jack
2Q
90
100
Kevin
1Q
45
20
David
1Q
15
60
Adam
1Q
15
60
David
2Q
40
75
Adam
2Q
40
75
(wanted data)
Name
quarter
math
physics
Jack
2Q
90
100
Kevin
1Q
45
20
Adam
1Q
15
60
Adam
2Q
40
75
better using python dataFRAME
You mentioned a pair of DROP criteria:
certain conditions C
same conditions C and matching quarter
So (2.) is more specific than (1.) -- we
say that (1.) is a subset of (2.)
Begin by DROPping rows using (2.),
then go on to DROP relevant surviving rows using (1.)

Creating multiple dataframes using loop and filtering in pyspark

I have a df which looks like this :
CustomerID
CustomerName
StoreName
101
Mike
ABC
102
Sarah
ABC
103
Alice
ABC
104
Michael
PQR
105
Abhi
PQR
106
Bill
XYZ
107
Roody
XYZ
Now I want to seperate out the 3 stores in 3 seperate dfs.
For this i created a list of store names
store_list = df.select("StoreName").distinct().rdd.flatMap(lambda x:x).collect()
Now I want to iterate through this list and filter out different stores in diff dfs.
for i in store_list:
df_{i} = df.where(col("storeName") == i)
The code has syntax errors obviously, but thats the approach I am thinking. I want to avoid Pandas as the datasets are huge.
Can anyone help me with this?
Thanks

Is it possible to do full text search in pandas dataframe

currently, I'm using pandas DataFrame.filter to filter the records of the dataset. if I give a word, I have got all the records that are matching with that word. now if I give two words that are present in the dataset but they are not in one record then I got an empty set. Is there any way in either pandas or other python modules that I can find something that can search multiple words ( not in one record )?
With python list comprehension, we can build a full-text search by mapping. in pandas DataFrame.filter uses indexing. is there any difference between mapping and indexing? if yes what is it and which can give a better performance?
CustomerID Genre Age AnnualIncome (k$) SpendingScore (1-100)
1 Male 19 15 39
2 Male 21 15 81
3 Female 20 16 6
4 Female 23 16 77
5 Female 31 17 40
pokemon[pokemon['CustomerID'].isin(['200','5'])]
Output:
CustomerID Genre Age AnnualIncome (k$) SpendingScore (1-100)
5 Female 31 17 40
200 Male 30 137 83
Name Qty.
0 Apple 3
1 Orange 4
2 Cake 5
Considering the above dataframe, if you want to find quantities of Apples and Oranges, you can do it like this:
result = df[df['Name'].isin(['Apple','Orange'])]
print (result)

Appending two dataframes with multiindex

I have two dataframes, each with a multiindex. The multiindex levels share names, but are in a different order. When I append or concat, I would expect pandas to line up the indices just like it aligns index-less columns before appending. Is there a function or an argument I can pass to append or concat to get this to work in the way I desire (and that I think ought to be standard)?
import pandas as pd
df1 = pd.DataFrame(data = {'Name':['Bob','Ann','Sally'], 'Acct':['Savings','Savings','Checking'], 'Value':[101,102,103]})
df1 = df1.set_index(['Name','Acct'])
print(df1)
df2 = pd.DataFrame(data = {'Acct':['Savings','Savings','Checking'], 'Name':['Bob','Ann','Sally'], 'Value':[201,202,203]})
df2 = df2.set_index(['Acct','Name'])
print(df2)
print(df1.append(df2))
print(pd.concat([df1,df2]))
Value
Name Acct
Bob Savings 101
Ann Savings 102
Sally Checking 103
Value
Acct Name
Savings Bob 201
Ann 202
Checking Sally 203
Value
Name Acct
Bob Savings 101
Ann Savings 102
Sally Checking 103
Savings Bob 201
Ann 202
Checking Sally 203
Value
Name Acct
Bob Savings 101
Ann Savings 102
Sally Checking 103
Savings Bob 201
Ann 202
Checking Sally 203
As you can see, after appending or concatenating, my combined index appears to show that, for example, "Sally" is an account, not a name. I'm aware that if I put the index levels in the same order when setting index, I'll get what I want, and that I could reset the index on the frames to align them, but I'm hoping there's a more intuitive way to get the indices to align on name, not on position.
Somewhat of a work around, you can reset_index on both data sets, concat them, then set_index:
print(pd.concat([
df1.reset_index(),
df2.reset_index()
], sort=False).set_index([
'Name',
'Acct'
]))
Value
Name Acct
Bob Savings 101
Ann Savings 102
Sally Checking 103
Bob Savings 201
Ann Savings 202
Sally Checking 203
Though I'm not sure why you would want to have multiple rows with the same index...

comparing datasets for matching elements in a column

Recently attended an introduction into python/pandas and data sets.Now trying to put some of what I learned into use. Have trawled through various answers and tried various solutions with no luck.
Basically I wish to compare DF1 name with DF2 name and then add the scores together if I get a match. Example below
DF1
name score
fred 20
harry 30
joe 24
jim 14
DF2
name score
harry 25
joe 52
fred 61
jim 23
DF3
name score
fred 81
harry 55
jim 77
joe 76
You could use set_index for both dataframes, then add them and reset_index:
df3 = (df1.set_index('name') + df2.set_index('name')).reset_index()
In [77]: df3
Out[77]:
name score
0 fred 81
1 harry 55
2 jim 37
3 joe 76

Categories