Pandas isin() returning all false - python

I'm using pandas 1.1.3, the latest available with Anaconda.
I have two DataFrames, imported from a .txt and a .xlsx file. They have a column called "ID" which is an int64 (verified with df.info()) on both DataFrames.
df1:
ID Name
0 1234564567 Last, First
1 1234564569 Last, First
...
df2:
ID Amount
0 1234564567 59.99
1 5678995545 19.99
I want to check if all of the IDs on df1 are on df2. For this I create a series:
foo = df1["ID"].isin(df2["ID"])
And I get that all values are False, even though manually I checked and the values do match.
0 False
1 False
2 False
3 False
4 False
...
I'm not sure if I'm missing something, if there is something wrong with the environment, or if it is a known bug.

You must do something wrong. Try to reproduce this error with a toy example as I did here. The below works for me.
Reproducing with and sharing a minimal example not only allows you to challenge your error but also allows us to provide help.
import pandas as pd
import numpy as np
data = {'Name':['Tom', 'nick'], 'ID':[1234564567, 1234564569]}
data2 = {'Name':['Tom', 'nick'], 'ID':[1234564567, 5678995545]}
# Create DataFrame
df = pd.DataFrame(data)
df2 = pd.DataFrame(data2)
df["ID"].isin(df2["ID"])
0 True
1 False
Name: ID, dtype: bool
EDIT: with Paul's data I don't get any error. See the importance of providing examples?
import pandas as pd
data = {'ID':['1234564567', '1234564569'],'Name':['Last, First', 'Last, First']}
data2 = {'ID':['1234564567', '5678995545'],'Amount': [59.99, 19.99]}
df1 = pd.DataFrame(data)
df2 = pd.DataFrame(data2)
df["ID"].isin(df2["ID"])
0 True
1 False

import pandas as pd
data = {'ID':['1234564567', '1234564569'],'Name':['Last, First', 'Last, First']}
data2 = {'ID':['1234564567', '5678995545'],'Amount': [59.99, 19.99]}
df1 = pd.DataFrame(data)
df2 = pd.DataFrame(data2)
now we have that set up we get to the meat...
df1["ID"].apply(lambda x: df2['ID'].isin([x]))
Which shows
0 1
0 True False
1 False False
That ID 0 in df1 is in ID 0 of df2

Related

iterate in all Dataframe rows and perform startswith()

my df look like this:
as you can see the User starts with 'ff' and it could be in access column or any other column rather than user column.
i want to create a new column in this df called "UserID" where whenever the is 'ff' in all the columns copy this value to my new column "UserId"
i have been using this method which is working fine but i have to repeat this line in all the columns:
hist.loc[hist.User.str.startswith("ff",na=False),'UserId']=hist['User'].str[2:]
is there any other method i can use to loop over all rows at once?
thanks
If you are cool with picking only the first occurence:
df['UserID'] = df.apply(lambda x: x[x.str.startswith('ff')][:1], axis=1)
NumPy + Pandas solution below.
In case of ambiguity (several ff-strings in a row) leftmost occurance is taken. In case of absence (no ff-string in a row) NaN value is used.
Try it online!
import pandas as pd, numpy as np
df = pd.DataFrame({
'user': ['fftest', 'fwadmin', 'fshelpdesk3', 'no', 'ffone'],
'access': ['fwadmin', 'ffuser2', 'fwadmin', 'user', 'fftwo'],
'station': ['fshelpdesk', 'fshelpdesk2', 'ffuser3', 'here', 'three'],
})
sv = df.values.astype(np.str)
ix = np.argwhere(np.char.startswith(sv, 'ff'))[::-1].T
df.loc[ix[0], 'UserID'] = pd.Series(sv[(ix[0], ix[1])]).str[2:].values
print(df)
Output:
user access station UserID
0 fftest fwadmin fshelpdesk test
1 fwadmin ffuser2 fshelpdesk2 user2
2 fshelpdesk3 fwadmin ffuser3 user3
3 no user here NaN
4 ffone fftwo three one
Hey here is my attempt at solving the problem, hope it helps.
d = df[df.apply(lambda x: x.str.startswith('ff'))]
df['user_id'] = d['user'].fillna(d['access'].fillna(d['station']))
Result
user access station user_id
0 fftest fwadmin fshelpdesk fftest
1 fwadmin ffuser2 fshelpdesk2 ffuser2
2 fshelpdesk3 fwadmin ffuser3 ffuser3

Return a matching value from 2 dataframes (1 dataframe with single value in cell, 1 with a list in a cell) into 1 dataframe

I have 2 dataframes:
df1
ID Type
2456-AA Coolant
2457-AA Elec
df2
ID Task
[2456-AA, 5656-BB] Check AC
[2456-AA, 2457-AA] Check Equip.
I'm trying return the matched ID's 'Type' from df1 to df2. With the result looking something like this:
df2
ID Task Type
[2456-AA, 5656-BB] Check AC [Coolant]
[2456-AA, 2457-AA] Check Equip. [Coolant , Elec]
I tried the following for loop. I udnerstand it isn't the fastest but i'm struggling to workout a faster alternative:
def type_identifier(type):
df = df1.copy()
device_type = []
for value in df1.ID:
for x in type:
if x == value:
device_type.append(df1.Type.tolist())
else:
None
return device_type
df2['test'] = df2['ID'].apply(lambda x: type_identifier(x))
Could somebody help me out? and also refer me to material that could help me to better approach problems like these?
Thank you,
Use the to_dict of pandas to convert df1 to a dictionary, so we can efficiently translate id to type.
Then, apply lamda that for each ID in df2 converts it to the right type, and assign it to test column as you wished.
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'ID':['2456-AA', '2457-AA'],
'Type':['Coolant', 'Elec']})
df2 = pd.DataFrame({'ID':[['2456-AA', '5656-BB'], ['2456-AA', '2457-AA']],
'Task':['Check AC', 'Check Equip.']})
# Use to dict to convert df1 ids to types
id_to_type = df1.set_index('ID').to_dict()['Type']
# {'2456-AA': 'Coolant', '2457-AA': 'Elec'}
print(id_to_type)
# Apply lamda that for each `ID` in `df2` converts it to the right type
df2['test'] = df2['ID'].apply(lambda x: [id_to_type[t] for t in x if t in id_to_type])
print(df2)
Output:
ID Task test
0 [2456-AA, 5656-BB] Check AC [Coolant]
1 [2456-AA, 2457-AA] Check Equip. [Coolant, Elec]

Comparing multiple columns of same CSV file and returning result to another CSV file using Python

I have a CSV file with 7 columns
This is my csv file
Application,Expected Value,ADER,UGOM,PRD
APP,CVD2,CVD2,CVD2,CVD2
APP1,"VCF7,hg6","VCF7,hg6","VCF8,hg6","VCF7,hg6"
APP1,"VDF9,pova8","VDF9,pova8","VDF10,pova10","VDF9,pova11"
APP2,gf8,gf8,gf8,gf8
APP3,pf8,pf8,gf8,pf8
APP4,vd3,mn7","vd3,mn7","vd3,mn7","vd3,mn7"
So here i want to compare a Expected Value column with the columns after that (that is ADER,UGOM,PRD)
so here is my code in python
import pandas as pd
# assuming id columns are identical and contain the same values
df1 = pd.read_csv('file1.csv', index_col='Expected Value')
df2 = pd.read_csv('file1.csv', index_col='ADER')
df3 = pd.DataFrame(columns=['status'], index=df1.index)
df3['status'] = (df1['Expected Value'] == df2['ADER']).replace([True, False], ['Matching', 'Not Matching'])
df3.to_csv('output.csv')
So this not creating any output.csv file ,nor it generates any output . So can anyone help
So i edited code : based on a comment by #Vlado
import pandas as pd
# assuming id columns are identical and contain the same values
df1 = pd.read_csv('first.csv')
df3 = pd.DataFrame(columns=['Application','Expected Value','ADER','status of AdER'], index=df1.index)
df3['Application'] = df1['Application']
df3['Expected Value'] = df1['Expected Value']
df3['ADER'] = df1['ADER']
df3['status'] = (df1['Expected Value'] == df1['ADER'])
df3['status'].replace([True, False], ['Matching', 'Not Matching'])
df3.to_csv('output.csv')
so now it works for one column ADER , but my headers after EXpected Values are dynamic , it may change . so sometimes it may be one column after expected Value , sometimes N columns and header name may also change . so can some one help on how to do that
Below given piece of code generates the desired output. It will compare the Expected Value column with rest of the columns after that.
import pandas as pd
df = pd.read_csv("input.csv")
expected_value_index = df.columns.get_loc("Expected Value")
for col_index in range(expected_value_index+1, len(df.columns)):
column = df.columns[expected_value_index]+" & "+ df.columns[col_index]
df[column] = df.loc[:,"Expected Value"] == df.iloc[:,col_index]
df[column].replace([True, False], ["Matching", "No Matching"], inplace=True)
df.to_csv("output.csv", index=None)
I haven't tried replicating your code as of yet but here are a few suggestions:
You do not need to read the df two times.
df1 = pd.read_csv('FinalResult1.csv')
is sufficient.
Then, you can proceed with
df1['status'] = (df1['exp'] == df1['ader'])
df1['status'].replace([True, False], ['Matching', 'Not Matching'])
Alternatively, you could do this row by row by using the pandas apply method.
If that doesn't work a reasonable first step would be to print your dataframe out to see what is happening.
Try this code
k = list(df1.columns).index('Expected Value') + 1
# get the integer index for the column after 'Expected Value'
df3 = df1.iloc[:, :k]
# copy first k columns
df3 = pd.concat([df3, (df1.iloc[:, k:] == np.repeat(
df1['Expected Value'].to_frame().values, df1.shape[1] - k, axis=1))], axis=1)
# slice df1 with iloc, wich works just as slicing lists in python
# np.repeat, used to repeat 'Expected Value' as many columns as needed (df1.shape[1]-k=3)
# .to_frame, slicing a column from a df returns a 1D series so we turin it back into a 2D df
# .values, returns the underlying numpy array without any index/column names
# ...without .values Pandas would try to find 3 columns named 'Expected Value' in df1
# concatenate previous df3 with this calculation
print(df3)
Output
Application FileName ConfigVariable Expected Value ADER UGOM PRD
0 APP1 FileName1 ConfigVariable1 CVD2 True True True
1 APP1 FileName2 ConfigVariable2 VCF7,hg6 True False True
2 APP1 FileName3 ConfigVariable3 VDF9,pova8 True False False
3 APP2 FileName4 ConfigVariable4 gf8 True True True
4 APP3 FileName5 ConfigVariable5 pf8 True False True
5 APP4 FileName6 ConfigVariable vd3,mn7 True True True
Of course you can do a loop if for some reason you need a special calculation on some column
for colname in df1.columns[k:]:
df3[colname] = df1[colname] == df1['Expected Value']

How to drop rows by condition on string value in pandas dataframe?

Consider a Pandas Dataframe like:
>>> import pandas as pd
>>> df = pd.DataFrame(dict(url=['http://url1.com', 'http://www.url1.com', 'http://www.url2.com','http://www.url3.com','http://www.url1.com']))
>>> df
Giving:
url
0 http://url1.com
1 http://www.url1.com
2 http://www.url2.com
3 http://www.url3.com
4 http://www.url1.com
I want to remove all rows containing url1.com and url2.com to obtain dataframe result like:
url
0 http://ww.url3.com
I do this
domainToCheck = ('url1.com', 'url2.com')
goodUrl = df['url'].apply(lambda x : any(domain in x for domain in domainToCheck))
But this give me no result.
Any idea how to solve the above problem?
Edit: Solution
import pandas as pd
import tldextract
df = pd.DataFrame(dict(url=['http://url1.com', 'http://www.url1.com','http://www.url2.com','http://www.url3.com','http://www.url1.com']))
domainToCheck = ['url1', 'url2']
s = df.url.map(lambda x : tldextract.extract(x).domain).isin(domainToCheck)
df = df[~s].reset_index(drop=True)
If we checking domain , we should find the 100% match domain rather than use string contain . since the subdomain may contain the same key work as domain
import tldextract
s=df.url.map(lambda x : tldextract.extract(x).domain).isin(['url1','url2'])
Out[594]:
0 True
1 True
2 True
3 False
4 True
Name: url, dtype: bool
df=df[~s]
Use, Series.str.contains to create a boolean mask m and then you can filter the dataframe df using this boolean mask:
m = df['url'].str.contains('|'.join(domainToCheck))
df = df[~m].reset_index(drop=True)
Result:
url
0 http://www.url3.com
you can use pd.Series.str.contains here.
df[~df.url.str.contains('|'.join(domainToCheck))]
url
3 http://www.url3.com
If you want to reset index use this
df[~df.url.str.contains('|'.join(domainToCheck))].reset_index(drop=True)
url
0 http://www.url3.com

How to generate random 20 digit uid(Unique Id) in python

How to generate random 20 digit UID(Unique Id) in python. I want to generate UID for each row in my data frame. It should be exactly 20 digits and should be unique.
I am using uuid4() but it generates 32 digits UID, would it be okay to slice it [:21]? I don't want the id to repeat in the future.
Any suggestions would be appreciated!
I'm definately no expert in Python nor Pandas, but puzzled the following together. You might find something usefull:
First I tried to use Numpy but I hit the max of upper limit:
import pandas as pd
import numpy as np
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'], 'ID':[0,0,0,0]}
df = pd.DataFrame(data)
df.ID = np.random.randint(0, 9223372036854775807, len(df.index), np.int64)
df.ID = df.ID.map('{:020d}'.format)
print(df)
Results:
Name ID
0 Tom 03486834039218164118
1 Jack 04374010880686283851
2 Steve 05353371839474377629
3 Ricky 01988404799025990141
So then I tried a custom function and applied that:
import pandas as pd
import random
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'], 'ID':[0,0,0,0]}
df = pd.DataFrame(data)
def UniqueID():
UID = '{:020d}'.format(random.randint(0,99999999999999999999))
while UniqueID in df.ID.unique():
UID = '{:020d}'.format(random.randint(0,99999999999999999999))
return UID
df.ID = df.apply(lambda row: UniqueID(), axis = 1)
print(df)
Returns:
Name ID
0 Tom 46160813285603309146
1 Jack 88701982214887715400
2 Steve 50846419997696757412
3 Ricky 00786618836449823720
I think uuid4() in python works, just slice it accordingly

Categories