pandas contains exact string from a list [duplicate] - python

This question already has answers here:
Filter pandas DataFrame by substring criteria
(17 answers)
Closed 2 years ago.
I have 2 dataframes df1 and df2.
I would like to get all rows in df1 that has exact string match in column B of df2
This is df1:
df1={"columnA":['apple,cherry','pineple,lemon','banana, pear','cherry, pear, lemon']}
df1=pd.DataFrame(df1)
This is df2:
df2={"columnB":['apple','cherry']}
df2=pd.DataFrame(df2)
Below code output incorrect result:
df1[df1['columnA'].str.contains('|'.join(df2['columnB'].values))]
Pineapple is not supposed to appear as this is not exact match.
How can i get result like this:

Without actual reproducible code it's harder to help you, but I think this should work:
words = [rf'\b{string}\b' for string in df2.columnB]
df1[df1['columnA'].str.contains('|'.join(words))]

df1={"columnA":['apple,cherry','pineple,lemon','banana, pear','cherry, pear, lemon']}
df1=pd.DataFrame(df1)
df2={"columnB":['apple','cherry']}
df2=pd.DataFrame(df2)
Larger way of doing it ,but correct and simpler
list1=[]
for i in range(0,len(df1)):
for j in range(0,len(df2)):
if(df2["columnB"][j] in df1["columnA"][i]):
list1.append(i)
break
df=df1.loc[list1]
Answer
ColumnA
0 apple,cherry
3 cherry, pear, lemon

You were very close, but you will need to apply the word-operator of regex:
df1[df1['columnA'].str.contains("\b(" + '|'.join(df2['columnB'].values) + ")\b")]
This will look for the complete words.

Related

Removing special characters from column headers [duplicate]

This question already has answers here:
How to flatten a hierarchical index in columns
(19 answers)
Closed 1 year ago.
I used to_flat_index() to flatten columns and ended up with column names like ('Method', 'sum'). I am trying to remove the special characters from these. But when I try to remove them, it changes all the column names to nan
function attempted:
df_pred.columns = df_pred.columns.str.replace("[(,),']", '')
Expected outcome: MethodSum
It seems your columns are multi-indexed because your use to_flat_index.
>>> df
bar baz foo qux
one two one two one two one two
0 0.713825 0.015553 0.036683 0.388443 0.729509 0.699883 0.125998 0.407517
1 0.820843 0.259039 0.217209 0.021479 0.845530 0.112166 0.219814 0.527205
2 0.734660 0.931206 0.651559 0.337565 0.422514 0.873403 0.979258 0.269594
3 0.314323 0.857317 0.222574 0.811631 0.313495 0.315072 0.354784 0.394564
4 0.672068 0.658103 0.402914 0.430545 0.879331 0.015605 0.086048 0.918678
Try:
>>> df.columns.to_flat_index().map(''.join)
Index(['barone', 'bartwo', 'bazone', 'baztwo',
'fooone', 'footwo', 'quxone', 'quxtwo'],
dtype='object')

How to filter pandas dataframe based on length of a list in a column? [duplicate]

This question already has answers here:
How to filter a pandas dataframe based on the length of a entry
(2 answers)
Closed 1 year ago.
I have a pandas DataFrame like this:
id subjects
1 [math, history]
2 [English, Dutch, Physics]
3 [Music]
How to filter this dataframe based on the length of the column subjects?
So for example, if I only want to have rows where len(subjects) >= 2?
I tried using
df[len(df["subjects"]) >= 2]
But this gives
KeyError: True
Also, using loc does not help, that gives me the same error.
Thanks in advance!
Use the string accessor to work with lists:
df[df['subjects'].str.len() >= 2]
Output:
id subjects
0 1 [math, history]
1 2 [English, Dutch, Physics]

How to split a column in Pandas when the first number appears [duplicate]

This question already has an answer here:
Python pandas splitting text and numbers in dataframe
(1 answer)
Closed 2 years ago.
I have a dataframe that looks like:
Name
John5346
Alex7789
Jackie1123
John Smith4567
A.J Doe349
I am hoping to get:
Name No
John 5346
Alex 7789
Jackie 1123
John Smith 4567
A.J Doe 349
Have tried something like:
df["No"]= df["Name"].str.split(r'[0-9]')
for no such luck> Any ideas? Thanks very much
EDIT
Updated to include names that have a space or full stop in them
Try:
df[["Name", "sep", "No"]] = df["Name"].str.split("(\d)", n=1, expand=True)
df["No"] = df["sep"] + df["No"]
df.drop("sep", inplace=True, axis=1)
The essence in here is:
to split, keeping the separator - just put separator in square brackets (\d)
ensure max splits is exactly 1 - n=1

extract semicolon separated value from pandas df column [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 2 years ago.
I need to extract a specific value from pandas df column. The data looks like this:
row my_column
1 artid=delish.recipe.45064;artid=delish_recipe_45064;avb=83.3;role=4;data=list;prf=i
2 ab=px_d_1200;ab=2;ab=t_d_o_1000;artid=delish.recipe.23;artid=delish;role=1;pdf=true
3 dat=_o_1000;artid=delish.recipe.23;ar;role=56;passing=true;points001
The data is not consistent, but separated by a comma and I need to extract role=x.
I separated the data by a semicolon. And can loop trough the values to fetch the roles, but was wondering if there is a more elegant way to solve it.
Desired output:
row my_column
1 role=4
2 role=1
3 role=56
Thank you.
You can use str.extract and pass the required pattern within parentheses.
df['my_column'] = df['my_column'].str.extract('(role=\d+)')
row my_column
0 1 role=4
1 2 role=1
2 3 role=56
This should work:
def get_role(x):
l=x.split(sep=';')
t=[i for i in l if i[:4]=='role')][0]
return t
df['my_column']=[i for i in map(lambda y: get_role(y), df['my_column'])]

Checking one value of a dataframe in another [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 3 years ago.
I have 2 DataFrames as shown below:
df1:
OSIED geometry
257005 POLYGON ((311852.712 178933.993, 312106.023 17...
017049 POLYGON ((272943.107 137755.159, 272647.627 13...
017032 POLYGON ((276637.425 146141.397, 276601.509 14.
df2:
small_area Median_BER
2570059001 212.9
017049002 212.9
217112003 212.9
I need to search for df1.col1 in df2.col2 using "contains" logic and if it matches, get all the columns from both dataframes:
osied geometry small_area ber
257005 POLYGON ((311852.71 2570059001 212.9
I am new to python, which function which does this? isin function isn't useful here.
Updated:
Try this:
if any(df1.col1.isin(df2.col1)):
pd.concat([df1, df2], axis=1)
I think what you are probably looking for is some kind of merge. You can do:
df2.merge(df1, left_on='col2', right_on='col1', how='inner')
or change the 'how' argument based on what you're looking for.

Categories