pandas str.contains match exact substring not working with regex boudry - python

I have two dataframes, and trying to find out a way to match the exact substring from one dataframe to another dataframe.
First DataFrame:
import pandas as pd
import numpy as np
random_data = {'Place Name':['TS~HOT_MD~h_PB~progra_VV~gogl', 'FM~uiosv_PB~emo_SZ~1x1_TG~bhv'],
'Site':['DV360', 'Adikteev']}
dataframe = pd.DataFrame(random_data)
print(dataframe)
Second DataFrame
test_data = {'code name': ['PB', 'PB', 'PB'],
'Actual':['programmatic me', 'emoteev', 'programmatic-mechanics'],
'code':['progra', 'emo', 'prog']}
test_dataframe = pd.DataFrame(test_data)
Approach
for k, l, m in zip(test_dataframe.iloc[:, 0], test_dataframe.iloc[:, 1], test_dataframe.iloc[:, 2]):
dataframe['Site'] = np.select([dataframe['Place Name'].str.contains(r'\b{}~{}\b'.format(k, m), regex=False)], [l],
default=dataframe['Site'])
The current output is as below, though I am expecting to match the exact substring, which is not working with the code above.
Current Output:
Place Name Site
TS~HOT_MD~h_PB~progra_VV~gogl programmatic-mechanics
FM~uiosv_PB~emo_SZ~1x1_TG~bhv emoteev
Expected Output:
Place Name Site
TS~HOT_MD~h_PB~progra_VV~gogl programmatic me
FM~uiosv_PB~emo_SZ~1x1_TG~bhv emoteev

Data
import pandas as pd
import numpy as np
random_data = {'Place Name':['TS~HOT_MD~h_PB~progra_VV~gogl',
'FM~uiosv_PB~emo_SZ~1x1_TG~bhv'], 'Site':['DV360', 'Adikteev']}
dataframe = pd.DataFrame(random_data)
test_data = {'code name': ['PB', 'PB', 'PB'], 'Actual':['programmatic me', 'emoteev', 'programmatic-mechanics'],
'code':['progra', 'emo', 'prog']}
test_dataframe = pd.DataFrame(test_data)
Map the test_datframe code and Actual into dictionary as key and value respectively
keys=test_dataframe['code'].values.tolist()
dicto=dict(zip(test_dataframe.code, test_dataframe.Actual))
dicto
Join the keys separated by | to enable search of either phrases
k = '|'.join(r"{}".format(x) for x in dicto.keys())
k
Extract string from datframe meeting any of the phrases in k and map them to to the dictionary
dataframe['Site'] = dataframe['Place Name'].str.extract('('+ k + ')', expand=False).map(dicto)
dataframe
Output

Not the most elegant solution, but this does the trick.
Set up data
import pandas as pd
import numpy as np
random_data = {'Place Name':['TS~HOT_MD~h_PB~progra_VV~gogl',
'FM~uiosv_PB~emo_SZ~1x1_TG~bhv'], 'Site':['DV360', 'Adikteev']}
dataframe = pd.DataFrame(random_data)
test_data = {'code name': ['PB', 'PB', 'PB'], 'Actual':['programmatic me', 'emoteev', 'programmatic-mechanics'],
'code':['progra', 'emo', 'prog']}
test_dataframe = pd.DataFrame(test_data)
Solution
Create a column in test_dataframe with the substring to match:
test_dataframe['match_str'] = test_dataframe['code name'] + '~' + test_dataframe.code
print(test_dataframe)
code name Actual code match_str
0 PB programmatic me progra PB~progra
1 PB emoteev emo PB~emo
2 PB programmatic-mechanics prog PB~prog
Define a function to apply to test_dataframe:
def match_string(row, dataframe):
ind = row.name
try:
if row[-1] in dataframe.loc[ind, 'Place Name']:
return row[1]
else:
return dataframe.loc[ind, 'Site']
except KeyError:
# More rows in test_dataframe than there are in dataframe
pass
# Apply match_string and assign back to dataframe
dataframe['Site'] = test_dataframe.apply(match_string, args=(dataframe,), axis=1)
Output:
Place Name Site
0 TS~HOT_MD~h_PB~progra_VV~gogl programmatic me
1 FM~uiosv_PB~emo_SZ~1x1_TG~bhv emoteev

Related

Count occurrence of column values in other dataframe column

I have two dataframes and I want to count the occurrence of "classifier" in "fullname". My problem is that my script counts a word like "carrepair" only for one classifier and I would like to have a count for both classifiers. I would also like to add one random coordinate that matches the classifier.
First dataframe:
Second dataframe:
Result so far:
Desired Result:
My script so far:
import pandas as pd
fl = pd.read_excel (r'fullname.xlsx')
clas= pd.read_excel (r'classifier.xlsx')
fl.fullname= fl.fullname.str.lower()
clas.classifier = clas.classifier.str.lower()
pat = '({})'.format('|'.join(clas['classifier'].unique()))
fl['fullname'] = fl['fullname'].str.extract(pat, expand = False)
clas['count_of_classifier'] = clas['classifier'].map(fl['fullname'].value_counts())
print(clas)
Thanks!
You could try this:
import pandas as pd
fl = pd.read_excel (r'fullname.xlsx')
clas= pd.read_excel (r'classifier.xlsx')
fl.fullname= fl.fullname.str.lower()
clas.classifier = clas.classifier.str.lower()
# Add a new column to 'fl' containing either 'repair' or 'car'
for value in clas["classifier"].values:
fl.loc[fl["fullname"].str.contains(value, case=False), value] = value
# Count values and create a new dataframe
new_clas = pd.DataFrame(
{
"classifier": [col for col in clas["classifier"].values],
"count": [fl[col].count() for col in clas["classifier"].values],
}
)
# Merge 'fl' and 'new_clas'
new_clas = pd.merge(
left=new_clas, right=fl, how="left", left_on="classifier", right_on="fullname"
).reset_index(drop=True)
# Keep only expected columns
new_clas = new_clas.reindex(columns=["classifier", "count", "coordinate"])
print(new_clas)
# Outputs
classifier count coordinate
repair 3 52.520008, 13.404954
car 3 54.520008, 15.404954

How to insert output of a python for loop into an excel column?

I'm trying to use a for loop to populate data in a destination column in an excel spreadsheet. The destination column gets made but then the information from the for loop doesn't print into the excel file.
import pandas as pd
aasb_scores = pd.read_excel ('/Users/nnamdiokoli/Library/Containers/com.microsoft.Excel/Data/Desktop/AASB Scoring PivotTable Example.xlsx',
index=False)
aasb_scores['Average'] = (aasb_scores['Q1'] +
aasb_scores['Q2']+ aasb_scores['Q3'] +
aasb_scores['Q4'] + aasb_scores['Q5'])/5.00
aasb_scores.head(10)
def finalround():
for i in aasb_scores['Average']:
if i >= 3:
print('Final Round')
else:
print('cut')
aasb_scores['Moving on?'] = finalround()
aasb_scores.to_excel('/Users/nnamdiokoli/Library/Containers/com.microsoft.Excel/Data/Desktop/AASB Scoring PivotTable Example.xlsx',
index=False)
print() is used only to display on screen, not to put in variable or any other place.
You should use return "Final Round" and return "cut".
But with pandas you should rather use its functions instead of for-loop - ie. apply().
import pandas as pd
import random
df = pd.DataFrame({'Average': [random.randint(0,5) for _ in range(10)]})
def finalround(value):
if value >= 3:
return 'Final Round'
else:
return 'cut'
df['Moving on?'] = df['Average'].apply(finalround)
print(df)
or shorter with lambda
import pandas as pd
import random
df = pd.DataFrame({'Average': [random.randint(0,5) for _ in range(10)]})
df['Moving on?'] = df['Average'].apply(lambda x: 'Final Round' if x>=3 else 'cut')
print(df)
Eventually you can create column 'Moving on?' with default value 'cut' and later filter rows in which you want to set 'Final Round'
import pandas as pd
import random
df = pd.DataFrame({'Average': [random.randint(0,5) for _ in range(10)]})
df['Moving on?'] = 'cut'
df['Moving on?'][ df['Average'] >= 3 ] = 'Final Round'
print(df)

check data based on other data csv using pandas

i have two data csv
The first :
word,centroid
she,1
great,0
good,3
mother,2
father,2
After,4
before,4
.....
The second:
sentences,label
good mother,1
great father,1
I want to check each sentence based on the cluster results
so if the sentences is good mother good on the centroid 3 then array will be [0,0,0,1,0] and word mother on the centroid 2 then array will be [0,0,1,1,0]...
I have complicated and wrong code ... can anyone help me
this is my code:
import pandas as pd
import re
array=[]
data = pd.read_csv('data/data_komentar.csv',encoding = "ISO-8859-1")
df = pd.read_csv('data/hasil_cluster.csv',encoding = "ISO-8859-1")
for index,row in data.iterrows():
kalimat=row[0]
words=re.sub(r'([^\s\w]|_)', '', str(kalimat))
words= re.sub(r'[0-9]+', '', words)
for word in words.split():
kata=word.lower()
df = df[df.eq(kata)]
if df.empty:
print("empty")
else:
print(kata)
if df['centroid;'] is 0:
array=array+[1,0,0,0,0]
if df['centroid'] is 1:
array=array+[0,1,0,0,0]
if df['centroid'] is 2:
array=array+[0,0,1,0,0]
if df['centroid;'] is 3:
array=array+[0,0,0,1,0]
if df['centroid;'] is 4:
array=array+[0,0,0,0,1]
print(array)
You can use apply() on the sentences column of the DataFrame:
import numpy as np
MAX_CENTROIDS = 5
def get_centroids(row):
centroids = np.zeros(MAX_CENTROIDS, dtype=int)
for word in row.split(' '):
if word in df1['word'].values:
centroids[df1[df1['word']==word]['centroid'].values]+=1
return centroids
df2['centroid'] = df2['sentences'].apply(get_centroids)
Result df2:
df1 is the DataFrame with your words and centroids, df2 with the sententes. You have to specify the maximal number of centroids in MAX_CENTROIDS (=length of the centroid list).
Edit
To read the datasample you provided:
# Maybe remove encoding on your system
df1 = pd.read_csv('hasil_cluster.csv', sep=',', encoding='iso-8859-1')
# Drop Values without a centroid:
df1.dropna(inplace=True)
# Remove ; from every centroid value and convert the column to integers
df1['centroid'] = df1['centroid;'].apply(lambda x:str(x).replace(';', '')).astype(int)
# Remove unused colum
df1.drop('centroid;', inplace=True, axis=1)

How to dynamically match rows from two pandas dataframes

I have a large dataframe of urls and a smaller 2nd dataframe that contains columns of strings which I want to use to merge the two dataframes together. Data from the 2nd df will be used to populate the larger 1st df.
The matching strings can contain * wildcards (and more then one) but the order of the grouping still matters; so "path/*path2" would match with "exsample.com/eg_path/extrapath2.html but not exsample.com/eg_path2/path/test.html. How can I use the strings in the 2nd dataframe to merge the two dataframes together. There can be more then one matching string in the 2nd dataframe.
import pandas as pd
urls = {'url':['https://stackoverflow.com/questions/56318782/','https://www.google.com/','https://en.wikipedia.org/wiki/Python_(programming_language)','https://stackoverflow.com/questions/'],
'hits':[1000,500,300,7]}
metadata = {'group':['group1','group2'],
'matching_string_1':['google','wikipedia*Python_'],
'matching_string_2':['stackoverflow*questions*56318782','']}
result = {'url':['https://stackoverflow.com/questions/56318782/','https://www.google.com/','https://en.wikipedia.org/wiki/Python_(programming_language)','https://stackoverflow.com/questions/'],
'hits':[1000,500,300,7],
'group':['group2','group1','group1','']}
df1 = pd.DataFrame(urls)
df2 = pd.DataFrame(metadata)
what_I_am_after = pd.DataFrame(result)
Not very robust but gives the correct answer for my example.
import pandas as pd
urls = {'url':['https://stackoverflow.com/questions/56318782/','https://www.google.com/','https://en.wikipedia.org/wiki/Python_(programming_language)','https://stackoverflow.com/questions/'],
'hits':[1000,500,300,7]}
metadata = {'group':['group1','group2'],
'matching_string_1':['google','wikipedia*Python_'],
'matching_string_2':['stackoverflow*questions*56318782','']}
result = {'url':['https://stackoverflow.com/questions/56318782/','https://www.google.com/','https://en.wikipedia.org/wiki/Python_(programming_language)','https://stackoverflow.com/questions/'],
'hits':[1000,500,300,7],
'group':['group2','group1','group1','']}
df1 = pd.DataFrame(urls)
df2 = pd.DataFrame(metadata)
results = pd.DataFrame(columns=['url','hits','group'])
for index,row in df2.iterrows():
for x in row[1:]:
group = x.split('*')
rx = "".join([str(x)+".*" if len(x) > 0 else '' for x in group])
if rx == "":
continue
filter = df1['url'].str.contains(rx,na=False, regex=True)
if filter.any():
temp = df1[filter]
temp['group'] = row[0]
results = results.append(temp)
d3 = df1.merge(results,how='outer',on=['url','hits'])

String matching between 2 dataframe

Learning Python here, and any help on this is much appreciated.
My problem scenario is, there are 2 dataframes A and B contains a column(Name and Flag) list of Names.
ExDF = pd.DataFrame({'Name' : ['Smith','John, Alex','Peter Lin','Carl Marx','Abhraham Moray','Calvin Klein'], 'Flag':['False','False','False','False','False','False']})
SnDF = pd.DataFrame({'Name' : ['Adam K ','John Smith','Peter Lin','Carl Josh','Abhraham Moray','Tim Klein'], 'Flag':['False','False','False','False','False','False']})
The initial value of Flag is False.
Point 1: I need to flip the names in both dataframe ie. Adam Smith to Smith Adam and save the flip names in another new column in the both dataframes.
- This part is done.
Point 2: Then both the Original name and flip names of A dataframe should get check in B dataframe original names and flip names. If it found the the flag column in both the dataframe should get update by True.
I wrote the code but it checks one on one row to both dataframe like A[0] to B[0], A[1] to B[1], but i need to check A[0] record to all the records of B dataframe.
Pls help me on this!!
The code which tried is below:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
ExDF_swap = ExDF["Swap"] = ExDF["Name"].apply(lambda x: " ".join(reversed(x.split())))
SnDF_swap = SnDF["Swap"] = SnDF["Name"].apply(lambda x: " ".join(reversed(x.split())))
ExDF_swap = pd.DataFrame(ExDF_swap)
SnDF_swap = pd.DataFrame(SnDF_swap)
vect = CountVectorizer()
X = vect.fit_transform(ExDF_swap.Name)
Y = vect.transform(SnDF_swap.Name)
res = np.ravel(np.any((X.dot(Y.T) > 1).todense(), axis=1))
pd.DataFrame(X.toarray(), columns=vect.get_feature_names())
pd.DataFrame(Y.toarray(), columns=vect.get_feature_names())
ExDF["Flag"] = np.ravel(np.any((X.dot(Y.T) > 1).todense(), axis=1))
SnDF["Flag"] = np.ravel(np.any((X.dot(Y.T) > 1).todense(), axis=1))
You could try isin() - of pandas:
import pandas as pd
ExDF = pd.DataFrame({'Name' : ['Smith','John, Alex','Peter Lin','Carl Marx','Abhraham Moray','Calvin Klein'], 'Flag':['False','False','False','False','False','False']})
SnDF = pd.DataFrame({'Name' : ['Adam K ','John Smith','Peter Lin','Carl Josh','Abhraham Moray','Tim Klein'], 'Flag':['False','False','False','False','False','False']})
print(ExDF)
print(SnDF)
ExDF["Swap"] = ExDF["Name"].apply(lambda x: " ".join(reversed(x.split())))
SnDF["Swap"] = SnDF["Name"].apply(lambda x: " ".join(reversed(x.split())))
print(ExDF)
print(SnDF)
ExDF['Flag'] = ExDF.Name.isin(SnDF.Name)
SnDF['Flag'] = SnDF.Name.isin(ExDF.Name)
print(ExDF)
print(SnDF)

Categories