How to search for multiple substrings using text.find - python

I'm a Python beginner, so please forgive me if I'm not using the right lingo and if my code includes blatant errors.
I have text data (i.e., job descriptions from job postings) in one column of my data frame. I want to determine which job ads contain any of the following strings: bachelor, ba/bs, bs/ba.
The function I wrote doesn't work because it produces an empty column (i.e., all zeros). It works fine if I just search for one substring at a time. Here it is:
def requires_bachelor(text):
if text.find('bachelor|ba/bs|bs/ba')>-1:
return True
else:
return False
df_jobs['bachelor']=df_jobs['description'].apply(requires_bachelor).map({True:1, False:0})
Thanks so much to anyone who is willing to help!

Here's my approach. You were pretty close but you need to check for each of the items individually. If any of the available "Bachelor tags" exist, return true. Then instead of using map({true:1, false:0}), you can use map(bool) to make it a bit nicer. Good luck!
import pandas as pd
df_jobs = pd.DataFrame({"name":["bob", "sally"], "description":["bachelor", "ms"]})
def requires_bachelor(text):
return any(text.find(a) > -1 for a in ['bachelor', 'ba/bs','bs/ba']) # -1 if not found
df_jobs['bachelor']=df_jobs['description'].apply(requires_bachelor).map(bool)

The | in search string does not work like or operator. You should divide it into three calls like this:
if text.find('bachelor') > -1 or text.find('ba/bs') > -1 or text.find('bs/ba') > -1:

You could try doing:
bachelors = ["bachelor", "ba/bs", "bs/ba"]
if any(bachelor in text for bachelor in bachelors):
return True

Instead of writing a custom function that requires .apply (which will be quite slow), you can use str.contains for this. Also, you don't need map to turn booleans into 1 and 0; try using astype(int) instead.
df_jobs = pd.DataFrame({'description': ['job ba/bs', 'job bachelor',
'job bs/ba', 'job ba']})
df_jobs['bachelor'] = df_jobs.description.str.contains(
'bachelor|ba/bs|bs/ba', regex=True).astype(int)
print(df_jobs)
description bachelor
0 job ba/bs 1
1 job bachelor 1
2 job bs/ba 1
3 job ba 0
# note that the pattern does not look for match on simply "ba"!

So, you are checking for a string bachelor|ba/bs|bs/ba in the list, Which I don't believe will exist in any case...
What I suggest you do is to check for all possible combinations in the IF, and join them with a or statement, as follows:
def requires_bachelor(text):
if text.find('bachelor')>-1 or text.find('ba/bs')>-1 or text.find('bs/ba')>-1:
return True
else:
return False
df_jobs['bachelor']=df_jobs['description'].apply(requires_bachelor).map({True:1, False:0})

It can all be done simply in one line in Pandas
df_jobs['bachelor'] = df_jobs['description'].str.contains(r'bachelor|bs|ba')

Related

PySpark / Python Slicing and Indexing Issue

Can someone let me know how to pull out certain values from a Python output.
I would like the retrieve the value 'ocweeklyreports' from the the following output using either indexing or slicing:
'config': '{"hiveView":"ocweeklycur.ocweeklyreports"}
This should be relatively easy, however, I'm having problem defining the Slicing / Indexing configuation
The following will successfully give me 'ocweeklyreports'
myslice = config['hiveView'][12:30]
However, I need the indexing or slicing modified so that I will get any value after'ocweeklycur'
I'm not sure what output you're dealing with and how robust you're wanting it but if it's just a string you can do something similar to this (for a quick and dirty solution).
input = "Your input"
indexStart = input.index('.') + 1 # Get the index of the input at the . which is where you would like to start collecting it
finalResponse = input[indexStart:-2])
print(finalResponse) # Prints ocweeklyreports
Again, not the most elegant solution but hopefully it helps or at least offers a starting point. Another more robust solution would be to use regex but I'm not that skilled in regex at the moment.
You could almost all of it using regex.
See if this helps:
import re
def search_word(di):
st = di["config"]["hiveView"]
p = re.compile(r'^ocweeklycur.(?P<word>\w+)')
m = p.search(st)
return m.group('word')
if __name__=="__main__":
d = {'config': {"hiveView":"ocweeklycur.ocweeklyreports"}}
print(search_word(d))
The following worked best for me:
# Extract the value of the "hiveView" key
hive_view = config['hiveView']
# Split the string on the '.' character
parts = hive_view.split('.')
# The value you want is the second part of the split string
desired_value = parts[1]
print(desired_value) # Output: "ocweeklyreports"

Using fuzzywuzzy

The dataset I have is manually filled addresses.
The city I wanna look for is 'İstanbul'. It has a Turkish character and I'm running into some encoding issues as well. For example, lower()'ing the İ in İstanbul won't return me a character I can pick up with a regular 'i' in a regex pattern.
Therefore, as well as other reasons, I changed my approach to fuzzy string searching. I want to give reference strings to my fuzzy lookup algorithm: '/ist' and 'İstanbul' — these are the reference values to be looked up for in my address column.
Example of rows with phrases I want to catch:
...İSYTANBUL...
...isanbul...
...Istanbul...
...İ/STANBUL...
...,STANBUL/ÜSKÜDAR...
isatanbul
iatanbul
İSTRANBUL
isytanbul
/isanbul
These are full addresses so I found partial_ratio to be better working compared to ratio.
My goal is to use fuzzywuzzy.partial_ratio at the row level with the string 'istanbul' or '/ist. And use the score partial_ratio returns to get a True or False for that row index's preferred column —referenced as 'istanbul mu' in code.
The code I've developed is below but it stops at about 25k rows every time I make this run. And it's abysmally slow. Do you think there's a more efficient to accomplish the task?
def fuzzy(string, df, columnname):
fullrange = len(df[columnname])
for i in range(fullrange):
if fuzz.partial_ratio(string, df[columnname][i]) > 70:
df.loc[df.index == i, 'istanbul mu'] = True
else:
df.loc[df.index == i, 'istanbul mu'] = False
As a faster alternative to your own answer you can replace FuzzyWuzzy with RapidFuzz, which has a faster implementation of fuzz.partial_ratio
from rapidfuzz import fuzz
def applyfuzzy(row):
return fuzz.partial_ratio('the string', row['address_column'], score_cutoff=70) > 70
df['column'] = df.apply(applyfuzzy, axis=1)
This approach is doing a lot better. Using .process may wield better results but for reference:
def applyfuzzy(row):
if fuzz.partial_ratio('the string', row['column holding the address to be queried']) > 65:
return True
else:
return False
df['preferredcolumn'] = df.apply(applyfuzzy, axis=1)

How to split strings into new dataframe rows depending on keywords

I want to split a row into a new row whenever an adverb is present. However, if multiple adverbs occur in a row, then I only want to split into a new row after the last adverb.
A sample of my dataframe looks like this:
0 but well that's alright
1 otherwise however we'll have to
2 okay sure
3 what?
With adverbs = ['but', 'well', 'otherwise', 'however'], I want the resulting df to look like this:
0 but well
1 that's alright
2 otherwise however
3 we'll have to
2 okay sure
3 what?
I have a partial solution, maybe it can help.
You could use the TextBlob package.
Using this API, you can assign each word a token. A list of possible tokens is available here.
The issue is that it's not perfect to tag words, and your definition of adverb might not match theirs (for instance, but is a coordinating conjunction on the API, and the well tag is, for some reason, a verb. But it still works for the most part:
The splitting could be done this way
from textblob import TextBlob
def adv_split(s):
annotations = TextBlob(s).tags
# Extract adverbs (CC for coordinating conjunction or RB for adverbs)
adv_words = [ word for word,tag in annotations
if tag.startswith('CC') or tag.startswith('RB') ]
# We have at least one adverb
if len(adv_words) >0:
# Get the last one
adv_pos = s.index(adv_words[-1]) + len(adv_words[-1])
return [s[:adv_pos], s[adv_pos:]]
else:
return s
Then, you can use the pandas apply() and the new explode() method (pandas>0.25) to split your dataframe:
import pandas as pd
data = pd.Series(["but well that's alright",
"otherwise however we'll have to",
"okay sure",
"what?"])
data.apply(adv_split).explode()
You get:
0 but
0 well that's alright
1 otherwise however
1 we'll have to
2 okay sure
3 what?
It's not exactly right since well's tag is wrong, but you have the idea.
df = df[0].str.split().explode().to_frame()
df[1] = df[0].str.contains('|'.join(adverbs))
df = df.groupby([df.index, 1], sort=False).agg(' '.join).reset_index(drop=True)
print(df)
0
0 but well
1 that's alright
2 otherwise however
3 we'll have to
4 okay sure
5 what?

Apply operation and a division operation in the same step using Python

I am trying to get proportion of nouns in my text using the code below and it is giving me an error. I am using a function that calculates the number of nouns in my text and I have the overall word count in a different column.
pos_family = {
'noun' : ['NN','NNS','NNP','NNPS']
}
def check_pos_tag(x, flag):
cnt = 0
try:
for tag,value in x.items():
if tag in pos_family[flag]:
cnt +=value
except:
pass
return cnt
df2['noun_count'] = df2['PoS_Count'].apply(lambda x: check_pos_tag(x, 'noun')/df2['word_count'])
Note: I have used nltk package to get the counts by PoS tags and I have the counts in a dictionary in PoS_Count column in my dataframe.
If I remove "/df2['word_count']" in the first run and get the noun count and include it again and run, it works fine but if I run it for the first time I get the below error.
ValueError: Wrong number of items passed 100, placement implies 1
Any help is greatly appreciated
Thanks in Advance!
As you have guessed, the problem is in the /df2['word_count'] bit.
df2['word_count'] is a pandas series, but you need to use a float or int here, because you are dividing check_pos_tag(x, 'noun') (which is an int) by it.
A possible solution is to extract the corresponding field from the series and use it in your lambda.
However, it would be easier (and arguably faster) to do each operation alone.
Try this:
df2['noun_count'] = df2['PoS_Count'].apply(lambda x: check_pos_tag(x, 'noun')) / df2['word_count']

Converting Python statement into string

I got a situation I have to generate a if condition based on item configured in my configuration json file.
So once the if statement converted into a string I want that to run by eval .
This is what I am trying for that.
Here is what flags and set_value look like
My json
"EMAIL_CONDITION":
{
"TOXIC_THRESOLD":"50",
"TOXIC_PLATFORM_TODAY":"0",
"TOXIC_PRS_TODAY":"0",
"Explanation":"select any or all 1 for TOXIC_Thresold 2 for TOXIC_PLATFORM 3 for toxic_prs ",
"CONDITION_TYPE":["1","2"]
},
"email_list":{
"cc":["abc#def.t"],
"to":["abc#def.net"]
}
The Python
CONDITION_TYPE is a variable where values can be either all, any or 1,2 ,2,3, or 1,3
1 stands for toxic index 2 for platform and 3 for toxic prs
But the idea is going forward any number of parameters can be added so I wanted to make the if condition generic so that can take any number of conditions simply just wanted to avoid to many if else. all and any i have already handaled its straight forward but this is variable number so only else part snippet given
flags['1'] = toxic_index
flags['2'] = toxic_platform
flags['3'] = toxic_prs
set_value['1'] = toxic_index_condition
set_value['2'] = toxic_platform_condition
set_value['3'] = toxic_pr_condition
else:
condition_string = 'if '
for val,has_more in lookahead(conditions):
if has_more:
condition_string = str(condition_string+ str(flags[val] >= set_value[val])+ str('and') )
else:
condition_string = str(condition_string+ str(flags[val] >= set_value[val]) + str(':') )
print str(condition_string)
I do understand that most of them are variable so I am getting the response like
if False and False:
Instead of False I wanted to get the real condition(basically like condition_string+ str(flags[val] >= set_value[val]) ) based on that I can send mail.
I am not able to do that as I am getting False and False.
Please suggest me a best solution for it.

Categories