How to extract every string between two substring in a paragraph? - python

After web-scrapping, I get the following:
[<p>xxx<p>, <p>1.apple</p>, <p>aaa</p>, <p>xxxxx</p>, <p>xxxxx</p>, <p>2.orange</p>, <p>aaa</p>, <p>xxxxx</p>,<p>3.banana</p>, <p>aaa</p>, <p>xxxxx</p>]
From the list, "xxxx" are those useless values. I can see the pattern that the result I want is between two substrings. Substring1 = "<p>1" / "<p>2" / "<p>3" ; Substring2 = "</p>, <p>aaa".
Assume this pattern repeats hundreds of times. How do I get the result by python? Many thanks !!
My target result is :
apple
orange
banana
I have tried to use split and tried [sub1:sub2] but it doesn't work

From what I INFER from your question (assuming the words you're looking for follow a beacon of format <p>number. ), a regex would do the job:
import re
print(re.findall(r'<p>\d+.([^<]+)', html_string)
# ['apple', 'orange', 'banana']

Related

Python matching various keyword from dictionary issues

I have a complex text where I am categorizing different keywords stored in a dictionary:
text = 'data-ls-static="1">Making Bio Implants, Drug Delivery and 3D Printing in Medicine,MEDICINE</h3>'
sector = {"med tech": ['Drug Delivery' '3D printing', 'medicine', 'medical technology', 'bio cell']}
this can successfully find my keywords and categorize them with some limitations:
pattern = r'[a-zA-Z0-9]+'
[cat for cat in sector if any(x in re.findall(pattern,text) for x in sector[cat])]
The limitations that I cannot solve are:
For example, keywords like "Drug Delivery" that are separated by a space are not recognized and therefore categorized.
I was not able to make the pattern case insensitive, as words like MEDICINE are not recognized. I tried to add (?i) to the pattern but it doesn't work.
The categorized keywords go into a pandas df, but they are printed into []. I tried to loop again the script to take them out but they are still there.
Data to pandas df:
ind_list = []
for site in url_list:
ind = [cat for cat in indication if any(x in re.findall(pattern,soup_string) for x in indication[cat])]
ind_list.append(ind)
websites['Indication'] = ind_list
Current output:
Website Sector Sub-sector Therapeutical Area Focus URL status
0 url3.com [med tech] [] [] [] []
1 www.url1.com [med tech, services] [] [oncology, gastroenterology] [] []
2 www.url2.com [med tech, services] [] [orthopedy] [] []
In the output I get [] that I'd like to avoid.
Can you help me with these points?
Thanks!
Give you some hints here the problem that can readily be spot:
Why can't match keywords like "Drug Delivery" that are separated by a space ? This is because the regex pattern r'[a-zA-Z0-9]+' does not match for a space. You can change it to r'[a-zA-Z0-9 ]+' (added a space after 9) if you want to match also for a space. However, if you want to support other types of white spaces (e.g. \t, \n), you need to further change this regex pattern.
Why don't support case insensitive match ? Your code fragment any(x in re.findall(pattern,text) for x in sector[cat]) requires x to have the same upper/lower case for BOTH being in result of re.findall and being in sector[cat]. This constrain even cannot be bypassed by setting flags=re.I in the re.findall() call. Suggest you to convert them all to the same case before checking. That is, for example change them all to lower cases before matching: any(x in re.findall(pattern,text.lower()) for x.lower() in sector[cat]) Here we added .lower() to both text and x.lower().
With the above 2 changes, it should allow you to capture some categorized keywords.
Actually, for this particular case, you may not need to use regular expression and re.findall at all. You may just check e.g. sector[cat][i].lower()) in text.lower(). That is, change the list comprehension as follows:
[cat for cat in sector if any(x in text.lower() for x in [y.lower() for y in sector[cat]])]
Edit
Test Run with 2-word phrase:
text = 'drug delivery'
sector = {"med tech": ['Drug Delivery', '3D printing', 'medicine', 'medical technology', 'bio cell']}
[cat for cat in sector if any(x in text.lower() for x in [y.lower() for y in sector[cat]])]
Output: # Successfully got the categorizing keyword even with dictionary values of different upper/lower cases
['med tech']
text = 'Drug Store fast delivery'
[cat for cat in sector if any(x in text.lower() for x in [y.lower() for y in sector[cat]])]
Ouptput: # Correctly doesn't match with extra words in between
[]
Can you try a different approach other than regex,
I would suggest difflib when you have two similar matching words.
findall is pretty wasteful here since you are repeatedly breaking up the string for each keyword.
If you want to test whether the keyword is in the string:
[cat for cat in sector if any(re.search(word, text, re.I) for word in sector[cat])]
# Output: med tech

How to extract compound words from a multiline string in Python

I have a text string from which I want to extract specific words (fruits) that may appear in it; the respective words are stored in a set. (Actually, the set is very large, but I tried to simplify the code). I achieved extracting single fruit words with this simple code:
# The text string.
text = """
My friend loves healthy food: Yesterday, he enjoyed
an apple, a pine apple and a banana. But what he
likes most, is a blueberry-orange cake.
"""
# Remove noisy punctuation and lowercase the text string.
prep_text = text.replace(",", "")
prep_text = prep_text.replace(".", "")
prep_text = prep_text.replace(":", "")
prep_text = prep_text.lower()
# The word set.
fruits = {"apple", "banana", "orange", "blueberry",
"pine apple"}
# Extracting single fruits.
extracted_fruits = []
for word in prep_text.split():
if word in fruits:
extracted_fruits.append(word)
print(extracted_fruits)
# Out: ['apple', 'apple', 'banana']
# Missing: 'pine apple', 'blueberry-orange'
# False: the second 'apple'
But if the text string contains a fruit compound separated by a space (here: "pine apple"), it is not extracted (or rather, just "apple" is extracted from it, even though I don't want this occurrence of "apple" because it's part of the compound). I know this is because I used split() on prep_text. Neither extracted is the hyphenated combination "blueberry-orange", which I want to get as well. Other hyphenated words that don't include fruits should not be extracted, though.
If I could use a variable fruit for each item in the fruits set, I would solve it with f-strings like:
fruit = # How can I get a single fruit element and similarly all fruit elements from 'fruits'?
hyphenated_fruit = f"{fruit}-{fruit}"
for word in prep_text.split():
if word == hyphenated_fruit:
extracted_fruits.append(word)
I can't use the actual strings "blueberry" and "orange" as variables though, because other fruits could also appear hyphenated in a different text string. Moreover, I don't want to just add "blueberry-orange" to the set - I'm searching for a way without changing the set.
Is there a way to add "pine apple" and "blueberry-orange" as well to the extracted_fruits list?
I appreciate any help and tips. Thanks a lot in advance!
The quickest (and dirtiest?) approach might be to use this regex:
(banana|orange|(pine\s*)?apple|blueberry)(-\s*(banana|orange|(pine\s*)?apple|blueberry))?
It will match pineapple, pine apple, pine apple, and pine apple with newlines or tabs between pine and apple.
Demo
One disadvantage is, of course, that it matches things like orange-orange and there's repeated text in there. You can construct the regex programmatically to fix that, though.
It's just a start but may be good enough for your use case. You can grow it to add more capabilities for a bit, I think.

Is there a possibility in pySpark to search a string within two separate words?

I'm looking to find a way in python spark to search a string with separate two words. for example: IPhone x or Samsun s10 ...
I want to give a text file and (Iphone x) as a composite string for example, and get result then.
All what i find in the internet is just one word count
IUUC:
In spark 2.0 and if you were gunna read it from a file, for exemple a .csv file:
df = spark.read.format("csv").option("header", "true").load("pathtoyourcsvfile.csv")
then you can filter it using regex like this:
pattern = "\s+(word1|word2)\s+"
filtered = df.filter(df['<thedesiredcolumnhere>'].rlike(pattern))
You can try to write your own UDF combine with wordsegmente to segment your words, and you can add new word to the dictionary to help library to segment new words, such as "Iphone x"
For example:
>>> from wordsegment import clean
>>> clean('She said, "Python rocks!"')
'shesaidpythonrocks'
>>> segment('She said, "Python rocks!"')
['she', 'said', 'python', 'rocks']
If you don't want to use library, you can also see Word segmentation using dynamic programming
This is the answer:
# give a file
rdd = sc.textFile("/root/PycharmProjects/Spark/file")
# give a composite string
string_ = "Iphone x"
# filer by line containing the string
new_rdd = rdd.filter(lambda line: string_ in line)
# collect these lines
rt = str(new_rdd.collect())
# apply regex to find all words and count
count = re.findall(string_, rt) them

python to search separated words

I am trying to extract separated multi words from a python list with two different list as a query string. My sentences list is
lst = ['we have the terrible HIV epidemic that takes down the life expectancy of the African ','and I take the regions down here','The poorest are down']
lst_verb = ['take','go','wake']
lst_prep = ['down','up','in']
import re
output=[]
item = 'down'
p = re.compile(r'(?:\w+\s+){1,20}'+item)
for i in lst:
output.append(p.findall(i))
for item in output:
print(item)
with this i am able to extract word from the list, However I am only want to extract separated multiwords, i.e it should extract the word from the list "and I take the regions down here".
furthermore, I want to use the word from lst_verb and lst_prep as query string.
for example
re.findall(r \lst_verb+'*.\b'+ \lst_prep)
Thank you for your answer.
You can use regex like
(?is)^(?=.*\b(take)\b)(?=.*?\b(go)\b)(?=.*\b(where)\b)(?=.*\b(wake)\b).*
To match Multiple words
like this your example
use functions to create regex string from the verbs and prep.
hope this helps

Search in a string and obtain the 2 words before and after the match in Python

I'm using Python to search some words (also multi-token) in a description (string).
To do that I'm using a regex like this
result = re.search(word, description, re.IGNORECASE)
if(result):
print ("Trovato: "+result.group())
But what I need is to obtain the first 2 word before and after the match. For example if I have something like this:
Parking here is horrible, this shop sucks.
"here is" is the word that I looking for. So after I matched it with my regex I need the 2 words (if exists) before and after the match.
In the example:
Parking here is horrible, this
"Parking" and horrible, this are the words that I need.
ATTTENTION
The description cab be very long and the pattern "here is" can appear multiple times?
How about string operations?
line = 'Parking here is horrible, this shop sucks.'
before, term, after = line.partition('here is')
before = before.rsplit(maxsplit=2)[-2:]
after = after.split(maxsplit=2)[:2]
Result:
>>> before
['Parking']
>>> after
['horrible,', 'this']
Try this regex: ((?:[a-z,]+\s+){0,2})here is\s+((?:[a-z,]+\s*){0,2})
with re.findall and re.IGNORECASE set
Demo
I would do it like this (edit: added anchors to cover most cases):
(\S+\s+|^)(\S+\s+|)here is(\s+\S+|)(\s+\S+|$)
Like this you will always have 4 groups (might have to be trimmed) with the following behavior:
If group 1 is empty, there was no word before (group 2 is empty too)
If group 2 is empty, there was only one word before (group 1)
If group 1 and 2 are not empty, they are the words before in order
If group 3 is empty, there was no word after
If group 4 is empty, there was only one word after
If group 3 and 4 are not empty, they are the words after in order
Corrected demo link
Based on your clarification, this becomes a bit more complicated. The solution below deals with scenarios where the searched pattern may in fact also be in the two preceding or two subsequent words.
line = "Parking here is horrible, here is great here is mediocre here is here is "
print line
pattern = "here is"
r = re.search(pattern, line, re.IGNORECASE)
output = []
if r:
while line:
before, match, line = line.partition(pattern)
if match:
if not output:
before = before.split()[-2:]
else:
before = ' '.join([pattern, before]).split()[-2:]
after = line.split()[:2]
output.append((before, after))
print output
Output from my example would be:
[(['Parking'], ['horrible,', 'here']), (['is', 'horrible,'], ['great', 'here']), (['is', 'great'], ['mediocre', 'here']), (['is', 'mediocre'], ['here', 'is']), (['here', 'is'], [])]

Categories