I am trying to use a regex to exclude disambiguation pages when scraping wikipedia. I looked around for tips about using the negative lookahead and
I cannot seem to make it work. I think I am missing something fundamental
about its use but as of now I am totally clueless. Could someone please
point me in the right direction? (I don't want to use
if 'disambiguation' in y
, I am trying to grasp
the workings of the negative lookahead.) Thank you.
Here is the code:
list_links = ['/wiki/Oolong_(disambiguation)', '/wiki/File:Mi_Lan_Xiang_Oolong_Tea_cropped.jpg',
'/wiki/Taiwanese_tea', '/wiki/Tung-ting_tea',
'/wiki/Nantou_County', '/wiki/Taiwan', '/wiki/Dongfang_Meiren',
'/wiki/Alishan_National_Scenic_Area', '/wiki/Chiayi_County',
'/wiki/Dayuling', '/wiki/Baozhong_tea', '/wiki/Pinglin_Township']
def findString(string):
regex1 = r'(/wiki/)(_\($)(!?disambiguation)'
for x in list_links:
y = re.findall(regex1, x)
print(y)
findString(list_links)```
You can use one of the regex, based on your need. Also, I have added some changes to the function definition to respect PEP.
def remove_disambiguation_link(list_of_links):
regex = "(.*)\((!?disambiguation)\)"
# regex = "(/wiki/)(.*)\((!?disambiguation)\)"
# return [links for links in list_of_links if not re.search(regex, links)]
return list(filter(lambda link: not re.search(regex, link), list_of_links))
list_links = remove_disambiguation_link(list_links)
print(list_links)
[
"/wiki/File:Mi_Lan_Xiang_Oolong_Tea_cropped.jpg",
"/wiki/Taiwanese_tea",
"/wiki/Tung-ting_tea",
"/wiki/Nantou_County",
"/wiki/Taiwan",
"/wiki/Dongfang_Meiren",
"/wiki/Alishan_National_Scenic_Area",
"/wiki/Chiayi_County",
"/wiki/Dayuling",
"/wiki/Baozhong_tea",
"/wiki/Pinglin_Township",
]
For your case the simplest solution would just be not using regex for that...
just do something like:
list_links = ['/wiki/Oolong_(disambiguation)', '/wiki/File:Mi_Lan_Xiang_Oolong_Tea_cropped.jpg',
'/wiki/Taiwanese_tea', '/wiki/Tung-ting_tea',
'/wiki/Nantou_County', '/wiki/Taiwan', '/wiki/Dongfang_Meiren',
'/wiki/Alishan_National_Scenic_Area', '/wiki/Chiayi_County',
'/wiki/Dayuling', '/wiki/Baozhong_tea', '/wiki/Pinglin_Township']
def findString(string):
regex1 = r'(/wiki/)(_\($)'
for x in string:
if 'disambiguation' in x:
continue # skip
y = re.findall(regex1, x)
print(y)
findString(list_links)
You do not need to use regex. You can iterate through list_links and check if the string you are looking for, 'disambiguation` is in each item in list_links.
list_links = ['/wiki/Oolong_(disambiguation)', '/wiki/File:Mi_Lan_Xiang_Oolong_Tea_cropped.jpg',
'/wiki/Taiwanese_tea', '/wiki/Tung-ting_tea',
'/wiki/Nantou_County', '/wiki/Taiwan', '/wiki/Dongfang_Meiren',
'/wiki/Alishan_National_Scenic_Area', '/wiki/Chiayi_County',
'/wiki/Dayuling', '/wiki/Baozhong_tea', '/wiki/Pinglin_Township']
to_find = 'disambiguation'
def findString(list_links):
for link in list_links:
if to_find in link:
# get indice of match
match_index = list_links.index(link)
# remove match from list
list_links.pop(match_index)
# print new list without 'disambiguation' items
print(list_links)
findString(list_links)
Related
I'm trying to check some string in a row in a sentence using if condition but i'm not getting the expected the output. I've also tried with the Regex pattern but that is also not helping me out. Can anyone help me with this? My value for string changes everytime so not sure if this is the problem.
r="Carnival: monitor service-eu Beta cloudwatch_module"
if "Carnival: monitor service-eu Beta" in r:
test_string="EU"
test_string1="eu"
elif "Carnival: monitor service-na Beta" in r:
test_string="NA"
test_string1="na"
elif "Carnival: monitor service-fe Beta" in r:
test_string="FE"
test_string1="fe"
else:
print("None found")
With regex something like this.
but this is also not working.
re_pattern = r'\b(?:service-eu|Beta|monitor|Carnival)\b'
new_= re.findall(re_pattern, r)
new1_=new_[2]
If isn't necessary use this short_description function, I suggest you use the find function:
if r.find("Carnival: monitor service-eu Beta") != -1:
test_string="EU"
test_string1="eu"
elif r.find("Carnival: monitor service-na Beta") != -1:
test_string="NA"
test_string1="na"
elif r.find("Carnival: monitor service-fe Beta") != -1:
test_string="FE"
test_string1="fe"
else:
print("None found")
I'm not 100% sure I understand your problem, but hopefully this helps:
import re
def get_string(r):
return re.findall(r"service-[a-z]{2}",r)[0][-2:]
get_string("Carnival: monitor service-na Beta")
>>> 'na'
get_string("Carnival: monitor service-fe Beta")
>>> 'fe'
Here, [a-z]{2} means any word that contains lowercase letters with a length of 2.
You can use a pattern with a capture group and .upper() for the group value.
\bCarnival: monitor service-([a-z]{2}) Beta\b
See the capture group value at the regex 101 demo and a Python demo.
Example
import re
pattern = r"\bCarnival: monitor service-([a-z]{2}) Beta\b"
r = "Carnival: monitor service-eu Beta cloudwatch_module"
m = re.search(pattern, r)
if m:
test_string1 = m.group(1)
test_string = test_string1.upper()
print(test_string)
print(test_string1)
Output
EU
eu
I have a complex text where I am categorizing different keywords stored in a dictionary:
text = 'data-ls-static="1">Making Bio Implants, Drug Delivery and 3D Printing in Medicine,MEDICINE</h3>'
sector = {"med tech": ['Drug Delivery' '3D printing', 'medicine', 'medical technology', 'bio cell']}
this can successfully find my keywords and categorize them with some limitations:
pattern = r'[a-zA-Z0-9]+'
[cat for cat in sector if any(x in re.findall(pattern,text) for x in sector[cat])]
The limitations that I cannot solve are:
For example, keywords like "Drug Delivery" that are separated by a space are not recognized and therefore categorized.
I was not able to make the pattern case insensitive, as words like MEDICINE are not recognized. I tried to add (?i) to the pattern but it doesn't work.
The categorized keywords go into a pandas df, but they are printed into []. I tried to loop again the script to take them out but they are still there.
Data to pandas df:
ind_list = []
for site in url_list:
ind = [cat for cat in indication if any(x in re.findall(pattern,soup_string) for x in indication[cat])]
ind_list.append(ind)
websites['Indication'] = ind_list
Current output:
Website Sector Sub-sector Therapeutical Area Focus URL status
0 url3.com [med tech] [] [] [] []
1 www.url1.com [med tech, services] [] [oncology, gastroenterology] [] []
2 www.url2.com [med tech, services] [] [orthopedy] [] []
In the output I get [] that I'd like to avoid.
Can you help me with these points?
Thanks!
Give you some hints here the problem that can readily be spot:
Why can't match keywords like "Drug Delivery" that are separated by a space ? This is because the regex pattern r'[a-zA-Z0-9]+' does not match for a space. You can change it to r'[a-zA-Z0-9 ]+' (added a space after 9) if you want to match also for a space. However, if you want to support other types of white spaces (e.g. \t, \n), you need to further change this regex pattern.
Why don't support case insensitive match ? Your code fragment any(x in re.findall(pattern,text) for x in sector[cat]) requires x to have the same upper/lower case for BOTH being in result of re.findall and being in sector[cat]. This constrain even cannot be bypassed by setting flags=re.I in the re.findall() call. Suggest you to convert them all to the same case before checking. That is, for example change them all to lower cases before matching: any(x in re.findall(pattern,text.lower()) for x.lower() in sector[cat]) Here we added .lower() to both text and x.lower().
With the above 2 changes, it should allow you to capture some categorized keywords.
Actually, for this particular case, you may not need to use regular expression and re.findall at all. You may just check e.g. sector[cat][i].lower()) in text.lower(). That is, change the list comprehension as follows:
[cat for cat in sector if any(x in text.lower() for x in [y.lower() for y in sector[cat]])]
Edit
Test Run with 2-word phrase:
text = 'drug delivery'
sector = {"med tech": ['Drug Delivery', '3D printing', 'medicine', 'medical technology', 'bio cell']}
[cat for cat in sector if any(x in text.lower() for x in [y.lower() for y in sector[cat]])]
Output: # Successfully got the categorizing keyword even with dictionary values of different upper/lower cases
['med tech']
text = 'Drug Store fast delivery'
[cat for cat in sector if any(x in text.lower() for x in [y.lower() for y in sector[cat]])]
Ouptput: # Correctly doesn't match with extra words in between
[]
Can you try a different approach other than regex,
I would suggest difflib when you have two similar matching words.
findall is pretty wasteful here since you are repeatedly breaking up the string for each keyword.
If you want to test whether the keyword is in the string:
[cat for cat in sector if any(re.search(word, text, re.I) for word in sector[cat])]
# Output: med tech
Python - Identify certain keywords in a user's input, to then lead to an answer. For example, user inputs "There is no display on my phone"
The keywords 'display' and 'phone' would link to a set of solutions.
I just need help finding a general idea on how to identify and then lead to a set of solutions. I would appreciate any help.
Use NLTK library, import stopwords.
write a code that if the word in your text is in stopword then you have to remove that word. You will get the filtered output.
Also,
Make a negative list file - containing all the words apart from stopwords that you want to remove, extent the stopwords with these words before the above code.and you will get a 100% correct output.
A simple way if you don't want to use any external libraries would be the following.
def bool_to_int(list):
num = 0
for k, v in enumerate(list):
if v==1:
num+=(2**k)
return num
def take_action(code):
if code==1:
# do this
elif code==2:
# do this
...
keywords = ['display', 'phone', .....,]
list_of_words = data.split(" ")
code = [0]*len(keywords)
for i in list_of_words:
if i in keywords:
idx = keywords.index(i)
code[idx]=1
code = bool_to_int(code)
take_action(code)
I am pulling information down from an rss feed. Due to further analysis,, I don't particularly want to use the likes of beautiful soup or feedparser. The explanation is kind of out of scope for this question.
The output is generating the text covered in [' and ']. For example
Title:
['The Morning Download: Apple Stumbles but Mobile Soars']
Published:
['Tue, 28 Jan 2014 13:09:04 GMT']
Why is this output like this? How do I stop this?
try:
#This is the RSS Feed that is being scraped
page = 'http://finance.yahoo.com/rss/headline?s=aapl'
yahooFeed = opener.open(page).read()
try:
items = re.findall(r'<item>(.*?)</item>', yahooFeed)
for item in items:
# Prints the title
title = re.findall(r'<title>(.*?)</title>', item)
print "Title:"
print title
# Prints the Date / Time Published
print "Published:"
datetime = re.findall(r'<pubDate>(.*?)</pubDate>', item)
print datetime
print "\n"
except Exception, e:
print str(e)
I am grateful of any criticism, advise and best practice information.
I'm a Java / Perl programmer so still getting used to Python, so any great resources you know of, are greatly appreciated.
Use re.search instead of re.findall, re.findall always returns a list of all matches.
datetime = re.search(r'<pubDate>(.*?)</pubDate>', item).group(1)
Note that the difference between re.findall and re.search is that the former returns a list(Python's array data-structure) of all matches, while re.search will only return the first match found.
In case of a no match re.search returns None, so to handle that as well:
match = re.search(r'<pubDate>(.*?)</pubDate>', item)
if match is not None:
datetime = match.group(1)
I have a small problem to extract the words which are in bold:
Médoc, Rouge
2ème Vin, Margaux, Rosé
2ème vin, Pessac-Léognan, Blanc
I have to clarify more my question :
I'm trying to extract some information from web pages, so each time i found a kind of sentence but me i'm interesting in which is in bold. I give you the adress of the tree wab pages :
(http://www.nicolas.com/page.php/fr/18_409_9829_tourprignacgrandereserve.htm)
(http://www.nicolas.com/page.php/fr/18_409_8236_relaisdedurfortvivens.htm)
re(r'\s*\w+-\w+-\w+|\w+-\w+|\w+[^Rouge,Blanc,Rosé]')
Any ideas?
You can use positive look ahead to see if Rouge or Blanc or Rosé is after the word we are looking for:
>>> import re
>>> l = [u"Médoc, Rouge", u"2ème Vin, Margaux, Rosé", u"2ème vin, Pessac-Léognan, Blanc"]
>>> for s in l:
... print re.search(ur'([\w-]+)(?=\W+(Rouge|Blanc|Rosé))', s, re.UNICODE).group(0)
...
Médoc
Margaux
Pessac-Léognan
Seems like it's always the second to last term in the comma separated list? You can split and select the second to last, example:
>>> myStr = '2ème vin, Pessac-Léognan, Blanc'
>>> res = myStr.split(', ')[-2]
Otherwise, if you want regex alone... I'll suggest this:
>>> res = re.search(r'([^,]+),[^,]+$', myStr).group(1)
And trim if necessary for spaces.