Can you help me to find the right Regular expression to extract (Margaux or Saint-Julien) in each time of this 2 pages:
in page 1: Margaux, Rouge
in page 2: 2ème Vin, Saint-Julien, Rouge
my code :
item ["appelation"] = res.select('.//div[#class="pro_col_right"]/div[#class="pro_blk_trans"]/div[#class="pro_blk_trans_titre"]/text()').re(r'\s*\w+\-\w+\-\w+|\w+\-\w+|\[^Rouge,Blanc]')
My regular expression couldn't find Margaux but it extracts Saint-Julien !!
Not sure why you need this but suppose s is your html file then this regex will find what you look for..
import re
m = re.search(r"\<div\ class=\"pro_blk_trans_titre\"\>(.*)\</div\>", s)
print(m.group(1).strip().encode("utf8"))
# page1: b'Margaux, Rouge'
# page2: b'2\xc3\xa8me Vin, Saint-Julien, Rouge'
Related
I am trying to use a regex to exclude disambiguation pages when scraping wikipedia. I looked around for tips about using the negative lookahead and
I cannot seem to make it work. I think I am missing something fundamental
about its use but as of now I am totally clueless. Could someone please
point me in the right direction? (I don't want to use
if 'disambiguation' in y
, I am trying to grasp
the workings of the negative lookahead.) Thank you.
Here is the code:
list_links = ['/wiki/Oolong_(disambiguation)', '/wiki/File:Mi_Lan_Xiang_Oolong_Tea_cropped.jpg',
'/wiki/Taiwanese_tea', '/wiki/Tung-ting_tea',
'/wiki/Nantou_County', '/wiki/Taiwan', '/wiki/Dongfang_Meiren',
'/wiki/Alishan_National_Scenic_Area', '/wiki/Chiayi_County',
'/wiki/Dayuling', '/wiki/Baozhong_tea', '/wiki/Pinglin_Township']
def findString(string):
regex1 = r'(/wiki/)(_\($)(!?disambiguation)'
for x in list_links:
y = re.findall(regex1, x)
print(y)
findString(list_links)```
You can use one of the regex, based on your need. Also, I have added some changes to the function definition to respect PEP.
def remove_disambiguation_link(list_of_links):
regex = "(.*)\((!?disambiguation)\)"
# regex = "(/wiki/)(.*)\((!?disambiguation)\)"
# return [links for links in list_of_links if not re.search(regex, links)]
return list(filter(lambda link: not re.search(regex, link), list_of_links))
list_links = remove_disambiguation_link(list_links)
print(list_links)
[
"/wiki/File:Mi_Lan_Xiang_Oolong_Tea_cropped.jpg",
"/wiki/Taiwanese_tea",
"/wiki/Tung-ting_tea",
"/wiki/Nantou_County",
"/wiki/Taiwan",
"/wiki/Dongfang_Meiren",
"/wiki/Alishan_National_Scenic_Area",
"/wiki/Chiayi_County",
"/wiki/Dayuling",
"/wiki/Baozhong_tea",
"/wiki/Pinglin_Township",
]
For your case the simplest solution would just be not using regex for that...
just do something like:
list_links = ['/wiki/Oolong_(disambiguation)', '/wiki/File:Mi_Lan_Xiang_Oolong_Tea_cropped.jpg',
'/wiki/Taiwanese_tea', '/wiki/Tung-ting_tea',
'/wiki/Nantou_County', '/wiki/Taiwan', '/wiki/Dongfang_Meiren',
'/wiki/Alishan_National_Scenic_Area', '/wiki/Chiayi_County',
'/wiki/Dayuling', '/wiki/Baozhong_tea', '/wiki/Pinglin_Township']
def findString(string):
regex1 = r'(/wiki/)(_\($)'
for x in string:
if 'disambiguation' in x:
continue # skip
y = re.findall(regex1, x)
print(y)
findString(list_links)
You do not need to use regex. You can iterate through list_links and check if the string you are looking for, 'disambiguation` is in each item in list_links.
list_links = ['/wiki/Oolong_(disambiguation)', '/wiki/File:Mi_Lan_Xiang_Oolong_Tea_cropped.jpg',
'/wiki/Taiwanese_tea', '/wiki/Tung-ting_tea',
'/wiki/Nantou_County', '/wiki/Taiwan', '/wiki/Dongfang_Meiren',
'/wiki/Alishan_National_Scenic_Area', '/wiki/Chiayi_County',
'/wiki/Dayuling', '/wiki/Baozhong_tea', '/wiki/Pinglin_Township']
to_find = 'disambiguation'
def findString(list_links):
for link in list_links:
if to_find in link:
# get indice of match
match_index = list_links.index(link)
# remove match from list
list_links.pop(match_index)
# print new list without 'disambiguation' items
print(list_links)
findString(list_links)
I have an XML document where I wish to extract certain text contained in specific tags such as-
<title>Four-minute warning</title>
<categories>
<category>Nuclear warfare</category>
<category>Cold War</category>
<category>Cold War military history of the United Kingdom</category>
<category>disaster preparedness in the United Kingdom</category>
<category>History of the United Kingdom</category>
</categories>
<bdy>
some text
</bdy>
In this toy example, if I want to extract all the text contained in tags by using the following Regular Expression code in Python 3-
# Python 3 code using RE-
file = open("some_xml_file.xml", "r")
xml_doc = file.read()
file.close()
title_text = re.findall(r'<title>.+</title>', xml_doc)
if title_text:
print("\nMatches found!\n")
for title in title_text:
print(title)
else:
print("\nNo matches found!\n\n")
It gives me the text within the XML tags ALONG with the tags. An example of a single output would be-
<title>Four-minute warning</title>
My question is, how should I frame the pattern within the re.findall() or re.search() methods so that and tags are skipped and all I get is the text between them.
Thanks for your help!
Just use a capture group in your regex (re.findall() takes care of the rest in this case). For example:
import re
s = '<title>Four-minute warning</title>'
title_text = re.findall(r'<title>(.+)</title>', s)
print(title_text[0])
# OUTPUT
# Four-minute warning
I'm trying to get the "real" name of a movie from its name when you download it.
So for instance, I have
Star.Wars.Episode.4.A.New.Hope.1977.1080p.BrRip.x264.BOKUTOX.YIFY
and would like to get
Star Wars Episode 4 A New Hope
So I'm using this regex:
.*?\d{1}?[ .a-zA-Z]*
which works fine, but only for a movie with a number, as in 'Iron Man 3' for example.
I'd like to be able to get movies like 'Interstellar' from
Interstellar.2014.1080p.BluRay.H264.AAC-RARBG
and I currently get
Interstellar 2
I tried several ways, and spent quite a lot of time on it already, but figured it wouldn't hurt asking you guys if you had any suggestion/idea/tip on how to do it...
Thanks a lot!
Given your examples and assuming you always download in 1080p (or know that field's value):
x = 'Interstellar.2014.1080p.BluRay.H264.AAC-RARBG'
y = x.split('.')
print " ".join(y[:y.index('1080p')-1])
Forget the regex (for now anyway!) and work with the fixed field layout. Find a field you know (1080p) and remove the information you don't want (the year). Recombine the results and you get "Interstellar" and "Star Wars Episode 4 A New Hope".
The following regex would work (assuming the format is something like moviename.year.1080p.anything or moviename.year.720p.anything:
.*(?=.\d{4}.*\d{3,}p)
Regex example (try the unit tests to see the regex in action)
Explanation:
\.(?=.*?(?:19|20)\d{2}\b)|(?:19|20)\d{2}\b.*$
Try this with re.sub.See demo.
https://regex101.com/r/hR7tH4/10
import re
p = re.compile(r'\.(?=.*?(?:19|20)\d{2}\b)|(?:19|20)\d{2}\b.*$', re.MULTILINE)
test_str = "Star.Wars.Episode.4.A.New.Hope.1977.1080p.BrRip.x264.BOKUTOX.YIFY\nInterstellar.2014.1080p.BluRay.H264.AAC-RARBG\nIron Man 3"
subst = " "
result = re.sub(p, subst, test_str)
Assuming, there is always a four-digit-year, or a four-digit-resolution notation within the movie's file name, a simple solution replaces the not-wanted parts as this:
"(?:\.|\d{4,4}.+$)"
by a blank, strip()'ing them afterwards ...
For example:
test1 = "Star.Wars.Episode.4.A.New.Hope.1977.1080p.BrRip.x264.BOKUTOX.YIFY"
test2 = "Interstellar.2014.1080p.BluRay.H264.AAC-RARBG"
res1 = re.sub(r"(?:\.|\d{4,4}.+$)",' ',test1).strip()
res2 = re.sub(r"(?:\.|\d{4,4}.+$)",' ',test2).strip()
print(res1, res2, sep='\n')
>>> Star Wars Episode 4 A New Hope
>>> Interstellar
I trying to create a regex to extract telephone, streetAddress, Pages values (9440717256,H.No. 3-11-62, RTC Colony..) from the html page in python. These three fields are optional I tried this regex, but output is inconsistent
telephone\S+>(.+)</em>.*(?:streetAddress\S+(.+)</span>)?.*(?:pages\S+>(.+)</a></span>)?
sample string
<em phone="**telephone**">9440717256</em></div></div></li><li class="row"><i class="icon-sm icon-address"></i><div class="profile-details"><strong>Address</strong><div class="profi`enter code here`le-child"><address itemprop="address" itemscope="" itemtype="http://schema.org/PostalAddress" class="data-item"><span itemprop="**streetAddress**">H.No. 3-11-62, RTC Colony</span>, <span>Vastu Colony, </span><span class="text-black" itemprop="addressLocality">Lal Bahadur Nagar</span>
Can anyone help me building the regex please ?
Considering that your input is not valid HTML and that it may be subject to change, you can use a HTML parser like BeautifulSoup. But if your input changes, these simple selectors will have to be adapted.
from bs4 import BeautifulSoup
h = """<em phone="**telephone**">9440717256</em></div></div></li><li class="row"><i class="icon-sm icon-address"></i><div class="profile-details"><strong>Address</strong><div class="profi`enter code here`le-child"><address itemprop="address" itemscope="" itemtype="http://schema.org/PostalAddress" class="data-item"><span itemprop="**streetAddress**">H.No. 3-11-62, RTC Colony</span>, <span>Vastu Colony, </span><span class="text-black" itemprop="addressLocality">Lal Bahadur Nagar</span>"""
soup = BeautifulSoup(h)
Edit: Since you now tell us that you want the text of the elements that have the specified attribute value, you can use a function as filter.
def find_phone(tag):
return tag.has_attr("phone") and tag.get("phone") == "**telephone**"
def find_streetAddress(tag):
return tag.has_attr("itemprop") and tag.get("itemprop") == "**streetAddress**"
def find_pages(tag):
return tag.has_attr("title") and tag.get("title") == "**Pages**"
print(soup.find(find_phone).string)
print(soup.find(find_streetAddress).string)
print(soup.find(find_pages).string)
Output:
9440717256
H.No. 3-11-62, RTC Colony
Lal Bahadur Nagar
Regex is safe to use in case you know the HTML provider, what the code inside looks like.
Then, just use alternations and named capture groups.
telephone[^>]*>(?P<Telephone>[^<]+)|streetAddress[^>]*>(?P<Address>[^<]+)|Pages[^>]*>(?P<Pages>[^<]+)
See demo
In case > is not serialized, you can use this regex (more universal one, edit: now, verbose):
telephone[^<]*> # Looking for telephone
(?P<Telephone>[^<]+) # Capture telephone (all text up to the next tag)
|
streetAddress[^<]*> # Looking for streetAddress
(?P<Address>[^<]+) # Capture address (all text up to the next tag)
|
Pages[^<]*> # Looking for Pages
(?P<Pages>[^<]+) # Capture Pages (all text up to the next tag)
Sample demo on IDEONE
Pasting regex code part:
p = re.compile(ur'''telephone[^<]*> # Looking for telephone
(?P<Telephone>[^<]+) # Capture telephone (all text up to the next tag)
|
streetAddress[^<]*> # Looking for streetAddress
(?P<Address>[^<]+) # Capture address (all text up to the next tag)
|
Pages[^<]*> # Looking for Pages
(?P<Pages>[^<]+) # Capture Pages (all text up to the next tag)''', re.IGNORECASE | re.VERBOSE)
test_str = "YOUR STRING"
print filter(None, [x.group("Telephone") for x in re.finditer(p, test_str)])
print filter(None, [x.group("Address") for x in re.finditer(p, test_str)])
print filter(None, [x.group("Pages") for x in re.finditer(p, test_str)])
Output (doubled results are the result of my duplicating the input string with different node order):
[u'9440717256', u'9440717256']
[u'H.No. 3-11-62, RTC Colony', u'H.No. 3-11-62, RTC Colony']
[u'Lal Bahadur Nagar', u'Lal Bahadur Nagar']
I have a small problem to extract the words which are in bold:
Médoc, Rouge
2ème Vin, Margaux, Rosé
2ème vin, Pessac-Léognan, Blanc
I have to clarify more my question :
I'm trying to extract some information from web pages, so each time i found a kind of sentence but me i'm interesting in which is in bold. I give you the adress of the tree wab pages :
(http://www.nicolas.com/page.php/fr/18_409_9829_tourprignacgrandereserve.htm)
(http://www.nicolas.com/page.php/fr/18_409_8236_relaisdedurfortvivens.htm)
re(r'\s*\w+-\w+-\w+|\w+-\w+|\w+[^Rouge,Blanc,Rosé]')
Any ideas?
You can use positive look ahead to see if Rouge or Blanc or Rosé is after the word we are looking for:
>>> import re
>>> l = [u"Médoc, Rouge", u"2ème Vin, Margaux, Rosé", u"2ème vin, Pessac-Léognan, Blanc"]
>>> for s in l:
... print re.search(ur'([\w-]+)(?=\W+(Rouge|Blanc|Rosé))', s, re.UNICODE).group(0)
...
Médoc
Margaux
Pessac-Léognan
Seems like it's always the second to last term in the comma separated list? You can split and select the second to last, example:
>>> myStr = '2ème vin, Pessac-Léognan, Blanc'
>>> res = myStr.split(', ')[-2]
Otherwise, if you want regex alone... I'll suggest this:
>>> res = re.search(r'([^,]+),[^,]+$', myStr).group(1)
And trim if necessary for spaces.