the right regex expression in python - python

I have a small problem to extract the words which are in bold:
Médoc, Rouge
2ème Vin, Margaux, Rosé
2ème vin, Pessac-Léognan, Blanc
I have to clarify more my question :
I'm trying to extract some information from web pages, so each time i found a kind of sentence but me i'm interesting in which is in bold. I give you the adress of the tree wab pages :
(http://www.nicolas.com/page.php/fr/18_409_9829_tourprignacgrandereserve.htm)
(http://www.nicolas.com/page.php/fr/18_409_8236_relaisdedurfortvivens.htm)
re(r'\s*\w+-\w+-\w+|\w+-\w+|\w+[^Rouge,Blanc,Rosé]')
Any ideas?

You can use positive look ahead to see if Rouge or Blanc or Rosé is after the word we are looking for:
>>> import re
>>> l = [u"Médoc, Rouge", u"2ème Vin, Margaux, Rosé", u"2ème vin, Pessac-Léognan, Blanc"]
>>> for s in l:
... print re.search(ur'([\w-]+)(?=\W+(Rouge|Blanc|Rosé))', s, re.UNICODE).group(0)
...
Médoc
Margaux
Pessac-Léognan

Seems like it's always the second to last term in the comma separated list? You can split and select the second to last, example:
>>> myStr = '2ème vin, Pessac-Léognan, Blanc'
>>> res = myStr.split(', ')[-2]
Otherwise, if you want regex alone... I'll suggest this:
>>> res = re.search(r'([^,]+),[^,]+$', myStr).group(1)
And trim if necessary for spaces.

Related

How to slice a string input at a certain unknown index

A string is given as an input (e.g. "What is your name?"). The input always contains a question which I want to extract. But the problem that I am trying to solve is that the input is always with unneeded input.
So the input could be (but not limited to) the following:
1- "eo000 ATATAT EG\n\nWhat is your name?\nkgda dasflkjasn" 2- "What is your\nlastname and email?\ndasf?lkjas" 3- "askjdmk.\nGiven your skills\nhow would you rate yourself?\nand your name? dasf?"
(Notice that at the third input, the question starts with the word "Given" and end with "yourself?")
The above input examples are generated by the pytesseract OCR library of scanning an image and converting it into text
I only want to extract the question from the garbage input and nothing else.
I tried to use find('?', 1) function of the re library to get index of last part of the question (assuming for now that the first question mark is always the end of the question and not part of the input that I don't want). But I can't figure out how to get the index of the first letter of the question. I tried to loop in reverse and get the first spotted \n in the input, but the question doesn't always have \n before the first letter of the question.
def extractQuestion(q):
index_end_q = q.find('?', 1)
index_first_letter_of_q = 0 # TODO
question = '\n ' . join(q[index_first_letter_of_q :index_end_q ])
A way to find the question's first word index would be to search for the first word that has an actual meaning (you're interested in English words I suppose). A way to do that would be using pyenchant:
#!/usr/bin/env python
import enchant
GLOSSARY = enchant.Dict("en_US")
def isWord(word):
return True if GLOSSARY.check(word) else False
sentences = [
"eo000 ATATAT EG\n\nWhat is your name?\nkgda dasflkjasn",
"What is your\nlastname and email?\ndasf?lkjas",
"\nGiven your skills\nhow would you rate yourself?\nand your name? dasf?"]
for sentence in sentences:
for i,w in enumerate(sentence.split()):
if isWord(w):
print('index: {} => {}'.format(i, w))
break
The above piece of code gives as a result:
index: 3 => What
index: 0 => What
index: 0 => Given
You could try a regular expression like \b[A-Z][a-z][^?]+\?, meaning:
The start of a word \b with an upper case letter [A-Z] followed by a lower case letter [a-z],
then a sequence of non-questionmark-characters [^?]+,
followed by a literal question mark \?.
This can still have some false positives or misses, e.g. if a question actually starts with an acronym, or if there is a name in the middle of the question, but for you examples it works quite well.
>>> tests = ["eo000 ATATAT EG\n\nWhat is your name?\nkgda dasflkjasn",
"What is your\nlastname and email?\ndasf?lkjas",
"\nGiven your skills\nhow would you rate yourself?\nand your name? dasf?"]
>>> import re
>>> p = r"\b[A-Z][a-z][^?]+\?"
>>> [re.search(p, t).group() for t in tests]
['What is your name?',
'What is your\nlastname and email?',
'Given your skills\nhow would you rate yourself?']
If that's one blob of text, you can use findall instead of search:
>>> text = "\n".join(tests)
>>> re.findall(p, text)
['What is your name?',
'What is your\nlastname and email?',
'Given your skills\nhow would you rate yourself?']
Actually, this also seems to work reasonably well for questions with names in them:
>>> t = "asdGARBAGEasd\nHow did you like St. Petersburg? more stuff with ?"
>>> re.search(p, t).group()
'How did you like St. Petersburg?'

How to extract a substring beginning with a specific substring and ends with another specific substring?

I have something like that:
"1111Austria9999Salzburg (SZG)Vienna (VIE)1111Bosnia-Herzegovina9999Sarajevo (SJJ)1111Bulgaria9999Bourgas (BOJ)Varna (VAR)"
And I want to extract
Salzburg (SZG), Sarajevo (SJJ), Bourgas (BOJ), Varna (VAR)
import re
sentence = "1111Austria9999Salzburg (SZG)Vienna (VIE)1111Bosnia-Herzegovina9999Sarajevo (SJJ)1111Bulgaria9999Bourgas (BOJ)Varna (VAR)"
regs = re.findall(r'[A-z]+\s\([A-Z]{3}\)', sentence)
print(regs)
This follows from my logic in the comments.

Replace word between two substrings (keeping other words)

I'm trying to replace a word (e.g. on) if it falls between two substrings (e.g. <temp> & </temp>) however other words are present which need to be kept.
string = "<temp>The sale happened on February 22nd</temp>"
The desired string after the replace would be:
Result = <temp>The sale happened {replace} February 22nd</temp>
I've tried using regex, I've only been able to figure out how to replace everything lying between the two <temp> tags. (Because of the .*?)
result = re.sub('<temp>.*?</temp>', '{replace}', string, flags=re.DOTALL)
However on may appear later in the string not between <temp></temp> and I wouldn't want to replace this.
re.sub('(<temp>.*?) on (.*?</temp>)', lambda x: x.group(1)+" <replace> "+x.group(2), string, flags=re.DOTALL)
Output:
<temp>The sale happened <replace> February 22nd</temp>
Edit:
Changed the regex based on suggestions by Wiktor and HolyDanna.
P.S: Wiktor's comment on the question provides a better solution.
Try lxml:
from lxml import etree
root = etree.fromstring("<temp>The sale happened on February 22nd</temp>")
root.text = root.text.replace(" on ", " {replace} ")
print(etree.tostring(root, pretty_print=True))
Output:
<temp>The sale happened {replace} February 22nd</temp>

Parsing file name with RegEx - Python

I'm trying to get the "real" name of a movie from its name when you download it.
So for instance, I have
Star.Wars.Episode.4.A.New.Hope.1977.1080p.BrRip.x264.BOKUTOX.YIFY
and would like to get
Star Wars Episode 4 A New Hope
So I'm using this regex:
.*?\d{1}?[ .a-zA-Z]*
which works fine, but only for a movie with a number, as in 'Iron Man 3' for example.
I'd like to be able to get movies like 'Interstellar' from
Interstellar.2014.1080p.BluRay.H264.AAC-RARBG
and I currently get
Interstellar 2
I tried several ways, and spent quite a lot of time on it already, but figured it wouldn't hurt asking you guys if you had any suggestion/idea/tip on how to do it...
Thanks a lot!
Given your examples and assuming you always download in 1080p (or know that field's value):
x = 'Interstellar.2014.1080p.BluRay.H264.AAC-RARBG'
y = x.split('.')
print " ".join(y[:y.index('1080p')-1])
Forget the regex (for now anyway!) and work with the fixed field layout. Find a field you know (1080p) and remove the information you don't want (the year). Recombine the results and you get "Interstellar" and "Star Wars Episode 4 A New Hope".
The following regex would work (assuming the format is something like moviename.year.1080p.anything or moviename.year.720p.anything:
.*(?=.\d{4}.*\d{3,}p)
Regex example (try the unit tests to see the regex in action)
Explanation:
\.(?=.*?(?:19|20)\d{2}\b)|(?:19|20)\d{2}\b.*$
Try this with re.sub.See demo.
https://regex101.com/r/hR7tH4/10
import re
p = re.compile(r'\.(?=.*?(?:19|20)\d{2}\b)|(?:19|20)\d{2}\b.*$', re.MULTILINE)
test_str = "Star.Wars.Episode.4.A.New.Hope.1977.1080p.BrRip.x264.BOKUTOX.YIFY\nInterstellar.2014.1080p.BluRay.H264.AAC-RARBG\nIron Man 3"
subst = " "
result = re.sub(p, subst, test_str)
Assuming, there is always a four-digit-year, or a four-digit-resolution notation within the movie's file name, a simple solution replaces the not-wanted parts as this:
"(?:\.|\d{4,4}.+$)"
by a blank, strip()'ing them afterwards ...
For example:
test1 = "Star.Wars.Episode.4.A.New.Hope.1977.1080p.BrRip.x264.BOKUTOX.YIFY"
test2 = "Interstellar.2014.1080p.BluRay.H264.AAC-RARBG"
res1 = re.sub(r"(?:\.|\d{4,4}.+$)",' ',test1).strip()
res2 = re.sub(r"(?:\.|\d{4,4}.+$)",' ',test2).strip()
print(res1, res2, sep='\n')
>>> Star Wars Episode 4 A New Hope
>>> Interstellar

find the appropriate regular expression

Can you help me to find the right Regular expression to extract (Margaux or Saint-Julien) in each time of this 2 pages:
in page 1: Margaux, Rouge
in page 2: 2ème Vin, Saint-Julien, Rouge
my code :
item ["appelation"] = res.select('.//div[#class="pro_col_right"]/div[#class="pro_blk_trans"]/div[#class="pro_blk_trans_titre"]/text()').re(r'\s*\w+\-\w+\-\w+|\w+\-\w+|\[^Rouge,Blanc]')
My regular expression couldn't find Margaux but it extracts Saint-Julien !!
Not sure why you need this but suppose s is your html file then this regex will find what you look for..
import re
m = re.search(r"\<div\ class=\"pro_blk_trans_titre\"\>(.*)\</div\>", s)
print(m.group(1).strip().encode("utf8"))
# page1: b'Margaux, Rouge'
# page2: b'2\xc3\xa8me Vin, Saint-Julien, Rouge'

Categories