My regex is not working properly - python

My regex is not working properly. I'm showing you before regex text and after regex text. I'm using this regex re.search(r'(?ms).*?{{(Infobox film.*?)}}', text). You will see my regex not displaying the result after | country = Assam, {{IND . My regex stuck at this point. Will you please help me ? thanks
Before regex:
{{Infobox film
| name = Papori
| released = 1986
| runtime = 144 minutes
| country = Assam, {{IND}}
| language = [[Assamese language|Assamese]]
| budget =
| followed by = free
}}
After regex:
{Infobox film
| name = Papori
| released = 1986
| runtime = 144 minutes
| country = Assam, {{IND
Why regex stuck at this point? country = Assam, {{IND
Edit : Expecting Result
Infobox film
| name = Papori
| released = 1986
| runtime = 144 minutes
| country = Assam, {{IND}}
| language = [[Assamese language|Assamese]]
| budget =
| followed by = free

Your regex is catching everything between the first {{ and the first }}, which is in the "country" entry of the infobox. If you want everything between the first {{ and the last }}, then you want to make the .* inside the braces greedy by removing the ?:
re.search(r'(?ms).*?{{(Infobox film.*)}}', text)
Note that this will find the last }} in the input (eg. if there's another template far below the end of the infobox, it will find the end of that), so this may not be what you want. When you have nesting things like this, regex is not always the best way to search.

Related

Find the most likely word alignment between two strings in Python

I have 2 similar strings. How can I find the most likely word alignment between these two strings in Python?
Example of input:
string1 = 'my channel is youtube dot com slash example and then I also do live streaming on twitch.'
string2 = 'my channel is youtube.com/example and then I also do livestreaming on twitch.'
Desired output:
alignment['my'] = 'my'
alignment['channel'] = 'channel'
alignment['is'] = 'is'
alignment['youtube'] = 'youtube.com/example'
alignment['dot'] = 'youtube.com/example'
alignment['com'] = 'youtube.com/example'
alignment['slash'] = 'youtube.com/example'
alignment['example'] = 'youtube.com/example'
alignment['and'] = 'and'
alignment['then'] = 'then'
alignment['I'] = 'I'
alignment['also'] = 'also'
alignment['do'] = 'do'
alignment['live'] = 'livestreaming'
alignment['streaming'] = 'livestreaming'
alignment['on'] = 'on'
alignment['twitch'] = 'twitch'
Alignment is tricky. spaCy can do it (see Aligning tokenization) but AFAIK it assumes that the two underlying strings are identical which is not the case here.
I used Bio.pairwise2 a few years back for a similar problem. I don't quite remember the exact settings, but here's what the default setup would give you:
from Bio import pairwise2
from Bio.pairwise2 import format_alignment
string1 = 'my channel is youtube dot com slash example and then I also do live streaming on twitch.'
string2 = 'my channel is youtube.com/example and then I also do livestreaming on twitch.'
alignments = pairwise2.align.globalxx(string1.split(),
string2.split(),
gap_char=['-']
)
The resulting alignments - pretty close already:
>>> format_alignment(*alignments[0])
my channel is youtube dot com slash example - and then I also do live streaming - on twitch.
| | | | | | | | | |
my channel is - - - - - youtube.com/example and then I also do - - livestreaming on twitch.
Score=10
You can provide your own matching functions, which would make fuzzywuzzy an interesting addition.
Previous answers offer biology-based alignment methods, there are NLP-based alignments methods as well. The most standard would be the Levenshtein edit distance. There are a few variants, and generally this problem is considered closely related to the question of text similarity measures (aka fuzzy matching, etc.). In particular it's possible to mix alignment at the level of word and characters. as well as different measures (e.g. SoftTFIDF, see this answer).
The Needleman-Wunch Algorithm
Biologists sometimes try to align the DNA of two different plants or animals to see how much of their genome they share in common.
MOUSE: A A T C C G C T A G
RAT: A A A C C C T T A G
+ + - + + - - + + +
Above "+" means that pieces of DNA match.
Above "-" means that pieces of DNA mis-match.
You can use the full ASCII character set (128 characters) instead of the letters ATCG that biologists use.
I recommend using the the Needleman Wunsch Algorithm
Needle-Wunsch is not the fastest algorithm in the world.
However, Needle-Wunsch is easy to understand.
In cases were one string of English text is completely missing a word present in the other text, Needleman Wunsch will match the word to special "GAP" character.
+-------+--------+-------+---+------+----+-----+-------+-----+----+-----+-------+------+
| The | reason | that | I | went | to | the | store | was | to | buy | some | food |
+-------+--------+-------+---+------+----+-----+-------+-----+----+-----+-------+------+
| <GAP> | reason | <GAP> | I | went | 2 | te | store | wuz | 2 | buy | <GAP> | fud |
+-------+--------+-------+---+------+----+-----+-------+-----+----+-----+-------+------+
The special GAP characters are fine.
However, what is in-efficient about Needle Wunsch is that people who wrote the algorithm believed that the order of the gap characters was important. The following are computed as two separate cases:
ALIGNMENT ONE
+---+-------+-------+---+---+
| A | 1 | <GAP> | R | A |
+---+-------+-------+---+---+
| A | <GAP> | B | R | A |
+---+-------+-------+---+---+
ALIGNMENT TWO
+---+-------+-------+---+---+
| A | <GAP> | 1 | R | A |
+---+-------+-------+---+---+
| A | B | <GAP> | R | A |
+---+-------+-------+---+---+
However, if you have two or more gaps in a row, then order of the gaps should not matter.
The Needleman-Wunch algorithm calculates the same thing many times over because whoever wrote the algorithm thought that order mattered a little more than it really does.
The following two alignments have the same score.
Also, both alignments have more or less the same meaning in the "real world" (outside of the computer).
However, the Needleman-Wunch algorithm will compute the scores of the two example alignments twice instead of computing it only one time.

How do I print a specific part of a YAML string

My YAML database:
left:
- title: Active Indicative
fill: "#cb202c"
groups:
- "Present | dūc[ō] | dūc[is] | dūc[it] | dūc[imus] | dūc[itis] | dūc[unt]"
My Python code:
import io
import yaml
with open("C:/Users/colin/Desktop/LBot/latin3_2.yaml", 'r', encoding="utf8") as f:
doc = yaml.safe_load(f)
txt = doc["left"][1]["groups"][1]
print(txt)
Currently my output is Present | dūc[ō] | dūc[is] | dūc[it] | dūc[imus] | dūc[itis] | dūc[unt] but I would like the output to be ō, is, it, or imus. Is this possible in PyYaml and if so how would I implement it? Thanks in advance.
I don't have a PyYaml solution, but if you already have the string from the YAML file, you can use Python's regex module to extract the text inside the [ ].
import re
txt = "Present | dūc[ō] | dūc[is] | dūc[it] | dūc[imus] | dūc[itis] | dūc[unt]"
parts = txt.split(" | ")
print(parts)
# ['Present', 'dūc[ō]', 'dūc[is]', 'dūc[it]', 'dūc[imus]', 'dūc[itis]', 'dūc[unt]']
pattern = re.compile("\\[(.*?)\\]")
output = []
for part in parts:
match = pattern.search(part)
if match:
# group(0) is the matched part, ex. [ō]
# group(1) is the text inside the (.*?), ex. ō
output.append(match.group(1))
else:
output.append(part)
print(" | ".join(output))
# Present | ō | is | it | imus | itis | unt
The code first splits the text into individual parts, then loops through each part search-ing for the pattern [x]. If it finds it, it extracts the text inside the brackets from the match object and stores it in a list. If the part does not match the pattern (ex. 'Present'), it just adds it as is.
At the end, all the extracted strings are join-ed together to re-build the string without the brackets.
EDIT based on comment:
If you just need one of the strings inside the [ ], you can use the same regex pattern but use the findall method instead on the entire txt, which will return a list of matching strings in the same order that they were found.
import re
txt = "Present | dūc[ō] | dūc[is] | dūc[it] | dūc[imus] | dūc[itis] | dūc[unt]"
pattern = re.compile("\\[(.*?)\\]")
matches = pattern.findall(txt)
print(matches)
# ['ō', 'is', 'it', 'imus', 'itis', 'unt']
Then it's just a matter of using some variable to select an item from the list:
selected_idx = 1 # 0-based indexing so this means the 2nd character
print(matches[selected_idx])
# is

Regular expression Variant

I want to extract the length of a dress from a pandas dataframe .The row of that dataframe looks like this :
A-line dress with darting at front and back | Surplice neckline | Long sleeves | About 23" from shoulder to hem | Triacetate/polyester | Dry clean | Imported | Model shown is 5'10" (177cm) wearing a size 4
As you can see the size is contained between About and shoulder but in some cases shoulder is replaced by waist,hem etc.Below is my python script that finds the length but it fails when lets say there is a comma after About since i am slicing the list.
import re
def regexfinder(string_var):
res=''
x=re.search(r"(?<=About).*?(?=[shoulder,waist,hem,bust,neck,bust,top,hips])", string_var).group(0)
tohave=int(x[1:3])
if tohave >=16 and tohave<=36:
res="Mini"
return res
if tohave>36 and tohave<40:
res="Above the Knee"
return res
if tohave >=40 and tohave<=46:
res="Knee length"
return res
if tohave>46 and tohave<49:
res="Mid/Tea length"
return res
if tohave >=49 and tohave<=59:
res="Long/Maxi length"
return res
if tohave>59:
res="Floor Length"
return res
Your regex (?<=About).*?(?=[shoulder,waist,hem,bust,neck,bust,top,hips]) uses a character class for the words shoulder,waist,hem,bust,neck,bust,top,hips.
I think you want to put them in a non capturing group using an or |.
Try it like this using an optional comma ,?:
(?<=About),? (\d+)(?=.*?(?:shoulder|waist|hem|bust|neck|bust|top|hips]))
The size is in the first capturing group.
import re
s = """A-line dress with darting at front and back | Surplice neckline | Long sleeves | About 23" from shoulder to hem | Triacetate/polyester | Dry clean | Imported | Model shown is 5'10" (177cm) wearing a size 4"""
q = """'Velvet dress featuring mesh front, back and sleeves | Crewneck | Long bell sleeves | Self-tie closure at back cutout | About, 31" from shoulder to hem | Viscose/nylon | Hand wash | Imported | Model shown is 5\'10" (177cm) wearing a size Small.'1"""
def getSize(stringVal, strtoCheck):
for i in stringVal.split("|"): #Split string by "|"
if i.strip().startswith(strtoCheck): #Check if string startswith "About"
val = i.strip()
return re.findall("\d+", val)[0] #Extract int
print getSize(s, "About")
print getSize(q, "About")
Output:
23
31

split occurance of time from large string

in my task I want to fetch only time and store in variable, in my string it may be possible that time occurs more than 1 time and it may be "AM" or "PM"
I only want to store this value from my string.
"4:19:27" and "7:00:05" the occurrence of time may be more than twice.
str = """ 16908310=android.widget.TextView#405ed820=Troubles | 2131034163=android.widget.TextView#405eec00=Some situations can be acknowledged using the 'OK' button, if present. A green check-mark after the description indicates that the situation has been acknowledged. Some situations have further detail available by pressing on the text or icon of the trouble message. | 2131034160=android.widget.TextView#407e5380=Zone Perl Thermostatyfu Communication Failure | 2131034161=android.widget.RadioButton#4081b4f8=OK | 2131034162=android.widget.TextView#4082ac98=Sep 12, 2017 4:19:27 AM | 2131034160=android.widget.TextView#40831690=Zone Door Tampered | 2131034161=android.widget.RadioButton#4085bb78=OK | 2131034162=android.widget.TextView#407520c8=Sep 12, 2017 7:00:05 PM | VIEW : -1=android.widget.LinearLayout#405ec8c0 | -1=android.widget.FrameLayout#405ed278 | 16908310=android.widget.TextView#405ed820 | 16908290=android.widget.FrameLayout#405ee4d8 | -1=android.widget.LinearLayout#405ee998 | 2131034163=android.widget.TextView#405eec00 | -1=android.widget.ScrollView#405ef4f8 | 2131034164=android.widget.TableLayout#405f0200 | 2131034158=android.widget.TableRow#406616d8 | 2131034159=android.widget.ImageView#4066cec8 | 2131034160=android.widget.TextView#407e5380 | 2131034161=android.widget.RadioButton#4081b4f8 | 2131034162=android.widget.TextView#4082ac98 | 2131034158=android.widget.TableRow#4075e3c8 | 2131034159=android.widget.ImageView#4079bc80 | 2131034160=android.widget.TextView#40831690 | 2131034161=android.widget.RadioButton#4085bb78 | 2131034162=android.widget.TextView#407520c8 | -1=com.android.internal.policy.impl.PhoneWindow$DecorView#405ec0c8 | BUTTONS : 2131034161=android.widget.RadioButton#4081b4f8 | 2131034161=android.widget.RadioButton#4085bb78 | """
MY Code is
str = '''TEXT VIEW : 16908310=android.widget.TextView#405ee2f0=Troubles | 2131034163=android.widget.TextView#405ef6d0=Some situations can be acknowledged using the 'OK' button, if present. A green check-mark after the description indicates that the situation has been acknowledged. Some situations have further detail available by pressing on the text or icon of the trouble message. | 2131034160=android.widget.TextView#40630608=Zone Perl Thermostatyfu Communication Failure | 2131034161=android.widget.RadioButton#40631068=OK | 2131034162=android.widget.TextView#40632078=Sep 12, 2017 4:19:27 AM | VIEW : -1=android.widget.LinearLayout#405ed390 | -1=android.widget.FrameLayout#405edd48 | 16908310=android.widget.TextView#405ee2f0 | 16908290=android.widget.FrameLayout#405eefa8 | -1=android.widget.LinearLayout#405ef468 | 2131034163=android.widget.TextView#405ef6d0 | -1=android.widget.ScrollView#405effc8 | 2131034164=android.widget.TableLayout#405f0cd0 | 2131034158=android.widget.TableRow#4062f7a8 | 2131034159=android.widget.ImageView#4062fcd0 | 2131034160=android.widget.TextView#40630608 | 2131034161=android.widget.RadioButton#40631068 | 2131034162=android.widget.TextView#40632078 | -1=com.android.internal.policy.impl.PhoneWindow$DecorView#405ecb98 | BUTTONS : 2131034161=android.widget.RadioButton#40631068 |'''
if " AM " or " PM " in str:
Time = str.split(" AM " or " PM ")[0].rsplit(None, 1)[-1]
print Time
Note that you shouldn't name a variable with a special word like str. You could use a regular expression, like this:
import re
my_string = """ 16908310=android.widget.TextView#405ed820=Troubles | 2131034163=android.widget.TextView#405eec00=Some situations can be acknowledged using the 'OK' button, if present. A green check-mark after the description indicates that the situation has been acknowledged. Some situations have further detail available by pressing on the text or icon of the trouble message. | 2131034160=android.widget.TextView#407e5380=Zone Perl Thermostatyfu Communication Failure | 2131034161=android.widget.RadioButton#4081b4f8=OK | 2131034162=android.widget.TextView#4082ac98=Sep 12, 2017 4:19:27 AM | 2131034160=android.widget.TextView#40831690=Zone Door Tampered | 2131034161=android.widget.RadioButton#4085bb78=OK | 2131034162=android.widget.TextView#407520c8=Sep 12, 2017 7:00:05 PM | VIEW : -1=android.widget.LinearLayout#405ec8c0 | -1=android.widget.FrameLayout#405ed278 | 16908310=android.widget.TextView#405ed820 | 16908290=android.widget.FrameLayout#405ee4d8 | -1=android.widget.LinearLayout#405ee998 | 2131034163=android.widget.TextView#405eec00 | -1=android.widget.ScrollView#405ef4f8 | 2131034164=android.widget.TableLayout#405f0200 | 2131034158=android.widget.TableRow#406616d8 | 2131034159=android.widget.ImageView#4066cec8 | 2131034160=android.widget.TextView#407e5380 | 2131034161=android.widget.RadioButton#4081b4f8 | 2131034162=android.widget.TextView#4082ac98 | 2131034158=android.widget.TableRow#4075e3c8 | 2131034159=android.widget.ImageView#4079bc80 | 2131034160=android.widget.TextView#40831690 | 2131034161=android.widget.RadioButton#4085bb78 | 2131034162=android.widget.TextView#407520c8 | -1=com.android.internal.policy.impl.PhoneWindow$DecorView#405ec0c8 | BUTTONS : 2131034161=android.widget.RadioButton#4081b4f8 | 2131034161=android.widget.RadioButton#4085bb78 | """
pattern = '\d{1,2}:\d{2}:\d{2}\s[AP]M'
date_list = re.findall(pattern, my_string)
print(date_list)
# outputs ['4:19:27 AM', '7:00:05 PM']
Explanation of the pattern:
\d{1,2} matches one or two digits
: matches ":"
\d{2} matches exactly two digits
: matches ":"
\d{2} matches exactly two digits
\s matches a space
[AP] matches either an A or a P, only one
M, the last M
Use regex with this expression: ([0-9]{1,2}:[0-9]{2}:[0-9]{2}) (AM|PM). This pattern will give you two groups: one for the numbers of the time and one for the AM or PM information. This is much better than splitting the string manually. You can test it here, and get used to using regex.
All in all you can use it like this in python:
import re
p = re.compile('([0-9]{1,2}:[0-9]{2}:[0-9]{2}) (AM|PM)')
for (numbers, status) in p.match(theString):
#prints the numbers like 04:02:55
print(numbers)
#prints the AM or PM
print(status)
It's not a good idea to use str as a variable name because that's a builtin
so assuming your string is in s, here is an interactive demonstration of
what I think you want.
>>> import re
>>> re.findall('[=][^|=]+[AP]M [|]', s)
['=Sep 12, 2017 4:19:27 AM |', '=Sep 12, 2017 7:00:05 PM |']
>>> [r.split() for r in re.findall('[=][^|=]+[AP]M [|]', s)]
[['=Sep', '12,', '2017', '4:19:27', 'AM', '|'], ['=Sep', '12,', '2017', '7:00:05', 'PM', '|']]
>>> [r.split()[3] for r in re.findall('[=][^|=]+[AP]M [|]', s)]
['4:19:27', '7:00:05']
>>>
Regular expressions are your friend here. For example:
import re
inputstring = '''...'''
timematch = re.compile('\d{1,2}:\d{1,2}:\d{1,2} [AP]M')
print(timematch.findall(inputstring))
The regular expression in question matches any occurrence of XX:XX:XX AM and XX:XX:XX PM, and takes into account time noted as 4:00:00 AM as well as 04:00:00 AM.
It would be easy to use regex:
<script src="//repl.it/embed/Kyqe/0.js"></script>
You can use this regex:
\d+:\d+:\d+
or r'\d{1,2}:\d{1,2}:\d{1,2}'
Code: https://repl.it/Kyqe/0

How can I substitute an expression {{ text }} with re.sub() when 'text' may include further {{ text }} blocks?

I'm trying to parse raw wikipedia article content, e.g. the article on Sweden, using re.sub(). However, I am running into problems trying to substitute blocks of {{some text}}, because they can contain further blocks of {{some text}}.
Here's an abbreviated example from the above article:
{{Infobox country
| conventional_long_name = Kingdom of Sweden
| native_name = {{native name|sv|Konungariket Sverige|icon=no}}
| common_name = Sweden
}}
Some text I do not want parsed.
{{Link GA|eo}}
The curly braces within curly braces recursion could theoretically be arbitrarily nested to any number of levels.
If I match the greedy block of {{.+}}, everything is matched from {{Infobox to eo}}, including the text I do not want matched.
If I match the ungreedy block of {{.+}}, the part from {{Infobox to icon=no}} is matched, as is {{Link GA|eo}}. But then I'm left with the string | common_name [...] not want parsed.
I also tried variants of \{\{.+(\{\{.+\}\})*.+\}\} and \{\{[^\{]+(\{\{[^\{]+\}\})*[^\{]+\}\}, in the hopes of matching only sub-blocks within the larger block, but to no avail.
I'd list all of what I've tried, but I honestly can't remember half and I doubt it'd be of much use anyway. It always comes back to the same problem: that for the double curly end braces }} to match, there needs to have been the same number of {{ occurrences beforehand.
Is this even solvable using regular expressions, or do I need another solution?
Have you considered mwparserfromhell?
import mwparserfromhell
s = """{{Infobox country
| conventional_long_name = Kingdom of Sweden
| native_name = {{native name|sv|Konungariket Sverige|icon=no}}
| common_name = Sweden
}}
Some text I do not want parsed.
{{Link GA|eo}}"""
wikicode = mwparserfromhell.parse(s)
print wikicode.filter_templates()[0]
Prints:
{{Infobox country
| conventional_long_name = Kingdom of Sweden
| native_name = {{native name|sv|Konungariket Sverige|icon=no}}
| common_name = Sweden
}}

Categories