Find the most likely word alignment between two strings in Python

Find the most likely word alignment between two strings in Python - python

I have 2 similar strings. How can I find the most likely word alignment between these two strings in Python?
Example of input:
string1 = 'my channel is youtube dot com slash example and then I also do live streaming on twitch.'
string2 = 'my channel is youtube.com/example and then I also do livestreaming on twitch.'
Desired output:
alignment['my'] = 'my'
alignment['channel'] = 'channel'
alignment['is'] = 'is'
alignment['youtube'] = 'youtube.com/example'
alignment['dot'] = 'youtube.com/example'
alignment['com'] = 'youtube.com/example'
alignment['slash'] = 'youtube.com/example'
alignment['example'] = 'youtube.com/example'
alignment['and'] = 'and'
alignment['then'] = 'then'
alignment['I'] = 'I'
alignment['also'] = 'also'
alignment['do'] = 'do'
alignment['live'] = 'livestreaming'
alignment['streaming'] = 'livestreaming'
alignment['on'] = 'on'
alignment['twitch'] = 'twitch'

Alignment is tricky. spaCy can do it (see Aligning tokenization) but AFAIK it assumes that the two underlying strings are identical which is not the case here.
I used Bio.pairwise2 a few years back for a similar problem. I don't quite remember the exact settings, but here's what the default setup would give you:
from Bio import pairwise2
from Bio.pairwise2 import format_alignment
string1 = 'my channel is youtube dot com slash example and then I also do live streaming on twitch.'
string2 = 'my channel is youtube.com/example and then I also do livestreaming on twitch.'
alignments = pairwise2.align.globalxx(string1.split(),
string2.split(),
gap_char=['-']
)
The resulting alignments - pretty close already:
>>> format_alignment(*alignments[0])
my channel is youtube dot com slash example - and then I also do live streaming - on twitch.
| | | | | | | | | |
my channel is - - - - - youtube.com/example and then I also do - - livestreaming on twitch.
Score=10
You can provide your own matching functions, which would make fuzzywuzzy an interesting addition.

Previous answers offer biology-based alignment methods, there are NLP-based alignments methods as well. The most standard would be the Levenshtein edit distance. There are a few variants, and generally this problem is considered closely related to the question of text similarity measures (aka fuzzy matching, etc.). In particular it's possible to mix alignment at the level of word and characters. as well as different measures (e.g. SoftTFIDF, see this answer).

The Needleman-Wunch Algorithm
Biologists sometimes try to align the DNA of two different plants or animals to see how much of their genome they share in common.
MOUSE: A A T C C G C T A G
RAT: A A A C C C T T A G
+ + - + + - - + + +
Above "+" means that pieces of DNA match.
Above "-" means that pieces of DNA mis-match.
You can use the full ASCII character set (128 characters) instead of the letters ATCG that biologists use.
I recommend using the the Needleman Wunsch Algorithm
Needle-Wunsch is not the fastest algorithm in the world.
However, Needle-Wunsch is easy to understand.
In cases were one string of English text is completely missing a word present in the other text, Needleman Wunsch will match the word to special "GAP" character.
+-------+--------+-------+---+------+----+-----+-------+-----+----+-----+-------+------+
| The | reason | that | I | went | to | the | store | was | to | buy | some | food |
+-------+--------+-------+---+------+----+-----+-------+-----+----+-----+-------+------+
| <GAP> | reason | <GAP> | I | went | 2 | te | store | wuz | 2 | buy | <GAP> | fud |
+-------+--------+-------+---+------+----+-----+-------+-----+----+-----+-------+------+
The special GAP characters are fine.
However, what is in-efficient about Needle Wunsch is that people who wrote the algorithm believed that the order of the gap characters was important. The following are computed as two separate cases:
ALIGNMENT ONE
+---+-------+-------+---+---+
| A | 1 | <GAP> | R | A |
+---+-------+-------+---+---+
| A | <GAP> | B | R | A |
+---+-------+-------+---+---+
ALIGNMENT TWO
+---+-------+-------+---+---+
| A | <GAP> | 1 | R | A |
+---+-------+-------+---+---+
| A | B | <GAP> | R | A |
+---+-------+-------+---+---+
However, if you have two or more gaps in a row, then order of the gaps should not matter.
The Needleman-Wunch algorithm calculates the same thing many times over because whoever wrote the algorithm thought that order mattered a little more than it really does.
The following two alignments have the same score.
Also, both alignments have more or less the same meaning in the "real world" (outside of the computer).
However, the Needleman-Wunch algorithm will compute the scores of the two example alignments twice instead of computing it only one time.

Related

Regex Pattern in Python for special charaters

I asked a similar question a few days ago on here and it was great help! A new challenge I wanted build is to further develop the regex pattern to look for specific formats in this iteration, and I thought I have solved it using regex 101 to build/test a regex code but when applied in Python received 'pattern contain no group'. Below is a test df, and a image of what the results should be like/code that was provided via StackOverflow that worked with digits only.
df = pd.DataFrame([["{1} | | Had a Greeter welcome clients {1.0} | | Take measures to ensure a safe and organized distribution {1.000} | | Protected confidentiality of clients (on social media, pictures, in conversation, own congregation members receiving assistance, etc.)",
"{1.00} | | Chairs for clients to sit in while waiting {1.0000} | | Take measures to ensure a safe and organized distribution"],
["{1 } | Financial literacy/budgeting {1 } | | Monetary/Bill Support {1} | | Mental Health Services/Counseling",
"{1}| | Clothing Assistance {1 } | | Healthcare {1} | | Mental Health Services/Counseling {1} | | Spiritual Support {1} | | Job Skills Training"]
] , columns = ['CF1', 'CF2'])
Here is the iteration code that worked digits only. I changed the pattern search with my new regex pattern and it did not work.
Original code: (df.stack().str.extractall('(\d+)')[0] .groupby(level=[0,1]).sum().unstack())
New Code (Failed to recognize pattern): (df.stack().str.extractall(r'(?<=\{)[\d+\.\ ]+(?=\})')[0].astype(int) .groupby(level=[0,1]).sum().unstack())
**In the test df you will see I want to only capture the numbers between "{}" and there's a mixture of decimals and spaces following the number I want to capture and sum. The new pattern did not work in application so any help would be great! **

Your (?<=\{)[\d+\.\ ]+(?=\}) regex contains no capturing groups while Series.str.extractall requires at least one capturing group to output a value.
You need to use
(df.stack().str.extractall(r'\{\s*(\d+(?:\.\d+)?)\s*}')[0].astype(float) .groupby(level=[0,1]).sum().unstack())
Output:
CF1 CF2
0 3.0 2.0
1 3.0 5.0
The \{\s*(\d+(?:\.\d+)?)\s*} regex matches
\{ - a { char
\s* - zero or more whitespaces
(\d+(?:\.\d+)?) - Group 1 (note this group captured value will be the output of the extractall method, it requires at least one capturing group): one or more digits, and then an optional occurrence of a . and one or more digits
\s* - zero or more whitespaces
} - a } char.
See the regex demo.

You can use '\{([\d.]+)\}':
(df.stack().str.extractall(r'\{([\d.]+)\}')[0]
.astype(float).groupby(level=[0,1]).sum().unstack())
output:
CF1 CF2
0 3.0 2.0
1 1.0 4.0
as int only:
(df.stack().str.extractall(r'\{(\d+)(?:\.\d+)?\}')[0]
.astype(int).groupby(level=[0,1]).sum().unstack())
output:
CF1 CF2
0 3 2
1 1 4

NLTK CFG No spaces between non terminals

I want to define a CFG txt file to read into NLTK using nltk.CFG.fromstring(). Thing is, when I define rules, I want to make rules that don't output spaces between non terminals. For example, say I have this grammar:
X -> TENS ONES
TENS -> '二十' | '三十' | '四十' | '五十' | '六十' | '七十' | '八十' | '九十'
ONES -> '一' | '二' | '三' | '四' | '五' | '六' | '七' | '八' | '九'
If I want the word "二十一", I cannot generate it because TENS ONES will insert a space and make '二十 一". If I instead make the rule as X -> TENSONES, TENSONES is treated as one non-terminal, not two and thus there is no parse. Is there a way I can use two non terminals in a production without the need of a space between them?

Is there a way to add a new line after every ']' in Python?

I originally had a string containing BBCode in which I wanted to format it better so it can be readable.
I had something like
['"9-5[centre][notice][url=https:\\/\\/www.youtube.com\\/playlist?list=PL3OTylWB5pG7s7JowIUEYBiPkKR0GRRRG][b]\\u25ba osu! Mapping Theory[\\/b][\\/url]\\n[url=https:\\/\\/youtu.be\\/0uGeZzyobSY]Linear Momentum[\\/url] | [url=https:\\/\\/youtu.be\\/zOzi8Q655vs]Linear Momentum 2[\\/url] | [url=https:\\/\\/youtu.be\\/Rm5l0UDJLcQ]Angular Momentum and Circular Flow[\\/url] | [url=https:\\/\\/youtu.be\\/hRc3Xm0wI7s]Active and Passive Mapping[\\/url]\\n[url=https:\\/\\/youtu.be\\/OgNhsZpKRYc]Slider Flow[\\/url] | [url=https:\\/\\/youtu.be\\/e05hOKXfWOk]Stream Flow[\\/url] | [url=https:\\/\\/youtu.be\\/zYAujNMPVbY]Slider Mechanics[\\/url] | [url=https:\\/\\/youtu.be\\/ZOtkAQ3MoNE]Aesthetics by Symmetry[\\/url] | [url=https:\\/\\/youtu.be\\/WnLG31LaQx0]Aesthetics by Complexity[\\/url] | [url=https:\\/\\/youtu.be\\/i323hh7-CAQ]Defining Flow[\\/url]\\n[url=https:\\/\\/youtu.be\\/hNnF5NLoOwU]Flow and Aesthetics[\\/url] | [url=https:\\/\\/youtu.be\\/tq8fu_-__8M]Angle Emphasis[\\/url] | [url=https:\\/\\/youtu.be\\/6ilBsa_dV8k]Strain[\\/url] | [url=https:\\/\\/youtu.be\\/KKDnLsIyRp0]Pressure[\\/url] | [url=https:\\/\\/youtu.be\\/jm43HilQhYk]Tension[\\/url] | [url=https:\\/\\/youtu.be\\/-_Mh0NbpHXo]Song Choice[\\/url] | [url=https:\\/\\/youtu.be\\/BNjVu8xq4os]Song Length[\\/url]\\n\\n[url=https:\\/\\/www.youtube.com\\/playlist?list=PL3OTylWB5pG6t5MCwGnq87iYZnE5G7aZL][b]\\u25ba osu! Rambling[\\/b][\\/url]\\n[url=https:\\/\\/youtu.be\\/-Beeh7dKyTk]Storyboards[\\/url] | [url=https:\\/\\/youtu.be\\/i6zzHMzwIzU]Why[\\/url]\\n\\n[url=https:\\/\\/youtu.be\\/_sBP7ttRQog]0 BPM[\\/url] | [url=https:\\/\\/youtu.be\\/UgtR6WnuTT8]ppv2 Pt.1[\\/url] | [url=https:\\/\\/youtu.be\\/Bx14u5tltyE]ppv2 Pt.2[\\/url] | [url=https:\\/\\/youtu.be\\/-095yuSLE4Y]Super high star rating[\\/url][\\/notice][url=https:\\/\\/amo.s-ul.eu\\/oApvJHWA][b]Skin v3.4[\\/b][\\/url]\\n[size=85]Personal edit of [url=https:\\/\\/osu.ppy.sh\\/forum\\/t\\/481314]Re:m skin by Choilicious[\\/url][\\/size]\\n\\n[img]http:\\/\\/puu.sh\\/qqv6C\\/0aaca52f51.jpg[\\/img][url=https:\\/\\/osu.ppy.sh\\/u\\/Satellite][img]http:\\/\\/puu.sh\\/qqv6K\\/94681bed3f.jpg[\\/img][\\/url][url=https:\\/\\/osu.ppy.sh\\/u\\/Sellenite][img]http:\\/\\/puu.sh\\/qqv6T\\/c943ed1703.jpg[\\/img][\\/url][url=https:\\/\\/osu.ppy.sh\\/u\\/Morinaga][img]http:\\/\\/puu.sh\\/qqv70\\/cfbdb2a242.jpg[\\/img][\\/url][url=https:\\/\\/osu.ppy.sh\\/u\\/-Mo-][img]http:\\/\\/puu.sh\\/qqv77\\/ca489f2d00.jpg[\\/img][\\/url]\\n[notice]I don\'t really do nomination stuff often anymore. \\nHowever, please do show me your map if it\'s any of the following:[box=][b]Bounty[\\/b]\\n[size=50]High priority modding for these artists\\/songs (maybe a GD, just ask).\\nPreferably non-cut versions and songs that have no ranked maps yet.[\\/size]\\n\\nYuuhei Satellite\\nYuuhei Catharsis\\nShoujo Fractal\\nHoneyWorks (non-vocaloid)\\nTrySail\\nClariS\\n\\nClariS - CLICK (Asterisk DnB Remix), [size=85]either version.[\\/size]\\nfhana - Outside of Melancholy, [size=85]a version that isn\'t cut pls[\\/size]\\nAny cover of \\u7832\\u96f7\\u6483\\u6226\\u3001\\u59cb\\u3081![\\/box]I also do storyboard checks for any map.\\n\\nPMs are open for anything. Ask me anything. \\nAsk me what my favourite colour is if you really want even.[\\/notice][box=Guests][b]Ranked[\\/b]\\n[url=https:\\/\\/osu.ppy.sh\\/b\\/1575100][img]http:\\/\\/s.ppy.sh\\/images\\/insane.png[\\/img] senya - Terasareru kurai no Shiawase [Lunatic][\\/url]\\n[url=https:\\/\\/osu.ppy.sh\\/b\\/1794557][img]http:\\/\\/s.ppy.sh\\/images\\/insane.png[\\/img] senya - Arehateta Chijou no Uta [Collab Insane][\\/url]\\n[url=https:\\/\\/osu.ppy.sh\\/b\\/1592915][img]http:\\/\\/s.ppy.sh\\/images\\/insane.png[\\/img] Tanaka Hirokazu - C-TYPE [TetriS-TYPE] [S-TYPE][\\/url]\\n[url=https:\\/\\/osu.ppy.sh\\/b\\/1490130][img]http:\\/\\/s.ppy.sh\\/images\\/insane.png[\\/img] TrySail - adrenaline!!! [Insane][\\/url] [size=85]Slightly ruined version.[\\/size]\\n[url=https:\\/\\/osu.ppy.sh\\/b\\/1401096][img]http:\\/\\/s.ppy.sh\\/images\\/insane.png[\\/img] senya - Shunkan Everlasting [Insane][\\/url]\\n[url=https:\\/\\/osu.ppy.sh\\/b\\/795269][img]
Basically unreadable currently.
I tried making it look like
['"9-5[centre][notice][url=https:\\/\\/www.youtube.com\\/playlist?list=PL3OTylWB5pG7s7JowIUEYBiPkKR0GRRRG]
[b]
\\u25ba osu! Mapping Theory[\\/b]
[\\/url]\\n[url=https:\\/\\/youtu.be\\/0uGeZzyobSY]
Linear Momentum[\\/url]
| [url=https:\\/\\/youtu.be\\/zOzi8Q655vs]
Linear Momentum 2[\\/url]
| [url=https:\\/\\/youtu.be\\/Rm5l0UDJLcQ]
Angular Momentum and Circular Flow[\\/url]
| [url=https:\\/\\/youtu.be\\/hRc3Xm0wI7s]
Active and Passive Mapping[\\/url]
\\n[url=https:\\/\\/youtu.be\\/OgNhsZpKRYc]
Slider Flow[\\/url]
| [url=https:\\/\\/youtu.be\\/e05hOKXfWOk]
Stream Flow[\\/url]
| [url=https:\\/\\/youtu.be\\/zYAujNMPVbY]
Slider Mechanics[\\/url]
| [url=https:\\/\\/youtu.be\\/ZOtkAQ3MoNE]
Aesthetics by Symmetry[\\/url]
| [url=https:\\/\\/youtu.be\\/WnLG31LaQx0]
Aesthetics by Complexity[\\/url]
| [url=https:\\/\\/youtu.be\\/i323hh7-CAQ]
Defining Flow[\\/url]
\\n[url=https:\\/\\/youtu.be\\/hNnF5NLoOwU]
Flow and Aesthetics[\\/url]
| [url=https:\\/\\/youtu.be\\/tq8fu_-__8M]
Angle Emphasis[\\/url]
| [url=https:\\/\\/youtu.be\\/6ilBsa_dV8k]
Strain[\\/url]
| [url=https:\\/\\/youtu.be\\/KKDnLsIyRp0]
Pressure[\\/url]
| [url=https:\\/\\/youtu.be\\/jm43HilQhYk]
Tension[\\/url]
| [url=https:\\/\\/youtu.be\\/-_Mh0NbpHXo]
Song Choice[\\/url]
|
Where there's a new line after every ']'
I've tried tweaking re.sub("[\(\[].*?[\)\]]", "", str(x)) to fit what I need but it just deletes everything inside of them. (I have no idea how regex works)
How can I go about this?

There's no need for a regular expression, just use the simple str.replace() function.
x = x.replace(']', ']\n')

It really depends on exactly what you want your output to look like.
I interpreted your output as wanting a newline after each url= tag, which would require the following regex:
output = re.sub(r"(\[url.*?\])", r"\1\n", input)
The brackets () form a capture group which is then used in the replace statement as \1 since its the first unnamed capture group.
You can change the regex to your will but just keep the stuff you want to keep within the capture group.
If you want to experiment with regex you can use https://regexr.com/ which is an amazing resource when fiddling around with regex.

Regular expression Variant

I want to extract the length of a dress from a pandas dataframe .The row of that dataframe looks like this :
A-line dress with darting at front and back | Surplice neckline | Long sleeves | About 23" from shoulder to hem | Triacetate/polyester | Dry clean | Imported | Model shown is 5'10" (177cm) wearing a size 4
As you can see the size is contained between About and shoulder but in some cases shoulder is replaced by waist,hem etc.Below is my python script that finds the length but it fails when lets say there is a comma after About since i am slicing the list.
import re
def regexfinder(string_var):
res=''
x=re.search(r"(?<=About).*?(?=[shoulder,waist,hem,bust,neck,bust,top,hips])", string_var).group(0)
tohave=int(x[1:3])
if tohave >=16 and tohave<=36:
res="Mini"
return res
if tohave>36 and tohave<40:
res="Above the Knee"
return res
if tohave >=40 and tohave<=46:
res="Knee length"
return res
if tohave>46 and tohave<49:
res="Mid/Tea length"
return res
if tohave >=49 and tohave<=59:
res="Long/Maxi length"
return res
if tohave>59:
res="Floor Length"
return res

Your regex (?<=About).*?(?=[shoulder,waist,hem,bust,neck,bust,top,hips]) uses a character class for the words shoulder,waist,hem,bust,neck,bust,top,hips.
I think you want to put them in a non capturing group using an or |.
Try it like this using an optional comma ,?:
(?<=About),? (\d+)(?=.*?(?:shoulder|waist|hem|bust|neck|bust|top|hips]))
The size is in the first capturing group.

import re
s = """A-line dress with darting at front and back | Surplice neckline | Long sleeves | About 23" from shoulder to hem | Triacetate/polyester | Dry clean | Imported | Model shown is 5'10" (177cm) wearing a size 4"""
q = """'Velvet dress featuring mesh front, back and sleeves | Crewneck | Long bell sleeves | Self-tie closure at back cutout | About, 31" from shoulder to hem | Viscose/nylon | Hand wash | Imported | Model shown is 5\'10" (177cm) wearing a size Small.'1"""
def getSize(stringVal, strtoCheck):
for i in stringVal.split("|"): #Split string by "|"
if i.strip().startswith(strtoCheck): #Check if string startswith "About"
val = i.strip()
return re.findall("\d+", val)[0] #Extract int
print getSize(s, "About")
print getSize(q, "About")
Output:
23
31

split occurance of time from large string

in my task I want to fetch only time and store in variable, in my string it may be possible that time occurs more than 1 time and it may be "AM" or "PM"
I only want to store this value from my string.
"4:19:27" and "7:00:05" the occurrence of time may be more than twice.
str = """ 16908310=android.widget.TextView#405ed820=Troubles | 2131034163=android.widget.TextView#405eec00=Some situations can be acknowledged using the 'OK' button, if present. A green check-mark after the description indicates that the situation has been acknowledged. Some situations have further detail available by pressing on the text or icon of the trouble message. | 2131034160=android.widget.TextView#407e5380=Zone Perl Thermostatyfu Communication Failure | 2131034161=android.widget.RadioButton#4081b4f8=OK | 2131034162=android.widget.TextView#4082ac98=Sep 12, 2017 4:19:27 AM | 2131034160=android.widget.TextView#40831690=Zone Door Tampered | 2131034161=android.widget.RadioButton#4085bb78=OK | 2131034162=android.widget.TextView#407520c8=Sep 12, 2017 7:00:05 PM | VIEW : -1=android.widget.LinearLayout#405ec8c0 | -1=android.widget.FrameLayout#405ed278 | 16908310=android.widget.TextView#405ed820 | 16908290=android.widget.FrameLayout#405ee4d8 | -1=android.widget.LinearLayout#405ee998 | 2131034163=android.widget.TextView#405eec00 | -1=android.widget.ScrollView#405ef4f8 | 2131034164=android.widget.TableLayout#405f0200 | 2131034158=android.widget.TableRow#406616d8 | 2131034159=android.widget.ImageView#4066cec8 | 2131034160=android.widget.TextView#407e5380 | 2131034161=android.widget.RadioButton#4081b4f8 | 2131034162=android.widget.TextView#4082ac98 | 2131034158=android.widget.TableRow#4075e3c8 | 2131034159=android.widget.ImageView#4079bc80 | 2131034160=android.widget.TextView#40831690 | 2131034161=android.widget.RadioButton#4085bb78 | 2131034162=android.widget.TextView#407520c8 | -1=com.android.internal.policy.impl.PhoneWindow$DecorView#405ec0c8 | BUTTONS : 2131034161=android.widget.RadioButton#4081b4f8 | 2131034161=android.widget.RadioButton#4085bb78 | """
MY Code is
str = '''TEXT VIEW : 16908310=android.widget.TextView#405ee2f0=Troubles | 2131034163=android.widget.TextView#405ef6d0=Some situations can be acknowledged using the 'OK' button, if present. A green check-mark after the description indicates that the situation has been acknowledged. Some situations have further detail available by pressing on the text or icon of the trouble message. | 2131034160=android.widget.TextView#40630608=Zone Perl Thermostatyfu Communication Failure | 2131034161=android.widget.RadioButton#40631068=OK | 2131034162=android.widget.TextView#40632078=Sep 12, 2017 4:19:27 AM | VIEW : -1=android.widget.LinearLayout#405ed390 | -1=android.widget.FrameLayout#405edd48 | 16908310=android.widget.TextView#405ee2f0 | 16908290=android.widget.FrameLayout#405eefa8 | -1=android.widget.LinearLayout#405ef468 | 2131034163=android.widget.TextView#405ef6d0 | -1=android.widget.ScrollView#405effc8 | 2131034164=android.widget.TableLayout#405f0cd0 | 2131034158=android.widget.TableRow#4062f7a8 | 2131034159=android.widget.ImageView#4062fcd0 | 2131034160=android.widget.TextView#40630608 | 2131034161=android.widget.RadioButton#40631068 | 2131034162=android.widget.TextView#40632078 | -1=com.android.internal.policy.impl.PhoneWindow$DecorView#405ecb98 | BUTTONS : 2131034161=android.widget.RadioButton#40631068 |'''
if " AM " or " PM " in str:
Time = str.split(" AM " or " PM ")[0].rsplit(None, 1)[-1]
print Time

Note that you shouldn't name a variable with a special word like str. You could use a regular expression, like this:
import re
my_string = """ 16908310=android.widget.TextView#405ed820=Troubles | 2131034163=android.widget.TextView#405eec00=Some situations can be acknowledged using the 'OK' button, if present. A green check-mark after the description indicates that the situation has been acknowledged. Some situations have further detail available by pressing on the text or icon of the trouble message. | 2131034160=android.widget.TextView#407e5380=Zone Perl Thermostatyfu Communication Failure | 2131034161=android.widget.RadioButton#4081b4f8=OK | 2131034162=android.widget.TextView#4082ac98=Sep 12, 2017 4:19:27 AM | 2131034160=android.widget.TextView#40831690=Zone Door Tampered | 2131034161=android.widget.RadioButton#4085bb78=OK | 2131034162=android.widget.TextView#407520c8=Sep 12, 2017 7:00:05 PM | VIEW : -1=android.widget.LinearLayout#405ec8c0 | -1=android.widget.FrameLayout#405ed278 | 16908310=android.widget.TextView#405ed820 | 16908290=android.widget.FrameLayout#405ee4d8 | -1=android.widget.LinearLayout#405ee998 | 2131034163=android.widget.TextView#405eec00 | -1=android.widget.ScrollView#405ef4f8 | 2131034164=android.widget.TableLayout#405f0200 | 2131034158=android.widget.TableRow#406616d8 | 2131034159=android.widget.ImageView#4066cec8 | 2131034160=android.widget.TextView#407e5380 | 2131034161=android.widget.RadioButton#4081b4f8 | 2131034162=android.widget.TextView#4082ac98 | 2131034158=android.widget.TableRow#4075e3c8 | 2131034159=android.widget.ImageView#4079bc80 | 2131034160=android.widget.TextView#40831690 | 2131034161=android.widget.RadioButton#4085bb78 | 2131034162=android.widget.TextView#407520c8 | -1=com.android.internal.policy.impl.PhoneWindow$DecorView#405ec0c8 | BUTTONS : 2131034161=android.widget.RadioButton#4081b4f8 | 2131034161=android.widget.RadioButton#4085bb78 | """
pattern = '\d{1,2}:\d{2}:\d{2}\s[AP]M'
date_list = re.findall(pattern, my_string)
print(date_list)
# outputs ['4:19:27 AM', '7:00:05 PM']
Explanation of the pattern:
\d{1,2} matches one or two digits
: matches ":"
\d{2} matches exactly two digits
: matches ":"
\d{2} matches exactly two digits
\s matches a space
[AP] matches either an A or a P, only one
M, the last M

Use regex with this expression: ([0-9]{1,2}:[0-9]{2}:[0-9]{2}) (AM|PM). This pattern will give you two groups: one for the numbers of the time and one for the AM or PM information. This is much better than splitting the string manually. You can test it here, and get used to using regex.
All in all you can use it like this in python:
import re
p = re.compile('([0-9]{1,2}:[0-9]{2}:[0-9]{2}) (AM|PM)')
for (numbers, status) in p.match(theString):
#prints the numbers like 04:02:55
print(numbers)
#prints the AM or PM
print(status)

It's not a good idea to use str as a variable name because that's a builtin
so assuming your string is in s, here is an interactive demonstration of
what I think you want.
>>> import re
>>> re.findall('[=][^|=]+[AP]M [|]', s)
['=Sep 12, 2017 4:19:27 AM |', '=Sep 12, 2017 7:00:05 PM |']
>>> [r.split() for r in re.findall('[=][^|=]+[AP]M [|]', s)]
[['=Sep', '12,', '2017', '4:19:27', 'AM', '|'], ['=Sep', '12,', '2017', '7:00:05', 'PM', '|']]
>>> [r.split()[3] for r in re.findall('[=][^|=]+[AP]M [|]', s)]
['4:19:27', '7:00:05']
>>>

Regular expressions are your friend here. For example:
import re
inputstring = '''...'''
timematch = re.compile('\d{1,2}:\d{1,2}:\d{1,2} [AP]M')
print(timematch.findall(inputstring))
The regular expression in question matches any occurrence of XX:XX:XX AM and XX:XX:XX PM, and takes into account time noted as 4:00:00 AM as well as 04:00:00 AM.

It would be easy to use regex:
<script src="//repl.it/embed/Kyqe/0.js"></script>
You can use this regex:
\d+:\d+:\d+
or r'\d{1,2}:\d{1,2}:\d{1,2}'
Code: https://repl.it/Kyqe/0

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Find the most likely word alignment between two strings in Python - python

Related

Regex Pattern in Python for special charaters

NLTK CFG No spaces between non terminals

Is there a way to add a new line after every ']' in Python?

Regular expression Variant

split occurance of time from large string

Categories

Resources