Regular expression Variant

Regular expression Variant - python

I want to extract the length of a dress from a pandas dataframe .The row of that dataframe looks like this :
A-line dress with darting at front and back | Surplice neckline | Long sleeves | About 23" from shoulder to hem | Triacetate/polyester | Dry clean | Imported | Model shown is 5'10" (177cm) wearing a size 4
As you can see the size is contained between About and shoulder but in some cases shoulder is replaced by waist,hem etc.Below is my python script that finds the length but it fails when lets say there is a comma after About since i am slicing the list.
import re
def regexfinder(string_var):
res=''
x=re.search(r"(?<=About).*?(?=[shoulder,waist,hem,bust,neck,bust,top,hips])", string_var).group(0)
tohave=int(x[1:3])
if tohave >=16 and tohave<=36:
res="Mini"
return res
if tohave>36 and tohave<40:
res="Above the Knee"
return res
if tohave >=40 and tohave<=46:
res="Knee length"
return res
if tohave>46 and tohave<49:
res="Mid/Tea length"
return res
if tohave >=49 and tohave<=59:
res="Long/Maxi length"
return res
if tohave>59:
res="Floor Length"
return res

Your regex (?<=About).*?(?=[shoulder,waist,hem,bust,neck,bust,top,hips]) uses a character class for the words shoulder,waist,hem,bust,neck,bust,top,hips.
I think you want to put them in a non capturing group using an or |.
Try it like this using an optional comma ,?:
(?<=About),? (\d+)(?=.*?(?:shoulder|waist|hem|bust|neck|bust|top|hips]))
The size is in the first capturing group.

import re
s = """A-line dress with darting at front and back | Surplice neckline | Long sleeves | About 23" from shoulder to hem | Triacetate/polyester | Dry clean | Imported | Model shown is 5'10" (177cm) wearing a size 4"""
q = """'Velvet dress featuring mesh front, back and sleeves | Crewneck | Long bell sleeves | Self-tie closure at back cutout | About, 31" from shoulder to hem | Viscose/nylon | Hand wash | Imported | Model shown is 5\'10" (177cm) wearing a size Small.'1"""
def getSize(stringVal, strtoCheck):
for i in stringVal.split("|"): #Split string by "|"
if i.strip().startswith(strtoCheck): #Check if string startswith "About"
val = i.strip()
return re.findall("\d+", val)[0] #Extract int
print getSize(s, "About")
print getSize(q, "About")
Output:
23
31

Related

Find the most likely word alignment between two strings in Python

I have 2 similar strings. How can I find the most likely word alignment between these two strings in Python?
Example of input:
string1 = 'my channel is youtube dot com slash example and then I also do live streaming on twitch.'
string2 = 'my channel is youtube.com/example and then I also do livestreaming on twitch.'
Desired output:
alignment['my'] = 'my'
alignment['channel'] = 'channel'
alignment['is'] = 'is'
alignment['youtube'] = 'youtube.com/example'
alignment['dot'] = 'youtube.com/example'
alignment['com'] = 'youtube.com/example'
alignment['slash'] = 'youtube.com/example'
alignment['example'] = 'youtube.com/example'
alignment['and'] = 'and'
alignment['then'] = 'then'
alignment['I'] = 'I'
alignment['also'] = 'also'
alignment['do'] = 'do'
alignment['live'] = 'livestreaming'
alignment['streaming'] = 'livestreaming'
alignment['on'] = 'on'
alignment['twitch'] = 'twitch'

Alignment is tricky. spaCy can do it (see Aligning tokenization) but AFAIK it assumes that the two underlying strings are identical which is not the case here.
I used Bio.pairwise2 a few years back for a similar problem. I don't quite remember the exact settings, but here's what the default setup would give you:
from Bio import pairwise2
from Bio.pairwise2 import format_alignment
string1 = 'my channel is youtube dot com slash example and then I also do live streaming on twitch.'
string2 = 'my channel is youtube.com/example and then I also do livestreaming on twitch.'
alignments = pairwise2.align.globalxx(string1.split(),
string2.split(),
gap_char=['-']
)
The resulting alignments - pretty close already:
>>> format_alignment(*alignments[0])
my channel is youtube dot com slash example - and then I also do live streaming - on twitch.
| | | | | | | | | |
my channel is - - - - - youtube.com/example and then I also do - - livestreaming on twitch.
Score=10
You can provide your own matching functions, which would make fuzzywuzzy an interesting addition.

Previous answers offer biology-based alignment methods, there are NLP-based alignments methods as well. The most standard would be the Levenshtein edit distance. There are a few variants, and generally this problem is considered closely related to the question of text similarity measures (aka fuzzy matching, etc.). In particular it's possible to mix alignment at the level of word and characters. as well as different measures (e.g. SoftTFIDF, see this answer).

The Needleman-Wunch Algorithm
Biologists sometimes try to align the DNA of two different plants or animals to see how much of their genome they share in common.
MOUSE: A A T C C G C T A G
RAT: A A A C C C T T A G
+ + - + + - - + + +
Above "+" means that pieces of DNA match.
Above "-" means that pieces of DNA mis-match.
You can use the full ASCII character set (128 characters) instead of the letters ATCG that biologists use.
I recommend using the the Needleman Wunsch Algorithm
Needle-Wunsch is not the fastest algorithm in the world.
However, Needle-Wunsch is easy to understand.
In cases were one string of English text is completely missing a word present in the other text, Needleman Wunsch will match the word to special "GAP" character.
+-------+--------+-------+---+------+----+-----+-------+-----+----+-----+-------+------+
| The | reason | that | I | went | to | the | store | was | to | buy | some | food |
+-------+--------+-------+---+------+----+-----+-------+-----+----+-----+-------+------+
| <GAP> | reason | <GAP> | I | went | 2 | te | store | wuz | 2 | buy | <GAP> | fud |
+-------+--------+-------+---+------+----+-----+-------+-----+----+-----+-------+------+
The special GAP characters are fine.
However, what is in-efficient about Needle Wunsch is that people who wrote the algorithm believed that the order of the gap characters was important. The following are computed as two separate cases:
ALIGNMENT ONE
+---+-------+-------+---+---+
| A | 1 | <GAP> | R | A |
+---+-------+-------+---+---+
| A | <GAP> | B | R | A |
+---+-------+-------+---+---+
ALIGNMENT TWO
+---+-------+-------+---+---+
| A | <GAP> | 1 | R | A |
+---+-------+-------+---+---+
| A | B | <GAP> | R | A |
+---+-------+-------+---+---+
However, if you have two or more gaps in a row, then order of the gaps should not matter.
The Needleman-Wunch algorithm calculates the same thing many times over because whoever wrote the algorithm thought that order mattered a little more than it really does.
The following two alignments have the same score.
Also, both alignments have more or less the same meaning in the "real world" (outside of the computer).
However, the Needleman-Wunch algorithm will compute the scores of the two example alignments twice instead of computing it only one time.

Is there a way to add a new line after every ']' in Python?

I originally had a string containing BBCode in which I wanted to format it better so it can be readable.
I had something like
['"9-5[centre][notice][url=https:\\/\\/www.youtube.com\\/playlist?list=PL3OTylWB5pG7s7JowIUEYBiPkKR0GRRRG][b]\\u25ba osu! Mapping Theory[\\/b][\\/url]\\n[url=https:\\/\\/youtu.be\\/0uGeZzyobSY]Linear Momentum[\\/url] | [url=https:\\/\\/youtu.be\\/zOzi8Q655vs]Linear Momentum 2[\\/url] | [url=https:\\/\\/youtu.be\\/Rm5l0UDJLcQ]Angular Momentum and Circular Flow[\\/url] | [url=https:\\/\\/youtu.be\\/hRc3Xm0wI7s]Active and Passive Mapping[\\/url]\\n[url=https:\\/\\/youtu.be\\/OgNhsZpKRYc]Slider Flow[\\/url] | [url=https:\\/\\/youtu.be\\/e05hOKXfWOk]Stream Flow[\\/url] | [url=https:\\/\\/youtu.be\\/zYAujNMPVbY]Slider Mechanics[\\/url] | [url=https:\\/\\/youtu.be\\/ZOtkAQ3MoNE]Aesthetics by Symmetry[\\/url] | [url=https:\\/\\/youtu.be\\/WnLG31LaQx0]Aesthetics by Complexity[\\/url] | [url=https:\\/\\/youtu.be\\/i323hh7-CAQ]Defining Flow[\\/url]\\n[url=https:\\/\\/youtu.be\\/hNnF5NLoOwU]Flow and Aesthetics[\\/url] | [url=https:\\/\\/youtu.be\\/tq8fu_-__8M]Angle Emphasis[\\/url] | [url=https:\\/\\/youtu.be\\/6ilBsa_dV8k]Strain[\\/url] | [url=https:\\/\\/youtu.be\\/KKDnLsIyRp0]Pressure[\\/url] | [url=https:\\/\\/youtu.be\\/jm43HilQhYk]Tension[\\/url] | [url=https:\\/\\/youtu.be\\/-_Mh0NbpHXo]Song Choice[\\/url] | [url=https:\\/\\/youtu.be\\/BNjVu8xq4os]Song Length[\\/url]\\n\\n[url=https:\\/\\/www.youtube.com\\/playlist?list=PL3OTylWB5pG6t5MCwGnq87iYZnE5G7aZL][b]\\u25ba osu! Rambling[\\/b][\\/url]\\n[url=https:\\/\\/youtu.be\\/-Beeh7dKyTk]Storyboards[\\/url] | [url=https:\\/\\/youtu.be\\/i6zzHMzwIzU]Why[\\/url]\\n\\n[url=https:\\/\\/youtu.be\\/_sBP7ttRQog]0 BPM[\\/url] | [url=https:\\/\\/youtu.be\\/UgtR6WnuTT8]ppv2 Pt.1[\\/url] | [url=https:\\/\\/youtu.be\\/Bx14u5tltyE]ppv2 Pt.2[\\/url] | [url=https:\\/\\/youtu.be\\/-095yuSLE4Y]Super high star rating[\\/url][\\/notice][url=https:\\/\\/amo.s-ul.eu\\/oApvJHWA][b]Skin v3.4[\\/b][\\/url]\\n[size=85]Personal edit of [url=https:\\/\\/osu.ppy.sh\\/forum\\/t\\/481314]Re:m skin by Choilicious[\\/url][\\/size]\\n\\n[img]http:\\/\\/puu.sh\\/qqv6C\\/0aaca52f51.jpg[\\/img][url=https:\\/\\/osu.ppy.sh\\/u\\/Satellite][img]http:\\/\\/puu.sh\\/qqv6K\\/94681bed3f.jpg[\\/img][\\/url][url=https:\\/\\/osu.ppy.sh\\/u\\/Sellenite][img]http:\\/\\/puu.sh\\/qqv6T\\/c943ed1703.jpg[\\/img][\\/url][url=https:\\/\\/osu.ppy.sh\\/u\\/Morinaga][img]http:\\/\\/puu.sh\\/qqv70\\/cfbdb2a242.jpg[\\/img][\\/url][url=https:\\/\\/osu.ppy.sh\\/u\\/-Mo-][img]http:\\/\\/puu.sh\\/qqv77\\/ca489f2d00.jpg[\\/img][\\/url]\\n[notice]I don\'t really do nomination stuff often anymore. \\nHowever, please do show me your map if it\'s any of the following:[box=][b]Bounty[\\/b]\\n[size=50]High priority modding for these artists\\/songs (maybe a GD, just ask).\\nPreferably non-cut versions and songs that have no ranked maps yet.[\\/size]\\n\\nYuuhei Satellite\\nYuuhei Catharsis\\nShoujo Fractal\\nHoneyWorks (non-vocaloid)\\nTrySail\\nClariS\\n\\nClariS - CLICK (Asterisk DnB Remix), [size=85]either version.[\\/size]\\nfhana - Outside of Melancholy, [size=85]a version that isn\'t cut pls[\\/size]\\nAny cover of \\u7832\\u96f7\\u6483\\u6226\\u3001\\u59cb\\u3081![\\/box]I also do storyboard checks for any map.\\n\\nPMs are open for anything. Ask me anything. \\nAsk me what my favourite colour is if you really want even.[\\/notice][box=Guests][b]Ranked[\\/b]\\n[url=https:\\/\\/osu.ppy.sh\\/b\\/1575100][img]http:\\/\\/s.ppy.sh\\/images\\/insane.png[\\/img] senya - Terasareru kurai no Shiawase [Lunatic][\\/url]\\n[url=https:\\/\\/osu.ppy.sh\\/b\\/1794557][img]http:\\/\\/s.ppy.sh\\/images\\/insane.png[\\/img] senya - Arehateta Chijou no Uta [Collab Insane][\\/url]\\n[url=https:\\/\\/osu.ppy.sh\\/b\\/1592915][img]http:\\/\\/s.ppy.sh\\/images\\/insane.png[\\/img] Tanaka Hirokazu - C-TYPE [TetriS-TYPE] [S-TYPE][\\/url]\\n[url=https:\\/\\/osu.ppy.sh\\/b\\/1490130][img]http:\\/\\/s.ppy.sh\\/images\\/insane.png[\\/img] TrySail - adrenaline!!! [Insane][\\/url] [size=85]Slightly ruined version.[\\/size]\\n[url=https:\\/\\/osu.ppy.sh\\/b\\/1401096][img]http:\\/\\/s.ppy.sh\\/images\\/insane.png[\\/img] senya - Shunkan Everlasting [Insane][\\/url]\\n[url=https:\\/\\/osu.ppy.sh\\/b\\/795269][img]
Basically unreadable currently.
I tried making it look like
['"9-5[centre][notice][url=https:\\/\\/www.youtube.com\\/playlist?list=PL3OTylWB5pG7s7JowIUEYBiPkKR0GRRRG]
[b]
\\u25ba osu! Mapping Theory[\\/b]
[\\/url]\\n[url=https:\\/\\/youtu.be\\/0uGeZzyobSY]
Linear Momentum[\\/url]
| [url=https:\\/\\/youtu.be\\/zOzi8Q655vs]
Linear Momentum 2[\\/url]
| [url=https:\\/\\/youtu.be\\/Rm5l0UDJLcQ]
Angular Momentum and Circular Flow[\\/url]
| [url=https:\\/\\/youtu.be\\/hRc3Xm0wI7s]
Active and Passive Mapping[\\/url]
\\n[url=https:\\/\\/youtu.be\\/OgNhsZpKRYc]
Slider Flow[\\/url]
| [url=https:\\/\\/youtu.be\\/e05hOKXfWOk]
Stream Flow[\\/url]
| [url=https:\\/\\/youtu.be\\/zYAujNMPVbY]
Slider Mechanics[\\/url]
| [url=https:\\/\\/youtu.be\\/ZOtkAQ3MoNE]
Aesthetics by Symmetry[\\/url]
| [url=https:\\/\\/youtu.be\\/WnLG31LaQx0]
Aesthetics by Complexity[\\/url]
| [url=https:\\/\\/youtu.be\\/i323hh7-CAQ]
Defining Flow[\\/url]
\\n[url=https:\\/\\/youtu.be\\/hNnF5NLoOwU]
Flow and Aesthetics[\\/url]
| [url=https:\\/\\/youtu.be\\/tq8fu_-__8M]
Angle Emphasis[\\/url]
| [url=https:\\/\\/youtu.be\\/6ilBsa_dV8k]
Strain[\\/url]
| [url=https:\\/\\/youtu.be\\/KKDnLsIyRp0]
Pressure[\\/url]
| [url=https:\\/\\/youtu.be\\/jm43HilQhYk]
Tension[\\/url]
| [url=https:\\/\\/youtu.be\\/-_Mh0NbpHXo]
Song Choice[\\/url]
|
Where there's a new line after every ']'
I've tried tweaking re.sub("[\(\[].*?[\)\]]", "", str(x)) to fit what I need but it just deletes everything inside of them. (I have no idea how regex works)
How can I go about this?

There's no need for a regular expression, just use the simple str.replace() function.
x = x.replace(']', ']\n')

It really depends on exactly what you want your output to look like.
I interpreted your output as wanting a newline after each url= tag, which would require the following regex:
output = re.sub(r"(\[url.*?\])", r"\1\n", input)
The brackets () form a capture group which is then used in the replace statement as \1 since its the first unnamed capture group.
You can change the regex to your will but just keep the stuff you want to keep within the capture group.
If you want to experiment with regex you can use https://regexr.com/ which is an amazing resource when fiddling around with regex.

split occurance of time from large string

in my task I want to fetch only time and store in variable, in my string it may be possible that time occurs more than 1 time and it may be "AM" or "PM"
I only want to store this value from my string.
"4:19:27" and "7:00:05" the occurrence of time may be more than twice.
str = """ 16908310=android.widget.TextView#405ed820=Troubles | 2131034163=android.widget.TextView#405eec00=Some situations can be acknowledged using the 'OK' button, if present. A green check-mark after the description indicates that the situation has been acknowledged. Some situations have further detail available by pressing on the text or icon of the trouble message. | 2131034160=android.widget.TextView#407e5380=Zone Perl Thermostatyfu Communication Failure | 2131034161=android.widget.RadioButton#4081b4f8=OK | 2131034162=android.widget.TextView#4082ac98=Sep 12, 2017 4:19:27 AM | 2131034160=android.widget.TextView#40831690=Zone Door Tampered | 2131034161=android.widget.RadioButton#4085bb78=OK | 2131034162=android.widget.TextView#407520c8=Sep 12, 2017 7:00:05 PM | VIEW : -1=android.widget.LinearLayout#405ec8c0 | -1=android.widget.FrameLayout#405ed278 | 16908310=android.widget.TextView#405ed820 | 16908290=android.widget.FrameLayout#405ee4d8 | -1=android.widget.LinearLayout#405ee998 | 2131034163=android.widget.TextView#405eec00 | -1=android.widget.ScrollView#405ef4f8 | 2131034164=android.widget.TableLayout#405f0200 | 2131034158=android.widget.TableRow#406616d8 | 2131034159=android.widget.ImageView#4066cec8 | 2131034160=android.widget.TextView#407e5380 | 2131034161=android.widget.RadioButton#4081b4f8 | 2131034162=android.widget.TextView#4082ac98 | 2131034158=android.widget.TableRow#4075e3c8 | 2131034159=android.widget.ImageView#4079bc80 | 2131034160=android.widget.TextView#40831690 | 2131034161=android.widget.RadioButton#4085bb78 | 2131034162=android.widget.TextView#407520c8 | -1=com.android.internal.policy.impl.PhoneWindow$DecorView#405ec0c8 | BUTTONS : 2131034161=android.widget.RadioButton#4081b4f8 | 2131034161=android.widget.RadioButton#4085bb78 | """
MY Code is
str = '''TEXT VIEW : 16908310=android.widget.TextView#405ee2f0=Troubles | 2131034163=android.widget.TextView#405ef6d0=Some situations can be acknowledged using the 'OK' button, if present. A green check-mark after the description indicates that the situation has been acknowledged. Some situations have further detail available by pressing on the text or icon of the trouble message. | 2131034160=android.widget.TextView#40630608=Zone Perl Thermostatyfu Communication Failure | 2131034161=android.widget.RadioButton#40631068=OK | 2131034162=android.widget.TextView#40632078=Sep 12, 2017 4:19:27 AM | VIEW : -1=android.widget.LinearLayout#405ed390 | -1=android.widget.FrameLayout#405edd48 | 16908310=android.widget.TextView#405ee2f0 | 16908290=android.widget.FrameLayout#405eefa8 | -1=android.widget.LinearLayout#405ef468 | 2131034163=android.widget.TextView#405ef6d0 | -1=android.widget.ScrollView#405effc8 | 2131034164=android.widget.TableLayout#405f0cd0 | 2131034158=android.widget.TableRow#4062f7a8 | 2131034159=android.widget.ImageView#4062fcd0 | 2131034160=android.widget.TextView#40630608 | 2131034161=android.widget.RadioButton#40631068 | 2131034162=android.widget.TextView#40632078 | -1=com.android.internal.policy.impl.PhoneWindow$DecorView#405ecb98 | BUTTONS : 2131034161=android.widget.RadioButton#40631068 |'''
if " AM " or " PM " in str:
Time = str.split(" AM " or " PM ")[0].rsplit(None, 1)[-1]
print Time

Note that you shouldn't name a variable with a special word like str. You could use a regular expression, like this:
import re
my_string = """ 16908310=android.widget.TextView#405ed820=Troubles | 2131034163=android.widget.TextView#405eec00=Some situations can be acknowledged using the 'OK' button, if present. A green check-mark after the description indicates that the situation has been acknowledged. Some situations have further detail available by pressing on the text or icon of the trouble message. | 2131034160=android.widget.TextView#407e5380=Zone Perl Thermostatyfu Communication Failure | 2131034161=android.widget.RadioButton#4081b4f8=OK | 2131034162=android.widget.TextView#4082ac98=Sep 12, 2017 4:19:27 AM | 2131034160=android.widget.TextView#40831690=Zone Door Tampered | 2131034161=android.widget.RadioButton#4085bb78=OK | 2131034162=android.widget.TextView#407520c8=Sep 12, 2017 7:00:05 PM | VIEW : -1=android.widget.LinearLayout#405ec8c0 | -1=android.widget.FrameLayout#405ed278 | 16908310=android.widget.TextView#405ed820 | 16908290=android.widget.FrameLayout#405ee4d8 | -1=android.widget.LinearLayout#405ee998 | 2131034163=android.widget.TextView#405eec00 | -1=android.widget.ScrollView#405ef4f8 | 2131034164=android.widget.TableLayout#405f0200 | 2131034158=android.widget.TableRow#406616d8 | 2131034159=android.widget.ImageView#4066cec8 | 2131034160=android.widget.TextView#407e5380 | 2131034161=android.widget.RadioButton#4081b4f8 | 2131034162=android.widget.TextView#4082ac98 | 2131034158=android.widget.TableRow#4075e3c8 | 2131034159=android.widget.ImageView#4079bc80 | 2131034160=android.widget.TextView#40831690 | 2131034161=android.widget.RadioButton#4085bb78 | 2131034162=android.widget.TextView#407520c8 | -1=com.android.internal.policy.impl.PhoneWindow$DecorView#405ec0c8 | BUTTONS : 2131034161=android.widget.RadioButton#4081b4f8 | 2131034161=android.widget.RadioButton#4085bb78 | """
pattern = '\d{1,2}:\d{2}:\d{2}\s[AP]M'
date_list = re.findall(pattern, my_string)
print(date_list)
# outputs ['4:19:27 AM', '7:00:05 PM']
Explanation of the pattern:
\d{1,2} matches one or two digits
: matches ":"
\d{2} matches exactly two digits
: matches ":"
\d{2} matches exactly two digits
\s matches a space
[AP] matches either an A or a P, only one
M, the last M

Use regex with this expression: ([0-9]{1,2}:[0-9]{2}:[0-9]{2}) (AM|PM). This pattern will give you two groups: one for the numbers of the time and one for the AM or PM information. This is much better than splitting the string manually. You can test it here, and get used to using regex.
All in all you can use it like this in python:
import re
p = re.compile('([0-9]{1,2}:[0-9]{2}:[0-9]{2}) (AM|PM)')
for (numbers, status) in p.match(theString):
#prints the numbers like 04:02:55
print(numbers)
#prints the AM or PM
print(status)

It's not a good idea to use str as a variable name because that's a builtin
so assuming your string is in s, here is an interactive demonstration of
what I think you want.
>>> import re
>>> re.findall('[=][^|=]+[AP]M [|]', s)
['=Sep 12, 2017 4:19:27 AM |', '=Sep 12, 2017 7:00:05 PM |']
>>> [r.split() for r in re.findall('[=][^|=]+[AP]M [|]', s)]
[['=Sep', '12,', '2017', '4:19:27', 'AM', '|'], ['=Sep', '12,', '2017', '7:00:05', 'PM', '|']]
>>> [r.split()[3] for r in re.findall('[=][^|=]+[AP]M [|]', s)]
['4:19:27', '7:00:05']
>>>

Regular expressions are your friend here. For example:
import re
inputstring = '''...'''
timematch = re.compile('\d{1,2}:\d{1,2}:\d{1,2} [AP]M')
print(timematch.findall(inputstring))
The regular expression in question matches any occurrence of XX:XX:XX AM and XX:XX:XX PM, and takes into account time noted as 4:00:00 AM as well as 04:00:00 AM.

It would be easy to use regex:
<script src="//repl.it/embed/Kyqe/0.js"></script>
You can use this regex:
\d+:\d+:\d+
or r'\d{1,2}:\d{1,2}:\d{1,2}'
Code: https://repl.it/Kyqe/0

How to capture element in columns with spaces around?

Example
300 january 10 20
300 februari 120,30 10
300 march 20,30 10
300,10 april 20,30 10
300 may 420,10 10,46
I want to reorder columns.
The first thing I do is to separate the columns between the text using a separator. p.e.
(?<=\S)(\s{2,})(?=\S) or
(?<=\S)(\s{1,})(?=\S)
Then I want to to put the columns in a list like this:
|300 | |january | | 10 | |20 |
|300 | |februari| |120,30| |10 |
|300 | |march | |20,30 | |10 |
|300,10| | april | | 20,30| |10 |
|300 | |may | |420,10| |10,46|
expected output:
mylist = [['300 ','january ',' 10 ','20 ']
['300 ','februari','120,30','10 '],
['300 ','march ','20,30 ','10 '],
['300,10',' april ',' 20,30','10 '],
['300 ','may ','420,10','10,46']]
I have no idea how to capture the spaces.
I tried this to capture the spaces after use of the separator:
#find the max length of an element in a column
lengte_temp = [[len(x) for x in row] for row in mylist]
maxlengthcolumn = max(l[len(mylist[0])-1] for l in length_temp)
#add spaces to elements
for b in range(0,len(mylist)):
if length_temp[b][len(mylist[-1])-1] < maxlengthcolumn:
mylist[b][len(mylist[-1])-1] = mylist[b][len(mylist[-1])-1] + ' '*(maxlengthcolumn-length_temp[b][len(mylist[-1])-1])
but this removes the spaces before the elements in a column.
How can I capture the elements in a list as in my example above?

Assuming that you're working with strings, you can use `ord' to obtain the ascii values, and split your string where alphas and numerics begin and end.
To break it down:
Intake each line in text one at time (from what I've read it looks like your original text could be a .txt?) to import your can use file i/o methods (more about that here and here)
Pass each line as a string and convert to ascii values using ord(), store these values in a separate variable
Set up logic to see where words/numbers begin (you should be looking for a pattern of an alpha, or numeric, followed by 0 or more alpha/numeric(s) followed by spaces, and after those series of spaces, you should find another alpha or numeric. Store the locations of each beginning (beginning defined as the first in the string, or the first alpha/numeric to follow after a series of spaces)
Index the line of text your currently working with and pull out desired strings.
This might be unclear, so see the psuedo code below:
strings_start = [5, 12, 22] # this would be where the words/numbers begin in the string that holds a line of your text
# we'll assume you have some variable, line, which holds the current line of the text you're parsing in a loop
for i in range(len(strings_start)):
if i < len(strings_start) - 1 # subtract 1 because indexes start at 0
string_list[i] = line[i: i + 1]
else:
string_list[i] = line[i:]

My regex is not working properly

My regex is not working properly. I'm showing you before regex text and after regex text. I'm using this regex re.search(r'(?ms).*?{{(Infobox film.*?)}}', text). You will see my regex not displaying the result after | country = Assam, {{IND . My regex stuck at this point. Will you please help me ? thanks
Before regex:
{{Infobox film
| name = Papori
| released = 1986
| runtime = 144 minutes
| country = Assam, {{IND}}
| language = [[Assamese language|Assamese]]
| budget =
| followed by = free
}}
After regex:
{Infobox film
| name = Papori
| released = 1986
| runtime = 144 minutes
| country = Assam, {{IND
Why regex stuck at this point? country = Assam, {{IND
Edit : Expecting Result
Infobox film
| name = Papori
| released = 1986
| runtime = 144 minutes
| country = Assam, {{IND}}
| language = [[Assamese language|Assamese]]
| budget =
| followed by = free

Your regex is catching everything between the first {{ and the first }}, which is in the "country" entry of the infobox. If you want everything between the first {{ and the last }}, then you want to make the .* inside the braces greedy by removing the ?:
re.search(r'(?ms).*?{{(Infobox film.*)}}', text)
Note that this will find the last }} in the input (eg. if there's another template far below the end of the infobox, it will find the end of that), so this may not be what you want. When you have nesting things like this, regex is not always the best way to search.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Regular expression Variant - python

Related

Find the most likely word alignment between two strings in Python

Is there a way to add a new line after every ']' in Python?

split occurance of time from large string

How to capture element in columns with spaces around?

My regex is not working properly

Categories

Resources