Regular expression for multiple occurances in python - python

I need to parse lines having multiple language codes as below
008800002 Bruxelles-Nord$Br�ussel Nord$<deu>$Brussel Noord$<nld>
008800002 being a id
Bruxelles-Nord$Br�ussel Nord$ being name1
deu being language one
$Brussel Noord$ being name two
nld being language two.
SO, the idea is name and language can appear N number of times. I need to collect them all.
the language in <> is 3 characters in length (fixed)
and all names end with $ sign.
I tried this one but it is not giving expected output.
x = re.compile('(?P<stop_id>\d{9})\s(?P<authority>[[\x00-\x7F]{3}|\s{3}])\s(?P<stop_name>.*)
(?P<lang_code>(?:[<]\S{0,4}))',flags=re.UNICODE)
I have no idea how to get repeated elements.
It takes
Bruxelles-Nord$Br�ussel Nord$<deu>$Brussel Noord$ as stop_name and <nld> as language.

Do it in two steps. First separate ID from name/language pairs; then use re.finditer on the name/language section to iterate over the pairs and stuff them into a dict.
import re
line = u"008800002 Bruxelles-Nord$Br�ussel Nord$<deu>$Brussel Noord$<nld>"
m = re.search("(\d+)\s+(.*)", line, re.UNICODE)
id = m.group(1)
names = {}
for m in re.finditer("(.*?)<(.*?)>", m.group(2), re.UNICODE):
names[m.group(2)] = m.group(1)
print id, names

\b(\d+)\b\s*|(.*?)(?=<)<(.*?)>
Try this.Just grab the captures.see demo.
http://regex101.com/r/hS3dT7/4

Related

Extract values from String using Python

I am getting a string
name="Mathew",lastname="Thomas",zipcode="PR123T",gender="male"
I need to get the values Mathew, Thomas, PR123T, male.
Also if the String doesnt have a value for zipcode, it should not assign any value to string.
I am newbie to python. Please help
You need to use the .split() function that is available on every string. First you need to split by comma ,, then you need to split by = and select the 1th element.
Once this is done, you need to .join() the elements on a comma , again.
def split_my_fields(input_string):
if not 'zipcode=""' in input_string:
output = ', '.join(e.split('=')[1].replace('"','') for e in input_string.split(','))
print(f'Output is {output}')
return output
else:
print('Zipcode is empty.')
split_my_fields(r'name="Mathew",lastname="Thomas",zipcode="PR123T",gender="male"')
Output:
>>> split_my_fields(r'name="Mathew",lastname="Thomas",zipcode="PR123T",gender="male"')
Output is Mathew, Thomas, PR123T, male
'Mathew, Thomas, PR123T, male'
In fact, my dear friend, you can use parse
>>from parse import *
>>parse("name={},lastname={},zipcode={},gender={}","name='Mathew',lastname='Thomas',zipcode='PR123T',gender='male'")
<Result ("'Mathew'", "'Thomas'", "'PR123T'", "'male'") {}>
You can use named groups and create dictionary with keys corresponding to the group names:
import re
text = 'name="Mathew",lastname="Thomas",zipcode="PR123T",gender="male"'
expr = re.compile(r'^(name="(\s+)?(?P<name>.*?)(\s+)?")?,?(lastname="(\s+)?(?P<lastname>.*?)(\s+)?")?,?(zipcode="(\s+)?(?P<zipcode>.*?)(\s+)?")?,?(gender="(\s+)?(?P<gender>.*?)(\s+)?")?$')
match = expr.search(text).groupdict()
print(match['name']) # Matthew
print(match['lastname']) # Thomas
print(match['zipcode']) # R123T
print(match['gender']) # male
The pattern will catch all non-whitespace characters between parentheses and strip whitespaces around it. For empty zipcode value it will return an empty string (the same applies for other named groups). It will also handle missing key-value pairs as long as the order in which keys are appearing will stay the same (e.g. text = 'name="Mathew",lastname="Thomas",gender="male"').

What's a better way to process inconsistently structured strings?

I have an output string like this:
read : io=131220KB, bw=14016KB/s, iops=3504, runt= 9362msec
And I want to just extract one of the numerical values for computation, say iops. I'm processing it like this:
if 'read ' in key:
my_read_iops = value.split(",")[2].split("=")[1]
result['test_details']['read'] = my_read_iops
But there are slight inconsistencies with some of the strings I'm reading in and my code is getting super complicated and verbose. So instead of manually counting the number of commas vs "=" chars, what's a better way to handle this?
You can use regular expression \s* to handle inconsistent spacing, it matches zero or more whitespaces:
import re
s = 'read : io=131220KB, bw=14016KB/s, iops=3504, runt= 9362msec'
for m in re.finditer(r'\s*(?P<name>\w*)\s*=\s*(?P<value>[\w/]*)\s*', s):
print(m.group('name'), m.group('value'))
# io 131220KB
# bw 14016KB/s
# iops 3504
# runt 9362msec
Using group name, you can construct pattern string from a list of column names and do it like:
names = ['io', 'bw', 'iops', 'runt']
name_val_pat = r'\s*{name}\s*=\s*(?P<{group_name}>[\w/]*)\s*'
pattern = ','.join([name_val_pat.format(name=name, group_name=name) for name in names])
# '\s*io\s*=\s*(?P<io>[\w/]*)\s*,\s*bw\s*=\s*(?P<bw>[\w/]*)\s*,\s*iops\s*=\s*(?P<iops>[\w/]*)\s*,\s*runt\s*=\s*(?P<runt>[\w/]*)\s*'
match = re.search(pattern, s)
data_dict = {name: match.group(name) for name in names}
print(data_dict)
# {'io': '131220KB', 'bw': '14016KB/s', 'runt': '9362msec', 'iops': '3504'}
In this way, you only need to change names and keep the order correct.
If I were you,I'd use regex(regular expression) as first choice.
import re
s= "read : io=131220KB, bw=14016KB/s, iops=3504, runt= 9362msec"
re.search(r"iops=(\d+)",s).group(1)
By this python code, I find the string pattern that starts 'iops=' and continues number expression at least 1 digit.I extract the target string(3504) by using round bracket.
you can find more information about regex from
https://docs.python.org/3.6/library/re.html#module-re
regex is powerful language for complex pattern matching with simple syntax.
from re import match
string = 'read : io=131220KB, bw=14016KB/s, iops=3504, runt= 9362msec'
iops = match(r'.+(iops=)([0-9]+)', string).group(2)
iops
'3504'

Python - Extract text from string

What are the most efficient ways to extract text from a string? Are there some available functions or regex expressions, or some other way?
For example, my string is below and I want to extract the IDs as well
as the ScreenNames, separately.
[User(ID=1234567890, ScreenName=RandomNameHere), User(ID=233323490, ScreenName=AnotherRandomName), User(ID=4459284, ScreenName=YetAnotherName)]
Thank you!
Edit: These are the text strings that I want to pull. I want them to be in a list.
Target_IDs = 1234567890, 233323490, 4459284
Target_ScreenNames = RandomNameHere, AnotherRandomName, YetAnotherName
import re
str = '[User(ID=1234567890, ScreenName=RandomNameHere), User(ID=233323490, ScreenName=AnotherRandomName), User(ID=4459284, ScreenName=YetAnotherName)]'
print 'Target IDs = ' + ','.join( re.findall(r'ID=(\d+)', str) )
print 'Target ScreenNames = ' + ','.join( re.findall(r' ScreenName=(\w+)', str) )
Output :
Target IDs = 1234567890,233323490,4459284
Target ScreenNames = RandomNameHere,AnotherRandomName,YetAnotherName
It depends. Assuming that all your text comes in the form of
TagName = TagValue1, TagValue2, ...
You need just two calls to split.
tag, value_string = string.split('=')
values = value_string.split(',')
Remove the excess space (probably a couple of rstrip()/lstrip() calls will suffice) and you are done. Or you can take regex. They are slightly more powerful, but in this case I think it's a matter of personal taste.
If you want more complex syntax with nonterminals, terminals and all that, you'll need lex/yacc, which will require some background in parsers. A rather interesting thing to play with, but not something you'll want to use for storing program options and such.
The regex I'd use would be:
(?:ID=|ScreenName=)+(\d+|[\w\d]+)
However, this assumes that ID is only digits (\d) and usernames are only letters or numbers ([\w\d]).
This regex (when combined with re.findall) would return a list of matches that could be iterated through and sorted in some fashion like so:
import re
s = "[User(ID=1234567890, ScreenName=RandomNameHere), User(ID=233323490, ScreenName=AnotherRandomName), User(ID=4459284, ScreenName=YetAnotherName)]"
pattern = re.compile(r'(?:ID=|ScreenName=)+(\d+|[\w\d]+)');
ids = []
names = []
for p in re.findall(pattern, s):
if p.isnumeric():
ids.append(p)
else:
names.append(p)
print(ids, names)

Python regex similar expressions

I have a file with two different types of data I'd like to parse with a regex; however, the data is similar enough that I can't find the correct way to distinguish it.
Some lines in my file are of form:
AED=FRI
AFN=FRI:SAT
AMD=SUN:SAT
Other lines are of form
AED=20180823
AMD=20150914
AMD=20150921
The remaining lines are headers and I'd like to discard them. For example
[HEADER: BUSINESS DATE=20160831]
My solution attempt so far is to match first three capital letters and an equal sign,
r'\b[A-Z]{3}=\b'
but after that I'm not sure how to distinguish between dates (eg 20180823) and days (eg FRI:SAT:SUN).
The results I'd expect from these parsing functions:
Regex weekday_rx = new Regex(<EXPRESSION FOR TYPES LIKE AED=FRI>);
Regex date_rx = new Regex(<EXPRESSION FOR TYPES LIKE AED=20160816>);
weekdays = [weekday_rx.Match(line) for line in infile.read()]
dates = [date_rx.Match(line) for line in infile.read()]
r'\S*\d$'
Will match all non-whitespace characters that end in a digit
Will match AED=20180823
r'\S*[a-zA-Z]$'
Matches all non-whitespace characters that end in a letter.
will match AED=AED=FRI
AFN=FRI:SAT
AMD=SUN:SAT
Neither will match
[HEADER: BUSINESS DATE=20160831]
This will match both
r'(\S*[a-zA-Z]$|\S*\d$)'
Replacing the * with the number of occurences you expect will be safer, the (a|b) is match a or match b
The following is a solution in Python :)
import re
p = re.compile(r'\b([A-Z]{3})=((\d)+|([A-Z])+)')
str_test_01 = "AMD=SUN:SAT"
m = p.search(str_test_01)
print (m.group(1))
print (m.group(2))
str_test_02 = "AMD=20150921"
m = p.search(str_test_02)
print (m.group(1))
print (m.group(2))
"""
<Output>
AMD
SUN
AMD
20150921
"""
Use pipes to express alternatives in regex. Pattern '[A-Z]{3}:[A-Z]{3}|[A-Z]{3}' will match both ABC and ABC:ABC. Then use parenthesis to group results:
import re
match = re.match(r'([A-Z]{3}:[A-Z]{3})|([A-Z]{3})', 'ABC:ABC')
assert match.groups() == ('ABC:ABC', None)
match = re.match(r'([A-Z]{3}:[A-Z]{3})|([A-Z]{3})', 'ABC')
assert match.groups() == (None, 'ABC')
You can research the concept of named groups to make this even more readable. Also, take a look at the docs for the match object for useful info and methods.

Matching alternative regexps in Python

I'm using Python to parse a file in search for e-mail addresses, but I can't figure out what the syntax for alternative regexps should be. Here's the code:
addresses = []
pattern = '(\w+)#(\w+\.com)|(\w+)#(it.\w+\.com)'
for line in file:
matches = re.findall(pattern,line)
for m in matches:
address = '%s#%s' % m
addresses.append(address)
So I want to find addresses that look like john#company.com or john#it.company.com, but the above code doesn't work because either the first two groups are empty or the last two groups are empty. What is the correct solution? I need to use groups to store the user name (before #) and server name (after #) separately.
EDIT: Matching email adresses is only an example. What I'm trying to find out is how to match different regexps that have only one thing in common - they match two groups.
(\w+)#((?:it\.)?\w+\.com)
You want to capture the part after the # whether it's example.com or it.example.com, so you put both options inside the same capture group. But since they share a similar format, you can condense (it\.\w+\.com|\w+\.com) to just ((it\.)?\w+\.com)
The (?: ) makes that parens a non-capturing group, so it won't take part in your matched groups. There will be one match for the first (\w+), and one match for the whole ((?:it\.)?\w+\.com) after the #. That's two matches total, plus the default group-0 match for the full string.
EDIT: To answer your new question, all you have to do is follow the grouping I used, but stop before you condense it.
If your test cases are:
1) example#abcdef
2) example#123456
You could write your regex as such: (\w+)#([a-zA-Z]+|\d+), which would always have the part before the # in group 1, and the part after in group 2. Notice that there are only two pairs of parens, and the |("or") operator appears inside of the second parens group.
I once found here a well written email regex, it was build for extracting a wide range of valid email adresses from a generic string, so it should also be able to do what you're looking for.
Example:
>>> email_regex = re.compile("""((([a-zA-Z0-9!\#\$%&'*+\-\/=?^_`{|}~]+|"([a-zA-Z0-9!\#\$%&'*+\-\/=?^_`{|}~(),:;<>#\[\]\.]|\\[ \\"])*")\.)*([a-zA-Z0-9!\#\$%&'*+\-\/=?^_`{|}~]+|"([a-zA-Z0-9!\#\$%&'*+\-\/=?^_`{|}~(),:;<>#\[\]\.]|\\[ \\"])*"))#((([a-zA-Z0-9]([a-zA-Z0-9]*(\-[a-zA-Z0-9]*)*)?\.)*[a-zA-Z]+|\[((0?\d{1,2}|1\d{2}|2[0-4]\d|25[0-5])\.){3}(0?\d{1,2}|1\d{2}|2[0-4]\d|25[0-5])\]|\[[Ii][Pp][vV]6(:[0-9a-fA-F]{0,4}){6}\]))""")
>>>
>>> m = email_regex.search('john#it.company.com')
>>> m.group(0)
'john#it.company.com'
>>> m.group(1)
'john'
>>> m.group(7)
'it.company.com'
>>>
>>> n = email_regex.search('john#company.com')
>>> n.group(0)
'john#company.com'
>>> n.group(1)
'john'
>>> n.group(7)
'company.com'

Categories