Extracting 2 strings from regular expression Python - python

I am trying to extract city, state and/or zip code from a string using a regular expression. The regex I am using (from here get city, state or zip from a string in python) is ([^\d]+)?(\d{5})? and when I tested it on http://regex101.com/ it accurately selects the two strings I want to match.
However I'm not sure how to separate these two strings in Python. Here is what I have tried:
import re
string = "binghamton ny 13905"
reg = re.compile('([^\d]+)?(\d{5})?')
match = reg.match(string)
return match.group()
This simply returns the entire string. Is there a way to pull each match individually?
I have also tried separating the regular expression into two distinct regular expressions (one for city, state and one for zip code) however the zip code regex either returns an empty string or None. All help is appreciated, thanks.

Probably the easiest way is to name the two capturing groups:
reg = re.compile('(?P<city>[^\d]+)?(?P<zip>\d{5})?')
and then access the groupdict:
>>> match = reg.match("binghamton ny 13905")
>>> match.groupdict()
{'city': 'binghamton ny ', 'zip': '13905'}
This gives you easy access to the two pieces of information by name, rather than index.

I would agree with jonrsharpe
string = "binghamton ny 13905"
reg = re.compile('(?P<city>[^\d]+)?(?P<zip>\d{5})?')
result = re.match(reg, string)
Additionally you can access the variables by name like this:
result.group('city')
result.group('zip')
Python re reference page

r = re.search("([^\d]+)?(\d{5})?")
r.groups()
(u'binghamton ny ', u'13905')

Related

Extract values from String using Python

I am getting a string
name="Mathew",lastname="Thomas",zipcode="PR123T",gender="male"
I need to get the values Mathew, Thomas, PR123T, male.
Also if the String doesnt have a value for zipcode, it should not assign any value to string.
I am newbie to python. Please help
You need to use the .split() function that is available on every string. First you need to split by comma ,, then you need to split by = and select the 1th element.
Once this is done, you need to .join() the elements on a comma , again.
def split_my_fields(input_string):
if not 'zipcode=""' in input_string:
output = ', '.join(e.split('=')[1].replace('"','') for e in input_string.split(','))
print(f'Output is {output}')
return output
else:
print('Zipcode is empty.')
split_my_fields(r'name="Mathew",lastname="Thomas",zipcode="PR123T",gender="male"')
Output:
>>> split_my_fields(r'name="Mathew",lastname="Thomas",zipcode="PR123T",gender="male"')
Output is Mathew, Thomas, PR123T, male
'Mathew, Thomas, PR123T, male'
In fact, my dear friend, you can use parse
>>from parse import *
>>parse("name={},lastname={},zipcode={},gender={}","name='Mathew',lastname='Thomas',zipcode='PR123T',gender='male'")
<Result ("'Mathew'", "'Thomas'", "'PR123T'", "'male'") {}>
You can use named groups and create dictionary with keys corresponding to the group names:
import re
text = 'name="Mathew",lastname="Thomas",zipcode="PR123T",gender="male"'
expr = re.compile(r'^(name="(\s+)?(?P<name>.*?)(\s+)?")?,?(lastname="(\s+)?(?P<lastname>.*?)(\s+)?")?,?(zipcode="(\s+)?(?P<zipcode>.*?)(\s+)?")?,?(gender="(\s+)?(?P<gender>.*?)(\s+)?")?$')
match = expr.search(text).groupdict()
print(match['name']) # Matthew
print(match['lastname']) # Thomas
print(match['zipcode']) # R123T
print(match['gender']) # male
The pattern will catch all non-whitespace characters between parentheses and strip whitespaces around it. For empty zipcode value it will return an empty string (the same applies for other named groups). It will also handle missing key-value pairs as long as the order in which keys are appearing will stay the same (e.g. text = 'name="Mathew",lastname="Thomas",gender="male"').

Regular Expression in Python 3

I am new here and just start using regular expressions in my python codes. I have a string which has 6 commas inside. One of the commas is fallen between two quotation marks. I want to get rid of the quotation marks and the last comma.
The input:
string = 'Fruits,Pear,Cherry,Apple,Orange,"Cherry,"'
I want this output:
string = 'Fruits,Pear,Cherry,Apple,Orange,Cherry'
The output of my code:
string = 'Fruits,Pear,**CherryApple**,Orange,Cherry'
here is my code in python:
if (re.search('"', string)):
matches = re.findall(r'\"(.+?)\"',string);
matches1 = re.sub(",", "", matches[0]);
string = re.sub(matches[0],matches1,string);
string = re.sub('"','',string);
My problem is, I want to give a condition that the code only works for the last bit ("Cherry,") but unfortunately it affects other words in the middle (Cherry,Apple), which has the same text as the one between the quotation marks! That results in reducing the number of commas (from 6 to 4) as it merges two fields (Cherry,Apple) and I want to be left with 5 commas.
fullString = '2000-04-24 12:32:00.000,22186CBD0FDEAB049C60513341BA721B,0DDEB5,COMP,Ch‌​erry Corp.,DE,100,0.57,100,31213C678CC483768E1282A9D8CB524C,365.0‌​0000,business,acquis‌​itions-mergers,acqui‌​sition-bid,interest,‌​acquiree,fact,,,,,,,‌​,,,,,,acquisition-in‌​terest-acquiree,Cher‌​ry Corp. Gets Buyout Offer From Chairman President,FULL-ARTICLE,B5569E,Dow Jones Newswires,0.04,-0.18,0,0,1,0,0,0,0,1,1,5,RPA,DJ,DN2000042400‌​0597,"Cherry Corp. Gets Buyout Offer From Chairman President,"\n'
Many Thanks in advance
For your task you don't need regular expressions, just use replace:
string = 'Fruits,Pear,Cherry,Apple,Orange,"Cherry,"'
new_string = string.replace('"').strip(',')
The best way would be to use the newer regex module where (*SKIP)(*FAIL) is supported:
import regex as re
string = 'Fruits,Pear,Cherry,Apple,Orange,"Cherry,"'
# parts
rx = re.compile(r'"[^"]+"(*SKIP)(*FAIL)|,')
def cleanse(match):
rxi = re.compile(r'[",]+')
return rxi.sub('', match)
parts = [cleanse(match) for match in rx.split(string)]
print(parts)
# ['Fruits', 'Pear', 'Cherry', 'Apple', 'Orange', 'Cherry']
Here you match anything between double quotes and throw it away afterwards, thus only commas outside quotes are used for the split operation. The rest is a list comprehension with a cleaning function.
See a demo on regex101.com.
Why not simply use this:
>>>ans_string=string.replace('"','')[0:-1]
Output
>>>ans_string
'Fruits,Pear,Cherry,Apple,Orange,Cherry'
For the sake of simplicity and algorithmic complexity.
You might consider using the csv module to do this.
Example:
import csv
s='Fruits,Pear,Cherry,Apple,Orange,"Cherry,"'
>>> ','.join([e.replace(',','') for row in csv.reader([s]) for e in row])
Fruits,Pear,Cherry,Apple,Orange,Cherry
The csv module will strip the quotes but keep the commas on each quoted field. Then you can just remove that comma that was kept.
This will take care of any modifications desired (remove , for example) on a field by field basis. The fields with quotes and commas could be any field in the string.
If your content is in a csv file, you would do something like this (in pseudo code)
with open(file, 'rb') as csv_fo:
# modify(string) stands for what you want to do to each field...
for row in csv.reader(csv_fo):
new_row=[modify(field) for field in row]
# now do what you need with that row

Regular expression for multiple occurances in python

I need to parse lines having multiple language codes as below
008800002 Bruxelles-Nord$Br�ussel Nord$<deu>$Brussel Noord$<nld>
008800002 being a id
Bruxelles-Nord$Br�ussel Nord$ being name1
deu being language one
$Brussel Noord$ being name two
nld being language two.
SO, the idea is name and language can appear N number of times. I need to collect them all.
the language in <> is 3 characters in length (fixed)
and all names end with $ sign.
I tried this one but it is not giving expected output.
x = re.compile('(?P<stop_id>\d{9})\s(?P<authority>[[\x00-\x7F]{3}|\s{3}])\s(?P<stop_name>.*)
(?P<lang_code>(?:[<]\S{0,4}))',flags=re.UNICODE)
I have no idea how to get repeated elements.
It takes
Bruxelles-Nord$Br�ussel Nord$<deu>$Brussel Noord$ as stop_name and <nld> as language.
Do it in two steps. First separate ID from name/language pairs; then use re.finditer on the name/language section to iterate over the pairs and stuff them into a dict.
import re
line = u"008800002 Bruxelles-Nord$Br�ussel Nord$<deu>$Brussel Noord$<nld>"
m = re.search("(\d+)\s+(.*)", line, re.UNICODE)
id = m.group(1)
names = {}
for m in re.finditer("(.*?)<(.*?)>", m.group(2), re.UNICODE):
names[m.group(2)] = m.group(1)
print id, names
\b(\d+)\b\s*|(.*?)(?=<)<(.*?)>
Try this.Just grab the captures.see demo.
http://regex101.com/r/hS3dT7/4

Extracting number from unicode string with regex

I have the following dictionary which contains some product data:
dictionary = {'price': [u'3\xa0590 EUR'],
'name': [u'Product name with unicode chars]}
All values are in unicode. As you can see I'm using lists as dictionary values because sometimes I need to concatenate the information from several different sources.
I'm looking for a way to extract the digits from the price value without the non-breaking space (\xa0) and currency at the end (EUR) by using a regex.
In this case I would like to see the following as a result:
3590
Can you please suggest a solution?
[SOLUTION]
Adding the solution here because the comments field wrapped my code unexpectedly:
I used .sub() method from Python's re module which is a replace function. Here is the final code that gives me the expected result:
p = re.compile( '(\xa0| EUR|)')
result = p.sub( '', dictionary['price'][0])
Not sure about python, but here's a regex:
p = /\D/g;
s.replace(p, '');

How to extract longest of overlapping groups?

How can I extract the longest of groups which start the same way
For example, from a given string, I want to extract the longest match to either CS or CSI.
I tried this "(CS|CSI).*" and it it will return CS rather than CSI even if CSI is available.
If I do "(CSI|CS).*" then I do get CSI if it's a match, so I gues the solution is to always place the shorter of the overlaping groups after the longer one.
Is there a clearer way to express this with re's? somehow it feels confusing that the result depends on the order you link the groups.
No, that's just how it works, at least in Perl-derived regex flavors like Python, JavaScript, .NET, etc.
http://www.regular-expressions.info/alternation.html
As Alan says, the patterns will be matched in the order you specified them.
If you want to match on the longest of overlapping literal strings, you need the longest one to appear first. But you can organize your strings longest-to-shortest automatically, if you like:
>>> '|'.join(sorted('cs csi miami vice'.split(), key=len, reverse=True))
'miami|vice|csi|cs'
Intrigued to know the right way of doing this, if it helps any you can always build up your regex like:
import re
string_to_look_in = "AUHDASOHDCSIAAOSLINDASOI"
string_to_match = "CSIABC"
re_to_use = "(" + "|".join([string_to_match[0:i] for i in range(len(string_to_match),0,-1)]) + ")"
re_result = re.search(re_to_use,string_to_look_in)
print string_to_look_in[re_result.start():re_result.end()]
similar functionality is present in vim editor ("sequence of optionally matched atoms"), where e.g. col\%[umn] matches col in color, colum in columbus and full column.
i am not aware if similar functionality in python re,
you can use nested anonymous groups, each one followed by ? quantifier, for that:
>>> import re
>>> words = ['color', 'columbus', 'column']
>>> rex = re.compile(r'col(?:u(?:m(?:n)?)?)?')
>>> for w in words: print rex.findall(w)
['col']
['colum']
['column']

Categories