New to python and regular expressions, I have been trying to find a way that I can parse a sentence so that I can take parts of it and assign them to their own variables.
An example sentence is: Laura Compton, a Stock Broker from Los Angeles, California
My objective is to have: name = "Laura Compton" ( this one is the easy one, I can target the anchor link no problem), position = "Stock Broker", city = Los Angeles, state = California
All of the sentences I need to iterate over follow the same pattern, name is always in the anchor tag, the position always follows the , after the closing anchor, sometimes its uses "a" or "an" so I would like to strip those off. The city and state always follow the word "from" .
You can use named groups within patterns to capture substrings, which makes referring to them easier and the code doing so slightly more readable:
import re
data = ['Laura Compton, a Stock Broker from Los Angeles, California',
'Miles Miller, a Soccer Player from Seattle, Washington']
pattern = (r'^(?P<name>[^,]+)\, an? (?P<position>.+) from '
r'(?P<city>[^,]+)\, +(?P<state>.+)')
FIELDS = 'name', 'position', 'city', 'state'
for sentence in data:
matches = re.search(pattern, sentence)
name, position, city, state = matches.group(*FIELDS)
print(', '.join([name, position, city, state]))
Output produced from sample data:
Laura Compton, Stock Broker, Los Angeles, California
Miles Miller, Soccer Player, Seattle, Washington
A.M. Kuchling wrote a good tutorial titled Regular Expression HOWTO you ought to check-out.
You can try this:
import re
s = "Laura Compton, a Stock Broker from Los Angeles, California"
new_s = re.findall('^[a-zA-Z\s]+|(?<=a\s)[a-zA-Z\s]+(?=from)|(?<=an\s)[a-zA-Z\s]+(?=from)|(?<=from\s)[a-zA-Z\s]+(?=,)|(?<=,\s)[a-zA-Z\s]+$', s)
headers = ['name', 'title', 'city', 'state']
data = {a:b for a, b in zip(headers, new_s)}
Output:
{'city': 'Los Angeles', 'state': 'California', 'name': 'Laura Compton', 'title': 'Stock Broker '}
Related
For example , I have a set of sentences like this :
New York is in New York State
D.C. is the capital of United States
The weather is cool in the south of that country.
Lets take a bus to get to point b from point a.
And another sentence like this :
is cool in the south of that country
The output should be : The weather is cool in the south of that country.
If I have an input like of United States The weather is cool the output should be :
D.C. is the capital of United States The weather is cool in the south of that country.
So far i tried difflib and got the overlap but that doesn't quite solve the problem in all cases.
You could build a dictionary of starting expressions and ending expressions from the sentences. Then find a prefix and suffix for the sentence to extend in these dictionaries. In both cases you would need to build/check one key for each substring of words starting from the beginning and from the end:
sentences="""New York is in New York State
D.C. is the capital of United States
The weather is cool in the south of that country
Lets take a bus to get to point b from point a""".split("\n")
ends = { tuple(sWords[i:]):sWords[:i] for s in sentences
for sWords in [s.split()] for i in range(len(sWords)) }
starts = { tuple(sWords[:i]):sWords[i:] for s in sentences
for sWords in [s.split()] for i in range(1,len(sWords)+1) }
def extendSentence(sentence):
sWords = sentence.split(" ")
prefix = next( (ends[p] for i in range(1,len(sWords)+1)
for p in [tuple(sWords[:i])] if p in ends),
[])
suffix = next( (starts[p] for i in range(len(sWords))
for p in [tuple(sWords[i:])] if p in starts),
[])
return " ".join(prefix + [sentence] + suffix)
output:
print(extendSentence("of United States The weather is cool"))
# D.C. is the capital of United States The weather is cool in the south of that country
print(extendSentence("is cool in the south of that country"))
# The weather is cool in the south of that country
note that I had to remove the periods at the end of sentences because they prevent matching. You will need to clean these up in the dictionary building step
I have a python scraping script to get infos about some upcomming concerts and it's the same text pattern everytime no matter how many concerts will appear, it means that each line will always be referring to a certain information such as this example (please note that there are no spaces between concerts, my data is exactly in this format):
01/01/99 9PM
Iron Maiden
Madison Square Garden
New York City
01/01/99 9.30PM
The Doors
Staples Center
Los Angeles
01/02/99 8.45PM
Dr Dre & Snoop Dogg
Staples Center
Los Angeles
01/02/99 9PM
Diana Ross
City Hall
New York City ect...
For each line I need to assign it to a variable, so 4 variables in total:
time = all the 1st lines
name = all the 2nd lines
location = all the 3rd lines
city = all the 4th lines
Then loop through all the lines to catch the informations corresponding to each variables, such as getting all the dates from the 1st lines, all the names from the 2nd lines ect...
so far I haven't found any solutions yet, and I barely know anything about regex
I hope that you see the idea, don't hesitate if you have any questions thanks
No need to use regex:
string = '''01/01/99 9PM
Iron Maiden
Madison Square Garden
New York City
01/01/99 9.30PM
The Doors
Staples Center
Los Angeles
01/02/99 8.45PM
Dr Dre & Snoop Dogg
Staples Center
Los Angeles
01/02/99 9PM
Diana Ross
City Hall
New York City
'''
lines = string.split('\n')
dates = [i for i in lines [0::4]]
bands = [i for i in lines [1::4]]
places = [i for i in lines [2::4]]
cities = [i for i in lines [3::4]]
This will give you a list of dates/bands/places/cities, which will be easier to work with.
If you want to turn them back into a string, you could do:
'; '.join(dates) #Do the same for all 4 variables
Which brings:
'01/01/99 9PM; 01/01/99 9.30PM; 01/02/99 8.45PM; 01/02/99 9PM; '
You could replace '; ' with ' ' if you only want them to be space separated, or with whichever you like.
I would personally use namedtuples. Note that I put your data in a file called input.txt.
from collections import namedtuple
Entry = namedtuple("Entry", "time name location city")
with open('input.txt') as f:
lines = [line.strip() for line in f]
objects = [Entry(*lines[i:i+4]) for i in range(0, len(lines), 4)]
print(*objects, sep='\n')
for obj in objects:
print(obj.name)
Output:
Entry(time='01/01/99 9PM', name='Iron Maiden', location='Madison Square Garden', city='New York City')
Entry(time='01/01/99 9.30PM', name='The Doors', location='Staples Center', city='Los Angeles')
Entry(time='01/02/99 8.45PM', name='Dr Dre & Snoop Dogg', location='Staples Center', city='Los Angeles')
Entry(time='01/02/99 9PM', name='Diana Ross', location='City Hall', city='New York City')
Iron Maiden
The Doors
Dr Dre & Snoop Dogg
Diana Ross
This calls for slicing:
times = lines[0::4]
names = lines[1::4]
locations = lines[2::4]
cities = lines[3::4]
And now we can zip those lists into tuples:
events = zip(*[times, names, locations, cities])
With your sample data, this gives us
>>> list(events)
[('01/01/99 9PM', 'Iron Maiden', 'Madison Square Garden ', 'New York City'), ('01/01/99 9.30PM', 'The Doors', 'Staples Center', 'Los Angeles'), ('01/02/99 8.45PM', 'Dr Dre & Snoop Dogg', 'Staples Center', 'Los Angeles'), ('01/02/99 9PM', 'Diana Ross', 'City Hall', 'New York City')]
You can now process these tuples into any data structure that suits your use case best.
I'm working on several hundreds of documents and I'm writing a function that will find specific words and its values and returns a list of dictionaries.
I'm looking specifically for a piece of specific information ('city' and the number that refers to it). However, in some documents, I have one city, and in others, I might have twenty or even one hundred, so I need something very generic.
A text example (the parenthesis are messed up like this):
text = 'The territory of modern Hungary was for centuries inhabited by a succession of peoples, including Celts, Romans, Germanic tribes, Huns, West Slavs and the Avars. The foundations of the Hungarian state was established in the late ninth century AD by the Hungarian grand prince Árpád following the conquest of the Carpathian Basin. According to previous census City: Budapest (population was: 1,590,316)Debrecen (population was: 115,399)Szeged (population was: 104,867)Miskolc (population was: 109,841). However etc etc'
or
text2 = 'About medium-sized cities such as City: Eger (population was: 32,352). However etc etc'
Using regex I found the string that I'm looking for:
p = regex.compile(r'(?<=City).(.*?)(?=However)')
m = p.findall(text)
Which returns the whole text as a list.
[' Budapest (population was: 1,590,316)Debrecen (population was: 115,399)Szeged (population was: 104,867)Miskolc (population was: 109,841). ']
Now, this is where I'm stuck and I don't know how to proceed. Should I use regex.findall or regex.finditer?
Considering that the amount of 'cities' varies in the documents, I would like to get a list of dictionaries back. If I run in text 2, I would get:
d = [{'cities': 'Eger', 'population': '32,352'}]
If I run in text one:
d = [{'cities': 'Szeged', 'population': '104,867'}, {'cities': 'Miskolc': 'population': 109,841'}]
I really appreciate any help, guys!
You may use re.finditer with a regex having named capturing groups (named after your keys) on the matched text with x.groupdict() to get a dictionary of results:
import re
text = 'The territory of modern Hungary was for centuries inhabited by a succession of peoples, including Celts, Romans, Germanic tribes, Huns, West Slavs and the Avars. The foundations of the Hungarian state was established in the late ninth century AD by the Hungarian grand prince Árpád following the conquest of the Carpathian Basin. According to previous census City: Budapest (population was: 1,590,316)Debrecen (population was: 115,399)Szeged (population was: 104,867)Miskolc (population was: 109,841). However etc etc'
p = re.compile(r'City:\s*(.*?)However')
p2 = re.compile(r'(?P<city>\w+)\s*\([^()\d]*(?P<population>\d[\d,]*)')
m = p.search(text)
if m:
print([x.groupdict() for x in p2.finditer(m.group(1))])
# => [{'population': '1,590,316', 'city': 'Budapest'}, {'population': '115,399', 'city': 'Debrecen'}, {'population': '104,867', 'city': 'Szeged'}, {'population': '109,841', 'city': 'Miskolc'}]
See the Python 3 demo online.
The second p2 regex is
(?P<city>\w+)\s*\([^()\d]*(?P<population>\d[\d,]*)
See the regex demo.
Here,
(?P<city>\w+) - Group "city": 1+ word chars
\s*\( - 0+ whitespaces and (
[^()\d]* - any 0+ chars other than ( and ) and digits
(?P<population>\d[\d,]*) - Group "population": a digit followed with 0+ digits or/and commas
You might try to run the p2 regex on the whole original string (see demo), but it may overmatch.
A very good answer by #Wiktor. Since I spend some time on this, I am posting my answer.
d = [' Budapest (population was: 1,590,316)Debrecen (population was: 115,399)Szeged (population was: 104,867)Miskolc (population was: 109,841). ']
oo = []
import re
for i in d[0].split(")"):
jj = re.search("[0-9,]+", i)
kk, *xx = i.split()
if jj:
oo.append({"cities": kk , "population": jj.group()})
print (oo)
#Result--> [{'cities': 'Budapest', 'population': '1,590,316'}, {'cities': 'Debrecen', 'population': '115,399'}, {'cities': 'Szeged', 'population': '104,867'}, {'cities': 'Miskolc', 'population': '109,841'}]
I need to parse text of text file into two categories:
University
Location(Example: Lahore, Peshawar, Jamshoro, Faisalabad)
but the text file contain following text:
"Imperial College of Business Studies, Lahore"
"Government College University Faisalabad"
"Imperial College of Business Studies Lahore"
"University of Peshawar, Peshawar"
"University of Sindh, Jamshoro"
Code:
for l in content:
rep = l.replace('"','')
if ',' in rep:
uni = rep.split(',')[0]
loc = rep.split(',')[-1].strip()
else:
loc = rep.split(' ')[-1].strip()
uni = rep.split(' ').index(loc)
It Return following Output, Where 3 and 5 are index value before cities:
3 represents Government College University
5 represents Imperial College of Business Studies
Uni: Imperial College of Business Studies Loc: Lahore
Uni: 3 Loc: Faisalabad
Uni: 5 Loc: Lahore
Uni: University of Peshawar Loc: Peshawar
Uni: University of Sindh Loc: Jamshoro
I want the Program to return me the string value against index value 3 & 5.
In the case where there is no comma, the lines
loc = rep.split(' ')[-1].strip()
uni = rep.split(' ').index(loc)
first find the location as the last element of the string, then tell you at what index the string occurs. What you want is everything but the last word in the string which you can get as
uni = ' '.join(rep.split()[:-1])
It might be better just to replace the , by '' to begin with so that there was only one case to deal with. Also, my inclination is to split the string only once.
words = rep.split() # the default is to split at whitespace
loc = words[-1]
uni = ' '.join(words[:-1])
So, I would write the loop like this:
for l in content:
rep = l.strip('"').replace(',','')
words = rep.split()
loc = words[-1]
uni = ' '.join(words[:-1])
print(uni, loc)
This prints
('Imperial College of Business Studies', 'Lahore')
('Government College University', 'Faisalabad')
('Imperial College of Business Studies', 'Lahore')
('University of Peshawar', 'Peshawar')
('University of Sindh', 'Jamshoro')
which I take it is what you want.
just cycle thru and take the last element as location:
content = [
"Government College University Faisalabad",
"Imperial College of Business Studies Lahore",
"University of Peshawar, Peshawar",
"University of Sindh, Jamshoro"]
locs = [l.split()[-1] for l in content]
print locs
['Faisalabad', 'Lahore', 'Peshawar', 'Jamshoro']
Using regular expression in Python 3.4, how would I extract the city names from the following text below?
replacement windows in seattle wa
basement remodeling houston texas
siding contractor new york ny
windows in elk grove village
Sometimes the city name before it has \sin\s, sometimes it doesn't. Sometimes it has a general word like 'windows', 'remodeling', ... anything. Sometimes there is no state full name or state abbreviation at the end.
Is there a single regular expression that can capture these above conditions?
Here's what I've tried so far but it only captures 'seattle'.
import re
l = ['replacement windows in seattle wa',
'basement remodeling houston texas',
'siding contractor new york ny',
'windows in elk grove village'
]
for i in l:
m = re.search(r'(?<=\sin\s)(.+)(?=\s(wa|texas|ny))', i)
m.group(1)
What you are after is not possible with regular expressions. Regular expressions need string patterns to work. In your case, it would seem that the pattern either does not exist or can take a myriad of forms.
What you could do would be to use a search efficient data structure and split your string in words. You would then go through each word and see if it is in your search efficient data structure.
import re
l = ['replacement windows in seattle wa',
'basement remodeling houston texas',
'siding contractor newyork ny',
'windows in elk grove village']
p = re.compile(r"(\w+)\s(?:(wa | texas | ny | village))", re.VERBOSE)
for words in l:
print p.search(words).expand(r"\g<1> <-- the code is --> \g<2>")