So Parsing a large text file with 56,900 book titles with authors and a etext no.
Trying to find the authors. By parsing the file.
The file is a like this:
TITLE and AUTHOR ETEXT NO.
Aspects of plant life; with special reference to the British flora, 56900
by Robert Lloyd Praeger
The Vicar of Morwenstow, by Sabine Baring-Gould 56899
[Subtitle: Being a Life of Robert Stephen Hawker, M.A.]
Raamatun tutkisteluja IV, mennessä Charles T. Russell 56898
[Subtitle: Harmagedonin taistelu]
[Language: Finnish]
Raamatun tutkisteluja III, mennessä Charles T. Russell 56897
[Subtitle: Tulkoon valtakuntasi]
[Language: Finnish]
Tom Thatcher's Fortune, by Horatio Alger, Jr. 56896
A Yankee Flier in the Far East, by Al Avery 56895
and George Rutherford Montgomery
[Illustrator: Paul Laune]
Nancy Brandon's Mystery, by Lillian Garis 56894
Nervous Ills, by Boris Sidis 56893
[Subtitle: Their Cause and Cure]
Pensées sans langage, par Francis Picabia 56892
[Language: French]
Helon's Pilgrimage to Jerusalem, Volume 2 of 2, by Frederick Strauss 56891
[Subtitle: A picture of Judaism, in the century
which preceded the advent of our Savior]
Fra Tommaso Campanella, Vol. 1, di Luigi Amabile 56890
[Subtitle: la sua congiura, i suoi processi e la sua pazzia]
[Language: Italian]
The Blue Star, by Fletcher Pratt 56889
Importanza e risultati degli incrociamenti in avicoltura, 56888
di Teodoro Pascal
[Language: Italian]
The Junior Classics, Volume 3: Tales from Greece and Rome, by Various 56887
~ ~ ~ ~ Posting Dates for the below eBooks: 1 Mar 2018 to 31 Mar 2018 ~ ~ ~ ~
TITLE and AUTHOR ETEXT NO.
The American Missionary, Volume 41, No. 1, January, 1887, by Various 56886
Morganin miljoonat, mennessä Sven Elvestad 56885
[Author a.k.a. Stein Riverton]
[Subtitle: Salapoliisiromaani]
[Language: Finnish]
"Trip to the Sunny South" in March, 1885, by L. S. D 56884
Balaam and His Master, by Joel Chandler Harris 56883
[Subtitle: and Other Sketches and Stories]
Susien saaliina, mennessä Jack London 56882
[Language: Finnish]
Forged Egyptian Antiquities, by T. G. Wakeling 56881
The Secret Doctrine, Vol. 3 of 4, by Helena Petrovna Blavatsky 56880
[Subtitle: Third Edition]
No Posting 56879
Author name usually starts after "by" or when there is no "by" in line then author name starts after a comma ","...However the "," can be a part of the title if the line has a by.
So, I parsed it for by first then for comma.
Here is what I tried:
def search_by_author():
fhand = open('GUTINDEX.ALL')
print("Search by Author:")
for line in fhand:
if not line.startswith(" [") and not line.startswith("TITLE"):
if not line.startswith("~"):
words = line.rstrip()
words = line.lstrip()
words = words[:-6]
if ", by" in words:
words = words[words.find(', by'):]
words = words[5:]
print (words)
else:
words = words[words.find(', '):]
words = words[5:]
if "," in words:
words = words[words.find(', '):]
if words.startswith(','):
words =words[words.find(','):]
print (words)
else:
print (words)
else:
print (words)
if " by" in words:
words = words[words.find('by')]
print(words)
search_by_author()
However it can't seem to find the author name for lines like
Aspects of plant life; with special reference to the British flora, 56900
by Robert Lloyd Praeger
As per your file, info about a book can be spread across multiple lines. There is a blank line after each book info. I used that to gather all info about a book and then parse it to get the author info.
import re
def search_by_author():
fhand = open('GUTINDEX.ALL')
book_info = ''
for line in fhand:
line = line.rstrip()
if (line.startswith('TITLE') or line.startswith('~')):
continue
if (len(line) == 0):
# remove info in square bracket from book_info
book_info = re.sub(r'\[.*$', '', book_info)
if ('by ' in book_info):
tokens = book_info.split('by ')
else:
tokens = book_info.split(',')
if (len(tokens) > 1):
authors = tokens[-1].strip()
print(authors)
book_info = ''
else:
# remove ETEXT NO. from line
line = re.sub(r'\d+$', '', line)
book_info += ' ' + line.rstrip()
search_by_author()
Output:
Robert Lloyd Praeger
Sabine Baring-Gould
mennessä Charles T. Russell
mennessä Charles T. Russell
Horatio Alger, Jr.
Al Avery and George Rutherford Montgomery
Lillian Garis
Boris Sidis
par Francis Picabia
Frederick Strauss
di Luigi Amabile
Fletcher Pratt
di Teodoro Pascal
Various
Various
mennessä Sven Elvestad
L. S. D
Joel Chandler Harris
mennessä Jack London
T. G. Wakeling
Helena Petrovna Blavatsky
Related
I'm working on a movie data base, getting responses from imdb. I'm getting the response in a xml format, but it has no tags, just the information mixed. How can I get each of the data in there?
Here's how the respone shows up:
<?xml version="1.0" encoding="UTF-8"?><root response="True">
<movie title="Batman" year="1989" rated="PG-13" released="23 Jun 1989" runtime="126 min" genre="Action, Adventure" director="Tim Burton" writer="Bob Kane (Batman characters), Sam Hamm (story), Sam Hamm (screenplay), Warren Skaaren (screenplay)" actors="Michael Keaton, Jack Nicholson, Kim Basinger, Robert Wuhl" plot="Gotham City. Crime boss Carl Grissom (Jack Palance) effectively runs the town but there's a new crime fighter in town - Batman (Michael Keaton). Grissom's right-hand man is Jack Napier (Jack Nicholson), a brutal man who is not entirely sane... After falling out between the two Grissom has Napier set up with the Police and Napier falls to his apparent death in a vat of chemicals. However, he soon reappears as The Joker and starts a reign of terror in Gotham City. Meanwhile, reporter Vicki Vale (Kim Basinger) is in the city to do an article on Batman. She soon starts a relationship with Batman's everyday persona, billionaire Bruce Wayne." language="English, French, Spanish" country="USA, UK" awards="Won 1 Oscar. Another 8 wins & 26 nominations." poster="https://m.media-amazon.com/images/M/MV5BMTYwNjAyODIyMF5BMl5BanBnXkFtZTYwNDMwMDk2._V1_SX300.jpg" metascore="69" imdbRating="7.6" imdbVotes="302,842" imdbID="tt0096895" type="movie" />
</root>
Here is my answer to your question
xmlRaw="""< ?xml
version = "1.0"
encoding = "UTF-8"? > < root
response = "True" >
< movie
title = "Batman"
year = "1989"
rated = "PG-13"
released = "23 Jun 1989"
runtime = "126 min"
genre = "Action, Adventure"
director = "Tim Burton"
writer = "Bob Kane (Batman characters), Sam Hamm (story), Sam Hamm (screenplay), Warren Skaaren (screenplay)"
actors = "Michael Keaton, Jack Nicholson, Kim Basinger, Robert Wuhl"
plot = "Gotham City. Crime boss Carl Grissom (Jack Palance) effectively runs the town but there's a new crime fighter in town - Batman (Michael Keaton). Grissom's right-hand man is Jack Napier (Jack Nicholson), a brutal man who is not entirely sane... After falling out between the two Grissom has Napier set up with the Police and Napier falls to his apparent death in a vat of chemicals. However, he soon reappears as The Joker and starts a reign of terror in Gotham City. Meanwhile, reporter Vicki Vale (Kim Basinger) is in the city to do an article on Batman. She soon starts a relationship with Batman's everyday persona, billionaire Bruce Wayne."
language = "English, French, Spanish"
country = "USA, UK"
awards = "Won 1 Oscar. Another 8 wins & 26 nominations."
poster = "https://m.media-amazon.com/images/M/MV5BMTYwNjAyODIyMF5BMl5BanBnXkFtZTYwNDMwMDk2._V1_SX300.jpg"
metascore = "69"
imdbRating = "7.6"
imdbVotes = "302,842"
imdbID = "tt0096895"
type = "movie" / >
< / root >"""
def getValue(xml, value):
textString = xmlRaw.split('\n')
for line in textString:
if value in line:
returnData = line
return returnData
print (getValue(xmlRaw, 'title'))
print (getValue(xmlRaw, 'year'))
print (getValue(xmlRaw, 'rated'))
print (getValue(xmlRaw, 'released'))
#add more as you need the data
I am very new to python. I have this very large xml file and I want to extract some data from it. Here is an excerpt:
<program>
<id>38e072a7-8fc9-4f9a-8eac-3957905c0002</id>
<programID>3853</programID>
<orchestra>New York Philharmonic</orchestra>
<season>1842-43</season>
<concertInfo>
<eventType>Subscription Season</eventType>
<Location>Manhattan, NY</Location>
<Venue>Apollo Rooms</Venue>
<Date>1842-12-07T05:00:00Z</Date>
<Time>8:00PM</Time>
</concertInfo>
<worksInfo>
<work ID="52446*">
<composerName>Beethoven, Ludwig van</composerName>
<workTitle>SYMPHONY NO. 5 IN C MINOR, OP.67</workTitle>
<conductorName>Hill, Ureli Corelli</conductorName>
</work>
<work ID="8834*4">
<composerName>Weber, Carl Maria Von</composerName>
<workTitle>OBERON</workTitle>
<movement>"Ozean, du Ungeheuer" (Ocean, thou mighty monster), Reiza (Scene and Aria), Act II</movement>
<conductorName>Timm, Henry C.</conductorName>
<soloists>
<soloist>
<soloistName>Otto, Antoinette</soloistName>
<soloistInstrument>Soprano</soloistInstrument>
<soloistRoles>S</soloistRoles>
</soloist>
</soloists>
</work>
<work ID="3642*">
<composerName>Hummel, Johann</composerName>
<workTitle>QUINTET, PIANO, D MINOR, OP. 74</workTitle>
<soloists>
<soloist>
<soloistName>Scharfenberg, William</soloistName>
<soloistInstrument>Piano</soloistInstrument>
<soloistRoles>A</soloistRoles>
</soloist>
<soloist>
<soloistName>Hill, Ureli Corelli</soloistName>
<soloistInstrument>Violin</soloistInstrument>
<soloistRoles>A</soloistRoles>
</soloist>
<soloist>
<soloistName>Derwort, G. H.</soloistName>
<soloistInstrument>Viola</soloistInstrument>
<soloistRoles>A</soloistRoles>
</soloist>
<soloist>
<soloistName>Boucher, Alfred</soloistName>
<soloistInstrument>Cello</soloistInstrument>
<soloistRoles>A</soloistRoles>
</soloist>
<soloist>
<soloistName>Rosier, F. W.</soloistName>
<soloistInstrument>Contrabass</soloistInstrument>
<soloistRoles>A</soloistRoles>
</soloist>
</soloists>
</work>
<work ID="0*">
<interval>Intermission</interval>
</work>
<work ID="8834*3">
<composerName>Weber, Carl Maria Von</composerName>
<workTitle>OBERON</workTitle>
<movement>Overture</movement>
<conductorName>Etienne, Denis G.</conductorName>
</work>
<work ID="8835*1">
<composerName>Rossini, Gioachino</composerName>
<workTitle>ARMIDA</workTitle>
<movement>Duet</movement>
<conductorName>Timm, Henry C.</conductorName>
<soloists>
<soloist>
<soloistName>Otto, Antoinette</soloistName>
<soloistInstrument>Soprano</soloistInstrument>
<soloistRoles>S</soloistRoles>
</soloist>
<soloist>
<soloistName>Horn, Charles Edward</soloistName>
<soloistInstrument>Tenor</soloistInstrument>
<soloistRoles>S</soloistRoles>
</soloist>
</soloists>
</work>
<work ID="8837*6">
<composerName>Beethoven, Ludwig van</composerName>
<workTitle>FIDELIO, OP. 72</workTitle>
<movement>"In Des Lebens Fruhlingstagen...O spur ich nicht linde," Florestan (aria)</movement>
<conductorName>Timm, Henry C.</conductorName>
<soloists>
<soloist>
<soloistName>Horn, Charles Edward</soloistName>
<soloistInstrument>Tenor</soloistInstrument>
<soloistRoles>S</soloistRoles>
</soloist>
</soloists>
</work>
<work ID="8336*4">
<composerName>Mozart, Wolfgang Amadeus</composerName>
<workTitle>ABDUCTION FROM THE SERAGLIO,THE, K.384</workTitle>
<movement>"Ach Ich liebte," Konstanze (aria)</movement>
<conductorName>Timm, Henry C.</conductorName>
<soloists>
<soloist>
<soloistName>Otto, Antoinette</soloistName>
<soloistInstrument>Soprano</soloistInstrument>
<soloistRoles>S</soloistRoles>
</soloist>
</soloists>
</work>
<work ID="5543*">
<composerName>Kalliwoda, Johann W.</composerName>
<workTitle>OVERTURE NO. 1, D MINOR, OP. 38</workTitle>
<conductorName>Timm, Henry C.</conductorName>
</work>
</worksInfo>
</program>
<program>
What I would like to do is extract the following pieces of information: programID, orchestra, season, eventType, work ID, soloistName, solositInstrument, soloistRole
Here is the code I am using:
import csv
import xml.etree.cElementTree as ET
tree = ET.iterparse('complete.xml.txt')
#root = tree.getroot()
for program in root.iter('program'):
ID = program.findtext('id')
programID = program.findtext('programID')
orchestra = program.findtext('orchestra')
season = program.findtext('season')
for concert in program.findall('concertInfo'):
event = concert.findtext('eventType')
for worksInfo in program.findall('worksInfo'):
for work in worksInfo.iter('work'):
workid = work.get('ID')
for soloists in work.iter('soloists'):
for soloist in soloists.iter('soloist'):
soloname = soloist.findtext('soloistName')
soloinstrument = `soloist.findtext('soloistInstrument')`
solorole = soloist.findtext('soloistRoles')
#print(soloname, soloinstrument, solorole)
#print(workid)
#print(event)
#print(programID , " , " , orchestra , " , " , season)
with open("nyphil.txt","a") as nyphil:
nyphilwriter = csv.writer(nyphil)
nyphilwriter.writerow([programID, orchestra, season, event, workid, `soloname.encode('utf-8'), soloinstrument, solorole])
nyphil.close()
When I run this code I only get the last soloistName and soloistInstrumet. The outcome that I have in mind is sort of like a repeated observations for each program. So I'd have something like:
13918, New York Philharmonic, 1842-43, Subscription Season, 52446*, Otto, Antoinette, Soprano, S
13918,...., 3642*, Scharfenberg, William , Piano, A
13918,...., 3642*, Hill, Ureli Corelli , Violin, A
and so on until the last work ID:
13918,...., 8336*4 , Otto, Antoinette, Soprano, S
What I am getting is only the last work:
13918, New York Philharmonic, 1842-43, Subscription Season, 8336*, Otto, Antoinette, Soprano, S
In the file there are over 15,000 programs like the example I posted. I want to parse all of them and extract the information I mentioned above. I am not entirely sure how to go about doing this, I've scoured the internet for a way to do this, but everything I tried just doesn't work!!
Your problem here is that you misunderstand the way loops work. Specifically, the values only change while you're in the loop:
for x in range(10):
pass
print(x) # prints 9
vs
for x in range(10):
print(x)
Those are two different things. You're doing the former. What you need to do is something like this:
with open('nyphil.txt', 'w') as f:
nyphilwriter = csv.writer(f)
for program in root.iter('program'):
id_ = program.findtext('id')
program_id = program.findtext('programID')
orchestra = program.findtext('orchestra')
season = program.findtext('season')
for concert in program.findall('concertInfo'):
event = concert.findtext('eventType')
for info in program.findall('worksInfo'):
for work in info.iter('work'):
work_id = work.get('ID')
for soloists in work.iter('soloists'):
for soloist in soloists.iter('soloist'):
# Change this line to whatever you want to write out
nyphilwriter.writerow([id, program_id, orchestra, season, event, work_id, soloist.findtext('soloistName')])
The 13918 does not appear in your data. Leaving that aside, here's what I wrote, which appears to process your data successfully.
from lxml import etree
tree = etree.parse('test.xml')
programs = tree.xpath('.//program')
for program in programs:
programID, orchestra, season = [program.xpath(_)[0].text for _ in ['programID', 'orchestra', 'season']]
print (programID, orchestra, season)
works = program.xpath('worksInfo/work')
for work in works:
workID = work.attrib['ID']
soloistItems = work.xpath('soloists/soloist')
for soloistItem in soloistItems:
print (workID, soloistItem.find('soloistName').text, soloistItem.find('soloistInstrument').text, soloistItem.find('soloistRoles').text)
The script produces the following output.
3853 New York Philharmonic 1842-43
8834*4 Otto, Antoinette Soprano S
3642* Scharfenberg, William Piano A
3642* Hill, Ureli Corelli Violin A
3642* Derwort, G. H. Viola A
3642* Boucher, Alfred Cello A
3642* Rosier, F. W. Contrabass A
8835*1 Otto, Antoinette Soprano S
8835*1 Horn, Charles Edward Tenor S
8837*6 Horn, Charles Edward Tenor S
8336*4 Otto, Antoinette Soprano S
One other thing to note: I put a tag at the beginning of your XML and a at the end since the real data would contain multiple elements.
In Python, is there a function that classifies and orders an array of objects by an attribute?
Example:
class Book:
"""A Book class"""
def __init__(self,name,author,year):
self.name = name
self.author = author
self.year = year
hp1 = Book("Harry Potter and the Philosopher's stone","J.k Rowling",1997)
hp2 = Book("Harry Potter and the chamber of secretse","J.k Rowling",1998)
hp3 = Book("Harry Potter and the Prisioner of Azkaban","J.k Rowling",1999)
#asoiaf stands for A Song of Ice and Fire
asoiaf1 = Book("A Game of Thrones","George R.R Martin",1996)
asoiaf2 = Book("A Clash of Kings","George R.R Martin",1998)
hg1 = Book("The Hunger Games","Suzanne Collins",2008)
hg2 = Book("Catching Fire","Suzanne Collins",2009)
hg3 = Book("Mockingjaye","Suzanne Collins",2010);
books = [hp3,asoiaf1,hp1,hg1,hg2,hp2,asoiaf2,hg3]
#disordered on purpose
organized_by_autor = magic_organize_function(books,"author")
Does the magic_organize_function exist? Otherwise, what would it be?
Sorting by author is one way:
sorted_by_author = sorted(books, key=lambda x: x.author)
for book in sorted_by_author:
print(book.author)
Output:
George R.R Martin
George R.R Martin
J.k Rowling
J.k Rowling
J.k Rowling
Suzanne Collins
Suzanne Collins
Suzanne Collins
You can also nicely group by author using itertools.groupby:
from itertools import groupby
organized_by_author = groupby(sorted_by_author, key=lambda x: x.author)
for author, book_from_author in organized_by_author:
print(author)
for book in book_from_author:
print(' ', book.name)
Output:
George R.R Martin
A Game of Thrones
A Clash of Kings
J.k Rowling
Harry Potter and the Prisioner of Azkaban
Harry Potter and the Philosopher's stone
Harry Potter and the chamber of secretse
Suzanne Collins
The Hunger Games
Catching Fire
Mockingjaye
Note: You need to feed groupby the sorted sorted_by_author.
I need to turn this file content into a dictionary, so that every key in the dict is a name of a movie and every value is the name of the actors that plays in it inside a set.
Example of file content:
Brad Pitt, Sleepers, Troy, Meet Joe Black, Oceans Eleven, Seven, Mr & Mrs Smith
Tom Hanks, You have got mail, Apollo 13, Sleepless in Seattle, Catch Me If You Can
Meg Ryan, You have got mail, Sleepless in Seattle
Diane Kruger, Troy, National Treasure
Dustin Hoffman, Sleepers, The Lost City
Anthony Hopkins, Hannibal, The Edge, Meet Joe Black, Proof
This should get you started:
line = "a, b, c, d"
result = {}
names = line.split(", ")
actor = names[0]
movies = names[1:]
result[actor] = movies
Try the following:
res_dict = {}
with open('my_file.txt', 'r') as f:
for line in f:
my_list = [item.strip() for item in line.split(',')]
res_dict[my_list[0]] = my_list[1:] # To make it a set, use: set(my_list[1:])
Explanation:
split() is used to split each line to form a list using , separator
strip() is used to remove spaces around each element of the previous list
When you use with statement, you do not need to close your file explicitly.
[item.strip() for item in line.split(',')] is called a list comprehension.
Output:
>>> res_dict
{'Diane Kruger': ['Troy', 'National Treasure'], 'Brad Pitt': ['Sleepers', 'Troy', 'Meet Joe Black', 'Oceans Eleven', 'Seven', 'Mr & Mrs Smith'], 'Meg Ryan': ['You have got mail', 'Sleepless in Seattle'], 'Tom Hanks': ['You have got mail', 'Apollo 13', 'Sleepless in Seattle', 'Catch Me If You Can'], 'Dustin Hoffman': ['Sleepers', 'The Lost City'], 'Anthony Hopkins': ['Hannibal', 'The Edge', 'Meet Joe Black', 'Proof']}
I have a list of weather forecasts that start with a similar prefix that I'd like to remove. I'd also like to capture the city names:
Some Examples:
If you have vacation or wedding plans in Phoenix, Tucson, Flagstaff,
Salt Lake City, Park City, Denver, Estes Park, Colorado Springs,
Pueblo, or Albuquerque, the week will...
If you have vacation or wedding plans for Miami, Jacksonville, Macon,
Charlotte, or Charleston, expect a couple systems...
If you have vacation or wedding plans in Pittsburgh, Philadelphia,
Atlantic City, Newark, Baltimore, D.C., Richmond, Charleston, or
Dover, expect the week...
The strings start with a common prefix "If you have vacation or wedding plans in" and the last city has "or" before it. The list of cities is of variable length.
I've tried this:
>>> text = 'If you have vacation or wedding plans in NYC, Boston, Manchester, Concord, Providence, or Portland'
>>> re.search(r'^If you have vacation or wedding plans in ((\b\w+\b), ?)+ or (\w+)', text).groups()
('Providence,', 'Providence', 'Portland')
>>>
I think I'm pretty close, but obviously it's not working. I've never tried to do something with a variable number of captured items; any guidance would be greatly appreciated.
Alternative solution here (probably just for sharing and educational purposes).
If you were to solve it with nltk, it would be called a Named Entity Recognition problem. Using the snippet based on nltk.chunk.ne_chunk_sents(), provided here:
import nltk
def extract_entity_names(t):
entity_names = []
if hasattr(t, 'label') and t.label:
if t.label() == 'NE':
entity_names.append(' '.join([child[0] for child in t]))
else:
for child in t:
entity_names.extend(extract_entity_names(child))
return entity_names
sample = "If you have vacation or wedding plans in Phoenix, Tucson, Flagstaff, Salt Lake City, Park City, Denver, Estes Park, Colorado Springs, Pueblo, or Albuquerque, the week will..."
sentences = nltk.sent_tokenize(sample)
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
chunked_sentences = nltk.ne_chunk_sents(tagged_sentences, binary=True)
entity_names = []
for tree in chunked_sentences:
entity_names.extend(extract_entity_names(tree))
print entity_names
It prints exactly the desired result:
['Phoenix', 'Tucson', 'Flagstaff', 'Salt Lake City', 'Park City', 'Denver', 'Estes Park', 'Colorado Springs', 'Pueblo', 'Albuquerque']
Here is my approach: use the csv module to parse the lines (I assume they are in a text file named data.csv, please change to suite your situation). After parsing each line:
Discard the last cell, it is not a city name
Remove 'If ...' from the first cell
Remove or 'or ' from the last cell (used to be next-to-last)
Here is the code:
import csv
def cleanup(row):
new_row = row[:-1]
new_row[0] = new_row[0].replace('If you have vacation or wedding plans in ', '')
new_row[0] = new_row[0].replace('If you have vacation or wedding plans for ', '')
new_row[-1] = new_row[-1].replace('or ', '')
return new_row
if __name__ == '__main__':
with open('data.csv') as f:
reader = csv.reader(f, skipinitialspace=True)
for row in reader:
row = cleanup(row)
print row
Output:
['Phoenix', 'Tucson', 'Flagstaff', 'Salt Lake City', 'Park City', 'Denver', 'Estes Park', 'Colorado Springs', 'Pueblo', 'Albuquerque']
['Miami', 'Jacksonville', 'Macon', 'Charlotte', 'Charleston']
['Pittsburgh', 'Philadelphia', 'Atlantic City', 'Newark', 'Baltimore', 'D.C.', 'Richmond', 'Charleston', 'Dover']
import re
s = "If you have vacation or wedding plans for Miami, Jacksonville, Macon, Charlotte, or Charleston, expect a couple systems"
p = re.compile(r"If you have vacation or wedding plans (in|for) ((\w+, )+)or (\w+)")
m = p.match(s)
print m.group(2) # output: Miami, Jacksonville, Macon, Charlotte,
cities = m.group(2).split(", ") # cities = ['Miami', 'Jacksonville', 'Macon', 'Charlotte', '']
cities[-1] = m.group(4) # add the city after or
print cities # cities = ['Miami', 'Jacksonville', 'Macon', 'Charlotte', 'Charleston']
the city can be matched by pattern (\w+, ) and or (\w+)
and split cities by pattern ,
btw, as the pattern is used to many data, it is preferred to work with the compiled object
PS: the word comes after plan can be for or in, according to examples you provide
How about this
>>> text = 'If you have vacation or wedding plans for Phoenix, Tucson, Flagstaff, Salt Lake City, Park City, Denver, Estes Park, Colorado Springs, Pueblo, or Albuquerque, the week will'
>>> match = re.search(r'^If you have vacation or wedding plans (in?|for?) ([\w+ ,]+)',text).groups()[1].split(", ")
Output
>>> match
['Phoenix', 'Tucson', 'Flagstaff', 'Salt Lake City', 'Park City', 'Denver', 'Estes Park', 'Colorado Springs', 'Pueblo', 'or Albuquerque', 'the week will']