Related
I have a column that has a lot of doctor specialties. I want to clean it up and created a function below:
def specialty(x):
if x.str.contains('Urolog'):
return 'Urology'
elif x.str.contains('Nurse'):
return 'Nurse Practioner'
elif x.str.contains('Oncology'):
return 'Oncology'
elif x.str.contains('Physician'):
return 'Physician Assistant'
elif x.str.contains('Family Medicine'):
return 'Family Medicine'
elif x.str.contains('Anesthes'):
return 'Anesthesiology'
else:
return 'Other'
df['desc_clean'] = df['desc'].apply(specialty)
However I get an error TypeError: 'function' object is not subscriptable
There are too many values to use a manual mapping so i wanted to use str.contains. Is there a way to do this better?
EDIT: Sample DF
{'person_id': {39063: 33081476009,
50538: 33033519093,
56075: 33170508793,
36593: 33061707789,
51656: 33047685345,
95512: 33022026049,
40286: 33038034707,
3887: 33076466195,
40161: 33052807819,
52905: 33190526939,
35418: 33008425164,
35934: 33015737122,
3389: 33055125864,
136: 33139641318,
105460: 33113871389,
52568: 33075745388,
24725: 33052090907,
34838: 33205449839,
31908: 33183672635,
36115: 33006692696},
'final_desc': {39063: 'None',
50538: 'Urology',
56075: 'Anesthesiology',
36593: 'None',
51656: 'Urology',
95512: 'None',
40286: 'Anesthesiology',
3887: 'Specialist',
40161: 'None',
52905: 'Anesthesiology',
35418: 'Urology',
35934: 'None',
3389: 'Ophthalmology',
136: 'Rheumatology',
105460: 'None',
52568: 'Urology',
24725: 'Family Medicine',
34838: 'None',
31908: 'Nurse Practitioner',
36115: 'None'}}
To do this, we can define a mapping between matches, then iterate through them and set the column's value, keeping track of columns we've changed. At the end, any columns we never matched get set to 'Other'.
mapping = {'Urolog': 'Urology',
'Nurse': 'Nurse Practioner',
'Oncology': 'Oncology',
'Physician': 'Physician Assistant',
'Family Medicine': 'Family Medicine',
'Anesthes': 'Anesthesiology'}
def specialty(column):
column = column.copy()
matches = pd.Series(False, index=column.index)
for k,v in mapping.items():
match = column.str.contains(k)
column[match] = v
matches[match] = True
column[~matches] = 'Other'
return column
specialty(df['final_desc'])
39063 Other
50538 Urology
56075 Anesthesiology
36593 Other
51656 Urology
95512 Other
40286 Anesthesiology
3887 Other
40161 Other
52905 Anesthesiology
35418 Urology
35934 Other
3389 Other
136 Other
105460 Other
52568 Urology
24725 Family Medicine
34838 Other
31908 Nurse Practioner
36115 Other
Name: final_desc, dtype: object
x received by specialty function is string itself. So no x.str and since it is string you can use 'in' to check as below. Modified some data to see the result
Tip: You should use a dictionary or list rather than using the elif chain.
Code:
import pandas as pd
import numpy as np
def specialty(x):
print(x)
if x in 'Urolog':
return 'Urology'
elif x in 'Nurse':
return 'Nurse Practioner'
elif x in 'Oncology':
return 'Oncology'
elif x in 'Physician':
return 'Physician Assistant'
elif x in 'Family Medicine':
return 'Family Medicine'
elif x in 'Anesthes':
return 'Anesthesiology'
else:
return 'Other'
df = pd.DataFrame(data={'person_id': {39063: 33081476009, 50538: 33033519093, 56075: 33170508793, 36593: 33061707789, 51656: 33047685345, 95512: 33022026049, 40286: 33038034707, 3887: 33076466195, 40161: 33052807819, 52905: 33190526939, 35418: 33008425164, 35934: 33015737122, 3389: 33055125864, 136: 33139641318, 105460: 33113871389, 52568: 33075745388, 24725: 33052090907, 34838: 33205449839, 31908: 33183672635, 36115: 33006692696},
'final_desc': {39063: 'None', 50538: 'Urolog', 56075: 'Anesthes', 36593: 'None', 51656: 'Urology', 95512: 'None', 40286: 'Anesthes', 3887: 'Specialist', 40161: 'None', 52905: 'Anesthesiology', 35418: 'Urology', 35934: 'None', 3389: 'Ophthalmology', 136: 'Rheumatology', 105460: 'None', 52568: 'Urology', 24725: 'Family Medicine', 34838: 'None', 31908: 'Nurse', 36115: 'None'}})
df['desc_clean'] = df['final_desc'].apply(specialty)
print(df)
Output:
person_id final_desc desc_clean
39063 33081476009 None Other
50538 33033519093 Urolog Urology
56075 33170508793 Anesthes Anesthesiology
36593 33061707789 None Other
51656 33047685345 Urology Other
95512 33022026049 None Other
40286 33038034707 Anesthes Anesthesiology
3887 33076466195 Specialist Other
40161 33052807819 None Other
52905 33190526939 Anesthesiology Other
35418 33008425164 Urology Other
35934 33015737122 None Other
3389 33055125864 Ophthalmology Other
136 33139641318 Rheumatology Other
105460 33113871389 None Other
52568 33075745388 Urology Other
24725 33052090907 Family Medicine Family Medicine
34838 33205449839 None Other
31908 33183672635 Nurse Nurse Practioner
36115 33006692696 None Other
You can use a library like fuzzywuzzy for fuzzy string matching. The benefit of this approach is it is more flexible than some rule set, as demonstrated below.
This solution generates the max score of substrings and candidate categories, returning the one that matches best. If it's below the threshold it will return the default value ("None"):
from fuzzywuzzy import fuzz
CATEGORIES = [
'Urology',
'Nurse Practioner',
'Oncology',
'Physician Assistant',
'Family Medicine',
'Anesthesiology',
'Specialist',
]
def best_match(
text,
categories=CATEGORIES,
default="None",
threshold=65
):
matches = {fuzz.partial_ratio(cat, text): cat for cat in categories}
best_score = max(matches)
best_match = matches[best_score]
if best_score >= threshold:
return best_match
else:
return default
df["final_desc"] = df.desc.apply(best_match)
result:
person_id final_desc desc
52568 33075745388 Urology urologist
36593 33061707789 Nurse Practioner nruse practition
136 33139641318 Specialist oncology specialist
50538 33033519093 Physician Assistant physicians assistant
3389 33055125864 Family Medicine fam. medicine
51656 33047685345 Anesthesiology anesthesiology
35418 33008425164 Anesthesiology anesthesiologist
52905 33190526939 Nurse Practioner Nurses practitioner
36115 33006692696 Specialist Occupational specialist
31908 33183672635 Oncology Oncologist
You can iterate directly using the index :
ix = df[df.desc.str.contains('Urolog')].index
df.loc[ix, 'desc_clean'] = "Urology"
So iterating would be something like :
dict_specialties = {"Urolog":"Urology",}
for key, val in dict_specialties.items():
ix = df[df.desc.str.contains(key)].index
df.loc[ix, 'desc_clean'] = val
defined few functions (not induced body of the functions to simplify question here)
def policyname(i):
retrun policyname
def policytype(i):
retrun policytype
def active(i):
retrun active
def backupselection(i):
retrun backupselection
defined a list -
clients = ['winwebint16', 'winwebtpie03', 'winwebtpie04', 'winwtsdt08', 'winwtsmwg03', 'winwtsqnr03', 'winwtswrl37', 'winwtswrl60', 'winwtswrl62', 'winwtswrl63', 'winwtswrl75', 'winwtszsim03', 'winwww0016','winsbk0100', 'winsbk0100a0', 'winsbk0100a1', 'winsbk0101', 'iinf065', 'iinf130', 'iinf185', 'iinf2126', 'inbf005', 'inis001', 'ipdataisbic01', 'ipdataisbic02', 'ipdataispre01', 'ipdataispre02', 'iproip02', 'isis002', 'isyn002', 'isyn006', 'isyn011', 'isyn012','isyn014', 'isyn038', 'isyn039', 'isyn040', 'mu2ssql1001', 'mu2ssql1003', 'macrsz0001', 'macrsz0005']
defined a class -
class client():
def __init__(self,policyname,policytype,active,backupselection):
self.policyname = policyname
self.policytype = policytype
self.active = active
self.backupselection = backupselection
For each item in clients list create class objects by passing parameters from functions.
Is below code is correct ?
for i in clients:
i = client(policyname(i),policytype(i),active(i),backupselection(i))
with above code, will i be able to access specific class objects like ?
print(winwebint16.policyname)
print(winwebint16.policytype)
print(winwebint16.active)
print(winwebint16.backupselection)
Classnames should be capital letter but that's just convention. Further you have a problem with the for loop
for i in clients:
i = client(policyname(i),policytype(i),active(i),backupselection(i))
The created client is not visible outside of the for loop scope, so you might wanna add them to some kind of list or dict
client_list: dict = {}
for i in clients:
client_list[i] = client(policyname(i),policytype(i),active(i),backupselection(i))
you should then be able to print like
print(client_list['winwebint16'].policyname)
This really worked for me :
clients = ['669165933', '963881480', '341417157', '514321792', '115456712', '547995746', '135425221', '871543967', '770463311', '616607081', '814711606', '939825713']
policynames = ['Tuvalu', 'Grenada', 'Russia', 'Sao Tome and Principe', 'Rwanda', 'Solomon Islands', 'Angola', 'Burkina Faso', 'Republic of the Congo', 'Senegal', 'Kyrgyzstan', 'Cape Verde']
policytypes = ['Offline', 'Online', 'Offline', 'Online', 'Offline', 'Online', 'Offline', 'Online', 'Offline', 'Online', 'Online', 'Offline']
actives = ['Baby Food', 'Cereal', 'Office Supplies', 'Fruits', 'Office Supplies', 'Baby Food', 'Household', 'Vegetables', 'Personal Care', 'Cereal', 'Vegetables', 'Clothes']
backupselections = ['H', 'C', 'L', 'C', 'L', 'C', 'M', 'H', 'M', 'H', 'H', 'H']
def policyname(i):
return policynames[i]
def policytype(i):
return policytypes[i]
def active(i):
return actives[i]
def backupselection(i):
return backupselections[i]
class client():
def __init__(self,policyname,policytype,active,backupselection):
self.policyname = policyname
self.policytype = policytype
self.active = active
self.backupselection = backupselection
for i in range(0,len(clients)):
i = client(policyname(i),policytype(i),active(i),backupselection(i))
print(i.policyname,i.policytype,i.active,i.backupselection)
Tuvalu Offline Baby Food H
Grenada Online Cereal C
Russia Offline Office Supplies L
Sao Tome and Principe Online Fruits C
Rwanda Offline Office Supplies L
Solomon Islands Online Baby Food C
Angola Offline Household M
Burkina Faso Online Vegetables H
Republic of the Congo Offline Personal Care M
Senegal Online Cereal H
Kyrgyzstan Online Vegetables H
Cape Verde Offline Clothes H
But still I can't print
print(669165933.policyname)
File "<ipython-input-20-72da397c032b>", line 1
print(669165933.policyname)
^
SyntaxError: invalid syntax
I have some data scraped from a website in a string as shown below:
myDatastr =
'United States3.43M+57,9421M138K+282Brazil1.88M+20,2861.21M72,833+733India907K+28,701571K23,727+500Russia734K+6,537504K11,439+104Peru330K+3,797221K12,054+184Chile318K287K7,024Mexico304K+4,685189K35,491+485United Kingdom290K+65044,830+21South Africa288K138K4,172Iran260K+2,349223K13,032+203Spain256K+2,045150K28,406+3Pakistan254K+2,753171K5,320+69Italy243K+169195K34,967+13Saudi Arabia235K+2,852170K2,704+20Turkey214K+1,008196K5,382+19Germany200K+159185K9,138+1Bangladesh187K+3,09998,3172,391+39France172K78,59730,029Colombia154K+3,83265,8095,455+148Canada108K71,8418,790Qatar104K101K149Argentina103K+3,0991,903+58Mainland China85,568+3Egypt83,00124,9753,935Iraq79,73546,9883,250Indonesia76,981+1,28236,6893,656+50Sweden75,826+315,536+0Ecuador68,459+5895,9005,063+16Belarus65,114+18255,492468+4Belgium62,707Kazakhstan59,899+1,64634,190375+0Oman58,179+1,31837,257259+9Philippines57,006+74720,3711,599+65Kuwait55,50845,356393United Arab Emirates55,19845,513334Ukraine54,13326,5031,398Netherlands51,308+1016,156+0Bolivia49,25015,2941,866Panama47,17323,919932Portugal46,81831,0651,662Singapore46,283+32242,54126+0Dominican Republic45,50622,441903Israel40,632+1,33619,395365+1Poland38,190+29926,0481,576+5Afghanistan34,45521,2541,012Nigeria33,513+59513,671744+4Bahrain33,47628,425104Romania32,94821,6921,901Switzerland32,94629,6001,686Armenia32,151+18219,865573+8Guatemala29,7424,3211,244Honduras28,579+489789+15Ireland25,638+1023,3491,746+0Ghana24,98821,067139Azerbaijan24,570+52015,640313+8Japan21,868Algeria19,689+49414,0191,018+7Moldova19,439+17412,793649+2Austria18,94817,000708Serbia18,360Nepal16,945+1443,65238+0Morocco15,93612,934255Cameroon15,173+25711,928359+0Uzbekistan13,591+4878,03063+3South Korea13,512+3312,282289+0Czechia13,2388,373353Denmark13,147609Côte d'Ivoire12,766Kyrgyzstan11,538+488149+15Kenya10,2942,946197Sudan10,250+0+0Australia9,980+1837,769108+0El Salvador9,9785,755267Venezuela9,707+2422,67193+4Norway8,984+08,138253+0Malaysia8,725+78,520122Senegal8,198+1215,514150+3North Macedonia8,1974,326385Democratic Republic of the Congo8,075+623,620190+0Costa Rica8,0362,30431Ethiopia7,7662,430128Bulgaria7,525Finland7,2956,800329Palestine7,037Bosnia and Herzegovina6,981+4823,179226+5Haiti6,727+732,924139+4Tajikistan6,551+46+0French Guiana6,170+22129+3Guinea6,141+974,86237+0Gabon5,942+03,00446+0Mauritania5,275+149+3Kosovo5,118+1872,370108+6Djibouti4,972+4+0Luxembourg4,956+314,183111+0Madagascar4,867+289+1Central African Republic4,288+0+0Hungary4,247+133,073595+0Greece3,826+311,374193+0Croatia3,775+532,514119+0Albania3,5712,01495Thailand3,220+33,09058+0Equatorial Guinea3,071+084251+0Somalia3,059+81,30693+1Paraguay2,9801,29325Nicaragua2,846+01,75091+0Maldives2,762+672,29013+0Mayotte2,711+0+0Sri Lanka2,646+1061,98111+0Malawi2,43074739Cuba2,428+62,26887+0Mali2,411Lebanon2,334+166+0South Sudan2,148+9+0Republic of the Congo2,103+0+0Estonia2,0141,89569Slovakia1,90228Iceland1,90010Zambia1,895+01,34842+0Lithuania1,869Guinea-Bissau1,842+077326+0Slovenia1,841+14+0Cape Verde1,698+75+0Sierra Leone1,642+71,17563+0New Zealand1,545+022+0Hong Kong1,522+521,2178+1Yemen1,498+33424+7Libya1,433Benin1,378+0+0Eswatini1,351Rwanda1,337+38+1Tunisia1,263Montenegro1,221Jordan1,179+3+0Latvia1,173+01,01930+0Mozambique1,157+22+0Niger1,099+097868+0Burkina Faso1,033+13Uganda1,025Cyprus1,021+7+0Liberia1,010+12+4Georgia995+9Uruguay98731Zimbabwe985+3+0Chad880+6+1Namibia861+72281+0Andorra85580352Suriname780+3952618Jamaica759+110+0São Tomé and Príncipe727+228414+0Togo720San Marino716+3+0Malta674+06589+0Réunion593+16+0Tanzania509+0+0Angola506Taiwan4514387Syria417+2319+3Botswana399+85381+0Vietnam372+2Mauritius342+033010+0Isle of Man336+031224+0Myanmar (Burma)331+1+0Jersey329+431+0Comoros317+32967+0Guyana30015517Burundi269+82071+0Martinique255+0+0Guernsey25223813Guernsey252+0+0Lesotho245+49333+1Eritrea232+0Mongolia230+3Cayman Islands201+01941+0Guadeloupe190+0+0Faroe Islands188+01880Gibraltar180+0Cambodia1651330Bermuda150+01379+0Brunei141+01383+0Trinidad and Tobago133+01178+0Northern Cyprus1131044The Bahamas111+3+0Monaco1094Aruba105+0993Barbados103+5+0Seychelles100+0110Turkmenistan10000Liechtenstein85+0+0Bhutan84760Sint Maarten78+06315+0Antigua and Barbuda74+0573+0Turks and Caicos Islands72122The Gambia64+0343+0French Polynesia62+0600Macao46450Saint Martin44+0+0Belize37+0202+0Saint Vincent and the Grenadines35+0296Fiji26+0180Curaçao25+0241+0Timor-Leste24+0240Grenada23+0230Saint Lucia22+0190New Caledonia21+0210Laos19+0190Åland Islands19Dominica18+0180Saint Kitts and Nevis17150Falkland Islands (Islas Malvinas)13+0130Greenland13+0130Montserrat12101Vatican City12+0120Papua New Guinea1180British Virgin Islands8+071+0Caribbean Netherlands7+0Saint Barthélemy6+0Anguilla3+030Saint Pierre and Miquelon2+010Western Sahara'
I want to get the data like:
[United States,3.43M,+57,942,1M,138K,+282]
[Brazil,1.88M,+20,286,1.21M,72,833,+733]
Tried different things but did not work.
Taking a subset of your string, splitting it into a multi-line string for readability.
First capture the countries to a list, and use this list to obtain all values between countries.
Finally output to a list of lists as desired:
import re
text = """
United States3.43M+57,9421M138K+282Brazil1.88M+20,2861.21M72,833+733
India907K+28,701571K23,727+500Russia734K+6,537504K11,439+104
Peru330K+3,797221K12,054+184Chile318K287K7,024Mexico304K+4,685189K35,491+485
United Kingdom290K+65044,830+21South Africa288K138K4,172
Iran260K+2,349223K13,032+203Spain256K+2,045150K28,406+3
Pakistan254K+2,753171K5,320+69Italy243K+169195K34,967+13
Saudi Arabia235K+2,852170K2,704+20Turkey214K+1,008196K5,382+19
Germany200K+159185K9,138+1
"""
pattern = r"([a-z]{3,}(?: [a-z]{2,})?)"
regex = re.compile(pattern, re.I)
countries = [''.join([i for i in r if not i.isdigit()]).rstrip('K') for r in re.findall(regex, text)]
out = []
for idx, country in enumerate(countries):
if idx < len(countries) -1:
pattern = fr'{country}(.*){countries[idx+1]}'
regex = re.compile(pattern, re.I | re.DOTALL)
result = re.search(regex, text).group(1).strip()
result = result.replace('+', ',+').split(',')
tmp = [country]
for i in result:
tmp.append(i)
out.append(tmp)
else:
result = text.split(countries[-1])[-1].strip()
result = result.replace('+', ',+').split(',')
tmp = [countries[-1]]
for i in result:
tmp.append(i)
out.append(tmp)
for country in out:
print(country)
Returns:
['United States', '3.43M', '+57', '9421M138K', '+282']
['Brazil', '1.88M', '+20', '2861.21M72', '833', '+733']
['India', '907K', '+28', '701571K23', '727', '+500']
['Russia', '734K', '+6', '537504K11', '439', '+104']
['Peru', '330K', '+3', '797221K12', '054', '+184']
['Chile', '318K287K7', '024']
['Mexico', '304K', '+4', '685189K35', '491', '+485']
['United Kingdom', '290K', '+65044', '830', '+21']
['South Africa', '288K138K4', '172']
['Iran', '260K', '+2', '349223K13', '032', '+203']
['Spain', '256K', '+2', '045150K28', '406', '+3']
['Pakistan', '254K', '+2', '753171K5', '320', '+69']
['Italy', '243K', '+169195K34', '967', '+13']
['Saudi Arabia', '235K', '+2', '852170K2', '704', '+20']
['Turkey', '214K', '+1', '008196K5', '382', '+19']
['Germany', '200K', '+159185K9', '138', '+1']
How to convert continent name from country name using pycountry. I have a list of country like this
country = ['India', 'Australia', ....]
And I want to get continent name from it like.
continent = ['Asia', 'Australia', ....]
All the tools you need are provided in pycountry-convert.
Here is an example of how you can create your own function to make a direct conversion from country to continent name, using pycountry's tools:
import pycountry_convert as pc
def country_to_continent(country_name):
country_alpha2 = pc.country_name_to_country_alpha2(country_name)
country_continent_code = pc.country_alpha2_to_continent_code(country_alpha2)
country_continent_name = pc.convert_continent_code_to_continent_name(country_continent_code)
return country_continent_name
# Example
country_name = 'Germany'
print(country_to_continent(country_name))
Out[1]: Europe
Hope it helps!
Remember you can access function descriptions using '?? function_name'
?? pc.country_name_to_country_alpha2
Looking at the documentation, would something like this do the trick?:
country_alpha2_to_continent_code()
It converts country code(eg: NO, SE, ES) to continent name.
If first you need to acquire the country code you could use:
country_name_to_country_alpha2(cn_name, cn_name_format="default")
to get the country code from country name.
Full example:
import pycountry_convert as pc
country_code = pc.country_name_to_country_alpha2("China", cn_name_format="default")
print(country_code)
continent_name = pc.country_alpha2_to_continent_code(country_code)
print(continent_name)
from pycountry_convert import country_alpha2_to_continent_code, country_name_to_country_alpha2
continents = {
'NA': 'North America',
'SA': 'South America',
'AS': 'Asia',
'OC': 'Australia',
'AF': 'Africa',
'EU': 'Europe'
}
countries = ['India', 'Australia']
[continents[country_alpha2_to_continent_code(country_name_to_country_alpha2(country))] for country in countries]
Don't know what to do with Antarctica continent ¯_(ツ)_/¯
I have a list of weather forecasts that start with a similar prefix that I'd like to remove. I'd also like to capture the city names:
Some Examples:
If you have vacation or wedding plans in Phoenix, Tucson, Flagstaff,
Salt Lake City, Park City, Denver, Estes Park, Colorado Springs,
Pueblo, or Albuquerque, the week will...
If you have vacation or wedding plans for Miami, Jacksonville, Macon,
Charlotte, or Charleston, expect a couple systems...
If you have vacation or wedding plans in Pittsburgh, Philadelphia,
Atlantic City, Newark, Baltimore, D.C., Richmond, Charleston, or
Dover, expect the week...
The strings start with a common prefix "If you have vacation or wedding plans in" and the last city has "or" before it. The list of cities is of variable length.
I've tried this:
>>> text = 'If you have vacation or wedding plans in NYC, Boston, Manchester, Concord, Providence, or Portland'
>>> re.search(r'^If you have vacation or wedding plans in ((\b\w+\b), ?)+ or (\w+)', text).groups()
('Providence,', 'Providence', 'Portland')
>>>
I think I'm pretty close, but obviously it's not working. I've never tried to do something with a variable number of captured items; any guidance would be greatly appreciated.
Alternative solution here (probably just for sharing and educational purposes).
If you were to solve it with nltk, it would be called a Named Entity Recognition problem. Using the snippet based on nltk.chunk.ne_chunk_sents(), provided here:
import nltk
def extract_entity_names(t):
entity_names = []
if hasattr(t, 'label') and t.label:
if t.label() == 'NE':
entity_names.append(' '.join([child[0] for child in t]))
else:
for child in t:
entity_names.extend(extract_entity_names(child))
return entity_names
sample = "If you have vacation or wedding plans in Phoenix, Tucson, Flagstaff, Salt Lake City, Park City, Denver, Estes Park, Colorado Springs, Pueblo, or Albuquerque, the week will..."
sentences = nltk.sent_tokenize(sample)
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
chunked_sentences = nltk.ne_chunk_sents(tagged_sentences, binary=True)
entity_names = []
for tree in chunked_sentences:
entity_names.extend(extract_entity_names(tree))
print entity_names
It prints exactly the desired result:
['Phoenix', 'Tucson', 'Flagstaff', 'Salt Lake City', 'Park City', 'Denver', 'Estes Park', 'Colorado Springs', 'Pueblo', 'Albuquerque']
Here is my approach: use the csv module to parse the lines (I assume they are in a text file named data.csv, please change to suite your situation). After parsing each line:
Discard the last cell, it is not a city name
Remove 'If ...' from the first cell
Remove or 'or ' from the last cell (used to be next-to-last)
Here is the code:
import csv
def cleanup(row):
new_row = row[:-1]
new_row[0] = new_row[0].replace('If you have vacation or wedding plans in ', '')
new_row[0] = new_row[0].replace('If you have vacation or wedding plans for ', '')
new_row[-1] = new_row[-1].replace('or ', '')
return new_row
if __name__ == '__main__':
with open('data.csv') as f:
reader = csv.reader(f, skipinitialspace=True)
for row in reader:
row = cleanup(row)
print row
Output:
['Phoenix', 'Tucson', 'Flagstaff', 'Salt Lake City', 'Park City', 'Denver', 'Estes Park', 'Colorado Springs', 'Pueblo', 'Albuquerque']
['Miami', 'Jacksonville', 'Macon', 'Charlotte', 'Charleston']
['Pittsburgh', 'Philadelphia', 'Atlantic City', 'Newark', 'Baltimore', 'D.C.', 'Richmond', 'Charleston', 'Dover']
import re
s = "If you have vacation or wedding plans for Miami, Jacksonville, Macon, Charlotte, or Charleston, expect a couple systems"
p = re.compile(r"If you have vacation or wedding plans (in|for) ((\w+, )+)or (\w+)")
m = p.match(s)
print m.group(2) # output: Miami, Jacksonville, Macon, Charlotte,
cities = m.group(2).split(", ") # cities = ['Miami', 'Jacksonville', 'Macon', 'Charlotte', '']
cities[-1] = m.group(4) # add the city after or
print cities # cities = ['Miami', 'Jacksonville', 'Macon', 'Charlotte', 'Charleston']
the city can be matched by pattern (\w+, ) and or (\w+)
and split cities by pattern ,
btw, as the pattern is used to many data, it is preferred to work with the compiled object
PS: the word comes after plan can be for or in, according to examples you provide
How about this
>>> text = 'If you have vacation or wedding plans for Phoenix, Tucson, Flagstaff, Salt Lake City, Park City, Denver, Estes Park, Colorado Springs, Pueblo, or Albuquerque, the week will'
>>> match = re.search(r'^If you have vacation or wedding plans (in?|for?) ([\w+ ,]+)',text).groups()[1].split(", ")
Output
>>> match
['Phoenix', 'Tucson', 'Flagstaff', 'Salt Lake City', 'Park City', 'Denver', 'Estes Park', 'Colorado Springs', 'Pueblo', 'or Albuquerque', 'the week will']