How to format strings properly in python? - python

I have some data scraped from a website in a string as shown below:
myDatastr =
'United States3.43M+57,9421M138K+282Brazil1.88M+20,2861.21M72,833+733India907K+28,701571K23,727+500Russia734K+6,537504K11,439+104Peru330K+3,797221K12,054+184Chile318K287K7,024Mexico304K+4,685189K35,491+485United Kingdom290K+65044,830+21South Africa288K138K4,172Iran260K+2,349223K13,032+203Spain256K+2,045150K28,406+3Pakistan254K+2,753171K5,320+69Italy243K+169195K34,967+13Saudi Arabia235K+2,852170K2,704+20Turkey214K+1,008196K5,382+19Germany200K+159185K9,138+1Bangladesh187K+3,09998,3172,391+39France172K78,59730,029Colombia154K+3,83265,8095,455+148Canada108K71,8418,790Qatar104K101K149Argentina103K+3,0991,903+58Mainland China85,568+3Egypt83,00124,9753,935Iraq79,73546,9883,250Indonesia76,981+1,28236,6893,656+50Sweden75,826+315,536+0Ecuador68,459+5895,9005,063+16Belarus65,114+18255,492468+4Belgium62,707Kazakhstan59,899+1,64634,190375+0Oman58,179+1,31837,257259+9Philippines57,006+74720,3711,599+65Kuwait55,50845,356393United Arab Emirates55,19845,513334Ukraine54,13326,5031,398Netherlands51,308+1016,156+0Bolivia49,25015,2941,866Panama47,17323,919932Portugal46,81831,0651,662Singapore46,283+32242,54126+0Dominican Republic45,50622,441903Israel40,632+1,33619,395365+1Poland38,190+29926,0481,576+5Afghanistan34,45521,2541,012Nigeria33,513+59513,671744+4Bahrain33,47628,425104Romania32,94821,6921,901Switzerland32,94629,6001,686Armenia32,151+18219,865573+8Guatemala29,7424,3211,244Honduras28,579+489789+15Ireland25,638+1023,3491,746+0Ghana24,98821,067139Azerbaijan24,570+52015,640313+8Japan21,868Algeria19,689+49414,0191,018+7Moldova19,439+17412,793649+2Austria18,94817,000708Serbia18,360Nepal16,945+1443,65238+0Morocco15,93612,934255Cameroon15,173+25711,928359+0Uzbekistan13,591+4878,03063+3South Korea13,512+3312,282289+0Czechia13,2388,373353Denmark13,147609Côte d'Ivoire12,766Kyrgyzstan11,538+488149+15Kenya10,2942,946197Sudan10,250+0+0Australia9,980+1837,769108+0El Salvador9,9785,755267Venezuela9,707+2422,67193+4Norway8,984+08,138253+0Malaysia8,725+78,520122Senegal8,198+1215,514150+3North Macedonia8,1974,326385Democratic Republic of the Congo8,075+623,620190+0Costa Rica8,0362,30431Ethiopia7,7662,430128Bulgaria7,525Finland7,2956,800329Palestine7,037Bosnia and Herzegovina6,981+4823,179226+5Haiti6,727+732,924139+4Tajikistan6,551+46+0French Guiana6,170+22129+3Guinea6,141+974,86237+0Gabon5,942+03,00446+0Mauritania5,275+149+3Kosovo5,118+1872,370108+6Djibouti4,972+4+0Luxembourg4,956+314,183111+0Madagascar4,867+289+1Central African Republic4,288+0+0Hungary4,247+133,073595+0Greece3,826+311,374193+0Croatia3,775+532,514119+0Albania3,5712,01495Thailand3,220+33,09058+0Equatorial Guinea3,071+084251+0Somalia3,059+81,30693+1Paraguay2,9801,29325Nicaragua2,846+01,75091+0Maldives2,762+672,29013+0Mayotte2,711+0+0Sri Lanka2,646+1061,98111+0Malawi2,43074739Cuba2,428+62,26887+0Mali2,411Lebanon2,334+166+0South Sudan2,148+9+0Republic of the Congo2,103+0+0Estonia2,0141,89569Slovakia1,90228Iceland1,90010Zambia1,895+01,34842+0Lithuania1,869Guinea-Bissau1,842+077326+0Slovenia1,841+14+0Cape Verde1,698+75+0Sierra Leone1,642+71,17563+0New Zealand1,545+022+0Hong Kong1,522+521,2178+1Yemen1,498+33424+7Libya1,433Benin1,378+0+0Eswatini1,351Rwanda1,337+38+1Tunisia1,263Montenegro1,221Jordan1,179+3+0Latvia1,173+01,01930+0Mozambique1,157+22+0Niger1,099+097868+0Burkina Faso1,033+13Uganda1,025Cyprus1,021+7+0Liberia1,010+12+4Georgia995+9Uruguay98731Zimbabwe985+3+0Chad880+6+1Namibia861+72281+0Andorra85580352Suriname780+3952618Jamaica759+110+0São Tomé and Príncipe727+228414+0Togo720San Marino716+3+0Malta674+06589+0Réunion593+16+0Tanzania509+0+0Angola506Taiwan4514387Syria417+2319+3Botswana399+85381+0Vietnam372+2Mauritius342+033010+0Isle of Man336+031224+0Myanmar (Burma)331+1+0Jersey329+431+0Comoros317+32967+0Guyana30015517Burundi269+82071+0Martinique255+0+0Guernsey25223813Guernsey252+0+0Lesotho245+49333+1Eritrea232+0Mongolia230+3Cayman Islands201+01941+0Guadeloupe190+0+0Faroe Islands188+01880Gibraltar180+0Cambodia1651330Bermuda150+01379+0Brunei141+01383+0Trinidad and Tobago133+01178+0Northern Cyprus1131044The Bahamas111+3+0Monaco1094Aruba105+0993Barbados103+5+0Seychelles100+0110Turkmenistan10000Liechtenstein85+0+0Bhutan84760Sint Maarten78+06315+0Antigua and Barbuda74+0573+0Turks and Caicos Islands72122The Gambia64+0343+0French Polynesia62+0600Macao46450Saint Martin44+0+0Belize37+0202+0Saint Vincent and the Grenadines35+0296Fiji26+0180Curaçao25+0241+0Timor-Leste24+0240Grenada23+0230Saint Lucia22+0190New Caledonia21+0210Laos19+0190Åland Islands19Dominica18+0180Saint Kitts and Nevis17150Falkland Islands (Islas Malvinas)13+0130Greenland13+0130Montserrat12101Vatican City12+0120Papua New Guinea1180British Virgin Islands8+071+0Caribbean Netherlands7+0Saint Barthélemy6+0Anguilla3+030Saint Pierre and Miquelon2+010Western Sahara'
I want to get the data like:
[United States,3.43M,+57,942,1M,138K,+282]
[Brazil,1.88M,+20,286,1.21M,72,833,+733]
Tried different things but did not work.

Taking a subset of your string, splitting it into a multi-line string for readability.
First capture the countries to a list, and use this list to obtain all values between countries.
Finally output to a list of lists as desired:
import re
text = """
United States3.43M+57,9421M138K+282Brazil1.88M+20,2861.21M72,833+733
India907K+28,701571K23,727+500Russia734K+6,537504K11,439+104
Peru330K+3,797221K12,054+184Chile318K287K7,024Mexico304K+4,685189K35,491+485
United Kingdom290K+65044,830+21South Africa288K138K4,172
Iran260K+2,349223K13,032+203Spain256K+2,045150K28,406+3
Pakistan254K+2,753171K5,320+69Italy243K+169195K34,967+13
Saudi Arabia235K+2,852170K2,704+20Turkey214K+1,008196K5,382+19
Germany200K+159185K9,138+1
"""
pattern = r"([a-z]{3,}(?: [a-z]{2,})?)"
regex = re.compile(pattern, re.I)
countries = [''.join([i for i in r if not i.isdigit()]).rstrip('K') for r in re.findall(regex, text)]
out = []
for idx, country in enumerate(countries):
if idx < len(countries) -1:
pattern = fr'{country}(.*){countries[idx+1]}'
regex = re.compile(pattern, re.I | re.DOTALL)
result = re.search(regex, text).group(1).strip()
result = result.replace('+', ',+').split(',')
tmp = [country]
for i in result:
tmp.append(i)
out.append(tmp)
else:
result = text.split(countries[-1])[-1].strip()
result = result.replace('+', ',+').split(',')
tmp = [countries[-1]]
for i in result:
tmp.append(i)
out.append(tmp)
for country in out:
print(country)
Returns:
['United States', '3.43M', '+57', '9421M138K', '+282']
['Brazil', '1.88M', '+20', '2861.21M72', '833', '+733']
['India', '907K', '+28', '701571K23', '727', '+500']
['Russia', '734K', '+6', '537504K11', '439', '+104']
['Peru', '330K', '+3', '797221K12', '054', '+184']
['Chile', '318K287K7', '024']
['Mexico', '304K', '+4', '685189K35', '491', '+485']
['United Kingdom', '290K', '+65044', '830', '+21']
['South Africa', '288K138K4', '172']
['Iran', '260K', '+2', '349223K13', '032', '+203']
['Spain', '256K', '+2', '045150K28', '406', '+3']
['Pakistan', '254K', '+2', '753171K5', '320', '+69']
['Italy', '243K', '+169195K34', '967', '+13']
['Saudi Arabia', '235K', '+2', '852170K2', '704', '+20']
['Turkey', '214K', '+1', '008196K5', '382', '+19']
['Germany', '200K', '+159185K9', '138', '+1']

Related

How does the ISRI Stemmer give better stem words than Lancaster or Snowball Stemmer

I have this sample text which i want to tokenize and subsequently find the stem words
sample_text = "'I am a student from the University of Alabama. \
I was born in Ontario, Canada and I am a huge fan of the United States. \
I am going to get a degree in Philosophy to improve\
my chances of becoming a Philosophy professor. \
I have been working towards this goal for 4 years. \
I am currently enrolled in a PhD program. \
It is very difficult, but I am confident that it will be a good decision'"
Using Lancaster Stemmer I am getting the following result -
sentences = sent_tokenize(sample_text)
from nltk.stem import LancasterStemmer
lancaster = LancasterStemmer()
for i in range(len(sentences)):
sentences[i] = re.sub('[^A-Za-z0-9]', ' ', sentences[i])
sentences[i] = word_tokenize(sentences[i])
stopwds = [word.lower() for word in stopwords.words('english')]
sentences[i] = [word.lower() for word in sentences[i] if word.lower() not in stopwds]
sentences[i] = [lancaster.stem(word) for word in sentences[i]]
print(sentences[i])
Output of Lancaster Stemmer:
['stud', 'univers', 'alabam']
['born', 'ontario', 'canad', 'hug', 'fan', 'unit', 'stat']
['going', 'get', 'degr', 'philosoph', 'improvemy', 'chant', 'becom', 'philosoph', 'profess']
['work', 'toward', 'goal', '4', 'year']
['cur', 'enrol', 'phd', 'program']
['difficult', 'confid', 'good', 'decid']
Output with Snowball stemmer -
['student', 'univers', 'alabama']
['born', 'ontario', 'canada', 'huge', 'fan', 'unit', 'state']
['go', 'get', 'degre', 'philosophi', 'improvemi', 'chanc', 'becom', 'philosophi', 'professor']
['work', 'toward', 'goal', '4', 'year']
['current', 'enrol', 'phd', 'program']
['difficult', 'confid', 'good', 'decis']
Output of Porter Stemmer
['student', 'univers', 'alabama']
['born', 'ontario', 'canada', 'huge', 'fan', 'unit', 'state']
['go', 'get', 'degre', 'philosophi', 'improvemi', 'chanc', 'becom', 'philosophi', 'professor']
['work', 'toward', 'goal', '4', 'year']
['current', 'enrol', 'phd', 'program']
['difficult', 'confid', 'good', 'decis']
Whereas ISRI Stemmer almost gives me same results as if the words had been lemmatized
sentences = sent_tokenize(sample_text)
from nltk.stem import ISRIStemmer
isri = ISRIStemmer()
for i in range(len(sentences)):
sentences[i] = re.sub('[^A-Za-z0-9]', ' ', sentences[i])
sentences[i] = word_tokenize(sentences[i])
stopwds = [word.lower() for word in stopwords.words()]
sentences[i] = [word.lower() for word in sentences[i] if word.lower() not in stopwds]
sentences[i] = [ isri.stem(word) for word in sentences[i]]
print(sentences[i])
Output :
['student', 'university', 'alabama']
['born', 'ontario', 'canada', 'huge', 'fan', 'united', 'states']
['going', 'get', 'degree', 'philosophy', 'improvemy', 'chances', 'becoming', 'philosophy', 'professor']
['working', 'towards', 'goal', '4', 'years']
['currently', 'enrolled', 'phd', 'program']
['difficult', 'confident', 'good', 'decision']
Can someone explain how ISRI Stemmer gives almost Lemmatized words
This stemmer was developed specifically for Arabic, that is why you are getting such results when using it on English.
Here is a link to the nltk doc https://www.nltk.org/_modules/nltk/stem/isri.html

how to for each item in a list create class objects by passing parameters from functions

defined few functions (not induced body of the functions to simplify question here)
def policyname(i):
retrun policyname
def policytype(i):
retrun policytype
def active(i):
retrun active
def backupselection(i):
retrun backupselection
defined a list -
clients = ['winwebint16', 'winwebtpie03', 'winwebtpie04', 'winwtsdt08', 'winwtsmwg03', 'winwtsqnr03', 'winwtswrl37', 'winwtswrl60', 'winwtswrl62', 'winwtswrl63', 'winwtswrl75', 'winwtszsim03', 'winwww0016','winsbk0100', 'winsbk0100a0', 'winsbk0100a1', 'winsbk0101', 'iinf065', 'iinf130', 'iinf185', 'iinf2126', 'inbf005', 'inis001', 'ipdataisbic01', 'ipdataisbic02', 'ipdataispre01', 'ipdataispre02', 'iproip02', 'isis002', 'isyn002', 'isyn006', 'isyn011', 'isyn012','isyn014', 'isyn038', 'isyn039', 'isyn040', 'mu2ssql1001', 'mu2ssql1003', 'macrsz0001', 'macrsz0005']
defined a class -
class client():
def __init__(self,policyname,policytype,active,backupselection):
self.policyname = policyname
self.policytype = policytype
self.active = active
self.backupselection = backupselection
For each item in clients list create class objects by passing parameters from functions.
Is below code is correct ?
for i in clients:
i = client(policyname(i),policytype(i),active(i),backupselection(i))
with above code, will i be able to access specific class objects like ?
print(winwebint16.policyname)
print(winwebint16.policytype)
print(winwebint16.active)
print(winwebint16.backupselection)
Classnames should be capital letter but that's just convention. Further you have a problem with the for loop
for i in clients:
i = client(policyname(i),policytype(i),active(i),backupselection(i))
The created client is not visible outside of the for loop scope, so you might wanna add them to some kind of list or dict
client_list: dict = {}
for i in clients:
client_list[i] = client(policyname(i),policytype(i),active(i),backupselection(i))
you should then be able to print like
print(client_list['winwebint16'].policyname)
This really worked for me :
clients = ['669165933', '963881480', '341417157', '514321792', '115456712', '547995746', '135425221', '871543967', '770463311', '616607081', '814711606', '939825713']
policynames = ['Tuvalu', 'Grenada', 'Russia', 'Sao Tome and Principe', 'Rwanda', 'Solomon Islands', 'Angola', 'Burkina Faso', 'Republic of the Congo', 'Senegal', 'Kyrgyzstan', 'Cape Verde']
policytypes = ['Offline', 'Online', 'Offline', 'Online', 'Offline', 'Online', 'Offline', 'Online', 'Offline', 'Online', 'Online', 'Offline']
actives = ['Baby Food', 'Cereal', 'Office Supplies', 'Fruits', 'Office Supplies', 'Baby Food', 'Household', 'Vegetables', 'Personal Care', 'Cereal', 'Vegetables', 'Clothes']
backupselections = ['H', 'C', 'L', 'C', 'L', 'C', 'M', 'H', 'M', 'H', 'H', 'H']
def policyname(i):
return policynames[i]
def policytype(i):
return policytypes[i]
def active(i):
return actives[i]
def backupselection(i):
return backupselections[i]
class client():
def __init__(self,policyname,policytype,active,backupselection):
self.policyname = policyname
self.policytype = policytype
self.active = active
self.backupselection = backupselection
for i in range(0,len(clients)):
i = client(policyname(i),policytype(i),active(i),backupselection(i))
print(i.policyname,i.policytype,i.active,i.backupselection)
Tuvalu Offline Baby Food H
Grenada Online Cereal C
Russia Offline Office Supplies L
Sao Tome and Principe Online Fruits C
Rwanda Offline Office Supplies L
Solomon Islands Online Baby Food C
Angola Offline Household M
Burkina Faso Online Vegetables H
Republic of the Congo Offline Personal Care M
Senegal Online Cereal H
Kyrgyzstan Online Vegetables H
Cape Verde Offline Clothes H
But still I can't print
print(669165933.policyname)
File "<ipython-input-20-72da397c032b>", line 1
print(669165933.policyname)
^
SyntaxError: invalid syntax

Get continent name from country using pycountry

How to convert continent name from country name using pycountry. I have a list of country like this
country = ['India', 'Australia', ....]
And I want to get continent name from it like.
continent = ['Asia', 'Australia', ....]
All the tools you need are provided in pycountry-convert.
Here is an example of how you can create your own function to make a direct conversion from country to continent name, using pycountry's tools:
import pycountry_convert as pc
def country_to_continent(country_name):
country_alpha2 = pc.country_name_to_country_alpha2(country_name)
country_continent_code = pc.country_alpha2_to_continent_code(country_alpha2)
country_continent_name = pc.convert_continent_code_to_continent_name(country_continent_code)
return country_continent_name
# Example
country_name = 'Germany'
print(country_to_continent(country_name))
Out[1]: Europe
Hope it helps!
Remember you can access function descriptions using '?? function_name'
?? pc.country_name_to_country_alpha2
Looking at the documentation, would something like this do the trick?:
country_alpha2_to_continent_code()
It converts country code(eg: NO, SE, ES) to continent name.
If first you need to acquire the country code you could use:
country_name_to_country_alpha2(cn_name, cn_name_format="default")
to get the country code from country name.
Full example:
import pycountry_convert as pc
country_code = pc.country_name_to_country_alpha2("China", cn_name_format="default")
print(country_code)
continent_name = pc.country_alpha2_to_continent_code(country_code)
print(continent_name)
from pycountry_convert import country_alpha2_to_continent_code, country_name_to_country_alpha2
continents = {
'NA': 'North America',
'SA': 'South America',
'AS': 'Asia',
'OC': 'Australia',
'AF': 'Africa',
'EU': 'Europe'
}
countries = ['India', 'Australia']
[continents[country_alpha2_to_continent_code(country_name_to_country_alpha2(country))] for country in countries]
Don't know what to do with Antarctica continent ¯_(ツ)_/¯

More efficient way to clean dataframe than loc

My code looks like:
import pandas as pd
df = pd.read_excel("Energy Indicators.xls", header=None, footer=None)
c_df = df.copy()
c_df = c_df.iloc[18:245, 2:]
c_df = c_df.rename(columns={2: 'Country', 3: 'Energy Supply', 4:'Energy Supply per Capita', 5:'% Renewable'})
c_df['Energy Supply'] = c_df['Energy Supply'].apply(lambda x: x*1000000)
c_df.loc[c_df['Country'] == 'Korea, Rep.'] = 'South Korea'
c_df.loc[c_df['Country'] == 'United States of America20'] = 'United States'
c_df.loc[c_df['Country'] == 'United Kingdom of Great Britain and Northern Ireland'] = 'United Kingdom'
c_df.loc[c_df['Country'] == 'China, Hong Kong Special Administrative Region'] = 'Hong Kong'
c_df.loc[c_df['Country'] == 'Venezuela (Bolivarian Republic of)'] = 'Venezuela'
c_df.loc[c_df['Country'] == 'Bolivia (Plurinational State of)'] = 'Bolivia'
c_df.loc[c_df['Country'] == 'Switzerland17'] = 'Switzerland'
c_df.loc[c_df['Country'] == 'Australia1'] = 'Australia'
c_df.loc[c_df['Country'] == 'China2'] = 'China'
c_df.loc[c_df['Country'] == 'Falkland Islands (Malvinas)'] = 'Bolivia'
c_df.loc[c_df['Country'] == 'Greenland7'] = 'Greenland'
c_df.loc[c_df['Country'] == 'Iran (Islamic Republic of'] = 'Iran'
c_df.loc[c_df['Country'] == 'Italy9'] = 'Italy'
c_df.loc[c_df['Country'] == 'Japan10'] = 'Japan'
c_df.loc[c_df['Country'] == 'Kuwait11'] = 'Kuwait'
c_df.loc[c_df['Country'] == 'Micronesia (Federal States of)'] = 'Micronesia'
c_df.loc[c_df['Country'] == 'Netherlands12'] = 'Netherlands'
c_df.loc[c_df['Country'] == 'Portugal13'] = 'Portugal'
c_df.loc[c_df['Country'] == 'Saudi Arabia14'] = 'Saudi Arabia'
c_df.loc[c_df['Country'] == 'Serbia15'] = 'Serbia'
c_df.loc[c_df['Country'] == 'Sint Maarteen (Dutch part)'] = 'Sint Marteen'
c_df.loc[c_df['Country'] == 'Spain16'] = 'Spain'
c_df.loc[c_df['Country'] == 'Ukraine18'] = 'Ukraine'
c_df.loc[c_df['Country'] == 'Denmark5'] = 'Denmark'
c_df.loc[c_df['Country'] == 'France6'] = 'France'
c_df.loc[c_df['Country'] == 'Indonesia8'] = 'Indonesia'
I feel like there must be an easier way to change the values of the countries with parentheses and numbers in their names. What pandas method can I use to look within the column for names with numbers of parentheses? isin?
You can start by getting rid of numbers and text in parentheses. After that, for everything else that requires non-trivial replacement, declare a map and apply it using pd.Series.replace.
mapper = {'Korea, Rep' : 'South Korea', 'Falkland Islands' : 'Bolivia', ...}
df['Country'] = (
df['Country'].str.replace(r'\d+|\s*\(.*\)', '').str.strip().replace(mapper)
)
Simple enough, done.
Details
\d+ # one or more digits
| # regex OR pipe
\s* # zero or more whitespace characters
\( # literal parentheses (opening brace)
.* # match anything
\) # closing brace
Using a dictionary and then df.replace:
dict_to_replace = {'Korea, Rep.':'South Korea',
'United States of America20':'United States',
'United Kingdom of Great Britain and Northern Ireland': 'United Kingdom'
...}
df['c_df'] = df['c_df'].replace(dict_to_replace)

Failing to append to dictionary. Python

I am experiencing a strange faulty behaviour, where a dictionary is only appended once and I can not add more key value pairs to it.
My code reads in a multi-line string and extracts substrings via split(), to be added to a dictionary. I make use of conditional statements. Strangely only the key:value pairs under the first conditional statement are added.
Therefore I can not complete the dictionary.
How can I solve this issue?
Minimal code:
#I hope the '\n' is sufficient or use '\r\n'
example = "Name: Bugs Bunny\nDOB: 01/04/1900\nAddress: 111 Jokes Drive, Hollywood Hills, CA 11111, United States"
def format(data):
dic = {}
for line in data.splitlines():
#print('Line:', line)
if ':' in line:
info = line.split(': ', 1)[1].rstrip() #does not work with files
#print('Info: ', info)
if ' Name:' in info: #middle name problems! /maiden name
dic['F_NAME'] = info.split(' ', 1)[0].rstrip()
dic['L_NAME'] = info.split(' ', 1)[1].rstrip()
elif 'DOB' in info: #overhang
dic['DD'] = info.split('/', 2)[0].rstrip()
dic['MM'] = info.split('/', 2)[1].rstrip()
dic['YY'] = info.split('/', 2)[2].rstrip()
elif 'Address' in info:
dic['STREET'] = info.split(', ', 2)[0].rstrip()
dic['CITY'] = info.split(', ', 2)[1].rstrip()
dic['ZIP'] = info.split(', ', 2)[2].rstrip()
return dic
if __name__ == '__main__':
x = format(example)
for v, k in x.iteritems():
print v, k
Your code doesn't work, at all. You split off the name before the colon and discard it, looking only at the value after the colon, stored in info. That value never contains the names you are looking for; Name, DOB and Address all are part of the line before the :.
Python lets you assign to multiple names at once; make use of this when splitting:
def format(data):
dic = {}
for line in data.splitlines():
if ':' not in line:
continue
name, _, value = line.partition(':')
name = name.strip()
if name == 'Name':
dic['F_NAME'], dic['L_NAME'] = value.split(None, 1) # strips whitespace for us
elif name == 'DOB':
dic['DD'], dic['MM'], dic['YY'] = (v.strip() for v in value.split('/', 2))
elif name == 'Address':
dic['STREET'], dic['CITY'], dic['ZIP'] = (v.strip() for v in value.split(', ', 2))
return dic
I used str.partition() here rather than limit str.split() to just one split; it is slightly faster that way.
For your sample input this produces:
>>> format(example)
{'CITY': 'Hollywood Hills', 'ZIP': 'CA 11111, United States', 'L_NAME': 'Bunny', 'F_NAME': 'Bugs', 'YY': '1900', 'MM': '04', 'STREET': '111 Jokes Drive', 'DD': '01'}
>>> from pprint import pprint
>>> pprint(format(example))
{'CITY': 'Hollywood Hills',
'DD': '01',
'F_NAME': 'Bugs',
'L_NAME': 'Bunny',
'MM': '04',
'STREET': '111 Jokes Drive',
'YY': '1900',
'ZIP': 'CA 11111, United States'}

Categories