I have a dataset to clean and organize. Here is the link of the data set
https://github.com/irJERAD/Intro-to-Data-Science-in-Python/blob/master/MyNotebooks/university_towns.txt
So what I am trying to do is to clean this data set to the dictionary with the format {State: Town) for example {'Alabama': 'Auburn', Alabama: 'Florence'....'Wyoming': 'Laramie')
Here is my code:
import re
univ_towns = open('university_towns.txt',encoding='utf-8').readlines()
state_list = []
d={}
for name in univ_towns:
if "[ed" in name:
statename = re.sub('\[edit]\n$', '', name)
state_list.append(statename)
len_state = len(state_list)
elif "(" in name:
sep = ' ('
townname = name.split(sep, 1)[0]
if "," in townname:
sep = ','
townname = townname.split(sep, 1)[0]
d[state_list[len_state-1]] = townname
d
However, the code of my output only gives the results with only the last town appended in the dictionary. I am sure there is something no right with the loop logic but I can't really figure out what is wrong. Here is the output of my code:
{'Alabama': 'Tuskegee',
'Alaska': 'Fairbanks',
'Arizona': 'Tucson',
'Arkansas': 'Searcy',
'California': 'Whittier',
'Colorado': 'Pueblo',
'Connecticut': 'Willimantic',
'Delaware': 'Newark',
'Florida': 'Tampa',
'Georgia': 'Young Harris',
'Hawaii': 'Manoa',
'Idaho': 'Rexburg',
'Illinois': 'Peoria',
'Indiana': 'West Lafayette',
'Iowa': 'Waverly',
'Kansas': 'Pittsburg',
'Kentucky': 'Wilmore',
'Louisiana': 'Thibodaux',
'Maine': 'Waterville',
'Maryland': 'Westminster',
'Massachusetts': 'Framingham',
'Michigan': 'Ypsilanti',
'Minnesota': 'Winona',
'Mississippi': 'Starkville',
'Missouri': 'Warrensburg',
'Montana': 'Missoula',
'Nebraska': 'Wayne',
'Nevada': 'Reno',
'New Hampshire': 'Rindge',
'New Jersey': 'West Long Branch',
'New Mexico': 'Silver City',
'New York': 'West Point',
'North Carolina': 'Winston-Salem',
'North Dakota': 'Grand Forks',
'Ohio': 'Wilberforce',
'Oklahoma': 'Weatherford',
'Oregon': 'Newberg',
'Pennsylvania': 'Williamsport',
'Rhode Island': 'Providence',
'South Carolina': 'Spartanburg',
'South Dakota': 'Vermillion',
'Tennessee': 'Sewanee',
'Texas': 'Waco',
'Utah': 'Ephraim',
'Vermont': 'Northfield',
'Virginia': 'Chesapeake',
'Washington': 'University District',
'West Virginia': 'West Liberty',
'Wisconsin': 'Whitewater',
'Wyoming': 'Laramie'}
Try using defaultdict:
from collections import defaultdict
d = defaultdict(list)
for name in univ_towns:
if "[ed" in name:
statename = re.sub('\[edit]\n$', '', name)
state_list.append(statename)
len_state = len(state_list)
elif "(" in name:
sep = ' ('
townname = name.split(sep, 1)[0]
if "," in townname:
sep = ','
townname = townname.split(sep, 1)[0]
d[state_list[len_state-1]].append(townname)
As you can see, the only major difference is at the end where you use append instead of =. The way you had it before will only return one city rather than all cities, which is what you seem to want, unless I'm misunderstanding.
Related
I'm dealing with the user location information from tweets. And I want to get a standardized location tag from these user-input data. If the location is within USA it return the name of state, else it return the country name.
Basically something like:
text = ["New York, NY, USA", "Santa Monica, California", "ShanDong, China"]
output = text.standardize()
output
["New York", "California", "China"]
And it should have some tolerance to the typo of users. Is there any library recommended? Any thoughts on this will be really appreciated!
Here's what I would do, and I actually did recently in a project with tweets: Take a list of the possible states inside the US. Then, create a function to check if certain string contains the words of any state. If so, print the state name. Otherwise, print the last word(s) of the string after a comma.
text = ["New York, NY, USA", "Santa Monica, California", "ShanDong, China"]
states = ['Alaska', 'Alabama', 'Arkansas', 'American Samoa', 'Arizona', 'California', 'Colorado', 'Connecticut', 'District of Columbia', 'Delaware', 'Florida', 'Georgia', 'Guam', 'Hawaii', 'Iowa', 'Idaho', 'Illinois', 'Indiana', 'Kansas', 'Kentucky', 'Louisiana', 'Massachusetts', 'Maryland', 'Maine', 'Michigan', 'Minnesota', 'Missouri', 'Northern Mariana Islands', 'Mississippi', 'Montana', 'National', 'North Carolina', 'North Dakota', 'Nebraska', 'New Hampshire', 'New Jersey', 'New Mexico', 'Nevada', 'New York', 'Ohio', 'Oklahoma', 'Oregon', 'Pennsylvania', 'Puerto Rico', 'Rhode Island', 'South Carolina', 'South Dakota', 'Tennessee', 'Texas', 'Utah', 'Virginia', 'Virgin Islands', 'Vermont', 'Washington', 'Wisconsin', 'West Virginia', 'Wyoming']
def standartize(text):
for state in states:
if text.__contains__(state):
return(state)
return(text.split(", ")[-1])
text_2 = [standartize(i) for i in text]
# Prints ['New York', 'California', 'China']
I'm trying to clean a dataset that has the states either as abbreviations or fully spelled out. I need to make them all into abbreviations.
Any cheats to do this?
This is what I've come up with, but I'm still not getting the right output. What am I missing
states = []
for c in by_state['order state']:
if len(c)==2:
states = c.upper()
else:
map(abbr.get,c)
Here is an approach.
import re
"""Table to Map States to Abbreviations Courtesy https://gist.github.com/Quenty/74156dcc4e21d341ce52da14a701c40c"""
statename_to_abbr = {
# Other
'District of Columbia': 'DC',
# States
'Alabama': 'AL',
'Montana': 'MT',
'Alaska': 'AK',
'Nebraska': 'NE',
'Arizona': 'AZ',
'Nevada': 'NV',
'Arkansas': 'AR',
'New Hampshire': 'NH',
'California': 'CA',
'New Jersey': 'NJ',
'Colorado': 'CO',
'New Mexico': 'NM',
'Connecticut': 'CT',
'New York': 'NY',
'Delaware': 'DE',
'North Carolina': 'NC',
'Florida': 'FL',
'North Dakota': 'ND',
'Georgia': 'GA',
'Ohio': 'OH',
'Hawaii': 'HI',
'Oklahoma': 'OK',
'Idaho': 'ID',
'Oregon': 'OR',
'Illinois': 'IL',
'Pennsylvania': 'PA',
'Indiana': 'IN',
'Rhode Island': 'RI',
'Iowa': 'IA',
'South Carolina': 'SC',
'Kansas': 'KS',
'South Dakota': 'SD',
'Kentucky': 'KY',
'Tennessee': 'TN',
'Louisiana': 'LA',
'Texas': 'TX',
'Maine': 'ME',
'Utah': 'UT',
'Maryland': 'MD',
'Vermont': 'VT',
'Massachusetts': 'MA',
'Virginia': 'VA',
'Michigan': 'MI',
'Washington': 'WA',
'Minnesota': 'MN',
'West Virginia': 'WV',
'Mississippi': 'MS',
'Wisconsin': 'WI',
'Missouri': 'MO',
'Wyoming': 'WY',
}
def multiple_replace(lookup, text):
"""Perform substituions that map strings in the lookup table to valuees (modification from https://stackoverflow.com/questions/15175142/how-can-i-do-multiple-substitutions-using-regex-in-python)"""
# re.IGNORECASE flags allows provides case insensitivity (i.e. matches New York, new york, NEW YORK, etc.)
regex = re.compile(r'\b(' + '|'.join(lookup.keys()) + r')\b', re.IGNORECASE)
# For each match, look-up corresponding value in dictionary and peform subsstituion
# we convert match to title to capitalize first letter in each word
return regex.sub(lambda mo: lookup[mo.string.title()[mo.start():mo.end()]], text)
if __name__ == "__main__":
text = """United States Census Regions are:
Region 1: Northeast
Division 1: New England (Connecticut, Maine, Massachusetts, New Hampshire, Rhode Island, and Vermont)
Division 2: Mid-Atlantic (New Jersey, New York, and Pennsylvania)
Region 2: Midwest (Prior to June 1984, the Midwest Region was designated as the North Central Region.)[7]
Division 3: East North Central (Illinois, Indiana, Michigan, Ohio, and Wisconsin)
Division 4: West North Central (Iowa, Kansas, Minnesota, Missouri, Nebraska, North Dakota, and South Dakota)"""
print(multiple_replace(statename_to_abbr, text))
Output Example
United States Census Regions are:Region 1: NortheastDivision 1: New England (CT, ME, MA, NH, RI, and VT)Division 2: Mid-Atlantic (NJ, NY, and PA)Region 2: Midwest (Prior to June 1984, the Midwest Region was designated as the North Central Region.)[7]Division 3: East North Central (IL, IN, MI, OH,and WI)Division 4: West North Central (IA, KS, MN, MO,NE, ND, and SD)
Thanks for the help. I've final found all the answers to my use-case, so here is what I needed in case anyone else needs it.
#Creates new dataframe with two columns,removes all the NaN values
by_state = sales[['order state','total']].dropna()
#Map a dictionary of abbreviations to the dataframe
by_state['order state'] = by_state['order state'].map(abbr).fillna(by_state['order state'])
#Map values that were not capitalized correctly
by_state['order state'] = by_state['order state'].apply(lambda x:x.title()).map(abbr).fillna(by_state['order state'])
#Convert all abbreviations to uppercase
by_state['order state'] = by_state['order state'].apply(lambda x:x.upper())
#Remove a period after a abbreviation
by_state['order state'] = by_state['order state'].apply(lambda x:x.split('.')[0])
I have an CSV file containing a column "State" which contains US State names in full like: "New Jersey", "California", etc.
I want to modify this column so that they contain abbreviations instead of the full name like "NJ", "CA"...
To do this, I already have a dictionary that maps the state name to its abbreviation
us_state_abbrev = {
'Alabama': 'AL', 'Alaska': 'AK', 'Arizona': 'AZ', 'Arkansas': 'AR', 'California': 'CA', 'Colorado': 'CO',
'Connecticut': 'CT', 'Delaware': 'DE', 'Florida': 'FL', 'Georgia': 'GA', 'Hawaii': 'HI', 'Idaho': 'ID',
'Illinois': 'IL', 'Indiana': 'IN', 'Iowa': 'IA', 'Kansas': 'KS', 'Kentucky': 'KY', 'Louisiana': 'LA',
'Maine': 'ME', 'Maryland': 'MD', 'Massachusetts': 'MA', 'Michigan': 'MI', 'Minnesota': 'MN', 'Mississippi': 'MS',
'Missouri': 'MO', 'Montana': 'MT', 'Nebraska': 'NE', 'Nevada': 'NV', 'New Hampshire': 'NH', 'New Jersey': 'NJ',
'New Mexico': 'NM', 'New York': 'NY', 'North Carolina': 'NC', 'North Dakota': 'ND', 'Ohio': 'OH', 'Oklahoma': 'OK',
'Oregon': 'OR', 'Pennsylvania': 'PA', 'Rhode Island': 'RI', 'South Carolina': 'SC', 'South Dakota': 'SD',
'Tennessee': 'TN', 'Texas': 'TX', 'Utah': 'UT', 'Vermont': 'VT', 'Virginia': 'VA', 'Washington': 'WA',
'West Virginia': 'WV', 'Wisconsin': 'WI', 'Wyoming': 'WY'}
How do I loop through the column in my CSV file AND the dictionary and replace the full state name with the abbreviation?
Here's the code I wrote but it doesn't work:
with open(emp_file, 'r', errors='ignore') as fileHandle:
reader = csv.reader(fileHandle)
for row in reader:
for state, abbrev in us_state_abbrev.items():
if row[4] == state:
row[4] = abbrev
What am I doing wrong here? Please help.
import pandas as pd
df = pd.read_csv(emp_file)
then, assuming you know which column you want to edit:
df['State'] = df['State'].map(us_state_abbrev).fillna(df['State'])
Note: the last part deals with State entries not present in your dictionary
I'm getting an really confusing "EOL while scanning string literal" error when trying to run my code. The bit it's pointing at isn't on line 15, and removing line 15 doesn't help. There are no EOLs or reserved characters in the dictionary (checked in a text editor).
What on earth have I done?
File "C:\Users\User\AppData\Local\Programs\Python\Python37-32\myScripts\Quiz\quiz.py", line 15
'Carson City', 'New Hampshire': 'Concord', 'New Jersey': 'Trenton', 'New
^
SyntaxError: EOL while scanning string literal
#! python3
# this code generates a random quiz for each member of a class
import random
capitals = {'Alabama': 'Montgomery', 'Alaska': 'Juneau', 'Arizona': 'Phoenix','Arkansas': 'Little Rock', 'California': 'Sacramento', 'Colorado': 'Denver','Connecticut': 'Hartford', 'Delaware': 'Dover', 'Florida': 'Tallahassee','Georgia': 'Atlanta', 'Hawaii': 'Honolulu', 'Idaho': 'Boise', 'Illinois':'Springfield', 'Indiana': 'Indianapolis', 'Iowa': 'Des Moines', 'Kansas':'Topeka', 'Kentucky': 'Frankfort', 'Louisiana': 'Baton Rouge', 'Maine':'Augusta', 'Maryland': 'Annapolis', 'Massachusetts': 'Boston', 'Michigan':'Lansing', 'Minnesota': 'Saint Paul', 'Mississippi': 'Jackson', 'Missouri':'Jefferson City', 'Montana': 'Helena', 'Nebraska': 'Lincoln', 'Nevada':'Carson City', 'New Hampshire': 'Concord', 'New Jersey': 'Trenton', 'New Mexico': 'Santa Fe', 'New York': 'Albany', 'North Carolina': 'Raleigh','North Dakota': 'Bismarck', 'Ohio': 'Columbus', 'Oklahoma': 'Oklahoma City','Oregon': 'Salem', 'Pennsylvania': 'Harrisburg', 'Rhode Island': 'Providence','South Carolina': 'Columbia', 'South Dakota': 'Pierre', 'Tennessee':'Nashville', 'Texas': 'Austin', 'Utah': 'Salt Lake City', 'Vermont':'Montpelier', 'Virginia': 'Richmond', 'Washington': 'Olympia', 'West Virginia': 'Charleston', 'Wisconsin': 'Madison', 'Wyoming': 'Cheyenne'}
# creates 35 quizzes
for quizNum in range(35):
# creates the files
quizfile = open('statesquiz%.txt' % (quizNum + 1),'w')
answerFile = open('statesquiz%_answers.txt' % (quizNum + 1),'w')
# writes some stuff in quiz files
quizFile.write('Name: \n\nDate: \n\n')
quizFile.write((' ' * 20) + 'State Capitals Quiz (Form %s)' % (quizNum + 1))
quizFile.write('\n\n')
# shuffles states
states = list(capitals.keys())
random.shuffle(states)
# loop through states and make a question for each
for questionNum in range(50):
# gets right and decoy answers
correctAnswer = capitals[states[questionNum]]
wrongAnswers = list(capitals.values())
del wrongAnswers[wrongAnswers.index(correctAnswer)]
wrongAnswers = random.samples(wrongAnswers, 3)
answers = wrongAnswers + [correctAnswer]
random.shuffle(answers)
# writes more stuff in quiz file
quizFile.write('%s. What is the capital of %s?\n' % (questionNum + 1, capitals[questionNum]))
for i in range(4):
quizFile.write(' %s. %s\n' % ('ABCD'[i], answers[i]))
quizFile.write('\n')
# writes stuff in answer file
answerFile.write('%s. %s\n' % (questionNum + 1, 'ABCD'[answers.index(correctAnswer)]))
quizFile.close()
answerFile.close()
I'm trying to improve this code which asks the user to say what the state capital is when given a state, but I've noticed that sometimes it will repeat a state and ask it twice.
I tried using random.sample instead, but I got an error "TypeError: Unhashable type: 'list'. Here is the code that works (but repeats) with the random.sample commented out:
capitals_dict = {
'Alabama': 'Montgomery',
'Alaska': 'Juneau',
'Arizona': 'Phoenix',
'Arkansas': 'Little Rock',
'California': 'Sacramento',
'Colorado': 'Denver',
'Connecticut': 'Hartford',
'Delaware': 'Dover',
'Florida': 'Tallahassee',
'Georgia': 'Atlanta',
'Hawaii': 'Honolulu',
'Idaho': 'Boise',
'Illinois': 'Springfield',
'Indiana': 'Indianapolis',
'Iowa': 'Des Moines',
'Kansas': 'Topeka',
'Kentucky': 'Frankfort',
'Louisiana': 'Baton Rouge',
'Maine': 'Augusta',
'Maryland': 'Annapolis',
'Massachusetts': 'Boston',
'Michigan': 'Lansing',
'Minnesota': 'St. Paul',
'Mississippi': 'Jackson',
'Missouri': 'Jefferson City',
'Montana': 'Helena',
'Nebraska': 'Lincoln',
'Nevada': 'Carson City',
'New Hampshire': 'Concord',
'New Jersey': 'Trenton',
'New Mexico': 'Santa Fe',
'New York': 'Albany',
'North Carolina': 'Raleigh',
'North Dakota': 'Bismarck',
'Ohio': 'Columbus',
'Oklahoma': 'Oklahoma City',
'Oregon': 'Salem',
'Pennsylvania': 'Harrisburg',
'Rhode Island': 'Providence',
'South Carolina': 'Columbia',
'South Dakota': 'Pierre',
'Tennessee': 'Nashville',
'Texas': 'Austin',
'Utah': 'Salt Lake City',
'Vermont': 'Montpelier',
'Virginia': 'Richmond',
'Washington': 'Olympia',
'West Virginia': 'Charleston',
'Wisconsin': 'Madison',
'Wyoming': 'Cheyenne',
}
import random
states = list(capitals_dict.keys())
for i in [1, 2, 3, 4, 5]:
state = random.choice(states)
#state = random.sample(states, 5)
capital = capitals_dict[state]
capital_guess = input('What is the capital of ' + state + '?')
if capital_guess == capital:
print('Correct! Nice job!')
else:
print('Incorrect. The Capital of ' + state + ' is ' + capital + '.')
print('All done.')
I also tried just using the dictionary name capitals_dict like this:
random.sample(capitals_dict, 5)
but I got a different error then found out that I can't use dictionaries like that.
You can create a list of all keys in the dictionary by passing the dictionary to the list() function first, then sample from that list:
sample = random.sample(list(capitals_dict), 5)
You can also pass in the dict.keys() dictionary view:
sample = random.sample(capitals_dict.keys(), 5)
but internally random.sample() will just convert that to a sequence too (a tuple()) so using list() is actually more efficient here.
The exception you encountered actually tells you this:
>>> random.sample(capitals_dict, 5)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/.../lib/python3.4/random.py", line 311, in sample
raise TypeError("Population must be a sequence or set. For dicts, use list(d).")
TypeError: Population must be a sequence or set. For dicts, use list(d).
# ^^^^^^^^^^^^^^^^^^^^^^^
Demo:
>>> import random
>>> capitals_dict = {
... 'Alabama': 'Montgomery',
... 'Alaska': 'Juneau',
... 'Arizona': 'Phoenix',
... 'Arkansas': 'Little Rock',
... 'California': 'Sacramento',
... # ... elided ...
... }
>>>
>>> random.sample(list(capitals_dict), 5)
['Maryland', 'Mississippi', 'Wisconsin', 'Texas', 'West Virginia']
To incorporate that into your code:
import random
for state in random.sample(list(capitals_dict), 5):
capital = capitals_dict[state]
capital_guess = input('What is the capital of {}?'.format(state))
if capital_guess == capital:
print('Correct! Nice job!')
else:
print('Incorrect. The Capital of {} is {}.'.format(state, capital))
I also replaced your string concatenations with str.format() calls to put values into string templates instead.
Try doing it this way. which just samples the state names:
import random
num_queries = 5
for state in random.sample(capitals_dict.keys(), num_queries):
capital = capitals_dict[state]
capital_guess = input('What is the capital of ' + state + '?')
if capital_guess == capital:
print('Correct! Nice job!')
else:
print('Incorrect. The Capital of ' + state + ' is ' + capital + '.')
print('All done.')
While you could also use:
for state in random.sample(list(capitals_dict), num_queries):
because list(dictionary) will implicitly return a list of the dictionary's keys, but I prefer making what's going on explicit.
If anyone reading this wants a decent US States Capitals quizzer, I updated the code to include tracking the users score. It will ask all 50 states in a random order, and it will also let you skip and quit any time.
capitals_dict = {
'Alabama': 'Montgomery',
'Alaska': 'Juneau',
'Arizona': 'Phoenix',
'Arkansas': 'Little Rock',
'California': 'Sacramento',
'Colorado': 'Denver',
'Connecticut': 'Hartford',
'Delaware': 'Dover',
'Florida': 'Tallahassee',
'Georgia': 'Atlanta',
'Hawaii': 'Honolulu',
'Idaho': 'Boise',
'Illinois': 'Springfield',
'Indiana': 'Indianapolis',
'Iowa': 'Des Moines',
'Kansas': 'Topeka',
'Kentucky': 'Frankfort',
'Louisiana': 'Baton Rouge',
'Maine': 'Augusta',
'Maryland': 'Annapolis',
'Massachusetts': 'Boston',
'Michigan': 'Lansing',
'Minnesota': 'St. Paul',
'Mississippi': 'Jackson',
'Missouri': 'Jefferson City',
'Montana': 'Helena',
'Nebraska': 'Lincoln',
'Nevada': 'Carson City',
'New Hampshire': 'Concord',
'New Jersey': 'Trenton',
'New Mexico': 'Santa Fe',
'New York': 'Albany',
'North Carolina': 'Raleigh',
'North Dakota': 'Bismarck',
'Ohio': 'Columbus',
'Oklahoma': 'Oklahoma City',
'Oregon': 'Salem',
'Pennsylvania': 'Harrisburg',
'Rhode Island': 'Providence',
'South Carolina': 'Columbia',
'South Dakota': 'Pierre',
'Tennessee': 'Nashville',
'Texas': 'Austin',
'Utah': 'Salt Lake City',
'Vermont': 'Montpelier',
'Virginia': 'Richmond',
'Washington': 'Olympia',
'West Virginia': 'Charleston',
'Wisconsin': 'Madison',
'Wyoming': 'Cheyenne',
}
import random
counterQuestions = 0 # Represents the number of questions asked to the user
counterCorrect = 0
print('Enter the name of the State Capital with proper spelling. Enter "skip" to skip or "quit" to quit')
for state in random.sample(list(capitals_dict), 50):
capital = capitals_dict[state]
capital_guess = input('What is the capital of {}? '.format(state))
if capital_guess == 'skip':
#print('The Capital of {} is {}.'.format(state, capital)) #study mode - use comment feature to turn this on/off.
counterQuestions = counterQuestions + 1
continue
elif capital_guess == 'quit':
break
elif capital_guess == capital:
print('Correct! Nice job!')
counterCorrect = counterCorrect + 1
counterQuestions = counterQuestions + 1
else:
print('Incorrect. The Capital of {} is {}.'.format(state, capital))
counterQuestions = counterQuestions + 1
score = (counterCorrect / counterQuestions) * 100
counterIncorrect = counterQuestions - counterCorrect
print('All done. Your score is ' + str(score) + '% correct, or ' + str(counterCorrect) + ' out of ' + str(counterQuestions) + ' (' + str(counterIncorrect) + ' incorrect)')