Standardized the location information of tweets data - python

I'm dealing with the user location information from tweets. And I want to get a standardized location tag from these user-input data. If the location is within USA it return the name of state, else it return the country name.
Basically something like:
text = ["New York, NY, USA", "Santa Monica, California", "ShanDong, China"]
output = text.standardize()
output
["New York", "California", "China"]
And it should have some tolerance to the typo of users. Is there any library recommended? Any thoughts on this will be really appreciated!

Here's what I would do, and I actually did recently in a project with tweets: Take a list of the possible states inside the US. Then, create a function to check if certain string contains the words of any state. If so, print the state name. Otherwise, print the last word(s) of the string after a comma.
text = ["New York, NY, USA", "Santa Monica, California", "ShanDong, China"]
states = ['Alaska', 'Alabama', 'Arkansas', 'American Samoa', 'Arizona', 'California', 'Colorado', 'Connecticut', 'District of Columbia', 'Delaware', 'Florida', 'Georgia', 'Guam', 'Hawaii', 'Iowa', 'Idaho', 'Illinois', 'Indiana', 'Kansas', 'Kentucky', 'Louisiana', 'Massachusetts', 'Maryland', 'Maine', 'Michigan', 'Minnesota', 'Missouri', 'Northern Mariana Islands', 'Mississippi', 'Montana', 'National', 'North Carolina', 'North Dakota', 'Nebraska', 'New Hampshire', 'New Jersey', 'New Mexico', 'Nevada', 'New York', 'Ohio', 'Oklahoma', 'Oregon', 'Pennsylvania', 'Puerto Rico', 'Rhode Island', 'South Carolina', 'South Dakota', 'Tennessee', 'Texas', 'Utah', 'Virginia', 'Virgin Islands', 'Vermont', 'Washington', 'Wisconsin', 'West Virginia', 'Wyoming']
def standartize(text):
for state in states:
if text.__contains__(state):
return(state)
return(text.split(", ")[-1])
text_2 = [standartize(i) for i in text]
# Prints ['New York', 'California', 'China']

Related

Only getting the last data point of a web page when using my web scraping code

It is probably a newbie problem but I cannot solve it. Found couple of different web scraping codes on youtube tutorials, but each one of them gives me only the last data point, and not a list of all of them as I want to get. This is my code(using jupyter notebook):
import requests
html_text = requests.get('https://www.scrapethissite.com/pages/simple/').text
soup = BeautifulSoup(html_text, 'lxml')
countrys= soup.find_all('div',class_='col-md-4 country')
for country in countrys:
country_name = country.find('h3',class_='country-name').text.strip()
capital = country.find('span',class_='country-capital').text
population = country.find('span',class_='country-population').text
data = [country_name, capital, population]
print(data)
Result:
['Zimbabwe', 'Harare', '11651858']
Therefore, only last value of the data(country list) is a result of a code. How can I get the list of all the data?
You have to create data variable as a list outside the loop and append records to the list:
from bs4 import BeautifulSoup
import requests
html_text = requests.get('https://www.scrapethissite.com/pages/simple/').text
soup = BeautifulSoup(html_text, 'lxml')
countrys= soup.find_all('div',class_='col-md-4 country')
data = [] # <- HERE
for country in countrys:
country_name = country.find('h3',class_='country-name').text.strip()
capital = country.find('span',class_='country-capital').text
population = country.find('span',class_='country-population').text
data.append([country_name, capital, population]) # <- HERE
print(data)
Output:
[['Andorra', 'Andorra la Vella', '84000'],
['United Arab Emirates', 'Abu Dhabi', '4975593'],
['Afghanistan', 'Kabul', '29121286'],
['Antigua and Barbuda', "St. John's", '86754'],
['Anguilla', 'The Valley', '13254'],
['Albania', 'Tirana', '2986952'],
['Armenia', 'Yerevan', '2968000'],
['Angola', 'Luanda', '13068161'],
['Antarctica', 'None', '0'],
['Argentina', 'Buenos Aires', '41343201'],
['American Samoa', 'Pago Pago', '57881'],
['Austria', 'Vienna', '8205000'],
['Australia', 'Canberra', '21515754'],
['Aruba', 'Oranjestad', '71566'],
['Åland', 'Mariehamn', '26711'],
['Azerbaijan', 'Baku', '8303512'],
['Bosnia and Herzegovina', 'Sarajevo', '4590000'],
['Barbados', 'Bridgetown', '285653'],
['Bangladesh', 'Dhaka', '156118464'],
['Belgium', 'Brussels', '10403000'],
['Burkina Faso', 'Ouagadougou', '16241811'],
['Bulgaria', 'Sofia', '7148785'],
['Bahrain', 'Manama', '738004'],
['Burundi', 'Bujumbura', '9863117'],
['Benin', 'Porto-Novo', '9056010'],
['Saint Barthélemy', 'Gustavia', '8450'],
['Bermuda', 'Hamilton', '65365'],
['Brunei', 'Bandar Seri Begawan', '395027'],
['Bolivia', 'Sucre', '9947418'],
['Bonaire', 'Kralendijk', '18012'],
['Brazil', 'Brasília', '201103330'],
['Bahamas', 'Nassau', '301790'],
['Bhutan', 'Thimphu', '699847'],
['Bouvet Island', 'None', '0'],
['Botswana', 'Gaborone', '2029307'],
['Belarus', 'Minsk', '9685000'],
['Belize', 'Belmopan', '314522'],
['Canada', 'Ottawa', '33679000'],
['Cocos [Keeling] Islands', 'West Island', '628'],
['Democratic Republic of the Congo', 'Kinshasa', '70916439'],
['Central African Republic', 'Bangui', '4844927'],
['Republic of the Congo', 'Brazzaville', '3039126'],
['Switzerland', 'Bern', '7581000'],
['Ivory Coast', 'Yamoussoukro', '21058798'],
['Cook Islands', 'Avarua', '21388'],
['Chile', 'Santiago', '16746491'],
['Cameroon', 'Yaoundé', '19294149'],
['China', 'Beijing', '1330044000'],
['Colombia', 'Bogotá', '47790000'],
['Costa Rica', 'San José', '4516220'],
['Cuba', 'Havana', '11423000'],
['Cape Verde', 'Praia', '508659'],
['Curacao', 'Willemstad', '141766'],
['Christmas Island', 'Flying Fish Cove', '1500'],
['Cyprus', 'Nicosia', '1102677'],
['Czech Republic', 'Prague', '10476000'],
['Germany', 'Berlin', '81802257'],
['Djibouti', 'Djibouti', '740528'],
['Denmark', 'Copenhagen', '5484000'],
['Dominica', 'Roseau', '72813'],
['Dominican Republic', 'Santo Domingo', '9823821'],
['Algeria', 'Algiers', '34586184'],
['Ecuador', 'Quito', '14790608'],
['Estonia', 'Tallinn', '1291170'],
['Egypt', 'Cairo', '80471869'],
['Western Sahara', 'Laâyoune / El Aaiún', '273008'],
['Eritrea', 'Asmara', '5792984'],
['Spain', 'Madrid', '46505963'],
['Ethiopia', 'Addis Ababa', '88013491'],
['Finland', 'Helsinki', '5244000'],
['Fiji', 'Suva', '875983'],
['Falkland Islands', 'Stanley', '2638'],
['Micronesia', 'Palikir', '107708'],
['Faroe Islands', 'Tórshavn', '48228'],
['France', 'Paris', '64768389'],
['Gabon', 'Libreville', '1545255'],
['United Kingdom', 'London', '62348447'],
['Grenada', "St. George's", '107818'],
['Georgia', 'Tbilisi', '4630000'],
['French Guiana', 'Cayenne', '195506'],
['Guernsey', 'St Peter Port', '65228'],
['Ghana', 'Accra', '24339838'],
['Gibraltar', 'Gibraltar', '27884'],
['Greenland', 'Nuuk', '56375'],
['Gambia', 'Bathurst', '1593256'],
['Guinea', 'Conakry', '10324025'],
['Guadeloupe', 'Basse-Terre', '443000'],
['Equatorial Guinea', 'Malabo', '1014999'],
['Greece', 'Athens', '11000000'],
['South Georgia and the South Sandwich Islands', 'Grytviken', '30'],
['Guatemala', 'Guatemala City', '13550440'],
['Guam', 'Hagåtña', '159358'],
['Guinea-Bissau', 'Bissau', '1565126'],
['Guyana', 'Georgetown', '748486'],
['Hong Kong', 'Hong Kong', '6898686'],
['Heard Island and McDonald Islands', 'None', '0'],
['Honduras', 'Tegucigalpa', '7989415'],
['Croatia', 'Zagreb', '4491000'],
['Haiti', 'Port-au-Prince', '9648924'],
['Hungary', 'Budapest', '9982000'],
['Indonesia', 'Jakarta', '242968342'],
['Ireland', 'Dublin', '4622917'],
['Israel', 'None', '7353985'],
['Isle of Man', 'Douglas', '75049'],
['India', 'New Delhi', '1173108018'],
['British Indian Ocean Territory', 'None', '4000'],
['Iraq', 'Baghdad', '29671605'],
['Iran', 'Tehran', '76923300'],
['Iceland', 'Reykjavik', '308910'],
['Italy', 'Rome', '60340328'],
['Jersey', 'Saint Helier', '90812'],
['Jamaica', 'Kingston', '2847232'],
['Jordan', 'Amman', '6407085'],
['Japan', 'Tokyo', '127288000'],
['Kenya', 'Nairobi', '40046566'],
['Kyrgyzstan', 'Bishkek', '5776500'],
['Cambodia', 'Phnom Penh', '14453680'],
['Kiribati', 'Tarawa', '92533'],
['Comoros', 'Moroni', '773407'],
['Saint Kitts and Nevis', 'Basseterre', '51134'],
['North Korea', 'Pyongyang', '22912177'],
['South Korea', 'Seoul', '48422644'],
['Kuwait', 'Kuwait City', '2789132'],
['Cayman Islands', 'George Town', '44270'],
['Kazakhstan', 'Astana', '15340000'],
['Laos', 'Vientiane', '6368162'],
['Lebanon', 'Beirut', '4125247'],
['Saint Lucia', 'Castries', '160922'],
['Liechtenstein', 'Vaduz', '35000'],
['Sri Lanka', 'Colombo', '21513990'],
['Liberia', 'Monrovia', '3685076'],
['Lesotho', 'Maseru', '1919552'],
['Lithuania', 'Vilnius', '2944459'],
['Luxembourg', 'Luxembourg', '497538'],
['Latvia', 'Riga', '2217969'],
['Libya', 'Tripoli', '6461454'],
['Morocco', 'Rabat', '31627428'],
['Monaco', 'Monaco', '32965'],
['Moldova', 'Chişinău', '4324000'],
['Montenegro', 'Podgorica', '666730'],
['Saint Martin', 'Marigot', '35925'],
['Madagascar', 'Antananarivo', '21281844'],
['Marshall Islands', 'Majuro', '65859'],
['Macedonia', 'Skopje', '2062294'],
['Mali', 'Bamako', '13796354'],
['Myanmar [Burma]', 'Naypyitaw', '53414374'],
['Mongolia', 'Ulan Bator', '3086918'],
['Macao', 'Macao', '449198'],
['Northern Mariana Islands', 'Saipan', '53883'],
['Martinique', 'Fort-de-France', '432900'],
['Mauritania', 'Nouakchott', '3205060'],
['Montserrat', 'Plymouth', '9341'],
['Malta', 'Valletta', '403000'],
['Mauritius', 'Port Louis', '1294104'],
['Maldives', 'Malé', '395650'],
['Malawi', 'Lilongwe', '15447500'],
['Mexico', 'Mexico City', '112468855'],
['Malaysia', 'Kuala Lumpur', '28274729'],
['Mozambique', 'Maputo', '22061451'],
['Namibia', 'Windhoek', '2128471'],
['New Caledonia', 'Noumea', '216494'],
['Niger', 'Niamey', '15878271'],
['Norfolk Island', 'Kingston', '1828'],
['Nigeria', 'Abuja', '154000000'],
['Nicaragua', 'Managua', '5995928'],
['Netherlands', 'Amsterdam', '16645000'],
['Norway', 'Oslo', '5009150'],
['Nepal', 'Kathmandu', '28951852'],
['Nauru', 'Yaren', '10065'],
['Niue', 'Alofi', '2166'],
['New Zealand', 'Wellington', '4252277'],
['Oman', 'Muscat', '2967717'],
['Panama', 'Panama City', '3410676'],
['Peru', 'Lima', '29907003'],
['French Polynesia', 'Papeete', '270485'],
['Papua New Guinea', 'Port Moresby', '6064515'],
['Philippines', 'Manila', '99900177'],
['Pakistan', 'Islamabad', '184404791'],
['Poland', 'Warsaw', '38500000'],
['Saint Pierre and Miquelon', 'Saint-Pierre', '7012'],
['Pitcairn Islands', 'Adamstown', '46'],
['Puerto Rico', 'San Juan', '3916632'],
['Palestine', 'None', '3800000'],
['Portugal', 'Lisbon', '10676000'],
['Palau', 'Melekeok', '19907'],
['Paraguay', 'Asunción', '6375830'],
['Qatar', 'Doha', '840926'],
['Réunion', 'Saint-Denis', '776948'],
['Romania', 'Bucharest', '21959278'],
['Serbia', 'Belgrade', '7344847'],
['Russia', 'Moscow', '140702000'],
['Rwanda', 'Kigali', '11055976'],
['Saudi Arabia', 'Riyadh', '25731776'],
['Solomon Islands', 'Honiara', '559198'],
['Seychelles', 'Victoria', '88340'],
['Sudan', 'Khartoum', '35000000'],
['Sweden', 'Stockholm', '9828655'],
['Singapore', 'Singapore', '4701069'],
['Saint Helena', 'Jamestown', '7460'],
['Slovenia', 'Ljubljana', '2007000'],
['Svalbard and Jan Mayen', 'Longyearbyen', '2550'],
['Slovakia', 'Bratislava', '5455000'],
['Sierra Leone', 'Freetown', '5245695'],
['San Marino', 'San Marino', '31477'],
['Senegal', 'Dakar', '12323252'],
['Somalia', 'Mogadishu', '10112453'],
['Suriname', 'Paramaribo', '492829'],
['South Sudan', 'Juba', '8260490'],
['São Tomé and Príncipe', 'São Tomé', '175808'],
['El Salvador', 'San Salvador', '6052064'],
['Sint Maarten', 'Philipsburg', '37429'],
['Syria', 'Damascus', '22198110'],
['Swaziland', 'Mbabane', '1354051'],
['Turks and Caicos Islands', 'Cockburn Town', '20556'],
['Chad', "N'Djamena", '10543464'],
['French Southern Territories', 'Port-aux-Français', '140'],
['Togo', 'Lomé', '6587239'],
['Thailand', 'Bangkok', '67089500'],
['Tajikistan', 'Dushanbe', '7487489'],
['Tokelau', 'None', '1466'],
['East Timor', 'Dili', '1154625'],
['Turkmenistan', 'Ashgabat', '4940916'],
['Tunisia', 'Tunis', '10589025'],
['Tonga', "Nuku'alofa", '122580'],
['Turkey', 'Ankara', '77804122'],
['Trinidad and Tobago', 'Port of Spain', '1228691'],
['Tuvalu', 'Funafuti', '10472'],
['Taiwan', 'Taipei', '22894384'],
['Tanzania', 'Dodoma', '41892895'],
['Ukraine', 'Kiev', '45415596'],
['Uganda', 'Kampala', '33398682'],
['U.S. Minor Outlying Islands', 'None', '0'],
['United States', 'Washington', '310232863'],
['Uruguay', 'Montevideo', '3477000'],
['Uzbekistan', 'Tashkent', '27865738'],
['Vatican City', 'Vatican City', '921'],
['Saint Vincent and the Grenadines', 'Kingstown', '104217'],
['Venezuela', 'Caracas', '27223228'],
['British Virgin Islands', 'Road Town', '21730'],
['U.S. Virgin Islands', 'Charlotte Amalie', '108708'],
['Vietnam', 'Hanoi', '89571130'],
['Vanuatu', 'Port Vila', '221552'],
['Wallis and Futuna', 'Mata-Utu', '16025'],
['Samoa', 'Apia', '192001'],
['Kosovo', 'Pristina', '1800000'],
['Yemen', 'Sanaa', '23495361'],
['Mayotte', 'Mamoudzou', '159042'],
['South Africa', 'Pretoria', '49000000'],
['Zambia', 'Lusaka', '13460305'],
['Zimbabwe', 'Harare', '11651858']]
You're redefining the variable data on every loop. You need to define a variable before the loop to store all the data:
from bs4 import BeautifulSoup
import requests
html_text = requests.get('https://www.scrapethissite.com/pages/simple/').text
soup = BeautifulSoup(html_text, 'lxml')
countrys= soup.find_all('div',class_='col-md-4 country')
data = []
for country in countrys:
country_name = country.find('h3',class_='country-name').text.strip()
capital = country.find('span',class_='country-capital').text
population = country.find('span',class_='country-population').text
data.append([country_name, capital, population])
print(data)
Or better yet, you can use dictionaries, which will make accessing the data easier:
from bs4 import BeautifulSoup
import requests
html_text = requests.get('https://www.scrapethissite.com/pages/simple/').text
soup = BeautifulSoup(html_text, 'lxml')
countrys= soup.find_all('div',class_='col-md-4 country')
data = {}
for country in countrys:
country_name = country.find('h3',class_='country-name').text.strip()
capital = country.find('span',class_='country-capital').text
population = country.find('span',class_='country-population').text
data[country_name] = {'capital': capital, 'population': population}
print(data)

How can I use string interpolation while iterating over a dictionary in python 3?

I have a dictionary full of states and their abbreviations mapped to their actual names. And I want to iterate over them, because I want to make a task easier (don't want to write this out for each state). So far I have a dictionary like this
state_dict = {
'AK': 'ALASKA',
'AL': 'ALABAMA',
'AR': 'ARKANSAS',
'AS': 'AMERICAN SAMOA',
'AZ': 'ARIZONA ',
'CA': 'CALIFORNIA ',
'CO': 'COLORADO ',
'CT': 'CONNECTICUT',
'DC': 'DISTRICT OF COLUMBIA',
'DE': 'DELAWARE',
'FL': 'FLORIDA',
'FM': 'FEDERATED STATES OF MICRONESIA',
'GA': 'GEORGIA',
'GU': 'GUAM ',
'HI': 'HAWAII',
'IA': 'IOWA',
'ID': 'IDAHO',
'IL': 'ILLINOIS',
'IN': 'INDIANA',
'KS': 'KANSAS',
'KY': 'KENTUCKY',
'LA': 'LOUISIANA',
'MA': 'MASSACHUSETTS',
'MD': 'MARYLAND',
'ME': 'MAINE',
'MH': 'MARSHALL ISLANDS',
'MI': 'MICHIGAN',
'MN': 'MINNESOTA',
'MO': 'MISSOURI',
'MP': 'NORTHERN MARIANA ISLANDS',
'MS': 'MISSISSIPPI',
'MT': 'MONTANA',
'NC': 'NORTH CAROLINA',
'ND': 'NORTH DAKOTA',
'NE': 'NEBRASKA',
'NH': 'NEW HAMPSHIRE',
'NJ': 'NEW JERSEY',
'NM': 'NEW MEXICO',
'NV': 'NEVADA',
'NY': 'NEW YORK',
'OH': 'OHIO',
'OK': 'OKLAHOMA',
'OR': 'OREGON',
'PA': 'PENNSYLVANIA',
'PR': 'PUERTO RICO',
'RI': 'RHODE ISLAND',
'SC': 'SOUTH CAROLINA',
'SD': 'SOUTH DAKOTA',
'TN': 'TENNESSEE',
'TX': 'TEXAS',
'UT': 'UTAH',
'VA': 'VIRGINIA ',
'VI': 'VIRGIN ISLANDS',
'VT': 'VERMONT',
'WA': 'WASHINGTON',
'WI': 'WISCONSIN',
'WV': 'WEST VIRGINIA',
'WY': 'WYOMING'
}
for k, v in state_dict.items():
print("""if (c_state_code.equals("{k}"))
{
out_state_code = "{v}";
}""").format(k, v)
But I'm getting 'NoneType' object has no attribute 'format, and I even tried **attrs in the .format but got the same error.
You're calling format() on the result of print(), which doesn't return anything. It should be called on the format string -- it needs to be inside the argument to print().
for k, v in state_dict.items():
print("""if (c_state_code.equals("{k}"))
{{
out_state_code = "{v}";
}}""".format(k, v))
If you're using Python version 3.6 you can make it even easier using an f-string.
for k, v in state_dict.items():
print(f"""if (c_state_code.equals("{k}"))
{{
out_state_code = "{v}";
}}""")
If you want to use this code, I think #Barmar's answer is pretty good. However, it looks like you are trying to copy and paste a million different if statements to convert the initials of a state into the state name. In this case, I would use the dictionary (or even store it in a JSON file!)
state_dict = {...}
out_state_code = state_dict[c_state_code]
or
import json
with open("states.json", "r") as states_file:
state_dict = json.load(states_file)
out_state_code = state_dict[c_state_code]

How to fix 'Dict object has no attrubute key' [closed]

Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 3 years ago.
Improve this question
I got this error, while i have attribute key there.
Traceback (most recent call last):
File "C:/Users/Acer/AppData/Local/Programs/Python/Python37- 32/randomQuizGenerator.py", line 33, in <module>
states = list(capitals.key())
AttributeError: 'dict' object has no attribute 'key'
I am new in Python btw. I follow along all of the tutorial.
and while my dict is here:
capitals = {'Alabama': 'Montgemory', 'Alaska': 'Juneau', 'Arizona': 'Phoenix', 'Arkansas': 'Little Rock', 'California': 'Sacramento', 'Colorado': 'Denver',
'Connecticut': 'Hartford', 'Delaware': 'Dover', 'Florida': 'Tallahassee',
'Georgia': 'Atlanta', 'Hawaii': 'Honolulu', 'Idaho': 'Boise', 'Illinois':
'Springfield', 'Indiana': 'Indianapolis', 'Iowa': 'Des Moines', 'Kansas':
'Topeka', 'Kentucky': 'Frankfort', 'Louisiana': 'Baton Rouge', 'Maine':
'Augusta', 'Maryland': 'Annapolis', 'Massachusetts': 'Boston', 'Michigan':
'Lansing', 'Minnesota': 'Saint Paul', 'Mississippi': 'Jackson', 'Missouri':
'Jefferson City', 'Montana': 'Helena', 'Nebraska': 'Lincoln', 'Nevada':
'Carson City', 'New Hampshire': 'Concord', 'New Jersey': 'Trenton', 'New Mexico': 'Santa Fe', 'New York': 'Albany', 'North Carolina': 'Raleigh',
'North Dakota': 'Bismarck', 'Ohio': 'Columbus', 'Oklahoma': 'Oklahoma City',
'Oregon': 'Salem', 'Pennsylvania': 'Harrisburg', 'Rhode Island': 'Providence',
'South Carolina': 'Columbia', 'South Dakota': 'Pierre', 'Tennessee':
'Nashville', 'Texas': 'Austin', 'Utah': 'Salt Lake City', 'Vermont':
'Montpelier', 'Virginia': 'Richmond', 'Washington': 'Olympia', 'West Virginia': 'Charleston', 'Wisconsin': 'Madison', 'Wyoming': 'Cheyenne'}
I have the key attribute in my dictionary, i don't know which part is wrong actually.
I expected the result is like this:
State Capitals Quiz (Form 1)
1. What is the capital of West Virginia?
A. Hartford
B. Santa Fe
C. Harrisburg
D. Charleston
2. What is the capital of Colorado?
A. Raleigh
B. Harrisburg
C. Denver
D. Lincoln
The method is keys not key.
states = list(capitals.keys())
states
Now to get corresponding values of these states.
[capitals[state] for state in states]
If you are trying to get a list of all of the states, or keys, within the dictionary, you can do the following using list comprehension:
states = [k for k in capitals.keys()]

Confused by EOL while scanning string literal error

I'm getting an really confusing "EOL while scanning string literal" error when trying to run my code. The bit it's pointing at isn't on line 15, and removing line 15 doesn't help. There are no EOLs or reserved characters in the dictionary (checked in a text editor).
What on earth have I done?
File "C:\Users\User\AppData\Local\Programs\Python\Python37-32\myScripts\Quiz\quiz.py", line 15
'Carson City', 'New Hampshire': 'Concord', 'New Jersey': 'Trenton', 'New
^
SyntaxError: EOL while scanning string literal
#! python3
# this code generates a random quiz for each member of a class
import random
capitals = {'Alabama': 'Montgomery', 'Alaska': 'Juneau', 'Arizona': 'Phoenix','Arkansas': 'Little Rock', 'California': 'Sacramento', 'Colorado': 'Denver','Connecticut': 'Hartford', 'Delaware': 'Dover', 'Florida': 'Tallahassee','Georgia': 'Atlanta', 'Hawaii': 'Honolulu', 'Idaho': 'Boise', 'Illinois':'Springfield', 'Indiana': 'Indianapolis', 'Iowa': 'Des Moines', 'Kansas':'Topeka', 'Kentucky': 'Frankfort', 'Louisiana': 'Baton Rouge', 'Maine':'Augusta', 'Maryland': 'Annapolis', 'Massachusetts': 'Boston', 'Michigan':'Lansing', 'Minnesota': 'Saint Paul', 'Mississippi': 'Jackson', 'Missouri':'Jefferson City', 'Montana': 'Helena', 'Nebraska': 'Lincoln', 'Nevada':'Carson City', 'New Hampshire': 'Concord', 'New Jersey': 'Trenton', 'New Mexico': 'Santa Fe', 'New York': 'Albany', 'North Carolina': 'Raleigh','North Dakota': 'Bismarck', 'Ohio': 'Columbus', 'Oklahoma': 'Oklahoma City','Oregon': 'Salem', 'Pennsylvania': 'Harrisburg', 'Rhode Island': 'Providence','South Carolina': 'Columbia', 'South Dakota': 'Pierre', 'Tennessee':'Nashville', 'Texas': 'Austin', 'Utah': 'Salt Lake City', 'Vermont':'Montpelier', 'Virginia': 'Richmond', 'Washington': 'Olympia', 'West Virginia': 'Charleston', 'Wisconsin': 'Madison', 'Wyoming': 'Cheyenne'}
# creates 35 quizzes
for quizNum in range(35):
# creates the files
quizfile = open('statesquiz%.txt' % (quizNum + 1),'w')
answerFile = open('statesquiz%_answers.txt' % (quizNum + 1),'w')
# writes some stuff in quiz files
quizFile.write('Name: \n\nDate: \n\n')
quizFile.write((' ' * 20) + 'State Capitals Quiz (Form %s)' % (quizNum + 1))
quizFile.write('\n\n')
# shuffles states
states = list(capitals.keys())
random.shuffle(states)
# loop through states and make a question for each
for questionNum in range(50):
# gets right and decoy answers
correctAnswer = capitals[states[questionNum]]
wrongAnswers = list(capitals.values())
del wrongAnswers[wrongAnswers.index(correctAnswer)]
wrongAnswers = random.samples(wrongAnswers, 3)
answers = wrongAnswers + [correctAnswer]
random.shuffle(answers)
# writes more stuff in quiz file
quizFile.write('%s. What is the capital of %s?\n' % (questionNum + 1, capitals[questionNum]))
for i in range(4):
quizFile.write(' %s. %s\n' % ('ABCD'[i], answers[i]))
quizFile.write('\n')
# writes stuff in answer file
answerFile.write('%s. %s\n' % (questionNum + 1, 'ABCD'[answers.index(correctAnswer)]))
quizFile.close()
answerFile.close()

Append cleaned data to dictionary using if and loop technique

I have a dataset to clean and organize. Here is the link of the data set
https://github.com/irJERAD/Intro-to-Data-Science-in-Python/blob/master/MyNotebooks/university_towns.txt
So what I am trying to do is to clean this data set to the dictionary with the format {State: Town) for example {'Alabama': 'Auburn', Alabama: 'Florence'....'Wyoming': 'Laramie')
Here is my code:
import re
univ_towns = open('university_towns.txt',encoding='utf-8').readlines()
state_list = []
d={}
for name in univ_towns:
if "[ed" in name:
statename = re.sub('\[edit]\n$', '', name)
state_list.append(statename)
len_state = len(state_list)
elif "(" in name:
sep = ' ('
townname = name.split(sep, 1)[0]
if "," in townname:
sep = ','
townname = townname.split(sep, 1)[0]
d[state_list[len_state-1]] = townname
d
However, the code of my output only gives the results with only the last town appended in the dictionary. I am sure there is something no right with the loop logic but I can't really figure out what is wrong. Here is the output of my code:
{'Alabama': 'Tuskegee',
'Alaska': 'Fairbanks',
'Arizona': 'Tucson',
'Arkansas': 'Searcy',
'California': 'Whittier',
'Colorado': 'Pueblo',
'Connecticut': 'Willimantic',
'Delaware': 'Newark',
'Florida': 'Tampa',
'Georgia': 'Young Harris',
'Hawaii': 'Manoa',
'Idaho': 'Rexburg',
'Illinois': 'Peoria',
'Indiana': 'West Lafayette',
'Iowa': 'Waverly',
'Kansas': 'Pittsburg',
'Kentucky': 'Wilmore',
'Louisiana': 'Thibodaux',
'Maine': 'Waterville',
'Maryland': 'Westminster',
'Massachusetts': 'Framingham',
'Michigan': 'Ypsilanti',
'Minnesota': 'Winona',
'Mississippi': 'Starkville',
'Missouri': 'Warrensburg',
'Montana': 'Missoula',
'Nebraska': 'Wayne',
'Nevada': 'Reno',
'New Hampshire': 'Rindge',
'New Jersey': 'West Long Branch',
'New Mexico': 'Silver City',
'New York': 'West Point',
'North Carolina': 'Winston-Salem',
'North Dakota': 'Grand Forks',
'Ohio': 'Wilberforce',
'Oklahoma': 'Weatherford',
'Oregon': 'Newberg',
'Pennsylvania': 'Williamsport',
'Rhode Island': 'Providence',
'South Carolina': 'Spartanburg',
'South Dakota': 'Vermillion',
'Tennessee': 'Sewanee',
'Texas': 'Waco',
'Utah': 'Ephraim',
'Vermont': 'Northfield',
'Virginia': 'Chesapeake',
'Washington': 'University District',
'West Virginia': 'West Liberty',
'Wisconsin': 'Whitewater',
'Wyoming': 'Laramie'}
Try using defaultdict:
from collections import defaultdict
d = defaultdict(list)
for name in univ_towns:
if "[ed" in name:
statename = re.sub('\[edit]\n$', '', name)
state_list.append(statename)
len_state = len(state_list)
elif "(" in name:
sep = ' ('
townname = name.split(sep, 1)[0]
if "," in townname:
sep = ','
townname = townname.split(sep, 1)[0]
d[state_list[len_state-1]].append(townname)
As you can see, the only major difference is at the end where you use append instead of =. The way you had it before will only return one city rather than all cities, which is what you seem to want, unless I'm misunderstanding.

Categories