Grouping by almost similar strings - python

I have a dataset with city names and counts of crimes. The data is dirty such that a name of a city for example 'new york', is written as 'newyork', 'new york us', 'new york city', 'manhattan new york' etc. How can I group all these cities together and sum their crimes?
I tried the 'difflib' package in python that matches strings and gives you a score. It doesn't work well. I also tried the geocode package in python. It has limits on number of times you can access the api, and doesnt work well either. Any suggestions?

Maybe this might help:
http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/
Another way: if a string contains 'new' and 'york', then label it 'new york city'.
Another way: Create a dictionary of all the possible fuzzy words that occur and label each of them manually. And use that labelling to replace each of these fuzzy words with the label.

Another approach is to go through each entry and strip the white space and see if they contain a base city name. For example 'newyork', 'new york us', 'new york city', 'manhattan new york' when stripped of white space would be 'newyork', 'newyorkus', 'newyorkcity', 'manhattannewyork', which all contain the word 'newyork'.
There are two approaches with this method, you can go through and replace the all the 'new york' strings with ones that have no white space and are just 'newyork' or you can just check them on the fly.
I wrote down an example below, but since I don't know how your data is formatted, I'm not sure how helpful it is.
crime_count = 0
for (key, val) in dataset:
if 'newyork' in key.replace(" ", ""):
crime_count = crime_count + val

Related

Python - lists of lists and user input (user selects position from a bigger list, then gets print of an f string with elements of a chosen sub - list)

this is my first question here ever, I am on the very beginnin of my programming course. I really tried to find a solution myself and then look for something similar in the internet, but it seems I don't even know how to name the (extremely basic) problem properly, so google search gives me no results.
I have a task to do. Part of that task is: I have a list of lists, which is a list of richest people in the world and additional data about them. It has 101 positions and looks like this:
[['Rank', 'Name', 'Total Net Worth', '$ Last Change', '$ YTD Change', 'Country', 'Industry'], ['1', 'Jeff Bezos', '$188B', '+$1.68B', '-$2.31B\xa0', 'United States', 'Technology'],
['2', 'Elon Musk', '$170B', '-$2.89B', '+$773M\xa0', 'United States',
and so on. My goal is to ask user for int input (a number between 1 and 100) and then display him data of a person who has that position in the ranking, like this:
position = int(input("Hello, this is Top 100 Richest People List. Press number "\
f"from 1 to 100 to see what billionaire is on that position in the ranking."))
#stuff happens#
print(f"The billionaire with a position number ... is ... . Total Net Worth "\
f"amounts to ... $. Last change amounts to ...$. YTD change is ... . "\
f"The country of origin is ... and the industry is ... .")
I have no idea how to map user input into the selection from a list, and then print the content just of that sub - list in the f string. So I kindly ask for some help, hints, or at least some tutorial "lists and lists + user inputs" which would be useful.
Thank you
Is this what you're looking for?
l = [['Rank', 'Name', 'Total Net Worth', '$ Last Change', '$ YTD Change', 'Country', 'Industry'], ['1', 'Jeff Bezos', '$188B', '+$1.68B', '-$2.31B\xa0', 'United States', 'Technology'],
['2', 'Elon Musk', '$170B', '-$2.89B', '+$773M\xa0', 'United States', ...
position = int(input("Hello, this is Top 100 Richest People List. Press number "\
f"from 1 to 100 to see what billionaire is on that position in the ranking."))
try:
found_person = l[position]
print(f"The billionaire with a position number {position} is {found_person[1]} . Total Net Worth
"\
f"amounts to {found_person[2]} $. Last change amounts to {found_person[3]}$. YTD change is {found_person[4]} .
"\
f"The country of origin is {found_person[5]} and the industry is {found_person[6]} .")
except IndexError:
print('Cannot find any rich people with given position-index!')
Breaking down the code:
l is where we initialize the list (as you've mentioned in your question description)
position - index of the person in the list. Basically index of a sublist in the initial list
we have try - except condition in order to handle IndexError error. This error can appear if a sublist with a given index by the user is not found in the initial list
found_person is found sublist which is related to the person
in the print after found_person we are printing the final information about the person by simply addressing the index of every piece of information (every element in the found sublist). We can print them out by including these elements (i.e. found_person[1], found_person[2] and so on) in curly brackets { } inside the string and putting f in front of it (just like in the question description)
In order to advise you. This is just an Example of what you looking for. I'm trying to make it simple and clear.
persons = [['Name', 'Age', 'hair Color'],
['Alice', '25', 'blonde'],
['Bob', '33', 'black'],
['Ann', '18', 'purple']]
position = int(input("Hello, this is list of 3 persons. "\
f"Press number to get information about it:"))
Name= persons[position][0] # get data from 1st column
Age= persons[position][1] # get data from 2nd column
Color=persons[position][2] # get data from 3rd column
print(f"The person with a position number {position} is {Name}. Has {Age} old."\
f" They have a {Color} hair color.")

How to match city names split by space?

Trying to figure out given two different types of strings, how to make a determination whether or not a city name is actually a split word? Since working in python, I Split the string and save s[0] for street num, s[-1] for zip code and so on but how to figure out whether the city name may be a split word such as New York or San Jose!?
E.g. : 123 Main Street St. Louisville OH 43071 [City name is single word]
E. g : 45 Holy Grail Al. Niagara Town ZP 32908 [City name 'Niagara Town' is two words]
Forgive the noob question.
Thank you,
I making two assumptions here:
1) That the number code before the town name is always numeric
2) That there is no town name with a number name
index = list(filter(lambda x: x[1].isnumeric(),enumerate(x.split())))[-1][0]
" ".join(x.split()[index+1:])
So what is happening: We try to identify the last part of the split that is purely numeric, and then get the index of that element. Then we join all elements after that numeric element.

String replace - avoid repeat

I am working on merging a few datasets regarding over 200 countries in the world. In cleaning the data I need to convert some three-letter codes for each country into the countries' full names.
The three-letter codes and country full names come from a separate CSV file, which shows a slightly different set of countries.
My question is: Is there a better way to write this?
str.replace("USA", "United States of America")
str.replace("CAN", "Canada")
str.replace("BHM", "Bahamas")
str.replace("CUB", "Cuba")
str.replace("HAI", "Haiti")
str.replace("DOM", "Dominican Republic")
str.replace("JAM", "Jamaica")
and so on. It goes on for another 200 rows. Thank you!
Since the number of substitution is high, I would instead iterate over the words in the string and replace based upon a dictionary lookup.
mapofcodes = {'USA': 'United States of America', ....}
for word in mystring.split():
finalstr += mapofcodes.get(word, word)
Try reading the CSV file into a dictionary to a 2D array, you can access which ever one you want then.
that is if I understand your question correctly.
Here's a regular expressions solution:
import re
COUNTRIES = {'USA': 'United States of America', 'CAN': 'Canada'}
def repl(m):
country_code = m.group(1)
return COUNTRIES.get(country_code, country_code)
p = re.compile(r'([A-Z]{3})')
my_string = p.sub(repl, my_string)

Finding multiple common starting strings

I have a list of strings in which one or more subsets of the strings have a common starting string. I would like a function that takes as input the original list of strings and returns a list of all the common starting strings. In my particular case I also know that each common prefix must end in a given delimiter. Below is an example of the type of input data I am talking about (ignore any color highlighting):
Population of metro area / Portland
Population of city / Portland
Population of metro area / San Francisco
Population of city / San Francisco
Population of metro area / Seattle
Population of city / Seattle
Here the delimiter is / and the common starting strings are Population of metro area and Population of city. Perhaps the delimiter won't ultimately matter but I've put it in to emphasize that I don't want just one result coming back, namely the universal common starting string Population of; nor do I want the common substrings Population of metro area / S and Population of city / S.
The ultimate use for this algorithm will be to group the strings by their common prefixes. For instance, the list above can be restructured into a hierarchy that eliminates redundant information, like so:
Population of metro area
Portland
San Francisco
Seattle
Population of city
Portland
San Francisco
Seattle
I'm using Python but pseudo-code in any language would be fine.
EDIT
As noted by Tom Anderson, the original problem as given can easily be reduced to simply splitting the strings and using a hash to group by the prefix. I had originally thought the problem might be more complicated because sometimes in practice I encounter prefixes with embedded delimiters, but I realize this could also be solved by simply doing a right split that is limited to splitting only one time.
Isn't this just looping over the strings, splitting them on the delimiter, then grouping the second halves by the first halves? Like so:
def groupByPrefix(strings):
stringsByPrefix = {}
for string in strings:
prefix, suffix = map(str.strip, string.split("/", 1))
group = stringsByPrefix.setdefault(prefix, [])
group.append(suffix)
return stringsByPrefix
In general, if you're looking for string prefices, the solution would be to whop the strings into a trie. Any branch node with multiple children is a maximal common prefix. But your need is more restricted than that.
d = collections.defaultdict(list)
for place, name in ((i.strip() for i in line.split('/'))
for line in text.splitlines()):
d[place].append(name)
so d will be a dict like:
{'Population of city':
['Portland',
'San Francisco',
'Seattle'],
'Population of metro area':
['Portland',
'San Francisco',
'Seattle']}
You can replace (i.strip() for i in line.split('/') by line.split(' / ') if you know there's no extra whitespace around your text.
Using csv.reader and itertools.groupby, treat the '/' as the delimiter and group by the first column:
for key, group in groupby(sorted(reader(inp, delimiter='/')), key=lambda x: x[0]):
print key
for line in group:
print "\t", line[1]
This isn't very general, but may do what you need:
def commons(strings):
return set(s.split(' / ')[0] for s in strings)
And to avoid going back over the data for the grouping:
def group(strings):
groups = {}
for s in strings:
prefix, remainder = s.split(' / ', 1)
groups.setdefault(prefix, []).append(remainder)
return groups

How can I organize each scraped item into a csv row?

What is the best way to organize scraped data into a csv? More specifically each item is in this form
url
"firstName middleInitial, lastName - level - word1 word2 word3, & wordN practice officeCity."
JD, schoolName, date
Example:
http://www.examplefirm.com/jang
"Joe E. Ang - partner - privatization mergers, media & technology practice New York."
JD, University of Chicago Law School, 1985
I want to put this item in this form:
(http://www.examplefirm.com/jang, Joe, E., Ang, partner, privatization mergers, media & technology, New York, University of Chicago Law School, 1985)
so that I can write it into a csv file to import to a django db.
What would be the best way of doing this?
Thank you.
There's really no short cut on this. Line 1 is easy. Just assign it to url. Line 3 can probably be split on , without any ill effects, but line 2 will have to be manually parsed. What do you know about word1-wordN? Are you sure "practice" will never be a "word". Are you sure the words are only one word long? Can they be quoted? Can they contain dashes?
Then I would parse out the beginning and end bits, so you're left with a list of words, split it by commas and/or & (is there a consistent comma before &? Your format says yes, but your example says no.) If there are a variable number of words, you don't want to inline them in your tuple like that, because you don't know how to get them out. Create a list from your words, and add that as one element of the tuple.
>>> tup = (url, first, middle, last, rank, words, city, school, year)
>>> tup
('http://www.examplefirm.com/jang', 'Joe', 'E.', 'Ang', 'partner',
['privatization mergers', 'media & technology'], 'New York',
'University of Chicago Law School', '1985')
More specifically? You're on your own there.

Categories