How can I organize each scraped item into a csv row? - python

What is the best way to organize scraped data into a csv? More specifically each item is in this form
url
"firstName middleInitial, lastName - level - word1 word2 word3, & wordN practice officeCity."
JD, schoolName, date
Example:
http://www.examplefirm.com/jang
"Joe E. Ang - partner - privatization mergers, media & technology practice New York."
JD, University of Chicago Law School, 1985
I want to put this item in this form:
(http://www.examplefirm.com/jang, Joe, E., Ang, partner, privatization mergers, media & technology, New York, University of Chicago Law School, 1985)
so that I can write it into a csv file to import to a django db.
What would be the best way of doing this?
Thank you.

There's really no short cut on this. Line 1 is easy. Just assign it to url. Line 3 can probably be split on , without any ill effects, but line 2 will have to be manually parsed. What do you know about word1-wordN? Are you sure "practice" will never be a "word". Are you sure the words are only one word long? Can they be quoted? Can they contain dashes?
Then I would parse out the beginning and end bits, so you're left with a list of words, split it by commas and/or & (is there a consistent comma before &? Your format says yes, but your example says no.) If there are a variable number of words, you don't want to inline them in your tuple like that, because you don't know how to get them out. Create a list from your words, and add that as one element of the tuple.
>>> tup = (url, first, middle, last, rank, words, city, school, year)
>>> tup
('http://www.examplefirm.com/jang', 'Joe', 'E.', 'Ang', 'partner',
['privatization mergers', 'media & technology'], 'New York',
'University of Chicago Law School', '1985')
More specifically? You're on your own there.

Related

Iterate over a text and find the distance between predefined substrings

I decided I wanted to take a text and find how close some labels were in the text. Basically, the idea is to check if two persons are less than 14 words apart and if they are we say that they are related.
My naive implementation is working, but only if the person is a single word, because I iterate over words.
text = """At this moment Robert who rises at seven and works before
breakfast came in He glanced at his wife her cheek was
slightly flushed he patted it caressingly What s the
matter my dear he asked She objects to my doing nothing
and having red hair said I in an injured tone Oh of
course he can t help his hair admitted Rose It generally
crops out once in a generation said my brother So does the
nose Rudolf has got them both I must premise that I am going
perforce to rake up the very scandal which my dear Lady
Burlesdon wishes forgotten--in the year 1733 George II
sitting then on the throne peace reigning for the moment and
the King and the Prince of Wales being not yet at loggerheads
there came on a visit to the English Court a certain prince
who was afterwards known to history as Rudolf the Third of Ruritania"""
involved = ['Robert', 'Rose', 'Rudolf the Third',
'a Knight of the Garter', 'James', 'Lady Burlesdon']
# my naive implementation
ws = text.split()
l = len(ws)
for wi,w in enumerate(ws):
# Skip if the word is not a person
if w not in involved:
continue
# Check next x words for any involved person
x = 14
for i in range(wi+1,wi+x):
# Avoid list index error
if i >= l:
break
# Skip if the word is not a person
if ws[i] not in involved:
continue
# Print related
print(ws[wi],ws[i])
Now I would like to upgrade this script to allow for multi-word names such as 'Lady Burlesdon'. I am not entirely sure what is the best way to proceed. Any hints are welcome.
You could first preprocess your text so that all the names in text are replaced with single-word ids. The ids would have to be strings that you would not expect to appear as other words in the text. As you preprocess the text, you could keep a mapping of ids to names to know which name corresponds to which id. This would allow to keep your current algorithm as is.

How to match city names split by space?

Trying to figure out given two different types of strings, how to make a determination whether or not a city name is actually a split word? Since working in python, I Split the string and save s[0] for street num, s[-1] for zip code and so on but how to figure out whether the city name may be a split word such as New York or San Jose!?
E.g. : 123 Main Street St. Louisville OH 43071 [City name is single word]
E. g : 45 Holy Grail Al. Niagara Town ZP 32908 [City name 'Niagara Town' is two words]
Forgive the noob question.
Thank you,
I making two assumptions here:
1) That the number code before the town name is always numeric
2) That there is no town name with a number name
index = list(filter(lambda x: x[1].isnumeric(),enumerate(x.split())))[-1][0]
" ".join(x.split()[index+1:])
So what is happening: We try to identify the last part of the split that is purely numeric, and then get the index of that element. Then we join all elements after that numeric element.

The fastest way to remove items that matches a substring from list - Python

What is the fastest way to remove items in the list that matches substrings in the set?
For example,
the_list =
['Donald John Trump (born June 14, 1946) is an American businessman, television personality',
'and since June 2015, a candidate for the Republican nomination for President of the United States in the 2016 election.',
'He is the chairman and president of The Trump Organization and the founder of Trump Entertainment Resorts.',
'Trumps career',
'branding efforts',
'personal life',
'and outspoken manner have made him a celebrity.',
'Trump is a native of New York City and a son of Fred Trump, who inspired him to enter real estate development.',
'While still attending college he worked for his fathers firm',
'Elizabeth Trump & Son. Upon graduating in 1968 he joined the company',
'and in 1971 was given control, renaming the company The Trump Organization.',
'Since then he has built hotels',
'casinos',
'golf courses',
'and other properties',
'many of which bear his name. He is a major figure in the American business scene and has received prominent media exposure']
The list is actually a lot longer than this (millions of string elements) and I'd like to remove whatever elements that contain the strings in the set, for example,
{"Donald Trump", "Trump Organization","Donald J. Trump", "D.J. Trump", "dump", "dd"}
What will be the fastest way? Is Looping through the fastest?
The Aho-Corasick algorithm was specifically designed for exactly this task. It has the distinct advantage of having a much lower time complexity O(n+m) than nested loops O(n*m) where n is the number of strings to find and m is the number of strings to be searched.
There is a good Python implementation of Aho-Corasick with accompanying explanation. There are also a couple of implementations at the Python Package Index but I've not looked at them.
Use a list comprehension if you have your strings already in memory:
new = [line for line in the_list if not any(item in line for item in set_of_words)]
If you don't have them in memory as a more optimized approach in term of memory use you can use a generator expression:
new = (line for line in the_list if not any(item in line for item in set_of_words))

Finding multiple common starting strings

I have a list of strings in which one or more subsets of the strings have a common starting string. I would like a function that takes as input the original list of strings and returns a list of all the common starting strings. In my particular case I also know that each common prefix must end in a given delimiter. Below is an example of the type of input data I am talking about (ignore any color highlighting):
Population of metro area / Portland
Population of city / Portland
Population of metro area / San Francisco
Population of city / San Francisco
Population of metro area / Seattle
Population of city / Seattle
Here the delimiter is / and the common starting strings are Population of metro area and Population of city. Perhaps the delimiter won't ultimately matter but I've put it in to emphasize that I don't want just one result coming back, namely the universal common starting string Population of; nor do I want the common substrings Population of metro area / S and Population of city / S.
The ultimate use for this algorithm will be to group the strings by their common prefixes. For instance, the list above can be restructured into a hierarchy that eliminates redundant information, like so:
Population of metro area
Portland
San Francisco
Seattle
Population of city
Portland
San Francisco
Seattle
I'm using Python but pseudo-code in any language would be fine.
EDIT
As noted by Tom Anderson, the original problem as given can easily be reduced to simply splitting the strings and using a hash to group by the prefix. I had originally thought the problem might be more complicated because sometimes in practice I encounter prefixes with embedded delimiters, but I realize this could also be solved by simply doing a right split that is limited to splitting only one time.
Isn't this just looping over the strings, splitting them on the delimiter, then grouping the second halves by the first halves? Like so:
def groupByPrefix(strings):
stringsByPrefix = {}
for string in strings:
prefix, suffix = map(str.strip, string.split("/", 1))
group = stringsByPrefix.setdefault(prefix, [])
group.append(suffix)
return stringsByPrefix
In general, if you're looking for string prefices, the solution would be to whop the strings into a trie. Any branch node with multiple children is a maximal common prefix. But your need is more restricted than that.
d = collections.defaultdict(list)
for place, name in ((i.strip() for i in line.split('/'))
for line in text.splitlines()):
d[place].append(name)
so d will be a dict like:
{'Population of city':
['Portland',
'San Francisco',
'Seattle'],
'Population of metro area':
['Portland',
'San Francisco',
'Seattle']}
You can replace (i.strip() for i in line.split('/') by line.split(' / ') if you know there's no extra whitespace around your text.
Using csv.reader and itertools.groupby, treat the '/' as the delimiter and group by the first column:
for key, group in groupby(sorted(reader(inp, delimiter='/')), key=lambda x: x[0]):
print key
for line in group:
print "\t", line[1]
This isn't very general, but may do what you need:
def commons(strings):
return set(s.split(' / ')[0] for s in strings)
And to avoid going back over the data for the grouping:
def group(strings):
groups = {}
for s in strings:
prefix, remainder = s.split(' / ', 1)
groups.setdefault(prefix, []).append(remainder)
return groups

How to intelligently parse last name

Assuming western naming convention of FirstName MiddleName(s) LastName,
What would be the best way to correctly parse out the last name from a full name?
For example:
John Smith --> 'Smith'
John Maxwell Smith --> 'Smith'
John Smith Jr --> 'Smith Jr'
John van Damme --> 'van Damme'
John Smith, IV --> 'Smith, IV'
John Mark Del La Hoya --> 'Del La Hoya'
...and the countless other permutations from this.
Probably the best answer here is not to try. Names are individual and idosyncratic and, even limiting yourself to the Western tradition, you can never be sure that you'll have thought of all the edge cases. A friend of mine legally changed his name to be a single word, and he's had a hell of a time dealing with various institutions whose procedures can't deal with this. You're in a unique position of being the one creating the software that implements a procedure, and so you have an opportunity to design something that isn't going to annoy the crap out of people with unconventional names. Think about why you need to be parsing out the last name to begin with, and see if there's something else you could do.
That being said, as a purely techincal matter the best way would probably be to trim off specifically the strings " Jr", ", Jr", ", Jr.", "III", ", III", etc. from the end of the string containing the name, and then get everything from the last space in the string to the (new, after having removed Jr, etc.) end. This wouldn't get, say, "Del La Hoya" from your example, but you can't even really count on a human to get that - I'm making an educated guess that John Mark Del La Hoya's last name is "Del La Hoya" and not "Mark Del La Hoya" because I"m a native English speaker and I have some intuition about what Spanish last names look like - if the name were, say "Gauthip Yeidze Ka Illunyepsi" I would have absolutely no idea whether to count that Ka as part of the last name or not because I have no idea what language that's from.
Came across a lib called "nameparser" at
https://pypi.python.org/pypi/nameparser
It handles four out of six cases above:
#!/usr/bin/env python
from nameparser import HumanName
def get_lname(somename):
name = HumanName(somename)
return name.last
people_names = [
('John Smith', 'Smith'),
('John Maxwell Smith', 'Smith'),
# ('John Smith Jr', 'Smith Jr'),
('John van Damme', 'van Damme'),
# ('John Smith, IV', 'Smith, IV'),
('John Mark Del La Hoya', 'Del La Hoya')
]
for name, target in people_names:
print('{} --> {} <-- {}'.format(name, get_lname(name), target))
assert get_lname(name) == target
I'm seconding Tnekutippa here, but you should check out named entity recognition. It might help automate some of the process. This is however, as noted, quite difficult. I'm not quite sure if the Stanford NER can extract first and last names out of the box, but a machine learning approach could prove very useful for this task. The Stanford NER could be a nice starting point, or you could try to make your own classifiers and training corpora.

Categories