How do I just look at a specific row with Pandas? - python

I'm fairly new to Python Pandas. So sorry for a very easy question. But I'm trying to target all restaurants that are on the street Commonwealth.
Wouldn't this be the way to target it:
commonwealth = df[df['Street'] == 'Commonwealth AV']
Restaurant Number Street
a 700 Commonwealth AV
b 300 Faneuil Hall ST
c 440 Commonwealth AV
However, I am not getting any returns?
In addition, right before I targeted 'Street,' I actually separated the address into 'Street' and 'Number.' I am not sure if that changes anything.

As others have noted, you've got some white space in your string. The clue to this is that that street names are showing up offset from the column name. Try
df['Street'] = df['Street'].apply(lambda x: x.strip())
to clear out all of the whitespace. Then
df[df['Street'] == 'Commonwealth AV']
or
df.loc[df['Street'] == 'Commonwealth AV', :]
should get you the slices where the street is Commonwealth AV

Related

Geograpy - two sentences with same province name return different results

May anyone help me to explain the reason of these two text with same information but geograpy3 can only detect province name in 1 sentence?
a = geograpy.get_geoPlace_context(text='I live in Gauteng South Africa')
a.other --> ['Gauteng']
b = geograpy.get_geoPlace_context(text='Gauteng South Africa')
b.other --> [] --> is it wrong?
Thanks all.
Geography need to know what is the location in the sentence, adding the "in" on the first example indicate the location should be right after. This sentence will do the same result: 'In Gauteng South Africa'.
For the second sentence Geography cannot tell if the location is 'Gauteng' or 'South Africa' so it answer nothing, using only 'Gauteng' will answer right.

How to match city names split by space?

Trying to figure out given two different types of strings, how to make a determination whether or not a city name is actually a split word? Since working in python, I Split the string and save s[0] for street num, s[-1] for zip code and so on but how to figure out whether the city name may be a split word such as New York or San Jose!?
E.g. : 123 Main Street St. Louisville OH 43071 [City name is single word]
E. g : 45 Holy Grail Al. Niagara Town ZP 32908 [City name 'Niagara Town' is two words]
Forgive the noob question.
Thank you,
I making two assumptions here:
1) That the number code before the town name is always numeric
2) That there is no town name with a number name
index = list(filter(lambda x: x[1].isnumeric(),enumerate(x.split())))[-1][0]
" ".join(x.split()[index+1:])
So what is happening: We try to identify the last part of the split that is purely numeric, and then get the index of that element. Then we join all elements after that numeric element.

parsing string - regex help in python

Hi, I have this string in Python:
'Every Wednesday and Friday, this market is perfect for lunch! Nestled in the Minna St. tunnel (at 5th St.), this location is great for escaping the fog or rain. Check out live music every Friday.\r\n\r\nLocation: 5th St. # Minna St.\r\nTime: 11:00am-2:00pm\r\n\r\nVendors:\r\nKasa Indian\r\nFiveten Burger\r\nHiyaaa\r\nThe Rib Whip\r\nMayo & Mustard\r\n\r\n\r\nCATERING NEEDS? Have OtG cater your next event! Get started by visiting offthegridsf.com/catering.'
I need to extract the following:
Location: 5th St. # Minna St.
Time: 11:00am-2:00pm
Vendors:
Kasa Indian
Fiveten Burger
Hiyaaa
The Rib Whip
Mayo & Mustard
I tried to do this by using:
val = desc.split("\r\n")
and then val[2] gives the location, val[3] gives the time and val[6:11] gives the vendors. But I am sure there is a nicer, more efficient way to do this.
Any help will be highly appreciated.
If your input is always going to formatted in exactly this way, using str.split() is preferable. If you want something slightly more resilient, here's a regex approach, using re.VERBOSE and re.DOTALL:
import re
desc_match = re.search(r'''(?sx)
(?P<loc>Location:.+?)[\n\r]
(?P<time>Time:.+?)[\n\r]
(?P<vends>Vendors:.+?)(?:\n\r?){2}''', desc)
if desc_match:
for gname in ['loc', 'time', 'vends']:
print desc_match.group(gname)
Given your definition of desc, this prints out:
Location: 5th St. # Minna St.
Time: 11:00am-2:00pm
Vendors:
Kasa Indian
Fiveten Burger
Hiyaaa
The Rib Whip
Mayo & Mustard
Efficiency really doesn't matter here because the time is going to be negligible either way (don't optimize unless there is a bottleneck.) And again, this is only "nicer" if it works more often than your solution using str.split() - that is, if there are any possible input strings for which your solution does not produce the correct result.
If you only want the values, just move the prefixes outside of the group definitions (a group is defined by (?P<group_name>...))
r'''(?sx)
Location: \s* (?P<loc>.+?) [n\r]
Time: \s* (?P<time>.+?) [\n\r]
Vendors: \s* (?P<vends>.+?) (?:\n\r?){2}'''
NLNL = "\r\n\r\n"
parts = s.split(NLNL)
result = NLNL.join(parts[1:3])
print(result)
which gives
Location: 5th St. # Minna St.
Time: 11:00am-2:00pm
Vendors:
Kasa Indian
Fiveten Burger
Hiyaaa
The Rib Whip
Mayo & Mustard

Finding multiple common starting strings

I have a list of strings in which one or more subsets of the strings have a common starting string. I would like a function that takes as input the original list of strings and returns a list of all the common starting strings. In my particular case I also know that each common prefix must end in a given delimiter. Below is an example of the type of input data I am talking about (ignore any color highlighting):
Population of metro area / Portland
Population of city / Portland
Population of metro area / San Francisco
Population of city / San Francisco
Population of metro area / Seattle
Population of city / Seattle
Here the delimiter is / and the common starting strings are Population of metro area and Population of city. Perhaps the delimiter won't ultimately matter but I've put it in to emphasize that I don't want just one result coming back, namely the universal common starting string Population of; nor do I want the common substrings Population of metro area / S and Population of city / S.
The ultimate use for this algorithm will be to group the strings by their common prefixes. For instance, the list above can be restructured into a hierarchy that eliminates redundant information, like so:
Population of metro area
Portland
San Francisco
Seattle
Population of city
Portland
San Francisco
Seattle
I'm using Python but pseudo-code in any language would be fine.
EDIT
As noted by Tom Anderson, the original problem as given can easily be reduced to simply splitting the strings and using a hash to group by the prefix. I had originally thought the problem might be more complicated because sometimes in practice I encounter prefixes with embedded delimiters, but I realize this could also be solved by simply doing a right split that is limited to splitting only one time.
Isn't this just looping over the strings, splitting them on the delimiter, then grouping the second halves by the first halves? Like so:
def groupByPrefix(strings):
stringsByPrefix = {}
for string in strings:
prefix, suffix = map(str.strip, string.split("/", 1))
group = stringsByPrefix.setdefault(prefix, [])
group.append(suffix)
return stringsByPrefix
In general, if you're looking for string prefices, the solution would be to whop the strings into a trie. Any branch node with multiple children is a maximal common prefix. But your need is more restricted than that.
d = collections.defaultdict(list)
for place, name in ((i.strip() for i in line.split('/'))
for line in text.splitlines()):
d[place].append(name)
so d will be a dict like:
{'Population of city':
['Portland',
'San Francisco',
'Seattle'],
'Population of metro area':
['Portland',
'San Francisco',
'Seattle']}
You can replace (i.strip() for i in line.split('/') by line.split(' / ') if you know there's no extra whitespace around your text.
Using csv.reader and itertools.groupby, treat the '/' as the delimiter and group by the first column:
for key, group in groupby(sorted(reader(inp, delimiter='/')), key=lambda x: x[0]):
print key
for line in group:
print "\t", line[1]
This isn't very general, but may do what you need:
def commons(strings):
return set(s.split(' / ')[0] for s in strings)
And to avoid going back over the data for the grouping:
def group(strings):
groups = {}
for s in strings:
prefix, remainder = s.split(' / ', 1)
groups.setdefault(prefix, []).append(remainder)
return groups

How can I organize each scraped item into a csv row?

What is the best way to organize scraped data into a csv? More specifically each item is in this form
url
"firstName middleInitial, lastName - level - word1 word2 word3, & wordN practice officeCity."
JD, schoolName, date
Example:
http://www.examplefirm.com/jang
"Joe E. Ang - partner - privatization mergers, media & technology practice New York."
JD, University of Chicago Law School, 1985
I want to put this item in this form:
(http://www.examplefirm.com/jang, Joe, E., Ang, partner, privatization mergers, media & technology, New York, University of Chicago Law School, 1985)
so that I can write it into a csv file to import to a django db.
What would be the best way of doing this?
Thank you.
There's really no short cut on this. Line 1 is easy. Just assign it to url. Line 3 can probably be split on , without any ill effects, but line 2 will have to be manually parsed. What do you know about word1-wordN? Are you sure "practice" will never be a "word". Are you sure the words are only one word long? Can they be quoted? Can they contain dashes?
Then I would parse out the beginning and end bits, so you're left with a list of words, split it by commas and/or & (is there a consistent comma before &? Your format says yes, but your example says no.) If there are a variable number of words, you don't want to inline them in your tuple like that, because you don't know how to get them out. Create a list from your words, and add that as one element of the tuple.
>>> tup = (url, first, middle, last, rank, words, city, school, year)
>>> tup
('http://www.examplefirm.com/jang', 'Joe', 'E.', 'Ang', 'partner',
['privatization mergers', 'media & technology'], 'New York',
'University of Chicago Law School', '1985')
More specifically? You're on your own there.

Categories