I am very new to Python but seem to be getting along. I am writing a web crawler in Python.
I've got the crawler working using the Beautiful Soup library and want to find the best library for parsing or splitting an address into it constituent parts.
Here is a sample of the text to be parsed.
['\r\n\t \t\t \t25 Stockwood Road', <br/>, 'Asheville, NC 28803', <br/>, '\t (828) 505-1638\t \t']
I understand it's a list and I can figure out how to remove the control character.
Since I'm so new I'd like recommendations on what libraries are being used for this - Python version, OS and perquisites.
I'd like to figure out the code for myself, but if you inclined to offer a sample, I wouldn;t argue. :)
you can try the python library usaddress (there's also a web interface for trying it out)
it parses addresses probabilistically, and is much more robust than regex-based parsers when dealing with messy addresses.
List Comprehension is pretty sleek for something like this. Also look into String Strip. It won't remove HTML blank elements though, but the tabs, newlines and spaces will be cleaned up.
out = [x.strip() for x in lst]
Related
I am using Python 3.4, Windows 10, and Visual Studio 2015. I am trying to make a program that scrapes phone numbers from websites formatted like this one.
I am using Beautiful Soup 4, and am trying to get the number of beds from the table. I have tried soup.select('.td') and it only returns an empty array, I am not sure what else to try.
Why not grab the entire page HTML as a string and then use a regular expression to parse it? Is that not where Python excels?
In case you are afraid of regex, here is a beginner-friendly tutorial:
https://regexone.com/
The syntax for Python might be slightly different:
https://docs.python.org/2/library/re.html
And I seriously hope you are not scraping phone numbers for nefarious purposes. I don't want a phone call from you :-).
Here is another Stack Overflow answer which gives a good starting regex:
https://stackoverflow.com/a/123666/5129424
Here's a regex for a 7 or 10 digit number, with extensions allowed, delimiters are spaces, dashes, or periods:
^(?:(?:\+?1\s*(?:[.-]\s*)?)?(?:\(\s*([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9])\s*\)|([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9]))\s*(?:[.-]\s*)?)?([2-9]1[02-9]|[2-9][02-9]1|[2-9][02-9]{2})\s*(?:[.-]\s*)?([0-9]{4})(?:\s*(?:#|x\.?|ext\.?|extension)\s*(\d+))?$
Just because you "might mess it up" doesn't mean you shouldn't try it and test it. Regardless of what you do, you are either at the mercy of the structure of the page, which may change, or the format of the phone numbers, which may also change. There is no perfect solution.
I'd like to scrape the below html so that I can get [u'Hero', u'Adventurer', u'King', u'Wizard', u'Marceline's Henchman', u'Nice Knight']. I've tried a variety of different things with xpath, and I've also explored beautiful soup, but I feel there are too many extra rules that I'd like to squeeze into xpath to get the output that I want. Ex... I don't want anything in my output from the parenthesis, and I'd like things in the a tags outside of parentheses(like Marceline) to be one element together with other text that are in between the same br tags(like 's Henchman). I was wondering if there was some kind of alternative way of scraping that does not look at raw html, and instead looks at the actual web browser display of the html, because that's arranged really conveniently like a list. Is there anything out there that's along the lines of what I'm thinking?If there's nothing out there, I'm thinking about just parsing through this purely through python, but I'd first like to see what tools you've used to deal with moderately complicated scraping. Thanks!
This html snippet is a part of a larger document and I got this from writing this out:
occupation = data.xpath("tr[td/b[contains(.,'Occupation')]]/td[position()>1]").extract()[0]
print occupation
<td> Hero
<br>Adventurer
<br>King (formerly in "<a href="/wiki/The_Silent_King" title="The Silent King">The Silent King</a
>")<br>Wizard (formerly in "Wizard")
<br><a href="/wiki/Mar
celine" title="Marceline">Marceline</a>'s Henchman (formerly in "Henchman" )
<br>Nice Knight (formerly in "Loyalty to the King")
</td>
P.S. I suppose things in parentheses can be easily removed later through python, but it's coming up with a list where everything is separated by br tags that's a little confusing for me.
3 ways to manipulate xhtml data
xpath - the best way to parse and extract from the xhtml doc itself, scrapy response.xpath('..') you seem to know that already (well, there's also .css('...') selector, IMHO there is usually no reason to use it)
regular expression - the best way to match and manipulate free text, scrapy response.xpath('..').re('<regex go here>')
regular python - can be avoided in most cases
AFAIK no other magic.
So, I've been doing some research for a while now and I could't find anything about detecting a URL in a string. The problem is that most results are about detecting whether a string IS a URL, and not if it contains a URL. The 2 results that look best to me are
Regex to find urls in string in Python
and
Detecting a (naughty or nice) URL or link in a text string
but the first requires http://, which is not something spammers would use (:P) and the second one isn't in regex - and my limited knowledge does not know how to translate any of these. Something I have considered doing is using something dull like
spamlist = [".com",".co.uk","etc"]
for word in string:
if word in spamlist:
Do().stuff()
But that would honestly do more bad than good, and I am 100% sure there is a better way using regex or anything!
So if anyone knows anything that could help me I'd be very grateful! I've only been doing python for 1-2 months and not very intensively during this period but I feel like I'm making great progress and this one thing is all that's in the way, really.
EDIT: Sorry for not specifying earlier, I am looking to use this locally, not website (apache) based or anything similar. More trying to clean out any links from files I've got hanging around.
As I said in the comments,
Detecting a (naughty or nice) URL or link in a text string 's solution is a regex and you should probably make it a raw string or escape backslashes in it when using it in Python
You really shouldn't reinvent the square wheel here, especially since spam filtering is an arms race domain (couldn't remember the exact English phrase for this)
You can see description here http://www.mdh.org/sites/www/healthapp/jobs/View.aspx?id=10
MDH Human Resources
525 E. Grant St.
Macomb, IL 61455
T: 309-836-1577
F: 309-836-1677
The page has this address and I want to extract City and State using regex. In this case it's Macomb and IL.
For a moment I used following regex but it did not work where description contain more than one similar patterns.
(\w+),\s+(\w{2})\s+\d+
How can I write regex which tells to first extract these address lines and then the line which has this pattern?
^([A-Z][A-Za-z\s]*),\s+([A-Z]{2})\s+\d{5}$
which I think would be good enough to keep noise away. The downside is that it could potentially avoid what you desire. In that case, you may want to iterate through the page using a less strong regular expression like yours. Anyway, you can't achieve perfection using regex.
It works with Javascript. Adjust the syntax to meet the need of Python.
I'm trying to get a list of craigslist states and their associates urls. Don't worry, I have no intentions of spaming, if you're wondering what this is for see the * below.
What I'm trying to extract begins the line after 'us states' and is the next 50 < li >'s. I read through html.parser's docs and it seemed too low level for this, more aimed at making a dom parser or syntax highlighting/formatting in an ide as opposed to searching which makes me think my best bet is using re's. I would like to keep myself contained to what's in the standard library just for the sake of learning. I'm not asking for help writing a regular expression, I'll figure that out on my own, just making sure there's not a better way to do this before spending the time on that.
*This is my first program or anything beyond simple python scripts. I'm making a c++ program to manage my posts and remind me when they've expired in case I want to repost them, and a python script to download a list of all of the US states and cities/areas in order to populate a combobox in the gui. I really don't need it, but I'm aiming to make this 'production ready'/feature complete both as a learning exercise and to create a portfolio to possibly get a job. I don't know if I'll make the program publicly available or not, there's obvious potential for misuse and is probably against their ToS anyway.
There is xml.etree an XML Parser available in the Python Standard library itself. You should not using regex for parsing XMLs. Go the particular node where you find the information and extract the links from that.
Use lxml.html. It's the best python html parser. It supports xpath!