Trying to create a simple python web crawler - python

I have decided to learn python 2.7 coding for data analysis and have been watching many tutorials on youtube to get a good understanding of the basics.
I am at the stage where I want to create simple web-crawlers for educational purposes only to learn different techniques and just get used to some of the coding.
I am following a tutorial for a web-crawler but I am not sure of a few things. This is what I have so far:
import requests
from bs4 import BeautifulSoup
url = 'http://www.aflcio.org/Legislation-and-Politics/Legislative-Alerts'
r = requests.get(url)
plain_text = r.text
soup = BeautifulSoup(plain_text, 'html.parser')
statements = soup.findAll('div','ec_statements')
for link in statements:
print (link.contents)
I can't seem to get the href links to separate and have the text and date information displayed.
I want it to look like this:
Name of Article
Link to Article
Date of Article
Could someone help with some information on why those steps were taken please?
Much appreciated!

A little code to help you.In bs4,all node are connection,you all read get a "link" node(actually is a div),you want to get his child like tag a,so link.a is ok.
then, a node have two part values, one is attribute,access by a['href'], and content access by a.text.
for link in statements:
print(link.a['href'])
ps:
this is the link variable:
<div id="legalert_title">Letter to Representatives opposing the "Fairness in Class Action Litigation and Furthering Asbestos Claim Transparency Act"</div>
this is link.a:
Letter to Representatives opposing the "Fairness in Class Action Litigation and Furthering Asbestos Claim Transparency Act"
this is link.a['href']:
/Legislation-and-Politics/Legislative-Alerts/Letter-to-Representatives-opposing-the-Fairness-in-Class-Action-Litigation-and-Furthering-Asbestos-Claim-Transparency-Act
this is .text:
Letter to Representatives opposing the "Fairness in Class Action Litigation and Furthering Asbestos Claim Transparency Act"
all html is in this way, maybe you need learn a little html.

Related

Is there a way to scrape URLs without scraping links?

Basically, I'm trying to download some images from a site which has been "down for maintenance" for over a year. Attempting to visit any page on the website redirects to the forums. However, visiting image URLs directly will still take you to the specified image.
There's a particular post that I know contains more images than I've been able to brute force by guessing possible file names. I thought, rather than typing every combination of characters possible ad infinitum, I could program a web scraper to do it for me.
However, I'm brand new to Python, and I've been unable to overcome the Javascript redirects. While I've been able to use requests & beautifulsoup to scrape the page it redirects to for 'href', without circumventing the JS I cannot pull from the news article which has links.
import requests
from bs4 import BeautifulSoup
url = 'https://www.he-man.org/news_article.php?id=6136'
reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, 'html.parser')
urls = []
for link in soup.find_all('a'):
    print(link.get('href'))
I've added allow_redirects=False to every field to no avail. I've tried searching for 'jpg' instead of 'href'. I'm currently attempting to install Selenium, though it's fighting me, and I expect I'll just be getting 3xx errors anyway.
The images from the news article all begin with the same 62 characters (https://www.he-man.org/assets/images/home_news/justinedantzer_), so I've also thought maybe there's a way to just infinite-monkeys-on-a-keyboard scrape the rest of it? Or the type of file (.jpg)? I'm open to suggestions here, I really have no idea what direction to come at this thing from now that my first six plans have failed. I've seen a few mentions of scraping spiders, but at this point I've sunk a lot of time into dead ends. Are they worth looking into? What would you recommend?

Web Scrapping Using Python for nlp project

I have to scrap text data from this website. I have read some blogs on web scrap. But the major challenge that I have found is parsing HTML code. I am entirely new to this field. Can I get some help about how to scrap text data(which is possible) and make it into a CSV? Is this possible at all without knowledge about html? Can I expect a good demonstration of python code solving my problem then I will try this on my own for other websites?
TIA
The tools you can use in Python to scrape and parse html data are the requests module and the Beautiful Soup library.
Parsing html files into, for example, csv files is entirely possible, it just requires some effort to learn the tools. In my view there's no best way to learn this than by trying it out yourself.
As for "do you need to know html to parse html files?" well, yes you do, but the good thing is that html is actually quite simple. I suggest you take a look at some tutorials like this one, then inspect the webpage you're interested in and see if you can relate the two.
I appreciate my answer is not really what you were looking for, however as I said I think there's no best way to learn than to try things out yourself. If you're then stuck on anything in particular you can then ask on SO for specific help :)
I din't check the html of the website but you can use beautifulsoup for parsing
html and pandas for converting data into csv
sample code
import requests
from bs4 import BeautifulSoup
res = requests.get('yourwesite.com')
soup = BeautifulSoup(res.content,'html.parser')
# suppose i want all 'li' tags and links in 'li' tags.
lis = soup.find_all("li")
links = []
for li in lis:
a_tag = li.find("a")
link = a_tag.get("href")
links.appedn(link)
And you can get lots of tutorial on pandas online.

How to download all the comments from a news article using Python?

I have to admit that I don't know much html. I am trying to extract all the comments from an article in the online news using python. I tried using python BeautifulSoup, but it seems comments are not in the html source-code, but present in the inspect-element. For instance you can check here. http://www.dailymail.co.uk/sciencetech/article-5100519/Elon-Musk-says-Tesla-Roadster-special-option.html#comments
My code is here and I am struck.
import urllib.request as urllib2
from bs4 import BeautifulSoup
url = "http://www.dailymail.co.uk/sciencetech/article-5100519/Elon-Musk-says-Tesla-Roadster-special-option.html#comments"
page = urllib2.urlopen(url)
soup = BeautifulSoup(page, "html.parser")
I want to do this
name_box = soup.find('p', attrs={'class': 'comment-body comment-text'})
but this info is not there in the source-code.
Any suggestion, how to move forward?
I have not attempted things like this, but my guess is if you want to get it directly from "page source" you'll need something like selenium to actually navigate the page since the page is dynamic.
Alternatively if you're only interested in comments you may use the dailymail.co.uk's api to acquire comments.
Note the items in the querystring "max=1000" "&order" etc. You may also need to use the variable "offset" along side max to find all the comments if the API has a limit on the maximum "max" value.
I do not know where the API is defined, you can view it by view the network requests that your browser makes while you search the webpage.
You can get comment data from http://www.dailymail.co.uk/reader-comments/p/asset/readcomments/5100519?max=1000&order=desc&rcCache=shout for that page in JSON format. It appears that every article has something like "5101863" in its url, you can use swap those numbers for each new story that you want comments about.
Thank you FredMan. I did not know about this API. It seems we need to give only the article id and we can the comments from the article. This was the solution I was looking for.

Python Web Scraping with Beautiful Soup - Having Trouble

I'm using BeautifulSoup to try to pull either the top links or simply the top headlines from different topics on the CNN homepage. I seem to be missing something here and would appreciate some assistance. I have managed to come up with a few web scrapers before, but it's always through a lot of resistance and is quite the uphill battle.
What it looks like to me is that the links I need are ultimately stored somewhere like this:
<article class="cd cd--card cd--article cd--idx-1 cd--extra-small cd--has-siblings cd--media__image" data-vr-contentbox="/2015/10/02/travel/samantha-brown-travel-channel-feat/index.html" data-eq-pts="xsmall: 0, small: 300, medium: 460, large: 780, full16x9: 1100" data-eq-state="small">
I can grab that link after data-vr-contentbox and append it to the end of www.cnn.com and it brings me to the page I need. My problem is in actually grabbing that link. I've tried various forms to grab them. My current iteration is as follows:
r = requests.get("http://www.cnn.com/")
data = r.text
soup = BeautifulSoup(data)
for link in soup.findAll("article"):
test = link.get("data-vr-contentbox")
print(test)
My issue here is that it only seems to grab a small number of things that I actually need. I'm only seeing two articles from politics, none from travel, etc. I would appreciate some assistance in resolving this issue. I'm looking to grab all of the links under each topic. Right now I'm just looking at politics or travel as a base to get started.
Particularly, I want to be able to specify the topic (tech, travel, politics, etc.) and grab those headlines. Whether I could simply grab the links and use those to get the headline from the respective page, or simply grab the headlines from here... I seem unable to do either. It would be nice to be able to view everything in a single topic at once, but finding out how to narrow this down isn't proving very simple.
An example article is the "IOS 9's Wi-Fi Assist feature costly" which can be found within tags.
I want to be able to find ALL articles under, say, the Tech heading on the homepage and isolate those tags to grab the headline. The tags for this headline look like this:
<div class="strip-rec-link-title ob-tcolor">IOS 9's Wi-Fi Assist feature costly</div>
Yet I don't know how to do BOTH of these things. I can't even seem to grab the headline, despite it being within tags when I try this:
for link in soup.findAll("div"):
print("")
print(link)
I feel like I have a fundamental misunderstanding somewhere, although I've managed to do some scrapers before.
My guess is that the cnn.com website has a bunch of javascript which renders a lot of the content after beautifulsoup reads it. I opened cnn.com and looked at the source in safari and there were 197 instances of data-vr-contentbox. However when I ran it through beautifulsoup and dumped it out there were only 13 instances of data-vr-contentbox.
There are a bunch of posts out there about handling it. You can start with the method used in this question: Scraping Javascript driven web pages with PyQt4 - how to access pages that need authentication?

How do you grab a headline from a blog/article like techmeme?

I'm a creating a type of news aggregator and I would like to create a program(Python) that correctly detects the headline and displays it. How would I go about doing this? Is this a machine learning problem?
I would appreciate any articles or books that would point me in the right direction.
My past attempts have included BeautifulSoup and Requests module. Any other open source models I should check out?
Thank you,
Fernando
The direct way to scrape a web page requires human learning - look at the page, decide what you think are headlines, find out how they are tagged, and then look for those tags using a parser like BeautifulSoup. For example, the level 1 headlines on Techmeme currently are labeled:
<DIV CLASS="ii">
and the level 2 headlines are:
<STRONG CLASS="L1">
After your program fetches the page and matches the tags you're interested in, see if they identify what you're looking for. If some headlines are missed, add additional tags to your search list. If you get false positives (hits on links that aren't headlines), weeding them out will require extra page-dependent logic. There is no magic to reverse engineering, just grunt work and testing and periodic revalidation to be sure the webmaster hasn't switched things up on you.
After playing around a bit I find that this works best:
Use BeautifuSoup and Requests module
r = requests.get('http://example.com')
soup = BeautifulSoup(r.text)
if soup.findAll('title'):
title = soup.find('title')
print title.renderContents()
What results is title text that should be cleaned up a bit using regular expressions.
Maybe it could be much easer with parsing their RSS\Atom feeds. Google easily delivers these links http://wiki.python.org/moin/RssLibraries and http://pypi.python.org/pypi/Atomisator/1.3
But those are pure XML, so you could use built-in urllib and XML(DOM or SAX) libraries

Categories