page external links count in python - python

I need such functions in python:
-check external links count on site pages.
-check if some link is present on given page or not.
Does anybody know good solutions/libs for this task?
I think i should use BeautifulSoup here.., may be something more lib can help?

You should be able to use urllib2 module to fetch the page, use beautifulsoup to parse the page and extract the links, store it up in list and match them to check for some existing link. There are number of questions on BeautifulSoup on SO itself.

Related

How can you iterate through the list of websites in the search engine in Python?

When you search something on your browser it will give you by default a list of websites related to the search that you have done, but I was wondering if there was a way to store/print/iterate the list of urls shown in that main page.
I haven't tried anything because I don't even know which python library should I use.
Which library should I use for this puprose?
I hope that it is a valid question.
Beautiful Soup
Requests
Selenium
Pick your poison.
Read the docs.
???
Profit!

How to download all the comments from a news article using Python?

I have to admit that I don't know much html. I am trying to extract all the comments from an article in the online news using python. I tried using python BeautifulSoup, but it seems comments are not in the html source-code, but present in the inspect-element. For instance you can check here. http://www.dailymail.co.uk/sciencetech/article-5100519/Elon-Musk-says-Tesla-Roadster-special-option.html#comments
My code is here and I am struck.
import urllib.request as urllib2
from bs4 import BeautifulSoup
url = "http://www.dailymail.co.uk/sciencetech/article-5100519/Elon-Musk-says-Tesla-Roadster-special-option.html#comments"
page = urllib2.urlopen(url)
soup = BeautifulSoup(page, "html.parser")
I want to do this
name_box = soup.find('p', attrs={'class': 'comment-body comment-text'})
but this info is not there in the source-code.
Any suggestion, how to move forward?
I have not attempted things like this, but my guess is if you want to get it directly from "page source" you'll need something like selenium to actually navigate the page since the page is dynamic.
Alternatively if you're only interested in comments you may use the dailymail.co.uk's api to acquire comments.
Note the items in the querystring "max=1000" "&order" etc. You may also need to use the variable "offset" along side max to find all the comments if the API has a limit on the maximum "max" value.
I do not know where the API is defined, you can view it by view the network requests that your browser makes while you search the webpage.
You can get comment data from http://www.dailymail.co.uk/reader-comments/p/asset/readcomments/5100519?max=1000&order=desc&rcCache=shout for that page in JSON format. It appears that every article has something like "5101863" in its url, you can use swap those numbers for each new story that you want comments about.
Thank you FredMan. I did not know about this API. It seems we need to give only the article id and we can the comments from the article. This was the solution I was looking for.

BeautifulSoup object not matching a website's html markup in chrome's DeveloperTools

I am tring to crawl this link using Python's BeautifulSoup and urllib2 libraries. One problem that I am running into is that the soup object does not match the webpage's html shown using GoogleChrome's DeveloperTool. I checked multiple times and I am certain that I am passing the correct address. The reason I know they are different is because I printed the entire soup object onto sublime2 and compared it against what is shown on chrome's DeveloperTools. I also searched for really specific tags in the soup object. After debugging for hours, I am out of ideas. Does anyone know why this is happening? Is there some sort of re-direction that is going on?
JavaScript will be run in the website which changes the website DOM. Any url library (such as urllib2) only downloads the HTML and does not execute included/linked JavaScript. That's why you see a difference.

Extract site that HTML document came from

I have a folder full of HTML documents that are saved copies of webpages, but i need to know what site they came from, what function can i use to extract the website name from the documents? I did not find anything in the BeautifulSoup module. Is there a specific that i should be looking for in the document? I do not need to know the full url i just need to know the name of the website.
You can only do that if the url is mentioned somewhere in the source...
First find out where the url is if it is mentioned. If it's there it'll probably be in the base tag. Sometimes websites have nice headers with a link to their landing page which could be used if all you want is the domain. Or it could be in a comment somewhwere depending on how you saved it.
If the way the url is mentioned is similar in all the pages then your job is easy: Either use re or BeautifulSoup or lxml and xpath to grab the info you need. There are other tools available but either of those will do.

How would I look for all URLs on a web page and then save them to a individual variables with urllib2 In Python?

How would I look for all URLs on a web page and then save them to individual variables with urllib2 In Python?
Parse the html with an html parser and find all (e.g. using Beutiful Soup's findAll() method) <a> tags and check their href attributes.
If, however, you want to find all URLs in the page even if they aren't hyperlinks, then you can use a regular expression which could be anything from simple to ridiculously insane.
You don't do it with urllib2 alone. What are you looking for is parsing urls in a web page.
You get your first page using urllib2, read its contents and then pass it through parser like Beautifulsoup or as the other poster explained, you can regex to search the contents of the page too.
You could simply download the raw html with urllib2, then simply search through it. There might be easier ways but you could do this:
1:Download the source code.
2:Use strings library to split it into a list.
3:Search the first 7 characters of each section-->
4:If the first 7 characters are http://, write that to a variable.
Why do you need separate variables though? Wouldn't it be easier to save them all to a list, using list.append(URL_YOU_JUST_FOUND), every time you find another url?

Categories