How to download all the comments from a news article using Python?

How to download all the comments from a news article using Python? - python

I have to admit that I don't know much html. I am trying to extract all the comments from an article in the online news using python. I tried using python BeautifulSoup, but it seems comments are not in the html source-code, but present in the inspect-element. For instance you can check here. http://www.dailymail.co.uk/sciencetech/article-5100519/Elon-Musk-says-Tesla-Roadster-special-option.html#comments
My code is here and I am struck.
import urllib.request as urllib2
from bs4 import BeautifulSoup
url = "http://www.dailymail.co.uk/sciencetech/article-5100519/Elon-Musk-says-Tesla-Roadster-special-option.html#comments"
page = urllib2.urlopen(url)
soup = BeautifulSoup(page, "html.parser")
I want to do this
name_box = soup.find('p', attrs={'class': 'comment-body comment-text'})
but this info is not there in the source-code.
Any suggestion, how to move forward?

I have not attempted things like this, but my guess is if you want to get it directly from "page source" you'll need something like selenium to actually navigate the page since the page is dynamic.
Alternatively if you're only interested in comments you may use the dailymail.co.uk's api to acquire comments.
Note the items in the querystring "max=1000" "&order" etc. You may also need to use the variable "offset" along side max to find all the comments if the API has a limit on the maximum "max" value.
I do not know where the API is defined, you can view it by view the network requests that your browser makes while you search the webpage.
You can get comment data from http://www.dailymail.co.uk/reader-comments/p/asset/readcomments/5100519?max=1000&order=desc&rcCache=shout for that page in JSON format. It appears that every article has something like "5101863" in its url, you can use swap those numbers for each new story that you want comments about.

Thank you FredMan. I did not know about this API. It seems we need to give only the article id and we can the comments from the article. This was the solution I was looking for.

Related

Why can't I find anything in BeautifulSoup documentation about .text or content method?

At the moment I am following a Python course on Udemy and I am learning the concept of web scraping. The way this is done is as follows:
import requests
import bs4
url = requests.get("http://example.com/")
soup = bs4.BeautifulSoup(url.text, "lxml")
Now, I cannot find anything about the text method of Beautifulsoup in the documentation. I know this because it is clearly explained in the course I am following.
Is this usual? I am asking this more from a general point of view when searching for relevant information in future documentation.

You have to use the .text attribut, because if you use just url in your case, you only get the statuscode of your request, which can not be a parameter in your soup object.

How to scrape text of specific cell from website using BeautifulSoup

I've been trying to scrape text from a website for the past hour and have made no progress, simply because I have very little knowledge on how to actually use BSoup.
def select_ticker():
url = "https://www.barchart.com/stocks/performance/gap/gap-up?screener=nasdaq"
r = requests.get(url)
html = r.text
soup = BeautifulSoup(html)
find = soup.findAll('td, {"data-ng-if:"row.blankRow"}')
print(find)
I'm going to this website and trying to get the first symbol from the table. Right now that symbol is BFBG
I know this should be extremely easy for someone who actually knows what they're doing with BSoup but I don't understand searching for things and this website doesn't make it easy to search either.
I appreciate your time and thanks for the help!

Actually, you cannot scrap the first symbol from the html get request. You need to fetch the json.
import urllib3
import json
http = urllib3.PoolManager()
r = http.request('GET', 'https://core-api.barchart.com/v1/quotes/get?lists=stocks.gaps.up.nasdaq&orderDir=desc&fields=symbol,symbolName,lastPrice,priceChange,gapUp,highPrice,lowPrice,volume,tradeTime,symbolCode,symbolType,hasOptions&orderBy=gapUp&meta=field.shortName,field.type,field.description&hasOptions=true&page=1&limit=100&raw=1')
print(json.loads(r.data)['data'][0]['symbol'])
And there you got the first symbol.
With the Json you can also find every information you probably want to scrap.
Here is how you can usually find those Jsons :
Going into the console, network tab, xhr tab and reload the page. If there are a lot of ressources fetched, you can also filter by the name of the domain ! :)
However, this syntax is wrong:
soup.findAll('td, {"data-ng-if:"row.blankRow"}')
you need to give a dictionnary to the find_all method according to BS4 doc
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-all
soup.find_all('td', {'data-ng-if':'row.blankRow'})
Hope this helps

Web scraping for divs inserted by scripts

Sorry if this is a silly question.
I am trying to use Beautifulsoup and urllib2 in python to look at a url and extract all divs with a particular class. However, the result is always empty even though I can see the divs when I "inspect element" in chrome's developer tools.
I looked at the page source and those divs were not there which means they were inserted by a script. So my question is how can i look for those divs (using their class name) using Beautifulsoup? I want to eventually read and follow hrefs under those divs.
Thanks.
[Edit]
I am currently looking at the H&M website: http://www.hm.com/sg/products/ladies and I am interested to get all the divs with class 'product-list-item'

Try using selenium to run the javascript
from selenium import webdriver
driver = webdriver.Firefox()
driver.get("http://www.python.org")
html = driver.page_source

check this link enter link description here
you can get all info by change the url, this link can be found in chrome dev tools > Network

The reason why you got nothing from that specific url is simply because, the info you need is not there.
So first let me explain a little bit about how that page is loaded in a browser: when you request for that page(http://www.hm.com/sg/products/ladies), the literal content will be returned in the very first phase(which is what you got from your urllib2 request), then the browser starts to read/parse the content, basically it tells the browser where to find all information it needs to render the whole page(e.g. CSS to control layout, additional javascript/urls/pages to populate certain area etc.), and the browser does all that behind the scene. When you "inspect element" in chrome, the page is already fully loaded, and those info you want is not in original url, so you need to find out which url is used to populate those area and go after that specific url instead.
So now we need to find out what happens behind the scene, and a tool is needed to capture all traffic when that page loads(I would recommend fiddler).
As you can see, lots of things happen when you open that page in a browser!(and that's only part of the whole page-loading process) So by educated guess, those info you need should be in one of those three "api.hm.com" requests, and the best part is they are alread JSON formatted, which means you might not even bother with BeautifulSoup, the built-in json module could do the job!
OK, now what? Use urllib2 to simulate those requests and get what you want.
P.S. requests is a great tool for this kind of job, you can get it here.

Try This one :
from bs4 import BeautifulSoup
import urllib2
page = urllib2.urlopen("http://www.hm.com/sg/products/ladies")
soup = BeautifulSoup(page.read(),'lxml')
scrapdiv = open('scrapdiv.txt','w')
product_lists = soup.findAll("div",{"class":"o-product-list"})
print product_lists
for product_list in product_lists:
print product_list
scrapdiv.write(str(product_list))
scrapdiv.write("\n\n")
scrapdiv.close()

Using BeautifulSoup to parse facebook

so I'm trying to parse public facebook pages using BeautifulSoup. I've managed to successfully scrape LinkedIn, but I've spent hours trying to get it to work on facebook with no luck. The code I'm trying to use looks like this:
for urls in my_urls:
try:
page = urllib2.urlopen(urls)
soup = BeautifulSoup(page)
info = soup.find_all("div", class_="fsl fwb fcb")
info2 = info.findall('a')
The part that's frustrating me is that I can get the title element out, and I can even get pretty far down the document, but I can't get to the part where I need to get.
This line successfuly grabs the pageTitle:
info = soup.find_all("title", attrs={"id": "pageTitle"})
This line can get pretty far down the list of elements, but can't go any farther.
info = soup.find_all(id="pagelet_timeline_main_column")
Here's a sample page that I'm trying to parse, I want current city from it:
https://www.facebook.com/100004210542493
and heres a quick screenshot of what the part I want looks like:
http://prntscr.com/1t8xx6
I feel like I'm really close, but I just can't figure it out. Thanks in advance for any help!
EDIT 2: I should also mention that I can successfully print the whole soup and visually find the part I need, but for whatever reason the parsing just won't work the way it should.

Try looking at content returned by using curl or wget. What you are seeing in the browser is what has been rendered after javascripts has been executed.
wget https://www.facebook.com/100004210542493
You might want to use memchanize or selenium, since you want to simulate a client browser (instead of handling raw content).
Another issue related to it might be Beautiful Soup cannot find a CSS class if the object has other classes, too

Extract site that HTML document came from

I have a folder full of HTML documents that are saved copies of webpages, but i need to know what site they came from, what function can i use to extract the website name from the documents? I did not find anything in the BeautifulSoup module. Is there a specific that i should be looking for in the document? I do not need to know the full url i just need to know the name of the website.

You can only do that if the url is mentioned somewhere in the source...
First find out where the url is if it is mentioned. If it's there it'll probably be in the base tag. Sometimes websites have nice headers with a link to their landing page which could be used if all you want is the domain. Or it could be in a comment somewhwere depending on how you saved it.
If the way the url is mentioned is similar in all the pages then your job is easy: Either use re or BeautifulSoup or lxml and xpath to grab the info you need. There are other tools available but either of those will do.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.