Extracting comments from news articles

Extracting comments from news articles - python

My question is similar to the one asked here:
https://stackoverflow.com/questions/14599485/news-website-comment-analysis
I am trying to extract comments from any news article. E.g. i have a news url here:
http://www.cnn.com/2013/09/24/politics/un-obama-foreign-policy/
I am trying to use BeautifulSoup in python to extract the comments. However it seems the comment section is either embedded within an iframe or loaded through javascript. Viewing the source through firebug does not reveal the source of the comments section. But explicitly viewing the source of the comments through view-source feature of the browser does. How to go about extracting the comments, especially when the comments come from a different url embedded within the news web-page?
This is what i have done till now although this is not much:
import urllib2
from bs4 import BeautifulSoup
opener = urllib2.build_opener()
url = ('http://www.cnn.com/2013/08/28/health/stem-cell-brain/index.html')
urlContent = opener.open(url).read()
soup = BeautifulSoup(urlContent)
title = soup.title.text
print title
body = soup.findAll('body')
outfile = open("brain.txt","w+")
for i in body:
i=i.text.encode('ascii','ignore')
outfile.write(i +'\n')
Any help in what I need to do or how to go about it will be much appreciated.

its inside an iframe. check for a frame with id="dsq2".
now the iframe has a src attr which is a link to the actual site that has the comments.
so in beautiful soup: css_soup.select("#dsq2") and get the url from the src attribute. it will lead you to a page that has only comments.
to get the actual comments, after you get the page from src you can use this css selector: .post-message p
and if you want to load more comment, when you click to the more comments buttons it seems to be sending this:
http://disqus.com/api/3.0/threads/listPostsThreaded?limit=50&thread=1660715220&forum=cnn&order=popular&cursor=2%3A0%3A0&api_key=E8Uh5l5fHZ6gD8U3KycjAIAk46f68Zw7C6eW8WSjZvCLXebZ7p0r1yrYDrLilk2F

Related

I cant extract a link from a HTML-code with Python and BeautifulSoup (beginner)

I'm a complete beginner with webscraping and programming with Python. The answer might be somewhere at the forum, but i'm so new, that i dont really now, what to look for. So i hope, you can help me:
Last week I completed a three day course in webscraping with Python, and at the moment i'm trying to brush up on what i've learned so far.
I'm trying to scrape out a spcific link from a website, so that i later on can create a loop, that extracts all the other links. But i can't seem to extract any link even though they are visible in the HTML-code.
Here is the website (danish)
Here is the link i'm trying to extract
The link i'm trying extract is located in this html-code:
<a class="nav-action-arrow-underlined" href="/afgorelser-og-domme/afgorelser-fra-styrelsen-for-patientklager/22sfp108/" aria-label="Læs mere om Regionen tilbød ikke"\>Læs mere\</a\>
Here is the programming in Python, that i've tried so far:
url = "https://stpk.dk/afgorelser-og-domme/afgorelser-fra-styrelsen-for-patientklager/"
r = requests.get(url)
html = r.text
soup = BeautifulSoup(html, "html.parser")
a_tags = soup.find_all("a") len(a_tags)
#there is 34
've then tried going through all "a-tags" from 0-33 without finding the link.
If i'm printing a_tags [26] - i'm getting this code:
<a aria-current="page" class="nav-action is-current" href="/afgorelser-og-domme/afgorelser-fra-styrelsen-for-patientklager/"\>Afgørelser fra Styrelsen for Patientklager\</a\>
Which is somewhere at the top of the website. But the next a_tag [27] is a code at the bottom of the site:
<a class="footer-linkedin" href="``https://www.linkedin.com/company/styrelsen-for-patientklager/``" rel="noopener" target="_blank" title="``https://www.linkedin.com/company/styrelsen-for-patientklager/``"><span class="sr-only">Linkedin profil</span></a>
Can anyone help me by telling me, how to access the specific part of the HTML-code, that contains the link?
When i find out how to pull out the link, my plan is to make the following programming:
path = "/afgorelser-og-domme/afgorelser-fra-styrelsen-for-patientklager/22sfp108/"
full_url = f"htps://stpk.dk{path}"
print(full_url)

You will not find what you are looking for, cause requests do not render websites like a browser will do - but no worry, there is an alterntive.
Content is dynamically loaded via api, so you should call these directly and you will get JSON that contains the displayed information.
To find such information take a closer look into the developer tools of your browser and check the tab for the XHR Requests - May take a
minute to read and follow the topic:
https://developer.mozilla.org/en-US/docs/Glossary/XHR_(XMLHttpRequest)
Simply iterate over the items, extract the url value and prepend the base_url.
Check and manipulate the following parameters to your needs:
containerKey: a76f4a50-6106-4128-bc09-a1da7695902b
query:
year:
category:
legalTheme:
specialty:
profession:
treatmentPlace:
critiqueType:
take: 200
skip: 0
Example
import requests
url = 'https://stpk.dk/api/verdicts/settlements/?containerKey=a76f4a50-6106-4128-bc09-a1da7695902b&query=&year=&category=&legalTheme=&specialty=&profession=&treatmentPlace=&critiqueType=&take=200&skip=0'
base_url = 'https://stpk.dk'
for e in requests.get(url).json()['items']:
print(base_url+e['url'])
Output
https://stpk.dk/afgorelser-og-domme/afgorelser-fra-styrelsen-for-patientklager/22sfp108/
https://stpk.dk/afgorelser-og-domme/afgorelser-fra-styrelsen-for-patientklager/22sfp107/
https://stpk.dk/afgorelser-og-domme/afgorelser-fra-styrelsen-for-patientklager/22sfp106/
https://stpk.dk/afgorelser-og-domme/afgorelser-fra-styrelsen-for-patientklager/22sfp105/
...

How to copy the text from certain Website with python?

i want to copy the text from this Website (https://www.reclamgymnasium.de/mobil/plankl.html?Klasse=9.2), to use it later py script.
How can i do this? (It doesent realy work with request...)

If you google about python webscraping you will find a lot of information!
basically you start by executing
response = requests.get(url)
Which provides you with the html content of the webpage. Now you can use beautifulsoup to navigate through the content to get what you need.
First we need to create a soup:
soup = beautifulsoup(response.text, "lxml")
in which we can now find the content. If we for example want to find all the url's in the webpage, you can use:
soup.find_all('a')
Here is a complete example code for printing all the url's of a webpage:
import requests
from bs4 import BeautifulSoup
url = "https://google.com"
response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")
for link in soup.find_all('a'):
print(link)
Here is the documentation of beautifulsoup: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

As the information that Johann was looking isn't static but dynamic information, I'm making a second answer to explain how I got the info.
When visiting the webpage https://www.reclamgymnasium.de/mobil/plankl.html?Klasse=9.2
Open the development tools of your browser (in my case it is firefox and I'm opening by pressing F12).
When the develompent tools are open, click on the "network" tab, which will be empty at this point.
Reload the page by clicking the reload arrow or by pressing F5.
Now we can see requests being loaded in the "network" tab.
As we are looking for data being loaded after page content, we look for "xml" or "json" responses in the "type" column.
Right click the response which has either correct type and click "open page in new tab"
If multiple responses match, test all matching until you find the information you are looking for.
In this case we found https://www.reclamgymnasium.de/mobil/mobdaten/PlanKl20210618.xml?_=1623933794858

How do I dynamically scrape websites for CSS files based on user input?

I'm working on a group project where we're trying rank website designs based on the number of colours.
I used a regex to parse through a 'style.css' file that I had already downloaded and got the colour counting down but I'm struggling with the scraping URLs part. I want to be able to access the CSS code straight from whatever URLs the user inputs.
I'm pretty new at programming so I'd appreciate any help offered cause I've been looking at multiple solutions but I don't really understand them or how to reappropriate them for my needs.

Here is a simple example program that will find all the in-page style data for a page, as well as find all linked style pages and print out everything. This should get you started, but you'll have to link it up to your color counting system.
import urllib.request as req
from bs4 import BeautifulSoup
url = input('enter a full website address: ')
html = req.urlopen(url) # request the initial page
soup = BeautifulSoup(html, 'html.parser')
for styles in soup.select('style'): # get in-page style tags
print('in page style:')
print(styles.string)
for link in soup.find_all('link', type='text/css'): # get links to external style sheets
address = link['href'] # the address of the stylesheet
if address.startswith('/'): # relative link
address = url + address
css = req.urlopen(address).read() # make a request to download the stylesheet from the address
print('linked stylesheet')
print(css)

How do I filter out .mp3 links using beautifulsoup from (possibly) broken html? (JSON)

I want to build small tool to help a family member download podcasts off a site.
In order to get the links to the files I first need to filter them out (with bs4 + python3).
The files are on this website (Estonian): Download Page "Laadi alla" = "Download"
So far my code is as follows:
(most of it is from examples on stackoverflow)
from bs4 import BeautifulSoup
import urllib.request
import re
url = urllib.request.urlopen("http://vikerraadio.err.ee/listing/mystiline_venemaa#?page=1&pagesize=902&phrase=&from=&to=&path=mystiline_venemaa&showAll")
content = url.read()
soup = BeautifulSoup(content, "lxml")
links = [a['href'] for a in soup.find_all('a',href=re.compile('http.*\.mp3'))]
print ("Links:", links)
Unfortunately I always get only two results.
Output:
Links: ['http://heli.err.ee/helid/exp/ERR_raadiouudised.mp3', 'http://heli.err.ee/helid/exp/ERR_raadiouudised.mp3']
These are not the ones I want.
My best guess is that the page has somewhat broken html and bs4 / the parser is not able to find anything else.
I've tried different parsers with resulting in no change.
Maybe I'm doing something else wrong too.
My goal is to have the individual links in a list for example.
I'll filter out any duplicates / unwanted entries later myself.
Just a quick note, just in case: This is a public radio and all the content is legally hosted.
My new code is:
for link in soup.find_all('d2p1:DownloadUrl'):
print(link.text)
I am very unsure if the tag is selected correctly.
None of the examples listed in this question are actually working. See the answer below for working code.

Please be aware that the listings from the page are interfaced through an API. So instead of requesting the HTML page, I suggest you to request the API link which has 200 .mp3 links.
Please follow the below steps:
Request the API link, not the HTML page link
Check the response, it's a JSON. So extract the fields that are of your need
Help your Family, All Time :)
Solution
import requests, json
from bs4 import BeautifulSoup
myurl = 'http://vikerraadio.err.ee/api/listing/bypath?path=mystiline_venemaa&page=1&pagesize=200&phrase=&from=&to=&showAll=false'
r = requests.get(myurl)
abc = json.loads(r.text)
all_mp3 = {}
for lstngs in abc['ListItems']:
for asd in lstngs['Podcasts']:
all_mp3[asd['DownloadUrl']] = lstngs['Header']
all_mp3
all_mp3 is what you need. all_mp3 is a dictionary with download urls as keys and mp3 names as the values.

I want to get all links from a certain webpage using python

i want to be able to pull all urls from the following webpage using python https://yeezysupply.com/pages/all i tried using some other suggestions i found but they didn't seem to work with this particular website. i would end up not finding any urls at all.
import urllib
import lxml.html
connection = urllib.urlopen('https://yeezysupply.com/pages/all')
dom = lxml.html.fromstring(connection.read())
for link in dom.xpath('//a/#href'):
print link

perhaps it would be useful for you to make use of modules specifically designed for this. heres a quick and dirty script that gets the relative links on the page
#!/usr/bin/python3
import requests, bs4
res = requests.get('https://yeezysupply.com/pages/all')
soup = bs4.BeautifulSoup(res.text,'html.parser')
links = soup.find_all('a')
for link in links:
print(link.attrs['href'])
it generates output like this:
/pages/jewelry
/pages/clothing
/pages/footwear
/pages/all
/cart
/products/womens-boucle-dress-bleach/?back=%2Fpages%2Fall
/products/double-sleeve-sweatshirt-bleach/?back=%2Fpages%2Fall
/products/boxy-fit-zip-up-hoodie-light-sand/?back=%2Fpages%2Fall
/products/womens-boucle-skirt-cream/?back=%2Fpages%2Fall
etc...
is this what you are looking for? requests and beautiful soup are amazing tools for scraping.

There are no links in the page source; they are inserted using Javascript after the page is loaded int the browser.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting comments from news articles - python

Related

I cant extract a link from a HTML-code with Python and BeautifulSoup (beginner)

How to copy the text from certain Website with python?

How do I dynamically scrape websites for CSS files based on user input?

How do I filter out .mp3 links using beautifulsoup from (possibly) broken html? (JSON)

I want to get all links from a certain webpage using python

Categories

Resources