A general way to scrape link titles from any site in Python? - python

Is there a "general" way to scrape link titles from any website in Python? For example, if I use the following code:
from urllib.request import url open
from bs4 import BeautifulSoup
site = "https://news.google.com"
html = urlopen(site)
soup = BeautifulSoup(html.read(), 'lxml');
titles = soup.findAll('span', attrs = { 'class' : 'titletext' })
for title in titles:
print(title.contents)
I am able to extract nearly every headline title from news.google.com. However, if I use the same code at www.yahoo.com, I am unable to due to a different HTML formatting.
Is there a more general way to do this so that it works for most sites?

No, each site is different and if you make a more general scraper, it will get more data that isn't as specific as every headline title.
For instance the following would get every headline title from google and would also probably get them from yahoo also.
titles = soup.find_all('a')
for title in titles:
print(title.get_text())
However it would also get you all of the headers and other links which would muddy up your results. (there are approximately 150 links on that google page that aren't headlines)

Not, that's why we need CSS selector and XPath, but if there are small number of page, there is a convenient way to do that:
site = "https://news.google.com"
if 'google' in site:
filters = {'name':'span', "class" : 'titletext' }
elif 'yahoo' in site:
filters = {'name':'blala', "class" : 'blala' }
titles = soup.findAll(**filters)
for title in titles:
print(title.contents)

Related

Extracting book title and author from website - Python 3

I am trying to create two Python lists - one of book titles, and one of the books' authors, from a publisher's "coming soon" website.
I have tried a similar approach on other publisher's sites with success, but on this site it does not seem to be working. I am new to parsing html so I am obviously missing something, just can't figure out what. The find_all function just returns an empty list, so my titles and authors lists are empty too.
For reference, this is what the html shows when I click "inspect" in my browser for the first title and author, respectively. I've looked through the BS4 documentation and still can't figure out what I'm doing wrong here.
<h3 class="sp__the-title">Flame</h3>
<p class="sp__the-author">Donna Grant</p>
Thanks for your help!
import requests
from bs4 import BeautifulSoup
page = 'https://us.macmillan.com/search?collection=coming-soon'
page_response = requests.get(page)
soup = BeautifulSoup(page_response.content, "html.parser")
titles = []
for tag in soup.find_all("h3", {"class":"sp__the-title"}):
print(tag.text)
titles.append(tag.text)
authors = []
for tag in soup.find_all("p", {"class":"sp__the-author"}):
print(tag.text)
authors.append(tag.text)

Python BeautifulSoup4 - Scrape Section/Table Header and Values from Multiple Sections/Tables

I'm trying to scrape links with contextual information from the following page: https://www.reddit.com/r/anime/wiki/discussion_archive/2018. I'm able to get the links just fine using BS4 using Python, but having year, season, titles, and episodes associated to the links is ideal. The desired output would look like this:
I've started with the code below, but don't know how to loop through the code to capture things in sections for each season/title:
import requests
from bs4 import BeautifulSoup
session = requests.Session()
link = 'https://www.reddit.com/r/anime/wiki/discussion_archive/2018'
request_2018 = session.get(link, headers={'User-agent': 'Chrome'})
soup = BeautifulSoup(request_2018.content, 'lxml')
data_table = soup.find('div', class_='md wiki')
Is this something that's doable with BS4? Thanks for your help!
EDIT
criteria = {'class':'md wiki'} # so it can reuse later
data_soup = soup.find('div', criteria)
titles = data_soup.find_all('strong')
tables = data_soup.find_all('table')
Try following:
titles = soup.find('div', {'class':'md wiki'}).find_all('strong')
data_tables = soup.find('div', {'class':'md wiki'}).find_all('table')
Better put the second argument of find into a dict and find_all will return all elements which match your search.

How to get the Wikidata item's Q-number of a wikipedia page by BS4?

You can find namely Wikidata item under Tools in the left sidebar of this Wikipedia page. If you hover on that , you can find the link address as below with Q-number at the end.
https://www.wikidata.org/wiki/Special:EntityPage/Q15112.
How can i extract the Q-number?
from bs4 import BeautifulSoup
import requests
getUrl= 'https://en.wikipedia.org/wiki/Ariyalur_district'
url = getUrl
content = requests.get(url).content
soup = BeautifulSoup(content,'lxml')
#extracting page title
firstHeading = soup.find('h1',{'class' : 'firstHeading'})
print(firstHeading.text +'~')
Upto this, my code is good. I tried to get the Q-number by the below code but i can't. Kindly, guide me.
QNumber = soup.find('li','t-wikibase')
print(QNumber)
How can get the Q-number?
You'll need to explicitly specify the selector you're looking for, that's id in this case:
In [1601]: QNumber = soup.find('li', {'id' : 't-wikibase'})
In [1604]: QNumber.a['href']
Out[1604]: 'https://www.wikidata.org/wiki/Special:EntityPage/Q15112'
If you just want the number at the end of this link, you can do this:
In [1605]: QNumber.a['href'].rsplit('/')[-1]
Out[1605]: 'Q15112'

Scraping with Python. Can't get wanted data

I am trying to scrape website, but I encountered a problem. When I try to scrape data, it looks like the html differs from what I see on google inspect and from what I get from python. I get this with http://edition.cnn.com/election/results/states/arizona/house/01 I tried to scrape election results. I used this script to check HTML part of the webpage, and I noticed that they different. There is no classes that I need, like section-wrapper.
page =requests.get('http://edition.cnn.com/election/results/states/arizona/house/01')
soup = BeautifulSoup(page.content, "lxml")
print(soup)
Anyone knows what is the problem ?
http://data.cnn.com/ELECTION/2016/AZ/county/H_d1_county.json
This site use JavaScript fetch data, you can check the url above.
You can find this url in chrome dev-tools, there are many links, check it out
Chrome >>F12>> network tab>>F5(refresh page)>>double click the .josn url>> open new tab
import requests
from bs4 import BeautifulSoup
page=requests.get('http://edition.cnn.com/election/results/states/arizona/house/01')
soup = BeautifulSoup(page.content)
#you can try all sorts of tags here I used class: "ad" and class:"ec-placeholder"
g_data = soup.find_all("div", {"class":"ec-placeholder"})
h_data = soup.find_all("div"),{"class":"ad"}
for item in g_data:print item
#print '\n'
#for item in h_data:print item

Extract URL for website background-image using BeautifulSoup/Python

I'm trying to extract the url for a "background-image" on a Soundcloud page (ex. https://soundcloud.com/ohwondermusic/drive). I'm not quite sure why I'm finding this so difficult compared to extracting urls from other webpages that I've found guides for online.
From the example webpage I linked, I would like this url: https://i1.sndcdn.com/artworks-000125017075-di2n0i-t500x500.jpg that can be found by right clicking the album artwork and choosing 'inspect element' when in the Chrome browser.
I would like some way to consistently do this for other Soundcloud pages too (ie get the URL that would be found by inspecting the album artwork, the URL that ends in 500x500.jpg).
Does anyone know how to do this?
Edit: I've used various codes to attempt to solve, along the lines of:
def pull2(url):
html = urllib2.urlopen(url)
soup = BeautifulSoup(html)
readOnly = soup.body.find_all('div', attrs={'class': 'image readOnly customImage'})
print readOnly.attrs['style']
or
def test(url):
html = urllib2.urlopen(url)
soup = BeautifulSoup(html)
imgs = soup.findAll("div", {"class":"thumb-pic"})
for img in imgs:
print img.a['href'].split("imgurl=")[1]
It looks like you should just be able to grab the style from the correct span on each page with something like:
soup.find("span", class_="sc-artwork")['style']
Then, write a regex to extract the url from that or split it on "url"

Categories