soup.find_all works but soup.select doesn't work - python

I'm playing around with parsing an html page using css selectors
import requests
import webbrowser
from bs4 import BeautifulSoup
page = requests.get('http://www.marketwatch.com', headers={'User-Agent': 'Mozilla/5.0'})
soup = BeautifulSoup(page.content, 'html.parser')
I'm having trouble selecting a list tag with a class when using the select method. However, I have no trouble when using the find_all method
soup.find_all('ul', class_= "latestNews j-scrollElement")
This returns the output I desire, but for some reason I can't do the same using
css selectors. I want to know what I'm doing wrong.
Here is my attempt:
soup.select("ul .latestNews j-scrollElement")
which is returning an empty list.
I can't figure out what I'm doing wrong with the select method.
Thank you.

From the documentation:
If you want to search for tags that match two or more CSS classes, you
should use a CSS selector:
css_soup.select("p.strikeout.body")
In your case, you'd call it like this:
In [1588]: soup.select("ul.latestNews.j-scrollElement")
Out[1588]:
[<ul class="latestNews j-scrollElement" data-track-code="MW_Header_Latest News|MW_Header_Latest News_Facebook|MW_Header_Latest News_Twitter" data-track-query=".latestNews__headline a|a.icon--facebook|a.icon--twitter">
.
.
.

Related

Can't figure out why soup.find_all() returns an empty list

I'm a novice in Python and am practicing web scraping by using BeautifulSoup.
I've checked some similar questions such as this one, this one, and this one. However, I'm still get stuck in my problem.
Here is my codes:
import urllib.request
from bs4 import BeautifulSoup
html = urllib.request.urlopen("https://en.wikipedia.org/wiki/List_of_largest_recorded_music_markets").read()
soup = BeautifulSoup(html, 'html.parser')
tbody = soup.find_all('table',{"class":"wikitable plainrowheaders sortable jquery-tablesorter"})
First, I don't think the web page I'm looking for contains java script that was mentioned in similar questions. I intend to extract the data in those tables, but when I executed print(tbody), I found it was an empty list. Can someone have a look and give me some hints?
Thank you.
You must remove the jquery-tablesorter part. It is dynamically applied after the page loads, so if you include it, it doesn't work.
This should work:
import urllib.request
from bs4 import BeautifulSoup
html = urllib.request.urlopen("https://en.wikipedia.org/wiki/List_of_largest_recorded_music_markets").read()
soup = BeautifulSoup(html, 'html.parser')
tbody = soup.find('table', {"class": "wikitable plainrowheaders sortable"})
print(tbody)

getting a specific image from a website link with beautifulSoup

I'm trying to get some specific image on a website using beautiful soup :
from bs4 import BeautifulSoup
import urllib.request
import random
url=urllib.request.urlopen("https://www.arandomlink.com")
content = url.read() #We read it
soup = BeautifulSoup(content) #We create a BeautifulSoup object
#WE OBTAIN THE URL OF THE IMAGES RELATED TO THE ITEM
images = soup.findAll("img", {'class': "arandonclass"})
print(images['src'])
The problem is that it doesn't work, it can't find any image even if I verified that there is images with that class.
What did I do wrong?
I'm using python 3 and BS4
Beautiful Soup Documentation
Searching by CSS class
It’s very useful to search for a tag that has a certain CSS class, but the name of the CSS attribute, “class”, is a reserved word in Python. Using class as a keyword argument will give you a syntax error. As of Beautiful Soup 4.1.2, you can search by CSS class using the keyword argument class_:
soup.find_all("a", class_="sister")

BeautifulSoup (bs4) does not find all tags

I'm using Python 3.5 and bs4
The following code will not retrieve all the tables from the specified website. The page has 14 tables but the return value of the code is 2. I have no idea what's going on. I manually inspected the HTML and can't find a reason as to why it's not working. There doesn't seem to be anything special about each table.
import bs4
import requests
link = "http://www.pro-football-reference.com/players/B/BradTo00.htm"
htmlPage = requests.get(link)
soup = bs4.BeautifulSoup(htmlPage.content, 'html.parser')
all_tables = soup.findAll('table')
print(len(all_tables))
What's going on?
EDIT: I should clarify. If I inspect the soup variable, it contains all of the tables that I expected to see. How am I not able to extract those tables from soup with the findAll method?
this page is rendered by javascript, and if you disable the javascrip in you broswer, you will notice that this page only hava two table.
i recommend to use selenium for this situation.

Using BeautifulSoup to select div blocks within HTML

I am trying to parse several div blocks using Beautiful Soup using some html from a website. However, I cannot work out which function should be used to select these div blocks. I have tried the following:
import urllib2
from bs4 import BeautifulSoup
def getData():
html = urllib2.urlopen("http://www.racingpost.com/horses2/results/home.sd?r_date=2013-09-22", timeout=10).read().decode('UTF-8')
soup = BeautifulSoup(html)
print(soup.title)
print(soup.find_all('<div class="crBlock ">'))
getData()
I want to be able to select everything between <div class="crBlock "> and its correct end </div>. (Obviously there are other div tags but I want to select the block all the way down to the one that represents the end of this section of html.)
The correct use would be:
soup.find_all('div', class_="crBlock ")
By default, beautiful soup will return the entire tag, including contents. You can then do whatever you want to it if you store it in a variable. If you are only looking for one div, you can also use find() instead. For instance:
div = soup.find('div', class_="crBlock ")
print(div.find_all(text='foobar'))
Check out the documentation page for more info on all the filters you can use.

Removing span tags in python

I'm a newbie having trouble removing span tags after using BeautifulSoup to to grab the html from a page. Tried using "del links['span'] but it returned the same results. A few attemps at using getText() failed, as well. Clearly I'm doing something wrong that should be very easy. Help?
from bs4 import BeautifulSoup
import urllib.request
import re
url = urllib.request.urlopen("http://www.python.org")
content = url.read()
soup = BeautifulSoup(content)
for links in soup.find_all("span", text=re.compile(".com")):
del links['class']
print(links.)
Use the .unwrap() method to remove tags, preserving their contents:
for links in soup.find_all("span", text=re.compile(".com")):
links.unwrap()
print soup
Depending what you are trying to do, you could either use unwrap to remove the tag (in fact, replacing the element by its content) or decompose to remove the element and its content.

Categories