I was trying to parse an HTML document to find links using Beautiful Soup and found a weird behavior. The page is http://people.csail.mit.edu/gjtucker/ . Here's my code:
from bs4 import BeautifulSoup
import requests
user_agent = {'User-agent': 'Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.52 Safari/537.17'}
t=requests.get(url, headers = user_agent).text
soup=BeautifulSoup(t, 'html.parser')
for link in soup.findAll('a'):
print link['href']
This prints two links: http://www.amazon.jobs/team/speech-amazon and https://scholar.google.com/citations?user=-gJkPHIAAAAJ&hl=en, whereas clearly there are many more links in the page.
Can anyone reproduce this? Is there a specific reason for this happening with this URL? A few outher urls worked just fine.
The HTML of the page is not well-formed, you should use a more lenient parser, like html5lib:
soup = BeautifulSoup(t, 'html5lib')
for link in soup.find_all('a'):
print(link['href'])
Prints:
http://www.amazon.jobs/team/speech-amazon
https://scholar.google.com/citations?user=-gJkPHIAAAAJ&hl=en
http://www.linkedin.com/pub/george-tucker/6/608/3ba
...
http://www.hsph.harvard.edu/alkes-price/
...
http://www.nature.com/ng/journal/v47/n3/full/ng.3190.html
http://www.biomedcentral.com/1471-2105/14/299
pdfs/journal.pone.0029095.pdf
pdfs/es201187u.pdf
pdfs/sigtrans.pdf
Related
import requests
from bs4 import BeautifulSoup
import re
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:16.0)
Gecko/20100101 Firefox/16.0'}
url = "https://ascscotties.com"
reqs = requests.get(url, headers=headers)
soup = BeautifulSoup(reqs.text, 'html.parser')
links = soup.find_all('a', href=re.compile("roster"))
for link in links:
print(link.get("href"))
The output:
https://ascscotties.com/roster.aspx?path=wbball
https://ascscotties.com/roster.aspx?path=wcross
https://ascscotties.com/roster.aspx?path=wsoc
https://ascscotties.com/roster.aspx?path=softball
https://ascscotties.com/roster.aspx?path=wten
https://ascscotties.com/roster.aspx?path=wvball
The code does not work for this https://owlsports.com/ both the website under sidearm platform. Also the landing page of https://owlsports.com/ does not have any of the roster links.
Would deleting the href parameter from your find_all query be impossible?
If getting all the urls from a website is your requirement, the documentation shows just a simple
links = soup.find_all('a')
for link in links:
print(link.get("href"))
would do.
Please let me know how your case is different, so we can solve this together. Thank you.
I've been bouncing around a ton of similar questions, but nothing that seems to fix the issue... I've set this up (with help) to scrape the HREF tags from a different URL.
I'm trying to now take the HREF links in the "Result" column from this URL.
here
The script doesn't seem to be working like it did for other sites.
The table is an HTML element, but no matter how I tweak my script, I can't retrieve anything except a blank result.
Could someone explain to me why this is the case? I'm watching many YouTube videos trying to understand, but this just doesn't make sense to me.
import requests
from bs4 import BeautifulSoup
profiles = []
urls = [
'https://stats.ncaa.org/player/game_by_game?game_sport_year_ctl_id=15881&id=15881&org_id=6&stats_player_seq=-100'
]
for url in urls:
req = requests.get(url)
soup = BeautifulSoup(req.text, 'html.parser')
for profile in soup.find_all('a'):
profile = profile.get('href')
profiles.append(profile)
print(profiles)
The following code works:
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.60 Safari/537.17'}
r = requests.get('https://stats.ncaa.org/player/game_by_game?game_sport_year_ctl_id=15881&id=15881&org_id=6&stats_player_seq=-100', headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')
for x in soup.select('a'):
print(x.get('href'))
Main issue in that case is that you miss to send a user-agent, cause some sites, regardless of whether it is a good idea, use this as base to decide that you are a bot and do not or only specific content.
So minimum is to provide some of that infromation while making your request:
req = requests.get(url,headers={'User-Agent': 'Mozilla/5.0'})
Also take a closer look to your selection. Assuming you like to get the team links only you should adjust it, I used css selectors:
for profile in soup.select('table a[href^="/team/"]'):
It also needs concating the baseUrl to the extracted values:
profile = 'https://stats.ncaa.org'+profile.get('href')
Example
from bs4 import BeautifulSoup
import requests
profiles = []
urls = ['https://stats.ncaa.org/player/game_by_game?game_sport_year_ctl_id=15881&id=15881&org_id=6&stats_player_seq=-100']
for url in urls:
req = requests.get(url,headers={'User-Agent': 'Mozilla/5.0'})
soup = BeautifulSoup(req.text, 'html.parser')
for profile in soup.select('table a[href^="/team/"]'):
profile = 'https://stats.ncaa.org'+profile.get('href')
profiles.append(profile)
print(profiles)
I am trying to get the restaurant name and address of each restaurant from this platform:
https://customers.dlivery.live/en/list
So far I tried with BeautifulSoup
import requests
from bs4 import BeautifulSoup
import json
url = 'https://customers.dlivery.live/en/list'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) '\
'AppleWebKit/537.36 (KHTML, like Gecko) '\
'Chrome/75.0.3770.80 Safari/537.36'}
response = requests.get(url,headers=headers)
soup = BeautifulSoup(response.text, "html.parser")
soup
I noticed that within soup there is not the data about the restaurants.
How can I do this?
if you inspect element the page, you will notice that the names are wrapped in the card_heading class, and the addresses are wrapped in card_distance class.
soup = BeautifulSoup(response.text, 'html.parser')
restaurantAddress = soup.find_all(class_='card_distance')
for address in restaurantAddress:
print(address.text)
and
soup = BeautifulSoup(response.text, 'html.parser')
restaurantNames = soup.find_all(class_='card_heading')
for name in restaurantNames:
print(name.text)
Not sure if this exact code will work, but this is pretty close to what you are looking for.
So I am using BeautfiulSoup4 with Python and I am trying to get an element with "div class". But this element is under many divs and when I try to use "find" with BeautifulSoup, it just returns "None". The element I'm trying to get is show with class "WhatIWant" in the screenshot. Here is the screenshot of the website html:
Screenshot
And this is the code I use for getting that element
page = requests.get(URL)
soup = BeautifulSoup(page.content, "lxml")
element = soup.find_all("div", {"class": "WhatIWant"})
import requests
from bs4 import BeautifulSoup
url = 'https://www.leagueofgraphs.com/summoner/tr/AvaIanche'
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36"
}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
print(soup.find('div', {'class':'leagueTier'}).text.strip())
output:
Platinum I
Maybe your web page that you request does not load that element using a simple request, some of the web pages have JavaScript, and you cant scrape it with Bs4; it may be better to use Selenium.
Test it and then send the response here comment; it may be better to send here this URL.
Good day!
I am currently making a web scraper for Alibaba website.
My problem is that the returned source code does not show some parts that I am interested in. The data is there when I checked the source code using the browser, but I can't retrieve it when using BeautifulSoup.
Any tips?
from bs4 import BeautifulSoup
def make_soup(url):
try:
html = urlopen(url).read()
except:
return None
return BeautifulSoup(html, "lxml")
url = "http://www.alibaba.com/Agricultural-Growing-Media_pid144"
soup2 = make_soup(url)
I am interested in the highlighted part as shown in the image using the Developer Tools of Chrome. But when I tried writing in a text file, some parts including the highlighted is nowhere to be found. Any tips? TIA!
You need to provide the User-Agent header at least.
Example using requests package instead of urllib2:
import requests
from bs4 import BeautifulSoup
def make_soup(url):
try:
html = requests.get(url, headers={"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.80 Safari/537.36"}).content
except:
return None
return BeautifulSoup(html, "lxml")
url = "http://www.alibaba.com/Agricultural-Growing-Media_pid144"
soup = make_soup(url)
print(soup.select_one("a.next").get('href'))
Prints http://www.alibaba.com/catalogs/products/CID144/2.