Error in finding urls from the website using beautifulsoup - python

import requests
from bs4 import BeautifulSoup
import re
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:16.0)
Gecko/20100101 Firefox/16.0'}
url = "https://ascscotties.com"
reqs = requests.get(url, headers=headers)
soup = BeautifulSoup(reqs.text, 'html.parser')
links = soup.find_all('a', href=re.compile("roster"))
for link in links:
print(link.get("href"))
The output:
https://ascscotties.com/roster.aspx?path=wbball
https://ascscotties.com/roster.aspx?path=wcross
https://ascscotties.com/roster.aspx?path=wsoc
https://ascscotties.com/roster.aspx?path=softball
https://ascscotties.com/roster.aspx?path=wten
https://ascscotties.com/roster.aspx?path=wvball
The code does not work for this https://owlsports.com/ both the website under sidearm platform. Also the landing page of https://owlsports.com/ does not have any of the roster links.

Would deleting the href parameter from your find_all query be impossible?
If getting all the urls from a website is your requirement, the documentation shows just a simple
links = soup.find_all('a')
for link in links:
print(link.get("href"))
would do.
Please let me know how your case is different, so we can solve this together. Thank you.

Related

Scraping HREF Links contained within a Table

I've been bouncing around a ton of similar questions, but nothing that seems to fix the issue... I've set this up (with help) to scrape the HREF tags from a different URL.
I'm trying to now take the HREF links in the "Result" column from this URL.
here
The script doesn't seem to be working like it did for other sites.
The table is an HTML element, but no matter how I tweak my script, I can't retrieve anything except a blank result.
Could someone explain to me why this is the case? I'm watching many YouTube videos trying to understand, but this just doesn't make sense to me.
import requests
from bs4 import BeautifulSoup
profiles = []
urls = [
'https://stats.ncaa.org/player/game_by_game?game_sport_year_ctl_id=15881&id=15881&org_id=6&stats_player_seq=-100'
]
for url in urls:
req = requests.get(url)
soup = BeautifulSoup(req.text, 'html.parser')
for profile in soup.find_all('a'):
profile = profile.get('href')
profiles.append(profile)
print(profiles)
The following code works:
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.60 Safari/537.17'}
r = requests.get('https://stats.ncaa.org/player/game_by_game?game_sport_year_ctl_id=15881&id=15881&org_id=6&stats_player_seq=-100', headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')
for x in soup.select('a'):
print(x.get('href'))
Main issue in that case is that you miss to send a user-agent, cause some sites, regardless of whether it is a good idea, use this as base to decide that you are a bot and do not or only specific content.
So minimum is to provide some of that infromation while making your request:
req = requests.get(url,headers={'User-Agent': 'Mozilla/5.0'})
Also take a closer look to your selection. Assuming you like to get the team links only you should adjust it, I used css selectors:
for profile in soup.select('table a[href^="/team/"]'):
It also needs concating the baseUrl to the extracted values:
profile = 'https://stats.ncaa.org'+profile.get('href')
Example
from bs4 import BeautifulSoup
import requests
profiles = []
urls = ['https://stats.ncaa.org/player/game_by_game?game_sport_year_ctl_id=15881&id=15881&org_id=6&stats_player_seq=-100']
for url in urls:
req = requests.get(url,headers={'User-Agent': 'Mozilla/5.0'})
soup = BeautifulSoup(req.text, 'html.parser')
for profile in soup.select('table a[href^="/team/"]'):
profile = 'https://stats.ncaa.org'+profile.get('href')
profiles.append(profile)
print(profiles)

Web Scraping: How do I get 'href' links and scrape table from them

I am trying to scrape table from link. So that need to scrape 'href' links from it and then try to scrape the table from it . I try following code but couldn't find:
from bs4 import BeautifulSoup
import requests
url = 'http://www.stats.gov.cn/was5/web/search?channelid=288041&andsen=%E6%B5%81%E9%80%9A%E9%A2%86%E5%9F%9F%E9%87%8D%E8%A6%81%E7%94%9F%E4%BA%A7%E8%B5%84%E6%96%99%E5%B8%82%E5%9C%BA%E4%BB%B7%E6%A0%BC%E5%8F%98%E5%8A%A8%E6%83%85%E5%86%B5'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
#table = soup.find("table")
#print(table)
# links = []
# for href in soup.find_all(class_='searchresulttitle'):
# print(href)
# links.append(href.find('a').get('href'))
# print(links)
link = soup.find(attr={"class":"searchresulttitle"})
print(link)
So please guide me how to find href and scrape table from them
The URLs are stored in the HTML as variables inside Javascript. BeautifulSoup can be used to grab all the <script> elements and then a regular expression can be used to extract the value for urlstr.
Assuming Python 3.6 is being used, a dictionary can be used to create a unique ordrered list of the URLs displayed:
from bs4 import BeautifulSoup
import requests
import re
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'}
url = 'http://www.stats.gov.cn/was5/web/search?channelid=288041&andsen=%E6%B5%81%E9%80%9A%E9%A2%86%E5%9F%9F%E9%87%8D%E8%A6%81%E7%94%9F%E4%BA%A7%E8%B5%84%E6%96%99%E5%B8%82%E5%9C%BA%E4%BB%B7%E6%A0%BC%E5%8F%98%E5%8A%A8%E6%83%85%E5%86%B5'
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.content, 'html.parser')
urls = {} # Use a dictionary to create unique, ordered URLs (Assuming Python >=3.6)
for script in soup.find_all('script'):
for m in re.findall(r"var urlstr = '(.*?)';", script.text):
urls[m] = None
urls = list(urls.keys())
print(urls)
This would display URLS starting as:
['http://www.stats.gov.cn/tjsj/zxfb/201811/t20181105_1631364.html',
'http://www.stats.gov.cn/tjsj/zxfb/201810/t20181024_1629464.html',
'http://www.stats.gov.cn/tjsj/zxfb/201810/t20181015_1627579.html',
...]

beautiful soup parser can't find links

I was trying to parse an HTML document to find links using Beautiful Soup and found a weird behavior. The page is http://people.csail.mit.edu/gjtucker/ . Here's my code:
from bs4 import BeautifulSoup
import requests
user_agent = {'User-agent': 'Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.52 Safari/537.17'}
t=requests.get(url, headers = user_agent).text
soup=BeautifulSoup(t, 'html.parser')
for link in soup.findAll('a'):
print link['href']
This prints two links: http://www.amazon.jobs/team/speech-amazon and https://scholar.google.com/citations?user=-gJkPHIAAAAJ&hl=en, whereas clearly there are many more links in the page.
Can anyone reproduce this? Is there a specific reason for this happening with this URL? A few outher urls worked just fine.
The HTML of the page is not well-formed, you should use a more lenient parser, like html5lib:
soup = BeautifulSoup(t, 'html5lib')
for link in soup.find_all('a'):
print(link['href'])
Prints:
http://www.amazon.jobs/team/speech-amazon
https://scholar.google.com/citations?user=-gJkPHIAAAAJ&hl=en
http://www.linkedin.com/pub/george-tucker/6/608/3ba
...
http://www.hsph.harvard.edu/alkes-price/
...
http://www.nature.com/ng/journal/v47/n3/full/ng.3190.html
http://www.biomedcentral.com/1471-2105/14/299
pdfs/journal.pone.0029095.pdf
pdfs/es201187u.pdf
pdfs/sigtrans.pdf

Webscraping Using BeautifulSoup: Retrieving source code of a website

Good day!
I am currently making a web scraper for Alibaba website.
My problem is that the returned source code does not show some parts that I am interested in. The data is there when I checked the source code using the browser, but I can't retrieve it when using BeautifulSoup.
Any tips?
from bs4 import BeautifulSoup
def make_soup(url):
try:
html = urlopen(url).read()
except:
return None
return BeautifulSoup(html, "lxml")
url = "http://www.alibaba.com/Agricultural-Growing-Media_pid144"
soup2 = make_soup(url)
I am interested in the highlighted part as shown in the image using the Developer Tools of Chrome. But when I tried writing in a text file, some parts including the highlighted is nowhere to be found. Any tips? TIA!
You need to provide the User-Agent header at least.
Example using requests package instead of urllib2:
import requests
from bs4 import BeautifulSoup
def make_soup(url):
try:
html = requests.get(url, headers={"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.80 Safari/537.36"}).content
except:
return None
return BeautifulSoup(html, "lxml")
url = "http://www.alibaba.com/Agricultural-Growing-Media_pid144"
soup = make_soup(url)
print(soup.select_one("a.next").get('href'))
Prints http://www.alibaba.com/catalogs/products/CID144/2.

Python, trouble getting embedded video url

Ok, I have been scratching my head on this for way too long. I am trying to retrieve the url for an embedded video on a web page using Beautiful Soup and requests modules in Python 2.7.6. I inspect the html in chrome and I can see the url to the video but when I get the page using requests and use Beautiful Soup I can't find the "video" node. From looking at the source it looks like the video window is a nested html document. I have searched all over and can't find out why I can't retrieve this. If anyone could point me in the right direction I would greatly appreciate it. Thanks.
here is the url to one of the videos:
http://www.growingagreenerworld.com/episode125/
The problem is that there is an iframe with the video tag inside which is loaded asynchronously in the browser.
Good news is that you can simulate that behavior by making an additional request to the iframe URL passing the current page URL as a Referer.
Implementation:
import re
from bs4 import BeautifulSoup
import requests
url = 'http://www.growingagreenerworld.com/episode125/'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.115 Safari/537.36'}
with requests.Session() as session:
session.headers = headers
response = session.get(url)
soup = BeautifulSoup(response.content)
# follow the iframe url
response = session.get('http:' + soup.iframe['src'], headers={'Referer': url})
soup = BeautifulSoup(response.content)
# extract the video URL from the script tag
print re.search(r'"url":"(.*?)"', soup.script.text).group(1)
Prints:
http://pdl.vimeocdn.com/43109/378/290982236.mp4?token2=1424891659_69f846779e96814be83194ac3fc8fbae&aksessionid=678424d1f375137f

Categories