Webscraping using BeautifulSoup (Jupyter Notebook) - python

Good afternoon,
I am fairly new to Webscraping. I am trying to scrape a dataset from an open source portal. Just to try to figure out how I can scrape website.
I am trying to scape a dataset from data.toerismevlaanderen.be
This is the dataset i want: https://data.toerismevlaanderen.be/tourist/reca/beer_bars
I always end up with a http error: HTTP Error 404: Not Found
This is my code:
import requests
import urllib.request
import time
from bs4 import BeautifulSoup
url = 'https://data.toerismevlaanderen.be/'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
soup.findAll('a')
one_a_tag = soup.findAll('a')[35]
link = one_a_tag['href']
download_url = 'https://data.toerismevlaanderen.be/'+ link
urllib.request.urlretrieve(download_url,'./'+link[link.find('/tourist/reca/beer_bars_')+1:])
time.sleep
What am I doing wrong?

The issue is the following:
link = one_a_tag['href']
print(link)
This returns a link: https://data.toerismevlaanderen.be/
Then you are adding this link to download_url by doing:
download_url = 'https://data.toerismevlaanderen.be/'+ link
Therefore, if you print(download_url), you get:
https://data.toerismevlaanderen.be/https://data.toerismevlaanderen.be/
Which it is not a valid url.
UPDATE BASED ON COMMENTS
The issue is that there is not tourist/activities/breweries anywhere in the text you scrape.
If you write:
for link in soup.findAll('a'):
print(link.get('href'))
you see all the a href tag. None contains tourist/activities/breweries
But
If you want just the link data.toerismevlaanderen.be/tourist/activities/breweries you can do:
download_url = link + "tourist/activities/breweries"

There is an API for this so I would use that
e.g.
import requests
r = requests.get('https://opendata.visitflanders.org/tourist/reca/beer_bars.json?page=1&page_size=500&limit=1').json()

you get many absolute links in return. Adding it to the original url for new requests therefor won't work. Simply requesting the 'link' you grabbed will work instead

Related

How to scrape JavaScript page with Python

I'm trying to scrape patentsview.org but I'm having an issue. When I try to scrape this page, it doesn't work well. Site using JavaScript to get data from their database. I tried to get the data using requests-html package but I didn't quite understand.
Here's what I tried:
# Import
import re
from bs4 import BeautifulSoup
from requests_html import HTMLSession
session = HTMLSession()
# Set requests
r = session.get('https://datatool.patentsview.org/#search/assignee&asn=1|Samsung')
r.html.render()
# Set BS and print
soup = BeautifulSoup(r.html.html, "lxml")
tags = soup.find_all("div", class_='summary')
print(tags)
This code gives me this result:
# Result
[<div class="summary"></div>]
But I want this:
This is the right div. But I can't see content of div with my code. How can I get the div's content? Hope you understand what I meant.
Use the browser dev tools. (Chrome. F12 - Network - XHR) and see the HTTP GET thst return the data (as JSON) you are looking for.
HTTP GET https://webapi.patentsview.org/api/assignees/query?q={%22_and%22:[{%22_or%22:[{%22_and%22:[{%22_contains%22:{%22assignee_first_name%22:%22Samsung%22}}]},{%22_and%22:[{%22_contains%22:{%22assignee_last_name%22:%22Samsung%22}}]},{%22_and%22:[{%22_contains%22:{%22assignee_organization%22:%22Samsung%22}}]}]}]}&f=[%22assignee_id%22,%22assignee_first_name%22,%22assignee_last_name%22,%22assignee_organization%22,%22assignee_lastknown_country%22,%22assignee_lastknown_state%22,%22assignee_lastknown_city%22,%22assignee_lastknown_location_id%22,%22assignee_total_num_patents%22,%22assignee_first_seen_date%22,%22assignee_last_seen_date%22,%22patent_id%22]&o={%22per_page%22:50,%22matched_subentities_only%22:true,%22sort_by_subentity_counts%22:%22patent_id%22,%22page%22:1}&s=[{%22patent_id%22:%22desc%22},{%22assignee_total_num_patents%22:%22desc%22},{%22assignee_organization%22:%22asc%22},{%22assignee_last_name%22:%22asc%22},{%22assignee_first_name%22:%22asc%22}]

Scraping First post from phpbb3 forum by Python

I have alink like that
http://www.arabcomics.net/phpbb3/viewtopic.php?f=98&t=71718
the link has LINKS in first post in phpbb3 forum
How I get LINKS in first post
I tried this but not working
import requests
from bs4 import BeautifulSoup as bs
url = 'http://www.arabcomics.net/phpbb3/viewtopic.php?f=98&t=71718'
response= requests.get(url)
soup = bs(response.text, 'html5lib')
itemstr= soup.findAll('div',{'class':'postbody'})
for link in itemstr.findAll('a'):
links = link.get('href')
print(links)
Big oof my man, just use regex for this ? No need to use bs, also regex will work even if they remake site.
import re
myurlregex=re.compile(r'''(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'\".,<>?«»“”‘’]))\" class=\"postlink\"''')
url = re.findall(myurlregex,response.text)[0]
Also as a coder regex is one of skills u will need always.

Scraping File Paths from GitHub Repo yields 400 Response, but viewing in browser works fine

I’m trying to scrape all the file paths from links like this: https://github.com/themichaelusa/Trinitum/find/master, without using the GitHub API at all.
The link above contains a data-url attribute in the HTML (table, id=‘tree-finder-results’, class=‘tree-browser css-truncate’), which is used to make a URL like this: https://github.com/themichaelusa/Trinitum/tree-list/45a2ca7145369bee6c31a54c30fca8d3f0aae6cd
which displays this dictionary:
{"paths":["Examples/advanced_example.py","Examples/basic_example.py","LICENSE","README.md","Trinitum/AsyncManager.py","Trinitum/Constants.py","Trinitum/DatabaseManager.py","Trinitum/Diagnostics.py","Trinitum/Order.py","Trinitum/Pipeline.py","Trinitum/Position.py","Trinitum/RSU.py","Trinitum/Strategy.py","Trinitum/TradingInstance.py","Trinitum/Trinitum.py","Trinitum/Utilities.py","Trinitum/__init__.py","setup.cfg","setup.py"]}
when you view it in a browser like Chrome. However, GET request yields a <[400] Response>.
Here is the code I used:
username, repo = ‘themichaelusa’, ‘Trinitum’
ghURL = 'https://github.com'
url = ghURL + ('/{}/{}/find/master'.format(self.username, repo))
html = requests.get(url)
soup = BeautifulSoup(html.text, "lxml")
repoContent = soup.find('div', class_='tree-finder clearfix')
fileLinksURL = ghURL + str(repoContent.find('table').attrs['data-url'])
filePaths = requests.get(fileLinksURL)
print(filePaths)
Not sure what is wrong with it. My theory is that the first link creates a cookie that allows the second link to show the file paths of the repo we are targeting. I'm just unsure how to achieve this via code. Would really appreciate some pointers!
Give it a go. The links containing .py files are generated dynamically, so to catch them you need to use selenium. I think this is what you expected.
from selenium import webdriver ; from bs4 import BeautifulSoup
from urllib.parse import urljoin
url = 'https://github.com/themichaelusa/Trinitum/find/master'
driver=webdriver.Chrome()
driver.get(url)
soup = BeautifulSoup(driver.page_source, "lxml")
driver.quit()
for link in soup.select('#tree-finder-results .js-tree-finder-path'):
print(urljoin(url,link['href']))
Partial results:
https://github.com/themichaelusa/Trinitum/blob/master
https://github.com/themichaelusa/Trinitum/blob/master/Examples/advanced_example.py
https://github.com/themichaelusa/Trinitum/blob/master/Examples/basic_example.py
https://github.com/themichaelusa/Trinitum/blob/master/LICENSE
https://github.com/themichaelusa/Trinitum/blob/master/README.md
https://github.com/themichaelusa/Trinitum/blob/master/Trinitum/AsyncManager.py

Scraping all links using Python BeautifulSoup/lxml

http://www.snapdeal.com/
I was trying to scrape all links from this site and when I do, I get an unexpected result. I figured out that this is happening because of javascript.
under "See All categories" Tab you will find all major product categories. If you hover the mouse over any category it will expand the categories. I want those links from each major categories.
url = 'http://www.snapdeal.com/'
data = urllib2.urlopen(url)
page = BeautifulSoup(data)
#print data
for link in page.findAll('a'):
l = link.get('href')
print l
But, this gave me a different result than what I expected (I turned off javascript and looked at the page source and output was from this source)
I just want to finds all sub links from each major category. any suggestions will be appreciated.
This is happening just because you are letting BeautifulSoup chose its own best parser , and you might not have installed lxml .
The best option is to use html.parser to parse the url .
from bs4 import BeautifulSoup
import urllib2
url = 'http://www.snapdeal.com/'
data = urllib2.urlopen(url).read()
page = BeautifulSoup(data,'html.parser')
for link in page.findAll('a'):
l = link.get('href')
print l
This worked for me .Make sure to install dependencies .
I thinks you should try another library such as selenium , it provide a web driver for you and this is the advantage of this library ,for my self I couldn't handle javascripts with bs4.
Categories Menu is the url you are looking for. Many websites generate the content dynamically using XHR(XMLHTTPRequest).
In order to examine the components of a website get familiar with Firebug add-on in Firefox or Developer Tools(inbuilt addon) in Chrome. You can check the XHR used in website under the network tab in aforementioned add-ons.
Use a web scraping tool such as scrapy or mechanize
In mechanize, to get all the links in the snapdeal homepage,
br=Browser()
br.open("http://www.snapdeal.com")
for link in browser.links():
print link.name
print link.url
I have been looking into a way to scrape links from webpages that are only rendered in an actual browser but wanted the results to be run using a headless browser.
I was able to achieve this using phantomJS, selenium and beautiful soup
#!/usr/bin/python
import bs4
import requests
from selenium import webdriver
driver = webdriver.PhantomJS('phantomjs')
url = 'http://www.snapdeal.com/'
browser = driver.get(url)
content = driver.page_source
soup = bs4.BeautifulSoup(content)
links = [a.attrs.get('href') for a in soup.find_all('a')]
for paths in links:
print paths
driver.close()
The following examples will work for both HTTP and HTTPS. I'm writing this answer to show how this can be used in both Python 2 and Python 3.
Python 2
This is inspired by this answer.
from bs4 import BeautifulSoup
import urllib2
url = 'https://stackoverflow.com'
data = urllib2.urlopen(url).read()
page = BeautifulSoup(data,'html.parser')
for link in page.findAll('a'):
l = link.get('href')
print l
Python 3
from bs4 import BeautifulSoup
from urllib.request import urlopen
import ssl
# to open up HTTPS URLs
gcontext = ssl.SSLContext()
# You can give any URL here. I have given the Stack Overflow homepage
url = 'https://stackoverflow.com'
data = urlopen(url, context=gcontext).read()
page = BeautifulSoup(data, 'html.parser')
for link in page.findAll('a'):
l = link.get('href')
print(l)
Other Languages
For other languages, please see this answer.

Pass over URLs scraping

I am trying to do some web scraping and I wrote a simple script that aims to print all URLs present in the webpage. I don't know why it passes over many URLs and is printing a list from the middle instead from the first URL.
from urllib import request
from bs4 import BeautifulSoup
source = request.urlopen("http://www.bda-ieo.it/test/Alphabetical.aspx?Lan=Ita&FL=%25")
soup = BeautifulSoup(source, "html.parser")
for links in soup.select('a'):
print(links['href'])
Why that? Anyone could explain me what happen?
I am using Python 3.7.1, OS Windows 10 - Visual Studio Code
Often, hrefs just provide part (not complete) of urls. No worries.
Open it in a new tab/ browser. Find the missing part of the url. Add it to the href as string.
in the case, that must be 'http://www.bda-ieo.it/test/'.
Here is your code.
from urllib import request
from bs4 import BeautifulSoup
source = request.urlopen("http://www.bda-ieo.it/test/Alphabetical.aspx?Lan=Ita&FL=%25")
soup = BeautifulSoup(source, "html.parser")
for links in soup.select('a'):
print('http://www.bda-ieo.it/test/' + links['href'])
And this' the result.
http://www.bda-ieo.it/test/Alphabetical.aspx?Lan=Ita&FL=A
http://www.bda-ieo.it/test/Alphabetical.aspx?Lan=Ita&FL=B
http://www.bda-ieo.it/test/Alphabetical.aspx?Lan=Ita&FL=C
http://www.bda-ieo.it/test/Alphabetical.aspx?Lan=Ita&FL=D
http://www.bda-ieo.it/test/Alphabetical.aspx?Lan=Ita&FL=E
...
http://www.bda-ieo.it/test/ComponentiAlimento.aspx?Lan=Ita&foodid=8721_2
http://www.bda-ieo.it/test/ComponentiAlimento.aspx?Lan=Ita&foodid=347_1
http://www.bda-ieo.it/test/ComponentiAlimento.aspx?Lan=Ita&foodid=2021_1
http://www.bda-ieo.it/test/ComponentiAlimento.aspx?Lan=Ita&foodid=805958_1
http://www.bda-ieo.it/test/ComponentiAlimento.aspx?Lan=Ita&foodid=349_1

Categories