Scraping all links using Python BeautifulSoup/lxml

Scraping all links using Python BeautifulSoup/lxml - python

http://www.snapdeal.com/
I was trying to scrape all links from this site and when I do, I get an unexpected result. I figured out that this is happening because of javascript.
under "See All categories" Tab you will find all major product categories. If you hover the mouse over any category it will expand the categories. I want those links from each major categories.
url = 'http://www.snapdeal.com/'
data = urllib2.urlopen(url)
page = BeautifulSoup(data)
#print data
for link in page.findAll('a'):
l = link.get('href')
print l
But, this gave me a different result than what I expected (I turned off javascript and looked at the page source and output was from this source)
I just want to finds all sub links from each major category. any suggestions will be appreciated.

This is happening just because you are letting BeautifulSoup chose its own best parser , and you might not have installed lxml .
The best option is to use html.parser to parse the url .
from bs4 import BeautifulSoup
import urllib2
url = 'http://www.snapdeal.com/'
data = urllib2.urlopen(url).read()
page = BeautifulSoup(data,'html.parser')
for link in page.findAll('a'):
l = link.get('href')
print l
This worked for me .Make sure to install dependencies .

I thinks you should try another library such as selenium , it provide a web driver for you and this is the advantage of this library ,for my self I couldn't handle javascripts with bs4.

Categories Menu is the url you are looking for. Many websites generate the content dynamically using XHR(XMLHTTPRequest).
In order to examine the components of a website get familiar with Firebug add-on in Firefox or Developer Tools(inbuilt addon) in Chrome. You can check the XHR used in website under the network tab in aforementioned add-ons.

Use a web scraping tool such as scrapy or mechanize
In mechanize, to get all the links in the snapdeal homepage,
br=Browser()
br.open("http://www.snapdeal.com")
for link in browser.links():
print link.name
print link.url

I have been looking into a way to scrape links from webpages that are only rendered in an actual browser but wanted the results to be run using a headless browser.
I was able to achieve this using phantomJS, selenium and beautiful soup
#!/usr/bin/python
import bs4
import requests
from selenium import webdriver
driver = webdriver.PhantomJS('phantomjs')
url = 'http://www.snapdeal.com/'
browser = driver.get(url)
content = driver.page_source
soup = bs4.BeautifulSoup(content)
links = [a.attrs.get('href') for a in soup.find_all('a')]
for paths in links:
print paths
driver.close()

The following examples will work for both HTTP and HTTPS. I'm writing this answer to show how this can be used in both Python 2 and Python 3.
Python 2
This is inspired by this answer.
from bs4 import BeautifulSoup
import urllib2
url = 'https://stackoverflow.com'
data = urllib2.urlopen(url).read()
page = BeautifulSoup(data,'html.parser')
for link in page.findAll('a'):
l = link.get('href')
print l
Python 3
from bs4 import BeautifulSoup
from urllib.request import urlopen
import ssl
# to open up HTTPS URLs
gcontext = ssl.SSLContext()
# You can give any URL here. I have given the Stack Overflow homepage
url = 'https://stackoverflow.com'
data = urlopen(url, context=gcontext).read()
page = BeautifulSoup(data, 'html.parser')
for link in page.findAll('a'):
l = link.get('href')
print(l)
Other Languages
For other languages, please see this answer.

Related

Using Python to receive specific URLs from a webpage (Multireddit lists)

I'm trying to run some statistical analysis on topic-based multireddits. Rather than collecting each individual subreddit by hand, I have found websites that collect these subreddits (Example, Example 2).
These sites unfortunately do not have the ability to download the list of subreddits into plaintext that can be used in a dictionary. Is there a specific method I could use to scrape these sites to only receive back the URL of each attached hyperlink on the webpage?
Thanks!
Edit: Here's my current code
Here's my current code, which runs, but returns every URL.
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = "https://snoopsnoo.com/subreddits/travel/"
page = requests.get(url)
data = page.text
soup = BeautifulSoup(data)
links = []
for link in soup.find_all('a'):
reddit = link.get('href')
links.append(reddit)
df = pd.DataFrame(links, columns=['string_values'])
df.to_csv('travel.csv')

Yes there is such a method. If you are using Python, a widely used library is Beautifulsoup. This library parses the HTML directly, so no webdriver is necessary or running a webbrowser in the background like with selenium. You can install it with: pip install bs4.
For your first example site:
import urllib
from bs4 import BeautifulSoup
# Load the url
url = "https://snoopsnoo.com/subreddits/travel/"
html = urllib.request.urlopen(url).read()
# Create the parser object
soup = BeautifulSoup(html)
# Find all panel headings
panels = soup.find_all(class_="panel-heading big")
# Find the <a>-elements and exctract the link
links = [elem.find('a')['href'] for elem in panels]
print(links)
Here I checked the contents of the page to locate the panel elements by class and then extracted the <a>-elements and its href-attribute.

This code will grab all of the titles.
from selenium import webdriver
firefox_options = webdriver.FirefoxOptions()
#firefox_options.add_argument('--headless')
driver = webdriver.Firefox(executable_path='geckodriver.exe', firefox_options=firefox_options)
driver.get("https://snoopsnoo.com/subreddits/travel/")
for i in range(3):
wds = driver.find_elements_by_class_name('title')
for wd in wds:
print(wd.text)
driver.find_element_by_xpath('/html/body/div/div[2]/div[1]/ul/li/a').click
print('next page')
driver.close()
Change 3 to how many pages you want in for i in range(3): Uncomment firefox_options.add_argument('--headless') to use headless mode

Scraping File Paths from GitHub Repo yields 400 Response, but viewing in browser works fine

I’m trying to scrape all the file paths from links like this: https://github.com/themichaelusa/Trinitum/find/master, without using the GitHub API at all.
The link above contains a data-url attribute in the HTML (table, id=‘tree-finder-results’, class=‘tree-browser css-truncate’), which is used to make a URL like this: https://github.com/themichaelusa/Trinitum/tree-list/45a2ca7145369bee6c31a54c30fca8d3f0aae6cd
which displays this dictionary:
{"paths":["Examples/advanced_example.py","Examples/basic_example.py","LICENSE","README.md","Trinitum/AsyncManager.py","Trinitum/Constants.py","Trinitum/DatabaseManager.py","Trinitum/Diagnostics.py","Trinitum/Order.py","Trinitum/Pipeline.py","Trinitum/Position.py","Trinitum/RSU.py","Trinitum/Strategy.py","Trinitum/TradingInstance.py","Trinitum/Trinitum.py","Trinitum/Utilities.py","Trinitum/__init__.py","setup.cfg","setup.py"]}
when you view it in a browser like Chrome. However, GET request yields a <[400] Response>.
Here is the code I used:
username, repo = ‘themichaelusa’, ‘Trinitum’
ghURL = 'https://github.com'
url = ghURL + ('/{}/{}/find/master'.format(self.username, repo))
html = requests.get(url)
soup = BeautifulSoup(html.text, "lxml")
repoContent = soup.find('div', class_='tree-finder clearfix')
fileLinksURL = ghURL + str(repoContent.find('table').attrs['data-url'])
filePaths = requests.get(fileLinksURL)
print(filePaths)
Not sure what is wrong with it. My theory is that the first link creates a cookie that allows the second link to show the file paths of the repo we are targeting. I'm just unsure how to achieve this via code. Would really appreciate some pointers!

Give it a go. The links containing .py files are generated dynamically, so to catch them you need to use selenium. I think this is what you expected.
from selenium import webdriver ; from bs4 import BeautifulSoup
from urllib.parse import urljoin
url = 'https://github.com/themichaelusa/Trinitum/find/master'
driver=webdriver.Chrome()
driver.get(url)
soup = BeautifulSoup(driver.page_source, "lxml")
driver.quit()
for link in soup.select('#tree-finder-results .js-tree-finder-path'):
print(urljoin(url,link['href']))
Partial results:
https://github.com/themichaelusa/Trinitum/blob/master
https://github.com/themichaelusa/Trinitum/blob/master/Examples/advanced_example.py
https://github.com/themichaelusa/Trinitum/blob/master/Examples/basic_example.py
https://github.com/themichaelusa/Trinitum/blob/master/LICENSE
https://github.com/themichaelusa/Trinitum/blob/master/README.md
https://github.com/themichaelusa/Trinitum/blob/master/Trinitum/AsyncManager.py

How to scrape dynamic webpages by Python

[What I'm trying to do]
Scrape the webpage below for used car data.
http://www.goo-net.com/php/search/summary.php?price_range=&pref_c=08,09,10,11,12,13,14&easysearch_flg=1
[Issue]
To scrape the entire pages. In the url above, only first 30 items are shown. Those could be scraped by the code below which I wrote. Links to other pages are displayed like 1 2 3... but the link addresses seems to be in Javascript. I googled for useful information but couldn't find any.
from bs4 import BeautifulSoup
import urllib.request
html = urllib.request.urlopen("http://www.goo-net.com/php/search/summary.php?price_range=&pref_c=08,09,10,11,12,13,14&easysearch_flg=1")
soup = BeautifulSoup(html, "lxml")
total_cars = soup.find(class_="change change_01").find('em').string
tmp = soup.find(class_="change change_01").find_all('span')
car_start, car_end = tmp[0].string, tmp[1].string
# get urls to car detail pages
car_urls = []
heading_inners = soup.find_all(class_="heading_inner")
for heading_inner in heading_inners:
href = heading_inner.find('h4').find('a').get('href')
car_urls.append('http://www.goo-net.com' + href)
for url in car_urls:
html = urllib.request.urlopen(url)
soup = BeautifulSoup(html, "lxml")
#title
print(soup.find(class_='hdBlockTop').find('p', class_='tit').string)
#price of car itself
print(soup.find(class_='price1').string)
#price of car including tax
print(soup.find(class_='price2').string)
tds = soup.find(class_='subData').find_all('td')
# year
print(tds[0].string)
# distance
print(tds[1].string)
# displacement
print(tds[2].string)
# inspection
print(tds[3].string)
[What I'd like to know]
How to scrape the entire pages. I prefer to use BeautifulSoup4 (Python). But if that is not the appropriate tool, please show me other ones.
[My environment]
Windows 8.1
Python 3.5
PyDev (Eclipse)
BeautifulSoup4
Any guidance would be appreciated. Thank you.

you can use selenium like below sample:
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('http://example.com')
element = driver.find_element_by_class_name("yourClassName") #or find by text or etc
element.click()

The python module splinter may be a good starting point. It calls an external browser (such as Firefox) and access the browser's DOM rather than dealing with HTML only.

Getting Different Results For Web Scraping

I was trying to do web scraping and was using the following code :
import mechanize
from bs4 import BeautifulSoup
url = "http://www.thehindu.com/archive/web/2010/06/19/"
br = mechanize.Browser()
htmltext = br.open(url).read()
link_dictionary = {}
soup = BeautifulSoup(htmltext)
for tag_li in soup.findAll('li', attrs={"data-section":"Chennai"}):
for link in tag_li.findAll('a'):
link_dictionary[link.string] = link.get('href')
print link_dictionary[link.string]
urlnew = link_dictionary[link.string]
brnew = mechanize.Browser()
htmltextnew = brnew.open(urlnew).read()
articletext = ""
soupnew = BeautifulSoup(htmltextnew)
for tag in soupnew.findAll('p'):
articletext += tag.text
print articletext
I was unable to get any printed values by using this. But on using attrs={"data-section":"Business"} instead of attrs={"data-section":"Chennai"} I was able to get the desired output. Can someone help me?

READ THE TERMS OF SERVICES OF THE WEBSITE BEFORE SCRAPING
If you are using firebug or inspect element in Chrome, you might see some contents that will not be seen if you are using Mechanize or Urllib2.
For example, when you view the source code of the page sent out by you. (Right click view source in Chrome). and search for data-section tag, you won't see any tags which chennai, I am not 100% sure but I will say those contents need to be populated by Javascript ..etc. which requires the functionality of a browser.
If I were you, I will use selenium to open up the page and then get the source page from there, then the HTML collected in that way will be more like what you see in a browser.
Cited here
from selenium import webdriver
from bs4 import BeautifulSoup
import time
driver = webdriver.Firefox()
driver.get("URL GOES HERE")
# I noticed there is an ad here, sleep til page fully loaded.
time.sleep(10)
soup = BeautifulSoup(driver.page_source)
print len(soup.findAll(...}))
# or you can work directly in selenium
...
driver.close()
And the output for me is 8

Pass over URLs scraping

I am trying to do some web scraping and I wrote a simple script that aims to print all URLs present in the webpage. I don't know why it passes over many URLs and is printing a list from the middle instead from the first URL.
from urllib import request
from bs4 import BeautifulSoup
source = request.urlopen("http://www.bda-ieo.it/test/Alphabetical.aspx?Lan=Ita&FL=%25")
soup = BeautifulSoup(source, "html.parser")
for links in soup.select('a'):
print(links['href'])
Why that? Anyone could explain me what happen?
I am using Python 3.7.1, OS Windows 10 - Visual Studio Code

Often, hrefs just provide part (not complete) of urls. No worries.
Open it in a new tab/ browser. Find the missing part of the url. Add it to the href as string.
in the case, that must be 'http://www.bda-ieo.it/test/'.
Here is your code.
from urllib import request
from bs4 import BeautifulSoup
source = request.urlopen("http://www.bda-ieo.it/test/Alphabetical.aspx?Lan=Ita&FL=%25")
soup = BeautifulSoup(source, "html.parser")
for links in soup.select('a'):
print('http://www.bda-ieo.it/test/' + links['href'])
And this' the result.
http://www.bda-ieo.it/test/Alphabetical.aspx?Lan=Ita&FL=A
http://www.bda-ieo.it/test/Alphabetical.aspx?Lan=Ita&FL=B
http://www.bda-ieo.it/test/Alphabetical.aspx?Lan=Ita&FL=C
http://www.bda-ieo.it/test/Alphabetical.aspx?Lan=Ita&FL=D
http://www.bda-ieo.it/test/Alphabetical.aspx?Lan=Ita&FL=E
...
http://www.bda-ieo.it/test/ComponentiAlimento.aspx?Lan=Ita&foodid=8721_2
http://www.bda-ieo.it/test/ComponentiAlimento.aspx?Lan=Ita&foodid=347_1
http://www.bda-ieo.it/test/ComponentiAlimento.aspx?Lan=Ita&foodid=2021_1
http://www.bda-ieo.it/test/ComponentiAlimento.aspx?Lan=Ita&foodid=805958_1
http://www.bda-ieo.it/test/ComponentiAlimento.aspx?Lan=Ita&foodid=349_1

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scraping all links using Python BeautifulSoup/lxml - python

I thinks you should try another library such as selenium , it provide a web driver for you and this is the advantage of this library ,for my self I couldn't handle javascripts with bs4.

Use a web scraping tool such as scrapy or mechanize In mechanize, to get all the links in the snapdeal homepage, br=Browser() br.open("http://www.snapdeal.com") for link in browser.links(): print link.name print link.url

Related

Using Python to receive specific URLs from a webpage (Multireddit lists)

Scraping File Paths from GitHub Repo yields 400 Response, but viewing in browser works fine

How to scrape dynamic webpages by Python

Getting Different Results For Web Scraping

Pass over URLs scraping

Categories

Resources