Webscraping with Python ( beginner) - python

I'm doing the first example of the webscraping tutorial from the book "Automate the Boring Tasks with Python". The project consists of typing a search term on the command line and have my computer automatically open a browser with all the top search results in new tabs
It mentions that I need to locate the
<h3 class="r">
element from the page source, which are the links to each search results.
The r class is used only for search result links.
But the problem is that I can't find it anywhere, even using Chrome Devtools. Any help as to where is it would be greatly appreciated.
Note: Just for reference this is the complete program as seen on the book.
# lucky.py - Opens several Google search results.
import requests, sys, webbrowser, bs4
print('Googling..') # display text while downloading the Google page
res= requests.get('http://google.com/search?q=' + ' '.join(sys.argv[1:]))
res.raise_for_status()
#Retrieve top searh result links.
soup = bs4.BeautifulSoup(res.text)
#Open a browser tab for each result.
linkElems = soup.select('.r a')
numOpen = min(5,len(linkElems))
for i in range(numOpen):
webbrowser.open('http://google.com' + linkElems[i].get('href'))

This will work for you :
>>> import requests
>>> from lxml import html
>>> r = requests.get("https://www.google.co.uk/search?q=how+to+do+web+scraping&num=10")
>>> source = html.fromstring((r.text).encode('utf-8'))
>>> links = source.xpath('//h3[#class="r"]//a//#href')
>>> for link in links:
print link.replace("/url?q=","").split("&sa=")[0]
Output :
http://newcoder.io/scrape/intro/
https://www.analyticsvidhya.com/blog/2015/10/beginner-guide-web-scraping-beautiful-soup-python/
http://docs.python-guide.org/en/latest/scenarios/scrape/
http://webscraper.io/
https://blog.hartleybrody.com/web-scraping/
https://first-web-scraper.readthedocs.io/
https://www.youtube.com/watch%3Fv%3DE7wB__M9fdw
http://www.gregreda.com/2013/03/03/web-scraping-101-with-python/
http://analystcave.com/web-scraping-tutorial/
https://en.wikipedia.org/wiki/Web_scraping
Note: I am using Python 2.7.X , for Python 3.X you just have to surround the print output like this print (link.replace("/url?q=","").split("&sa=")[0])

Related

Open first video by searching in youtube, using Python

I try this and don't know how to open the first video. This code opens the search in the browser.
import webbrowser
def findYT(search):
words = search.split()
link = "http://www.youtube.com/results?search_query="
for i in words:
link += i + "+"
time.sleep(1)
webbrowser.open_new(link[:-1])
This successfully searches the video, but how do I open the first result?
The most common approach would be to use two very popular libraries: requests and BeautifulSoup. requests to get the page, and BeautifulSoup to parse it.
import requests
from bs4 import BeautifulSoup
import webbrowser
def findYT(search):
words = search.split()
search_link = "http://www.youtube.com/results?search_query=" + '+'.join(words)
search_result = requests.get(search_link).text
soup = BeautifulSoup(search_result, 'html.parser')
videos = soup.select(".yt-uix-tile-link")
if not videos:
raise KeyError("No video found")
link = "https://www.youtube.com" + videos[0]["href"]
webbrowser.open_new(link)
Note that it is recommended not to use uppercases in python while naming variables.
To do that you have to webscrape. Python canĀ“t see what is on your screen. You have to webscrape the youtube page you are searching and then you can open the first <a> that comes up for example. ('' is a url tag in html)
Things you need for that:
BeautifulSoup or selenium for example
requests
That should be all that you need to do what you want.

UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html5lib")

Hi I am new to programming, I am currently reading AutomateTheBoringStuff and I come across this error while web scraping UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html5lib").
It a simple web scraping program that open the first 5 website link of the word that I writted in my input program.
my code:
import webbrowser, requests, bs4
ans = input()
print('Googling...') # display text while downloading the Google page
res = requests.get('http://google.com/search?q=' + ans)
res.raise_for_status()
# Retrieve top search result links.
soup = bs4.BeautifulSoup(res.text)
# Open a browser tab for each result.
linkElems = soup.select('.r a')
# Open a browser tab for each result.
numOpen = min(5, len(linkElems))
for i in range(numOpen):
webbrowser.open('http://google.com' + linkElems[i].get('href'))
When I write html.parser on this line : soup = bs4.BeautifulSoup(res.text, "html.parser") The error is no longer here but the program doesn't open my web tab. So I guess it not the right way to resolve this
Thanks for helping!

Why doesn't soup.select('.r a') find anything in my google search?

I'm trying out the lucky.py project in this book, https://automatetheboringstuff.com/chapter11/. The program runs fine, but I can't get beautifulsoup to select the correct links.
What I've tried:
I've tried soup.select('div') and it chooses all the links from the top.
Tried soup.select('span div') and it selects all the sublinks on each search result.
Looked up lots over questions, but none of them seems to answer why soup.select('.r a') doesn't work or how to fix it.
When I enter print(linkElems) in the code, it shows me an empty dictionary.
This is my code:
#! /usr/bin/env python3
import requests, sys, webbrowser, bs4
print('Googling...') # display text while downloading the Google page
res = requests.get('https://google.com/search?q=' + ' '.join(sys.argv[1:]))
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, features="html.parser")
linkElems = soup.select('.r a')
numOpen = min(5, len(linkElems))
for i in range(numOpen):
webbrowser.open('https://google.com' + linkElems[i].get('href'))
I'm expecting it to open the first 5 links of the google search in new tabs, but nothing comes up because the selector isn't working properly.
It seems like class r (.r) means tag for one link.
If class r has only one a tag, multiple links can't be opened.
So, you might need to search upper tag like "div tag + id='search'", for example, div#search
Then, returned object will contain all a tags because element of "div#search" is located upper than all a tags

Scraping all links using Python BeautifulSoup/lxml

http://www.snapdeal.com/
I was trying to scrape all links from this site and when I do, I get an unexpected result. I figured out that this is happening because of javascript.
under "See All categories" Tab you will find all major product categories. If you hover the mouse over any category it will expand the categories. I want those links from each major categories.
url = 'http://www.snapdeal.com/'
data = urllib2.urlopen(url)
page = BeautifulSoup(data)
#print data
for link in page.findAll('a'):
l = link.get('href')
print l
But, this gave me a different result than what I expected (I turned off javascript and looked at the page source and output was from this source)
I just want to finds all sub links from each major category. any suggestions will be appreciated.
This is happening just because you are letting BeautifulSoup chose its own best parser , and you might not have installed lxml .
The best option is to use html.parser to parse the url .
from bs4 import BeautifulSoup
import urllib2
url = 'http://www.snapdeal.com/'
data = urllib2.urlopen(url).read()
page = BeautifulSoup(data,'html.parser')
for link in page.findAll('a'):
l = link.get('href')
print l
This worked for me .Make sure to install dependencies .
I thinks you should try another library such as selenium , it provide a web driver for you and this is the advantage of this library ,for my self I couldn't handle javascripts with bs4.
Categories Menu is the url you are looking for. Many websites generate the content dynamically using XHR(XMLHTTPRequest).
In order to examine the components of a website get familiar with Firebug add-on in Firefox or Developer Tools(inbuilt addon) in Chrome. You can check the XHR used in website under the network tab in aforementioned add-ons.
Use a web scraping tool such as scrapy or mechanize
In mechanize, to get all the links in the snapdeal homepage,
br=Browser()
br.open("http://www.snapdeal.com")
for link in browser.links():
print link.name
print link.url
I have been looking into a way to scrape links from webpages that are only rendered in an actual browser but wanted the results to be run using a headless browser.
I was able to achieve this using phantomJS, selenium and beautiful soup
#!/usr/bin/python
import bs4
import requests
from selenium import webdriver
driver = webdriver.PhantomJS('phantomjs')
url = 'http://www.snapdeal.com/'
browser = driver.get(url)
content = driver.page_source
soup = bs4.BeautifulSoup(content)
links = [a.attrs.get('href') for a in soup.find_all('a')]
for paths in links:
print paths
driver.close()
The following examples will work for both HTTP and HTTPS. I'm writing this answer to show how this can be used in both Python 2 and Python 3.
Python 2
This is inspired by this answer.
from bs4 import BeautifulSoup
import urllib2
url = 'https://stackoverflow.com'
data = urllib2.urlopen(url).read()
page = BeautifulSoup(data,'html.parser')
for link in page.findAll('a'):
l = link.get('href')
print l
Python 3
from bs4 import BeautifulSoup
from urllib.request import urlopen
import ssl
# to open up HTTPS URLs
gcontext = ssl.SSLContext()
# You can give any URL here. I have given the Stack Overflow homepage
url = 'https://stackoverflow.com'
data = urlopen(url, context=gcontext).read()
page = BeautifulSoup(data, 'html.parser')
for link in page.findAll('a'):
l = link.get('href')
print(l)
Other Languages
For other languages, please see this answer.

Getting Different Results For Web Scraping

I was trying to do web scraping and was using the following code :
import mechanize
from bs4 import BeautifulSoup
url = "http://www.thehindu.com/archive/web/2010/06/19/"
br = mechanize.Browser()
htmltext = br.open(url).read()
link_dictionary = {}
soup = BeautifulSoup(htmltext)
for tag_li in soup.findAll('li', attrs={"data-section":"Chennai"}):
for link in tag_li.findAll('a'):
link_dictionary[link.string] = link.get('href')
print link_dictionary[link.string]
urlnew = link_dictionary[link.string]
brnew = mechanize.Browser()
htmltextnew = brnew.open(urlnew).read()
articletext = ""
soupnew = BeautifulSoup(htmltextnew)
for tag in soupnew.findAll('p'):
articletext += tag.text
print articletext
I was unable to get any printed values by using this. But on using attrs={"data-section":"Business"} instead of attrs={"data-section":"Chennai"} I was able to get the desired output. Can someone help me?
READ THE TERMS OF SERVICES OF THE WEBSITE BEFORE SCRAPING
If you are using firebug or inspect element in Chrome, you might see some contents that will not be seen if you are using Mechanize or Urllib2.
For example, when you view the source code of the page sent out by you. (Right click view source in Chrome). and search for data-section tag, you won't see any tags which chennai, I am not 100% sure but I will say those contents need to be populated by Javascript ..etc. which requires the functionality of a browser.
If I were you, I will use selenium to open up the page and then get the source page from there, then the HTML collected in that way will be more like what you see in a browser.
Cited here
from selenium import webdriver
from bs4 import BeautifulSoup
import time
driver = webdriver.Firefox()
driver.get("URL GOES HERE")
# I noticed there is an ad here, sleep til page fully loaded.
time.sleep(10)
soup = BeautifulSoup(driver.page_source)
print len(soup.findAll(...}))
# or you can work directly in selenium
...
driver.close()
And the output for me is 8

Categories