I want to write a python script in which I enter a question and as an answer I get reply from google custom search api or Bing or any other search api (any one). I tried using google custom search API but it gave me this script:
<script>
(function() {
var cx = 'someurl';
var gcse = document.createElement('script');
gcse.type = 'text/javascript';
gcse.async = true;
gcse.src = 'someurl' + cx;
var s = document.getElementsByTagName('script')[0];
s.parentNode.insertBefore(gcse, s);
})();
</script>
Since I am not using any html page and just require answer in python console how do I do this? Is there somethere method to do so other than API call?
To do google searches there are many modules like google or Google-Search-API. But if you want to do many searches and have many requests google will block you and you'll get error 503. Before there was option to use other APIs but like Bing and Yahoo, but neither of them are free anymore. Only free API that does internet searches is FAROO API. But there is still one option to do google search by using selenium webdriver. Selenium is used to imitate browser usage and it has options to use Firefox, Chrome, Edge or Safari webdrivers (it actually opens Chrome and does your search), but this is annoying because you don't actually want to see the browser. But there is solution for this you can use PhantomJS. Download from here. Extracted and see how to use it in the example below (I wrote a simple class which you can use, you just need to change the path to PhantomJS):
import time
from urllib.parse import quote_plus
from selenium import webdriver
class Browser:
def __init__(self, path, initiate=True, implicit_wait_time = 10, explicit_wait_time = 2):
self.path = path
self.implicit_wait_time = implicit_wait_time # http://www.aptuz.com/blog/selenium-implicit-vs-explicit-waits/
self.explicit_wait_time = explicit_wait_time # http://www.aptuz.com/blog/selenium-implicit-vs-explicit-waits/
if initiate:
self.start()
return
def start(self):
self.driver = webdriver.PhantomJS(path)
self.driver.implicitly_wait(self.implicit_wait_time)
return
def end(self):
self.driver.quit()
return
def go_to_url(self, url, wait_time = None):
if wait_time is None:
wait_time = self.explicit_wait_time
self.driver.get(url)
print('[*] Fetching results from: {}'.format(url))
time.sleep(wait_time)
return
def get_search_url(self, query, page_num=0, per_page=10, lang='en'):
query = quote_plus(query)
url = 'https://www.google.hr/search?q={}&num={}&start={}&nl={}'.format(query, per_page, page_num*per_page, lang)
return url
def scrape(self):
#xpath migth change in future
links = self.driver.find_elements_by_xpath("//h3[#class='r']/a[#href]") # searches for all links insede h3 tags with class "r"
results = []
for link in links:
d = {'url': link.get_attribute('href'),
'title': link.text}
results.append(d)
return results
def search(self, query, page_num=0, per_page=10, lang='en', wait_time = None):
if wait_time is None:
wait_time = self.explicit_wait_time
url = self.get_search_url(query, page_num, per_page, lang)
self.go_to_url(url, wait_time)
results = self.scrape()
return results
path = '<YOUR PATH TO PHANTOMJS>/phantomjs-2.1.1-windows/bin/phantomjs.exe' ## SET YOU PATH TO phantomjs
br = Browser(path)
results = br.search('Python')
for r in results:
print(r)
br.end()
Related
As a fun experiment, I decided to scrape data from Google shopping, it works perfectly on my local but on my server it doesn't work. Here is the code
#Web driver file
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
options = Options()
options.headless = False
driver = webdriver.Chrome(options=options, executable_path="/Users/kevin/Documents/projects/deal_hunt/scraper_scripts/chromedriver")
def get_items(url, category):
driver.get(url)
results = []
content = driver.page_source
soup = BeautifulSoup(content, features="lxml")
#the first will click on all the images of the products that are on sale, that's the only way to generate the class that is going to allow us to
#fetch the data we need
for element in soup.find_all(attrs="i0X6df"):
#not all items are on sale, those that are, have a label, we will only choose those ones
sale_label = element.find('span', {'class': 'Ib8pOd'})
if sale_label is None:
pass
else:
#we want to take the id of the image from the page and dynamically click on it. If we don't do this, selenium will keep clicking on the first picture
parent_div = element.find('div', {'class': 'ArOc1c'})
image_tag = parent_div.find('img')
image_to_click = driver.find_element_by_id(image_tag['id'])
driver.execute_script("arguments[0].click();", image_to_click)
time.sleep(5)
items = driver.find_elements_by_class_name('_-oQ')
for item in items:
image_tag = item.find_element_by_class_name('sh-div__current').get_attribute('src')
description = item.find_element_by_class_name('sh-t__title').get_attribute('text')
link = item.find_element_by_class_name('sh-t__title').get_attribute('href')
store = item.find_element_by_css_selector('._-oA > span').get_attribute('textContent')
price = item.find_elements_by_class_name('_-pX')[0].get_attribute('textContent')
old_price = item.find_elements_by_class_name('_-pX')[1].get_attribute('textContent')
#we only take numbers, because the web page returns a series of weird characters and the price is found at the end of the string
price_array = price.split(',')
price = ''.join(re.findall(r'\d+', price_array[0])) + '.' + price_array[1]
old_price_array = old_price.split(',')
old_price = ''.join(re.findall(r'\d+', old_price_array[0])) + '.' + old_price_array[1]
#remove rand sign
price = price.replace("R ", "")
#replace the comma with the dot
price = price.replace(",", ".")
#we're trying to get the url of the product inside the google url
url_to_parse = link
parsed_url = urlparse(url_to_parse)
product_url = parse_qs(parsed_url.query)['q'][0]
results.append({
'image': image_tag,
'description': description,
'store': store,
'link': product_url,
'price': float(price),
'old_price': float(old_price)
})
#if we successfully scrape data, we print it, otherwise we skip
if len(results) > 0:
print(results)
print("Command has been perfectly executed")
else:
print("There is nothing to add")
when I run python3 main.py on local, it returns that the command has been perfectly executed, but on my Ubuntu server the same command returns immediately "There is nothing to add"
It would be necessary to verify that you have installed the necessary on your server, including the selenium and python versions, also check the path on the server because that driver may not be running.
As an additional recommendation, make checkpoints in the code to validate if from the beginning it is not bringing info or if it is somewhere else that is losing it. Superficially in the code I do not see something strange that could generate the error.
As suggested in another response, you should debug your code and ensure that the requests are identical.
Alternatively, you could try running the spider using containers to avoid any OS particularities. A more escalable option would be to use a cloud-based scraping environment like estela, although it has not been tested, you could try to use Scrapy with Selenium.
I am using PyCharm to capture some data from web and push it into in-memory database-table on SQLite. I have debugged the code, it works fine, in the debugger I can see data being fetched, it being pushed into db[table] location.
Python code is as below -
import requests
import dataset
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse
def begin():
db = dataset.connect('sqlite:///quotes.db')
authors_seen = set()
base_url = 'http://quotes.toscrape.com/'
def clean_url(url):
# Clean '/author/Steve-Martin' to 'Steve-Martin'
# Use urljoin to make an absolute URL
url = urljoin(base_url, url)
# Use urlparse to get out the path part
path = urlparse(url).path
# Now split the path by '/' and get the second part
# E.g. '/author/Steve-Martinvisual studio' -> ['','author', 'Steve-Martin']
return path.split('/')[2]
def scrape_quotes(html_soup):
for quote in html_soup.select('div.quote'):
quote_text = quote.find(class_='text').get_text(strip=True)
quote_author_url = clean_url(quote.find(class_='author').find_next_sibling('a').get('href'))
quote_tag_urls = [clean_url(a.get('href')) for a in quote.find_all('a', class_='tag')]
authors_seen.add(quote_author_url)
# Store this quote and its tags
quote_id = db['quotes'].insert({'text' : quote_text, 'author' : quote_author_url})
db['quotes_tags'].insert_many([{'quote_id' : quote_id, 'tag_id' : tag} for tag in quote_tag_urls])
def scrape_author(html_soup, author_id):
author_name = html_soup.find(class_='author-title').get_text(strip=True)
author_born_date = html_soup.find(class_='author-born-date').get_text(strip=True)
author_born_loc = html_soup.find(class_='author-born-location').get_text(strip=True)
author_desc = html_soup.find(class_='author-description').get_text(strip=True)
db['authors'].insert({'author_id': author_id, 'name': author_name,
'born_date': author_born_date, 'born_location': author_born_loc,
'description': author_desc})
# Start by scraping all the quote pages
print('*****Beginning scraping process - quotes first.*****')
url = base_url
while True:
print('Now scraping page:', url)
r = requests.get(url)
html_soup = BeautifulSoup(r.text, 'html.parser')
# Scrape the quotes
scrape_quotes(html_soup)
# Is there a next page?
next_a = html_soup.select('li.next > a')
if not next_a or not next_a[0].get('href'):
break
url = urljoin(url, next_a[0].get('href'))
# Now fetch out the author information
print('*****Scraping authors data.*****')
for author_id in authors_seen:
url = urljoin(base_url, '/author/' + author_id)
print('Now scraping author:', url)
r = requests.get(url)
html_soup = BeautifulSoup(r.text, 'html.parser')
# Scrape the author information
scrape_author(html_soup, author_id)
db.commit()
db.close()
What I am struggling with is the pycharm IDE connection. As shown in the figure below, I can see quotes.sqlite database. It has only one table listed - sqlite_master. Under server objects there are collations, modules and routines, which is part of infrastructure provided by SQLite.
Also, when I view the db object (python's driver to SQLite) in debugger, I can see the relevant table as shown in the picture below -
Any ideas why PyCharm refuses to show relevant table/collection in the IDE?
I wrote python code to search for an image in google with some google dork keywords. Here is the code:
def showD(self):
self.text, ok = QInputDialog.getText(self, 'Write A Keyword', 'Example:"twitter.com"')
if ok == True:
self.google()
def google(self):
filePath = self.imagePath
domain = self.text
searchUrl = 'http://www.google.com/searchbyimage/upload'
multipart = {'encoded_image': (filePath, open(filePath, 'rb')), 'image_content': '', 'q': f'site:{domain}'}
response = requests.post(searchUrl, files=multipart, allow_redirects=False)
fetchUrl = response.headers['Location']
webbrowser.open(fetchUrl)
App = QApplication(sys.argv)
window = Window()
sys.exit(App.exec())
I just didn't figure how to display the url of the search result in my program. I tried this code:
import requests
from bs4 import BeautifulSoup
import re
query = "twitter"
search = query.replace(' ', '+')
results = 15
url = (f"https://www.google.com/search?q={search}&num={results}")
requests_results = requests.get(url)
soup_link = BeautifulSoup(requests_results.content, "html.parser")
links = soup_link.find_all("a")
for link in links:
link_href = link.get('href')
if "url?q=" in link_href and not "webcache" in link_href:
title = link.find_all('h3')
if len(title) > 0:
print(link.get('href').split("?q=")[1].split("&sa=U")[0])
# print(title[0].getText())
print("------")
But it only works for normal google search keyword and failed when I try to optimize it for the result of google image search. It didn't display any result.
Currently there is no simple way to scrape Google's "Search by image" using plain HTTPS requests. Before responding to this type of request, they presumably check if user is real using several sophisticated techniques. Even your working example of code does not work for long — it happens to be banned by Google after 20-100 requests.
All public solutions in Python that really scrape Google with images use Selenium and imitate the real user behaviour. So you can go this way yourself. Interfaces of python-selenium binding are not so tough to get used to, except maybe the setup process.
The best of them, for my taste, is hardikvasa/google-images-download (7.8K stars on Github). Unfortunately, this library has no such input interface as image path or image in binary format. It only has the similar_images parameter which expects a URL. Nevertheless, you can try to use it with http://localhost:1234/... URL (you can easily set one up this way).
You can check all these questions and see that all the solutions use Selenium for this task.
I'm scraping a website that strongly depends on Javascript. The main page from which I need to extract the urls that will be parsed depends on Javascript, so I have to modify start_requests.
I'm looking for a way to connect start_requests, with the linkextractor and with process_match
class MatchSpider(CrawlSpider):
name = "match"
allowed_domains = ["whoscored"]
rules = (
Rule(LinkExtractor(restrict_xpaths='//*[contains(#class, "match-report")]//#href'), callback='parse_item'),
)
def start_requests(self):
url = 'https://www.whoscored.com/Regions/252/Tournaments/2/Seasons/6335/Stages/13796/Fixtures/England-Premier-League-2016-2017'
browser = Browser(browser='Chrome')
browser.get(url)
# should return a request with the html body from Selenium driver so that LinkExtractor rule can be applied
def process_match(self, response):
match_item = MatchItem()
regex = re.compile("matchCentreData = \{.*?\};", re.S)
match = re.search(regex, response.text).group()
match = match.replace('matchCentreData =', '').replace(';', '')
match_item['match'] = json.loads(match)
match_item['url'] = response.url
match_item['project'] = self.settings.get('BOT_NAME')
match_item['spider'] = self.name
match_item['server'] = socket.gethostname()
match_item['date'] = datetime.datetime.now()
yield match_item
A wrapper I'm using around Selenium:
class Browser:
"""
selenium on steroids. allows you to create different types of browsers plus
adds methods for safer calls
"""
def __init__(self, browser='Firefox'):
"""
type: silent or not
browser: chrome of firefox
"""
self.browser = browser
self._start()
def _start(self):
'''
starts browser
'''
if self.browser == 'Chrome':
chrome_options = webdriver.ChromeOptions()
prefs = {"profile.managed_default_content_settings.images": 2}
chrome_options.add_extension('./libcommon/adblockpluschrome-1.10.0.1526.crx')
chrome_options.add_experimental_option("prefs", prefs)
chrome_options.add_argument("user-agent={0}".format(random.choice(USER_AGENTS)))
self.driver_ = webdriver.Chrome(executable_path='./libcommon/chromedriver', chrome_options=chrome_options)
elif self.browser == 'Firefox':
profile = webdriver.FirefoxProfile()
profile.set_preference("general.useragent.override", random.choice(USER_AGENTS))
profile.add_extension('./libcommon/adblock_plus-2.7.1-sm+tb+an+fx.xpi')
profile.set_preference('permissions.default.image', 2)
profile.set_preference('dom.ipc.plugins.enabled.libflashplayer.so', 'false')
profile.set_preference("webdriver.load.strategy", "unstable")
self.driver_ = webdriver.Firefox(profile)
elif self.browser == 'PhantomJS':
self.driver_ = webdriver.PhantomJS()
self.driver_.set_window_size(1120, 550)
def close(self):
self.driver_.close()
def return_when(self, condition, locator):
"""
returns browser execution when condition is met
"""
for _ in range(5):
with suppress(Exception):
wait = WebDriverWait(self.driver_, timeout=100, poll_frequency=0.1)
wait.until(condition(locator))
self.driver_.execute_script("return window.stop")
return True
return False
def __getattr__(self, name):
"""
ruby-like method missing: derive methods not implemented to attribute that
holds selenium browser
"""
def _missing(*args, **kwargs):
return getattr(self.driver_, name)(*args, **kwargs)
return _missing
There's two problems I see after looking into this. Forgive any ignorance on my part, because it's been a while since I was last in the Python/Scrapy world.
First: How do we get the HTML from Selenium?
According to the Selenium docs, the driver should have a page_source attribute containing the contents of the page.
browser = Browser(browser='Chrome')
browser.get(url)
html = browser.driver_.page_source
browser.close()
You may want to make this a function in your browser class to avoid accessing browser.driver_ from MatchSpider.
# class Browser
def page_source(self):
return self.driver_.page_source
# end class
browser.get(url)
html = browser.page_source()
Second: How do we override Scrapy's internal web requests?
It looks like Scrapy tries to decouple the behind-the-scenes web requests from the what-am-I-trying-to-parse functionality of each spider you write. start_requests() should "return an iterable with the first Requests to crawl" and make_requests_from_url(url) (which is called if you don't override start_requests()) takes "a URL and returns a Request object". When internally processing a Spider, Scrapy starts creating a plethora of Request objects that will be asynchronously executed and the subsequent Response will be sent to parse(response)...the Spider never actually does the processing from Request to `Response.
Long story short, this means you would need to create middleware for the Scrapy Downloader to use Selenium. Then, you can remove your overridden start_requests() method and add a start_urls attribute. Specifically, your SeleniumDownloaderMiddleware should overwrite the process_request(request, spider) method to use the above Selenium code.
I am trying to create a website downloader using python. I have the code for:
Finding all URLs from a page
Downloading a given URL
What I have to do is to recursively download a page, and if there's any other link in that page, I need to download them also. I tried combining the above two functions, but recursion thing doesn't work.
The codes are given below:
1)
*from sgmllib import SGMLParser
class URLLister(SGMLParser):
def reset(self):
SGMLParser.reset(self)
self.urls = []
def start_a(self, attrs):
href = [v for k, v in attrs if k=='href']
if href:
self.urls.extend(href)
if __name__ == "__main__":
import urllib
wanted_url=raw_input("Enter the URL: ")
usock = urllib.urlopen(wanted_url)
parser = URLLister()
parser.feed(usock.read())
parser.close()
usock.close()
for url in parser.urls: download(url)*
2) where download(url) function is defined as follows:
*def download(url):
import urllib
webFile = urllib.urlopen(url)
localFile = open(url.split('/')[-1], 'w')
localFile.write(webFile.read())
webFile.close()
localFile.close()
a=raw_input("Enter the URL")
download(a)
print "Done"*
Kindly help me on how to combine these two codes to "recursively" download the new links on a webpage that's being downloaded.
You may want to look into the Scrapy library.
It would make a task like this pretty trivial, and allow you to download multiple pages concurrently.
done_url = []
def download(url):
if url in done_url:return
...download url code...
done_url.append(url)
urls = sone_function_to_fetch_urls_from_this_page()
for url in urls:download(url)
This is a very sad/bad code. For example you will need to check if the url is within the domain you want to crawl or not. However, you asked for recursive.
Be mindful of the recursion depth.
There are just so many things wrong with my solution. :P
You must try some crawling library like Scrapy or something.
Generally, the idea is this:
def get_links_recursive(document, current_depth, max_depth):
links = document.get_links()
for link in links:
downloaded = link.download()
if current_depth < max_depth:
get_links_recursive(downloaded, depth-1, max_depth)
Call get_links_recursive(document, 0, 3) to get things started.