Python web scraping gives wrong source code

Python web scraping gives wrong source code - python

I want to extract some data from Amazon(link in the following code)
Here is my code:
import urllib2
url="http://www.amazon.com/s/ref=sr_nr_n_11?rh=n%3A283155%2Cn%3A%2144258011%2Cn%3A2205237011%2Cp_n_feature_browse-bin%3A2656020011%2Cn%3A173507&bbn=2205237011&sort=titlerank&ie=UTF8&qid=1393984161&rnid=1000"
webpage=urllib2.urlopen(url).read()
doc=open("test.html","w")
doc.write(webpage)
doc.close()
When I open the test.html, the content of my page is different from the website in the Internet.

The page involves javascript execution.
urllib2.urlopen(..).read() simply read the url content. So they are different.
To get same content, you need to use library that can handle javascript.
For example, following code uses selenium:
from selenium import webdriver
url = 'http://www.amazon.com/s/ref=sr_nr_n_11?...161&rnid=1000'
driver = webdriver.Firefox()
driver.get(url)
with open('test.html', 'w') as f:
f.write(driver.page_source.encode('utf-8'))
driver.quit()

To complete falsetru's answer:
another solution is to use python-ghost. It is based on Qt. It's much heavier to install, so I advice Selenium too.
Using Firefox will open it up on script execution. To not have it on your way, use PhantomJS:
apt-get install nodejs # you get npm, the Node Package Manager
npm install -g phantomjs # install globally
[…]
driver = webdriver.PhantomJS()

Related

Download folder in python

I try download this zip. I used selenium and requests, but neither of them works and I don't know why.
Thank you for your advice.
from selenium import webdriver
import requests
url = 'http://vdp.cuzk.cz/vymenny_format/csv/20200131_OB_ADR_csv.zip'
driver = webdriver.Chrome('drivers\chromedriver.exe')
driver.get(url)
requests.get(url)

requests.get() downloads the entity into memory. This needs to be explicitly written to a file using open.
Example:
import requests
url = 'http://vdp.cuzk.cz/vymenny_format/csv/20200131_OB_ADR_csv.zip'
filename = 'c:/users/user/downloads/csv.zip'
filebody = requests.get(url)
open(filename, 'wb').write(filebody.content)

First of all, you don't need requests to download a file (in this case at least). As I don't know the errors you are getting, I would suggest double-checking the path to your chromedriver.exe and you should escape backslashes.
driver = webdriver.Chrome('drivers\\chromedriver.exe')
I tried your code (while entering the location of chromedriver on my computer) and it worked - I was able to download the file.

Uploading file using Python Selenium System via system window

I am taking a trial website case to learn to upload files using Python Selenium where the upload window is not a part of the HTML. The upload window is a system level update. This is already solved using JAVA (stackoverflow link(s) below). If this is not possible via Python then I intent to shift to JAVA for this task.
BUT,
Dear all my fellow Python lovers, why shouldn't it be possible using Python webdriver-Selenium. Hence this quest.
Solved in JAVA for URL: http://www.zamzar.com/
Solution (& JAVA code) in stackoverflow: How to handle windows file upload using Selenium WebDriver?
This is my Python code that should be self explanatory, inclusive of chrome webdriver download links.
Task (uploading file) I am trying in brief:
Website: https://www.wordtopdf.com/
Note_1: I don't need this tool for any work as there are far better packages to do this word to pdf conversion. Instead, this is just for learning & polishing Python Selenium code/application.
Note_2: You will have to painstakingly enter 2 paths into my code below after downloading and unzipping the chrome driver (link below in comments). The 2 paths are: [a] Path of a(/any) word file & [b] path of the unzipped chrome driver.
My Code:
from selenium import webdriver
UNZIPPED_DRIVER_PATH = 'C:/Users/....' # You need to specify this on your computer
driver = webdriver.Chrome(executable_path = UNZIPPED_DRIVER_PATH)
# Driver download links below (check which version of chrome you are using if you don't know it beforehand):
# Chrome Driver 74 Download: https://chromedriver.storage.googleapis.com/index.html?path=74.0.3729.6/
# Chrome Driver 73 Download: https://chromedriver.storage.googleapis.com/index.html?path=73.0.3683.68/
New_Trial_URL = 'https://www.wordtopdf.com/'
driver.get(New_Trial_URL)
time.sleep(np.random.uniform(4.5, 5.5, size = 1)) # Time to load the page in peace
Find_upload = driver.find_element_by_xpath('//*[#id="file-uploader"]')
WORD_FILE_PATH = 'C:/Users/..../some_word_file.docx' # You need to specify this on your computer
Find_upload.send_keys(WORD_FILE_PATH) # Not working, no action happens here
Based on something very similar in JAVA (How to handle windows file upload using Selenium WebDriver?), this should work like a charm. But Voila... total failure and thus chance to learn something new.
I have also tried:
Click_Alert = Find_upload.click()
Click_Alert(driver).send_keys(WORD_FILE_PATH)
Did not work. 'Alert' should be inbuilt function as per these 2 links (https://seleniumhq.github.io/selenium/docs/api/py/webdriver/selenium.webdriver.common.alert.html & Selenium-Python: interact with system modal dialogs).
But the 'Alert' function in the above link doesn't seem to exist in my Python setup even after executing
from selenium import webdriver
#All the readers, hope this doesn't take much of your time and we all get to learn something out of this.
Cheers

You get ('//*[#id="file-uploader"]') which is <a> tag
but there is hidden <input type="file"> (behind <a>) which you have to use
import selenium.webdriver
your_file = "/home/you/file.doc"
your_email = "you#example.com"
url = 'https://www.wordtopdf.com/'
driver = selenium.webdriver.Firefox()
driver.get(url)
file_input = driver.find_element_by_xpath('//input[#type="file"]')
file_input.send_keys(your_file)
email_input = driver.find_element_by_xpath('//input[#name="email"]')
email_input.send_keys(your_email)
driver.find_element_by_id('convert_now').click()
Tested with Firefox 66 / Linux Mint 19.1 / Python 3.7 / Selenium 3.141.0
EDIT: The same method for uploading on zamzar.com
Situation which I saw first time (so it took me longer time to create solution): it has <input type="file"> hidden under button but it doesn't use it to upload file. It create dynamically second <input type="file"> which uses to upload file (or maybe even many files - I didn't test it).
import selenium.webdriver
from selenium.webdriver.support.ui import Select
import time
your_file = "/home/furas/Obrazy/37884728_1975437959135477_1313839270464585728_n.jpg"
#your_file = "/home/you/file.jpg"
output_format = 'png'
url = 'https://www.zamzar.com/'
driver = selenium.webdriver.Firefox()
driver.get(url)
#--- file ---
# it has to wait because paga has to create second `input[#type="file"]`
file_input = driver.find_elements_by_xpath('//input[#type="file"]')
while len(file_input) < 2:
print('len(file_input):', len(file_input))
time.sleep(0.5)
file_input = driver.find_elements_by_xpath('//input[#type="file"]')
file_input[1].send_keys(your_file)
#--- format ---
select_input = driver.find_element_by_id('convert-format')
select = Select(select_input)
select.select_by_visible_text(output_format)
#--- convert ---
driver.find_element_by_id('convert-button').click()
#--- download ---
time.sleep(5)
driver.find_elements_by_xpath('//td[#class="status last"]/a')[0].click()

Selenium: difference between chrome and PhantomJS?-python

I want to do web scraping for Bing's search results. Basically, I am using selenium, the idea is to using selenium to click 'Next' automatedly and scrap the URLs of search results of each page. I made it run with chrome browser on my Ubuntu:
from selenium import web driver
import os
class bingURL(object):
def __init__(self):
self.driver=webdriver.Chrome(os.path.expanduser('./chromedriver'))
def get_urls(self,url):
driver=self.driver
driver.get(url)
elems = driver.find_elements_by_xpath("//a[#href]")
href=[]
for elem in elems:
link=elem.get_attribute("href")
try:
if 'bing.com' not in link and 'http' in link and 'microsoft.com' not in link and 'smashboards.com' not in link:
href.append(link)
except:
pass
return list(set(href))
def search_urls(self,keyword,pagenum):
driver=self.driver
searchurl=self.lookup(keyword) ### url of first page of google search
driver.get(searchurl)
results=self.get_urls(searchurl)
for i in range(pagenum):
driver.find_elements_by_class_name("sb_pagN")[0].click() # click 'Next' of bing search result
time.sleep(5) # wait to load page
current_url=driver.current_url
#print(current_url)
#print(self.get_urls(current_url))
results[0:0]=self.get_urls(current_url)
driver.quit()
return results
def lookup(self,query):
return "https://www.bing.com/search?q="+query
if __name__ == "__main__":
g=bingURL()
result=g.search_urls('Stackoverflow is good',10)
it works perfectly, when I run the code, it launches a Chrome browser, and I can saw it go to the next page automatically, and get URLs for 10 pages of searching results.
However, my goal is to run these codes on AWS successfully. The original codes failed with error 'Chrome failed to start'. After google, it seems I need to use a headless browser like PhantomJS on AWS. Thus I installed PhantomJS, and change the def __init__(self): to:
def __init__(self):
self.driver=webdriver.PhantomJS()
However, it cannot click 'next' anymore, and cannot scrap URLs using the old code. The error message is:
File ".../SEARCH_BING_MODULE.py", line 70, in search_urls
driver.find_elements_by_class_name("sb_pagN")[0].click()
IndexError: list index out of range
It looks like change the browser completely change the rules. How should I modify the more original code to make it work again? or how to scrap Bing search results' URLs using selenium+PhantomJS?
Thanks for your help!

Yes, You can perform all operations as per of your all 3 point using headless browser. Don't use HTMLUnit as it have many configuration issue.
PhamtomJS was another approach for headless browser but PhantomJs is having bug these days because of poorly maintenance of it.
You can use chromedriver itself for headless jobs.
You just need to pass one option in chromedriver as below:-
chromeOptions.addArguments("--headless");
Full code will appear like this :-
System.setProperty("webdriver.chrome.driver","D:\\Workspace\\JmeterWebdriverProject\\src\\lib\\chromedriver.exe");
ChromeOptions chromeOptions = new ChromeOptions();
chromeOptions.addArguments("--headless");
chromeOptions.addArguments("--start-maximized");
WebDriver driver = new ChromeDriver(chromeOptions);
driver.get("https://www.google.co.in/");
Hope it will help you :)

selenium works on local and not on azure server

I am trying to get video url from links on this page. Video link could be seen on https://in.news.yahoo.com/video/jaguar-fighter-aircraft-crashes-near-084300217.html . (Open in Chrome)
For that I wrote chrome web driver related code as below :
from bs4 import BeautifulSoup
from selenium import webdriver
from pyvirtualdisplay import Display
chromedriver = '/usr/local/bin/chromedriver'
os.environ['webdriver.chrome.driver'] = chromedriver
display = Display(visible=0, size=(800,600))
display.start()
driver = webdriver.Chrome(chromedriver)
driver.get('https://in.news.yahoo.com/video/jaguar-fighter-aircraft-crashes-near-084300217.html')
try:
element = WebDriverWait(driver, 20).until(lambda driver: driver.find_elements_by_class_name('yvp-main'))
self.yahoo_video_trend = []
for s in driver.find_elements_by_class_name('yvp-main'):
print "Processing link - ", item['link']
trend = item
print item['description']
trend['video_link'] = s.find_element_by_tag_name('video').get_attribute('src')
print
print s.find_element_by_tag_name('video').get_attribute('src')
self.yahoo_video_trend.append(trend)
except:
return
This works fine on my local system but when I run on my azure server it does not give any result at s.find_element_by_tag_name('video').get_attribute('src')
I have installed chrome on my azureserver.
Update :
Please see, requests and Beautifulsoup I already tried, but as yahoo loads html content dynamically from json, I could not get it using them.
And yeah azure server is simple linux system with command line access. Not any application.

I tried to reproduce your issue using you code. However, I found there was no tag named video in that page('https://in.news.yahoo.com/video/jaguar-fighter-aircraft-crashes-near-084300217.html')(using IE and Chrome to test).
I used the developer Tool to check the HTML code, like this picture:
It seems that this page used the flash player to play video,not HTML5 video control.
For this reason, I suggest that you can check your code whether used the rightly tag name.
Any concerns, please feel free to let me know.

We tried to reproduce the error on our side. I was not able to get chrome driver to work, but I did try the firefox driver and it worked fine. It was able to load the page and get the link via the URL.
Can you change your code to print the exception and send it to us, to see where the script is failing?
Change your code:
except:
return
try
do
except Exception,e: print str(e)
Send us the exception, so we can take a look.

Is there a way to use PhantomJS in Python?

I want to use PhantomJS in Python. I googled this problem but couldn't find proper solutions.
I find os.popen() may be a good choice. But I couldn't pass some arguments to it.
Using subprocess.Popen() may be a proper solution for now. I want to know whether there's a better solution or not.
Is there a way to use PhantomJS in Python?

The easiest way to use PhantomJS in python is via Selenium. The simplest installation method is
Install NodeJS
Using Node's package manager install phantomjs: npm -g install phantomjs-prebuilt
install selenium (in your virtualenv, if you are using that)
After installation, you may use phantom as simple as:
from selenium import webdriver
driver = webdriver.PhantomJS() # or add to your PATH
driver.set_window_size(1024, 768) # optional
driver.get('https://google.com/')
driver.save_screenshot('screen.png') # save a screenshot to disk
sbtn = driver.find_element_by_css_selector('button.gbqfba')
sbtn.click()
If your system path environment variable isn't set correctly, you'll need to specify the exact path as an argument to webdriver.PhantomJS(). Replace this:
driver = webdriver.PhantomJS() # or add to your PATH
... with the following:
driver = webdriver.PhantomJS(executable_path='/usr/local/lib/node_modules/phantomjs/lib/phantom/bin/phantomjs')
References:
http://selenium-python.readthedocs.io/
How do I set a proxy for phantomjs/ghostdriver in python webdriver?
https://dzone.com/articles/python-testing-phantomjs

PhantomJS recently dropped Python support altogether. However, PhantomJS now embeds Ghost Driver.
A new project has since stepped up to fill the void: ghost.py. You probably want to use that instead:
from ghost import Ghost
ghost = Ghost()
with ghost.start() as session:
page, extra_resources = ghost.open("http://jeanphi.me")
assert page.http_status==200 and 'jeanphix' in ghost.content

Now since the GhostDriver comes bundled with the PhantomJS, it has become even more convenient to use it through Selenium.
I tried the Node installation of PhantomJS, as suggested by Pykler, but in practice I found it to be slower than the standalone installation of PhantomJS. I guess standalone installation didn't provided these features earlier, but as of v1.9, it very much does so.
Install PhantomJS (http://phantomjs.org/download.html) (If you are on Linux, following instructions will help https://stackoverflow.com/a/14267295/382630)
Install Selenium using pip.
Now you can use like this
import selenium.webdriver
driver = selenium.webdriver.PhantomJS()
driver.get('http://google.com')
# do some processing
driver.quit()

Here's how I test javascript using PhantomJS and Django:
mobile/test_no_js_errors.js:
var page = require('webpage').create(),
system = require('system'),
url = system.args[1],
status_code;
page.onError = function (msg, trace) {
console.log(msg);
trace.forEach(function(item) {
console.log(' ', item.file, ':', item.line);
});
};
page.onResourceReceived = function(resource) {
if (resource.url == url) {
status_code = resource.status;
}
};
page.open(url, function (status) {
if (status == "fail" || status_code != 200) {
console.log("Error: " + status_code + " for url: " + url);
phantom.exit(1);
}
phantom.exit(0);
});
mobile/tests.py:
import subprocess
from django.test import LiveServerTestCase
class MobileTest(LiveServerTestCase):
def test_mobile_js(self):
args = ["phantomjs", "mobile/test_no_js_errors.js", self.live_server_url]
result = subprocess.check_output(args)
self.assertEqual(result, "") # No result means no error
Run tests:
manage.py test mobile

The answer by #Pykler is great but the Node requirement is outdated. The comments in that answer suggest the simpler answer, which I've put here to save others time:
Install PhantomJS
As #Vivin-Paliath points out, it's a standalone project, not part of Node.
Mac:
brew install phantomjs
Ubuntu:
sudo apt-get install phantomjs
etc
Set up a virtualenv (if you haven't already):
virtualenv mypy # doesn't have to be "mypy". Can be anything.
. mypy/bin/activate
If your machine has both Python 2 and 3 you may need run virtualenv-3.6 mypy or similar.
Install selenium:
pip install selenium
Try a simple test, like this borrowed from the docs:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.PhantomJS()
driver.get("http://www.python.org")
assert "Python" in driver.title
elem = driver.find_element_by_name("q")
elem.clear()
elem.send_keys("pycon")
elem.send_keys(Keys.RETURN)
assert "No results found." not in driver.page_source
driver.close()

this is what I do, python3.3. I was processing huge lists of sites, so failing on the timeout was vital for the job to run through the entire list.
command = "phantomjs --ignore-ssl-errors=true "+<your js file for phantom>
process = subprocess.Popen(command, shell=True, stdout=subprocess.PIPE)
# make sure phantomjs has time to download/process the page
# but if we get nothing after 30 sec, just move on
try:
output, errors = process.communicate(timeout=30)
except Exception as e:
print("\t\tException: %s" % e)
process.kill()
# output will be weird, decode to utf-8 to save heartache
phantom_output = ''
for out_line in output.splitlines():
phantom_output += out_line.decode('utf-8')

If using Anaconda, install with:
conda install PhantomJS
in your script:
from selenium import webdriver
driver=webdriver.PhantomJS()
works perfectly.

In case you are using Buildout, you can easily automate the installation processes that Pykler describes using the gp.recipe.node recipe.
[nodejs]
recipe = gp.recipe.node
version = 0.10.32
npms = phantomjs
scripts = phantomjs
That part installs node.js as binary (at least on my system) and then uses npm to install PhantomJS. Finally it creates an entry point bin/phantomjs, which you can call the PhantomJS webdriver with. (To install Selenium, you need to specify it in your egg requirements or in the Buildout configuration.)
driver = webdriver.PhantomJS('bin/phantomjs')

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python web scraping gives wrong source code - python

Related

Download folder in python

Uploading file using Python Selenium System via system window

Selenium: difference between chrome and PhantomJS?-python

selenium works on local and not on azure server

Is there a way to use PhantomJS in Python?

Categories

Resources