I want to crawl some images for my machine learning practice and found this google-image-download to very useful and the codes works out of the box.
However, at the moment, it only allow not more than 100 images, which is the limit from google image page(that only load 100 images per page).
The document said if you are using pip install google_images_download(which in my case, I am doing that), it will download together with selenium and by using chromedriver, you can download more than that limit.
however, everytime I run the code python gimages.py:
from google_images_download import google_images_download
response = google_images_download.googleimagesdownload()
arguments = {"keywords":"number plates","limit":200,"print_urls":True}
paths = response.download(arguments)
print(paths)
I will get error:
Looks like we cannot locate the path the 'chromedriver' (use the
'--chromedriver' argument to specify the path to the executable.) or
google chrome browser is not installed on your machine (exception:
expected str, bytes or os.PathLike object, not NoneType)
as I checked my installation, selenium already installed:
reading further, it said I can download chromedriver and put inside the same folder and call python gimages.py --chromedriver "chromedriver", I still get the same error.
How can I resolve this?
I am using conda with python 3.6, running the terminal from conda. the code is already working, just that chromedriver part is not.
You need to specify the path... "chromedriver" is not a path...
You might need to the explicit path "/path/to/chromedriver/folder".
In your case: python gimages.py --chromedriver "/path/to/chromedriver/folder"
Hope this helps you!
Related
I trying to web scrape using requests-html but it returns an error saying there is a missing file even though I pip install requests-html and it said all req fulfilled. how do I get around this.
from requests_html import HTMLSession
import time
url = 'https://soundcloud.com/jujubucks'
s = HTMLSession()
r = s.get(url)
r.html.render()
songs = r.html.xpath('//*[#id="content"]/div/div[4]/div[1]/div/div[2]/div/div[2]', first=True)
print(songs)
this produces a sxstrace error.
OSError: [WinError 14001] The application has failed to start because its side-by-side
configuration is incorrect. Please see the application event log or use the command-line
sxstrace.exe tool for more detail
apparently this is the missing file according the event log but I dont know where to get it.
Activation context generation failed for "C:\Users\houst\AppData\Local\pyppeteer\pyppeteer\local-chromium\588429\chrome-win32\chrome.exe". Dependent Assembly 71.0.3542.0,language="*",type="win32",version="71.0.3542.0" could not be found. Please use sxstrace.exe for detailed diagnosis.
I came here with the same question, but the only answer didn't apply to me. My win10x64 PC has 5 versions of python, 4 installed via anaconda and python 3.10 installed via the microsoft store. Debugging the process in vscode using the MS store version... with pip install requests-html installed for that version of python only.
VScode stack trace showed that subprocess.py failed to launch a subprocess.
Windows event viewer showed a failed attempt to launch chrome.exe in:
C:\Users\username\AppData\Local\pyppeteer\pyppeteer\local-chromium\588429\chrome-win32
Windows search showed that chrome.exe - which was downloaded and extracted automatically the first time an attempt was made to call response.html.render() - was actually located at:
C:\Users\username\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\Local\pyppeteer\pyppeteer\local-chromium\588429\chrome-win32
As a work around, and although I've no idea why the issue occured, I moved the chrome-win32 directory to the location expected, and found that chrome ran the javascript on the page and returned html correctly.
requests_html depends upon pyppeteer but it seems your pypeteer has not installed chromium completely. Try installing chromium manually, just activate your environment containing pyppeteer and run pyppeteer-install.exe.
I'm using google_images_download library to download top 20 images for a keyword. It's worked perfectly when I'm using it last days. Code is as follows.
from google_images_download import google_images_download
response = google_images_download.googleimagesdownload()
arguments = {"keywords":keyword,"limit":10,"print_urls":True}
paths = response.download(arguments)
Now it gives following error.
Evaluating...
Starting Download...
Unfortunately all 10 could not be downloaded because some images were not downloadable. 0 is all we got for this search filter!
Errors: 0
How can I solve this error.
There has been some changes on Google end (how they respond to the request) which results in this issue. Joeclinton1 on github has done some modifications to the original repo which provides a temporary fix.
You can find the updated repo here: https://github.com/Joeclinton1/google-images-download.git . The solution is in patch-1 branch if I'm not mistaken.
First uninstall the current version of google_images_download.
Then manually install Joeclinton1's repo by:
git clone https://github.com/Joeclinton1/google-images-download.git
cd google-images-download && sudo python setup.py install #no need for 'sudo' on windows Anaconda environment
or to install it with pip
pip install git+https://github.com/Joeclinton1/google-images-download.git
This should solve the problem. Note that currently this repo only supports upto 100 images.
I faced the same issue with google-image-download, which used to work perfect earlier!
I have an alternative that I would like to suggest, which should solve the problem.
Solution: Instead of using google-image-download for Python, use bing-image-downloader, that downloads from Bing! search engine.
Steps:
Step 1:
Install the library by using: pip install bing-image-downloader
Step 2:
from bing_image_downloader import downloader
downloader.download(query_string, limit=100, output_dir='dataset',
adult_filter_off=True, force_replace=False, timeout=60)
That's it! All you would need to do is to add your image topic to the query_string.
Note:
Parameters that you can further tweak:
query_string : String to be searched.
limit : (optional, default is 100) Number of images to download.
output_dir : (optional, default is 'dataset') Name of output dir.
adult_filter_off : (optional, default is True) Enable of disable adult filteration.
force_replace : (optional, default is False) Delete folder if present and start a fresh download.
timeout : (optional, default is 60) timeout for connection in seconds.
Further Reference: https://pypi.org/project/bing-image-downloader/
If you want to download less than 100 images per query string, google-images-download will work better than bing-images-downloader. It handles the errors better and, you know, Google Images gives quite better results than Bing equivalent.
However, if you're trying to download more than 100 images, google-images-downloader will give you a lot of headaches. As mentioned in this answer, Google changed their end, and because of this the repo is having a lot of failures (more info on the situation status here).
So, if you want to download thousands of images, use bing-image-downloader:
Install package from pip
pip install bing-image-downloader
Run query.
NOTE: The documentation seems to be incorrect, as it returns a "No module found" error when importing the package as from bing_image_downloader import downloader (as mentioned in this answer). Import it and use it like this:
from bing_image_downloader.downloader import download
query_string = 'muscle cars'
download(query_string, limit=1000, output_dir='dataset', adult_filter_off=True, force_replace=False, timeout=60, verbose=True)
Another easy way to download any number of images :-
pip install simple_image_download
from simple_image_download import simple_image_download as simp
response = simp.simple_image_download
response().download(a, b)
Where a= string of subject you want to download
B= number of images you want to download
I'm using Selenium with Python (3.5) to programmatically explore a site. One step of this exploration includes scrolling to the bottom of a given page, and I have chosen to do so with jQuery as such where driver is the webdriver object and scrollloadtime is the set amount of time I want the scrolling to take:
driver.execute_script("$('html, body').animate({scrollTop: $(document).height() - $(window).height()}, %s);" % scrollloadtime)
This is where things get weird. When I run this code in a test environment (VM running Kali Linux), I have no issues with this -- I've never once had a problem with this line running in this environment.
However, when I attempt to run the exact same code with the exact same package versions (which I have listed below) on the exact same webpage inside a docker container running Debian Stretch, I get the following error:
Message: TypeError: $(...).animate is not a function
I'd like to figure out why this is happening rather than just a workaround. It's driving me insane!
I'm certainly no jQuery expert, but from the research I've done on the above error this normally occurs when either an old and minimized jQuery version is being used. What I can't figure out myself is how that solution then ties into Selenium or even Python itself.
Things I have tried, to no avail:
Installed jquery-related packages that exist on my test environment that did not exist within the docker image, on the docker image (i.e. all libjs-jquery* packages).
Attempted to inject jQuery into the page before running the script which triggers DDoS security. (Additionally, this shouldn't be necessary because the jQuery script worked without any injection in the test environment)
Attempted to exchange the initial $('html, body') with a defined variable (var x = document.getElementsByTagName('html')[0]; x.animate(...), though I will be the first to admit that I might not have done so correctly
Versions:
Python 3.5
Selenium (Python) 3.141.0
Geckodriver 0.24.0
Firefox ESR 68.1.0
Debian Stretch and Kali Linux
Any assistance or troubleshooting guidance would be greatly appreciated. Let me know if I can provide any additional information.
I am working one file python project.
I integrated google-cloud-API for realtime speech streaming and recognition.
It works with python aaa.py command well.
Now I need windows build file(.exe), so I used pyinstaller program and I got aaa.exe file successfully.
But I got this error while running speech streaming by using Google cloud API.
[Errno 2] No such file or directory:
'D:\AI\ai\dist\AAA\google\cloud\gapic\speech\v1\speech_client_config.json'
So I copied this speech_client_config.json file in needed path, after that I got below error again.
Exception in 'grpc._cython.cygrpc.ssl_roots_override_callback'
ignored E0511 01:13:14.320000000 3108
src/core/lib/security/security_connector/security _connector.cc:1170]
assertion failed: pem_root_certs != nullptr
Then, I can not find solution to get working version with google-cloud API.
I am using python version 2.7.14
I need your friendly help.
Thanks.
I had the same problem. If you are willing to distribute roots.pem with your executable (just search for the file - it should be buried deep within the installation directory of grpcio), I had luck fixing this by setting GRPC_DEFAULT_SSL_ROOTS_FILE_PATH environment variable to the full path of this roots.pem file.
Update 2021
To anyone who is experiencing this issue. I got it working thanks to these amazing people. See the full conversation on this github issue.
Here is the link
Step 1
Credits to #cbenhagen & #rising-stark on this github link.
A PyInstaller hook called hook-grpc.py looking like this would do the trick:
Create a python file named hook-grpc.py with this code.
from PyInstaller.utils.hooks import collect_data_files
datas = collect_data_files('grpc')
Step 2
Put the hook-grpc.py file in your \site-packages\PyInstaller\hooks directory of the python environment you are running on. So basically you can find it at
C:\Users\yourusername\AppData\Local\Programs\Python\Python37\Lib\site-packages\PyInstaller\hooks
Note:
Just change the yourusername and Python37 to your
respective username and python version you are using.
For Anaconda users it might be different. Check this site
to find the anaconda python environment path you are using.
Step 3
Once you've done that you can now convert your .py python program to .exe using pyinstaller and it should work.
This looks to me like a SSL credentials mistake. I think you are not being allowed to GC. Check this code snippet and this documentation.
I have been struggling to figure out why I keep getting errors trying to use selenium. I'm using a local install of anaconda3 on my /home/user unix drive at the company I work for. I already pip installed selenium, seemingly without issue, but when I try the following:
from selenium import webdriver
driver = webdriver.Firefox()
it fails with the following message:
WebDriverException: Message: Expected browser binary location, but unable to find binary in default location, no 'moz:firefoxOptions.binary' capability provided, and no binary flag set on the command line
I've tried downloading the most current chromedriver and trying with that, I've tried installing another gecko-driver, I've tried all kinds of things. But nothing is working. I'm happy to provide any amount of additional information, I just want to get this off the ground at some point...
Thank you!
from selenium import webdriver
path = r'C:\yourgeckodriverpath\geckodriver.exe'
driver = webdriver.Firefox(executable_path=path)
Alright, through a combination of the responses to this question, I have figured out what (I think) went wrong. I was using a linux anaconda install on my company's servers, which (I believe) meant my python had no access to a browser driver. The solution was sadly to install anaconda locally, manually download/unzip/install selenium and geckodriver, and then make sure I pass the whole "executable_path=path" parameter to the Firefox method. This didn't work for Chrome for some reason, which I'll assume has something to do with the unchangeable security specifications on my work machine. If any part of this doesn't sound right, feel free to chime in and shed more light on the issue. Thanks!