Running a Selenium webscraper in AWS Lambda using Docker

Running a Selenium webscraper in AWS Lambda using Docker - python

I am trying to create a simple python Lambda app using SAM CLI that gets the number of followers for a particular handle. Have looked at ALL tutorials and blog posts and yet have not been able to make it work.
The build and local test works fine using sam build and sam local invoke, however, after deployment to AWS Lambda it throws the following error.
Any ideas how to solve this?
{
"errorMessage": "Message: unknown error: Chrome failed to start: crashed.\n (chrome not reachable)\n (The process started from chrome location /usr/bin/google-chrome is no longer running, so ChromeDriver is assuming that Chrome has crashed.)\n",
"errorType": "WebDriverException",
"stackTrace": [
" File \"/var/task/app.py\", line 112, in main\n data = Twitter(StockInfo_list)\n",
" File \"/var/task/app.py\", line 36, in Twitter\n driver = webdriver.Chrome(executable_path=chromedriver_path, options=chrome_optionsdata)\n",
" File \"/var/task/selenium/webdriver/chrome/webdriver.py\", line 76, in __init__\n RemoteWebDriver.__init__(\n",
" File \"/var/task/selenium/webdriver/remote/webdriver.py\", line 157, in __init__\n self.start_session(capabilities, browser_profile)\n",
" File \"/var/task/selenium/webdriver/remote/webdriver.py\", line 252, in start_session\n response = self.execute(Command.NEW_SESSION, parameters)\n",
" File \"/var/task/selenium/webdriver/remote/webdriver.py\", line 321, in execute\n self.error_handler.check_response(response)\n",
" File \"/var/task/selenium/webdriver/remote/errorhandler.py\", line 242, in check_response\n raise exception_class(message, screen, stacktrace)\n"
]
}
I'm using the following as my Dockerfile
FROM public.ecr.aws/lambda/python:3.8
# Update repository and install unzip
RUN yum update -y
RUN yum install unzip -y
# Download and install Google Chrome
COPY curl https://intoli.com/install-google-chrome.sh | bash
# Download and install ChromeDriver
RUN CHROME_DRIVER_VERSION=`curl -sS https://chromedriver.storage.googleapis.com/LATEST_RELEASE` && \
wget -O /tmp/chromedriver.zip https://chromedriver.storage.googleapis.com/$CHROME_DRIVER_VERSION/chromedriver_linux64.zip && \
unzip /tmp/chromedriver.zip chromedriver -d /usr/local/bin/
RUN echo $(chromedriver --version)
# Upgrade PIP
RUN /var/lang/bin/python3.8 -m pip install --upgrade pip
# Install requirements (including selenium)
COPY app.py requirements.txt ./
RUN python3.8 -m pip install -r requirements.txt -t .
# Command can be overwritten by providing a different command in the template directly.
CMD ["app.main"]
My applications looks like
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
import os
import time, datetime
import pandas as pd
import csv
# import mysql.connector
from datetime import date, datetime as dt1
def Twitter(twitter_stock_id):
twitterlist = []
stockids = []
twitterlist = twitter_stock_id["TwitterUrl"].str.lower().tolist()
stockids = twitter_stock_id["stockid"].str.lower().tolist()
chrome_optionsdata = webdriver.ChromeOptions()
prefs = {"profile.default_content_setting_values.notifications": 2}
chrome_optionsdata.add_experimental_option("prefs", prefs)
chrome_optionsdata.add_argument("--headless")
chrome_optionsdata.add_argument("--no-sandbox")
chrome_optionsdata.add_argument("--disable-dev-shm-usage")
chrome_optionsdata.add_argument("--disable-gpu")
chrome_optionsdata.add_argument("--disable-gpu-sandbox")
chromedriver_path = "/usr/local/bin/chromedriver"
driver = webdriver.Chrome(executable_path=chromedriver_path, options=chrome_optionsdata)
lfollowers = []
for url in twitterlist:
tempurl = driver.get(url)
time.sleep(10)
followers = driver.find_elements_by_class_name("r-qvutc0")
for fol in followers:
if "Followers" in fol.text:
tempstr = fol.text.split(" ")
lfollowers.append(tempstr[0])
time.sleep(5)
lmaindata = []
for fld in lfollowers:
if fld != "Followers":
lmaindata.append(fld)
print("Followers" + str(lmaindata))
driver.quit()
return f"Followers: {lmaindata}"
import json
def main(event, context):
StockInfo_list = pd.DataFrame([{"TwitterUrl": "https://twitter.com/costco", "stockid": "COST"}])
data = Twitter(StockInfo_list)
return {
"statusCode": 200,
"body": json.dumps({"message": "hello", "data": data}),
}

There were three key issues why this script didn't work
Lambda restricts write to /tmp/ folder
The executables were not a locaiton in PATH
Missing dependencies for Chromium
To fix this,
I appropriated a shell script that downloads a specific version of Chromium & Chromium webdriver that are compatible into /tmp/ folder and then installed them at /opt/.
#!/usr/bin/bash
declare -A chrome_versions
# Enter the list of browsers to be downloaded
### Using Chromium as documented here - https://www.chromium.org/getting-involved/download-chromium
chrome_versions=( ['89.0.4389.47']='843831' )
chrome_drivers=( "89.0.4389.23" )
#firefox_versions=( "86.0" "87.0b3" )
#gecko_drivers=( "0.29.0" )
# Download Chrome
for br in "${!chrome_versions[#]}"
do
echo "Downloading Chrome version $br"
mkdir -p "/opt/chrome/stable"
curl -Lo "/opt/chrome/stable/chrome-linux.zip" \
"https://www.googleapis.com/download/storage/v1/b/chromium-browser-snapshots/o/Linux_x64%2F${chrome_versions[$br]}%2Fchrome-linux.zip?alt=media"
unzip -q "/opt/chrome/stable/chrome-linux.zip" -d "/opt/chrome/stable/"
mv /opt/chrome/stable/chrome-linux/* /opt/chrome/stable/
rm -rf /opt/chrome/stable/chrome-linux "/opt/chrome/stable/chrome-linux.zip"
done
# Download Chromedriver
for dr in ${chrome_drivers[#]}
do
echo "Downloading Chromedriver version $dr"
mkdir -p "/opt/chromedriver/stable/"
curl -Lo "/opt/chromedriver/stable//chromedriver_linux64.zip" \
"https://chromedriver.storage.googleapis.com/$dr/chromedriver_linux64.zip"
unzip -q "/opt/chromedriver/stable//chromedriver_linux64.zip" -d "/opt/chromedriver/stable/"
chmod +x "/opt/chromedriver/stable/chromedriver"
rm -rf "/opt/chromedriver/stable/chromedriver_linux64.zip"
done
echo "Chrome & Chromedriver installed"
Changed the Dockerfile to the following
FROM public.ecr.aws/lambda/python:3.8 as base
# Hack to install chromium dependencies
RUN yum install -y -q unzip
RUN yum install -y https://dl.google.com/linux/direct/google-chrome-stable_current_x86_64.rpm
# Install Chromium
COPY install-browser.sh /tmp/
RUN /usr/bin/bash /tmp/install-browser.sh
#FROM public.ecr.aws/lambda/python:3.8
# Install Python dependencies for function
COPY requirements.txt /tmp/
RUN pip install --upgrade pip -q
RUN pip install -r /tmp/requirements.txt -q
COPY app.py ./
CMD [ "app.handler" ]
And finally, the missing dependencies were also solved within this docker file by installing Chrome directly that brought with itself 122 packages that were needed to run Chrome.
I've put this in a GitHub Repository here and explained the steps in a blog post here.

Related

How to run Chrome Headless in Docker Container with Selenium?

I am trying to run a simple test file that is meant to open google.com on chrome within an openjdk docker container and return "Completely Successfully" upon completion, however, I keep receiving the same error saying that the "service object has no attribute process". This is the error I keep receiving:
Traceback (most recent call last):
File "/NewJersey/test.py", line 60, in <module>
print(main())
^^^^^^
File "/NewJersey/test.py", line 42, in main
driver = webdriver.Chrome(service = service, options=chrome_options)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/selenium/webdriver/chrome/webdriver.py", line 81, in __init__
super().__init__(
File "/usr/local/lib/python3.11/dist-packages/selenium/webdriver/chromium/webdriver.py", line 103, in __init__
self.service.start()
File "/usr/local/lib/python3.11/dist-packages/selenium/webdriver/common/service.py", line 106, in start
self.assert_process_still_running()
File "/usr/local/lib/python3.11/dist-packages/selenium/webdriver/common/service.py", line 117, in assert_process_still_running
return_code = self.process.poll()
^^^^^^^^^^^^
AttributeError: 'Service' object has no attribute 'process'
This is the code I am running:
#General Imports
from logging import error
import os
import sys
import time
import os.path
import random
#Selenium Imports (Chrome)
from selenium import webdriver
from selenium.webdriver.chrome.service import Service as ChromeService
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
#ChromeDriver Import
from webdriver_manager.chrome import ChromeDriverManager
def main():
chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--disable-gpu")
service = ChromeService("/chromedriver")
driver = webdriver.Chrome(service = service, options=chrome_options)
try:
completion_msg = reroute(driver)
print(completion_msg)
driver.close()
return "Test Completed Successfully"
except error as Error:
return Error
def reroute(driver):
driver.get("https://www.google.com")
return "Success"
if __name__ == "__main__":
print(main())
This is my docker container:
# syntax=docker/dockerfile:1
FROM openjdk:11
ENV PATH = "${PATH}:/chromedriver/chromedriver.exe"
RUN apt-get update && apt-get install -y \
software-properties-common \
unzip \
curl \
xvfb \
wget \
bzip2 \
snapd
# Chrome
RUN apt-get update && \
apt-get install -y gnupg wget curl unzip --no-install-recommends && \
wget -q -O - https://dl-ssl.google.com/linux/linux_signing_key.pub | apt-key add - && \
echo "deb http://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google.list && \
apt-get update -y && \
apt-get install -y google-chrome-stable && \
CHROMEVER=$(google-chrome --product-version | grep -o "[^\.]*\.[^\.]*\.[^\.]*") && \
DRIVERVER=$(curl -s "https://chromedriver.storage.googleapis.com/LATEST_RELEASE_$CHROMEVER") && \
wget -q --continue -P /chromedriver "http://chromedriver.storage.googleapis.com/$DRIVERVER/chromedriver_linux64.zip" && \
unzip /chromedriver/chromedriver* -d /chromedriver
# Python
RUN apt-get update && apt-get install -y \
python2.7 \
python-setuptools \
python3-pip
COPY requirements.txt requirements.txt
RUN pip install -r requirements.txt
COPY . .
CMD python3 test.py
When I first started my project, I attempted to do it with firefox but due to certain limitations chose to switch to chrome.
After trying to do research, there were suggestions to pass the path of chromedriver to the service object and add the path of chromedriver to the PATH in the docker container, both of which I have already done as shown above. I continue to get the exact same error.
I haven't been able to find any other solutions to the above issue so I would greatly appreciate any help!

In case anyone else stumbles across this and has a similar problem, this is how I solved it.
I simply removed the service object entirely. It seems that for whatever reason, the service object wasn't configured correctly or even needed once I had added the ChromeDriver path to my System Path on the dockerfile. The code snippet now reads like this:
chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--disable-gpu")
driver = webdriver.Chrome(options=chrome_options)

runnig undetected_chromdriver on AWS Lambda: Message: 'e39098076c0be4f2_chromedriver' executable needs to be in PATH

I'm trying to deploy a python automation script on AWS lambda using docker image. and I'm sure I did everything right regarding the Path and the installtion, but when I run it on AWS I get this weird error message.
"errorMessage": "Message: 'e39098076c0be4f2_chromedriver' executable needs to be in PATH."
what is wierd about it is that there is always a random string infront of chromedriver. the usual error message should be like:
"errorMessage": "Message: 'chromedriver' executable needs to be in PATH."
the complete log for the error I get:
{
"errorMessage": "Message: 'e39098076c0be4f2_chromedriver' executable needs to be in PATH. Please see https://chromedriver.chromium.org/home\n",
"errorType": "WebDriverException",
"requestId": "107a17da-65d4-4bec-a5dc-1e69c1b63fe7",
"stackTrace": [
" File \"/home/app/app.py\", line 60, in lambda_handler\n driver = uc.Chrome(executable_path=r\"/tmp/chromedriver\")\n",
" File \"/home/app/undetected_chromedriver/__init__.py\", line 409, in __init__\n super(Chrome, self).__init__(\n",
" File \"/home/app/selenium/webdriver/chrome/webdriver.py\", line 69, in __init__\n super().__init__(DesiredCapabilities.CHROME['browserName'], \"goog\",\n",
" File \"/home/app/selenium/webdriver/chromium/webdriver.py\", line 89, in __init__\n self.service.start()\n",
" File \"/home/app/selenium/webdriver/common/service.py\", line 81, in start\n raise WebDriverException(\n"
]
}
I have the executble file in Path but it's name is chromdriver.
my docker file:
# Define global args
ARG FUNCTION_DIR="/home/app/"
ARG RUNTIME_VERSION="3.10"
ARG DISTRO_VERSION="3.16"
# Stage 1 - bundle base image + runtime
# Grab a fresh copy of the image and install GCC
FROM python:${RUNTIME_VERSION}-alpine${DISTRO_VERSION} AS python-alpine
# Install GCC (Alpine uses musl but we compile and link dependencies with GCC)
RUN apk add --no-cache \
libstdc++
# Stage 2 - build function and dependencies
FROM python-alpine AS build-image
# Install aws-lambda-cpp build dependencies
RUN apk add --no-cache \
build-base \
libtool \
autoconf \
automake \
libexecinfo-dev \
make \
cmake \
libcurl \
curl \
gcc \
g++
# Include global args in this stage of the build
ARG FUNCTION_DIR
ARG RUNTIME_VERSION
# Create function directory
RUN mkdir -p ${FUNCTION_DIR}
# Copy required files
COPY patcher.py ${FUNCTION_DIR}
COPY app.py ${FUNCTION_DIR}
COPY requirements.txt .
COPY edit_excutable.py ${FUNCTION_DIR}
# Optional – Install the function's dependencies
RUN python${RUNTIME_VERSION} -m pip install --upgrade pip
RUN python${RUNTIME_VERSION} -m pip install -r requirements.txt --target ${FUNCTION_DIR}
# Fix undetected_chromedriver to use in lambda
RUN cd ${FUNCTION_DIR} && cp -f patcher.py ${FUNCTION_DIR}/undetected_chromedriver
# Install Lambda Runtime Interface Client for Python
RUN python${RUNTIME_VERSION} -m pip install awslambdaric --target ${FUNCTION_DIR}
# Stage 3 - final runtime image
# Grab a fresh copy of the Python image
FROM python-alpine
# Include global arg in this stage of the build
ARG FUNCTION_DIR
# Set working directory to function root directory
WORKDIR ${FUNCTION_DIR}
# Copy in the built dependencies
COPY --from=build-image ${FUNCTION_DIR} ${FUNCTION_DIR}
COPY edit_excutable.py /usr/bin/
RUN apk add chromium-chromedriver
RUN wget https://chromedriver.storage.googleapis.com/107.0.5304.62/chromedriver_linux64.zip
#RUN python${RUNTIME_VERSION} edit_excutable.py
RUN cp /usr/bin/chromedriver ${FUNCTION_DIR}
# (Optional) Add Lambda Runtime Interface Emulator and use a script in the ENTRYPOINT for simpler local runs
ADD https://github.com/aws/aws-lambda-runtime-interface-emulator/releases/latest/download/aws-lambda-rie /usr/bin/aws-lambda-rie
COPY entry.sh /
RUN chmod 755 /usr/bin/aws-lambda-rie /entry.sh
ENTRYPOINT [ "/entry.sh" ]
CMD [ "app.lambda_handler" ]
and here is a snippet of my code that generate the error:
import os
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support.ui import Select
from selenium.webdriver.support import expected_conditions as EC
import undetected_chromedriver.v2 as uc
import subprocess
import shutil
import time
BIN_DIR = "/tmp/bin"
CURR_BIN_DIR = os.getcwd()
def lambda_handler(event, context):
if not os.path.exists(BIN_DIR):
print("Creating bin folder")
os.makedirs(BIN_DIR)
os.environ["PATH"] += os.pathsep + BIN_DIR
os.environ["PATH"] += os.pathsep + CURR_BIN_DIR
print (os.environ)
chrome_options = uc.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--disable-extensions')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument('--window-size=1024x768')
chrome_options.add_argument('--user-data-dir=/tmp/user-data')
chrome_options.add_argument('--hide-scrollbars')
chrome_options.add_argument('--enable-logging')
chrome_options.add_argument('--log-level=0')
chrome_options.add_argument('--single-process')
chrome_options.add_argument('--data-path=/tmp/data-path')
chrome_options.add_argument('--ignore-certificate-errors')
chrome_options.add_argument('--homedir=/tmp')
chrome_options.add_argument('--disk-cache-dir=/tmp/cache-dir')
chrome_options.binary_location = "/tmp/chromedriver"
options = {'request_storage_base_dir': '/tmp' }
os.system("cp ./chromedriver /tmp/chromedriver")
os.chmod("/tmp/chromedriver", 0o777)
driver = uc.Chrome(executable_path=r"/tmp/chromedriver", chrome_options=chrome_options)
I tried different versions of the runtime, chromdrive and distro version.

I solved the problem in the code using driver_executable_path argument instead of executable_path.
also changed the docker file to install chromium browser then chromdrive
RUN apk add chromium
RUN apk add chromium-chromedriver

Configure selenium webdriver in docker container

I have the following code:
fox = webdriver.Firefox(executable_path=GeckoDriverManager().install())
try:
fox.get(url+search+id)
image = BytesIO(fox.find_element_by_tag_name('table').screenshot_as_png)
image.name = id + '.png'
except:
fox.close()
fox.close()
return image
Its perfectly just works in windows 10 with env, but this code is just a part of one project, and i use docker to deploy it. The Dockerfile:
FROM python:slim
WORKDIR /app
COPY . .
RUN pip install -r requirements.txt
CMD [ "python", "./main.py" ]
All i have after execution of the python code above is:
====== WebDriver manager ======
Current firefox version is 91.5
Get LATEST geckodriver version for 91.5 firefox
Driver [/root/.wdm/drivers/geckodriver/linux64/v0.30.0/geckodriver] found in cache

Make it work with this configuration:
opts = FirefoxOptions()
opts.add_argument("--headless")
fox = webdriver.Firefox(executable_path=GeckoDriverManager().install(), options=opts)
Also added this line to the Dockerfile:
RUN apt-get update && apt-get install firefox-esr -y

How do I write my Dockerfile to include chromedriver?

I am a newbie to Dockerfile as well as Selenium. I was working on the web scraping using selenium and taking a screenshot. I am trying to dockerize it. This questions of mine seems to be answered in a few questions but it did not solve my error. FYI, I am using a Windows laptop.
The screenshot code works on my local machine but dockerfile seems to be giving me errors.
I am trying to use this version of chromedriver=89.0.4389.82
This is my UPDATED Dockefile,
FROM python:3.6
RUN pip install --upgrade pip && pip install pytest && pip install pytest-mock && pip install pytest-smtp && pip install mock \
pip install schedule && pip install selenium && pip install Selenium-Screenshot && pip install python-dateutil
# For running code
COPY src/screenshotcode.py /
RUN wget -q -O - https://dl-ssl.google.com/linux/linux_signing_key.pub | apt-key add -
RUN echo "deb http://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google.list
RUN apt-get update -y
RUN apt-get install -y google-chrome-stable
RUN apt-get install libxi6 libgconf-2-4 -y
ENV CHROMEDRIVER_VERSION 2.19
ENV CHROMEDRIVER_DIR /chromedriver
RUN mkdir -p $CHROMEDRIVER_DIR
# Download and install Chromedriver
RUN wget -q --continue -P $CHROMEDRIVER_DIR "http://chromedriver.storage.googleapis.com/$CHROMEDRIVER_VERSION/chromedriver_linux64.zip"
RUN unzip $CHROMEDRIVER_DIR/chromedriver* -d $CHROMEDRIVER_DIR
# Put Chromedriver into the PATH
ENV PATH $CHROMEDRIVER_DIR:$PATH
CMD [ "python", "screenshotcode.py" ]
My screenshot code,
import time
from Screenshot import Screenshot_Clipping
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver import Chrome
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
from email_it import email_it
from environmental_variables import environmental_variables
from error_alert_email import error_alert_email
from selenium import webdriver
def screenshot():
ob=Screenshot_Clipping.Screenshot()
chrome_options = Options()
chrome_options.add_argument('--start-maximized')
chrome_options.add_argument('--start-fullscreen')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--headless')
chrome_options.add_argument('--disable-gpu')
driver = webdriver.Chrome(executable_path = r"C:\Users\me\Documents\Projects\chromedriver.exe")
print('taking screenshot...')
img_url=ob.full_Screenshot(driver, path = path, image_name = label)
print('closing driver...')
driver.close()
screenshot()
EDIT: I get the following error
PS C:\Users\me\Documents\Projects\> docker run screenshot
File "scheduler.py", line 16, in <module>
from screenshot import screenshot
File "/screenshotcode.py", line 72, in <module>
screenshot()
File "/screenshotcode.py", line 32, in screenshot
driver = webdriver.Chrome()
File "/usr/local/lib/python3.6/site-packages/selenium/webdriver/chrome/webdriver.py", line 81, in __init__
desired_capabilities=desired_capabilities)
File "/usr/local/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 157, in __init__
self.start_session(capabilities, browser_profile)
File "/usr/local/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 252, in start_session
response = self.execute(Command.NEW_SESSION, parameters)
File "/usr/local/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 321, in execute
self.error_handler.check_response(response)
File "/usr/local/lib/python3.6/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: unknown error: Chrome failed to start: exited abnormally
(Driver info: chromedriver=2.19.346067 (6abd8652f8bc7a1d825962003ac88ec6a37a82f1),platform=Linux 5.4.72-microsoft-standard-WSL2 x86_64)

You set in the code the chromedriver to be at:
driver = webdriver.Chrome(executable_path = r"C:\Users\me\Documents\Projects\chromedriver.exe")
but in your dockerfile you have it at /usr/local/bin/chromedriver
so you need to change your code to
driver = webdriver.Chrome(executable_path = "/usr/local/bin/chromedriver")

Pass arguments to scrapy spider through docker run

I have a scrapy+Selenium spider packaged in a docker container. I want to run that container with passing some aruments to the spider. However, for some reason I receive a strange error message. I did an extensive search and tried many different options before submitting the question.
Dockerfile
FROM python:2.7
# install google chrome
RUN wget -q -O - https://dl-ssl.google.com/linux/linux_signing_key.pub | apt-key add -
RUN sh -c 'echo "deb [arch=amd64] http://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google-chrome.list'
RUN apt-get -y update
RUN apt-get install -y google-chrome-stable
# install chromedriver
RUN apt-get install -yqq unzip
RUN wget -O /tmp/chromedriver.zip http://chromedriver.storage.googleapis.com/`curl -sS chromedriver.storage.googleapis.com/LATEST_RELEASE`/chromedriver_linux64.zip
RUN unzip /tmp/chromedriver.zip chromedriver -d /usr/local/bin/
# install xvfb
RUN apt-get install -yqq xvfb
# install pyvirtualdisplay
RUN pip install pyvirtualdisplay
# set display port and dbus env to avoid hanging
ENV DISPLAY=:99
ENV DBUS_SESSION_BUS_ADDRESS=/dev/null
#install scrapy
RUN pip install --upgrade pip && \
pip install --upgrade \
setuptools \
wheel && \
pip install --upgrade scrapy
# install selenium
RUN pip install selenium==3.8.0
# install xlrd
RUN pip install xlrd
# install bs4
RUN pip install beautifulsoup4
ADD . /tralala/
WORKDIR tralala/
CMD scrapy crawl personel_spider_mpc -a chunksNo=$chunksNo -a chunkI=$chunkI
I guess that the problem may be in CMD part.
Spider init part:
class Crawler(scrapy.Spider):
name = "personel_spider_mpc"
allowed_domains = ['tralala.de',]
def __init__(self, vdisplay = True, **kwargs):
super(Crawler, self).__init__(**kwargs)
self.chunkI = chunkI
self.chunksNo = chunksNo
How I run the container:
docker run --env chunksNo='10' --env chunkI='1' ostapp/tralala
I tried with both quotations marks and without them
The error message:
2018-04-04 16:42:32 [twisted] CRITICAL:
Traceback (most recent call last):
File "/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 1386, in _inlineCallbacks
result = g.send(result)
File "/usr/local/lib/python2.7/site-packages/scrapy/crawler.py", line 98, in crawl
six.reraise(*exc_info)
File "/usr/local/lib/python2.7/site-packages/scrapy/crawler.py", line 79, in crawl
self.spider = self._create_spider(*args, **kwargs)
File "/usr/local/lib/python2.7/site-packages/scrapy/crawler.py", line 102, in _create_spider
return self.spidercls.from_crawler(self, *args, **kwargs)
File "/usr/local/lib/python2.7/site-packages/scrapy/spiders/__init__.py", line 51, in from_crawler
spider = cls(*args, **kwargs)
File "/tralala/tralala/spiders/tralala_spider_mpc.py", line 673, in __init__
self.chunkI = chunkI
NameError: global name 'chunkI' is not defined

Your arguments are stored in kwargs, which is just a dictionary, with key acting as argument name and value as argument value. It does not define names for you, so you get your error.
For more details, see this answer
In your specific case, try self.chunkI = kwargs['chunkI'] and self.chunksNo = kwargs['chunksNo']

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Running a Selenium webscraper in AWS Lambda using Docker - python

Related

How to run Chrome Headless in Docker Container with Selenium?

runnig undetected_chromdriver on AWS Lambda: Message: 'e39098076c0be4f2_chromedriver' executable needs to be in PATH

Configure selenium webdriver in docker container

How do I write my Dockerfile to include chromedriver?

Pass arguments to scrapy spider through docker run

Categories

Resources