I'm building a web scraper to automate the process of downloading tweet data using selenium and the headless chrome browser.
I've written a function which logs into twitter, navigates to the analytics page and downloads the csv file, but is there any way to use the pandas.read_csv function to read csv from the source directly without downloading as an intermediary step? I'm pushing data to a SQL database and eventually want to schedule on AWS Lambda so would be good if I could eliminate the need for creating new files.
code as follows (twt is how i've called TwitterBrowser() in the if name == "main": line)
class TwitterBrowser:
def __init__(self):
global LOGIN, PASSWORD, browser
chrome_options = Options()
chrome_options.add_argument("--incognito")
chrome_driver = os.getcwd() +"\\chromedriver.exe"
browser = webdriver.Chrome(chrome_options=chrome_options, executable_path=chrome_driver)
parser = ConfigParser()
parser.read("apikeys.ini")
LOGIN = parser.get('TWITTER', 'USERNAME')
PASSWORD = parser.get('TWITTER', 'PASSWORD')
def get_url(self, url, sec):
load_page = browser.get(url)
try:
WebDriverWait(browser, timeout=sec)
except TimeoutException:
print('TIMED OUT!')
return load_page
def login(self):
twt.get_url('https://twitter.com/login', 5)
browser.find_element_by_xpath('//*[#id="page-container"]/div/div[1]/form/fieldset/div[1]/input').send_keys(LOGIN)
browser.find_element_by_xpath('//*[#id="page-container"]/div/div[1]/form/fieldset/div[2]/input').send_keys(PASSWORD)
WebDriverWait(browser, 5)
browser.find_element_by_xpath('//*[#id="page-container"]/div/div[1]/form/div[2]/button').click()
def tweet_analytics(self):
twt.get_url('https://analytics.twitter.com/user/'+LOGIN+'/tweets', 5)
WebDriverWait(browser, 5)
browser.find_element_by_xpath('/html/body/div[2]/div/div[2]/div').click()
WebDriverWait(browser, 5)
browser.find_element_by_xpath('/html/body/div[5]/div[4]/ul/li[1]').click()
WebDriverWait(browser, 5)
browser.find_element_by_xpath('//*[#id="export"]/button/span[2]').click()
WebDriverWait(browser, 10)
Pandas can read csv directly from url as stated here. So I'd get the raw csv link and read it directly. I'm not sure tho if Twitter analytics has the raw csv hosted on their server (raw csv exemple) or they generate a download link, generating the csv on the fly, where you'd be stuck, which is probably the case as I don't see them hosting unnecessary csvs.
In case you have to download it, you can then read it from you
Related
Hello everyone I've got my program that navigates to a webpage and clicks a link to download the pdf document I need. But I want to know if there's a way to name this file for python to use and upload it to my google drive. I don't want to manually type the upload file name as it will change every time I click a different download link that I need. So for example the current file is invoice_sample-1234 but the next download would be invoice_sample-5678.
How do I cut out the process of typing each invoice?
Thank you for any help
options = webdriver.ChromeOptions()
options.add_experimental_option('excludeSwitches', ['enable-logging'])
driver = webdriver.Chrome(options=options)
driver.get("myurl.com")
window_before = driver.window_handles[0]
driver.find_element(By.ID, "Invoice_Links").click()
window_after = driver.window_handles[1]
driver.switch_to.window(window_after)
download_button= wait.until(EC.visibility_of_element_located((By.ID,"Download Doc"))).click()
def upload_Drive():
upload_file_list = ['invoice_sample-1234.pdf']
for upload_file in upload_file_list:
gfile = drive.CreateFile({'parents': [{'id':'Folder' }]})
gfile.SetContentFile(upload_file)
gfile.Upload() #Upload the file.
print('file Uploaded')
upload_Drive()
I think you can try a random number generator or use timestamp to name it as a file. The file name is a string, like if i try with following pseudo code -
filename =random(seed).toString() + ".pdf"
I think this should work.
I am trying to write a script to automate job applications on Linkedin using selenium and python.
The steps are simple:
open the LinkedIn page, enter id password and log in
open https://linkedin.com/jobs and enter the search keyword and location and click search(directly opening links like https://www.linkedin.com/jobs/search/?geoId=101452733&keywords=python&location=Australia get stuck as loading, probably due to lack of some post information from the previous page)
the click opens the job search page but this doesn't seem to update the driver as it still searches on the previous page.
import selenium
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from bs4 import BeautifulSoup
import pandas as pd
import yaml
driver = webdriver.Chrome("/usr/lib/chromium-browser/chromedriver")
url = "https://linkedin.com/"
driver.get(url)
content = driver.page_source
stream = open("details.yaml", 'r')
details = yaml.safe_load(stream)
def login():
username = driver.find_element_by_id("session_key")
password = driver.find_element_by_id("session_password")
username.send_keys(details["login_details"]["id"])
password.send_keys(details["login_details"]["password"])
driver.find_element_by_class_name("sign-in-form__submit-button").click()
def get_experience():
return "1%C22"
login()
jobs_url = f'https://www.linkedin.com/jobs/'
driver.get(jobs_url)
keyword = driver.find_element_by_xpath("//input[starts-with(#id, 'jobs-search-box-keyword-id-ember')]")
location = driver.find_element_by_xpath("//input[starts-with(#id, 'jobs-search-box-location-id-ember')]")
keyword.send_keys("python")
location.send_keys("Australia")
driver.find_element_by_xpath("//button[normalize-space()='Search']").click()
WebDriverWait(driver, 10)
# content = driver.page_source
# soup = BeautifulSoup(content)
# with open("a.html", 'w') as a:
# a.write(str(soup))
print(driver.current_url)
driver.current_url returns https://linkedin.com/jobs/ instead of https://www.linkedin.com/jobs/search/?geoId=101452733&keywords=python&location=Australia as it should. I have tried to print the content to a file, it is indeed from the previous jobs page and not from the search page. I have also tried to search elements from page like experience and easy apply button but the search results in a not found error.
I am not sure why this isn't working.
Any ideas? Thanks in Advance
UPDATE
It works if try to directly open something like https://www.linkedin.com/jobs/search/?f_AL=True&f_E=2&keywords=python&location=Australia but not https://www.linkedin.com/jobs/search/?f_AL=True&f_E=1%2C2&keywords=python&location=Australia
the difference in both these links is that one of them takes only one value for experience level while the other one takes two values. This means it's probably not a post values issue.
You are getting and printing the current URL immediately after clicking on the search button, before the page changed with the response received from the server.
This is why it outputs you with https://linkedin.com/jobs/ instead of something like https://www.linkedin.com/jobs/search/?geoId=101452733&keywords=python&location=Australia.
WebDriverWait(driver, 10) or wait = WebDriverWait(driver, 20) will not cause any kind of delay like time.sleep(10) does.
wait = WebDriverWait(driver, 20) only instantiates a wait object, instance of WebDriverWait module / class
I wrote a script to find the download link through a series of click, first on the settings gear icon then on the "Export data" tab and finally on the click here to download data link.
However when i click on the final link it does not download the data to my specified default directory.
**ideally i would like to download the data directly to a variable but i couldn't even figure out the why the general download wasn't working.
I have tried getting the href from the download link and opening a new tab using that url but it still gives me nothing
URL = 'https://edap.epa.gov/public/single/?appid=73b2b6a5-70c6-4820-b3fa-186ac094f10d&sheet=1e76b65b-dd6c-41fd-9143-ba44874e1f9d'
DELAY = 10
def init_driver(url):
options = webdriver.chrome.options.Options()
path = '/Users/X/Applications/chromedriver'
options.add_argument("--headless")
options.add_argument("download.default_directory=Users/X/Python/data_scraper/epa_data")
driver = webdriver.Chrome(chrome_options= options, executable_path=path)
driver.implicitly_wait(20)
driver.get(url)
return driver
def find_settings(web_driver):
#find the settings gear
#time.sleep(10)
try:
driver_wait = WebDriverWait(web_driver,10)
ng_scope = driver_wait.until(EC.visibility_of_element_located((By.CLASS_NAME,"ng-scope")))
settings = web_driver.find_element_by_css_selector("span.cl-icon.cl-icon--cogwheel.cl-icon-right-align")
print(settings)
settings.click()
#export_data = web_driver.find_elements_by_css_selector("span.lui-list__text.ng-binding")
#print(web_driver.page_source)
except Exception as e:
print(e)
print(web_driver.page_source)
def get_settings_list(web_driver):
#find the export button and download data
menu_item_list = {}
find_settings(web_driver)
#print(web_driver.page_source)
try:
time.sleep(8)
print("got menu_items")
menu_items = web_driver.find_elements_by_css_selector("span.lui-list__text.ng-binding")
for i in menu_items:
print(i.text)
menu_item_list[i.text] = i
except Exception as e:
print(e)
return menu_item_list
def get_export_data(web_driver):
menu_items = get_settings_list(web_driver)
print(menu_items)
export_data = menu_items['Export data']
export_data.click()
web_driver.execute_script("window.open();")
print(driver.window_handles)
main_window = driver.window_handles[0]
temp_window = driver.window_handles[1]
driver.switch_to_window(main_window)
time.sleep(8)
download_data = driver.find_element_by_xpath("//a[contains(text(), 'Click here to download your data file.')]")
download_href = download_data.get_attribute('href')
print(download_href)
download_data.click()
driver.switch_to_window(temp_window)
driver.get("https://edap.epa.gov"+download_href)
print(driver.page_source)
driver = init_driver(URL)
#get_settings_list(driver)
get_export_data(driver)
I would like to have this code emulate the manual action of clicking the settings gear icon, then export data then download data which downloads data in a csv (ideally i want to skip the file and put in a pandas dataframe, but that an issue for another time)
For security reasons, Chrome will not allow downloads while running headless. Here's a link to some more information and a possible workaround.
Unless you need to use Chrome, Firefox will allow downloads while headless - albeit with some tweaking.
I am having troubles downloading txt file from this page: https://www.ceps.cz/en/all-data#RegulationEnergy (when you scroll down and see Download: txt, xls and xml).
My goal is to create scraper that will go to the linked page, clicks on the txt link for example and saves a downloaded file.
Main problems that I am not sure how to solve:
The file doesn't have a real link that I can call and download it, but the link is created with JS based on filters and file type.
When I use requests library for python and call the link with all headers it just redirects me to https://www.ceps.cz/en/all-data .
Approaches tried:
Using scraper such as ParseHub to download link didn't work as intended. But this scraper was the closest to what I've wanted to get.
Used requests library to connect to the link using headers that HXR request uses for downloading the file but it just redirects me to https://www.ceps.cz/en/all-data .
If you could propose some solution for this task, thank you in advance. :-)
You can download this data to a directory of your choice with Selenium; you just need to specify the directory to which the data will be saved. In what follows below, I'll save the txt data to my desktop:
from selenium import webdriver
download_dir = '/Users/doug/Desktop/'
chrome_options = webdriver.ChromeOptions()
prefs = {'download.default_directory' : download_dir}
chrome_options.add_experimental_option('prefs', prefs)
driver = webdriver.Chrome(chrome_options=chrome_options)
driver.get('https://www.ceps.cz/en/all-data')
container = driver.find_element_by_class_name('download-graph-data')
button = container.find_element_by_tag_name('li')
button.click()
You should do like so:
import requests
txt_format = 'txt'
xls_format = 'xls' # open in binary mode
xml_format = 'xlm' # open in binary mode
def download(file_type):
url = f'https://www.ceps.cz/download-data/?format={txt_format}'
response = requests.get(url)
if file_type is txt_format:
with open(f'file.{file_type}', 'w') as file:
file.write(response.text)
else:
with open(f'file.{file_type}', 'wb') as file:
file.write(response.content)
download(txt_format)
I am working on python and selenium. I want to download file from clicking event using selenium. I wrote following code.
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.keys import Keys
browser = webdriver.Firefox()
browser.get("http://www.drugcite.com/?q=ACTIMMUNE")
browser.close()
I want to download both files from links with name "Export Data" from given url. How can I achieve it as it works with click event only?
Find the link using find_element(s)_by_*, then call click method.
from selenium import webdriver
# To prevent download dialog
profile = webdriver.FirefoxProfile()
profile.set_preference('browser.download.folderList', 2) # custom location
profile.set_preference('browser.download.manager.showWhenStarting', False)
profile.set_preference('browser.download.dir', '/tmp')
profile.set_preference('browser.helperApps.neverAsk.saveToDisk', 'text/csv')
browser = webdriver.Firefox(profile)
browser.get("http://www.drugcite.com/?q=ACTIMMUNE")
browser.find_element_by_id('exportpt').click()
browser.find_element_by_id('exporthlgt').click()
Added profile manipulation code to prevent download dialog.
I'll admit this solution is a little more "hacky" than the Firefox Profile saveToDisk alternative, but it works across both Chrome and Firefox, and doesn't rely on a browser-specific feature which could change at any time. And if nothing else, maybe this will give someone a little different perspective on how to solve future challenges.
Prerequisites: Ensure you have selenium and pyvirtualdisplay installed...
Python 2: sudo pip install selenium pyvirtualdisplay
Python 3: sudo pip3 install selenium pyvirtualdisplay
The Magic
import pyvirtualdisplay
import selenium
import selenium.webdriver
import time
import base64
import json
root_url = 'https://www.google.com'
download_url = 'https://www.google.com/images/branding/googlelogo/2x/googlelogo_color_272x92dp.png'
print('Opening virtual display')
display = pyvirtualdisplay.Display(visible=0, size=(1280, 1024,))
display.start()
print('\tDone')
print('Opening web browser')
driver = selenium.webdriver.Firefox()
#driver = selenium.webdriver.Chrome() # Alternately, give Chrome a try
print('\tDone')
print('Retrieving initial web page')
driver.get(root_url)
print('\tDone')
print('Injecting retrieval code into web page')
driver.execute_script("""
window.file_contents = null;
var xhr = new XMLHttpRequest();
xhr.responseType = 'blob';
xhr.onload = function() {
var reader = new FileReader();
reader.onloadend = function() {
window.file_contents = reader.result;
};
reader.readAsDataURL(xhr.response);
};
xhr.open('GET', %(download_url)s);
xhr.send();
""".replace('\r\n', ' ').replace('\r', ' ').replace('\n', ' ') % {
'download_url': json.dumps(download_url),
})
print('Looping until file is retrieved')
downloaded_file = None
while downloaded_file is None:
# Returns the file retrieved base64 encoded (perfect for downloading binary)
downloaded_file = driver.execute_script('return (window.file_contents !== null ? window.file_contents.split(\',\')[1] : null);')
print(downloaded_file)
if not downloaded_file:
print('\tNot downloaded, waiting...')
time.sleep(0.5)
print('\tDone')
print('Writing file to disk')
fp = open('google-logo.png', 'wb')
fp.write(base64.b64decode(downloaded_file))
fp.close()
print('\tDone')
driver.close() # close web browser, or it'll persist after python exits.
display.popen.kill() # close virtual display, or it'll persist after python exits.
Explaination
We first load a URL on the domain we're targeting a file download from. This allows us to perform an AJAX request on that domain, without running into cross site scripting issues.
Next, we're injecting some javascript into the DOM which fires off an AJAX request. Once the AJAX request returns a response, we take the response and load it into a FileReader object. From there we can extract the base64 encoded content of the file by calling readAsDataUrl(). We're then taking the base64 encoded content and appending it to window, a gobally accessible variable.
Finally, because the AJAX request is asynchronous, we enter a Python while loop waiting for the content to be appended to the window. Once it's appended, we decode the base64 content retrieved from the window and save it to a file.
This solution should work across all modern browsers supported by Selenium, and works whether text or binary, and across all mime types.
Alternate Approach
While I haven't tested this, Selenium does afford you the ability to wait until an element is present in the DOM. Rather than looping until a globally accessible variable is populated, you could create an element with a particular ID in the DOM and use the binding of that element as the trigger to retrieve the downloaded file.
In chrome what I do is downloading the files by clicking on the links, then I open chrome://downloads page and then retrieve the downloaded files list from shadow DOM like this:
docs = document
.querySelector('downloads-manager')
.shadowRoot.querySelector('#downloads-list')
.getElementsByTagName('downloads-item')
This solution is restrained to chrome, the data also contains information like file path and download date. (note this code is from JS, may not be the correct python syntax)
Here is the full working code. You can use web scraping to enter the username password and other field. For getting the field names appearing on the webpage, use inspect element. Element name(Username,Password or Click Button) can be entered through class or name.
from selenium import webdriver
# Using Chrome to access web
options = webdriver.ChromeOptions()
options.add_argument("download.default_directory=C:/Test") # Set the download Path
driver = webdriver.Chrome(options=options)
# Open the website
try:
driver.get('xxxx') # Your Website Address
password_box = driver.find_element_by_name('password')
password_box.send_keys('xxxx') #Password
download_button = driver.find_element_by_class_name('link_w_pass')
download_button.click()
driver.quit()
except:
driver.quit()
print("Faulty URL")