How to save web scraped data from selenium to a .txt file

How to save web scraped data from selenium to a .txt file - python

I'm pretty new to Python and just completed the 'automate the boring stuff with python' course. I have a script that works well for going to a site, grabbing all the necessary data I need, and printing it to my console. I'm having a problem though on how to actually save/export that data to a file. For now I'd like to be able to export it to a .txt or a .csv file. Any help is appreciated, as I can't find a straightforward answer on the web. I just need that last step to complete my project, thanks!
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
browser = webdriver.Chrome()
def getTen():
# Opens the browser and navigates to the page
browser.get('http://taxsales.lgbs.com/map?lat=29.437693458470175&lon=-98.4618145&zoom=9&offset=0&ordering=sale_date,street_name,address_full,uid&sale_type=SALE,RESALE,STRUCK%20OFF,FUTURE%20SALE&county=BEXAR%20COUNTY&state=TX&in_bbox=-99.516502,28.71637426279382,-97.407127,30.153924134433552')
# Waits until the page is loaded then clicks the accept button on the popup window
WebDriverWait(browser, 10).until(EC.element_to_be_clickable((By.XPATH, "/html/body/div[2]/div/div/div[2]/button[1]"))).click()
# Loops through 10 times, changing the listing number to correspond with 1
for i in range(1,11):
clickable = "/html/body/ui-view/div[2]/main/aside/div[2]/property-listing[" + str(i) + "]/article/div/a"
# Waits until the page is loaded then clicks the view more details button on the first result
WebDriverWait(browser, 10).until(EC.element_to_be_clickable((By.XPATH, clickable))).click()
# Waits until the page is loaded, then pulls all the info from the top section of the page
info = WebDriverWait(browser, 10).until(EC.element_to_be_clickable((By.XPATH, "/html/body/ui-view/div/main/div/div/div[2]/property-detail-info/dl[1]")))
# prints info to the console
print(info.text);
# goes back a page and repeats the process for the next listing
browser.back()
getTen()

If you are trying to save info.text, you can just open a local file and write to it. For example:
with open('output.txt', 'w') as f:
f.write(info.text)
More info on reading and writing to files can be found here: Reading and Writing Files in Python

You can simply use you output/get text in console and use below code to get text in.txt
output.txt will be your file name in which u want to save and element.text will be element or text u want to save in .txt
with open('output.txt', 'w') as f:
f.write(elements.text)

Related

Not able to download the file from websiteusing selenium python

I am trying to download the daily report from the website NSE-India using selenium & python.
Approach to download the daily report
Website loads with no data
After X time,page is loaded with report information
Once the page is loaded with report data,"table[#id='etfTable']" appears
Explicit wait is added in the code,to wait till the "table[#id='etfTable']" loads
Code for explicit wait
element=WebDriverWait(driver,50).until(EC.visibility_of_element_located(By.xpath,"//table[#id='etfTable']"))
Extract the onclick event using xpath
downloadcsv= driver.find_element_by_xpath("//div[#id='esw-etf']/div[2]/div/div[3]/div/ul/li/a")
Trigger the click to download the file
Full code
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
options =webdriver.ChromeOptions();
prefs={"download.default_directory":"/Volumes/Project/WebScraper/downloadData"};
options.binary_location=r'/Applications/Google Chrome 2.app/Contents/MacOS/Google Chrome'
chrome_driver_binary =r'/usr/local/Caskroom/chromedriver/94.0.4606.61/chromedriver'
options.add_experimental_option("prefs",prefs)
driver =webdriver.Chrome(chrome_driver_binary,options=options)
try:
#driver.implicity_wait(10)
driver.get('https://www.nseindia.com/market-data/exchange-traded-funds-etf')
element =WebDriverWait(driver,50).until(EC.visibility_of_element_located(By.xpath,"//table[#id='etfTable']"))
downloadcsv= driver.find_element_by_xpath("//div[#id='esw-etf']/div[2]/div/div[3]/div/ul/li/a")
print(downloadcsv)
downloadcsv.click()
time.sleep(5)
driver.close()
except:
print("Invalid URL")
Issue i am facing.
The page is keeps on loading but when launched without selenium the daily report is getting loaded
Normal
Loading via Selenium
Not able to download the daily report

There are some syntax error in the program. Like semi-colon in few lines and while finding element using WebDriverWait, brackets are missing.
Try like below and confirm.
Can use Javascript to click on that element.
driver.get("https://www.nseindia.com/market-data/exchange-traded-funds-etf")
element =WebDriverWait(driver,50).until(EC.visibility_of_element_located((By.XPATH,"//table[#id='etfTable']/tbody/tr[2]")))
downloadcsv= driver.find_element_by_xpath("//img[#title='csv']/parent::a")
print(downloadcsv)
driver.execute_script("arguments[0].click();",downloadcsv)

It's not an issue with your code it's an issue with the website. I checked it most of the time it did not allow me to click on the CSV file. instead of downloading the CSV file, you can scrape the table.
# for direct to the page delete cookies is very important otherwise it will deny the access
browser.delete_all_cookies()
browser.get('https://www.nseindia.com/market-data/exchange-traded-funds-etf')
sleep(5)
soup = BeautifulSoup(browser.page_source, 'html.parser')
# scrape the table from the soup

Selenium Python, parsing through website, opening a new tab, and scraping

I'm new at Python and Selenium. I'm trying to do something--which im sure im going in a very round-about way--any help is greatly appreciated.
The page im trying to parse through has different cards that need to be clicked on, i need to go to each card, and from there grab the name (h1) and the url. I havent gotten very far, and this is what i have so far.
I go through the first page, grab all the urls, add them to a list. Then i want to go through the list, and go to each url (opening a new tab) and from there grabbing the h1 and url. It doesn't seem like I'm even able to grab the h1, and it opens a new tab, then hangs, then opens the same tab.
Thank you in advance!
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
driver = webdriver.Chrome()
driver.get('https://zdb.pedaily.cn/enterprise//') #main URL
title_links = driver.find_elements_by_css_selector('ul.n4 a')
urls = [] #list of URLs
# main = driver.find_elements_by_id('enterprise-list')
for item in title_links:
urls.append(item.get_attribute('href'))
# print(urls)
for url in urls:
driver.execute_script("window.open('');")
driver.switch_to.window(driver.window_handles[1])
driver.get(url)
print(driver.find_element_by_css_selector('div.info h1'))

Well, there are a few issues here:
You should be much more specific with your tag for grabbing urls. This is leading to multiple copies of the same url--that's why it is opening the same pages again.
You should give the site enough time to load before trying to grab objects, that may be why it's timing out but always good to be on the safe side before grabbing objects.
You have to shift focus back to the original page to continue iterating the list
You don't need to inject JS to open a new tab and use a py call to open , and JS formatting could be cleaner
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
driver = webdriver.Chrome()
driver.get('https://zdb.pedaily.cn/enterprise/') # main URL
# Be much more specific or you'll get multiple returns of the same link
urls = driver.find_elements(By.TAG_NAME, 'ul.n4 li div.img a')
for url in urls:
# get href to print
print(url.get_attribute('href'))
# Inject JS to open new tab
driver.execute_script("window.open(arguments[0])", url)
# Switch focus to new tab
driver.switch_to.window(driver.window_handles[1])
# Make sure what we want has time to load and exists before trying to grab it
WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, 'div.info h1')))
# Grab it and print it's contents
print(driver.find_element(By.CSS_SELECTOR, 'div.info h1').text)
# Uncomment the next line to do one tab at a time. Will reduce speed but not use so much ram.
#driver.close()
# Focus back on first window
driver.switch_to.window(driver.window_handles[0])
# Close window
driver.quit()

selenium - trouble clicking a button to export

I've been learning how to use selenium to parse data and I've been doing alright with that process. So I'm trying something different, in that I found data ta parse, but there is a provided export button which to me, sounds like a quicker solution, so I thought I'd have a stab at it. But I'm not quite understanding how it's not working:
browser = webdriver.Chrome()
url = 'https://www.rotowire.com/football/injury-report.php'
browser.get(url)
button = browser.find_elements_by_xpath('//*[#id="injury-report"]/div[2]/div[2]/button[2]')
button.click()
browser.close()
I just want to click on the export csv button on the page.
Also, I haven't looked yet, but my next step would be to specify where to save the csv file it exports. Right now it, defaults to the downloads folder. Is there a way to specify a location without changing the default? Also is there a way to specify a file name?

Try below code to click required button:
from selenium.webdriver.support.ui import WebDriverWait as wait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
browser = webdriver.Chrome()
url = 'https://www.rotowire.com/football/injury-report.php'
browser.get(url)
button = wait(browser, 10).until(EC.element_to_be_clickable((By.CLASS_NAME, "is-csv")))
button.click()
browser.close()
Also check how to save file to specific folder

Using Python to Scrape a JS Form

I'm currently working on a research project in which we are trying to collect saved image files from Brazil's Hemeroteca database. I've done web scraping on PHP pages before using C/C++ with HTML forms, but as this is a shared script, I need to switch to python such that everyone in the group can use this tool.
The page which I'm trying to scrape is: http://bndigital.bn.gov.br/hemeroteca-digital/
There are three forms which populate, the first being the newspaper/journal. Upon selecting this, the available times populate, and the final field is the search term. I've inspected the HTML page here and the three IDs of these are respectively: 'PeriodicoCmb1_Input', 'PeriodoCmb1_Input', and 'PesquisaTxt1'.
Some google searches on this topic led me to the Selenium package, and I've put together this sample code to attempt to read the page:
import webbrowser
import requests
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import time
print("Begin...")
browser = webdriver.Chrome()
url = "http://bndigital.bn.gov.br/hemeroteca-digital/"
browser.get(url)
print("Waiting to load page... (Delay 3 seconds)")
time.sleep(3)
print("Searching for elements")
journal = browser.find_element_by_id("PeriodicoCmb1_Input")
timeRange = browser.find_element_by_id("PeriodoCmb1_Input")
searchTerm = browser.find_element_by_id("PesquisaTxt1")
print(journal)
print("Set fields, delay 3 seconds between input")
search_journal = "Relatorios dos Presidentes dos Estados Brasileiros (BA)"
search_timeRange = "1890 - 1899"
search_text = "Milho"
journal.send_keys(search_journal)
time.sleep(3)
timeRange.send_keys(search_timeRange)
time.sleep(3)
searchTerm.send_keys(search_text)
print("Perform search")
submitButton = button.find_element_by_id("PesquisarBtn1_input")
submitButton.click()
The script runs to the print(journal) statement, where an error is thrown saying the element cannot be found.
Can anyone take a quick sweep of the page in question and make sure I've got the general premise of this script in line correctly, or point me towards some examples to get me running on this problem?
Thanks!

Your DOM elements you are trying to find are located in iframe. So before using find_element_by_id API you should switch to iframe context.
Here is a code how to switch to iframe context:
# add your code
frame_ref = browser.find_elements_by_tag_name("iframe")[0]
iframe = browser.switch_to.frame(frame_ref)
journal = browser.find_element_by_id("PeriodicoCmb1_Input")
timeRange = browser.find_element_by_id("PeriodoCmb1_Input")
searchTerm = browser.find_element_by_id("PesquisaTxt1")
# add your code
Here is a link describing switching to iframe context.

Automate file upload using Selenium and ng-file-upload

I have a project where I am uploading photos via ng-file-upload and I need to automate the upload process with selenium webdriver in Python.
my HTML element looks like this:
<element
ngf-pattern="image/*" accept="image/*"
ngf-max-size="10MB" ngf-max-height="1000" ngf-select="addPhoto($index, $file)">
...</element>
Uploading the element definitely works when doing it manually. But I cannot find a way to automate this using Selenium in Python. I have tried finding the element, then sending the keys of the image's absolute path, but that just puts the absolute path in the browser's search field (as if I typed "Command + F" on Mac)
Note that there is no
<input type="file"/>
with this method of uploading a file.
Any ideas how to do this in Python using Selenium? Thanks!

There has to be a file input hidden which is implicitly responsible for the file upload. For instance, the angular-file-upload DEMO page has it hidden at the bottom of a page.
Here is a working example:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
driver.get("https://angular-file-upload.appspot.com/")
# wait for the upload box to be visible
wait = WebDriverWait(driver, 10)
element = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div[ngf-select]")))
# get a model
model = element.get_attribute("ng-model")
# find a file input by model
file_input = driver.find_element_by_css_selector('input[ngf-select][ng-model="%s"]' % model)
# upload an image
file_input.send_keys("/absolute/path/to/dr-evil-and-minion-laughing.png")
Results into:

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to save web scraped data from selenium to a .txt file - python

If you are trying to save info.text, you can just open a local file and write to it. For example: with open('output.txt', 'w') as f: f.write(info.text) More info on reading and writing to files can be found here: Reading and Writing Files in Python

You can simply use you output/get text in console and use below code to get text in.txt output.txt will be your file name in which u want to save and element.text will be element or text u want to save in .txt with open('output.txt', 'w') as f: f.write(elements.text)

Related

Not able to download the file from websiteusing selenium python

Selenium Python, parsing through website, opening a new tab, and scraping

selenium - trouble clicking a button to export

Using Python to Scrape a JS Form

Automate file upload using Selenium and ng-file-upload

Categories

Resources