Download data using selenium - python

I am a research analyst trying to collate data and perform analysis.I need data from this page . I need data of Abrasives to vanspati Oils (you'll find it on left side). I always encounter problems like this, I figured out that selenium will be able to handle such stuff. But I am stuck on how to download this data into Excel. I need one excel sheet for each category.
My exact technical question is how do I address the problem of downloading the table data.I did a little bit of background research and understood that the data can be extracted if the table has class_name.from here. I see that the table has class="tbldata14 bdrtpg" So I used it in my code.
I got this error
InvalidSelectorException: Message: The given selector tbldata14 bdrtpg
is either invalid or does not result in a WebElement.
How can I download this table data? Point me to any references that I can read and solve this problem.
My code:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Firefox()
driver.get("http://www.moneycontrol.com/stocks/marketinfo/netprofit/bse/index.html")
elem=driver.find_element_by_class_name("tbldata14 bdrtpg")
Thanks in advance.Also please suggest if there is another simple way [I tried copy paste it is too tedious!]

Fetching the data you're interesting in can be achieved as following,
from selenium import webdriver
url = "http://www.moneycontrol.com/stocks/marketinfo/netprofit/bse/index.html"
# Get table-cells where the cell contains an anchor or text
xpath = "//table[#class='tbldata14 bdrtpg']//tr//td[child::a|text()]"
driver = webdriver.Firefox()
driver.get(url)
data = driver.find_elements_by_xpath(xpath)
# Group the output where each row contains 5 elements
rows=[data[x:x+5] for x in xrange(0, len(data), 5)]
for r in rows:
print "Company {}, Last Price {}, Change {}, % Change {}, Net Profit {}" \
.format(r[0].text, r[1].text, r[2].text, r[3].text, r[4].text)
Writing the data to an excel file is explained here,
Python - Write to Excel Spreadsheet
Python, appending printed output to excel file

Related

WebScraping: Pandas to_excel Not Displaying full DataFrame

I am brand new to coding, and was given a web scraping tutorial (found here) to help build my skills as I learn. I've already had to make several adjustments to the code in this tutorial, but I digress. I'm scraping off of http://books.toscrape.com/ and, when I try to export a Dataframe of just the book categories into Excel, I get a couple of issues. Note that, when exporting to csv (and then opening the file in Notepad), these issues are not present. I am working in a Jupyter Notebook in Azure Data Studio.
First, the row with all the data appears to not exist, even though it is displayed, making it so that I have to tab over to each column to go past the data that is shown in the default windowsize of Excel.
Second, it only displays the first 9 results (the first being "Books," and the other 8 being the first 8 categories).
Image of desired scrape section
Here is my code:
s = Service('C:/Users/.../.../chromedriver.exe')
browser = webdriver.Chrome(service=s)
url = 'http://books.toscrape.com/'
browser.get(url)
results = []
content = browser.page_source
soup = BeautifulSoup(content)
# changes from the tutorial due to recommendations
# from StackOverflow, based on similar questions
# from error messages popping up when using original
# formatting; tutorial is outdated.
for li in soup.find_all(attrs={'class': 'side_categories'}):
name = element.find('li')
if name not in results:
results.append(name.text)
df = pd.DataFrame({'Categories': results})
df.to_excel('categories.xlsx', index=False)
# per https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.io.formats.style.Styler.to_excel.html
# encoding is Deprecated and apparently wasn't
# needed for any excel writer other than xlwt,
# which is no longer maintained.
Images of results:
View before tabbing further in the columns
End of the displayed results
What can I do to fix this?
Edit: Many apologies, I didn't realize I have copied an older, incorrect version of my code blocks. Should be correct now.
The code in question will not create any dataframe. However, you should select your elements more specific for example with css selectors:
for a in soup.select('ul.nav-list a'):
if a.get_text(strip=True) not in results:
results.append(a.get_text(strip=True))
Example
import requests
from bs4 import BeautifulSoup
import pandas as pd
results = []
soup = BeautifulSoup(requests.get('http://books.toscrape.com/').content)
for a in soup.select('ul.nav-list a'):
if a.get_text(strip=True) not in results:
results.append(a.get_text(strip=True))
pd.DataFrame({'Categories': results})
Output
Categories
0
Books
1
Travel
2
Mystery
3
Historical Fiction
4
Sequential Art
5
Classics
6
Philosophy
...

How can I scrape JSON data from a HTML page source?

I'm trying to pull some data from an online music database. In particular, I want to pull this data that you can find with CTRL+F -- "isrc":"GB-FFM-19-0853."
view-source:https://www.audionetwork.com/browse/m/track/purple-beat_1008534
I'm using Python and Selenium and have tried to locate the data via things like tag, xpath and id, but nothing seems to be working.
I haven't seen this x:y format before and some searching makes me think it's in a JSON format.
Is there a way to grab that isrc data via Selenium? I'd need the approach to be generic (i.e. work for pages with different isrc values, as each music track has a different one).
My code so far ...
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from time import sleep
import os
# Access AudioNetwork and search for tracks.
path = "C:\Program Files (x86)\chromedriver.exe"
driver = webdriver.Chrome(path)
driver.get("https://www.audionetwork.com/track/searchkeyword")
search = driver.find_element(By.ID, "js-keyword")
search.send_keys("ANW3175_001_Purple-Beat.wav")
search.send_keys(Keys.RETURN)
sleep(5)
music_link = driver.find_element(By.CSS_SELECTOR, "a.track__title")
music_link.click()
I know I need to make better waits / probably other issues with the code, but any ideas on how to grab that ISRC number?
You want to extract the entire script as JSON data (which can be read as a dictionary in python) and search for the "isrc" parameter.
The following code uses selenium in order to extract the script content inside the page, parse it as json and print the "isrc" value to the terminal.
from selenium import webdriver
from selenium.webdriver.common.by import By
import json
driver = webdriver.Chrome()
driver.get("https://www.audionetwork.com/browse/m/track/purple-beat_1008534")
search = driver.find_element(By.XPATH, "/html/body/script[1]")
content = search.get_attribute('innerHTML')
content_as_dict = json.loads(content)
print(content_as_dict['props']['pageProps']['track']['isrc'])
driver.close()
driver.quit()
Yes, this is JSON format. It's actually JSON wrapped inside of a HTML script tag. It's a essentially a "key": "value" pair - so the specific thing you outlined ("isrc":"GB-FFM-19-08534") has a key of isrc with a value of GB-FFM-19-08534.
Python has a library for parsing JSON, I think you might want this - https://www.w3schools.com/python/gloss_python_json_parse.asp. Let me know if that works for you.
If you wanted to find the value of isrc, you could do:
import json
... # your code here
jsonString = json.loads(someValueHere)
isrcValue = jsonString["isrc"]
replace someValueHere with the json string that you're parsing through and that should help. I think isrc is nested though, so it might not be quite that simple. I don't think you can just do jsonString["track.isrc"] in python, but I'm not sure... the path you're looking for is props.pageProps.track.isrc. You may have to assign a variable per layer...
jsonString = json.loads(someValueHere)
propsValue = jsonString["props"]
pagePropsValue = propsValue["pageProps"]
trackValue = pagePropsValue["track"]
isrcValue = trackValue["isrc"]

How to export a list to web page text box using selenium webdriver in python?

My work here is to get data from google sheets and put those values in the webpage text box - using python
In my google sheet, I have 450 rows which are comma-separated values.
I need put all the 450 rows data into the webpage text box using selenium send.key().
##getting data from google sheets.
scope = ['https://www.googleapis.com/auth/drive']
credentials = ServiceAccountCredentials.from_json_keyfile_name('json',scope)
client = gspread.authorize(credentials)
workbook = client.open_by_url("https://docs.googlesheet") # using google sheet here
sheet1 = workbook.worksheet("Sheet1")
##converted sheet1 data into a dataframe called dera.
dera = gd.get_as_dataframe(sheet1, evaluate_formulas=True, skiprows=0, has_header=True)
##from dera dataframe reading 'names' column and removing null values.
del = dera[['names']].dropna()
##Converted my dataframe into list- I have read it will be easy to put list(z) values in send keys
z = del['names'].values.tolist()
Selenium code:
driver = webdriver.Chrome(executable_path="/Users/naveenbabudadla/Documents/automation/chromedriver")
driver.get("https://google.com/") # using google.com as example
driver.find_element_by_xpath("//div[text()='Maximum 5000 names'] /..//textarea").send_keys(z) ## got stuck here.
time.sleep(2)
not able to define "z" to selenium send keys correctly.
Can someone help me with this?
So let's forget about whole code before z = del['names'].values.tolist() and assume it works, you could change del name to some other name as del is a build in symbol in python. But if it works then it wokrs.
So it seems that z is some list, let's say it's a list of strings then it would look like ['value1', 'value2']. If you want to send it to your browser as is you should change it to string with str(z). Perhaps you want to send some value of your list then you chould send z[0]. But still all of these are my quesses, you should provide exact stacktrace of your errror.

Python Selenium Web Scrape embedded excel in XPATH to pandas frame convert logic need

I have this python requirement that after login into a website using python selenium webdriver, in a particular XPath there is an embedded csv file
I could download the csv file to a local folder using the below code.
content =driver.find_element_by_xpath('//*[#id=":n"]/div').click()
My requirement is to read this csv in python code and convert this as pandas dataframe directly, I tried few methods and it is not working,How this csv file can be directly converted as dataframe in python using the XPATH to use internal data processing.If this CSV is not downloaded and only can be converted as pandas using selenium method is also fine..
Error code 1 :
content1 =driver.find_element_by_xpath('//*[#id=":l"]/div').click()
content1 =pd.read_csv('content1.text')
Error code 2:
content1 =pd.read_csv('driver.find_element_by_xpath('//*[#id=":l"]/div').click()')
Note :I do not want to download and locate the file to convert it using pd_read_csv() method.
Let the method be using selenium webdriver, I do not want to use requests, soup to convert a table to the data frame.
Also if the file is Excel(Xls or Xlsx) in the web embedded format how to make a dataframe.
Your support is very much appreciated. thank you!
UPDATE :Tried the below code as well still it is printing only "All/Selected to CSV File" nothing converted as pandas dataframe.
Any expert advice to resolve this issue.
content =driver.find_element_by_xpath('//*[#id=":l"]/div')
content = content.text.split('\n')
content = pd.DataFrame(content)
print(content)
This was printing "All/Selected to CSV File " Note the XPATH given is to select the CSV file.
Update2:
Found the exact path from where the csv file is downloaded tried few codes still it is failing.How to read the csv without fail from this URL atleast.
Note:due to security reason the URL is altered, Given here is only reference
url='https://el.hm.com/cg-in/export?EK=45002&FORMAT=2&nocache=1533623732089&TID=507686'
c=driver.get(url)
c=pd.read_csv(io.StringIO(c.decode('utf-8')))
Error :AttributeError: 'NoneType' object has no attribute 'decode'
After you click() on the element you need to wait for the download to succeed and then get the downloaded file name. This answer has code snippets for Chrome and Firefox drivers. Your code would look like this for Chrome driver:
def getDownLoadedFileName(waitTime):
driver.execute_script("window.open()")
# switch to new tab
driver.switch_to.window(driver.window_handles[-1])
# navigate to chrome downloads
driver.get('chrome://downloads')
# define the endTime
endTime = time.time()+waitTime
while True:
try:
# get downloaded percentage
downloadPercentage = driver.execute_script(
"return document.querySelector('downloads-manager').shadowRoot.querySelector('#downloadsList downloads-item').shadowRoot.querySelector('#progress').value")
# check if downloadPercentage is 100 (otherwise the script will keep waiting)
if downloadPercentage == 100:
# return the file name once the download is completed
return driver.execute_script("return document.querySelector('downloads-manager').shadowRoot.querySelector('#downloadsList downloads-item').shadowRoot.querySelector('div#content #file-link').text")
except:
pass
time.sleep(1)
if time.time() > endTime:
break
# Initialize the driver and open the web page
# from which you wish to download the file
# Click to download and wait for it to finish
content = driver.find_element_by_xpath('//*[#id=":n"]/div').click()
# Load the csv dataframe. Replace read_csv with read_excel if it's an excel file
df = pd.read_csv(getDownLoadedFileName(60))
To load Excel file as dataframe you can use pd.read_excel().

Use Python script to extract data from excel sheet and input into website

Hey I am python scripting novice, I was wondering if someone could help me understand how I could python, or any convenient scripting language, to cherry pick specific data values (just a few arbitrary columns on the excel sheet), and take the data and input into a web browser like chrome. Just a general idea of how this should function would be extremely helpful, and also if there is an API available that would be great to know.
Any advice is appreciated. Thanks.
Okay. Lets get started.
Getting values out of an excel document
This page is a great place to start as far as reading excel values goes.
Directly from the site:
import pandas as pd
table = pd.read_excel('sales.xlsx',
sheetname = 'Month 1',
header = 0,
index_col = 0,
parse_cols = "A, C, G",
convert_float = True)
print(table)
Inputting values into browser
Inputting values into the browser can be done quite easily through Selenium which is a python library for controlling browsers via code.
Also directly from the site:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Firefox()
driver.get("http://www.python.org")
assert "Python" in driver.title
elem = driver.find_element_by_name("q")
elem.clear()
elem.send_keys("pycon")
elem.send_keys(Keys.RETURN)
assert "No results found." not in driver.page_source
driver.close()
How to start your project
Now that you have the building blocks, you can put them together.
Example steps:
Open file, open browser and other initialization stuff you need to do.
Read values from excel document
Send values to browser
Repeat steps 2 & 3 forever

Categories