How can I scrape JSON data from a HTML page source?

How can I scrape JSON data from a HTML page source? - python

I'm trying to pull some data from an online music database. In particular, I want to pull this data that you can find with CTRL+F -- "isrc":"GB-FFM-19-0853."
view-source:https://www.audionetwork.com/browse/m/track/purple-beat_1008534
I'm using Python and Selenium and have tried to locate the data via things like tag, xpath and id, but nothing seems to be working.
I haven't seen this x:y format before and some searching makes me think it's in a JSON format.
Is there a way to grab that isrc data via Selenium? I'd need the approach to be generic (i.e. work for pages with different isrc values, as each music track has a different one).
My code so far ...
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from time import sleep
import os
# Access AudioNetwork and search for tracks.
path = "C:\Program Files (x86)\chromedriver.exe"
driver = webdriver.Chrome(path)
driver.get("https://www.audionetwork.com/track/searchkeyword")
search = driver.find_element(By.ID, "js-keyword")
search.send_keys("ANW3175_001_Purple-Beat.wav")
search.send_keys(Keys.RETURN)
sleep(5)
music_link = driver.find_element(By.CSS_SELECTOR, "a.track__title")
music_link.click()
I know I need to make better waits / probably other issues with the code, but any ideas on how to grab that ISRC number?

You want to extract the entire script as JSON data (which can be read as a dictionary in python) and search for the "isrc" parameter.
The following code uses selenium in order to extract the script content inside the page, parse it as json and print the "isrc" value to the terminal.
from selenium import webdriver
from selenium.webdriver.common.by import By
import json
driver = webdriver.Chrome()
driver.get("https://www.audionetwork.com/browse/m/track/purple-beat_1008534")
search = driver.find_element(By.XPATH, "/html/body/script[1]")
content = search.get_attribute('innerHTML')
content_as_dict = json.loads(content)
print(content_as_dict['props']['pageProps']['track']['isrc'])
driver.close()
driver.quit()

Yes, this is JSON format. It's actually JSON wrapped inside of a HTML script tag. It's a essentially a "key": "value" pair - so the specific thing you outlined ("isrc":"GB-FFM-19-08534") has a key of isrc with a value of GB-FFM-19-08534.
Python has a library for parsing JSON, I think you might want this - https://www.w3schools.com/python/gloss_python_json_parse.asp. Let me know if that works for you.
If you wanted to find the value of isrc, you could do:
import json
... # your code here
jsonString = json.loads(someValueHere)
isrcValue = jsonString["isrc"]
replace someValueHere with the json string that you're parsing through and that should help. I think isrc is nested though, so it might not be quite that simple. I don't think you can just do jsonString["track.isrc"] in python, but I'm not sure... the path you're looking for is props.pageProps.track.isrc. You may have to assign a variable per layer...
jsonString = json.loads(someValueHere)
propsValue = jsonString["props"]
pagePropsValue = propsValue["pageProps"]
trackValue = pagePropsValue["track"]
isrcValue = trackValue["isrc"]

Related

Python - Downloading PDFs from ASPX page

I am trying to download the PDF grant reports from a state's department of education public database here: https://mdoe.state.mi.us/cms/grantauditorreport.aspx
I'd like to produce a Python script to go up and download all reports as PDFs for a particular range of dates. There is no option in the page's interface to just download all recipients (schools), so I'm hoping a Python script could loop through all the available selections and download each report individually.
I am very new to Python and have attempted some resources here for people asking similar things, but I have been unsuccessful. So, I do not have any starter code. If someone could give me a start on this, I would greatly appreciate it.
Thanks!

I would recommend Selenium it can be used for webscraping in python.
You will need to install selenium using the instructions provided in the above link. You will also need to install pyautogui (pip install should work).
Note that there were many issues when getting selenium to work in Internet Explorer, if you have problems check out here and here. Because of these issues I had to add in a number of capabilities and define the location of the IEdriver when initializing the selenium webdriver, you will need to change these to match your system. I had originally hoped to use the Chrome or firefox browsers, but the website being scraped only generated the report in internet explorer. As has been noted on other stackexchange boards selenium executes commands much more slowly in Internet Explorer.
Here is code that works with my system and selenium versions:
from selenium import webdriver
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.common.keys import Keys
import time
url = "https://mdoe.state.mi.us/cms/grantauditorreport.aspx"
# LOAD WEBPAGE
capabilities = DesiredCapabilities.INTERNETEXPLORER
capabilities['ignoreProtectedModeSettings'] = True
capabilities['ignoreZoomSetting'] = True
capabilities.setdefault("nativeEvents", False)
driver = webdriver.Ie(capabilities=capabilities,executable_path="C:\\Users\\amcmu\\Downloads\\IEDriverServer_Win32_2.42.0\\IEDriverServer.exe") #include path to IEdriver
driver.get(url)
# SEARCH
search_key = 'public school'
search = driver.find_element_by_name("ctl00$cphMain$txtAgency")
search.send_keys(search_key) # this is the search term
driver.find_element_by_css_selector('.ButtonStandard').click()
# DROPDOWN LIST
list_item = 'Alpena Public Schools'
select = Select(driver.find_element_by_id('ctl00_cphMain_ddlAgency'))
# select by visible text
select.select_by_visible_text(list_item)
# If you want to iterate through every item in the list use select_by_value
#select.select_by_value('1')
# DATES
start='4/2/2018'
end='4/9/2021'
start_date = driver.find_element_by_name("ctl00$cphMain$txtBeginDate")
driver.execute_script("arguments[0].value = ''", start_date)
start_date.send_keys(start)
end_date = driver.find_element_by_name("ctl00$cphMain$txtEndDate")
driver.execute_script("arguments[0].value = ''", end_date)
end_date.send_keys(end)
# PRESS ENTER TO GENERATE PDF
driver.find_element_by_name('ctl00$cphMain$btnSearch').click()
time.sleep(30) # wait while the server generates the file
print('I hope I waited long enough to generate the file.')
# SAVE FILE
import pyautogui
pyautogui.hotkey('alt','n')
pyautogui.press('tab')
pyautogui.press('enter')
time.sleep(3)
driver.quit()
quit()
Now when the file is being generated you need to wait while their server does its thing, this seemed to take a while (order of 20s). I added the time.sleep(30) to give it 30 seconds to generate the file, you will need to play with this value, or figure out a way to find out when the file has been generated.
I am not sure how you want to iterate through the schools (ie do you have a list of schools with their exact name). If you don't have the list of schools you might want to use something like this pseudocode:
select = Select(driver.find_element_by_name("ctl00$cphMain$ddlAgency"))
options = select.options
for index in range(0, len(options) - 1):
# DROPDOWN LIST
select.select_by_index(index)
# DATES
do stuff
# PRESS ENTER TO GENERATE PDF
do stuff
# SAVE FILE
do stuff

HTML hidden elements

I'm actually trying to code a little "GPS" and actually I couldn't use Google API because of the daily restriction.
I decided to use a site "viamichelin" which provide me the distance between two adresses. I created a little code to fetch all the URL adresses I needed like this :
import pandas
import numpy as np
df = pandas.read_excel('C:\Users\Bibi\Downloads\memoire\memoire.xlsx', sheet_name='Clients')
df2= pandas.read_excel('C:\Users\Bibi\Downloads\memoire\memoire.xlsx', sheet_name='Agences')
matrix=df.as_matrix(columns=None)
clients = np.squeeze(np.asarray(matrix))
matrix2=df2.as_matrix(columns=None)
agences = np.squeeze(np.asarray(matrix2))
compteagences=0
comptetotal=0
for j in agences:
compteclients=0
for i in clients:
print agences[compteagences]
print clients[compteclients]
url ='https://fr.viamichelin.be/web/Itineraires?departure='+agences[compteagences]+'&arrival='+clients[compteclients]+'&arrivalId=34MTE1MnJ5ZmQwMDMzb3YxMDU1ZDFvbGNOVEF1TlRVNU5UUT1jTlM0M01qa3lOZz09Y05UQXVOVFl4TlE9PWNOUzQzTXpFNU5nPT1jTlRBdU5UVTVOVFE9Y05TNDNNamt5Tmc9PTBqUnVlIEZvbmQgZGVzIEhhbGxlcw==&index=0&vehicle=0&type=0&distance=km&currency=EUR&highway=false&toll=false&vignette=false&orc=false&crossing=true&caravan=false&shouldUseTraffic=false&withBreaks=false&break_frequency=7200&coffee_duration=1200&lunch_duration=3600&diner_duration=3600&night_duration=32400&car=hatchback&fuel=petrol&fuelCost=1.393&allowance=0&corridor=&departureDate=&arrivalDate=&fuelConsumption='
print url
compteclients+=1
comptetotal+=1
compteagences+=1
All my datas are on Excel that's why I used the pandas library. I have all the URL's needed for my project.
Although, I would like to extract the number of kilometers needed but there's a little problem. In the source code, I don't have the information I need, so I can't extract it with Python... The site is presented like this:
Michelin
When I click on "inspect" I can find the information needed (on the left) but I can't on the source code (on the right) ... Can someone provide me some help?
Itinerary
I have already tried this, without succeeding :
import os
import csv
import requests
from bs4 import BeautifulSoup
requete = requests.get("https://fr.viamichelin.be/web/Itineraires?departure=Rue%20Lebeau%2C%20Liege%2C%20Belgique&departureId=34MTE1Mmc2NzQwMDM0NHoxMDU1ZW44d2NOVEF1TmpNek5ERT1jTlM0MU5qazJPQT09Y05UQXVOak16TkRFPWNOUzQxTnpBM01nPT1jTlRBdU5qTXpOREU9Y05TNDFOekEzTWc9PTBhUnVlIExlYmVhdQ==&arrival=Rue%20Rys%20De%20Mosbeux%2C%20Trooz%2C%20Belgique&arrivalId=34MTE1MnJ5ZmQwMDMzb3YxMDU1ZDFvbGNOVEF1TlRVNU5UUT1jTlM0M01qa3lOZz09Y05UQXVOVFl4TlE9PWNOUzQzTXpFNU5nPT1jTlRBdU5UVTVOVFE9Y05TNDNNamt5Tmc9PTBqUnVlIEZvbmQgZGVzIEhhbGxlcw==&index=0&vehicle=0&type=0&distance=km&currency=EUR&highway=false&toll=false&vignette=false&orc=false&crossing=true&caravan=false&shouldUseTraffic=false&withBreaks=false&break_frequency=7200&coffee_duration=1200&lunch_duration=3600&diner_duration=3600&night_duration=32400&car=hatchback&fuel=petrol&fuelCost=1.393&allowance=0&corridor=&departureDate=&arrivalDate=&fuelConsumption=")
page = requete.content
soup = BeautifulSoup(page, "html.parser")
print soup

Looking at the inspector for the page, the actual routing is done via a JavaScript invocation to this rather long URL.
The data you need seems to be in that response, starting from _scriptLoaded(. (Since it's a JavaScript object literal, you can use Python's built-in JSON library to load the data into a dict.)

Web scraping using Python and Selenium, don't know how to get dynamic data

I'm trying to grab values from a table, but they're not in the HTML. However, they are in the HTML when I inspect them in the browser. I'm guessing that they're dynamically generated, but how do I capture them in Selenium or another way in Python?

You can do like this
from selenium import webdriver
import pandas as pd
import time
driver = webdriver.Chrome()
driver.get('https://www.predictit.org/Contract/7422/Will-Trump-veto-Russian-sanctions-bill-by-August-31#prices')
time.sleep(2)
tables = pd.read_html(driver.page_source) # returns list of dataframes
print(len(tables))
print(tables[2]) # this is table with YES
print(tables[3]) # this is the table with NO
This code fetches only the tables but you need to do some cleaning. You can read the docs for pandas.DataFrame

Download data using selenium

I am a research analyst trying to collate data and perform analysis.I need data from this page . I need data of Abrasives to vanspati Oils (you'll find it on left side). I always encounter problems like this, I figured out that selenium will be able to handle such stuff. But I am stuck on how to download this data into Excel. I need one excel sheet for each category.
My exact technical question is how do I address the problem of downloading the table data.I did a little bit of background research and understood that the data can be extracted if the table has class_name.from here. I see that the table has class="tbldata14 bdrtpg" So I used it in my code.
I got this error
InvalidSelectorException: Message: The given selector tbldata14 bdrtpg
is either invalid or does not result in a WebElement.
How can I download this table data? Point me to any references that I can read and solve this problem.
My code:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Firefox()
driver.get("http://www.moneycontrol.com/stocks/marketinfo/netprofit/bse/index.html")
elem=driver.find_element_by_class_name("tbldata14 bdrtpg")
Thanks in advance.Also please suggest if there is another simple way [I tried copy paste it is too tedious!]

Fetching the data you're interesting in can be achieved as following,
from selenium import webdriver
url = "http://www.moneycontrol.com/stocks/marketinfo/netprofit/bse/index.html"
# Get table-cells where the cell contains an anchor or text
xpath = "//table[#class='tbldata14 bdrtpg']//tr//td[child::a|text()]"
driver = webdriver.Firefox()
driver.get(url)
data = driver.find_elements_by_xpath(xpath)
# Group the output where each row contains 5 elements
rows=[data[x:x+5] for x in xrange(0, len(data), 5)]
for r in rows:
print "Company {}, Last Price {}, Change {}, % Change {}, Net Profit {}" \
.format(r[0].text, r[1].text, r[2].text, r[3].text, r[4].text)
Writing the data to an excel file is explained here,
Python - Write to Excel Spreadsheet
Python, appending printed output to excel file

Extracting data from Web

One really newbie question.
I'm working on a small python script for my home use, that will collect data of a specific air ticket.
I want to extract the data from skyscanner (using BeautifulSoap and urllib). Example:
http://www.skyscanner.net/flights/lond/rome/120922/120929/airfares-from-london-to-rome-in-september-2012.html
And I'm interested in all the data that are stored in this kind of element, specially the price: http://shrani.si/f/1w/An/1caIzEzT/capture.png
Because they are not located in the HTML, can I extract them?

I believe the problem is that these values are rendered through a javascript code which your browser runs and urllib doesn't - You should use a library that can execute javascript code.
I just googled crawler python javascript and I got the some stackoverflow questions and answers which recommends the use of selenium or webkit. You can use those libraries through scrapy. Here are two snippets:
Rendered/interactive javascript with gtk/webkit/jswebkit
Rendered Javascript Crawler With Scrapy and Selenium RC

I have been working on this same exact issue. I have been introduced to Beautifulsoup and later since learned about Scrapy. Beautifulsoup is very easy to use, especially if you're new at this. Scrapy apparently has more "features", but I believe you can accomplish your needs with Beautifulsoup.
I had the same issues about not being able to gain access to a website that loaded information through Javascript and thankfully Selenium was the savior.
A great introduction to Selenium can be found here.
Install: pip install selenium
Below is a simple class I put together. You can save it as a .py file and import it into your project. If you call the method retrieve_source_code(self, domain) and send the hyperlink that you are trying to parse it will return the source code of the fully loaded page when you can then put into Beautifulsoup and find the information you're looking for!
Ex:
airfare_url = 'http://www.skyscanner.net/flights/lond/rome/120922/120929/airfares-from-london-to-rome-in-september-2012.html'
soup = BeautifulSoup(SeleniumWebScraper.retrieve_source_code(airfare_url))
Now you can parse soup like you normally would with Beautifulsoup.
I hope that helps you!
from selenium import webdriver
import requests
class SeleniumWebScraper():
def __init__(self):
self.source_code = ''
self.is_page_loaded = 0
self.driver = webdriver.Firefox()
self.is_browser_closed = 0
# To ensure the page has fully loaded we will 'implicitly' wait
self.driver.implicitly_wait(10) # Seconds
def close(self):
self.driver.close()
self.clear_source_code()
self.is_page_loaded = 0
self.is_browser_closed = 1
def clear_source_code(self):
self.source_code = ''
self.is_page_loaded = 0
def retrieve_source_code(self, domain):
if self.is_browser_closed:
self.driver = webdriver.Firefox()
# The driver.get method will navigate to a page given by the URL.
# WebDriver will wait until the page has fully loaded (that is, the "onload" event has fired)
# before returning control to your test or script.
# It's worth nothing that if your page uses a lot of AJAX on load then
# WebDriver may not know when it has completely loaded.
self.driver.get(domain)
self.is_page_loaded = 1
self.source_code = self.driver.page_source
return self.source_code

You don't even need BeautifulSoup to extract data.
Just do this and your response is converted to a Dictionary which is very easy to handle.
text = json.loads("You text of the main response content")
You can now print any key value pair from the dictionary.
Give it a try. It is super easy.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How can I scrape JSON data from a HTML page source? - python

Related

Python - Downloading PDFs from ASPX page

HTML hidden elements

Web scraping using Python and Selenium, don't know how to get dynamic data

Download data using selenium

Extracting data from Web

Categories

Resources