Extracting data from Web

Extracting data from Web - python

One really newbie question.
I'm working on a small python script for my home use, that will collect data of a specific air ticket.
I want to extract the data from skyscanner (using BeautifulSoap and urllib). Example:
http://www.skyscanner.net/flights/lond/rome/120922/120929/airfares-from-london-to-rome-in-september-2012.html
And I'm interested in all the data that are stored in this kind of element, specially the price: http://shrani.si/f/1w/An/1caIzEzT/capture.png
Because they are not located in the HTML, can I extract them?

I believe the problem is that these values are rendered through a javascript code which your browser runs and urllib doesn't - You should use a library that can execute javascript code.
I just googled crawler python javascript and I got the some stackoverflow questions and answers which recommends the use of selenium or webkit. You can use those libraries through scrapy. Here are two snippets:
Rendered/interactive javascript with gtk/webkit/jswebkit
Rendered Javascript Crawler With Scrapy and Selenium RC

I have been working on this same exact issue. I have been introduced to Beautifulsoup and later since learned about Scrapy. Beautifulsoup is very easy to use, especially if you're new at this. Scrapy apparently has more "features", but I believe you can accomplish your needs with Beautifulsoup.
I had the same issues about not being able to gain access to a website that loaded information through Javascript and thankfully Selenium was the savior.
A great introduction to Selenium can be found here.
Install: pip install selenium
Below is a simple class I put together. You can save it as a .py file and import it into your project. If you call the method retrieve_source_code(self, domain) and send the hyperlink that you are trying to parse it will return the source code of the fully loaded page when you can then put into Beautifulsoup and find the information you're looking for!
Ex:
airfare_url = 'http://www.skyscanner.net/flights/lond/rome/120922/120929/airfares-from-london-to-rome-in-september-2012.html'
soup = BeautifulSoup(SeleniumWebScraper.retrieve_source_code(airfare_url))
Now you can parse soup like you normally would with Beautifulsoup.
I hope that helps you!
from selenium import webdriver
import requests
class SeleniumWebScraper():
def __init__(self):
self.source_code = ''
self.is_page_loaded = 0
self.driver = webdriver.Firefox()
self.is_browser_closed = 0
# To ensure the page has fully loaded we will 'implicitly' wait
self.driver.implicitly_wait(10) # Seconds
def close(self):
self.driver.close()
self.clear_source_code()
self.is_page_loaded = 0
self.is_browser_closed = 1
def clear_source_code(self):
self.source_code = ''
self.is_page_loaded = 0
def retrieve_source_code(self, domain):
if self.is_browser_closed:
self.driver = webdriver.Firefox()
# The driver.get method will navigate to a page given by the URL.
# WebDriver will wait until the page has fully loaded (that is, the "onload" event has fired)
# before returning control to your test or script.
# It's worth nothing that if your page uses a lot of AJAX on load then
# WebDriver may not know when it has completely loaded.
self.driver.get(domain)
self.is_page_loaded = 1
self.source_code = self.driver.page_source
return self.source_code

You don't even need BeautifulSoup to extract data.
Just do this and your response is converted to a Dictionary which is very easy to handle.
text = json.loads("You text of the main response content")
You can now print any key value pair from the dictionary.
Give it a try. It is super easy.

Related

How can I scrape JSON data from a HTML page source?

I'm trying to pull some data from an online music database. In particular, I want to pull this data that you can find with CTRL+F -- "isrc":"GB-FFM-19-0853."
view-source:https://www.audionetwork.com/browse/m/track/purple-beat_1008534
I'm using Python and Selenium and have tried to locate the data via things like tag, xpath and id, but nothing seems to be working.
I haven't seen this x:y format before and some searching makes me think it's in a JSON format.
Is there a way to grab that isrc data via Selenium? I'd need the approach to be generic (i.e. work for pages with different isrc values, as each music track has a different one).
My code so far ...
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from time import sleep
import os
# Access AudioNetwork and search for tracks.
path = "C:\Program Files (x86)\chromedriver.exe"
driver = webdriver.Chrome(path)
driver.get("https://www.audionetwork.com/track/searchkeyword")
search = driver.find_element(By.ID, "js-keyword")
search.send_keys("ANW3175_001_Purple-Beat.wav")
search.send_keys(Keys.RETURN)
sleep(5)
music_link = driver.find_element(By.CSS_SELECTOR, "a.track__title")
music_link.click()
I know I need to make better waits / probably other issues with the code, but any ideas on how to grab that ISRC number?

You want to extract the entire script as JSON data (which can be read as a dictionary in python) and search for the "isrc" parameter.
The following code uses selenium in order to extract the script content inside the page, parse it as json and print the "isrc" value to the terminal.
from selenium import webdriver
from selenium.webdriver.common.by import By
import json
driver = webdriver.Chrome()
driver.get("https://www.audionetwork.com/browse/m/track/purple-beat_1008534")
search = driver.find_element(By.XPATH, "/html/body/script[1]")
content = search.get_attribute('innerHTML')
content_as_dict = json.loads(content)
print(content_as_dict['props']['pageProps']['track']['isrc'])
driver.close()
driver.quit()

Yes, this is JSON format. It's actually JSON wrapped inside of a HTML script tag. It's a essentially a "key": "value" pair - so the specific thing you outlined ("isrc":"GB-FFM-19-08534") has a key of isrc with a value of GB-FFM-19-08534.
Python has a library for parsing JSON, I think you might want this - https://www.w3schools.com/python/gloss_python_json_parse.asp. Let me know if that works for you.
If you wanted to find the value of isrc, you could do:
import json
... # your code here
jsonString = json.loads(someValueHere)
isrcValue = jsonString["isrc"]
replace someValueHere with the json string that you're parsing through and that should help. I think isrc is nested though, so it might not be quite that simple. I don't think you can just do jsonString["track.isrc"] in python, but I'm not sure... the path you're looking for is props.pageProps.track.isrc. You may have to assign a variable per layer...
jsonString = json.loads(someValueHere)
propsValue = jsonString["props"]
pagePropsValue = propsValue["pageProps"]
trackValue = pagePropsValue["track"]
isrcValue = trackValue["isrc"]

Python - Downloading PDFs from ASPX page

I am trying to download the PDF grant reports from a state's department of education public database here: https://mdoe.state.mi.us/cms/grantauditorreport.aspx
I'd like to produce a Python script to go up and download all reports as PDFs for a particular range of dates. There is no option in the page's interface to just download all recipients (schools), so I'm hoping a Python script could loop through all the available selections and download each report individually.
I am very new to Python and have attempted some resources here for people asking similar things, but I have been unsuccessful. So, I do not have any starter code. If someone could give me a start on this, I would greatly appreciate it.
Thanks!

I would recommend Selenium it can be used for webscraping in python.
You will need to install selenium using the instructions provided in the above link. You will also need to install pyautogui (pip install should work).
Note that there were many issues when getting selenium to work in Internet Explorer, if you have problems check out here and here. Because of these issues I had to add in a number of capabilities and define the location of the IEdriver when initializing the selenium webdriver, you will need to change these to match your system. I had originally hoped to use the Chrome or firefox browsers, but the website being scraped only generated the report in internet explorer. As has been noted on other stackexchange boards selenium executes commands much more slowly in Internet Explorer.
Here is code that works with my system and selenium versions:
from selenium import webdriver
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.common.keys import Keys
import time
url = "https://mdoe.state.mi.us/cms/grantauditorreport.aspx"
# LOAD WEBPAGE
capabilities = DesiredCapabilities.INTERNETEXPLORER
capabilities['ignoreProtectedModeSettings'] = True
capabilities['ignoreZoomSetting'] = True
capabilities.setdefault("nativeEvents", False)
driver = webdriver.Ie(capabilities=capabilities,executable_path="C:\\Users\\amcmu\\Downloads\\IEDriverServer_Win32_2.42.0\\IEDriverServer.exe") #include path to IEdriver
driver.get(url)
# SEARCH
search_key = 'public school'
search = driver.find_element_by_name("ctl00$cphMain$txtAgency")
search.send_keys(search_key) # this is the search term
driver.find_element_by_css_selector('.ButtonStandard').click()
# DROPDOWN LIST
list_item = 'Alpena Public Schools'
select = Select(driver.find_element_by_id('ctl00_cphMain_ddlAgency'))
# select by visible text
select.select_by_visible_text(list_item)
# If you want to iterate through every item in the list use select_by_value
#select.select_by_value('1')
# DATES
start='4/2/2018'
end='4/9/2021'
start_date = driver.find_element_by_name("ctl00$cphMain$txtBeginDate")
driver.execute_script("arguments[0].value = ''", start_date)
start_date.send_keys(start)
end_date = driver.find_element_by_name("ctl00$cphMain$txtEndDate")
driver.execute_script("arguments[0].value = ''", end_date)
end_date.send_keys(end)
# PRESS ENTER TO GENERATE PDF
driver.find_element_by_name('ctl00$cphMain$btnSearch').click()
time.sleep(30) # wait while the server generates the file
print('I hope I waited long enough to generate the file.')
# SAVE FILE
import pyautogui
pyautogui.hotkey('alt','n')
pyautogui.press('tab')
pyautogui.press('enter')
time.sleep(3)
driver.quit()
quit()
Now when the file is being generated you need to wait while their server does its thing, this seemed to take a while (order of 20s). I added the time.sleep(30) to give it 30 seconds to generate the file, you will need to play with this value, or figure out a way to find out when the file has been generated.
I am not sure how you want to iterate through the schools (ie do you have a list of schools with their exact name). If you don't have the list of schools you might want to use something like this pseudocode:
select = Select(driver.find_element_by_name("ctl00$cphMain$ddlAgency"))
options = select.options
for index in range(0, len(options) - 1):
# DROPDOWN LIST
select.select_by_index(index)
# DATES
do stuff
# PRESS ENTER TO GENERATE PDF
do stuff
# SAVE FILE
do stuff

I am trying to fill out a web form with Python using Data from Excel

I have an excel file with thousands of rows with data. I am trying to find materials that could help me automate the process of pulling data from excel and filling it up on the website I have. I am only able to find videos and instructions to do it with the browser opening. Can I do it with a browser already open and a page already loaded? Anything on VBA or Python would help, thanks.

Your question is pretty broad, but in general lines what I think you're looking for is essentially Python, Pandas and Selenium.
You could essentially install Selenium, use Pandas to evaluate your csv file and loop it so it'll input the data for you. It might take a while as it will imitate human procedure, so to speak, but it should help you.
Edit - Handling Selenium and Pseudo-Code
1. Building blocks
For this pseudo-walkthrough, here's what we are going to use:
BeautifulSoup: Python library used to obtain HTML and XML information
Pandas: Python library for reading and manipulating data. This is what we're using to handle the Excel file.
Selenium: a web browser automation tool that will handle the grunt work for us.
Google Chrome: dismisses presentation
2. Installations
2.1. Python Libraries
Run the following code to install the necessary libraries:
!pip install bs4
!pip install pandas
!pip install selenium
2.2. Selenium Driver
Now that you have the libraries installed, let's handle the last part of the setup. We will download a Google Chrome driver, which will be the basis of operation for the Selenium library.
Follow these steps:
Open Google Chrome and click the three dots on the upper right corner
Go to Help > About Google Chrome
Check what your Chrome version is
Access http://chromedriver.chromium.org/downloads and download the driver corresponding to your version
Make sure you save the driver to a folder you'll use in the future
3. Getting stuff done
So, I haven't exactly run this code just yet, so sorry for any syntax mistakes I might've overlooked. However, this is what it's going to look like:
from bs4 import BeautifulSoup
from requests import get
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
import pandas as pd
# using Pandas to read the csv file
source_information = pd.read_csv('my_file.csv',header=None,skiprows=[0])
# setting the URL for BeautifulSoup to operate in
url = "yourwebform.com"
my_web_form = get(url).content
soup = BeautifulSoup(my_web_form, 'html.parser')
# Setting parameters for selenium to work
path = r'C:/Users/itsme/Desktop/Util/chromedriver.exe' #make sure you insert the path to your driver here!
options = webdriver.ChromeOptions()
options.add_argument("--start-maximized")
driver = webdriver.Chrome(path, chrome_options=options)
driver.get(url)
# creating a procedure to fill the form
def fulfill_form(username, user_email):
# use Chrome Dev Tools to find the names or IDs for the fields in the form
input_customer_name = driver.find_element_by_name('username')
input_customer_email = driver.find_element_by_name('email')
submit = driver.find_element_by_name('Submit')
#input the values and hold a bit for the next action
input_customer_name.send_keys(username)
time.sleep(1)
input_customer_email.send_keys(user_email)
time.sleep(1)
submit.click()
time.sleep(7)
# creating a list to hold any entries should them result in error
failed_attempts = []
# creating a loop to do the procedure and append failed cases to the list
for customer in source_information:
try:
fulfill_form(source_information[customer_name], source_information[customer_email])
except:
failed_attempts.append(source_information[customer_name])
pass
if len(failed_attemps) > 0:
print("{} cases have failed").format(len(failed_attempts))
print("Procedure concluded")
Let me know if you run into any trouble and we'll try and fix it together!

I used Rodrigo Castilho's solution but had to place all the code creating and opening the driver inside the formula. With the driver outside the formula it would iterate through the first row of data and then fail at the second because there was no connection to the driver. By placing the block of code to connect in the formula it closed and reopened for each row. I'm happy to explain to anyone else having similar problems any other steps.
def fulfill_form(first, last, email, phone, zip):
# Setting parameters for selenium to work
path = r'chromedriver' # make sure you insert the path to your driver here!
options = webdriver.ChromeOptions()
options.add_argument("--start-maximized")
driver = webdriver.Chrome(path, options=options)
driver.get(url)
# use Chrome Dev Tools to find the names or IDs for the fields in the form
input_first = driver.find_element_by_id('form-first_name')
input_last = driver.find_element_by_id('form-last_name')
input_email = driver.find_element_by_id('form-email')
input_phone = driver.find_element_by_id('form-phone')
input_zip = driver.find_element_by_id('form-zip_code')
submit = driver.find_element_by_name('commit')
#input the values and hold a bit for the next action
driver.delete_all_cookies()
input_first.send_keys(first)
time.sleep(1)
input_last.send_keys(last)
time.sleep(1)
input_email.send_keys(email)
time.sleep(1)
input_phone.send_keys(phone)
time.sleep(1)
input_zip.send_keys(zip)
time.sleep(1)
submit.click()
time.sleep(1)
driver.close()

HTML hidden elements

I'm actually trying to code a little "GPS" and actually I couldn't use Google API because of the daily restriction.
I decided to use a site "viamichelin" which provide me the distance between two adresses. I created a little code to fetch all the URL adresses I needed like this :
import pandas
import numpy as np
df = pandas.read_excel('C:\Users\Bibi\Downloads\memoire\memoire.xlsx', sheet_name='Clients')
df2= pandas.read_excel('C:\Users\Bibi\Downloads\memoire\memoire.xlsx', sheet_name='Agences')
matrix=df.as_matrix(columns=None)
clients = np.squeeze(np.asarray(matrix))
matrix2=df2.as_matrix(columns=None)
agences = np.squeeze(np.asarray(matrix2))
compteagences=0
comptetotal=0
for j in agences:
compteclients=0
for i in clients:
print agences[compteagences]
print clients[compteclients]
url ='https://fr.viamichelin.be/web/Itineraires?departure='+agences[compteagences]+'&arrival='+clients[compteclients]+'&arrivalId=34MTE1MnJ5ZmQwMDMzb3YxMDU1ZDFvbGNOVEF1TlRVNU5UUT1jTlM0M01qa3lOZz09Y05UQXVOVFl4TlE9PWNOUzQzTXpFNU5nPT1jTlRBdU5UVTVOVFE9Y05TNDNNamt5Tmc9PTBqUnVlIEZvbmQgZGVzIEhhbGxlcw==&index=0&vehicle=0&type=0&distance=km&currency=EUR&highway=false&toll=false&vignette=false&orc=false&crossing=true&caravan=false&shouldUseTraffic=false&withBreaks=false&break_frequency=7200&coffee_duration=1200&lunch_duration=3600&diner_duration=3600&night_duration=32400&car=hatchback&fuel=petrol&fuelCost=1.393&allowance=0&corridor=&departureDate=&arrivalDate=&fuelConsumption='
print url
compteclients+=1
comptetotal+=1
compteagences+=1
All my datas are on Excel that's why I used the pandas library. I have all the URL's needed for my project.
Although, I would like to extract the number of kilometers needed but there's a little problem. In the source code, I don't have the information I need, so I can't extract it with Python... The site is presented like this:
Michelin
When I click on "inspect" I can find the information needed (on the left) but I can't on the source code (on the right) ... Can someone provide me some help?
Itinerary
I have already tried this, without succeeding :
import os
import csv
import requests
from bs4 import BeautifulSoup
requete = requests.get("https://fr.viamichelin.be/web/Itineraires?departure=Rue%20Lebeau%2C%20Liege%2C%20Belgique&departureId=34MTE1Mmc2NzQwMDM0NHoxMDU1ZW44d2NOVEF1TmpNek5ERT1jTlM0MU5qazJPQT09Y05UQXVOak16TkRFPWNOUzQxTnpBM01nPT1jTlRBdU5qTXpOREU9Y05TNDFOekEzTWc9PTBhUnVlIExlYmVhdQ==&arrival=Rue%20Rys%20De%20Mosbeux%2C%20Trooz%2C%20Belgique&arrivalId=34MTE1MnJ5ZmQwMDMzb3YxMDU1ZDFvbGNOVEF1TlRVNU5UUT1jTlM0M01qa3lOZz09Y05UQXVOVFl4TlE9PWNOUzQzTXpFNU5nPT1jTlRBdU5UVTVOVFE9Y05TNDNNamt5Tmc9PTBqUnVlIEZvbmQgZGVzIEhhbGxlcw==&index=0&vehicle=0&type=0&distance=km&currency=EUR&highway=false&toll=false&vignette=false&orc=false&crossing=true&caravan=false&shouldUseTraffic=false&withBreaks=false&break_frequency=7200&coffee_duration=1200&lunch_duration=3600&diner_duration=3600&night_duration=32400&car=hatchback&fuel=petrol&fuelCost=1.393&allowance=0&corridor=&departureDate=&arrivalDate=&fuelConsumption=")
page = requete.content
soup = BeautifulSoup(page, "html.parser")
print soup

Looking at the inspector for the page, the actual routing is done via a JavaScript invocation to this rather long URL.
The data you need seems to be in that response, starting from _scriptLoaded(. (Since it's a JavaScript object literal, you can use Python's built-in JSON library to load the data into a dict.)

Web scraping with Anaconda and Python 3.65

I'm not a programmer, but I'm trying to teach myself Python so that I can pull data off various sites for projects that I'm working on. I'm using "Automate the Boring Stuff" and I'm having trouble getting the examples to work with one of the pages I'm trying to pull data from.
I'm using Anaconda as my prompt with Python 3.65. Here's what I've done:
Step 1: create the beautiful soup object
import requests, bs4
res = requests.get('https://www.almanac.com/weather/history/zipcode/02111/2017-05-15')
res.raise_for_status()
weatherTest = bs4.BeautifulSoup(res.text)
type(weatherTest)
This works, and returns the result
<class 'bs4.BeautifulSoup'>
I've made the assumption that the "noStarchSoup" that was in the original text (in place of weatherTest here) is a name the author gave to the object that I can rename to something more relevant to me. If that's not accurate, please let me know.
Step 2: pull an element out of the html
Here's where I get stuck. The author had just mentioned how to pull a page down into a file (which I would prefer not to do, I want to use the bs4 object), but then is using that file as his source for the html data. The exampleFile was his downloaded file.
import bs4
exampleFile = open('https://www.almanac.com/weather/history/zipcode/02111/2017-05-15')
I've tried using weatherTest in place of exampleFile, I've tried running the whole thing with the original object name (noStarchSoup), I've even tried it with exampleFile, even though I haven't downloaded the file.
What I get is
"OSError: [Errno 22] Invalid argument:
'https://www.almanac.com/weather/history/zipcode/02111/2017-05-15'
The next step is to tell it what element to pull but I'm trying to fix this error first and kind of spinning my wheels here.

Couldn't resist here!
I found this page during my search but this answer didn't quite help... try this code :)
Step 1: download Anaconda 3.0+
Step 2: (function)
# Import Libraries
import bs4
import requests
def import_high_short_tickers(market_type):
if market_type == 'NADAQ':
page = requests.get('https://www.highshortinterest.com/nasdaq/')
elif market_type == 'NYSE':
page = requests.get('https://www.highshortinterest.com/nyse/')
else:
logger.error("Invalid market_type: " + market_type)
return None
# Parse the HTML Page
soup = bs4.BeautifulSoup(page.content, 'html.parser')
# Grab only table elements
all_soup = soup.find_all('table')
# Get what you want from table elements!
for element in all_soup:
listing = str(element)
if 'https://finance.yahoo.com/' in listing:
# Stuff the results in a pandas data frame (if your not using these you should)
data = pd.read_html(listing)
return data
Yes Yes its very crude but don't hate!
Cheers!

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting data from Web - python

You don't even need BeautifulSoup to extract data. Just do this and your response is converted to a Dictionary which is very easy to handle. text = json.loads("You text of the main response content") You can now print any key value pair from the dictionary. Give it a try. It is super easy.

Related

How can I scrape JSON data from a HTML page source?

Python - Downloading PDFs from ASPX page

I am trying to fill out a web form with Python using Data from Excel

HTML hidden elements

Web scraping with Anaconda and Python 3.65

Categories

Resources