Web scraping with Anaconda and Python 3.65

Web scraping with Anaconda and Python 3.65 - python

I'm not a programmer, but I'm trying to teach myself Python so that I can pull data off various sites for projects that I'm working on. I'm using "Automate the Boring Stuff" and I'm having trouble getting the examples to work with one of the pages I'm trying to pull data from.
I'm using Anaconda as my prompt with Python 3.65. Here's what I've done:
Step 1: create the beautiful soup object
import requests, bs4
res = requests.get('https://www.almanac.com/weather/history/zipcode/02111/2017-05-15')
res.raise_for_status()
weatherTest = bs4.BeautifulSoup(res.text)
type(weatherTest)
This works, and returns the result
<class 'bs4.BeautifulSoup'>
I've made the assumption that the "noStarchSoup" that was in the original text (in place of weatherTest here) is a name the author gave to the object that I can rename to something more relevant to me. If that's not accurate, please let me know.
Step 2: pull an element out of the html
Here's where I get stuck. The author had just mentioned how to pull a page down into a file (which I would prefer not to do, I want to use the bs4 object), but then is using that file as his source for the html data. The exampleFile was his downloaded file.
import bs4
exampleFile = open('https://www.almanac.com/weather/history/zipcode/02111/2017-05-15')
I've tried using weatherTest in place of exampleFile, I've tried running the whole thing with the original object name (noStarchSoup), I've even tried it with exampleFile, even though I haven't downloaded the file.
What I get is
"OSError: [Errno 22] Invalid argument:
'https://www.almanac.com/weather/history/zipcode/02111/2017-05-15'
The next step is to tell it what element to pull but I'm trying to fix this error first and kind of spinning my wheels here.

Couldn't resist here!
I found this page during my search but this answer didn't quite help... try this code :)
Step 1: download Anaconda 3.0+
Step 2: (function)
# Import Libraries
import bs4
import requests
def import_high_short_tickers(market_type):
if market_type == 'NADAQ':
page = requests.get('https://www.highshortinterest.com/nasdaq/')
elif market_type == 'NYSE':
page = requests.get('https://www.highshortinterest.com/nyse/')
else:
logger.error("Invalid market_type: " + market_type)
return None
# Parse the HTML Page
soup = bs4.BeautifulSoup(page.content, 'html.parser')
# Grab only table elements
all_soup = soup.find_all('table')
# Get what you want from table elements!
for element in all_soup:
listing = str(element)
if 'https://finance.yahoo.com/' in listing:
# Stuff the results in a pandas data frame (if your not using these you should)
data = pd.read_html(listing)
return data
Yes Yes its very crude but don't hate!
Cheers!

Related

WebScraping: Pandas to_excel Not Displaying full DataFrame

I am brand new to coding, and was given a web scraping tutorial (found here) to help build my skills as I learn. I've already had to make several adjustments to the code in this tutorial, but I digress. I'm scraping off of http://books.toscrape.com/ and, when I try to export a Dataframe of just the book categories into Excel, I get a couple of issues. Note that, when exporting to csv (and then opening the file in Notepad), these issues are not present. I am working in a Jupyter Notebook in Azure Data Studio.
First, the row with all the data appears to not exist, even though it is displayed, making it so that I have to tab over to each column to go past the data that is shown in the default windowsize of Excel.
Second, it only displays the first 9 results (the first being "Books," and the other 8 being the first 8 categories).
Image of desired scrape section
Here is my code:
s = Service('C:/Users/.../.../chromedriver.exe')
browser = webdriver.Chrome(service=s)
url = 'http://books.toscrape.com/'
browser.get(url)
results = []
content = browser.page_source
soup = BeautifulSoup(content)
# changes from the tutorial due to recommendations
# from StackOverflow, based on similar questions
# from error messages popping up when using original
# formatting; tutorial is outdated.
for li in soup.find_all(attrs={'class': 'side_categories'}):
name = element.find('li')
if name not in results:
results.append(name.text)
df = pd.DataFrame({'Categories': results})
df.to_excel('categories.xlsx', index=False)
# per https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.io.formats.style.Styler.to_excel.html
# encoding is Deprecated and apparently wasn't
# needed for any excel writer other than xlwt,
# which is no longer maintained.
Images of results:
View before tabbing further in the columns
End of the displayed results
What can I do to fix this?
Edit: Many apologies, I didn't realize I have copied an older, incorrect version of my code blocks. Should be correct now.

The code in question will not create any dataframe. However, you should select your elements more specific for example with css selectors:
for a in soup.select('ul.nav-list a'):
if a.get_text(strip=True) not in results:
results.append(a.get_text(strip=True))
Example
import requests
from bs4 import BeautifulSoup
import pandas as pd
results = []
soup = BeautifulSoup(requests.get('http://books.toscrape.com/').content)
for a in soup.select('ul.nav-list a'):
if a.get_text(strip=True) not in results:
results.append(a.get_text(strip=True))
pd.DataFrame({'Categories': results})
Output
Categories
0
Books
1
Travel
2
Mystery
3
Historical Fiction
4
Sequential Art
5
Classics
6
Philosophy
...

Trying to loop through URL's and download images from these webpages

I have a nice URL structure and I want to iterate through the URL's and download all the images from the URL's. I am trying to use BeautifulSoup to get the job done, along with the requests function.
Here is the URL - https://sixmorevodka.com/#&gid=0&pid={i}, and I want 'i' to iterate from say 1 to 100 for this example.
from bs4 import BeautifulSoup as soup
import requests, contextlib, re, os
#contextlib.contextmanager
def get_images(url:str):
d = soup(requests.get(url).text, 'html.parser')
yield [[i.find('img')['src'], re.findall('(?<=\.)\w+$', i.find('img')['alt'])[0]] for i in d.find_all('a') if re.findall('/image/\d+', i['href'])]
n = 100 #end value
for i in range(n):
with get_images(f'https://sixmorevodka.com/#&gid=0&pid={i}') as links:
print(links)
for c, [link, ext] in enumerate(links, 1):
with open(f'ART/image{i}{c}.{ext}', 'wb') as f:
f.write(requests.get(f'https://sixmorevodka.com{link}').content)
I think I either messed something up in the Yield line or in the very last write line. Someone help me out please. I am using Python 3.7

In looking at the structure of that webpage, your gid parameter is invalid. To see for yourself, open a new tab and navigate to https://sixmorevodka.com/#&gid=0&pid=22.
You'll notice that none of the portfolio images are displayed. gid can be a value 1-5, denoting the grid to which an image belongs.
Regardless, your current scraping methodology is inefficient, and puts undue traffic on the website. Instead, you only need to make this request once, and extract the urls actually containing the images using the ilb portfolio__grid__item class selector.
Then, you can iterate and download those urls, which are directly the source of the images.

HTML hidden elements

I'm actually trying to code a little "GPS" and actually I couldn't use Google API because of the daily restriction.
I decided to use a site "viamichelin" which provide me the distance between two adresses. I created a little code to fetch all the URL adresses I needed like this :
import pandas
import numpy as np
df = pandas.read_excel('C:\Users\Bibi\Downloads\memoire\memoire.xlsx', sheet_name='Clients')
df2= pandas.read_excel('C:\Users\Bibi\Downloads\memoire\memoire.xlsx', sheet_name='Agences')
matrix=df.as_matrix(columns=None)
clients = np.squeeze(np.asarray(matrix))
matrix2=df2.as_matrix(columns=None)
agences = np.squeeze(np.asarray(matrix2))
compteagences=0
comptetotal=0
for j in agences:
compteclients=0
for i in clients:
print agences[compteagences]
print clients[compteclients]
url ='https://fr.viamichelin.be/web/Itineraires?departure='+agences[compteagences]+'&arrival='+clients[compteclients]+'&arrivalId=34MTE1MnJ5ZmQwMDMzb3YxMDU1ZDFvbGNOVEF1TlRVNU5UUT1jTlM0M01qa3lOZz09Y05UQXVOVFl4TlE9PWNOUzQzTXpFNU5nPT1jTlRBdU5UVTVOVFE9Y05TNDNNamt5Tmc9PTBqUnVlIEZvbmQgZGVzIEhhbGxlcw==&index=0&vehicle=0&type=0&distance=km&currency=EUR&highway=false&toll=false&vignette=false&orc=false&crossing=true&caravan=false&shouldUseTraffic=false&withBreaks=false&break_frequency=7200&coffee_duration=1200&lunch_duration=3600&diner_duration=3600&night_duration=32400&car=hatchback&fuel=petrol&fuelCost=1.393&allowance=0&corridor=&departureDate=&arrivalDate=&fuelConsumption='
print url
compteclients+=1
comptetotal+=1
compteagences+=1
All my datas are on Excel that's why I used the pandas library. I have all the URL's needed for my project.
Although, I would like to extract the number of kilometers needed but there's a little problem. In the source code, I don't have the information I need, so I can't extract it with Python... The site is presented like this:
Michelin
When I click on "inspect" I can find the information needed (on the left) but I can't on the source code (on the right) ... Can someone provide me some help?
Itinerary
I have already tried this, without succeeding :
import os
import csv
import requests
from bs4 import BeautifulSoup
requete = requests.get("https://fr.viamichelin.be/web/Itineraires?departure=Rue%20Lebeau%2C%20Liege%2C%20Belgique&departureId=34MTE1Mmc2NzQwMDM0NHoxMDU1ZW44d2NOVEF1TmpNek5ERT1jTlM0MU5qazJPQT09Y05UQXVOak16TkRFPWNOUzQxTnpBM01nPT1jTlRBdU5qTXpOREU9Y05TNDFOekEzTWc9PTBhUnVlIExlYmVhdQ==&arrival=Rue%20Rys%20De%20Mosbeux%2C%20Trooz%2C%20Belgique&arrivalId=34MTE1MnJ5ZmQwMDMzb3YxMDU1ZDFvbGNOVEF1TlRVNU5UUT1jTlM0M01qa3lOZz09Y05UQXVOVFl4TlE9PWNOUzQzTXpFNU5nPT1jTlRBdU5UVTVOVFE9Y05TNDNNamt5Tmc9PTBqUnVlIEZvbmQgZGVzIEhhbGxlcw==&index=0&vehicle=0&type=0&distance=km&currency=EUR&highway=false&toll=false&vignette=false&orc=false&crossing=true&caravan=false&shouldUseTraffic=false&withBreaks=false&break_frequency=7200&coffee_duration=1200&lunch_duration=3600&diner_duration=3600&night_duration=32400&car=hatchback&fuel=petrol&fuelCost=1.393&allowance=0&corridor=&departureDate=&arrivalDate=&fuelConsumption=")
page = requete.content
soup = BeautifulSoup(page, "html.parser")
print soup

Looking at the inspector for the page, the actual routing is done via a JavaScript invocation to this rather long URL.
The data you need seems to be in that response, starting from _scriptLoaded(. (Since it's a JavaScript object literal, you can use Python's built-in JSON library to load the data into a dict.)

Wrong output using tree.xpath

Iam a beginner to data scraping.I want to extract all the marathon events name from a website Wikipedia
For this I have written a small code :-
from lxml import html
import requests
page = requests.get('https://en.wikipedia.org/wiki/List_of_marathon_races')
tree = html.fromstring(page.text)
events = tree.xpath('//td/a[#class="new"]/text()')
print events
The problem is when I try to execute the following code,a blank array comes as an output.What is the problem with this code?It would be grateful if anyone can help me in correcting or finding mistakes my code.

Extracting data from Web

One really newbie question.
I'm working on a small python script for my home use, that will collect data of a specific air ticket.
I want to extract the data from skyscanner (using BeautifulSoap and urllib). Example:
http://www.skyscanner.net/flights/lond/rome/120922/120929/airfares-from-london-to-rome-in-september-2012.html
And I'm interested in all the data that are stored in this kind of element, specially the price: http://shrani.si/f/1w/An/1caIzEzT/capture.png
Because they are not located in the HTML, can I extract them?

I believe the problem is that these values are rendered through a javascript code which your browser runs and urllib doesn't - You should use a library that can execute javascript code.
I just googled crawler python javascript and I got the some stackoverflow questions and answers which recommends the use of selenium or webkit. You can use those libraries through scrapy. Here are two snippets:
Rendered/interactive javascript with gtk/webkit/jswebkit
Rendered Javascript Crawler With Scrapy and Selenium RC

I have been working on this same exact issue. I have been introduced to Beautifulsoup and later since learned about Scrapy. Beautifulsoup is very easy to use, especially if you're new at this. Scrapy apparently has more "features", but I believe you can accomplish your needs with Beautifulsoup.
I had the same issues about not being able to gain access to a website that loaded information through Javascript and thankfully Selenium was the savior.
A great introduction to Selenium can be found here.
Install: pip install selenium
Below is a simple class I put together. You can save it as a .py file and import it into your project. If you call the method retrieve_source_code(self, domain) and send the hyperlink that you are trying to parse it will return the source code of the fully loaded page when you can then put into Beautifulsoup and find the information you're looking for!
Ex:
airfare_url = 'http://www.skyscanner.net/flights/lond/rome/120922/120929/airfares-from-london-to-rome-in-september-2012.html'
soup = BeautifulSoup(SeleniumWebScraper.retrieve_source_code(airfare_url))
Now you can parse soup like you normally would with Beautifulsoup.
I hope that helps you!
from selenium import webdriver
import requests
class SeleniumWebScraper():
def __init__(self):
self.source_code = ''
self.is_page_loaded = 0
self.driver = webdriver.Firefox()
self.is_browser_closed = 0
# To ensure the page has fully loaded we will 'implicitly' wait
self.driver.implicitly_wait(10) # Seconds
def close(self):
self.driver.close()
self.clear_source_code()
self.is_page_loaded = 0
self.is_browser_closed = 1
def clear_source_code(self):
self.source_code = ''
self.is_page_loaded = 0
def retrieve_source_code(self, domain):
if self.is_browser_closed:
self.driver = webdriver.Firefox()
# The driver.get method will navigate to a page given by the URL.
# WebDriver will wait until the page has fully loaded (that is, the "onload" event has fired)
# before returning control to your test or script.
# It's worth nothing that if your page uses a lot of AJAX on load then
# WebDriver may not know when it has completely loaded.
self.driver.get(domain)
self.is_page_loaded = 1
self.source_code = self.driver.page_source
return self.source_code

You don't even need BeautifulSoup to extract data.
Just do this and your response is converted to a Dictionary which is very easy to handle.
text = json.loads("You text of the main response content")
You can now print any key value pair from the dictionary.
Give it a try. It is super easy.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Web scraping with Anaconda and Python 3.65 - python

Related

WebScraping: Pandas to_excel Not Displaying full DataFrame

Trying to loop through URL's and download images from these webpages

HTML hidden elements

Wrong output using tree.xpath

Extracting data from Web

Categories

Resources