BeautifulSoup findall returning empty list - python

So i'm very new to python and I'm trying to get data from a table from iso-ne.com/isoexpress/ using bs4 and urllib. Here's what I have so far:
from bs4 import BeautifulSoup
from urllib import urlopen
website='http://www.iso-ne.com/isoexpress/'
html=urlopen(website).read().decode('utf-8')
soup=BeautifulSoup(html, 'html.parser')
table=soup.find('div', {'class': 'chart'})
rows=table.find_all('tr')
for tr in rows:
col=tr.find_all('td')
for td in col:
text=td.find_all(class_='lmp-list-energy')
print text
When I run this, I get 6 empty brackets:
[]
[]
[]
[]
[]
[]
the data that I am trying to get is the Five Minute Real Time LMP price for the state of New Hampshire on the iso-ne website

The data are filled by javascript, which is not interpreted by beautifulsoup : you get the raw container.
What I would do (but I would check about legality and conditions...): look at requests done to the backend (e.g. by using the network mode on chrome)
=> you'll find something a call to http://iso-ne.com/ws/wsclient. Grab the parameter that your client is sending (cookies...) and replay the request (or fine-tune the parameters though trial and error).
Good luck (I did manage to replay the request for data from curl, so it should be doable in python)

Related

How to solve a doubling problem when scraping with BeautifulSoup

I have a strange problem with my script which extracts some dates from a webpage.
Here is the script :
# import library
import json
import re
import requests
from bs4 import BeautifulSoup
import datetime
# Request to website and dowload HTML contents
url = 'https://www.coteur.com/cotes-basket.php'
#page = requests.get(url)
response = requests.get(url)
#soup = BeautifulSoup(page.text, 'html.parser')
soup = BeautifulSoup(response.text, 'html.parser')
s = soup.find("table", id="mediaTable").find_all('script', type='application/ld+json')
date_hour = [json.loads(re.search(r'>(.+)<', str(j), re.S).group(1))["startDate"] for j in s]
#print(date_hour)
date_hour = [sub.replace("T", " ") for sub in date_hour]
print(len(date_hour))
print(date_hour)
This code is functional. It returns the startDate element inside tag script.
But there is a doubling with one date (in the webpage, I count 24 basket events but my list is length 25). In the webpage you can see 3 events which start at 00:00 but my script extract 4 dates with 00:00
Maybe you have a idea why the site does not display these extra entries?
Maybe you have a idea why the site does not display these extra
entries?
It does not display where there are no odds. This is due to a script which runs and removes those were no odds from view. I think currently that is script identified by script:nth-child(25), which starts $(document).on('click'.... This has a test on odds.length and if 0 there is row removal.
You can test by disabling javascript and reload page. You will get same result as your python request (where js doesn't run). The row is present. Re-enable js and the row will disappear.
You can view whether there are odds by going Recontres (main table) for a given match > Cotes (also see Prognostics). If you do this with js disabled you can follow the Recontres links for all matches and see whether there are odds. In prognostics there should be odds based calculations that aren't both 0.
How to solve a doubling problem when scraping with BeautifulSoup?
There is no way, from the response you get with requests to distinguish the row(s) that will be missing on the page. I am not sure you can even make additional requests to check the odds info as it is missing for all without js. You would likely need to switch to selenium/browser automation. You then wouldn't really need BeautifulSoup at all.
There is a small outside chance you might find an API/other site that pulls the same odds and you could cross-reference.

Data scraper: the contents of the div tag is empty (??)

I am data scraping a website to get a number. This number changes dynamically every split second, but upon inspection, the number is shown. I just need to capture that number but the div wrapper that contains it, it returns no value. What am I missing? (please go easy on me as I am quite new to Python and data scraping).
I have some code that works and returns the piece of html that supposedly contains the data I want, but no joy, the div wrapper returns no value.
import requests
from bs4 import BeautifulSoup
r = requests.get('https://deuda-publica-espana.com')
deuda = BeautifulSoup(r.text, 'html.parser')
deuda = deuda.findAll('div', {'id': 'contador_PDEH'})
print(deuda)
I don't receive any errors, I am just getting [<div class="contador_xl contador_verde" id="contador_PDEH"></div>] with no value!
Indeed it is easy with selenium. I suspect there is a js script running a counter supplying the number which is why you can't find it with your method (as mentioned in comments)
from selenium import webdriver
d = webdriver.Chrome(r'C:\Users\User\Documents\chromedriver.exe')
d.get('https://deuda-publica-espana.com/')
print(d.find_element_by_id('contador_PDEH').text)
d.quit()

Scraping Web data with Python

sorry if this is not the place for this question, but I'm not sure where else to ask.
I'm trying to scrape data from rotogrinders.com and I'm running into some challenges.
In particular, I want to be able to scrape previous NHL game data using urls of this format (obviously you can change the date for other day's data):
https://rotogrinders.com/game-stats/nhl-skater?site=draftkings&date=11-22-2016
However, when I get to the page, I notice that the data is broken up into pages, and I'm unsure what to do to get my script to get the data that's presented after clicking the "all" button at the bottom of the page.
Is there a way to do this in python? Perhaps some library that will allow button clicks? Or is there some way to get the data without actually clicking the button by being clever about the URL/request?
Actually, things are not that complicated in this case. When you click "All" no network requests are issued. All the data is already there - inside a script tag in the HTML, you just need to extract it.
Working code using requests (to download the page content), BeautifulSoup (to parse HTML and locate the desired script element), re (to extract the desired "player" array from the script) and json (to load the array string into a Python list):
import json
import re
import requests
from bs4 import BeautifulSoup
url = "https://rotogrinders.com/game-stats/nhl-skater?site=draftkings&date=11-22-2016"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
pattern = re.compile(r"var data = (\[.*?\]);$", re.MULTILINE | re.DOTALL)
script = soup.find("script", text=pattern)
data = pattern.search(script.text).group(1)
data = json.loads(data)
# printing player names for demonstration purposes
for player in data:
print(player["player"])
Prints:
Jeff Skinner
Jordan Staal
...
William Carrier
A.J. Greer

Using urllib with Python 3

I'm trying to write a simple application that reads the HTML from a webpage, converts it to a string, and displays certain slices of that string to the user.
However, it seems like these slices change themselves! Each time I run my code I get a different output! Here's the code.
# import urllib so we can get HTML source
from urllib.request import urlopen
# import time, so we can choose which date to read from
import time
# save HTML to a variable
content = urlopen("http://www.islamicfinder.org/prayerDetail.php?country=canada&city=Toronto&state=ON&lang")
# make HTML readable and covert HTML to a string
content = str(content.read())
# select part of the string containing the prayer time table
table = content[24885:24935]
print(table) # print to test what is being selected
I'm not sure what's going on here.
You should really be using something like Beautiful soup. Something along the lines of the following should help. From looking at the source code for that url there is not id/class for the table which makes it a little bit more trickier to find.
from bs4 import BeautifulSoup
import requests
url = "http://www.islamicfinder.org/prayerDetail.php?country=canada&city=Toronto&state=ON&lang"
r = requests.get(url)
soup = BeautifulSoup(r.text)
for table in soup.find_all('table'):
# here you can find the table you want and deal with the results
print(table)
You shouldn't be looking for the part you want by grabbing the specific indexes of the list, websites are often dynamic and the list contain the exact same content each time
What you want to do is search for the table you want, so say the table started with the keyword class="prayer_table" you could find this with str.find()
better yet, extract the tables from the webpage instead of relying on str.find() The code below is from a question on extract tables from a webpage reference
from lxml import etree
import urllib
web = urllib.urlopen("http://www.ffiec.gov/census/report.aspx?year=2011&state=01&report=demographic&msa=11500")
s = web.read()
html = etree.HTML(s)
## Get all 'tr'
tr_nodes = html.xpath('//table[#id="Report1_dgReportDemographic"]/tr')
## 'th' is inside first 'tr'
header = [i[0].text for i in tr_nodes[0].xpath("th")]
## Get text from rest all 'tr'
td_content = [[td.text for td in tr.xpath('td')] for tr in tr_nodes[1:]]

Isolating data from dynamic table with beautifulSoup

I'm trying to extract data from a table(1), which has a couple filter options. I'm using BeautifulSoup and got to this page with Requests. An extract of code:
from bs4 import BeautifulSoup
tt = Contact_page.content # webpage with table
soup = BeautifulSoup(tt)
R_tables = soup.find('div', {'class': 'responsive-table'})
Using find_all("tr") and find_all("th") results in empty sets. Using R_tables.findChildren only goes down to "formrow" who then has no children. From formrow to my tr/th tags, I can't access it through BS4.
R_tables results in table 3. The XPath for this file is
"//*[#id="kronos_body"]/div[3]/div[2]/div[3]/script/text()
How can I get each row information for my data? soup.find("r") and soup.find("f") also result in empty sets.
Pardon me in advance if this post is sloppy, this is my first. I'll link what my most similar thread is in a comment, I can't link more than 2 times.
EDIT 1 : Apparently BS doesn't recognize any javascript apart from variables (correct me if I'm wrong, I'm still still relatively new). Are there any other modules that can help me out? I was proposed Ghost and Selenium, but I won't be using Selenium.

Categories