Scraping website only provides a partial or random data into CSV

Scraping website only provides a partial or random data into CSV - python

I am trying to extract a list of golf courses name and addresses from the Garmin Website using the script below.
import csv
import requests
from bs4 import BeautifulSoup
courses_list= []
for i in range(893): #893
url = "http://sites.garmin.com/clsearch/courses?browse=1&country=US&lang=en&per_page={}".format(i*20)
r = requests.get(url)
soup = BeautifulSoup(r.content)
g_data2=soup.find_all("div",{"class":"result"})
for item in g_data2:
try:
name= item.contents[3].find_all("div",{"class":"name"})[0].text
print name
except:
name=''
try:
address= item.contents[3].find_all("div",{"class":"location"})[0].text
except:
address=''
course=[name,address]
courses_list.append(course)
with open ('PGA_Garmin2.csv','a') as file:
writer=csv.writer(file)
for row in courses_list:
writer.writerow([s.encode("utf-8") for s in row])
After running the script, I end up not getting the full data that I need and further when executed it produces random values and not a complete set of data. I need to extract information from 893 pages and get a list of at least 18000 but after running this script I only get 122. How do I fix this script to get the complete data set and produce the needed CSV with the complete data set of golf courses from the Garmin Website. I corrected the page page numbers to reflect the page set up in the Garmin website where the page starts at 20 so on.

Just taking a guess here, but try checking your r.status and confirm that it's 200? Maybe it is possible that you're not accessing the whole website?
Stab in the dark.

Related

list index out of range - beautiful soup

NEW TO PYTHON*** Below is my code I am using to pull a zip file from a website but I am getting the error, "list index out of range". I was given this code by someone else who wrote it but I had to change the URL and now I am getting the error. When I print(list_of_documents) it is blank.
Can someone help me with this? The url requires access so you won't be able to try to input this code directly. I am trying to understand how to use beautiful soup in this and how I can get the list to populate correctly.
import datetime
import requests
import csv
from zipfile import ZipFile as zf
import os
import pandas as pd
import time
from bs4 import BeautifulSoup
import pyodbc
import re
#set download location
downloads_folder = r"C:\Scripts\"
##### Creating outage dataframe
#Get list of download links
res = requests.get('https://www.ercot.com/mp/data-products/data-product-details?id=NP3-233-CD')
ercot_soup = BeautifulSoup(res.text, "lxml")
list_of_documents = ercot_soup.findAll('td', attrs={'class': 'labelOptional_ind'})
list_of_links = ercot_soup.select('a')'
##create the url for the download
loc = str(list_of_links[0])[9:len(str(list_of_links[0]))-9]
link = 'http://www.ercot.com' + loc
link = link.replace('amp;','')
# Define file name and set download path
file_name = str(list_of_documents[0])[30:len(str(list_of_documents[0]))-5]
file_path = downloads_folder + '/' + file_name

You can't expect code tailored to scrape one website to work for a different link! You should always inspect and explore your target site, especially the parts you need to scrape, so you know the tag names [like td and a here] and identifying attributes [like name, id, class, etc.] of the elements you need to extract data from.
With this site, if you want the info from the reportTable, it gets generated after the page gets loaded with javascript, so it wouldn't show up in the request response. You could either try something like Selenium, or you could try retrieving the data from the source itself.
If you inspect the site and look at the network tab, you'll find a request (which is what actually retrieves the data for the table) that looks like this, and when you inspect the table's html, you'll find above it the scripts to generate the data.
In the suggested solution below, the getReqUrl scrapes your link to get the url for requesting the reports (and also the template of the url for downloading the documents).
def getReqUrl(scrapeUrl):
res = requests.get(scrapeUrl)
ercot_soup = BeautifulSoup(res.text, "html.parser")
script = [l.split('"') for l in [
s for s in ercot_soup.select('script')
if 'reportListUrl' in s.text
and 'reportTypeID' in s.text
][0].text.split('\n') if l.count('"') == 2]
rtID = [l[1] for l in script if 'reportTypeID' in l[0]][0]
rlUrl = [l[1] for l in script if 'reportListUrl' in l[0]][0]
rdUrl = [l[1] for l in script if 'reportDownloadUrl' in l[0]][0]
return f'{rlUrl}{rtID}&_={int(time.time())}', rdUrl
(I couldn't figure out how to scrape the last query parameter [the &_=... part] from the site exactly, but {int(time.time())}} seems to get close enough - the results are the same even then and even when that last bit is omitted entirely; so it's totally optional.)
The url returned can be used to request the documents:
#import json
url = 'https://www.ercot.com/mp/data-products/data-product-details?id=NP3-233-CD'
reqUrl, ddUrl = getReqUrl(url)
reqRes = requests.get(reqUrl[0]).text
rsJson = json.loads(reqRes)
for doc in rsJson['ListDocsByRptTypeRes']['DocumentList']:
d = doc['Document']
downloadLink = ddUrl+d['DocID']
#print(f"{d['FriendlyName']} {d['PublishDate']} {downloadLink}")
print(f"Download '{d['ConstructedName']}' at\n\t {downloadLink}")
print(len(rsJson['ListDocsByRptTypeRes']['DocumentList']))
The print results will look like

Recursive crawling with BeautifulSoup really slow

I'm building a crawler that downloads all .pdf Files of a given website and its subpages. For this, I've used built-in functionalities around the below simplified recursive function that retrieves all links of a given URL.
However this becomes quite slow, the longer it crawls a given website (may take 2 minutes or longer per URL).
I can't quite figure out what's causing this and would really appreciate suggestions on what needs to be changed in order to increase the speed.
import re
import requests
from bs4 import BeautifulSoup
pages = set()
def get_links(page_url):
global pages
pattern = re.compile("^(/)")
html = requests.get(f"https://www.srs-stahl.de/{page_url}").text
soup = BeautifulSoup(html, "html.parser")
for link in soup.find_all("a", href=pattern):
if "href" in link.attrs:
if link.attrs["href"] not in pages:
new_page = link.attrs["href"]
print(new_page)
pages.add(new_page)
get_links(new_page)
get_links("")

It is not that easy to figure out what activly slow down your crawling - It is maybe the way you crawl, server of the website, ...
In your code, you request a URL, grab the links and call the functions itself in the first iteration, so you only append requested urls.
You may want to work with "queues" to keep the processes more transparent.
One advantage is that if the script aborts, you have this information stored and can access it to start from the urls you already have collected to visit. Quite the opposite of your for loop, which may have to start at an earlier point to ensure it get all urls.
Another point is, you request the PDF files, but without using the response in any way. Wouldn't it make more sense to either download and save them directly or skip the request and at least keep the links in separate "queue" for post processing?
Collected information in comparison - Based on iterations
Code in question:
pages --> 24
Example code (without delay):
urlsVisited --> 24
urlsToVisit --> 87
urlsToDownload --> 67
Example
Just to demonstrate, feel free to create defs, classes and structure to your needs. Note added some delay, but you can skip it if you like. "Queues" to demonstrate the process are lists but should be files, database,... to store your data safely.
import requests, time
from bs4 import BeautifulSoup
baseUrl = 'https://www.srs-stahl.de'
urlsToDownload = []
urlsToVisit = ["https://www.srs-stahl.de/"]
urlsVisited = []
def crawl(url):
html = requests.get(url).text
soup = BeautifulSoup(html, "html.parser")
for a in soup.select('a[href^="/"]'):
url = f"{baseUrl}{a['href']}"
if '.pdf' in url and url not in urlsToDownload:
urlsToDownload.append(url)
else:
if url not in urlsToVisit and url not in urlsVisited:
urlsToVisit.append(url)
while urlsToVisit:
url = urlsToVisit.pop(0)
try:
crawl(url)
except Exception as e:
print(f'Failed to crawl: {url} -> error {e}')
finally:
urlsVisited.append(url)
time.sleep(2)

Web scraping using beautiful soup Python

I am trying to web scrape some data from the website - https://boardgamegeek.com/browse/boardgame/page/1
After I have obtained a name of the games and their score, I would also like to open each of these pages and find out how many players are needed for each game. But, when I go into each of the games the URL has a unique number.
For example: When I click on the first game- Gloomhaven it opens the page - https://boardgamegeek.com/boardgame/**174430**/gloomhaven (The unique number is marked in bold).
random_no = r.randint(1000,300000)
url2 = "https://boardgamegeek.com/boardgame/"+str(random_no)+"/"+name[0]
page2 = requests.get(url2)
if page2.status_code==200:
print("this is it!")
break
So I generated a random number and plugged it into the URL and read the response. However, even the wrong number gives a correct response but does not open the correct page.
What is this unique number ? How can I get information about this? Or can I use an alternative to get the information I need?
Thanks in advance.

Try this
import requests
import bs4
s = bs4.BeautifulSoup(requests.get(
url = 'https://boardgamegeek.com/browse/boardgame/page/1',
).content, 'html.parser').find('table', {'id': 'collectionitems'})
urls = ['https://boardgamegeek.com'+x['href'] for x in s.find_all('a', {'class':'primary'})]
print(urls)

I am having a hard time creating a program that finds tor nodes

I am trying to create a web scraping program that goes to a specific website, collects the tor nodes and then compares it to a list that I have. If the IP addresses match then it's a tor node, if not it isn't a tor node then it's false.
I am having a hard time getting the "text" from the inspect element of the website ..[Inspect element of website][1]
[1]: https://i.stack.imgur.com/16zWw.png
Any help is appreciated, I'm stuck right now and don't know how to get the "text" from the first picture to show up on my program. Thanks in advance.
Here is the code to my program so far:
import requests
from bs4 import BeautifulSoup
page = requests.get('https://www.dan.me.uk/tornodes')
soup = BeautifulSoup(page.content, 'html.parser')
search = soup.find(id='content')
#137.74.19.201 is practice tor node
items = search.find_all(class_='article box')

Why bother with BeautifulSoup ?! the guy states clearly that there are some markers in the page ... just take the whole pate as a string, split by those markers an go from there, for example:
import requests
from bs4 import BeautifulSoup
page = requests.get('https://www.dan.me.uk/tornodes')
# assuming that page.content contains the source code of the page
if "<!--__BEGIN_TOR_NODE_LIST__-->" not in page.content:
print("list not ready")
else:
list_text = page.content.split("<!--__BEGIN_TOR_NODE_LIST__-->")[1] # take everything after this
list_text = list_text.split("<!--__END_TOR_NODE_LIST__-->")[1] # take everything before this
line_list = [line.strip() ]
for line in list_text.split("<br>"):
line_ip = line.strip().split("|")[0]
# how do what you want with it
if line_ip in my_known_ip_list:
print("This is good %s" % line_ip)

import urllib.request # the lib that handles the url stuff
target_url = 'https://www.dan.me.uk/torlist/'
my_ips = ['1.161.11.204', '1.161.11.205']
confirmed_ips = []
for line in urllib.request.urlopen(target_url):
if line in my_ips:
print(line.decode())
confirmed_ips.append(line)
print(confirmed_ips)
# ATTENTION:
# Umm... You can only fetch the data every 30 minutes - sorry. It's pointless any faster as I only update every 30 minutes anyway.
# If you keep trying to download this list too often, you may get blocked from accessing it completely.
# (this is due to some people trying to download this list every minute!)
Since there's this 30min limitation, otherwise you will receive ERROR 403, you can read the lines and save to a file, then compare your list with the downloaded list.

How web scrape with request, Bs4 when there is a script result?

I am trying to get some data from this website:
http://www.espn.com.br/futebol/resultados/_/liga/BRA.1/data/20181018
When I inspect the page on my browser I can see all the values I need on the HTML. I want to fetch the game result and the players names (for each date, in this example 2018-10-18)
On no game days the website shows:
"Sem jogos nesta data", which is it easy to find on browser inspection:
But when using
url = 'http://www.espn.com.br/futebol/resultados/_/liga/todos/data/20181018'
page = requests.get(url, "lxml")
The output is basically the website where I can't find the phrase "Sem jogos nesta data"
How can I get fetch the HTML containing the script results? Is it possible with request? urllib?

Looks like the data you are looking for that comes from their backend API. I would use selenium-python package instead of requests.
Here is example:
driver = webdriver.Firefox()
driver.get("http://www.espn.com.br/futebol/resultados/_/liga/todos/data/20181018")
value = driver.find_elements(By.XPATH, '//*[#id="events"]/div')
drive.close()
I didn't check the code but it should be working

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scraping website only provides a partial or random data into CSV - python

Just taking a guess here, but try checking your r.status and confirm that it's 200? Maybe it is possible that you're not accessing the whole website? Stab in the dark.

Related

list index out of range - beautiful soup

Recursive crawling with BeautifulSoup really slow

Web scraping using beautiful soup Python

I am having a hard time creating a program that finds tor nodes

How web scrape with request, Bs4 when there is a script result?

Categories

Resources