It's been a step by step process getting the code to this point, the goal of it was to visit a list of URLs and scrape specific data. This has been accomplished with the script below:
import requests
from bs4 import BeautifulSoup as bs
import json
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}
urls = ['https://www.nba.com/game/bkn-vs-phi-0022100993',
'https://www.nba.com/game/was-vs-lac-0022100992']
for url in urls:
r = requests.get(url, headers=headers)
soup = bs(r.text, 'html.parser')
page_obj = soup.select_one('script#__NEXT_DATA__')
json_obj = json.loads(page_obj.text)
print('Title:', json_obj['props']['pageProps']
['story']['header']['headline'])
print('Date:', json_obj['props']['pageProps']['story']['date'])
print('Content:', json_obj['props']['pageProps']['story']['content'])
I had an idea I hoped to implement -- I feel I'm very close but not sure why it's not running. Basically, rather than having the static list of URLs, I wanted to use a Google Sheet as the source of URLs. Meaning, a column on this tab will have the URL list that needs to be scraped.
From there, when run, the script will PULL the URLS from the first tab, the data will get scraped, then the info will be pushed to the data in the second tab.
I've been able to print the URLs in terminal with the code above - basically, getting to the source, and requesting all records.
I thought then, I'd be able to still loop through those links in the same way (new code):
from unittest import skip
import requests
from bs4 import BeautifulSoup as bs
import json
import gspread
gc = gspread.service_account(filename='creds.json')
sh = gc.open_by_key('1NFrhsJT7T0zm3dRaP5J8OY0FryBHy5W_wEEGvwBg58I')
worksheet = sh.sheet1
freshurls = gc.open("NBA Stories").get_worksheet(1)
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}
urls = freshurls.get_all_records()
for url in urls:
try:
r = requests.get(url, headers=headers)
soup = bs(r.text, 'html.parser')
page_obj = soup.select_one('script#__NEXT_DATA__')
json_obj = json.loads(page_obj.text)
title = (json_obj['props']['pageProps']['story']['header']['headline'])
date = (json_obj['props']['pageProps']['story']['date'])
content = str(json_obj['props']['pageProps']['story']['content'])
AddData = [url, title, date, content]
worksheet.append_row(AddData)
except:
skip
Even if I switch the ending actions (AddData & append rows) to just print the results, I'm not seeing anything.
Seems like I'm missing a step? Is there something I could do differently here to leverage those URLs right from the sheet, instead of having to paste them in the script every time?
SUGGESTION
You can try using the batch_get method in a separate script file to get the URL data from a sheet tab and then just call the URL data to your scraping script file in your looping method to reduce complexity and for the readability of your script. For more context, see the sample script below.
In my understanding, here is your goal:
Put a list of URLs on a specific sheet tab in a spreadsheet file.
Get the URL data from that Sheet tab in Python
Loop through it in your Python script and scrape the data per URL
Append each scrape data to a second sheet tab.
Sample Script
The getURLsFromSpreadsheet.py file
import gspread
gc = gspread.service_account(filename='creds.json')
# Open a spreadsheet by ID
sh = gc.open_by_key('1XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX')
# Get the sheets
wk = sh.worksheet("Sheet1")
apprendWk = sh.worksheet("Sheet2")
# E.G. the URLs are listed on Sheet 1 on Column A
urls = wk.batch_get(('A2:A',) )[0]
The scrapeScript.py file
from getURLsFromSpreadsheet import *
import requests
from bs4 import BeautifulSoup as bs
import json
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}
for url in urls:
r = requests.get(url[0], headers=headers)
soup = bs(r.text, 'html.parser')
page_obj = soup.select_one('script#__NEXT_DATA__')
json_obj = json.loads(page_obj.text)
samplelist = [[str(json_obj['props']['pageProps']['story']['header']['headline']),
str(json_obj['props']['pageProps']['story']['date']),
str(json_obj['props']['pageProps']['story']['content'])[2:-1]
]]
apprendWk.append_rows(samplelist)
Demonstration
Sample spreadsheet file. URLs are listed on Column A
The Sheet 2 tab after running the scrapeScript.py file
In action:
Reference
GSpread Samples
Python – Call function from another file
At the document of gspread, it seems that get_all_records() returns a list of dictionaries. Ref Under this condition, when for url in urls: is run, url is {"header1": "value1",,,}. I thought that this might be the reason for your issue.
Unfortunately, although, from your question, I couldn't know the column where the URLs are put, for example, when column "A" has the URLs you want to use, how about the following modification?
From:
urls = freshurls.get_all_records()
for url in urls:
To:
column = 1 # This means column "A".
urls = freshurls.get_all_values()
for e in urls[1:]:
url = e[column - 1]
# print(url) # You can check the URL.
In this modification, the values are retrieved from column "A" using get_all_values. And, the 1st header row is skipped. It seems that get_all_values returns the values as the 2-dimensional array.
If your actual situation uses a different column from column "A", please modify the script.
Reference:
get_all_values(**kwargs)
Related
I'm new to webscraping. I was trying to make a script that gets data from a balance sheet (here the site: https://www.sec.gov/ix?doc=/Archives/edgar/data/320193/000032019320000010/a10-qq1202012282019.htm). The problem is getting the data: when I watch at the source code in my browser, I'm able to find the tag and the correct value. Once I write down a script with bs4, I don't get anything.
I'm trying to get informations form the balance sheet: Products, Services, Cost of sales... and the data contained in the table 1. (I'm sorry, but I can't post the image. Anyway is the first table you see scrolling down).
Here's my code.
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
url = "https://www.sec.gov/ix?doc=/Archives/edgar/data/320193/000032019320000010/a10-qq1202012282019.htm"
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
read_data = urlopen(req).read()
soup_data = BeautifulSoup(read_data,"lxml")
names = soup_data.find_all("td")
for name in names:
print(name)
Thanks for your time.
Try this URL:
Also include the headers to get the data.
import requests
from bs4 import BeautifulSoup
url = "https://www.sec.gov/Archives/edgar/data/320193/000032019320000010/a10-qq1202012282019.htm"
headers = {"User-agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36"}
req = requests.get(url, headers=headers)
soup_data = BeautifulSoup(req.text,"lxml")
You will be able to find the data you need.
I want my embed message to look like this, but mine only returns one link.
Here's my code:
import requests
from bs4 import BeautifulSoup
from discord_webhook import DiscordWebhook, DiscordEmbed
url = 'https://www.solebox.com/Footwear/Basketball/Lebron-X-JE-Icon-QS-variant.html'
headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36'}
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.content, "lxml")
for tag in soup.find_all('a', class_="selectSize"):
#There's multiple 'id' resulting in more than one link
aid = tag.get('id')
#There's also multiple sizes
size = tag.get('data-size-us')
#These are the links that need to be shown in the embed message
product_links = "https://www.solebox.com/{0}".format(aid)
webhook = DiscordWebhook(url='WebhookURL')
embed = DiscordEmbed(title='Title')
embed.set_author(name='Brand')
embed.set_thumbnail(url="Image")
embed.set_footer(text='Footer')
embed.set_timestamp()
embed.add_embed_field(name='Sizes', value='US{0}'.format(size))
embed.add_embed_field(name='Links', value='[Links]({0})'.format(product_links))
webhook.add_embed(embed)
webhook.execute()
This will most likely get you the results you want. type(product_links) is a string, meaning that every iteration in your for loop is just re-writing over the product_links variable with a new string. If you declare a List before the loop and append product_links to that list, it will most likely result in what you wanted.
Note: I had to use a different URL from that site. The one specified in the question was no longer available. I also had to use a different header as the one the asker put up continuously fed me a 403 error.
Additional Note: The URLS that are returned via your code logic return links that lead to no where. I feel that you'll need to work that one through since I don't know what you're exactly trying to do, however I feel that this answers the question of why you where only getting one link.
import requests
from bs4 import BeautifulSoup
url = 'https://www.solebox.com/Footwear/Basketball/Air-Force-1-07-PRM-variant-2.html'
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.3"}
r = requests.get(url=url, headers=headers)
soup = BeautifulSoup(r.content, "lxml")
product_links = [] # Create our product
for tag in soup.find_all('a', class_="selectSize"):
#There's multiple 'id' resulting in more than one link
aid = tag.get('id')
#There's also multiple sizes
size = tag.get('data-size-us')
#These are the links that need to be shown in the embed message
product_links.append("https://www.solebox.com/{0}".format(aid))
I'm trying to extract specific text from many urls that are being returned.
Im using Python 2.7 with requests and BeautifulSoup.
The reason is i need to find the latest URL which can be identified by the highest number "DF_7" with 7 been the highest from the below urls.This url will then be downloaded.
Note, each day new files are added, this is why i need to check for the one with the highest number.
Once i find the highest number in the list of URL's i then need to join
this "https://service.rl360.com/scripts/customer.cgi/SC/servicing/" to the url with the highest number.
The final product should look like this.
https://service.rl360.com/scripts/customer.cgi/SC/servicing/downloads.php?Reference=DF_7&SortField=ExpiryDays&SortOrder=Ascending
The urls look like this just with DF_ incrementing each time
Is this the right approach? if so how do i go about doing this.
Thanks
import base
import requests
import zipfile, StringIO, re
from lxml import html
from bs4 import BeautifulSoup
from base import os
from django.conf import settings
# Fill in your details here to be posted to the login form.
payload = {
'USERNAME': 'xxxxxx',
'PASSWORD': 'xxxxxx',
'option': 'login'
}
headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36'}
# Use 'with' to ensure the session context is closed after use.
with requests.Session() as s:
p = s.post('https://service.rl360.com/scripts/customer.cgi?option=login', data=payload)
# An authorised request.
r = s.get('https://service.rl360.com/scripts/customer.cgi/SC/servicing/downloads.php?Folder=DataDownloads&SortField=ExpiryDays&SortOrder=Ascending', stream=True)
content = r.text
soup = BeautifulSoup(content, 'lxml')
table = soup.find('table')
links = table.find_all('a')
print links
You can go straight to the last link with the class "tableid" and print it's href value like this:
href = soup.find_all("a", {'class':'tabletd'})[-1]['href']
base = "https://service.rl360.com/scripts/customer.cgi/SC/servicing/"
print (base + href)
import requests
a = 'http://tmsearch.uspto.gov/bin/showfield?f=toc&state=4809%3Ak1aweo.1.1&p_search=searchstr&BackReference=&p_L=100&p_plural=no&p_s_PARA1={}&p_tagrepl%7E%3A=PARA1%24MI&expr=PARA1+or+PARA2&p_s_PARA2=&p_tagrepl%7E%3A=PARA2%24ALL&a_default=search&f=toc&state=4809%3Ak1aweo.1.1&a_search=Submit+Query'
a = a.format('coca-cola')
b = requests.get(a)
print(b.text)
print(b.url)
If you copy the printed url and paste it in browser, site will open with no problem, but if you do requests.get, i get some token? errors. Is there anything I can do?
VIA requests.get I url back, but no data if doing manually. It says: <html><head><TITLE>TESS -- Error</TITLE></head><body>
First of all, make sure you follow the website's Terms of Use and usage policies.
This is a little bit more complicated that it may seem. You need to maintain a certain state throughout the [web-scraping session][1]. And, you'll need an HTML parser, like BeautifulSoup along the way:
from urllib.parse import parse_qs, urljoin
import requests
from bs4 import BeautifulSoup
SEARCH_TERM = 'coca-cola'
with requests.Session() as session:
session.headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36'}
# get the current search state
response = session.get("https://tmsearch.uspto.gov/")
soup = BeautifulSoup(response.content, "html.parser")
link = soup.find("a", text="Basic Word Mark Search (New User)")["href"]
session.get(urljoin(response.url, link))
state = parse_qs(link)['state'][0]
# perform a search
response = session.post("https://tmsearch.uspto.gov/bin/showfield", data={
'f': 'toc',
'state': state,
'p_search': 'search',
'p_s_All': '',
'p_s_ALL': SEARCH_TERM + '[COMB]',
'a_default': 'search',
'a_search': 'Submit'
})
# print search results
soup = BeautifulSoup(response.content, "html.parser")
print(soup.find("font", color="blue").get_text())
table = soup.find("th", text="Serial Number").find_parent("table")
for row in table('tr')[1:]:
print(row('td')[1].get_text())
It prints all the serial number values from the first search results page, for demonstration purposes.
I have been using following code to parse web page in the link https://www.blogforacure.com/members.php. The code is expected to return the links of all the members of the given page.
from bs4 import BeautifulSoup
import urllib
r = urllib.urlopen('https://www.blogforacure.com/members.php').read()
soup = BeautifulSoup(r,'lxml')
headers = soup.find_all('h3')
print(len(headers))
for header in headers:
a = header.find('a')
print(a.attrs['href'])
But I get only the first 10 links from the above page. Even while printing the prettify option I see only the first 10 links.
The results are dynamically loaded by making AJAX requests to the https://www.blogforacure.com/site/ajax/scrollergetentries.php endpoint.
Simulate them in your code with requests maintaining a web-scraping session:
from bs4 import BeautifulSoup
import requests
url = "https://www.blogforacure.com/site/ajax/scrollergetentries.php"
with requests.Session() as session:
session.headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36'}
session.get("https://www.blogforacure.com/members.php")
page = 0
members = []
while True:
# get page
response = session.post(url, data={
"p": str(page),
"id": "#scrollbox1"
})
html = response.json()['html']
# parse html
soup = BeautifulSoup(html, "html.parser")
page_members = [member.get_text() for member in soup.select(".memberentry h3 a")]
print(page, page_members)
members.extend(page_members)
page += 1
It prints the current page number and the list of members per page accumulating member names into a members list. Not posting what it prints since it contains names.
Note that I've intentionally left the loop endless, please figure out the exit condition. May be when response.json() throws an error.