Scraping JSON object with beautiful soup

Scraping JSON object with beautiful soup - python

Background
I am attempting to scrape this page. Basically get the name of each product, it's price and image. I was expecting to see the div's that contain the product in the soup but i did not. So what i did is i opened up the url in my chrome browser and upon doing inspect element in my networks tab i found the GET call it's making is directly to this page to get all the product related information. If you open that url you will see basically a JSON object and there is html string in there with the divs for the product and prices. The question for me is how would I parse this?
Attempted Solution
I thought one obvious way is to convert the soup in to a JSON and so in order to do that soup needs to be a string and that's exactly what i did. The issue now is that my json_data variable basically has a string. So when i attempt to do something like this json_data['Results'] it gives me and error saying i can only pass ints. I am unsure how to proceed further.
I would love suggestions and any pointers if i am doing something wrong.
Following is My code
from bs4 import BeautifulSoup
from random_user_agent.user_agent import UserAgent
from random_user_agent.params import SoftwareName, OperatingSystem
import requests
import json
import sys
sys.stdout = open('output.html', 'wt')
page_to_scrape = 'https://shop.guess.com/en/catalog/browse/men/tanks-t-shirts/view-all/?filter=true&page=1'
software_names = [SoftwareName.CHROME.value]
operating_systems = [OperatingSystem.WINDOWS.value, OperatingSystem.LINUX.value]
user_agent_rotator = UserAgent(software_names=software_names, operating_systems=operating_systems, limit=100)
page = requests.get(page_to_scrape, headers={'User-Agent': user_agent_rotator.get_random_user_agent()})
soup = BeautifulSoup(page.content, "html.parser")
json_data = json.dumps(str(soup))
print(json_data)

The error might be that json_data is a string and not a dict type as json.dumps(str(soup)) returns a string.Since json_data is string, we cannot do json_data['Results'] and to access any element of string, we need to pass the index and hence the error.
EDIT
To get Results from the response, the code is shown below:
json_data = json.loads(str(soup.text))
print(json_data['Results'])
Let me know if this helps!!

Related

Python / BeautifulSoup - Scraping XML data from Clinicaltrials.gov API - parse data within XML parent/child tags

I'm new to working with XML and BeautifulSoup and I am trying to get a dataset of clinical trials using Clinicaltrials.gov's new API that converts a list of trials into an XML dataset. I tried using find_all() like I typically do with HTML, but I'm not having the same luck. I've tried a few other approaches, like converting to a string and splitting (very messy) but I don't want to clutter my code with failed attempts.
Bottom line: I want to extract all NCTIds (I know I can just convert the whole thing into a string and use regex, but I want to learn how to actually parse XML correctly) and official titles for each clinical trial listed in the XML file. Any help is appreciated!
import requests
from bs4 import BeautifulSoup
from lxml import etree
import lxml.html
url = 'https://clinicaltrials.gov/api/query/full_studies?expr=diabetes+telehealth+peer+support&+AREA%5BStartDate%5D+EXPAND%5BTerm%5D+RANGE%5B01%2F01%2F2020%2C+09%2F01%2F2020%5D&min_rnk=1&max_rnk=10&fmt=xml'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
m1_nctid = soup.find_all('Field Name="NCTId"') #This comes back with 0 results
m1_officialtitle = soup.find_all('Field Name="OfficialTitle"') #This comes back with 0 results

you can filter on attributes like following:
m1_nctid = soup.findAll("field", {"name" : "NCTId"})
m1_officialtitle = soup.findAll("field", {"name" : "OfficialTitle"})
and then iterate each result to get text, for ex:
official_titles = [result.text for result in m1_officialtitle]
for more info, you can check the documentation here

You can search for the field tag in lowercase, and pass name as an attribute to attrs. This works with just BeautifulSoup there's no need to use etree:
import requests
from bs4 import BeautifulSoup
url = "https://clinicaltrials.gov/api/query/full_studies?expr=diabetes+telehealth+peer+support&+AREA%5BStartDate%5D+EXPAND%5BTerm%5D+RANGE%5B01%2F01%2F2020%2C+09%2F01%2F2020%5D&min_rnk=1&max_rnk=10&fmt=xml"
response = requests.get(url)
soup = BeautifulSoup(response.content, "lxml")
m1_nctid = soup.find_all("field", attrs={"name": "NCTId"})
m1_officialtitle = soup.find_all("field", attrs={"name": "OfficialTitle"})

BeautifulSoup: save each interation of loop's resulting HTML

I have written the following code to obtain the html of some pages, according to some id which I can input in a URL. I would like to then save each html as a .txt file in a desired path. This is the code that I have written for that purpose:
import urllib3
import requests
from bs4 import BeautifulSoup
import pandas as pd
def get_html(id):
url = f'https://www.myurl&id={id}'
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
html=print(soup)
return html
id = ['11111','22222']
for id in id:
path=f'D://MyPath//{id}.txt'
a = open(path, 'w')
a.write(get_html(id))
a.close()
Although generating the html pages is quite simple. This loop is not working properly. I am getting the following message TypeError: write() argument must be str, not None. Which means that the first loop somehow is failing to generate a string to be saved as a text file.
I would like to say that in the original data I have around 9k ids, so you can also let me know if instead of several .txt files you would recommend a big csv to store all the results. Thanks!

The problem is, that the print() returns None. Use str() instead:
def get_html(id):
url = f'https://www.myurl&id={id}'
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
#html=print(soup) <-- print() returns None
return str(soup) # <--- convert soup to string

Loading more links in a page after sending json requests in Python

I am parsing this URL to get links from one of the boxes with infinite scroll. Here is mo code for sending the requests for the website to get next 10 links:
import requests
from bs4 import BeautifulSoup
import urllib2
import urllib
import extraction
import json
from json2html import *
baseUrl = 'http://www.marketwatch.com/news/headline/getheadlines'
parameters2 = {
'ticker':'XOM',
'countryCode':'US',
'docType':'2007',
'sequence':'6e09aca3-7207-446e-bb8a-db1a4ea6545c',
'messageNumber':'1830',
'count':'10',
'channelName':'',
'topic':' ',
'_':'1479539628362'}
html2 = requests.get(baseUrl, params = parameters2)
html3 = json.loads(html2.text) # array of size 10
In the corresponding HTML , there is an element like:
<li class="loading">Loading more headlines...</li>
that tells there are more items to be loaded by scrolling dowwn , but I don't know how to use json file to write a loop to gets more links.
My first try was to use Beautiful Soup and to write the following code to get links and ids :
url = 'http://www.marketwatch.com/investing/stock/xom'
r = urllib.urlopen(url).read()
soup = BeautifulSoup(r, 'lxml')
pressReleaseBox = soup.find('div', attrs={'id':'prheadlines'})
and then check if there is more link to scrape, get the next json file:
loadingMore = pressReleaseBox.find('li',attrs={'class':'loading'})
while loadingMore != None:
# get the links from json file and load more links
I don't know hot to implement the comment part. do you have any idea about it?
I am not obliged to use BeautifulSoup, and any other working library will be fine.

Here is how you can load more json file:
get last json file, extract value of key UniqueId in last item.
if the value is something looks like e5a00f51-8821-4fbc-8ac6-e5f64b5eb0f2:8499
extract e5a00f51-8821-4fbc-8ac6-e5f64b5eb0f2 as sequence
extract 8499 as messageNumber
let docId be empty
if the value is something looks like 1222712881
let sequence be empty
let messageNumber be empty
extract 1222712881 as docId
put parameters sequence, messageNumber, docId into your parameters2.
use requests.get(baseUrl, params = parameters2) to get your next json file.

Opening webpage and returning a dict of all the links and their text

I'm trying to open a webpage and return all the links as a dictionary that would look like this.
{"http://my.computer.com/some/file.html" : "link text"}
So the link would be after the href= and the text would be between the > and the </a>
I'm using https://www.yahoo.com/ as my test website
I keep getting a this error:
'href=' in line:
TypeError: a bytes-like object is required, not 'str'
Heres my code:
def urlDict(myUrl):
url = myUrl
page = urllib.request.urlopen(url)
pageText = page.readlines()
urlList = {}
for line in pageText:
if '<a href=' in line:
try:
url = line.split('<a href="')[-1].split('">')[0]
txt = line.split('<a href="')[-1].split('">')[-1].split('< /a>')[0]
urlList[url] = txt
except:
pass
return urlList
What am I doing wrong? I've looked around and people have mostly suggest this mysoup parser thing. I'd use it, but I don't think that would fly with my teacher.

The issue is that you're attempting to compare a byte string to a regular string. If you add print(line) as the first command in your for loops, you'll see that it will print a string of HTML but it will have a b' at the beginning, indicating it's not utf-8 encoding. This makes things difficult. The proper way to use urllib here is the following:
def url_dict(myUrl):
with urllib.request.urlopen(myUrl) as f:
s = f.read().decode('utf-8')
This will have the s variable hold the entire text of the page. You can then use a regular expression to parse out the links and the link target. Here is an example which will pull the link targets without the HTML.
import urllib.request
import re
def url_dict():
# url = myUrl
with urllib.request.urlopen('http://www.yahoo.com') as f:
s = f.read().decode('utf-8')
r = re.compile('(?<=href=").*?(?=")')
print(r.findall(s))
url_dict()
Using regex to get both the html and the link itself in a dictionary is outside the scope of where you are in your class, so I would absolutely not recommend submitting it for the assignment, although I would recommend learning it for later use.
You'll want to use BeautifulSoup as suggested, as it make this entire thing extremely easy. There is an example in the docs that you can cut and paste to extract the URLs.

For what it's worth, here is a BeautifulSoup and requests approach.
Feel free to replace requests with urllib, but BeautifulSoup doesn't really have a nice replacement.
import requests
from bs4 import BeautifulSoup
def get_links(url):
page = requests.get(url)
soup = BeautifulSoup(page.text, "html.parser")
return { a_tag['href']: a_tag.text for a_tag in soup.find_all('a') }
for link, text in get_links('https://www.yahoo.com/').items():
print(text.strip(), link)

BeautifulSoup not parsing the entire page content

I am trying to get the set of url's(which are webpages) from newyork times, but i get a different answer, I am sure that I gave a correct class, though it extracts different classes. My ny_url.txt has "http://query.nytimes.com/search/sitesearch/?action=click&region=Masthead&pgtype=SectionFront&module=SearchSubmit&contentCollection=us&t=qry900#/isis; http://query.nytimes.com/search/sitesearch/?action=click&region=Masthead&pgtype=SectionFront&module=SearchSubmit&contentCollection=us&t=qry900#/isis/since1851/allresults/2/"
Here is my code:
import urllib2
import urllib
from cookielib import CookieJar
from bs4 import BeautifulSoup
cj = CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
text_file = open('ny_url.txt', 'r')
for line in text_file:
print line
soup = BeautifulSoup(opener.open(line))
links = soup.find_all('div', attrs = {'class' : 'element2'})
for href in links:
print href

Well its not that simple.
The data you are looking for is not in your page_source downloaded by urllib2.
Try printing the opener.open(line).read() you will find the data to be missing.
This is because, the site is making another GET request to http://query.nytimes.com/svc/cse/v2pp/sitesearch.json?query=isis&page=1
Where within the url your query parameters are passed query=isis and page=1
The data fetched is in json format, try opening the url above in the browser manually. You will find your data there.
So a pure pythonic way would be to call this url and parse JSON to get what you want.
No rocket science needed - just parse the dict using proper keys.
OR
An easier way would be to use webdrivers like Selenium - navigate to the page - and parse the page source using BeautifulSoup. That should easily fetch the entire Content.
Hope that helps. Let me know if you need more insights.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scraping JSON object with beautiful soup - python

Related

Python / BeautifulSoup - Scraping XML data from Clinicaltrials.gov API - parse data within XML parent/child tags

BeautifulSoup: save each interation of loop's resulting HTML

Loading more links in a page after sending json requests in Python

Opening webpage and returning a dict of all the links and their text

BeautifulSoup not parsing the entire page content

Categories

Resources