Python BS4 Not Retrieving Results - python

Using the below code, I am able to fetch "soup" without an issue. My goal is to ultimately fetch the title within the soup object, but I'm having trouble figuring out how to do it. In addition to below, I've also tried various iterations of soup['results'], soup.results, soup.get_text().results .. etc and not sure how to get to it. I can, of course, do soup.get_text() ... (some kind of search function for the string "title," but feel like there has to be a built-in method for this.
55)get_title()
54 ipdb.set_trace()
---> 55 title = soup.html.head.title.string
56 title = re.sub(r'[^\x00-\x7F]+',' ', title)
ipdb> type(soup)
<class 'bs4.BeautifulSoup'>
ipdb> soup.title
ipdb> print soup.title
None
ipdb> soup
{"status":"OK","copyright":"Copyright (c) 2018 The New York Times Company. All Rights Reserved.","section":"home","last_updated":"2018-01-07T06:19:00-05:00","num_results":42,"results":[{"section":"Briefing","subsection":"",**"title":"Trump, Palestinians, Golden Globes: Your Weekend Briefing"**, ....
Code
from __future__ import division
import regex as re
import string
import urllib2
from bs4 import BeautifulSoup
from cookielib import CookieJar
import ipdb
PARSER_TYPE = 'html.parser'
def get_title(url):
cj = CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
p = opener.open(url)
soup = BeautifulSoup(p.read(), PARSER_TYPE) # This loads fine
ipdb.set_trace()
title = soup.html.head.title.string # This is sad
title = re.sub(r'[^\x00-\x7F]+',' ', title)
return title

Take a look at what p.read() returns. You will find that it is not HTML, it is a JSON string. You can't use a HTML parser to successfully parse JSON, however, you can use a JSON parser such as the one provided in the json package.
import json
p = opener.open(url)
response = json.loads(p.read())
Following this response will reference a dictionary. You can then use dictionary access methods to extract a particular piece of data:
title = response['results'][0]['title']
Note here that response['results'] is itself a list so you need to get the first element of that list (at least for the example that you've shown). response['results'][0] then gives a second nested dictionary that contains the data that you want. Look that up with the title key.
Since the results are contained in a list you might need to iterate over that list to process each result:
for result in response['results']:
print(result['title'])
If some results do not have title keys you can use dict.get() to perform the lookup without raising an exception:
for result in response['results']:
print(result.get('title'))

Related

Python / BeautifulSoup - Scraping XML data from Clinicaltrials.gov API - parse data within XML parent/child tags

I'm new to working with XML and BeautifulSoup and I am trying to get a dataset of clinical trials using Clinicaltrials.gov's new API that converts a list of trials into an XML dataset. I tried using find_all() like I typically do with HTML, but I'm not having the same luck. I've tried a few other approaches, like converting to a string and splitting (very messy) but I don't want to clutter my code with failed attempts.
Bottom line: I want to extract all NCTIds (I know I can just convert the whole thing into a string and use regex, but I want to learn how to actually parse XML correctly) and official titles for each clinical trial listed in the XML file. Any help is appreciated!
import requests
from bs4 import BeautifulSoup
from lxml import etree
import lxml.html
url = 'https://clinicaltrials.gov/api/query/full_studies?expr=diabetes+telehealth+peer+support&+AREA%5BStartDate%5D+EXPAND%5BTerm%5D+RANGE%5B01%2F01%2F2020%2C+09%2F01%2F2020%5D&min_rnk=1&max_rnk=10&fmt=xml'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
m1_nctid = soup.find_all('Field Name="NCTId"') #This comes back with 0 results
m1_officialtitle = soup.find_all('Field Name="OfficialTitle"') #This comes back with 0 results
you can filter on attributes like following:
m1_nctid = soup.findAll("field", {"name" : "NCTId"})
m1_officialtitle = soup.findAll("field", {"name" : "OfficialTitle"})
and then iterate each result to get text, for ex:
official_titles = [result.text for result in m1_officialtitle]
for more info, you can check the documentation here
You can search for the field tag in lowercase, and pass name as an attribute to attrs. This works with just BeautifulSoup there's no need to use etree:
import requests
from bs4 import BeautifulSoup
url = "https://clinicaltrials.gov/api/query/full_studies?expr=diabetes+telehealth+peer+support&+AREA%5BStartDate%5D+EXPAND%5BTerm%5D+RANGE%5B01%2F01%2F2020%2C+09%2F01%2F2020%5D&min_rnk=1&max_rnk=10&fmt=xml"
response = requests.get(url)
soup = BeautifulSoup(response.content, "lxml")
m1_nctid = soup.find_all("field", attrs={"name": "NCTId"})
m1_officialtitle = soup.find_all("field", attrs={"name": "OfficialTitle"})

beautifulsoup with bscscan api in python

I am trying to use beautifulsoup with the bscscan api but I don't know how to separate the data that someone gives me, could someone guide me?
from bs4 import BeautifulSoup
import requests
import pandas as pd
url_base = 'https://api.bscscan.com/api?module=stats&action=tokensupply&contractaddress='
contract = '0x6053b8FC837Dc98C54F7692606d632AC5e760488'
url_fin = '&apikey=YourApiKeyToken'
url = url_base+contract+url_fin
page = requests.get(url)
soup = BeautifulSoup(page.content, 'lxml')
totalsupply = soup.find('p').text
print(totalsupply)
Screenshot:
The first part of the solution is almost identical to what you already have. Since you didn't need pandas, I've removed it. I've also changed the parser from lxml to html.
from bs4 import BeautifulSoup
import requests
url_base = 'https://api.bscscan.com/api?module=stats&action=tokensupply&contractaddress='
contract = '0x6053b8FC837Dc98C54F7692606d632AC5e760488'
url_fin = '&apikey=YourApiKeyToken'
url = url_base+contract+url_fin
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html')
By now, if you print soup you will see something like this:
{"status":"1","message":"OK","result":"1289436"}
You may think that's a python dictionary, but that's only the representation (__repr__ or __str__). You still can't extract the keys and values as you would with a normal dictionary. soup is an instance of bs4.BeautifulSoup. So convert that into a json and then save each of the three items as its own variable:
from operator import itemgetter
import json
d = json.loads(soup.get_text())
status, message, result = itemgetter('status', 'message', 'result')(d)
Now you will have status, message, and result each as a variable.
What's noteworthy here is that if you already totalsupply being a valid dictionary, simply skip the json.loads() step above:
from operator import itemgetter
status, message, result = itemgetter('status', 'message', 'result')(totalsupply)
# alternatively if you have guarantees on the order:
status, message, result = totalsupply.values()
I would say, up until very recently, dicts in Python are unordered, unpacking of d.values() may actually give you the wrong behavior. OrderedDict is not a thing until Python 3.6, and they're ordered by insertion. If this isn't something you understand yet, I would suggest sticking to the itemgetter solution.

BeautifulSoup: save each interation of loop's resulting HTML

I have written the following code to obtain the html of some pages, according to some id which I can input in a URL. I would like to then save each html as a .txt file in a desired path. This is the code that I have written for that purpose:
import urllib3
import requests
from bs4 import BeautifulSoup
import pandas as pd
def get_html(id):
url = f'https://www.myurl&id={id}'
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
html=print(soup)
return html
id = ['11111','22222']
for id in id:
path=f'D://MyPath//{id}.txt'
a = open(path, 'w')
a.write(get_html(id))
a.close()
Although generating the html pages is quite simple. This loop is not working properly. I am getting the following message TypeError: write() argument must be str, not None. Which means that the first loop somehow is failing to generate a string to be saved as a text file.
I would like to say that in the original data I have around 9k ids, so you can also let me know if instead of several .txt files you would recommend a big csv to store all the results. Thanks!
The problem is, that the print() returns None. Use str() instead:
def get_html(id):
url = f'https://www.myurl&id={id}'
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
#html=print(soup) <-- print() returns None
return str(soup) # <--- convert soup to string

Scraping JSON object with beautiful soup

Background
I am attempting to scrape this page. Basically get the name of each product, it's price and image. I was expecting to see the div's that contain the product in the soup but i did not. So what i did is i opened up the url in my chrome browser and upon doing inspect element in my networks tab i found the GET call it's making is directly to this page to get all the product related information. If you open that url you will see basically a JSON object and there is html string in there with the divs for the product and prices. The question for me is how would I parse this?
Attempted Solution
I thought one obvious way is to convert the soup in to a JSON and so in order to do that soup needs to be a string and that's exactly what i did. The issue now is that my json_data variable basically has a string. So when i attempt to do something like this json_data['Results'] it gives me and error saying i can only pass ints. I am unsure how to proceed further.
I would love suggestions and any pointers if i am doing something wrong.
Following is My code
from bs4 import BeautifulSoup
from random_user_agent.user_agent import UserAgent
from random_user_agent.params import SoftwareName, OperatingSystem
import requests
import json
import sys
sys.stdout = open('output.html', 'wt')
page_to_scrape = 'https://shop.guess.com/en/catalog/browse/men/tanks-t-shirts/view-all/?filter=true&page=1'
software_names = [SoftwareName.CHROME.value]
operating_systems = [OperatingSystem.WINDOWS.value, OperatingSystem.LINUX.value]
user_agent_rotator = UserAgent(software_names=software_names, operating_systems=operating_systems, limit=100)
page = requests.get(page_to_scrape, headers={'User-Agent': user_agent_rotator.get_random_user_agent()})
soup = BeautifulSoup(page.content, "html.parser")
json_data = json.dumps(str(soup))
print(json_data)
The error might be that json_data is a string and not a dict type as json.dumps(str(soup)) returns a string.Since json_data is string, we cannot do json_data['Results'] and to access any element of string, we need to pass the index and hence the error.
EDIT
To get Results from the response, the code is shown below:
json_data = json.loads(str(soup.text))
print(json_data['Results'])
Let me know if this helps!!

Loading more links in a page after sending json requests in Python

I am parsing this URL to get links from one of the boxes with infinite scroll. Here is mo code for sending the requests for the website to get next 10 links:
import requests
from bs4 import BeautifulSoup
import urllib2
import urllib
import extraction
import json
from json2html import *
baseUrl = 'http://www.marketwatch.com/news/headline/getheadlines'
parameters2 = {
'ticker':'XOM',
'countryCode':'US',
'docType':'2007',
'sequence':'6e09aca3-7207-446e-bb8a-db1a4ea6545c',
'messageNumber':'1830',
'count':'10',
'channelName':'',
'topic':' ',
'_':'1479539628362'}
html2 = requests.get(baseUrl, params = parameters2)
html3 = json.loads(html2.text) # array of size 10
In the corresponding HTML , there is an element like:
<li class="loading">Loading more headlines...</li>
that tells there are more items to be loaded by scrolling dowwn , but I don't know how to use json file to write a loop to gets more links.
My first try was to use Beautiful Soup and to write the following code to get links and ids :
url = 'http://www.marketwatch.com/investing/stock/xom'
r = urllib.urlopen(url).read()
soup = BeautifulSoup(r, 'lxml')
pressReleaseBox = soup.find('div', attrs={'id':'prheadlines'})
and then check if there is more link to scrape, get the next json file:
loadingMore = pressReleaseBox.find('li',attrs={'class':'loading'})
while loadingMore != None:
# get the links from json file and load more links
I don't know hot to implement the comment part. do you have any idea about it?
I am not obliged to use BeautifulSoup, and any other working library will be fine.
Here is how you can load more json file:
get last json file, extract value of key UniqueId in last item.
if the value is something looks like e5a00f51-8821-4fbc-8ac6-e5f64b5eb0f2:8499
extract e5a00f51-8821-4fbc-8ac6-e5f64b5eb0f2 as sequence
extract 8499 as messageNumber
let docId be empty
if the value is something looks like 1222712881
let sequence be empty
let messageNumber be empty
extract 1222712881 as docId
put parameters sequence, messageNumber, docId into your parameters2.
use requests.get(baseUrl, params = parameters2) to get your next json file.

Categories