I am trying to use beautifulsoup with the bscscan api but I don't know how to separate the data that someone gives me, could someone guide me?
from bs4 import BeautifulSoup
import requests
import pandas as pd
url_base = 'https://api.bscscan.com/api?module=stats&action=tokensupply&contractaddress='
contract = '0x6053b8FC837Dc98C54F7692606d632AC5e760488'
url_fin = '&apikey=YourApiKeyToken'
url = url_base+contract+url_fin
page = requests.get(url)
soup = BeautifulSoup(page.content, 'lxml')
totalsupply = soup.find('p').text
print(totalsupply)
Screenshot:
The first part of the solution is almost identical to what you already have. Since you didn't need pandas, I've removed it. I've also changed the parser from lxml to html.
from bs4 import BeautifulSoup
import requests
url_base = 'https://api.bscscan.com/api?module=stats&action=tokensupply&contractaddress='
contract = '0x6053b8FC837Dc98C54F7692606d632AC5e760488'
url_fin = '&apikey=YourApiKeyToken'
url = url_base+contract+url_fin
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html')
By now, if you print soup you will see something like this:
{"status":"1","message":"OK","result":"1289436"}
You may think that's a python dictionary, but that's only the representation (__repr__ or __str__). You still can't extract the keys and values as you would with a normal dictionary. soup is an instance of bs4.BeautifulSoup. So convert that into a json and then save each of the three items as its own variable:
from operator import itemgetter
import json
d = json.loads(soup.get_text())
status, message, result = itemgetter('status', 'message', 'result')(d)
Now you will have status, message, and result each as a variable.
What's noteworthy here is that if you already totalsupply being a valid dictionary, simply skip the json.loads() step above:
from operator import itemgetter
status, message, result = itemgetter('status', 'message', 'result')(totalsupply)
# alternatively if you have guarantees on the order:
status, message, result = totalsupply.values()
I would say, up until very recently, dicts in Python are unordered, unpacking of d.values() may actually give you the wrong behavior. OrderedDict is not a thing until Python 3.6, and they're ordered by insertion. If this isn't something you understand yet, I would suggest sticking to the itemgetter solution.
Related
I'm new to working with XML and BeautifulSoup and I am trying to get a dataset of clinical trials using Clinicaltrials.gov's new API that converts a list of trials into an XML dataset. I tried using find_all() like I typically do with HTML, but I'm not having the same luck. I've tried a few other approaches, like converting to a string and splitting (very messy) but I don't want to clutter my code with failed attempts.
Bottom line: I want to extract all NCTIds (I know I can just convert the whole thing into a string and use regex, but I want to learn how to actually parse XML correctly) and official titles for each clinical trial listed in the XML file. Any help is appreciated!
import requests
from bs4 import BeautifulSoup
from lxml import etree
import lxml.html
url = 'https://clinicaltrials.gov/api/query/full_studies?expr=diabetes+telehealth+peer+support&+AREA%5BStartDate%5D+EXPAND%5BTerm%5D+RANGE%5B01%2F01%2F2020%2C+09%2F01%2F2020%5D&min_rnk=1&max_rnk=10&fmt=xml'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
m1_nctid = soup.find_all('Field Name="NCTId"') #This comes back with 0 results
m1_officialtitle = soup.find_all('Field Name="OfficialTitle"') #This comes back with 0 results
you can filter on attributes like following:
m1_nctid = soup.findAll("field", {"name" : "NCTId"})
m1_officialtitle = soup.findAll("field", {"name" : "OfficialTitle"})
and then iterate each result to get text, for ex:
official_titles = [result.text for result in m1_officialtitle]
for more info, you can check the documentation here
You can search for the field tag in lowercase, and pass name as an attribute to attrs. This works with just BeautifulSoup there's no need to use etree:
import requests
from bs4 import BeautifulSoup
url = "https://clinicaltrials.gov/api/query/full_studies?expr=diabetes+telehealth+peer+support&+AREA%5BStartDate%5D+EXPAND%5BTerm%5D+RANGE%5B01%2F01%2F2020%2C+09%2F01%2F2020%5D&min_rnk=1&max_rnk=10&fmt=xml"
response = requests.get(url)
soup = BeautifulSoup(response.content, "lxml")
m1_nctid = soup.find_all("field", attrs={"name": "NCTId"})
m1_officialtitle = soup.find_all("field", attrs={"name": "OfficialTitle"})
I want to webscrape a few urls. This is what I do:
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
import requests
url_2021_int = ["https://www.ecb.europa.eu/press/inter/date/2021/html/index_include.en.html","https://www.ecb.europa.eu/press/inter/date/2020/html/index_include.en.html", "https://www.ecb.europa.eu/press/inter/date/2019/html/index_include.en.html", "https://www.ecb.europa.eu/press/inter/date/2018/html/index_include.en.html", "https://www.ecb.europa.eu/press/inter/date/2017/html/index_include.en.html"]
for url in url_2021_int:
req_int = requests.get(url)
soup_int = BeautifulSoup(req_int.text)
titles_int = soup_int.select(".title a")
titles_int=[data.text for data in titles_int]
However, I get data only for the last url (2017).
What am I doing wrong?
Thanks!
When you use req_int = requests.get(url) in the loop, the req_int variable is re-written each time.
If you want to store the requests.get(url) results in a list variable you can use
req_ints = [requests.get(url) for url in url_2021_int]
However, it seems logical to process the data in the same loop:
for url in url_2021_int:
req_int = requests.get(url)
soup_int = BeautifulSoup(req_int.text, "html.parser")
titles_int = soup_int.select(".title a")
titles_int=[data.text for data in titles_int]
Note that you can specify the "html.parser" as a second argument to the BeautifulSoup call, since the documents you are parsing are HTML documents.
from urllib.request import urlopen
from bs4 import BeautifulSoup
apikey='*****d2deb67f650f022ae13d07*****'
first='http://api.ipstack.com/'
ip='134.201.250.155'
third='?access_key='
print(first+ip+third+apikey)
#html=urlopen(first+ip+third+apikey)
soup=BeautifulSoup(html,"html.parser")
print(soup)
i had to hide the first,last 5 digits of my apikey,anyway this gives
{"ip":"134.201.250.155","type":"ipv4","continent_code":"NA","continent_name":"North America","country_code":"US","country_name":"United States","region_code":"CA","region_name":"California","city":"La Jolla","zip":"92037","latitude":32.8455,"longitude":-117.2521,"location":{"geoname_id":5363943,"capital":"Washington D.C.","languages":[{"code":"en","name":"English","native":"English"}],"country_flag":"http:\/\/assets.ipstack.com\/flags\/us.svg","country_flag_emoji":"\ud83c\uddfa\ud83c\uddf8","country_flag_emoji_unicode":"U+1F1FA U+1F1F8","calling_code":"1","is_eu":false}}
this is giving me a soup object,what do i i need to add to get the country_name,geoname_id,ip in a list so i can write them later in .json file
This seems like a json response
you need to parse it from json liberary
import json
parsed_json = json.loads(str(soup))
geoname_id = parsed_json['location']['geoname_id']
country_name = parsed_json['country_name']
ip = parsed_json['ip']
A better solution while dealing with REST apis that return json responses would be:
import requests
apikey='*****d2deb67f650f022ae13d07*****'
first='http://api.ipstack.com/'
ip='134.201.250.155'
query_string = {'access_key': apikey}
res = requests.get(first+ip+third, params=query_string)
res.raise_for_status()
ip = res.json()['ip']
The documentation is very helpful here - what you need to do is in there:
soup = BeautifulSoup(html,"html.parser")
print(soup.ip)
>>> "134.201.250.155"
Let me know if you need further help!
Using the below code, I am able to fetch "soup" without an issue. My goal is to ultimately fetch the title within the soup object, but I'm having trouble figuring out how to do it. In addition to below, I've also tried various iterations of soup['results'], soup.results, soup.get_text().results .. etc and not sure how to get to it. I can, of course, do soup.get_text() ... (some kind of search function for the string "title," but feel like there has to be a built-in method for this.
55)get_title()
54 ipdb.set_trace()
---> 55 title = soup.html.head.title.string
56 title = re.sub(r'[^\x00-\x7F]+',' ', title)
ipdb> type(soup)
<class 'bs4.BeautifulSoup'>
ipdb> soup.title
ipdb> print soup.title
None
ipdb> soup
{"status":"OK","copyright":"Copyright (c) 2018 The New York Times Company. All Rights Reserved.","section":"home","last_updated":"2018-01-07T06:19:00-05:00","num_results":42,"results":[{"section":"Briefing","subsection":"",**"title":"Trump, Palestinians, Golden Globes: Your Weekend Briefing"**, ....
Code
from __future__ import division
import regex as re
import string
import urllib2
from bs4 import BeautifulSoup
from cookielib import CookieJar
import ipdb
PARSER_TYPE = 'html.parser'
def get_title(url):
cj = CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
p = opener.open(url)
soup = BeautifulSoup(p.read(), PARSER_TYPE) # This loads fine
ipdb.set_trace()
title = soup.html.head.title.string # This is sad
title = re.sub(r'[^\x00-\x7F]+',' ', title)
return title
Take a look at what p.read() returns. You will find that it is not HTML, it is a JSON string. You can't use a HTML parser to successfully parse JSON, however, you can use a JSON parser such as the one provided in the json package.
import json
p = opener.open(url)
response = json.loads(p.read())
Following this response will reference a dictionary. You can then use dictionary access methods to extract a particular piece of data:
title = response['results'][0]['title']
Note here that response['results'] is itself a list so you need to get the first element of that list (at least for the example that you've shown). response['results'][0] then gives a second nested dictionary that contains the data that you want. Look that up with the title key.
Since the results are contained in a list you might need to iterate over that list to process each result:
for result in response['results']:
print(result['title'])
If some results do not have title keys you can use dict.get() to perform the lookup without raising an exception:
for result in response['results']:
print(result.get('title'))
I wanted to play around with python to learn it, so I'm taking on a little project, but a part of it requires me to search for a name on this list:
https://bughunter.withgoogle.com/characterlist/1
(the number one is to be incremented by one every time to search for the name)
So I will be HTML scraping it, I'm new to python and would appreciate if someone could give me an example of how to make this work.
import json
import requests
from bs4 import BeautifulSoup
URL = 'https://bughunter.withgoogle.com'
def get_page_html(page_num):
r = requests.get('{}/characterlist/{}'.format(URL, page_num))
r.raise_for_status()
return r.text
def get_page_profiles(page_html):
page_profiles = {}
soup = BeautifulSoup(page_html)
for table_cell in soup.find_all('td'):
profile_name = table_cell.find_next('h2').text
profile_url = table_cell.find_next('a')['href']
page_profiles[profile_name] = '{}{}'.format(URL, profile_url)
return page_profiles
if __name__ == '__main__':
all_profiles = {}
for page_number in range(1, 81):
current_page_html = get_page_html(page_number)
current_page_profiles = get_page_profiles(current_page_html)
all_profiles.update(current_page_profiles)
with open('google_hall_of_fame_profiles.json', 'w') as f:
json.dump(all_profiles, f, indent=2)
Your question wasn't clear about how you wanted the data structured after scraping so I just saved the profiles in a dict (with the key/value pair as {profile_name: profile_url}) and then dumped the results to a json file.
Let me know if anything is unclear!
Try this. You will need to install bs4 first (python 3). It will get all of the names of the people on the website page:
from bs4 import BeautifulSoup as soup
import urllib.request
text=str(urllib.request.urlopen('https://bughunter.withgoogle.com/characterlist/1').read())
text=soup(text)
print(text.findAll(class_='item-list')[0].get_text())