How do I parse the following SEC data from a .html website in Python?
Im trying to parse the html from the following webpage: https://www.sec.gov/Archives/edgar/data/1831187/0001831187-23-000001-index.html
The .txt version of the page contains the three top-level elements that I need to extract and parse into data frames:
Header < SEC-HEADER >
Primary Document <edgarSubmission... >
Information Table <informationTable...>
I can see some of the information with the following code, but I am ignorant of how to find the equivalent text element in the HTML and extract it. How can I proceed?
import bs4
from bs4 import BeautifulSoup
import requests
url= "https://www.sec.gov/Archives/edgar/data/1831187/0001831187-23-000001-index.html"
request = requests.get(index, headers = {"User-Agent": "Mozilla/5.0"})
soup = BeautifulSoup(request.text, 'lxml')
Related
I'm new to working with XML and BeautifulSoup and I am trying to get a dataset of clinical trials using Clinicaltrials.gov's new API that converts a list of trials into an XML dataset. I tried using find_all() like I typically do with HTML, but I'm not having the same luck. I've tried a few other approaches, like converting to a string and splitting (very messy) but I don't want to clutter my code with failed attempts.
Bottom line: I want to extract all NCTIds (I know I can just convert the whole thing into a string and use regex, but I want to learn how to actually parse XML correctly) and official titles for each clinical trial listed in the XML file. Any help is appreciated!
import requests
from bs4 import BeautifulSoup
from lxml import etree
import lxml.html
url = 'https://clinicaltrials.gov/api/query/full_studies?expr=diabetes+telehealth+peer+support&+AREA%5BStartDate%5D+EXPAND%5BTerm%5D+RANGE%5B01%2F01%2F2020%2C+09%2F01%2F2020%5D&min_rnk=1&max_rnk=10&fmt=xml'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
m1_nctid = soup.find_all('Field Name="NCTId"') #This comes back with 0 results
m1_officialtitle = soup.find_all('Field Name="OfficialTitle"') #This comes back with 0 results
you can filter on attributes like following:
m1_nctid = soup.findAll("field", {"name" : "NCTId"})
m1_officialtitle = soup.findAll("field", {"name" : "OfficialTitle"})
and then iterate each result to get text, for ex:
official_titles = [result.text for result in m1_officialtitle]
for more info, you can check the documentation here
You can search for the field tag in lowercase, and pass name as an attribute to attrs. This works with just BeautifulSoup there's no need to use etree:
import requests
from bs4 import BeautifulSoup
url = "https://clinicaltrials.gov/api/query/full_studies?expr=diabetes+telehealth+peer+support&+AREA%5BStartDate%5D+EXPAND%5BTerm%5D+RANGE%5B01%2F01%2F2020%2C+09%2F01%2F2020%5D&min_rnk=1&max_rnk=10&fmt=xml"
response = requests.get(url)
soup = BeautifulSoup(response.content, "lxml")
m1_nctid = soup.find_all("field", attrs={"name": "NCTId"})
m1_officialtitle = soup.find_all("field", attrs={"name": "OfficialTitle"})
I am trying to scrape the data from this link
I have tried this way
from bs4 import BeautifulSoup
import urllib.request
import csv
# specify the url
urlpage = 'https://www.ikh.se/sysNet/getProductsJSON/getProductsJSONDB.aspx?' \
'sua=1&lang=2&navid=19277994'
# query the website and return the html to the variable 'page'
page = urllib.request.urlopen(urlpage)
# parse the html using beautiful soup and store in variable 'soup'
soup = BeautifulSoup(page, 'html.parser')
tag = soup.find('div', attrs={'class':'dnsCell'})
text = (''.join(tag.stripped_strings))
print (page)
I got the HTML dom but the product list dom are missing. Actually I guess the product list dom manages by a JSON array that requests from this link but I am not sure about the product list dom load method. I am right or wrong.
I want to scrape the all product details from this site and export in Excel.
The requests library does not load the Javascript. If you want to download completely rendered website, use selenium library : https://selenium-python.readthedocs.io/
I am trying to extract the search results with Python from this link into a JSON file, but normal request methods seem not functioning in this case. How can extract all the results?
url= https://apps.usp.org/app/worldwide/medQualityDatabase/reportResults.html?country=Ethiopia%2BGhana%2BKenya%2BMozambique%2BNigeria%2BCambodia%2BLao+PDR%2BPhilippines%2BThailand%2BViet+Nam%2BBolivia%2BColombia%2BEcuador%2BGuatemala%2BGuyana%2BPeru&period=2017%2B2016%2B2015%2B2014%2B2013%2B2012%2B2011%2B2010%2B2009%2B2008%2B2007%2B2006%2B2005%2B2004%2B2003&conclusion=Both&testType=Both&counterfeit=Both&recordstart=50
my code
import requests
from bs4 import BeautifulSoup
r = requests.get(url)
results_page = BeautifulSoup(response.content,'lxml')
Why am I not getting the full source code of the page?
How can I convert a wikimeadia dump xml file into text in python. is there any package in python?
Not sure what dump you have, from the post you are trying to convert the web content and read the elements and write in to a file using python.
better use website scrape using requests and bs4 objects:
#Getting data from website - scrape
import requests, bs4
#Getting HTML from the wikipedia page
url = "https://en.wikipedia.org/wiki/Eigenvalues_and_eigenvectors"
req = requests.get(url)
#Create a bs4 object
soup = bs4.BeautifulSoup(req.text, "html5lib")
element = soup.select('.mwe-math-element')
print(element)
#You can save the required content to a file by manipulating the content in element list
I was trying to parse yahoo finance webpage using beautifulsoup. I am using python 2.7 and bs4 4.3.2. My final objective is to extract in python all the tabulated data from http://finance.yahoo.com/q/ae?s=PXT.TO. As a start, following code cannot find any table from the url. What am i missing?
from bs4 import BeautifulSoup
from urllib2 import urlopen
url = "http://finance.yahoo.com/q/ae?s=PXT.TO"
soup = BeautifulSoup(urlopen(url).read())
table = soup.findAll("table")
print table`