Extracting a specific href from table - python

I'm trying to extract the "10-K" url and append it into a list from the following site:
https://www.sec.gov/Archives/edgar/data/320193/000091205701544436/0000912057-01-544436-index.htm
Picture 1
So basically I'm trying to extract the first under the first that does not have as its sub category.
Am trying to create a loop to loop this code in multiple similar-like links, but guess I'm trying to resolve this issue first for now.
Any ideas?

Hope this answers your requirement.
import requests
from bs4 import BeautifulSoup
URL = "https://www.sec.gov/Archives/edgar/data/320193/000091205701544436/0000912057-01-544436-index.htm"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
rows = soup.findAll("td")
href_list = []
for ele in rows:
a_Tag = ele.findChildren("a")
if a_Tag:
href_list.append(a_Tag)
print(href_list)

I'm not sure I understand you'r quastion but if I got i rigth this can help you
from bs4 import BeautifulSoup
import requests
page = requests.get("https://www.sec.gov/Archives/edgar/data/320193/000091205701544436/0000912057-01-544436-index.htm")
s = BeautifulSoup(page.content, "html.parser")
print(s.find("table").findChild("a")["href"])

Related

how do i grab first link from out put

`enter code here`
from bs4 import BeautifulSoup
import requests
url = "https://www.tutorialspoint.com/index.htm"
req = requests.get(url)
soup = BeautifulSoup(req.text, "html.parser")
for link in soup.find_all('a'):
print(str(link.get('href')))
this is the out put
https://www.tutorialspoint.com/index.htm https://www.tutorialspoint.com/codingground.htm https://www.tutorialspoint.com/about/about_careers.htm
i need to know how do i grab first link
https://www.tutorialspoint.com/index.htm
Just index the list.
links = soup.find_all('a')[0].get('href')
out
https://www.tutorialspoint.com/index.htm
you can use find instead it only gets the first element
link = soup.find('a').get('href')

beautifulsoup returns none when using find() for an element

I'm trying to scrape this site to retrieve the years of each paper thats been published. I've managed to get titles to work but when it comes to scraping the years it returns none.
I've broken it down and the results of 'none' occur when its going into the for loop but I can't figure out why this happens when its worked with titles.
import requests
from bs4 import BeautifulSoup
URL = "https://dblp.org/search?q=web%20scraping"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
results = soup.find(class_="publ-list")
paperResults = results.find_all(class_="data tts-content")
for singlepaper in paperResults:
paperyear = singlepaper.find(class_="datePublished")
print(paperyear)
When it goes to paperResults it gives the breakdown of the section I've selected within the results on the line above that.
Any suggestions on how to retrieve the years would be greatly appreciated
Change this
for singlepaper in paperResults:
paperyear = singlepaper.find(class_="datePublished")
print(paperyear)
To this
for singlepaper in paperResults:
paperyear = singlepaper.find('span', itemprop="datePublished")
print(paperyear.string)
You were looking for a class when you needed to be parsing span... if you print paperResults you will see that your datePublished is an itemprop in a span element.
Try this:
import requests
from bs4 import BeautifulSoup
URL = "https://dblp.org/search?q=web%20scraping"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
results = soup.find(class_="publ-list")
paperResults = results.find_all(class_="data tts-content")
for singlepaper in paperResults:
paperyear = singlepaper.find(attrs={"itemprop": "datePublished"})
print(paperyear)
It worked for me.

How to get the results of multiple iterations in one dataframe when crawling with beautifulsoup?

Wondering if you could help please. Python newbie here.
I am crawling multiple urls in the one script, but it's returning 3 dataframes. One for each iteration.
What I am looking to achieve is one single dataframe that contains the URLs from each iteration.
What I have so far is the following:
import requests
from bs4 import BeautifulSoup
import pandas as pd
def get_links(url):
links = []
website = requests.get(url)
website_text = website.text
soup = BeautifulSoup(website_text)
for link in soup.find_all('a'):
links.append(link.get('href'))
data = (link in links)
df = pd.DataFrame(links)
display(df)
get_links('https://www.example.com/')
get_links('https://www.example2.com/')
get_links('https://www.example3.com/')
Thank you for your help
you should let the function return the dataframe instead of only display it
and then you should combine the outputs like that
https://pandas.pydata.org/docs/reference/api/pandas.concat.html
df = pd.concat([df1,df2,df3])
from bs4 import BeautifulSoup, SoupStrainer
import requests
# Reason Behind using `SoupStrainer` is to increase performance to parse only `a` tags.
# https://www.crummy.com/software/BeautifulSoup/bs4/doc/#soupstrainer
urls = ['Link1', 'Link2', 'Link3']
def get_links():
for url in urls:
r = requests.get(url)
links = BeautifulSoup(r.text, 'lxml', parse_only=SoupStrainer('a'))
for link in links:
print(link.get('href', 'N/A'))
get_links()

Exporting data from HTML to Excel

i just started programming.
I have the task to extract data from a HTML page to Excel.
Using Python 3.7.
My Problem is, that i have a website, whith more urls inside.
Behind these urls again more urls.
I need the data behind the third url.
My first Problem would be, how i can dictate the programm to choose only specific links from an ul rather then every ul on the page?
from bs4 import BeautifulSoup
import urllib
import requests
import re
page = urllib.request.urlopen("file").read()
soup = BeautifulSoup(page, "html.parser")
print(soup.prettify())
for link in soup.find_all("a", href=re.compile("katalog_")):
links= link.get("href")
if "katalog" in links:
for link in soup.find_all("a", href=re.compile("alle_")):
links = link.get("href")
print(soup.get_text())
There are many ways, one is to use "find_all" and try to be specific on the tags like "a" just like you did. If that's the only option, then use regular expression with your output. You can refer to this thread: Python BeautifulSoup Extract specific URLs. Also please show us either the link, or html structure of the links you want to extract. We would like to see the differences between the URLs.
PS: Sorry I can't make comments because of <50 reputation or I would have.
Updated answer based on understanding:
from bs4 import BeautifulSoup
import urllib
import requests
page = urllib.request.urlopen("https://www.bsi.bund.de/DE/Themen/ITGrundschutz/ITGrundschutzKompendium/itgrundschutzKompendium_node.html").read()
soup = BeautifulSoup(page, "html.parser")
for firstlink in soup.find_all("a",{"class":"RichTextIntLink NavNode"}):
firstlinks = firstlink.get("href")
if "bausteine" in firstlinks:
bausteinelinks = "https://www.bsi.bund.de/" + str(firstlinks.split(';')[0])
response = urllib.request.urlopen(bausteinelinks).read()
soup = BeautifulSoup(response, 'html.parser')
secondlink = "https://www.bsi.bund.de/" + str(((soup.find("a",{"class":"RichTextIntLink Basepage"})["href"]).split(';'))[0])
res = urllib.request.urlopen(secondlink).read()
soup = BeautifulSoup(res, 'html.parser')
listoftext = soup.find_all("div",{"id":"content"})
for text in listoftext:
print (text.text)

Drop part of a soup

I am learning how to use beautifulsoup. I managed to parse the html and now I want to extract a list of links from the page. The problem is that I am only interested in some links and the only way I can think of is to take all the links after a certain word appears. Can I drop part of the soup before I start extracting? Thank you.
This is what I have:
# import libraries
import urllib2
from bs4 import BeautifulSoup
import pandas as pd
import os
import re
# specify the url
quote_page = 'https://econpapers.repec.org/RAS/pab7.htm'
# query the website and return the html to the variable page
page = urllib2.urlopen(quote_page)
# parse the html using beautiful soup and store in variable soup
soup = BeautifulSoup(page, 'html.parser')
print(soup)
#transform to pandas dataframe
pages1 = soup.find_all('li', )
print(pages1)
pages2 = pd.DataFrame({
"papers": pages1,
})
print(pages2)
And I need to drop the upper half of the links in page2 and the only way to differenciate the ones I want from the rest is a word that appears in the html, that is this line "<h2 class="colored">Journal Articles</h2>"
EDIT: I just noticed that I can also separate them by the begining of the link. I only want the ones that start with "/article/"
As well using css_selector:
# parse the html using beautiful soup and store in variable soup
soup = BeautifulSoup(page, 'lxml')
#print(BeautifulSoup.prettify(soup))
css_selector = 'a[href^="/article"]'
href_tag_list = soup.select(css_selector)
print("Href list size:", len(href_tag_list)) # check that you found datas, do if else if needed
href_link_list = [] #use urljoin probably needed at some point
for href_tag in href_tag_list:
href_link_list.append(href_tag['href'])
print("href:", href_tag['href'])
I used this reference web page which was provided by another stackflow user:
Web Link
NB: You will have to take off the list the "/article/".
There can be various ways to get all the href starting with "/article/". One of the simple ways to do this would be :
# import libraries
import urllib.request
from bs4 import BeautifulSoup
import os
import re
import ssl
# specify the url
quote_page = 'https://econpapers.repec.org/RAS/pab7.htm'
gcontext = ssl.SSLContext()
# query the website and return the html to the variable page
page = urllib.request.urlopen(quote_page, context=gcontext)
# parse the html using beautiful soup and store in variable soup
soup = BeautifulSoup(page, 'html.parser')
#print(soup)
# Anchor tags starting with "/article/"
anchor_tags = soup.find_all('a', href=re.compile("/article/"))
for link in anchor_tags:
print(link.get('href'))
This answer would be helpful as well. And, go through the quick start guide of BeautifulSoup, it has a very good and elaborative examples.

Categories