How to scrape <span > and next <p>? - python

I am trying to scrape some information from a webpage using Selenium. In <span id='text'>, I want to extract the id value (text) and in the same div I want to extract <p> element.
here is what I have tried:
import requests
from bs4 import BeautifulSoup
# Send an HTTP request to the website and retrieve the HTML code of the webpage
response = requests.get('https://www.osha.gov/laws-regs/regulations/standardnumber/1926/1926.451#1926.451(a)(6)')
html = response.text
# Parse the HTML code using Beautiful Soup to extract the desired information
soup = BeautifulSoup(html, 'html.parser')
# find all <a> elements on the page with name attribute
links = soup.find_all('a', attrs={'name': True})
print(links)
linq = []
for link in links:
#print(link['name'])
linq.append(link['name'])
information = soup.find_all('p') # find all <p> elements on the page
# This is how I did it
with open('osha.txt', 'w') as f:
for i in range(len(linq)):
f.write(linq[i])
f.write('\n')
f.write(infoo[i])
f.write('\n')
f.write('-' * 50)
f.write('\n')
Below is the HTML code.
What I want is to save this in a separate text file is this information:
1926.451(a)
Capacity
<div class="field--item">
<div class="paragraph paragraph--type--regulations-standard-number paragraph--view-mode--token">
<span id="1926.451(a)">
<a href="/laws-regs/interlinking/standards/1926.451(a)" name="1926.451(a)">
1926.451(a)
</a>
</span>
<div class="field field--name-field-standard-paragraph-body-p">
<p>"Capacity"</p>
</div>
</div>
</div>

Some of the a tag and paragraph you might missing on the page.
Use try except block to handle that.
Use css selector to get the parent node and then get respective child nodes.
user dataframe to store the value and export it to csv file.
import pandas as pd
import csv
import requests
from bs4 import BeautifulSoup
# Send an HTTP request to the website and retrieve the HTML code of the webpage
response = requests.get('https://www.osha.gov/laws-regs/regulations/standardnumber/1926/1926.451#1926.451(a)(6)')
html = response.text
code=[]
para=[]
# Parse the HTML code using Beautiful Soup to extract the desired information
soup = BeautifulSoup(html, 'html.parser')
for item in soup.select(".field.field--name-field-reg-standard-number .field--item"):
try:
code.append(item.find("a").text.strip())
except:
code.append(item.find("span").text.strip())
try:
para.append(item.find("p").text.strip())
except:
para.append("Nan")
df=pd.DataFrame({"code" : code, "paragraph" : para})
print(df)
df.to_csv("path/to/filenme")
Output:

Related

How to scrape aria-label text in python?

I want scrape players name list from website, but names are on labels. I don't know how to scrape text on labels.
Here is the link
https://athletics.baruch.cuny.edu/sports/mens-swimming-and-diving/roster
For example, from html we have
How to scrape text from labels?
<div class="sidearm-roster-player-image column">
<a data-bind="click: function() { return true; }, clickBubble: false" href="/sports/mens-swimming-and-diving/roster/gregory-becker/3555" aria-label="Gregory Becker - View Full Bio" title="View Full Bio">
<img class="lazyload" data-src="/images/2018/10/19/GREGORY_BECKER.jpg?width=80" alt="GREGORY BECKER">
</a>
</div>
You can use .get() method in BeautifulSoup. First select your element in elem or any other variable using any selector or find/find_all. Then try:
print(elem.get('aria-label'))
Below is the code that will help you to extract name from the a tag
from bs4 import BeautifulSoup
with open("<path-to-html-file>") as fp:
soup = BeautifulSoup(fp, 'html.parser') #parse the html
tags = soup.find_all('a') # get all the a tag
for tag in tags:
print(tag.get('aria-label')) #get the required text

How do i extract the contents in HTML tags that only have <p>

I just got into web scraping and I'm using beautifulsoup to perform web scraping but I only want to extract contents just with the "p" tags. So I want to ignore tags if there are additional class/style/etc...
Example:
<p>what I want to extract</p>
<p class="copy">what I do not want to extract from HTML page</p>
So far I can only extract all the "p" tags with this code
from bs4 import BeautifulSoup as BS
import requests
URL = input("Enter url to scrape: ")
content = requests.get(URL)
soup = BS(content.text, 'html.parser')
content_p = soup.find_all('p')
print(content_p)
You can try
soup.findAll(lambda tag: len(tag.name) == 1 and not tag.attrs)
Refer - https://www.crummy.com/software/BeautifulSoup/bs3/documentation.html#The%20basic%20find%20method:%20findAll(name,%20attrs,%20recursive,%20text,%20limit,%20**kwargs)

Retrieving Imgur Image Link via Web Scraping Python

I am trying to retrieve the link for an image using imgur.com. It seems that the picture (if .jpg or .png) is usually stored within (div class="image post-image") on their website, like:
<div class='image post-image'>
<img alt="" src="//i.imgur.com/QSGvOm3.jpg" original-title="" style="max-width: 100%; min-height: 666px;">
</div>
so here is my code so far:
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = 'https://imgur.com/gallery/0PTPt'
html = urlopen(url)
soup = BeautifulSoup(html, 'lxml')
info = soup.find_all('div', {'class':'post-image'})
file = open('imgur-html.txt', 'w')
file.write(str(info))
file.close()
Instead of being able to get everything within these tags, this is my output:
<div class="post-image" style="min-height: 666px">
</div>
What do I need to do in order to access this further so I can get the image link? Or is this simply something where I need to only use the API? Thanks for any help.
The child img it would appear is dynamically added and not present. You can extract full link from rel
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://imgur.com/gallery/0PTPt')
soup = bs(r.content, 'lxml')
print(soup.select_one('[rel=image_src]')['href'])

article scraping with beautifulsoup: scraping <p> tags inside <div > tags with ids

i wrote a script in python to pull out particular paragraphs but then i end up getting all the information in that page. I want to scrap paragraphs inside with varying ids with different pages eg.
<div id="content-body-123123">
and this id varies for different pages. How can i identify this particular tag and pull out paragraphs inside this tag alone?
url='http://www.thehindu.com/opinion/op-ed/Does-Beijing-really-want-to-
ldquobreak-uprdquo-India/article16875298.ece'
page = requests.get(url)
html=page.content
soup = bs(html, 'html.parser')
for tag in soup.find_all('p'):
print tag.text.encode('utf-8')+'\n'
Try this. The change of id number should not affect your result:
from bs4 import BeautifulSoup
import requests
url = 'http://www.thehindu.com/opinion/op-ed/Does-Beijing-really-want-to-ldquobreak-uprdquo-India/article16875298.ece'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'lxml')
for content in soup.select("[id^='content-body-'] p"):
print(content.text)

Using beautifulsoup in python to get link names and "selecting" links instead of limiting?

I've got the following code trying to return data from some html, however I am unable to return what I require...
import urllib2
from bs4 import BeautifulSoup
from time import sleep
def getData():
htmlfile = open('C:/html.html', 'rb')
html = htmlfile.read()
soup = BeautifulSoup(html)
items = soup.find_all('div', class_="blocks")
for item in items:
links = item.find_all('h3')
for link in links:
print link
getData()
Returns the a list of following:
<h3>
<a href="http://www.mywebsite.com/titles" title="Click for details(x)">
TITLE STUFF HERE (YES)
</a>
</h3>
<h3>
<a href="http://www.mywebsite.com/titles" title="Click for details(x)">
TITLE STUFF HERE (MAYBE)
</a>
</h3>
I want to be able to return just the title: TITLE STUFF HERE (YES) and TITLE STUFF HERE (MAYBE)
Another thing I want to be able to do to use the
soup.find_all("a", limit=2) function but instead of "limit" and instead of returning two results only I want it to return ONLY the second link... so a select feature not a limit? (Does such a feature exist?)
import urllib2
from bs4 import BeautifulSoup
from time import sleep
def getData():
htmlfile = open('C:/html.html', 'rb')
html = htmlfile.read()
soup = BeautifulSoup(html)
items = soup.find_all('div', class_="blocks")
for item in items:
links = item.find_all('a')
for link in links:
if link.parent.name == 'h3':
print(link.text)
getData()
You can also just find all the links from the very beginning and check both the parent is h3 and the parent's parent is a div with class blocks

Categories