Python: scrape a part of source code and save it as html

Python: scrape a part of source code and save it as html - python

Here is the case, I need to save a web page's source code as html file. But if you look at the web page, there are lots of section, I don't need them, I only want to save the source code of the article itself.
code:
from urllib.request import urlopen
page = urlopen('http://www.abcde.com')
page_content = page.read()
with open('page_content.html', 'wb') as f:
f.write(page_content)
I can save the whole source code from my code, but how can I just save the only part I want?
Explain:
<div itemscope itemtype="http://schema.org/MedicalWebPage">
.
.
.
</div>
I need to save the source code with and inside this tag , not extract the sentences in the tags.
The result I want is to save like this:
<div itemscope itemtype="http://schema.org/MedicalWebPage">
<div class="col-md-12 col-xs-12" style="padding-left:10px;">
<h1 itemprop="name" class="page_article_title" title="Apple" id="mask">Apple</h1>
</div>
<!--Article Start-->
<section class="page_article_div" id="print">
<article itemprop="text" class="page_article_content">
<p>
<img alt="Apple" src="http://www.abcde.com/383741719.jpg" style="width: 300px; height: 200px;" /></p>
<p>
The apple tree (Malus pumila, commonly and erroneously called Malus domestica) is a deciduous tree in the rose family best known for its sweet, pomaceous fruit, the apple.</p>
<p>
It is cultivated worldwide as a fruit tree, and is the most widely grown species in the genus Malus.</p>
<p>
<strong><span style="color: #884499;">Appe is red</span></strong></p>
<ol>
<li>
Germanic paganism</li>
<li>
Greek mythology</li>
</ol>
<p style="text-align: right;">
【Jane】</p>
<p style="text-align: right;">
Credit : Wiki</p>
</article>
<div style="text-align:right;font-size:1.2em;"><a class="authorlink" href="http://www.abcde.com/web/online;url=http://61.66.117.1234/name=2017">2017</a></div>
<br />
<div style="text-align:right;font-size:1.2em;">【Thank you!】</div>
</section>
<!--Article End-->
</div>

My own solution here:
page = urlopen('http://www.abcde.com')
page_content = page.read()
soup = BeautifulSoup(page_content, "lxml")
list = []
for tag in soup.select('div[itemtype="http://schema.org/MedicalWebPage"]'):
list.append(str(tag))
list2= (', '.join(list))
#print(list2)
#print(type(list2))
with open('C:/html/try.html', 'w',encoding='UTF-8') as f:
f.write(list2)
I am a beginner so I am trying to do it as simple as it is, and this is my answer, it's working quite well at the moment :)

You can search with the tag with the property of tag such as class or tag name or id and save it to the what ever format you want like the example below.
driver = BeautifulSoup(yoursavedfile.read(), 'html.parser')
tag_for_me = driver.find_elements_by_class_name('class_name_of_your_tag')
print tag_for_me
tag_for_me will have your required code.

You can use Beautiful Soup to get any HTML source you need.
import requests
from bs4 import BeautifulSoup
target_class = "gb4"
target_text = "Web History"
r = requests.get("https://google.com")
soup = BeautifulSoup(r.text, "lxml")
for elem in soup.find_all(attrs={"class":target_class}):
if elem.text == target_text:
print(elem)
Output:
<a class="gb4" href="http://www.google.com/history/optout?hl=en">Web History</a>

Use BeautifulSoup to get the HTML where you want to insert, get the HTML which you want to insert. use insert() to generate new_tag. Overwrite to the original file.
from bs4 import BeautifulSoup
import requests
#Use beautiful soup to get the place you want to insert.
# div_tag is extracted div
soup = BeautifulSoup("Your content here",'lxml')
div_tag = soup.find('div',attrs={'class':'id=itemscope'})
#e.g
#div_tag = <div id=itemscope itemtype="http://schema.org/MedicalWebPage">
</div>
res = requests.get('url to get content from')
soup1 = BeautifulSoup(res.text,'lxml')
insert_data = soup1.find('your div/data to insert')
#this will insert the tag to div_tag. You can overwrite this to your original page_content.html.
div_tag.insert(3,insert_data)
#div tag contains you desired output. Overwrite it to original file.

Related

BeautifulSoup - Getting all the child from tag instead of the first

I am creating a script that collects data from a website. However I am getting some issues to collect only specific information. The HTML part that is causing me problems is the following:
<div class="Content">
<article>
<blockquote class="messageText 1234">
I WANT THIS
<br/>
I WANT THIS 2
<br/>
</a>
<br/>
</blockquote>
</article>
</div>
<div class="Content">
<article>
<blockquote class="messageText 1234">
<a class="IDENTIFIER" href="WEBSITE">
</a>
NO WANT THIS
<br/>
<br/>
NO WANT THIS
<br/>
<br/>
NO WANT THIS
<div class="messageTextEndMarker">
</div>
</blockquote>
</article>
</div>
And I am trying to create a process that prints only the part "I WANT THIS". I have the following script:
import requests
from bs4 import BeautifulSoup
url = ''
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
for a in soup.find_all('div', class_='panels'):
for b in a.find_all('form', class_='section'):
for c in b.find_all('div', class_='message'):
for d in c.find_all('div', class_='primaryContent'):
for d in d.find_all('div', class_='messageContent'):
for e in d.content.find_all('blockquote', class_='messageText 1234')[0]:
print(e.string)
My idea with the code was to extract only the part from the first blockquote element, however, I am getting all the text from the blockquotes:
I WANT THIS
NO WANT THIS
NO WANT THIS
NO WANT THIS
How can I achieve this?

Why not use select_one to isolate first block then stripped_strings to separate out text strings?
from bs4 import BeautifulSoup as bs
html = ''' your html'''
soup = bs(html, 'lxml')
print([s for s in soup.select_one('.Content .messageText').stripped_strings])

BeautifulSoup : how to show the inside of a div that won't show?

I am new to BeautifulSoup and I have some sort of issue I do not understand, I think the question may have yet been answered, but none of the answers I have found help me in this case.
I need to access the inside of a div to retrieve the glossary entries of a website, however the inside of that div seems to "not show" at all with BeautifulSoup. Could you help me ?
So this is the html on the website :
<!DOCTYPE html>
<html lang="en-US" style="margin-top: 0px !important;">
<head>...</head>
<body>
<header>...</header>
<section id="glossary" class="search-off">
<dl class="title">
<dt>Glossary</dt>
</dl>
<div class="content">
<aside id="glossary-aside">
<div></div>
<ul></ul>
</aside>
<div id="glossary-list" class="list">
<dl data-id="2103">...</dl>
<dl data-id="1105">
<dt>ABV (Alcohol by volume)</dt>
<dd>
<p style="margin-bottom: 0cm; text-align: justify;"><span style="font-family: Arial Cyr,sans-serif;"><span style="font-size: x-small;"><span style="font-size: small;"><span style="font-size: medium;">Alcohol by volume (ABV) is the measure of an alcoholic beverage’s alcohol content. Wines may have alcohol content from 4% ABV to 18% ABV; however, wines’ typical alcohol content ranges from 12.5% to 14.5% ABV. You can find a particular wine’s alcohol content by checking the label.</span></span></span></span><span style="font-size: medium;"> </span></p>
</dd>
</dl>
<dl data-id="1106">...</dl>
<dl data-id="1213">...</dl>
<dl data-id="2490">...</dl>
<dl data-id="11705">...</dl>
<dl data-id="1782">...</dl>
</div>
<div id="glossary-single" class="list">...</div>
</div>
<div class="s_content">
<div id="glossary-s_list" class="list"></div>
</div>
</section>
<footer></footer>
</body>
</html>
And I need to access the different <dl> tags in the <div id="glossary-list" class="list">.
My code is now as follow :
url_winevibe = requests.get("http://winevibe.com/glossary")
soup = BeautifulSoup(html, "lxml")
ct = url_winevibe.find("div", {"id":"glossary-list"}).findAll("dl")
I have tried various things, including getting to the descendants and children, but all I get is an empty list.
If I try ct = soup.find("div", {"id":"glossary-list"}) and print it, I get : <div class="list" id="glossary-list"></div>. It seems to me the inside of the div is somehow blocked but I am not quite sure.
Does anyone have an idea of how to access this ?

First Solution url is based on my research from where the data loads ! and i do see that it's loads via XHR from different url where the JavaScript rendered:
import requests
import json
r = requests.get('http://winevibe.com/wp-json/glossary/key/?l=en').json()
hoks = json.loads(r)
for item in hoks:
print(item['key'])
Second Solution:
from selenium import webdriver
from bs4 import BeautifulSoup
import time
browser = webdriver.Firefox()
url = 'http://winevibe.com/glossary/'
browser.get(url)
time.sleep(20) # wait 20 seconds for the site to load.
html = browser.page_source
soup = BeautifulSoup(html, features='html.parser')
for item in soup.findAll('div', attrs={'id': 'glossary-list'}):
for dt in item.findAll('dt'):
print(dt.text)
you can use browser.close() to close the browser
Output:
Here's the final code which will get through all user requests via Chat:
import requests
import json
r = requests.get('http://winevibe.com/wp-json/glossary/key/?l=en').json()
data = json.loads(r)
result = ([(item['key'], item['id']) for item in data])
text = []
for item in result:
try:
r = requests.get(
f"http://winevibe.com/wp-json/glossary/text/?id={item[1]}").json()
data = json.loads(r)
print(f"Getting Text For: {item[0]}")
text.append(data[0]['text'])
except KeyboardInterrupt:
print('Good Bye')
break
with open('result.txt', 'w+') as f:
for a, b in zip(result, text):
lines = ', '.join([a[0], b.replace('\n', '')]) + '\n'
f.write(lines)

beautifulsoup - extracting link, text, and title within child div

The layout is as follows:
<div class="App">
<div class="content">
<div class="title">Application Name #1</div>
<div class="image" style="background-image: url(https://img_url)">
</div>
install app
</div>
</div>
I'm trying to grab The TITLE, then the APP_URL and ideally, when I print via html, I would like for the TITLE to become a hyper link of the APP_URL.
My code is like this but doesn't yield desire results. I believe I need to add another command within the loop to grab the title. Only problem is, How do I make sure that I grab the TITLE and APP_URL so that they go together? There are at least 15 apps with the class of <div class="App">. Of course, I want all 15 results as well.
IMPORTANT: for the href links, I need it from the class called "signed button".
soup = BeautifulSoup(example)
for div in soup.findAll('div', {'class': 'App'}):
a = div.findAll('a')[1]
print a.text.strip(), '=>', a.attrs['href']

Use CSS selectors:
from bs4 import BeautifulSoup
html = """
<div class="App">
<div class="content">
<div class="title">Application Name #1</div>
<div class="image" style="background-image: url(https://img_url)">
</div>
install app
</div>
</div>"""
soup = BeautifulSoup(html, 'html5lib')
for div in soup.select('div.App'):
title = div.select_one('div.title')
link = div.select_one('a')
print("Click here: <a href='{}'>{}</a>".format(link["href"], title.text))
Which yields
Click here: <a href='http://app_url'>Application Name #1</a>

Maybe something like this will work?
soup = BeautifulSoup(example)
for div in soup.findAll('div', {'class': 'App'}):
a = div.findAll('a')[0]
print div.findAll('div', {'class': 'title'})[0].text, '=>', a.attrs['href']

Scraping the links from a specific url

this is my first question if I have explained anything wrong please forgive me.
I am trying scrape url's from a specific website in python and parse the links to a csv. The thing is when i parse the website in BeautifulSoup I can't extract the url's because when I parse it in python I can only get <div id="dvScores" style="min-height: 400px;">\n</div>, and nothing under that branch. But when I open the console and copy the table where the links are and paste it to a text editor it pastes 600 pages of html. What I want to do is to write a for loop that shows the links. The structure of the html is below:
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
#shadow-root (open)
<head>...</head>
<body>
<div id="body">
<div id="wrapper">
#multiple divs but i don't need them
<div id="live-master"> #what I need is under this div
<span id="contextual">
#multiple divs but i don't need them
<div id="live-score-master"> #what I need is under this div
<div ng-app="live-menu" id="live-score-rightcoll">
#multiple divs but i don't need them
<div id="left-score-lefttemp" style="padding-top: 35px;">
<div id="dvScores">
<table cellspacing=0 ...>
<colgroup>...</colgroup>
<tbody>
<tr class="row line-bg1"> #this changes to bg2 or bg3
<td class="row">
<span class="row">
<a href="www.example.com" target="_blank" class="td_row">
#I need to extract this link
</span>
</td>
#Multiple td's
</tr>
#multiple tr class="row line-bg1" or "row line-bg2"
.
.
.
</tbody>
</table>
</div>
</div>
</div>
</div>
</span>
</div>
</div>
</body>
</html>
What am I doing wrong? I need to automate a system for python to do rather than pasting the html to text and extracting links with a regex.
My python code is below also:
import requests
from bs4 import BeautifulSoup
r=requests.get("http://example.com/example")
c=r.content
soup=BeautifulSoup(c,"html.parser")
all=soup.find_all("span",id="contextual")
span=all[0].find_all("tbody")

if you are trying scrape urls then you should get hrefs :
urls = soup.find_all('a', href=True)

This site uses JavaScript for populating its content, therefore, you can't get url via beautifulsoup. If you inspect network tab in your browser you can spot a this link. It contains all data what you need. You can simply parse it and extract all desired value.
import requests
req = requests.get('http://goapi.mackolik.com/livedata?group=0').json()
for el in req['m'][4:100]:
index = el[0]
team_1 = el[2].replace(' ', '-')
team_2 = el[4].replace(' ', '-')
print('http://www.mackolik.com/Mac/{}/{}-{}'.format(index, team_1, team_2))

It seems like the html is being dynamically generated by js. You would need to crawl it with a crawler to mimic a browser. Since you are using requests, it already has a crawler session.
session = requests.session()
data = session.get ("http://website.com").content #usage xample
After this you can do the parsing, additional scraping, etc.

Having problems understanding BeautifulSoup filtering

Could someone please explain how the filtering works with Beautiful Soup. Ive got the below HTML I am trying to filter specific data from but I cant seem to access it. Ive tried various approaches, from gathering all class=g's to grabbing just the items of interest in that specific div, but I just get None returns or no prints.
Each page has a <div class="srg"> div with multiple <div class="g"> divs, the data i am looking to use is the data withing <div class="g">. Each of these has
multiple divs, but im only interested in the <cite> and <span class="st"> data. I am struggling to understand how the filtering works, any help would be appreciated.
I have attempted stepping through the divs and grabbing the relevant fields:
soup = BeautifulSoup(response.text)
main = soup.find('div', {'class': 'srg'})
result = main.find('div', {'class': 'g'})
data = result.find('div', {'class': 's'})
data2 = data.find('div')
for item in data2:
site = item.find('cite')
comment = item.find('span', {'class': 'st'})
print site
print comment
I have also attempted stepping into the initial div and finding all;
soup = BeautifulSoup(response.text)
s = soup.findAll('div', {'class': 's'})
for result in s:
site = result.find('cite')
comment = result.find('span', {'class': 'st'})
print site
print comment
Test Data
<div class="srg">
<div class="g">
<div class="g">
<div class="g">
<div class="g">
<!--m-->
<div class="rc" data="30">
<div class="s">
<div>
<div class="f kv _SWb" style="white-space:nowrap">
<cite class="_Rm">http://www.url.com.stuff/here</cite>
<span class="st">http://www.url.com. Some info on url etc etc
</span>
</div>
</div>
</div>
<!--n-->
</div>
<div class="g">
<div class="g">
<div class="g">
</div>
UPDATE
After Alecxe's solution I took another stab at getting it right but was still not getting anything printed. So I decided to take another look at the soup and it looks different. I was previously looking at the response.text from requests. I can only think that BeautifulSoup modifies the response.text or I somehow just got the sample completely wrong the first time (not sure how). However Below is the new sample based on what I am seeing from a soup print. And below that my attempt to get to the element data I am after.
<li class="g">
<h3 class="r">
context
</h3>
<div class="s">
<div class="kv" style="margin-bottom:2px">
<cite>www.url.com/index.html</cite> #Data I am looking to grab
<div class="_nBb">‎
<div style="display:inline"snipped">
<span class="_O0"></span>
</div>
<div style="display:none" class="am-dropdown-menu" role="menu" tabindex="-1">
<ul>
<li class="_Ykb">
<a class="_Zkb" href="/url?/search">Cached</a>
</li>
</ul>
</div>
</div>
</div>
<span class="st">Details about URI </span> #Data I am looking to grab
Update Attempt
I have tried taking Alecxe's approach to no success so far, am I going down the right road with this?
soup = BeautifulSoup(response.text)
for cite in soup.select("li.g div.s div.kv cite"):
span = cite.find_next_sibling("span", class_="st")
print(cite.get_text(strip=True))
print(span.get_text(strip=True))

First get div with class name srg then find all div with class name s inside that srg and get text of that site and comment. Below is the working code for me-
from bs4 import BeautifulSoup
html = """<div class="srg">
<div class="g">
<div class="g">
<div class="g">
<div class="g">
<!--m-->
<div class="rc" data="30">
<div class="s">
<div>
<div class="f kv _SWb" style="white-space:nowrap">
<cite class="_Rm">http://www.url.com.stuff/here</cite>
<span class="st">http://www.url.com. Some info on url etc etc
</span>
</div>
</div>
</div>
<!--n-->
</div>
<div class="g">
<div class="g">
<div class="g">
</div>"""
soup = BeautifulSoup(html , 'html.parser')
labels = soup.find('div',{"class":"srg"})
spans = labels.findAll('div', {"class": 'g'})
sites = []
comments = []
for data in spans:
site = data.find('cite',{'class':'_Rm'})
comment = data.find('span',{'class':'st'})
if site:#Check if site in not None
if site.text.strip() not in sites:
sites.append(site.text.strip())
else:
pass
if comment:#Check if comment in not None
if comment.text.strip() not in comments:
comments.append(comment.text.strip())
else: pass
print sites
print comments
Output-
[u'http://www.url.com.stuff/here']
[u'http://www.url.com. Some info on url etc etc']
EDIT--
Why your code does not work
For try One-
You are using result = main.find('div', {'class': 'g'}) it will grab single and first encountered element but first element has not div with class name s . So the next part of this code will not work.
For try Two-
You are printing site and comment that is not in the print scope. So try to print inside for loop.
soup = BeautifulSoup(html,'html.parser')
s = soup.findAll('div', {'class': 's'})
for result in s:
site = result.find('cite')
comment = result.find('span', {'class': 'st'})
print site.text#Grab text
print comment.text

You don't have to deal with the hierarchy manually - let BeautifulSoup worry about it. Your second approach is close to what you should really be trying to do, but it would fail once you get the div with class="s" with no cite element inside.
Instead, you need to let BeautifulSoup know that you are interested in specific elements containing specific elements. Let's ask for cite elements located inside div elements with class="g" located inside the div element with class="srg" - div.srg div.g cite CSS selector would find us exactly what we are asking about:
for cite in soup.select("div.srg div.g cite"):
span = cite.find_next_sibling("span", class_="st")
print(cite.get_text(strip=True))
print(span.get_text(strip=True))
Then, once the cite is located, we are "going sideways" and grabbing the next span sibling element with class="st". Though, yes, here we are assuming it exists.
For the provided sample data, it prints:
http://www.url.com.stuff/here
http://www.url.com. Some info on url etc etc
The updated code for the updated input data:
for cite in soup.select("li.g div.s div.kv cite"):
span = cite.find_next("span", class_="st")
print(cite.get_text(strip=True))
print(span.get_text(strip=True))
Also, make sure you are using the 4th BeautifulSoup version:
pip install --upgrade beautifulsoup4
And the import statement should be:
from bs4 import BeautifulSoup

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: scrape a part of source code and save it as html - python

Related

BeautifulSoup - Getting all the child from tag instead of the first

BeautifulSoup : how to show the inside of a div that won't show?

beautifulsoup - extracting link, text, and title within child div

Scraping the links from a specific url

Having problems understanding BeautifulSoup filtering

Categories

Resources