Python code to consolidate CSS into HTML - python

Looking for python code that can take an HTML page and insert any linked CSS style definitions used by that page into it - so any externally referenced css page(s) are not needed.
Needed to make single files to insert as email attachments from existing pages used on web site. Thanks for any help.

Sven's answer helped me, but it didn't work out of the box. The following did it for me:
import bs4 #BeautifulSoup 3 has been replaced
soup = bs4.BeautifulSoup(open("index.html").read())
stylesheets = soup.findAll("link", {"rel": "stylesheet"})
for s in stylesheets:
t = soup.new_tag('style')
c = bs4.element.NavigableString(open(s["href"]).read())
t.insert(0,c)
t['type'] = 'text/css'
s.replaceWith(t)
open("output.html", "w").write(str(soup))

You will have to code this yourself, but BeautifulSoup will help you a long way. Assuming all your files are local, you can do something like:
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(open("index.html").read())
stylesheets = soup.findAll("link", {"rel": "stylesheet"})
for s in stylesheets:
s.replaceWith('<style type="text/css" media="screen">' +
open(s["href"]).read()) +
'</style>')
open("output.html", "w").write(str(soup))
If the files are not local, you can use Pythons urllib or urllib2 to retrieve them.

You can use pynliner. Example from their documentation:
html = "html string"
css = "css string"
p = Pynliner()
p.from_string(html).with_cssString(css)

Related

HTML Scraping the website with duplicated div class name

I currently working on the HTML scraping the baka-update.
However, the name of Div Class is duplicated.
As my goal is as csv or json, I would like to use information in [sCat] as column name and [sContent] as to be get stored.....
Is their are way to scrape with this kinds of website?
Thanks,
Sample
https://www.mangaupdates.com/series.html?id=75363
Image 1
Image 2
from lxml import html
import requests
page = requests.get('http://www.mangaupdates.com/series.html?id=153558?')
tree = html.fromstring(page.content)
#Get the name of the columns.... I hope
sCat = tree.xpath('//div[#class="sCat"]/text()')
#Get the actual data
sContent = tree.xpath('//div[#class="sContent"]/text()')
print('sCat: ', sCat)
print('sContent: ', sContent)
I tried but nothing I could find of
#Jasper Nichol M Fabella
I tried to edit your code and got the following output. Maybe it will Help.
from lxml import html
import requests
page = requests.get('http://www.mangaupdates.com/series.html?id=153558?')
tree = html.fromstring(page.content)
# print(page.content)
#Get the name of the columns.... I hope
sCat = tree.xpath('//div[#class="sCat"]')
#Get the actual data
sContent = tree.xpath('//div[#class="sContent"]')
print('sCat: ', len(sCat))
print('sContent: ', len(sContent))
json_dict={}
for i in range(0,len(sCat)):
# print(''.join(i.itertext()))
sCat_text=(''.join(sCat[i].itertext()))
sContent_text=(''.join(sContent[i].itertext()))
json_dict[sCat_text]=sContent_text
print(json_dict)
I got the following output
Hope it Helps
you can use xpath expressions and create an absolute path on what you want to scrape
Here is an example with requests and lxml library:
from lxml import html
import requests
r = requests.get('https://www.mangaupdates.com/series.html?id=75363')
tree = html.fromstring(r.content)
sCat = [i.text_content().strip() for i in tree.xpath('//div[#class="sCat"]')]
sContent = [i.text_content().strip() for i in tree.xpath('//div[#class="sContent"]')]
What are you using to scrape?
If you are using BeautifulSoup? Then you can search for all content on the page with FindAll method with a class identifier and iterate thru that. You can the special "_class" deginator
Something like
import bs4
soup = bs4.BeautifulSoup(html.source)
soup.find_all('div', class_='sCat')
# do rest of your logic work here
Edit: I was typing on my mobile on cached page before you made the edits. So didnt see the changes. Though i see you are using raw lxml library to parse. Yes that's faster but I am not to familiar, as Ive only used raw lxml library for one project but I think you can chain two search methods to distill to something equivalent.

Using BeautifulSoup4 with Google Translate

I am currently going through the Web Scraping section of AutomateTheBoringStuff and trying to write a script that extracts translated words from Google Translate using BeautifulSoup4.
I inspected the html content of a page where 'Explanation' is the translated word:
<span id="result_box" class="short_text" lang="en">
<span class>Explanation</span>
</span>
Using BeautifulSoup4, I tried different selectors but nothing would return the translated word. Here are a few examples I tried, but they return no results at all:
soup.select('span[id="result_box"] > span')
soup.select('span span')
I even copied the selector directly from the Developer Tools, which gave me #result_box > span. This again returns no results.
Can someone explain to me how to use BeautifulSoup4 for my purpose? This is my first time using BeautifulSoup4 but I think I am using BeautifulSoup more or less correctly because the selector
soup.select('span[id="result_box"]')
gets me the outer span element**
[<span class="short_text" id="result_box"></span>]
**Not sure why the 'leng="en"' part is missing but I am fairly certain I have located the correct element regardless.
Here is the complete code:
import bs4, requests
url = 'https://translate.google.ca/#zh-CN/en/%E6%B2%BB%E5%85%B7'
res = requests.get(url)
res.raise_for_status
soup = bs4.BeautifulSoup(res.text, "html.parser")
translation = soup.select('#result_box span')
print(translation)
EDIT: If I save the Google Translate page as an offline html file and then make a soup object out of that html file, there would be no problem locating the element.
import bs4
file = open("Google Translate.html")
soup = bs4.BeautifulSoup(file, "html.parser")
translation = soup.select('#result_box span')
print(translation)
The result_box div is the correct element but your code only works when you save what you see in your browser as that includes the dynamically generated content, using requests you get only the source itself bar any dynamically generated content. The translation is generated by an ajax call to the url below:
"https://translate.google.ca/translate_a/single?client=t&sl=zh-CN&tl=en&hl=en&dt=at&dt=bd&dt=ex&dt=ld&dt=md&dt=qca&dt=rw&dt=rm&dt=ss&dt=t&ie=UTF-8&oe=UTF-8&source=bh&ssel=0&tsel=0&kc=1&tk=902911.786207&q=%E6%B2%BB%E5%85%B7"
For your requests it returns:
[[["Fixture","治具",,,0],[,,,"Zhì jù"]],,"zh-CN",,,[["治 具",1,[["Fixture",999,true,false],["Fixtures",0,true,false],["Jig",0,true,false],["Jigs",0,true,false],["Governance",0,true,false]],[[0,2]],"治具",0,1]],1,,[["ja"],,[1],["ja"]]]
So you will either have to mimic the request, passing all the necessary parameters or use something that supports dynamic content like selenium
Simply try this :
translation = soup.select('#result_box span')[0].text
print(translation)
You can try this diferent aproach:
if filename.endswith(extension_file):
with open(os.path.join(files_from_folder, filename), encoding='utf-8') as html:
soup = BeautifulSoup('<pre>' + html.read() + '</pre>', 'html.parser')
for title in soup.findAll('title'):
recursively_translate(title)
FOR THE COMPLETE CODE, PLEASE SEE HERE:
https://neculaifantanaru.com/en/python-code-text-google-translate-website-translation-beautifulsoup-library.html
or HERE:
https://neculaifantanaru.com/en/example-google-translate-api-key-python-code-beautifulsoup.html

Python:Getting text from html using Beautifulsoup

I am trying to extract the ranking text number from this link link example: kaggle user ranking no1. More clear in an image:
I am using the following code:
def get_single_item_data(item_url):
sourceCode = requests.get(item_url)
plainText = sourceCode.text
soup = BeautifulSoup(plainText)
for item_name in soup.findAll('h4',{'data-bind':"text: rankingText"}):
print(item_name.string)
item_url = 'https://www.kaggle.com/titericz'
get_single_item_data(item_url)
The result is None. The problem is that soup.findAll('h4',{'data-bind':"text: rankingText"}) outputs:
[<h4 data-bind="text: rankingText"></h4>]
but in the html of the link when inspecting this is like:
<h4 data-bind="text: rankingText">1st</h4>. It can be seen in the image:
Its clear that the text is missing. How can I overpass that?
Edit:
Printing the soup variable in the terminal I can see that this value exists:
So there should be a way to access through soup.
Edit 2: I tried unsuccessfully to use the most voted answer from this stackoverflow question. Could be a solution around there.
If you aren't going to try browser automation through selenium as #Ali suggested, you would have to parse the javascript containing the desired information. You can do this in different ways. Here is a working code that locates the script by a regular expression pattern, then extracts the profile object, loads it with json into a Python dictionary and prints out the desired ranking:
import re
import json
from bs4 import BeautifulSoup
import requests
response = requests.get("https://www.kaggle.com/titericz")
soup = BeautifulSoup(response.content, "html.parser")
pattern = re.compile(r"profile: ({.*}),", re.MULTILINE | re.DOTALL)
script = soup.find("script", text=pattern)
profile_text = pattern.search(script.text).group(1)
profile = json.loads(profile_text)
print profile["ranking"], profile["rankingText"]
Prints:
1 1st
The data is databound using javascript, as the "data-bind" attribute suggests.
However, if you download the page with e.g. wget, you'll see that the rankingText value is actually there inside this script element on initial load:
<script type="text/javascript"
profile: {
...
"ranking": 96,
"rankingText": "96th",
"highestRanking": 3,
"highestRankingText": "3rd",
...
So you could use that instead.
I have solved your problem using regex on the plain text:
def get_single_item_data(item_url):
sourceCode = requests.get(item_url)
plainText = sourceCode.text
#soup = BeautifulSoup(plainText, "html.parser")
pattern = re.compile("ranking\": [0-9]+")
name = pattern.search(plainText)
ranking = name.group().split()[1]
print(ranking)
item_url = 'https://www.kaggle.com/titericz'
get_single_item_data(item_url)
This return only the rank number, but I think it will help you, since from what I see the rankText just add 'st', 'th' and etc to the right of the number
This could because of dynamic data filling.
Some javascript code, fill this tag after page loading. Thus if you fetch the html using requests it is not filled yet.
<h4 data-bind="text: rankingText"></h4>
Please take a look at Selenium web driver. Using this driver you can fetch the complete page and running js as normal.

how to detect the language of webpage content by using python

I have to test a bunch of URLs whether those webpages have respective translation content or not. Is there any way to return the language of content in a webpage by using the Python language? Like if the page is in Chinese, then it should return `"Chinese"``.
I checked it with langdetect module, but not able to get the results I desire. These URls are in web xml format. The content is showing under <releasehigh>
Here is a simple example demonstrating use of BeautifulSoup to extract HTML body text and langdetect for the language detection:
from bs4 import BeautifulSoup
from langdetect import detect
with open("foo.html", "rb") as f:
soup = BeautifulSoup(f, "lxml")
[s.decompose() for s in soup("script")] # remove <script> elements
body_text = soup.body.get_text()
print(detect(body_text))
You can extract a chunk of content then use some python language detection like langdetect or guess-language.
Maybe you have a header like this one :
<HTML xmlns="http://www.w3.org/1999/xhtml" xml:lang="fr" lang="fr">
If it's the case you can see with lang="fr" that this is a french web page. If it's not the case, guessing the language of a text is not trivial.
You can use BeautifulSoup to extract the language from HTML source code.
<html class="no-js" lang="cs">
Extract the lang field from source code:
from bs4 import BeautifulSoup
import requests
html = requests.get(url).content
soup = BeautifulSoup(html, 'html.parser')
print(soup.html["lang"])

Extracting comments from news articles

My question is similar to the one asked here:
https://stackoverflow.com/questions/14599485/news-website-comment-analysis
I am trying to extract comments from any news article. E.g. i have a news url here:
http://www.cnn.com/2013/09/24/politics/un-obama-foreign-policy/
I am trying to use BeautifulSoup in python to extract the comments. However it seems the comment section is either embedded within an iframe or loaded through javascript. Viewing the source through firebug does not reveal the source of the comments section. But explicitly viewing the source of the comments through view-source feature of the browser does. How to go about extracting the comments, especially when the comments come from a different url embedded within the news web-page?
This is what i have done till now although this is not much:
import urllib2
from bs4 import BeautifulSoup
opener = urllib2.build_opener()
url = ('http://www.cnn.com/2013/08/28/health/stem-cell-brain/index.html')
urlContent = opener.open(url).read()
soup = BeautifulSoup(urlContent)
title = soup.title.text
print title
body = soup.findAll('body')
outfile = open("brain.txt","w+")
for i in body:
i=i.text.encode('ascii','ignore')
outfile.write(i +'\n')
Any help in what I need to do or how to go about it will be much appreciated.
its inside an iframe. check for a frame with id="dsq2".
now the iframe has a src attr which is a link to the actual site that has the comments.
so in beautiful soup: css_soup.select("#dsq2") and get the url from the src attribute. it will lead you to a page that has only comments.
to get the actual comments, after you get the page from src you can use this css selector: .post-message p
and if you want to load more comment, when you click to the more comments buttons it seems to be sending this:
http://disqus.com/api/3.0/threads/listPostsThreaded?limit=50&thread=1660715220&forum=cnn&order=popular&cursor=2%3A0%3A0&api_key=E8Uh5l5fHZ6gD8U3KycjAIAk46f68Zw7C6eW8WSjZvCLXebZ7p0r1yrYDrLilk2F

Categories