parse json in html page with python [duplicate] - python
I am downloading HTML pages that have data defined in them in the following way:
... <script type= "text/javascript"> window.blog.data = {"activity":{"type":"read"}}; </script> ...
I would like to extract the JSON object defined in 'window.blog.data'.
Is there a simpler way than parsing it manually? (I am looking into Beautiful Soap but can't seem to find a method that will return the exact object without parsing)
Thanks
Edit:
Would it be possible and more correct to do this with a python headless browser (e.g., Ghost.py)?
BeautifulSoup is an html parser; you also need a javascript parser here. btw, some javascript object literals are not valid json (though in your example the literal is also a valid json object).
In simple cases you could:
extract <script>'s text using an html parser
assume that window.blog... is a single line or there is no ';' inside the object and extract the javascript object literal using simple string manipulations or a regex
assume that the string is a valid json and parse it using json module
Example:
#!/usr/bin/env python
html = """<!doctype html>
<title>extract javascript object as json</title>
<script>
// ..
window.blog.data = {"activity":{"type":"read"}};
// ..
</script>
<p>some other html here
"""
import json
import re
from bs4 import BeautifulSoup # $ pip install beautifulsoup4
soup = BeautifulSoup(html)
script = soup.find('script', text=re.compile('window\.blog\.data'))
json_text = re.search(r'^\s*window\.blog\.data\s*=\s*({.*?})\s*;\s*$',
script.string, flags=re.DOTALL | re.MULTILINE).group(1)
data = json.loads(json_text)
assert data['activity']['type'] == 'read'
If the assumptions are incorrect then the code fails.
To relax the second assumption, a javascript parser could be used instead of a regex e.g., slimit (suggested by #approximatenumber):
from slimit import ast # $ pip install slimit
from slimit.parser import Parser as JavascriptParser
from slimit.visitors import nodevisitor
soup = BeautifulSoup(html, 'html.parser')
tree = JavascriptParser().parse(soup.script.string)
obj = next(node.right for node in nodevisitor.visit(tree)
if (isinstance(node, ast.Assign) and
node.left.to_ecma() == 'window.blog.data'))
# HACK: easy way to parse the javascript object literal
data = json.loads(obj.to_ecma()) # NOTE: json format may be slightly different
assert data['activity']['type'] == 'read'
There is no need to treat the object literal (obj) as a json object. To get the necessary info, obj can be visited recursively like other ast nodes. It would allow to support arbitrary javascript code (that can be parsed by slimit).
Something like this may work:
import re
HTML = """
<html>
<head>
...
<script type= "text/javascript">
window.blog.data = {"activity":
{"type":"read"}
};
...
</script>
</head>
<body>
...
</body>
</html>
"""
JSON = re.compile('window.blog.data = ({.*?});', re.DOTALL)
matches = JSON.search(HTML)
print matches.group(1)
I had a similar issue and ended up using selenium with phantomjs. It's a little hacky and I couldn't quite figure out the correct wait until method, but the implicit wait seems to work fine so far for me.
from selenium import webdriver
import json
import re
url = "http..."
driver = webdriver.PhantomJS(service_args=['--load-images=no'])
driver.set_window_size(1120, 550)
driver.get(url)
driver.implicitly_wait(1)
script_text = re.search(r'window\.blog\.data\s*=.*<\/script>', driver.page_source).group(0)
# split text based on first equal sign and remove trailing script tag and semicolon
json_text = script_text.split('=',1)[1].rstrip('</script>').strip().rstrip(';').strip()
# only care about first piece of json
json_text = json_text.split("};")[0] + "}"
data = json.loads(json_text)
driver.quit()
```
fast and easy way is ('here put exactly the start (.*?) and the end here') that's all !
import re
import json
html = """<!doctype html>
<title>extract javascript object as json</title>
<script>
// ..
window.blog.data = {"activity":{"type":"read"}};
// ..
</script>
<p>some other html here
"""
than simply
re.search('{"activity":{"type":"(.*?)"', html).group(1)
or for full json
jsondata = re.search('window.blog.data = (.*?);', html).group(1)
jsondata = json.loads(jsondata)
print(jsondata["activity"])
#output {'type': 'read'}
Related
To use Regex or not, to grab json value from a HTML
So I have been getting some mixed answers here. Either to run with regex or not. what I am trying to do is that I am trying to grab a specific value (The json of spConfig) in the html which is: <script type="text/x-magento-init"> { "#product_addtocart_form": { "configurable": { "spConfig": {"attributes":{"93":{"id":"93","code":"color","label":"Color","options":[{"id":"8243","label":"Helloworld","products":["97460","97459"]}],"position":"0"},"148":{"id":"148","code":"codish","label":"Codish","options":[{"id":"4707","label":"12.5","products":[]},{"id":"2724","label":"13","products":[]},{"id":"4708","label":"13.5","products":[]}],"position":"1"}},"template":"EUR <%- data.price %>","optionPrices":{"97459":{"oldPrice":{"amount":121},"basePrice":{"amount":121},"finalPrice":{"amount":121},"tierPrices":[]}},"prices":{"oldPrice":{"amount":"121"},"basePrice":{"amount":"121"},"finalPrice":{"amount":"121"}},"productId":"97468","chooseText":"Choose an Option...","images":[],"index":[]}, "gallerySwitchStrategy": "replace" } } } </script> and here is the problem. When scraping the HTML, there is multiply <script type="text/x-magento-init"> but only one spConfig and I have two question here. Should I grab the value spConfig using Regex to later use json.loads(spConfigValue) or not? If not then what method should I use to scrape the json value? If I am supposed to regex. I have been trying to do grab it using \"spConfig\"\: (.*?) however it is not scraping the json value for me. what am I doing wrong?
In this case, with bs4 4.7.1 + :contains is your friend. You say there is only a single match for that so you can do the following: from bs4 import BeautifulSoup as bs import json html= '''<html> <head> <script type="text/x-magento-init"> { "#product_addtocart_form": { "configurable": { "spConfig": {"attributes":{"93":{"id":"93","code":"color","label":"Color","options":[{"id":"8243","label":"Helloworld","products":["97460","97459"]}],"position":"0"},"148":{"id":"148","code":"codish","label":"Codish","options":[{"id":"4707","label":"12.5","products":[]},{"id":"2724","label":"13","products":[]},{"id":"4708","label":"13.5","products":[]}],"position":"1"}},"template":"EUR <%- data.price %>","optionPrices":{"97459":{"oldPrice":{"amount":121},"basePrice":{"amount":121},"finalPrice":{"amount":121},"tierPrices":[]}},"prices":{"oldPrice":{"amount":"121"},"basePrice":{"amount":"121"},"finalPrice":{"amount":"121"}},"productId":"97468","chooseText":"Choose an Option...","images":[],"index":[]}, "gallerySwitchStrategy": "replace" } } } </script> </head> <body></body> </html>''' soup = bs(html, 'html.parser') data = json.loads(soup.select_one('script:contains(spConfig)').text) Config is then: data['#product_addtocart_form']['configurable']['spConfig'] with keys:
No, don't ever use regex for HTML. Use HTML-parsers like BeautifulSoup instead!
So basically for json use json parser right. ? 🤔 And for yaml use yamel parser 🤔 so in HTML do use HTML parser See some example and also like that will make you life to shine from html.parser import HTMLParser class MyHTMLParser(HTMLParser): def handle_starttag(self, tag, attrs): print("Encountered a start tag:", tag) def handle_endtag(self, tag): print("Encountered an end tag :", tag) def handle_data(self, data): print("Encountered some data :", data) parser = MyHTMLParser() parser.feed('<html><head><title>Test</title></head>' '<body><h1>Parse me!</h1></body></html>') https://docs.python.org/3/library/html.parser.html
Extracting json from html
I have html files with blocks like this: <script type="text/javascript> var json1 = { // ... } </script> Using the names of the vars - e.g. "json1" - what is a straightforward way to extract the json? Could a regex do it, or do I need something like Beautiful Soup?
yes you need both regex and beautiful soup import json import re from bs4 import BeautifulSoup # $ pip install beautifulsoup4 html = //Your html output soup = BeautifulSoup(html) script = soup.find('script', text=re.compile('json1')) json_text = re.search(r'^\s*json1\s*=\s*({.*?})\s*;\s*$', script.string, flags=re.DOTALL | re.MULTILINE).group(1) data = json.loads(json_text) print(data['json1'])
I found something simple that worked in my case. Get the position of the "var json1 = ", then call html.find("", startOfJson1). Use the indexes to slice the json from the string.
Python HTML parsing: removing excess HTML from get request output
I am wanting to make a simple python script to automate the process of pulling .mov files from an IP camera's SD card. The Model of IP camera supports http requests which returns HTML that contains the .mov file info. My python script so far.. from bs4 import BeautifulSoup import requests page = requests.get("http://192.168.1.99/form/getStorageFileList?type=3") soup = BeautifulSoup(page.content, 'html.parser') print(soup.prettify()) OUTPUT: NAME2041=Record_continiously/2018-06-02/8/MP_2018-06-03_00-33-15_60.mov I want to only return the MOV file. So removing: "NAME2041=Record_continiously/2018-06-02/8/" I'm new to HTML parsing with python so I'm a bit confused with the functionality. Is returned HTML considered a string? If so, I understand that it will be immutable and I will have to create a new string instead of "striping away" the preexisting string. I have tried: page.replace("NAME2041=Record_continiously/2018-06-02/8/","") in which I receive an attribute error. Is anyone aware of any method that could accomplish this? Here is a sample of the HTML I am working with... <html> <head></head> <body> 000 Success NUM=2039 NAME0=Record_Continuously/2018-06-04/10/MP_2018-06-04_12-17-38_60.mov SIZE0=15736218 NAME1=Record_Continuously/2018-06-04/10/MP_2018-06-04_12-16-37_60.mov SIZE1=15683077 NAME2=Record_Continuously/2018-06-04/10/MP_2018-06-04_12-15-36_60.mov SIZE2=15676882 NAME3=Record_Continuously/2018-06-04/10/MP_2018-06-04_12-14-35_60.mov SIZE3=15731539 </body> </html>
Use str.split with negative indexing. Ex: page = "NAME2041=Record_continiously/2018-06-02/8/MP_2018-06-03_00-33-15_60.mov" print( page.split("/")[-1]) Output: MP_2018-06-03_00-33-15_60.mov
as you asked for explanation of your code here it is: # import statements from bs4 import BeautifulSoup import requests page = requests.get("http://192.168.1.99/form/getStorageFileList?type=3") # returns response object soup = BeautifulSoup(page.content, 'html.parser') # page.content returns string content of response you are passing this(page.content) string content to class BeautifulSoup which is initialized with two arguments your content(page.content) as string and parser here it is html.parser soup is the object of BeautifulSoup .prettify() is method used to pretty print the content In string slicing you may get failure of result due to length of content so it's better to split your content as suggested by #Rakesh and that's the best approach in your case.
HTML Parsing issue with BeautifulSoup Library
I am working with the BS library for HTML parsing. My task is to remove everything between the head tags. So if i have <head> A lot of Crap! </head> then the result should be <head></head>. This is the code for it raw_html = "entire_web_document_as_string" soup = BeautifulSoup(raw_html) head = soup.head head.unwrap() print(head) And this works fine. But i want that these changes should take place in the raw_html string that contains the entire html document. How do reflect these commands in the original string and not only in the head string? Can you share a code snippet for doing it?
You're basically asking how to export a string of HTML from BS's soup object. You can do it this way: # Python 2.7 modified_raw_html = unicode(soup) # Python3 modified_raw_html = str(soup)
Python: find on string return none
I'm working on a project to parse HTML page. It is for an internal website within a company but I changed the example so you can try. I get the source code of a HTML page and I search for a certain markup. Then I want to extract a substring of this markup but it doesn't work. Python returns a none... Hier below my code with in comment the return of Python: #!/usr/bin/python import urllib2 from bs4 import BeautifulSoup response = urllib2.urlopen("http://www.resto.be/restaurant/liege/4000-liege/8219-le-bar-a-gouts/") page_source = response.read() soup = BeautifulSoup(page_source) name = soup.find_all("meta", attrs={"itemprop":"name"}) print(name[0]) # <meta content="LE BAR A GOUTS" itemprop="name"/> print(name[0].find("<meta")) # none
You don't have a string, you have a tag object. Printing the tag has a nice HTML represention, but it is not a string object. As such, you are using the BeautifulSoup Tag.find() function, and it returns None if there are no child tags with the tag name <meta. Which indeed there are not here. If you wanted to find the content attribute, use item access: print name[0]['content']