I have html files with blocks like this:
<script type="text/javascript>
var json1 = {
// ...
}
</script>
Using the names of the vars - e.g. "json1" - what is a straightforward way to extract the json? Could a regex do it, or do I need something like Beautiful Soup?
yes you need both regex and beautiful soup
import json
import re
from bs4 import BeautifulSoup # $ pip install beautifulsoup4
html = //Your html output
soup = BeautifulSoup(html)
script = soup.find('script', text=re.compile('json1'))
json_text = re.search(r'^\s*json1\s*=\s*({.*?})\s*;\s*$',
script.string, flags=re.DOTALL | re.MULTILINE).group(1)
data = json.loads(json_text)
print(data['json1'])
I found something simple that worked in my case. Get the position of the "var json1 = ", then call html.find("", startOfJson1). Use the indexes to slice the json from the string.
Related
I'm new to working with XML and BeautifulSoup and I am trying to get a dataset of clinical trials using Clinicaltrials.gov's new API that converts a list of trials into an XML dataset. I tried using find_all() like I typically do with HTML, but I'm not having the same luck. I've tried a few other approaches, like converting to a string and splitting (very messy) but I don't want to clutter my code with failed attempts.
Bottom line: I want to extract all NCTIds (I know I can just convert the whole thing into a string and use regex, but I want to learn how to actually parse XML correctly) and official titles for each clinical trial listed in the XML file. Any help is appreciated!
import requests
from bs4 import BeautifulSoup
from lxml import etree
import lxml.html
url = 'https://clinicaltrials.gov/api/query/full_studies?expr=diabetes+telehealth+peer+support&+AREA%5BStartDate%5D+EXPAND%5BTerm%5D+RANGE%5B01%2F01%2F2020%2C+09%2F01%2F2020%5D&min_rnk=1&max_rnk=10&fmt=xml'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
m1_nctid = soup.find_all('Field Name="NCTId"') #This comes back with 0 results
m1_officialtitle = soup.find_all('Field Name="OfficialTitle"') #This comes back with 0 results
you can filter on attributes like following:
m1_nctid = soup.findAll("field", {"name" : "NCTId"})
m1_officialtitle = soup.findAll("field", {"name" : "OfficialTitle"})
and then iterate each result to get text, for ex:
official_titles = [result.text for result in m1_officialtitle]
for more info, you can check the documentation here
You can search for the field tag in lowercase, and pass name as an attribute to attrs. This works with just BeautifulSoup there's no need to use etree:
import requests
from bs4 import BeautifulSoup
url = "https://clinicaltrials.gov/api/query/full_studies?expr=diabetes+telehealth+peer+support&+AREA%5BStartDate%5D+EXPAND%5BTerm%5D+RANGE%5B01%2F01%2F2020%2C+09%2F01%2F2020%5D&min_rnk=1&max_rnk=10&fmt=xml"
response = requests.get(url)
soup = BeautifulSoup(response.content, "lxml")
m1_nctid = soup.find_all("field", attrs={"name": "NCTId"})
m1_officialtitle = soup.find_all("field", attrs={"name": "OfficialTitle"})
i have something like that:
(async () => {
await import("https://s-gr.cdngr.pl/assets/gratka/v0.40.7/dist/js/Map.js");
Map.init('#item-map', {
gratkaMapsUrl: 'https://map.api.gratka.it',
assetsUrl: 'https://s-gr.cdngr.pl/assets/gratka/v0.40.7/dist/',
locationApiHost: 'https://locations.api.gratka.it/locations/v1',
apiUrl: 'https://gratka.api.gratka.it/gratka/v2',
eventType: 'click',
statisticsType: 'show_map',
locationParams: {"lokalizacja_ulica":"aleja Marsz. J\u00f3zefa Pi\u0142sudskiego","lokalizacja_szerokosc-geograficzna-y":52.231069627971,"lokalizacja_region":"mazowieckie","lokalizacja_powiat":"Warszawa","lokalizacja_miejscowosc":"Warszawa","lokalizacja_kraj":"Polska","lokalizacja_gmina":"Warszawa","lokalizacja_dlugosc-geograficzna-x":21.2497334550424},
offersId: [18702037]
});
})();
Im looking for a method to extract these params: "lokalizacja_ulica", "lokalizacja_szerokosc-geograficzna-y" and "lokalizacja_dlugosc-geograficzna-x". Any ideas? I'm python newbie :<
You cannot extract information from js with bs4 afaik. You can use regex though.
from bs4 import BeautifulSoup
import json
import re
soup = BeautifulSoup(<html text>)
script = soup.find('script').string
match = re.search(r'(?<=locationParams: ).+(?=,\n)', script, re.M).group(0)
data = json.loads(match)
(?<=locationParams: ).+(?=,\n) pattern looks for anything that has a "locationParams: " before and a curly bracket followed by a newline character after it. Then you can pass that string into json.loads() which turns it into a python dictionary.
I have managed to get the script tag using BeautifulSoup.Then I turned into a json object. The information that I want is within data['x'] but it is stuck between b tags.
Example :
<b>infoiwant</b><br>NA<br>infoinwant</br>columniwant: 123','<b>infoiwant</b><br>NA<br>columniwant: 123'</br>columniwant: 123
How would I go about getting the info out of these b elements
Before converting to json, can you use the BeautifulSoup get_text() method? Maybe something like
soup.find('b').get_text()
One method how to extract the data from <script> tag is using re module:
import re
from bs4 import BeautifulSoup
html_text = """
<script>
var data['x'] = '<b>infoiwant</b><br>NA<br>infoinwant</br>columniwant: 123';
</script>
"""
html_data = re.search(r"data\['x'\] = '(.*?)';", html_text).group(1)
soup = BeautifulSoup(html_data, "html.parser")
print(soup.find("b").get_text(strip=True))
Prints:
infoiwant
I am downloading HTML pages that have data defined in them in the following way:
... <script type= "text/javascript"> window.blog.data = {"activity":{"type":"read"}}; </script> ...
I would like to extract the JSON object defined in 'window.blog.data'.
Is there a simpler way than parsing it manually? (I am looking into Beautiful Soap but can't seem to find a method that will return the exact object without parsing)
Thanks
Edit:
Would it be possible and more correct to do this with a python headless browser (e.g., Ghost.py)?
BeautifulSoup is an html parser; you also need a javascript parser here. btw, some javascript object literals are not valid json (though in your example the literal is also a valid json object).
In simple cases you could:
extract <script>'s text using an html parser
assume that window.blog... is a single line or there is no ';' inside the object and extract the javascript object literal using simple string manipulations or a regex
assume that the string is a valid json and parse it using json module
Example:
#!/usr/bin/env python
html = """<!doctype html>
<title>extract javascript object as json</title>
<script>
// ..
window.blog.data = {"activity":{"type":"read"}};
// ..
</script>
<p>some other html here
"""
import json
import re
from bs4 import BeautifulSoup # $ pip install beautifulsoup4
soup = BeautifulSoup(html)
script = soup.find('script', text=re.compile('window\.blog\.data'))
json_text = re.search(r'^\s*window\.blog\.data\s*=\s*({.*?})\s*;\s*$',
script.string, flags=re.DOTALL | re.MULTILINE).group(1)
data = json.loads(json_text)
assert data['activity']['type'] == 'read'
If the assumptions are incorrect then the code fails.
To relax the second assumption, a javascript parser could be used instead of a regex e.g., slimit (suggested by #approximatenumber):
from slimit import ast # $ pip install slimit
from slimit.parser import Parser as JavascriptParser
from slimit.visitors import nodevisitor
soup = BeautifulSoup(html, 'html.parser')
tree = JavascriptParser().parse(soup.script.string)
obj = next(node.right for node in nodevisitor.visit(tree)
if (isinstance(node, ast.Assign) and
node.left.to_ecma() == 'window.blog.data'))
# HACK: easy way to parse the javascript object literal
data = json.loads(obj.to_ecma()) # NOTE: json format may be slightly different
assert data['activity']['type'] == 'read'
There is no need to treat the object literal (obj) as a json object. To get the necessary info, obj can be visited recursively like other ast nodes. It would allow to support arbitrary javascript code (that can be parsed by slimit).
Something like this may work:
import re
HTML = """
<html>
<head>
...
<script type= "text/javascript">
window.blog.data = {"activity":
{"type":"read"}
};
...
</script>
</head>
<body>
...
</body>
</html>
"""
JSON = re.compile('window.blog.data = ({.*?});', re.DOTALL)
matches = JSON.search(HTML)
print matches.group(1)
I had a similar issue and ended up using selenium with phantomjs. It's a little hacky and I couldn't quite figure out the correct wait until method, but the implicit wait seems to work fine so far for me.
from selenium import webdriver
import json
import re
url = "http..."
driver = webdriver.PhantomJS(service_args=['--load-images=no'])
driver.set_window_size(1120, 550)
driver.get(url)
driver.implicitly_wait(1)
script_text = re.search(r'window\.blog\.data\s*=.*<\/script>', driver.page_source).group(0)
# split text based on first equal sign and remove trailing script tag and semicolon
json_text = script_text.split('=',1)[1].rstrip('</script>').strip().rstrip(';').strip()
# only care about first piece of json
json_text = json_text.split("};")[0] + "}"
data = json.loads(json_text)
driver.quit()
```
fast and easy way is ('here put exactly the start (.*?) and the end here') that's all !
import re
import json
html = """<!doctype html>
<title>extract javascript object as json</title>
<script>
// ..
window.blog.data = {"activity":{"type":"read"}};
// ..
</script>
<p>some other html here
"""
than simply
re.search('{"activity":{"type":"(.*?)"', html).group(1)
or for full json
jsondata = re.search('window.blog.data = (.*?);', html).group(1)
jsondata = json.loads(jsondata)
print(jsondata["activity"])
#output {'type': 'read'}
I'm writing a python script which will extract the script locations after parsing from a webpage.
Lets say there are two scenarios :
<script type="text/javascript" src="http://example.com/something.js"></script>
and
<script>some JS</script>
I'm able to get the JS from the second scenario, that is when the JS is written within the tags.
But is there any way, I could get the value of src from the first scenario (i.e extracting all the values of src tags within script such as http://example.com/something.js)
Here's my code
#!/usr/bin/python
import requests
from bs4 import BeautifulSoup
r = requests.get("http://rediff.com/")
data = r.text
soup = BeautifulSoup(data)
for n in soup.find_all('script'):
print n
Output : Some JS
Output Needed : http://example.com/something.js
It will get all the src values only if they are present. Or else it would skip that <script> tag
from bs4 import BeautifulSoup
import urllib2
url="http://rediff.com/"
page=urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
sources=soup.findAll('script',{"src":True})
for source in sources:
print source['src']
I am getting following two src values as result
http://imworld.rediff.com/worldrediff/js_2_5/ws-global_hm_1.js
http://im.rediff.com/uim/common/realmedia_banner_1_5.js
I guess this is what you want. Hope this is useful.
Get 'src' from script node.
import requests
from bs4 import BeautifulSoup
r = requests.get("http://rediff.com/")
data = r.text
soup = BeautifulSoup(data)
for n in soup.find_all('script'):
print "src:", n.get('src') <====
This should work, you just filter to find all the script tags, then determine if they have a 'src' attribute. If they do then the URL to the javascript is contained in the src attribute, otherwise we assume the javascript is in the tag
#!/usr/bin/python
import requests
from bs4 import BeautifulSoup
# Test HTML which has both cases
html = '<script type="text/javascript" src="http://example.com/something.js">'
html += '</script> <script>some JS</script>'
soup = BeautifulSoup(html)
# Find all script tags
for n in soup.find_all('script'):
# Check if the src attribute exists, and if it does grab the source URL
if 'src' in n.attrs:
javascript = n['src']
# Otherwise assume that the javascript is contained within the tags
else:
javascript = n.text
print javascript
This output of this is
http://example.com/something.js
some JS