Extracting data from a script by Beautifulsoup

Extracting data from a script by Beautifulsoup - python

i have something like that:
(async () => {
await import("https://s-gr.cdngr.pl/assets/gratka/v0.40.7/dist/js/Map.js");
Map.init('#item-map', {
gratkaMapsUrl: 'https://map.api.gratka.it',
assetsUrl: 'https://s-gr.cdngr.pl/assets/gratka/v0.40.7/dist/',
locationApiHost: 'https://locations.api.gratka.it/locations/v1',
apiUrl: 'https://gratka.api.gratka.it/gratka/v2',
eventType: 'click',
statisticsType: 'show_map',
locationParams: {"lokalizacja_ulica":"aleja Marsz. J\u00f3zefa Pi\u0142sudskiego","lokalizacja_szerokosc-geograficzna-y":52.231069627971,"lokalizacja_region":"mazowieckie","lokalizacja_powiat":"Warszawa","lokalizacja_miejscowosc":"Warszawa","lokalizacja_kraj":"Polska","lokalizacja_gmina":"Warszawa","lokalizacja_dlugosc-geograficzna-x":21.2497334550424},
offersId: [18702037]
});
})();
Im looking for a method to extract these params: "lokalizacja_ulica", "lokalizacja_szerokosc-geograficzna-y" and "lokalizacja_dlugosc-geograficzna-x". Any ideas? I'm python newbie :<

You cannot extract information from js with bs4 afaik. You can use regex though.
from bs4 import BeautifulSoup
import json
import re
soup = BeautifulSoup(<html text>)
script = soup.find('script').string
match = re.search(r'(?<=locationParams: ).+(?=,\n)', script, re.M).group(0)
data = json.loads(match)
(?<=locationParams: ).+(?=,\n) pattern looks for anything that has a "locationParams: " before and a curly bracket followed by a newline character after it. Then you can pass that string into json.loads() which turns it into a python dictionary.

Related

Extracting json from html

I have html files with blocks like this:
<script type="text/javascript>
var json1 = {
// ...
}
</script>
Using the names of the vars - e.g. "json1" - what is a straightforward way to extract the json? Could a regex do it, or do I need something like Beautiful Soup?

yes you need both regex and beautiful soup
import json
import re
from bs4 import BeautifulSoup # $ pip install beautifulsoup4
html = //Your html output
soup = BeautifulSoup(html)
script = soup.find('script', text=re.compile('json1'))
json_text = re.search(r'^\s*json1\s*=\s*({.*?})\s*;\s*$',
script.string, flags=re.DOTALL | re.MULTILINE).group(1)
data = json.loads(json_text)
print(data['json1'])

I found something simple that worked in my case. Get the position of the "var json1 = ", then call html.find("", startOfJson1). Use the indexes to slice the json from the string.

How do I extract the data from the URL using Regex (Know the variable name)?

I am trying to extract data from a website https://www.icra.in/Rationale/Index?CompanyName=20%20Microns%20Limited using Scrapy and Beautiful Soup. However, both scrapers return empty when I use the class 'list-nw'.
I tried different parsers using BS but the same. On closer look, I noticed the view source has the data I need. Thus I get the page content in text which has the data. (rather than the class).
How do I extract the entire array using Regex for the key "LstrationaleDetails" inside variable var Model. (Line number 793)?
I tried several Regex but was unable to. Is Regex the only option or I can use Scrapy or BS? Also confused as after extracting how will I store it? If it was a JSON I could de-serialize it. I was thinking of something in lines of split and eval.
I tried this for BS.
page = urllib.request.urlopen(quote_page)
soup = BeautifulSoup(page, 'html5lib.parser')
print(soup)
Thanks for the help.

Attributable to #t.m.adam
You can use the following regex to extract from source html. Use the DOTALL flag to allow for newlines. User-Agent is required in headers.
import requests
import re
import json
url = 'https://www.icra.in/Rationale/Index?CompanyName=20%20Microns%20Limited'
headers = {
'User-Agent' : 'Mozilla/5.0'
}
r = requests.get(url, headers = headers)
data = re.search('var Model =(.*?);\s+Ratinoal', r.text, flags=re.DOTALL).group(1)
result = json.loads(data)
for item in result['LstrationaleDetails']:
print(item)

Trouble parsing HTML with BeautifulSoup or golang colly

FTR I have written quite a few scrapers successfully in both frameworks but I'm stumped. Here is a screenshot of the data I'm trying to scrape (you can also go to the actual link in the get request):
I attempt to target the div.section_content:
import requests
from bs4 import BeautifulSoup
html = requests.get("https://www.baseball-reference.com/boxes/ARI/ARI201803300.shtml").text
soup = BeautifulSoup(html)
soup.findAll("div", {"class": "section_content"})
Printing the last line shows some other divs, but not the one with the pitching data.
However, I can see it's in the text, so it's not a javascript triggered loading problem (the phrase "Pitching" only comes up in that table):
>>> "Pitching" in soup.text
True
Here is an abbreviated version of one of the golang attempts:
package main
import (
"fmt"
"github.com/gocolly/colly"
)
func main() {
c := colly.NewCollector(
colly.AllowedDomains("www.baseball-reference.com"),
)
c.OnHTML("div.table_wrapper", func(e *colly.HTMLElement) {
fmt.Println(e.ChildText("div.section_content"))
})
c.Visit("https://www.baseball-reference.com/boxes/ARI/ARI201803300.shtml")
}
}

It looks to me like the HTML is actually commented out, so that's why BeautifulSoup can't find it. Either remove the comment markers from the HTML string before you parse it or use BeautifulSoup to extract the comments and parse the return value.
For example:
for element in soup(text=lambda text: isinstance(text, Comment)):
comment = element.extract()
comment_soup = BeautifulSoup(comment)
# work with comment_soup

python Get a link from string

I need to use a python script to take a email and fine a link from it and them open use that link to send a packet to a server that has that verification link inside of it so it verifies an account. How would I use python to take the
https://www.boomlings.com/database/accounts/activate.php?uid=8722046actcode=xLCReGjLdkWmINt1GY9e
out of
{'Sender': 'Geometry Dash', 'Subject': 'Please activate your account.', 'body': b'<style type="text/css">\n#google_translate_element{\n float: right;\n padding:0 0 10px 10px;\n}\n/* twitter do\xc4\x9frulama linki fix */\n.bulletproof-btn-1 a {\n font-size: 20px!important;\n color: #fff!important;\n padding: 20px!important;\n line-height: 33px!important;\n text-decoration: none!important;\n}\n</style>\n<div id="google_translate_element"></div><script type="text/javascript">\nfunction googleTranslateElementInit() {\n new google.translate.TranslateElement({pageLanguage: \'en\', layout: google.translate.TranslateElement.InlineLayout.SIMPLE, autoDisplay: false, multilanguagePage: true}, \'google_translate_element\');\n}\n</script><script type="text/javascript" src="//translate.google.com/translate_a/element.js?cb=googleTranslateElementInit"></script>\n\r\n\r\n<html>\r\n<head>\r\n\t<title></title>\r\n</head>\r\n<body>\r\n<p>Thank you for registering a Geometry Dash account</p>\r\n\r\n<p>Your account information:<br />\r\nUsername: SUKAFUTCUCK</p>\r\n\r\n<p>Please click the link below to activate your account:<br />\r\nClick\r\nHere</p>\r\n\r\n<p>Please contact support#robtopgames.com if you have any questions or\r\nneed assistance.</p>\r\n\r\n<p>If you did not send an account request using this email, then you\r\ncan safely disregard this message and nothing will happen.</p>\r\n\r\n<p>Regards,<br />\r\nRobTop Games</p>\r\n</body>\r\n</html>\r\n\r\n\r\n'}
The link will be different in different emails so I need something that can do this.
https://www.boomlings.com/database/accounts/activate.php?uid=*actcode=*
When the * means that string at any length can go there because it will be a different activate.php cod

You can use regex for that with something like:
import re
c = re.search("<a href=\".*?(?=\")", yourDict["body"].decode("utf-8"))
print(c.group())
but is much better if you find a package like parsel because you extract the html with xpath and not with regex, check this
EDIT
I use the regular expression because is the shortest and the fastest way with no need of download a package, but if your response changes drastically I recommend parsel for that. Example:
from parsel import Selector
sel = Selector(text=yourDict["body"].decode("utf-8"))
url = sel.xpath('//a[#target="_blank"]/#href').extract_first()

Assuming that dict from your description is now in a variable named d (it was just a bit long to put in here):
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(d['body'], 'lxml')
>>> link = soup.find('a', target='_blank')
>>> link['href']
'http://www.boomlings.com/database/accounts/activate.php?uid=8722046&actcode=xlCReGjLdkWmINt1GY9e'
BeautifulSoup docs

The email could in HTML or text format.
If it's in HTML format then use libraries like bs4, pyquery etc.
If it's text then use regex to search the URL using the following regex
regex = ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
Refer: http://www.ietf.org/rfc/rfc3986.txt
Use re module to search the string as
import re
regex = r"^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?"
urls = re.findall( regex, text )
print(urls)
Use pyquery module
from pyquery import pyQuery as pq
q = pq( text )
a_list = q( "a" )
urls = [ a.attr[ 'href' ] for a in a_list ]
print(urls)
EDIT:
Instead of using generic URL we can use specific URL, for example https?:\/\/www\.boomlings\.com\/database\/accounts\/activate\.php\?uid=.*&actcode=.*
https://ideone.com/NFj90L

parse json in html page with python [duplicate]

I am downloading HTML pages that have data defined in them in the following way:
... <script type= "text/javascript"> window.blog.data = {"activity":{"type":"read"}}; </script> ...
I would like to extract the JSON object defined in 'window.blog.data'.
Is there a simpler way than parsing it manually? (I am looking into Beautiful Soap but can't seem to find a method that will return the exact object without parsing)
Thanks
Edit:
Would it be possible and more correct to do this with a python headless browser (e.g., Ghost.py)?

BeautifulSoup is an html parser; you also need a javascript parser here. btw, some javascript object literals are not valid json (though in your example the literal is also a valid json object).
In simple cases you could:
extract <script>'s text using an html parser
assume that window.blog... is a single line or there is no ';' inside the object and extract the javascript object literal using simple string manipulations or a regex
assume that the string is a valid json and parse it using json module
Example:
#!/usr/bin/env python
html = """<!doctype html>
<title>extract javascript object as json</title>
<script>
// ..
window.blog.data = {"activity":{"type":"read"}};
// ..
</script>
<p>some other html here
"""
import json
import re
from bs4 import BeautifulSoup # $ pip install beautifulsoup4
soup = BeautifulSoup(html)
script = soup.find('script', text=re.compile('window\.blog\.data'))
json_text = re.search(r'^\s*window\.blog\.data\s*=\s*({.*?})\s*;\s*$',
script.string, flags=re.DOTALL | re.MULTILINE).group(1)
data = json.loads(json_text)
assert data['activity']['type'] == 'read'
If the assumptions are incorrect then the code fails.
To relax the second assumption, a javascript parser could be used instead of a regex e.g., slimit (suggested by #approximatenumber):
from slimit import ast # $ pip install slimit
from slimit.parser import Parser as JavascriptParser
from slimit.visitors import nodevisitor
soup = BeautifulSoup(html, 'html.parser')
tree = JavascriptParser().parse(soup.script.string)
obj = next(node.right for node in nodevisitor.visit(tree)
if (isinstance(node, ast.Assign) and
node.left.to_ecma() == 'window.blog.data'))
# HACK: easy way to parse the javascript object literal
data = json.loads(obj.to_ecma()) # NOTE: json format may be slightly different
assert data['activity']['type'] == 'read'
There is no need to treat the object literal (obj) as a json object. To get the necessary info, obj can be visited recursively like other ast nodes. It would allow to support arbitrary javascript code (that can be parsed by slimit).

Something like this may work:
import re
HTML = """
<html>
<head>
...
<script type= "text/javascript">
window.blog.data = {"activity":
{"type":"read"}
};
...
</script>
</head>
<body>
...
</body>
</html>
"""
JSON = re.compile('window.blog.data = ({.*?});', re.DOTALL)
matches = JSON.search(HTML)
print matches.group(1)

I had a similar issue and ended up using selenium with phantomjs. It's a little hacky and I couldn't quite figure out the correct wait until method, but the implicit wait seems to work fine so far for me.
from selenium import webdriver
import json
import re
url = "http..."
driver = webdriver.PhantomJS(service_args=['--load-images=no'])
driver.set_window_size(1120, 550)
driver.get(url)
driver.implicitly_wait(1)
script_text = re.search(r'window\.blog\.data\s*=.*<\/script>', driver.page_source).group(0)
# split text based on first equal sign and remove trailing script tag and semicolon
json_text = script_text.split('=',1)[1].rstrip('</script>').strip().rstrip(';').strip()
# only care about first piece of json
json_text = json_text.split("};")[0] + "}"
data = json.loads(json_text)
driver.quit()
```

fast and easy way is ('here put exactly the start (.*?) and the end here') that's all !
import re
import json
html = """<!doctype html>
<title>extract javascript object as json</title>
<script>
// ..
window.blog.data = {"activity":{"type":"read"}};
// ..
</script>
<p>some other html here
"""
than simply
re.search('{"activity":{"type":"(.*?)"', html).group(1)
or for full json
jsondata = re.search('window.blog.data = (.*?);', html).group(1)
jsondata = json.loads(jsondata)
print(jsondata["activity"])
#output {'type': 'read'}

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting data from a script by Beautifulsoup - python

Related

Extracting json from html

How do I extract the data from the URL using Regex (Know the variable name)?

Trouble parsing HTML with BeautifulSoup or golang colly

python Get a link from string

parse json in html page with python [duplicate]

Categories

Resources