Get data from a <script> var with BeautifulSoup - python

Url = https://letterboxd.com/film/dogville/
I want to get movie name and release year with BeautifulSoup.
import requests
from bs4 import BeautifulSoup
url = 'https://letterboxd.com/film/dogville/'
req = requests.get(url)
soup = BeautifulSoup(req.content, 'html.parser')
soup.find_all("script")[10]
Output:
<script>
var filmData = { id: 51565, name: "Dogville", gwiId: 39220, releaseYear: "2003", posterURL: "/film/dogville/image-150/", path: "/film/dogville/" };
</script>
I managed to get the <script> block but i don't know how to get name and releaseYear.
How can i get them?

The problem is that bs4 is not a javascript parser. You reach its boundary than you need smt else, a javascript parser. Some weaker solution may use json module from the standard library to convert the string dictionary into a python dictionary.
Once you get the string containing js-code or you regex it to extract the dictionary like string or other way around.
Here the other way around
...
script_text = str(soup.find(script, string=True).string) # or what ever
# here the template
script_text = ' var filmData = { id: 51565, name: "Dogville", gwiId: 39220, releaseYear: "2003", posterURL: "/film/dogville/image-150/", path: "/film/dogville/" };'
script_text = script_text.strip()[:-1]
# substring starting after the 1st {
script_text = script_text[script_text.find('{')+1:]
script_text=script_text.replace(':', '=')
# find index of the closing }
i_close = len(script_text) - script_text[::-1].find('}')
#
script_text_d = 'dict(' + script_text[:i_close-1] + ')'
# evaluate the string
script_text_d = eval(script_text_d)
print(script_text_d)
print(script_text_d['name'])
Output
{'id': 51565, 'name': 'Dogville', 'gwiId': 39220, 'releaseYear': '2003', 'posterURL': '/film/dogville/image-150/', 'path': '/film/dogville/'}
Dogville
Remarks:
I choose to make the the dictionary construnction via built-it function, dict to avoid extra works
for json.loads I guess you need to put it in the form {} but then you need to double quotes all the key-like string
use javascript parsers

Related

how to use python to parse a html that is in txt format?

I am trying to parse a txt, example as below link.
The txt, however, is in the form of html. I am trying to get "COMPANY CONFORMED NAME" which located at the top of the file, and my function should return "Monocle Acquisition Corp".
https://www.sec.gov/Archives/edgar/data/1754170/0001571049-19-000004.txt
I have tried below:
import requests
from bs4 import BeautifulSoup
url = 'https://www.sec.gov/Archives/edgar/data/1754170/0001571049-19-000004.txt'
r = requests.get(url)
soup = BeautifulSoup(r.content, "html")
However, "soup" does not contain "COMPANY CONFORMED NAME" at all.
Can someone point me in the right direction?
The data you are looking for is not in an HTML structure so Beautiful Soup is not the best tool. The correct and fast way of searching for this data is just using a simple Regular Expression like this:
import re
import requests
url = 'https://www.sec.gov/Archives/edgar/data/1754170/0001571049-19-000004.txt'
r = requests.get(url)
text_string = r.content.decode()
name_re = re.compile("COMPANY CONFORMED NAME:[\\t]*(.+)\n")
match = name_re.search(text_string).group(1)
print(match)
the part you look like is inside a huge tag <SEC-HEADER>
you can get the whole section by using soup.find('sec-header')
but you will need to parse the section manually, something like this works, but it's some dirty job :
(view it in replit : https://repl.it/#gui3/stackoverflow-parsing-html)
import requests
from bs4 import BeautifulSoup
url = 'https://www.sec.gov/Archives/edgar/data/1754170/0001571049-19-000004.txt'
r = requests.get(url)
soup = BeautifulSoup(r.content, "html")
header = soup.find('sec-header').text
company_name = None
for line in header.split('\n'):
split = line.split(':')
if len(split) > 1 :
key = split[0]
value = split[1]
if key.strip() == 'COMPANY CONFORMED NAME':
company_name = value.strip()
break
print(company_name)
There may be some library able to parse this data better than this code

Scrape Text After Specific Text and Before Specific Text

<script type="text/javascript">
'sku': 'T3246B5',
'Name': 'TAS BLACKY',
'Price': '111930',
'categories': 'Tas,Wanita,Sling Bags,Di bawah Rp 200.000',
'brand': '',
'visibility': '4',
'instock': "1",
'stock': "73.0000"
</script>
I want to scrape the text between : 'stock': " and .0000" so the desireable result is 73
What I used to know is to do something like this:
for url2 in urls2:
req2 = Request(url2, headers={'User-Agent': 'Chrome/39.0.2171.95'})
html2 = uReq(req2).read()
page_soup2 = soup(html2, "html.parser")
# Grab text
stock = page_soup2.findAll("p", {"class": "stock"})
stocks = stock[0].text
I used something like this in my previous code, It works before the web change their code.
But now there is more than 1 ("script", {"type": "text/javascript"}) in the entire page I want to scrape. So I dont know how to find the right ("script", {"type": "text/javascript"})
I also don't know hot to get the specific text before and after the text.
I have googled it all this day but can't find the solution. Please help.
I found that strings = 'stock': " and .0000" is unique in the entire page, only 1 'stock': and only 1 .0000"
So I think it could be the sign of location where I want to scrape the text.
Please help, thank you for your kindness.
I also apologize for my lack of English, and I am actually unfamiliar with programming. I'm just trying to learn from Google, but I can't find the answer. Thank you for your understanding.
the url = view-source:sophieparis.com/blacky-bag.html
Since you are sure 'stock' only shows up in the script tag you want, you can pull out that text that contains 'stock. Once you have that, it's a matter of trimming off the excess, and change to double quotes to get it into a valid json format and then simply read that in using json.loads()
import requests
from bs4 import BeautifulSoup
import json
url2 = 'https://www.sophieparis.com/blacky-bag.html'
req2 = requests.get(url2, headers={'User-Agent': 'Chrome/39.0.2171.95'})
page_soup2 = BeautifulSoup(req2.text, "html.parser")
scripts = page_soup2.find_all('script')
for script in scripts:
if 'stock' in script.text:
jsonStr = script.text
jsonStr = jsonStr.split('productObject = ')[-1].strip()
jsonStr = jsonStr.rsplit('}',1)[0].strip() + '}'
jsonData = json.loads(jsonStr.replace("'",'"'))
print (jsonData['stock'].split('.')[0])
Output:
print (jsonData['stock'].split('.')[0])
71
You could also do this without the loop and just grab the script that has the string stock in it using 1 line:
jsonStr = page_soup2.find('script', text=re.compile(r'stock')).text
Full code would look something like:
import requests
from bs4 import BeautifulSoup
import json
import re
url2 = 'https://www.sophieparis.com/blacky-bag.html'
req2 = requests.get(url2, headers={'User-Agent': 'Chrome/39.0.2171.95'})
page_soup2 = BeautifulSoup(req2.text, "html.parser")
jsonStr = page_soup2.find('script', text=re.compile(r'stock')).text
jsonStr = jsonStr.split('productObject = ')[-1].strip()
jsonStr = jsonStr.rsplit('}',1)[0].strip() + '}'
jsonData = json.loads(jsonStr.replace("'",'"'))
print (jsonData['stock'].split('.')[0])
I would write a regex that targets the javascript dictionary variable that houses the values of interest. You can apply this direct to response.text with no need for bs4.
The dictionary variable is called productObject, and you want the non-empty dictionary which is the second occurrence of productObject = {..} i.e. not the one which has 'var ' preceeding it. You can use negative lookbehind to specify this requirement.
Use hjson to handle property names enclosed in single quotes.
Py
import requests, re, hjson
r = requests.get('https://www.sophieparis.com/blacky-bag.html')
p = re.compile(r'(?<!var\s)productObject = ([\s\S]*?})')
data = hjson.loads(p.findall(r.text)[0])
print(data)
Regex: try
If you want to provide me with the webpage you wish to scrape the data from, I'll see if I can fix the code to pull the information.

Extract data from <script> with beautifulsoup

I'm trying to scrape some data with Python and Beautifulsoup. I know how to get the text from the script tag. The data between [ ] is valid json.
<script>
dataLayer =
[
{
"p":{
"t":"text1",
"lng":"text2",
"vurl":"text3"
},
"c":{ },
"u":{ },
"d":{ },
"a":{ }
}
]
</script>
I've read this response and it almost does what I want:
Extract content of <Script with BeautifulSoup
Here is my code:
import urllib.request
from bs4 import BeautifulSoup
import json
url = "www.example.com"
html = urllib.request.urlopen(url)
soup = BeautifulSoup(html, "html.parser")
raw_data = soup.find("script")
I would then ideally do:
json_dict = json.loads(raw_data)
And access the data through the dictionary. But this is not working because of
"<script> dataLayer ="
preceding the valid json and the script tag at the end. I've tried trimming the raw_data as a string, like this:
raw_data[20:]
But this didn't work because the soup object is not a string.
How can I get the raw_data variable to contain ONLY the text between the block quotes [ ]?
EDIT: this seems to work. It avoids regex and solves the problem of the trailing chars as well. Thanks for your suggestions.
url = "www.example.com"
html = urllib.request.urlopen(url)
soup = BeautifulSoup(html, "html.parser")
# get the script tag data and convert soup into a string
data = str(soup.find("script"))
# cut the <script> tag and some other things from the beginning and end to get valid JSON
cut = data[27:-13]
# load the data as a json dictionary
jsoned = json.loads(cut)
use .text to get content inside <script> tag then replace dataLayer =
raw_data = soup.find("script")
raw_data = raw_data.text.replace('dataLayer =', '')
json_dict = json.loads(raw_data)
>>> import re
>>> soup.find_all(re.compile("\[(.*?)\]"))
you would do that with regex
You will have to create a regex norm that only takes text between []
here a link of common regex usage within beautifulsoup
here the regex to extract from between square brackets

Parsing brackets in HTML with Python

I am trying to parse some information thats in a var meta window, and I am just a little confused how to grab just the value for the "id".
My code is below
url = input("\n\nEnter URL: ")
print(Fore.MAGENTA + "\nSetting link . . .")
def printID():
print("")
session = requests.session()
response = session.get(url)
soup = bs(response.text, 'html.parser')
form = soup.find('script', {'id' : 'ProductJson-product-template'})
scripts = soup.findAll('id')
#get the id
'''
for scripts in form:
data = soup.find_all()
print data
'''
print(form)
printID()
And the output of this prints
<script id="ProductJson-product-template" type="application/json">
{"id":463448473639,"title":"n/a","handle":"n/a","description":"n/a"}
</script>
Again, I just want to print just the value of the ID ("463448473639").
you can retrieve all the attributes using following sytax.
form.attrs
and if you looking some specific, it's dictionary.
form['id']
the full code is as below
from bs4 import BeautifulSoup
html_doc="""<script id="ProductJson-product-template" type="application/json">
{"id":463448473639,"title":"n/a","handle":"n/a","description":"n/a"}
</script>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
print soup.find("script").attrs
print soup.find("script")['id']
However if you want to get value of ID from innerText {"id":463448473639,"title":"n/a","handle":"n/a","description":"n/a"}
the only way to do is, as below.
innerText = soup.find("script").getText()
print innerText
print ast.literal_eval(strip(innerText)).get("id")
It looks like you are going to want to do something like:
import json
id = json.loads(scripts[0].get_text())['id']
I haven't tested that but if you want to get what is in between the script tags I think that is they way you will do it. get_text doc

Regex/Python: Find everything before one symbol, if it's after another symbol

Looking to return a full string after if there is a long dash ("―"), and if true, return everything before the first comma (","). How would I do this using Python with Regex?
from bs4 import BeautifulSoup
import requests
import json
import pandas as pd
request = requests.get('https://www.goodreads.com/quotes/tag/fun?page=1')
soup = BeautifulSoup(request.text, 'lxml')
# for loop
s = soup.find_all("div", class_="quoteText")[0].text
s = " ".join(s.split())
s[:s.index(",")]
s
Raw Output:
“That does it," said Jace. "I\'m going to get you a dictionary for Christmas this year.""Why?" Isabelle said."So you can look up \'fun.\' I\'m not sure you know what it means.” ― Cassandra Clare, City of Ashes //<![CDATA[ function submitShelfLink(unique_id, book_id, shelf_id, shelf_name, submit_form, exclusive) { var checkbox_id = \'shelf_name_\' + unique_id + \'_\' + shelf_id; var element = document.getElementById(checkbox_id) var checked = element.checked if (checked && exclusive) { // can\'t uncheck a radio by clicking it! return } if(document.getElementById("savingMessage")){ Element.show(\'savingMessage\') } var element_id = \'shelfInDropdownName_\' + unique_id + \'_\' + shelf_id; Element.upda
Desired Output:
“That does it," said Jace. "I\'m going to get you a dictionary for Christmas this year.""Why?" Isabelle said."So you can look up \'fun.\' I\'m not sure you know what it means.” ― Cassandra Clare
Here's one solution:
import re
s = 'adflakjd, fkljlkjdf ― Cassandra Clare, City of Ash, adflak'
x = re.findall('.*―.*?(?=,)', s)
print x
['adflakjd, fkljlkjdf ― Cassandra Clare']
I'm not sure I understand it properly, but I think you mean:
example_string = "part to return,example__text"
if example_string.count('__') > 0:
try:
result = re.search('(.*?)\,', example_string).group(0)
except:
result = None
print(result)
This prints 'part to return'
If you mean, the part of the string between the '__' and the ',' I would use:
example_string = "lala__part to return, lala"
try:
result = re.search('__(.*?)\,', example_string).group(0)
except:
result = None
print(result)
from bs4 import BeautifulSoup
from bs4.element import NavigableString
import requests
request = requests.get('https://www.goodreads.com/quotes/tag/fun?page=1')
soup = BeautifulSoup(request.text, 'html.parser')
# for loop
s = soup.find_all("div", class_="quoteText")[0]
text = ''
text += "".join([t.strip() for t in s.contents if type(t) == NavigableString])
for book_or_author_tag in s.find_all("a", class_ = "authorOrTitle"):
text += "\n" + book_or_author_tag.text.strip()
print(text)
The quote you want is split across the initial quoteText div, but calling text on it returns all that CDATA junk you're trying to remove with the regex.
By looping over every child of that div and checking whether it's a navigable string type, we can extract only the actual text data you want. then tack on the author and book, and hopefully your regex becomes a lot simpler.

Categories