I'm trying to scrape some data with Python and Beautifulsoup. I know how to get the text from the script tag. The data between [ ] is valid json.
<script>
dataLayer =
[
{
"p":{
"t":"text1",
"lng":"text2",
"vurl":"text3"
},
"c":{ },
"u":{ },
"d":{ },
"a":{ }
}
]
</script>
I've read this response and it almost does what I want:
Extract content of <Script with BeautifulSoup
Here is my code:
import urllib.request
from bs4 import BeautifulSoup
import json
url = "www.example.com"
html = urllib.request.urlopen(url)
soup = BeautifulSoup(html, "html.parser")
raw_data = soup.find("script")
I would then ideally do:
json_dict = json.loads(raw_data)
And access the data through the dictionary. But this is not working because of
"<script> dataLayer ="
preceding the valid json and the script tag at the end. I've tried trimming the raw_data as a string, like this:
raw_data[20:]
But this didn't work because the soup object is not a string.
How can I get the raw_data variable to contain ONLY the text between the block quotes [ ]?
EDIT: this seems to work. It avoids regex and solves the problem of the trailing chars as well. Thanks for your suggestions.
url = "www.example.com"
html = urllib.request.urlopen(url)
soup = BeautifulSoup(html, "html.parser")
# get the script tag data and convert soup into a string
data = str(soup.find("script"))
# cut the <script> tag and some other things from the beginning and end to get valid JSON
cut = data[27:-13]
# load the data as a json dictionary
jsoned = json.loads(cut)
use .text to get content inside <script> tag then replace dataLayer =
raw_data = soup.find("script")
raw_data = raw_data.text.replace('dataLayer =', '')
json_dict = json.loads(raw_data)
>>> import re
>>> soup.find_all(re.compile("\[(.*?)\]"))
you would do that with regex
You will have to create a regex norm that only takes text between []
here a link of common regex usage within beautifulsoup
here the regex to extract from between square brackets
Related
So far I managed to make this:
from bs4 import BeautifulSoup
import requests
def function():
url = 'https://dynasty-scans.com/chapters/liar_satsuki_can_see_death_ch28_6#6'
soup = BeautifulSoup(requests.get(url).text, 'html.parser')
script = soup.find_all('script')
print(script[1])
output:
<script>
//<![CDATA[
var pages = [{"image":"/system/releases/000/036/945/1.png","name":"1"},{"image":"/system/releases/000/036/945/2.png","name":"2"},{"image":"/system/releases/000/036/945/3.png","name":"3"},{"image":"/system/releases/000/036/945/4.png","name":"4"},{"image":"/system/releases/000/036/945/5.png","name":"5"},{"image":"/system/releases/000/036/945/6.png","name":"6"},{"image":"/system/releases/000/036/945/7.png","name":"7"},{"image":"/system/releases/000/036/945/credits.png","name":"credits"}];
//]]>
</script>
I'm trying to extract values of "image" as strings
example: "/system/releases/000/036/945/7.png"
How can I do it ?
you can use a regular expression to extract the variable "pages"
import re, json, requests
url = 'https://dynasty-scans.com/chapters/liar_satsuki_can_see_death_ch28_6#6'
r = requests.get(url)
# extract the data
match = re.search('var pages = (\[.*?\]);', r.text).group(1)
# parse it into json
match_json = json.loads(match)
# iterate through it to get the links
images = [img['image'] for img in match_json]
output:
['/system/releases/000/036/945/1.png',
'/system/releases/000/036/945/2.png',
'/system/releases/000/036/945/3.png',
'/system/releases/000/036/945/4.png',
'/system/releases/000/036/945/5.png',
'/system/releases/000/036/945/6.png',
'/system/releases/000/036/945/7.png',
'/system/releases/000/036/945/credits.png']
Url = https://letterboxd.com/film/dogville/
I want to get movie name and release year with BeautifulSoup.
import requests
from bs4 import BeautifulSoup
url = 'https://letterboxd.com/film/dogville/'
req = requests.get(url)
soup = BeautifulSoup(req.content, 'html.parser')
soup.find_all("script")[10]
Output:
<script>
var filmData = { id: 51565, name: "Dogville", gwiId: 39220, releaseYear: "2003", posterURL: "/film/dogville/image-150/", path: "/film/dogville/" };
</script>
I managed to get the <script> block but i don't know how to get name and releaseYear.
How can i get them?
The problem is that bs4 is not a javascript parser. You reach its boundary than you need smt else, a javascript parser. Some weaker solution may use json module from the standard library to convert the string dictionary into a python dictionary.
Once you get the string containing js-code or you regex it to extract the dictionary like string or other way around.
Here the other way around
...
script_text = str(soup.find(script, string=True).string) # or what ever
# here the template
script_text = ' var filmData = { id: 51565, name: "Dogville", gwiId: 39220, releaseYear: "2003", posterURL: "/film/dogville/image-150/", path: "/film/dogville/" };'
script_text = script_text.strip()[:-1]
# substring starting after the 1st {
script_text = script_text[script_text.find('{')+1:]
script_text=script_text.replace(':', '=')
# find index of the closing }
i_close = len(script_text) - script_text[::-1].find('}')
#
script_text_d = 'dict(' + script_text[:i_close-1] + ')'
# evaluate the string
script_text_d = eval(script_text_d)
print(script_text_d)
print(script_text_d['name'])
Output
{'id': 51565, 'name': 'Dogville', 'gwiId': 39220, 'releaseYear': '2003', 'posterURL': '/film/dogville/image-150/', 'path': '/film/dogville/'}
Dogville
Remarks:
I choose to make the the dictionary construnction via built-it function, dict to avoid extra works
for json.loads I guess you need to put it in the form {} but then you need to double quotes all the key-like string
use javascript parsers
I am trying to parse a txt, example as below link.
The txt, however, is in the form of html. I am trying to get "COMPANY CONFORMED NAME" which located at the top of the file, and my function should return "Monocle Acquisition Corp".
https://www.sec.gov/Archives/edgar/data/1754170/0001571049-19-000004.txt
I have tried below:
import requests
from bs4 import BeautifulSoup
url = 'https://www.sec.gov/Archives/edgar/data/1754170/0001571049-19-000004.txt'
r = requests.get(url)
soup = BeautifulSoup(r.content, "html")
However, "soup" does not contain "COMPANY CONFORMED NAME" at all.
Can someone point me in the right direction?
The data you are looking for is not in an HTML structure so Beautiful Soup is not the best tool. The correct and fast way of searching for this data is just using a simple Regular Expression like this:
import re
import requests
url = 'https://www.sec.gov/Archives/edgar/data/1754170/0001571049-19-000004.txt'
r = requests.get(url)
text_string = r.content.decode()
name_re = re.compile("COMPANY CONFORMED NAME:[\\t]*(.+)\n")
match = name_re.search(text_string).group(1)
print(match)
the part you look like is inside a huge tag <SEC-HEADER>
you can get the whole section by using soup.find('sec-header')
but you will need to parse the section manually, something like this works, but it's some dirty job :
(view it in replit : https://repl.it/#gui3/stackoverflow-parsing-html)
import requests
from bs4 import BeautifulSoup
url = 'https://www.sec.gov/Archives/edgar/data/1754170/0001571049-19-000004.txt'
r = requests.get(url)
soup = BeautifulSoup(r.content, "html")
header = soup.find('sec-header').text
company_name = None
for line in header.split('\n'):
split = line.split(':')
if len(split) > 1 :
key = split[0]
value = split[1]
if key.strip() == 'COMPANY CONFORMED NAME':
company_name = value.strip()
break
print(company_name)
There may be some library able to parse this data better than this code
<script type="text/javascript">
'sku': 'T3246B5',
'Name': 'TAS BLACKY',
'Price': '111930',
'categories': 'Tas,Wanita,Sling Bags,Di bawah Rp 200.000',
'brand': '',
'visibility': '4',
'instock': "1",
'stock': "73.0000"
</script>
I want to scrape the text between : 'stock': " and .0000" so the desireable result is 73
What I used to know is to do something like this:
for url2 in urls2:
req2 = Request(url2, headers={'User-Agent': 'Chrome/39.0.2171.95'})
html2 = uReq(req2).read()
page_soup2 = soup(html2, "html.parser")
# Grab text
stock = page_soup2.findAll("p", {"class": "stock"})
stocks = stock[0].text
I used something like this in my previous code, It works before the web change their code.
But now there is more than 1 ("script", {"type": "text/javascript"}) in the entire page I want to scrape. So I dont know how to find the right ("script", {"type": "text/javascript"})
I also don't know hot to get the specific text before and after the text.
I have googled it all this day but can't find the solution. Please help.
I found that strings = 'stock': " and .0000" is unique in the entire page, only 1 'stock': and only 1 .0000"
So I think it could be the sign of location where I want to scrape the text.
Please help, thank you for your kindness.
I also apologize for my lack of English, and I am actually unfamiliar with programming. I'm just trying to learn from Google, but I can't find the answer. Thank you for your understanding.
the url = view-source:sophieparis.com/blacky-bag.html
Since you are sure 'stock' only shows up in the script tag you want, you can pull out that text that contains 'stock. Once you have that, it's a matter of trimming off the excess, and change to double quotes to get it into a valid json format and then simply read that in using json.loads()
import requests
from bs4 import BeautifulSoup
import json
url2 = 'https://www.sophieparis.com/blacky-bag.html'
req2 = requests.get(url2, headers={'User-Agent': 'Chrome/39.0.2171.95'})
page_soup2 = BeautifulSoup(req2.text, "html.parser")
scripts = page_soup2.find_all('script')
for script in scripts:
if 'stock' in script.text:
jsonStr = script.text
jsonStr = jsonStr.split('productObject = ')[-1].strip()
jsonStr = jsonStr.rsplit('}',1)[0].strip() + '}'
jsonData = json.loads(jsonStr.replace("'",'"'))
print (jsonData['stock'].split('.')[0])
Output:
print (jsonData['stock'].split('.')[0])
71
You could also do this without the loop and just grab the script that has the string stock in it using 1 line:
jsonStr = page_soup2.find('script', text=re.compile(r'stock')).text
Full code would look something like:
import requests
from bs4 import BeautifulSoup
import json
import re
url2 = 'https://www.sophieparis.com/blacky-bag.html'
req2 = requests.get(url2, headers={'User-Agent': 'Chrome/39.0.2171.95'})
page_soup2 = BeautifulSoup(req2.text, "html.parser")
jsonStr = page_soup2.find('script', text=re.compile(r'stock')).text
jsonStr = jsonStr.split('productObject = ')[-1].strip()
jsonStr = jsonStr.rsplit('}',1)[0].strip() + '}'
jsonData = json.loads(jsonStr.replace("'",'"'))
print (jsonData['stock'].split('.')[0])
I would write a regex that targets the javascript dictionary variable that houses the values of interest. You can apply this direct to response.text with no need for bs4.
The dictionary variable is called productObject, and you want the non-empty dictionary which is the second occurrence of productObject = {..} i.e. not the one which has 'var ' preceeding it. You can use negative lookbehind to specify this requirement.
Use hjson to handle property names enclosed in single quotes.
Py
import requests, re, hjson
r = requests.get('https://www.sophieparis.com/blacky-bag.html')
p = re.compile(r'(?<!var\s)productObject = ([\s\S]*?})')
data = hjson.loads(p.findall(r.text)[0])
print(data)
Regex: try
If you want to provide me with the webpage you wish to scrape the data from, I'll see if I can fix the code to pull the information.
I am trying to parse through an html page using beautiful soup. Specifically, I am looking at this very large array called "g_rgTopCurators" that can be summarized below:
g_rgTopCurators =
[{\"curator_description\":\"Awesome and sometimes overlooked indie games
curated by the orlygift.com team\",
\"last_curation_date\":1538400354,
\"discussion_url\":null,
\"rgTagLineLocalizations\":[],
\"broadcasters\":[],
\"broadcasters_info_available\":1,
\"bFollowed\":null,
\"m_rgAppRecommendations\":
[{ \"appid\":495600,
\"clanid\":9254464,
\"link_url\":\"https:\\\/\\\/www.orlygift.com\\\/games\\\/asteroid-fight\",
\"link_text\":\"\",
\"blurb\":\"Overall, we found Asteroid Fight to be a cool space game. If you want to manage a base and also handle asteroids, this is the right game for you. It\\u2019s definitely fun, unique and it has its own twist.\",
\"time_recommended\":1538400354,
\"comment_count\":0,
\"upvote_count\":0,
\"accountid_creator\":10142231,
\"recommendation_state\":0,
\"received_compensation\":0,
\"received_for_free\":1},
{other app with same params as above},
{other app},
{other app}
],
\"m_rgCreatedApps\":[],
\"m_strCreatorVanityURL\":\"\",
\"m_nCreatorPartnerID\":0,
\"clanID\":\"9254464\",
\"name\":\"Orlygift\",
\"communityLink\":\"https:\\\/\\\/steamcommunity.com\\\/groups\\\/orlygift\",
\"strAvatarHash\":\"839146c7ccac8ee3646059e3af616cb7691e1440\",
\"link\":\"https:\\\/\\\/store.steampowered.com\\\/curator\\\/9254464-Orlygift\\\/\",
\"youtube\":null,
\"facebook_page\":null,
\"twitch\":null,
\"twitter\":null,
\"total_reviews\":50,
\"total_followers\":38665,
\"total_recommended\":50,
\"total_not_recommended\":0,
\"total_informative\":0
},
{another curator},
{another curator}
];
I am trying to figure out how to properly use soup.select() to get every \"name\" for every curator in this large array.
soup = bs4.BeautifulSoup(data["results_html"], "html.parser")
curators = soup.select(" ??? ")
As the response is JSON containing HTML which contains a script element containing more JSON my first approach was this:
import requests
import json
from bs4 import BeautifulSoup
url="https://store.steampowered.com/curators/ajaxgetcurators/render?start=0&count=50"
response = requests.get(url, headers = {"Accept": "application/json"})
loaded_response = response.json() # Get the JSON response containing the HTML containing the required JSON.
results_html = loaded_response['results_html'] # Get the HTML from the JSON
soup = BeautifulSoup(results_html, 'html.parser')
text = soup.find_all('script')[1].text # Get the script element from the HTML.
# Get the JSON in the HTML script element
jn = json.loads(text[text.index("var g_rgTopCurators = ")+ len("var g_rgTopCurators = "):text.index("var fnCreateCapsule")].strip().rstrip(';'))
for i in jn: # Iterate through JSON
print (i['name'])
Outputs:
Cynical Brit Gaming
PC Gamer
Just Good PC Games
...
WGN Chat
Bloody Disgusting Official
Orlygift
There is a quicker way of doing it just get the response as bytes decode and escape it then go straight to the desired JSON with string manipulation:
import requests
import json
url="https://store.steampowered.com/curators/ajaxgetcurators/render?start=0&count=50"
response = requests.get(url, headers = {"Accept": "application/json"})
text = response.content.decode("unicode_escape") # response body as bytes decode and escape
# find the JSON
jn = json.loads(text[text.index("var g_rgTopCurators = ")+ len("var g_rgTopCurators = "):text.index("var fnCreateCapsule")].strip().rstrip(';'))
for i in jn: # Iterate through JSON
print (i['name'])