Parsing html document with Beautiful Soup

Parsing html document with Beautiful Soup - python

I am trying to parse through an html page using beautiful soup. Specifically, I am looking at this very large array called "g_rgTopCurators" that can be summarized below:
g_rgTopCurators =
[{\"curator_description\":\"Awesome and sometimes overlooked indie games
curated by the orlygift.com team\",
\"last_curation_date\":1538400354,
\"discussion_url\":null,
\"rgTagLineLocalizations\":[],
\"broadcasters\":[],
\"broadcasters_info_available\":1,
\"bFollowed\":null,
\"m_rgAppRecommendations\":
[{ \"appid\":495600,
\"clanid\":9254464,
\"link_url\":\"https:\\\/\\\/www.orlygift.com\\\/games\\\/asteroid-fight\",
\"link_text\":\"\",
\"blurb\":\"Overall, we found Asteroid Fight to be a cool space game. If you want to manage a base and also handle asteroids, this is the right game for you. It\\u2019s definitely fun, unique and it has its own twist.\",
\"time_recommended\":1538400354,
\"comment_count\":0,
\"upvote_count\":0,
\"accountid_creator\":10142231,
\"recommendation_state\":0,
\"received_compensation\":0,
\"received_for_free\":1},
{other app with same params as above},
{other app},
{other app}
],
\"m_rgCreatedApps\":[],
\"m_strCreatorVanityURL\":\"\",
\"m_nCreatorPartnerID\":0,
\"clanID\":\"9254464\",
\"name\":\"Orlygift\",
\"communityLink\":\"https:\\\/\\\/steamcommunity.com\\\/groups\\\/orlygift\",
\"strAvatarHash\":\"839146c7ccac8ee3646059e3af616cb7691e1440\",
\"link\":\"https:\\\/\\\/store.steampowered.com\\\/curator\\\/9254464-Orlygift\\\/\",
\"youtube\":null,
\"facebook_page\":null,
\"twitch\":null,
\"twitter\":null,
\"total_reviews\":50,
\"total_followers\":38665,
\"total_recommended\":50,
\"total_not_recommended\":0,
\"total_informative\":0
},
{another curator},
{another curator}
];
I am trying to figure out how to properly use soup.select() to get every \"name\" for every curator in this large array.
soup = bs4.BeautifulSoup(data["results_html"], "html.parser")
curators = soup.select(" ??? ")

As the response is JSON containing HTML which contains a script element containing more JSON my first approach was this:
import requests
import json
from bs4 import BeautifulSoup
url="https://store.steampowered.com/curators/ajaxgetcurators/render?start=0&count=50"
response = requests.get(url, headers = {"Accept": "application/json"})
loaded_response = response.json() # Get the JSON response containing the HTML containing the required JSON.
results_html = loaded_response['results_html'] # Get the HTML from the JSON
soup = BeautifulSoup(results_html, 'html.parser')
text = soup.find_all('script')[1].text # Get the script element from the HTML.
# Get the JSON in the HTML script element
jn = json.loads(text[text.index("var g_rgTopCurators = ")+ len("var g_rgTopCurators = "):text.index("var fnCreateCapsule")].strip().rstrip(';'))
for i in jn: # Iterate through JSON
print (i['name'])
Outputs:
Cynical Brit Gaming
PC Gamer
Just Good PC Games
...
WGN Chat
Bloody Disgusting Official
Orlygift
There is a quicker way of doing it just get the response as bytes decode and escape it then go straight to the desired JSON with string manipulation:
import requests
import json
url="https://store.steampowered.com/curators/ajaxgetcurators/render?start=0&count=50"
response = requests.get(url, headers = {"Accept": "application/json"})
text = response.content.decode("unicode_escape") # response body as bytes decode and escape
# find the JSON
jn = json.loads(text[text.index("var g_rgTopCurators = ")+ len("var g_rgTopCurators = "):text.index("var fnCreateCapsule")].strip().rstrip(';'))
for i in jn: # Iterate through JSON
print (i['name'])

Related

Scraping using BeautifulSoup only gets me 33 responses off of an infinite scrolling page. How do i increase the number of responses?

The website link:
https://collegedunia.com/management/human-resources-management-colleges
The code:
import requests
from bs4 import BeautifulSoup
r = requests.get("https://collegedunia.com/management/human-resources-management-colleges")
c = r.content
soup = BeautifulSoup(c,"html.parser")
all = soup.find_all("div",{"class":"jsx-765939686 col-4 mb-4 automate_client_img_snippet"})
l = []
for divParent in all:
item = divParent.find("div",{"class":"jsx-765939686 listing-block text-uppercase bg-white position-relative"})
d = {}
d["Name"] = item.find("div",{"class":"jsx-765939686 top-block position-relative overflow-hidden"}).find("div",{"class":"jsx-765939686 clg-name-address"}).find("h3").text
d["Rating"] = item.find("div",{"class":"jsx-765939686 bottom-block w-100 position-relative"}).find("ul").find_all("li")[-1].find("a").find("span").text
d["Location"] = item.find("div",{"class":"jsx-765939686 clg-head d-flex"}).find("span").find("span",{"class":"mr-1"}).text
l.append(d)
import pandas
df = pandas.DataFrame(l)
df.to_excel("Output.xlsx")
The page keeps adding colleges as you scroll down, i dont know if i could get all the data, but is there a way to atleast increase the number of responses i get. There are a total of 2506 entries, as can be seen on the website?

Seeing to your Question we can see it in the network requests data is being fetched from the ajax request and they are using base64 encoded params to fetch the data you can follow the below code to get the data and parse it in your desire format.
Code:
import json
import pandas
import requests
import base64
collegedata = []
count = 0
while True:
datadict = {"url": "management/human-resources-management-colleges", "stream": "13", "sub_stream_id": "607",
"page": count}
data = base64.urlsafe_b64encode(json.dumps(datadict).encode()).decode()
params = {
"data": data
}
response = requests.get('https://collegedunia.com/web-api/listing', params=params).json()
if response["hasNext"]:
for i in response["colleges"]:
d = {}
d["Name"] = i["college_name"]
d["Rating"] = i["rating"]
d["Location"] = i["college_city"] + ", " + i["state"]
collegedata.append(d)
print(d)
else:
break
count += 1
df = pandas.DataFrame(collegedata)
df.to_excel("Output.xlsx", index=False)
Output:
Let me know if you have any questions :)

When you analyse the website via the network tab on chrome, you can see the website makes xhr calls in the back.
The endpoint to which it sends a http get request is as follows:
https://collegedunia.com/web-api/listing?data=eyJ1cmwiOiJtYW5hZ2VtZW50L2h1bWFuLXJlc291cmNlcy1tYW5hZ2VtZW50LWNvbGxlZ2VzIiwic3RyZWFtIjoiMTMiLCJzdWJfc3RyZWFtX2lkIjoiNjA3IiwicGFnZSI6M30=
When you send a get via requests module, you get a json response back.
import requests
url = "https://collegedunia.com/web-api/listing?data=eyJ1cmwiOiJtYW5hZ2VtZW50L2h1bWFuLXJlc291cmNlcy1tYW5hZ2VtZW50LWNvbGxlZ2VzIiwic3RyZWFtIjoiMTMiLCJzdWJfc3RyZWFtX2lkIjoiNjA3IiwicGFnZSI6M30="
res = requests.get(url)
print(res.json())
But you need all the data, not only for page 1. The data sent in the request is base64 encoded i.e if you decode the data parameter of the get request, you can see the following
{"url":"management/human-resources-management-colleges","stream":"13","sub_stream_id":"607","page":3}
Now, change the page number, sub_stream_id, steam etc. accordingly and get the complete data from the website.

how to use python to parse a html that is in txt format?

I am trying to parse a txt, example as below link.
The txt, however, is in the form of html. I am trying to get "COMPANY CONFORMED NAME" which located at the top of the file, and my function should return "Monocle Acquisition Corp".
https://www.sec.gov/Archives/edgar/data/1754170/0001571049-19-000004.txt
I have tried below:
import requests
from bs4 import BeautifulSoup
url = 'https://www.sec.gov/Archives/edgar/data/1754170/0001571049-19-000004.txt'
r = requests.get(url)
soup = BeautifulSoup(r.content, "html")
However, "soup" does not contain "COMPANY CONFORMED NAME" at all.
Can someone point me in the right direction?

The data you are looking for is not in an HTML structure so Beautiful Soup is not the best tool. The correct and fast way of searching for this data is just using a simple Regular Expression like this:
import re
import requests
url = 'https://www.sec.gov/Archives/edgar/data/1754170/0001571049-19-000004.txt'
r = requests.get(url)
text_string = r.content.decode()
name_re = re.compile("COMPANY CONFORMED NAME:[\\t]*(.+)\n")
match = name_re.search(text_string).group(1)
print(match)

the part you look like is inside a huge tag <SEC-HEADER>
you can get the whole section by using soup.find('sec-header')
but you will need to parse the section manually, something like this works, but it's some dirty job :
(view it in replit : https://repl.it/#gui3/stackoverflow-parsing-html)
import requests
from bs4 import BeautifulSoup
url = 'https://www.sec.gov/Archives/edgar/data/1754170/0001571049-19-000004.txt'
r = requests.get(url)
soup = BeautifulSoup(r.content, "html")
header = soup.find('sec-header').text
company_name = None
for line in header.split('\n'):
split = line.split(':')
if len(split) > 1 :
key = split[0]
value = split[1]
if key.strip() == 'COMPANY CONFORMED NAME':
company_name = value.strip()
break
print(company_name)
There may be some library able to parse this data better than this code

How to scrape data from interactive chart using python?

I have a next link which represent an exact graph I want to scrape: https://index.minfin.com.ua/ua/economy/index/svg.php?indType=1&fromYear=2010&acc=1
I'm simply can't understand is it a xml or svg graph and how to scrape data. I think I need to use bs4, requests but don't know the way to do that.
Anyone could help?

You will load HTML like this:
import requests
url = "https://index.minfin.com.ua/ua/economy/index/svg.php?indType=1&fromYear=2010&acc=1"
resp = requests.get(url)
data = resp.text
Then you will create a BeatifulSoup object with this HTML.
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, features="html.parser")
After this, it is usually very subjective how to parse out what you want. The candidate codes may vary a lot. This is how I did it:
Using BeautifulSoup, I parsed all "rect"s and check if "onmouseover" exists in that rect.
rects = soup.svg.find_all("rect")
yx_points = []
for rect in rects:
if rect.has_attr("onmouseover"):
text = rect["onmouseover"]
x_start_index = text.index("'") + 1
y_finish_index = text[x_start_index:].index("'") + x_start_index
yx = text[x_start_index:y_finish_index].split()
print(text[x_start_index:y_finish_index])
yx_points.append(yx)
As you can see from the image below, I scraped onmouseover= part and get those 02.2015 155,1 parts.
Here, this is how yx_points looks like now:
[['12.2009', '100,0'], ['01.2010', '101,8'], ['02.2010', '103,7'], ...]

from bs4 import BeautifulSoup
import requests
import re
#First get all the text from the url.
url="https://index.minfin.com.ua/ua/economy/index/svg.php?indType=1&fromYear=2010&acc=1"
response = requests.get(url)
html = response.text
#Find all the tags in which the data is stored.
soup = BeautifulSoup(html, 'lxml')
texts = soup.findAll("rect")
final = []
for each in texts:
names = each.get('onmouseover')
try:
q = re.findall(r"'(.*?)'", names)
final.append(q[0])
except Exception as e:
print(e)
#The details are appended to the final variable

Scrape Text After Specific Text and Before Specific Text

<script type="text/javascript">
'sku': 'T3246B5',
'Name': 'TAS BLACKY',
'Price': '111930',
'categories': 'Tas,Wanita,Sling Bags,Di bawah Rp 200.000',
'brand': '',
'visibility': '4',
'instock': "1",
'stock': "73.0000"
</script>
I want to scrape the text between : 'stock': " and .0000" so the desireable result is 73
What I used to know is to do something like this:
for url2 in urls2:
req2 = Request(url2, headers={'User-Agent': 'Chrome/39.0.2171.95'})
html2 = uReq(req2).read()
page_soup2 = soup(html2, "html.parser")
# Grab text
stock = page_soup2.findAll("p", {"class": "stock"})
stocks = stock[0].text
I used something like this in my previous code, It works before the web change their code.
But now there is more than 1 ("script", {"type": "text/javascript"}) in the entire page I want to scrape. So I dont know how to find the right ("script", {"type": "text/javascript"})
I also don't know hot to get the specific text before and after the text.
I have googled it all this day but can't find the solution. Please help.
I found that strings = 'stock': " and .0000" is unique in the entire page, only 1 'stock': and only 1 .0000"
So I think it could be the sign of location where I want to scrape the text.
Please help, thank you for your kindness.
I also apologize for my lack of English, and I am actually unfamiliar with programming. I'm just trying to learn from Google, but I can't find the answer. Thank you for your understanding.
the url = view-source:sophieparis.com/blacky-bag.html

Since you are sure 'stock' only shows up in the script tag you want, you can pull out that text that contains 'stock. Once you have that, it's a matter of trimming off the excess, and change to double quotes to get it into a valid json format and then simply read that in using json.loads()
import requests
from bs4 import BeautifulSoup
import json
url2 = 'https://www.sophieparis.com/blacky-bag.html'
req2 = requests.get(url2, headers={'User-Agent': 'Chrome/39.0.2171.95'})
page_soup2 = BeautifulSoup(req2.text, "html.parser")
scripts = page_soup2.find_all('script')
for script in scripts:
if 'stock' in script.text:
jsonStr = script.text
jsonStr = jsonStr.split('productObject = ')[-1].strip()
jsonStr = jsonStr.rsplit('}',1)[0].strip() + '}'
jsonData = json.loads(jsonStr.replace("'",'"'))
print (jsonData['stock'].split('.')[0])
Output:
print (jsonData['stock'].split('.')[0])
71
You could also do this without the loop and just grab the script that has the string stock in it using 1 line:
jsonStr = page_soup2.find('script', text=re.compile(r'stock')).text
Full code would look something like:
import requests
from bs4 import BeautifulSoup
import json
import re
url2 = 'https://www.sophieparis.com/blacky-bag.html'
req2 = requests.get(url2, headers={'User-Agent': 'Chrome/39.0.2171.95'})
page_soup2 = BeautifulSoup(req2.text, "html.parser")
jsonStr = page_soup2.find('script', text=re.compile(r'stock')).text
jsonStr = jsonStr.split('productObject = ')[-1].strip()
jsonStr = jsonStr.rsplit('}',1)[0].strip() + '}'
jsonData = json.loads(jsonStr.replace("'",'"'))
print (jsonData['stock'].split('.')[0])

I would write a regex that targets the javascript dictionary variable that houses the values of interest. You can apply this direct to response.text with no need for bs4.
The dictionary variable is called productObject, and you want the non-empty dictionary which is the second occurrence of productObject = {..} i.e. not the one which has 'var ' preceeding it. You can use negative lookbehind to specify this requirement.
Use hjson to handle property names enclosed in single quotes.
Py
import requests, re, hjson
r = requests.get('https://www.sophieparis.com/blacky-bag.html')
p = re.compile(r'(?<!var\s)productObject = ([\s\S]*?})')
data = hjson.loads(p.findall(r.text)[0])
print(data)
Regex: try

If you want to provide me with the webpage you wish to scrape the data from, I'll see if I can fix the code to pull the information.

Extract data from <script> with beautifulsoup

I'm trying to scrape some data with Python and Beautifulsoup. I know how to get the text from the script tag. The data between [ ] is valid json.
<script>
dataLayer =
[
{
"p":{
"t":"text1",
"lng":"text2",
"vurl":"text3"
},
"c":{ },
"u":{ },
"d":{ },
"a":{ }
}
]
</script>
I've read this response and it almost does what I want:
Extract content of <Script with BeautifulSoup
Here is my code:
import urllib.request
from bs4 import BeautifulSoup
import json
url = "www.example.com"
html = urllib.request.urlopen(url)
soup = BeautifulSoup(html, "html.parser")
raw_data = soup.find("script")
I would then ideally do:
json_dict = json.loads(raw_data)
And access the data through the dictionary. But this is not working because of
"<script> dataLayer ="
preceding the valid json and the script tag at the end. I've tried trimming the raw_data as a string, like this:
raw_data[20:]
But this didn't work because the soup object is not a string.
How can I get the raw_data variable to contain ONLY the text between the block quotes [ ]?
EDIT: this seems to work. It avoids regex and solves the problem of the trailing chars as well. Thanks for your suggestions.
url = "www.example.com"
html = urllib.request.urlopen(url)
soup = BeautifulSoup(html, "html.parser")
# get the script tag data and convert soup into a string
data = str(soup.find("script"))
# cut the <script> tag and some other things from the beginning and end to get valid JSON
cut = data[27:-13]
# load the data as a json dictionary
jsoned = json.loads(cut)

use .text to get content inside <script> tag then replace dataLayer =
raw_data = soup.find("script")
raw_data = raw_data.text.replace('dataLayer =', '')
json_dict = json.loads(raw_data)

>>> import re
>>> soup.find_all(re.compile("\[(.*?)\]"))
you would do that with regex
You will have to create a regex norm that only takes text between []
here a link of common regex usage within beautifulsoup
here the regex to extract from between square brackets

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parsing html document with Beautiful Soup - python

Related

Scraping using BeautifulSoup only gets me 33 responses off of an infinite scrolling page. How do i increase the number of responses?

how to use python to parse a html that is in txt format?

How to scrape data from interactive chart using python?

Scrape Text After Specific Text and Before Specific Text

Extract data from <script> with beautifulsoup

Categories

Resources