I follow the website and successfully get the AJAX data from jobs page for Apple.com
The teaching website: http://toddhayton.com/2015/03/11/scraping-ajax-pages-with-python/
here is the complete code, which is correct:
import json
import requests
from bs4 import BeautifulSoup
class AppleJobsScraper(object):
def __init__(self):
self.search_request = {
"jobType":"0",
"sortBy":"req_open_dt",
"sortOrder":"1",
"filters":{
"locations":{
"location":[{
"type":"0",
"code":"USA"
}]
}
},
"pageNumber":"0"
}
def scrape(self):
jobs = self.scrape_jobs()
for job in jobs:
#print(job)
pass
def scrape_jobs(self, max_pages=3):
jobs = []
pageno = 0
self.search_request['pageNumber'] = pageno
while pageno < max_pages:
payload = {
'searchRequestJson': json.dumps(self.search_request),
'clientOffset': '-300'
}
r = requests.post(
url='https://jobs.apple.com/us/search/search-result',
data=payload,
headers={
'X-Requested-With': 'XMLHttpRequest'
}
)
s = BeautifulSoup(r.text)
if not s.requisition:
break
for r in s.findAll('requisition'):
job = {}
job['jobid'] = r.jobid.text
job['title'] = r.postingtitle and \
r.postingtitle.text or r.retailpostingtitle.text
job['location'] = r.location.text
jobs.append(job)
# Next page
pageno += 1
self.search_request['pageNumber'] = pageno
return jobs
if __name__ == '__main__':
scraper = AppleJobsScraper()
scraper.scrape()
I almost understand the code provided by the website, except one of a little piece of section.
for r in s.findAll('requisition'):
job = {}
job['jobid'] = r.jobid.text
job['title'] = r.postingtitle and \
r.postingtitle.text or r.retailpostingtitle.text
job['location'] = r.location.text
jobs.append(job)
I am curious about what s.findAll('requisition') means.
Of course, I should print out the website text by print(s.get_text()) to see what the source code of website looks like.
But what I get does not look like a website code, which makes me more confused.
So I want to know why the code can use s.findAll('requisition') to get the data.
And why the code know it can use job['jobid'], job['title'], job['location'] to get the data it wants.
I am really appreciate for your help!
In that code you are using Beautiful Soup, which is a HTML parser. This tool provides to you the findAll method, which will find all the "requisition" tags on the document. So, this methods return a list containing all the matches, and each list element is a Tag object, and as you can read on the docs, a Tag object is a tag in the original document.
Related
I am interested in scraping PDFs from any of the speakers on this page. How might I go about this: https://www.nas.gov.sg/archivesonline/speeches/search-result?search-type=advanced&speaker=Amy+Khor
The website has changed from previous occasions and the code used previously such as this:
import requests
from bs4 import BeautifulSoup
url = 'http://www.nas.gov.sg/archivesonline/speeches/search-result?search-type=advanced&speaker='
search_term = 'Amy+Khor'
data = {
'keywords': search_term,
'search-type': 'basic',
'keywords-type': 'all',
'page-num': 1
}
soup = BeautifulSoup(requests.post(url, data=data).text, 'lxml')
cnt = 1
while True:
print()
print('Page no. {}'.format(cnt))
print('-' * 80)
for a in soup.select('a[href$=".pdf"]'):
print(a['href'])
if soup.select_one('span.next-10'):
data['page-num'] += 10
cnt += 1
soup = BeautifulSoup(requests.post(url, data=data).text, 'lxml')
else:
break
The code above no longer works...
To get all PDF links from the pages you can use next example:
import requests
from bs4 import BeautifulSoup
url = "https://www.nas.gov.sg/archivesonline/speeches/search-result"
params = {
"search-type": "advanced",
"speaker": "Amy Khor",
"page-num": "1",
}
for params["page-num"] in range(1, 3): # <--- increase number of pages here
soup = BeautifulSoup(
requests.get(url, params=params).content, "html.parser"
)
for a in soup.select('a[href$="pdf"]'):
print("https:" + a["href"])
print("-" * 80)
Prints:
https://www.nas.gov.sg/archivesonline/data/pdfdoc/MINDEF_20171123001_2.pdf
https://www.nas.gov.sg/archivesonline/data/pdfdoc/MSE_20151126001.pdf
https://www.nas.gov.sg/archivesonline/data/pdfdoc/MSE_20160229002.pdf
...and so on.
Here's how I'd do it if I were to start from scratch.
Google Search is actually pretty powerful, and I feel like this query gets your pdfs:
"Amy Khor" site:https://www.nas.gov.sg/archivesonline/data/pdfdoc filetype:pdf
Then, I'd use either BeautifulSoup or, even better, something like googlesearch-python to get the results and process them into your desired lxml format.
The website link:
https://collegedunia.com/management/human-resources-management-colleges
The code:
import requests
from bs4 import BeautifulSoup
r = requests.get("https://collegedunia.com/management/human-resources-management-colleges")
c = r.content
soup = BeautifulSoup(c,"html.parser")
all = soup.find_all("div",{"class":"jsx-765939686 col-4 mb-4 automate_client_img_snippet"})
l = []
for divParent in all:
item = divParent.find("div",{"class":"jsx-765939686 listing-block text-uppercase bg-white position-relative"})
d = {}
d["Name"] = item.find("div",{"class":"jsx-765939686 top-block position-relative overflow-hidden"}).find("div",{"class":"jsx-765939686 clg-name-address"}).find("h3").text
d["Rating"] = item.find("div",{"class":"jsx-765939686 bottom-block w-100 position-relative"}).find("ul").find_all("li")[-1].find("a").find("span").text
d["Location"] = item.find("div",{"class":"jsx-765939686 clg-head d-flex"}).find("span").find("span",{"class":"mr-1"}).text
l.append(d)
import pandas
df = pandas.DataFrame(l)
df.to_excel("Output.xlsx")
The page keeps adding colleges as you scroll down, i dont know if i could get all the data, but is there a way to atleast increase the number of responses i get. There are a total of 2506 entries, as can be seen on the website?
Seeing to your Question we can see it in the network requests data is being fetched from the ajax request and they are using base64 encoded params to fetch the data you can follow the below code to get the data and parse it in your desire format.
Code:
import json
import pandas
import requests
import base64
collegedata = []
count = 0
while True:
datadict = {"url": "management/human-resources-management-colleges", "stream": "13", "sub_stream_id": "607",
"page": count}
data = base64.urlsafe_b64encode(json.dumps(datadict).encode()).decode()
params = {
"data": data
}
response = requests.get('https://collegedunia.com/web-api/listing', params=params).json()
if response["hasNext"]:
for i in response["colleges"]:
d = {}
d["Name"] = i["college_name"]
d["Rating"] = i["rating"]
d["Location"] = i["college_city"] + ", " + i["state"]
collegedata.append(d)
print(d)
else:
break
count += 1
df = pandas.DataFrame(collegedata)
df.to_excel("Output.xlsx", index=False)
Output:
Let me know if you have any questions :)
When you analyse the website via the network tab on chrome, you can see the website makes xhr calls in the back.
The endpoint to which it sends a http get request is as follows:
https://collegedunia.com/web-api/listing?data=eyJ1cmwiOiJtYW5hZ2VtZW50L2h1bWFuLXJlc291cmNlcy1tYW5hZ2VtZW50LWNvbGxlZ2VzIiwic3RyZWFtIjoiMTMiLCJzdWJfc3RyZWFtX2lkIjoiNjA3IiwicGFnZSI6M30=
When you send a get via requests module, you get a json response back.
import requests
url = "https://collegedunia.com/web-api/listing?data=eyJ1cmwiOiJtYW5hZ2VtZW50L2h1bWFuLXJlc291cmNlcy1tYW5hZ2VtZW50LWNvbGxlZ2VzIiwic3RyZWFtIjoiMTMiLCJzdWJfc3RyZWFtX2lkIjoiNjA3IiwicGFnZSI6M30="
res = requests.get(url)
print(res.json())
But you need all the data, not only for page 1. The data sent in the request is base64 encoded i.e if you decode the data parameter of the get request, you can see the following
{"url":"management/human-resources-management-colleges","stream":"13","sub_stream_id":"607","page":3}
Now, change the page number, sub_stream_id, steam etc. accordingly and get the complete data from the website.
When I try to apply filters on the website before webscaping - it yields me to the following URL - https://www.marktplaats.nl/l/auto-s/p/2/#f:10898,10882
However, when I apply it in my script to retrieve href for each and every advertisement, it yields results from this website - https://www.marktplaats.nl/l/auto-s/p/2, completely neglecting 2 of my filters (namely #f:10898,10882).
Can you please advise me what is my problem?
import requests
import bs4
import pandas as pd
frames = []
for pagenumber in range (0,2):
url = 'https://www.marktplaats.nl/l/auto-s/p/'
add_url='/#f:10898,10882'
txt = requests.get(url + str(pagenumber)+add_url)
soup = bs4.BeautifulSoup(txt.text, 'html.parser')
soup_table = soup.find('ul', 'mp-Listings mp-Listings--list-view')
for car in soup_table.findAll('li'):
link = car.find('a')
sub_url = 'https://www.marktplaats.nl/' + link.get('href')
sub_soup = requests.get(sub_url)
soup1 = bs4.BeautifulSoup(sub_soup.text, 'html.parser')
I would suggest that you use their api instead which seems to be open.
If you open the link you will see all the same listings you are searching for (try to find something to format the json, since it will look like just a bunch a text), with the appropriate filters and no need to parse html. You can also modify it easily in request just by changing the headers.
https://www.marktplaats.nl/lrp/api/search?attributesById[]=10898&attributesById[]=10882&l1CategoryId=91&limit=30&offset=0
In code it would look something like this:
def getcars():
url = 'https://www.marktplaats.nl/lrp/api/search'
querystring = {
'attributesById[]': 10898,
'attributesById[]': 10882,
'l1CategoryId': 91,
'limit': 30,
'offset': 0
}
headers = {
}
response = requests.get(url, headers=headers, params=querystring)
x = response.json()
return x
cars = getcars()
Hi Wrote a web scraping program and it gets the ASN number correctly, but after all the data is scraped, it returns a error "Array Out if Bounds".
I am using Pycharm and latest python version. Below is my code.
There is already a similar issue on stackoverflow but I am not able to get the pieces together and make it work. (Web Scraping List Index Out Of Range) its the exact same error but I am not sure how to get it working for my List.
Error seems to be at current_country = link.split('/')[2]
Any help is appreciated. Thank you.
import urllib.request
import bs4
import re
import json
url = 'https://ipinfo.io/countries'
SITE = 'https://ipinfo.io'
def url_to_soup(url):
req = urllib.request.Request(url)
opener = urllib.request.build_opener()
html = opener.open(req)
soup = bs4.BeautifulSoup(html, "html.parser")
return soup
def find_pages(page):
pages = []
for link in page.find_all(href=re.compile('/countries')):
pages.append(link.get('href'))
return pages
def scrape_pages(links):
mappings = {}
print("Scraping Pages for ASN Data...")
for link in links:
country_page = url_to_soup(SITE + link)
current_country = link.split('/')[2]
print(current_country)
for row in country_page.find_all('tr'):
columns = row.find_all('td')
if len(columns) > 0:
current_asn = re.findall(r'\d+', columns[0].string)[0]
print(current_asn)
"""
name = columns[1].string
routes_v4 = columns[3].string
routes_v6 = columns[5].string
mappings[current_asn] = {'Country': current_country,
'Name': name,
'Routes v4': routes_v4,
'Routes v6': routes_v6}
return mappings """
main_page = url_to_soup(url)
country_links = find_pages(main_page)
#print(country_links)
asn_mappings = scrape_pages(country_links)
print(asn_mappings)
The last href contains string "/countries" in https://ipinfo.io/countries is actually "/countries":
<li>Global ASNs</li>
After splitting this link, it produced list ["", "countries"] where the third element was missing. To fix this problem, simply check the list length before retrieving the third element:
...
current_country = link.split('/')
if len(current_country) < 3:
continue
current_country = current_country[2]
...
Another solution is to exclude the last href by changing the regexp to:
...
for link in page.find_all(href=re.compile('/countries/')):
...
I am using Beautiful Soup to parse some JSON out of an HTML file.
Basically I am using to get all employee profiles out of a LinkedIn search result.
However, for some reason it does not work with companies that have more than 10 employees for some reason.
Here is my code
import requests, json
from bs4 import BeautifulSoup
s = requests.session()
def get_csrf_tokens():
url = "https://www.linkedin.com/"
req = s.get(url).text
csrf_token = req.split('name="csrfToken" value=')[1].split('" id="')[0]
login_csrf_token = req.split('name="loginCsrfParam" value="')[1].split('" id="')[0]
return csrf_token, login_csrf_token
def login(username, password):
url = "https://www.linkedin.com/uas/login-submit"
csrfToken, loginCsrfParam = get_csrf_tokens()
data = {
'session_key': username,
'session_password': password,
'csrfToken': csrfToken,
'loginCsrfParam': loginCsrfParam
}
req = s.post(url, data=data)
print "success"
login(USERNAME PASSWORD)
def get_all_json(company_link):
r=s.get(company_link)
html= r.content
soup=BeautifulSoup(html)
html_file= open("html_file.html", 'w')
html_file.write(html)
html_file.close()
Json_stuff=soup.find('code', id="voltron_srp_main-content")
print Json_stuff
return remove_tags(Json_stuff)
def remove_tags(p):
p=str(p)
return p[62: -10]
def list_of_employes():
jsons=get_all_json('https://www.linkedin.com/vsearch/p?f_CC=2409087')
print jsons
loaded_json=json.loads(jsons.replace(r'\u002d', '-'))
employes=loaded_json['content']['page']['voltron_unified_search_json']['search']['results']
return employes
def get_employee_link(employes):
profiles=[]
for employee in employes:
print employee['person']['link_nprofile_view_3']
profiles.append(employee['person']['link_nprofile_view_3'])
return profiles , len(profiles)
print get_employee_link(list_of_employes())
It will not work for the link that is in place; however it will work for this company search: https://www.linkedin.com/vsearch/p?f_CC=3003796
EDIT:
I am pretty sure that this is an error with the get_all_json() function. If
you take a look, it does not correctly fetch the JSON for companies with more than 10 employees.
This is because the results are paginated. You need get over all pages defined inside the json data at:
data['content']['page']['voltron_unified_search_json']['search']['baseData']['resultPagination']['pages']
pages is a list, for the company 2409087 it is:
[{u'isCurrentPage': True, u'pageNum': 1, u'pageURL': u'http://www.linkedin.com/vsearch/p?f_CC=2409087&page_num=1'},
{u'isCurrentPage': False, u'pageNum': 2, u'pageURL': u'http://www.linkedin.com/vsearch/p?f_CC=2409087&page_num=2', u'page_number_i18n': u'Page 2'},
{u'isCurrentPage': False, u'pageNum': 3, u'pageURL': u'http://www.linkedin.com/vsearch/p?f_CC=2409087&page_num=3', u'page_number_i18n': u'Page 3'}]
This is basically a list of URLs you need to get over and get the data.
Here's what you need to do (ommiting the code for login):
def get_results(json_code):
return json_code['content']['page']['voltron_unified_search_json']['search']['results']
url = "https://www.linkedin.com/vsearch/p?f_CC=2409087"
soup = BeautifulSoup(s.get(url).text)
code = soup.find('code', id="voltron_srp_main-content").contents[0].replace(r'\u002d', '-')
json_code = json.loads(code)
results = get_results(json_code)
pages = json_code['content']['page']['voltron_unified_search_json']['search']['baseData']['resultPagination']['pages']
for page in pages[1:]:
soup = BeautifulSoup(s.get(page['pageURL']).text)
code = soup.find('code', id="voltron_srp_main-content").contents[0].replace(r'\u002d', '-')
json_code = json.loads(code)
results += get_results(json_code)
print len(results)
It prints 25 for https://www.linkedin.com/vsearch/p?f_CC=2409087 - exactly how much you see in browser.
Turns out it was a problem with the default BeautifulSoup parser.
I changed it to html5lib by doing this:
Install in the console
pip install html5lib
And change the type of parser you choose when first creating the soup object.
soup = BeautifulSoup(html, 'html5lib')
This is documented in the BeautifulSoup docs here