Beautiful soup pareses in some cases but not in others. Why? - python
I am using Beautiful Soup to parse some JSON out of an HTML file.
Basically I am using to get all employee profiles out of a LinkedIn search result.
However, for some reason it does not work with companies that have more than 10 employees for some reason.
Here is my code
import requests, json
from bs4 import BeautifulSoup
s = requests.session()
def get_csrf_tokens():
url = "https://www.linkedin.com/"
req = s.get(url).text
csrf_token = req.split('name="csrfToken" value=')[1].split('" id="')[0]
login_csrf_token = req.split('name="loginCsrfParam" value="')[1].split('" id="')[0]
return csrf_token, login_csrf_token
def login(username, password):
url = "https://www.linkedin.com/uas/login-submit"
csrfToken, loginCsrfParam = get_csrf_tokens()
data = {
'session_key': username,
'session_password': password,
'csrfToken': csrfToken,
'loginCsrfParam': loginCsrfParam
}
req = s.post(url, data=data)
print "success"
login(USERNAME PASSWORD)
def get_all_json(company_link):
r=s.get(company_link)
html= r.content
soup=BeautifulSoup(html)
html_file= open("html_file.html", 'w')
html_file.write(html)
html_file.close()
Json_stuff=soup.find('code', id="voltron_srp_main-content")
print Json_stuff
return remove_tags(Json_stuff)
def remove_tags(p):
p=str(p)
return p[62: -10]
def list_of_employes():
jsons=get_all_json('https://www.linkedin.com/vsearch/p?f_CC=2409087')
print jsons
loaded_json=json.loads(jsons.replace(r'\u002d', '-'))
employes=loaded_json['content']['page']['voltron_unified_search_json']['search']['results']
return employes
def get_employee_link(employes):
profiles=[]
for employee in employes:
print employee['person']['link_nprofile_view_3']
profiles.append(employee['person']['link_nprofile_view_3'])
return profiles , len(profiles)
print get_employee_link(list_of_employes())
It will not work for the link that is in place; however it will work for this company search: https://www.linkedin.com/vsearch/p?f_CC=3003796
EDIT:
I am pretty sure that this is an error with the get_all_json() function. If
you take a look, it does not correctly fetch the JSON for companies with more than 10 employees.
This is because the results are paginated. You need get over all pages defined inside the json data at:
data['content']['page']['voltron_unified_search_json']['search']['baseData']['resultPagination']['pages']
pages is a list, for the company 2409087 it is:
[{u'isCurrentPage': True, u'pageNum': 1, u'pageURL': u'http://www.linkedin.com/vsearch/p?f_CC=2409087&page_num=1'},
{u'isCurrentPage': False, u'pageNum': 2, u'pageURL': u'http://www.linkedin.com/vsearch/p?f_CC=2409087&page_num=2', u'page_number_i18n': u'Page 2'},
{u'isCurrentPage': False, u'pageNum': 3, u'pageURL': u'http://www.linkedin.com/vsearch/p?f_CC=2409087&page_num=3', u'page_number_i18n': u'Page 3'}]
This is basically a list of URLs you need to get over and get the data.
Here's what you need to do (ommiting the code for login):
def get_results(json_code):
return json_code['content']['page']['voltron_unified_search_json']['search']['results']
url = "https://www.linkedin.com/vsearch/p?f_CC=2409087"
soup = BeautifulSoup(s.get(url).text)
code = soup.find('code', id="voltron_srp_main-content").contents[0].replace(r'\u002d', '-')
json_code = json.loads(code)
results = get_results(json_code)
pages = json_code['content']['page']['voltron_unified_search_json']['search']['baseData']['resultPagination']['pages']
for page in pages[1:]:
soup = BeautifulSoup(s.get(page['pageURL']).text)
code = soup.find('code', id="voltron_srp_main-content").contents[0].replace(r'\u002d', '-')
json_code = json.loads(code)
results += get_results(json_code)
print len(results)
It prints 25 for https://www.linkedin.com/vsearch/p?f_CC=2409087 - exactly how much you see in browser.
Turns out it was a problem with the default BeautifulSoup parser.
I changed it to html5lib by doing this:
Install in the console
pip install html5lib
And change the type of parser you choose when first creating the soup object.
soup = BeautifulSoup(html, 'html5lib')
This is documented in the BeautifulSoup docs here
Related
Visible and search URLs for webscraping
When I try to apply filters on the website before webscaping - it yields me to the following URL - https://www.marktplaats.nl/l/auto-s/p/2/#f:10898,10882 However, when I apply it in my script to retrieve href for each and every advertisement, it yields results from this website - https://www.marktplaats.nl/l/auto-s/p/2, completely neglecting 2 of my filters (namely #f:10898,10882). Can you please advise me what is my problem? import requests import bs4 import pandas as pd frames = [] for pagenumber in range (0,2): url = 'https://www.marktplaats.nl/l/auto-s/p/' add_url='/#f:10898,10882' txt = requests.get(url + str(pagenumber)+add_url) soup = bs4.BeautifulSoup(txt.text, 'html.parser') soup_table = soup.find('ul', 'mp-Listings mp-Listings--list-view') for car in soup_table.findAll('li'): link = car.find('a') sub_url = 'https://www.marktplaats.nl/' + link.get('href') sub_soup = requests.get(sub_url) soup1 = bs4.BeautifulSoup(sub_soup.text, 'html.parser')
I would suggest that you use their api instead which seems to be open. If you open the link you will see all the same listings you are searching for (try to find something to format the json, since it will look like just a bunch a text), with the appropriate filters and no need to parse html. You can also modify it easily in request just by changing the headers. https://www.marktplaats.nl/lrp/api/search?attributesById[]=10898&attributesById[]=10882&l1CategoryId=91&limit=30&offset=0 In code it would look something like this: def getcars(): url = 'https://www.marktplaats.nl/lrp/api/search' querystring = { 'attributesById[]': 10898, 'attributesById[]': 10882, 'l1CategoryId': 91, 'limit': 30, 'offset': 0 } headers = { } response = requests.get(url, headers=headers, params=querystring) x = response.json() return x cars = getcars()
Problems with data retrieving using Python web scraping
I wrote a simple code for scraping data from a web page but I mention all the thing like object class with tag but my program does not scrape data. One more thing there is an email that I also want to scrape but not know how to mention its id or class. Could you please guide me - how can I fix this issue? Thanks! Here is my code: import requests from bs4 import BeautifulSoup import csv def get_page(url): response = requests.get(url) if not response.ok: print('server responded:', response.status_code) else: soup = BeautifulSoup(response.text, 'html.parser') # 1. html , 2. parser return soup def get_detail_data(soup): try: title = soup.find('hi',class_="page-header",id=False).text except: title = 'empty' print(title) try: email = soup.find('',class_="",id=False).text except: email = 'empty' print(email) def main(): url = "https://www.igrc.org/clergydetail/2747164" #get_page(url) get_detail_data(get_page(url)) if __name__ == '__main__': main()
As noticed the value of the email is not in plain text. The html is loaded via JS in a script tag : <script type="text/javascript">document.write(String.fromCharCode(60,97,32,104,114,101,102,61,34,35,34,32,115,116,121,108,101,61,34,117,110,105,99,111,100,101,45,98,105,100,105,58,98,105,100,105,45,111,118,101,114,114,105,100,101,59,100,105,114,101,99,116,105,111,110,58,114,116,108,59,34,32,111,110,99,108,105,99,107,61,34,116,104,105,115,46,104,114,101,102,61,83,116,114,105,110,103,46,102,114,111,109,67,104,97,114,67,111,100,101,40,49,48,57,44,57,55,44,49,48,53,44,49,48,56,44,49,49,54,44,49,49,49,44,53,56,44,49,49,52,44,49,49,49,44,57,56,44,54,52,44,49,48,57,44,49,48,49,44,49,49,54,44,49,48,52,44,49,49,49,44,49,48,48,44,49,48,53,44,49,49,53,44,49,49,54,44,52,54,44,57,57,44,57,57,41,59,34,62,38,35,57,57,59,38,35,57,57,59,38,35,52,54,59,38,35,49,49,54,59,38,35,49,49,53,59,38,35,49,48,53,59,38,35,49,48,48,59,38,35,49,49,49,59,38,35,49,48,52,59,38,35,49,49,54,59,38,35,49,48,49,59,38,35,49,48,57,59,38,35,54,52,59,38,35,57,56,59,38,35,49,49,49,59,38,35,49,49,52,59,60,47,97,62));</script> which contains all the characters code (ascii code). When decoded will gives : cc.tsidohtem#bor which needs to be decoded too. We just needs the mailto which is present in onclick (the content in the mailto is unchanged whereas the text of the a tag is reversed (using direction: rtl as noticed by Hugo) : mailto:john#doe.inc The following python code extracts the mail : import requests from bs4 import BeautifulSoup import re r = requests.get("https://www.igrc.org/clergydetail/2747164") soup = BeautifulSoup(r.text, 'html.parser') titleContainer = soup.find(class_ = "page-header") title = titleContainer.text.strip() if titleContainer else "empty" emailScript = titleContainer.findNext("script").text def parse(data): res = re.search('\(([\d+,]*)\)', data, re.IGNORECASE) return "".join([ chr(int(i)) for i in res.group(1).split(",") ]) emailData1 = parse(emailScript) email = parse(emailData1) print(title) print(email.split(":")[1]) One could reproduce this encoding the other way around using the following code : def encode(data): return ",".join([str(ord(i)) for i in data]) mail = "john#doe.inc" encodedMailTo = encode("mailto:" + mail) encodedHtmlEmail = "".join(["&#" + str(ord(i)) + ";" for i in mail]) htmlContainer = f'{encodedHtmlEmail}' encodedHtmlContainer = encode(htmlContainer) scriptContainer = f'<script type="text/javascript">document.write(String.fromCharCode({encodedHtmlContainer}));</script>' print(scriptContainer)
Problem sending data through post request in Python
I am trying to input a decision start and end date into 2 input boxes on the Gosport Council website by sending a post request. Whenever I print out the text received from after I send the request it gives me the info shown on the input page, not the loaded page import requests payload = { "applicationDecisionStart": "1/8/2018", "applicationDecisionEnd": "1/10/2018", } with requests.Session() as session: r = session.get("https://publicaccess.gosport.gov.uk/online-applications/search.do?action=advanced", timeout=10, data=payload) print(r.text) If I execute it I want it to print out the HTML with the href links for example <a href="/online-applications/applicationDetails.do?keyVal=PEA12JHO07E00&activeTab=summary"> But my code won't show anything like this
I observe the POST, not GET which you are doing, is as follows (ignoring empty fields in POST): from bs4 import BeautifulSoup as bs import requests payload = { 'caseAddressType':'Application' ,'date(applicationDecisionStart)' :'1/8/2018' ,'date(applicationDecisionEnd)': '1/10/2018' , 'searchType' : 'Application' } with requests.Session() as s: r = s.post('https://publicaccess.gosport.gov.uk/online-applications/advancedSearchResults.do?action=firstPage', data = payload) soup = bs(r.content, 'lxml') info = [(item.text.strip(), item['href']) for item in soup.select('#searchresults a')] print(info) ## later pages #https://publicaccess.gosport.gov.uk/online-applications/pagedSearchResults.do?action=page&searchCriteria.page=2 Loop over pages: from bs4 import BeautifulSoup as bs import requests payload = { 'caseAddressType':'Application' ,'date(applicationDecisionStart)' :'1/8/2018' ,'date(applicationDecisionEnd)': '1/10/2018' , 'searchType' : 'Application' } with requests.Session() as s: r = s.post('https://publicaccess.gosport.gov.uk/online-applications/advancedSearchResults.do?action=firstPage', data = payload) soup = bs(r.content, 'lxml') info = [(item.text.strip(), item['href']) for item in soup.select('#searchresults a')] print(info) pages = int(soup.select('span + a.page')[-1].text) for page in range(2, pages + 1): r = s.get('https://publicaccess.gosport.gov.uk/online-applications/pagedSearchResults.do?action=page&searchCriteria.page={}'.format(page)) soup = bs(r.content, 'lxml') info = [(item.text.strip(), item['href']) for item in soup.select('#searchresults a')] print(info)
the url and data is incorrect use Chrome to analysis the response press f12 to open Developer tools,change to item "network".then submit your page,analysis the first request initiated by Chrome. what you need: Hearders-general-request url Hearders-request headers Hearders-data you need some packages to parser th html, such as bs4
Scraping and parsing multi-page (aspx) table
I'm trying to scrape information on greyhound races. For example, I want to scrape http://www.gbgb.org.uk/RaceCard.aspx?dogName=Hardwick%20Serena. This page shows all results for the dog Hardwick Serena, but it is split over several pages. Inspecting the page, it shows under the 'next page' button: <input type="submit" name="ctl00$ctl00$mainContent$cmscontent$DogRaceCard$lvDogRaceCard$ctl00$ctl03$ctl01$ctl12" value=" " title="Next Page" class="rgPageNext">. I was hoping for a HTML link, that I could use for the next iteration of the scrape, but no luck. Further inspection, by looking at network traffic, shows that the browser send a horribly long (hashed?) string for __VIEWSTATE, among others. Likely to protect the database? I'm looking for a way to scrape all pages of one dog, either by iterating over all pages, or by increasing the page length to show 100+ lines on page 1. The underlying database is .aspx. I'm using Python 3.5 and BeautifulSoup. current code: import requests from bs4 import BeautifulSoup url = 'http://www.gbgb.org.uk/RaceCard.aspx?dogName=Hardwick%20Serena' with requests.session() as s: s.headers['user-agent'] = 'Mozilla/5.0' r = s.get(url) soup = BeautifulSoup(r.content, 'html5lib') target = 'ctl00$ctl00$mainContent$cmscontent$DogRaceCard$btnFilter_input' data = { tag['name']: tag['value'] for tag in soup.select('input[name^=ctl00]') if tag.get('value') } state = { tag['name']: tag['value'] for tag in soup.select('input[name^=__]') } data.update(state) numberpages = int(str(soup.find('div', 'rgWrap rgInfoPart')).split(' ')[-2].split('>')[1].split('<')[0]) # for page in range(last_page + 1): for page in range(numberpages): data['__EVENTTARGET'] = target.format(page) #data['__VIEWSTATE'] = target.format(page) print(10) r = s.post(url, data=data) soup = BeautifulSoup(r.content, 'html5lib') tables = soup.findChildren('table') my_table = tables[9] rows = my_table.findChildren(['th', 'tr']) tabel = [[]] for i in range(len(rows)): cells = rows[i].findChildren('td') tabel.append([]) for j in range(len(cells)): value = cells[j].string tabel[i].append(value) table = [] for i in range(len(tabel)): if len(tabel[i]) == 16: del tabel[i][-2:] table.append(tabel[i])
In this case, for each page requested a POST request is issued with form url encoded parameter __EVENTTARGET & __VIEWSTATE : __VIEWSTATE can be easily extracted from an input tag __EVENTTARGET is different for each page and the value is passed from a javacript function for each page link so you can extract it with a regex : <a href="javascript:__doPostBack('ctl00$ctl00$mainContent$cmscontent$DogRaceCard$lvDogRaceCard$ctl00$ctl03$ctl01$ctl07','')"> <span>2</span> </a> The python script : from bs4 import BeautifulSoup import requests import re # extract data from page def extract_data(soup): tables = soup.find_all("div", {"class":"race-card"})[0].find_all("tbody") item_list = [ ( t[0].text.strip(), #date t[1].text.strip(), #dist t[2].text.strip(), #TP t[3].text.strip(), #StmHCP t[4].text.strip(), #Fin t[5].text.strip(), #By t[6].text.strip(), #WinnerOr2nd t[7].text.strip(), #Venue t[8].text.strip(), #Remarks t[9].text.strip(), #WinTime t[10].text.strip(), #Going t[11].text.strip(), #SP t[12].text.strip(), #Class t[13].text.strip() #CalcTm ) for t in (t.find_all('td') for t in tables[1].find_all('tr')) if t ] print(item_list) session = requests.Session() url = 'http://www.gbgb.org.uk/RaceCard.aspx?dogName=Hardwick%20Serena' response = session.get(url) soup = BeautifulSoup(response.content, "html.parser") # get view state value view_state = soup.find_all("input", {"id":"__VIEWSTATE"})[0]["value"] # get all event target values event_target = soup.find_all("div", {"class":"rgNumPart"})[0] event_target_list = [ re.search('__doPostBack\(\'(.*)\',', t["href"]).group(1) for t in event_target.find_all('a') ] # extract data for the 1st page extract_data(soup) # extract data for each page except the first for link in event_target_list[1:]: print("get page {0}".format(link)) post_data = { '__EVENTTARGET': link, '__VIEWSTATE': view_state } response = session.post(url, data=post_data) soup = BeautifulSoup(response.content, "html.parser") extract_data(soup)
Getting data from AJAX in Python
I follow the website and successfully get the AJAX data from jobs page for Apple.com The teaching website: http://toddhayton.com/2015/03/11/scraping-ajax-pages-with-python/ here is the complete code, which is correct: import json import requests from bs4 import BeautifulSoup class AppleJobsScraper(object): def __init__(self): self.search_request = { "jobType":"0", "sortBy":"req_open_dt", "sortOrder":"1", "filters":{ "locations":{ "location":[{ "type":"0", "code":"USA" }] } }, "pageNumber":"0" } def scrape(self): jobs = self.scrape_jobs() for job in jobs: #print(job) pass def scrape_jobs(self, max_pages=3): jobs = [] pageno = 0 self.search_request['pageNumber'] = pageno while pageno < max_pages: payload = { 'searchRequestJson': json.dumps(self.search_request), 'clientOffset': '-300' } r = requests.post( url='https://jobs.apple.com/us/search/search-result', data=payload, headers={ 'X-Requested-With': 'XMLHttpRequest' } ) s = BeautifulSoup(r.text) if not s.requisition: break for r in s.findAll('requisition'): job = {} job['jobid'] = r.jobid.text job['title'] = r.postingtitle and \ r.postingtitle.text or r.retailpostingtitle.text job['location'] = r.location.text jobs.append(job) # Next page pageno += 1 self.search_request['pageNumber'] = pageno return jobs if __name__ == '__main__': scraper = AppleJobsScraper() scraper.scrape() I almost understand the code provided by the website, except one of a little piece of section. for r in s.findAll('requisition'): job = {} job['jobid'] = r.jobid.text job['title'] = r.postingtitle and \ r.postingtitle.text or r.retailpostingtitle.text job['location'] = r.location.text jobs.append(job) I am curious about what s.findAll('requisition') means. Of course, I should print out the website text by print(s.get_text()) to see what the source code of website looks like. But what I get does not look like a website code, which makes me more confused. So I want to know why the code can use s.findAll('requisition') to get the data. And why the code know it can use job['jobid'], job['title'], job['location'] to get the data it wants. I am really appreciate for your help!
In that code you are using Beautiful Soup, which is a HTML parser. This tool provides to you the findAll method, which will find all the "requisition" tags on the document. So, this methods return a list containing all the matches, and each list element is a Tag object, and as you can read on the docs, a Tag object is a tag in the original document.