I successfully put together a script to retrieve search results from sales navigator in Linkedin. The following is the script, using python, selenium, and bs4.
browser = webdriver.Firefox(executable_path=r'D:\geckodriver\geckodriver.exe')
url1 = "https://www.linkedin.com/sales/search/company?companySize=E&geoIncluded=emea%3A0%2Ceurope%3A0&industryIncluded=6&keywords=AI&page=1&searchSessionId=zreYu57eQo%2BSZiFskdWJqg%3D%3D"
browser.get(url1)
time.sleep(15)
parsed = browser.find_element_by_tag_name('html').get_attribute('innerHTML')
soup = BeautifulSoup(parsed, 'html.parser')
search_results = soup.select('dt.result-lockup__name a')
print(len(search_results))
time.sleep(5)
browser.quit()
Irrespective of the no.of results, the answer was always 10 (i.e.) only 10 results were returned. Upon further investigation into the source, I noticed the following :
That the first 10 results are represented at a different level and the rest are under a div tag with style class named as deferred area. Though the dt class name is the same for all the search results (result-lockup__name), due to the change in levels, I am not able to access/retrieve it.
What would be the right way to retrieve all results in such a case?
EDIT 1
An example of how the tag levels are within li
And an example of the html script of the result that is not being retrieved
EDIT 2
The page source as requested
https://pastebin.com/D11YpHGQ
A lot of sites don't display all search results on page load rather only display them when needed, e.g the visitor keeps scrolling indicating they want to view more.
We can use javascript to scroll to the bottom of the page for us window.scrollTo(0,document.body.scrollHeight) , (you may want to loop this if you expect hundreds of results) forcing all results on the page, after which we can grab the HTML.
Below should do the trick.
browser = webdriver.Firefox(executable_path=r'D:\geckodriver\geckodriver.exe')
url1 = "https://www.linkedin.com/sales/search/company?companySize=E&geoIncluded=emea%3A0%2Ceurope%3A0&industryIncluded=6&keywords=AI&page=1&searchSessionId=zreYu57eQo%2BSZiFskdWJqg%3D%3D"
browser.get(url1)
time.sleep(15)
browser.execute_script('window.scrollTo(0,document.body.scrollHeight)')
time.sleep(15)
parsed = browser.find_element_by_tag_name('html').get_attribute('innerHTML')
soup = BeautifulSoup(parsed, 'html.parser')
search_results = soup.select('dt.result-lockup__name a')
print(len(search_results))
Related
I'm trying to scrape information from a series of pages from like these two:
https://www.nysenate.gov/legislation/bills/2019/s240
https://www.nysenate.gov/legislation/bills/2019/s8450
What I want to do is build a scraper that can pull down the text of "See Assembly Version of this Bill". In the two links listed above, the classes are the same but for one page it's the only iteration of that class, but for another it's the third.
I'm trying to make something like this work:
assembly_version = soup.select_one(".bill-amendment-detail content active > dd")
print(assembly_version)
But I keep getting None
Any thoughts?
url = "https://www.nysenate.gov/legislation/bills/2019/s11"
raw_html = requests.get(url).content
soup = BeautifulSoup(raw_html, "html.parser")
assembly_version = soup.find(class_="c-block c-bill-section c-bill--details").find("a").text.strip()
print(assembly_version)
So i am trying to extract the text in the grand-final section (the winner team name)
https://i.stack.imgur.com/4QPqI.png
my problem is that the text that im looking to extract isnt found by soup, it only finds up to (class="sgg2h1cC DEPRECATED_bootstrap_container undefined native-scroll dragscroll") but as you can see here:
https://i.imgur.com/Brmv6ba.png there is more.
here is my code, can someone explain how i would get the info im looking for? also im pretty new to webscraping
from bs4 import BeautifulSoup
URL = 'https://smash.gg/tournament/revolve-oceania-2v2-finale/event/revolve-oceania-2v2-finale-event/brackets/841267/1343704'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find(id="app_feature_canvas")
a = results.find_all('div', class_="regionWrapper-APP_TOURNAMENT_PAGE-FeatureCanvas")
print()
for b in a:
c = b.find('div', class_="page-section page-section-grey")
print(c)
What you see in your inspector is not the same as what you get when you use requests. Instead of using the dev console, view the page source.
Those parts of the page are generated by JavaScript, thus, will not appear when you request the page via requests.
URL = 'https://smash.gg/tournament/revolve-oceania-2v2-finale/event/revolve-oceania-2v2-finale-event/brackets/841267/1343704'
page = requests.get(URL)
print(page.text) # notice this is nothing like what you see in the inspector
To get javascript execution, consider using selenium instead of requests.
from selenium import webdriver
driver = webdriver.Chrome()
driver.get(URL)
html = driver.page_source # DOM with JavaScript execution complete
soup = BeautifulSoup(html)
# ... go from here
Alternatively, there may be enough information in the page source to get what you're looking for. Notice there's a lot of JSON in the page source with various info that, presumably, may be used by the JS to populate those elements.
Alternatively still, you can also copy/paste from the DOM browser in your inspector. (right-click the html element and click "copy outer html")
html = pyperclip.paste() # put contents of the clipboard into a variable
soup = BeautifulSoup(html)
results = soup.find(id="app_feature_canvas")
a = results.find_all('div', class_="regionWrapper-APP_TOURNAMENT_PAGE-FeatureCanvas")
print()
for b in a:
c = b.find('div', class_="page-section page-section-grey")
print(c)
And this works :-)
I work for a group that's looking to pull automatic reports of port statuses in Tx. To do this, I'm trying to web scrape (which I'm not too familiar with) the Coast Guard Homeport site. I've managed to use Selenium to pull all the information using the xpath of the page's table with the ports, however there is one port (Victoria) on the 'page 2' of the table that the script is not able to see. The xpath does not change if I tab between the pages, so I'm not sure how to locate it. Any help would be much appreciated!
edit: The page uses Javascript elements.
https://homeport.uscg.mil/port-directory/corpus-christi
url = 'https://homeport.uscg.mil/port-directory/corpus-christi'
xpath= "/html/body/form/div[12]/div[2]/div[2]/div[2]/div[3]/div[1]/div[4]/div/div/div/div/div/div[1]/div/div[2]/div[1]/div/div/div[2]/div/div[2]/div/table"
portsList = ['CORPUS CHRISTI','ORANGE','BEAUMONT','VICTORIA','CALHOUN','HARLINGEN','PALACIOS','PORT ISABEL','PORT LAVACA','PORT MANSFIELD']
df = pd.DataFrame(index=portsList, columns=['status','comments','dateupdated'])
driver = webdriver.Chrome(executable_path = r"C:\Users\M3ECHJJJ\Documents\chromedriver.exe")
urlpage = url+page
driver.get(urlpage)
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
time.sleep(15)
results = driver.find_elements_by_xpath(xpath)
ports_split = results[0].text.split('\n')
i = 0
for port in ports_split:
if port.upper() in portsList:
print(port)
df.xs(port.upper())['status'],df.xs(port.upper())['comments'],df.xs(port.upper())['dateupdated'] = parsePara(ports_split[i+1])
i = i+1
driver.quit()
First, be very careful writing automation that acts against government websites (or any website for that matter) and be sure you're allowed to do so. You may also find that many sites offer the information you're looking for in a structured format, like through an API or data download.
While selenium may provide you with a great tool for automating the browser, its capabilities for locating elements and parsing HTML may leave much to be desired. In cases like this, I might reach for using BeautifulSoup as a complementary tool to use alongside browser automation.
BeautifulSoup will support all the same locators as selenium, but also provides additional capabilities, including the ability to define your own criteria for locating elements.
For example, you can define a function to use very specific rules for locating elements (tags). The function should return True for tags that match your interests.
from bs4 import BeautifulSoup
def important_table(tag):
"""Given a particular tag, return True if it's what you're looking for"""
return bool(
# match <table> elements
tag.name == 'table' and
# check for expected text
any(port_name in tag.text for port_name in portsList) and
# check element attributes
"classname" in tag.get('class', []) and
# has at least 3 rows
len(tag.findall('tr')) > 3
# and so on
)
This is just an example, but you can write this function however you like to fit your need.
Then you could apply this like so
...
driver.get(urlpage)
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
time.sleep(15)
html = driver.page_source # get the DOM content as a string
soup = BeautifulSoup(html)
table = soup.find(important_table)
for row in table.findall('tr'):
print(row.text)
I am trying to webscrape and am currently stuck on how I should continue with the code. I am trying to create a code that scrapes the first 80 Yelp! reviews. Since there are only 20 reviews per page, I am also stuck on figuring out how to create a loop to change the webpage to the next 20 reviews.
from bs4 import BeautifulSoup
import requests
import time
all_reviews = ''
def get_description(pullman):
url = f'https://www.yelp.com/biz/pullman-bar-and-diner-iowa-city'
# get webpage data from url
response = requests.get(url)
#sleep for 2 seconds
time.sleep(2)
# get html document from web page data
html_doc = response.text
# parser
soup = BeautifulSoup(html_doc, "lxml")
page_title = soup.title.text
#get a tag content based on class
p_tag = soup.find_all('p',class_='lemon--p__373c0__3Qnnj text__373c0__2pB8f comment__373c0__3EKjH text-color--normal__373c0__K_MKN text-align--left__373c0__2pnx_')[0]
#print the text within the tag
return p_tag.text
General notes/tips:
Use the "Inspect" tool on pages you want to scrape.
As for your question, its also going to work much nicer if you visit the website and parse BeautifulSoup and then use the soup object in functions - visit once, parse as many times as you want. You won't be blacklisted by websites as often this way. An example structure below.
url = f'https://www.yelp.com/biz/pullman-bar-and-diner-iowa-city'
# get webpage data from url
response = requests.get(url)
#sleep for 2 seconds
time.sleep(2)
# get html document from web page data
html_doc = response.text
# parser
soup = BeautifulSoup(html_doc, "lxml")
get_description(soup)
get_reviews(soup)
If you inspect the page, each review appears as a copy of a template. If you take each review as an individual object and parse it, you can get the reviews you are looking for. The review template has the class id:lemon--li__373c0__1r9wz u-space-b3 u-padding-b3 border--bottom__373c0__uPbXS border-color--default__373c0__2oFDT
As for pagination, the pagination numbers are contained in a template with class="lemon--div__373c0__1mboc pagination-links__373c0__2ZHo6 border-color--default__373c0__2oFDT nowrap__373c0__1_N1j"
The individual page number links are contained within a-href tags, so just write a for loop to iterate over the links.
To get the next page, you're going to have to follow the "Next" link. The problem here is that the link is just the same as before plus #. Open the Inspector [Ctrl-Shift-I in Chrome, Firefox] and switch to the network tab, then click the next button, you'll see a request to something like:
https://www.yelp.com/biz/U4mOl3TRbaJ9-bgTQ1d6fw/review_feed?rl=en&sort_by=relevance_desc&q=&start=40
which looks something like:
{"reviews": [{"comment": {"text": "Such a great experience every time you come into this place...
This is JSON. The only problem is that you'll need to fool Yelp's servers into thinking you're browsing the website, by sending their headers to them, otherwise you get different data that doesn't look like comments.
They look like this in Chrome
My usual approach is to copy-paste the headers not prefixed with a colon (ignore :authority, etc) directly into a triple-quoted string called raw_headers, then run
headers = dict([[h.partition(':')[0], h.partition(':')[2]] for h in raw_headers.split('\n')])
over them, and pass them as an argument to requests with:
requests.get(url, headers=headers)
Some of the headers won't be necessary, cookies might expire, and all sorts of other issues might arise but this at least gives you a fighting chance.
ESPN Website View
I'd like to pull live auction/draft data from ESPN into a python script that adjusts player valuations / probability of being picked. The table on the page though, doesn't have TD/TR tags. It just has a lot of Div / Class. When trying different variations of find/findall for a lot of the Class' that I see in Chrome's inspector, I never seem to return any results.
import requests, bs4
url = "https://fantasy.espn.com/football/draft?leagueId=93589772&seasonId=2019&teamId=17&memberId={19AD42D6-8125-489D-B045-1E535CFC02E4}"
r = requests.get(url)
soup = bs4.BeautifulSoup(r.text, 'lxml')
table = soup.find("main", {"class": "jsx-2236042501 draftContainer"})
print (table)
these draft links only last so long, so unfortunately it won't be live for much longer.
The contents of the table are loaded with Javascript. You must use browser automation such as Selenium to extract the DOM after Javascript has loaded the page contents.