I am trying to write a simple scraper for Sales Navigator in Linkedin and this is the link I am trying to scrape . It has search results for specific filter options selected for account results.
The goal I am trying to achieve is to retrieve every company name among the search results. Upon inspecting the link elements carrying the company name (eg : Facile.it, AGT international), I see the following js script, showing the dt class name
<dt class="result-lockup__name">
<a id="ember208" href="/sales/company/2429831?_ntb=zreYu57eQo%2BSZiFskdWJqg%3D%3D" class="ember-view"> Facile.it
</a> </dt>
I basically want to retrieve those names and open the url represented in href.
It can be noted that all the company name links had the same dt class result-lockup__name. The following script is an attempt to collect the list of all company names displayed in the search result along with its elements.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
import re
import pandas as pd
import os
def scrape_accounts(url):
url = "https://www.linkedin.com/sales/search/companycompanySize=E&geoIncluded=emea%3A0%2Ceurope%3A0&industryIncluded=6&keywords=AI&page=1&searchSessionId=zreYu57eQo%2BSZiFskdWJqg%3D%3D"
driver = webdriver.PhantomJS(executable_path='C:\\phantomjs\\bin\\phantomjs.exe')
#driver = webdriver.Firefox()
#driver.implicitly_wait(30)
driver.get(url)
search_results = []
search_results = driver.find_elements_by_class_name("result-lockup__name")
print(search_results)
if __name__ == "__main__":
scrape_accounts("lol")
however, the result prints an empty list. I am trying to learn scraping different parts of web page and different elements,and thus I am not sure if I got this correct. What would be the right way?
I'm afraid I can't get to the page that you're after, but I notice that you're importing beautiful soup but not using it.
Try:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
import re
import pandas as pd
import os
url = "https://www.linkedin.com/sales/search/companycompanySize=E&geoIncluded=emea%3A0%2Ceurope%3A0&industryIncluded=6&keywords=AI&page=1&searchSessionId=zreYu57eQo%2BSZiFskdWJqg%3D%3D"
def scrape_accounts(url = url):
driver = webdriver.PhantomJS(executable_path='C:\\phantomjs\\bin\\phantomjs.exe')
#driver = webdriver.Firefox()
#driver.implicitly_wait(30)
driver.get(url)
html = driver.find_element_by_tag_name('html').get_attribute('innerHTML')
soup = BeautifulSoup(html, 'html.parser')
search_results = soup.select('dt.result-lockup__name a')
for link in search_results:
print(link.text.strip(), link['href'])
Related
Hi this is my first time attempting to webscrape in python using Beautiful soup. The problem that i am having is I am trying to scrape data off of a table from a website but the tables do not have ids. Say I was able to get the id of the element above the tr in the table is there anyway to scrape the data under that element.
This is what I am trying to scrape
I am able to grab the id="boat" in the first tr but I am trying to access the tr underneath it the problem is it has a class of "bottomline" this is a problem because the class name "bottomline" is used in multiple tr's which all have different values and i cant access the div with the class name of "tooltip" because the name is also used in multiple divs
So ultimitly my question is, is there away to scrape the data in tr that is under id="boat"
Thanks for any help in advance!
Beautiful Soup builds a tree for you. You are not required to have any identifying information about an element in order to find it, as long as you know the structure of the tree... which you do.
In your example, you already have the <strong> element with the ID you were looking for. If you look at the HTML, you see it is a child of a <td>, which is itself a child of a <tr>. BS4 allows you to move up the tree by iterating parents of an element:
name = soup.find(id = 'boat')
print(name)
for parent_row in name.parents:
if parent_row.name == 'tr':
break
At this point the variable parent_row will be set to the <tr> containing your <strong>.
Next, you can see that the data you are looking for is in the next <tr> after parent, which in BS4 terminology is a sibling of parent_row. You can iterate siblings similarly:
for sibling_row in parent_row.next_siblings:
if sibling_row.name == 'tr':
break
And at this point you have the row you need, and you can get the content:
content = list(sibling_row.stripped_strings)
print(content)
Putting it all together using the code in your later post:
import requests
from bs4 import BeautifulSoup
URL = "https://www.minecraftcraftingguide.net"
r = requests.get(URL)
soup = BeautifulSoup(r.content, 'html5lib')
print(soup.prettify())
name = soup.find(id = 'boat')
print(name)
for parent_row in name.parents:
if parent_row.name == 'tr':
break
for sibling_row in parent_row.next_siblings:
if sibling_row.name == 'tr':
break
content = list(sibling_row.stripped_strings)
print(content)
If you are scraping from a table, maybe pd.read_html() from the Pandas module can help here. I cannot reproduce your example because you have not offered any reproducible code, but you could try the following:
import requests
import pandas as pd
# Make a request and try to get the table using pandas
r = requests.get("your_url")
df = pd.read_html(r.content)[0]
If pandas is able to capture a dataframe from the response then you should be able to access all the data in the table as if you were using pandas over a normal dataframe. This has worked for me many times when performing this kind of tasks.
this is what my code looks like
from ask_sdk_core.dispatch_components import AbstractRequestHandler, AbstractExceptionHandler
from ask_sdk_core.handler_input import HandlerInput
from ask_sdk_model.ui import SimpleCard
import feedparserimport requests
from bs4 import BeautifulSoup
import webbrowser
URL = "https://www.minecraftcraftingguide.net"
r = requests.get(URL)
soup = BeautifulSoup(r.content, 'html5lib')
print(soup.prettify())
name = soup.find(id = 'boat')
print(name)
class MinecraftHelperIntentHandler(AbstractRequestHandler):
"""Handler for minecraft helper intent"""
def can_handle(self, handler_input):
return ask_utils.is_intent_name("MinecraftHelperIntent")(handler_input)
def handle(self, handler_input):
slots = handler_input.request_envelope.request.intent.slots
# interactionModel.languageModel.intents[].slots[].multipleValues.enabled
item = slots['Item'].value
itemStr = item.str();
imgStart = 'https://www.minecraftcraftingguide.net/img/crafting/'
imgMid = item
imgEnd = '-crafting.png'
imgLink = imgStart + imgMid + imgEnd
print(imgLink)
speak_output = f'To craft that you will need {item} here is a link {imgLink}'
return (
handler_input.response_builder
.speak(speak_output)
.set_card(SimpleCard('test', 'card_text'))#might need to link account for it to work
.response
)```
I have the same issue recently... but I'm using selenium instead of Beautiful soup. In my case to fix the issue I have to:
first identify a table parameter to use as reference, then
follow the table tree on the on web page that I was trying to scrape, after that
put everything in a xpath element like the code before:
from pyotp import *
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import selenium.webdriver.support.ui as ui
from selenium.webdriver.common.keys import Keys
get_the_value_from_td = driver.find_element_by_xpath('//table[#width="517"]/tbody/tr[8]/td[8]').text
This link was very helpful to me: https://www.guru99.com/selenium-webtable.html
Beginner question.. I'm attempting to scrape data from a table but I can't seem to recognize it, I've tried using the class and the id to identify it but my result is 0. The code and output are below.
# Import necessary packages
from bs4 import BeautifulSoup
import requests
import pandas as pd
import re
# Site URL
url="https://fbref.com/en/comps/9/stats/Premier-League-Stats"
# Make a GET request to fetch the raw HTML content
html_content = requests.get(url).text
# Parse HTML code for the entire site
soup = BeautifulSoup(html_content, "lxml")
#print(soup.prettify()) # print the parsed data of html
gdp = soup.find_all("table", attrs={"id": "stats_standard"})
print("Number of tables on site: ",len(gdp))
Output - 'Number of tables on site: 0'
I suggest you to use selenium for such scraping, its performance is very reliable.
This code will work for you:
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
option = Options()
option.add_argument('--headless')
url = 'https://fbref.com/en/comps/9/stats/Premier-League-Stats'
driver = webdriver.Chrome(options=option)
driver.get(url)
bs = BeautifulSoup(driver.page_source, 'html.parser')
gdp = bs.find_all('table', {'id': 'stats_standard'})
driver.quit()
print("Number of tables on site: ",len(gdp))
Output
Number of tables on site: 1
Can you find the table(s) without using attrs={"id": "stats_standard"}?
I have checked and indeed I cannot find any table whose ID is stats_standard (but there is one with ID stats_standard_sh, for example). So I guess you might be using the wrong ID.
Hello I'm getting the error of AttributeError: 'str' object has no attribute 'find_element' , even though my code seem to be correct and I looked a little bit in stack but I didnt find a solution to my specific problem.
I wanna get the value of the fifth <p> set after the <br> tag in my html code , correct me if you see any mistake please.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
import pandas as pd
import time
options = Options()
# Creating our dictionary
all_services = pd.DataFrame(columns=['Motif', 'Description'])
path = "C:/Users/Al4D1N/Documents/ChromeDriver_webscraping/chromedriver.exe"
driver = webdriver.Chrome(options=options, executable_path=path)
driver.get("https://www.mairie.net/national/acte-naissance.htm#plus")
# We store our 'Motif' & 'Description' for the first link 'acte-naissance'
service = driver.find_element(By.CLASS_NAME, "section-title").text
desc = service.find_element(By.XPATH, "//*[#class='section-group']/p[5]/following::br")
print(desc)
all_services = all_services.append({'Motif': service, 'Description': desc}, ignore_index=True)
# Get all elements in class 'list-images'
list_of_services = driver.find_elements_by_class_name("list-images")
all_services.to_excel('Services.xlsx', index=False)
I was able to get the text you want using this combination:
service = driver.find_element_by_class_name("section-title").text
txt = driver.find_element_by_xpath("//*[#class='section-group']/p[5]").text
desc = txt.split("\n")[1]
I also did it with requests and bs4, which is usually a little bit quicker than a headless browser (same issues of scraper navigability):
import requests
from bs4 import BeautifulSoup as Bs
r = requests.get("https://www.mairie.net/national/acte-naissance.htm#plus")
html = Bs(r.text, "lxml")
section = html.find("h1", {"class": "section-title"}).get_text()
div = html.find("div", {"class": "section-group"})
p = div.find_all("p")[4]
p.strong.extract()
txt = p.get_text().split("\t")
desc = (txt[1] + txt[2]).replace("\n", " ")[:-1]
Try to remove .text from
service = driver.find_element(By.CLASS_NAME, "section-title").text
I have a website to scrape and i am using selenium to do it. When i finished writing the code, i noticed that i was not getting output at all when i print the table contents. I viewed the page source and then i found out that the table was not in the source. That is why even i find the xpath of the table from inspect element i cant get any output of it. Do someone know how could I get the response/data or just printing the table from the javascript response? Thanks.
Here is my current code
from bs4 import BeautifulSoup
from selenium import webdriver
import time
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.options import Options
options = Options()
options.add_argument('--incognito')
chrome_path = r"C:\chromedriver.exe"
driver = webdriver.Chrome(chrome_path,options=options)
driver.implicitly_wait(3)
url = "https://reversewhois.domaintools.com/?refine#q=
%5B%5B%5B%22whois%22%2C%222%22%2C%22VerifiedID%40SG-Mandatory%22%5D%5D%5D"
driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html,'lxml')
#These line of codes is for selecting the desired search parameter from the combo box, you can disregard this since i was putting the whole url with params
input = driver.find_element_by_xpath('//*[#id="q0"]/div[2]/div/div[1]/div[3]/input')
driver.find_element_by_xpath('//*[#id="q0"]/div[2]/div/div[1]/div[1]/div').click()
driver.find_element_by_xpath('//*[#id="q0"]/div[2]/div/div[1]/div[5]/div[1]/div/div[3]').click()
driver.find_element_by_xpath('//*[#id="q0"]/div[2]/div/div[1]/div[2]/div/div[1]').click()
driver.find_element_by_xpath('//*[#id="q0"]/div[2]/div/div[1]/div[6]/div[1]/div/div[1]').click
input.send_keys("VerifiedID#SG-Mandatory")
driver.find_element_by_xpath('//*[#id="search-button-container"]/button').click()
table = driver.find_elements_by_xpath('//*[#id="refine-preview-content"]/table/tbody/tr/td')
for i in table:
print(i) no output
I just want to scrape all the domain names like in the first result like 0 _ _ .sg
You can try the below code. After you have selected all the details options to fill and click on the search button it is kind of an implicit wait to make sure we get the full page source. Then we used the read_html from pandas which searches for any tables present in the html and returns a list of dataframe. we take the required df from there.
from selenium import webdriver
import time
from selenium.webdriver.chrome.options import Options
import pandas as pd
options = Options()
options.add_argument('--incognito')
chrome_path = r"C:/Users/prakh/Documents/PythonScripts/chromedriver.exe"
driver = webdriver.Chrome(chrome_path,options=options)
driver.implicitly_wait(3)
url = "https://reversewhois.domaintools.com/?refine#q=%5B%5B%5B%22whois%22%2C%222%22%2C%22VerifiedID%40SG-Mandatory%22%5D%5D%5D"
driver.get(url)
#html = driver.page_source
#soup = BeautifulSoup(html,'lxml')
#These line of codes is for selecting the desired search parameter from the combo box
input = driver.find_element_by_xpath('//*[#id="q0"]/div[2]/div/div[1]/div[3]/input')
driver.find_element_by_xpath('//*[#id="q0"]/div[2]/div/div[1]/div[1]/div').click()
driver.find_element_by_xpath('//*[#id="q0"]/div[2]/div/div[1]/div[5]/div[1]/div/div[3]').click()
driver.find_element_by_xpath('//*[#id="q0"]/div[2]/div/div[1]/div[2]/div/div[1]').click()
driver.find_element_by_xpath('//*[#id="q0"]/div[2]/div/div[1]/div[6]/div[1]/div/div[1]').click
input.send_keys("VerifiedID#SG-Mandatory")
driver.find_element_by_xpath('//*[#id="search-button-container"]/button').click()
time.sleep(5)
html = driver.page_source
tables = pd.read_html(html)
df = tables[-1]
print(df)
If you are open to other ways does the following give the expected results? It mimics the xhr the page does (though I have trimmed it down to essential elements only) to retrieve the lookup results. Faster than using a browser.
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
headers = {'User-Agent': 'Mozilla/5.0'}
r = requests.get('https://reversewhois.domaintools.com/?ajax=mReverseWhois&call=ajaxUpdateRefinePreview&q=[[[%22whois%22,%222%22,%22VerifiedID#SG-Mandatory%22]]]&sf=true', headers=headers)
table = pd.read_html(r.json()['results'])
print(table)
Source Code:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
from bs4 import BeautifulSoup
path = "C:\\Python27\\chromedriver\\chromedriver"
driver = webdriver.Chrome(executable_path=path)
# Open Chrome
driver.get("http://www.thehindu.com/")
# 10 Second Delay
time.sleep(10)
elem = driver.find_element_by_id("searchString")
# Enter Keyword
elem.send_keys("unilever")
elem.send_keys(Keys.RETURN)
time.sleep(10)
# Problem Here
page = driver.page_source
soup = BeautifulSoup(page, 'lxml')
print soup
Above it the code.
I want to scrap data from "http://www.thehindu.com/", It searches for "unilever" word in search box and redirect to result page
Link for Search Page
Now I have a question for this, How can I get Source code of the searched Page.
Basically I want news related to "Unilever".
You can get text inside <body>:
body = driver.find_element_by_tag_name("body")
bodyText = body.get_attribute("innerText")
Then you can find your keyword in bodyText.