Web Scraping in python without "ids" in table - python

Hi this is my first time attempting to webscrape in python using Beautiful soup. The problem that i am having is I am trying to scrape data off of a table from a website but the tables do not have ids. Say I was able to get the id of the element above the tr in the table is there anyway to scrape the data under that element.
This is what I am trying to scrape
I am able to grab the id="boat" in the first tr but I am trying to access the tr underneath it the problem is it has a class of "bottomline" this is a problem because the class name "bottomline" is used in multiple tr's which all have different values and i cant access the div with the class name of "tooltip" because the name is also used in multiple divs
So ultimitly my question is, is there away to scrape the data in tr that is under id="boat"
Thanks for any help in advance!

Beautiful Soup builds a tree for you. You are not required to have any identifying information about an element in order to find it, as long as you know the structure of the tree... which you do.
In your example, you already have the <strong> element with the ID you were looking for. If you look at the HTML, you see it is a child of a <td>, which is itself a child of a <tr>. BS4 allows you to move up the tree by iterating parents of an element:
name = soup.find(id = 'boat')
print(name)
for parent_row in name.parents:
if parent_row.name == 'tr':
break
At this point the variable parent_row will be set to the <tr> containing your <strong>.
Next, you can see that the data you are looking for is in the next <tr> after parent, which in BS4 terminology is a sibling of parent_row. You can iterate siblings similarly:
for sibling_row in parent_row.next_siblings:
if sibling_row.name == 'tr':
break
And at this point you have the row you need, and you can get the content:
content = list(sibling_row.stripped_strings)
print(content)
Putting it all together using the code in your later post:
import requests
from bs4 import BeautifulSoup
URL = "https://www.minecraftcraftingguide.net"
r = requests.get(URL)
soup = BeautifulSoup(r.content, 'html5lib')
print(soup.prettify())
name = soup.find(id = 'boat')
print(name)
for parent_row in name.parents:
if parent_row.name == 'tr':
break
for sibling_row in parent_row.next_siblings:
if sibling_row.name == 'tr':
break
content = list(sibling_row.stripped_strings)
print(content)

If you are scraping from a table, maybe pd.read_html() from the Pandas module can help here. I cannot reproduce your example because you have not offered any reproducible code, but you could try the following:
import requests
import pandas as pd
# Make a request and try to get the table using pandas
r = requests.get("your_url")
df = pd.read_html(r.content)[0]
If pandas is able to capture a dataframe from the response then you should be able to access all the data in the table as if you were using pandas over a normal dataframe. This has worked for me many times when performing this kind of tasks.

this is what my code looks like
from ask_sdk_core.dispatch_components import AbstractRequestHandler, AbstractExceptionHandler
from ask_sdk_core.handler_input import HandlerInput
from ask_sdk_model.ui import SimpleCard
import feedparserimport requests
from bs4 import BeautifulSoup
import webbrowser
URL = "https://www.minecraftcraftingguide.net"
r = requests.get(URL)
soup = BeautifulSoup(r.content, 'html5lib')
print(soup.prettify())
name = soup.find(id = 'boat')
print(name)
class MinecraftHelperIntentHandler(AbstractRequestHandler):
"""Handler for minecraft helper intent"""
def can_handle(self, handler_input):
return ask_utils.is_intent_name("MinecraftHelperIntent")(handler_input)
def handle(self, handler_input):
slots = handler_input.request_envelope.request.intent.slots
# interactionModel.languageModel.intents[].slots[].multipleValues.enabled
item = slots['Item'].value
itemStr = item.str();
imgStart = 'https://www.minecraftcraftingguide.net/img/crafting/'
imgMid = item
imgEnd = '-crafting.png'
imgLink = imgStart + imgMid + imgEnd
print(imgLink)
speak_output = f'To craft that you will need {item} here is a link {imgLink}'
return (
handler_input.response_builder
.speak(speak_output)
.set_card(SimpleCard('test', 'card_text'))#might need to link account for it to work
.response
)```

I have the same issue recently... but I'm using selenium instead of Beautiful soup. In my case to fix the issue I have to:
first identify a table parameter to use as reference, then
follow the table tree on the on web page that I was trying to scrape, after that
put everything in a xpath element like the code before:
from pyotp import *
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import selenium.webdriver.support.ui as ui
from selenium.webdriver.common.keys import Keys
get_the_value_from_td = driver.find_element_by_xpath('//table[#width="517"]/tbody/tr[8]/td[8]').text
This link was very helpful to me: https://www.guru99.com/selenium-webtable.html

Related

Error: TypeError: must be str, not NoneType while Scraping list Links from website using BeautifulSoup

I want to scrape https://ens.dk/en/our-services/oil-and-gas-related-data/monthly-and-yearly-production this website.
there are 2 set of links SI units and Oil Field units
I have tried to scrape the list of links form SI units and created function called get_gas_links
import io
import requests
import pandas as pd
from bs4 import BeautifulSoup as bs, SoupStrainer
import re
url = "https://ens.dk/en/our-services/oil-and-gas-related-data/monthly-and-yearly-production"
first_page = requests.get(url)
soup = bs(first_page.content)
def pasrse_page(link):
print(link)
df = pd.read_html(link, skiprows=1, headers=1)
return df
def get_gas_links():
glinks=[]
gas_links = soup.find_all("a", href = re.compile("si.htm"))
for i in gas_links:
glinks.append("https://ens.dk/" + i.get("herf"))
return glinks
get_gas_links()
Main motive to scrape 3 tables from every link however before scraping table I am trying to scrape list of links
but it shows error : TypeError: must be str, not NoneType
error_image
You are using wrong regex in a wrong way. That's why soup can not find any links that fulfills the criteria.
You can check the following source and validate the the extracted_link however you want.
def get_gas_links():
glinks=[]
gas_links = soup.find('table').find_all('a')
for i in gas_links:
extracted_link = i['href']
#you can validate the extracted link however you want
glinks.append("https://ens.dk/" + extracted_link)
return glinks

How to clean this webscrape up

I want to scrape a table from a webpage, but there are two tables with the same tag.
The table I am interested in is "Event Timeline."
My problem is my code prints my desired table as a whole, and does not separate by column/row.
Ideally I would want this to be broken up per field.
Is there a way to clean this scrape up
from selenium import webdriver
import time
driver = webdriver.Chrome()
import pandas as pd
val=[]
driver.get('https://www.aan.com/MSA/Public/Events/Details/13419')
page_source = driver.page_source
element2=driver.find_element_by_tag_name('tbody').text.strip()
print(element2)
Selenium's purpose is more on web automation, therefore I will answer your question using a web scraping package BeautifulSoup instead.
This answer obtain the page's HTML file using your code, but a more efficient solution will be the Request package.
from selenium import webdriver
from bs4 import BeautifulSoup
import time
import pandas as pd
driver = webdriver.Chrome()
val = []
# Suggest using the Request package to obtain the HTML source code
driver.get('https://www.aan.com/MSA/Public/Events/Details/13419')
page_source = driver.page_source
# element2 = driver.find_element_by_tag_name('tbody')
# Declare a BeautifulSoup Object
soup = BeautifulSoup(driver.page_source, 'html.parser')
tbody = soup.find("tbody") #Find the first tbody
rows = tbody.find_all("tr") #Find all the rows
for row in rows:
rowVal = [] #Create an array to store the value
tds = row.find_all("td") #Find all the cells in the row
for td in tds:
rowVal.append(td.get_text().strip()) #Obtain text of the cell
print(rowVal) #Print them, or do anything else

Can't identify a table when scraping

Beginner question.. I'm attempting to scrape data from a table but I can't seem to recognize it, I've tried using the class and the id to identify it but my result is 0. The code and output are below.
# Import necessary packages
from bs4 import BeautifulSoup
import requests
import pandas as pd
import re
# Site URL
url="https://fbref.com/en/comps/9/stats/Premier-League-Stats"
# Make a GET request to fetch the raw HTML content
html_content = requests.get(url).text
# Parse HTML code for the entire site
soup = BeautifulSoup(html_content, "lxml")
#print(soup.prettify()) # print the parsed data of html
gdp = soup.find_all("table", attrs={"id": "stats_standard"})
print("Number of tables on site: ",len(gdp))
Output - 'Number of tables on site: 0'
I suggest you to use selenium for such scraping, its performance is very reliable.
This code will work for you:
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
option = Options()
option.add_argument('--headless')
url = 'https://fbref.com/en/comps/9/stats/Premier-League-Stats'
driver = webdriver.Chrome(options=option)
driver.get(url)
bs = BeautifulSoup(driver.page_source, 'html.parser')
gdp = bs.find_all('table', {'id': 'stats_standard'})
driver.quit()
print("Number of tables on site: ",len(gdp))
Output
Number of tables on site: 1
Can you find the table(s) without using attrs={"id": "stats_standard"}?
I have checked and indeed I cannot find any table whose ID is stats_standard (but there is one with ID stats_standard_sh, for example). So I guess you might be using the wrong ID.

Indexing multiple tables in BeautfulSoup

This page I want to parse - https://fbref.com/en/comps/9/gca/Premier-League-Stats
It has 2 tables, I am trying to get information from the second table, but it keeps displaying the first table every time I run this code.
from bs4 import BeautifulSoup
import requests
source = requests.get('https://fbref.com/en/comps/9/gca/Premier-League-Stats').text
soup = BeautifulSoup(source, 'lxml')
stattable = soup.find('table', class_= 'min_width sortable stats_table min_width shade_zero')[1]
print(stattable)
min_width sortable stats_table min_width shade_zero is the ID of the 'second' table.
It does not give me an error nor does it return anything. It's null.
Since the second table is dynamically generated, why not combine selenium, BeautifulSoup, and pandas to get what you want?
For example:
import time
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
options = Options()
options.headless = False
driver = webdriver.Chrome(options=options)
driver.get("https://fbref.com/en/comps/9/gca/Premier-League-Stats")
time.sleep(2)
soup = BeautifulSoup(driver.page_source, "html.parser").find("div", {"id": "div_stats_gca"})
driver.close()
df = pd.read_html(str(soup), skiprows=[0, 1])
df = pd.concat(df)
df.to_csv("data.csv", index=False)
This spits out a .csv file that, well, looks like that table you want. :)
The HTML you see when you do inspect element are generated using Javascript. However, the same classes are not available in the raw html that you get using the script.
I disabled Javascript for this site and I saw that the table is not visible.
You can try something like Selenium. There is good information in this question.

Print elements using dt class name selenium python

I am trying to write a simple scraper for Sales Navigator in Linkedin and this is the link I am trying to scrape . It has search results for specific filter options selected for account results.
The goal I am trying to achieve is to retrieve every company name among the search results. Upon inspecting the link elements carrying the company name (eg : Facile.it, AGT international), I see the following js script, showing the dt class name
<dt class="result-lockup__name">
<a id="ember208" href="/sales/company/2429831?_ntb=zreYu57eQo%2BSZiFskdWJqg%3D%3D" class="ember-view"> Facile.it
</a> </dt>
I basically want to retrieve those names and open the url represented in href.
It can be noted that all the company name links had the same dt class result-lockup__name. The following script is an attempt to collect the list of all company names displayed in the search result along with its elements.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
import re
import pandas as pd
import os
def scrape_accounts(url):
url = "https://www.linkedin.com/sales/search/companycompanySize=E&geoIncluded=emea%3A0%2Ceurope%3A0&industryIncluded=6&keywords=AI&page=1&searchSessionId=zreYu57eQo%2BSZiFskdWJqg%3D%3D"
driver = webdriver.PhantomJS(executable_path='C:\\phantomjs\\bin\\phantomjs.exe')
#driver = webdriver.Firefox()
#driver.implicitly_wait(30)
driver.get(url)
search_results = []
search_results = driver.find_elements_by_class_name("result-lockup__name")
print(search_results)
if __name__ == "__main__":
scrape_accounts("lol")
however, the result prints an empty list. I am trying to learn scraping different parts of web page and different elements,and thus I am not sure if I got this correct. What would be the right way?
I'm afraid I can't get to the page that you're after, but I notice that you're importing beautiful soup but not using it.
Try:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
import re
import pandas as pd
import os
url = "https://www.linkedin.com/sales/search/companycompanySize=E&geoIncluded=emea%3A0%2Ceurope%3A0&industryIncluded=6&keywords=AI&page=1&searchSessionId=zreYu57eQo%2BSZiFskdWJqg%3D%3D"
def scrape_accounts(url = url):
driver = webdriver.PhantomJS(executable_path='C:\\phantomjs\\bin\\phantomjs.exe')
#driver = webdriver.Firefox()
#driver.implicitly_wait(30)
driver.get(url)
html = driver.find_element_by_tag_name('html').get_attribute('innerHTML')
soup = BeautifulSoup(html, 'html.parser')
search_results = soup.select('dt.result-lockup__name a')
for link in search_results:
print(link.text.strip(), link['href'])

Categories