I am trying to scrape used car listing prices and names excluding those posted by a dealership. I am having trouble as I would like to put this in a dataframe using panda but can only do so once I can get the right information. Here is the code.
from bs4 import BeautifulSoup as bs4
import requests
import csv
import pandas as pd
import numpy as np
pages_to_scrape=2
pages=[]
prices=[]
names=[]
for i in range(1,pages_to_scrape+1):
url = 'https://www.kijiji.ca/b-cars-trucks/ottawa/used/page-{}/c174l1700185a49'.format(i)
pages.append(url)
for item in pages:
page = requests.get(item)
soup = bs4(page.text,'html.parser')
for k in soup.findAll('div', class_='price'):
if k.find(class_='dealer-logo'):
continue
else:
price=k.getText()
prices.append(price.strip())
My code up to here works as intended. Since 'dealer-logo' is a child of 'price'. However, I am having trouble having this work for the names, as the 'title' class is within 'info-container' where 'price' is also found.
As such, abc=soup.find('a', { 'class' : 'title' }) returns only the first element of the page when I want it to iterate through every listing that does not have 'dealer-logo' in it, and findAll obviously wouldn't work as it would give every element. findNext gives me a NoneType.
for l in soup.findAll('div', class_='info-container'):
if l.findAll(class_='dealer-logo'):
continue
else:
abc=soup.find('a', { 'class' : 'title' })
name=abc.getText()
names.append(name.strip())
print(names)
print(prices)
Below is the code I am scraping. I want to ignore all instances where 'dealer-logo' is present, and get the price and title for the listing and add it to a list.
With bs4 4.7.1+ you can use :not and :has to filter out the logo items. Select for a parent node then have the two target child items selected in a comprehension where you group them as tuples then convert to DataFrame with pandas.
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
r = requests.get('https://www.kijiji.ca/b-cars-trucks/ottawa/used/c174l1700185a49')
soup = bs(r.content, 'lxml')
df = pd.DataFrame((i.select_one('a.title').text.strip(), i.select_one('.price').text.strip()) for i in
soup.select('.info-container:not(:has(.dealer-logo))') if 'wanted' not in i.select_one('a.title').text.lower())
df
N.B.
It seems at times, one gets slightly more results than you see on page.
I think you can likely also filter out the wanted ads in the css, rather than the if as above, with
df = pd.DataFrame((i.select_one('a.title').text.strip(), i.select_one('.price').text.strip()) for i in
soup.select('div:not(div.regular-ad) > .info-container:not(:has(.dealer-logo))') )
Related
I want to scrape https://ens.dk/en/our-services/oil-and-gas-related-data/monthly-and-yearly-production this website.
there are 2 set of links SI units and Oil Field units
I have tried to scrape the list of links form SI units and created function called get_gas_links
import io
import requests
import pandas as pd
from bs4 import BeautifulSoup as bs, SoupStrainer
import re
url = "https://ens.dk/en/our-services/oil-and-gas-related-data/monthly-and-yearly-production"
first_page = requests.get(url)
soup = bs(first_page.content)
def pasrse_page(link):
print(link)
df = pd.read_html(link, skiprows=1, headers=1)
return df
def get_gas_links():
glinks=[]
gas_links = soup.find_all("a", href = re.compile("si.htm"))
for i in gas_links:
glinks.append("https://ens.dk/" + i.get("herf"))
return glinks
get_gas_links()
Main motive to scrape 3 tables from every link however before scraping table I am trying to scrape list of links
but it shows error : TypeError: must be str, not NoneType
error_image
You are using wrong regex in a wrong way. That's why soup can not find any links that fulfills the criteria.
You can check the following source and validate the the extracted_link however you want.
def get_gas_links():
glinks=[]
gas_links = soup.find('table').find_all('a')
for i in gas_links:
extracted_link = i['href']
#you can validate the extracted link however you want
glinks.append("https://ens.dk/" + extracted_link)
return glinks
Hi this is my first time attempting to webscrape in python using Beautiful soup. The problem that i am having is I am trying to scrape data off of a table from a website but the tables do not have ids. Say I was able to get the id of the element above the tr in the table is there anyway to scrape the data under that element.
This is what I am trying to scrape
I am able to grab the id="boat" in the first tr but I am trying to access the tr underneath it the problem is it has a class of "bottomline" this is a problem because the class name "bottomline" is used in multiple tr's which all have different values and i cant access the div with the class name of "tooltip" because the name is also used in multiple divs
So ultimitly my question is, is there away to scrape the data in tr that is under id="boat"
Thanks for any help in advance!
Beautiful Soup builds a tree for you. You are not required to have any identifying information about an element in order to find it, as long as you know the structure of the tree... which you do.
In your example, you already have the <strong> element with the ID you were looking for. If you look at the HTML, you see it is a child of a <td>, which is itself a child of a <tr>. BS4 allows you to move up the tree by iterating parents of an element:
name = soup.find(id = 'boat')
print(name)
for parent_row in name.parents:
if parent_row.name == 'tr':
break
At this point the variable parent_row will be set to the <tr> containing your <strong>.
Next, you can see that the data you are looking for is in the next <tr> after parent, which in BS4 terminology is a sibling of parent_row. You can iterate siblings similarly:
for sibling_row in parent_row.next_siblings:
if sibling_row.name == 'tr':
break
And at this point you have the row you need, and you can get the content:
content = list(sibling_row.stripped_strings)
print(content)
Putting it all together using the code in your later post:
import requests
from bs4 import BeautifulSoup
URL = "https://www.minecraftcraftingguide.net"
r = requests.get(URL)
soup = BeautifulSoup(r.content, 'html5lib')
print(soup.prettify())
name = soup.find(id = 'boat')
print(name)
for parent_row in name.parents:
if parent_row.name == 'tr':
break
for sibling_row in parent_row.next_siblings:
if sibling_row.name == 'tr':
break
content = list(sibling_row.stripped_strings)
print(content)
If you are scraping from a table, maybe pd.read_html() from the Pandas module can help here. I cannot reproduce your example because you have not offered any reproducible code, but you could try the following:
import requests
import pandas as pd
# Make a request and try to get the table using pandas
r = requests.get("your_url")
df = pd.read_html(r.content)[0]
If pandas is able to capture a dataframe from the response then you should be able to access all the data in the table as if you were using pandas over a normal dataframe. This has worked for me many times when performing this kind of tasks.
this is what my code looks like
from ask_sdk_core.dispatch_components import AbstractRequestHandler, AbstractExceptionHandler
from ask_sdk_core.handler_input import HandlerInput
from ask_sdk_model.ui import SimpleCard
import feedparserimport requests
from bs4 import BeautifulSoup
import webbrowser
URL = "https://www.minecraftcraftingguide.net"
r = requests.get(URL)
soup = BeautifulSoup(r.content, 'html5lib')
print(soup.prettify())
name = soup.find(id = 'boat')
print(name)
class MinecraftHelperIntentHandler(AbstractRequestHandler):
"""Handler for minecraft helper intent"""
def can_handle(self, handler_input):
return ask_utils.is_intent_name("MinecraftHelperIntent")(handler_input)
def handle(self, handler_input):
slots = handler_input.request_envelope.request.intent.slots
# interactionModel.languageModel.intents[].slots[].multipleValues.enabled
item = slots['Item'].value
itemStr = item.str();
imgStart = 'https://www.minecraftcraftingguide.net/img/crafting/'
imgMid = item
imgEnd = '-crafting.png'
imgLink = imgStart + imgMid + imgEnd
print(imgLink)
speak_output = f'To craft that you will need {item} here is a link {imgLink}'
return (
handler_input.response_builder
.speak(speak_output)
.set_card(SimpleCard('test', 'card_text'))#might need to link account for it to work
.response
)```
I have the same issue recently... but I'm using selenium instead of Beautiful soup. In my case to fix the issue I have to:
first identify a table parameter to use as reference, then
follow the table tree on the on web page that I was trying to scrape, after that
put everything in a xpath element like the code before:
from pyotp import *
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import selenium.webdriver.support.ui as ui
from selenium.webdriver.common.keys import Keys
get_the_value_from_td = driver.find_element_by_xpath('//table[#width="517"]/tbody/tr[8]/td[8]').text
This link was very helpful to me: https://www.guru99.com/selenium-webtable.html
The linked page below has two classes of the same name with data in them. I'm trying to mine the player names from these and assign positions of where they placed in the tournament. The find function in beautifulsoup is only allowing me to grab the first instance of the class.
I've tried a few different iterations of trying to iterate past the first instance of the class but nothing has worked. Having two instances of Table2__tbody seems to be the problem, how do I get past the first one and mine the data from the second one.
url_page = "https://www.espn.com/golf/leaderboard/_/tournamentId/401056502"
page = requests.get(url_page)
soup = BeautifulSoup(page.text, 'html.parser')
name_list = soup.find(class_='Table2__tbody')
name_list_items = name_list.find_all('a')
name_list is only capturing the data from the first instance of Table2__tbody. What I need is only the data from the second instance.
I think that you are not quite getting into the right attribute. The 'Table2__tbody' was only pointing to the first table of the hole_playoff results. The attribute you are looking for is actually 'tl Table2__td'.
So when you run the following code (run in python3) and BS4:
from bs4 import BeautifulSoup
from urllib import request
url_page = "https://www.espn.com/golf/leaderboard/_/tournamentId/401056502"
page = request.urlopen(url_page)
soup = BeautifulSoup(page, 'html.parser')
name_list = soup.find_all(class_='tl Table2__td')
name_list_items = []
for i in name_list:
name_list_items.append(i.get_text())
you actually get a list with the position of the player on the even indexes, and the name on the odd indexes. Some simple data manipulation can arrange that to do whatever you need it to do.
One solution how to select the proper table is using CSS selectors.
table:has(a.leaderboard_player_name) will select <table> that contains <a> with class leaderboard_player_name, which is our player list:
import requests
from bs4 import BeautifulSoup
url_page = "https://www.espn.com/golf/leaderboard/_/tournamentId/401056502"
page = requests.get(url_page)
soup = BeautifulSoup(page.text, 'html.parser')
table_with_namelist = soup.select_one('table:has(a.leaderboard_player_name)')
for a in table_with_namelist.select('.leaderboard_player_name'):
print(a.text)
Prints:
Xander Schauffele
Tony Finau
Justin Rose
Andrew Putnam
Kiradech Aphibarnrat
Keegan Bradley
...etc.
I'm currently working on a web scraper that will allow me to pull stats from a football player. Usually this would be an easy task if I could just grab the divs however, this website uses a attribute called data-stats and uses it like a class. This is an example of that.
<th scope="row" class="left " data-stat="year_id">2000</th>
If you would like to check the site for yourself here is the link.
https://www.pro-football-reference.com/players/B/BradTo00.htm
I'm tried a few different methods. Either It won't work at all or I will be able to start a for loop and start putting things into arrays, however you will notice that not everything in the table is the same var type.
Sorry for the formatting and the grammer.
Here is what I have so far, I'm sure its not the best looking code, it's mainly just code I've tried on my own and a few things mixed in from searching on Google. Ignore the random imports I was trying different things
# import libraries
import csv
from datetime import datetime
import requests
from bs4 import BeautifulSoup
import lxml.html as lh
import pandas as pd
# specify url
url = 'https://www.pro-football-reference.com/players/B/BradTo00.htm'
# request html
page = requests.get(url)
# Parse html using BeautifulSoup, you can use a different parser like lxml if present
soup = BeautifulSoup(page.content, 'lxml')
# find searches the given tag (div) with given class attribute and returns the first match it finds
headers = [c.get_text() for c in soup.find(class_ = 'table_container').find_all('td')[0:31]]
data = [[cell.get_text(strip=True) for cell in row.find_all('td')[0:32]]
for row in soup.find_all("tr", class_=True)]
tags = soup.find(data ='pos')
#stats = tags.find_all('td')
print(tags)
You need to use the get method from BeautifulSoup to get the attributes by name
See: BeautifulSoup Get Attribute
Here is a snippet to get all the data you want from the table:
from bs4 import BeautifulSoup
import requests
url = "https://www.pro-football-reference.com/players/B/BradTo00.htm"
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
# Get table
table = soup.find(class_="table_outer_container")
# Get head
thead = table.find('thead')
th_head = thead.find_all('th')
for thh in th_head:
# Get case value
print(thh.get_text())
# Get data-stat value
print(thh.get('data-stat'))
# Get body
tbody = table.find('tbody')
tr_body = tbody.find_all('tr')
for trb in tr_body:
# Get id
print(trb.get('id'))
# Get th data
th = trb.find('th')
print(th.get_text())
print(th.get('data-stat'))
for td in trb.find_all('td'):
# Get case value
print(td.get_text())
# Get data-stat value
print(td.get('data-stat'))
# Get footer
tfoot = table.find('tfoot')
thf = tfoot.find('th')
# Get case value
print(thf.get_text())
# Get data-stat value
print(thf.get('data-stat'))
for tdf in tfoot.find_all('td'):
# Get case value
print(tdf.get_text())
# Get data-stat value
print(tdf.get('data-stat'))
You can of course save the data in a csv or even a json instead of printing it
It's not very clear what exactly you're trying to extract, but this might help you a little bit:
import requests
from bs4 import BeautifulSoup as bs
url = 'https://www.pro-football-reference.com/players/B/BradTo00.htm'
page = requests.get(url)
soup = bs(page.text, "html.parser")
# Extract table
table = soup.find_all('table')
# Let's extract data from each row in table
for row in table:
col = row.find_all('td')
for c in col:
print(c.text)
Hope this helps!
I'm trying to extract the information from a single web-page that contains multiple similarly structured recordings. Information is contained within div tags with different classes (I'm interested in username, main text and date). Here is the code I use:
import bs4 as bs
import urllib
import pandas as pd
href = 'https://example.ru/'
sause = urllib.urlopen(href).read()
soup = bs.BeautifulSoup(sause, 'lxml')
user = pd.Series(soup.Series('div', class_='Username'))
main_text = pd.Series(soup.find_all('div', class_='MainText'))
date = pd.Series(soup.find_all('div', class_='Date'))
result = pd.DataFrame()
result = pd.concat([user, main_text, date], axis=1)
The problem is that I receive information with all tags, while I want only a text. Surprisingly, .text attribute doesn't work with find_all method, so now I'm completely out of ides.
Thank you for any help!
list comprehension is the way to go, to get all the text within MainText for example, try
[elem.text for elem in soup.find_all('div', class_='MainText')]