I am trying to grab the house price along with the address, and hopefully other relevant data (bedrooms?). I have got the following so far. Using google's element inspection I can see that there is a element, but if I search for this I won't get the address.
Any thoughts?
import requests
from bs4 import BeautifulSoup
query='http://www.realestate.com.au/buy/with-2-bedrooms-in-epping%2c+nsw+2121/list-1?maxBeds=2&source=refinements'
resp = requests.get(query)
soup = BeautifulSoup(resp.text)
ads=soup.findAll("div", {"id": "searchResultsTbl"})
If you need to get address use this:
import requests
from bs4 import BeautifulSoup
query='http://www.realestate.com.au/buy/with-2-bedrooms-in-epping%2c+nsw+2121/list-1?maxBeds=2&source=refinements'
resp = requests.get(query)
soup = BeautifulSoup(resp.text)
ads = soup.find("div", {"class": "vcard"})
print ads.h2.a.text
Output:
61 Mobbs Lane, Epping, NSW 2121
For all addresses use this:
soup = BeautifulSoup(resp.text)
ads = soup.findAll("div", {"class": "vcard"})
for ad in ads:
print ad.h2.a.text
Output:
61 Mobbs Lane, Epping, NSW 2121
29/3-5 Kandy Avenue, Epping, NSW 2121
5/30 Cambridge Street, Epping, NSW 2121
...
101/239-243 Carlingford Rd, Carlingford, NSW...
65-69 Adderton Road, Telopea, NSW 2117
And for rooms you can use something like this:
rooms = soup.findAll("li", {"class":"first"})
for room in rooms:
if room.span:
print room.span.text
Related
I am trying to scrape Company name, Postcode, phone number and web address from:
https://www.matki.co.uk/matki-dealers/ Finding it difficult as the information is only retrieved upon clicking the region on the page. If anyone could help it would be much appreciated. Very new to both Python and especially scraping!
!pip install beautifulsoup4
!pip install urllib3
from bs4 import BeautifulSoup
from urllib.request import urlopen
url = "https://www.matki.co.uk/matki-dealers/"
page = urlopen(url)
html = page.read().decode("utf-8")
soup = BeautifulSoup(html, "html.parser")
I guess this is what you wanted to do: (you can put the result after in a file or a database, or even parse it and use it directly)
import requests
from bs4 import BeautifulSoup
URL = "https://www.matki.co.uk/matki-dealers/"
page = requests.get(URL)
# parse HTML
soup = BeautifulSoup(page.content, "html.parser")
# extract the HTML results
results = soup.find(class_="dealer-region")
company_elements = results.find_all("article")
# Loop through the results and extract the wanted informations
for company_element in company_elements:
# some cleanup before printing the info:
company_info = company_element.getText(separator=u', ').replace('Find out more »', '')
# the results ..
print(company_info)
Output:
ESP Bathrooms & Interiors, Queens Retail Park, Queens Street, Preston, PR1 4HZ, 01772 200400, www.espbathrooms.co.uk
Paul Scarr & Son Ltd, Supreme Centre, Haws Hill, Lancaster Road A6, Carnforth, LA5 9DG, 01524 733788,
Stonebridge Interiors, 19 Main Street, Ponteland, NE20 9NH, 01661 520251, www.stonebridgeinteriors.com
Bathe Distinctive Bathrooms, 55 Pottery Road, Wigan, WN3 5AA, www.bathe-showroom.co.uk
Draw A Bath Ltd, 68 Telegraph Road, Heswall, Wirral, CH60 7SG, 0151 342 7100, www.drawabath.co.uk
Acaelia Home Design, Unit 4 Fence Avenue Industrial Estate, Macclesfield, Cheshire, SK10 1LT, 01625 464955, www.acaeliahomedesign.co.uk
...
I want to crawl maritime news from Fleetmon.com news as well as with detail pages and save it in text file. I tried BeautifulSoup in python but it not work properly..
import requests
from bs4 import BeautifulSoup
import pandas as pd
baseurl = 'https://www.fleetmon.com/maritime-news/'
headers = {'User-Agent': 'Mozilla/5.0'}
newslinks = [] # put all item in this array
for x in range(1): # set page range
response = requests.get(
f'https://www.fleetmon.com/maritime-news/?page={x}') # url of next page
soup = BeautifulSoup(response.content, 'html.parser')
newslist = soup.find_all('article')
# loop to get all href from ul
for item in newslist:
for link in item.find_all('a', href=True):
newslinks.append(link['href'])
newslinks = list(set(newslinks))
print(newslinks)
# news details pages
newsdata = []
for link in newslinks:
print(link)
response = requests.get(link, headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')
shipName = soup.find('div', {'class': 'uk-article-story'}).text.strip()
fieldsets = soup.find_all('article')
row = {'Ship Name': shipName}
for fieldset in fieldsets:
dts = fieldset.find_all('h1')
for dt in dts:
row.update({dt.text.strip(): dt.find_next('p').text.strip()})
newsdata.append(row)
#text or csv
df = pd.DataFrame(newsdata)
df.to_csv (r'C:\Users\Usuario\Desktop\news.csv', index = False, header=True)
print(df)
Help me to improve my code to get all data in text form.
Also is it possible to crawl data and save it csv like this:
Column1:News_title:value
column2:category: accidents
column3:publish_date_time:June 28, 2022 at 13:31
column4:news:full news here
Go to the details page (here I use req2 to go to the details page) and I 've made the pagination using for loop and range function and you can increase or decrease the page numbers with no time.
P/S: If you click on any title link then you can see the details page and from the details pages are scraped all required data items.
import pandas as pd
import requests
from bs4 import BeautifulSoup
headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36'}
url='https://www.fleetmon.com/maritime-news/?page={page}'
data=[]
for page in range(1,11):
req = requests.get(url.format(page=page),headers=headers)
soup = BeautifulSoup(req.text, 'lxml')
for link in soup.select('.news-headline h2 a') :
link='https://www.fleetmon.com' + link.get('href')
req2 = requests.get(link,headers=headers)
soup2 = BeautifulSoup(req2.text, 'lxml')
title= soup2.find('h1',class_="uk-article-title margin-t-0").text
cat=soup2.select_one('p.uk-article-meta span a strong').text
date=soup2.select_one('[class="uk-text-nowrap"]:nth-child(3)').text
details=soup2.select_one('.uk-article-story ').get_text(strip=True)
data.append({
'title':title,
'category':cat,
'date':date,
'details_news':details
})
df = pd.DataFrame(data)#.to_csv('news.csv',index=False)
print(df)
Output:
Cruise ship NORWEGIAN SUN hit iceberg, damaged... ... Cruise ship NORWEGIAN SUN hit an
iceberg size ...
1 Yang Ming and HMM Were Accused of Collusion to... ... YM WARRANTY by ship spotter phduck2kYM WARRANT...
2 Fire in bulk carrier cargo hold, Florida ... At around 2350 LT Jun 26 firefighters responde...
3 Chlorine gas tank fell on Chinese cargo ship, ... ... Tank with 25 tons of chlorine gas fell onto ca...
4 Heavy vehicle fell onto cargo deck during offl... ... Heavy machinery vehicle (probably mobile crane...
.. ... ...
...
195 Yara Plans 15 Ammonia Bunkering Terminals in S... ... VIKING ENERGY by ship spotter PattayaVIKING EN...
196 World’s Largest Electric Cruise Ship Sets Sail... ... ©Wuxi Saisiyi Electric Technology,©Wuxi Saisiy...
197 The Supply Chain Crisis Brewing at Israeli Ports ... Port Haifa in FleetMon ExplorerPort Haifa in F...
198 CDC Drops Its “Cruise Ship Travel Health Notic... ... AIDADIVA by ship spotter Becks93AIDADIVA by sh...
199 Scorpio Tankers Take the Path of Shipboard Car... ... CORONA UTILITY by ship spotter canonbenqCORONA...
[200 rows x 4 columns]
I want to get all the products on this page:
nike.com.br/snkrs#estoque
My python code is this:
produtos = []
def aviso():
print("Started!")
request = requests.get("https://www.nike.com.br/snkrs#estoque")
soup = bs4(request.text, "html.parser")
links = soup.find_all("a", class_="btn", text="Comprar")
links_filtred = list(set(links))
for link in links_filtred:
if(produto not in produtos):
request = requests.get(f"{link['href']}")
soup = bs4(request.text, "html.parser")
produto = soup.find("div", class_="nome-preco-produto").get_text()
if(code_formated == ""):
code_formated = "\u200b"
print(f"Nome: {produto} Link: {link['href']}\n")
produtos.append(link["href"])
aviso()
Guys, this code gets the products from the page, but not all yesterday, I suspect that the content is dynamic, but how can I get them all with request and beautifulsoup? I don't want to use Selenium or an automation library, how do I do that? I don't want to have to change my code a lot because it's almost done, how do I do that?
DO NOT USE requests.get if you are dealing with the same HOST.
Reason: read-that
import requests
from bs4 import BeautifulSoup
import pandas as pd
def main(url):
allin = []
with requests.Session() as req:
for page in range(1, 6):
params = {
'p': page,
'demanda': 'true'
}
r = req.get(url, params=params)
soup = BeautifulSoup(r.text, 'lxml')
goal = [(x.find_next('h2').get_text(strip=True, separator=" "), x['href'])
for x in soup.select('.aspect-radio-box')]
allin.extend(goal)
df = pd.DataFrame(allin, columns=['Title', 'Url'])
print(df)
main('https://www.nike.com.br/Snkrs/Feed')
Output:
Title Url
0 Dunk High x Fragment design Black https://www.nike.com.br/dunk-high-x-fragment-d...
1 Dunk Low Infantil (16-26) City Market https://www.nike.com.br/dunk-low-infantil-16-2...
2 ISPA Flow 2020 Desert Sand https://www.nike.com.br/ispa-flow-2020-153-169...
3 ISPA Flow 2020 Pure Platinum https://www.nike.com.br/ispa-flow-2020-153-169...
4 Nike iSPA Men's Lightweight Packable Jacket https://www.nike.com.br/nike-ispa-153-169-211-...
.. ... ...
115 Air Jordan 1 Mid Hyper Royal https://www.nike.com.br/air-jordan-1-mid-153-1...
116 Dunk High Orange Blaze https://www.nike.com.br/dunk-high-153-169-211-...
117 Air Jordan 5 Stealth https://www.nike.com.br/air-jordan-5-153-169-2...
118 Air Jordan 3 Midnight Navy https://www.nike.com.br/air-jordan-3-153-169-2...
119 Air Max 90 Bacon https://www.nike.com.br/air-max-90-153-169-211...
[120 rows x 2 columns]
To get the data you can send a request to:
https://www.nike.com.br/Snkrs/Estoque?p=<PAGE>&demanda=true
where providing a page number between 1-5 to p= in the URL.
For example, to print the links, you can try:
import requests
from bs4 import BeautifulSoup
url = "https://www.nike.com.br/Snkrs/Estoque?p={page}&demanda=true"
for page in range(1, 6):
response = requests.get(url.format(page=page))
soup = BeautifulSoup(response.content, "html.parser")
print(soup.find_all("a", class_="btn", text="Comprar"))
Link: https://www.yelp.com/search?cflt=restaurants&find_loc=San+Francisco%2C+CA
I am scraping names from yelp.com but it also print the page numbers below. How can I target only the names using BeautifulSoup? I am sharing couple screenshots. How can I target the name attribute showing in the inspect element screenshot?
import requests
from bs4 import BeautifulSoup as bs
url = "https://www.yelp.com/search?cflt=restaurants&find_loc=San+Francisco%2C+CA"
yelp_r = requests.get(url)
yelp_soup = bs(yelp_r.text, "html.parser")
# print(yelp_soup.prettify())
for name in yelp_soup.find_all("a", {"class": "lemon--a__373c0__IEZFH link__373c0__1G70M link-color--inherit__373c0__3dzpk link-size--inherit__373c0__1VFlE"}):
print(name.text)
import requests
from bs4 import BeautifulSoup as bs
url = "https://www.yelp.com/search?cflt=restaurants&find_loc=San+Francisco%2C+CA"
yelp_r = requests.get(url)
yelp_soup = bs(yelp_r.text, "html.parser")
ul = yelp_soup.find('ul', {'class':'lemon--ul__373c0__1_cxs undefined list__373c0__2G8oH'})
for li in ul.find_all('li', {'class':'lemon--li__373c0__1r9wz border-color--default__373c0__3-ifU'}):
for a_tag in li.find_all("a", {'class':"lemon--a__373c0__IEZFH link__373c0__1G70M link-color--inherit__373c0__3dzpk link-size--inherit__373c0__1VFlE"}):
print(a_tag.text) # get the text
print(a_tag.get('name')) # get the name property of a tag
output
Boo Koo
Boo Koo
Fog Harbor Fish House
Fog Harbor Fish House
... some results remove
Gary Danko
Gary Danko
um.ma
um.ma
Note: I didn't investigate if class name is dynamic and if changes
I will check the contents of each tag and check if there is no keyword 'Page' in it.
import requests
from bs4 import BeautifulSoup as bs
url = "https://www.yelp.com/search?cflt=restaurants&find_loc=San+Francisco%2C+CA"
yelp_r = requests.get(url)
yelp_soup = bs(yelp_r.text, "html.parser")
#print(yelp_soup.prettify())
for name in yelp_soup.find_all("a", {"class": "lemon--a__373c0__IEZFH link__373c0__1G70M link-color--inherit__373c0__3dzpk link-size--inherit__373c0__1VFlE"}):
if 'Page:' not in str(name.contents[0]):
print(name.contents[0])
Result:
San Francisco, CA
Restaurants
Boo Koo
...
Palm House
Anchor Oyster Bar
I am trying to get the headlines that are in between a class. the headlines are wrapped around the h2 tag. headlines come after the tag.
from bs4 import BeautifulSoup
import requests
r = requests.get("https://www.dailypost.ng/hot-news")
soup = BeautifulSoup(r.content, "html.parser")
mydivs = soup.findAll("span", {"class": "mvp-cd-date left relative"})
mytags = mydivs.findNext('h2')
for tag in mytags:
print(tag.text.strip())
You must iterate through mydivs to use findNext()
mydivs is a list of web elements. findNextonly applies to a single web element. You must iterate through the divs and run findNext on each of them.
Just add this line
for div in mydivs:
and put it before
mytags = div.findNext('h2')
Here is the full code for your working program:
from bs4 import BeautifulSoup
import requests
r = requests.get("https://www.dailypost.ng/hot-news")
soup = BeautifulSoup(r.content, "html.parser")
mydivs = soup.findAll("span", {"class": "mvp-cd-date left relative"})
for div in mydivs:
mytags = div.findNext('h2')
for tag in mytags:
print(tag.strip())
Try replacing the last 3 lines with:
for div in mydivs:
mytags = div.findNext('h2')
for tag in mytags:
print(tag.strip())
soup.findAll() returns a list (or None), so you cannot call findNext() on it. However, you can iterate the tags and call find_next() on each tag separately:
import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.dailypost.ng/hot-news")
soup = BeautifulSoup(r.content, "html.parser")
mydivs = soup.findAll("span", {"class": "mvp-cd-date left relative"})
for tag in mydivs:
print(tag.find_next('h2').get_text(strip=True))
Prints:
BREAKING: Another federal lawmaker dies in Dubai hospital
Cross-Over Night: Enugu Govt bans burning of tyres on roads
Dadiyata: DSS breaks silence as Nigerian govt critic remains missing
CAC: Nigerian govt appoints new Acting Registrar-General
What Buhari told me – Dabiri-Erewa
What soldiers should expect in 2020 – Buratai
Only earthquake can erase Amosun’s legacies in Ogun – Akinlade
Civil War: Militia leader sentenced to 20yrs in prison
2020: Prophet Omale releases prophecies on Buhari, Aisha, Kyari, govs, coup plot
BREAKING: EFCC arrests Shehu Sani
Armed Forces Day: Yobe Governor Buni, donates N40 million for emblem appeal fund
Zamfara govt bans illegal gathering in the state
Agbenu Kacholalo: Colours of culture at Idoma International Carnival 2019 [PHOTOS]
Men of God are too fearful, weak to challenge government activities
2020: Peter Obi sends message to Nigerians
TETFUND: EFCC, ICPC asked to probe agency over alleged corruption
Two inmates regain freedom from Uyo prison
Buhari meets President of AfDB, Adeshina at Aso Rock
New Kogi CP resumes office, promises crime free state
Nothing stops you from paying N30,000 minimum wage to workers – APC challenges Makinde
EDIT: This script will scrape headlines from several pages:
import requests
from bs4 import BeautifulSoup
url = 'https://dailypost.ng/hot-news/page/{}/'
for page in range(1, 5): # <-- change how many pages do you want
print('Page no.{}'.format(page))
soup = BeautifulSoup(requests.get(url.format(page)).content, "html.parser")
mydivs = soup.findAll("span", {"class": "mvp-cd-date left relative"})
for tag in mydivs:
print(tag.find_next('h2').get_text(strip=True))
print('-' * 80)