I would need to scrape the content (just titles) from a website. I did it for one page, but I would need to do it for all the pages on the website.
Currently, I am doing as follows:
import bs4, requests
import pandas as pd
import re
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
r = requests.get(website, headers=headers)
soup = bs4.BeautifulSoup(r.text, 'html')
title=soup.find_all('h2')
I know that, when I move to the next page, the url changes as follows:
website/page/2/
website/page/3/
...
website/page/49/
...
I tried to build a recursive function using next_page_url = base_url + next_page_partial but it does not move to the next page.
if soup.find("span", text=re.compile("Next")):
page = "https://catania.liveuniversity.it/notizie-catania-cronaca/cronacacatenesesicilina/page/".format(page_num)
page_num +=10 # I should scrape all the pages so maybe this number should be changed as I do not know at the beginning how many pages there are for that section
print(page_num)
else:
break
I followed this question (and answer): Moving to next page for scraping using BeautifulSoup
Please let me know if you need more info. Many thanks
Updated code:
import bs4, requests
import pandas as pd
import re
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
page_num=1
website="https://catania.liveuniversity.it/notizie-catania-cronaca/cronacacatenesesicilina"
while True:
r = requests.get(website, headers=headers)
soup = bs4.BeautifulSoup(r.text, 'html')
title=soup.find_all('h2')
if soup.find("span", text=re.compile("Next")):
page = f"https://catania.liveuniversity.it/notizie-catania-cronaca/cronacacatenesesicilina/page/{page_num}".format(page_num)
page_num +=10
else:
break
If you use f"url/{page_num}" then remove format(page_num).
You can use anything you want below:
page = f"https://catania.liveuniversity.it/notizie-catania-cronaca/cronacacatenesesicilina/page/{page_num}"
or
page = "https://catania.liveuniversity.it/notizie-catania-cronaca/cronacacatenesesicilina/page/{}".format(page_num)
Good luck!
Final answer will be this:
import bs4, requests
import pandas as pd
import re
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
page_num=1
website="https://catania.liveuniversity.it/notizie-catania-cronaca/cronacacatenesesicilina"
while True:
r = requests.get(website, headers=headers)
soup = bs4.BeautifulSoup(r.text, 'html')
title=soup.find_all('h2')
if soup.find("span", text=re.compile("Next")):
website = f"https://catania.liveuniversity.it/notizie-catania-cronaca/cronacacatenesesicilina/page/{page_num}"
page_num +=1
else:
break
Related
Goal is to get Python / BeautifulSoup to scrape Yahoo Finance and the first/last name of public company owner:
from bs4 import BeautifulSoup
import requests
url = 'https://finance.yahoo.com/quote/GTVI/profile?p=GTVI'
page = requests.get(url, headers={
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
})
soup = BeautifulSoup(page.text, 'html.parser')
price = soup.find_all("tr", {"class": "C($primaryColor) BdB Bdc($seperatorColor) H(36px)"})
print(soup.select_one("td > span").text)
^-The above single call works perfectly, but I can't get it to loop and print multiple times keeping the useragent of the browser masked. Here is my attempt at it (new to Python keep in mind) Haaalp :)
from bs4 import BeautifulSoup
import requests
url = ['https://finance.yahoo.com/quote/GTVI/profile?p=GTVI',
'https://finance.yahoo.com/quote/RAFA/profile?p=RAFA',
'https://finance.yahoo.com/quote/CYDX/profile?p=CYDX',
'https://finance.yahoo.com/quote/TTHG/profile?p=TTHG']
names = []
for link in url:
w=1
reqs2 = requests.get(link)
page = requests.get(url, headers={
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
})
soup = BeautifulSoup(page.text, 'html.parser')
for x in soup.find_all("tr", {"class": "C($primaryColor) BdB Bdc($seperatorColor) H(36px)"})
names.append(x.text)
print(names)(soup.select_one("td > span").text)
Check your indents to get your code running and also your requests. Cause expected result from your question is not that clear, this is just a hint how to fix or get a result.
Example
from bs4 import BeautifulSoup
import requests
url = ['https://finance.yahoo.com/quote/GTVI/profile?p=GTVI',
'https://finance.yahoo.com/quote/RAFA/profile?p=RAFA',
'https://finance.yahoo.com/quote/CYDX/profile?p=CYDX',
'https://finance.yahoo.com/quote/TTHG/profile?p=TTHG']
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36"}
names = []
for link in url:
w=1
page = requests.get(link, headers=headers)
soup = BeautifulSoup(page.text, 'html.parser')
for x in soup.find_all("tr", {"class": "C($primaryColor) BdB Bdc($seperatorColor) H(36px)"}):
names.append(x.text)
print(soup.select_one("td > span").text)
print(names)
So i am trying scrap CPI report from indian govt website.
here is website https://fcainfoweb.nic.in/pmsver2/reports/report_menu_web.aspx ,
I am using this approach,
When we load this website it asks for multiple options to select. after selecting options and then hitting the get data button, we are redirected to report page.
Here, i copied my cookie and session details,which i used in below python script to retrieve information. which is working fine.
Now, i want to fully automate this task, which will require
Price report -> Daily prices
date selection
getting data in code ,
but the issue is, web pages are redirected and even options on selectors are changing, how do i scrap this ?
i have below script where i've given prefecthed cookie & session as param & able to get data.
import requests
#from fake_useragent import UserAgent
from bs4 import BeautifulSoup
import lxml.html as lh
import pandas as pd
from pprint import pprint
# https://fcainfoweb.nic.in/reports/Report_Menu_Web.aspx
# report link = https://fcainfoweb.nic.in/reports/Report_daily1_Web_New.aspx
#url = 'https://fcainfoweb.nic.in/reports/Report_daily1_Web_New.aspx'
#url = 'https://fcainfoweb.nic.in/reports/Reportdaily9.aspx'
# "Cookie": "ASP.NET_SessionId=n3npgkgb2wpy3sup45ze024y; BNI_persistence=XIlVKPHMyFvRq0HtLj7pmqXxmRx7y7byO_ia3T0PrBLraaAiDz2RxPPPWpXCo2y2SGMfsbBJx4Pe4wWpm_C-OA=="}
#headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
#ua = UserAgent()9
#"Cookie":"ASP.NET_SessionId=dkk2h2003kzfamcypczrfaru; BNI_persistence=XIlVKPHMyFvRq0HtLj7pmqXxmRx7y7byO_ia3T0PrBLraaAiDz2RxPPPWpXCo2y2SGMfsbBJx4Pe4wWpm_C-OA==; _ga=GA1.3.654717034.1651138144; _gid=GA1.3.1558736990.1651468427; _gat_gtag_UA_106490103_3=1"
#res = requests.get('https://fcainfoweb.nic.in/reports/Daily_Average_Report_Data_Commoditywise_Percentage_Variation.aspx',headers=head)
head = {'User-Agent': 'Mozilla/5.0 (X11; OpenBSD i386) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36',
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"Cookie": "ASP.NET_SessionId=n3npgkgb2wpy3sup45ze024y; BNI_persistence=XIlVKPHMyFvRq0HtLj7pmqXxmRx7y7byO_ia3T0PrBLraaAiDz2RxPPPWpXCo2y2SGMfsbBJx4Pe4wWpm_C-OA=="}
u = '''https://fcainfoweb.nic.in/Reports/Report_Menu_Web.aspx'''
res = requests.get(u,headers=head)
print(res.headers)
print(res.text)
print(res.cookies)
with open('resp.html','w') as f:
f.writelines(res.text)
soup = BeautifulSoup(res.text, 'lxml')
#pprint(soup)
tab = soup.find_all('table')
cnt = 1
htab = pd.read_html(res.text)[1]
fn = "data_{0}.xlsx".format(cnt)
htab.to_excel(fn)
def req(url) -> BeautifulSoup:
print(url)
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.text, "html.parser")
return soup
def get_novels(page, sort, order, status):
books = []
novel_list = req(f"https://www.novelupdates.com/novelslisting/?sort={sort}&order={order}&status={status}&pg={page}")
novels = novel_list.find_all(class_="search_main_box_nu")...
The above code gets the actual content of the page in python version 3.10.2, but in 3.9.12, it gets some bot verification page.
Why is that, and how do I fix it? Please help.
I want to remove target tr block with text, when i run it i got perfect output but there is a problem i have seen that it scraping <tr><td>Domain</td><td>Last Resolved Date</td></tr> actually i don't want this line in my output so how can i remove it.Code bellow
Got fix
Old Code
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
url = "https://viewdns.info/reverseip/?host=github.com&t=1"
text = requests.get(url, headers=headers).text
soup = BeautifulSoup(text, 'html.parser')
table = soup.find('table', attrs={'border':'1'})
domain = table.findAll('td', attrs={'align':None})
for line in domain:
print(line.text)
Fixed
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
url = "https://viewdns.info/reverseip/?host=github.com&t=1"
text = requests.get(url, headers=headers).text
soup = BeautifulSoup(text, 'html.parser')
table = soup.find('table', attrs={'border':'1'})
domain = table.findAll('td', attrs={'align':None})[2:]
for line in domain:
print(line.text)
Try the code.
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
url = "https://viewdns.info/reverseip/?host=github.com&t=1"
text = requests.get(url, headers=headers).text
soup = BeautifulSoup(text, 'html.parser')
table = soup.find('table', attrs={'border':'1'})
domain = table.findAll('td', attrs={'align':None})[2:]
for line in domain:
print(line.text)
Filter out the first two objects in your domain variable:
domain = table.findAll('td', attrs={'align':None})[2:]
I've written a page monitor to receive the latest product link from Nike.com, but I only want it to return a link to me if it's from a product that has just been uploaded to the site. I haven't been able to find any help similar to this. This is the page monitor written in Python. Any help with Python returning only new links would be helpful.
import requests
from bs4 import BeautifulSoup
import time
import json
headers = {
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'
}
def item_finder():
source = requests.get('https://www.nike.com/launch/', headers=headers).text
soup = BeautifulSoup(source, 'lxml')
card = soup.find('figure', class_='item ncss-col-sm-12 ncss-col-md-6 ncss-col-lg-4 va-sm-t pb2-sm pb4-md prl0-sm prl2-md ')
card_data = "https://nike.com" + card.a.get('href')
print(card_data)
item_finder()