Hello I want to scrape a webpage. I posted my code but the line which I targeted is important. It doesn't worked. I mean there is no error but also no output. My code is there. I need to sum to strings and there is the problem.
import requests
from bs4 import BeautifulSoup
import pandas as pd
url='http://www.sis.itu.edu.tr/tr/ders_programlari/LSprogramlar/prg.php'
html_content = requests.get(url).text
soup = BeautifulSoup(html_content, "lxml")
url_course_main='http://www.sis.itu.edu.tr/tr/ders_programlari/LSprogramlar/prg.php?fb='
url_course=url_course_main+soup.find_all('option')[1].get_text() <---this line
html_content_course=requests.get(a).text
soup_course=BeautifulSoup(html_content_course,'lxml')
for j in soup_course.find_all('td'):
print(j.get_text())
When I am changing the line which I showed to
url_course=url_course_main+'AKM'
it worked.
Also soup.find_all('option')[1].get_text() is equal to AKM.
Can you guess where the mistake is.
Instead of requests module, try Python's standard urllib.request. It seems that requests module has problem opening the page:
import urllib.request
from bs4 import BeautifulSoup
url='http://www.sis.itu.edu.tr/tr/ders_programlari/LSprogramlar/prg.php'
html_content = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html_content, "lxml")
url_course_main='http://www.sis.itu.edu.tr/tr/ders_programlari/LSprogramlar/prg.php?fb='
url_course=url_course_main+soup.find_all('option')[1].get_text()
html_content_course=urllib.request.urlopen(url_course).read()
soup_course=BeautifulSoup(html_content_course,'lxml')
for j in soup_course.find_all('td'):
print(j.get_text(strip=True))
Prints:
2019-2020 Yaz Dönemi AKM Kodlu Derslerin Ders Programı
...
Problem is that get_text() gives 'AKM ' with space at the end and requests sends url with this space - and server can't find file 'AKM ' with space.
I used >< in string '>{}<'.format(param) to show this space - >AKM < - because without >< it seems OK.
Code needs get_text(strip=True) or get_text().strip() to remove this space.
import requests
from bs4 import BeautifulSoup
url = 'http://www.sis.itu.edu.tr/tr/ders_programlari/LSprogramlar/prg.php'
html_content = requests.get(url).text
soup = BeautifulSoup(html_content, 'lxml')
url_course_main = 'http://www.sis.itu.edu.tr/tr/ders_programlari/LSprogramlar/prg.php?fb='
param = soup.find_all('option')[1].get_text()
print('>{}<'.format(param)) # I use `> <` to show spaces
param = soup.find_all('option')[1].get_text(strip=True)
print('>{}<'.format(param)) # I use `> <` to show spaces
url_course = url_course_main + param
html_content_course = requests.get(url_course).text
soup_course = BeautifulSoup(html_content_course, 'lxml')
for j in soup_course.find_all('td'):
print(j.get_text())
Related
I'm trying to extract player position from many players' webpages (here's an example for Malcolm Brogdon). I'm able to extract Malcolm Brogdon's position using the following code:
player_id = 'malcolm-brogdon-1'
# Import libraries
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup as soup
import pandas as pd
import numpy as np
url = "https://www.sports-reference.com/cbb/players/{}.html".format(player_id)
req = Request(url , headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
page_soup = soup(webpage, "html.parser")
pos = page_soup.p.find("strong").next_sibling.strip()
pos
However, I want to be able to do this in a more dynamic way (that is, to locate "Position:" and then find what comes after). There are other players for which the webpage is structured slightly differently, and my current code wouldn't return position (i.e. Cat Barber).
I've tried doing something like page_soup.find("strong", text="Position:") but that doesn't seem to work.
You can select the element that contains the text "Position:" and then the next text sibling:
import requests
from bs4 import BeautifulSoup
url = "https://www.sports-reference.com/cbb/players/anthony-cat-barber-1.html"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
pos = soup.select_one('strong:contains("Position")').find_next_sibling(text=True).strip()
print(pos)
Prints:
Guard
EDIT: Another version:
import requests
from bs4 import BeautifulSoup
url = "https://www.sports-reference.com/cbb/players/anthony-cat-barber-1.html"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
pos = (
soup.find("strong", text=lambda t: "Position" in t)
.find_next_sibling(text=True)
.strip()
)
print(pos)
I recently got interested in web scraping on Python and did it on some simple examples, but I don't know how to handle other languages that don't follow the ASCII codes. For example, searching for a specific string in the HTML file or using those strings to be written in a file.
from urllib.parse import urljoin
import requests
import bs4
website = 'http://book.iranseda.ir'
book_url = 'http://book.iranseda.ir/DetailsAlbum/?VALID=TRUE&g=209103'
soup1 = bs4.BeautifulSoup(requests.get(book_url).text, 'lxml')
match1 = soup1.find_all('a', class_='download-mp3')
for m in match1:
m = m['href'].replace('q=10', 'q=9')
url = urljoin(website, m)
print(url)
print()
Looking at this website under book_url, each row has different text, but the text is in the Persian language.
Let say I need the last row to be considered.
The text is "صدای کل کتاب"
How can I search for this string in <li>, <div>, and <a> tags?
You need to set the encoding from requests to UTF-8. It looks like the requests module was not using the decoding you wanted. As mentioned in this SO post, you can tell requests what encoding to expect.
from urllib.parse import urljoin
import requests
import bs4
website = 'http://book.iranseda.ir'
book_url = 'http://book.iranseda.ir/DetailsAlbum/?VALID=TRUE&g=209103'
req = requests.get(book_url)
req.encoding = 'UTF-8'
soup1 = bs4.BeautifulSoup(req.text, 'lxml')
match1 = soup1.find_all('a', class_='download-mp3')
for m in match1:
m = m['href'].replace('q=10', 'q=9')
url = urljoin(website, m)
print(url)
print()
The only change here is
req = requests.get(book_url)
req.encoding = 'UTF-8'
soup1 = bs4.BeautifulSoup(req.text, 'lxml')
I have the following code:
from bs4 import BeautifulSoup
import requests
import csv
url = "https://coingecko.com/en"
base_url = "https://coingecko.com"
page = requests.get(url)
soup = BeautifulSoup(page.content,"html.parser")
names = [div.a.span.text for div in soup.find_all("div",attrs={"class":"coin-content center"})]
Link = [base_url+div.a["href"] for div in soup.find_all("div",attrs={"class":"coin-content center"})]
for link in Link:
inner_page = requests.get(link)
inner_soup = BeautifulSoup(inner_page.content,"html.parser")
indent = inner_soup.find("div",attrs={"class":"py-2"})
content = indent.div.next_siblings
Allcontent = [sibling for sibling in content if sibling.string is not None]
print(Allcontent)
I have successfully enter to innerpage and grabbed all coins' information from the first page listed coin. But there is next page as 1,2,3,4,5,6,7,8,9 etc. How can I go to all the next page and do the same as previously?
Further, the output of my code contains a lot of \n and space. How can I fix that.
You need to generate all the pages and requests one by one and parse using bs4
from bs4 import BeautifulSoup
import requests
req = requests.get('https://www.coingecko.com/en')
soup = BeautifulSoup(req.content, 'html.parser')
last_page = soup.select('ul.pagination li:nth-of-type(8) > a:nth-of-type(1)')[0]['href']
lp = last_page.split('=')[-1]
count = 0
for i in range(int(lp)):
count+=1
url = 'https://www.coingecko.com/en?page='+str(count)
print(url)
requests.get(url)#requests each page one by one till last page
##parse your fileds here using bs4
The way you have written your script has got a messy look. Try with .select() to make it concise and less prone to break. Although I could not find the further usage of names in your script, I kept it as it is. Here is how you can get all the available links traversing multiple pages.
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import requests
url = "https://coingecko.com/en"
while True:
page = requests.get(url)
soup = BeautifulSoup(page.text,"lxml")
names = [item.text for item in soup.select("span.d-lg-block")]
for link in [urljoin(url,item["href"]) for item in soup.select(".coin-content a")]:
inner_page = requests.get(link)
inner_soup = BeautifulSoup(inner_page.text,"lxml")
desc = [item.get_text(strip=True) for item in inner_soup.select(".py-2 p") if item.text]
print(desc)
try:
url = urljoin(url,soup.select_one(".pagination a[rel='next']")['href'])
except TypeError:break
Btw, whitespaces have also been taken care of by using .get_text(strip=True)
I'm scraping from the following page: https://www.pro-football-reference.com/boxscores/201809060phi.htm
I have this code:
import requests
from bs4 import BeautifulSoup
url = 'https://www.pro-football-reference.com/boxscores/201809060phi.htm'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
tables = soup.findAll("div",{"class":"table_outer_container"})
print (len(tables))
Each table on the page has the element "div",{"class":"table_outer_container"}. But my print statement only returns 1. Am I wrong in believing that my findAll statement will assign all of those elements to the variable, "tables"?
It's because most of the tables are within comments and your script wont grab them unless you kick out those vicious signs -->,<!-- from response. Try the following. It should give you 20 tables from that page.
import requests
from bs4 import BeautifulSoup
url = 'https://www.pro-football-reference.com/boxscores/201809060phi.htm'
r = requests.get(url).text
res = r.replace("<!--","").replace("-->","")
soup = BeautifulSoup(res, 'lxml')
tables = soup.findAll("div",{"class":"table_outer_container"})
print (len(tables))
I'm using lxml.html module
from lxml import html
page = html.parse('http://directory.ccnecommunity.org/reports/rptAccreditedPrograms_New.asp?sort=institution')
# print(page.content)
unis = page.xpath('//tr/td[#valign="top" and #style="width: 50%;padding-right:15px"]/h3/text()')
print(unis.__len__())
with open('workfile.txt', 'w') as f:
for uni in unis:
f.write(uni + '\n')
The website right here (http://directory.ccnecommunity.org/reports/rptAccreditedPrograms_New.asp?sort=institution#Z) is full of universities.
The problem is that it parses till the letter 'H' (244 unis).
I can't understand why, as I see it parses all the HTML till the end.
I also documented my self that 244 is not a limit of a list or anything in python3.
That HTML page simply isn't HTML, it's totally broken. But the following will do what you want. It uses the BeautifulSoup parser.
from lxml.html.soupparser import parse
import urllib
url = 'http://directory.ccnecommunity.org/reports/rptAccreditedPrograms_New.asp?sort=institution'
page = parse(urllib.request.urlopen(url))
unis = page.xpath('//tr/td[#valign="top" and #style="width: 50%;padding-right:15px"]/h3/text()')
See http://lxml.de/lxmlhtml.html#really-broken-pages for more info.
For web-scraping i recommend you to use BeautifulSoup 4
With bs4 this is easily done:
from bs4 import BeautifulSoup
import urllib.request
universities = []
result = urllib.request.urlopen('http://directory.ccnecommunity.org/reports/rptAccreditedPrograms_New.asp?sort=institution#Z')
soup = BeautifulSoup(result.read(),'html.parser')
table = soup.find_all(lambda tag: tag.name=='table')
for t in table:
rows = t.find_all(lambda tag: tag.name=='tr')
for r in rows:
# there are also the A-Z headers -> check length
# there are also empty headers -> check isspace()
headers = r.find_all(lambda tag: tag.name=='h3' and tag.text.isspace()==False and len(tag.text.strip()) > 2)
for h in headers:
universities.append(h.text)