Extract href value inside a div - beautifulsoup

Extract href value inside a div - beautifulsoup - python

I am trying to print all the title of an anime from https://gogoanime.pe/anime-movies.html?aph=&page=with the following code from Bucky's tutorial:
def animmov(max_pages):
page = 1
while page <= max_pages:
url = 'https://gogoanime.pe/anime-movies.html?aph=&page=' + str(page)
source_code = requests.get(url)
plain_text = source_code.text
sopas = BeautifulSoup(plain_text, features="html.parser")
for link in sopas.find_all('div', attrs={'class':'img'}):
href = link.get('href')
print(href)
page += 1
when i execute the code it prints the following None
I have tried to read the question here also but i can't follow through. How can i extract all the href link values inside the div.

the haref isn't part of the div-Tag, but an a-Tag within the div.
You have to use href = link.find('a').get('href')

Related

Scraping multiple pages in Python

I am trying to scrape a page that includes 12 links. I need to open each of these links and scrape all of their titles. When I open each page, I face multiple pages in each link. However, my code could only scrape the first page in all of these 12 links
By below code, I can print all the 12 links URLs that exist on the main page.
url = 'http://mlg.ucd.ie/modules/COMP41680/assignment2/index.html'
res = requests.get (url)
soup = BeautifulSoup(res.text, 'html.parser')
links = soup.find_all("a")
all_urls = []
for link in links[1:]:
link_address ='http://mlg.ucd.ie/modules/COMP41680/assignment2/' + link.get("href")
all_urls.append(link_address)
Then, I looped in all of them.
for i in range(0,12):
url = all_urls[i]
res = requests.get (url)
soup = BeautifulSoup(res.text, 'html.parser')
The title could be extracted by below lines:
title_news = []
news_div = soup.find_all('div', class_ = 'article')
for container in news_div:
title = container.h5.a.text
title_news.append(title)
The output of this code only includes the title for the first page of each of these 12 pages, while I need my code to go through multiple pages in these 12 URLs.
The below gives me the links of all the pages that exist in each of these 12 links if it defines in an appropriate loop. ( It reads the pagination section and look for the next page URL link)
page = soup.find('ul', {'class' : 'pagination'}).select('li', {'class': "page-link"})[2].find('a')['href']
How I should use a page variable inside my code to extract multiple pages in all of these 12 links and read all the titles and not only first-page titles.

You can use this code to get all titles from all the pages:
import requests
from bs4 import BeautifulSoup
base_url = "http://mlg.ucd.ie/modules/COMP41680/assignment2/"
soup = BeautifulSoup(
requests.get(base_url + "index.html").content, "html.parser"
)
title_news = []
for a in soup.select("#all a"):
next_link = a["href"]
print("Getting", base_url + next_link)
while True:
soup = BeautifulSoup(
requests.get(base_url + next_link).content, "html.parser"
)
for title in soup.select("h5 a"):
title_news.append(title.text)
next_link = soup.select_one('a[aria-label="Next"]')["href"]
if next_link == "#":
break
print("Length of title_news:", len(title_news))
Prints:
Getting http://mlg.ucd.ie/modules/COMP41680/assignment2/month-jan-001.html
Getting http://mlg.ucd.ie/modules/COMP41680/assignment2/month-feb-001.html
Getting http://mlg.ucd.ie/modules/COMP41680/assignment2/month-mar-001.html
Getting http://mlg.ucd.ie/modules/COMP41680/assignment2/month-apr-001.html
Getting http://mlg.ucd.ie/modules/COMP41680/assignment2/month-may-001.html
Getting http://mlg.ucd.ie/modules/COMP41680/assignment2/month-jun-001.html
Getting http://mlg.ucd.ie/modules/COMP41680/assignment2/month-jul-001.html
Getting http://mlg.ucd.ie/modules/COMP41680/assignment2/month-aug-001.html
Getting http://mlg.ucd.ie/modules/COMP41680/assignment2/month-sep-001.html
Getting http://mlg.ucd.ie/modules/COMP41680/assignment2/month-oct-001.html
Getting http://mlg.ucd.ie/modules/COMP41680/assignment2/month-nov-001.html
Getting http://mlg.ucd.ie/modules/COMP41680/assignment2/month-dec-001.html
Length of title_news: 16226

How do I get href links from href using python/pandas

I need to get href links which is present in href(which i have already) So I need to hit that href links and collect the other href. I tried but from that code only first href are getting, want to hit that one and collect href which present in that previous one. so how could I do that.
I Tried:
from bs4 import BeautifulSoup
import requests
url = 'https://www.iea.org/oilmarketreport/reports/'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
#soup.prettify()
#table = soup.find("table")
#print(table)
links = []
for href in soup.find_all(class_='omrlist'):
#print(href)
links.append(href.find('a').get('href'))
print(links)

here how to loop to get report url
import requests
root_url = 'https://www.iea.org'
def getLinks(url):
all_links = []
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
for href in soup.find_all(class_='omrlist'):
all_links.append(root_url + href.find('a').get('href')) # add prefix 'http://....'
return all_links
yearLinks = getLinks(root_url + '/oilmarketreport/reports/')
# get report URL
reportLinks = []
for url in yearLinks:
links = getLinks(url)
reportLinks.extend(links)
print(reportLinks)
for url in reportLinks:
if '.pdf' in url:
url = url.replace('../../..', '')
# do download pdf file
....
else:
# do extract pdf url from html and download it
....
....
now you can loop reportLinks to get pdf url

No output in console python

from bs4 import BeautifulSoup
import requests
def imdb_spider():
url = 'http://www.imdb.com/chart/top'
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for link in soup.findAll('a', {'class': 'secondaryInfo' }):
href = link.get('href')
print(href)
imdb_spider()
I'm trying to get links of all top rated movies from imdb . I'm using pycharm . The code runs for more than 30 mins but i'm not getting any print in my console.

You're correct that there's an element with class secondaryInfo for every movie title, but that's not the a element. If you want to find that, you have to use a different selector. For example, the following selector will do the trick instead of using soup.findAll().
soup.select('td.titleColumn a')

The problem is that {'class': 'secondaryInfo' } is a parameter of <span> object.
So try this:
from bs4 import BeautifulSoup
import requests
def imdb_spider():
url = 'http://www.imdb.com/chart/top'
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, "lxml")
for td in soup.findAll('td', {'class': 'titleColumn'}):
href = td.find('a').get('href')
print(href)
imdb_spider()

BeautifulSoup is not getting all data, only some

import requests
from bs4 import BeautifulSoup
def trade_spider(max_pages):
page = 0
while page <= max_pages:
url = 'http://orangecounty.craigslist.org/search/foa?s=' + str(page * 100)
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for link in soup.findAll('a', {'class':'hdrlnk'}):
href = 'http://orangecounty.craigslist.org/' + link.get('href')
title = link.string
print title
#print href
get_single_item_data(href)
page += 1
def get_single_item_data(item_url):
source_code = requests.get(item_url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for item_name in soup.findAll('section', {'id':'postingbody'}):
print item_name.string
trade_spider(1)
I am trying to crawl craigslist (for practice), http://orangecounty.craigslist.org/search/foa?s=0 in particular. I have it right now set to print the title of the entry and the description of the entry. The issue is that although the title correctly prints for every object listed, the description is listed as "None" for most of them, even though there is clearly a description. Any help would be appreciated. Thanks.

You are almost there. Just change item_name.string to item_name.text

Instead of getting the .string, get the text of the posting body (worked for me):
item_name.get_text(strip=True)
As a side note, your script has a blocking "nature", you may speed things up dramatically by switching to Scrapy web-scraping framework.

Python Web Crawler not printing any result

I am trying to create a simple Web Crawler in Python, and when I'm running it it's showing no errors but it's also not printing any results as intended.
I've put my current code below, could anyone please point me in the direction of the problem?
import requests
from bs4 import BeautifulSoup
def stepashka_spider(max_pages):
page = 1
while page <= max_pages:
url = "http://online.stepashka.com/filmy/#/page/" + str(page)
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for resoult in soup.findAll("a", {"class": "video-title"}):
href = resoult.get(href)
print(href)
page += 1
stepashka_spider(1)

"video-title" is in a div tag, you also need to pass a string "href":
def stepashka_spider(max_pages):
page = 1
while page <= max_pages:
url = "http://online.stepashka.com/filmy/#/page/" + str(page)
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for resoult in soup.findAll("div", {"class": "video-title"}):
a_tag = resoult.a
print(a_tag["href"])
page += 1
stepashka_spider(1)
Output:
http://online.stepashka.com/filmy/komedii/37878-klub-grust.html
http://online.stepashka.com/filmy/dramy/37875-kadr.html
http://online.stepashka.com/filmy/multfilmy/37874-betmen-protiv-robina.html
http://online.stepashka.com/filmy/fantastika/37263-hrustalnye-cherepa.html
http://online.stepashka.com/filmy/dramy/34369-bozhiy-syn.html
http://online.stepashka.com/filmy/trillery/37873-horoshee-ubiystvo.html
http://online.stepashka.com/filmy/trillery/34983-zateryannaya-reka.html
http://online.stepashka.com/filmy/priklucheniya/37871-totem-volka.html
http://online.stepashka.com/filmy/fantastika/35224-zheleznaya-shvatka.html
http://online.stepashka.com/filmy/dramy/37870-bercy.html
You are actually using the wrong url format, we can also use range instead of a loop:
def stepashka_spider(max_pages):
for page in range(1,max_pages+1):
url = "http://online.stepashka.com/filmy/page/{}/".format(page)
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
print("Movies for page {}".format(page))
for resoult in soup.findAll("div", {"class": "video-title"}):
a_tag = resoult.a
print(a_tag["href"])
print()
Output:
Movies for page 1
http://online.stepashka.com/filmy/dramy/37895-raskop.html
http://online.stepashka.com/filmy/semejnyj/36275-domik-v-serdce.html
http://online.stepashka.com/filmy/dramy/35371-enni.html
http://online.stepashka.com/filmy/trillery/37729-igra-na-vyzhivanie.html
http://online.stepashka.com/filmy/trillery/37893-vosstavshie-mertvecy.html
http://online.stepashka.com/filmy/semejnyj/30104-sedmoy-syn-seventh-son-2013-treyler.html
http://online.stepashka.com/filmy/dramy/37892-sekret-schastya.html
http://online.stepashka.com/filmy/uzhasy/37891-davayte-poohotimsya.html
http://online.stepashka.com/filmy/multfilmy/3404-specagent-archer-archer-archer-2010-2013.html
http://online.stepashka.com/filmy/trillery/37334-posledniy-reys.html
Movies for page 2
http://online.stepashka.com/filmy/komedii/37890-top-5.html
http://online.stepashka.com/filmy/komedii/37889-igra-v-doktora.html
http://online.stepashka.com/filmy/dramy/36651-vrozhdennyy-porok.html
http://online.stepashka.com/filmy/komedii/37786-superforsazh.html
http://online.stepashka.com/filmy/fantastika/35003-voshozhdenie-yupiter.html
http://online.stepashka.com/filmy/sport/37888-ufc-on-fox-15-machida-vs-rockhold.html
http://online.stepashka.com/filmy/semejnyj/37558-prizrak.html
http://online.stepashka.com/filmy/boeviki/36865-mordekay.html
http://online.stepashka.com/filmy/dramy/37884-stanovlenie-legendy.html
http://online.stepashka.com/filmy/trillery/37883-tainstvo.html
Movies for page 3
http://online.stepashka.com/filmy/dramy/37551-nochnoy-beglec.html
http://online.stepashka.com/filmy/dramy/37763-mech-drakona.html
http://online.stepashka.com/filmy/trillery/36471-paren-po-sosedstvu.html
http://online.stepashka.com/filmy/dramy/36652-amerikanskiy-snayper.html
http://online.stepashka.com/filmy/dramy/37555-feniks.html
http://online.stepashka.com/filmy/semejnyj/35156-gnezdo-drakona-vosstanie-chernogo-drakona.html
http://online.stepashka.com/filmy/kriminal/37882-ch-b.html
http://online.stepashka.com/filmy/priklucheniya/37881-admiral-bitva-za-men-ryan.html
http://online.stepashka.com/filmy/trillery/37880-malyshka.html
http://online.stepashka.com/filmy/trillery/36417-poteryannyy-ray.html

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract href value inside a div - beautifulsoup - python

the haref isn't part of the div-Tag, but an a-Tag within the div. You have to use href = link.find('a').get('href')

Related

Scraping multiple pages in Python

How do I get href links from href using python/pandas

No output in console python

BeautifulSoup is not getting all data, only some

Python Web Crawler not printing any result

Categories

Resources