BeautifulSoup is not getting all data, only some

BeautifulSoup is not getting all data, only some - python

import requests
from bs4 import BeautifulSoup
def trade_spider(max_pages):
page = 0
while page <= max_pages:
url = 'http://orangecounty.craigslist.org/search/foa?s=' + str(page * 100)
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for link in soup.findAll('a', {'class':'hdrlnk'}):
href = 'http://orangecounty.craigslist.org/' + link.get('href')
title = link.string
print title
#print href
get_single_item_data(href)
page += 1
def get_single_item_data(item_url):
source_code = requests.get(item_url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for item_name in soup.findAll('section', {'id':'postingbody'}):
print item_name.string
trade_spider(1)
I am trying to crawl craigslist (for practice), http://orangecounty.craigslist.org/search/foa?s=0 in particular. I have it right now set to print the title of the entry and the description of the entry. The issue is that although the title correctly prints for every object listed, the description is listed as "None" for most of them, even though there is clearly a description. Any help would be appreciated. Thanks.

You are almost there. Just change item_name.string to item_name.text

Instead of getting the .string, get the text of the posting body (worked for me):
item_name.get_text(strip=True)
As a side note, your script has a blocking "nature", you may speed things up dramatically by switching to Scrapy web-scraping framework.

Related

How do I make this web crawler print only the titles of the songs?

import requests
from bs4 import BeautifulSoup
url = 'https://www.officialcharts.com/charts/singles-chart'
reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, 'html.parser')
urls = []
for link in soup.find_all('a'):
print(link.get('href'))
def chart_spider(max_pages):
page = 1
while page >= max_pages:
url = "https://www.officialcharts.com/charts/singles-chart"
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, 'html.parser')
for link in soup.findAll('a', {"class": "title"}):
href = "BAD HABITS" + link.title(href)
print(href)
page += 1
chart_spider(1)
Wondering how to make this print just the titles of the songs instead of the entire page. I want it to go through the top 100 charts and print all the titles for now. Thanks

Here's is a possible solution, which modify your code as little as possible:
#!/usr/bin/env python3
import requests
from bs4 import BeautifulSoup
URL = 'https://www.officialcharts.com/charts/singles-chart'
def chart_spider():
source_code = requests.get(URL)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, 'html.parser')
for title in soup.find_all('div', {"class": "title"}):
print(title.contents[1].string)
chart_spider()
The result is a list of all the titles found in the page, one per line.

If all you want is the titles for each song on the top 100,
this code:
import requests
from bs4 import BeautifulSoup
url='https://www.officialcharts.com/charts/singles-chart/'
req = requests.get(url)
soup = BeautifulSoup(req.content, 'html.parser')
titles = [i.text.replace('\n', '') for i in soup.find_all('div', class_="title")]
does what you are looking for.

You can do like this.
The Song title is present inside a <div> tag with class name as title.
Select all those <div> with .find_all(). This gives you a list of all <div> tags.
Iterate over the list and print the text of each div.
from bs4 import BeautifulSoup
import requests
url = 'https://www.officialcharts.com/charts/singles-chart/'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
d = soup.find_all('div', class_='title')
for i in d:
print(i.text.strip())
Sample Output:
BAD HABITS
STAY
REMEMBER
BLACK MAGIC
VISITING HOURS
HAPPIER THAN EVER
INDUSTRY BABY
WASTED
.
.
.

Extract href value inside a div - beautifulsoup

I am trying to print all the title of an anime from https://gogoanime.pe/anime-movies.html?aph=&page=with the following code from Bucky's tutorial:
def animmov(max_pages):
page = 1
while page <= max_pages:
url = 'https://gogoanime.pe/anime-movies.html?aph=&page=' + str(page)
source_code = requests.get(url)
plain_text = source_code.text
sopas = BeautifulSoup(plain_text, features="html.parser")
for link in sopas.find_all('div', attrs={'class':'img'}):
href = link.get('href')
print(href)
page += 1
when i execute the code it prints the following None
I have tried to read the question here also but i can't follow through. How can i extract all the href link values inside the div.

the haref isn't part of the div-Tag, but an a-Tag within the div.
You have to use href = link.find('a').get('href')

Scraping multiple pages in Python

I am trying to scrape a page that includes 12 links. I need to open each of these links and scrape all of their titles. When I open each page, I face multiple pages in each link. However, my code could only scrape the first page in all of these 12 links
By below code, I can print all the 12 links URLs that exist on the main page.
url = 'http://mlg.ucd.ie/modules/COMP41680/assignment2/index.html'
res = requests.get (url)
soup = BeautifulSoup(res.text, 'html.parser')
links = soup.find_all("a")
all_urls = []
for link in links[1:]:
link_address ='http://mlg.ucd.ie/modules/COMP41680/assignment2/' + link.get("href")
all_urls.append(link_address)
Then, I looped in all of them.
for i in range(0,12):
url = all_urls[i]
res = requests.get (url)
soup = BeautifulSoup(res.text, 'html.parser')
The title could be extracted by below lines:
title_news = []
news_div = soup.find_all('div', class_ = 'article')
for container in news_div:
title = container.h5.a.text
title_news.append(title)
The output of this code only includes the title for the first page of each of these 12 pages, while I need my code to go through multiple pages in these 12 URLs.
The below gives me the links of all the pages that exist in each of these 12 links if it defines in an appropriate loop. ( It reads the pagination section and look for the next page URL link)
page = soup.find('ul', {'class' : 'pagination'}).select('li', {'class': "page-link"})[2].find('a')['href']
How I should use a page variable inside my code to extract multiple pages in all of these 12 links and read all the titles and not only first-page titles.

You can use this code to get all titles from all the pages:
import requests
from bs4 import BeautifulSoup
base_url = "http://mlg.ucd.ie/modules/COMP41680/assignment2/"
soup = BeautifulSoup(
requests.get(base_url + "index.html").content, "html.parser"
)
title_news = []
for a in soup.select("#all a"):
next_link = a["href"]
print("Getting", base_url + next_link)
while True:
soup = BeautifulSoup(
requests.get(base_url + next_link).content, "html.parser"
)
for title in soup.select("h5 a"):
title_news.append(title.text)
next_link = soup.select_one('a[aria-label="Next"]')["href"]
if next_link == "#":
break
print("Length of title_news:", len(title_news))
Prints:
Getting http://mlg.ucd.ie/modules/COMP41680/assignment2/month-jan-001.html
Getting http://mlg.ucd.ie/modules/COMP41680/assignment2/month-feb-001.html
Getting http://mlg.ucd.ie/modules/COMP41680/assignment2/month-mar-001.html
Getting http://mlg.ucd.ie/modules/COMP41680/assignment2/month-apr-001.html
Getting http://mlg.ucd.ie/modules/COMP41680/assignment2/month-may-001.html
Getting http://mlg.ucd.ie/modules/COMP41680/assignment2/month-jun-001.html
Getting http://mlg.ucd.ie/modules/COMP41680/assignment2/month-jul-001.html
Getting http://mlg.ucd.ie/modules/COMP41680/assignment2/month-aug-001.html
Getting http://mlg.ucd.ie/modules/COMP41680/assignment2/month-sep-001.html
Getting http://mlg.ucd.ie/modules/COMP41680/assignment2/month-oct-001.html
Getting http://mlg.ucd.ie/modules/COMP41680/assignment2/month-nov-001.html
Getting http://mlg.ucd.ie/modules/COMP41680/assignment2/month-dec-001.html
Length of title_news: 16226

No output in console python

from bs4 import BeautifulSoup
import requests
def imdb_spider():
url = 'http://www.imdb.com/chart/top'
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for link in soup.findAll('a', {'class': 'secondaryInfo' }):
href = link.get('href')
print(href)
imdb_spider()
I'm trying to get links of all top rated movies from imdb . I'm using pycharm . The code runs for more than 30 mins but i'm not getting any print in my console.

You're correct that there's an element with class secondaryInfo for every movie title, but that's not the a element. If you want to find that, you have to use a different selector. For example, the following selector will do the trick instead of using soup.findAll().
soup.select('td.titleColumn a')

The problem is that {'class': 'secondaryInfo' } is a parameter of <span> object.
So try this:
from bs4 import BeautifulSoup
import requests
def imdb_spider():
url = 'http://www.imdb.com/chart/top'
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, "lxml")
for td in soup.findAll('td', {'class': 'titleColumn'}):
href = td.find('a').get('href')
print(href)
imdb_spider()

Python Web Crawler not printing any result

I am trying to create a simple Web Crawler in Python, and when I'm running it it's showing no errors but it's also not printing any results as intended.
I've put my current code below, could anyone please point me in the direction of the problem?
import requests
from bs4 import BeautifulSoup
def stepashka_spider(max_pages):
page = 1
while page <= max_pages:
url = "http://online.stepashka.com/filmy/#/page/" + str(page)
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for resoult in soup.findAll("a", {"class": "video-title"}):
href = resoult.get(href)
print(href)
page += 1
stepashka_spider(1)

"video-title" is in a div tag, you also need to pass a string "href":
def stepashka_spider(max_pages):
page = 1
while page <= max_pages:
url = "http://online.stepashka.com/filmy/#/page/" + str(page)
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for resoult in soup.findAll("div", {"class": "video-title"}):
a_tag = resoult.a
print(a_tag["href"])
page += 1
stepashka_spider(1)
Output:
http://online.stepashka.com/filmy/komedii/37878-klub-grust.html
http://online.stepashka.com/filmy/dramy/37875-kadr.html
http://online.stepashka.com/filmy/multfilmy/37874-betmen-protiv-robina.html
http://online.stepashka.com/filmy/fantastika/37263-hrustalnye-cherepa.html
http://online.stepashka.com/filmy/dramy/34369-bozhiy-syn.html
http://online.stepashka.com/filmy/trillery/37873-horoshee-ubiystvo.html
http://online.stepashka.com/filmy/trillery/34983-zateryannaya-reka.html
http://online.stepashka.com/filmy/priklucheniya/37871-totem-volka.html
http://online.stepashka.com/filmy/fantastika/35224-zheleznaya-shvatka.html
http://online.stepashka.com/filmy/dramy/37870-bercy.html
You are actually using the wrong url format, we can also use range instead of a loop:
def stepashka_spider(max_pages):
for page in range(1,max_pages+1):
url = "http://online.stepashka.com/filmy/page/{}/".format(page)
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
print("Movies for page {}".format(page))
for resoult in soup.findAll("div", {"class": "video-title"}):
a_tag = resoult.a
print(a_tag["href"])
print()
Output:
Movies for page 1
http://online.stepashka.com/filmy/dramy/37895-raskop.html
http://online.stepashka.com/filmy/semejnyj/36275-domik-v-serdce.html
http://online.stepashka.com/filmy/dramy/35371-enni.html
http://online.stepashka.com/filmy/trillery/37729-igra-na-vyzhivanie.html
http://online.stepashka.com/filmy/trillery/37893-vosstavshie-mertvecy.html
http://online.stepashka.com/filmy/semejnyj/30104-sedmoy-syn-seventh-son-2013-treyler.html
http://online.stepashka.com/filmy/dramy/37892-sekret-schastya.html
http://online.stepashka.com/filmy/uzhasy/37891-davayte-poohotimsya.html
http://online.stepashka.com/filmy/multfilmy/3404-specagent-archer-archer-archer-2010-2013.html
http://online.stepashka.com/filmy/trillery/37334-posledniy-reys.html
Movies for page 2
http://online.stepashka.com/filmy/komedii/37890-top-5.html
http://online.stepashka.com/filmy/komedii/37889-igra-v-doktora.html
http://online.stepashka.com/filmy/dramy/36651-vrozhdennyy-porok.html
http://online.stepashka.com/filmy/komedii/37786-superforsazh.html
http://online.stepashka.com/filmy/fantastika/35003-voshozhdenie-yupiter.html
http://online.stepashka.com/filmy/sport/37888-ufc-on-fox-15-machida-vs-rockhold.html
http://online.stepashka.com/filmy/semejnyj/37558-prizrak.html
http://online.stepashka.com/filmy/boeviki/36865-mordekay.html
http://online.stepashka.com/filmy/dramy/37884-stanovlenie-legendy.html
http://online.stepashka.com/filmy/trillery/37883-tainstvo.html
Movies for page 3
http://online.stepashka.com/filmy/dramy/37551-nochnoy-beglec.html
http://online.stepashka.com/filmy/dramy/37763-mech-drakona.html
http://online.stepashka.com/filmy/trillery/36471-paren-po-sosedstvu.html
http://online.stepashka.com/filmy/dramy/36652-amerikanskiy-snayper.html
http://online.stepashka.com/filmy/dramy/37555-feniks.html
http://online.stepashka.com/filmy/semejnyj/35156-gnezdo-drakona-vosstanie-chernogo-drakona.html
http://online.stepashka.com/filmy/kriminal/37882-ch-b.html
http://online.stepashka.com/filmy/priklucheniya/37881-admiral-bitva-za-men-ryan.html
http://online.stepashka.com/filmy/trillery/37880-malyshka.html
http://online.stepashka.com/filmy/trillery/36417-poteryannyy-ray.html

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

BeautifulSoup is not getting all data, only some - python

You are almost there. Just change item_name.string to item_name.text

Instead of getting the .string, get the text of the posting body (worked for me): item_name.get_text(strip=True) As a side note, your script has a blocking "nature", you may speed things up dramatically by switching to Scrapy web-scraping framework.

Related

How do I make this web crawler print only the titles of the songs?

Extract href value inside a div - beautifulsoup

Scraping multiple pages in Python

No output in console python

Python Web Crawler not printing any result

Categories

Resources