How to avoid getting broken words while webcrawling

How to avoid getting broken words while webcrawling - python

I'm trying to web crawl movie titles from this website: https://www.the-numbers.com/market/2019/top-grossing-movies
And keep getting broken word like "John Wick: Chapter 3 â€” ".
this is the picture:
This is the code:
url = "https://www.the-numbers.com/market/" + "2019" + "/top-grossing-movies"
raw = requests.get(url,
headers={'User-Agent':'Mozilla/5.0'})
html = BeautifulSoup(raw.text, "html.parser")
movie_list = html.select("#page_filling_chart table tr > td > b > a") #"#page_filling_chart > table > tbody > tr > td > b"
for i in range(len(movie_list)):
print(movie_list[i].text)
And these are the outputs:
Avengers: Endgame
The Lion King
Frozen II
Toy Story 4
Captain Marvel
Star Wars: The Rise of Skyw…
Spider-Man: Far From Home
Aladdin
Joker
Jumanji: The Next Level
It: Chapter Two
Us
Fast & Furious Presents: Ho…
John Wick: Chapter 3 â€” Para…
How to Train Your Dragon: T…
The Secret Life of Pets 2
PokÃ©mon: Detective Pikachu
Once Upon a Timeâ€¦in Hollywo…
I want to know why I keep getting these broken words and how to fix this!

Due to this page is server-render, you could request those page separately when the title getting broken.(Also don't forget to get the title by regex, because the title of its page contain the publication date.)
Try code below:
import requests
from bs4 import BeautifulSoup
url = "https://www.the-numbers.com/market/" + "2019" + "/top-grossing-movies"
raw = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})
html = BeautifulSoup(raw.text, "html.parser")
movie_list = html.select("#page_filling_chart table tr > td > b > a") # "#page_filling_chart > table > tbody > tr > td > b"
for movie in movie_list:
raw = requests.get("https://www.the-numbers.com" + movie.get("href"), headers={'User-Agent': 'Mozilla/5.0'})
raw.encoding = 'utf-8'
html = BeautifulSoup(raw.text, "html.parser")
print(html.select_one("#main > div > h1").text)
That's gave me:
Avengers: Endgame (2019)
The Lion King (2019)
Frozen II (2019)
Toy Story 4 (2019)
Captain Marvel (2019)
Star Wars: The Rise of Skywalker (2019)
Spider-Man: Far From Home (2019)
....

You need to handle the strings like this, the solution code is:
import requests
from bs4 import BeautifulSoup
url = "https://www.the-numbers.com/market/" + "2019" + "/top-grossing-movies"
raw = requests.get(url,
headers={'User-Agent':'Mozilla/5.0'})
html = BeautifulSoup(raw.text, "lxml")
movie_list = html.select("#page_filling_chart table tr > td > b > a") #"#page_filling_chart > table > tbody > tr > td > b"
import unicodedata
for i in range(len(movie_list)):
movie_name = movie_list[i].text
print(unicodedata.normalize('NFKD', movie_name).encode('ascii', 'ignore').decode())
The output is like this:
Avengers: Endgame
The Lion King
Frozen II
Toy Story 4
Captain Marvel
Star Wars: The Rise of Skyw...
Spider-Man: Far From Home
Aladdin
Joker
Jumanji: The Next Level
It: Chapter Two
Us
Fast & Furious Presents: Ho...
John Wick: Chapter 3 a Para...
How to Train Your Dragon: T...
The Secret Life of Pets 2
PokAmon: Detective Pikachu
Once Upon a Timeain Hollywo...
Shazam!
Aquaman
Knives Out
Dumbo
Maleficent: Mistress of Evil
.
.
Narcissister Organ Player
Chef Flynn
I am Not a Witch
Divide and Conquer: The Sto...
Senso
Never-Ending Man: Hayao Miy...

Related

Inconsistent results while web scraping using beautiful soup

I am having an inconsistent issue that is driving me crazy. I am trying to scrape data about rental units. Let's say we have a webpage with 42 ads, the code works just fine for only 19 ads then it returns:
Traceback (most recent call last):
File "main.py", line 53, in <module>
title = real_state_title.div.h1.text.strip()
AttributeError: 'NoneType' object has no attribute 'div'
If you started the code to process ads starting from a different ad number, let's say 5, it will also process the first 19 ads then raises the same error!
Here is a minimum code to show the issue I am having. Please note that this code will print the HTML for a functioning ad and also for the one with the error. What is printed is so different.
Run the code then change the value of i to see the results.
from bs4 import BeautifulSoup as soup # HTML data structure
from urllib.request import urlopen as uReq # Web client
import traceback
page_url = "https://www.kijiji.ca/b-apartments-condos/saint-john/c37l80017?ll=45.273315%2C-66.063308&address=Saint+John%2C+NB&ad=offering&radius=20.0"
# opens the connection and downloads html page from url
uClient = uReq(page_url)
# parses html into a soup data structure to traverse html
page_soup = soup(uClient.read(), "html.parser")
uClient.close()
# finds each ad from Kijiji web page
containers = page_soup.findAll('div', {'class': 'clearfix'})
# Print the number of ads in this web page
print(f'Number of ads in this web page is {len(containers)}')
print_functioning_ad = True
# Loop throw ads
i = 1 # change to start from a different ad (don't put zero)
for container in containers[i:]:
print(f'Ad No.: {i}\n')
i += 1
# Get the link for this specific ad
ad_link_container = container.find('div', {'class': 'title'})
ad_link = 'https://kijiji.ca' + ad_link_container.a['href']
print(ad_link)
single_ad = uReq(ad_link)
# parses html into a soup data structure to traverse html
page_soup2 = soup(single_ad.read(), "html.parser")
single_ad.close()
# Title
real_state_title = page_soup2.find('div', {'class': 'realEstateTitle-1440881021'})
# Print one functioning ad html
if print_functioning_ad:
print_functioning_ad = False
print(page_soup2)
print('real state title type', type(real_state_title))
try:
title = real_state_title.div.h1.text.strip()
print(title)
except Exception:
print(traceback.format_exc())
print(page_soup2)
break
print('____________________________________________________________')
Edit 1:
In my simple example I want to loop through each ad in the provided link, open it, and get the title. In my actual code I am not only getting the title but also every other info about the ad. So I need to load the data from the link associated with every ad. My code actually does that, but for an unknown reason, this happens ONLY for 19 ads regardless which one I started with. This is driving my nuts!

To get all pages from the URL you can use next example:
import requests
from bs4 import BeautifulSoup
page_url = "https://www.kijiji.ca/b-apartments-condos/saint-john/c37l80017?ll=45.273315%2C-66.063308&address=Saint+John%2C+NB&ad=offering&radius=20.0"
page = 1
while True:
print("Page {}...".format(page))
print("-" * 80)
soup = BeautifulSoup(requests.get(page_url).content, "html.parser")
for i, a in enumerate(soup.select("a.title"), 1):
print(i, a.get_text(strip=True))
next_url = soup.select_one('a[title="Next"]')
if not next_url:
break
print()
page += 1
page_url = "https://www.kijiji.ca" + next_url["href"]
Prints:
Page 1...
--------------------------------------------------------------------------------
1 Spacious One Bedroom Apartment
2 3 Bedroom Quispamsis
3 Uptown-two-bedroom apartment for rent - all-inclusive
4 New Construction!! Large 2 Bedroom Executive Apt
5 LARGE 1 BEDROOM UPTOWN $850 HEAT INCLUDED AVAIABLE JULY 1
6 84 Wright St Apt 2
7 310 Woodward Ave (Brentwood Tower) Condo #1502
...
Page 5...
--------------------------------------------------------------------------------
1 U02 - CHFR - Cozy 1 Bedroom + Den - WEST SAINT JOHN
2 2+ Bedroom Historic Renovated Stainless Kitchen
3 2 Bedroom Apartment - 343 Prince Street West
4 2 Bedroom 5th Floor Loft Apartment in South End Saint John
5 Bay of Fundy view from luxury 5th floor 1 bedroom + den suite
6 Suites of The Atlantic - Renting for Fall 2021: 2 bedrooms
7 WOODWARD GARDENS//2 BR/$945 + LIGHTS//MAY//MILLIDGEVILLE//JULY
8 HEATED & SMOKE FREE - Bach & 1Bd Apt - 50% off 1st month's rent
9 Beautiful 2 bedroom apartment in Millidgeville
10 Spacious 2 bedroom in Uptown Saint John
11 3 bedroom apartment at Millidge Ave close to university ave
12 Big Beautiful 3 bedroom apt. in King Square
13 NEWER HARBOURVIEW SUITES UNFURNISHED OR FURNISHED /BLUE ROCK
14 Rented
15 Completely Renovated - 1 Bedroom Condo w/ small den Brentwood
16 1+1 Bedroom Apartment for rent for 2 persons
17 3 large bedroom apt. in King Street East Saint John,NB
18 Looking for a house
19 Harbour View 2 Bedroom Apartment
20 Newer Harbourview suites unfurnished or furnished /Blue Rock Ct
21 LOVELY 2 BEDROOM APARTMENT FOR LEASE 5 WOODHOLLOW PARK EAST SJ

I think I figured out the problem here. I seems like you can't make a lot of requests in a short period of time, so I added a try: except: statement where a time sleep of 80 second is issued when this error occurs, this fixed my problem!
You may want to change the sleep time period to a different value depends on the website you are trying to scrape from.
Here is the modified code:
from bs4 import BeautifulSoup as soup # HTML data structure
from urllib.request import urlopen as uReq # Web client
import traceback
import time
page_url = "https://www.kijiji.ca/b-apartments-condos/saint-john/c37l80017?ll=45.273315%2C-66.063308&address=Saint+John%2C+NB&ad=offering&radius=20.0"
# opens the connection and downloads html page from url
uClient = uReq(page_url)
# parses html into a soup data structure to traverse html
page_soup = soup(uClient.read(), "html.parser")
uClient.close()
# finds each ad from Kijiji web page
containers = page_soup.findAll('div', {'class': 'clearfix'})
# Print the number of ads in this web page
print(f'Number of ads in this web page is {len(containers)}')
print_functioning_ad = True
# Loop throw ads
i = 1 # change to start from a different ad (don't put zero)
for container in containers[i:]:
print(f'Ad No.: {i}\n')
i = i + 1
# Get the link for this specific ad
ad_link_container = container.find('div', {'class': 'title'})
ad_link = 'https://kijiji.ca' + ad_link_container.a['href']
print(ad_link)
single_ad = uReq(ad_link)
# parses html into a soup data structure to traverse html
page_soup2 = soup(single_ad.read(), "html.parser")
single_ad.close()
# Title
real_state_title = page_soup2.find('div', {'class': 'realEstateTitle-1440881021'})
try:
title = real_state_title.div.h1.text.strip()
print(title)
except AttributeError:
print(traceback.format_exc())
i = i - 1
t = 80
print(f'----------------------------Sleep for {t} seconds!')
time.sleep(t)
continue
print('____________________________________________________________')

Extracting title from link in Python (Beautiful soup)

I am new to Python and I'm looking to extract the title from a link. So far I have the following but have hit a dead end:
import requests
from bs4 import BeautifulSoup
page = requests.get("http://books.toscrape.com/")
soup = BeautifulSoup(page.content, 'html.parser')
books = soup.find("section")
book_list = books.find_all(class_="product_pod")
tonight = book_list[0]
for book in book_list:
price = book.find(class_="price_color").get_text()
title = book.find('a')
print (price)
print (title.contents[0])

To extract title from links, you can use title attribute.
Fore example:
import requests
from bs4 import BeautifulSoup
page = requests.get("http://books.toscrape.com/")
soup = BeautifulSoup(page.content, 'html.parser')
for a in soup.select('h3 > a'):
print(a['title'])
Prints:
A Light in the Attic
Tipping the Velvet
Soumission
Sharp Objects
Sapiens: A Brief History of Humankind
The Requiem Red
The Dirty Little Secrets of Getting Your Dream Job
The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull
The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics
The Black Maria
Starving Hearts (Triangular Trade Trilogy, #1)
Shakespeare's Sonnets
Set Me Free
Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)
Rip it Up and Start Again
Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991
Olio
Mesaerion: The Best Science Fiction Stories 1800-1849
Libertarianism for Beginners
It's Only the Himalayas

you can use it:
import requests
from bs4 import BeautifulSoup
page = requests.get("http://books.toscrape.com/")
soup = BeautifulSoup(page.content, 'html.parser')
books = soup.find("section")
book_list = books.find_all(class_="product_pod")
tonight = book_list[0]
for book in book_list:
price = book.find(class_="price_color").get_text()
title = book.select_one('a img')['alt']
print (title)
Output:
A Light in the Attic
Tipping the Velvet
Soumission
Sharp Objects
Sapiens: A Brief History of Humankind
The Requiem Red...

By just modifying your existing code you can use the alt text which contains the book titles in your example.
print (title.contents[0].attrs["alt"])

get review rating bubble from tripadvisor

how to get review from tripadvisor
this my code using beautifulsoup
review_data = data.find_all('div', attrs={'class':'reviews-tab'})
for review in review_data:
namareview = review.findNext('a', attrs={'class':'ui_header_link social-member-event-MemberEventOnObjectBlock__member--35-jC'})[0].text.strip()
ratingreview =
tittlereview = data.find_all('a', attrs={'class':'location-review-review-list-parts-ReviewTitle__reviewTitleText--2tFRT'})[0].text.strip()
print (namareview)
and how to get value rating review from bubble rating
<span class="ui_bubble_rating bubble_30"></span>
this my code now
from bs4 import BeautifulSoup
import requests
import pandas as pd
from bs4 import BeautifulSoup
import re
url = "https://www.tripadvisor.com/Attraction_Review-g297722-d6611509-Reviews-Pahawang_Island-Bandar_Lampung_Lampung_Sumatra.html"
response = requests.get(url)
data = BeautifulSoup(response.text, "html.parser")
print(data.title.text)
nama = data.find_all('h1', attrs={'class':'ui_header h1'})[0].text.strip()
print (nama)
category = data.find_all('div', attrs={'class':'attractions-attraction-review-header-AttractionLinks__detail--2-xvX'})[0].text.strip()
print (category)
location= data.find_all('div', attrs={'class':'attractions-contact-card-ContactCard__contactRow--3Ih6v'})[0].text.strip()
print (location)
review_data = data.find_all('div', attrs={'class':'reviews-tab'})
for review in review_data:
namareview = review.findNext('a', attrs={'class':'ui_header_link social-member-event-MemberEventOnObjectBlock__member--35-jC'})[0].text.strip()
bubblereview=
tittlereview = data.find_all('a', attrs={'class':'location-review-review-list-parts-ReviewTitle__reviewTitleText--2tFRT'})[0].text.strip()
print (namareview,bubblereview,tittlereview)
thistime my full code

Tripadvisor is a tricky site to scrape. But not impossible. Not sure what you are after, but you can work/look at parsing the json within the script tags:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import re
import json
url = "https://www.tripadvisor.co.uk/Attraction_Review-g297722-d6611509-Reviews-Pahawang_Island-Bandar_Lampung_Lampung_Sumatra.html"
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132 Safari/537.36'}
response = requests.get(url, headers=headers)
data = BeautifulSoup(response.text, "html.parser")
print(data.title.text)
nama = data.find_all('h1', attrs={'class':'ui_header h1'})[0].text.strip()
print (nama)
category = data.find_all('div', attrs={'class':'attractions-attraction-review-header-AttractionLinks__detail--2-xvX'})[0].text.strip()
print (category)
location= data.find_all('div', attrs={'class':'attractions-contact-card-ContactCard__contactRow--3Ih6v'})[0].text.strip()
print (location)
# Get Total count of reviews
data = BeautifulSoup(response.text, "html.parser")
reviewDataIDs = []
scripts = data.find_all('script')
for script in scripts:
if 'window.__WEB_CONTEXT__=' in script.text:
jsonStr = script.text
jsonStr = jsonStr.split('window.__WEB_CONTEXT__={pageManifest:')[-1]
iterateJson = True
while iterateJson == True:
try:
jsonData = json.loads(jsonStr + '}')
iterateJson = False
except:
jsonStr = jsonStr.rsplit('}',1)[0]
raiseError = True
for k, v in jsonData['urqlCache'].items():
try:
totalCount = jsonData['urqlCache'][k]['data']['locations'][0]['reviewListPage']['totalCount']
raiseError = False
reviewDataIDs.append(k)
break
except:
pass
def getJsonData(reviewCount, reviewDataIDs, continueLoop):
while continueLoop == True:
url = "https://www.tripadvisor.co.uk/Attraction_Review-g297722-d6611509-Reviews-or%s-Pahawang_Island-Bandar_Lampung_Lampung_Sumatra.html#REVIEWS" %reviewCount
response = requests.get(url, headers=headers)
data = BeautifulSoup(response.text, "html.parser")
scripts = data.find_all('script')
for script in scripts:
if 'window.__WEB_CONTEXT__=' in script.text:
jsonStr = script.text
jsonStr = jsonStr.split('window.__WEB_CONTEXT__={pageManifest:')[-1]
iterateJson = True
while iterateJson == True:
try:
jsonData = json.loads(jsonStr + '}')
iterateJson = False
except:
jsonStr = jsonStr.rsplit('}',1)[0]
raiseError = True
for k, v in jsonData['urqlCache'].items():
try:
reviewData = jsonData['urqlCache'][k]['data']['locations'][0]['reviewListPage']['reviews']
raiseError = False
if k not in reviewDataIDs:#
continueLoop = False
reviewDataIDs.append(k)
break
except:
pass
if raiseError == True:
raise ValueError ('Data could not be found.')
if continueLoop == False:
return reviewData, reviewDataIDs
# Get Reviews
for reviewCount in list(range(0,totalCount,5)):
reviewData, reviewDataIDs = getJsonData(reviewCount, reviewDataIDs, continueLoop=True)
for each in reviewData:
rating = each['rating']
title = each['title']
text = each['text']
user = each['username']
print ('Name: %s\nTitle: %s\nRating: %s\nReview: %s\n' %(user, title, rating, text) + '-'*70 + '\n')
Output:
Name: Hamdan O
Title: Great for snorkelling and beach fun
Rating: 4
Review: Get a boat from Ketapang Jetty. There were 4 piers with lots of boats to choose from. Choose from traditional wooden boats which are cheaper but slow paced or higher priced fast fiberglass speed boats. We haggled for a fast speed boat to take us snorkelling and island hopping for half a day at 700K Rupiah. We got it from Pak Yayat at Pier 2. Pahawang is excellent for snorkelling. Just off shore the island the residents built platforms with small food/drink booths. They moored the boats there as bases for snorkelling. You can hop from one platform to another. Fantastic ideas to preserve the corals but unfortunately the inexperienced snorkellers ravaged through some of the patches closer to te beach. Great for an overnight trip as well at some of the local folks' homestays on the island.
----------------------------------------------------------------------
Name: PaulusKK
Title: he Trip is just So So
Rating: 3
Review: the boat trip to Pahawang island to me is a bit unsafe, it was a small wooden boat, and the journey was bumpy with high waves, and the island itself almost have no attraction, and the lunch provided there was not good, I only enjoy the fresh coconut water.
----------------------------------------------------------------------
Name: damarwianggo
Title: Pahawang is awesome
Rating: 5
Review: It was a story that Pahawang Island is great place to visit. Then, when I had a chance to accompany students from SMAK IPEKA Palembang to visit Pahawang Island in Lampung, Pahawang is truly exciting. Our one-day-trip to Pahawang was really extraordinary. Moreover, all the students were really excited to join all activities during the trip. The guide helped us to enjoy the trip.
----------------------------------------------------------------------
Name: deddy p
Title: Awesome
Rating: 5
Review: One word i can tell about Pahawang..... Superb. Clean water, beautiful corals. Hope you can help to take care this beautiful environment. Keep it clean.....stay away from plastic.
----------------------------------------------------------------------
Name: kristi0308
Title: Clean beach
Rating: 3
Review: I felt like in pulau pari seribu island for the view
The corals are dead but i saw lots of babies baracudas and a huge purple jellyfish and still got so many pretty little fish
Water are clean and people are not careless about environment as it was very clean when i swam in the island
Thanks to my boat man i paid him only 400k just for a day trip by myself
Paid boat parking every time i move like around 15-20k
And snorkel gear for 30k
----------------------------------------------------------------------

The value of the bubble rating is represented as the number at the end of the class name. Each bubble has a value of 10, so ui_bubble_rating bubble_30 is a rating with 3 out of 5 bubbles filled. Likewise ui_bubble_rating bubble_45 will have 4.5 out of 5 bubbles filled. You can find all these instances with a regex since the number changes.
bubblereview = data.find_all('span', {'class': re.compile('ui_bubble_rating bubble_\d*')})
The resulting list:
[<span class="ui_bubble_rating bubble_45"></span>,
<span class="ui_bubble_rating bubble_45"></span>,
<span class="ui_bubble_rating bubble_40"></span>,
<span class="ui_bubble_rating bubble_30"></span>,
<span class="ui_bubble_rating bubble_50"></span>,
<span class="ui_bubble_rating bubble_50"></span>,
<span class="ui_bubble_rating bubble_30"></span>,
<span class="ui_bubble_rating bubble_40"></span>,
<span class="ui_bubble_rating bubble_35"></span>,
<span class="ui_bubble_rating bubble_40"></span>,
<span class="ui_bubble_rating bubble_40"></span>,
<span class="ui_bubble_rating bubble_45"></span>,
<span class="ui_bubble_rating bubble_40"></span>]
You can filter out the ratings like this:
ratings = re.findall('\d+', ''.join(map(str, bubblereview)))
# ['45', '45', '40', '30', '50', '50', '30', '40', '35', '40', '40', '45', '40']

Try this loop:
for review in data.select("div[class*='SingleReview']"):
title= review.select_one(":scope a > span > span").get_text()
buble_tag = review.select_one(":scope span[class*='bubble']")
raiting = buble_tag["class"][-1].split("_")[-1]
print(f"({raiting}){title}")

This also works and for the only 5 reviews needed per page...
bubblereview = soup.find_all('div', {'class': re.compile('nf9vGX55')})

how do i get the next tag

I am trying to get the headlines that are in between a class. the headlines are wrapped around the h2 tag. headlines come after the tag.
from bs4 import BeautifulSoup
import requests
r = requests.get("https://www.dailypost.ng/hot-news")
soup = BeautifulSoup(r.content, "html.parser")
mydivs = soup.findAll("span", {"class": "mvp-cd-date left relative"})
mytags = mydivs.findNext('h2')
for tag in mytags:
print(tag.text.strip())

You must iterate through mydivs to use findNext()
mydivs is a list of web elements. findNextonly applies to a single web element. You must iterate through the divs and run findNext on each of them.
Just add this line
for div in mydivs:
and put it before
mytags = div.findNext('h2')
Here is the full code for your working program:
from bs4 import BeautifulSoup
import requests
r = requests.get("https://www.dailypost.ng/hot-news")
soup = BeautifulSoup(r.content, "html.parser")
mydivs = soup.findAll("span", {"class": "mvp-cd-date left relative"})
for div in mydivs:
mytags = div.findNext('h2')
for tag in mytags:
print(tag.strip())

Try replacing the last 3 lines with:
for div in mydivs:
mytags = div.findNext('h2')
for tag in mytags:
print(tag.strip())

soup.findAll() returns a list (or None), so you cannot call findNext() on it. However, you can iterate the tags and call find_next() on each tag separately:
import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.dailypost.ng/hot-news")
soup = BeautifulSoup(r.content, "html.parser")
mydivs = soup.findAll("span", {"class": "mvp-cd-date left relative"})
for tag in mydivs:
print(tag.find_next('h2').get_text(strip=True))
Prints:
BREAKING: Another federal lawmaker dies in Dubai hospital
Cross-Over Night: Enugu Govt bans burning of tyres on roads
Dadiyata: DSS breaks silence as Nigerian govt critic remains missing
CAC: Nigerian govt appoints new Acting Registrar-General
What Buhari told me – Dabiri-Erewa
What soldiers should expect in 2020 – Buratai
Only earthquake can erase Amosun’s legacies in Ogun – Akinlade
Civil War: Militia leader sentenced to 20yrs in prison
2020: Prophet Omale releases prophecies on Buhari, Aisha, Kyari, govs, coup plot
BREAKING: EFCC arrests Shehu Sani
Armed Forces Day: Yobe Governor Buni, donates N40 million for emblem appeal fund
Zamfara govt bans illegal gathering in the state
Agbenu Kacholalo: Colours of culture at Idoma International Carnival 2019 [PHOTOS]
Men of God are too fearful, weak to challenge government activities
2020: Peter Obi sends message to Nigerians
TETFUND: EFCC, ICPC asked to probe agency over alleged corruption
Two inmates regain freedom from Uyo prison
Buhari meets President of AfDB, Adeshina at Aso Rock
New Kogi CP resumes office, promises crime free state
Nothing stops you from paying N30,000 minimum wage to workers – APC challenges Makinde
EDIT: This script will scrape headlines from several pages:
import requests
from bs4 import BeautifulSoup
url = 'https://dailypost.ng/hot-news/page/{}/'
for page in range(1, 5): # <-- change how many pages do you want
print('Page no.{}'.format(page))
soup = BeautifulSoup(requests.get(url.format(page)).content, "html.parser")
mydivs = soup.findAll("span", {"class": "mvp-cd-date left relative"})
for tag in mydivs:
print(tag.find_next('h2').get_text(strip=True))
print('-' * 80)

Simple web scraper formatting, how could I fix this?

I have this code:
import requests
from bs4 import BeautifulSoup
def posts_spider():
url = 'http://www.reddit.com/r/nosleep/new/'
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for link in soup.findAll('a', {'class': 'title'}):
href = "http://www.reddit.com" + link.get('href')
title = link.string
print(title)
print(href)
print("\n")
def get_single_item_data():
item_url = 'http://www.reddit.com/r/nosleep/new/'
source_code = requests.get(item_url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for rating in soup.findAll('div', {'class': 'score unvoted'}):
print(rating.string)
posts_spider()
get_single_item_data()
The output is:
My light.. I'm seeing and feeling things.. what's happening?
http://www.reddit.com/r/nosleep/comments/2kw0nu/my_light_im_seeing_and_feeling_things_whats/
Why being the first to move in a new Subdivision is not the most brilliant idea...
http://www.reddit.com/r/nosleep/comments/2kw010/why_being_the_first_to_move_in_a_new_subdivision/
I Am Falling.
http://www.reddit.com/r/nosleep/comments/2kvxvt/i_am_falling/
Heidi
http://www.reddit.com/r/nosleep/comments/2kvrnf/heidi/
I remember everything
http://www.reddit.com/r/nosleep/comments/2kvrjs/i_remember_everything/
To Lieutenant Griffin Stone
http://www.reddit.com/r/nosleep/comments/2kvm9p/to_lieutenant_griffin_stone/
The woman in my room
http://www.reddit.com/r/nosleep/comments/2kvir0/the_woman_in_my_room/
Dr. Margin's Guide to New Monsters: The Guest, or, An Update
http://www.reddit.com/r/nosleep/comments/2kvhe5/dr_margins_guide_to_new_monsters_the_guest_or_an/
The Evil Woman (part 5)
http://www.reddit.com/r/nosleep/comments/2kva73/the_evil_woman_part_5/
Blood for the blood god, The first of many.
http://www.reddit.com/r/nosleep/comments/2kv9gx/blood_for_the_blood_god_the_first_of_many/
An introduction to the beginning of my journey
http://www.reddit.com/r/nosleep/comments/2kv8s0/an_introduction_to_the_beginning_of_my_journey/
A hunter..of sorts.
http://www.reddit.com/r/nosleep/comments/2kv8oz/a_hunterof_sorts/
Void Trigger
http://www.reddit.com/r/nosleep/comments/2kv84s/void_trigger/
What really happened to Amelia Earhart
http://www.reddit.com/r/nosleep/comments/2kv80r/what_really_happened_to_amelia_earhart/
I Used To Be Fine Being Alone
http://www.reddit.com/r/nosleep/comments/2kv2ks/i_used_to_be_fine_being_alone/
The Green One
http://www.reddit.com/r/nosleep/comments/2kuzre/the_green_one/
Elevator
http://www.reddit.com/r/nosleep/comments/2kuwxu/elevator/
Scary story told by my 4 year old niece- The Guy With Really Big Scary Claws
http://www.reddit.com/r/nosleep/comments/2kuwjz/scary_story_told_by_my_4_year_old_niece_the_guy/
Cranial Nerve Zero
http://www.reddit.com/r/nosleep/comments/2kuw7c/cranial_nerve_zero/
Mom's Story About a Ghost Uncle
http://www.reddit.com/r/nosleep/comments/2kuvhs/moms_story_about_a_ghost_uncle/
It snowed.
http://www.reddit.com/r/nosleep/comments/2kutp6/it_snowed/
The pocket watch I found at a store
http://www.reddit.com/r/nosleep/comments/2kusru/the_pocket_watch_i_found_at_a_store/
You’re Going To Die When You Are 23
http://www.reddit.com/r/nosleep/comments/2kur3m/youre_going_to_die_when_you_are_23/
The Customer: Part Two
http://www.reddit.com/r/nosleep/comments/2kumac/the_customer_part_two/
Dimenhydrinate
http://www.reddit.com/r/nosleep/comments/2kul8e/dimenhydrinate/
•
•
•
•
•
12
12
76
4
2
4
6
4
18
2
6
13
5
16
2
2
14
48
1
13
What I want to do is, to place the matching rating for each post right next to it, so I could tell instantly how much rating does that post have, instead of printing the titles and links in 1 "block" and the rating numbers in another "block".
Thanks in advance for the help!

You can do it in one go by iterating over div elements with class="thing" (think about it as iterating over posts). For each div, get the link and rating:
from urlparse import urljoin
from bs4 import BeautifulSoup
import requests
def posts_spider():
url = 'http://www.reddit.com/r/nosleep/new/'
soup = BeautifulSoup(requests.get(url).content)
for thing in soup.select('div.thing'):
link = thing.find('a', {'class': 'title'})
rating = thing.find('div', {'class': 'score'})
href = urljoin("http://www.reddit.com", link.get('href'))
print(link.string, href, rating.string)
posts_spider()
FYI, div.thing is a CSS Selector that matches all divs with class="thing".

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to avoid getting broken words while webcrawling - python

Related

Inconsistent results while web scraping using beautiful soup

Extracting title from link in Python (Beautiful soup)

get review rating bubble from tripadvisor

how do i get the next tag

Simple web scraper formatting, how could I fix this?

Categories

Resources