Inconsistent results while web scraping using beautiful soup

Inconsistent results while web scraping using beautiful soup - python

I am having an inconsistent issue that is driving me crazy. I am trying to scrape data about rental units. Let's say we have a webpage with 42 ads, the code works just fine for only 19 ads then it returns:
Traceback (most recent call last):
File "main.py", line 53, in <module>
title = real_state_title.div.h1.text.strip()
AttributeError: 'NoneType' object has no attribute 'div'
If you started the code to process ads starting from a different ad number, let's say 5, it will also process the first 19 ads then raises the same error!
Here is a minimum code to show the issue I am having. Please note that this code will print the HTML for a functioning ad and also for the one with the error. What is printed is so different.
Run the code then change the value of i to see the results.
from bs4 import BeautifulSoup as soup # HTML data structure
from urllib.request import urlopen as uReq # Web client
import traceback
page_url = "https://www.kijiji.ca/b-apartments-condos/saint-john/c37l80017?ll=45.273315%2C-66.063308&address=Saint+John%2C+NB&ad=offering&radius=20.0"
# opens the connection and downloads html page from url
uClient = uReq(page_url)
# parses html into a soup data structure to traverse html
page_soup = soup(uClient.read(), "html.parser")
uClient.close()
# finds each ad from Kijiji web page
containers = page_soup.findAll('div', {'class': 'clearfix'})
# Print the number of ads in this web page
print(f'Number of ads in this web page is {len(containers)}')
print_functioning_ad = True
# Loop throw ads
i = 1 # change to start from a different ad (don't put zero)
for container in containers[i:]:
print(f'Ad No.: {i}\n')
i += 1
# Get the link for this specific ad
ad_link_container = container.find('div', {'class': 'title'})
ad_link = 'https://kijiji.ca' + ad_link_container.a['href']
print(ad_link)
single_ad = uReq(ad_link)
# parses html into a soup data structure to traverse html
page_soup2 = soup(single_ad.read(), "html.parser")
single_ad.close()
# Title
real_state_title = page_soup2.find('div', {'class': 'realEstateTitle-1440881021'})
# Print one functioning ad html
if print_functioning_ad:
print_functioning_ad = False
print(page_soup2)
print('real state title type', type(real_state_title))
try:
title = real_state_title.div.h1.text.strip()
print(title)
except Exception:
print(traceback.format_exc())
print(page_soup2)
break
print('____________________________________________________________')
Edit 1:
In my simple example I want to loop through each ad in the provided link, open it, and get the title. In my actual code I am not only getting the title but also every other info about the ad. So I need to load the data from the link associated with every ad. My code actually does that, but for an unknown reason, this happens ONLY for 19 ads regardless which one I started with. This is driving my nuts!

To get all pages from the URL you can use next example:
import requests
from bs4 import BeautifulSoup
page_url = "https://www.kijiji.ca/b-apartments-condos/saint-john/c37l80017?ll=45.273315%2C-66.063308&address=Saint+John%2C+NB&ad=offering&radius=20.0"
page = 1
while True:
print("Page {}...".format(page))
print("-" * 80)
soup = BeautifulSoup(requests.get(page_url).content, "html.parser")
for i, a in enumerate(soup.select("a.title"), 1):
print(i, a.get_text(strip=True))
next_url = soup.select_one('a[title="Next"]')
if not next_url:
break
print()
page += 1
page_url = "https://www.kijiji.ca" + next_url["href"]
Prints:
Page 1...
--------------------------------------------------------------------------------
1 Spacious One Bedroom Apartment
2 3 Bedroom Quispamsis
3 Uptown-two-bedroom apartment for rent - all-inclusive
4 New Construction!! Large 2 Bedroom Executive Apt
5 LARGE 1 BEDROOM UPTOWN $850 HEAT INCLUDED AVAIABLE JULY 1
6 84 Wright St Apt 2
7 310 Woodward Ave (Brentwood Tower) Condo #1502
...
Page 5...
--------------------------------------------------------------------------------
1 U02 - CHFR - Cozy 1 Bedroom + Den - WEST SAINT JOHN
2 2+ Bedroom Historic Renovated Stainless Kitchen
3 2 Bedroom Apartment - 343 Prince Street West
4 2 Bedroom 5th Floor Loft Apartment in South End Saint John
5 Bay of Fundy view from luxury 5th floor 1 bedroom + den suite
6 Suites of The Atlantic - Renting for Fall 2021: 2 bedrooms
7 WOODWARD GARDENS//2 BR/$945 + LIGHTS//MAY//MILLIDGEVILLE//JULY
8 HEATED & SMOKE FREE - Bach & 1Bd Apt - 50% off 1st month's rent
9 Beautiful 2 bedroom apartment in Millidgeville
10 Spacious 2 bedroom in Uptown Saint John
11 3 bedroom apartment at Millidge Ave close to university ave
12 Big Beautiful 3 bedroom apt. in King Square
13 NEWER HARBOURVIEW SUITES UNFURNISHED OR FURNISHED /BLUE ROCK
14 Rented
15 Completely Renovated - 1 Bedroom Condo w/ small den Brentwood
16 1+1 Bedroom Apartment for rent for 2 persons
17 3 large bedroom apt. in King Street East Saint John,NB
18 Looking for a house
19 Harbour View 2 Bedroom Apartment
20 Newer Harbourview suites unfurnished or furnished /Blue Rock Ct
21 LOVELY 2 BEDROOM APARTMENT FOR LEASE 5 WOODHOLLOW PARK EAST SJ

I think I figured out the problem here. I seems like you can't make a lot of requests in a short period of time, so I added a try: except: statement where a time sleep of 80 second is issued when this error occurs, this fixed my problem!
You may want to change the sleep time period to a different value depends on the website you are trying to scrape from.
Here is the modified code:
from bs4 import BeautifulSoup as soup # HTML data structure
from urllib.request import urlopen as uReq # Web client
import traceback
import time
page_url = "https://www.kijiji.ca/b-apartments-condos/saint-john/c37l80017?ll=45.273315%2C-66.063308&address=Saint+John%2C+NB&ad=offering&radius=20.0"
# opens the connection and downloads html page from url
uClient = uReq(page_url)
# parses html into a soup data structure to traverse html
page_soup = soup(uClient.read(), "html.parser")
uClient.close()
# finds each ad from Kijiji web page
containers = page_soup.findAll('div', {'class': 'clearfix'})
# Print the number of ads in this web page
print(f'Number of ads in this web page is {len(containers)}')
print_functioning_ad = True
# Loop throw ads
i = 1 # change to start from a different ad (don't put zero)
for container in containers[i:]:
print(f'Ad No.: {i}\n')
i = i + 1
# Get the link for this specific ad
ad_link_container = container.find('div', {'class': 'title'})
ad_link = 'https://kijiji.ca' + ad_link_container.a['href']
print(ad_link)
single_ad = uReq(ad_link)
# parses html into a soup data structure to traverse html
page_soup2 = soup(single_ad.read(), "html.parser")
single_ad.close()
# Title
real_state_title = page_soup2.find('div', {'class': 'realEstateTitle-1440881021'})
try:
title = real_state_title.div.h1.text.strip()
print(title)
except AttributeError:
print(traceback.format_exc())
i = i - 1
t = 80
print(f'----------------------------Sleep for {t} seconds!')
time.sleep(t)
continue
print('____________________________________________________________')

Related

Scraping data for Company details

I am trying to scrape Company name, Postcode, phone number and web address from:
https://www.matki.co.uk/matki-dealers/ Finding it difficult as the information is only retrieved upon clicking the region on the page. If anyone could help it would be much appreciated. Very new to both Python and especially scraping!
!pip install beautifulsoup4
!pip install urllib3
from bs4 import BeautifulSoup
from urllib.request import urlopen
url = "https://www.matki.co.uk/matki-dealers/"
page = urlopen(url)
html = page.read().decode("utf-8")
soup = BeautifulSoup(html, "html.parser")

I guess this is what you wanted to do: (you can put the result after in a file or a database, or even parse it and use it directly)
import requests
from bs4 import BeautifulSoup
URL = "https://www.matki.co.uk/matki-dealers/"
page = requests.get(URL)
# parse HTML
soup = BeautifulSoup(page.content, "html.parser")
# extract the HTML results
results = soup.find(class_="dealer-region")
company_elements = results.find_all("article")
# Loop through the results and extract the wanted informations
for company_element in company_elements:
# some cleanup before printing the info:
company_info = company_element.getText(separator=u', ').replace('Find out more »', '')
# the results ..
print(company_info)
Output:
ESP Bathrooms & Interiors, Queens Retail Park, Queens Street, Preston, PR1 4HZ, 01772 200400, www.espbathrooms.co.uk
Paul Scarr & Son Ltd, Supreme Centre, Haws Hill, Lancaster Road A6, Carnforth, LA5 9DG, 01524 733788,
Stonebridge Interiors, 19 Main Street, Ponteland, NE20 9NH, 01661 520251, www.stonebridgeinteriors.com
Bathe Distinctive Bathrooms, 55 Pottery Road, Wigan, WN3 5AA, www.bathe-showroom.co.uk
Draw A Bath Ltd, 68 Telegraph Road, Heswall, Wirral, CH60 7SG, 0151 342 7100, www.drawabath.co.uk
Acaelia Home Design, Unit 4 Fence Avenue Industrial Estate, Macclesfield, Cheshire, SK10 1LT, 01625 464955, www.acaeliahomedesign.co.uk
...

How to get all products from a beautifulsoup page

I want to get all the products on this page:
nike.com.br/snkrs#estoque
My python code is this:
produtos = []
def aviso():
print("Started!")
request = requests.get("https://www.nike.com.br/snkrs#estoque")
soup = bs4(request.text, "html.parser")
links = soup.find_all("a", class_="btn", text="Comprar")
links_filtred = list(set(links))
for link in links_filtred:
if(produto not in produtos):
request = requests.get(f"{link['href']}")
soup = bs4(request.text, "html.parser")
produto = soup.find("div", class_="nome-preco-produto").get_text()
if(code_formated == ""):
code_formated = "\u200b"
print(f"Nome: {produto} Link: {link['href']}\n")
produtos.append(link["href"])
aviso()
Guys, this code gets the products from the page, but not all yesterday, I suspect that the content is dynamic, but how can I get them all with request and beautifulsoup? I don't want to use Selenium or an automation library, how do I do that? I don't want to have to change my code a lot because it's almost done, how do I do that?

DO NOT USE requests.get if you are dealing with the same HOST.
Reason: read-that
import requests
from bs4 import BeautifulSoup
import pandas as pd
def main(url):
allin = []
with requests.Session() as req:
for page in range(1, 6):
params = {
'p': page,
'demanda': 'true'
}
r = req.get(url, params=params)
soup = BeautifulSoup(r.text, 'lxml')
goal = [(x.find_next('h2').get_text(strip=True, separator=" "), x['href'])
for x in soup.select('.aspect-radio-box')]
allin.extend(goal)
df = pd.DataFrame(allin, columns=['Title', 'Url'])
print(df)
main('https://www.nike.com.br/Snkrs/Feed')
Output:
Title Url
0 Dunk High x Fragment design Black https://www.nike.com.br/dunk-high-x-fragment-d...
1 Dunk Low Infantil (16-26) City Market https://www.nike.com.br/dunk-low-infantil-16-2...
2 ISPA Flow 2020 Desert Sand https://www.nike.com.br/ispa-flow-2020-153-169...
3 ISPA Flow 2020 Pure Platinum https://www.nike.com.br/ispa-flow-2020-153-169...
4 Nike iSPA Men's Lightweight Packable Jacket https://www.nike.com.br/nike-ispa-153-169-211-...
.. ... ...
115 Air Jordan 1 Mid Hyper Royal https://www.nike.com.br/air-jordan-1-mid-153-1...
116 Dunk High Orange Blaze https://www.nike.com.br/dunk-high-153-169-211-...
117 Air Jordan 5 Stealth https://www.nike.com.br/air-jordan-5-153-169-2...
118 Air Jordan 3 Midnight Navy https://www.nike.com.br/air-jordan-3-153-169-2...
119 Air Max 90 Bacon https://www.nike.com.br/air-max-90-153-169-211...
[120 rows x 2 columns]

To get the data you can send a request to:
https://www.nike.com.br/Snkrs/Estoque?p=<PAGE>&demanda=true
where providing a page number between 1-5 to p= in the URL.
For example, to print the links, you can try:
import requests
from bs4 import BeautifulSoup
url = "https://www.nike.com.br/Snkrs/Estoque?p={page}&demanda=true"
for page in range(1, 6):
response = requests.get(url.format(page=page))
soup = BeautifulSoup(response.content, "html.parser")
print(soup.find_all("a", class_="btn", text="Comprar"))

How to avoid getting broken words while webcrawling

I'm trying to web crawl movie titles from this website: https://www.the-numbers.com/market/2019/top-grossing-movies
And keep getting broken word like "John Wick: Chapter 3 â€” ".
this is the picture:
This is the code:
url = "https://www.the-numbers.com/market/" + "2019" + "/top-grossing-movies"
raw = requests.get(url,
headers={'User-Agent':'Mozilla/5.0'})
html = BeautifulSoup(raw.text, "html.parser")
movie_list = html.select("#page_filling_chart table tr > td > b > a") #"#page_filling_chart > table > tbody > tr > td > b"
for i in range(len(movie_list)):
print(movie_list[i].text)
And these are the outputs:
Avengers: Endgame
The Lion King
Frozen II
Toy Story 4
Captain Marvel
Star Wars: The Rise of Skyw…
Spider-Man: Far From Home
Aladdin
Joker
Jumanji: The Next Level
It: Chapter Two
Us
Fast & Furious Presents: Ho…
John Wick: Chapter 3 â€” Para…
How to Train Your Dragon: T…
The Secret Life of Pets 2
PokÃ©mon: Detective Pikachu
Once Upon a Timeâ€¦in Hollywo…
I want to know why I keep getting these broken words and how to fix this!

Due to this page is server-render, you could request those page separately when the title getting broken.(Also don't forget to get the title by regex, because the title of its page contain the publication date.)
Try code below:
import requests
from bs4 import BeautifulSoup
url = "https://www.the-numbers.com/market/" + "2019" + "/top-grossing-movies"
raw = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})
html = BeautifulSoup(raw.text, "html.parser")
movie_list = html.select("#page_filling_chart table tr > td > b > a") # "#page_filling_chart > table > tbody > tr > td > b"
for movie in movie_list:
raw = requests.get("https://www.the-numbers.com" + movie.get("href"), headers={'User-Agent': 'Mozilla/5.0'})
raw.encoding = 'utf-8'
html = BeautifulSoup(raw.text, "html.parser")
print(html.select_one("#main > div > h1").text)
That's gave me:
Avengers: Endgame (2019)
The Lion King (2019)
Frozen II (2019)
Toy Story 4 (2019)
Captain Marvel (2019)
Star Wars: The Rise of Skywalker (2019)
Spider-Man: Far From Home (2019)
....

You need to handle the strings like this, the solution code is:
import requests
from bs4 import BeautifulSoup
url = "https://www.the-numbers.com/market/" + "2019" + "/top-grossing-movies"
raw = requests.get(url,
headers={'User-Agent':'Mozilla/5.0'})
html = BeautifulSoup(raw.text, "lxml")
movie_list = html.select("#page_filling_chart table tr > td > b > a") #"#page_filling_chart > table > tbody > tr > td > b"
import unicodedata
for i in range(len(movie_list)):
movie_name = movie_list[i].text
print(unicodedata.normalize('NFKD', movie_name).encode('ascii', 'ignore').decode())
The output is like this:
Avengers: Endgame
The Lion King
Frozen II
Toy Story 4
Captain Marvel
Star Wars: The Rise of Skyw...
Spider-Man: Far From Home
Aladdin
Joker
Jumanji: The Next Level
It: Chapter Two
Us
Fast & Furious Presents: Ho...
John Wick: Chapter 3 a Para...
How to Train Your Dragon: T...
The Secret Life of Pets 2
PokAmon: Detective Pikachu
Once Upon a Timeain Hollywo...
Shazam!
Aquaman
Knives Out
Dumbo
Maleficent: Mistress of Evil
.
.
Narcissister Organ Player
Chef Flynn
I am Not a Witch
Divide and Conquer: The Sto...
Senso
Never-Ending Man: Hayao Miy...

Web Scraping - Get to Page 2

How to I get to page two of the data sets? No matter what I do, it only returns page 1.
import bs4
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
myURL = 'https://jobs.collinsaerospace.com/search-jobs/'
uClient = uReq(myURL)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
container = page_soup.findAll("section", {"id":"search-results"}, {"data-current-page":"4"})
for child in container:
for heading in child.find_all('h2'):
print(heading.text)

The site actually uses JSON to return the HTML containing all of the entries. The API for this allows a page number to be specified and also the number of records to be returned for each page, increasing this will further increase the speed.
The JSON that is returned contains 3 keys. Filter information, the results HTML and a flag to indicate if jobs were returned. This last entry can be used to signal when you have reached the end of the pages.
You might want to look at the very popular Python requests library which simplifies generating the correct URLs for you and is also fast.
import bs4
import requests
from bs4 import BeautifulSoup as soup
params = {
"CurrentPage" : 1,
"RecordsPerPage" : 100,
"SearchResultsModuleName" : "Search Results",
"SearchFiltersModuleName" : "Search Filters",
"SearchType" : 5,
}
myURL = 'https://jobs.collinsaerospace.com/search-jobs/results'
page = 1
more_jobs = True
while more_jobs:
print(f"\nPage {page}")
params['CurrentPage'] = page
req = requests.get(myURL, params=params)
json = req.json()
page_soup = soup(json['results'], "html.parser")
container = page_soup.findAll("section", {"id":"search-results"}, {"data-current-page":"4"})
for child in container:
for heading in child.find_all('h2'):
print(heading.text)
more_jobs = json['hasJobs'] # Did this return any jobs?
page += 1

Try the following script to get the results from whatever pages you are interested in. All you need to do is change the range as per your requirement. I could have defined a while loop to exhaust the whole content but that is not the question you asked.
import requests
from bs4 import BeautifulSoup
link = 'https://jobs.collinsaerospace.com/search-jobs/results?'
params = {
'CurrentPage': '',
'RecordsPerPage': 15,
'Distance': 50,
'SearchResultsModuleName': 'Search Results',
'SearchFiltersModuleName': 'Search Filters',
'SearchType': 5
}
for page in range(1,5): #This is where you change the range to get the results from whatever page you want
params['CurrentPage'] = page
res = requests.get(link,params=params)
soup = BeautifulSoup(res.json()['results'],"lxml")
for name in soup.select("h2"):
print(name.text)

try this :
import bs4
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
for letter in range(10):
myURL = 'https://jobs.collinsaerospace.com/search-jobs/'+ str(letter) + ' '
uClient = uReq(myURL)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
container = page_soup.findAll("section", {"id":"search-results"}, {"data-current-page":"4"})
for child in container:
for heading in child.find_all('h2'):
print(heading.text)
output:
of 3 first pages:
0
SYSTEMS / APPLICATIONS ENGINEER
Data Scientist
Sr Engineer, Drafter/Product Definition
Finance and Accounting Intern
Senior Software Engineer - CT3
Intern Manufacturing Engineer
Staff Eng., Reliability Engineering
Software Developer
Configuration Management Specialist
Disassembler I--2nd Shift
Disassembler I--3rd Shift
Manager, Supplier Performance
Manager, Supplier Performance
Assoc Eng, Mfg Engrg-Ops, ME P1
Manager, Supplier Performance
1
Assembly Operator (UK7014) 1 1 1 1
Senior Administrator (DF1040) 1 1 1
Tester 1
Assembler 1
Assembler 1
Finisher 1
Painter 1
Technician 1 Manufacturing/Operations
Assembler 1 - 1st Shift
Supply Chain Analyst 1
Assembler (W7006) 1
Assembler (W7006) 1
Supplier Quality Engineer 1
Supplier Inspection Engineer 1
Assembler 1 - 1st Shift
2
Assembler I-FAA-2
Senior/Business Analyst-2
Operational Technical Support Level 2
Project Engineer - 2 – EMU Program
Line & Surface Plate Inspector Class 2
Software Engineer (LVL 2) - Embedded UAV Controls
Software Engineer (LVL 2 / JAVA) - Air Combat Training
Software Engineer (Level 2) - Mission Simulation & Training
Electrical Engineer (LVL 2) - Mission Systems Design Tools
Quality Inspector II
GET/PGET
GET/PGET
Production Supervisor - 2nd shift
Software Developer
Trainee Operator/ Operator

Simple web scraper formatting, how could I fix this?

I have this code:
import requests
from bs4 import BeautifulSoup
def posts_spider():
url = 'http://www.reddit.com/r/nosleep/new/'
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for link in soup.findAll('a', {'class': 'title'}):
href = "http://www.reddit.com" + link.get('href')
title = link.string
print(title)
print(href)
print("\n")
def get_single_item_data():
item_url = 'http://www.reddit.com/r/nosleep/new/'
source_code = requests.get(item_url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for rating in soup.findAll('div', {'class': 'score unvoted'}):
print(rating.string)
posts_spider()
get_single_item_data()
The output is:
My light.. I'm seeing and feeling things.. what's happening?
http://www.reddit.com/r/nosleep/comments/2kw0nu/my_light_im_seeing_and_feeling_things_whats/
Why being the first to move in a new Subdivision is not the most brilliant idea...
http://www.reddit.com/r/nosleep/comments/2kw010/why_being_the_first_to_move_in_a_new_subdivision/
I Am Falling.
http://www.reddit.com/r/nosleep/comments/2kvxvt/i_am_falling/
Heidi
http://www.reddit.com/r/nosleep/comments/2kvrnf/heidi/
I remember everything
http://www.reddit.com/r/nosleep/comments/2kvrjs/i_remember_everything/
To Lieutenant Griffin Stone
http://www.reddit.com/r/nosleep/comments/2kvm9p/to_lieutenant_griffin_stone/
The woman in my room
http://www.reddit.com/r/nosleep/comments/2kvir0/the_woman_in_my_room/
Dr. Margin's Guide to New Monsters: The Guest, or, An Update
http://www.reddit.com/r/nosleep/comments/2kvhe5/dr_margins_guide_to_new_monsters_the_guest_or_an/
The Evil Woman (part 5)
http://www.reddit.com/r/nosleep/comments/2kva73/the_evil_woman_part_5/
Blood for the blood god, The first of many.
http://www.reddit.com/r/nosleep/comments/2kv9gx/blood_for_the_blood_god_the_first_of_many/
An introduction to the beginning of my journey
http://www.reddit.com/r/nosleep/comments/2kv8s0/an_introduction_to_the_beginning_of_my_journey/
A hunter..of sorts.
http://www.reddit.com/r/nosleep/comments/2kv8oz/a_hunterof_sorts/
Void Trigger
http://www.reddit.com/r/nosleep/comments/2kv84s/void_trigger/
What really happened to Amelia Earhart
http://www.reddit.com/r/nosleep/comments/2kv80r/what_really_happened_to_amelia_earhart/
I Used To Be Fine Being Alone
http://www.reddit.com/r/nosleep/comments/2kv2ks/i_used_to_be_fine_being_alone/
The Green One
http://www.reddit.com/r/nosleep/comments/2kuzre/the_green_one/
Elevator
http://www.reddit.com/r/nosleep/comments/2kuwxu/elevator/
Scary story told by my 4 year old niece- The Guy With Really Big Scary Claws
http://www.reddit.com/r/nosleep/comments/2kuwjz/scary_story_told_by_my_4_year_old_niece_the_guy/
Cranial Nerve Zero
http://www.reddit.com/r/nosleep/comments/2kuw7c/cranial_nerve_zero/
Mom's Story About a Ghost Uncle
http://www.reddit.com/r/nosleep/comments/2kuvhs/moms_story_about_a_ghost_uncle/
It snowed.
http://www.reddit.com/r/nosleep/comments/2kutp6/it_snowed/
The pocket watch I found at a store
http://www.reddit.com/r/nosleep/comments/2kusru/the_pocket_watch_i_found_at_a_store/
You’re Going To Die When You Are 23
http://www.reddit.com/r/nosleep/comments/2kur3m/youre_going_to_die_when_you_are_23/
The Customer: Part Two
http://www.reddit.com/r/nosleep/comments/2kumac/the_customer_part_two/
Dimenhydrinate
http://www.reddit.com/r/nosleep/comments/2kul8e/dimenhydrinate/
•
•
•
•
•
12
12
76
4
2
4
6
4
18
2
6
13
5
16
2
2
14
48
1
13
What I want to do is, to place the matching rating for each post right next to it, so I could tell instantly how much rating does that post have, instead of printing the titles and links in 1 "block" and the rating numbers in another "block".
Thanks in advance for the help!

You can do it in one go by iterating over div elements with class="thing" (think about it as iterating over posts). For each div, get the link and rating:
from urlparse import urljoin
from bs4 import BeautifulSoup
import requests
def posts_spider():
url = 'http://www.reddit.com/r/nosleep/new/'
soup = BeautifulSoup(requests.get(url).content)
for thing in soup.select('div.thing'):
link = thing.find('a', {'class': 'title'})
rating = thing.find('div', {'class': 'score'})
href = urljoin("http://www.reddit.com", link.get('href'))
print(link.string, href, rating.string)
posts_spider()
FYI, div.thing is a CSS Selector that matches all divs with class="thing".

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Inconsistent results while web scraping using beautiful soup - python

Related

Scraping data for Company details

How to get all products from a beautifulsoup page

How to avoid getting broken words while webcrawling

Web Scraping - Get to Page 2

Simple web scraper formatting, how could I fix this?

Categories

Resources