I am trying to extract the star rating of each review in a dataframe for sentiment analysis.
https://www.mouthshut.com/product-reviews/Kotak-811-Mobile-Banking-reviews-925917218
This the webpage I am trying to scrape. I am fairly new to webscraping, so I prefer beautifulsoup as it is easier to understand.
import requests
from bs4 import BeautifulSoup
import pandas as pd
URL = ""
Final = []
for x in range(0, 8):
if x == 1:
URL = "https://www.mouthshut.com/product-reviews/Kotak-811-Mobile-Banking-reviews-925917218"
else:
URL ="https://www.mouthshut.com/product-reviews/Kotak-811-Mobile-Banking-reviews-925917218-page-{}".format(x)
r = requests.get(URL)
soup = BeautifulSoup(r.content, 'html.parser')
reviews = [] # a list to store reviews
# Use a CSS selector to extract all the review containers
review_divs = soup.select('div.col-10.review')
for element in review_divs :
review = {'Review_Title': element .a.text, 'URL': element .a['href'], 'Review': element .find('div', {'class': ['more', 'reviewdata']}).text.strip()}
reviews.append(review)
Final.extend(reviews)
df = pd.DataFrame(Final)
I would really appreciate the help.
Thank You
You may add the following entry to your review dictionary to get all the
giving stars under class=rating.
'Stars' : len(element.find('div', "rating").findAll("i", "rated-star"))
Review_Title ... Stars
0 Why need permission for contact, gallery ... 1
1 Very dull marketing for open account ... 1
2 Worst bank ... 1
3 Good interface & can be easily accessible ... 3
4 Best digital Bank account ... 4
5 Better account for everyone ... 4
6 Feature full Mobile banking ... 5
7 Very good bank ... 4
8 Above average online banking experience ... 3
...
Related
I scraping a district school website in which all website build are same every URL in websites are completely same except their names. The code I use only working on one school when I put it other school name it giving blank output anyone help me where I am going wrong. Here is the working code:
import pandas as pd
from bs4 import BeautifulSoup
import requests
url = 'https://fairfaxhs.fcps.edu/staff-directory?
field_last_name_from=&field_last_name_to=&items_per_page=10&keywords=&page={page}'
data = []
for page in range(0,2):
soup = BeautifulSoup(requests.get(url.format(page=page)).text,'lxml')
try:
for u in ['https://fairfaxhs.fcps.edu'+link.a.get('href') for link in
soup.table.select('tr td[class="views-field views-field-rendered-item"]')]:
soup2 = BeautifulSoup(requests.get(u).text,'lxml')
d={
'Name': soup2.select_one('h1.node__title.fcps-color--dark11').get_text(strip=True),
'Position': soup2.select_one('h1+div').get_text(strip=True),
'contact_url': u
}
data.append(d)
except:
pass
df=pd.DataFrame(data).to_csv('fcps_school.csv',index=False)
print(df)
here is the other URL I am trying to scrap:
https://aldrines.fcps.edu/staff-directory?keywords=&field_last_name_from=&field_last_name_to=&items_per_page=10&page=
https://aldrines.fcps.edu
I've scraped 10 pages as an example without changing anything from the existing code and it's working fine and also getting the same output in csv file.
import pandas as pd
from bs4 import BeautifulSoup
import requests
url = 'https://fairfaxhs.fcps.edu/staff-directory?field_last_name_from=&field_last_name_to=&items_per_page=10&keywords=&page={page}'
data = []
for page in range(0,10):
soup = BeautifulSoup(requests.get(url.format(page=page)).text,'lxml')
try:
for u in ['https://fairfaxhs.fcps.edu'+link.a.get('href') for link in
soup.table.select('tr td[class="views-field views-field-rendered-item"]')]:
soup2 = BeautifulSoup(requests.get(u).text,'lxml')
d={
'Name': soup2.select_one('h1.node__title.fcps-color--dark11').get_text(strip=True),
'Position': soup2.select_one('h1+div').get_text(strip=True),
'contact_url': u
}
data.append(d)
except:
pass
df=pd.DataFrame(data)#.to_csv('fcps_school.csv',index=False)
print(df)
Output
Name Position contact_url
0 Bouchera Abutaa Instructional Assistant https://fairfaxhs.fcps.edu/staff/bouchera-abutaa
1 Margaret Aderton Substitute Teacher - Regular Term https://fairfaxhs.fcps.edu/staff/margaret-aderton
2 Aja Adu-Gyamfi School Counselor, HS https://fairfaxhs.fcps.edu/staff/aja-adu-gyamfi
3 Paul Agyeman Custodian II https://fairfaxhs.fcps.edu/staff/paul-agyeman
4 Jin Ahn Food Services Worker https://fairfaxhs.fcps.edu/staff/jin-ahn
.. ... ... ...
95 Tiffany Haddock School Counselor, HS https://fairfaxhs.fcps.edu/staff/tiffany-haddock
96 Heather Hakes Learning Disabilities Teacher, MS/HS https://fairfaxhs.fcps.edu/staff/heather-hakes
97 Gabrielle Hall History & Social Studies Teacher, HS https://fairfaxhs.fcps.edu/staff/gabrielle-hall
98 Sydney Hamrick English Teacher, HS https://fairfaxhs.fcps.edu/staff/sydney-hamrick
99 Anne-Marie Hanapole Biology Teacher, HS https://fairfaxhs.fcps.edu/staff/anne-marie-ha...
[100 rows x 3 columns]
Update:
Actually, success of webscraping not noly depends on good coding skill but also 50% success depends on good understanding the website.
Domain name:
https://fairfaxhs.fcps.edu
and
2.https://aldrines.fcps.edu
aren't the same and h1 tag's attribute value is a bit difference, otherwise, the both website's structure is alike.
Working code:
import pandas as pd
from bs4 import BeautifulSoup
import requests
url = 'https://aldrines.fcps.edu/staff-directory?field_last_name_from=&field_last_name_to=&items_per_page=10&keywords=&page={page}'
data = []
for page in range(0,10):
soup = BeautifulSoup(requests.get(url.format(page=page)).text,'lxml')
try:
for u in ['https://aldrines.fcps.edu'+link.a.get('href') for link in soup.table.select('tr td[class="views-field views-field-rendered-item"]')]:
soup2 = BeautifulSoup(requests.get(u).text,'lxml')
d={
'Name': soup2.select_one('h1.node__title.fcps-color--dark7').get_text(strip=True),
'Position': soup2.select_one('h1+div').get_text(strip=True),
'contact_url': u
}
data.append(d)
except:
pass
df=pd.DataFrame(data)#.to_csv('fcps_school.csv',index=False)
print(df)
Output:
Name ... contact_url
0 Jamileh Abu-Ghannam ... https://aldrines.fcps.edu/staff/jamileh-abu-gh...
1 Linda Adgate ... https://aldrines.fcps.edu/staff/linda-adgate
2 Rehab Ahmed ... https://aldrines.fcps.edu/staff/rehab-ahmed
3 Richard Amernick ... https://aldrines.fcps.edu/staff/richard-amernick
4 Laura Arm ... https://aldrines.fcps.edu/staff/laura-arm
.. ... ... ...
95 Melissa Weinhaus ... https://aldrines.fcps.edu/staff/melissa-weinhaus
96 Kathryn Wheeler ... https://aldrines.fcps.edu/staff/kathryn-wheeler
97 Latoya Wilson ... https://aldrines.fcps.edu/staff/latoya-wilson
98 Shane Wolfe ... https://aldrines.fcps.edu/staff/shane-wolfe
99 Michael Woodring ... https://aldrines.fcps.edu/staff/michael-woodring
[100 rows x 3 columns]
I am trying to scrape reviews and their star ratings for various products from Snapdeal. I am accessing the product through the product URL. On the specific page, I want to filter out ratings according to stars and fetch the rating number as well as review. I am using the following code to do so
'''
url_snapdeal=('https://www.snapdeal.com/')
driver.get(url_snapdeal)
time.sleep(2)
search = driver.find_element_by_id('inputValEnter')
search.clear()
search.send_keys('smartphone')
search.send_keys(Keys.ENTER)
time.sleep(2)
for i in range(0,3):
driver.execute_script('window.scrollTo(0,document.body.scrollHeight)')
time.sleep(1)
urls=[]
for link in driver.find_elements_by_xpath("//div[#class='product-desc-rating ']/a"):
urls.append(link.get_attribute('href'))
snap_reviews=[]
snap_ratings=[]
for url in urls:
driver.get(url)
time.sleep(4)
try:
for x in range(2,7):
driver.find_element_by_xpath("//div[#class='selectarea']").click()
time.sleep(1)
driver.find_element_by_xpath(f"//div[#class='options']/ul/li[{x}]").click()
time.sleep(1)
for rating in driver.find_elements_by_xpath("//div[#class='user-review']/div[1]"):
stars = rating.find_elements_by_xpath("i[#class='sd-icon sd-icon-star active']")
snap_ratings.append(len(stars))
except NoSuchElementException:
print('Not found')
The try block is supposed to click on the star filter dropdown and select 5star, collect the star rating and review text, again click on the dropdown, click on 4star and collect rating and review, and so on.
My code manages to click on the dropdown but is unable to click on filter options like 5star, 4 star etc. It throws ElementNotInteractable Exception.
Any help or suggestion would be greatly appreciated. Thanks in advance.
You actually can get the ratings directly by product number. So get the product numbers and feed that in (I haven't looked, but it might be possible to get those without selenium as well). Then you could just filter the dataframe. Here's and example for 1 product:
import requests
import pandas as pd
import math
productId = 639365186960
url = 'https://www.snapdeal.com/acors/web/getSelfieList/v2'
payload = {
'productId':productId,
'offset':0}
jsonData = requests.get(url, params=payload).json()
total_pages = math.ceil(jsonData['selfieTotal'] / 10)
for page in range(1,total_pages+1):
if page == 1:
ratings = jsonData['selfieList']
else:
payload['offset'] = 10*(page-1)
jsonData = requests.get(url, params=payload).json()
ratings += jsonData['selfieList']
df = pd.DataFrame(ratings)
Output:
df[df['rating'] == 4]
Out[82]:
selfieId ... reducedImage
0 015cd9a6a1e80000dd22850445ac4f71 ... https://n1.sdlcdn.com/image/upload/h_162,w_162...
1 015bffb8bcb60000dd22850466cc1c72 ... https://n1.sdlcdn.com/image/upload/h_162,w_162...
6 015b4dd88df70000dd228504c9271488 ... https://n1.sdlcdn.com/image/upload/h_162,w_162...
7 015b4dd7f9b80000dd228504b8777fc8 ... https://n1.sdlcdn.com/image/upload/h_162,w_162...
8 015b1e574cdc0000dd228504d4a694ad ... https://n1.sdlcdn.com/image/upload/h_162,w_162...
9 015b182be5640000dd22850418c0bdd6 ... https://n1.sdlcdn.com/image/upload/h_162,w_162...
10 015aa6e8700e0000dd228504a3378958 ... https://n1.sdlcdn.com/image/upload/h_162,w_162...
11 015a9df9ab640000dd2285045069dcff ... https://n1.sdlcdn.com/image/upload/h_162,w_162...
14 015a4c7a37040000dd2285045daaa6d3 ... https://n1.sdlcdn.com/image/upload/h_162,w_162...
15 015a4b377b8b0000dd228504d3dd159b ... https://n1.sdlcdn.com/image/upload/h_162,w_162...
[10 rows x 9 columns]
I want to get all the products on this page:
nike.com.br/snkrs#estoque
My python code is this:
produtos = []
def aviso():
print("Started!")
request = requests.get("https://www.nike.com.br/snkrs#estoque")
soup = bs4(request.text, "html.parser")
links = soup.find_all("a", class_="btn", text="Comprar")
links_filtred = list(set(links))
for link in links_filtred:
if(produto not in produtos):
request = requests.get(f"{link['href']}")
soup = bs4(request.text, "html.parser")
produto = soup.find("div", class_="nome-preco-produto").get_text()
if(code_formated == ""):
code_formated = "\u200b"
print(f"Nome: {produto} Link: {link['href']}\n")
produtos.append(link["href"])
aviso()
Guys, this code gets the products from the page, but not all yesterday, I suspect that the content is dynamic, but how can I get them all with request and beautifulsoup? I don't want to use Selenium or an automation library, how do I do that? I don't want to have to change my code a lot because it's almost done, how do I do that?
DO NOT USE requests.get if you are dealing with the same HOST.
Reason: read-that
import requests
from bs4 import BeautifulSoup
import pandas as pd
def main(url):
allin = []
with requests.Session() as req:
for page in range(1, 6):
params = {
'p': page,
'demanda': 'true'
}
r = req.get(url, params=params)
soup = BeautifulSoup(r.text, 'lxml')
goal = [(x.find_next('h2').get_text(strip=True, separator=" "), x['href'])
for x in soup.select('.aspect-radio-box')]
allin.extend(goal)
df = pd.DataFrame(allin, columns=['Title', 'Url'])
print(df)
main('https://www.nike.com.br/Snkrs/Feed')
Output:
Title Url
0 Dunk High x Fragment design Black https://www.nike.com.br/dunk-high-x-fragment-d...
1 Dunk Low Infantil (16-26) City Market https://www.nike.com.br/dunk-low-infantil-16-2...
2 ISPA Flow 2020 Desert Sand https://www.nike.com.br/ispa-flow-2020-153-169...
3 ISPA Flow 2020 Pure Platinum https://www.nike.com.br/ispa-flow-2020-153-169...
4 Nike iSPA Men's Lightweight Packable Jacket https://www.nike.com.br/nike-ispa-153-169-211-...
.. ... ...
115 Air Jordan 1 Mid Hyper Royal https://www.nike.com.br/air-jordan-1-mid-153-1...
116 Dunk High Orange Blaze https://www.nike.com.br/dunk-high-153-169-211-...
117 Air Jordan 5 Stealth https://www.nike.com.br/air-jordan-5-153-169-2...
118 Air Jordan 3 Midnight Navy https://www.nike.com.br/air-jordan-3-153-169-2...
119 Air Max 90 Bacon https://www.nike.com.br/air-max-90-153-169-211...
[120 rows x 2 columns]
To get the data you can send a request to:
https://www.nike.com.br/Snkrs/Estoque?p=<PAGE>&demanda=true
where providing a page number between 1-5 to p= in the URL.
For example, to print the links, you can try:
import requests
from bs4 import BeautifulSoup
url = "https://www.nike.com.br/Snkrs/Estoque?p={page}&demanda=true"
for page in range(1, 6):
response = requests.get(url.format(page=page))
soup = BeautifulSoup(response.content, "html.parser")
print(soup.find_all("a", class_="btn", text="Comprar"))
I am having an inconsistent issue that is driving me crazy. I am trying to scrape data about rental units. Let's say we have a webpage with 42 ads, the code works just fine for only 19 ads then it returns:
Traceback (most recent call last):
File "main.py", line 53, in <module>
title = real_state_title.div.h1.text.strip()
AttributeError: 'NoneType' object has no attribute 'div'
If you started the code to process ads starting from a different ad number, let's say 5, it will also process the first 19 ads then raises the same error!
Here is a minimum code to show the issue I am having. Please note that this code will print the HTML for a functioning ad and also for the one with the error. What is printed is so different.
Run the code then change the value of i to see the results.
from bs4 import BeautifulSoup as soup # HTML data structure
from urllib.request import urlopen as uReq # Web client
import traceback
page_url = "https://www.kijiji.ca/b-apartments-condos/saint-john/c37l80017?ll=45.273315%2C-66.063308&address=Saint+John%2C+NB&ad=offering&radius=20.0"
# opens the connection and downloads html page from url
uClient = uReq(page_url)
# parses html into a soup data structure to traverse html
page_soup = soup(uClient.read(), "html.parser")
uClient.close()
# finds each ad from Kijiji web page
containers = page_soup.findAll('div', {'class': 'clearfix'})
# Print the number of ads in this web page
print(f'Number of ads in this web page is {len(containers)}')
print_functioning_ad = True
# Loop throw ads
i = 1 # change to start from a different ad (don't put zero)
for container in containers[i:]:
print(f'Ad No.: {i}\n')
i += 1
# Get the link for this specific ad
ad_link_container = container.find('div', {'class': 'title'})
ad_link = 'https://kijiji.ca' + ad_link_container.a['href']
print(ad_link)
single_ad = uReq(ad_link)
# parses html into a soup data structure to traverse html
page_soup2 = soup(single_ad.read(), "html.parser")
single_ad.close()
# Title
real_state_title = page_soup2.find('div', {'class': 'realEstateTitle-1440881021'})
# Print one functioning ad html
if print_functioning_ad:
print_functioning_ad = False
print(page_soup2)
print('real state title type', type(real_state_title))
try:
title = real_state_title.div.h1.text.strip()
print(title)
except Exception:
print(traceback.format_exc())
print(page_soup2)
break
print('____________________________________________________________')
Edit 1:
In my simple example I want to loop through each ad in the provided link, open it, and get the title. In my actual code I am not only getting the title but also every other info about the ad. So I need to load the data from the link associated with every ad. My code actually does that, but for an unknown reason, this happens ONLY for 19 ads regardless which one I started with. This is driving my nuts!
To get all pages from the URL you can use next example:
import requests
from bs4 import BeautifulSoup
page_url = "https://www.kijiji.ca/b-apartments-condos/saint-john/c37l80017?ll=45.273315%2C-66.063308&address=Saint+John%2C+NB&ad=offering&radius=20.0"
page = 1
while True:
print("Page {}...".format(page))
print("-" * 80)
soup = BeautifulSoup(requests.get(page_url).content, "html.parser")
for i, a in enumerate(soup.select("a.title"), 1):
print(i, a.get_text(strip=True))
next_url = soup.select_one('a[title="Next"]')
if not next_url:
break
print()
page += 1
page_url = "https://www.kijiji.ca" + next_url["href"]
Prints:
Page 1...
--------------------------------------------------------------------------------
1 Spacious One Bedroom Apartment
2 3 Bedroom Quispamsis
3 Uptown-two-bedroom apartment for rent - all-inclusive
4 New Construction!! Large 2 Bedroom Executive Apt
5 LARGE 1 BEDROOM UPTOWN $850 HEAT INCLUDED AVAIABLE JULY 1
6 84 Wright St Apt 2
7 310 Woodward Ave (Brentwood Tower) Condo #1502
...
Page 5...
--------------------------------------------------------------------------------
1 U02 - CHFR - Cozy 1 Bedroom + Den - WEST SAINT JOHN
2 2+ Bedroom Historic Renovated Stainless Kitchen
3 2 Bedroom Apartment - 343 Prince Street West
4 2 Bedroom 5th Floor Loft Apartment in South End Saint John
5 Bay of Fundy view from luxury 5th floor 1 bedroom + den suite
6 Suites of The Atlantic - Renting for Fall 2021: 2 bedrooms
7 WOODWARD GARDENS//2 BR/$945 + LIGHTS//MAY//MILLIDGEVILLE//JULY
8 HEATED & SMOKE FREE - Bach & 1Bd Apt - 50% off 1st month's rent
9 Beautiful 2 bedroom apartment in Millidgeville
10 Spacious 2 bedroom in Uptown Saint John
11 3 bedroom apartment at Millidge Ave close to university ave
12 Big Beautiful 3 bedroom apt. in King Square
13 NEWER HARBOURVIEW SUITES UNFURNISHED OR FURNISHED /BLUE ROCK
14 Rented
15 Completely Renovated - 1 Bedroom Condo w/ small den Brentwood
16 1+1 Bedroom Apartment for rent for 2 persons
17 3 large bedroom apt. in King Street East Saint John,NB
18 Looking for a house
19 Harbour View 2 Bedroom Apartment
20 Newer Harbourview suites unfurnished or furnished /Blue Rock Ct
21 LOVELY 2 BEDROOM APARTMENT FOR LEASE 5 WOODHOLLOW PARK EAST SJ
I think I figured out the problem here. I seems like you can't make a lot of requests in a short period of time, so I added a try: except: statement where a time sleep of 80 second is issued when this error occurs, this fixed my problem!
You may want to change the sleep time period to a different value depends on the website you are trying to scrape from.
Here is the modified code:
from bs4 import BeautifulSoup as soup # HTML data structure
from urllib.request import urlopen as uReq # Web client
import traceback
import time
page_url = "https://www.kijiji.ca/b-apartments-condos/saint-john/c37l80017?ll=45.273315%2C-66.063308&address=Saint+John%2C+NB&ad=offering&radius=20.0"
# opens the connection and downloads html page from url
uClient = uReq(page_url)
# parses html into a soup data structure to traverse html
page_soup = soup(uClient.read(), "html.parser")
uClient.close()
# finds each ad from Kijiji web page
containers = page_soup.findAll('div', {'class': 'clearfix'})
# Print the number of ads in this web page
print(f'Number of ads in this web page is {len(containers)}')
print_functioning_ad = True
# Loop throw ads
i = 1 # change to start from a different ad (don't put zero)
for container in containers[i:]:
print(f'Ad No.: {i}\n')
i = i + 1
# Get the link for this specific ad
ad_link_container = container.find('div', {'class': 'title'})
ad_link = 'https://kijiji.ca' + ad_link_container.a['href']
print(ad_link)
single_ad = uReq(ad_link)
# parses html into a soup data structure to traverse html
page_soup2 = soup(single_ad.read(), "html.parser")
single_ad.close()
# Title
real_state_title = page_soup2.find('div', {'class': 'realEstateTitle-1440881021'})
try:
title = real_state_title.div.h1.text.strip()
print(title)
except AttributeError:
print(traceback.format_exc())
i = i - 1
t = 80
print(f'----------------------------Sleep for {t} seconds!')
time.sleep(t)
continue
print('____________________________________________________________')
I have this code:
import requests
from bs4 import BeautifulSoup
def posts_spider():
url = 'http://www.reddit.com/r/nosleep/new/'
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for link in soup.findAll('a', {'class': 'title'}):
href = "http://www.reddit.com" + link.get('href')
title = link.string
print(title)
print(href)
print("\n")
def get_single_item_data():
item_url = 'http://www.reddit.com/r/nosleep/new/'
source_code = requests.get(item_url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for rating in soup.findAll('div', {'class': 'score unvoted'}):
print(rating.string)
posts_spider()
get_single_item_data()
The output is:
My light.. I'm seeing and feeling things.. what's happening?
http://www.reddit.com/r/nosleep/comments/2kw0nu/my_light_im_seeing_and_feeling_things_whats/
Why being the first to move in a new Subdivision is not the most brilliant idea...
http://www.reddit.com/r/nosleep/comments/2kw010/why_being_the_first_to_move_in_a_new_subdivision/
I Am Falling.
http://www.reddit.com/r/nosleep/comments/2kvxvt/i_am_falling/
Heidi
http://www.reddit.com/r/nosleep/comments/2kvrnf/heidi/
I remember everything
http://www.reddit.com/r/nosleep/comments/2kvrjs/i_remember_everything/
To Lieutenant Griffin Stone
http://www.reddit.com/r/nosleep/comments/2kvm9p/to_lieutenant_griffin_stone/
The woman in my room
http://www.reddit.com/r/nosleep/comments/2kvir0/the_woman_in_my_room/
Dr. Margin's Guide to New Monsters: The Guest, or, An Update
http://www.reddit.com/r/nosleep/comments/2kvhe5/dr_margins_guide_to_new_monsters_the_guest_or_an/
The Evil Woman (part 5)
http://www.reddit.com/r/nosleep/comments/2kva73/the_evil_woman_part_5/
Blood for the blood god, The first of many.
http://www.reddit.com/r/nosleep/comments/2kv9gx/blood_for_the_blood_god_the_first_of_many/
An introduction to the beginning of my journey
http://www.reddit.com/r/nosleep/comments/2kv8s0/an_introduction_to_the_beginning_of_my_journey/
A hunter..of sorts.
http://www.reddit.com/r/nosleep/comments/2kv8oz/a_hunterof_sorts/
Void Trigger
http://www.reddit.com/r/nosleep/comments/2kv84s/void_trigger/
What really happened to Amelia Earhart
http://www.reddit.com/r/nosleep/comments/2kv80r/what_really_happened_to_amelia_earhart/
I Used To Be Fine Being Alone
http://www.reddit.com/r/nosleep/comments/2kv2ks/i_used_to_be_fine_being_alone/
The Green One
http://www.reddit.com/r/nosleep/comments/2kuzre/the_green_one/
Elevator
http://www.reddit.com/r/nosleep/comments/2kuwxu/elevator/
Scary story told by my 4 year old niece- The Guy With Really Big Scary Claws
http://www.reddit.com/r/nosleep/comments/2kuwjz/scary_story_told_by_my_4_year_old_niece_the_guy/
Cranial Nerve Zero
http://www.reddit.com/r/nosleep/comments/2kuw7c/cranial_nerve_zero/
Mom's Story About a Ghost Uncle
http://www.reddit.com/r/nosleep/comments/2kuvhs/moms_story_about_a_ghost_uncle/
It snowed.
http://www.reddit.com/r/nosleep/comments/2kutp6/it_snowed/
The pocket watch I found at a store
http://www.reddit.com/r/nosleep/comments/2kusru/the_pocket_watch_i_found_at_a_store/
You’re Going To Die When You Are 23
http://www.reddit.com/r/nosleep/comments/2kur3m/youre_going_to_die_when_you_are_23/
The Customer: Part Two
http://www.reddit.com/r/nosleep/comments/2kumac/the_customer_part_two/
Dimenhydrinate
http://www.reddit.com/r/nosleep/comments/2kul8e/dimenhydrinate/
•
•
•
•
•
12
12
76
4
2
4
6
4
18
2
6
13
5
16
2
2
14
48
1
13
What I want to do is, to place the matching rating for each post right next to it, so I could tell instantly how much rating does that post have, instead of printing the titles and links in 1 "block" and the rating numbers in another "block".
Thanks in advance for the help!
You can do it in one go by iterating over div elements with class="thing" (think about it as iterating over posts). For each div, get the link and rating:
from urlparse import urljoin
from bs4 import BeautifulSoup
import requests
def posts_spider():
url = 'http://www.reddit.com/r/nosleep/new/'
soup = BeautifulSoup(requests.get(url).content)
for thing in soup.select('div.thing'):
link = thing.find('a', {'class': 'title'})
rating = thing.find('div', {'class': 'score'})
href = urljoin("http://www.reddit.com", link.get('href'))
print(link.string, href, rating.string)
posts_spider()
FYI, div.thing is a CSS Selector that matches all divs with class="thing".