Simple web scraper formatting, how could I fix this?

Simple web scraper formatting, how could I fix this? - python

I have this code:
import requests
from bs4 import BeautifulSoup
def posts_spider():
url = 'http://www.reddit.com/r/nosleep/new/'
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for link in soup.findAll('a', {'class': 'title'}):
href = "http://www.reddit.com" + link.get('href')
title = link.string
print(title)
print(href)
print("\n")
def get_single_item_data():
item_url = 'http://www.reddit.com/r/nosleep/new/'
source_code = requests.get(item_url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for rating in soup.findAll('div', {'class': 'score unvoted'}):
print(rating.string)
posts_spider()
get_single_item_data()
The output is:
My light.. I'm seeing and feeling things.. what's happening?
http://www.reddit.com/r/nosleep/comments/2kw0nu/my_light_im_seeing_and_feeling_things_whats/
Why being the first to move in a new Subdivision is not the most brilliant idea...
http://www.reddit.com/r/nosleep/comments/2kw010/why_being_the_first_to_move_in_a_new_subdivision/
I Am Falling.
http://www.reddit.com/r/nosleep/comments/2kvxvt/i_am_falling/
Heidi
http://www.reddit.com/r/nosleep/comments/2kvrnf/heidi/
I remember everything
http://www.reddit.com/r/nosleep/comments/2kvrjs/i_remember_everything/
To Lieutenant Griffin Stone
http://www.reddit.com/r/nosleep/comments/2kvm9p/to_lieutenant_griffin_stone/
The woman in my room
http://www.reddit.com/r/nosleep/comments/2kvir0/the_woman_in_my_room/
Dr. Margin's Guide to New Monsters: The Guest, or, An Update
http://www.reddit.com/r/nosleep/comments/2kvhe5/dr_margins_guide_to_new_monsters_the_guest_or_an/
The Evil Woman (part 5)
http://www.reddit.com/r/nosleep/comments/2kva73/the_evil_woman_part_5/
Blood for the blood god, The first of many.
http://www.reddit.com/r/nosleep/comments/2kv9gx/blood_for_the_blood_god_the_first_of_many/
An introduction to the beginning of my journey
http://www.reddit.com/r/nosleep/comments/2kv8s0/an_introduction_to_the_beginning_of_my_journey/
A hunter..of sorts.
http://www.reddit.com/r/nosleep/comments/2kv8oz/a_hunterof_sorts/
Void Trigger
http://www.reddit.com/r/nosleep/comments/2kv84s/void_trigger/
What really happened to Amelia Earhart
http://www.reddit.com/r/nosleep/comments/2kv80r/what_really_happened_to_amelia_earhart/
I Used To Be Fine Being Alone
http://www.reddit.com/r/nosleep/comments/2kv2ks/i_used_to_be_fine_being_alone/
The Green One
http://www.reddit.com/r/nosleep/comments/2kuzre/the_green_one/
Elevator
http://www.reddit.com/r/nosleep/comments/2kuwxu/elevator/
Scary story told by my 4 year old niece- The Guy With Really Big Scary Claws
http://www.reddit.com/r/nosleep/comments/2kuwjz/scary_story_told_by_my_4_year_old_niece_the_guy/
Cranial Nerve Zero
http://www.reddit.com/r/nosleep/comments/2kuw7c/cranial_nerve_zero/
Mom's Story About a Ghost Uncle
http://www.reddit.com/r/nosleep/comments/2kuvhs/moms_story_about_a_ghost_uncle/
It snowed.
http://www.reddit.com/r/nosleep/comments/2kutp6/it_snowed/
The pocket watch I found at a store
http://www.reddit.com/r/nosleep/comments/2kusru/the_pocket_watch_i_found_at_a_store/
You’re Going To Die When You Are 23
http://www.reddit.com/r/nosleep/comments/2kur3m/youre_going_to_die_when_you_are_23/
The Customer: Part Two
http://www.reddit.com/r/nosleep/comments/2kumac/the_customer_part_two/
Dimenhydrinate
http://www.reddit.com/r/nosleep/comments/2kul8e/dimenhydrinate/
•
•
•
•
•
12
12
76
4
2
4
6
4
18
2
6
13
5
16
2
2
14
48
1
13
What I want to do is, to place the matching rating for each post right next to it, so I could tell instantly how much rating does that post have, instead of printing the titles and links in 1 "block" and the rating numbers in another "block".
Thanks in advance for the help!

You can do it in one go by iterating over div elements with class="thing" (think about it as iterating over posts). For each div, get the link and rating:
from urlparse import urljoin
from bs4 import BeautifulSoup
import requests
def posts_spider():
url = 'http://www.reddit.com/r/nosleep/new/'
soup = BeautifulSoup(requests.get(url).content)
for thing in soup.select('div.thing'):
link = thing.find('a', {'class': 'title'})
rating = thing.find('div', {'class': 'score'})
href = urljoin("http://www.reddit.com", link.get('href'))
print(link.string, href, rating.string)
posts_spider()
FYI, div.thing is a CSS Selector that matches all divs with class="thing".

Related

Extracting title from link in Python (Beautiful soup)

I am new to Python and I'm looking to extract the title from a link. So far I have the following but have hit a dead end:
import requests
from bs4 import BeautifulSoup
page = requests.get("http://books.toscrape.com/")
soup = BeautifulSoup(page.content, 'html.parser')
books = soup.find("section")
book_list = books.find_all(class_="product_pod")
tonight = book_list[0]
for book in book_list:
price = book.find(class_="price_color").get_text()
title = book.find('a')
print (price)
print (title.contents[0])

To extract title from links, you can use title attribute.
Fore example:
import requests
from bs4 import BeautifulSoup
page = requests.get("http://books.toscrape.com/")
soup = BeautifulSoup(page.content, 'html.parser')
for a in soup.select('h3 > a'):
print(a['title'])
Prints:
A Light in the Attic
Tipping the Velvet
Soumission
Sharp Objects
Sapiens: A Brief History of Humankind
The Requiem Red
The Dirty Little Secrets of Getting Your Dream Job
The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull
The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics
The Black Maria
Starving Hearts (Triangular Trade Trilogy, #1)
Shakespeare's Sonnets
Set Me Free
Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)
Rip it Up and Start Again
Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991
Olio
Mesaerion: The Best Science Fiction Stories 1800-1849
Libertarianism for Beginners
It's Only the Himalayas

you can use it:
import requests
from bs4 import BeautifulSoup
page = requests.get("http://books.toscrape.com/")
soup = BeautifulSoup(page.content, 'html.parser')
books = soup.find("section")
book_list = books.find_all(class_="product_pod")
tonight = book_list[0]
for book in book_list:
price = book.find(class_="price_color").get_text()
title = book.select_one('a img')['alt']
print (title)
Output:
A Light in the Attic
Tipping the Velvet
Soumission
Sharp Objects
Sapiens: A Brief History of Humankind
The Requiem Red...

By just modifying your existing code you can use the alt text which contains the book titles in your example.
print (title.contents[0].attrs["alt"])

Is there a way to parse data from multiple pages from a parent webpage?

So I have been going to a website to get NDC codes https://ndclist.com/?s=Solifenacin and I need to get 10 digit NDC codes, but on the current webpage there is only 8 digit NDC codes shown like this picture below
So I click on the underlined NDC code. And get this webpage.
So I copy and paste these 2 NDC codes to an excel sheet, and repeat the process for the rest of the codes on the first webpage I've shown. But this process takes a good bit of time, and was wondering if there was a library in Python that could copy and paste the 10 digit NDC codes for me or store them in a list and then I could print the list once I'm finished with all the 8 digit NDC codes on the first page. Would BeautifulSoup work or is there a better library to achieve this process?
EDIT <<<<
I actually need to go another level deep and I've been trying to figure it out, but I've been failing, apparently the last level of webpage is this dumb html table, and I only need one element of the table. Here is the last webpage after you click on the 2nd level codes.
Here is the code that I have, but it's returning a tr and None object once I run it.
url ='https://ndclist.com/?s=Trospium'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
all_data = []
for a in soup.select('[data-title="NDC"] a[href]'):
link_url = a['href']
print('Processin link {}...'.format(link_url))
soup2 = BeautifulSoup(requests.get(link_url).content, 'html.parser')
for b in soup2.select('#product-packages a'):
link_url2 = b['href']
print('Processing link {}... '.format(link_url2))
soup3 = BeautifulSoup(requests.get(link_url2).content, 'html.parser')
for link in soup3.findAll('tr', limit=7)[1]:
print(link.name)
all_data.append(link.name)
print('Trospium')
print(all_data)

Yes, BeautifulSoup is ideal in this case. This script will print all 10 digits codes from the page:
import requests
from bs4 import BeautifulSoup
url = 'https://ndclist.com/?s=Solifenacin'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
all_data = []
for a in soup.select('[data-title="NDC"] a[href]'):
link_url = a['href']
print('Processin link {}...'.format(link_url))
soup2 = BeautifulSoup(requests.get(link_url).content, 'html.parser')
for link in soup2.select('#product-packages a'):
print(link.text)
all_data.append(link.text)
# In all_data you have all codes, uncoment to print them:
# print(all_data)
Prints:
Processin link https://ndclist.com/ndc/0093-5263...
0093-5263-56
0093-5263-98
Processin link https://ndclist.com/ndc/0093-5264...
0093-5264-56
0093-5264-98
Processin link https://ndclist.com/ndc/0591-3796...
0591-3796-19
Processin link https://ndclist.com/ndc/27241-037...
27241-037-03
27241-037-09
... and so on.
EDIT: (Version where I get the description too):
import requests
from bs4 import BeautifulSoup
url = 'https://ndclist.com/?s=Solifenacin'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
all_data = []
for a in soup.select('[data-title="NDC"] a[href]'):
link_url = a['href']
print('Processin link {}...'.format(link_url))
soup2 = BeautifulSoup(requests.get(link_url).content, 'html.parser')
for code, desc in zip(soup2.select('a > h4'), soup2.select('a + p.gi-1x')):
code = code.get_text(strip=True).split(maxsplit=1)[-1]
desc = desc.get_text(strip=True).split(maxsplit=2)[-1]
print(code, desc)
all_data.append((code, desc))
# in all_data you have all codes:
# print(all_data)
Prints:
Processin link https://ndclist.com/ndc/0093-5263...
0093-5263-56 30 TABLET, FILM COATED in 1 BOTTLE
0093-5263-98 90 TABLET, FILM COATED in 1 BOTTLE
Processin link https://ndclist.com/ndc/0093-5264...
0093-5264-56 30 TABLET, FILM COATED in 1 BOTTLE
0093-5264-98 90 TABLET, FILM COATED in 1 BOTTLE
Processin link https://ndclist.com/ndc/0591-3796...
0591-3796-19 90 TABLET, FILM COATED in 1 BOTTLE
...and so on.

how do i get the next tag

I am trying to get the headlines that are in between a class. the headlines are wrapped around the h2 tag. headlines come after the tag.
from bs4 import BeautifulSoup
import requests
r = requests.get("https://www.dailypost.ng/hot-news")
soup = BeautifulSoup(r.content, "html.parser")
mydivs = soup.findAll("span", {"class": "mvp-cd-date left relative"})
mytags = mydivs.findNext('h2')
for tag in mytags:
print(tag.text.strip())

You must iterate through mydivs to use findNext()
mydivs is a list of web elements. findNextonly applies to a single web element. You must iterate through the divs and run findNext on each of them.
Just add this line
for div in mydivs:
and put it before
mytags = div.findNext('h2')
Here is the full code for your working program:
from bs4 import BeautifulSoup
import requests
r = requests.get("https://www.dailypost.ng/hot-news")
soup = BeautifulSoup(r.content, "html.parser")
mydivs = soup.findAll("span", {"class": "mvp-cd-date left relative"})
for div in mydivs:
mytags = div.findNext('h2')
for tag in mytags:
print(tag.strip())

Try replacing the last 3 lines with:
for div in mydivs:
mytags = div.findNext('h2')
for tag in mytags:
print(tag.strip())

soup.findAll() returns a list (or None), so you cannot call findNext() on it. However, you can iterate the tags and call find_next() on each tag separately:
import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.dailypost.ng/hot-news")
soup = BeautifulSoup(r.content, "html.parser")
mydivs = soup.findAll("span", {"class": "mvp-cd-date left relative"})
for tag in mydivs:
print(tag.find_next('h2').get_text(strip=True))
Prints:
BREAKING: Another federal lawmaker dies in Dubai hospital
Cross-Over Night: Enugu Govt bans burning of tyres on roads
Dadiyata: DSS breaks silence as Nigerian govt critic remains missing
CAC: Nigerian govt appoints new Acting Registrar-General
What Buhari told me – Dabiri-Erewa
What soldiers should expect in 2020 – Buratai
Only earthquake can erase Amosun’s legacies in Ogun – Akinlade
Civil War: Militia leader sentenced to 20yrs in prison
2020: Prophet Omale releases prophecies on Buhari, Aisha, Kyari, govs, coup plot
BREAKING: EFCC arrests Shehu Sani
Armed Forces Day: Yobe Governor Buni, donates N40 million for emblem appeal fund
Zamfara govt bans illegal gathering in the state
Agbenu Kacholalo: Colours of culture at Idoma International Carnival 2019 [PHOTOS]
Men of God are too fearful, weak to challenge government activities
2020: Peter Obi sends message to Nigerians
TETFUND: EFCC, ICPC asked to probe agency over alleged corruption
Two inmates regain freedom from Uyo prison
Buhari meets President of AfDB, Adeshina at Aso Rock
New Kogi CP resumes office, promises crime free state
Nothing stops you from paying N30,000 minimum wage to workers – APC challenges Makinde
EDIT: This script will scrape headlines from several pages:
import requests
from bs4 import BeautifulSoup
url = 'https://dailypost.ng/hot-news/page/{}/'
for page in range(1, 5): # <-- change how many pages do you want
print('Page no.{}'.format(page))
soup = BeautifulSoup(requests.get(url.format(page)).content, "html.parser")
mydivs = soup.findAll("span", {"class": "mvp-cd-date left relative"})
for tag in mydivs:
print(tag.find_next('h2').get_text(strip=True))
print('-' * 80)

Python not finding search terms in strings parsed by BeautifulSoup

In Python 3, when I want to return only strings with the term I am interested in, I can do this:
phrases = ["1. The cat was sleeping",
"2. The dog jumped over the cat",
"3. The cat was startled"]
for phrase in phrases:
if "dog" in phrase:
print(phrase)
Which of course prints "2. The dog jumped over the cat"
Now what I'm trying to do is make the same concept work with parsed strings in BeautifulSoup. Craigslist, for example, has lots of A Tags, but only the A Tags that also have "hdrlnk" in them are of interest to us. So I:
import requests
from bs4 import BeautifulSoup
url = "https://chicago.craigslist.org/search/apa"
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
links = soup.find_all("a")
for link in links:
if "hdrlnk" in link:
print(link)
Problem is, instead of printing all the A Tags with "hdrlnk" inside, Python prints nothing. And I'm not sure what's going wrong.

"hdrlnk" is a class attribute on the links. As you say you are only interested in these links just find the links based on class like this:
import requests
from bs4 import BeautifulSoup
url = "https://chicago.craigslist.org/search/apa"
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
links = soup.find_all("a", {"class": "hdrlnk"})
for link in links:
print(link)
Outputs:
<a class="result-title hdrlnk" data-id="6293679332" href="/chc/apa/d/high-rise-2-bedroom-heated/6293679332.html">High-Rise 2 Bedroom Heated Pool Indoor Parking Fire Pit Pet Friendly!</a>
<a class="result-title hdrlnk" data-id="6285069993" href="/chc/apa/d/new-beautiful-studio-in/6285069993.html">NEW-Beautiful Studio in Uptown/free heat</a>
<a class="result-title hdrlnk" data-id="6293694090" href="/chc/apa/d/albany-park-2-bed-1-bath/6293694090.html">Albany Park 2 Bed 1 Bath Dishwasher W/D & Heat + Parking Incl Pets ok</a>
<a class="result-title hdrlnk" data-id="6282289498" href="/chc/apa/d/north-center-2-bed-1-bath/6282289498.html">NORTH CENTER: 2 BED 1 BATH HDWD AC UNITS PROVIDE W/D ON SITE PRK INCLU</a>
<a class="result-title hdrlnk" data-id="6266583119" href="/chc/apa/d/beautiful-2bed-1bath-in-the/6266583119.html">Beautiful 2bed/1bath in the heart of Wrigleyville</a>
<a class="result-title hdrlnk" data-id="6286352598" href="/chc/apa/d/newly-rehabbed-2-bedroom-unit/6286352598.html">Newly Rehabbed 2 Bedroom Unit! Section 8 OK! Pets OK! (NHQ)</a>
To get the link href or text use:
print(link["href"])
print(link.text)

Try:
for link in links:
if "hdrlnk" in link["href"]:
print(link)

Just search term in link content, otherwise your code seems fine
import requests
from bs4 import BeautifulSoup
url = "https://chicago.craigslist.org/search/apa"
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
links = soup.find_all("a")
for link in links:
if "hdrlnk" in link.contents[0]:
print(link)
Or, if you want to search inside href or title, use link['href'] and link['title']

To get the required links, you can use selectors within your script to make the scraper robust and concise.
import requests
from bs4 import BeautifulSoup
base_link = "https://chicago.craigslist.org"
res = requests.get("https://chicago.craigslist.org/search/apa").text
soup = BeautifulSoup(res, "lxml")
for link in soup.select(".hdrlnk"):
print(base_link + link.get("href"))

Beautiful Soup nested div (Adding extra function)

I am trying to extract Company Name, address, and zipcode from [www.quicktransportsolutions.com][1]. I have written the following code to scrawl the site and return the information I need.
import requests
from bs4 import BeautifulSoup
def trade_spider(max_pages):
page = 1
while page <= max_pages:
url = 'http://www.quicktransportsolutions.com/carrier/missouri/adrian.php'
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for link in soup.findAll('div', {'class': 'well well-sm'}):
title = link.string
print(link)
trade_spider(1)
After running the code, I see the information that I want, but I am confused to how to get it to print without all of the non-pertinent information.
Above the
print(link)
I thought that I could have link.string pull the Company Names, but that failed. Any suggestions?
Output:
div class="well well-sm">
<b>2 OLD BOYS TRUCKING LLC</b><br><u><span itemprop="name"><b>2 OLD BOYS TRUCKING</b></span></u><br> <span itemprop="address" itemscope="" itemtype="http://schema.org/PostalAddress"><span itemprop="streetAddress">227 E 2ND</span>
<br>
<span itemprop="addressLocality">Adrian</span>, <span itemprop="addressRegion">MO</span> <span itemprop="postalCode">64720</span></br></span><br>
Trucks: 2 Drivers: 2<br>
<abbr class="initialism" title="Unique Number to identify Companies operating commercial vehicles to transport passengers or haul cargo in interstate commerce">USDOT</abbr> 2474795 <br><span class="glyphicon glyphicon-phone"></span><b itemprop="telephone"> 417-955-0651</b>
<br><a href="/inspectionreports/2-old-boys-trucking-usdot-2474795.php" itemprop="url" target="_blank" title="Trucking Company 2 OLD BOYS TRUCKING Inspection Reports">
Everyone,
Thanks for the help so far... I'm trying to add an extra function to my little crawler. I have written the following code:
def Crawl_State_Page(max_pages):
url = 'http://www.quicktransportsolutions.com/carrier/alabama/trucking-companies.php'
while i <= len(url):
response = requests.get(url)
soup = BeautifulSoup(response.content)
table = soup.find("table", {"class" : "table table-condensed table-striped table-hover table-bordered"})
for link in table.find_all(href=True):
print link['href']
Output:
abbeville.php
adamsville.php
addison.php
adger.php
akron.php
alabaster.php
alberta.php
albertville.php
alexander-city.php
alexandria.php
aliceville.php
alpine.php
... # goes all the way to Z I cut the output short for spacing..
What I'm trying to accomplish here is to pull all of the href with the city.php and write it to a file. .. But right now, i am stuck in an infinite loop where it keep cycling through the URL. Any tips on how to increment it? My end goal is to create another function that feeds back into my trade_spider with the www.site.com/state/city.php and then loops through all 50 dates... Something to the effect of
while i < len(states,cities):
url = "http://www.quicktransportsolutions.com/carrier" + states + cities[i] +"
And then this would loop into my trade_spider function, pulling all of the information that I needed.
But, before I get to that part, I need a bit of help getting out of my infinite loop. Any suggestions? Or foreseeable issues that I am going to run into?
I tried to create a crawler that would cycle through every link on the page, and then if it found content on the page that trade_spider could crawl, it would write it to a file... However, that was a bit out of my skill set, for now. So, i'm trying this method.

I would rely on the itemprop attributes of the different tags for each company. They are conveniently set for name, url, address etc:
import requests
from bs4 import BeautifulSoup
def trade_spider(max_pages):
page = 1
while page <= max_pages:
url = 'http://www.quicktransportsolutions.com/carrier/missouri/adrian.php'
response = requests.get(url)
soup = BeautifulSoup(response.content)
for company in soup.find_all('div', {'class': 'well well-sm'}):
link = company.find('a', itemprop='url').get('href').strip()
name = company.find('span', itemprop='name').text.strip()
address = company.find('span', itemprop='address').text.strip()
print name, link, address
print "----"
trade_spider(1)
Prints:
2 OLD BOYS TRUCKING /truckingcompany/missouri/2-old-boys-trucking-usdot-2474795.php 227 E 2ND
Adrian, MO 64720
----
HILLTOP SERVICE & EQUIPMENT /truckingcompany/missouri/hilltop-service-equipment-usdot-1047604.php ROUTE 2 BOX 453
Adrian, MO 64720
----

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Simple web scraper formatting, how could I fix this? - python

Related

Extracting title from link in Python (Beautiful soup)

Is there a way to parse data from multiple pages from a parent webpage?

how do i get the next tag

Python not finding search terms in strings parsed by BeautifulSoup

Beautiful Soup nested div (Adding extra function)

Categories

Resources