Python not finding search terms in strings parsed by BeautifulSoup

Python not finding search terms in strings parsed by BeautifulSoup - python

In Python 3, when I want to return only strings with the term I am interested in, I can do this:
phrases = ["1. The cat was sleeping",
"2. The dog jumped over the cat",
"3. The cat was startled"]
for phrase in phrases:
if "dog" in phrase:
print(phrase)
Which of course prints "2. The dog jumped over the cat"
Now what I'm trying to do is make the same concept work with parsed strings in BeautifulSoup. Craigslist, for example, has lots of A Tags, but only the A Tags that also have "hdrlnk" in them are of interest to us. So I:
import requests
from bs4 import BeautifulSoup
url = "https://chicago.craigslist.org/search/apa"
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
links = soup.find_all("a")
for link in links:
if "hdrlnk" in link:
print(link)
Problem is, instead of printing all the A Tags with "hdrlnk" inside, Python prints nothing. And I'm not sure what's going wrong.

"hdrlnk" is a class attribute on the links. As you say you are only interested in these links just find the links based on class like this:
import requests
from bs4 import BeautifulSoup
url = "https://chicago.craigslist.org/search/apa"
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
links = soup.find_all("a", {"class": "hdrlnk"})
for link in links:
print(link)
Outputs:
<a class="result-title hdrlnk" data-id="6293679332" href="/chc/apa/d/high-rise-2-bedroom-heated/6293679332.html">High-Rise 2 Bedroom Heated Pool Indoor Parking Fire Pit Pet Friendly!</a>
<a class="result-title hdrlnk" data-id="6285069993" href="/chc/apa/d/new-beautiful-studio-in/6285069993.html">NEW-Beautiful Studio in Uptown/free heat</a>
<a class="result-title hdrlnk" data-id="6293694090" href="/chc/apa/d/albany-park-2-bed-1-bath/6293694090.html">Albany Park 2 Bed 1 Bath Dishwasher W/D & Heat + Parking Incl Pets ok</a>
<a class="result-title hdrlnk" data-id="6282289498" href="/chc/apa/d/north-center-2-bed-1-bath/6282289498.html">NORTH CENTER: 2 BED 1 BATH HDWD AC UNITS PROVIDE W/D ON SITE PRK INCLU</a>
<a class="result-title hdrlnk" data-id="6266583119" href="/chc/apa/d/beautiful-2bed-1bath-in-the/6266583119.html">Beautiful 2bed/1bath in the heart of Wrigleyville</a>
<a class="result-title hdrlnk" data-id="6286352598" href="/chc/apa/d/newly-rehabbed-2-bedroom-unit/6286352598.html">Newly Rehabbed 2 Bedroom Unit! Section 8 OK! Pets OK! (NHQ)</a>
To get the link href or text use:
print(link["href"])
print(link.text)

Try:
for link in links:
if "hdrlnk" in link["href"]:
print(link)

Just search term in link content, otherwise your code seems fine
import requests
from bs4 import BeautifulSoup
url = "https://chicago.craigslist.org/search/apa"
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
links = soup.find_all("a")
for link in links:
if "hdrlnk" in link.contents[0]:
print(link)
Or, if you want to search inside href or title, use link['href'] and link['title']

To get the required links, you can use selectors within your script to make the scraper robust and concise.
import requests
from bs4 import BeautifulSoup
base_link = "https://chicago.craigslist.org"
res = requests.get("https://chicago.craigslist.org/search/apa").text
soup = BeautifulSoup(res, "lxml")
for link in soup.select(".hdrlnk"):
print(base_link + link.get("href"))

Related

Conditions in loop to ensure python only scrapes single div

While attempting to scrape this website: https://dining.umich.edu/menus-locations/dining-halls/mosher-jordan/ I have located the food item names by doing the following:
import requests
from bs4 import BeautifulSoup
url = "https://dining.umich.edu/menus-locations/dining-halls/mosher-jordan/"
req = requests.get(url, headers)
soup = BeautifulSoup(req.content, 'html.parser')
foodLocation = soup.find_all('div', class_='item-name')
for singleFood in foodLocation:
food = singleFood.text
print(food)
The problem is, I only want to print the food inside of the "World Palate Maize" section seen in the Lunch portion of the link. In the HTML, there are multiple divs that all contain the foods within a certain type (World Palate Maize, Hot Cereal, MBakery etc.) I'm having trouble figuring out how to tell the loop to only print inside of a certain section (certain div?). This may require an if statement or condition in the for loop but I am unsure about how to format/what to use as a condition to ensure this loop only prints the content from one section.

One strategy could be to select more specific by text e.g. with css selectors:
soup.select('h3:-soup-contains("Lunch")+div h4:-soup-contains("World Palate Maize") + ul .item-name')
Example
import requests
from bs4 import BeautifulSoup
url = "https://dining.umich.edu/menus-locations/dining-halls/mosher-jordan/"
req = requests.get(url)
soup = BeautifulSoup(req.content, 'html.parser')
foodLocation = soup.select('h3:-soup-contains("Lunch")+div h4:-soup-contains("World Palate Maize") + ul .item-name')
for singleFood in foodLocation:
food = singleFood.text
print(food)
Output
Mojo Grilled Chicken
Italian White Bean Salad

Seems like "Lunch" would always be the second div, so you can probably do
import requests
from bs4 import BeautifulSoup
headers = {
'User-Agent': 'Mozilla'
}
url = "https://dining.umich.edu/menus-locations/dining-halls/mosher-jordan/"
req = requests.get(url, headers)
soup = BeautifulSoup(req.content, 'html.parser')
[breakfast, lunch, dinner] = soup.select('div#mdining-items div.courses')
foods = lunch.select('div.item-name')
for food in foods:
print(food.text)

How to open scraped links one by one automatically using Python?

So here is my situation: Let's say you search on eBay for "Motorola DynaTAC 8000x". The bot that I build is going to scrape all the links of the listings. My goal is now, to make it open those scraped links one by one.
I think something like that would be possible with using loops, but I am not sure on how to do it. Thanks in advance!
Here is the code of the bot:
import requests
from bs4 import BeautifulSoup
url = "https://www.ebay.com/sch/i.html?_from=R40&_trksid=p2380057.m570.l1313&_nkw=Motorola+DynaTAC+8000x&_sacat=0"
r = requests.get(url)
soup = BeautifulSoup(r.content, features="lxml")
listings = soup.select("li a")
for a in listings:
link = a["href"]
if link.startswith("https://www.ebay.com/itm/"):
print(link)

To get information from the link you can do:
import requests
from bs4 import BeautifulSoup
url = "https://www.ebay.com/sch/i.html?_from=R40&_trksid=p2380057.m570.l1313&_nkw=Motorola+DynaTAC+8000x&_sacat=0"
r = requests.get(url)
soup = BeautifulSoup(r.content, features="lxml")
listings = soup.select("li a")
for a in listings:
link = a["href"]
if link.startswith("https://www.ebay.com/itm/"):
s = BeautifulSoup(requests.get(link).content, "lxml")
price = s.select_one('[itemprop="price"]')
print(s.h1.text)
print(price.text if price else "-")
print(link)
print("-" * 80)
Prints:
...
Details about  MOTOROLA DYNATAC 8100L- BRICK CELL PHONE VINTAGE RETRO RARE MUSEUM 8000X
GBP 555.00
https://www.ebay.com/itm/393245721991?hash=item5b8f458587:g:c7wAAOSw4YdgdvBt
--------------------------------------------------------------------------------
Details about  MOTOROLA DYNATAC 8100L- BRICK CELL PHONE VINTAGE RETRO RARE MUSEUM 8000X
GBP 555.00
https://www.ebay.com/itm/393245721991?hash=item5b8f458587:g:c7wAAOSw4YdgdvBt
--------------------------------------------------------------------------------
Details about  Vintage Pulsar Extra Thick Brick Cell Phone Has Dynatac 8000X Display
US $3,000.00
https://www.ebay.com/itm/163814682288?hash=item26241daeb0:g:sTcAAOSw6QJdUQOX
--------------------------------------------------------------------------------
...

Extract Link and Title Within a Heading Tag with bs4

I have used the below code:
from bs4 import BeautifulSoup
import requests
page = requests.get(
"https://www.olivemagazine.com/recipes/entertain/best-ever-starter-recipes/")
soup = BeautifulSoup(page.content, 'html.parser')
for i in soup.find_all('h3')[1:-3]:
print(i)
To get this kind of output:
<h3 class="p1">Summer deli board</h3>
<h3 class="p1">Marinated figs with mozzarella and serrano ham</h3>
<h3>Tomato salad with burrata and warm 'nduja dressing</h3>
<h3 class="p1">Griddled avocados with crab and chorizo</h3>
<h3>Duck, chicken and sour cherry terrine</h3>
<h3>Steak tartare</h3>
<h3>Tomatoes and lardo on toast with basil oil</h3>
From here I would like to extract the link in the anchor tag as well as the display name eg Summer Deli board.
I am not sure how to extract these two elements from where I have gotten so far.

You can take a nested loop inside you for loop to get href and text for your code and append into the list
from bs4 import BeautifulSoup
import requests
page = requests.get(
"https://www.olivemagazine.com/recipes/entertain/best-ever-starter-recipes/")
soup = BeautifulSoup(page.content, 'html.parser')
link=[]
title=[]
for i in soup.find_all('h3')[1:-3]:
a_tag=i.find_all("a")
for i in a_tag:
link.append(i.attrs['href'])
title.append(i.text)
Output:
link:
['https://www.olivemagazine.com/recipes/family/giant-champagne-and-lemon-prawn-vol-au-vents/',
'https://www.olivemagazine.com/recipes/fish-and-seafood/grilled-scallops-with-nduja-butter/',
'https://www.olivemagazine.com/recipes/quick-and-easy/herb-and-chilli-calamari/',.......]
title:
['Giant champagne and lemon prawn vol-au-vents',
'Grilled scallops with ’nduja butter',
'Herb and chilli calamari',....]

Is there a way to parse data from multiple pages from a parent webpage?

So I have been going to a website to get NDC codes https://ndclist.com/?s=Solifenacin and I need to get 10 digit NDC codes, but on the current webpage there is only 8 digit NDC codes shown like this picture below
So I click on the underlined NDC code. And get this webpage.
So I copy and paste these 2 NDC codes to an excel sheet, and repeat the process for the rest of the codes on the first webpage I've shown. But this process takes a good bit of time, and was wondering if there was a library in Python that could copy and paste the 10 digit NDC codes for me or store them in a list and then I could print the list once I'm finished with all the 8 digit NDC codes on the first page. Would BeautifulSoup work or is there a better library to achieve this process?
EDIT <<<<
I actually need to go another level deep and I've been trying to figure it out, but I've been failing, apparently the last level of webpage is this dumb html table, and I only need one element of the table. Here is the last webpage after you click on the 2nd level codes.
Here is the code that I have, but it's returning a tr and None object once I run it.
url ='https://ndclist.com/?s=Trospium'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
all_data = []
for a in soup.select('[data-title="NDC"] a[href]'):
link_url = a['href']
print('Processin link {}...'.format(link_url))
soup2 = BeautifulSoup(requests.get(link_url).content, 'html.parser')
for b in soup2.select('#product-packages a'):
link_url2 = b['href']
print('Processing link {}... '.format(link_url2))
soup3 = BeautifulSoup(requests.get(link_url2).content, 'html.parser')
for link in soup3.findAll('tr', limit=7)[1]:
print(link.name)
all_data.append(link.name)
print('Trospium')
print(all_data)

Yes, BeautifulSoup is ideal in this case. This script will print all 10 digits codes from the page:
import requests
from bs4 import BeautifulSoup
url = 'https://ndclist.com/?s=Solifenacin'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
all_data = []
for a in soup.select('[data-title="NDC"] a[href]'):
link_url = a['href']
print('Processin link {}...'.format(link_url))
soup2 = BeautifulSoup(requests.get(link_url).content, 'html.parser')
for link in soup2.select('#product-packages a'):
print(link.text)
all_data.append(link.text)
# In all_data you have all codes, uncoment to print them:
# print(all_data)
Prints:
Processin link https://ndclist.com/ndc/0093-5263...
0093-5263-56
0093-5263-98
Processin link https://ndclist.com/ndc/0093-5264...
0093-5264-56
0093-5264-98
Processin link https://ndclist.com/ndc/0591-3796...
0591-3796-19
Processin link https://ndclist.com/ndc/27241-037...
27241-037-03
27241-037-09
... and so on.
EDIT: (Version where I get the description too):
import requests
from bs4 import BeautifulSoup
url = 'https://ndclist.com/?s=Solifenacin'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
all_data = []
for a in soup.select('[data-title="NDC"] a[href]'):
link_url = a['href']
print('Processin link {}...'.format(link_url))
soup2 = BeautifulSoup(requests.get(link_url).content, 'html.parser')
for code, desc in zip(soup2.select('a > h4'), soup2.select('a + p.gi-1x')):
code = code.get_text(strip=True).split(maxsplit=1)[-1]
desc = desc.get_text(strip=True).split(maxsplit=2)[-1]
print(code, desc)
all_data.append((code, desc))
# in all_data you have all codes:
# print(all_data)
Prints:
Processin link https://ndclist.com/ndc/0093-5263...
0093-5263-56 30 TABLET, FILM COATED in 1 BOTTLE
0093-5263-98 90 TABLET, FILM COATED in 1 BOTTLE
Processin link https://ndclist.com/ndc/0093-5264...
0093-5264-56 30 TABLET, FILM COATED in 1 BOTTLE
0093-5264-98 90 TABLET, FILM COATED in 1 BOTTLE
Processin link https://ndclist.com/ndc/0591-3796...
0591-3796-19 90 TABLET, FILM COATED in 1 BOTTLE
...and so on.

Simple web scraper formatting, how could I fix this?

I have this code:
import requests
from bs4 import BeautifulSoup
def posts_spider():
url = 'http://www.reddit.com/r/nosleep/new/'
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for link in soup.findAll('a', {'class': 'title'}):
href = "http://www.reddit.com" + link.get('href')
title = link.string
print(title)
print(href)
print("\n")
def get_single_item_data():
item_url = 'http://www.reddit.com/r/nosleep/new/'
source_code = requests.get(item_url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for rating in soup.findAll('div', {'class': 'score unvoted'}):
print(rating.string)
posts_spider()
get_single_item_data()
The output is:
My light.. I'm seeing and feeling things.. what's happening?
http://www.reddit.com/r/nosleep/comments/2kw0nu/my_light_im_seeing_and_feeling_things_whats/
Why being the first to move in a new Subdivision is not the most brilliant idea...
http://www.reddit.com/r/nosleep/comments/2kw010/why_being_the_first_to_move_in_a_new_subdivision/
I Am Falling.
http://www.reddit.com/r/nosleep/comments/2kvxvt/i_am_falling/
Heidi
http://www.reddit.com/r/nosleep/comments/2kvrnf/heidi/
I remember everything
http://www.reddit.com/r/nosleep/comments/2kvrjs/i_remember_everything/
To Lieutenant Griffin Stone
http://www.reddit.com/r/nosleep/comments/2kvm9p/to_lieutenant_griffin_stone/
The woman in my room
http://www.reddit.com/r/nosleep/comments/2kvir0/the_woman_in_my_room/
Dr. Margin's Guide to New Monsters: The Guest, or, An Update
http://www.reddit.com/r/nosleep/comments/2kvhe5/dr_margins_guide_to_new_monsters_the_guest_or_an/
The Evil Woman (part 5)
http://www.reddit.com/r/nosleep/comments/2kva73/the_evil_woman_part_5/
Blood for the blood god, The first of many.
http://www.reddit.com/r/nosleep/comments/2kv9gx/blood_for_the_blood_god_the_first_of_many/
An introduction to the beginning of my journey
http://www.reddit.com/r/nosleep/comments/2kv8s0/an_introduction_to_the_beginning_of_my_journey/
A hunter..of sorts.
http://www.reddit.com/r/nosleep/comments/2kv8oz/a_hunterof_sorts/
Void Trigger
http://www.reddit.com/r/nosleep/comments/2kv84s/void_trigger/
What really happened to Amelia Earhart
http://www.reddit.com/r/nosleep/comments/2kv80r/what_really_happened_to_amelia_earhart/
I Used To Be Fine Being Alone
http://www.reddit.com/r/nosleep/comments/2kv2ks/i_used_to_be_fine_being_alone/
The Green One
http://www.reddit.com/r/nosleep/comments/2kuzre/the_green_one/
Elevator
http://www.reddit.com/r/nosleep/comments/2kuwxu/elevator/
Scary story told by my 4 year old niece- The Guy With Really Big Scary Claws
http://www.reddit.com/r/nosleep/comments/2kuwjz/scary_story_told_by_my_4_year_old_niece_the_guy/
Cranial Nerve Zero
http://www.reddit.com/r/nosleep/comments/2kuw7c/cranial_nerve_zero/
Mom's Story About a Ghost Uncle
http://www.reddit.com/r/nosleep/comments/2kuvhs/moms_story_about_a_ghost_uncle/
It snowed.
http://www.reddit.com/r/nosleep/comments/2kutp6/it_snowed/
The pocket watch I found at a store
http://www.reddit.com/r/nosleep/comments/2kusru/the_pocket_watch_i_found_at_a_store/
You’re Going To Die When You Are 23
http://www.reddit.com/r/nosleep/comments/2kur3m/youre_going_to_die_when_you_are_23/
The Customer: Part Two
http://www.reddit.com/r/nosleep/comments/2kumac/the_customer_part_two/
Dimenhydrinate
http://www.reddit.com/r/nosleep/comments/2kul8e/dimenhydrinate/
•
•
•
•
•
12
12
76
4
2
4
6
4
18
2
6
13
5
16
2
2
14
48
1
13
What I want to do is, to place the matching rating for each post right next to it, so I could tell instantly how much rating does that post have, instead of printing the titles and links in 1 "block" and the rating numbers in another "block".
Thanks in advance for the help!

You can do it in one go by iterating over div elements with class="thing" (think about it as iterating over posts). For each div, get the link and rating:
from urlparse import urljoin
from bs4 import BeautifulSoup
import requests
def posts_spider():
url = 'http://www.reddit.com/r/nosleep/new/'
soup = BeautifulSoup(requests.get(url).content)
for thing in soup.select('div.thing'):
link = thing.find('a', {'class': 'title'})
rating = thing.find('div', {'class': 'score'})
href = urljoin("http://www.reddit.com", link.get('href'))
print(link.string, href, rating.string)
posts_spider()
FYI, div.thing is a CSS Selector that matches all divs with class="thing".

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python not finding search terms in strings parsed by BeautifulSoup - python

Try: for link in links: if "hdrlnk" in link["href"]: print(link)

Related

Conditions in loop to ensure python only scrapes single div

How to open scraped links one by one automatically using Python?

Extract Link and Title Within a Heading Tag with bs4

Is there a way to parse data from multiple pages from a parent webpage?

Simple web scraper formatting, how could I fix this?

Categories

Resources