Parsing a Reddit search result with BeautifulSoup and Python

Parsing a Reddit search result with BeautifulSoup and Python - python

Using Python/BeautifulSoup, I'm trying to get the post title and URL from every result returned on Reddit.
Below is part of my code that retrieves all Reddit search results.
url = 'https://www.reddit.com/search/?q=test'
r = s.get(url, headers=headers_Get)
soup = BeautifulSoup(r.text, "html.parser")
results = soup.find_all('a', attrs={'data-click-id':'body'})
for result in results:
print(result.prettify())
title_post = result.find('h3').text
url_post = result.find('a')['href']
soup.find_all('a', attrs={'data-click-id':'body'}) appears to return a list of all search results. This is working as I'm expecting / hoping.
by doing print(result), I can validate that it is returning what I need. Below is the result of print(result.prettify()):
<a class="SQnoC3ObvgnGjWt90zD9Z _2INHSNB8V5eaWp4P0rY_mE" data-click-id="body" href="/r/Cricket/comments/kunmyt/match_thread_3rd_test_australia_v_india_day_5/">
<div class="_2SdHzo12ISmrC8H86TgSCp _1zpZYP8cFNLfLDexPY65Y7" style="--posttitletextcolor:#222222">
<h3 class="_eYtD2XCVieq6emjKBH3m">
<span style="font-weight:normal">Match Thread: 3rd
<em style="font-weight:700">Test
</em>- Australia v India, Day 5
</span>
</h3>
</div>
</a>
title_post = result.find('h3').text extracts the title associated with the comment or post. It is working as expected / hoped.
The problem that I have is with retrieving the address of the post (see href=):
<a class="SQnoC3ObvgnGjWt90zD9Z _2INHSNB8V5eaWp4P0rY_mE" data-click-id="body" href="/r/Cricket/comments/kunmyt/match_thread_3rd_test_australia_v_india_day_5/">
The line url_post = result.find('a')['href'] returns an error TypeError: 'NoneType' object is not subscriptable.
If I could use the "result" as a string, then I could just look for href within it. Something like:
loc = result.text.find('href=')
print(result.text[loc:])
Obviously, this won't work:
result.text does not return the HTML code, but just the string "Match Thread: 3rd Test - Australia v India, Day 5"
Question 1:
Is there a way to return only the href="" component?
Question 2:
Is there a way to convert the soup object "result" into plain text while keeping the HTML components? If it was possible, then I'd have an easy workaround.

The href is already in the .attrs of result:
>>> for result in results:
... print(result.attrs)
...
{'data-click-id': 'body', 'class': ['SQnoC3ObvgnGjWt90zD9Z', '_2INHSNB8V5eaWp4P0rY_mE'], 'href': '/r/Cricket/comments/kunmyt/match_thread_3rd_test_australia_v_india_day_5/'}
...
so don't call the .find() method, instead access the href value using the [key] notation (like a dictionary).
In your example:
for result in results:
url_post = result["href"]
print(url_post)
Output:
/r/Cricket/comments/kunmyt/match_thread_3rd_test_australia_v_india_day_5/
/r/Cricket/comments/ku008u/match_thread_3rd_test_australia_v_india_day_4/
/r/Cricket/comments/ktcg7n/match_thread_3rd_test_australia_v_india_day_3/
...

You can use PRAW: The Python Reddit API Wrapper for their API, which is much easier to use than parsing from the webpages. You are obviously not able to access their randomly generated class names.
https://praw.readthedocs.io/en/latest/

Related

How to scrape last string of <p> tag element?

To start, python is my first language I am learning.
I am scraping a website for rent prices across my city and I am using BeautifulSoup to get the price data, but I am unable to get the value of this tag.
Here is the tag:
<p><strong class="hidden show-mobile-inline">Monthly Rent: </strong>2,450 +</p>
Here is my code:
text = soup.find_all("div", {"class", "plan-group rent"})
for item in text:
rent = item.find_all("p")
for price in rent:
print(price.string)
I also tried:
text = soup.find_all("div", {"class", "plan-group rent"})
for item in text:
rent = item.find_all("p")
for price in rent:
items = price.find_all("strong")
for item in items:
print('item.string')
and that works to print out "Monthly Rent:" but I don't understand why I can't get the actual price. The above code shows me that the monthly rent is in the strong tag, which means that the p tag only contains the price which is what I want.

As mentioned by #kyrony there are two children in your <p> - Cause you select the <strong> you will only get one of the texts.
You could use different approaches stripped_strings:
list(soup.p.stripped_strings)[-1]
or contents
soup.p.contents[-1]
or with recursive argument
soup.p.find(text=True, recursive=False)
Example
from bs4 import BeautifulSoup
html = '''<p><strong class="hidden show-mobile-inline">Monthly Rent: </strong>2,450 +</p>'''
soup = BeautifulSoup(html)
soup.p.contents[-1]

Technically your content has two children
<p><strong class="hidden show-mobile-inline">Monthly Rent: </strong>2,450 +</p>
A strong tag
<strong class="hidden show-mobile-inline">Monthly Rent: </strong>
and a string
2,450 +
The string method in beautiful soup only takes one argument so its going to return None. In order to get the second string you need to use the stripped_strings generator.

How do I use find_all or select more precisely in this case?

When I run the following code after importing stuff:
Fighter1Main = []
for i in range(1,3):
url = Request(f"https://www.sherdog.com/events/a-{page}", headers={'User-Agent': 'Mozilla/5.0'})
response = urlopen(url).read()
soup = BeautifulSoup(response, "html.parser")
for test2 in soup.find_all(class_="fighter left_side"):
test3 = test2.find_all(itemprop="url")
Fighter1Main.append(test3)
page = page + 1
I get:
[[<a href="/fighter/Todd-Medina-61" itemprop="url">
<img alt="Todd 'El Tiburon' Medina" itemprop="image" src="/image_crop/200/300/_images/fighter/20140801074225_IMG_5098.JPG" title="Todd 'El Tiburon' Medina">
</img></a>], [<a href="/fighter/Ricco-Rodriguez-8" itemprop="url">
<img alt="Ricco 'Suave' Rodriguez" itemprop="image" src="/image_crop/200/300/_images/fighter/20141225125221_1MG_9472.JPG" title="Ricco 'Suave' Rodriguez">
</img></a>]]
But I was expecting:
<a href="/fighter/Todd-Medina-61" itemprop="url">
<a href="/fighter/Ricco-Rodriguez-8" itemprop="url">
This is the type of webpage in question https://www.sherdog.com/events/a-1
I also tried using css select and got the same result.
for test2 in soup.select('.fighter.left_side [itemprop="url"]'):
Fighter1Main.append(test2)
I thought I was using it correctly but I'm not sure how else to narrow it down to what I want.

If your issue is that you're getting a list of lists, and you just want a flat list, then you should do it like
for test2 in soup.find_all(class_="fighter left_side"):
Fighter1Main += [t for t in test2.find_all(itemprop="url")]
but since you weren't happy with the output from for test2 in soup.select('.fighter.left_side [itemprop="url"]'): Fighter1Main.append(test2), and from your title, I'm guessing that isn't the the problem here.
If you want to filter out any tags that have a nested tag inside them then you can add :not(:has(*)) to your selector
for test2 in soup.select('.fighter.left_side *[itemprop="url"]:not(:has(*))'):
Fighter1Main.append(test2)
however, you can expect an empty list if you do this because [as far as I can tell] all tags matched to .fighter.left_side *[itemprop="url"] will have an img tag nested within.
If you really want something like your expected output, you'll have to either alter the soup or build it up yourself.
You can either remove everything inside the Tags with itemprop="url" [original soup object will be altered]:
for test2 in soup.select('.fighter.left_side *[itemprop="url"]'):
test2.clear()
Fighter1Main.append(test2)
Or you could form new html tags with only the href [if there is any] and itemprop attributes [original soup object will remain unaltered, but you'll be parsing and extracting again for each item]:
soup = BeautifulSoup(response, "html.parser")
Fighter1Main += [BeautifulSoup(
f'<{n}{h} itemprop="url"></{n}>', "html.parser"
).find(n) for n, h in [(
t.name, '' if t.get("href") is None else f' href="{t.get("href")}"'
) for t in soup.select('.fighter.left_side *[itemprop="url"]')]]

python bs4 extract from attribute inside button class

So im trying to get the value of a attribute using BeautifulSoup4.
replay_url_data = matchdatatr[1].findAll("button",{"class":"replay_button_super"})
This is how i get all my data into the object.
Typing the replay_url_data into the console returns :
<button class="replay_button_super" data-client-version="0.0.2.21" data-rel="spectatePopup" data-spectate-encryptionkey="bPPxpLIDmi0hRfU2U8B9Li1VJfTTx6pZ" data-spectate-endpoint="replays.cosmicradiance.com:80" data-spectate-gameid="4339075348" data-spectate-link="/api/spectate/UEJPNkN4MkIwUkZERUJ0MWUyZ3dDTmxGT25kanlUN2V6YnpuZUQ0bVlyMWRReGNDRXprZ1lQVnRnSkNHMG04Y2hUdVhxQm9abHFsQ2VBaTRaYVFPdnc9PQ==" data-spectate-platform="Modigu1" data-width="640"><i class="fa fa-play"></i>Replay</button>
What i want is to get the value of data-spectate-link.
I have tried every google result i found about similar topics but nothing worked.
replay_url_split = replay_url_data[0].findAll("button",{"class":"data-spectate-link"})
This returns "[]" empty.
replay_url_data[0].find('data-spectate-platform')
This returns same result empty
replay_url_data[0].find('button',attrs={'class' : 'data-spectate-link'})
And this one returns the same as above "[]" empty.
After 3 hours of searching on google so far nothing has helped me and im getting desperate.Im still new to python and html so excuse my stupidity.

To get attribute you use .attrs["data-spectate-link"] or directly ["data-spectate-link"]
Example
from bs4 import BeautifulSoup as BS
text = '<button class="replay_button_super" data-client-version="0.0.2.21" data-rel="spectatePopup" data-spectate-encryptionkey="bPPxpLIDmi0hRfU2U8B9Li1VJfTTx6pZ" data-spectate-endpoint="replays.cosmicradiance.com:80" data-spectate-gameid="4339075348" data-spectate-link="/api/spectate/UEJPNkN4MkIwUkZERUJ0MWUyZ3dDTmxGT25kanlUN2V6YnpuZUQ0bVlyMWRReGNDRXprZ1lQVnRnSkNHMG04Y2hUdVhxQm9abHFsQ2VBaTRaYVFPdnc9PQ==" data-spectate-platform="Modigu1" data-width="640"><i class="fa fa-play"></i>Replay</button>'
soup = BS(text, 'html.parser')
all_buttons = soup.findAll("button", {"class": "replay_button_super"})
one_button = all_buttons[0]
value = one_button["data-spectate-link"]
print(value)
value = one_button.attrs["data-spectate-link"]
print(value)
BTW: If you want to search buttons with attribute data-spectate-link then you have to search
{"data-spectate-link": True}
not {"class": "data-spectate-link"}
Example
from bs4 import BeautifulSoup as BS
text = '''<button>Other button</button>
<button>Other button</button>
<button>Other button</button>
<button class="replay_button_super" data-client-version="0.0.2.21" data-rel="spectatePopup" data-spectate-encryptionkey="bPPxpLIDmi0hRfU2U8B9Li1VJfTTx6pZ" data-spectate-endpoint="replays.cosmicradiance.com:80" data-spectate-gameid="4339075348" data-spectate-link="/api/spectate/UEJPNkN4MkIwUkZERUJ0MWUyZ3dDTmxGT25kanlUN2V6YnpuZUQ0bVlyMWRReGNDRXprZ1lQVnRnSkNHMG04Y2hUdVhxQm9abHFsQ2VBaTRaYVFPdnc9PQ==" data-spectate-platform="Modigu1" data-width="640"><i class="fa fa-play"></i>Replay</button>
<button>Other button</button>
<button>Other button</button>'''
soup = BS(text, 'html.parser')
all_buttons = soup.findAll("button", {"data-spectate-link": True})
one_button = all_buttons[0]
value = one_button["data-spectate-link"]
print(value)

soup.button['data-spectate-link']
is what you want.
soup.button set the tag inside the soup. Then with ['data-spectate-link'] you can set property inside the tag.
docs here

data-spectate-link is an attribute.To get the value of data-spectate-link you need to use element['data-spectate-link']
You can use either findAll() or CSS selector select()
replay_url_data =matchdatatr[1].findAll("button",attrs={"class" :"replay_button_super", "data-spectate-link" :True})
print(replay_url_data[0]['data-spectate-link'])
OR Css selector
replay_url_data =soup.select("button.replay_button_super[data-spectate-link]")
print(replay_url_data[0]['data-spectate-link'])

Beautifulsoup find the tag and attribute of without value?

I'm trying to get the content of the particular tag which having the attribute but no values. How can I get it for example
cont = '<nav></nav> <nav breadcrumbs> aa</nav> <nav></nav>'
From the above one I want to extract the <nav breadcrumbs> aa</nav>
So I have tried the following one
bread = contSoup.find("nav",{"breadcrumbs":""})
I have tried below one also
bread = contSoup.find("nav breadcrumbs")
Finally I'm using RegEx to get this data, I'm able to get the answer, but how can I do it from the beautiful soup

You can use attr=True for this case.
cont = '<nav></nav> <nav breadcrumbs> aa</nav> <nav></nav>'
soup = BeautifulSoup(cont, 'lxml') # works with 'html.parser' too.
print(soup.find('nav', breadcrumbs=True))
# which is the same as print(soup.find('nav', {'breadcrumbs': True}))
Output:
<nav breadcrumbs=""> aa</nav>

Extracting li element and assigning it to variable with beautiful soup

Given the following element
<ul class="listing-key-specs ">
<li>2004 (54 reg)</li>
<li>Hatchback</li>
<li>90,274 miles</li>
<li>Manual</li>
<li>1.2L</li>
<li>60 bhp</li>
<li>Petrol</li>
</ul>
How do I extract each li element and assign it to a variable with beautiful soup?
Currently, my code looks like this:
detail = car.find('ul', {'class': 'listing-key-specs'}).get_text(strip=True)
and it produces the following output:
2005 (05 reg)Saloon66,038 milesManual1.8L118 bhpPetrol
Please refer to the following question for more context: "None" returned during scraping.

Check online DEMO
from bs4 import BeautifulSoup
html_doc="""
<ul class="listing-key-specs ">
<li>2004 (54 reg)</li>
<li>Hatchback</li>
<li>90,274 miles</li>
<li>Manual</li>
<li>1.2L</li>
<li>60 bhp</li>
<li>Petrol</li>
</ul>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
lst = [_.get_text(strip=True) for _ in soup.find('ul', {'class': 'listing-key-specs'}).find_all('li')]
print(lst)

Currently, you are calling get_text() on the ul tag, which simply returns all its contents as one string. So
<div>
<p>Hello </p>
<p>World </p>
</div>
would become Hello World.
To extract each matching sub tag and store them as seperate elements, use car.find_all(), like this.
tag_list = car.find_all('li', class_='listing-key-specs')
my_list = [i.get_text() for i in tag_list]
This will give you a list of all li tags inside the class 'listing-key-specs'. Now you're free to assign variables, eg. carType = my_list[1]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parsing a Reddit search result with BeautifulSoup and Python - python

You can use PRAW: The Python Reddit API Wrapper for their API, which is much easier to use than parsing from the webpages. You are obviously not able to access their randomly generated class names. https://praw.readthedocs.io/en/latest/

Related

How to scrape last string of <p> tag element?

How do I use find_all or select more precisely in this case?

python bs4 extract from attribute inside button class

Beautifulsoup find the tag and attribute of without value?

Extracting li element and assigning it to variable with beautiful soup

Categories

Resources