Remove image sources with same class reference when web scraping in python?

Remove image sources with same class reference when web scraping in python? - python

I'm trying to write some code to extract some data from transfermarkt (Link Here for the page I'm using). I'm stuck trying to print the clubs. I've figured out that I need to access h2 and then the a class in order to just get the text. The HTML code is below
<div class="table-header" id="to-349"><a class="vereinprofil_tooltip" id="349" href="/fc-barnsley/transfers/verein/349/saison_id/2018"><img src="https://tmssl.akamaized.net/images/wappen/small/349.png?lm=1574162298" title=" " alt="Barnsley FC" class="" /></a><h2><a class="vereinprofil_tooltip" id="349" href="/fc-barnsley/transfers/verein/349/saison_id/2018">Barnsley FC</a></h2></div>
so you can see if I just try find_all("a", "class": "vereinprofil_tooltip"}) it doesn't work properly as it also returns the image file which has no plain text? But if I can search for h2 first and then search find_all("a", "class": "vereinprofil_tooltip"}) within the returned h2 it would get me what I want. My code is below.
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent':
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}
page = "https://www.transfermarkt.co.uk/league-one/transfers/wettbewerb/GB3/plus/?saison_id=2018&s_w=&leihe=1&intern=0&intern=1"
pageTree = requests.get(page, headers=headers)
pageSoup = BeautifulSoup(pageTree.content, 'html.parser')
#Players = pageSoup.find_all("a", {"class": "spielprofil_tooltip"})
Clubs = pageSoup.find_all("h2")
Club = Clubs.find("a", {"class": "vereinprofil_tooltip"})
print(Club)
I get the error in getattr
raise AttributeError(
AttributeError: ResultSet object has no attribute 'find_all'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?
I know what the error means but I've been going round in circles trying to find a way of actually doing it properly and getting what I want. Any help is appreciated.

import requests
from bs4 import BeautifulSoup
headers = {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}
page = "https://www.transfermarkt.co.uk/league-one/transfers/wettbewerb/GB3/
plus/?saison_id=2018&s_w=&leihe=1&intern=0&intern=1"
pageTree = requests.get(page, headers=headers)
pageSoup = BeautifulSoup(pageTree.content, 'html.parser')
#Players = pageSoup.find_all("a", {"class": "spielprofil_tooltip"})
Clubs = pageSoup.find_all("h2")
print(type(Clubs)) # this can be removed, but I left it to expose how I figured this out
for club in Clubs:
print(club.text)
Basically: Clubs is a list (technically, a ResultSet, but the behavior is very similar), you need to iterate it as such. .text gives just the text, other attributes could be retrieved as well.
Output looks like:
Transfer record 18/19
Barnsley FC
Burton Albion
Sunderland AFC
Shrewsbury Town
Scunthorpe United
Charlton Athletic
Plymouth Argyle
Portsmouth FC
Peterborough United
Southend United
Bradford City
Blackpool FC
Bristol Rovers
Fleetwood Town
Doncaster Rovers
Oxford United
Gillingham FC
AFC Wimbledon
Walsall FC
Rochdale AFC
Accrington Stanley
Luton Town
Wycombe Wanderers
Coventry City
Transfer record 18/19
There are, however, a bunch of blank lines (I.e., .text was '') that you should probably handle as well.

my guess is you might mean findAll instead of find_all
I tried this code below and it works
content = """<div class="table-header" id="to-349">
<a class="vereinprofil_tooltip" id="349" href="/fc-barnsley/transfers/verein/349/saison_id/2018">
<img src="https://tmssl.akamaized.net/images/wappen/small/349.png?lm=1574162298" title=" " alt="Barnsley FC" class="" />
</a>
<h2>
<a class="vereinprofil_tooltip" id="349" href="/fc-barnsley/transfers/verein/349/saison_id/2018">
Barnsley FC
</a>
</h2>
</div>"""
soup = BeautifulSoup(content, 'html.parser')
#get main_box
main_box = soup.findAll('a', {'class': 'vereinprofil_tooltip'})
#print(main_box)
for main_text in main_box: # looping thru the list
if main_text.text.strip(): # get the body text
print(main_text.text.strip()) # print it
output is
Barnsley FC
I'll edit this with a reference to the documentation about findAll. cant remember it on to pof my head
edit:
did a look at the documentation, turns out find_all = findAll..
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-all
now I feel dumb lol

Related

Can't find hrefs of interest with BeautifulSoup

I am trying to collect a list of hrefs from the Netflix careers site: https://jobs.netflix.com/search. Each job listing on this site has an anchor and a class: <a class=css-2y5mtm essqqm81>. To be thorough here, the entire anchor is:
<a class="css-2y5mtm essqqm81" role="link" href="/jobs/244837014" aria-label="Manager, Written Communications"\>\
<span tabindex="-1" class="css-1vbg17 essqqm80"\>\<h4 class="css-hl3xbb e1rpdjew0"\>Manager, Written Communications\</h4\>\</span\>\</a\>
Again, the information of interest here is the hrefs of the form href="/jobs/244837014". However, when I perform the standard BS commands to read the HTML:
html_page = urllib.request.urlopen("https://jobs.netflix.com/search")
soup = BeautifulSoup(html_page)
I don't see any of the hrefs that I'm interested in inside of soup.
Running the following loop does not show the hrefs of interest:
for link in soup.findAll('a'):
print(link.get('href'))
What am I doing wrong?

That information is being fed dynamically in page, via XHR calls. You need to scrape the API endpoint to get jobs info. The following code will give you a dataframe with all jobs currently listed by Netflix:
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
from tqdm import tqdm ## if Jupyter: from tqdm.notebook import tqdm
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
headers = {
'referer': 'https://jobs.netflix.com/search',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36'
}
big_df = pd.DataFrame()
s = requests.Session()
s.headers.update(headers)
for x in tqdm(range(1, 20)):
url = f'https://jobs.netflix.com/api/search?page={x}'
r = s.get(url)
df = pd.json_normalize(r.json()['records']['postings'])
big_df = pd.concat([big_df, df], axis=0, ignore_index=True)
print(big_df[['text', 'team', 'external_id', 'updated_at', 'created_at', 'location', 'organization' ]])
Result:
100%
19/19 [00:29<00:00, 1.42s/it]
text team external_id updated_at created_at location organization
0 Events Manager - SEA [Publicity] 244936062 2022-11-23T07:20:16+00:00 2022-11-23T04:47:29Z Bangkok, Thailand [Marketing and PR]
1 Manager, Written Communications [Publicity] 244837014 2022-11-23T07:20:16+00:00 2022-11-22T17:30:06Z Los Angeles, California [Marketing and Publicity]
2 Manager, Creative Marketing - Korea [Marketing] 244740829 2022-11-23T07:20:16+00:00 2022-11-22T07:39:56Z Seoul, South Korea [Marketing and PR]
3 Administrative Assistant - Philippines [Netflix Technology Services] 244683946 2022-11-23T07:20:16+00:00 2022-11-22T01:26:08Z Manila, Philippines [Corporate Functions]
4 Associate, Studio FP&A - APAC [Finance] 244680097 2022-11-23T07:20:16+00:00 2022-11-22T01:01:17Z Seoul, South Korea [Corporate Functions]
... ... ... ... ... ... ... ...
365 Software Engineer (L4/L5) - Content Engineering [Core Engineering, Studio Technologies] 77239837 2022-11-23T07:20:31+00:00 2021-04-22T07:46:29Z Mexico City, Mexico [Product]
366 Distributed Systems Engineer (L5) - Data Platform [Data Platform] 201740355 2022-11-23T07:20:31+00:00 2021-03-12T22:18:57Z Remote, United States [Product]
367 Senior Research Scientist, Computer Graphics / Computer Vision / Machine Learning [Data Science and Engineering] 227665988 2022-11-23T07:20:31+00:00 2021-02-04T18:54:10Z Los Gatos, California [Product]
368 Counsel, Content - Japan [Legal and Public Policy] 228338138 2022-11-23T07:20:31+00:00 2020-11-12T03:08:04Z Tokyo, Japan [Corporate Functions]
369 Associate, FP&A [Financial Planning and Analysis] 46317422 2022-11-23T07:20:31+00:00 2017-12-26T19:38:32Z Los Angeles, California [Corporate Functions]
370 rows × 7 columns

For each job, the url would be https://jobs.netflix.com/jobs/{external_id}

BeautifulSoup can't find list elements given class

I am trying to access the elements in the Ingredients list of the following website: https://www.jamieoliver.com/recipes/pasta-recipes/gennaro-s-classic-spaghetti-carbonara/
<div class="col-md-12 ingredient-wrapper">
<ul class="ingred-list ">
<li>
3 large free-range egg yolks
</li>
<li>
40 g Parmesan cheese, plus extra to serve
</li>
<li>
1 x 150 g piece of higher-welfare pancetta
</li>
<li>
200g dried spaghetti
</li>
<li>
1 clove of garlic
</li>
<li>
extra virgin olive oil
</li>
</ul>
</div
I first tried just using requests and beautiful soup but my code didn't find the list elements. I then tried using Selenium and it still didn't work. My code is below:
from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd
url = "https://www.jamieoliver.com/recipes/pasta-recipes/cracker-ravioli/"
driver = webdriver.Chrome()
driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
for ultag in soup.findAll('div', {'class': "col-md-12 ingredient-wrapper"}):
# for ultag in soup.findAll('ul', {'class': 'ingred_list '}):
for litag in ultag.findALL('li'):
print(litag.text)

To get the ingredients list, you can use this example:
import requests
from bs4 import BeautifulSoup
url = 'https://www.jamieoliver.com/recipes/pasta-recipes/gennaro-s-classic-spaghetti-carbonara/'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:80.0) Gecko/20100101 Firefox/80.0'}
soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html.parser')
for li in soup.select('.ingred-list li'):
print(' '.join(li.text.split()))
Prints:
3 large free-range egg yolks
40 g Parmesan cheese , plus extra to serve
1 x 150 g piece of higher-welfare pancetta
200 g dried spaghetti
1 clove of garlic
extra virgin olive oil

How to get unlabeled data by BeautifulSoup Python

<div class="content-cell mdl-cell mdl-cell--6-col mdl-typography--body-1">
Contents
<a href="https://www.google.com/search?q=Google+what+time+is+it">
Google what time is it
</a>
<br/>
It's 4:38.
<br/>
2018. 2. 5. 5:38:41 PM
</div>]
In the code above, I want to extract the answer ("It's 4:38") and the timestamp. For the question, I used
for link in soup.find_all('a'): Questions.append(link.text);
but I couldn't do the same with the answers and timestamp. How do I resolve this issue?

You can see that the text you want is a descendant of the div element. You can also see that it immediately follows the first br element that is such a descendant. Then one way to find it is simply to iterate through the descendants of the div looking for that br. When you see that take the next item.
Here's how it will play out.
>>> import bs4
>>> soup = bs4.BeautifulSoup(open('sumin.htm').read(), 'lxml')
>>> div = soup.find('div')
>>> for element in div.descendants:
... element.name, element
...
(None, '\n Contents\n ')
('a', <a href="https://www.google.com/search?q=Google+what+time+is+it">
Google what time is it
</a>)
(None, '\n Google what time is it\n ')
(None, '\n')
('br', <br/>)
(None, "\n It's 4:38.\n ")
('br', <br/>)
(None, '\n 2018. 2. 5. 5:38:41 PM\n ')
Notice that elements such as br have the name property but navigable strings do not (this property is None).

How to get unlabeled data
Actually, it is not unlabeled. The text you want is located inside the <div> tag. To get that text you can check if the text is NavigableString.
If you check the type of each content,
Contents #<class 'bs4.element.NavigableString'>
<a href="https://www.google.com/search?q=Google+what+time+is+it">
Google what time is it
</a> #<class 'bs4.element.Tag'>
#<class 'bs4.element.NavigableString'>
<br/> #<class 'bs4.element.Tag'>
It's 4:38. #<class 'bs4.element.NavigableString'>
<br/> #<class 'bs4.element.Tag'>
2018. 2. 5. 5:38:41 PM #<class 'bs4.element.NavigableString'>
Code:
>>> from bs4 import BeautifulSoup, NavigableString
>>> html = '''<div class="content-cell mdl-cell mdl-cell--6-col mdl-typography--body-1">
... Contents
... <a href="https://www.google.com/search?q=Google+what+time+is+it">
... Google what time is it
... </a>
... <br/>
... It's 4:38.
... <br/>
... 2018. 2. 5. 5:38:41 PM
... </div>'''
>>> soup = BeautifulSoup(html, 'lxml')
>>> div = soup.find('div', class_='content-cell mdl-cell mdl-cell--6-col mdl-typography--body-1')
>>> contents = [x.strip() for x in div.contents if isinstance(x, NavigableString)]
>>> contents
['Contents', '', "It's 4:38.", '2018. 2. 5. 5:38:41 PM']
From this, you can understand what NavigableString is. Now, to get the date and time you can simply join the last 2 elements of the list.
>>> ' '.join(contents[-2:])
"It's 4:38. 2018. 2. 5. 5:38:41 PM"

text = '<div class="content-cell mdl-cell mdl-cell--6-col mdl-typography--body-1">\n Contents\n \n Google what time is it\n \n <br/>\n It\'s 4:38.\n <br/>\n 2018. 2. 5. 5:38:41 PM\n </div>'
s = BeautifulSoup(text,"lxml")
>>> s.find("br").findNext("br").next
'\n 2018. 2. 5. 5:38:41 PM\n '
>>> s.find("br").next
"\n It's 4:38.\n "

Use select_one() in combo with SelectorGadget Chrome extension to grab CSS selectors by clicking on the desired element in your browser:
soup.select_one('.YwPhnf').text
# 09:06
Or you can use stripped_strings, but it's not as pretty as using CSS selectors:
html = '''
<div class="content-cell mdl-cell mdl-cell--6-col mdl-typography--body-1">
Contents
<a href="https://www.google.com/search?q=Google+what+time+is+it">
Google what time is it
</a>
<br/>
It's 4:38.
<br/>
2018. 2. 5. 5:38:41 PM
</div>
'''
soup = BeautifulSoup(html, 'lxml')
# returns a generator object
current_time = list(soup.select_one('.content-cell').stripped_strings)[2]
print(current_time)
# It's 4:38.
Code:
from bs4 import BeautifulSoup
import requests, lxml
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "what time it is", # query
"gl": "us", # country to search from
"hl": "en" # language
}
html = requests.get("https://www.google.com/search", headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
current_time = soup.select_one('.YwPhnf').text
current_date = soup.select_one('.KfQeJ:nth-child(1)').text
print(f'{current_time}\n{current_date}')
# 2:11 AM
# September 11, 2021
Alternatively, you can achieve the same thing by using Google Direct Answer Box API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you only need to get the data you want from the structured JSON, rather than figuring out to extract things and maintain the parser over time if something won't work correctly because of some changes in the HTML.
Code to integrate:
from serpapi import GoogleSearch
params = {
"api_key": "YOUR_API_KEY",
"engine": "google",
"q": "what time it is",
"gl": "us",
"hl": "en"
}
search = GoogleSearch(params)
results = search.get_dict()
print(results['answer_box']['result'])
# 2:07 AM
Disclaimer, I work for SerpApi.

Data Scraping across <div>'s

I am trying to extract information from a repeating set of rows containing many embedded 's. For the page, I am trying to write a scraper to get various elements from this page. For some reason, I can't find a way to get to the tag with the class that contains the information for each row. Further, I am not able to isolate the sections that I will need to extract the information. For reference, here is a sample of one row:
<div id="dTeamEventResults" class="col-md-12 team-event-results"><div>
<div class="row team-event-result team-result">
<div class="col-md-12 main-info">
<div class="row">
<div class="col-md-7 event-name">
<dl>
<dt>Team Number:</dt>
<dd>11733</dd>
<dt>Team:</dt>
<dd> Aqua Duckies</dd>
<dt>Program:</dt>
<dd>FIRST LEGO League Jr.</dd>
</dl>
</div>
The script I have started to build looks like the following:
from urllib2 import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = 'https://www.firstinspires.org/team-event-search#type=teams&sort=name&keyword=NJ&programs=FLLJR,FLL,FTC,FRC&year=2017'
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
rows = page_soup.findAll("div", {"class":"row team-event-result team-result"})
whenever I run len(rows), it always results in 0. I seem to have hit a wall and am having trouble. Thanks for your help!

The content of this page is generated dynamically so to catch that you need to use any browser simulator like selenium. Here is the script which will fetch your desired content. Give this a shot:
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('https://www.firstinspires.org/team-event-search#type=teams&sort=name&keyword=NJ&programs=FLLJR,FLL,FTC,FRC&year=2017')
soup = BeautifulSoup(driver.page_source,"lxml")
for items in soup.select('.main-info'):
docs = ' '.join([' '.join([item.text,' '.join(val.text.split())]) for item,val in zip(items.select(".event-name dt"),items.select(".event-name dd"))])
location = ' '.join([' '.join(item.text.split()) for item in items.select(".event-location-type address")])
print("Event_Info: {}\nEvent_Location: {}\n".format(docs,location))
driver.quit()
The results look something like:
Event_Info: Team Number: 11733 Team: Aqua Duckies Program: FIRST LEGO League Jr.
Event_Location: Sparta, NJ 07871 USA
Event_Info: Team Number: 4281 Team: Bulldogs Program: FIRST Robotics Competition
Event_Location: Somerset, NJ 08873 USA

This seems like an issue of multiple-class tags. I believe this question might help you figure out the solution.

You can search specifically for dt and dd, the tags containing the target data:
from bs4 import BeautifulSoup as soup
from urllib2 import urlopen as uReq
import re
data = str(uReq('https://www.firstinspires.org/team-event-search#type=teams&sort=name&keyword=NJ&programs=FLLJR,FLL,FTC,FRC&year=2017').read())
s = soup(data, 'lxml')
headers = map(lambda x:x[:-1], [[b.text for b in i.find_all('dt')] for i in s.find_all('dl')][0])
data = [[re.sub('\s{2,}', '', b.text) for b in i.find_all('dd')] for i in s.find_all('dl')]
print(data)
final_data = [dict(zip(headers, i)) for i in data]
print(final_data)
When running this code on your example above, the output is:
[[u'11733', u' Aqua Duckies', u'FIRST LEGO League Jr.']]
[{u'Program': u'FIRST LEGO League Jr.', u'Team Number': u'11733', u'Team': u' Aqua Duckies'}]

Focusing in on specific results while scraping Twitter with Python and Beautiful Soup 4?

This is a follow up to my post Using Python to Scrape Nested Divs and Spans in Twitter?.
I'm not using the Twitter API because it doesn't look at the tweets by
hashtag this far back. Complete code and output are below after examples.
I want to scrape specific data from each tweet. name and handle are retrieving exactly what I'm looking for, but I'm having trouble narrowing down the rest of the elements.
As an example:
link = soup('a', {'class': 'tweet-timestamp js-permalink js-nav js-tooltip'})
url = link[0]
Retrieves this:
<a class="tweet-timestamp js-permalink js-nav js-tooltip" href="/Mikepeeljourno/status/648787700980408320" title="2:13 AM - 29 Sep 2015">
<span class="_timestamp js-short-timestamp " data-aria-label-part="last" data-long-form="true" data-time="1443518016" data-time-ms="1443518016000">29 Sep 2015</span></a>
For url, I only need the href value from the first line.
Similarly, the retweets and favorites commands return large chunks of html, when all I really need is the numerical value that is displayed for each one.
How can I narrow down the results to the required data for the url, retweetcount and favcount outputs?
I am planning to have this cycle through all the tweets once I get it working, in case that has an influence on your suggestions.
Complete Code:
from bs4 import BeautifulSoup
import requests
import sys
url = 'https://twitter.com/search?q=%23bangkokbombing%20since%3A2015-08-10%20until%3A2015-09-30&src=typd&lang=en'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'}
r = requests.get(url, headers=headers)
data = r.text.encode('utf-8')
soup = BeautifulSoup(data, "html.parser")
name = soup('strong', {'class': 'fullname js-action-profile-name show-popup-with-id'})
username = name[0].contents[0]
handle = soup('span', {'class': 'username js-action-profile-name'})
userhandle = handle[0].contents[1].contents[0]
link = soup('a', {'class': 'tweet-timestamp js-permalink js-nav js-tooltip'})
url = link[0]
messagetext = soup('p', {'class': 'TweetTextSize js-tweet-text tweet-text'})
message = messagetext[0]
retweets = soup('button', {'class': 'ProfileTweet-actionButtonUndo js-actionButton js-actionRetweet'})
retweetcount = retweets[0]
favorites = soup('button', {'class': 'ProfileTweet-actionButtonUndo u-linkClean js-actionButton js-actionFavorite'})
favcount = favorites[0]
print (username, "\n", "#", userhandle, "\n", "\n", url, "\n", "\n", message, "\n", "\n", retweetcount, "\n", "\n", favcount) #extra linebreaks for ease of reading
Complete Output:
Michael Peel
#Mikepeeljourno
<a class="tweet-timestamp js-permalink js-nav js-tooltip" href="/Mikepeeljourno/status/648787700980408320" title="2:13 AM - 29 Sep 2015"><span class="_timestamp js-short-timestamp " data-aria-label-part="last" data-long-form="true" data-time="1443518016" data-time-ms="1443518016000">29 Sep 2015</span></a>
<p class="TweetTextSize js-tweet-text tweet-text" data-aria-label-part="0" lang="en"><a class="twitter-hashtag pretty-link js-nav" data-query-source="hashtag_click" dir="ltr" href="/hashtag/FT?src=hash"><s>#</s><b>FT</b></a> Case closed: <a class="twitter-hashtag pretty-link js-nav" data-query-source="hashtag_click" dir="ltr" href="/hashtag/Thailand?src=hash"><s>#</s><b>Thailand</b></a> police chief proclaims <a class="twitter-hashtag pretty-link js-nav" data-query-source="hashtag_click" dir="ltr" href="/hashtag/Bangkokbombing?src=hash"><s>#</s><b><strong>Bangkokbombing</strong></b></a> solved ahead of his retirement this week -even as questions over case grow</p>
<button class="ProfileTweet-actionButtonUndo js-actionButton js-actionRetweet" data-modal="ProfileTweet-retweet" type="button">
<div class="IconContainer js-tooltip" title="Undo retweet">
<span class="Icon Icon--retweet"></span>
<span class="u-hiddenVisually">Retweeted</span>
</div>
<div class="IconTextContainer">
<span class="ProfileTweet-actionCount">
<span aria-hidden="true" class="ProfileTweet-actionCountForPresentation">4</span>
</span>
</div>
</button>
<button class="ProfileTweet-actionButtonUndo u-linkClean js-actionButton js-actionFavorite" type="button">
<div class="IconContainer js-tooltip" title="Undo like">
<div class="HeartAnimationContainer">
<div class="HeartAnimation"></div>
</div>
<span class="u-hiddenVisually">Liked</span>
</div>
<div class="IconTextContainer">
<span class="ProfileTweet-actionCount">
<span aria-hidden="true" class="ProfileTweet-actionCountForPresentation">2</span>
</span>
</div>
</button>
It was suggested that BeautifulSoup - extracting attribute values may have an answer to this question there. However, I think the question and its answers do not have sufficient context or explanation to be helpful in more complex situations. The link to the relevant part of the Beautiful Soup Documentation is helpful though, http://www.crummy.com/software/BeautifulSoup/documentation.html#The%20attributes%20of%20Tags

Use the dictionary-like access to the Tag's attributes.
For example, to get the href attribute value:
links = soup('a', {'class': 'tweet-timestamp js-permalink js-nav js-tooltip'})
url = link[0]["href"]
Or, if you need to get the href values for every link found:
links = soup('a', {'class': 'tweet-timestamp js-permalink js-nav js-tooltip'})
urls = [link["href"] for link in links]
As a side note, you don't need to specify the complete class value to locate elements. class is a special multi-valued attribute and you can just use one of the classes (if this is enough to narrow down the search for the desired elements). For example, instead of:
soup('a', {'class': 'tweet-timestamp js-permalink js-nav js-tooltip'})
You may use:
soup('a', {'class': 'tweet-timestamp'})
Or, a CSS selector:
soup.select("a.tweet-timestamp")

Alecxe already explained to use the 'href' key to get the value.
So I'm going to answer the other part of your questions:
Similarly, the retweets and favorites commands return large chunks of
html, when all I really need is the numerical value that is displayed
for each one.
.contents returns a list of all the children. Since you're finding 'buttons' which has several children you're interested in, you can just get them from the following parsed content list:
retweetcount = retweets[0].contents[3].contents[1].contents[1].string
This will return the value 4.
If you want a rather more readable approach, try this:
retweetcount = retweets[0].find_all('span', class_='ProfileTweet-actionCountForPresentation')[0].string
favcount = favorites[0].find_all('span', { 'class' : 'ProfileTweet-actionCountForPresentation')[0].string
This returns 4 and 2 respectively.
This works because we convert the ResultSet returned by soup/find_all and get the tag element (using [0]) and recursively find across all it's descendants again using find_all().
Now you can loop across each tweet and extract this information rather easily.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Remove image sources with same class reference when web scraping in python? - python

Related

Can't find hrefs of interest with BeautifulSoup

BeautifulSoup can't find list elements given class

How to get unlabeled data by BeautifulSoup Python

Data Scraping across <div>'s

Focusing in on specific results while scraping Twitter with Python and Beautiful Soup 4?

Categories

Resources