I have been trying hours to sort this out but unable to do so.
Here is my script using Selenium Webdriver in Python, trying to extract title, date, and link. I am able to extract the title and link. However, I am stuck at extracting the date. Could someone please help me with this. Much appreciated your response.
import selenium.webdriver
import pandas as pd
frame=[]
url = "https://www.oric.gov.au/publications/media-releases"
driver = selenium.webdriver.Chrome("C:/Users/[Computer_Name]/Downloads/chromedriver.exe")
driver.get(url)
all_div = driver.find_elements_by_xpath('//div[contains(#class, "ui-accordion-content")]')
for div in all_div:
all_items = div.find_elements_by_tag_name("a")
for item in all_items:
title = item.get_attribute('textContent')
link = item.get_attribute('href')
date =
frame.append({
'title': title,
'date': date,
'link': link,
})
dfs = pd.DataFrame(frame)
dfs.to_csv('myscraper.csv',index=False,encoding='utf-8-sig')
Here is the html I am interested in:
<div id="ui-accordion-1-panel-0" ...>
<div class="views-field views-field-title">
<span class="field-content">
<a href="/publications/media-release/ngadju-corporation-emerges-special-administration-stronger">
Ngadju corporation emerges from special administration stronger
</a>
</span>
</div>
<div class="views-field views-field-field-document-media-release-no">
<div class="field-content"><span class="date-display-single" property="dc:date" datatype="xsd:dateTime" content="2020-07-31T00:00:00+10:00">
31 July 2020
</span> (MR2021-06)</div>
</div>
</div>
...
I'd get all rows first.
from pprint import pprint
import selenium.webdriver
frame = []
url = "https://www.oric.gov.au/publications/media-releases"
driver = selenium.webdriver.Chrome()
driver.get(url)
divs = driver.find_elements_by_css_selector('div.ui-accordion-content')
for div in divs:
rows = div.find_elements_by_css_selector('div.views-row')
for row in rows:
item = row.find_element_by_tag_name('a')
title = item.get_attribute('textContent')
link = item.get_attribute('href')
date = row.find_element_by_css_selector(
'span.date-display-single').get_attribute('textContent')
frame.append({
'title': title,
'date': date,
'link': link,
})
driver.quit()
pprint(frame)
print(len(frame))
Ok just search for the <span> with the property dc:date, save it in a WebElement dateElement and take its text dateElement.text. That's your date as string.
Related
I am new to Python, and never done HTML. So any help would be appreciated.
I need to extract two numbers: '1062' and '348', from a website's inspect element.
This is my code:
page = requests.get("https://www.traderscockpit.com/?pageView=live-nse-advance-decline-ratio-chart")
soup = BeautifulSoup(page.content, 'html.parser')
Adv = soup.select_one ('.col-sm-6 .advDec:nth-child(1)').text[10:]
Dec = soup.select_two ('.col-sm-6 .advDec:nth-child(2)').text[10:]
The website element looks like below:
<div class="nifty-header-shade1 col-xs-12 col-sm-6 col-md-3">
<div class="row">
<div class="col-sm-12">
<h4>Stocks</h4>
</div>
<div class="col-sm-6">
<p class="advDec">Advanced: 1062</p>
</div>
<div class="col-sm-6">
<p class="advDec">Declined: 348</p>
</div>
</div>
</div>
Using my code, am able to extract first number (1062). But unable to extract the second number (348). Can you please help.
Assuming the Pattern is always the same, you can select your elements by text and get its next_sibling:
adv = soup.select_one('a:-soup-contains("Advanced:")').next_sibling.strip()
dec = soup.select_one('a:-soup-contains("Declined:")').next_sibling.strip()
Example
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.traderscockpit.com/?pageView=live-nse-advance-decline-ratio-chart")
soup = BeautifulSoup(page.content)
adv = soup.select_one('a:-soup-contains("Advanced:")').next_sibling.strip()
dec = soup.select_one('a:-soup-contains("Declined:")').next_sibling.strip()
print(adv, dec)
If there are always 2 elements, then the simplest way would probably be to destructure the array of selected elements.
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.traderscockpit.com/?pageView=live-nse-advance-decline-ratio-chart")
soup = BeautifulSoup(page.content, "html.parser")
adv, dec = [elm.next_sibling.strip() for elm in soup.select(".advDec a") ]
print("Advanced:", adv)
print("Declined", dec)
I'm looking for some help with scraping with selenium in python.
You need a paid account to view this page so creating a reproducible won't be possible.
The page I'm trying to scrape
I'm attempting to scrape the data from the pitch in the top right corner of the image under 'Spots on Field'.
<div class="player-details-football-map__UEFA player-details-football-map">
<div class="shots">
<div>
<a class="shot episode" style="left: 39.8529%; top: 28.9474%;"></a>
<div class="tooltip" style="left: 39.8529%; top: 28.9474%;">
<div class="tooltip-title">
<div class="tooltip-shoot-type">Shot on target</div>
<div class="tooltip-blow-type">Donyell Malen </div>
<div class="tooltip-shoot-name"></div>
</div>
<div class="tooltip-time">h Viktoria Koln</div>
<div class="tooltip-time">Half 1, 18:22 02/09/20</div>
<div class="tooltip-time">Length: 7.1 m</div>
<div class="tooltip-shoot-xg">Expected goals: 0.17</div>
</div>
</div>
The above is a snippet of just one of the data points I want to scrape.
I've tried using BeautifulSoup
from bs4 import BeautifulSoup
from requests import get
url = 'https://football.instatscout.com/players/294322/shots'
response = get(url)
html_soup = BeautifulSoup(response.text, 'html.parser')
type(html_soup)
shots = html_soup.find_all('div', class_ = 'tooltip')
print(type(shots))
print(len(shots))
and nothing was being returned.
So now I've tried using Selenium.
options = Options()
options.add_argument("start-maximized")
driver = webdriver.Chrome(chrome_options=options, executable_path=r'C:\Users\James\OneDrive\Desktop\webdriver\chromedriver.exe')
driver.get('https://football.instatscout.com/players/294322/shots')
print("Page Title is : %s" %driver.title)
driver.find_element_by_name('email').send_keys('my username')
driver.find_element_by_name('pass').send_keys('my password')
driver.find_element_by_xpath('//*[contains(concat( " ", #class, " " ), concat( " ", "hRAqIl", " " ))]').click()
goals = driver.find_element_by_class_name('tooltip')
but I'm getting the error of
NoSuchElementException: Message: no such element: Unable to locate element: {"method":"css selector","selector":".tooltip"}
Can someone please help point me in the right direction? I'm basically trying to scrape everything from the above HTML, that includes 'tooltip' in the class name.
Thanks
Using css selectors with bs4:
from bs4 import BeautifulSoup as soup
import re #for extracting offsets
r = [{**dict(zip(['left', 'top'], re.findall('[\d\.]+', i.div['style']))),
'shoot_type':i.select_one('.tooltip-shoot-type').text,
'name':i.select_one('.tooltip-blow-type').text,
'team':i.select_one('div:nth-of-type(2).tooltip-time').text,
'time':i.select_one('div:nth-of-type(3).tooltip-time').text,
'length':i.select_one('div:nth-of-type(4).tooltip-time').text[8:],
'expected_goals':i.select_one('.tooltip-shoot-xg').text[16:]}
for i in soup(html, 'html.parser').select('div.shots > div')]
Output:
[{'left': '39.8529', 'top': '28.9474', 'shoot_type': 'Shot on target', 'name': 'Donyell Malen ', 'team': 'h Viktoria Koln', 'time': 'Half 1, 18:22 02/09/20', 'length': '7.1 m', 'expected_goals': '0.17'}]
I'm trying to select the the second div tag with the info classname, but with no success using bs4 find_next. How Do you go about selecting the text inside the second div tag that share classname?
[<div class="info">
<a href="/clubs/12/Manchester-United/overview">
Manchester United<span class="playerClub badge-20 t1"></span>
</a>
</div>
<div class="info">Defender</div>
<div class="info">
<a href="/clubs/12/Manchester-United/overview">
Manchester United<span class="playerClub badge-20 t1"></span>
</a>
</div>
<div class="info">Defender</div>]
Here is what I have tried
from bs4 import BeautifulSoup
import requests
players_url =['http://www.premierleague.com//players/13559/Axel-Tuanzebe/stats']
# this is dict where we store all information:
players = {}
for url in players_url:
player_page = requests.get(url)
cont = soup(player_page.content, 'lxml')
data = dict((k.contents[0].strip(), v.get_text(strip=True)) for k, v in zip(cont.select('.topStat span.stat, .normalStat span.stat'), cont.select('.topStat span.stat > span, .normalStat span.stat > span')))
club = {"Club" : cont.find('div', attrs={'class' : 'info'}).get_text(strip=True)}
position = {"Position": cont.find_next('div', attrs={'class' : 'info'})}
players[cont.select_one('.playerDetails .name').get_text(strip=True)] = data
print(position)
You can try follows:
clud_ele = cont.find('div', attrs={'class' : 'info'})
club = {"Club" : clud_ele.get_text(strip=True)}
position = {"Position": clud_ele.find_next('div', attrs={'class' : 'info'})}
I'm parsing a huge file, the following HTML code is only a little part. I have many times the first div. In this div I want to get differents tags in <a> I don't care if I also get the element into the a.
I'm doing this but It doesn't work :
from bs4 import BeautifulSoup
import requests
import re
page_url = 'https://paris-sportifs.pmu.fr/'
page = requests.get(page_url)
soup = BeautifulSoup(page.text, 'html.parser')
with open('pmu.html', 'a+')as file:
for div in soup.find_all('div', class_ = 'time_group', attrs={ 'data-time_group' : re.compile("group[1-9]") }):
event_information = div.find('a', class_ = 'trow--event tc-track-element-events')
print(re.sub(r'\s+', ' ', event_information.text))
An exemple of HTML :
<div class="time_group" data-time_group="group0">
<div class="row">
<div class="col-sm-12">
<a class="trow--event tc-track-element-events" href="/event/522788/football/football/maroc-botola-pro-1/rsb-berkane-rapide-oued-zem" data-event_id="rsb_berkane__rapide_oued_zem" data-compet_id="maroc_-_botola_pro_1" data-sport_id="football" data-name="sportif.clic.paris_live.details" data-toggle="tooltip" data-placement="bottom" title="Football - Maroc - Botola Pro 1 - RSB Berkane // Rapide Oued Zem - 29 mars 2018 - 19h00">
<em class="trow--event--name">
<span>RSB Berkane // Rapide Oued Zem</span>
</em>
</a>
</div>
</div>
</div>
With the for loop i get into the different div which interest me but I don't know how I can use this div to do the next : div.find I want to do the find in the element on this div not outside (in the soup).
What I except :
<a class="trow--event tc-track-element-events" href="/event/522788/football/football/maroc-botola-pro-1/rsb-berkane-rapide-oued-zem" data-event_id="rsb_berkane__rapide_oued_zem" data-compet_id="maroc_-_botola_pro_1" data-sport_id="football" data-name="sportif.clic.paris_live.details" data-toggle="tooltip" data-placement="bottom" title="Football - Maroc - Botola Pro 1 - RSB Berkane // Rapide Oued Zem - 29 mars 2018 - 19h00">
<em class="trow--event--name">
<span>RSB Berkane // Rapide Oued Zem</span>
</em>
</a>
Then I just have to find the different tag values in my var.
I hope my english isn't horrible.
Thank you, in advance for your valuable assistance
EDIT 1 :
Let's take an exemple of source code : https://pastebin.com/KZBp9c3y
in this file when i do for div in soup.find_all('div', class_ = 'time_group', attrs={ 'data-time_group' : re.compile("group[1-9]") }): I find the first div but imagine we have multiple match in the for loop.
Then I want to find in this div the element with tag a and class trow--event... div.find('a', class_ = 'trow--event tc-track-element-events')
An exemple of possible result is:
data-event_id="brescia__pescara"
data-compet_id="italie_-_serie_b"
data-sport_id="football"
score-both :
Anyway the problem is that I don't know how to do a find from the div where I am. I'm in <div class="time_group" data-time_group="group1"> and I want to get different information. I want to parse the div from the top to the bottom.
concretely :
for div in soup:
if current_div is:
do this.....
else if:
do this...
How can I get the current_div ?
Tell me if you don't understand what I want.
Thanks you
I've find something it's not exactly what I wanted but it works :
from bs4 import BeautifulSoup
import requests
import re
page_url = 'https://paris-sportifs.pmu.fr/'
page = requests.get(page_url)
soup = BeautifulSoup(page.text, 'html.parser')
soupdiv = soup.find_all('div', class_ = 'time_group', attrs={ 'data-time_group' : re.compile("group[1-9]") })
for div in soupdiv:
test = div.find("a", {"class":"trow--event tc-track-element-events"})
print(test.text)
I doing my find from the current div in the for.
thanks you.
This is a follow up to my post Using Python to Scrape Nested Divs and Spans in Twitter?.
I'm not using the Twitter API because it doesn't look at the tweets by
hashtag this far back. Complete code and output are below after examples.
I want to scrape specific data from each tweet. name and handle are retrieving exactly what I'm looking for, but I'm having trouble narrowing down the rest of the elements.
As an example:
link = soup('a', {'class': 'tweet-timestamp js-permalink js-nav js-tooltip'})
url = link[0]
Retrieves this:
<a class="tweet-timestamp js-permalink js-nav js-tooltip" href="/Mikepeeljourno/status/648787700980408320" title="2:13 AM - 29 Sep 2015">
<span class="_timestamp js-short-timestamp " data-aria-label-part="last" data-long-form="true" data-time="1443518016" data-time-ms="1443518016000">29 Sep 2015</span></a>
For url, I only need the href value from the first line.
Similarly, the retweets and favorites commands return large chunks of html, when all I really need is the numerical value that is displayed for each one.
How can I narrow down the results to the required data for the url, retweetcount and favcount outputs?
I am planning to have this cycle through all the tweets once I get it working, in case that has an influence on your suggestions.
Complete Code:
from bs4 import BeautifulSoup
import requests
import sys
url = 'https://twitter.com/search?q=%23bangkokbombing%20since%3A2015-08-10%20until%3A2015-09-30&src=typd&lang=en'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'}
r = requests.get(url, headers=headers)
data = r.text.encode('utf-8')
soup = BeautifulSoup(data, "html.parser")
name = soup('strong', {'class': 'fullname js-action-profile-name show-popup-with-id'})
username = name[0].contents[0]
handle = soup('span', {'class': 'username js-action-profile-name'})
userhandle = handle[0].contents[1].contents[0]
link = soup('a', {'class': 'tweet-timestamp js-permalink js-nav js-tooltip'})
url = link[0]
messagetext = soup('p', {'class': 'TweetTextSize js-tweet-text tweet-text'})
message = messagetext[0]
retweets = soup('button', {'class': 'ProfileTweet-actionButtonUndo js-actionButton js-actionRetweet'})
retweetcount = retweets[0]
favorites = soup('button', {'class': 'ProfileTweet-actionButtonUndo u-linkClean js-actionButton js-actionFavorite'})
favcount = favorites[0]
print (username, "\n", "#", userhandle, "\n", "\n", url, "\n", "\n", message, "\n", "\n", retweetcount, "\n", "\n", favcount) #extra linebreaks for ease of reading
Complete Output:
Michael Peel
#Mikepeeljourno
<a class="tweet-timestamp js-permalink js-nav js-tooltip" href="/Mikepeeljourno/status/648787700980408320" title="2:13 AM - 29 Sep 2015"><span class="_timestamp js-short-timestamp " data-aria-label-part="last" data-long-form="true" data-time="1443518016" data-time-ms="1443518016000">29 Sep 2015</span></a>
<p class="TweetTextSize js-tweet-text tweet-text" data-aria-label-part="0" lang="en"><a class="twitter-hashtag pretty-link js-nav" data-query-source="hashtag_click" dir="ltr" href="/hashtag/FT?src=hash"><s>#</s><b>FT</b></a> Case closed: <a class="twitter-hashtag pretty-link js-nav" data-query-source="hashtag_click" dir="ltr" href="/hashtag/Thailand?src=hash"><s>#</s><b>Thailand</b></a> police chief proclaims <a class="twitter-hashtag pretty-link js-nav" data-query-source="hashtag_click" dir="ltr" href="/hashtag/Bangkokbombing?src=hash"><s>#</s><b><strong>Bangkokbombing</strong></b></a> solved ahead of his retirement this week -even as questions over case grow</p>
<button class="ProfileTweet-actionButtonUndo js-actionButton js-actionRetweet" data-modal="ProfileTweet-retweet" type="button">
<div class="IconContainer js-tooltip" title="Undo retweet">
<span class="Icon Icon--retweet"></span>
<span class="u-hiddenVisually">Retweeted</span>
</div>
<div class="IconTextContainer">
<span class="ProfileTweet-actionCount">
<span aria-hidden="true" class="ProfileTweet-actionCountForPresentation">4</span>
</span>
</div>
</button>
<button class="ProfileTweet-actionButtonUndo u-linkClean js-actionButton js-actionFavorite" type="button">
<div class="IconContainer js-tooltip" title="Undo like">
<div class="HeartAnimationContainer">
<div class="HeartAnimation"></div>
</div>
<span class="u-hiddenVisually">Liked</span>
</div>
<div class="IconTextContainer">
<span class="ProfileTweet-actionCount">
<span aria-hidden="true" class="ProfileTweet-actionCountForPresentation">2</span>
</span>
</div>
</button>
It was suggested that BeautifulSoup - extracting attribute values may have an answer to this question there. However, I think the question and its answers do not have sufficient context or explanation to be helpful in more complex situations. The link to the relevant part of the Beautiful Soup Documentation is helpful though, http://www.crummy.com/software/BeautifulSoup/documentation.html#The%20attributes%20of%20Tags
Use the dictionary-like access to the Tag's attributes.
For example, to get the href attribute value:
links = soup('a', {'class': 'tweet-timestamp js-permalink js-nav js-tooltip'})
url = link[0]["href"]
Or, if you need to get the href values for every link found:
links = soup('a', {'class': 'tweet-timestamp js-permalink js-nav js-tooltip'})
urls = [link["href"] for link in links]
As a side note, you don't need to specify the complete class value to locate elements. class is a special multi-valued attribute and you can just use one of the classes (if this is enough to narrow down the search for the desired elements). For example, instead of:
soup('a', {'class': 'tweet-timestamp js-permalink js-nav js-tooltip'})
You may use:
soup('a', {'class': 'tweet-timestamp'})
Or, a CSS selector:
soup.select("a.tweet-timestamp")
Alecxe already explained to use the 'href' key to get the value.
So I'm going to answer the other part of your questions:
Similarly, the retweets and favorites commands return large chunks of
html, when all I really need is the numerical value that is displayed
for each one.
.contents returns a list of all the children. Since you're finding 'buttons' which has several children you're interested in, you can just get them from the following parsed content list:
retweetcount = retweets[0].contents[3].contents[1].contents[1].string
This will return the value 4.
If you want a rather more readable approach, try this:
retweetcount = retweets[0].find_all('span', class_='ProfileTweet-actionCountForPresentation')[0].string
favcount = favorites[0].find_all('span', { 'class' : 'ProfileTweet-actionCountForPresentation')[0].string
This returns 4 and 2 respectively.
This works because we convert the ResultSet returned by soup/find_all and get the tag element (using [0]) and recursively find across all it's descendants again using find_all().
Now you can loop across each tweet and extract this information rather easily.