scraping with python using bs4

scraping with python using bs4 - python

I am trying to scrape from
url in code : url
I trying to get from there to a dictionary ,
{key (name of game) : value (list of links)}
now when Im trying , i cant find the div tag with the id=accordion .
and because of this Im stuck now :
my code:
def findLiveGamesInBetman():
dic = {}
links = []
url ='https://live.batstream.tv/?sport=football&sp=1,2,3,4,5,6,7,8,9,10,20,25&fs=13px&fw=700&tt=none&fc=405115&tc=333333&bc=FFFFFF&bhc=FDFDFD&pd=4px&mr=1px&tm=817503&tmb=FFFFFF&wb=e5e5e5&bsh=0px&rdb=FFFFFF&rdc=C74300&l=https://sport-play.tv/register/&lt=1&lsp=1&lcy=1&lda=1&l2=https://sport-play.tv/register/&l2t=1&l2sp=1&l2co=1&l2cy=1&l2da=1'
# Fetching the html
request = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
content = urlopen(request).read()
parse = BeautifulSoup(content, 'html.parser')
body=parse.find('body')
div_con = body.find('div', {"class": "container"})
div_row = div_con.find('div', {"class": "row"})
div_col = div_row.find('div', {"class": "col-lg-12"})
div_accor = div_col.find('div',{'id' : 'accordion_t'})
div_ac = div_accor.find('div',{'id' : 'accordion'}) #--> this return empty
note : the url in this code is the but it just the games.
I had been looking here to find something that may help , but unfortunately,I didnt find anything.
how can I fix it ?
Thanks

Related

Pulling p tags from multiple URLs

I've struggled on this for days and not sure what the issue could be - basically, I'm trying to extract the profile box data (picture below) of each link -- going through inspector, I thought I could pull the p tags and do so.
I'm new to this and trying to understand, but here's what I have thus far:
-- a code that (somewhat) succesfully pulls the info for ONE link:
import requests
from bs4 import BeautifulSoup
# getting html
url = 'https://basketball.realgm.com/player/Darius-Adams/Summary/28720'
req = requests.get(url)
soup = BeautifulSoup(req.text, 'html.parser')
container = soup.find('div', attrs={'class', 'main-container'})
playerinfo = container.find_all('p')
print(playerinfo)
I then also have a code that pulls all of the HREF tags from multiple links:
from bs4 import BeautifulSoup
import requests
def get_links(url):
links = []
website = requests.get(url)
website_text = website.text
soup = BeautifulSoup(website_text)
for link in soup.find_all('a'):
links.append(link.get('href'))
for link in links:
print(link)
print(len(links))
get_links('https://basketball.realgm.com/dleague/players/2022')
get_links('https://basketball.realgm.com/dleague/players/2021')
get_links('https://basketball.realgm.com/dleague/players/2020')
So basically, my goal is to combine these two, and get one code that will pull all of the P tags from multiple URLs. I've been trying to do it, and I'm really not sure at all why this isn't working here:
from bs4 import BeautifulSoup
import requests
def get_profile(url):
profiles = []
req = requests.get(url)
soup = BeautifulSoup(req.text, 'html.parser')
container = soup.find('div', attrs={'class', 'main-container'})
for profile in container.find_all('a'):
profiles.append(profile.get('p'))
for profile in profiles:
print(profile)
get_profile('https://basketball.realgm.com/player/Darius-Adams/Summary/28720')
get_profile('https://basketball.realgm.com/player/Marial-Shayok/Summary/26697')
Again, I'm really new to web scraping with Python but any advice would be greatly appreciated. Ultimately, my end goal is to have a tool that can scrape this data in a clean way all at once.
(Player name, Current Team, Born, Birthplace, etc).. maybe I'm doing it entirely wrong but any guidance is welcome!

You need to combine your two scripts together and make requests for each player. Try the following approach. This searches for <td> tags that have the data-td=Player attribute:
import requests
from bs4 import BeautifulSoup
def get_links(url):
data = []
req_url = requests.get(url)
soup = BeautifulSoup(req_url.content, "html.parser")
for td in soup.find_all('td', {'data-th' : 'Player'}):
a_tag = td.a
name = a_tag.text
player_url = a_tag['href']
print(f"Getting {name}")
req_player_url = requests.get(f"https://basketball.realgm.com{player_url}")
soup_player = BeautifulSoup(req_player_url.content, "html.parser")
div_profile_box = soup_player.find("div", class_="profile-box")
row = {"Name" : name, "URL" : player_url}
for p in div_profile_box.find_all("p"):
try:
key, value = p.get_text(strip=True).split(':', 1)
row[key.strip()] = value.strip()
except: # not all entries have values
pass
data.append(row)
return data
urls = [
'https://basketball.realgm.com/dleague/players/2022',
'https://basketball.realgm.com/dleague/players/2021',
'https://basketball.realgm.com/dleague/players/2020',
]
for url in urls:
print(f"Getting: {url}")
data = get_links(url)
for entry in data:
print(entry)

How to get data in Webscraping Python

https://mor.nlm.nih.gov/RxNav/search?searchBy=NDC&searchTerm=51079045120
how to get RXCUI number from this website in python, I am unable to get
content = driver.page_source
soup = BeautifulSoup(content, 'html.parser')
cont = soup.findAll("div", {"id": "titleHolder"})
rx = cont.find("span", id = "rxcuiDecoration")
print(rx.text)

The site renders using javascript you have to use the API
import requests
r = requests.get('https://rxnav.nlm.nih.gov/REST/ndcstatus.json?caller=RxNav&ndc=51079045120')
print(r.json()['ndcStatus']['rxcui'])

Struggling to scrape a telegram public channel with BeautifulSoup

I'm practicing web scraping with BeautifulSoup but I struggle to finish printing a dictionary including the items I've scraped
The web targeted can be any telegram public channel (web version) and I pretend to collect and add as part of the dictionary the text message, timestamp, views and image url (if exist attached to the post).
I've inspected the code for the 4 elements but the one related to the image url has no class or span, so I've ended scraping them it via regex. The other 3 elements are easily retrievable.
Let's go by parts:
Importing modules
from bs4 import BeautifulSoup
import requests
import re
Function to get the images url from the public channel
def pictures(url):
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
link = str(soup.find_all('a', class_ = 'tgme_widget_message_photo_wrap')) #converted to str in order to be able to apply regex
image_url = re.findall(r"https://cdn4.*.*.jpg", link)
return image_url
Soup to get the text message, timestamp and views
picture_list = pictures(url)
url = "https://t.me/s/computer_science_and_programming"
channel = requests.get(url).text
soup = BeautifulSoup(channel, 'lxml')
tgpost = soup.find_all('div', class_ ='tgme_widget_message')
full_message = {}
for content in tgpost:
full_message['views'] = content.find('span', class_ = 'tgme_widget_message_views').text
full_message['timestamp'] = content.find('time', class_ = 'time').text
full_message['text'] = content.find('div', class_ = 'tgme_widget_message_text').text
print(full_message)
I would really appreciate if someone can help me, I'm new to Python and I don't know how I could do it to
Check if the post contains an image and if so, add it to the dictionary
Print the dictionary including image_url as key and the url as value for each post.
Thank you very much

I think you want something like this.
from bs4 import BeautifulSoup
import requests, re
url = "https://t.me/s/computer_science_and_programming"
channel = requests.get(url).text
soup = BeautifulSoup(channel, 'lxml')
tgpost = soup.find_all('div', class_ ='tgme_widget_message')
full_message = {}
for content in tgpost:
full_message['views'] = content.find('span', class_ = 'tgme_widget_message_views').text
full_message['timestamp'] = content.find('time', class_ = 'time').text
full_message['text'] = content.find('div', class_ = 'tgme_widget_message_text').text
if content.find('a', class_ = 'tgme_widget_message_photo_wrap') != None :
link = str(content.find('a', class_ = 'tgme_widget_message_photo_wrap'))
full_message['url_image'] = re.findall(r"https://cdn4.*.*.jpg", link)[0]
elif 'url_image' in full_message:
full_message.pop('url_image')
print(full_message)

BeautifulSoup generating inconsistent results

I'm using BeautifulSoup to pull data out of Reddit sidebars on a selection of subreddits, but my results are changing pretty much every time I run my script.
Specifically, the results in sidebar_urls changes from iteration to iteration; sometimes it will result in [XYZ.com/abc, XYZ.com/def], other times it will return just [XYZ.com/def], and finally, it will sometimes return [].
Any ideas why this might be happening using the code below?
sidebar_urls = []
for i in range(0, len(reddit_urls)):
req = urllib.request.Request(reddit_urls[i], headers=headers)
resp = urllib.request.urlopen(req)
soup = BeautifulSoup(resp, 'html.parser')
links = soup.find_all(href=True)
for link in links:
if "XYZ.com" in str(link['href']):
sidebar_urls.append(link['href'])

It seems you sometimes get a page that does not have a side bar. It could be because Reddit is recognizing you as a robot and returning a default page instead of the one you expect. Consider identifying yourself when requesting the pages, using the User-Agent field:
reddit_urls = [
"https://www.reddit.com/r/leagueoflegends/",
"https://www.reddit.com/r/pokemon/"
]
# Update this to identify yourself
user_agent = "me#example.com"
sidebar_urls = []
for reddit_url in reddit_urls:
response = requests.get(reddit_url, headers={"User-Agent": user_agent})
soup = BeautifulSoup(response.text, "html.parser")
# Find the sidebar tag
side_tag = soup.find("div", {"class": "side"})
if side_tag is None:
print("Could not find a sidebar in page: {}".format(reddit_url))
continue
# Find all links in the sidebar tag
link_tags = side_tag.find_all("a")
for link in link_tags:
link_text = str(link["href"])
sidebar_urls.append(link_text)
print(sidebar_urls)

none returned when trying to get tag value

In this html snippet from https://letterboxd.com/shesnicky/list/top-50-favourite-films/, I'm trying to go through all the different li tags and get the info from 'data-target-link' so I can then use that to create a new link that takes me to the page for that film, however every time I try and get the data it simply returns None or an error along those lines.
<li class="poster-container numbered-list-item" data-owner-rating="10"> <div class="poster film-poster really-lazy-load" data-image-width="125" data-image-height="187" data-film-slug="/film/donnie-darko/" data-linked="linked" data-menu="menu" data-target-link="/film/donnie-darko/" > <img src="https://s3.ltrbxd.com/static/img/empty-poster-125.c6227b2a.png" class="image" width="125" height="187" alt="Donnie Darko"/><span class="frame"><span class="frame-title"></span></span> </div> <p class="list-number">1</p> </li>
I'm going to be using the links to grab imgs for a twitter bot, so I tried doing this within my code:
class BotStreamer(tweepy.StreamListener):
print "Bot Streamer"
#on_data method of Tweepy’s StreamListener
#passes data from statuses to the on_status method
def on_status(self, status):
print "on status"
link = 'https://letterboxd.com/shesnicky/list/top-50-favourite-films/'
page = requests.get(link)
soup = BS(page.content, 'html.parser')
movies_ul = soup.find('ul', {'class':'poster-list -p125 -grid film-list'})
movies = []
for mov in movies_ul.find('data-film-slug'):
movies.append(mov)
rand = randint(0,51)
newLink = "https://letterboxd.com%s" % (str(movies[rand]))
newPage = requests.get(newLink)
code = BS(newPage.content, 'html.parser')
code_div = code.find\
('div', {'class':'react-component film-poster film-poster-51910 poster'})
image = code_div.find('img')
url = image.get('src')
username = status.user.screen_name
status_id = status.id
tweet_reply(url, username, status_id)
However, I kept getting errors about list being out of range, or not being able to iterate over NoneType. So I made a test prgrm just to see if I could somehow get the data:
import requests
from bs4 import BeautifulSoup as BS
link = 'https://letterboxd.com/shesnicky/list/top-50-favourite-films/'
page = requests.get(link)
soup = BS(page.content, 'html.parser')
movies_ul = soup.find('ul', {'class':'poster-list -p125 -grid film-list'})
more = movies_ul.find('li', {'class':'poster-container numbered-list-item'})
k = more.find('data-target-link')
print k
And again, all I get is None. Any help greatly appreciated.

Read doc: find() as first argument expects tag name, not attribute.
You may do
soup.find('div', {'data-target-link': True})
or
soup.find(attrs={'data-target-link': True})
Full example
import requests
from bs4 import BeautifulSoup as BS
link = 'https://letterboxd.com/shesnicky/list/top-50-favourite-films/'
page = requests.get(link)
soup = BS(page.content, 'html.parser')
all_items = soup.find_all('div', {'data-target-link': True})
for item in all_items:
print(item['data-target-link'])

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

scraping with python using bs4 - python

Related

Pulling p tags from multiple URLs

How to get data in Webscraping Python

Struggling to scrape a telegram public channel with BeautifulSoup

BeautifulSoup generating inconsistent results

none returned when trying to get tag value

Categories

Resources