I've created a script in python to get the link to a twitter account of a player. The problem is the twitter link is within an iframe. I can parse that using selenium. However, I would like to know if there is any alternative to parse the link using requests module making use of script tag or something.
website link
If you scroll that site, you can see the twitter link located at the right sided area something like the image below:
I've tried with:
import requests
from bs4 import BeautifulSoup
link = "https://247sports.com/Player/JT-Tuimoloau-46048440/"
def get_links(link):
res = requests.get(link,headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(res.text,"lxml")
twitter = soup.select_one("a.customisable-highlight").get('href')
print(twitter)
if __name__ == '__main__':
get_links(link)
I don't know how to actually get the iframe, but maybe there is another way for you to fetch the Twitter name (and create a link to this Twitter account afterwards).
It seems like the information you need is hidden in a div tag with class="tweets-comp". If you extract the value of the attribute data-username, you should end up with the name of the Twitter account:
import requests
from bs4 import BeautifulSoup
link = "https://247sports.com/Player/JT-Tuimoloau-46048440/"
res = requests.get(link,headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(res.text,"html.parser")
div = soup.find('div', {'class':'tweets-comp'})
print(div['data-username'])
# JT_tuimoloau
Related
I tried to scrape my YouTube subscriptions list into a csv file. But I faced a problem in the middle of the code. Here is my code:
import requests
from bs4 import BeautifulSoup
import csv
url = 'https://www.youtube.com/feed/channels'
source = requests.get(url)
soup = BeautifulSoup(source.content, 'lxml')
text = soup.find_all('yt-formatted-string', {'id': 'text'})
for i in range(len(text)):
print(text[i].yt-formatted-string.text)
I am wondering why vscode didn't recognize 'yt-formatted-string' while it's found on the HTML page. Also when I tried another div from HTML, this code didn't give any output.
Your code is not working because you haven't logged in to your account while sending requests to "https://www.youtube.com/feed/channels". You must have to login first to get All subscriptions.
You can solve this problem by using selenium. First, login to your account using selenium, and then you can use either selenium or beautifulsoup to extract subscriptions from that page.
i'm trying to get the title of youtube videos given a link.
But i'm unable to access the element that hold the title. I'm using bs4 to parse the html.
I noticed im unable to access any element that is within 'ytd-app' tag in the youtube page.
import bs4
import requests
listed_url = "https://www.youtube.com/watch?v=9IfT8KXX_9c&list=PLQVvvaa0QuDfKTOs3Keq_kaG2P55YRn5v&index=31"
listed = requests.get(listed_url)
soup = bs4.BeautifulSoup(listed.text, "html.parser")
a = soup.find_all(attrs={"class": "style-scope ytd-video-primary-info-renderer"})
print(a)
So how can i get the video title ? Is there something i'm doing wrong or youtube intentionally created a tag like this to prevent web_scraping ?
See class that you are using is render through Javascript and all the contents are dynamic so it is very difficult to find any data using bs4
So what you can do find data in soup by manually and find particular tag
Also you can try out with pytube
import bs4
import requests
listed_url = "https://www.youtube.com/watch?v=9IfT8KXX_9c&list=PLQVvvaa0QuDfKTOs3Keq_kaG2P55YRn5v&index=31"
listed = requests.get(listed_url)
soup = bs4.BeautifulSoup(listed.text, "html.parser")
soup.find("title").get_text()
I'm trying to run some statistical analysis on topic-based multireddits. Rather than collecting each individual subreddit by hand, I have found websites that collect these subreddits (Example, Example 2).
These sites unfortunately do not have the ability to download the list of subreddits into plaintext that can be used in a dictionary. Is there a specific method I could use to scrape these sites to only receive back the URL of each attached hyperlink on the webpage?
Thanks!
Edit: Here's my current code
Here's my current code, which runs, but returns every URL.
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = "https://snoopsnoo.com/subreddits/travel/"
page = requests.get(url)
data = page.text
soup = BeautifulSoup(data)
links = []
for link in soup.find_all('a'):
reddit = link.get('href')
links.append(reddit)
df = pd.DataFrame(links, columns=['string_values'])
df.to_csv('travel.csv')
Yes there is such a method. If you are using Python, a widely used library is Beautifulsoup. This library parses the HTML directly, so no webdriver is necessary or running a webbrowser in the background like with selenium. You can install it with: pip install bs4.
For your first example site:
import urllib
from bs4 import BeautifulSoup
# Load the url
url = "https://snoopsnoo.com/subreddits/travel/"
html = urllib.request.urlopen(url).read()
# Create the parser object
soup = BeautifulSoup(html)
# Find all panel headings
panels = soup.find_all(class_="panel-heading big")
# Find the <a>-elements and exctract the link
links = [elem.find('a')['href'] for elem in panels]
print(links)
Here I checked the contents of the page to locate the panel elements by class and then extracted the <a>-elements and its href-attribute.
This code will grab all of the titles.
from selenium import webdriver
firefox_options = webdriver.FirefoxOptions()
#firefox_options.add_argument('--headless')
driver = webdriver.Firefox(executable_path='geckodriver.exe', firefox_options=firefox_options)
driver.get("https://snoopsnoo.com/subreddits/travel/")
for i in range(3):
wds = driver.find_elements_by_class_name('title')
for wd in wds:
print(wd.text)
driver.find_element_by_xpath('/html/body/div/div[2]/div[1]/ul/li/a').click
print('next page')
driver.close()
Change 3 to how many pages you want in for i in range(3): Uncomment firefox_options.add_argument('--headless') to use headless mode
Good afternoon,
I am fairly new to Webscraping. I am trying to scrape a dataset from an open source portal. Just to try to figure out how I can scrape website.
I am trying to scape a dataset from data.toerismevlaanderen.be
This is the dataset i want: https://data.toerismevlaanderen.be/tourist/reca/beer_bars
I always end up with a http error: HTTP Error 404: Not Found
This is my code:
import requests
import urllib.request
import time
from bs4 import BeautifulSoup
url = 'https://data.toerismevlaanderen.be/'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
soup.findAll('a')
one_a_tag = soup.findAll('a')[35]
link = one_a_tag['href']
download_url = 'https://data.toerismevlaanderen.be/'+ link
urllib.request.urlretrieve(download_url,'./'+link[link.find('/tourist/reca/beer_bars_')+1:])
time.sleep
What am I doing wrong?
The issue is the following:
link = one_a_tag['href']
print(link)
This returns a link: https://data.toerismevlaanderen.be/
Then you are adding this link to download_url by doing:
download_url = 'https://data.toerismevlaanderen.be/'+ link
Therefore, if you print(download_url), you get:
https://data.toerismevlaanderen.be/https://data.toerismevlaanderen.be/
Which it is not a valid url.
UPDATE BASED ON COMMENTS
The issue is that there is not tourist/activities/breweries anywhere in the text you scrape.
If you write:
for link in soup.findAll('a'):
print(link.get('href'))
you see all the a href tag. None contains tourist/activities/breweries
But
If you want just the link data.toerismevlaanderen.be/tourist/activities/breweries you can do:
download_url = link + "tourist/activities/breweries"
There is an API for this so I would use that
e.g.
import requests
r = requests.get('https://opendata.visitflanders.org/tourist/reca/beer_bars.json?page=1&page_size=500&limit=1').json()
you get many absolute links in return. Adding it to the original url for new requests therefor won't work. Simply requesting the 'link' you grabbed will work instead
Does anyone know how to use beautifulsoup in python.
I have this search engine with a list of different urls.
I want to get only the html tag containing a video embed url. and get the link.
example
import BeautifulSoup
html = '''https://archive.org/details/20070519_detroit2'''
#or this.. html = '''http://www.kumby.com/avatar-the-last-airbender-book-3-chapter-5/'''
#or this... html = '''https://www.youtube.com/watch?v=fI3zBtE_S_k'''
soup = BeautifulSoup.BeautifulSoup(html)
what should I do next . to get the html tag of video, or object or the exact link of the video..
I need it to put it on my iframe. i will integrate the python to my php. so getting the link of the video and outputting it using the python then i will echo it on my iframe.
You need to get the html of the page not just the url
use the built-in lib urllib like this:
import urllib
from bs4 import BeautifulSoup as BS
url = '''https://archive.org/details/20070519_detroit2'''
#open and read page
page = urllib.urlopen(url)
html = page.read()
#create BeautifulSoup parse-able "soup"
soup = BS(html)
#get the src attribute from the video tag
video = soup.find("video").get("src")
also with the site you are using i noticed that to get the embed link just change details in the link to embed so it looks like this:
https://archive.org/embed/20070519_detroit2
so if you want to do it to multiple urls without having to parse just do something like this:
url = '''https://archive.org/details/20070519_detroit2'''
spl = url.split('/')
spl[3] = 'embed'
embed = "/".join(spl)
print embed
EDIT
to get the embed link for the other links you provided in your edit you need to look through the html of the page you are parsing, look until you fint the link then get the tag its in then the attribute
for
'''http://www.kumby.com/avatar-the-last-airbender-book-3-chapter-5/'''
just do
soup.find("iframe").get("src")
the iframe becuase the link is in the iframe tag and the .get("src") because the link is the src attribute
You can try the next one because you should learn how to do it if you want to be able to do it in the future :)
Good luck!
You can't parse a URL. BeautifulSoup is used to parse an html page. Retrieve the page first:
import urllib2
data = urllib2.ulropen("https://archive.org/details/20070519_detroit2")
html = data.read()
Then you can use find, and then take the src attribute:
soup = BeautifulSoup(html)
video = soup.find('video')
src = video['src']
this is a one liner to get all the downloadable MP4 file in that page, in case you need it.
import bs4, urllib2
url = 'https://archive.org/details/20070519_detroit2'
soup = bs4.BeautifulSoup(urllib2.urlopen(url))
links = [a['href'] for a in soup.find_all(lambda tag: tag.name == "a" and '.mp4' in tag['href'])]
print links
Here are the output:
['/download/20070519_detroit2/20070519_detroit_jungleearth.mp4',
'/download/20070519_detroit2/20070519_detroit_sweetkissofdeath.mp4',
'/download/20070519_detroit2/20070519_detroit_goodman.mp4',
...
'/download/20070519_detroit2/20070519_detroit_wilson_512kb.mp4']
These are relative links and you and put them together with the domain and you get absolute path.