Get playlist URL count using pytube - python

I'm trying to print how many links there are in the playlist.
Code
from pytube import Playlist
p = Playlist('https://www.youtube.com/playlist?list=PLRmF_eQXS6BnZb0PnIOvpxD6H8F04DfBX')
print(f'Downloading: {p.title}')
print((p.video_urls))
for video in p.videos:
video.streams.first().download()
Actual output
Downloading: Bristol’s bday playlist
['https://www.youtube.com/watch?v=r7Rn4ryE_w8', 'https://www.youtube.com/watch?v=YhK2NwPIdt4', 'https://www.youtube.com/watch?v=4EQkYVtE-28', 'https://www.youtube.com/watch?v=CpTr4USXjQw', 'https://www.youtube.com/watch?v=dXJHDhKJ9Dw', 'https://www.youtube.com/watch?v=VKC_hzJ3jzg', 'https://www.youtube.com/watch?v=njvA03rMHx4', 'https://www.youtube.com/watch?v=GBvLVesLZmY', 'https://www.youtube.com/watch?v=JucvYrdSIcM']
Issue
But it prints the links, not the count of them.
I want this output:
Downloading: Bristol’s bday playlist
9

You can use the len() function to get the number of elements in a list, which in this case would be the number of links in the playlist. Just replace print((p.video_urls)) with print(f'Number of links: {len(p.video_urls)}')
Something like this:
from pytube import Playlist
p = Playlist('https://www.youtube.com/playlist?list=PLRmF_eQXS6BnZb0PnIOvpxD6H8F04DfBX')
print(f'Downloading: {p.title}')
print(f'Number of links: {len(p.video_urls)}')
for video in p.videos:
video.streams.first().download()

Related

While true try except loop giving different output each time in webscraping - repeats or omits elements while iterating

I am trying to scrape some pages and count occurrences of a word in the page. I have to go through different set of links to reach the final set of pages and I used for loops to collect and iterate through the links.
As the website is slow, I put the final iteration inside a while True loop. But each time I run the code, it loops through the final set of links in different ways. For example, it goes through 20 links and then repeats those 20 links again while ignoring another 20 links. Every time the number varies, sometimes within each iteration, repeating and omitting random number of links.
The website is really slow. So unless I put a while True loop, the program stops in the middle. Could someone please look through the code and point out what I am doing wrong?
from bs4 import BeautifulSoup as bs
from urllib.request import Request, urlopen
import re
import pandas as pd
import io
import requests
import time
import csv
d=open('Wyd 20-21.csv','w')
writer=csv.writer(d,lineterminator='\n')
URL = "http://mnregaweb4.nic.in/netnrega/writereaddata/citizen_out/MAT_DTL_1603_MATD_eng2021.html"
soup = bs(requests.get(URL).content, "html.parser")
base_url = "http://mnregaweb4.nic.in/netnrega/"
linksblocks = []
for tag in soup.select("td:nth-of-type(2) a"):
linksblocks.append(tag["href"])
print(linksblocks)
Allblocks = [base_url+e[6:] for e in linksblocks]
print(Allblocks)#This is the first set of links. I have to iterate through each one of them to get to the second set of links
links = []
for each in Allblocks:
soup=bs(requests.get(each).content,"html.parser")
for tag in soup.select("td:nth-of-type(2) a"):
links.append(tag["href"])
AllGPs = [base_url+e[6:] for e in links]
print(AllGPs)#This is the second set of links. I have to iterate through each one of them to get to the final set of links
gp=0
for each in AllGPs:
res=requests.get(each)
soup=bs(res.text,'html.parser')
urls=[]
for link in soup.find_all('a'):
urls.append(link.get('href'))
inte=urls[1:-1]
each_bill=[base_url+e[6:] for e in inte] #This is the final set of links. I have to iterate through each one of them to get to the final pages and look for the occurrence of the word in each of the page.
q=len(each_bill)
print("no of bills is: ",q)
gp+=1
x=0
while True:
try:
for each in each_bill:
r=requests.get(each)
y=r.text.count('Display Board')
print(y)
soup=bs(r.text,'html.parser')
table_soup=soup.findAll('table')
trow=[]
for tr in table_soup[3]:
trow.append(tr)
text=trow[1].text
b=text[13:]
print(b)
writer.writerow((y,b))
x+=1
print("Now Wyd ",x,"th bill in",gp," th GP")
if x==q:
break
if x==q:
break
except requests.exceptions.RequestException as e:
print("exception error: ",e)
time.sleep(5)
continue
d.close()

Extracting web elements using Pyquery, Requests and Gadget selector

I am able to extract table values from this website with the following code.
from pyquery import PyQuery as pq
import requests
url = "https://finviz.com/screener.ashx"
content = requests.get(url).content
doc = pq(content)
Tickers = doc(".screener-link-primary").text()
print(Tickers)
But I am able to extract only the first 20 values. There is a 'next' button at the end of the page which has the link to the next set of values.
How can I extract this link automatically, fetch the new page and extract the new set of values and append to my existing list?
You can iterate through all pages like:
counter = 1
while True:
url = "https://finviz.com/screener.ashx?v=111&r=%d" % counter
content = requests.get(url).content
counter += 20
Note that for the first page r parameter (which I guess stands for starting entry index) will be 1 for the second - 21, for the third -41... So I used + 20 increment for counter
You should also add break for the moment when the last page reached. Usually one make a check whether new data to scrape available and if not - break

Google Search Results/ Beginner Python

Just some questions regarding Python 3.
def AllMusic():
myList1 = ["bob"]
myList2 = ["dylan"]
x = myList1[0]
z = myList2[0]
y = "-->Then 10 Numbers?"
print("AllMusic")
print("http://www.allmusic.com/artist/"+x+"-"+z+"-mn"+y)
This is my code so far.
I want to write a program that prints out the variable y.
When you go to AllMusic.com. The different artists have unique 10 numbers.
For example, www.allmusic.com/artist/the-beatles-mn0000754032‎, www.allmusic.com/artist/arcade-fire-mn0000185591.
x is the first word of the artist and y is the second word of the artist. Everything works but I can't figure out a way to find that 10 digit number and return it to me for each artist I input into my python program.
I figured out that when you go to google and type for example "Arcade Fire AllMusic", in the first result just under the heading it gives you the url of the site. www.allmusic.com/artist/arcade-fire-mn0000185591
How can I copy that 10 digit code, 0000185591, into my python program and print it out for me to see.
I wouldn't use Google at all - you can use the search on the site. There are many useful tools to help you do web scraping in python: I'd recommend installing BeautifulSoup. Here's a small script you can experiment with:
import urllib
from bs4 import BeautifulSoup
def get_artist_link(artist):
base = 'http://www.allmusic.com/search/all/'
# encode spaces
query = artist.replace(' ', '%20')
url = base + query
page = urllib.urlopen(url)
soup = BeautifulSoup(page.read())
artists = soup.find_all("li", class_ = "artist")
for artist in artists:
print(artist.a.attrs['href'])
if __name__ == '__main__':
get_artist_link('the beatles')
get_artist_link('arcade fire')
For me this prints out:
/artist/the-beatles-mn0000754032
/artist/arcade-fire-mn0000185591

Python 2.7 : Can't figure out how to parse a tree with BeautifulSoup4

I am trying to parse this site to create 5 lists, one for each day and filled with one string for each announcement. For example
[in] custom_function(page)
[out] [[<MONDAYS ANNOUNCEMENTS>],
[<TUESDAYS ANNOUNCEMENTS>],
[<WEDNESDAYS ANNOUNCEMENTS>],
[<THURSDAYS ANNOUNCEMENTS>],
[<FRIDAYS ANNOUNCEMENTS>]]
But I can't figure out the correct way to do this.
This is what I have so far
from bs4 import BeautifulSoup
import requests
import datetime
url = http://mam.econoday.com/byweek.asp?day=7&month=4&year=2014&cust=mam&lid=0
# Get the text of the webpage
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data)
full_table_1 = soup.find('table', 'eventstable')
I Figured out that what I want is in the highlighted tag, but I'm not sure how to get to that exact tag and then parse out the times/announcements into a list. I've tried multiple methods but it just keeps getting messier.
What do I do?
The idea is to find all td elements with events class, then read div elements inside:
data = []
for day in soup.find_all('td', class_='events'):
data.append([div.text for div in day.find_all('div', class_='econoevents')])
print data
prints:
[[u'Gallup US Consumer Spending Measure8:30 AM\xa0ET',
u'4-Week Bill Announcement11:00 AM\xa0ET',
u'3-Month Bill Auction11:30 AM\xa0ET',
...
],
...
]

Parsing all possible YouTube urls

I am looking for all the features that a YouTube url can have?
http://www.youtube.com/watch?v=6FWUjJF1ai0&feature=related
So far I have seen feature=relmfu, related, fvst, fvwrel. Is there a list for this somewhere. Also, my ultimate aim is to extract the video id (6FWUjJF1ai) from all possible youtube urls. How can I do that? It seems to be difficult. Is there anyone who has already done that?
You can use urlparse to get the query string from your url, then you can use parse_qs to get the video id from the query string.
wrote the code for your assistance....the credit of solving is purely Frank's though.
import urlparse as ups
m = ups.urlparse('http://www.youtube.com/watch?v=6FWUjJF1ai0&feature=related')
print ups.parse_qs(m.query)['v']
From the following answer https://stackoverflow.com/a/43490746/8534966, I ran 55 different test cases and it was able to get 51 matches. See my tests.
So I wrote some if else code to fix it:
# Get YouTube video ID
if "watch%3Fv%3D" in youtube_url:
# e.g.: https://www.youtube.com/attribution_link?a=8g8kPrPIi-ecwIsS&u=/watch%3Fv%3DyZv2daTWRZU%26feature%3Dem-uploademail
search_pattern = re.search("watch%3Fv%3D(.*?)%", youtube_url)
if search_pattern:
youtube_id = search_pattern.group(1)
elif "watch?v%3D" in youtube_url:
# e.g.: http://www.youtube.com/attribution_link?a=JdfC0C9V6ZI&u=%2Fwatch%3Fv%3DEhxJLojIE_o%26feature%3Dshare
search_pattern = re.search("v%3D(.*?)&format", youtube_url)
if search_pattern:
youtube_id = search_pattern.group(1)
elif "/e/" in youtube_url:
# e.g.: http://www.youtube.com/e/dQw4w9WgXcQ
youtube_url += " "
search_pattern = re.search("/e/(.*?) ", youtube_url)
if search_pattern:
youtube_id = search_pattern.group(1)
else:
# All else.
search_pattern = re.search("(?:[?&]vi?=|\/embed\/|\/\d\d?\/|\/vi?\/|https?:\/\/(?:www\.)?youtu\.be\/)([^&\n?#]+)",
youtube_url)
if search_pattern:
youtube_id = search_pattern.group(1)
You may rather want to consider a wider spectrum of url parser as suggested on this Gist.
It will parse more than what urlparse can do.

Categories