I am trying to automate the process of obtaining the number of followers different twitter accounts using the page source.
I have the following code for one account
from bs4 import BeautifulSoup
import requests
username='justinbieber'
url = 'https://www.twitter.com/'+username
r = requests.get(url)
soup = BeautifulSoup(r.content)
for tag in soup.findAll('a'):
if tag.has_key('class'):
if tag['class'] == 'ProfileNav-stat ProfileNav-stat--link u-borderUserColor u-textCenter js-tooltip js-nav u-textUserColor':
if tag['href'] == '/justinbieber/followers':
print tag.title
break
I am not sure where did I went wrong. I understand that we can use Twitter API to obtain the number of followers. However, I wish to try to obtain it through this method as well to try it out. Any suggestions?
I've modified the code from here
If I were you, I'd be passing the class name as an argument to the find() function instead of find_all() and I'd first look for the <li> element that contains the anchor you're loooking for. It'd look something like this
from bs4 import BeautifulSoup
import requests
username='justinbieber'
url = 'https://www.twitter.com/'+username
r = requests.get(url)
soup = BeautifulSoup(r.content)
f = soup.find('li', class_="ProfileNav-item--followers")
title = f.find('a')['title']
print title
# 81,346,708 Followers
num_followers = int(title.split(' ')[0].replace(',',''))
print num_followers
# 81346708
PS findAll() was renamed to find_all() in bs4
Related
from bs4 import BeautifulSoup
import requests
from urllib.request import urlopen
url = f'https://www.apple.com/kr/search/youtube?src=globalnav'
response = requests.get(url)
html = response.text
soup = BeautifulSoup(html, 'html.parser')
links = soup.select(".rf-serp-productname-list")
print(links)
I want to crawl through all links of shown apps. When I searched for a keyword, I thought links = soup.select(".rf-serp-productname-list") would work, but links list is empty.
What should I do?
Just check this code, I think is what you want:
import re
import requests
from bs4 import BeautifulSoup
pages = set()
def get_links(page_url):
global pages
pattern = re.compile("^(/)")
html = requests.get(f"your_URL{page_url}").text # fstrings require Python 3.6+
soup = BeautifulSoup(html, "html.parser")
for link in soup.find_all("a", href=pattern):
if "href" in link.attrs:
if link.attrs["href"] not in pages:
new_page = link.attrs["href"]
print(new_page)
pages.add(new_page)
get_links(new_page)
get_links("")
Source:
https://gist.github.com/AO8/f721b6736c8a4805e99e377e72d3edbf
You can change the part:
for link in soup.find_all("a", href=pattern):
#do something
To check for a keyword I think
You are cooking a soup so first at all taste it and check if everything you expect contains in it.
ResultSet of your selection is empty cause structure in response differs a bit from your expected one from the developer tools.
To get the list of links select more specific:
links = [a.get('href') for a in soup.select('a.icon')]
Output:
['https://apps.apple.com/kr/app/youtube/id544007664', 'https://apps.apple.com/kr/app/%EC%BF%A0%ED%8C%A1%ED%94%8C%EB%A0%88%EC%9D%B4/id1536885649', 'https://apps.apple.com/kr/app/youtube-music/id1017492454', 'https://apps.apple.com/kr/app/instagram/id389801252', 'https://apps.apple.com/kr/app/youtube-kids/id936971630', 'https://apps.apple.com/kr/app/youtube-studio/id888530356', 'https://apps.apple.com/kr/app/google-chrome/id535886823', 'https://apps.apple.com/kr/app/tiktok-%ED%8B%B1%ED%86%A1/id1235601864', 'https://apps.apple.com/kr/app/google/id284815942']
I want to search all hyperlink that its text name includes "article" in https://www.geeksforgeeks.org/
for example, on the bottom of this webpage
Write an Article
Improve an Article
I want to get them all hyperlink and print them, so I tried to,
from urllib.request import urlopen
from bs4 import BeautifulSoup
import os
import re
url = 'https://www.geeksforgeeks.org/'
reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, "html.parser")
links = []
for link in soup.findAll('a',href = True):
#print(link.get("href")
if re.search('/article$', href):
links.append(link.get("href"))
However, it get a [] in result, how to solve it?
Here is something you can try:
Note that there are more links with the test article in the link you provided, but it gives the idea how you can deal with this.
In this case I just checked if the word article is in the text of that tag. You can use regex search there, but for this example it is an overkill.
import requests
from bs4 import BeautifulSoup
url = 'https://www.geeksforgeeks.org/'
res = requests.get(url)
if res.status_code != 200:
'no resquest'
soup = BeautifulSoup(res.content, "html.parser")
links_with_article = soup.findAll(lambda tag:tag.name=="a" and "article" in tag.text.lower())
EDIT:
If you know that there is a word in the href, i.e. in the link itself:
soup.select("a[href*=article]")
this will search for the word article in the href of all elements a.
Edit: get only href:
hrefs = [link.get('href') for link in links_with_article]
I'm a beginner in python, I'm trying to get the first search result link from google which was stored inside a div with class='yuRUbf' using beautifulsoup. When I run the script output is 'None' what is the error here.
import requests
import bs4
url = 'https://www.google.com/search?q=site%3Astackoverflow.com+how+to+use+bs4+in+python&sxsrf=AOaemvKrCLt-Ji_EiPLjcEso3DVfBUmRbg%3A1630215433722&ei=CR0rYby7K7ue4-EP7pqIkAw&oq=site%3Astackoverflow.com+how+to+use+bs4+in+python&gs_lcp=Cgdnd3Mtd2l6EAM6BwgAEEcQsAM6BwgjELACECc6BQgAEM0CSgQIQRgAUMw2WPh_YLiFAWgBcAJ4AIABkAKIAd8lkgEHMC4xMC4xM5gBAKABAcgBCMABAQ&sclient=gws-wiz&ved=0ahUKEwj849XewdXyAhU7zzgGHW4NAsIQ4dUDCA8&uact=5'
request_result=requests.get( url )
soup = bs4.BeautifulSoup(request_result.text,"html.parser")
productDivs = soup.find("div", {"class": "yuRUbf"})
print(productDivs)
Let's see:
from bs4 import BeautifulSoup
import requests, json
headers = {
'User-agent':
"useragent"
}
html = requests.get('https://www.google.com/search?q=hello', headers=headers).text
soup = BeautifulSoup(html, 'lxml')
# locating div element with a tF2Cxc class
# calling for <a> tag and then calling for 'href' attribute
link = soup.find('div', class_='tF2Cxc').a['href']
print(link)
output:
'''
https://www.youtube.com/watch?v=YQHsXMglC9A
As you want first google search in which class name which you are looking for might be differ with name so first you can first find manually that link so it will be easy to identify
import requests
import bs4
url = 'https://www.google.com/search?q=site%3Astackoverflow.com+how+to+use+bs4+in+python&sxsrf=AOaemvKrCLt-Ji_EiPLjcEso3DVfBUmRbg%3A1630215433722&ei=CR0rYby7K7ue4-EP7pqIkAw&oq=site%3Astackoverflow.com+how+to+use+bs4+in+python&gs_lcp=Cgdnd3Mtd2l6EAM6BwgAEEcQsAM6BwgjELACECc6BQgAEM0CSgQIQRgAUMw2WPh_YLiFAWgBcAJ4AIABkAKIAd8lkgEHMC4xMC4xM5gBAKABAcgBCMABAQ&sclient=gws-wiz&ved=0ahUKEwj849XewdXyAhU7zzgGHW4NAsIQ4dUDCA8&uact=5'
request_result=requests.get( url )
soup = bs4.BeautifulSoup(request_result.text,"html.parser")
Using select method:
I have used css selector method in which it identifies all matching
divs and from list i have taken from index postion 1
And than i have use select_one to get a tag and find href
according to it!
main_data=soup.select("div.ZINbbc.xpd.O9g5cc.uUPGi")[1:]
main_data[0].select_one("a")['href'].replace("/url?q=","")
Using find method:
main_data=soup.find_all("div",class_="ZINbbc xpd O9g5cc uUPGi")[1:]
main_data[0].find("a")['href'].replace("/url?q=","")
Output [Same for Both the Case]:
'https://stackoverflow.com/questions/23102833/how-to-scrape-a-website-which-requires-login-using-python-and-beautifulsoup&sa=U&ved=2ahUKEwjGxv2wytXyAhUprZUCHR8mBNsQFnoECAkQAQ&usg=AOvVaw280R9Wlz2mUKHFYQUOFVv8'
I have the following code:
from bs4 import BeautifulSoup
import requests
import csv
url = "https://coingecko.com/en"
base_url = "https://coingecko.com"
page = requests.get(url)
soup = BeautifulSoup(page.content,"html.parser")
names = [div.a.span.text for div in soup.find_all("div",attrs={"class":"coin-content center"})]
Link = [base_url+div.a["href"] for div in soup.find_all("div",attrs={"class":"coin-content center"})]
for link in Link:
inner_page = requests.get(link)
inner_soup = BeautifulSoup(inner_page.content,"html.parser")
indent = inner_soup.find("div",attrs={"class":"py-2"})
content = indent.div.next_siblings
Allcontent = [sibling for sibling in content if sibling.string is not None]
print(Allcontent)
I have successfully enter to innerpage and grabbed all coins' information from the first page listed coin. But there is next page as 1,2,3,4,5,6,7,8,9 etc. How can I go to all the next page and do the same as previously?
Further, the output of my code contains a lot of \n and space. How can I fix that.
You need to generate all the pages and requests one by one and parse using bs4
from bs4 import BeautifulSoup
import requests
req = requests.get('https://www.coingecko.com/en')
soup = BeautifulSoup(req.content, 'html.parser')
last_page = soup.select('ul.pagination li:nth-of-type(8) > a:nth-of-type(1)')[0]['href']
lp = last_page.split('=')[-1]
count = 0
for i in range(int(lp)):
count+=1
url = 'https://www.coingecko.com/en?page='+str(count)
print(url)
requests.get(url)#requests each page one by one till last page
##parse your fileds here using bs4
The way you have written your script has got a messy look. Try with .select() to make it concise and less prone to break. Although I could not find the further usage of names in your script, I kept it as it is. Here is how you can get all the available links traversing multiple pages.
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import requests
url = "https://coingecko.com/en"
while True:
page = requests.get(url)
soup = BeautifulSoup(page.text,"lxml")
names = [item.text for item in soup.select("span.d-lg-block")]
for link in [urljoin(url,item["href"]) for item in soup.select(".coin-content a")]:
inner_page = requests.get(link)
inner_soup = BeautifulSoup(inner_page.text,"lxml")
desc = [item.get_text(strip=True) for item in inner_soup.select(".py-2 p") if item.text]
print(desc)
try:
url = urljoin(url,soup.select_one(".pagination a[rel='next']")['href'])
except TypeError:break
Btw, whitespaces have also been taken care of by using .get_text(strip=True)
I am a beginner in WebCrawling, and I have a question regarding crawling multiple urls.
I am using CNBC in my project. I want to extract news titles and urls from its home page, and I also want to crawl the contents of the news articles from each url.
This is what I've got so far:
import requests
from lxml import html
import pandas
url = "http://www.cnbc.com/"
response = requests.get(url)
doc = html.fromstring(response.text)
headlineNode = doc.xpath('//div[#class="headline"]')
len(headlineNode)
result_list = []
for node in headlineNode :
url_node = node.xpath('./a/#href')
title = node.xpath('./a/text()')
soup = BeautifulSoup(url_node.content)
text =[''.join(s.findAll(text=True)) for s in soup.findAll("div", {"class":"group"})]
if (url_node and title and text) :
result_list.append({'URL' : url + url_node[0].strip(),
'TITLE' : title[0].strip(),
'TEXT' : text[0].strip()})
print(result_list)
len(result_list)
I am keep on getting an error saying that'list' object has no attribute 'content'. I want to create a dictionary that contains titles for each headlines, urls for each headlines, and the news article content for each headlines. Is there an easier way to approach this?
Great start on the script. However, soup = BeautifulSoup(url_node.content) is wrong. url_content is a list. You need to form the full news URL, use requests to get the HTML and then pass it to BeautifulSoup.
Apart from that, there are a few things I would look at:
I see import issues, BeautifulSoup is not imported.
Add from bs4 import BeautifulSoup to the top. Are you using pandas? If not, remove it.
Some of the news divs on CNN with the big banner picture will yield a 0 length list when you query url_node = node.xpath('./a/#href'). You need to find the appropriate logic and selectors to get those news URLs as well. I will leave that up to you.
Check this out:
import requests
from lxml import html
import pandas
from bs4 import BeautifulSoup
# Note trailing backslash removed
url = "http://www.cnbc.com"
response = requests.get(url)
doc = html.fromstring(response.text)
headlineNode = doc.xpath('//div[#class="headline"]')
print(len(headlineNode))
result_list = []
for node in headlineNode:
url_node = node.xpath('./a/#href')
title = node.xpath('./a/text()')
# Figure out logic to get that pic banner news URL
if len(url_node) == 0:
continue
else:
news_html = requests.get(url + url_node[0])
soup = BeautifulSoup(news_html.content)
text =[''.join(s.findAll(text=True)) for s in soup.findAll("div", {"class":"group"})]
if (url_node and title and text) :
result_list.append({'URL' : url + url_node[0].strip(),
'TITLE' : title[0].strip(),
'TEXT' : text[0].strip()})
print(result_list)
len(result_list)
Bonus debugging tip:
Fire up an ipython3 shell and do %run -d yourfile.py. Look up ipdb and the debugging commands. It's quite helpful to check what your variables are and if you're calling the right methods.
Good luck.