get number of followers of twitter account by scraping twitter - python

I tried to get the number of followers of a given Twitter account by scraping twitter. I tried scraping with BeautifulSoup and XPath. But none of the code is working.
This is some of my sample testing code for it,
from bs4 import BeautifulSoup
url = "https://twitter.com/BarackObama"
resposnse = re.get(url)
soup = BeautifulSoup(resposnse.content)
div_tag = soup.find_all('main',{"class":"css-1dbjc4n r-1habvwh r-16xksha r-1wbh5a2"})
when i try to see what is the content i scraped by using below code,
import requests
t=requests.get('https://twitter.com/BarackObama')
print(t.content)
It's not included any of the data like count of followers or anything.
Please help me to do this.

whenever your code started to parse information from the twitter URL. it will parse all data suddenly but it won't get all the data. because the URL page is loaded but not the values or other important data etc... (same for the followers). so there is TWITTER PYTHON API where you can able to get followers with api.GetFollowers()

The relevant API endpoint is followers/ids. Using TwitterAPI you can do the following:
from TwitterAPI import TwitterAPI, TwitterPager
api = TwitterAPI(YOUR_CONSUMER_KEY,
YOUR_CONSUMER_SECRET,
YOUR_ACCESS_TOKEN_KEY,
YOUR_ACCESS_TOKEN_SECRET)
count = 0
r = TwitterPager(api, 'followers/ids')
for item in r.get_iterator():
count = count + 1
print(count)

Related

Web Scraping Stock Ticker Price from Yahoo Finance using BeautifulSoup

I'm trying to scrape Gold stock ticker from Yahoo! Finance.
from bs4 import BeautifulSoup
import requests, lxml
response = requests.get('https://finance.yahoo.com/quote/GC=F?p=GC=F')
soup = BeautifulSoup(response.text, 'lxml')
gold_price = soup.findAll("div", class_='My(6px) Pos(r) smartphone_Mt(6px)')[2].find_all('p').text
Whenever I run this it returns: list index out of range.
When I do print(len(ssoup)) it returns 4.
Any ideas?
Thank you.
You can make a direct request to the yahoo server. To locate the query URL you need to open Network tab via Dev tools (F12) -> Fetch/XHR -> find name: spark?symbols= (refresh page if you don't see any), find the needed symbol, and see the response (preview tab) on the newly opened tab on the right.
You can make direct requests to all of these links if the request method is GET since POST methods are much more complicated.
You need json and requests library, no need for bs4. Note that making a lot of such requests might block your IP (or set an IP rate limit) or you won't get any response because their system might detect that it's a bot since the regular user won't make such requests to the server, repeatedly. So you need to figure out how to bypass it.
Update:
There's possibly a hard limit on how many requests can be made in an X period of time.
Code and example in the online IDE (contains full JSON response):
import requests, json
response = requests.get('https://query1.finance.yahoo.com/v7/finance/spark?symbols=GC%3DF&range=1d&interval=5m&indicators=close&includeTimestamps=false&includePrePost=false&corsDomain=finance.yahoo.com&.tsrc=finance').text
data_1 = json.loads(response)
gold_price = data_1['spark']['result'][0]['response'][0]['meta']['previousClose']
print(gold_price)
# 1830.8
P.S. There's a blog about scraping Yahoo! Finance Home Page of mine, which is kind of relevant.

Python Web Scraping Product Price

I'm trying to web scrape the product price on this site: https://www.webhallen.com/se/product/232445-Logitech-C920-HD-Pro-Webcam
I tried using
price = str(soup.find('div', {"class": "add-product-to-cart"}))
and
price = soup.find(id="add-product-to-cart").get_text()
But unfortunately, I had no luck. The item returns no price. The price/text is stored in a span class.
The entire website is behind JavaScript so you won't fetch anything with bs4. However, there's an API endpoint with all the data you need.
Here's how to get it:
import requests
with requests.Session() as session:
response = session.get("https://www.webhallen.com/api/product/232445").json()
print(response["product"]["price"]["price"])
Output:
1190.00

Twitter scraping of older tweets

I am doing a project in which I needed to get tweets from twitter, and I used the twitter API but it only gives tweets from 7-9 days old but I want a few months older tweets as well. So I decided to scrape Twitter using Beautifulsoup and later selenium, but when parsing it is not returning the elements but rather the veiwsource of the entire webpage. Please help!!
import requests
from bs4 import Beautifulsoup
f=requests.get("https://twitter.com/search?q=%23......%20until%3A2020-02-07%20since%3A2020-01-01&src=typed_query").text
soup = BeautifulSoup(f,'html.parser')
print(soup)
name = soup.find_all('span', class_="css-901oao css-16my406 r-1qd0xha r-ad9z0x r-bcqeeo r-qvutc0")
print(name)
the output from printing soup....i don't how to say it but its the viewsource but not the actual html code
{"undefined"!=typeof Symbol&&Symbol.toStringTag&&Object.defineProperty(e,Symbol.toStringTag,{value:"Module"}),Object.defineProperty(e,"__esModule",{value:!0})},t.t=function(e,n){if(1&n&&(e=t(e)),8&n)return e;if(4&n&&"object"==typeof e&&e&&e.__esModule)return e;var d=Object.create(null);if(t.r(d),Object.defineProperty(d,"default",{enumerable:!0,value:e}),2&n&&"string"!=typeof e)for(var o in e)t.d(d,o,function(n){return e[n]}.bind(null,o));return d},t.n=function(e){var n=e&&e.__esModule?function(){return e.default}:function(){return e};return t.d(n,"a",n),n},t.o=function(e,n){return Object.prototype.hasOwnProperty.call(e,n)},t.p="https://abs.twimg.com/responsive-web/web/",t.oe=function(e){throw e};var i=window.webpackJsonp=window.webpackJsonp||[],c=i.push.bind(i);i.push=n,i=i.slice();for(var l=0;l<i.length;l++)n(i[l]);var u=c;d()}([]),window.__SCRIPTS_LOADED__.runtime=!0;
//# sourceMappingURL=runtime.cc3200a4.js.map
Selenium output in the same as well
from selenium import webdriver
PATH = "C:\\Program Files\\chromedriver.exe"
driver = webdriver.Chrome(PATH)
driver.get("https://twitter.com")
email = driver.find_element_by_name('session[username_or_email]')
password = driver.find_element_by_name('session[password]')
email.send_keys('......')
password.send_keys("......")
password.send_keys(Keys.RETURN)
time.sleep(1)
driver.get('https://twitter.com/search?q=%23....%20until%3A2020-02-07%20since%3A2020-01-01&src=typed_query')
time.sleep(1)
print(driver.page_source)
GetOldTweets3 enables you to extract historical tweets and filter based on multiple criteria i.e. time frame, location, handle, or search query without any API key prerequisites.
E.g.
import GetOldTweets3 as got
# Tweet params
search_term = 'china trade war'
start_date = '2017-01-01'
end_date = '2020-01-01'
# Define historical tweets criteria
tweet_criteria = got.manager.TweetCriteria().setUsername('reuters') \
.setQuerySearch(search_term) \
.setSince(start_date) \
.setUntil(end_date) \
# Return tweets based on tweet criteria
tweets = got.manager.TweetManager.getTweets(tweet_criteria)
tweets.text
Note that you can access further tweet attributes such as hashtags, retweets etc through the tweet variable, for example:
other_tweet_attributes = [[tweet.username, tweet.hashtags for tweet in tweets]]

How can we get response of next loading web page

I am writing a scraper to get all the movie list available on hungama.com
I am requesting "http://www.hungama.com/all/hungama-picks-54/4470/" url to get the response.
When you go to this url, this will show 12 movies on the screen but as you sroll down the movie count gets increasing by auto reload.
I am parsing the html source page with below code
response.css('div.movie-block-artist.boxshadow.clearfix1>div>div>a::text').extract()
but I only get 12 items whereas there are more movie items. how can I get all the movies available. Please help.
While scrolling down the content of that page, If you take a good look at xhr tab in network category within dev tools then you can see that it produces some URLs with pagination feature attached to it like :http://www.hungama.com/all/hungama-picks-54/3632/2/. So, changing the line as I did below, you can get all the content from that page.
import requests
from scrapy import Selector
page = 1
URL = "http://www.hungama.com/all/hungama-picks-54/3632/"
while True:
page+=1
res = requests.get(URL)
sel = Selector(res)
container = sel.css(".leftbox")
if len(container)<=0:break
for item in container:
title = item.css("#pajax_a::text").extract_first()
year = item.css(".subttl::text").extract_first()
print(title,year)
next_page = "http://www.hungama.com/all/hungama-picks-54/3632/{}/"
URL = next_page.format(page)
Btw, the URL you have provided above is not working. The one I've supplied is active now. However, you understood the logic I think.
There seems to be an ajax request as a lazy load feature with url http://www.hungama.com/all/hungama-picks-54/4470/2/?ajax_call=1&_country=IN which fetches movies .
In the above url change 2 to 3 (http://www.hungama.com/all/hungama-picks-54/4470/3/?ajax_call=1&_country=IN) and so on for getting next movies detail.

Scraping website only provides a partial or random data into CSV

I am trying to extract a list of golf courses name and addresses from the Garmin Website using the script below.
import csv
import requests
from bs4 import BeautifulSoup
courses_list= []
for i in range(893): #893
url = "http://sites.garmin.com/clsearch/courses?browse=1&country=US&lang=en&per_page={}".format(i*20)
r = requests.get(url)
soup = BeautifulSoup(r.content)
g_data2=soup.find_all("div",{"class":"result"})
for item in g_data2:
try:
name= item.contents[3].find_all("div",{"class":"name"})[0].text
print name
except:
name=''
try:
address= item.contents[3].find_all("div",{"class":"location"})[0].text
except:
address=''
course=[name,address]
courses_list.append(course)
with open ('PGA_Garmin2.csv','a') as file:
writer=csv.writer(file)
for row in courses_list:
writer.writerow([s.encode("utf-8") for s in row])
After running the script, I end up not getting the full data that I need and further when executed it produces random values and not a complete set of data. I need to extract information from 893 pages and get a list of at least 18000 but after running this script I only get 122. How do I fix this script to get the complete data set and produce the needed CSV with the complete data set of golf courses from the Garmin Website. I corrected the page page numbers to reflect the page set up in the Garmin website where the page starts at 20 so on.
Just taking a guess here, but try checking your r.status and confirm that it's 200? Maybe it is possible that you're not accessing the whole website?
Stab in the dark.

Categories