Twitter scraping of older tweets - python

I am doing a project in which I needed to get tweets from twitter, and I used the twitter API but it only gives tweets from 7-9 days old but I want a few months older tweets as well. So I decided to scrape Twitter using Beautifulsoup and later selenium, but when parsing it is not returning the elements but rather the veiwsource of the entire webpage. Please help!!
import requests
from bs4 import Beautifulsoup
f=requests.get("https://twitter.com/search?q=%23......%20until%3A2020-02-07%20since%3A2020-01-01&src=typed_query").text
soup = BeautifulSoup(f,'html.parser')
print(soup)
name = soup.find_all('span', class_="css-901oao css-16my406 r-1qd0xha r-ad9z0x r-bcqeeo r-qvutc0")
print(name)
the output from printing soup....i don't how to say it but its the viewsource but not the actual html code
{"undefined"!=typeof Symbol&&Symbol.toStringTag&&Object.defineProperty(e,Symbol.toStringTag,{value:"Module"}),Object.defineProperty(e,"__esModule",{value:!0})},t.t=function(e,n){if(1&n&&(e=t(e)),8&n)return e;if(4&n&&"object"==typeof e&&e&&e.__esModule)return e;var d=Object.create(null);if(t.r(d),Object.defineProperty(d,"default",{enumerable:!0,value:e}),2&n&&"string"!=typeof e)for(var o in e)t.d(d,o,function(n){return e[n]}.bind(null,o));return d},t.n=function(e){var n=e&&e.__esModule?function(){return e.default}:function(){return e};return t.d(n,"a",n),n},t.o=function(e,n){return Object.prototype.hasOwnProperty.call(e,n)},t.p="https://abs.twimg.com/responsive-web/web/",t.oe=function(e){throw e};var i=window.webpackJsonp=window.webpackJsonp||[],c=i.push.bind(i);i.push=n,i=i.slice();for(var l=0;l<i.length;l++)n(i[l]);var u=c;d()}([]),window.__SCRIPTS_LOADED__.runtime=!0;
//# sourceMappingURL=runtime.cc3200a4.js.map
Selenium output in the same as well
from selenium import webdriver
PATH = "C:\\Program Files\\chromedriver.exe"
driver = webdriver.Chrome(PATH)
driver.get("https://twitter.com")
email = driver.find_element_by_name('session[username_or_email]')
password = driver.find_element_by_name('session[password]')
email.send_keys('......')
password.send_keys("......")
password.send_keys(Keys.RETURN)
time.sleep(1)
driver.get('https://twitter.com/search?q=%23....%20until%3A2020-02-07%20since%3A2020-01-01&src=typed_query')
time.sleep(1)
print(driver.page_source)

GetOldTweets3 enables you to extract historical tweets and filter based on multiple criteria i.e. time frame, location, handle, or search query without any API key prerequisites.
E.g.
import GetOldTweets3 as got
# Tweet params
search_term = 'china trade war'
start_date = '2017-01-01'
end_date = '2020-01-01'
# Define historical tweets criteria
tweet_criteria = got.manager.TweetCriteria().setUsername('reuters') \
.setQuerySearch(search_term) \
.setSince(start_date) \
.setUntil(end_date) \
# Return tweets based on tweet criteria
tweets = got.manager.TweetManager.getTweets(tweet_criteria)
tweets.text
Note that you can access further tweet attributes such as hashtags, retweets etc through the tweet variable, for example:
other_tweet_attributes = [[tweet.username, tweet.hashtags for tweet in tweets]]

Related

Beautifulsoup4 find_all not getting the results I need

I'm trying to get data from flashscore.com to a project I'm doing as a part of my self-tought Python study:
import requests
from bs4 import BeautifulSoup
res = requests.get("https://www.flashscore.com/")
soup = BeautifulSoup(res.text, "lxml")
games = soup.find_all("div", {'class':['event__match', 'event__match--scheduled', 'event__match--twoLine']})
print(games)
When I run this, it gets me an empty list []
Why?
When an empty list is returned in find_all(), that means the elements that you specified could not be found.
Make sure that what you are trying to scrape isn't dynamically added such as an iframe in some cases
The failure is due to the fact that the website uses a set of Ajax technologies, specifically dynamically added content with the help of the JavaScript client scripting language. The client code for scripting languages is executed in the browser itself, not at the web server level. The success of such code depends on the browser's ability to interpret and execute it correctly. With the help of the BeatifulSoup library in the program you wrote, you only check the HTML code. JavaScript code can be open, for example, with the help of the Selenium library: https://www.selenium.dev/. Below is the full code for the data that I suppose you are interested in:
# crawler_her_sel.py
import time
from selenium.webdriver import Firefox
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
from bs4 import BeautifulSoup
import pandas as pd
def firefoxdriver(my_url):
"""
Preparing of the browser for the work and adding the headers to
the browser.
"""
# Preparing of the Tor browser for the work.
options = Options()
options.add_argument("--headless")
driver = Firefox(options=options)
return driver
def scrapingitems(driver, my_list, my_xpath):
"""
Create appropriate lists of the data for the pandas library.
"""
try:
elem_to_scrap = driver.find_element(By.XPATH, my_xpath).text
my_list.append(elem_to_scrap)
except:
elem_to_scrap = ""
my_list.append(elem_to_scrap)
# Variable with the URL of the website.
my_url = "https://www.flashscore.com/"
# Preparing of the Tor browser for the work and adding the headers
# to the browser.
driver = firefoxdriver(my_url)
# Loads the website code as the Selenium object.
driver.get(my_url)
# Prepare the blank dictionary to fill in for pandas.
matches = {}
# Preparation of lists with scraped data.
countries = []
leagues = []
home_teams = []
scores_home = []
scores_away = []
away_teams = []
# Wait for page to fully render
try:
element = WebDriverWait(driver, 25).until(
EC.presence_of_element_located((By.CLASS_NAME, "adsclick")))
except TimeoutException:
print("Loading took too much time!. Please rerun the script.")
except Exception as e:
print(str(e))
else:
# Loads the website code as the BeautifulSoup object.
pageSource = driver.page_source
bsObj = BeautifulSoup(pageSource, "lxml")
# Determining the number of the football matches with the help of
# the BeautifulSoup.
games_1 = bsObj.find_all(
"div", {"class":
"event__participant event__participant--home"})
games_2 = bsObj.find_all(
"div", {"class":
"event__participant event__participant--home fontBold"})
games_3 = bsObj.find_all(
"div", {"class":
"event__participant event__participant--away"})
games_4 = bsObj.find_all(
"div", {"class":
"event__participant event__participant--away fontBold"})
# Determining the number of the countries for the given football
# matches.
all_countries = driver.find_elements(By.CLASS_NAME, "event__title--type")
# Determination of the number that determines the number of
# the loop iterations.
sum_to_iterate = len(all_countries) + len(games_1) + len(games_2)
+ len(games_3) + len(games_4)
for ind in range(1, (sum_to_iterate+1)):
# Scraping of the country names.
xpath_countries = ('//div[#class="sportName soccer"]/div['+str(ind)
+']/div[2]/div/span[1]')
scrapingitems(driver, countries, xpath_countries)
# Scraping of the league names.
xpath_leagues = ('//div[#class="sportName soccer"]/div['+str(ind)
+']/div[2]/div/span[2]')
scrapingitems(driver, leagues, xpath_leagues)
# Scraping of the home team names.
xpath_home_teams = ('//div[#class="sportName soccer"]/div['+str(ind)
+']/div[3]')
scrapingitems(driver, home_teams, xpath_home_teams)
# Scraping of the home team scores.
xpath_scores_home = ('//div[#class="sportName soccer"]/div['+str(ind)
+']/div[5]')
scrapingitems(driver, scores_home, xpath_scores_home)
# Scraping of the away team scores.
xpath_scores_away = ('//div[#class="sportName soccer"]/div['+str(ind)
+']/div[6]')
scrapingitems(driver, scores_away, xpath_scores_away)
# Scraping of the away team names.
xpath_away_teams = ('//div[#class="sportName soccer"]/div['+str(ind)
+']/div[4]')
scrapingitems(driver, away_teams, xpath_away_teams)
# Add lists with the scraped data to the dictionary in the correct
# order.
matches["Countries"] = countries
matches["Leagues"] = leagues
matches["Home_teams"] = home_teams
matches["Scores_for_home_teams"] = scores_home
matches["Scores_for_away_teams"] = scores_away
matches["Away_teams"] = away_teams
# Creating of the frame for the data with the help of the pandas
# package.
df_res = pd.DataFrame(matches)
# Saving of the properly formatted data to the csv file. The date
# and the time of the scraping are hidden in the file name.
name_of_file = lambda: "flashscore{}.csv".format(time.strftime(
"%Y%m%d-%H.%M.%S"))
df_res.to_csv(name_of_file(), encoding="utf-8")
finally:
driver.quit()
The result of the script is a csv file, which, when loaded as data into Excel, gives the following table, e.g.:
It is worth mentioning here to download the necessary driver for your browser: https://www.selenium.dev/documentation/webdriver/getting_started/install_drivers/.
In addition, I give you links to two other interesting scripts that relate to scraping from the https://www.flashscore.com/ portal, i.e.: How can i scrape a football results from flashscore using python and Scraping stats with Selenium.
I would also like to raise legal issues here. The robots.txt file downloaded from the https://www.flashscore.com/robots.txt website looks like this:
It shows that you can scrape the home page. But the „General Terms of Use” says that quoting „Without prior authorisation in writing from the Provider, Visitors are not authorised to copy, modify, tamper with, distribute, transmit, display, reproduce, transfer, upload, download or otherwise use or alter any of the content of the App. ”
This, unfortunately, introduces ambiguity and ultimately it is not clear what the owner really wants. Therefore, I recommend that you do not use this script constantly, and certainly not for commercial purposes and I ask other visitors for this that visit this website. I myself wrote this script for the purpose of learning to scrape and I do not intend to use it at all.
The finished script can be downloaded from my GitHub.

Select an option from a select object when scraping data from a dynamic webpage

I'm doing some web-scraping and I'd like to know how to select data from a dropdown box and scrape it. Here's the page: https://www.cbn.gov.ng/rates/ExchRateByCurrency.asp
As you can see, it's a dynamic web-page and there's the option to show how many entries you'd like.
What I want to do is select the maximum (100), and then scrape the data from the table afterwards. Any ideas how I can go about this? Here's some code you can build on:
Firefox = Firefox()
Firefox.get(source["Exchange Rates by Currency"])
sleep(30)
html = Firefox.page_source
html = bs(html,"html.parser")
table = html.find("table",id="exTable")
select_item = html.find("select")
It'll take you right to the table and select items respectively.
Try below approach using python - requests simple, straightforward, reliable, fast and less code is required when it comes to requests. I have fetched the API URL from website itself after inspecting the network section of google chrome browser.
What exactly below script is doing:
First it will take the API URL and do a GET request.
After getting the data script will parse the JSON data using json.loads library.
Finally it will iterate all over the list of exchange rate by currency list and print them for ex:- Buying rate, Central rate, Currency, Selling rate , Rate date.
import json
import requests
from urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
def scrap_cbn_data():
URL = 'https://www.cbn.gov.ng/rates/outputExchangeRateJSN.asp?_=1605068636834' #API URL
response = requests.get(URL,verify=False) # GET request
json_result = json.loads(response.text) #Parse JSON data using json.loads
extracted_data = json_result['data'] #extracted data
for item in extracted_data: #iterate over the list of exchange rate by currency
print('-' * 100)
print('Buying Rate : ', item['buyingrate'])
print('Central Rate : ', item['centralrate'])
print('Currency : ', item['currency'])
print('Rate Date : ', item['ratedate'])
print('Selling Rate : ', item['sellingrate'])
print('-' * 100)
scrap_cbn_data()

get number of followers of twitter account by scraping twitter

I tried to get the number of followers of a given Twitter account by scraping twitter. I tried scraping with BeautifulSoup and XPath. But none of the code is working.
This is some of my sample testing code for it,
from bs4 import BeautifulSoup
url = "https://twitter.com/BarackObama"
resposnse = re.get(url)
soup = BeautifulSoup(resposnse.content)
div_tag = soup.find_all('main',{"class":"css-1dbjc4n r-1habvwh r-16xksha r-1wbh5a2"})
when i try to see what is the content i scraped by using below code,
import requests
t=requests.get('https://twitter.com/BarackObama')
print(t.content)
It's not included any of the data like count of followers or anything.
Please help me to do this.
whenever your code started to parse information from the twitter URL. it will parse all data suddenly but it won't get all the data. because the URL page is loaded but not the values or other important data etc... (same for the followers). so there is TWITTER PYTHON API where you can able to get followers with api.GetFollowers()
The relevant API endpoint is followers/ids. Using TwitterAPI you can do the following:
from TwitterAPI import TwitterAPI, TwitterPager
api = TwitterAPI(YOUR_CONSUMER_KEY,
YOUR_CONSUMER_SECRET,
YOUR_ACCESS_TOKEN_KEY,
YOUR_ACCESS_TOKEN_SECRET)
count = 0
r = TwitterPager(api, 'followers/ids')
for item in r.get_iterator():
count = count + 1
print(count)

Extract individual links from a single youtube playlist link using python

I need a python script that takes link to a single youtube playlist and then gives out a list containing the links to individual videos in the playlist.
I realize that same question was asked few years ago, but it was asked for python2.x and the codes in the answer don't work properly. They are very weird, they work sometimes but give empty output once in a while(maybe some of the packages used there have been updated, I don't know). I've included one of those code below.
If any of you don't believe, run this code several times you'll receive empty list once in a while, but most of the time it does the job of breaking down a playlist.
from bs4 import BeautifulSoup as bs
import requests
r = requests.get('https://www.youtube.com/playlist?list=PL3D7BFF1DDBDAAFE5')
page = r.text
soup=bs(page,'html.parser')
res=soup.find_all('a',{'class':'pl-video-title-link'})
for l in res:
print(l.get("href"))
In case of some playlists the code just doesn't work at all.
Also, if beautifulsoup can't do the job, any other popular python library will do.
It seems youtube loads sometimes different versions of the page, sometimes with html organized like you expected using links with pl-video-title-link class :
<td class="pl-video-title">
<a class="pl-video-title-link yt-uix-tile-link yt-uix-sessionlink spf-link " dir="ltr" href="/watch?v=GtWXOzsD5Fw&list=PL3D7BFF1DDBDAAFE5&index=101&t=0s" data-sessionlink="ei=TJbjXtC8NYri0wWCxarQDQ&feature=plpp_video&ved=CGoQxjQYYyITCNCSmqHD_OkCFQrxtAodgqIK2ij6LA">
Android Application Development Tutorial - 105 - Spinners and ArrayAdapter
</a>
<div class="pl-video-owner">
de <a href="/user/thenewboston" class=" yt-uix-sessionlink spf-link " data-sessionlink="ei=TJbjXtC8NYri0wWCxarQDQ&feature=playlist&ved=CGoQxjQYYyITCNCSmqHD_OkCFQrxtAodgqIK2ij6LA" >thenewboston</a>
</div>
<div class="pl-video-bottom-standalone-badge">
</div>
</td>
Sometimes with data embedded in a JS variables and loaded dynamically :
window["ytInitialData"] = { .... very big json here .... };
For the second version, you will need to use regex to parse Javascript unless you want to use tools like selenium to grab the content after page load.
The best way is to use the official API which is straightforward to get the playlist items :
Go to Google Developer Console, search Youtube Data API / enable Youtube Data API v3
Click on Create Credentials / Youtube Data API v3 / Public data
Alternatively (For Credentials Creation) Go to Credentials / Create Credentials / API key
install google api client for python :
pip3 install --upgrade google-api-python-client
Use the API key in the script below. This script fetch playlist items for playlist with id PL3D7BFF1DDBDAAFE5, use pagination to get all of them, and re-create the link from the videoId and playlistID :
import googleapiclient.discovery
from urllib.parse import parse_qs, urlparse
#extract playlist id from url
url = 'https://www.youtube.com/playlist?list=PL3D7BFF1DDBDAAFE5'
query = parse_qs(urlparse(url).query, keep_blank_values=True)
playlist_id = query["list"][0]
print(f'get all playlist items links from {playlist_id}')
youtube = googleapiclient.discovery.build("youtube", "v3", developerKey = "YOUR_API_KEY")
request = youtube.playlistItems().list(
part = "snippet",
playlistId = playlist_id,
maxResults = 50
)
response = request.execute()
playlist_items = []
while request is not None:
response = request.execute()
playlist_items += response["items"]
request = youtube.playlistItems().list_next(request, response)
print(f"total: {len(playlist_items)}")
print([
f'https://www.youtube.com/watch?v={t["snippet"]["resourceId"]["videoId"]}&list={playlist_id}&t=0s'
for t in playlist_items
])
Output:
get all playlist items links from PL3D7BFF1DDBDAAFE5
total: 195
[
'https://www.youtube.com/watch?v=SUOWNXGRc6g&list=PL3D7BFF1DDBDAAFE5&t=0s',
'https://www.youtube.com/watch?v=857zrsYZKGo&list=PL3D7BFF1DDBDAAFE5&t=0s',
'https://www.youtube.com/watch?v=Da1jlmwuW_w&list=PL3D7BFF1DDBDAAFE5&t=0s',
...........
'https://www.youtube.com/watch?v=1j4prh3NAZE&list=PL3D7BFF1DDBDAAFE5&t=0s',
'https://www.youtube.com/watch?v=s9ryE6GwhmA&list=PL3D7BFF1DDBDAAFE5&t=0s'
]
from encodings import utf_8
import os
from bs4 import BeautifulSoup as bs
import requests
import json
data = []
r = requests.get('https://www.youtube.com/playlist?list=PLj_g-vuzpBAuU0YJHkiL98DSi_mwrJDJR')
page = r.text
soup=bs(page,'html.parser')
b = open("a.html","w",encoding="utf_8")
b.write(str(soup))
c = open("a.html","r",encoding="utf_8")
d = c.readlines()
lin = 0
while True:
try:
a = d[lin]
except:
print("Finished")
break
if '"url":"/watch?v=' in a:
a = a.split('"url":"')
te = 0
while True:
try:
if "/watch?v=" in a[te]:
aa = a[te].split('",')
e = 0
while True:
try:
if "/watch?v=" in aa[e]:
url = "https://www.youtube.com"+aa[e]
#url is added in data if you want to print all url uncomment this code
#print(url)
data.append(url)
except:
break
e+=1
except:
break
te +=1
lin +=1
c.close()
b.close()
os.remove("a.html")
print("Given data is in list so you can print url by use this code print(data[0])\n\n")
print(data)

Python Beautifulsoup - Scrape elements from "inspect"

I am trying to get scrape some data from stockrow.com using BeautifulSoup.
However there seems to be some diffrences between inspect and view sourcecode (im using chrome, but i do not see that being a problem for Pyton).
This is resulting in some trouble as the sourcecode itself does not show any html-tags such as h1. They are however showing up when i use the inspect tool.
The part i am trying to scrape (among other things) - this is show using the inspect tool:
<h1>Teva Pharmaceutical Industries Ltd<small>(TEVA)</small></h1>
My current code, printing an empty list:
import bs4 as bs
import urllib.request
class Stock:
stockrow_url = "https://stockrow.com"
url_suffix = "/financials/{}/annual"
def __init__(self, ticker : str, stock_url=stockrow_url, url_suffix = url_suffix):
# Stock ticker
self.ticker = ticker.upper()
# URLs for financial statements related to the ticker
self.stock_url = stock_url + "/{}".format(self.ticker)
sauce = urllib.request.urlopen(self.stock_url).read()
soup = bs.BeautifulSoup(sauce, 'html.parser').h1
print(soup)
self.income_url = self.stock_url + url_suffix.format("income")
self.balance_sheet_url = self.stock_url + url_suffix.format("balance")
self.cash_flow_url = self.stock_url + url_suffix.format("cashflow")
teva = Stock("teva")
print(teva.get_income_statement())
The page is dynamically generated using jscript and cannot be handled by beautifulsoup. You can capture the information using either selenium and the like, or by looking for API calls.
In this case, you can get for TEVA, background information using
import json
import requests
hdr = {'User-Agent':'Mozilla/5.0'}
url = "https://stockrow.com/api/companies/TEVA.json?ticker=TEVA"
response = requests.get(url, headers=hdr)
info = json.loads(response.text)
info
Similarly, the income statement is hiding here:
url = 'https://stockrow.com/api/companies/TEVA/financials.json?ticker=TEVA&dimension=MRY&section=Income+Statement'
Using the same code as above but with this other url, will get you your income statement, in json format.
And you can take it from there. Search around - there is a lot of information available on this topic. Good luck.

Categories