I am trying to scrape tweets from one webpage within a certain timeframe.
To do so I am using this link which only searches within the timeframe I have specified:
https://twitter.com/search?f=tweets&q=subwaydstats%20since%3A2016-08-22%20until%3A2018-08-22
This is my code:
import pandas as pd
import datetime as dt
import urllib.request
from bs4 import BeautifulSoup
url = 'https://twitter.com/search?f=tweets&q=subwaydstats%20since%3A2016-08-22%20until%3A2018-08-22'
thepage = urllib.request.urlopen(url)
soup = BeautifulSoup(driver.page_source,"html.parser")
i = 1
for tweet in soup.find_all('div', {'class': 'js-tweet-text-container'}):
print(tweet.find('p', {'class': 'TweetTextSize'}).text.encode('UTF-8'))
print(i)
i += 1
The above code works when I am scraping from within the actual twitter page for the subwaystat user.
For this reason I don't understand why it doesn't work for the search page even though the html appears to be the same to me.
I am a total beginner so I'm sorry if this is a dumb question. Thank you!
There is a Twitter API - Twitter Search API docs: https://developer.twitter.com/en/docs/tweets/search/api-reference/get-search-tweets
which using a non-official Python wrapper: https://github.com/bear/python-twitter makes it super easy to get tweets.
However, if you want to scrape the HTML, then it's a lot more difficult. I was doing something similar - scraping an angular app, however, the actual HTML you see on the screen is actually rendered through "front-end javascript". Requests and urllib, only get the basic HTML but does not run the javascript.
You could use selenium which is basically a browser which you can automate task on. Since it behaves as a browser, it actually runs that front-end javascript, meaning you will be able to scrape the webpage.
A great article here explains the different ways you can scrape twitter https://medium.com/#dawran6/twitter-scraper-tutorial-with-python-requests-beautifulsoup-and-selenium-part-2-b38d849b07fe
Related
I'm trying to scrape youtube but most of the times I do it, It just gives back an empty result.
In this code snippet I'm trying to get the list of the video titles on the page. But when I run it I just get an empty result back. Even one title doesn't show up in result.
I've searched around and some results point out that it's due to the website loading content dynamically with javascript. Is this the case? How do I go about solving this? Is it possible to do without selenium?
from bs4 import BeautifulSoup
import requests
source = requests.get('https://www.youtube.com/user/schafer5/videos').text
soup = BeautifulSoup(source, 'lxml')
title = soup.findAll('a', attrs={'class': "yt-simple-endpoint style-scope ytd-grid-video-renderer")
print(title)
Is it possible to do without selenium?
Often services have APIs which allow easier automation than scraping sites, Youtube has API and there are ready official libraries for various languages, for Python there is google-api-python-client, you would need key to use, to get running I suggest following Youtube Data API quickstart, note that you might ignore OAuth 2.0 parts, as long as you need access only to public data.
I totally agree with #Daweo and that's the right way to scrape a website like Youtube. But if you want to use BeautifulSoup and not get an empty list at the end, your code should be changed to as follows:
from bs4 import BeautifulSoup
import requests
source = requests.get('https://www.youtube.com/user/schafer5/videos').text
soup = BeautifulSoup(source, 'html.parser')
titles = [i.text for i in soup.findAll('a') if i.get('aria-describedby')]
print(titles)
I also suggest that you use the API.
When i try to parse https://www.forbes.com/ for learning purpose. when i run the code, it only parse one page, i mean, home page.
How can i parse entire website, i mean, all the page from a site.
My attempted codes are given below:
from bs4 import BeautifulSoup
import re
from urllib.request import urlopen
html_page = urlopen("http://www.bdjobs.com/")
soup = BeautifulSoup(html_page, "html.parser")
# To Export to csv file, we used below code.
links = []
for link in soup.findAll('a', attrs={'href': re.compile("^http")}):
links.append(link.get('href'))
import pandas as pd
df = pd.DataFrame(links)
df.to_csv('link.csv')
#print(df)
Can you tell me please how can i parse entire websites, not one page?
You have a couple of alternatives, it depends what you want to achieve.
Write your own crawler
Similarly as what you are trying to do in your code snippet, fetch a page from the website, identify all the interesting links in this page (using xpath, regular expressions, ...) and iterate until you have visited the whole domain.
This is probably most suitable for learning the basics of crawling, or to get some information quickly as a one-off task.
You'll have to be careful about a couple of thinks, like not to visit the same links twice, limit the domain(s) to avoid going to other websites etc.
Use a web scraping framework
If you are looking to perform some serious scraping, for a production application or some large scale scraping, consider using a framework such as scrapy.
It solves a lot of common problems for you, and it is a great way to learn advanced techniques of web scraping, by reading the documentation and diving into the code.
I have a webscraper that, given a hashtag, will return the tweets with that hashtag. The problem I have is that when I make a request to twitter to get the hashtags, I only receive about 20 tweets. I am using requests to make the request and grab the page source, which only contains the 20 tweets.
I believe that twitter renders the tweets only a few at a time but I wanted to know if there was a way, without using the twitter api, to get more than what is initially rendered on the page.
My current code to make the request looks like the following:
import requests
from bs4 import BeautifulSoup
def find_hashtags(hashtag):
r = requests.get('https://twitter.com/hashtag/' + hashtag + '?src=hash')
data = r.text
soup = BeautifulSoup(data, "html5lib")
find_tweets('cnn')
Does anybody know of a workaround to this?
The issue with using BeautifulSoup is that it is purely for html scrapping. The first tweets are loaded automatically in the html, but the next are loaded using javascript. BeautifulSoup wont be able to access those and you will need some other library which can handle javascript loaded elements. I would suggest looking into selenium which can mimic a web user.
What I ended up doing which worked really well was using selenium to open the browser, and scroll the page down 'i' number of times.
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
import time
import re
def find_hashtags(hashtag):
driver = webdriver.Chrome()
driver.get('https://twitter.com/hashtag/' + hashtag + '?src=hash')
for i in range(100):
print(i)
driver.execute_script("window.scrollTo(0, 100000)")
time.sleep(1.5)
Not sure if this is the most efficient way but it does what I want!
The best way I could find to do this is to use twitters search page and to scrape the data from the webpage. You can get more seach data by modifying date to and date from in the search query.
Modify the parameters of the URL to produce diffrent search results. For example, appending the parameter q=%23hashtagName to the URL will give you tweets including the hashtag "hashtagname".
https://twitter.com/search?q=%23hashtagName
I'm trying to scrape Google Patents using the following code.
url = 'https://patents.google.com/?q=usb'
r = requests.get(url)
html_doc = r.text
soup = BeautifulSoup(html_doc)
But when I try to inspect the document, using
print(soup.prettify)
I cannot get anything other than this https://pastebin.com/Xu81LdfE .
I checked the requests status and it is returning 200. Where am I going wrong?
The results on that page come for a different url:
https://patents.google.com/xhr/query?url=q%3Dusb&exp=
So instead of using BeautifulSoup, you could do r.json(), and find what you want in the dictionary it creates.
The data is not in the HTML, but loaded with JavaScript.
Therefore, beautifulsoup cannot scrape it.
Consider using the official APIs, as other usage likely violates the Google terms of service, and they will likely block you then.
I am currently trying to create a bot for the betfair trading site, it involves using the betfair api which uses soap and the new API-NG will use json so I can understand how to access the information that I need.
My question is, using python, what would the best way to get information from a website that uses just html, can I convert it some way to maybe xml or what is the best/easiest way.
Json, xml and basically all this is new to me so any help will be appreciated.
This is one of the websites I am trying to access to get horse names and prices,
http://www.oddschecker.com/horse-racing-betting/chepstow/14:35/winner
I know there are some similar questions but looking at the answers and the source of the above page I am no nearer to figuring out how to get the info I need.
For getting html from a website there are two well used options.
urllib2 This is built in.
requests This is third party but really easy to use.
If you then need to parse your html then I would suggest using Beautiful soup.
Example:
import requests
from bs4 import BeautifulSoup
url = 'http://www.example.com'
page_request = requests.get(url)
page_source = page_request.text
soup = BeautifulSoup(page_source)
The page_source is just the basic html of the page, not much use, the soup object on the other hand can be used to access different parts of the page automatically.