I would like to use scraperwiki and python to build a scraper that will scrape large amounts of information off of different sites. I am wondering if it is possible to point to a single URL and then scrape the data off of each of the links within that site.
For example: A site would contain information about different projects, each within its own individual link. I don't need a list of those links but the actual data contained within them.
The scraper would be looking for the same attributes on each of the links.
Does anyone know how or if I could go about doing this?
Thanks!
Check out BeautifulSoup with urllib2.
http://www.crummy.com/software/BeautifulSoup/
An (very) rough example link scraper would look like this:
from bs4 import BeautifulSoup
import urllib2
c = urllib2.urlopen(url)
contents = c.read()
soup = BeautifulSoup(contents)
links = soup.find_all(a):
Then just write a for loop to do that many times over and you're set!
Related
I was hoping someone could help me figure out how to scrape data from this page. I don't know where to start, as I've never worked with scraping or automating downloads in Python, but I'm just trying to find a way to automate downloading all the files on the linked page (and others like it -- just using this one as an example).
There is no discernible pattern in the file names linked; they appear to be random numbers that reference an ID-file name lookup table elsewhere.
for above URL provided you could download zip files by following the below code:
import re
import requests
from bs4 import BeautifulSoup
hostname="http://mis.ercot.com"
r = requests.get(f'{hostname}/misapp/GetReports.do?reportTypeId=13060&reportTitle=Historical%20DAM%20Load%20Zone%20and%20Hub%20Prices&showHTMLView=&mimicKey')
soup = BeautifulSoup(r.text, 'html.parser')
regex = re.compile('.*misdownload/servlets/mirDownload.*')
atgs=soup.findAll("a",{"href":regex})
for link in atgs:
data=requests.get(f"{hostname}{link['href']}")
filename=link["href"].split("doclookupId=")[1][:-1]+".zip"
with open(filename,"wb") as savezip:
savezip.write(data.content)
print(filename,"Saved")
Let me know if you have any questions :)
When i try to parse https://www.forbes.com/ for learning purpose. when i run the code, it only parse one page, i mean, home page.
How can i parse entire website, i mean, all the page from a site.
My attempted codes are given below:
from bs4 import BeautifulSoup
import re
from urllib.request import urlopen
html_page = urlopen("http://www.bdjobs.com/")
soup = BeautifulSoup(html_page, "html.parser")
# To Export to csv file, we used below code.
links = []
for link in soup.findAll('a', attrs={'href': re.compile("^http")}):
links.append(link.get('href'))
import pandas as pd
df = pd.DataFrame(links)
df.to_csv('link.csv')
#print(df)
Can you tell me please how can i parse entire websites, not one page?
You have a couple of alternatives, it depends what you want to achieve.
Write your own crawler
Similarly as what you are trying to do in your code snippet, fetch a page from the website, identify all the interesting links in this page (using xpath, regular expressions, ...) and iterate until you have visited the whole domain.
This is probably most suitable for learning the basics of crawling, or to get some information quickly as a one-off task.
You'll have to be careful about a couple of thinks, like not to visit the same links twice, limit the domain(s) to avoid going to other websites etc.
Use a web scraping framework
If you are looking to perform some serious scraping, for a production application or some large scale scraping, consider using a framework such as scrapy.
It solves a lot of common problems for you, and it is a great way to learn advanced techniques of web scraping, by reading the documentation and diving into the code.
I am trying to scrape tweets from one webpage within a certain timeframe.
To do so I am using this link which only searches within the timeframe I have specified:
https://twitter.com/search?f=tweets&q=subwaydstats%20since%3A2016-08-22%20until%3A2018-08-22
This is my code:
import pandas as pd
import datetime as dt
import urllib.request
from bs4 import BeautifulSoup
url = 'https://twitter.com/search?f=tweets&q=subwaydstats%20since%3A2016-08-22%20until%3A2018-08-22'
thepage = urllib.request.urlopen(url)
soup = BeautifulSoup(driver.page_source,"html.parser")
i = 1
for tweet in soup.find_all('div', {'class': 'js-tweet-text-container'}):
print(tweet.find('p', {'class': 'TweetTextSize'}).text.encode('UTF-8'))
print(i)
i += 1
The above code works when I am scraping from within the actual twitter page for the subwaystat user.
For this reason I don't understand why it doesn't work for the search page even though the html appears to be the same to me.
I am a total beginner so I'm sorry if this is a dumb question. Thank you!
There is a Twitter API - Twitter Search API docs: https://developer.twitter.com/en/docs/tweets/search/api-reference/get-search-tweets
which using a non-official Python wrapper: https://github.com/bear/python-twitter makes it super easy to get tweets.
However, if you want to scrape the HTML, then it's a lot more difficult. I was doing something similar - scraping an angular app, however, the actual HTML you see on the screen is actually rendered through "front-end javascript". Requests and urllib, only get the basic HTML but does not run the javascript.
You could use selenium which is basically a browser which you can automate task on. Since it behaves as a browser, it actually runs that front-end javascript, meaning you will be able to scrape the webpage.
A great article here explains the different ways you can scrape twitter https://medium.com/#dawran6/twitter-scraper-tutorial-with-python-requests-beautifulsoup-and-selenium-part-2-b38d849b07fe
I am a newbie to python and web scraping.
I am trying to extract information about test components of clinical diagnostic tests from this link. https://labtestsonline.org/tests-index
Tests index has a list of names of test components for various clinical tests. Clicking on each of those names takes you to another page containing details about individual test component. From the this page i would like to extract part which has common questions.
and finally put together a data frame containing the names of the test components in one column and each question from the common questions as the rest of the columns (as shown below).
Names how_its_used when_it_is_ordered what_does_test_result_mean
SO far i have only managed to get the names of the test components.
import requests
from bs4 import BeautifulSoup
url = 'https://labtestsonline.org/tests-index'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml' )
print(soup.prettify())
l = [] #get the names of the test components from the index
for i in soup.select("a[hreflang*=en]"):
l.append(i.text)
import pandas as pd
names = pd.DataFrame({'col':l}) # convert the above list to a dataframe
I suggest that you take a look at the open source web scraping library Scrapy. It will help you with many of the concerns that you might run in to when scraping websites such as:
Following the links on each page.
Scraping data from pages that match a particular pattern, e.g. you might only want to scrape the /detail page, while the other pages just scrape links to crawl.
lxml and css selectors.
Concurrency, allowing you to crawl multiple pages at the same time which will greatly speed up your scraper.
It's very easy to get going and there is a lot of resources out there of how to build simple to advanced web scrapers using the Scrapy library.
I'm trying to scrape few news websites to extract information like title, content and timestamp. Now, I also want to count the number of times that article was shared on twitter and Facebook. However, I haven't been able to find a suitable resource to do it effectively. I'm using Python 2.7.4 and Beautiful Soup4 to extract data and dump it into a .CSV file.
fackbook like count query:
Getting the Facebook like/share count for a given URL
twitter share count you can check this
Is there a way to get the twitter share count for a specific URL?
Since you're trying to only get the likes from the page. I suggest you use the graphAPI to get the likes and then convert that using Beautiful Soup and write it to a file the you can read the file to get your data.
This is an example of a script I wrote to do the same.
import urllib2
from bs4 import BeautifulSoup
x = urllib2.urlopen("https://api.facebook.com/method/fql.query?query=select%20like_count%20from%20link_stat%20where%20url=%27https://www.facebook.com/mitrevels?ref=br_tf%27")
soup = BeautifulSoup(x)
y = soup.get_text()
f = open("write.txt","wr")
f.write(y)
f.close()
This will just give me the likes on the particular page.
All you need to do is change the url part to get the likes on your particular page. The same is available for twitter. Read the documentation to get the results.