parse entire website using python beatifulsoup - python

When i try to parse https://www.forbes.com/ for learning purpose. when i run the code, it only parse one page, i mean, home page.
How can i parse entire website, i mean, all the page from a site.
My attempted codes are given below:
from bs4 import BeautifulSoup
import re
from urllib.request import urlopen
html_page = urlopen("http://www.bdjobs.com/")
soup = BeautifulSoup(html_page, "html.parser")
# To Export to csv file, we used below code.
links = []
for link in soup.findAll('a', attrs={'href': re.compile("^http")}):
links.append(link.get('href'))
import pandas as pd
df = pd.DataFrame(links)
df.to_csv('link.csv')
#print(df)
Can you tell me please how can i parse entire websites, not one page?

You have a couple of alternatives, it depends what you want to achieve.
Write your own crawler
Similarly as what you are trying to do in your code snippet, fetch a page from the website, identify all the interesting links in this page (using xpath, regular expressions, ...) and iterate until you have visited the whole domain.
This is probably most suitable for learning the basics of crawling, or to get some information quickly as a one-off task.
You'll have to be careful about a couple of thinks, like not to visit the same links twice, limit the domain(s) to avoid going to other websites etc.
Use a web scraping framework
If you are looking to perform some serious scraping, for a production application or some large scale scraping, consider using a framework such as scrapy.
It solves a lot of common problems for you, and it is a great way to learn advanced techniques of web scraping, by reading the documentation and diving into the code.

Related

Web Scrapping Using Python for nlp project

I have to scrap text data from this website. I have read some blogs on web scrap. But the major challenge that I have found is parsing HTML code. I am entirely new to this field. Can I get some help about how to scrap text data(which is possible) and make it into a CSV? Is this possible at all without knowledge about html? Can I expect a good demonstration of python code solving my problem then I will try this on my own for other websites?
TIA
The tools you can use in Python to scrape and parse html data are the requests module and the Beautiful Soup library.
Parsing html files into, for example, csv files is entirely possible, it just requires some effort to learn the tools. In my view there's no best way to learn this than by trying it out yourself.
As for "do you need to know html to parse html files?" well, yes you do, but the good thing is that html is actually quite simple. I suggest you take a look at some tutorials like this one, then inspect the webpage you're interested in and see if you can relate the two.
I appreciate my answer is not really what you were looking for, however as I said I think there's no best way to learn than to try things out yourself. If you're then stuck on anything in particular you can then ask on SO for specific help :)
I din't check the html of the website but you can use beautifulsoup for parsing
html and pandas for converting data into csv
sample code
import requests
from bs4 import BeautifulSoup
res = requests.get('yourwesite.com')
soup = BeautifulSoup(res.content,'html.parser')
# suppose i want all 'li' tags and links in 'li' tags.
lis = soup.find_all("li")
links = []
for li in lis:
a_tag = li.find("a")
link = a_tag.get("href")
links.appedn(link)
And you can get lots of tutorial on pandas online.

Best way to scrape job details from job descriptions

New to web scrapers and I prefer to use Python. Does anyone have any ideas for the easiest way to scrape job descriptions and input them into an excel file? Which scraper would you use?
Depends, for a dynamic website Selenium is great. Selenium is a tool that automates web actions. Beautiful Soup is also another option. Beautiful Soup doesn't automate website actions, it will just scrape website data. In my opinion, Beautiful Soup is easier to learn. One basic introduction will be all you need. As for the excel file, there are several libraries you could use, that is more of a preference.
However, for your project I would go with beautiful soup.
As for the process of learning, YouTube is a great place to find tutorials, there are several for both. It's also really easy to find help with issues with either on here.
To give you a hint as to the general structure of your program, I would suggest something like this:
First Step: open an excel file, this file will remain open for the whole time
Second Step: webscraper locates the HTML tag for the job description
Third Step: use a for loop to cycle through each job description within this tag
Fourth Step: for each tag you retrieve the data and send it to an excel sheet
Fifth Step: once your done you close the excel sheet
Libraries I personally use: here
This is generally the boilerplate code most people probably use to start web scraping:
import requests
from bs4 import BeautifulSoup
import re
from pprint import pprint
from os.path import dirname, join
current_dir = dirname(__file__)
print(current_dir)
code = 0
url_loop = "test.com"
r = (requests.get(url_loop))
error = "The page cannot be displayed because an internal server error has occurred."
soup = BeautifulSoup(r.text, 'html.parser')
Request is how you send HTTP Requests
BS4 is how you parse and extract specific info from the page such as all h1
tags
Pprint just formats the result nicely
As for using the collected data in excel: Here
Good luck!

Issue with scraping tweets using python

I am trying to scrape tweets from one webpage within a certain timeframe.
To do so I am using this link which only searches within the timeframe I have specified:
https://twitter.com/search?f=tweets&q=subwaydstats%20since%3A2016-08-22%20until%3A2018-08-22
This is my code:
import pandas as pd
import datetime as dt
import urllib.request
from bs4 import BeautifulSoup
url = 'https://twitter.com/search?f=tweets&q=subwaydstats%20since%3A2016-08-22%20until%3A2018-08-22'
thepage = urllib.request.urlopen(url)
soup = BeautifulSoup(driver.page_source,"html.parser")
i = 1
for tweet in soup.find_all('div', {'class': 'js-tweet-text-container'}):
print(tweet.find('p', {'class': 'TweetTextSize'}).text.encode('UTF-8'))
print(i)
i += 1
The above code works when I am scraping from within the actual twitter page for the subwaystat user.
For this reason I don't understand why it doesn't work for the search page even though the html appears to be the same to me.
I am a total beginner so I'm sorry if this is a dumb question. Thank you!
There is a Twitter API - Twitter Search API docs: https://developer.twitter.com/en/docs/tweets/search/api-reference/get-search-tweets
which using a non-official Python wrapper: https://github.com/bear/python-twitter makes it super easy to get tweets.
However, if you want to scrape the HTML, then it's a lot more difficult. I was doing something similar - scraping an angular app, however, the actual HTML you see on the screen is actually rendered through "front-end javascript". Requests and urllib, only get the basic HTML but does not run the javascript.
You could use selenium which is basically a browser which you can automate task on. Since it behaves as a browser, it actually runs that front-end javascript, meaning you will be able to scrape the webpage.
A great article here explains the different ways you can scrape twitter https://medium.com/#dawran6/twitter-scraper-tutorial-with-python-requests-beautifulsoup-and-selenium-part-2-b38d849b07fe

getting information from a webpage for an application using python

I am currently trying to create a bot for the betfair trading site, it involves using the betfair api which uses soap and the new API-NG will use json so I can understand how to access the information that I need.
My question is, using python, what would the best way to get information from a website that uses just html, can I convert it some way to maybe xml or what is the best/easiest way.
Json, xml and basically all this is new to me so any help will be appreciated.
This is one of the websites I am trying to access to get horse names and prices,
http://www.oddschecker.com/horse-racing-betting/chepstow/14:35/winner
I know there are some similar questions but looking at the answers and the source of the above page I am no nearer to figuring out how to get the info I need.
For getting html from a website there are two well used options.
urllib2 This is built in.
requests This is third party but really easy to use.
If you then need to parse your html then I would suggest using Beautiful soup.
Example:
import requests
from bs4 import BeautifulSoup
url = 'http://www.example.com'
page_request = requests.get(url)
page_source = page_request.text
soup = BeautifulSoup(page_source)
The page_source is just the basic html of the page, not much use, the soup object on the other hand can be used to access different parts of the page automatically.

Scraping data from multiple links within a site

I would like to use scraperwiki and python to build a scraper that will scrape large amounts of information off of different sites. I am wondering if it is possible to point to a single URL and then scrape the data off of each of the links within that site.
For example: A site would contain information about different projects, each within its own individual link. I don't need a list of those links but the actual data contained within them.
The scraper would be looking for the same attributes on each of the links.
Does anyone know how or if I could go about doing this?
Thanks!
Check out BeautifulSoup with urllib2.
http://www.crummy.com/software/BeautifulSoup/
An (very) rough example link scraper would look like this:
from bs4 import BeautifulSoup
import urllib2
c = urllib2.urlopen(url)
contents = c.read()
soup = BeautifulSoup(contents)
links = soup.find_all(a):
Then just write a for loop to do that many times over and you're set!

Categories