Web Scrapping Using Python for nlp project - python

I have to scrap text data from this website. I have read some blogs on web scrap. But the major challenge that I have found is parsing HTML code. I am entirely new to this field. Can I get some help about how to scrap text data(which is possible) and make it into a CSV? Is this possible at all without knowledge about html? Can I expect a good demonstration of python code solving my problem then I will try this on my own for other websites?
TIA

The tools you can use in Python to scrape and parse html data are the requests module and the Beautiful Soup library.
Parsing html files into, for example, csv files is entirely possible, it just requires some effort to learn the tools. In my view there's no best way to learn this than by trying it out yourself.
As for "do you need to know html to parse html files?" well, yes you do, but the good thing is that html is actually quite simple. I suggest you take a look at some tutorials like this one, then inspect the webpage you're interested in and see if you can relate the two.
I appreciate my answer is not really what you were looking for, however as I said I think there's no best way to learn than to try things out yourself. If you're then stuck on anything in particular you can then ask on SO for specific help :)

I din't check the html of the website but you can use beautifulsoup for parsing
html and pandas for converting data into csv
sample code
import requests
from bs4 import BeautifulSoup
res = requests.get('yourwesite.com')
soup = BeautifulSoup(res.content,'html.parser')
# suppose i want all 'li' tags and links in 'li' tags.
lis = soup.find_all("li")
links = []
for li in lis:
a_tag = li.find("a")
link = a_tag.get("href")
links.appedn(link)
And you can get lots of tutorial on pandas online.

Related

parse entire website using python beatifulsoup

When i try to parse https://www.forbes.com/ for learning purpose. when i run the code, it only parse one page, i mean, home page.
How can i parse entire website, i mean, all the page from a site.
My attempted codes are given below:
from bs4 import BeautifulSoup
import re
from urllib.request import urlopen
html_page = urlopen("http://www.bdjobs.com/")
soup = BeautifulSoup(html_page, "html.parser")
# To Export to csv file, we used below code.
links = []
for link in soup.findAll('a', attrs={'href': re.compile("^http")}):
links.append(link.get('href'))
import pandas as pd
df = pd.DataFrame(links)
df.to_csv('link.csv')
#print(df)
Can you tell me please how can i parse entire websites, not one page?
You have a couple of alternatives, it depends what you want to achieve.
Write your own crawler
Similarly as what you are trying to do in your code snippet, fetch a page from the website, identify all the interesting links in this page (using xpath, regular expressions, ...) and iterate until you have visited the whole domain.
This is probably most suitable for learning the basics of crawling, or to get some information quickly as a one-off task.
You'll have to be careful about a couple of thinks, like not to visit the same links twice, limit the domain(s) to avoid going to other websites etc.
Use a web scraping framework
If you are looking to perform some serious scraping, for a production application or some large scale scraping, consider using a framework such as scrapy.
It solves a lot of common problems for you, and it is a great way to learn advanced techniques of web scraping, by reading the documentation and diving into the code.

Is it possible to scrape webpage without using third-party libraries in python?

I am trying to understand how beautiful soup works in python. I used beautiful soup,lxml in my past but now trying to implement one script which can read data from given webpage without any third-party libraries but it looks like xml module don't have much options and throwing many errors. Is there any other library with good documentation for reading data from web page?
I am not using these scripts on any particular websites. I am just trying to read from public pages and news blogs.
Third party libraries exist to make your life easier. Yes, of course you could write a program without them (the authors of the libraries had to). However, why reinvent the wheel?
Your best options are beautifulsoup and scrappy. However, if your having trouble with beautifulsoup, I wouldn't try scrappy.
Perhaps you can get by with just the plain text from the website?
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
pagetxt = soup.get_text()
Then you can be done with all external libraries and just work with plain text. However, if you need to do something more complicated. HTML is something you really should use a library for manipulating. They is just too much that can go wrong.

getting information from a webpage for an application using python

I am currently trying to create a bot for the betfair trading site, it involves using the betfair api which uses soap and the new API-NG will use json so I can understand how to access the information that I need.
My question is, using python, what would the best way to get information from a website that uses just html, can I convert it some way to maybe xml or what is the best/easiest way.
Json, xml and basically all this is new to me so any help will be appreciated.
This is one of the websites I am trying to access to get horse names and prices,
http://www.oddschecker.com/horse-racing-betting/chepstow/14:35/winner
I know there are some similar questions but looking at the answers and the source of the above page I am no nearer to figuring out how to get the info I need.
For getting html from a website there are two well used options.
urllib2 This is built in.
requests This is third party but really easy to use.
If you then need to parse your html then I would suggest using Beautiful soup.
Example:
import requests
from bs4 import BeautifulSoup
url = 'http://www.example.com'
page_request = requests.get(url)
page_source = page_request.text
soup = BeautifulSoup(page_source)
The page_source is just the basic html of the page, not much use, the soup object on the other hand can be used to access different parts of the page automatically.

How do you grab a headline from a blog/article like techmeme?

I'm a creating a type of news aggregator and I would like to create a program(Python) that correctly detects the headline and displays it. How would I go about doing this? Is this a machine learning problem?
I would appreciate any articles or books that would point me in the right direction.
My past attempts have included BeautifulSoup and Requests module. Any other open source models I should check out?
Thank you,
Fernando
The direct way to scrape a web page requires human learning - look at the page, decide what you think are headlines, find out how they are tagged, and then look for those tags using a parser like BeautifulSoup. For example, the level 1 headlines on Techmeme currently are labeled:
<DIV CLASS="ii">
and the level 2 headlines are:
<STRONG CLASS="L1">
After your program fetches the page and matches the tags you're interested in, see if they identify what you're looking for. If some headlines are missed, add additional tags to your search list. If you get false positives (hits on links that aren't headlines), weeding them out will require extra page-dependent logic. There is no magic to reverse engineering, just grunt work and testing and periodic revalidation to be sure the webmaster hasn't switched things up on you.
After playing around a bit I find that this works best:
Use BeautifuSoup and Requests module
r = requests.get('http://example.com')
soup = BeautifulSoup(r.text)
if soup.findAll('title'):
title = soup.find('title')
print title.renderContents()
What results is title text that should be cleaned up a bit using regular expressions.
Maybe it could be much easer with parsing their RSS\Atom feeds. Google easily delivers these links http://wiki.python.org/moin/RssLibraries and http://pypi.python.org/pypi/Atomisator/1.3
But those are pure XML, so you could use built-in urllib and XML(DOM or SAX) libraries

How do I parse a website in python once I know its url?

If I know the url of a wiki site , How do I use python to parse the contents of it ?
This is a very broad question, but the first things to reach for are urllib, which will handle the downloading part, and Beautiful Soup, which will do the parsing. Gluing them together and writing the code to actually extract information from the parse tree is up to you.
You might try Scrapy as well.

Categories