A simple spider question - python

I am a newbie trying to achive this simple task by using Scrapy with no luck so far. I am asking your advice about how to do this with Scrapy or with any other tool (with Python). Thank you.
I want to
start from a page that lists bios of attorneys whose last name start with A: initial_url = www.example.com/Attorneys/List.aspx?LastName=A
From LastName=A to extract links to actual bios: /BioLinks/
visit each of the /BioLinks/ to extract the school info for each attorney.
I am able to extract the /BioLinks/ and School information but I am unable to go from the initial url to the bio pages.
If you think this is the wrong way to go about this, then, how would you achieve this goal?
Many thanks.

Not sure I fully understand what you're asking, but maybe you need to get the absolute URL to each bio and retrieve the source code for that page:
import urllib2
bio_page = urllib.urlopen(bio_url).read()
Then use a regular expressions or other parsing to get the attorney's law school.

Related

How to select element by class and get text?

I have been trying to scrape addresses in this page: https://www.worldometers.info/coronavirus/#news
How can I get values which under class=sorting_1? It is difficult for me. I'm completely new to Beautifulsoup.
Using find:
[t.get_text(strip=True) for t in soup.find_all(attrs={'class': 'sorting_1'})]
or if you're sure that's the only class for the tags you want:
[t.get_text(strip=True) for t in soup.find_all(class_='sorting_1')]
or using select
[t.get_text(strip=True) for t in soup.select('.sorting_1')]
Any of the above should work; and if you're going to be working with BeautifulSoup, you should really familiarize yourself with the documentation and/or go through at least one tutorial.

Web Scrapping Using Python for nlp project

I have to scrap text data from this website. I have read some blogs on web scrap. But the major challenge that I have found is parsing HTML code. I am entirely new to this field. Can I get some help about how to scrap text data(which is possible) and make it into a CSV? Is this possible at all without knowledge about html? Can I expect a good demonstration of python code solving my problem then I will try this on my own for other websites?
TIA
The tools you can use in Python to scrape and parse html data are the requests module and the Beautiful Soup library.
Parsing html files into, for example, csv files is entirely possible, it just requires some effort to learn the tools. In my view there's no best way to learn this than by trying it out yourself.
As for "do you need to know html to parse html files?" well, yes you do, but the good thing is that html is actually quite simple. I suggest you take a look at some tutorials like this one, then inspect the webpage you're interested in and see if you can relate the two.
I appreciate my answer is not really what you were looking for, however as I said I think there's no best way to learn than to try things out yourself. If you're then stuck on anything in particular you can then ask on SO for specific help :)
I din't check the html of the website but you can use beautifulsoup for parsing
html and pandas for converting data into csv
sample code
import requests
from bs4 import BeautifulSoup
res = requests.get('yourwesite.com')
soup = BeautifulSoup(res.content,'html.parser')
# suppose i want all 'li' tags and links in 'li' tags.
lis = soup.find_all("li")
links = []
for li in lis:
a_tag = li.find("a")
link = a_tag.get("href")
links.appedn(link)
And you can get lots of tutorial on pandas online.

How to search for some specific links(which may be present in a pdf file) in a website and crawl those links for other information?

I have a task to complete. I need to make a web crawler kind of application. What i need to do is to pass a url to my application. This url is website of a government agency. This url also having some links to other individual agencies which are approved by this government agency. I need to go to those links and get some information from that site about that agency. I hope i make myself clear.Now i have to make this application generic. It means i can't hard code it for just one website(government agency). I need to make it like any url given to it , it should check it and then get all the links and proceed. Now in some website these links present in pdfs and in some they are present on a page.
I have to use python for this. I don't know how to approach this. I spend time on this using BeautifulSoup but that require lots of parsing. Other options are scrapy or twill. Honestly i am new to python. I dont know which one is better for this task. So any one can help me in selecting the right tool and right approach to solve this problem. Thanks in advance
There is plenty of information out there about building web scrapers with Python. Python is a great tool for the job.
There are also tons of posts about web scrapers on this website if you search for them.

How do you grab a headline from a blog/article like techmeme?

I'm a creating a type of news aggregator and I would like to create a program(Python) that correctly detects the headline and displays it. How would I go about doing this? Is this a machine learning problem?
I would appreciate any articles or books that would point me in the right direction.
My past attempts have included BeautifulSoup and Requests module. Any other open source models I should check out?
Thank you,
Fernando
The direct way to scrape a web page requires human learning - look at the page, decide what you think are headlines, find out how they are tagged, and then look for those tags using a parser like BeautifulSoup. For example, the level 1 headlines on Techmeme currently are labeled:
<DIV CLASS="ii">
and the level 2 headlines are:
<STRONG CLASS="L1">
After your program fetches the page and matches the tags you're interested in, see if they identify what you're looking for. If some headlines are missed, add additional tags to your search list. If you get false positives (hits on links that aren't headlines), weeding them out will require extra page-dependent logic. There is no magic to reverse engineering, just grunt work and testing and periodic revalidation to be sure the webmaster hasn't switched things up on you.
After playing around a bit I find that this works best:
Use BeautifuSoup and Requests module
r = requests.get('http://example.com')
soup = BeautifulSoup(r.text)
if soup.findAll('title'):
title = soup.find('title')
print title.renderContents()
What results is title text that should be cleaned up a bit using regular expressions.
Maybe it could be much easer with parsing their RSS\Atom feeds. Google easily delivers these links http://wiki.python.org/moin/RssLibraries and http://pypi.python.org/pypi/Atomisator/1.3
But those are pure XML, so you could use built-in urllib and XML(DOM or SAX) libraries

How to extract all the url's from a website?

I am writing a programme in Python to extract all the urls from a given website. All the url's from a site not from a page.
As I suppose I am not the first one who wants to do that I was wondering if there was a ready made solution or if I have to write the code myself.
It's not gonna be easy, but a decent starting point would be to look into these two libraries:
urllib
BeautifulSoup
I didn't see any ready made scripts that does this on a quick google search.
Using the scrapy framework makes this almost trivial.
The time consuming part would be learning how to use scrapy. THeir tutorials are great though and shoulndn't take you that long.
http://doc.scrapy.org/en/latest/intro/tutorial.html
Creating a solution that others can use is one of the joys of being part of a programming community. iF a scraper doesn't exist you can create one that everyone can use to get all links from a site!
The given answers are what I would have suggested (+1).
But if you really want to do something quick and simple, and you're on a *NIX platform, try this:
lynx -dump YOUR_URL | grep http
Where YOUR_URL is the URL that you want to check. This should get you all the links you want (except for links that are not fully written)
You first have to download the page's HTML content using a package like urlib or requests.
After that, you can use Beautiful Soup to extract the URLs. In fact, their tutorial shows how to extract all links enclosed in <a> elements as a specific example:
for link in soup.find_all('a'):
print(link.get('href'))
# http://example.com/elsie
# http://example.com/lacie
# http://example.com/tillie
If you also want to find links not enclosed in <a> elements, you'll may have to write something more complex on your own.
EDIT: I also just came across two Scrapy link extractor classes that were created specifically for this task:
http://doc.scrapy.org/en/latest/topics/link-extractors.html

Categories