downloading census data from Bhuvan using python - python

I would like to download census data (year 2001 and 2011) from http://bhuvan5.nrsc.gov.in/bhuvan/web/?wicket:bookmarkablePage=:org.geoserver.web.demo.MapPreviewPage as kml/kmz format for multiple states of India. I am thinking to automate the process using python as the data contains huge number of file. I am a beginner in this kind of programming. It would be great if any one help me or guide me regarding this.

This is an ugly target for a first scrapping project as it has a javascript paginated table.
As you'll need a javascript engine, Python's most friendly option is to use Selenium with python bindings to scrap your page.
Before using it, you should have read some basics on html DOM and xpaths.

Related

Scraping Contact Information from Several Websites with Python

I want to collect contact information from all county governments. I do not have a list of their websites. I want to do three things with Python: 1) create a list of county government websites, 2) extract names, email addresses, and phone numbers of government officials, and 3) convert URLs and all the contact information into an excel sheet or csv.
I am a beginner in Python, and any guidance would be greatly appreciated. Thanks!
For creating tables, you would use a package called pandas
for extracting info from websites, a package called beautifulsoup4 is commonly used.
For scraping a website (all data present in the world) you should
define what type of search you want to start, I mean do you want to
Search in google or a specific website for both of them you need a
request library to curl a site or query a google (like search in
the search bar) and got HTML. for parsing data, you have gotten you
can choose BEATIFULSOAP. Both of them have good documents and you
must read them don't disappoint it's easy.
Because the count of countries around the world is more than 170+
you should manage your data; for managing data I recommend using pandas
and finally, after processing data you can convert data to any type of file
pandas.to_excel, pandas.to_csv and more.

Web scrapping an array with multiple pages

I am trying to scrap this table from Etherscan.io : https://etherscan.io/tokens?ps=100&p=1
Since I am not that familiar with XPath, I have chosen to use the Chrome extension Webscrap.
The problem being the table is contained in several pages.
Is there a way of scraping all the pages at once without mapping each page one by one?
I could also try to it with python directly, I know there are some pretty good libraries out there.
Is that do-able? Would it take me a long time to learn (I know very little about HTML and XPath)?
If so what would be easiest and quickest librairy to learn?

How do I extract data from a website for a data table using python?

Im building tools for a business I am starting, and one of the tools I need is a live reader of scrap prices to put into my website as well as deciphering overhead/profit. How would I take that information and then put it into a live data table?
Very amateur programmer, but better to do it myself.
You can use flask and Beautiful Soup. With the flask you will be able to save the data after obtaining the information from the site after manipulating the html with beautiful soup.

Building comprehensive scraping program/database for real estate websites

I have a project I’m exploring where I want to scrape the real estate broker websites in my country (30-40 websites of listings) and keep the information about each property in a database.
I have experimented a bit with scraping in python using both BeautifulSoup and Scrapy.
What I would Ideally like to achieve is a daily updated database that will find new properties and remove properties when they are sold.
Any pointers as to how to achieve this?
I am relatively new to programming and open to learning different languages and resources if python isn’t suitable.
Sorry if this forum isn’t intended for this kind of vague question :-)
Build a scraper and schedule a daily run. You can use scrapy and the daily run will update the database daily.

Crawling web for specific file type

As a part of a research, I need to download freely available RDF (Resource Description Framework - *.rdf) files via web, as much as possible. What are the ideal libraries/frameworks available in Python for doing this?
Are there any websites/search engines capable of doing this? I've tried Google filetype:RDF search. Initially, Google shows you 6,960,000 results. However, as you browse individual results pages, the results drastically drop down to 205 results. I wrote a script to screen-scrape and download files, but 205 is not enough for my research and I am sure there are more than 205 files in the web. So, I really need a file crawler. I'd like to know whether there are any online or offline tools that can be used for this purpose or frameworks/sample scripts in Python to achieve this. Any help in this regards is highly appreciated.
Crawling RDF content from the Web is no different than crawling any other content. That said, if your question is "what is a good python Web crawler", than you should read this question: Anyone know of a good Python based web crawler that I could use?. If your question is related to processing RDF with python, then there are several options, one being RDFLib
Did you notice the text something like "google has hidden similar results, click here to show all results" at the bottom of one page? Might help.
I know that I'm a bit late with this answer - but for future searchers - http://sindice.com/ is a great index of rdf documents
teleport pro, although it maybe cant copy from google, too big, it can probably handly proxy sites that return google results, and i know, for a fact, i could download 10 000 pdfs with in a day if i wanted to. it has filetype specifiers and many options.
here's one workaround :
get "download master" from chrome extensions, or similar program
search on google or other for results, set google to 100 per page
select - show all files
write your file extension, .rdf press enter
press download
you can have 100 files per click, not bad.

Categories