Scraping Contact Information from Several Websites with Python - python

I want to collect contact information from all county governments. I do not have a list of their websites. I want to do three things with Python: 1) create a list of county government websites, 2) extract names, email addresses, and phone numbers of government officials, and 3) convert URLs and all the contact information into an excel sheet or csv.
I am a beginner in Python, and any guidance would be greatly appreciated. Thanks!

For creating tables, you would use a package called pandas
for extracting info from websites, a package called beautifulsoup4 is commonly used.

For scraping a website (all data present in the world) you should
define what type of search you want to start, I mean do you want to
Search in google or a specific website for both of them you need a
request library to curl a site or query a google (like search in
the search bar) and got HTML. for parsing data, you have gotten you
can choose BEATIFULSOAP. Both of them have good documents and you
must read them don't disappoint it's easy.
Because the count of countries around the world is more than 170+
you should manage your data; for managing data I recommend using pandas
and finally, after processing data you can convert data to any type of file
pandas.to_excel, pandas.to_csv and more.

Related

Navigate through a pdf file to find specific pages and extract tabular data from image with python

I've come across an assignment which requires me to extract tabular data from images in a pdf file to neatly formatted dataframes via python code. There are several files to be processed and the relevant pages in all the files the may have different page numbers, hence the sequence of steps for this problem (my assumption) are:
Navigate to relevant section of the pdf
Extract images of the tabular data
Extract data from the images, format and convert to dataframes.
Some google searches resulted in me finding libraries for pdf text extraction, table extraction and more - modular solutions only.
I would appreciate some help in this regard. What packages should I use? Is my approach correct?
Can I get references to any helpful code snippets for similar problems?
page structure of the required tables
This started as a comment. I believe the answer is valid as it is in no way an endorsement of the service. I don't even use it. I know Azure uses SO as well.
This is the stuff of commercial services. You can try Azure Form Recognizer (with which I am not affiliated):
https://learn.microsoft.com/en-us/azure/applied-ai-services/form-recognizer
Here are some python examples of how to use it:
https://learn.microsoft.com/en-us/azure/applied-ai-services/form-recognizer/how-to-guides/try-sdk-rest-api?pivots=programming-language-python
The AWS equivalent is Textract https://aws.amazon.com/textract
The Google Cloud version is called Form Parser - see https://cloud.google.com/document-ai/docs/processors-list#processor_form-parser

Trying to extract data from multiple websites at once but can't

I have a dataset of 110,003 churches in the US. The dataset includes the name of the church together with other details. I need to populate a column with the date of opening. This requires Googling each church to find its date of opening and populating each row. I am posting an example dataset with 10,000 cases. Please help!
https://docs.google.com/spreadsheets/d/1B1db58lmP5nK1ZeJPEJTuYnPjGTxfu_74RG1IELLsP0/edit?usp=sharing
I have tried using import.io and scrapy but they don't seem to work for me as well as I would appreciate if you recommend a better tool.

Scraping Multiple Sites for Similar Information

I need to scrape several different sites for the same information. Basically, I am looking for similar information, but the sites could belong to different vendors and can have different HTML structures. For example, if I am trying to scrape the data related to text books in Barns&Nobles and Biblio (this is only two but there could be many) and get the book name, author and prices for the books how would one do that?
https://www.barnesandnoble.com/b/textbooks/mathematics/algebra/_/N-8q9Z18k3
https://www.biblio.com/search.php?stage=1&result_type=works&keyisbn=algebra
Of course, I can parse the two sites independently, but I am looking for a general methodology that can be easily applied to other vendors as well to extract the same information.
In a separate but related question, I would also like to know how google show different product information from different sources when you search for a product? For example, if you google for "MacBook Pro", at the top of the page, you'd get a carousel of products from different vendors. I assume google is scraping this information from different sources automatically to suggest to the user what are available.
Have a look at scrapely. It can really be helpful if you don't want to manually parse different HTML structures.

Trying to get links of an interactive map (Web scraping .swf)

i need to create a web scraper for this website
However I need to get the links for the counties, stored in the interactive map
Unfortunately, for some reason, their search engine doesn't provide all the results as the interactive map does.
My question:
Could anyone tell me how to get all the links for all the counties, without manually accessing them?
Thanks
Technically you can use a decompiler to do this job.
There are free (e.g.: ActionScript Extractor) and paid (e.g.: Sothink
SWF Decompiler) tools out there.
you can reference this answer
Edit :
Most swf content gets external records from either a .xml or .json file.
Without decompiling and just using the browser's Developer Tools we can see that an xml file is indeed accessed (maybe it contains what you want) :
http://www.allpetservices.co.uk/uk_ir_locator.xml.
Put view-source: in front of the link to read it (if there's an error message).
In that xml you want to extract the contents (the xyz) of each & every <link> xyz </link> tag. This will give you the links of every entry on the map.
The short answer to your question: There's no way to get the links from the site.
The solution: The structure of the links you are trying to retrieve are very predictable. They follow the same structure:
http://www.allpetservices.co.uk/search_map.asp?ccounty={COUNTY_NAME}
So, if you can use another site or data source to get the names of each of the counties, you can formulate each of the links that you need.

downloading census data from Bhuvan using python

I would like to download census data (year 2001 and 2011) from http://bhuvan5.nrsc.gov.in/bhuvan/web/?wicket:bookmarkablePage=:org.geoserver.web.demo.MapPreviewPage as kml/kmz format for multiple states of India. I am thinking to automate the process using python as the data contains huge number of file. I am a beginner in this kind of programming. It would be great if any one help me or guide me regarding this.
This is an ugly target for a first scrapping project as it has a javascript paginated table.
As you'll need a javascript engine, Python's most friendly option is to use Selenium with python bindings to scrap your page.
Before using it, you should have read some basics on html DOM and xpaths.

Categories