I have a csv file contains book names with single columns and 1000 row. I need the crawl author and published year on the next columns. Can i do this with Scrapy? Is there any document to share with me?
Thanks for now.
Book_Name;Author;Published_Date
don quijote;;
name of the rose;;
oliver twist;;
Edit: I tried to find data from "https://isbndb.com". I wonder if scrapy is suitable for this job.
Scrapy is used for Webscraping.
In your case you could simply use CSV files and Python
Related
I want to collect contact information from all county governments. I do not have a list of their websites. I want to do three things with Python: 1) create a list of county government websites, 2) extract names, email addresses, and phone numbers of government officials, and 3) convert URLs and all the contact information into an excel sheet or csv.
I am a beginner in Python, and any guidance would be greatly appreciated. Thanks!
For creating tables, you would use a package called pandas
for extracting info from websites, a package called beautifulsoup4 is commonly used.
For scraping a website (all data present in the world) you should
define what type of search you want to start, I mean do you want to
Search in google or a specific website for both of them you need a
request library to curl a site or query a google (like search in
the search bar) and got HTML. for parsing data, you have gotten you
can choose BEATIFULSOAP. Both of them have good documents and you
must read them don't disappoint it's easy.
Because the count of countries around the world is more than 170+
you should manage your data; for managing data I recommend using pandas
and finally, after processing data you can convert data to any type of file
pandas.to_excel, pandas.to_csv and more.
Is it possible to download all Wikipedia pages of one category (person)? I have downloaded the dump and I want to filter to get all person pages. I am using Python. Any hints would be very appreciated.
I will open the file in Python using bz2.open but I don't know how to filter the pages.
I have a list of thousands of websites and I would like to extract phone numbers and emails if available.
Possibly using python + scrapy
I found this one
https://levelup.gitconnected.com/scraping-websites-for-phone-numbers-and-emails-with-python-5557fcfa1596
but it looks like the package is not available anymore.
Any suggestions?
thanks!
This is a broad question, so I cant answer it here completely.
Basically, you need to follow the following steps:
First, scrape the website HTML using BS4 or Scrapy.
Then use some regex to find emails, phone numbers
Also check this article: https://www.geeksforgeeks.org/email-id-extractor-project-from-sites-in-scrapy-python/
I have a list of 230 crystal structure space groups (strings). I'd like to write a python script to extract files, for each group, from http://rruff.geo.arizona.edu/AMS/amcsd.php.
I'd like the script to iteratively searches for all space groups in the "Cell Parameters and Symmetry" search option, and then downloads one of the files for some structure (say the first one).
An example of my list looks something like spaceGroups = ["A-1","A2","A2/a","A2/m","..."]. Search format for say group 1 will look like this, sg=A-1, and the results look like http://rruff.geo.arizona.edu/AMS/result.php.
First I'd like to know if this is even possible, and if so, where to start?
Sure, it's possible. The "clean" way is to create a crawler to make requests, download and save the files.
You can use scrapy (https://docs.scrapy.org/en/latest/) for the crawler and Fiddler (https://www.telerik.com/fiddler) to see what requests you need to recreate inside your spider.
In essence you will use a list of space groups to generate requests to the form on that page, after each request you will parse the response, collect the IDs/download urls and follow on subsequent pages (to collect all IDs/download urls). Finally you will download the files.
If you don't want to use scrapy you can make your own logic with requests (https://requests.readthedocs.io/en/latest/user/quickstart/), but scrapy would download everything faster and has a lot of features to help you.
Perusing that page it seems you only need the ids from each crystals, the actual download urls are simple.
I have a dataset of 110,003 churches in the US. The dataset includes the name of the church together with other details. I need to populate a column with the date of opening. This requires Googling each church to find its date of opening and populating each row. I am posting an example dataset with 10,000 cases. Please help!
https://docs.google.com/spreadsheets/d/1B1db58lmP5nK1ZeJPEJTuYnPjGTxfu_74RG1IELLsP0/edit?usp=sharing
I have tried using import.io and scrapy but they don't seem to work for me as well as I would appreciate if you recommend a better tool.