Is Python a suitable tool for automating data scraping? [closed] - python

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
I am working on a project which involves working with a large amount of data. Essentially, there exists a large repository on some website of excel files that can be downloaded. The site has several different lists of filters and I have several different parameters I am filtering and then collecting data from. Overall, this process requires me to download upwards of 1,000+ excel files and copy and paste them together.
Does Python have the functionality to automate this process? Essentially what I am doing is setting Filter 1 = A, Filter 2 = B, Filter 3 = C, download file, and then repeat with different parameters and copy and paste files together. If Python is suitable for this, can anyone point me in the direction of a good tutorial or starting point? If not, what language would be more suitable for this for someone with little background?
Thanks!

Personally I would prefer to use python for this. I would look in particular at the Pandas library that is a powerful data analysis library that has a dataframe object that can be used like a headless Spreadsheet. I use it for a small number of spreadsheets and it's been very quick. Perhaps take a look at this person's website for more guidance. https://pythonprogramming.net/data-analysis-python-pandas-tutorial-introduction/
I'm not 100% if your question was only about spreadsheets and my first paragraph was really about working on the files once you have downloaded them, but if you're interested in actually fetching the files or 'scraping' the data you can look at the Requests library for the http side of things - this might be what you could use if there is Restful way of doing things. Or, look at scrapy https://scrapy.org for web scraping.
Sorry if I misunderstood in parts.

Related

First time web design, know Python already, any advice which direction to go? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 5 days ago.
Improve this question
I am a sort of experienced python programmer. I will quickly describe my situation. For a hobby programming was always nice. I then started working at a company that did lots of manual excel processing. One day I mentioned that I could probably automate this with python.
Things led to another and now there is python doing the excel work multiple times a day running from an Intel NUC i deployed as a small server. It has been some work figuring everything out but the money has been good as well, no complaints.
They are quite happy with me and have lots of different plans.
They want me to design a website where the employees can fill out a form daily and the data can be used elsewhere. However, I've done some html and css programming in highschool, but I know there needs to be a back-end to at least save the data that gets filled.
I dont know where to start. I know SQL is the #1 language in data processing and PHP in handling the back-end. But I already know python which also can do back-end operations.
I have two direct questions but also looking for advice on the whole situation. Feel free to just point anything out; I will read every comment.
My questions:
Could I run the webserver from my Intel NUC? Or is this generally seen as bad practice? Also, is it true that I would only need the domain if I run the webserver myself?
Is it worth it to learn SQL and PHP or should I stick to python?
I have tried looking online but found countless of resources. I would like to create a large database with lots of data I can use anytime. I think SQL is good for this but not looking to waste time.

How can I implement a word to PDF conversion in python without importing any libraries? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 months ago.
Improve this question
First time poster here. I'm trying to convert one or multiple .docx files to PDF but I can't figure out how to do it without importing any libraries/modules aside from what is available in python 3.3.
I've read through the packages documentation but nothing stuck out as a solution. I also don't know what I am looking for as I am pretty new to python. I found plenty of articles and resources that mention how to do it with an imported library, but not without.
Is it possible to accomplish this without importing a library?
Any advice/resources are welcome.
Code it from scratch. If you're not going to use an external library, that is by definition pretty much your only option.
You'll want to become an expert in the formal specifications for both PDF
and MS Word. Given the complexity and history of each of those, I expect a senior developer will want 6-12 months of experience with each to obtain the necessary understanding.
You should also have 6-12 months' experience with Python, since you'll likely need to be familiar with the language in order to define and use all the functions you'll need. But in just a few years of dedication, you should be able to write the necessary code.
MORE REALISTICALLY, import Python libraries for managing PDFs and MS Word. That should only take a week or two.

Using pandas within web applications - good or bad? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
Is it okay to use python pandas to manipulate tabular data within a flask/django web application?
My web app receives blocks of data which we visualise in chart form on the web. We then provide the user some data manipulation operations like, sorting the the data, deleting a given column. We have our own custom code to perform these data operations which it would be much easier to do it using pandas however I'm not sure if that is a good idea or not?
It’s a good question, pandas could be use in development environment if dataset isn’t too big , if dataset is really big I think you could use spark dataframes or rdd, if data increase in function of time you can think on streaming data with pyspark.
Actually yes, but don't forget to move your computation into a separate process if it takes too long.

extracting data from several xml-files with python [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I just started learing python for my new job, so everything is quite difficult to me, even if the task sounds pretty straight forward.
I would like to extract several nodes from multiple xml-files, at best putting the information into an excel file in the end. Every row should contain the information from one xml-file, the columns should represent the specific nodes I am looking for, like "Zip-code" "town". Not all xml-files contain all nodes, so it would be perfect, if node "Zip-code" doesnt exist it just leaves the cell blank.
Could someone please point out a few hints how to start with this or, this is also possible, a special programm, which is easy to learn and use? My company and me only need to do it once for about 2000 files.
Thank you very much =)
For opening the files and getting their contents, you can use the Python functions: Documentation.
For XML parsing, I always use Beautiful Soup. It's a HTML/XML parser with good documentation that mostly "just works".
For creating the Excel file, you can use Xlsxwriter.

How do large static sites make their content effectively searchable? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
One of the most popular tools to generate static sites is Sphinx which is largely used in the Python community to document code. It converts .rst files into other formats like HTML, PDF and others. But how is it possible that a static documentation with plain HTML files is searchable without losing performance?
I guess, it's done by creating an index (like a JSON file for example) that will be loaded via AJAX and is interpreted by something like lunr.js. Since many major projects in the world of Python have a huge documentation (like the Python docs itself). Therefore, how is it possible, to create such a good search without creating a gigantic index file that needs to be loaded?
You can use Google Search Engine to use Google´s power on your site. It is difficult to customize yet powerful. Other reference in this question

Categories