Each morning we receive 15-20 separate emails from different sites with data attached in excel format.
One person then cleans the data, collates and inputs into a single spreadsheet.
This is a daily and very time consuming task.
Is there a way to automate this process using python/sql?
It depends on how the Excels are formatted. Are they all the same or does actual transformation need to happen to get them into a common format? Are they actual .xls(x) files or rather .csv?
Excel itself should have enough tools to transform the data to the desired format in an automated way, at least if the actions are the same all the time.
From what I understand of your question, it's not actually needed to have the data in a database, but just combine them into a new file? Excel has the option to import data from several different formats under the "Data" menu option.
Related
I am using python to scrape, store and plot the data on an odds website for later reference. Initially I am storing the data in numerous .csv files (every X minutes) which I then aggregate into larger json files (per day) for easier access.
The problem is that with the increasing number of events per day(>600), the speed at which the json files are manipulated becomes unacceptable (~35s to just load a single json file of the size of 95MB).
What would be another set-up which would be more efficient (in terms of speed)? Maybe using SQL alongside python?
Maybe try another JSON library like orjson instead of the standard one.
I want to build one application which will be running locally supporting real time data processing, and need to built using python.
The input that needs to be provided in real time, and which is in the form of google spreadsheets (Multiple users are providing there data at a time).
Also, needs to write real time output of the code back to spreadsheets in it's adjacent column.
Please help me for the same.
Thanks
You can use the spark-google-spreadsheets library to read and write to Google Sheets from Spark, as described here.
Here's an example of how you can read data from a Google Sheet into a DataFrame:
val df = sqlContext.read.
format("com.github.potix2.spark.google.spreadsheets").
load("<spreadsheetId>/worksheet1")
Incremental updates will be tough. You might want to just try doing full refreshes.
I am building a tool which displays and compares data for a non specialist audience. I have to automate the whole procedure as much as possible.
I am extracting select data from several large data sets, processing it into a format that is useful and then displaying it in a variety of ways. The problem i foresee is in the updating of the model.
I don't really want the user to have to do anything more than download the relevant files from the relevant database, re-name and save them to a location and the spreadsheet should do the rest. Then the user will be able to look at the data in a variety of ways, perform a few different analytical functions, depending on what they are looking at. Output some graphs etc
Although some database exported files wont be that large, other data will be being pulled from very large xml or csv files (500000x50 cells) and there are several arrays working on the pulled data once it has been chopped down to the minimum possible. So it will be necessary to open and update several files in order, so that the data in the user control panel is up to date and not all at once so that the user machine freezes.
At the moment I am building all of this just using excel formulas.
My question is how best to do the updating and feeding bit. Perhaps some kind of controller program built with python? I don't know Python but i have other reasons to learn it.
Any advice would be very welcome.
Thanks
I am working on a personal project (using Python 3) that will retrieve weather information for any city in the United States. My program prompts the user to enter as many city-state combinations as they wish, and then it retrieves the weather information and creates a weather summary for each city entered. Behind the scenes, I'm essentially taking the State entered by the user, opening a .txt file corresponding to that State, and then getting a weather code that is associated with the city entered, which I then use in a URL request to find weather information for the city. Since I have a .txt file for every state, I have 50 .txt files, each with a large number of city-weather code combinations.
Would it be faster to keep my algorithm the way that it currently is, or would it be faster to keep all of this data in a dictionary? This is how I was thinking about storing the data in a dictionary:
info = {'Virginia':{'City1':'ID1','City2':'ID2'},'North Carolina':{'City3':'ID3'}}
I'd be happy to provide some of my code or elaborate if necessary.
Thanks!
If you have a large datafile, you will spend days shifting through the file and putting the values in the .py file. If it is a small file I would use a dictionary, but if it were a large file a .txt file.
Other possible solutions are:
sqlite
pickle
shelve
Other Resources
Basic data storage with Python
https://docs.python.org/3/library/persistence.html
https://docs.python.org/3/library/pickle.html
https://docs.python.org/3/library/shelve.html
It almost certainly would be much faster to preload the data from the files, if you're using the same python process for many user requests. If the process handles just one request and exits, this approach would be slower and use more memory. For some number of requests between "one" and "many", they'd be about equal on speed.
For a situation like this I would probably use sqlite, for which python has built-in support. It would be much faster than scanning text files without the time and memory overhead of loading the full dictionary.
It is probably not a very good idea to have a large amount of text files, because it will slow down in large or numerous director(y|ies) access. But If you have large data records, you might wish to choose an intermediate solution, in indexing one data file and load the index in a dictionary.
I have a very large (> 2 million rows) csv file that is being generated and viewed in an internal web service. The problem is that when users of this system want to export this csv to run custom queries, they open these files in excel. Excel is formatting the numbers the best it can, but there are some requests to have the data in xlsx format with filters and whatnot.
The question boils down to: Using python2.7, how can I read a large csv file (>2 million rows) into excel (or multiple excel files) and control the formatting? (dates, numbers, autofilters, etc)
I am open to python and internal excel solutions.
Without more information about the data types in the csv, or your exact issue with EXCEL properly handling those data types, it's hard to give you an exact answer.
However, recommending looking at this module (https://xlsxwriter.readthedocs.org/) which can be used in Python to create xlsx files. I haven't used it, but it seems to have more features than you need.
Especially if you need to split between multiple files, or workbooks. And it looks like you can pre-create the filters and have total control over the formating