How to Perform Spark Streaming with Google Spreadsheets? - python

I want to build one application which will be running locally supporting real time data processing, and need to built using python.
The input that needs to be provided in real time, and which is in the form of google spreadsheets (Multiple users are providing there data at a time).
Also, needs to write real time output of the code back to spreadsheets in it's adjacent column.
Please help me for the same.
Thanks

You can use the spark-google-spreadsheets library to read and write to Google Sheets from Spark, as described here.
Here's an example of how you can read data from a Google Sheet into a DataFrame:
val df = sqlContext.read.
format("com.github.potix2.spark.google.spreadsheets").
load("<spreadsheetId>/worksheet1")
Incremental updates will be tough. You might want to just try doing full refreshes.

Related

Google Sheets IMPORTRANGE not working dynamically when worksheet is Programmatically updated via Python

I am using Python and gspread to upload local .csv data to a google SpreadsheetA.
I have a separate google SpreadsheetB that uses =IMPORTRANGE to import the data from SpreadsheetA and create a pivot table and corresponding chart (both located on SpreadsheetB).
If I were to manually adjust any data in SpreadsheetA (e.g., alter value of any cell, add a value to an empty cell, etc), then the data in SpreadsheetB—with its corresponding pivot table and chart—update dynamically with the new data from SpreadsheetA.
However, when SpreadsheetA is updated with new data programmatically via Python, IMPORTRANGE in SpreadsheetB does not capture the new data.
Any ideas as to why this happens and how I might be able to fix?
Both Sheet A and B show the same number of rows. I am a bit confused with your IMPORTRANGE() formula though, why the ampersand?
=IMPORTRANGE("https://docs.google.com/spreadsheets/d/16DyWC8rsQB1ThpLiQh0p5xH9CYK2cPqbPH547ybw2Fo/edit#gid=1875728384",""&"TestgAPI!A:J")
I changed to this:
=IMPORTRANGE("https://docs.google.com/spreadsheets/d/16DyWC8rsQB1ThpLiQh0p5xH9CYK2cPqbPH547ybw2Fo/edit#gid=1875728384","TestgAPI!A:J")
Although probably not the ideal, my solution to this was to use gspread to add a new worksheet to spreadsheetA, which somehow manages to kickstart importrange() in SpreadsheetB.
I would still love to see a cleaner solution, if anyone knows of one—but this has continued to work since implementing a week ago.

How to automate data in email/excel to SQL?

Each morning we receive 15-20 separate emails from different sites with data attached in excel format.
One person then cleans the data, collates and inputs into a single spreadsheet.
This is a daily and very time consuming task.
Is there a way to automate this process using python/sql?
It depends on how the Excels are formatted. Are they all the same or does actual transformation need to happen to get them into a common format? Are they actual .xls(x) files or rather .csv?
Excel itself should have enough tools to transform the data to the desired format in an automated way, at least if the actions are the same all the time.
From what I understand of your question, it's not actually needed to have the data in a database, but just combine them into a new file? Excel has the option to import data from several different formats under the "Data" menu option.

Python-based PDF parser integrated with Zapier

I am working for a company which is currently storing PDF files into a remote drive and subsequently manually inserting values found within these files into an Excel document. I would like to automate the process using Zapier, and make the process scalable (we receive a large amount of PDF files). Would anyone know any applications useful and possibly free for converting PDFs into Excel docs and which integrate with Zapier? Alternatively, would it be possible to create a Python script in Zapier to access the information and store it into an Excel file?
This option came to mind. I'm using google drive as an example, you didn't say what you where using as storage, but Zapier should have an option for it.
Use cloud convert, doc parser (depends on what you want to pay, cloud convert at least gives you some free time per month, so that may be the closest you can get).
Create a zap with this step:
Trigger on new file in drive (Name: Convert new Google Drive files with CloudConvert)
Convert file with CloudConvert
Those are two options by Zapier that I can find. But you could also do it in python from your desktop by following something like this idea. Then set an event controller in windows event manager to trigger an upload/download.
Unfortunately it doesn't seem that you can import JS/Python libraries into zapier, however I may be wrong on that. If you could, or find a way to do so, then just use PDFminer and "Code by Zapier". A technician might have to confirm this though, I've never gotten libraries to work in zaps.
Hope that helps!

Best way to link excel workbooks feeding into each other for memory efficiency. Python?

I am building a tool which displays and compares data for a non specialist audience. I have to automate the whole procedure as much as possible.
I am extracting select data from several large data sets, processing it into a format that is useful and then displaying it in a variety of ways. The problem i foresee is in the updating of the model.
I don't really want the user to have to do anything more than download the relevant files from the relevant database, re-name and save them to a location and the spreadsheet should do the rest. Then the user will be able to look at the data in a variety of ways, perform a few different analytical functions, depending on what they are looking at. Output some graphs etc
Although some database exported files wont be that large, other data will be being pulled from very large xml or csv files (500000x50 cells) and there are several arrays working on the pulled data once it has been chopped down to the minimum possible. So it will be necessary to open and update several files in order, so that the data in the user control panel is up to date and not all at once so that the user machine freezes.
At the moment I am building all of this just using excel formulas.
My question is how best to do the updating and feeding bit. Perhaps some kind of controller program built with python? I don't know Python but i have other reasons to learn it.
Any advice would be very welcome.
Thanks

Google Cloud Dataflow (Python): function to join multiple files

I am new to Google cloud and know python to write few scripts, currently learning cloud functions and BiqQuery.
my question:
I need to join a large CSV file with multiple lookup files and replace values from lookup files.
learnt that dataflow can be used to do ETL,but don't know how to write the code in Python.
can you please share your insights.
Appreciate your help.
Rather than joining data in python, I suggest you separately extract and load the CSV and lookup data. Then run a BigQuery query that joins the data and writes the result to a permanent table. You can then delete the separately import data.

Categories