Processing data with different schema - python

I am using Azure Databricks's AutoLoader.
I have blob storage with lots of JSON files. Those consists of a few dozen of different schemas.
The current solution is based on inferring the schema and saving the data into Delta Tables. However, as the number of Tables and JSON schemas increases it starts to become hard to control. In case of any error, the whole importing process stops.
I am thinking about creating separate autoLoader per every Schema but I am struggling with finding any article that would make me sure that it is a valid approach.
Please let me know what your thoughts are, is having 30-40 writeStreams accessing single blob storage a valid approach?
I am just starting in data analysis topic, so I would appreciate even the most obvious suggestions.

Related

Extracting only few columns from a FITS file that is freely available online to download using python

I'm working on a model of universe for which I'm using data available on Sloan Digital Sky Survey site. Problem is some files are more than 4GB large(total more than 50GB) and I know those files contain a lot of data columns but I want data only from few columns. I have heard about web scraping so I thought to search about how to do it but it didn't help as all the tutorials explained how to download the whole file using python. I want know that is there any way through which I can extract only few columns from that file so that I only have the data I need and I won't have to download the whole larges file just for a small fraction of its data?
Sorry, my question is just words and no codes because I'm not that pro in python. I just searched online and learned how to do basic web-scraping but it didn't solve my problem.
It will be even more helpful if you could suggest me some more ways to reduce the size of data I'll have to download.
Here is the URL to download FITS files: https://data.sdss.org/sas/dr12/boss/lss/
I only want to extract columns that have coordinates(ra, dec), distance, velocity and redshifts from the files.
Also, is there a way to do the same thing with CSV files or a general way to do it with any file?
I'm afraid what you're asking is generally not possible, at least not with significant effort and software support both on the client and server side.
First of all, the way FITS tables are stored in binary is row-oriented meaning if you wanted to stream a portion of a FITS table you can read it one row at a time. But to read individual columns you need to make partial reads of each row for every single row in the table. Some web servers support what are called "range requests" meaning you can request only a few ranges of bytes from a file, instead of the whole file. The web server has to have this enabled, and not all servers do. If FITS tables were stored column-oriented this could be feasible, as you could download just the header of the file to determine the ranges of the columns, and then download just the ranges for those columns.
Unfortunately, since FITS tables are row-oriented, if you wanted to load say 3 columns from it, and the table contains a million rows, that would involve 3 million range requests which would likely involve enough overhead that you wouldn't gain anything from it (and I'm honestly not sure what limits web servers place on how many ranges you can request in a single request but I suspect most won't allow something so extreme.
There are other astronomy data formats (e.g. I think CASA Tables) that can store tables in a column-oriented format, and so are more feasible for this kind of use case.
Further, even if the HTTP limitations could be overcome, you would need software support for loading the file in this manner. This has been discussed to a limited extent here but for the reasons discussed above it would mostly be useful for a limited set of cases, such as loading one HDU at a time (not so helpful in your case if the entire table is in one HDU) or possibly some other specialized cases such as sections of tile-compressed images.
As mentioned elsewhere, Dask supports loading binary arrays from various cloud-based filesystems, but when it comes to streaming data from arbitrary HTTP servers it runs into similar limitations.
Worse still, I looked at the link you provided and all the files there are gzip-compressed, so it is especially difficult to deal with since you can't know what ranges of them to request without decompressing them first.
As an aside, since you asked, you will have the same problem with CSV, only worse since CSV fields are not typically in fixed-width format, so there is no way to know how to extract individual columns without downloading the whole file.
For FITS maybe it would be helpful to develop a web service capable of serving arbitrary extracts from larger FITS files. If such a thing already exists I don't know, but I don't think it exists in a very general sense. So this would a) have to be developed, and b) you would have to ask anyone hosting the files you want to access to host such a service.
Your best bet is to just download the whole file, extract the data you need from it, and delete the original file assuming you no longer need it. It's possible the information you need is also already accessible through some online database.

Best way to link excel workbooks feeding into each other for memory efficiency. Python?

I am building a tool which displays and compares data for a non specialist audience. I have to automate the whole procedure as much as possible.
I am extracting select data from several large data sets, processing it into a format that is useful and then displaying it in a variety of ways. The problem i foresee is in the updating of the model.
I don't really want the user to have to do anything more than download the relevant files from the relevant database, re-name and save them to a location and the spreadsheet should do the rest. Then the user will be able to look at the data in a variety of ways, perform a few different analytical functions, depending on what they are looking at. Output some graphs etc
Although some database exported files wont be that large, other data will be being pulled from very large xml or csv files (500000x50 cells) and there are several arrays working on the pulled data once it has been chopped down to the minimum possible. So it will be necessary to open and update several files in order, so that the data in the user control panel is up to date and not all at once so that the user machine freezes.
At the moment I am building all of this just using excel formulas.
My question is how best to do the updating and feeding bit. Perhaps some kind of controller program built with python? I don't know Python but i have other reasons to learn it.
Any advice would be very welcome.
Thanks

Exporting BigQuery data for analysis using python

I am new to Google BigQuery so I'm trying to understand how to best accomplish my use case.
I have daily data of customer visits stored in BigQuery that I wish to analyse using some algorithms that I have written in python. Since, there are multiple scripts that use subsets of the daily data, I was wondering what would be the best way to fetch and temporarily store the data. Additionally, the scripts run in a sequential manner. Each script modifies some columns of the data and the subsequent script uses this modified data. After all the scripts have run, I want to store the modified data back to BigQuery.
Some approaches I had in mind are:
Export the bigquery table into a GAE (Google App Engine) instance as a db file and query the relevant data for each script from the db file using sqlite3 python package. Once, all the scripts have run, store the modified table back to BigQuery and then remove the db file from the GAE instance.
Query data from BigQuery every time I want to run a script using the google-cloud python client library or pandas gbq package. Modify the BigQuery table after running each script.
Could somebody know which of these would be a better way to accomplish this (in terms of efficiency/cost) or suggest alternatives?
Thanks!
The answer to your question mostly depends on your use case and the size of the data that you will be processing, so there is not an absolute and correct answer for it.
However, there are some points that you may want to take into account regarding the usage of BigQuery and how some of its features can be interesting for you in the scenario you described.
Let me quickly go over the main topics you should have a look at:
Pricing: leaving aside the billing of storage, and focusing in the cost of queries themselves (which is more related to your use case), BigQuery billing is based on the number of bytes processed on each query. There is a 1TB free quota per month, and from then on, the cost is of $5 per TB of processed data, being the minimum measurable unit 10MB of data.
Cache: when BigQuery returns some information, it is stored in a temporary cached table (or a permanent one if you wish), and they are maintained for approximately 24 hours with some exceptions that you may find in this same documentation link (they are also best-effort, so earlier deletion may happen too). Results returned from a cached table are not billed (because as per the definition of the billing, the cost is based on the number of bytes processed, and accessing a cached table implies that there is no processing being done), as long as you are running the exact same query. I think it would be worth having a look at this feature, because from your sentence "Since there are multiple scripts that use subsets of the daily data", maybe (but just guessing here) it applies to your use case to perform a single query once and then retrieve the results multiple times from a cached version without having to store it anywhere else.
Partitions: BigQuery offers the concept of partitioned tables, which are individual tables that are partitioned into smaller segments by date, what will make it easier to query data daily as you require.
Speed: BigQuery offers a real-time analytics platform, so you will be able to perform fast queries retrieving the information you need, applying some initial processing that you can later use in your custom Python algorithms.
So, in general, I would say that there is no need for you to keep any other database with partial results a part from your BigQuery storage. In terms of resource and cost efficiency, BigQuery offers enough features for you to work with your data locally without having to deal with huge expenses or delays in data retrieving. However, again, this will finally depend on your use case and the amount of data you are storing and need to process simultaneously; but in general terms, I would just go with BigQuery on its own.

Storing and querying a large amount of data

I have a large amount of data around 50GB worth in a csv which i want to analyse purposes of ML. It is however way to large to fit in Python. I ideally want to use mySQL because querying is easier. Can anyone offer a host of tips for me to look into. This can be anything from:
How to store it in the first place, i realise i probably can't load it in all at once, would i do it iteratively? If so what things can i look into for this? In addition i've heard about indexing, would that really speed up queries on such a massive data set?
Are there better technologies out there to handle this amount of data and still be able to query and do feature engineering quickly. What i eventually feed into my algorithm should be able to be done in Python but i need query and do some feature engineering before i get my data set that is ready to be analysed.
I'd really appreciate any advice this all needs to be done on personal computer! Thanks!!
Can anyone offer a host of tips for me to look into
Gladly!
Look at the CSV file first line to see if there is a header. You'd need to create a table with the same fields (and type of data)
One of the fields might seem unique per line and can be used later to find the line. That's your candidate for PRIMARY KEY. Otherwise add an AUTO-INCREMENT field as PRIMARY KEY
INDEXes are used to later search for data. Whatever fields you feel you will be searching/filtering on later should have some sort of INDEX. You can always add them later.
INDEXes can combine multiple fields if they are often searched together
In order to read in the data, you have 2 ways:
Use LOAD DATA INFILE Load Data Infile Documentation
Write your own script: The best technique is to create a prepared statement for the
INSERT command. Then read your CSV line by line (in a loop), split the fields
into variables and execute the prepared statement with this line's
values
You will benefit from a web page designed to search the data. Depends on who needs to use it.
Hope this gives you some ideas
That's depend on what you have, you can use Apache spark and then use their SQL feature, spark SQL gives you the possibility to write SQL queries in your dataset, but for best performance you need a distributed mode(you can use it in a local machine but the result is limited) and high machine performance. you can use python, scala, java to write your code.

Somthing wrong with using CSV as database for a webapp?

I am using Flask to make a small webapp to manage a group project, in this website I need to manage attendances, and also meetings reports. I don't have the time to get into SQLAlchemy, so I need to know what might be the bad things about using CSV as a database.
Just don't do it.
The problem with CSV is …
a, concurrency is not possible: What this means is that when two people access your app at the same time, there is no way to make sure that they don't interfere with each other, making changes to each other's data. There is no way to solve this with when using a CSV file as a backend.
b, speed: Whenever you make changes to a CSV file, you need to reload more or less the whole file. Parsing the file is eating up both memory and time.
Databases were made to solve this issues.
I agree however, that you don't need to learn SQLAlchemy for a small app.
There are lightweight alternatives that you should consider.
What you are looking for are ORM - Object-relational mapping - who translate Python code into SQL and manage the SQL databases for you.
PeeweeORM and PonyORM. Both are easy to use and translate all SQL into Python and vice versa. Both are free for personal use, but Pony costs money if you use it for commercial purposes. I highly recommend PeeweeORM. You can start using SQLite as a backend with Peewee, or if your app grows larger, you can plug in MySQL or PostGreSQL easily.
Don't do it, CSV that is.
There are many other possibilities, for instance the sqlite database, python shelve, etc. The available options from the standard library are summarised here.
Given that your application is a webapp, you will need to consider the effect of concurrency on your solution to ensure data integrity. You could also consider a more powerful database such as postgres for which there are a number of python libraries.
I think there's nothing wrong with that as long as you abstract away from it. I.e. make sure you have a clean separation between what you write and how you implement i . That will bloat your code a bit, but it will make sure you can swap your CSV storage in a matter of days.
I.e. pretend that you can persist your data as if you're keeping it in memory. Don't write "openCSVFile" in you flask app. Use initPersistence(). Don't write "csvFile.appendRecord()". Use "persister.saveNewReport()". When and if you actually realise CSV to be a bottleneck, you can just write a new persister plugin.
There are added benefits like you don't have to use a mock library in tests to make them faster. You just provide another persister.
I am absolutely baffled by how many people discourage using CSV as an database storage back-end format.
Concurrency: There is NO reason why CSV can not be used with concurrency. Just like how a database thread can write to one area of a binary file at the same time that another thread writes to another area of the same binary file. Databases can do EXACTLY the same thing with CSV files. Just as a journal is used to maintain the atomic nature of individual transactions, the same exact thing can be done with CSV.
Speed: Why on earth would a database read and write a WHOLE file at a time, when the database can do what it does for ALL other database storage formats, look up the starting byte of a record in an index file and SEEK to it in constant time and overwrite the data and comment out anything left over and record the free space for latter use in a separate index file, just like a database could zero out the bytes of any unneeded areas of a binary "row" and record the free space in a separate index file... I just do not understand this hostility to non-binary formats, when everything that can be done with one format can be done with the other... everything, except perhaps raw binary data compression, depending on the particular CSV syntax in use (special binary comments... etc.).
Emergency access: The added benefit of CSV is that when the database dies, which inevitably happens, you are left with a CSV file that can still be accessed quickly in the case of an emergency... which is the primary reason I do not EVER use binary storage for essential data that should be quickly accessible even when the database breaks due to incompetent programming.
Yes, the CSV file would have to be re-indexed every time you made changes to it in a spread sheet program, but that is no different than having to re-index a binary database after the index/table gets corrupted/deleted/out-of-sync/etc./etc..

Categories