I am performing data analysis. I want to segment the steps of data analysis into different projects, as the analysis will be performed in the same order, but not usually all at the same time. There is so much code and data cleaning that keeping all of this in the same project may get confusing.
However, I have been keeping any header files for tracking columns of information in the data consistent. It is possible I will change the sequence of these headers at some point and want to run all sequences of code. I also want to make sure that the header used remains the same so I don't erroneously analyze one piece of data instead of another. I use headers so that if a column order changes at any time, I am accessing the index of the data based on the header that matches the data changes rather than changing every instance of appearance of a particular column number throughout my code.
To accomplish this, I would like to file track multiple projects that access the SAME header files, and update and alter the header files without having to access the header files from each project individually.
Finally, I don't want to just store it somewhere on my computer and not track it, because I work from two different work stations.
Any good solutions or best practice for what I want to do? Have I made an error somewhere in project set-up? I am mostly self-taught and so have developed my own project organization and sequence of data analysis based on my own ideas and research as I go, so if I've done some terribly bad practice that would be great to know.
I've found a possible solution that uses independent branches in the same repo for tracking two separate projects, but I'm not convinced this is the best solution either.
Thanks!
Related
I'm working on a model of universe for which I'm using data available on Sloan Digital Sky Survey site. Problem is some files are more than 4GB large(total more than 50GB) and I know those files contain a lot of data columns but I want data only from few columns. I have heard about web scraping so I thought to search about how to do it but it didn't help as all the tutorials explained how to download the whole file using python. I want know that is there any way through which I can extract only few columns from that file so that I only have the data I need and I won't have to download the whole larges file just for a small fraction of its data?
Sorry, my question is just words and no codes because I'm not that pro in python. I just searched online and learned how to do basic web-scraping but it didn't solve my problem.
It will be even more helpful if you could suggest me some more ways to reduce the size of data I'll have to download.
Here is the URL to download FITS files: https://data.sdss.org/sas/dr12/boss/lss/
I only want to extract columns that have coordinates(ra, dec), distance, velocity and redshifts from the files.
Also, is there a way to do the same thing with CSV files or a general way to do it with any file?
I'm afraid what you're asking is generally not possible, at least not with significant effort and software support both on the client and server side.
First of all, the way FITS tables are stored in binary is row-oriented meaning if you wanted to stream a portion of a FITS table you can read it one row at a time. But to read individual columns you need to make partial reads of each row for every single row in the table. Some web servers support what are called "range requests" meaning you can request only a few ranges of bytes from a file, instead of the whole file. The web server has to have this enabled, and not all servers do. If FITS tables were stored column-oriented this could be feasible, as you could download just the header of the file to determine the ranges of the columns, and then download just the ranges for those columns.
Unfortunately, since FITS tables are row-oriented, if you wanted to load say 3 columns from it, and the table contains a million rows, that would involve 3 million range requests which would likely involve enough overhead that you wouldn't gain anything from it (and I'm honestly not sure what limits web servers place on how many ranges you can request in a single request but I suspect most won't allow something so extreme.
There are other astronomy data formats (e.g. I think CASA Tables) that can store tables in a column-oriented format, and so are more feasible for this kind of use case.
Further, even if the HTTP limitations could be overcome, you would need software support for loading the file in this manner. This has been discussed to a limited extent here but for the reasons discussed above it would mostly be useful for a limited set of cases, such as loading one HDU at a time (not so helpful in your case if the entire table is in one HDU) or possibly some other specialized cases such as sections of tile-compressed images.
As mentioned elsewhere, Dask supports loading binary arrays from various cloud-based filesystems, but when it comes to streaming data from arbitrary HTTP servers it runs into similar limitations.
Worse still, I looked at the link you provided and all the files there are gzip-compressed, so it is especially difficult to deal with since you can't know what ranges of them to request without decompressing them first.
As an aside, since you asked, you will have the same problem with CSV, only worse since CSV fields are not typically in fixed-width format, so there is no way to know how to extract individual columns without downloading the whole file.
For FITS maybe it would be helpful to develop a web service capable of serving arbitrary extracts from larger FITS files. If such a thing already exists I don't know, but I don't think it exists in a very general sense. So this would a) have to be developed, and b) you would have to ask anyone hosting the files you want to access to host such a service.
Your best bet is to just download the whole file, extract the data you need from it, and delete the original file assuming you no longer need it. It's possible the information you need is also already accessible through some online database.
I'm writing some code using python/numpy that I will be using for data analysis of some experimental data sets. Certain steps of these analysis routines can take a while. It is not practical to rerun every step of the analysis every time (I.E when debugging) so it makes sense to save the output from these steps to a file and just reuse them if they're already available.
The data I ultimately want to obtain can be derived from various steps along this analysis process. I.E, A can be used to calculate B and C. D can be calculated from B. E can then be calculated using C and D. etc etc.
The catch here is that it's not uncommon to make it through a few (or many) datasets only to find that there's some tiny little gotcha in the code somewhere that requires some part of the tree to be recalculated. I.E - I discover a bug in B, so now anything that depends on B also needs to be recalculated because it was derived from incorrect data.
The end goal here is to basically protect myself from having sets of data that I forget to reprocess when bugs are found. In other words, I want to be confident that all of my data is calculated using the newest code.
Is there a way to implement this in Python? I have no specific form this solution needs to take so long as it's extensible as I add new steps. I also am okay with the "recalculation step" only being performed when the dependent quantity is recalculated (Rather than at the time one of the parents are changed).
My first thought of how this might be done is to embed information in a header of each saved file (A, B, C, etc) indicating what version of each module it was created with. Then, when loading the saved data the code can check if the version in the file matches the current version of the parent module. (Some sort of parent.getData() which checks if the data has been calculated for that dataset and if it's up to date)
The thing is, at least at first glance, I could see that this could have problems when the change happens several steps up in the dependency chain because the derived file may still be up to date with its module even though its parents are out of date. I suppose I could add some sort of parent.checkIfUpToDate() that checks its own files and then asks each of its parents if they're up to date (which then ask their parents, etc), and updates it if not. The version number can just be a static string stored in each module.
My concern with that approach is that it might mean reading potentially large files from disk just to get a version number. If I went with the "file header" approach, does Python actually load the whole file in to memory when I do an open(myFile), or can I do that, just read the header lines, and close the file without loading the whole thing in to memory?
Last - is there a good way to embed this type of information beyond just having the first line of the file be some variation of # MyFile made with MyModule V x.y.z and writing some piece of code to parse that line?
I'm kind of curious if this approach makes sense, or if I'm reinventing the wheel and there's already something out there to do this.
edit: And something else that occurred to me after I submitted this - does Python have any mechanism to define templates that modules must follow? Just as a means to make sure I keep the format for the data reading steps consistent from module to module.
I cannot answer all of your questions but you can read in only a small part of data from a large file as you can see here:
How to read specific part of large file in Python
I do not see why you would need a parent.checkIfUpToDate() function. You could as well just store the version number of the parent functions in the file itself as well.
To me your approach sounds reasonable, however I have never done anything similar. Alternatively you could create an additional file that holds the specified information but I think storing the information in the actual file should prevent Version errors between your "data file" and the "function version file".
I am building a tool which displays and compares data for a non specialist audience. I have to automate the whole procedure as much as possible.
I am extracting select data from several large data sets, processing it into a format that is useful and then displaying it in a variety of ways. The problem i foresee is in the updating of the model.
I don't really want the user to have to do anything more than download the relevant files from the relevant database, re-name and save them to a location and the spreadsheet should do the rest. Then the user will be able to look at the data in a variety of ways, perform a few different analytical functions, depending on what they are looking at. Output some graphs etc
Although some database exported files wont be that large, other data will be being pulled from very large xml or csv files (500000x50 cells) and there are several arrays working on the pulled data once it has been chopped down to the minimum possible. So it will be necessary to open and update several files in order, so that the data in the user control panel is up to date and not all at once so that the user machine freezes.
At the moment I am building all of this just using excel formulas.
My question is how best to do the updating and feeding bit. Perhaps some kind of controller program built with python? I don't know Python but i have other reasons to learn it.
Any advice would be very welcome.
Thanks
My place of work receives sets of pipe delimited files from many different clients that we use Visual Studio Integration Services projects to import into tables in our MS SQL 2008 R2 server for later processing - specifically with Data Flow Tasks containing Flat File Source to OLE DB Destination steps. Each data flow task has columns that are specifically mapped to columns in our tables, but the chances of a column addition in any file from any client are relatively high (and we are rarely warned that there will be changes), which is becoming tedious as I currently need to...
Run a python script that uses pyodbc to grab the columns contained in the destination tables and compare them to the source files to find out if there is a difference in columns
Execute the necessary SQL to add the columns to the destination tables
Open the corresponding VS Solution, refresh the columns in the flat file sources that have new columns and manually map each new column to the newly created columns in the OLE DB Destination
We are quickly getting more and more sites that I have to do this with, and I desperately need to find a way to automate this. The VS project can easily be automated if we could depend on the changes being accounted for, but as of now this needs to be a manual process to ensure we load all the data properly. Things I've thought about but have been unable to execute...
Using an XML parser - combined with the output of the python script mentioned above - to append new column mappings to the source/destination objects in the VS Package.dtsx.xml. I hit a dead end when I could not find out more information about creating a valid "DTS:DTSID" for new column mapping, and the file became corrupted whenever I edited it. This also seemed a very unstable option
Finding any built-in event handler in Visual Studio to throw an error if the flat file has a new, un-mapped column - I would be fine with this as a solution because we could confidently schedule the import projects to run automatically and only worry about changing the mapping for projects that failed. I could find a built in feature that does this. I'm also aware I could do this with a python script similar to the one mentioned above that fails if there are differences, but this would be extremely tedious to implement due to file-naming conventions and the fact that there are 50+ clients with more on the way.
I am open to any type of solution, even if it's just an idea. As this is my first question on Stack Overflow, I apologize if this was asked poorly and ask for feedback if the question could be improved. Thanks in advance to those that take the time to read!
Edit:
#Larnu stated that SSIS by default throws an error when unrecognized columns are found in the files. This however does not currently happen with our Visual Studio Integration Services projects and our team would certainly resist a conversion of all packages to SSIS at this point. It would be wonderful if someone could provide insight as to how to ensure the package would fail if there were new columns - in VS. If this isn't possible, I may have to pursue the difficult route as mentioned by #Dave Cullum, though I don't think I get paid enough for that!
Also, talking sense into the clients has proven to be impossible - the addition of columns will always be a crapshoot!
Using a script task you can read your file and record how many pipes are in a line:
using (System.IO.StreamReader sr = new System.IO.StreamReader(path))
{
string line = sr.ReadLine();
int ColumnCount = line.Length - line.Replace("|", "").Length +1;
}
I assume you know how to set that to a variable.
Now add an execute SQL and store result as another variable:
Select Count(*)
from INFORMATION_SCHEMA.columns
where TABLE_NAME = [your destination table]
Now exiting the execute SQL add a conditional arrow and compare the numbers. If they are equal continue your process. If they are not equal then go ahead and send an email (or some other type of notification.
I have a large amount of data around 50GB worth in a csv which i want to analyse purposes of ML. It is however way to large to fit in Python. I ideally want to use mySQL because querying is easier. Can anyone offer a host of tips for me to look into. This can be anything from:
How to store it in the first place, i realise i probably can't load it in all at once, would i do it iteratively? If so what things can i look into for this? In addition i've heard about indexing, would that really speed up queries on such a massive data set?
Are there better technologies out there to handle this amount of data and still be able to query and do feature engineering quickly. What i eventually feed into my algorithm should be able to be done in Python but i need query and do some feature engineering before i get my data set that is ready to be analysed.
I'd really appreciate any advice this all needs to be done on personal computer! Thanks!!
Can anyone offer a host of tips for me to look into
Gladly!
Look at the CSV file first line to see if there is a header. You'd need to create a table with the same fields (and type of data)
One of the fields might seem unique per line and can be used later to find the line. That's your candidate for PRIMARY KEY. Otherwise add an AUTO-INCREMENT field as PRIMARY KEY
INDEXes are used to later search for data. Whatever fields you feel you will be searching/filtering on later should have some sort of INDEX. You can always add them later.
INDEXes can combine multiple fields if they are often searched together
In order to read in the data, you have 2 ways:
Use LOAD DATA INFILE Load Data Infile Documentation
Write your own script: The best technique is to create a prepared statement for the
INSERT command. Then read your CSV line by line (in a loop), split the fields
into variables and execute the prepared statement with this line's
values
You will benefit from a web page designed to search the data. Depends on who needs to use it.
Hope this gives you some ideas
That's depend on what you have, you can use Apache spark and then use their SQL feature, spark SQL gives you the possibility to write SQL queries in your dataset, but for best performance you need a distributed mode(you can use it in a local machine but the result is limited) and high machine performance. you can use python, scala, java to write your code.