I have some data given to me by my mentor. The data consists of thousands of .fits files. Some of the .fits files are older versions of the others and the way the data tables are constructed are different. Here is what I mean:
Let's say I have two .fits files: FITS1.fits and FITS2.fits
$ python
>>> import pyfits
>>> a = pyfits.getdata('FITS1.fits')
>>> b = pyfits.getdata('FITS2.fits')
>>> a.names
['time', 'timeerr', 'sap_flux', 'sap_flux_err']
>>> b.names
['time', 'sap_flux', 'timeerr', 'sap_flux_err']
Does anyone know of a way that I can switch around the columns in the data tables? so that FITS2.fits's format is similar to FITS1.fits ?
Your best bet is to not use pyfits directly, but to use the newer, shinier Astropy Table interface. You can read in a FITS table like:
from astropy.table import Table
table = Table.read('FITS1.fits')
As demonstrated in the section on modifying tables, you can then reorder the columns like:
table = table[['time', 'timeerr', 'sap_flux', 'sap_flux_err']]
(technically this creates a new copy of the table, with the columns selected in the order you wanted them to be in; however IIRC this does not copy the underlying column arrays and should still be a fast operation).
It is also perfectly possible to do this with the legacy pyfits interface, but I wouldn't recommend it for most cases.
You shouldn't write code which depends on the order of keys in a dictionary - a dictionary is a hash table and the order that they are stored is essentialy arbitrary.
If you need to match or compare the entries you should get a list of the keys and sort them.
It's probable that pyfits builds a dictionary in the order that the keys are stored in the FITS header, but that isn't necessarily true.
Related
I have a CSV file that includes one column data that is not user friendly. I need to translate that data into something that makes sense. Simple find/replace seems bulky since there are dozens if not hundreds of different possible combinations I want to translate.
For instance: BLK = Black or MNT TP = Mountain Top
There are dozens if not hundreds of translations possible - I have lots of them already in a CSV table. The problem is how to use that dictionary to change the values in another CSV table. It is also important to note that this will (eventually) need to run on its own every few minutes - not just a one time translation.
It would be nice if you could describe in more detail what's the data you're working on. I'll do my best guess though.
Let's say you have a CSV file, you use pandas to read it into a data frame named df, and the "not user friendly" column named col.
To replace all the value in column col, first, you need a dictionary containing all the keys (original texts) and values (new texts):
my_dict = {"BLK": "Black", "MNT TP": Mountain Top,...}
Then, map the dictionary to the column:
df["col"] = df["col"].map(lambda x: my_dict.get(x, x))
If a key appears in the dictionary, it will be replaced by the new corresponding value in the dictionary, otherwise, it keeps the original value.
I have implemented a program in VBA for excel to generate automatic communications based on user inputs (selections of cells).
Such Macro written in VBA uses extensively the listObject function of VBA
i.e.
defining a table (list object)
Dim ClsSht As Worksheet
Set ClsSht = ThisWorkbook.Sheets("paragraph texts")
Dim ClsTbl As ListObject
Set ClsTbl = ClsSht.ListObjects(1)
accessing the table in the code in a very logical manner:
ClsTbl being now the table where I want to pick up data.
myvariable= ClsTbl.listcolumns("D1").databodyrange.item(34).value
Which means myvariable is the item (row) 34 of the data of the column D1 of the table clstbl
I decided to learn python to "translate" all that code into python and make a django based program accesable for anyone.
I am a beginner in Python and I am wondering what would be the equivalent in python to listobject of VBA. This decision will shape my whole program in python from the beginning, and I am hesitating a lot to decide what is the python equivalent to listobject in VBA.
The main idea here getting a way where I can access tables-data in a readable way,
i.e. give me the value of column "text" where column "chapter" is 3 and column paragraph is "2". The values are unique, meaning there is only one value in "text" column where that occurs.
Some observations:
I know everything can be done with lists in python, lists can contain lists that can contain lists..., but this is terrible for readability. mylist1[2][3] (assuming for instance that every row could be a list of values, and the whole table a list of lists of rows).
I don't considered an option to build any database. There are multiple relatively small tables (from 10 to 500 rows and from 3 to 15 columns) that are related but not in a database manner. That would force me to learn yet another language SQL or so, and I have more than enough with python and DJango.
The user modifies the structure of many tables (chapters coming together or getting splitted.
the data is 100% strings. The only integers are numbers to sort out text. I don't perform any mathematical operation with values but simple add together pieces of text and make replacements in texts.
the tables will be load into Python as CSV text files.
Please indicate me if there is something not enough clear in the question and I will complete it
Would it be necesary to operate with numpy? pandas?
i.e give me the value of cell
A DataFrame using pandas should provide everything you need, i.e. converstion to strings, manipulation, import and export. As a start, try
import pandas as pd
df = pd.read_csv('your_file.csv')
print(df)
print(df['text'])
The entries of the first row will be converted to labels of the DataFrame columns.
I currently have a file with 5 thousand lines, with one dictionary in each line. All dictionaries have the same fields. My question is:
Should I learn SQL to store this data and do that analysis with it, or is using the file I've got good enough, and I should just use pandas or some other module to do data analysis.
I'm really lost on which path should I take.
While the question is very general - it should be noted that the problems of how do I store my dataset and what tool do I use to analyze my data are very different questions.
Very often for datasets that need to be modified or updated at a regular interval a database will be preferable to i.e. a compressed file (since modifying the compressed file contents will require you to rewrite all of the data). For example I probably wouldn't use sqlite for an nltk.corpus although there maybe uses cases for that as well.
If you do decide to use with sqlite and your original data is in dictionary format, especially with many fields -
You may find exectrace and rowtrace to be useful:
http://apidoc.apsw.googlecode.com/hg/connection.html#apsw.Connection.setrowtrace
and
http://apidoc.apsw.googlecode.com/hg/connection.html#apsw.Connection.setexectrace
useful.
For example to get rows out of sqlite in dict rather than tuple format, you may do:
def rowtracer(cursor, sql):
dictionary = {}
for index, (name, type_) in enumerate(cursor.getdescription()):
dictionary[name] = sql[index]
return dictionary
con.setrowtrace(rowtracer)
And for inserts you can pass values in a dict by i.e.
"""insert into my_table(name, data) values(:name, :date)"""
I have a pytable. I often need to copy the rows to an in-memory object and then insert into another pytable. I am wondering what is the easiest way to do this. The following code does not work as one cannot convert a Row object to a dict.
for row in hf.root.my_table.iterrows():
rec = dict(row)
Also, I want to conditionally copy the data to another file, possibly adding 1-2 new columns.To do this, I will need to extract the table description from one table, modify it, and use the modified table description to create a new table. How can I do that?
This won't work either. In general, I find my way of using pytables a little bit awkward, and would like to know better way of doing things.
As mentioned in the documentation, you can use the Row.fetch_all_fields method to obtain to obtain an independent copy which retains its information after the table file is closed, for example.
Your example code would look like
for row in hf.root.my_table.iterrows():
rec = row.fetch_all_fields()
rec is a numpy void scalar with the same keys as the row; rec['field'] yields the same data as row['field'].
I have a scientific model which I am running in Python which produces a lookup table as output. That is, it produces a many-dimensional 'table' where each dimension is a parameter in the model and the value in each cell is the output of the model.
My question is how best to store this lookup table in Python. I am running the model in a loop over every possible parameter combination (using the fantastic itertools.product function), but I can't work out how best to store the outputs.
It would seem sensible to simply store the output as a ndarray, but I'd really like to be able to access the outputs based on the parameter values not just indices. For example, rather than accessing the values as table[16][5][17][14] I'd prefer to access them somehow using variable names/values, for example:
table[solar_z=45, solar_a=170, type=17, reflectance=0.37]
or something similar to that. It'd be brilliant if I were able to iterate over the values and get their parameter values back - that is, being able to find out that table[16]... corresponds to the outputs for solar_z = 45.
Is there a sensible way to do this in Python?
Why don't you use a database? I have found MongoDB (and the official Python driver, Pymongo) to be a wonderful tool for scientific computing. Here are some advantages:
Easy to install - simply download the executables for your platform (2 minutes tops, seriously).
Schema-less data model
Blazing fast
Provides map/reduce functionality
Very good querying functionalities
So, you could store each entry as a MongoDB entry, for example:
{"_id":"run_unique_identifier",
"param1":"val1",
"param2":"val2" # etcetera
}
Then you could query the entries as you will:
import pymongo
data = pymongo.Connection("localhost", 27017)["mydb"]["mycollection"]
for entry in data.find(): # this will yield all results
yield entry["param1"] # do something with param1
Whether or not MongoDB/pymongo are the answer to your specific question, I don't know. However, you could really benefit from checking them out if you are into data-intensive scientific computing.
If you want to access the results by name, then you could use a python nested dictionary instead of ndarray, and serialize it in a .JSON text file using json module.
One option is to use a numpy ndarray for the data (as you do now), and write a parser function to convert the query values into row/column indices.
For example:
solar_z_dict = {...}
solar_a_dict = {...}
...
def lookup(dataArray, solar_z, solar_a, type, reflectance):
return dataArray[solar_z_dict[solar_z] ], solar_a_dict[solar_a], ...]
You could also convert to string and eval, if you want to have some of the fields to be given as "None" and be translated to ":" (to give the full table for that variable).
For example, rather than accessing the values as table[16][5][17][14]
I'd prefer to access them somehow using variable names/values
That's what numpy's dtypes are for:
dt = [('L','float64'),('T','float64'),('NMSF','float64'),('err','float64')]
data = plb.loadtxt(argv[1],dtype=dt)
Now you can access the data elements using date['T']['L']['NMSF']
More info on dtypes:
http://docs.scipy.org/doc/numpy/reference/generated/numpy.dtype.html