Efficiency list of dictionaries - python

Hello I'm moving from VB to python and I'm doing some of my first projects to learn the syntax and the basics of the language, what I'm trying to do now is a sort of simulation of a "management app" before working with databases I'm doing it first with files.
What I do is having this file (which will be my database in the future) where I have stored the informations about some employees the data I store are
I'd ,name ,surname ,date of birth ,status, code , contract
On the file I have them stored like this
1|Bob|Brown|07/12/1985|Active|202020|1
(The pin is a number I generate to let the user "login" to see his informations and the contract is an id of the contracts that I have on another file so the foreign key)
Now I store all of these in a list of dictionaries so my overall data structure would look like this
[{Id:1,name:Bob, surname:Brown, dateB:07/12/1985,status:Active,code:202020, contact:1},{Id:2,name:Josh, surname:Allen, dateB:05/02/1999,status:Active,code:202021, contact:3}]
Each time I add a new employee I create a new dictionary like
NewEmpl = dict(id=3,name=Robert,surname=Lasky,dateB=03/11/1997,status=Active, code=202022, contract=2)
list_employees.append(NewEmpl)
F.write(str(id)+"|"+name+"|"+surname.....
update both the file and the list but I wonder if there is a more efficient way to store the data than how I'm doing right know with a list of dictionaries

You can use pandas. It is designed to be used for tabular data. Since it's written using numpy which uses c++ it is very efficient. You can turn your list of dicts to a DataFrame as follows:
df = pd.DataFrame(list_employees)
And to a csv it is as simple as:
df.to_csv('file.csv')

Related

python equivalent to listObjects in VBA for Excel (tables)

I have implemented a program in VBA for excel to generate automatic communications based on user inputs (selections of cells).
Such Macro written in VBA uses extensively the listObject function of VBA
i.e.
defining a table (list object)
Dim ClsSht As Worksheet
Set ClsSht = ThisWorkbook.Sheets("paragraph texts")
Dim ClsTbl As ListObject
Set ClsTbl = ClsSht.ListObjects(1)
accessing the table in the code in a very logical manner:
ClsTbl being now the table where I want to pick up data.
myvariable= ClsTbl.listcolumns("D1").databodyrange.item(34).value
Which means myvariable is the item (row) 34 of the data of the column D1 of the table clstbl
I decided to learn python to "translate" all that code into python and make a django based program accesable for anyone.
I am a beginner in Python and I am wondering what would be the equivalent in python to listobject of VBA. This decision will shape my whole program in python from the beginning, and I am hesitating a lot to decide what is the python equivalent to listobject in VBA.
The main idea here getting a way where I can access tables-data in a readable way,
i.e. give me the value of column "text" where column "chapter" is 3 and column paragraph is "2". The values are unique, meaning there is only one value in "text" column where that occurs.
Some observations:
I know everything can be done with lists in python, lists can contain lists that can contain lists..., but this is terrible for readability. mylist1[2][3] (assuming for instance that every row could be a list of values, and the whole table a list of lists of rows).
I don't considered an option to build any database. There are multiple relatively small tables (from 10 to 500 rows and from 3 to 15 columns) that are related but not in a database manner. That would force me to learn yet another language SQL or so, and I have more than enough with python and DJango.
The user modifies the structure of many tables (chapters coming together or getting splitted.
the data is 100% strings. The only integers are numbers to sort out text. I don't perform any mathematical operation with values but simple add together pieces of text and make replacements in texts.
the tables will be load into Python as CSV text files.
Please indicate me if there is something not enough clear in the question and I will complete it
Would it be necesary to operate with numpy? pandas?
i.e give me the value of cell
A DataFrame using pandas should provide everything you need, i.e. converstion to strings, manipulation, import and export. As a start, try
import pandas as pd
df = pd.read_csv('your_file.csv')
print(df)
print(df['text'])
The entries of the first row will be converted to labels of the DataFrame columns.

How to add data(dict) to a specific fieldname(key) on a csv file?

EDIT: Sorry for the confusion I'll explain what the program is for. It's to keep track of a users new weight record. This file will only update when they have exceeded their previous weight record with a timestamp. I want the user to be able to see a time line of their progress for each lift using the timestamp. This is why i was using lift['key']={data:dict} So that they can reference each lift type and query the date for example lift['snatch']['5/25'] this will tell them what they maxed that day. But i can't seem to be able to write this to a csv file properly. Thank you for you time! Happy friday!
I've been researching for days and can't seem to figure out how to add data to a specific Fieldname which is a the highest level key in my dict.
The data i want to add is a dict in it's own.
How I vision it to look like in the CSV file:
snatch <> squat <> jerk
10/25:150lbs <> 10/25:200lbs <> 10/25:0lbs
So this is how it would look like when they created the file. How am I able to update just one field.
Say the user only squatted that day and wants to Append data to that Field.
What I have so far:
import time
import csv
lifts={}
csv_columns = ['snatch','squat','jerk']
creation = time.strftime('%M:%S', time.localtime())
lifts['snatch']={creation:'150lbs'}
lifts['squat']={creation:'200lbs'}
lifts['jerk']={creation:'0lbs'}
try:
with open(csv_file, 'w') as csvfile:
writer = csvDictWriter(csvfile, fieldnames=csv_columns)
writer.writeheader()
for data in lifts:
writer.writerow(lifts)
except IOError as (errno, sterror):
print("Error")
return
->One of my issues is that when it writers to the csv file it writes it over three times. Not quite sure why. It's the format I want but it's there three times.
-> I also want to implement this next code and write to the specific column, when i do so it writes null or blanks in the other columns.
lifts['jerk'].update({time.strftime('%M:%S', time.localtime() : '160lbs'})
Then out putting
snatch <> squat <> jerk
10/25:150lbs <> 10/25:200lbs <> 10/25:0lbs 10/26:160lbs
Sorry I'm new to python and not quit sure how to use this editor i want that result to land under the {10/25:0lbs} Just like it would show in excel.
It's important to keep track of what's going on here: lifts is a dictionary with strings for keys ("snatch", "squat", "jerk") and whose values are also dictionaries. This second level of dictionaries has timestamp strings for keys and strings as values.
I suspect that when you want to update the lifts['jerk'] dictionary, you don't use the same key (timestamp) as the existing entry.
It doesn't seem like you need a dictionary for the second level; consider using a list instead, but if you must, you can access like so: lifts['jerk'][lifts['jerk'].keys()[0]] which is rather hamfisted - again, consider either using a different data type for the values of your lifts dictionary or use keys that are easier to reference than timestamps.
EDIT: You could do something like lifts['jerk'] = {'timestamp':creation,'weight':'165lbs'} which requires some restructuring of your data.

How to do data analysis using Python, of a file with thousands of dictionaries in each line

I currently have a file with 5 thousand lines, with one dictionary in each line. All dictionaries have the same fields. My question is:
Should I learn SQL to store this data and do that analysis with it, or is using the file I've got good enough, and I should just use pandas or some other module to do data analysis.
I'm really lost on which path should I take.
While the question is very general - it should be noted that the problems of how do I store my dataset and what tool do I use to analyze my data are very different questions.
Very often for datasets that need to be modified or updated at a regular interval a database will be preferable to i.e. a compressed file (since modifying the compressed file contents will require you to rewrite all of the data). For example I probably wouldn't use sqlite for an nltk.corpus although there maybe uses cases for that as well.
If you do decide to use with sqlite and your original data is in dictionary format, especially with many fields -
You may find exectrace and rowtrace to be useful:
http://apidoc.apsw.googlecode.com/hg/connection.html#apsw.Connection.setrowtrace
and
http://apidoc.apsw.googlecode.com/hg/connection.html#apsw.Connection.setexectrace
useful.
For example to get rows out of sqlite in dict rather than tuple format, you may do:
def rowtracer(cursor, sql):
dictionary = {}
for index, (name, type_) in enumerate(cursor.getdescription()):
dictionary[name] = sql[index]
return dictionary
con.setrowtrace(rowtracer)
And for inserts you can pass values in a dict by i.e.
"""insert into my_table(name, data) values(:name, :date)"""

Organizing column and header data with pandas, python

I'm having a go at using Numpy instead of Matlab, but I'm relatively new to Python.
My current challenge is importing the data in multiple file in a sensible way so that I can use and plot it. The data is organized in columnes (Temperature, Pressure, Time, etc, each file being a measurement period), and I decided pandas was probably the best way to import the data. I was thinking of using top-leve descriptor for each file, and subdescriptors for each column. Thought of doing it something like this.
Reading Multiple CSV Files into Python Pandas Dataframe
The problem is I'd like to retain and use some of the data in the header (for plotting, for instance). There's no column titles, but general info on data mesaurements, something like this:
Flight ID: XXXXXX
Date: 01-27-10 Time: 5:25:19
OWNER
Release Point: xx.304N xx.060E 11 m
Serial Number xxxxxx
Surface Data: 985.1 mb 1.0 C 100% 1.0 m/s # 308 deg.
I really don't know how to extract and store the data in a way that makes sense when combined with the data frame. Thought of perhaps a dictionary, but I'm not sure how to split the data efficiently since there's no consistent divider. Any ideas?
Looks like somebody is working with radiosondes...
When I pull in my radiosonde data I usually put it in a multi-level indexed dataframe. The levels could be of various forms and orders, but something like FLIGHT_NUM, DATE, ALTITUDE, etc. would make sense. Also, when working with sonde data I too want some additional information that does not necessarily need to be stored within the dataframe, so I store that as additional attributes. If I were to parse your file and then store it I would do something along the lines of this (yes, there are modifications that can be made to "improve" this):
import pandas as pd
with open("filename.csv",'r') as data:
header = data.read().split('\n')[:5] # change to match number of your header rows
data = pd.read_csv(data, skiprows=6, skipinitialspace=True, na_values=[-999,'Infinity','-Infinity'])
# now you can parse your header to get out the necessary information
# continue until you have all the header info you want/need; e.g.
flight = header[0].split(': ')[1]
date = header[1].split(': ')[1].split('')[0]
time = header[1].split(': ')[2]
# a lot of the header information will get stored as metadata for me.
# most likely you want more than flight number and date in your metadata, but you get the point.
data.metadata = {'flight':flight,
'date':date}
I presume you have a date/time column (call it "dates" here) within your file, so you can use that to re-index your dataframe. If you choose to use different variables within your multi-level index then the same method applies.
new_index = [(data.metadata['flight'],r) for r in data.dates]
data.index = pd.MultiIndex.from_tuples(new_index)
You now have a multi-level indexed dataframe.
Now, regarding your "metadata". EdChum makes an excellent point that if you copy "data" you will NOT copy over the metadata dictionary. Also, if you save "data" to a dataframe via data.to_pickle you will lose your metadata (more on this later). If you want to keep your metadata you have a couple options.
Save the data on a flight-by-flight basis. This will allow you to store metadata for each individual flight's file.
Assuming you want to have multiple flights within one saved file: you can add an additional column within your dataframe that hold that information (i.e. another column for flight number, another column for surface temperature, etc.), though this will increase the size of your saved file.
Assuming you want to have multiple flights within one saved file (option 2): You can make your metadata dictionary "keyed" by flight number. e.g.
data.metadata = {FLIGHT1:{'date':date},
FLIGHT2:{'date':date}}
Now to store the metadata. Check you my IO class on storing additional attributes within an h5 file posted here.
Your question was quite broad, so you got a broad answer. I hope this was helpful.

How to store numerical lookup table in Python (with labels)

I have a scientific model which I am running in Python which produces a lookup table as output. That is, it produces a many-dimensional 'table' where each dimension is a parameter in the model and the value in each cell is the output of the model.
My question is how best to store this lookup table in Python. I am running the model in a loop over every possible parameter combination (using the fantastic itertools.product function), but I can't work out how best to store the outputs.
It would seem sensible to simply store the output as a ndarray, but I'd really like to be able to access the outputs based on the parameter values not just indices. For example, rather than accessing the values as table[16][5][17][14] I'd prefer to access them somehow using variable names/values, for example:
table[solar_z=45, solar_a=170, type=17, reflectance=0.37]
or something similar to that. It'd be brilliant if I were able to iterate over the values and get their parameter values back - that is, being able to find out that table[16]... corresponds to the outputs for solar_z = 45.
Is there a sensible way to do this in Python?
Why don't you use a database? I have found MongoDB (and the official Python driver, Pymongo) to be a wonderful tool for scientific computing. Here are some advantages:
Easy to install - simply download the executables for your platform (2 minutes tops, seriously).
Schema-less data model
Blazing fast
Provides map/reduce functionality
Very good querying functionalities
So, you could store each entry as a MongoDB entry, for example:
{"_id":"run_unique_identifier",
"param1":"val1",
"param2":"val2" # etcetera
}
Then you could query the entries as you will:
import pymongo
data = pymongo.Connection("localhost", 27017)["mydb"]["mycollection"]
for entry in data.find(): # this will yield all results
yield entry["param1"] # do something with param1
Whether or not MongoDB/pymongo are the answer to your specific question, I don't know. However, you could really benefit from checking them out if you are into data-intensive scientific computing.
If you want to access the results by name, then you could use a python nested dictionary instead of ndarray, and serialize it in a .JSON text file using json module.
One option is to use a numpy ndarray for the data (as you do now), and write a parser function to convert the query values into row/column indices.
For example:
solar_z_dict = {...}
solar_a_dict = {...}
...
def lookup(dataArray, solar_z, solar_a, type, reflectance):
return dataArray[solar_z_dict[solar_z] ], solar_a_dict[solar_a], ...]
You could also convert to string and eval, if you want to have some of the fields to be given as "None" and be translated to ":" (to give the full table for that variable).
For example, rather than accessing the values as table[16][5][17][14]
I'd prefer to access them somehow using variable names/values
That's what numpy's dtypes are for:
dt = [('L','float64'),('T','float64'),('NMSF','float64'),('err','float64')]
data = plb.loadtxt(argv[1],dtype=dt)
Now you can access the data elements using date['T']['L']['NMSF']
More info on dtypes:
http://docs.scipy.org/doc/numpy/reference/generated/numpy.dtype.html

Categories