Transfer data from table to objects (Python) - python

I am new to programming and would appreciate any advice regarding my assignment. And before you read what is following, please accept my appologies for being so silly.
Context
Every week I receive several .txt documents. Each document goes something like this:
№;Date;PatientID;Checkpoint;Parameter1;Parameter2;Parameter3/n
1;01.02.2014;0001;1;25.3;24.2;40.0/n
2;01.02.2014;0002;1;22.1;19.1;40.7/n
3;02.02.2014;0003;1;20.1;19.3;44.2/n
4;04.02.2014;0001;2;22.8;16.1;39.3/n
...
The first line contains column names, and every other line represents an observation. In fact there are over 200 columns and about 3000 lines in each .txt file I get. Moreover, every week column names may be slightly different from what they were a week before, and every week the number of observations increases.
My job is to select the observations that satisfy certain parameter requirements and build boxplots for some of the parameters.
What I think I should do
I want to make a program in Python 2.7.6 that would consist of four parts.
Code that would turn every observation into an object, so that I can access attributes like this:
obs1.checkpoint = 1
obs4.patientid = "0001"
I literary want column names to become attribute names.
Having done this it would be nice to create an object for every unique PatientID. And I would like objects representing observations related to this patient to be attributes of patient objects. My goal here is to make it easy to check if from checkpoint 1 to checkpoint 2 the patient's parameter increases.
Code that would select the observations I need.
Code that would build boxplots.
Code that would combine the three parts above into one program.
What I've found so far
I have found some working code that dynamically adds attributes to instances:
http://znasibov.info/blog/html/2010/03/10/python-classes-dynamic-properties.html
I'm afraid I don't fully understand how it works yet, but I think it might be of use in my case to turn column names into attribute names.
I have also found that creating variables dynamically is frowned upon:
http://nedbatchelder.com/blog/201112/keep_data_out_of_your_variable_names.html
Questions
Is it a good idea to turn every line in the table into an object?
If so, how do I go about it?
How do I create as many objects as there are lines in the table, and how do I name these objects?
What is the best way to turn column names into class attributes?

Related

How to create a dataframe from a complex (nested) list of dictionaries?

Good morning!
I have retrieved a bunch of data from the Facebook Graph API using the python facebook library. I'd like to organize all the meaningful data neatly in a csv file for later analysis, but the problem is I'm quite new to python and I don't know how to approach the format in which the data has been retrieved. Basically, I have all the data about a page posts from 01-05-2020 in a list called data_basic:
Every instance of the list represents one post and is a dict of size 8.
Every dict has: 3 dict elements, 3 string elements, 1 bool element and 1 list element.
For example, in order to access the media_type of the first post I must type: data_basic[0]['attachments']['data'][0]['value'], because inside the dict representing the first post I have a dict containing the attachments whose key 'data' contains a list in which I have the values (for example, again, media_type). A nightmare...
Every instance of the dict containing the post data is different... Attachment is the most nested, but something similar happens for the comments or the tags, while message, created time and so on are much more accessible.
I'd like to obtain a csv table whose rows are the various posts and whose columns are the variables (except, of course, the comments, which I'll store in a different file since there's more than one for each post).
How can I approach the problem? The first thing which comes to my mind is a brute force approach using a for cycle trough all the posts and all the variables, filling the dataframe place by place. But I hope there's a quicker and more elegant way... I've come across the json_normalize function, tried something, but I really don't understand how it works and if it can be of any help... Any thoughts?
Thanks in advance!
edit: a couple of screenshot in order to understand better

Creating new groups by pattern

I have JSON with data about some products and I have already converted this into flat table by pandas, so now I have a few columns with data. I Selected some products manually and putted them into one group. I have sorted them by name for example but this is more complicated, there are also some features and requirements which need to be checked.
So what I want is to create script which will group my products in familiar way as those few groups I created manually based on my own thoughts.
Im totally new into machine learning, but I read about this and also watched some tutorials, but I haven't seen this type of case.
I saw that if I use KNN classifier for example, I have to put in input every group that exists and then it will assign single product to one of those groups, but in my case this must be more complicated I guess since I want from this script to create those groups on his own in similiar way to selected by me.
I was thinking about unsupervised machine learnign but this doesn't look like solution beacuse I have my own data which I want to provide, it seems like I need to use some kind of hybrid with supervised machine learning.
data = pd.read_json('recent.json')['results']
data = json_normalize(data)
le = preprocessing.LabelEncoder()
product_name = le.fit_transform(data['name'])
just some code to show what I have done
I don't know if that makes sense what I want, I already made attempt to this problem in normal way without machine learning just by If and loop things, but I wish I could do that also in "smarter" way
the code above shows nothing. If you have data about some products like each entry contains fields you can clasterize this with KNN what is an unsupervides method.
I have to put in input every group that exists
Not, just define the metric between two entries and the method makes classes or entire dendrogram according to that, so you can select classes from dendrogramm as you want. If you look at each node there, it contains common feature of items in class, so it makes auto-description for a class.

Compare Two Model Instances And Return Key/Value Differences

I have a list of documents. On the view of each document instance I'd like to highlight prior versions of the same document. I want to show how the document has changed over time, and my idea was to always compare each version to the one that came before it. So, for example, compare version 2 to version 1. And highlight what changed.
To solve this problem, I would like to find out how I could compare two model instances and return a key/value list of differences.
The answer to your question is pretty much contained in this Stackoverflow question.
Iterate over model instance field names and values in template
Once you have extracted the field names it should be pretty simple to iterate over them, using getattr() to extract the data values from the two different record versions. After that it's just a matter of formatting to taste.

Dynamically named set, or alternative suggested method?

First of all thank you for taking the time to look at my problem. Rather than simply describing the solution I have in mind for the problem I have to solve, I though it best to outline the problem also in order to enable alternative solution ideas to be suggested. It is more than likely that there is a better way to achieve this solution.
The problem I have:
I generate lists of names with associated scores ranks and other associated values, these lists are generated daily but have to change as the day progresses as a result of needing to remove some names. Currently these lists of names are produced on excel based sheets which contains the following data types in the following format;
(Unique List Title)
(Unique Name in list),(Rank),(Score),(Calculated Numeric Value),(non unique filter)
(Unique Name in list),(Rank),(Score),(Calculated Numeric Value),(non unique filter)
(Unique Name in list),(Rank),(Score),(Calculated Numeric Value),(non unique filter)
(Unique List Title)
(Unique Name in list),(Rank),(Score),(Calculated Numeric Value),(non unique filter)
(Unique Name in list),(Rank),(Score),(Calculated Numeric Value),(non unique filter)
(Unique Name in list),(Rank),(Score),(Calculated Numeric Value),(non unique filter)
(Unique Name in list),(Rank),(Score),(Calculated Numeric Value),(non unique filter)
For example;
Mrs Dodgsons class
Rosie,1,123.8,5,Lincoln University
James,2,122.6,7,Lincoln University
Chris,3,120.4,12,Lincoln University
Douglas,4,120.2,18,Lincoln University
Dr Clements class
Hannah,1,126.9,2.56,Durham University
Gill,2,124.54,6.89,Durham University
Jack,3,122.04,15.62,Durham University
Jamie,4,121.09,20.91,Durham University
Douglas,4,120.2,18,Durham University
Now what I have is a separate list of users and their associated "non unique filter" who need removing from the above excel generated lists (don't worry the final product of this little project is not to re-save a modified excel doc), this list is generated via a web scraper which is updated every two minutes. The method I currently perceive as a potentially viable solution to this problem is to use a piece of code which saves each list in the CSV as a SET (if this is possible) then upon finding a Unique Name/non uniqe filter combination it would then delete them from the SET/S in which they occur.
For instance if Douglas,Durham University was returned on this list then the second of the two defined sets would see Douglas removed. In the instance where a Unique user name appears in two of the sets one of them will always appear on the list of users to be removed along with their associated University (so we can identify which set to remove the user from.) However please note that users to be removed do not always appear in two sets at once, for instance "Rosie,Lincoln University" could just as easily appear on the list of users to be removed.
I previously put a very similar problem on the python forum, however I had made a few mistakes in the way the question was asked, and what I wanted to achieve, instead of confusing the issue on the old thread I have started up a new thread here. On the old thread there were some general questions asked about the problem which I shall answer here in order to provide some clarification.
Q1 So the first list is only generated once a day, what happens to it after that day, is it thrown away, stored, replaced ect.
A1 My gut feeling is that it should be saved to a folder as a simple .txt .csv or similar, if only for a debuggin log.
Q2 Every two minutes the first list needs altering, what happens to the altered list, who needs to know about it, is it stored or just changed in some memory state etc.
A2 The ultimate aim of this code is to produce an RSS with user statistics, some of these stats include the (Rank) & the (Calculated Numeric Value). The Rank is self explanitory with regards to how this could change as a result of a user being removed. However the (Calculated Numeric Value) is derived from an equation which uses the sum of the (Score)'s for each list as well as the number of users in said list. So in answering the original question the list will need to be stored in some way.
Q3 Are names unique per class or unique throughout the whole data.
A3 Name are unique throughout the entire data, i.e. the username Douglas will always refer to Douglas, if a user appears in more than one class then it will always appear on the list of users to be removed.
Q4 If names are unique what happens when two people have same name in the same class which sounds quite possible.
A4 In this example it seems possible for more than one user in the same class to have the same name, however in reality it can not happen.
My questions to the stack overflow are;
is the methodology proposed viable with regards to producing multiple uniquely named SETs (up to 60 per day )
Is there a better method of achieving the same result ?
Any help or comments would be greatly appreciated
Best regards AEA
No, I don't think you could convert the data in each csv file to a set without a loss of data. You could avoid that by converting them into dictionaries keyed by a tuple of(user, non-unique filter) associated with a list value consisting of all the other quantities in the corresponding row of the csv.
To update these dictionaries, you could simply delete any entries that exist in them that match any on the separate list of users you have of those that need removing.
If you are unsure of how to do either of these things, feel free to ask another question.

Storing and reloading large multidimensional data sets in Python

I'm going to be running a large number of simulations producing a large amount of data that needs to be stored and accessed again later. Output data from my simulation program is written to text files (one per simulation). I plan on writing a Python program that reads these text files and then stores the data in a format more convenient for analyzing later. After quite a bit of searching, I think I'm suffering from information overload, so I'm putting this question to Stack Overflow for some advice. Here are the details:
My data will basically take the form of a multidimensional array where each entry will look something like this:
data[ stringArg1, stringArg2, stringArg3, stringArg4, intArg1 ] = [ floatResult01, floatResult02, ..., floatResult12 ]
Each argument has roughly the following numbers of potential values:
stringArg1: 50
stringArg2: 20
stringArg3: 6
stringArg4: 24
intArg1: 10,000
Note, however, that the data set will be sparse. For example, for a given value of stringArg1, only about 16 values of stringArg2 will be filled in. Also, for a given combination of (stringArg1, stringArg2) roughly 5000 values of intArg1 will be filled in. The 3rd and 4th string arguments are always completely filled.
So, with these numbers my array will have roughly 50*16*6*24*5000 = 576,000,000 result lists.
I'm looking for the best way to store this array such that I can save it and reopen it later to either add more data, update existing data, or query existing data for analysis. Thus far I've looked into three different approaches:
a relational database
PyTables
Python dictionary that uses tuples as the dictionary keys (using pickle to save & reload)
There's one issue I run into in all three approaches, I always end up storing every tuple combination of (stringArg1, stringArg2, stringArg3, stringArg4, intArg1), either as a field in a table, or as the keys in the Python dictionary. From my (possibly naive) point of view, it seems like this shouldn't be necessary. If these were all integer arguments then they would just form the address of each data entry in the array, and there wouldn't be any need to store all the potential address combinations in a separate field. For example, if I had a 2x2 array = [[100, 200] , [300, 400]] you would retrieve values by asking for the value at an address array[0][1]. You wouldn't need to store all the possible address tuples (0,0) (0,1) (1,0) (1,1) somewhere else. So I'm hoping to find a way around this.
What I would love to be able to do is define a table in PyTables, where cells in this first table contain other tables. For example, the top-level tables would have two columns. Entries in the first column would be the possible values of stringArg1. Each entry in the second column would be a table. These sub-tables would then have two columns, the first being all the possible values of stringArg2, the second being another column of sub-sub-tables...
That kind of solution would be straightforward to browse and query (particularly if I could use ViTables to browse the data). The problem is PyTables doesn't seem to support having the cells of one table contain other tables. So I seem to have hit a dead end there.
I've been reading up on data warehousing and the star schema approach, but it still seems like your fact table would need to contain tuples of every possible argument combination.
Okay, so that's pretty much where I am. Any and all advice would be very much appreciated. At this point I've been searching around so much that my brain hurts. I figure it's time to ask the experts.
Why not using a big table for keep all the 500 millions of entries? If you use on-the-flight compression (Blosc compressor recommended here), most of the duplicated entries will be deduped, so the overhead in storage is kept under a minimum. I'd recommend give this a try; sometimes the simple solution works best ;-)
Is there a reason the basic 6 table approach doesn't apply?
i.e. Tables 1-5 would be single column tables defining the valid values for each of the fields, and then the final table would be a 5 column table defining the entries that actually exist.
Alternatively, if every value always exists for the 3rd and 4th string values as you describe, the 6th table could just consist of 3 columns (string1, string2, int1) and you generate the combinations with string3 and string4 dynamically via a Cartesian join.
I'm not entirely sure of what you're trying to do here, but it looks like you trying to create a (potentially) sparse multidimensional array. So I wont go into details for solving your specific problem, but the best package I know that deals with this is Numpy Numpy. Numpy can
be used as an efficient multi-dimensional container of generic data. Arbitrary data-types can be defined. This allows NumPy to seamlessly and speedily integrate with a wide variety of databases.
I've used Numpy many times for simulation data processing and it provides many useful tools including easy file storage/access.
Hopefully you'll find something in it's very easy to read documentation:
Numpy Documentation with Examples

Categories