Dynamically named set, or alternative suggested method? - python

First of all thank you for taking the time to look at my problem. Rather than simply describing the solution I have in mind for the problem I have to solve, I though it best to outline the problem also in order to enable alternative solution ideas to be suggested. It is more than likely that there is a better way to achieve this solution.
The problem I have:
I generate lists of names with associated scores ranks and other associated values, these lists are generated daily but have to change as the day progresses as a result of needing to remove some names. Currently these lists of names are produced on excel based sheets which contains the following data types in the following format;
(Unique List Title)
(Unique Name in list),(Rank),(Score),(Calculated Numeric Value),(non unique filter)
(Unique Name in list),(Rank),(Score),(Calculated Numeric Value),(non unique filter)
(Unique Name in list),(Rank),(Score),(Calculated Numeric Value),(non unique filter)
(Unique List Title)
(Unique Name in list),(Rank),(Score),(Calculated Numeric Value),(non unique filter)
(Unique Name in list),(Rank),(Score),(Calculated Numeric Value),(non unique filter)
(Unique Name in list),(Rank),(Score),(Calculated Numeric Value),(non unique filter)
(Unique Name in list),(Rank),(Score),(Calculated Numeric Value),(non unique filter)
For example;
Mrs Dodgsons class
Rosie,1,123.8,5,Lincoln University
James,2,122.6,7,Lincoln University
Chris,3,120.4,12,Lincoln University
Douglas,4,120.2,18,Lincoln University
Dr Clements class
Hannah,1,126.9,2.56,Durham University
Gill,2,124.54,6.89,Durham University
Jack,3,122.04,15.62,Durham University
Jamie,4,121.09,20.91,Durham University
Douglas,4,120.2,18,Durham University
Now what I have is a separate list of users and their associated "non unique filter" who need removing from the above excel generated lists (don't worry the final product of this little project is not to re-save a modified excel doc), this list is generated via a web scraper which is updated every two minutes. The method I currently perceive as a potentially viable solution to this problem is to use a piece of code which saves each list in the CSV as a SET (if this is possible) then upon finding a Unique Name/non uniqe filter combination it would then delete them from the SET/S in which they occur.
For instance if Douglas,Durham University was returned on this list then the second of the two defined sets would see Douglas removed. In the instance where a Unique user name appears in two of the sets one of them will always appear on the list of users to be removed along with their associated University (so we can identify which set to remove the user from.) However please note that users to be removed do not always appear in two sets at once, for instance "Rosie,Lincoln University" could just as easily appear on the list of users to be removed.
I previously put a very similar problem on the python forum, however I had made a few mistakes in the way the question was asked, and what I wanted to achieve, instead of confusing the issue on the old thread I have started up a new thread here. On the old thread there were some general questions asked about the problem which I shall answer here in order to provide some clarification.
Q1 So the first list is only generated once a day, what happens to it after that day, is it thrown away, stored, replaced ect.
A1 My gut feeling is that it should be saved to a folder as a simple .txt .csv or similar, if only for a debuggin log.
Q2 Every two minutes the first list needs altering, what happens to the altered list, who needs to know about it, is it stored or just changed in some memory state etc.
A2 The ultimate aim of this code is to produce an RSS with user statistics, some of these stats include the (Rank) & the (Calculated Numeric Value). The Rank is self explanitory with regards to how this could change as a result of a user being removed. However the (Calculated Numeric Value) is derived from an equation which uses the sum of the (Score)'s for each list as well as the number of users in said list. So in answering the original question the list will need to be stored in some way.
Q3 Are names unique per class or unique throughout the whole data.
A3 Name are unique throughout the entire data, i.e. the username Douglas will always refer to Douglas, if a user appears in more than one class then it will always appear on the list of users to be removed.
Q4 If names are unique what happens when two people have same name in the same class which sounds quite possible.
A4 In this example it seems possible for more than one user in the same class to have the same name, however in reality it can not happen.
My questions to the stack overflow are;
is the methodology proposed viable with regards to producing multiple uniquely named SETs (up to 60 per day )
Is there a better method of achieving the same result ?
Any help or comments would be greatly appreciated
Best regards AEA

No, I don't think you could convert the data in each csv file to a set without a loss of data. You could avoid that by converting them into dictionaries keyed by a tuple of(user, non-unique filter) associated with a list value consisting of all the other quantities in the corresponding row of the csv.
To update these dictionaries, you could simply delete any entries that exist in them that match any on the separate list of users you have of those that need removing.
If you are unsure of how to do either of these things, feel free to ask another question.

Related

How can i find the values that are not names in a pandas column?

I'm working with a dataframe of names from the databases of my company. My current job is to find if some of these values, with in total are more than 3 million, are not names. If they were wrongly registrated, if the softwares of clients registered some strange values of error, etc.
Is there a neural network alghoritm or other mechanism that i can use to find that?
[Here are some values of the column. I want to see every value that are kind of different from these1
I tried to see by the number of letters of strings, but it was useless.
Try to post some code of your tries so other can help you

How to create a dataframe from a complex (nested) list of dictionaries?

Good morning!
I have retrieved a bunch of data from the Facebook Graph API using the python facebook library. I'd like to organize all the meaningful data neatly in a csv file for later analysis, but the problem is I'm quite new to python and I don't know how to approach the format in which the data has been retrieved. Basically, I have all the data about a page posts from 01-05-2020 in a list called data_basic:
Every instance of the list represents one post and is a dict of size 8.
Every dict has: 3 dict elements, 3 string elements, 1 bool element and 1 list element.
For example, in order to access the media_type of the first post I must type: data_basic[0]['attachments']['data'][0]['value'], because inside the dict representing the first post I have a dict containing the attachments whose key 'data' contains a list in which I have the values (for example, again, media_type). A nightmare...
Every instance of the dict containing the post data is different... Attachment is the most nested, but something similar happens for the comments or the tags, while message, created time and so on are much more accessible.
I'd like to obtain a csv table whose rows are the various posts and whose columns are the variables (except, of course, the comments, which I'll store in a different file since there's more than one for each post).
How can I approach the problem? The first thing which comes to my mind is a brute force approach using a for cycle trough all the posts and all the variables, filling the dataframe place by place. But I hope there's a quicker and more elegant way... I've come across the json_normalize function, tried something, but I really don't understand how it works and if it can be of any help... Any thoughts?
Thanks in advance!
edit: a couple of screenshot in order to understand better

Transfer data from table to objects (Python)

I am new to programming and would appreciate any advice regarding my assignment. And before you read what is following, please accept my appologies for being so silly.
Context
Every week I receive several .txt documents. Each document goes something like this:
№;Date;PatientID;Checkpoint;Parameter1;Parameter2;Parameter3/n
1;01.02.2014;0001;1;25.3;24.2;40.0/n
2;01.02.2014;0002;1;22.1;19.1;40.7/n
3;02.02.2014;0003;1;20.1;19.3;44.2/n
4;04.02.2014;0001;2;22.8;16.1;39.3/n
...
The first line contains column names, and every other line represents an observation. In fact there are over 200 columns and about 3000 lines in each .txt file I get. Moreover, every week column names may be slightly different from what they were a week before, and every week the number of observations increases.
My job is to select the observations that satisfy certain parameter requirements and build boxplots for some of the parameters.
What I think I should do
I want to make a program in Python 2.7.6 that would consist of four parts.
Code that would turn every observation into an object, so that I can access attributes like this:
obs1.checkpoint = 1
obs4.patientid = "0001"
I literary want column names to become attribute names.
Having done this it would be nice to create an object for every unique PatientID. And I would like objects representing observations related to this patient to be attributes of patient objects. My goal here is to make it easy to check if from checkpoint 1 to checkpoint 2 the patient's parameter increases.
Code that would select the observations I need.
Code that would build boxplots.
Code that would combine the three parts above into one program.
What I've found so far
I have found some working code that dynamically adds attributes to instances:
http://znasibov.info/blog/html/2010/03/10/python-classes-dynamic-properties.html
I'm afraid I don't fully understand how it works yet, but I think it might be of use in my case to turn column names into attribute names.
I have also found that creating variables dynamically is frowned upon:
http://nedbatchelder.com/blog/201112/keep_data_out_of_your_variable_names.html
Questions
Is it a good idea to turn every line in the table into an object?
If so, how do I go about it?
How do I create as many objects as there are lines in the table, and how do I name these objects?
What is the best way to turn column names into class attributes?

Mapping ID's to Names and Removing Duplicates Algorithmic Development

I have a column in a CSV file that has names such that each cell in that column could be the same as a slightly misspelled cell. For example, "Nike" could be the same as "Nike inc." could be the same as "Nike Inc".
My Current Script
I've already written a program in Python that removes prefixes and suffixes from
each cell if that value occurs more than 2 times in the column as prefixes or
suffixes. I then compared one row to the next after sorting alphabetically in
this column.
My Current Problem
There are still many cells that are in reality duplicates of other cells, but they
are not indicated as such. These examples are:
a) Not exact matches (and not off just by capitalization)
b) Not caught by comparing its stem (without prefix and without suffix) to
its alphabetical neighbor
My current Questions
1) Does anyone have experience mapping IDs to names from all over the world
(so accents, unicode and all that stuff is an issue here, too, although I managed
to solve most of these unicode issues)
and has good ideas for algorithm development that are not listed here?
2) In some of the cases where duplicates are not picked up, I know why I
know they are duplicates. In one instance there is a period in the middle of a
line that is not present in its non-period containing brother cell. Is one good
strategy to simply to create an extra column and output cell values that I suspect
of being duplicates based on the few instances where I know why I know it?
3) How do I check myself? One way is to flag the maximum number of potential
duplicates and look over all of these manually. Unfortunately, the size of our
dataset doesn't make that very pretty, nor very feasible...
Thanks for any help you can provide!
Try to transliterate the names, to remove all the international symbols, then consider using a function like soundex or http://en.wikipedia.org/wiki/Levenshtein_distance (e.g. http://pypi.python.org/pypi/Fuzzy) to calculate text similarity.

Storing and reloading large multidimensional data sets in Python

I'm going to be running a large number of simulations producing a large amount of data that needs to be stored and accessed again later. Output data from my simulation program is written to text files (one per simulation). I plan on writing a Python program that reads these text files and then stores the data in a format more convenient for analyzing later. After quite a bit of searching, I think I'm suffering from information overload, so I'm putting this question to Stack Overflow for some advice. Here are the details:
My data will basically take the form of a multidimensional array where each entry will look something like this:
data[ stringArg1, stringArg2, stringArg3, stringArg4, intArg1 ] = [ floatResult01, floatResult02, ..., floatResult12 ]
Each argument has roughly the following numbers of potential values:
stringArg1: 50
stringArg2: 20
stringArg3: 6
stringArg4: 24
intArg1: 10,000
Note, however, that the data set will be sparse. For example, for a given value of stringArg1, only about 16 values of stringArg2 will be filled in. Also, for a given combination of (stringArg1, stringArg2) roughly 5000 values of intArg1 will be filled in. The 3rd and 4th string arguments are always completely filled.
So, with these numbers my array will have roughly 50*16*6*24*5000 = 576,000,000 result lists.
I'm looking for the best way to store this array such that I can save it and reopen it later to either add more data, update existing data, or query existing data for analysis. Thus far I've looked into three different approaches:
a relational database
PyTables
Python dictionary that uses tuples as the dictionary keys (using pickle to save & reload)
There's one issue I run into in all three approaches, I always end up storing every tuple combination of (stringArg1, stringArg2, stringArg3, stringArg4, intArg1), either as a field in a table, or as the keys in the Python dictionary. From my (possibly naive) point of view, it seems like this shouldn't be necessary. If these were all integer arguments then they would just form the address of each data entry in the array, and there wouldn't be any need to store all the potential address combinations in a separate field. For example, if I had a 2x2 array = [[100, 200] , [300, 400]] you would retrieve values by asking for the value at an address array[0][1]. You wouldn't need to store all the possible address tuples (0,0) (0,1) (1,0) (1,1) somewhere else. So I'm hoping to find a way around this.
What I would love to be able to do is define a table in PyTables, where cells in this first table contain other tables. For example, the top-level tables would have two columns. Entries in the first column would be the possible values of stringArg1. Each entry in the second column would be a table. These sub-tables would then have two columns, the first being all the possible values of stringArg2, the second being another column of sub-sub-tables...
That kind of solution would be straightforward to browse and query (particularly if I could use ViTables to browse the data). The problem is PyTables doesn't seem to support having the cells of one table contain other tables. So I seem to have hit a dead end there.
I've been reading up on data warehousing and the star schema approach, but it still seems like your fact table would need to contain tuples of every possible argument combination.
Okay, so that's pretty much where I am. Any and all advice would be very much appreciated. At this point I've been searching around so much that my brain hurts. I figure it's time to ask the experts.
Why not using a big table for keep all the 500 millions of entries? If you use on-the-flight compression (Blosc compressor recommended here), most of the duplicated entries will be deduped, so the overhead in storage is kept under a minimum. I'd recommend give this a try; sometimes the simple solution works best ;-)
Is there a reason the basic 6 table approach doesn't apply?
i.e. Tables 1-5 would be single column tables defining the valid values for each of the fields, and then the final table would be a 5 column table defining the entries that actually exist.
Alternatively, if every value always exists for the 3rd and 4th string values as you describe, the 6th table could just consist of 3 columns (string1, string2, int1) and you generate the combinations with string3 and string4 dynamically via a Cartesian join.
I'm not entirely sure of what you're trying to do here, but it looks like you trying to create a (potentially) sparse multidimensional array. So I wont go into details for solving your specific problem, but the best package I know that deals with this is Numpy Numpy. Numpy can
be used as an efficient multi-dimensional container of generic data. Arbitrary data-types can be defined. This allows NumPy to seamlessly and speedily integrate with a wide variety of databases.
I've used Numpy many times for simulation data processing and it provides many useful tools including easy file storage/access.
Hopefully you'll find something in it's very easy to read documentation:
Numpy Documentation with Examples

Categories