I'm working with a dataframe of names from the databases of my company. My current job is to find if some of these values, with in total are more than 3 million, are not names. If they were wrongly registrated, if the softwares of clients registered some strange values of error, etc.
Is there a neural network alghoritm or other mechanism that i can use to find that?
[Here are some values of the column. I want to see every value that are kind of different from these1
I tried to see by the number of letters of strings, but it was useless.
Try to post some code of your tries so other can help you
Good morning!
I have retrieved a bunch of data from the Facebook Graph API using the python facebook library. I'd like to organize all the meaningful data neatly in a csv file for later analysis, but the problem is I'm quite new to python and I don't know how to approach the format in which the data has been retrieved. Basically, I have all the data about a page posts from 01-05-2020 in a list called data_basic:
Every instance of the list represents one post and is a dict of size 8.
Every dict has: 3 dict elements, 3 string elements, 1 bool element and 1 list element.
For example, in order to access the media_type of the first post I must type: data_basic[0]['attachments']['data'][0]['value'], because inside the dict representing the first post I have a dict containing the attachments whose key 'data' contains a list in which I have the values (for example, again, media_type). A nightmare...
Every instance of the dict containing the post data is different... Attachment is the most nested, but something similar happens for the comments or the tags, while message, created time and so on are much more accessible.
I'd like to obtain a csv table whose rows are the various posts and whose columns are the variables (except, of course, the comments, which I'll store in a different file since there's more than one for each post).
How can I approach the problem? The first thing which comes to my mind is a brute force approach using a for cycle trough all the posts and all the variables, filling the dataframe place by place. But I hope there's a quicker and more elegant way... I've come across the json_normalize function, tried something, but I really don't understand how it works and if it can be of any help... Any thoughts?
Thanks in advance!
edit: a couple of screenshot in order to understand better
I am new to programming and would appreciate any advice regarding my assignment. And before you read what is following, please accept my appologies for being so silly.
Context
Every week I receive several .txt documents. Each document goes something like this:
№;Date;PatientID;Checkpoint;Parameter1;Parameter2;Parameter3/n
1;01.02.2014;0001;1;25.3;24.2;40.0/n
2;01.02.2014;0002;1;22.1;19.1;40.7/n
3;02.02.2014;0003;1;20.1;19.3;44.2/n
4;04.02.2014;0001;2;22.8;16.1;39.3/n
...
The first line contains column names, and every other line represents an observation. In fact there are over 200 columns and about 3000 lines in each .txt file I get. Moreover, every week column names may be slightly different from what they were a week before, and every week the number of observations increases.
My job is to select the observations that satisfy certain parameter requirements and build boxplots for some of the parameters.
What I think I should do
I want to make a program in Python 2.7.6 that would consist of four parts.
Code that would turn every observation into an object, so that I can access attributes like this:
obs1.checkpoint = 1
obs4.patientid = "0001"
I literary want column names to become attribute names.
Having done this it would be nice to create an object for every unique PatientID. And I would like objects representing observations related to this patient to be attributes of patient objects. My goal here is to make it easy to check if from checkpoint 1 to checkpoint 2 the patient's parameter increases.
Code that would select the observations I need.
Code that would build boxplots.
Code that would combine the three parts above into one program.
What I've found so far
I have found some working code that dynamically adds attributes to instances:
http://znasibov.info/blog/html/2010/03/10/python-classes-dynamic-properties.html
I'm afraid I don't fully understand how it works yet, but I think it might be of use in my case to turn column names into attribute names.
I have also found that creating variables dynamically is frowned upon:
http://nedbatchelder.com/blog/201112/keep_data_out_of_your_variable_names.html
Questions
Is it a good idea to turn every line in the table into an object?
If so, how do I go about it?
How do I create as many objects as there are lines in the table, and how do I name these objects?
What is the best way to turn column names into class attributes?
I have a column in a CSV file that has names such that each cell in that column could be the same as a slightly misspelled cell. For example, "Nike" could be the same as "Nike inc." could be the same as "Nike Inc".
My Current Script
I've already written a program in Python that removes prefixes and suffixes from
each cell if that value occurs more than 2 times in the column as prefixes or
suffixes. I then compared one row to the next after sorting alphabetically in
this column.
My Current Problem
There are still many cells that are in reality duplicates of other cells, but they
are not indicated as such. These examples are:
a) Not exact matches (and not off just by capitalization)
b) Not caught by comparing its stem (without prefix and without suffix) to
its alphabetical neighbor
My current Questions
1) Does anyone have experience mapping IDs to names from all over the world
(so accents, unicode and all that stuff is an issue here, too, although I managed
to solve most of these unicode issues)
and has good ideas for algorithm development that are not listed here?
2) In some of the cases where duplicates are not picked up, I know why I
know they are duplicates. In one instance there is a period in the middle of a
line that is not present in its non-period containing brother cell. Is one good
strategy to simply to create an extra column and output cell values that I suspect
of being duplicates based on the few instances where I know why I know it?
3) How do I check myself? One way is to flag the maximum number of potential
duplicates and look over all of these manually. Unfortunately, the size of our
dataset doesn't make that very pretty, nor very feasible...
Thanks for any help you can provide!
Try to transliterate the names, to remove all the international symbols, then consider using a function like soundex or http://en.wikipedia.org/wiki/Levenshtein_distance (e.g. http://pypi.python.org/pypi/Fuzzy) to calculate text similarity.
I'm going to be running a large number of simulations producing a large amount of data that needs to be stored and accessed again later. Output data from my simulation program is written to text files (one per simulation). I plan on writing a Python program that reads these text files and then stores the data in a format more convenient for analyzing later. After quite a bit of searching, I think I'm suffering from information overload, so I'm putting this question to Stack Overflow for some advice. Here are the details:
My data will basically take the form of a multidimensional array where each entry will look something like this:
data[ stringArg1, stringArg2, stringArg3, stringArg4, intArg1 ] = [ floatResult01, floatResult02, ..., floatResult12 ]
Each argument has roughly the following numbers of potential values:
stringArg1: 50
stringArg2: 20
stringArg3: 6
stringArg4: 24
intArg1: 10,000
Note, however, that the data set will be sparse. For example, for a given value of stringArg1, only about 16 values of stringArg2 will be filled in. Also, for a given combination of (stringArg1, stringArg2) roughly 5000 values of intArg1 will be filled in. The 3rd and 4th string arguments are always completely filled.
So, with these numbers my array will have roughly 50*16*6*24*5000 = 576,000,000 result lists.
I'm looking for the best way to store this array such that I can save it and reopen it later to either add more data, update existing data, or query existing data for analysis. Thus far I've looked into three different approaches:
a relational database
PyTables
Python dictionary that uses tuples as the dictionary keys (using pickle to save & reload)
There's one issue I run into in all three approaches, I always end up storing every tuple combination of (stringArg1, stringArg2, stringArg3, stringArg4, intArg1), either as a field in a table, or as the keys in the Python dictionary. From my (possibly naive) point of view, it seems like this shouldn't be necessary. If these were all integer arguments then they would just form the address of each data entry in the array, and there wouldn't be any need to store all the potential address combinations in a separate field. For example, if I had a 2x2 array = [[100, 200] , [300, 400]] you would retrieve values by asking for the value at an address array[0][1]. You wouldn't need to store all the possible address tuples (0,0) (0,1) (1,0) (1,1) somewhere else. So I'm hoping to find a way around this.
What I would love to be able to do is define a table in PyTables, where cells in this first table contain other tables. For example, the top-level tables would have two columns. Entries in the first column would be the possible values of stringArg1. Each entry in the second column would be a table. These sub-tables would then have two columns, the first being all the possible values of stringArg2, the second being another column of sub-sub-tables...
That kind of solution would be straightforward to browse and query (particularly if I could use ViTables to browse the data). The problem is PyTables doesn't seem to support having the cells of one table contain other tables. So I seem to have hit a dead end there.
I've been reading up on data warehousing and the star schema approach, but it still seems like your fact table would need to contain tuples of every possible argument combination.
Okay, so that's pretty much where I am. Any and all advice would be very much appreciated. At this point I've been searching around so much that my brain hurts. I figure it's time to ask the experts.
Why not using a big table for keep all the 500 millions of entries? If you use on-the-flight compression (Blosc compressor recommended here), most of the duplicated entries will be deduped, so the overhead in storage is kept under a minimum. I'd recommend give this a try; sometimes the simple solution works best ;-)
Is there a reason the basic 6 table approach doesn't apply?
i.e. Tables 1-5 would be single column tables defining the valid values for each of the fields, and then the final table would be a 5 column table defining the entries that actually exist.
Alternatively, if every value always exists for the 3rd and 4th string values as you describe, the 6th table could just consist of 3 columns (string1, string2, int1) and you generate the combinations with string3 and string4 dynamically via a Cartesian join.
I'm not entirely sure of what you're trying to do here, but it looks like you trying to create a (potentially) sparse multidimensional array. So I wont go into details for solving your specific problem, but the best package I know that deals with this is Numpy Numpy. Numpy can
be used as an efficient multi-dimensional container of generic data. Arbitrary data-types can be defined. This allows NumPy to seamlessly and speedily integrate with a wide variety of databases.
I've used Numpy many times for simulation data processing and it provides many useful tools including easy file storage/access.
Hopefully you'll find something in it's very easy to read documentation:
Numpy Documentation with Examples