Mapping ID's to Names and Removing Duplicates Algorithmic Development

Mapping ID's to Names and Removing Duplicates Algorithmic Development - python

I have a column in a CSV file that has names such that each cell in that column could be the same as a slightly misspelled cell. For example, "Nike" could be the same as "Nike inc." could be the same as "Nike Inc".
My Current Script
I've already written a program in Python that removes prefixes and suffixes from
each cell if that value occurs more than 2 times in the column as prefixes or
suffixes. I then compared one row to the next after sorting alphabetically in
this column.
My Current Problem
There are still many cells that are in reality duplicates of other cells, but they
are not indicated as such. These examples are:
a) Not exact matches (and not off just by capitalization)
b) Not caught by comparing its stem (without prefix and without suffix) to
its alphabetical neighbor
My current Questions
1) Does anyone have experience mapping IDs to names from all over the world
(so accents, unicode and all that stuff is an issue here, too, although I managed
to solve most of these unicode issues)
and has good ideas for algorithm development that are not listed here?
2) In some of the cases where duplicates are not picked up, I know why I
know they are duplicates. In one instance there is a period in the middle of a
line that is not present in its non-period containing brother cell. Is one good
strategy to simply to create an extra column and output cell values that I suspect
of being duplicates based on the few instances where I know why I know it?
3) How do I check myself? One way is to flag the maximum number of potential
duplicates and look over all of these manually. Unfortunately, the size of our
dataset doesn't make that very pretty, nor very feasible...
Thanks for any help you can provide!

Try to transliterate the names, to remove all the international symbols, then consider using a function like soundex or http://en.wikipedia.org/wiki/Levenshtein_distance (e.g. http://pypi.python.org/pypi/Fuzzy) to calculate text similarity.

Related

Handling Multiple if/else and Special Cases

So I'm fairly new to coding only having relatively simple scripts here and there when I need them for work. I have a document that has an ID column formatted as:
"Number Word Number" and some values under a spec, lower tol, and upper tol column.
Where sometimes the number under ID is a integer or float and the word can be one of say 30 different possibilities. Ultimately these need to be read and then organized depending on the spec and lower/upper tol columns into something like below:
I'm using Pandas to read the data and do the manipulations I need so my question isn't so much of a how to do it, but more of a how should it best be done.
The way my code is written is basically a series of if statements that handle each of the scenarios I've come across so far, but based on other peoples code I've seen this is generally not done and as I understand considered poor practice. It's very basic if statements like:
if(The ID column has "Note" in it) then its a basic dimension
if(The ID column has Roughness) then its an Ra value
if(The ID column has Position in it) then its a true position etc
Problem is I'm not really sure what the "correct" way to do it would be in terms of making it more efficient and simpler. I have currently a series of 30+ if statements and ways to handle different situations that I've run into so far. Virtually all the code I've written is done in this overly specific and not very general coding methodology that even though it works I find personally overcomplicated but I'm not really sure what capabilities of python/pandas I'm sort of missing and not utilizing to simplify my code.

Since you need to test what the variable in ID is and do some staff accordingly you can't avoid the if statements most probably.What i suggest you to do since you have written the code is to reform the database.If there is not a very specific reason you have database with a structure like this,you should change it asap.
To be specific to ID add an (auto)increment unique number and break the 3 datapoints of ID column into 3 seperate columns.

How can i find the values that are not names in a pandas column?

I'm working with a dataframe of names from the databases of my company. My current job is to find if some of these values, with in total are more than 3 million, are not names. If they were wrongly registrated, if the softwares of clients registered some strange values of error, etc.
Is there a neural network alghoritm or other mechanism that i can use to find that?
[Here are some values of the column. I want to see every value that are kind of different from these1
I tried to see by the number of letters of strings, but it was useless.

Try to post some code of your tries so other can help you

How to make a random name matcher with multiple arguments?

So this is my first year getting into code as a hobby. For my personal side project I want to make a date-matcher (not for a friend haha). This is mainly for me trying to get a better understanding for python structures.
To summarize: People fill 2 lists of names and the matcher returns back a list with random matches. (NO DUPLICATES)
Also, coming with these rules:
1. I want make every 'user'(name) choose between they are (Open, Not Interested, Taken) and match the strings accordingly.
When the are more items in a certain list, left over strings get printed out too
3 [Optional] When users fill in their name, they can fill in a certain 'preference string', making it a higher chance to be matched together with that string.
I'm kinda stuck at the first phase, this is what I have:
import random
VNamen=["Sarah","Annelotte","Kelsey","Mika","Ilse","Yara","Sjouke"]
MNamen=["Kelvin","Xander","Kolten","Ezekiel","Misael","Landon","Noel"]
VR= random.choices(VNamen)
MR= random.choices(MNamen)
print(VR, "together with",MR)
How do I randomly match the strings together?
How do I remove the duplicates in the resulting list
Maybe some suggestions on the rest of the functions above?
I hope someone has the time for this (for me) complicated question!
Greetings,
Quinten

Now, there are things like "re" that i would suggest (like dper did in the comment of your code), but if you want to do it with your own code, would suggest using random.choice(list) after importing random (which you have done) which will chose a random person from that list, do this with both lists, and put them(as in the two given names) together into another list and remove their names from the original lists, do this until one of the lists is empty, then print out everything in the not empty list.
Woah that was a lot of lists...
preference settings would be a little more complicated, you would have to use a list, which goes everywhere the name used to go, and in that list there would be all the information they have, but this way it would be impossible(as far as i am aware) to change the likelihood of getting a certain name.
if you would like me to actually show you it with your code, comment and ask me to do so, but i would suggest giving it a go yourself (if you chose to do it this way that is).

How to choose python pandas arrangement columns vs rows

I am quite new with pandas (couple of months) and I am starting building up a project that will be based on a pandas data array.
Such pandas data array will consist on a table including different kind of words present in a collection of texts (around 100k docs, and around 200 key-words).
imagine for instance the words "car" and the word "motorbike" and documents numbered doc1, doc2 etc.
how should I go about the arrangement?
a) The name of every column is the doc number and the index the words "car" and "motorbike" or
b) the other way around; the index being the docs numbers and the columns head the words?
I don't have enough insights of pandas in order to be able to foreseen what will the consequences of such choice. And all the code will be based on that decision.
As a side note there array is not static, there will be more documents and more words being added to the array every now and again.
what would you recommend? a or b? and why?
thanks.

Generally in pandas, we follow a practice that instances are columns (here doc number) and features are columns (here words). So, prefer to use the approach 'b'.

Prefix search against half a billion strings

I have a list of 500 mil strings. The strings are alphanumeric, ASCII characters, of varying size (usually from 2-30 characters). Also, they're single words (or a combination of words without spaces like 'helloiamastring').
What I need is a fast way to check against a target, say 'hi'. The result should be all strings from the 500mil list which start with 'hi' (for eg. 'hithere', 'hihowareyou' etc.). This needs to be fast because there will be a new query each time the user types something, so if he types "hi", all strings starting with "hi" from the 500 mil list will be shown, if he types "hey", all strings starting with "hey" will show etc.
I've tried with the Tries algo, but the memory footprint to store 300 mil strings is just huge. It should require me 100GB+ ram for that. And I'm pretty sure the list will grow up to a billion.
What is a fast algorithm for this use case?
P.S. In case there's no fast option, the best alternative would be to limit people to enter at least, say, 4 characters, before results show up. Is there a fast way to retrieve the results then?

You want a Directed Acyclic Word Graph or DAWG. This generalizes #greybeard's suggestion to use stemming.
See, for example, the discussion in section 3.2 of this.

If the strings are sorted then a binary search is reasonable. As a speedup, you could maintain a dictionary of all possible bigrams ("aa", "ab", etc.) where the corresponding values are the first and last index starting with that bigram (if any do) and so in O(1) time zero in on a much smaller sublist that contains the strings that you are looking for. Once you find a match, do a linear search to the right and left to get all other matches.

If you want to force the user to digit at least 4 letters, for example, you can keep a key-value map, memory or disk, where the keys are all combinations of 4 letters (they are not too many if it is case insensitive, otherwise you can limit to three), and the values are list of positions of all strings that begin with the combination.
After the user has typed the three (or four) letters you have at once all the possible strings. From this point on you just loop on this subset.
On average this subset is small enough, i.e. 500M divided by 26^4...just as example. Actually bigger because probably not all sets of 4 letters can be prefix for your strings.
Forgot to say: when you add a new string to the big list, you also update the list of indexes corresponding to the key in the map.

If you doesn't want to use some database, you should create some data related routines pre-existing in all database engines:
Doesn't try to load all data in memory.
Use fixed length for all string. It increase storage memory consumption but significantly decrease seeking time (i-th string can be found at position L*i bytes in file, where L - fixed length). Create additional mechanism to work with extremely long strings: store it in different place and use special pointers.
Sort all of strings. You can use merge sort to do it without load all strings in memory in one time.
Create indexes (address of first line starts with 'a','b',... ) also indexes can be created for 2-grams, 3-grams, etc. Indexes can be placed in memory to increase search speed.
Use advanced strategies to avoid full indexes regeneration on data update: split a data to a number of files by first letters and update only affected indexes, create an empty spaces in data to decrease affect of read-modify-write procedures, create a cache for a new lines before they will be added to main storage and search in this cache.
Use query cache to fast processing a popular requests.

In this hypothetical, where the strings being indexed are not associated with any other information (e.g. other columns in the same row), there is relatively little difference between a complete index and keeping the strings sorted in the first place (as in, some difference, but not as much as you are hoping for). In light of the growing nature of the list and the cost of updating it, perhaps the opposite approach will better accomplish the performance tradeoffs that you are looking for.
For any given character at any given location in the string, your base case is that no string exists containing that letter. For example, once 'hello' has been typed, if the next letter typed is 't', then your base case is that there is no string beginning 'hellot'. There is a finite number of characters that could follow 'hello' at location 5 (say, 26). You need 26 fixed-length spaces in which to store information about characters that follow 'hello' at location 5. Each space either says zero if there is no string in which, e.g., 't' follows 'hello', or contains a number of data-storage addresses by which to advance to find the list of characters for which one or more strings involve that character following 'hellot' at location 6 (or use absolute data-storage addresses, although only relative addressess allow the algorithm I propose to support an infinite number of strings of infinite length without any modification to allow for larger pointers as the list grows).
The algorithm can then move forward through this data stored on disk, building a tree of string-beginnings in memory as it goes, and avoiding delays caused by random-access reads. For an in-memory index, simply store the part of the tree closest to the root in memory. After the user has typed 'hello' and the algorithm has tracked that information about one or more strings beginning 'hellot' exists at data-storage address X, the algorithm finds one of two types of lists at location X. Either it is another sequence of, e.g., 26 fixed-length spaces with information about characters following 'hellot' at location 6, or it is a pre-allocated block of space listing all post-fixes that follow 'hellot', depending on how many such post-fixes exist. Once there are enough post-fixes that using some traditional search and/or sort algorithm to both update and search the post-fix list fails to provide the performance benefits that you desire, it gets divided up and replaced with a sequence of, e.g., 26 fixed-length spaces.
This involves pre-allocating a relatively substantial amount of disk-storage upfront, with the tradeoff that your tree can be maintained in sorted form without needing to move anything around for most updates, and your searches can be peformed in full in a single sequential read. It also provides more flexibility and probably requires less storage space than a solution based on storing the strings themselves as fixed-length strings.

First of all I should say that the tag you should have added for your question is "Information Retrieval".
I think using Apache Lucene's PrefixQuery is the best way you can handle wildcard queries. Apache has a Python version if you are comfortable with python. But to use Apache lucent to solve your problem you should first know about indexing your data(which is the part that your data will be compressed and saved in a more efficient manner).
Also looking to indexing and wildcard query section of IR book will give you a better vision.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.