I am very new to python and I'd like to ask for an advice on how to, where to start, what to learn.
I've got this fantasy name generator (joining randomly picked letters), which every now and then creates a name which is acceptable, what I'd like to do though is to train an AI to generate names which aren't lets say just consonants, ultimately being able to generate human, elvish, dwarfish etc names.
I'd appreciate any advice in this matter.
Edit:
My idea is: I get a string of letters, if they resemble a name, I approve it, if not - reject. It creates a dataset of True/False values, which can be used in machine learning, at least that's what I am hoping for, as I said, I am new to programming.
Again, I don't mind learning, but where do I begin?
Single characters are not really a good fit for this, as there are combinatorial restrictions as to which letters can be combined to larger sequences. It is much easier to not have single letters, but instead move on to bi-grams, tri-grams, or syllables. It doesn't really matter what you choose, as long as they can combine freely.
You need to come up with an inventory of elements which comply with the rules of your language; you can collect those from text samples in the language you are aiming for.
In the simplest case, get a list of names like the ones you want to generate, and collect three-letter sequences from that, preferably with their frequency count. Or simply make some up:
For example, if you have a language with a syllablic structure where you always have a consonant followed by a vowel, then by combining elements which are a consonant followed by a vowel you will always end up with valid names.
Then pick 2 to 5 (or however long you want your names to be) elements randomly from that inventory, perhaps guided by their frequency.
You could also add in a filter to remove those with unsuitable letter combinations (at the element boundaries) afterwards. Or go through the element list and remove invalid ones (eg any ending in 'q' -- either drop them, or add a 'u' to them).
Depending on what inventory you're using, you can simulate different languages/cultures for your names, as languages differ in their phonological structures.
Related
I am writing a web crawler in python that downloads a list of URLS, extracts all visible text from the HTML, tokenizes the text (using nltk.tokenize) and then creates a positional inverted index of words in each document for use by a search feature.
However, right now, the index contains a bunch of useless entries like:
1) //roarmag.org/2015/08/water-conflict-turkey-middle-east/
2) ———-
3) ykgnwym+ccybj9z1cgzqovrzu9cni0yf7yycim6ttmjqroz3wwuxiseulphetnu2
4) iazl+xcmwzc3da==
Some of these, like #1, are where URLs appear in the text. Some, like #3, are excerpts from PGP keys, or other random data that is embedded in the text.
I am trying to understand how to filter out useless data like this. But I don't just want to keep words that I would find in an English dictionary, but also things like names, places, nonsense words like "Jabberwocky" or "Rumpelstiltskin", acronyms like "TANSTAAFL", obscure technical/scientific terms, etc ...
That is, I'm looking for a way to heuristically strip out strings that are "jibberish". (1) exceedingly "long" (2) filled with a bunch of punctuation (3) composed of random strings of characters like afhdkhfadhkjasdhfkldashfkjahsdkfhdsakfhsadhfasdhfadskhkf ... I understand that there is no way to do this with 100% accuracy, but if I could remove even 75% of the junk I'd be happy.
Are there any techniques that I can use to separate "words" from junk data like this?
Excessively long words are trivial to filter. It's pretty easy to filter out URLs, too. I don't know about Python, but other languages have libraries you can use to determine if something is a relative or absolute URL. Or you could just use your "strings with punctuation" filter to filter out anything that contains a slash.
Words are trickier, but you can do a good job with n-gram language models. Basically, you build or obtain a language model, and run each string through the model to determine the likelihood of that string being a word in the particular language. For example, "Rumplestiltskin" will have a much higher likelihood of being an English word than, say, "xqjzipdg".
See https://github.com/ciprian-chelba/1-billion-word-language-modeling-benchmark for a trained model that might be useful to you in determining if a string is an actual word in some language.
See also NLTK and language detection.
I have a database table containing well over a million strings. Each string is a term that can vary in length from two words to five or six.
["big giant cars", "zebra videos", "hotels in rio de janeiro".......]
I also have a blacklist of over several thousand smaller terms in a csv file. What I want to do is identify similar terms in the database to the blacklisted terms in my csv file. Similarity in this case can be construed as mis-spellings of the blacklisted terms.
I am familiar with libraries in python such as fuzzywuzzy that can assess string similarity using Levensthein distance and return an integer representation of the similarity. An example from this tutorial would be:
fuzz.ratio("NEW YORK METS", "NEW YORK MEATS") ⇒ 96
A downside with this approach would be that it may falsely identify terms that may mean something in a different context.
A simple example of this would be "big butt", a blacklisted string, being confused with a more innocent string like "big but".
My question is, is it programmatically possible in python to accomplish this or would it be easier to just retrieve all the similar looking keywords and filter for false positives?
I'm not sure there's any definitive answer to this problem, so the best I can do is to explain how I'd approach this problem, and hopefully you'll be able to get any ideas from my ramblings. :-)
First.
On an unrelated angle, fuzzy string matching might not be enough. People are going to be using similar-looking characters and non-character symbols to get around any text matches, to the point where there's nearly zero match between a blacklisted word and actual text, and yet it's still readable for what it is. So perhaps you will need some normalization of your dictionary and search text, like converting all '0' (zeroes) to 'O' (capital O), '><' to 'X' etc. I believe there are libraries and/or conversion references to that purpose. Non-latin symbols are also a distinct possibility and should be accounted for.
Second.
I don't think you'll be able to differentiate between blacklisted words and similar-looking legal variants in a single pass. So yes, most likely you will have to search for possible blacklisted matches and then check if what you found matches some legal words too. Which means you will need not only the blacklisted dictionary, but a whitelisted dictionary as well. On a more positive note, there's probably no need to normalize the whitelisted dictionary, as people who're writing acceptable text are probably going to write it in acceptable language without any tricks outlined above. Or you could normalize it if you're feeling paranoid. :-)
Third.
However the problem is that matching words/expressions against black and white lists doesn't actually give you a reliable answer. Using your example, a person might write "big butt" as a honest typo which will be obvious in context (or vice versa, write a "big but" intentionally to get a higher match against a whitelisted word, even if context makes it quite obvious what the real meaning is). So you might have to actually check the context in case there are good enough matches against both black and white lists. This is an area I'm not intimately familiar with. It's probably possible to build correlation maps for various words (from both dictionaries) to identify what words are more (or less) frequently used in combination with them, and use them to check your specific example. Using this very paragraph as example, a word "black" could be whitelisted if it's used together with "list" but blacklisted in some other situations.
Fourth.
Even applying all those measures together you might want to leave a certain amount of gray area. That is, unless there's a high enough certainty in either direction, leave the final decision for a human (screening comments/posts for a time, automatically putting them into moderation queue, or whatever else your project dictates).
Fifth.
You might try to dabble in learning algorithms, collecting human input from previous step and using it to automatically fine-tune your algorithm as time goes by.
Hope that helps. :-)
I'm currently trying to solve the hard Challenge #151 on reddit with a unuasual method, a genetic algorithm.
In short, after seperating a string to consonants and vowels and removing spaces I need to put it together without knowing what character comes first.
hello world is seperated to hllwrld and eoo and needs to be put together again. One solution for example would be hlelworlod, but that doesn't make much sense. The exhaustive approach that takes all possible solutions works, but isn't feasible for longer problem sets.
What I already have
A database with the frequenzy of english words
An algorithm that constructs a relative cost database using Zipf's law and can consistently seperate words from sentences without spaces (borrowed from this question/answer
A method that puts consonants and vowels into a stack and randomly takes a character from either one and encodes this in a string that consists of 1 and 2, effectively encoding the construction in a gene. The correct gene for the example would be 1211212111
A method that mutates such a string, randomly swapping characters around
What I tried
Generating 500 random sequences, using the infer_spaces() method and evaluating fitness with the cost of all the words, taking the best 25% and mutate 4 new from those, works for small strings, but falls into local minima very often, especially for longer sequences. Hello World is found already in the first generation, thisisnotworkingverygood (which is correctly seperated and has a cost of 41.223) converges to th iss n ti wo or king v rye good (270 cost) already in the second generation.
What I need
Clearly, using the calculated cost as a evaluation method does only work for the separation of sentences that are grammatically correct, not for for this genetic algorithm. Do you have better ideas I could try? Or is another part of solution, for example the representation of the gene, the problem?
I would simplify the problem into two parts,
Finding candidate words to split the string into (so hllwrld => hll wrld)
How to then expand those words by adding vowels.
I would first take your dictionary of word frequencies, and process it to create a second list of words without vowels, along with a list of the possible vowel list for each collapsed word (and the associated frequency). You technically don't need a GA to solve this (and I think it would be easier to solve without one), but as you asked, I will provide 2 answers:
Without GA: you should be able to solve the first problem using a depth first search, matching substrings of the word against that dictionary, and doing so with the remaining word parts, only accepting partitions of the word into words (without vowels) where all words are in the second dictionary. Then you have to substitute in the vowels. Given that second dictionary, and the partition you already have, this should be easy. You can also use the list of vowels to further constrain the partitioning, as valid words in the partitions can only be made whole using vowels from the vowel list that is input into the algorithm. Starting at the left hand side of the string and iterating over all valid partitions in a depth first manner should solve this problem relatively quickly.
With GA: To solve this with a GA, I would create the dictionary of words without vowels. Then using the GA, create binary strings (as your chromosomes) of the same length as the input string of consonants, where a 1 = split a word at that position, and 0 = leave unchanged. These strings will all be the same length. Then create a fitness function that returns the proportion of words obtained after performing a split using the chromosome that are valid words without vowels, according to that dictionary. Create a second fitness function that takes the valid no-vowel words, and computes the proportion of overlap between the vowels missing in all these valid no-vowel words, and the original vowel list. Combine both fitness functions into one by multiplying the value from the first one by ten (assuming the second one returns a value between 0 and 1). That will force the algorithm to focus on the segmentation problem first and the vowel insertion problem second, and will also favor segmentations that are of the same quality, but preferring those that have a closer set of missing vowels to the original list. I would also include cross over in the solution. As all your chromosomes are the same length, this should be trivial. Once you have a solution that scores perfectly on the fitness function, then it should be trivial to recreate the original sentence given that dictionary of words without vowels (provided you maintain a second dictionary that list the possible missing vowel set for each non-vowel word - there could be multiple for each, as some vowel-less words will be the same with the vowels removed.
Let's say you have several generations and you plot the cost for the best specimen in each generation (we consider long sentence). Does this graph go down or converges after 2-3 generations to a specific value (let the algorithm run for example for 10 generations)? Can you run your algorithm several times with various initial conditions (random sequences) and see whether you get good results sometimes or not?
Depending of the results, you may try the following (this graph is a really good tool to improve the performance):
1) If you have a graph that goes up and down too much all the time - you have too much mutation (average number of swaps per gene for example), try to decrease it.
2) If you stuck up in a local minimum (cost of the best specimen doesn't change much after some time) try to increase mutation or run several isolated populations (3-4) of let's say 100 species at the beginning of your algorithm for a few generations. Then select the best population (that's closer to global minimum) and try to improve it as much as possible through mutation
PS: By the way interesting problem, I tried to figure out on how to use crossover to improve the algorithm but haven't figured it out
The fitness function is the key to the success of GA algorithm ( Which I kind of agree is suitable here ).
I agree with #Simon that the vowel non-vowel separation is not that important. just trip your text corpus to remove the vowels.
what is important in the fitness:
matched word frequency ( frequent words better )
grammar - structure of the sentence ( which you might need to use NLTK to get related infomation )
and don't forget to update the end result ^^
I'm not sure how exactly to word this question, so here's an example:
string1 = "THEQUICKBROWNFOX"
string2 = "KLJHQKJBKJBHJBJLSDFD"
I want a function that would score string1 higher than string2 and a million other gibberish strings. Note the lack of spaces, so this is a character-by-character function, not word-by-word.
In the 90s I wrote a trigram-scoring function in Delphi and populated it with trigrams from Huck Finn, and I'm considering porting the code to C or Python or kludging it into a stand-alone tool, but there must be more efficient ways by now. I'll be doing this millions of times, so speed is nice. I tried the Reverend.Thomas Beyse() python library and trained it with some all-caps-strings, but it seems to require spaces between words and thus returns a score of []. I found some Markov Chain libraries, but they also seemed to require spaces between words. Though from my understanding of them, I don't see why that should be the case...
Anyway, I do a lot of cryptanalysis, so in the future scoring functions that use spaces and punctuation would be helpful, but right now I need just ALLCAPITALLETTERS.
Thanks for the help!
I would start with a simple probability model for how likely each letter is, given the previous (possibly-null, at start-of-word) letter. You could build this based on a dictionary file. You could then expand this to use 2 or 3 previous letters as context to condition the probabilities if the initial model is not good enough. Then multiply all the probabilities to get a score for the word, and possibly take the Nth root (where N is the length of the string) if you want to normalize the results so you can compare words of different lengths.
I don't see why a Markov chain couldn't be modified to work. I would create a text file dictionary of sorts, and read that in to initially populate the data structure. You would just be using a chain of n letters to predict the next letter, rather than n words to predict the next word. Then, rather than randomly generating a letter, you would likely want to pull out the probability of the next letter. For instance if you had the current chain of "TH" and the next letter was "E", you would go to your map, and see the probability that an "E" would follow "TH". Personally I would simply add up all of these probabilities while looping through the string, but how to exactly create a score from the probability is up to you. You could normalize it for string length, to let you compare short and long strings.
Now that I think about it, this method would favor strings with longer words, since a dictionary would not include phrases. Then again, you could populate the dictionary with not only single words, but short phrases with the spaces removed as well. Then the scoring would not only score based on how english the seperate words are, but how english series of words are. It's not a perfect system, but it would provide consistent scoring.
I don't know how it works, but Mail::SpamAssassin::Plugin::TextCat analyzes email and guesses what language it is (with dozens of languages supported).
The Index of Coincidence might be of help here, see https://en.wikipedia.org/wiki/Index_of_coincidence.
For a start just compute the difference of the IC to the expected value of 1.73 (see Wikipedia above). For an advanced usage you might want to calculate the expected value yourself using some example language corpus.
I'm thinking that maybe you could apply some text-to-speech synthesis ideas here. In particular, if a speech synthesis program is able to produce a pronunciation for a word, then that can be considered "English."
The pre-processing step is called grapheme-to-phoneme conversion, and typically leads to probabilities of mapping strings to sounds.
Here's a paper that describes some approaches to this problem. (I don't claim this paper is authoritative, as it just was a highly ranked search result, and I don't really have expertise in this area.)
I'm looking for a reasonably simple algorithm to determine how difficult it is to type a word on the QWERTY layout.
The words would not necessarily be dictionary words, so a list of commonly mistyped words or the like is not an option. I'm sure there must be an existing, well-tested algorithm, but I can't find anything.
Can anyone offer any help or advice? I'm coding the algorithm in python, but any other language or pseudo-code is welcome.
There is this comparison between QWERTY, Colemak and Dvorak layouts, which calculates the distance between the keys typed, the percentage of keys on the same hand, etc. with source code in Java. These metrics in combination should give a very good estimate of the 'typeability' of a word.
I don't have any algorithms to propose, but a few hints:
I use both hands to type, meaning that the keyboard is roughly split in 2 halves, it is frequent that I have coordination issues between the two hands, meaning that each type the letters in the "right" order but the interleaving is wrong. This is especially true if one hand has more letters to type than the other, typical: "the" because the left hand type t and e and the right hand types h.
"slips" are frequent, meaning that often time one is going to miss the key and hit another key instead; "addition" / "deletion" are frequent too, ie typing a supplementary key or not pushing hard enough --> this mean that (obviously) the more letters there is, the harder it is to get the word right.
mix case makes it harder, it requires synchronization between pushing CAPS and hitting the keys, so it's likely that the nearby keys won't have the right upper/lower case.
Hope this helps...
Take out your Scrabble set, note down the scores for each letter, total the scores for a word, hey presto you have your algorithm. Not sure it entirely satisfies your requirements, but it might point you in a useful direction. You might, for instance, want to assign scores not only to individual letters but also to di- and tri-grams.
I'm not aware of any existing source of the information you need, perhaps you could come up with your own letter scores by examining the keyboard and assigning higher scores to the more difficult letters: so 1 for 'a', 8 for 'q', 2 for 'm', and so on.
EDIT: I seem to have confused people more than I usually do when I reply on SO. Here's the barebones of my proposal:
a) List all trigrams and digrams which occur in English (or your language). To each of them assign a difficulty-of-typing score. Do the same for individual letters (after all a 4 letter word might be composed of a trigram and a letter rather than two digrams).
b) Score the difficulty of typing a word as the sum of the difficulty of typing its components.
As for the difficulty scores, I haven't a clue, but you could start from 1 for a letter on the home keys on a keyboard, 2 for a letter which uses the index fingers but is not a home key, 3 for a letter which uses the 2nd or 3rd fingers on your hand, and so on. Then for digrams, score low for easy letters on left and right (or right and left) in sequence, high for difficult letters on one hand in sequence (eg qz, though that's perhaps not valid for English). And on you go.
I think, manhatten distances algorithm could be closest of what you are looking at. That algorithm takes into account the distance of the target from the source in the quadrangular fashion.
As for the implementation in python, for your specific need of difficulty in QWERTY, you will have to write one for yourself, otherwise few manhatten distances implementation can be found if you google for "n puzzle solver in python"