So this is my first year getting into code as a hobby. For my personal side project I want to make a date-matcher (not for a friend haha). This is mainly for me trying to get a better understanding for python structures.
To summarize: People fill 2 lists of names and the matcher returns back a list with random matches. (NO DUPLICATES)
Also, coming with these rules:
1. I want make every 'user'(name) choose between they are (Open, Not Interested, Taken) and match the strings accordingly.
When the are more items in a certain list, left over strings get printed out too
3 [Optional] When users fill in their name, they can fill in a certain 'preference string', making it a higher chance to be matched together with that string.
I'm kinda stuck at the first phase, this is what I have:
import random
VNamen=["Sarah","Annelotte","Kelsey","Mika","Ilse","Yara","Sjouke"]
MNamen=["Kelvin","Xander","Kolten","Ezekiel","Misael","Landon","Noel"]
VR= random.choices(VNamen)
MR= random.choices(MNamen)
print(VR, "together with",MR)
How do I randomly match the strings together?
How do I remove the duplicates in the resulting list
Maybe some suggestions on the rest of the functions above?
I hope someone has the time for this (for me) complicated question!
Greetings,
Quinten
Now, there are things like "re" that i would suggest (like dper did in the comment of your code), but if you want to do it with your own code, would suggest using random.choice(list) after importing random (which you have done) which will chose a random person from that list, do this with both lists, and put them(as in the two given names) together into another list and remove their names from the original lists, do this until one of the lists is empty, then print out everything in the not empty list.
Woah that was a lot of lists...
preference settings would be a little more complicated, you would have to use a list, which goes everywhere the name used to go, and in that list there would be all the information they have, but this way it would be impossible(as far as i am aware) to change the likelihood of getting a certain name.
if you would like me to actually show you it with your code, comment and ask me to do so, but i would suggest giving it a go yourself (if you chose to do it this way that is).
Related
Python question, as it's the only language I know.
I've got many very long text files (8,000+ lines) with the sentences fragmented and split across multiple lines, i.e.
Research both sides and you should then
formulate your own answer. It's simple,
straightforward advice that is common sense.
And when it comes to vaccinations, climate
change, and the novel coronavirus SARS-CoV-2
etc.
I need to concatenate the fragments into full sentences breaking them at the full stops (periods) question marks, quoted full stops, etc. And write them to a new cleaned up text file, but I am unsure the best way to go about it.
I tried looping though but the results showed me that this method was not going to work.
I have never coded Generators (not sure if that is what is called for in this instance) before as I am an amateur developer and use coding to make my life easier and solve problems.
Any help would be very greatly appreciated.
If you read the file into a variable f, then you can access the text one row at a time (as in f is similar to a list of strings). The functions that might be helpful to you are String.join and String.split. Join will take a list of strings, and join them with a string in between. 'z'.join["a", "b", "c"] will produce "azbzc". Split will take a string as a parameter, find each instance of that string, and split it up. "azbzc".split('z') will produce ["a", "b", "c"] again. Removing the newline after every line, then joining them with something like a space will rebuild the text back into a single string, then using split on things like question marks, etc. will split it up the way you want it.
Heads up, I'm a bloody beginner:
This may sound quite trivial to most of you, but I haven't figured out an efficient solution yet.
I wanted to write a randomization script to facilitate the process of assigning groups as I have to do this quite often. It should be able to form groups from the participant list that should be random but "evenly distributed", meaning that participants from different semesters should be as equally present in every group as possible.
I had two different ideas to approach this so far:
1) I thought of creating a dictionary containing the participant name and the semester this participant is in. Then I would create groups and append the participants depending on conditions (each group has a max. group number of 5 and should include not more than 2 participants from each semester) and if the condition is not met, the participant is assigned to the next group until every group is full.
2) I thought about creating a list with the names and a number as a combined string (e.g. "2Tom") and shuffle that list while splitting it into subgroups of five that have the conditions to not contain more than two participants of the same semester (which I would check with .startswith("2",0)).
Both of these solutions seem unnecessarily complicated to me. And I have not come very far yet. I've started with the second idea (even though the first one is better, but I'm not familiar with dictionaries yet). I created a list, managed to shuffle it and to split it into groups, but I could not build in the condition with the semesters.
Also, I searched quite a bit already to see if there are already similar questions/solutions on the web, but could not find any that helped me. If you find a solution, please share the link!
Any help is greatly appreciated! I wanted to make my life easier with this script, but so far it has made it only harder haha.
Thank you very much in advance!
This is what I have so far for the second idea:
participants = #list of participants that I do not want to disclose here
def check(participants):
for people in participants:
if people.startswith("2",0) == True:
print("2ndsemester")
if people.startswith("4",0) == True:
print("4thsemester")
if people.startswith("6",0) == True:
print("6thsemester")
while check(participants):
np.random.shuffle(participants)
myarray = np.asarray(participants)
groups = np.split(myarray, 5)
print(groups)
The function "check" should ultimately be able to check if the list is equally distributed, but I do not know hot to do that. I was thinking about checking if the consecutive three items contain the same semesters and if so, shuffle. But this solution is not smart, as it would only work if every semester is represented equally in the list, which is not the case.
To simplify the requirements:
I want to create a function that can split a list of participants into a desired number of groups and each group should consist of the same amount of participants from each semester (so there is no group that has all the 6th semesters only). I think it might be wiser to use a dictionary than a list.
I've been noodling around with Python for quite a while in my spare time, and while I have sort of understood and definitely used dictionaries, they've always seemed somewhat foreign to me, like I wasn't quite getting them. Maybe it's the name "dictionary" throwing me off, of the fact I started way back when with Basic (I know) which had arrays, but they were quite different.
Can I simply think of a dictionary in Python as nothing more or less than a two-column table where we name the contents of the first column "keys" and the contents of the second column "values"? Is this conceptualization extremely accurate and useful, or problematic?
If the former, I think I can finally swallow the concept in such a way to finally make it more natural to my thinking.
The analogy of a 2-column table might work to some degree but it doesn't cover some important aspects of how and why dictionaries are used in practice.
The comment by #Sayse is more conceptually useful. Think of the dictionary as a physical language dictionary, where the key is the word itself and the value is the word's definition. Two items in the dictionary cannot have the same key but could have the same value. In the analogy of a language dictionary, if two words had the same spelling then they are the same word. However, synonyms can exist where two words which are spelled differently could have the same definition.
The table analogy also doesn't cover the behaviour of a dictionary where the order is not preserved or reliable. In a dictionary, the order does not matter and the item is retrieved by its key. Perhaps another useful analogy is to think of the key as a person's name and the value is the person themselves (and maybe lots of information about them as well). The people are identified by their names but they may be in any given order or location...it doesn't matter, since we know their names we can identify them.
While the order of items in a dictionary may not be preserved, a dictionary has the advantage of having very fast retrieval for a single item. This becomes especially significant as the number of items to lookup grows larger (on the order of thousands or more).
Finally, I would also add that dictionaries can often improve the readability of code. For example, if you wanted create a lookup table of HTML color codes, an API using a dictionary of HTML color names is much more readable and usable than using a list and relying on documentation of indices to retrieve the values.
So if it helps you to conceptualize a dictionary as a table of 2 columns, that is fine, as long as you also keep in mind the rules for their use and the scenarios where they provide some benefit:
Duplicate keys are not allowed
The order of keys is not preserved and therefore not reliable
Retrieving a single item is fast (esp. for many items)
Improved readability of lookup tables
I have a list of 500 mil strings. The strings are alphanumeric, ASCII characters, of varying size (usually from 2-30 characters). Also, they're single words (or a combination of words without spaces like 'helloiamastring').
What I need is a fast way to check against a target, say 'hi'. The result should be all strings from the 500mil list which start with 'hi' (for eg. 'hithere', 'hihowareyou' etc.). This needs to be fast because there will be a new query each time the user types something, so if he types "hi", all strings starting with "hi" from the 500 mil list will be shown, if he types "hey", all strings starting with "hey" will show etc.
I've tried with the Tries algo, but the memory footprint to store 300 mil strings is just huge. It should require me 100GB+ ram for that. And I'm pretty sure the list will grow up to a billion.
What is a fast algorithm for this use case?
P.S. In case there's no fast option, the best alternative would be to limit people to enter at least, say, 4 characters, before results show up. Is there a fast way to retrieve the results then?
You want a Directed Acyclic Word Graph or DAWG. This generalizes #greybeard's suggestion to use stemming.
See, for example, the discussion in section 3.2 of this.
If the strings are sorted then a binary search is reasonable. As a speedup, you could maintain a dictionary of all possible bigrams ("aa", "ab", etc.) where the corresponding values are the first and last index starting with that bigram (if any do) and so in O(1) time zero in on a much smaller sublist that contains the strings that you are looking for. Once you find a match, do a linear search to the right and left to get all other matches.
If you want to force the user to digit at least 4 letters, for example, you can keep a key-value map, memory or disk, where the keys are all combinations of 4 letters (they are not too many if it is case insensitive, otherwise you can limit to three), and the values are list of positions of all strings that begin with the combination.
After the user has typed the three (or four) letters you have at once all the possible strings. From this point on you just loop on this subset.
On average this subset is small enough, i.e. 500M divided by 26^4...just as example. Actually bigger because probably not all sets of 4 letters can be prefix for your strings.
Forgot to say: when you add a new string to the big list, you also update the list of indexes corresponding to the key in the map.
If you doesn't want to use some database, you should create some data related routines pre-existing in all database engines:
Doesn't try to load all data in memory.
Use fixed length for all string. It increase storage memory consumption but significantly decrease seeking time (i-th string can be found at position L*i bytes in file, where L - fixed length). Create additional mechanism to work with extremely long strings: store it in different place and use special pointers.
Sort all of strings. You can use merge sort to do it without load all strings in memory in one time.
Create indexes (address of first line starts with 'a','b',... ) also indexes can be created for 2-grams, 3-grams, etc. Indexes can be placed in memory to increase search speed.
Use advanced strategies to avoid full indexes regeneration on data update: split a data to a number of files by first letters and update only affected indexes, create an empty spaces in data to decrease affect of read-modify-write procedures, create a cache for a new lines before they will be added to main storage and search in this cache.
Use query cache to fast processing a popular requests.
In this hypothetical, where the strings being indexed are not associated with any other information (e.g. other columns in the same row), there is relatively little difference between a complete index and keeping the strings sorted in the first place (as in, some difference, but not as much as you are hoping for). In light of the growing nature of the list and the cost of updating it, perhaps the opposite approach will better accomplish the performance tradeoffs that you are looking for.
For any given character at any given location in the string, your base case is that no string exists containing that letter. For example, once 'hello' has been typed, if the next letter typed is 't', then your base case is that there is no string beginning 'hellot'. There is a finite number of characters that could follow 'hello' at location 5 (say, 26). You need 26 fixed-length spaces in which to store information about characters that follow 'hello' at location 5. Each space either says zero if there is no string in which, e.g., 't' follows 'hello', or contains a number of data-storage addresses by which to advance to find the list of characters for which one or more strings involve that character following 'hellot' at location 6 (or use absolute data-storage addresses, although only relative addressess allow the algorithm I propose to support an infinite number of strings of infinite length without any modification to allow for larger pointers as the list grows).
The algorithm can then move forward through this data stored on disk, building a tree of string-beginnings in memory as it goes, and avoiding delays caused by random-access reads. For an in-memory index, simply store the part of the tree closest to the root in memory. After the user has typed 'hello' and the algorithm has tracked that information about one or more strings beginning 'hellot' exists at data-storage address X, the algorithm finds one of two types of lists at location X. Either it is another sequence of, e.g., 26 fixed-length spaces with information about characters following 'hellot' at location 6, or it is a pre-allocated block of space listing all post-fixes that follow 'hellot', depending on how many such post-fixes exist. Once there are enough post-fixes that using some traditional search and/or sort algorithm to both update and search the post-fix list fails to provide the performance benefits that you desire, it gets divided up and replaced with a sequence of, e.g., 26 fixed-length spaces.
This involves pre-allocating a relatively substantial amount of disk-storage upfront, with the tradeoff that your tree can be maintained in sorted form without needing to move anything around for most updates, and your searches can be peformed in full in a single sequential read. It also provides more flexibility and probably requires less storage space than a solution based on storing the strings themselves as fixed-length strings.
First of all I should say that the tag you should have added for your question is "Information Retrieval".
I think using Apache Lucene's PrefixQuery is the best way you can handle wildcard queries. Apache has a Python version if you are comfortable with python. But to use Apache lucent to solve your problem you should first know about indexing your data(which is the part that your data will be compressed and saved in a more efficient manner).
Also looking to indexing and wildcard query section of IR book will give you a better vision.
I have a column in a CSV file that has names such that each cell in that column could be the same as a slightly misspelled cell. For example, "Nike" could be the same as "Nike inc." could be the same as "Nike Inc".
My Current Script
I've already written a program in Python that removes prefixes and suffixes from
each cell if that value occurs more than 2 times in the column as prefixes or
suffixes. I then compared one row to the next after sorting alphabetically in
this column.
My Current Problem
There are still many cells that are in reality duplicates of other cells, but they
are not indicated as such. These examples are:
a) Not exact matches (and not off just by capitalization)
b) Not caught by comparing its stem (without prefix and without suffix) to
its alphabetical neighbor
My current Questions
1) Does anyone have experience mapping IDs to names from all over the world
(so accents, unicode and all that stuff is an issue here, too, although I managed
to solve most of these unicode issues)
and has good ideas for algorithm development that are not listed here?
2) In some of the cases where duplicates are not picked up, I know why I
know they are duplicates. In one instance there is a period in the middle of a
line that is not present in its non-period containing brother cell. Is one good
strategy to simply to create an extra column and output cell values that I suspect
of being duplicates based on the few instances where I know why I know it?
3) How do I check myself? One way is to flag the maximum number of potential
duplicates and look over all of these manually. Unfortunately, the size of our
dataset doesn't make that very pretty, nor very feasible...
Thanks for any help you can provide!
Try to transliterate the names, to remove all the international symbols, then consider using a function like soundex or http://en.wikipedia.org/wiki/Levenshtein_distance (e.g. http://pypi.python.org/pypi/Fuzzy) to calculate text similarity.