Python: Regex a dictionary using user input wildcards

Python: Regex a dictionary using user input wildcards - python

I would like to be able to search a dictionary in Python using user input wildcards.
I have found this:
import fnmatch
lst = ['this','is','just','a','test', 'thing']
filtered = fnmatch.filter(lst, 'th*')
This matches this and thing. Now if I try to input a whole file and search through
with open('testfilefolder/wssnt10.txt') as f:
file_contents = f.read().lower()
filtered = fnmatch.filter(file_contents, 'th*')
this doesn't match anything. The difference is that in the file that I am reading from I is a text file (Shakespeare play) so I have spaces and it is not a list. I can match things such as a single letter, so if I just have 't' then I get a bunch of t's. So this tells me that I am matching single letters - I however am wanting to match whole words - but even more, to preserve the wildcard structure.
Since what I would like to happen is that a user enters in text (including what will be a wildcard) that I can substitute it in to the place that 'th*' is. The wild card would do what it should still. That leads to the question, can I just stick in a variable holding the search text in for 'th*'? After some investigation I am wondering if I am somehow supposed to translate the 'th*' for example and have found something such as:
regex = fnmatch.translate('th*')
print(regex)
which outputs th.*\Z(?ms)
Is this the right way to go about doing this? I don't know if it is needed.
What would be the best way in going about "passing in regex formulas" as well as perhaps an idea of what I have wrong in the code as it is not operating on the string of incoming text in the second set of code as it does (correctly) in the first.

If the problem is just that you "have spaces and it is not a list," why not make it into a list?
with open('testfilefolder/wssnt10.txt') as f:
file_contents = f.read().lower().split(' ') # split line on spaces to make a list
filtered = fnmatch.filter(file_contents, 'th*')

Related

How to separate user's input with two separators? And controlling the users input

I want to separate the users input using two different separators which are ":" and ";"
Like the user should input 4 subject and it's amounts. The format should be:
(Subject:amount;Subject:amount;Subject:amount;Subject:amount)
If the input is wrong it should print "Invalid Input "
Here's my code but I can only used one separator and how can I control the users input?
B = input("Enter 4 subjects and amount separated by (;) like Math:90;Science:80:").split(";")
Please help. I can't figure it out.

If you are fine with using regular expressions in python you could use the following code:
import re
output_list = re.split("[;:]", input_string)
Where inside the square brackets you include all the characters (also known as delimiters) that you want to split by, just make sure to keep the quotes around the square brackets as that makes a regex string (what we are using to tell the computer what to split)
Further reading on regex can be found here if you feel like it: https://medium.com/factory-mind/regex-tutorial-a-simple-cheatsheet-by-examples-649dc1c3f285
However, if you want to do it without importing anything you could do this, which is another possible solution (and I would recommend against, but it gets the job done well):
input_string = input_string.replace(";", ":")
output_list = input_string.split(":")
Which works by first replacing all of the semicolons in the input string with colons (it could also work the other way around) and then splitting by the remaining character (in this case the colons)
Hope this helped, as it is my first answer on Stack overflow.

how do I add phrases to a file without deleting the file

The text file which is a "txt" file. Also, I have separate files for different length phrases (spaces count towards the phrase length) I saw phrases because it can be multiple words, but in the example below I use three letter words all of which are one word. Also, imagine each phrase is on a new line. Each phrase is separated by a comma. Imagine you have a file like this:
app,
bar,
car,
eel,
get,
pod,
What I want is to be able to add one or more phrases that will be assumed to only contain lowercase alphabetical letters and/or spaces.
For example, let us say I want to add the phrases in this order:
(cat, bat, car, hat, mom, rat)
basically, I want to add these phrases to the file without deleting
the file and making sure no phrases repeat in the file as well as making sure they are alphabetically sorted. Spaces are assumed to be after the letter z in terms of alphabetically sorting them. So after inputting these phrases, the file should look like this:
'
app,
bar,
bat,
car,
eel,
get,
hat,
mom,
pod,
rat
'
And each file will be assumed to become at least a gigabyte of data. What is the fastest/least memory consuming/etc. So copying the file in order to accomplish this is a no go.
I haven't tried anything that 100% works. I know what to do, I just don't know how to do it. Here are the main points that I need to accomplish.
1) Make sure the phrase(s) are created (using input() function)
2) Open the file of organized words (using "with open(filename)" statements)
3) Put each phrase into the "correct" spot in the file. By "correct" I mean that is alphabetical and is not a repeat.
4) Make sure the file doesn't get deleted.
Here is what I have currently (changed it a bit and it is doing MORE of what I want, but not everything):
phrase_to_add = input('Please enter the phrase: ').lower()
with open('/Users/ian/Documents/three_character_phrases.txt') as file:
unique_phrases = list(file.read().split())
unique_phrases.append(phrase_to_add)
unique_phrases.sort()
list_of_phrases = set()
for phrase in unique_phrases:
list_of_phrases.add(phrase)
with open('/Users/ian/Documents/three_character_phrases.txt', 'w') as fin:
for phrase in list_of_phrases:
fin.write(phrase + '\n')
So I started with BOTH files being empty and I added the word 'cow' by putting it into the input and this what the file looked like:
three_character_phrases.txt:
cow
then I inputted the word "bat" and I got this:
bat
cow
then I added the word "bawk" (I know it isn't a 3 letter word but I'll take care of making sure the right words go into the right files)
I got this:
bawk
bat
cow

It looks like you're getting wrapped up in the implementation instead of trying to understand the concept, so let me invite you to take a step back with me.
You have a data structure that resembles a list (since order is relevant) but allows no duplicates.
['act', 'bar', 'dog']
You want to add an entry to that list
['act', 'bar', 'cat', 'dog']
and serialize the whole thing to file afterwards so you can use the same data between multiple sessions.
First up is to establish your serialization method. You've chosen a plain text file, line delimited. There's nothing wrong with that, but if you were looking for alternatives then a csv, a json, or indeed serializing directly to database might be good too. Let's proceed forward under the assumption that you won't change serialization schemas, though.
It's easy to read from file
from pathlib import Path
FILEPATH = Path("/Users/ian/Documents/three_character_phrases.txt")
def read_phrases():
with FILEPATH.open(mode='r') as f:
return [line.strip() for line in f]
and it's easy to write to it, too.
# Assume FILEPATH is defined here, and in all future snippets as well.
def write_phrases(phrases):
with FILEPATH.open(mode='w') as f:
f.writelines(f'{phrase}\n' for phrase in phrases)
# this is equivalent to:
# text = '\n'.join(phrases)
# f.write(text + '\n')
You've even figured out how to have the user enter a new value (though your algorithm could use work to make the worst case better. Since you're always inserting into a sorted list, the bisect stdlib module can help your performance here for large lists. I'll leave that for a different question though).
Since you've successfully done all the single steps, the only thing holding you back is to put them all together.
phrases = read_phrases()
phrase_to_add = input('Please enter the phrase: ').lower()
if phrase_to_add not in phrases:
phrases.append(phrase_to_add)
phrases.sort() # this is, again, not optimal. Look at bisect!
write_phrases(phrases)

How to split a string on multiple pattern using pythonic way (one liner)?

I am trying to extract file name from file pointer without extension. My file name is as follows:
this site:time.list,this.list,this site:time_sec.list, that site:time_sec.list and so on. Here required file name always precedes either whitespace or dot.
Currently I am doing this to get file from file name preceding white space and dot in file name.
search_term = os.path.basename(f.name).split(" ")[0]
and
search_term = os.path.basename(f.name).split(".")[0]
Expected file name output: this, this, this, that.
How can i combine above two into one liner kind and pythonic way?
Thanks in advance.

using regex as below,
[ .] will split either on a space or a dot char
re.split('[ .]', os.path.basename(f.name))[0]

If you split on one and splitting on the other still returns something smaller, that's the one you want. If not, what you get is what you got from the first split. You don't need regex for this.
search_term = os.path.basename(f.name).split(" ")[0].split(".")[0]

Use regex to get the first word at the beginning of the string:
import re
re.match(r"\w+", "this site:time_sec.list").group()
# 'this'
re.match(r"\w+", "this site:time.list").group()
# 'this'
re.match(r"\w+", "that site:time_sec.list").group()
# 'that'
re.match(r"\w+", "this.list").group()
# 'this'
try this:
pattern = re.compile(r"\w+")
pattern.match(os.path.basename(f.name)).group()
Make sure your filenames don't have whitespace inside when you rely on the assumption that a whitespace separates what you want to extract from the rest. It's much more likely to get unexpected results you didn't think up in advance if you rely on implicit rules like that instead of actually looking at the strings you want to extract and tailor explicit expressions to fit the content.

Python - Injecting html tags into strings based on regex match

I wrote a script in Python for custom HTML page that finds a word within a string/line and highlights just that word with use of following tags where instance is the word that is searched for.
<b><font color=\"red\">"+instance+"</font></b>
With the following result:
I need to find a word (case insensitive) let's say "port" within a string that can be port, Port, SUPPORT, Support, support etc, which is easy enough.
pattern = re.compile(word, re.IGNORECASE)
find_all_instances = pattern.findall(string_to_search)
However my strings often contain 2 or more instances in single line, and I need to append
<b><font color=\"red\">"+instance+"</font></b> to each of those instances, without changing cases.
Problem with my approach, is that I am attempting to itterate over each of instances found with findall (exact match),
while multiple same matches can also be found within the string.
for instance in find_all_instances:
second_pattern = re.compile(instance)
string_to_search = second_pattern.sub("<b><font color=\"red\">"+instance+"</font></b>", string_to_search)
This results in following:
<b><font color="red"><b><font color="red"><b><font color="red">Http</font></b></font></b></font></b></font>
when I need
<b><font color="red">Http</font></b>
I was thinking, I would be able to avoid this if I was able to find out exact part of the string that the pattern.sub substitutes at the moment of doing it,
however I was not able to find any examples of that kind of usage, which leads me to believe that I am doing something very wrong.
If anyone have a way I could use to insert <b><font color="red">instance</font></b> without replacing instance for all matches(case insensitive), then I would be grateful.

Maybe I'm misinterpretting your question, but wouldn't re.sub be the best option?
Example: https://repl.it/DExs

Okay so two ways I did quickly! The second loop is definitely the way to go. It uses re.sub (as someone else commented too). It replaces with the lowercase search term bear in mind.
import re
FILE = open("testing.txt","r")
word="port"
#THIS LOOP IS CASE SENSITIVE
for line in FILE:
newline=line.replace(word,"<b><font color=\"red\">"+word+"</font></b>")
print newline
#THIS LOOP IS INCASESENSITIVE
for line in FILE:
pattern=re.compile(word,re.IGNORECASE)
newline = pattern.sub("<b><font color=\"red\">"+word+"</font></b>",line)
print newline

Getting a regex trie to run faster?

I have a 50mb regex trie that I'm using to split phrases apart.
Here is the relevant code:
import io
import re
with io.open('REGEXES.rx.txt', encoding='latin-1') as myfile:
regex = myfile.read()
while True == True:
Password = input("Enter a phrase to be split: ")
Words = re.findall(regex, Password)
print(Words)
Since the regex is so large, this takes forever!
Here is the code I'm trying now, with re.compile(TempRegex):
import io
import re
with io.open('REGEXES.rx.txt', encoding='latin-1') as myfile:
TempRegex = myfile.read()
regex = re.compile(TempRegex)
while True == True:
Password = input("Enter a phrase to be split: ")
Words = re.findall(regex, Password)
print(Words)
What I'm trying to do is I'm trying to check to see if an entered phrase is a combination of names. For example, the phrase "johnsmith123" to return ['john', 'smith', '123']. The regex file was created by a tool from a word list of every first and last name from Facebook. I want to see if an entered phrase is a combination of words from that wordlist essentially ... If johns and mith are names in the list, then I would want "johnsmith123" to return ['john', 'smith', '123', 'johns', 'mith'].

I don't think that regex is the way to go here. It seems to me that all you are trying to do is to find a list of all of the substrings of a given string that happen to be names.
If the user's input is a password or passphrase, that implies a relatively short string. It's easy to break that string up into the set of possible substrings, and then test that set against another set containing the names.
The number of substrings in a string of length n is n(n+1)/2. Assuming that no one is going to enter more than say 40 characters you are only looking at 820 substrings, many of which could be eliminated as being too short. Here is some code to do that:
def substrings(s, min_length=1):
for start in range(len(s)):
for length in range(min_length, len(s)-start+1):
yield s[start:start+length]
So the problem then is loading the names into a suitable data structure. Your regex is 50MB, but considering the snippet that you showed in one of your comments, the amount of actual data is going to be a lot smaller than that due to the overhead of the regex syntax.
If you just used text files with one name per line you could do this:
names = set(word.strip().lower() for word in open('names.txt'))
def substrings(s, min_length=1):
for start in range(len(s)):
for length in range(min_length, len(s)-start+1):
yield s[start:start+length]
s = 'johnsmith123'
print(sorted(names.intersection(substrings(s)))
Might give output:
['jo', 'john', 'johns', 'mi', 'smith']
I doubt that there will be memory issues given the likely small data set, but if you find that there's not enough memory to load the full data set at once you could look at using sqlite3 with a simple table to store the names. This will be slower to query, but it will fit in memory.
Another way could be to use the shelve module to create a persistent dictionary with names as keys.

Python's regex engine is not actually a regular expression, since it includes features such as lookbehind, capture groups, back references, and uses backtracking to match the leftmost valid branch instead of the longest.
If you use a true regex engine, you will almost always get better results if your regex does not require those features.
One of the most important qualities of a true regular expression is that it will always return a result in time proportional to the length of the input, without using any memory.
I've written one myself using a DFA implemented in C (but usable from python via cffi), which will have optimal asymptotic performance, but I haven't tried constant-factor improvements such as vectorization and assembly generation. I didn't make a generally usable API though since I only need to call it from within my library, but it shouldn't be too hard to figure out from the examples. (Note that search can be implemented as match with .* up front, then match backward, but for my purpose I would rather return a single character as an error token). Link to my project
You might also consider building the DFA offline and using it for multiple runs of your program - but this is what flex does so there was no point in me doing that for my project, so maybe just use that if you're comfortable with C? Of course you'd almost certainly have to write a fair bit of custom C code to use my project anyway ...

If you compile it, the regex patterns will be compiled into a bytecodes then run by a matching engine. If you don't compile it, it will load it over and over for the same regex whenever it is called. That's why compiled one is way faster if you are using same regex for multiple different records.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: Regex a dictionary using user input wildcards - python

If the problem is just that you "have spaces and it is not a list," why not make it into a list? with open('testfilefolder/wssnt10.txt') as f: file_contents = f.read().lower().split(' ') # split line on spaces to make a list filtered = fnmatch.filter(file_contents, 'th*')

Related

How to separate user's input with two separators? And controlling the users input

how do I add phrases to a file without deleting the file

How to split a string on multiple pattern using pythonic way (one liner)?

Python - Injecting html tags into strings based on regex match

Getting a regex trie to run faster?

Categories

Resources