Making python input search function case insensitive - python

I'm creating a python 3 search function that accepts input from the user. The user inputs a string, which is then searched in a word document.
I have tested the function with 2 docx files - one with the word "hello" in it and the other with "There".
When I input the words with the exact case as they are in the docx file the search function returns the name of the file - success! However, when I input the words without the correct case I don't get anything returned.
I've seen a few questions on here about case insensitive options but I couldn't really find one that was similar enough to my project. Any help would be much appreciated.
import os
import docx2txt
os.chdir('c:/users/Says/desktop/projectx')
path = ('c:/users/Says/desktop/projectx')
files = []
x = str(input("search: "))
for file in os.listdir(path):
if file.endswith('.docx'):
files.append(file)
for i in range(len(files)):
text = docx2txt.process(files[i])
if x in text:
print (files[i])

You could use .lower() to make both strings lower case.
if x.lower() in text.lower():
print( files[i] )

One way is which I can suggest is just convert x and test to either upper or lower and then compare as follows, it will give you desired result.
if x.upper() in text.upper():
if x.lower() in text.lower():

Related

Best regex pattern to replace input function from a separate python file

I am new to regex so please explain how you got to the answer. Anyway I want to know the best way to match input function from a separate python file.
For example:
match.py
a = input("Enter a number")
b = input()
print(a+b)
Now I want to match ONLY the input statement and replace it with a random number. I will do this in a separate file main.py. So my aim is to replace input function in the match.py with a random numbers so I can check the output will come as expected. You can think of match.py like a coding exercise where he writes the code in that file and main.py will be the file where it evaluates if the users code is right. And to do that I need to replace the input myself and check if it works for all kinds of inputs. I looked for "regex patterns for python input function" but the search did not work right. I have a current way of doing it but I don't think it works in all kinds of cases. I need a perfect pattern which works in all kinds of cases referring to the python syntax. Here is the current main.py I have (It doesn't work for all cases I mean when you write a string with single quote, it does not replace but here is the problem I can just add single quote in pattern but I also need to detect if both are used):
# Evaluating python file checking if input 2 numbers and print sum is correct
import re
import subprocess
input_pattern = re.compile(r"input\s?\([\"]?[\w]*[\"]?\)")
file = open("match.py", 'r')
read = file.read()
file.close()
code = read
matches = input_pattern.findall(code)
for match in matches:
code = code.replace(match, '8')
file = open("match.py", 'w')
file.write(code)
file.close()
process = subprocess.Popen('python3 match.py', stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=True)
out = process.communicate()[0]
print(out == b"16\n")
file = open("match.py", 'w')
file.write(read)
file.close()
Please let me know if you don't understand this question.
The following regex statement is very close to what you need:
input\s?\((?(?=[\"\'])[\"\'].*[\"\']\)|\))
I am using a conditional regex statement. However, I think it may need a nested conditional to avoid the situation that the user enters something like:
input(' text ")
But hopefully this gets you on the right track.

How can I add tolerance to an input search engine

I have this code to search text in a big text file:
y = input("Apellido/s:").upper()
for line in fread:
if y in line:
print(line)
How can I implement a way that it searches for similar text/autocorrect if nothing is found. Not as extensive as google does it, but just search for text that has maybe 1 more letter or an accent in a word. I can't imagine the algorithm to do it by myself.
I now that i have to add an if but im asking for the logic
You can do it using find_near_matches from fuzzysearch
from fuzzysearch import find_near_matches
y = input("Apellido/s:").upper()
for line in fread:
if find_near_matches(y, line, max_l_dist=2):
print(line)
find_near_matches return a list of matches if some found, if not match found it returns an empty array which is evaluated to false.
max_l_dist option says the total number of substitutions, insertions and deletions (a.k.a. the Levenshtein distance)

Extract e-mail addresses from .txt files in python

I would like to parse out e-mail addresses from several text files in Python. In a first attempt, I tried to get the following element that includes an e-mail address from a list of strings ('2To whom correspondence should be addressed. E-mail: joachim+pnas#uci.edu.\n').
When I try to find the list element that includes the e-mail address via i.find("#") == 0 it does not give me the content[i]. Am I misunderstanding the .find() function? Is there a better way to do this?
from os import listdir
TextFileList = []
PathInput = "C:/Users/p282705/Desktop/PythonProjects/ExtractingEmailList/text/"
# Count the number of different files you have!
for filename in listdir(PathInput):
if filename.endswith(".txt"): # In case you accidentally put other files in directory
TextFileList.append(filename)
for i in TextFileList:
file = open(PathInput + i, 'r')
content = file.readlines()
file.close()
for i in content:
if i.find("#") == 0:
print(i)
The standard way of checking whether a string contains a character, in Python, is using the in operator. In your case, that would be:
for i in content:
if "#" in i:
print(i)
The find method, as you where using, returns the position where the # character is located, starting at 0, as described in the Python official documentation.
For instance, in the string abc#google.com, it will return 3. In case the character is not located, it will return -1. The equivalent code would be:
for i in content:
if i.find("#") != -1:
print(i)
However, this is considered unpythonic and the in operator usage is preferred.
Find returns the index if you find the substring you are searching for. This isn't correct for what you are trying to do.
You would be better using a Regular Expression or RE to search for an occurence of #. In your case, you may come into as situation where there are more than one email address per line (Again I don't know your input data so I can't take a guess)
Something along these lines would benefit you:
import re
for i in content:
findEmail = re.search(r'[\w\.-]+#[\w\.-]+', i)
if findEmail:
print(findEmail.group(0))
You would need to adjust this for valid email addresses... I'm not entirely sure if you can have symbols like +...
'Find' function in python returns the index number of that character in a string. Maybe you can try this?
list = i.split(' ') # To split the string in words
for x in list: # search each word in list for # character
if x.find("#") != -1:
print(x)

Python: Regex a dictionary using user input wildcards

I would like to be able to search a dictionary in Python using user input wildcards.
I have found this:
import fnmatch
lst = ['this','is','just','a','test', 'thing']
filtered = fnmatch.filter(lst, 'th*')
This matches this and thing. Now if I try to input a whole file and search through
with open('testfilefolder/wssnt10.txt') as f:
file_contents = f.read().lower()
filtered = fnmatch.filter(file_contents, 'th*')
this doesn't match anything. The difference is that in the file that I am reading from I is a text file (Shakespeare play) so I have spaces and it is not a list. I can match things such as a single letter, so if I just have 't' then I get a bunch of t's. So this tells me that I am matching single letters - I however am wanting to match whole words - but even more, to preserve the wildcard structure.
Since what I would like to happen is that a user enters in text (including what will be a wildcard) that I can substitute it in to the place that 'th*' is. The wild card would do what it should still. That leads to the question, can I just stick in a variable holding the search text in for 'th*'? After some investigation I am wondering if I am somehow supposed to translate the 'th*' for example and have found something such as:
regex = fnmatch.translate('th*')
print(regex)
which outputs th.*\Z(?ms)
Is this the right way to go about doing this? I don't know if it is needed.
What would be the best way in going about "passing in regex formulas" as well as perhaps an idea of what I have wrong in the code as it is not operating on the string of incoming text in the second set of code as it does (correctly) in the first.
If the problem is just that you "have spaces and it is not a list," why not make it into a list?
with open('testfilefolder/wssnt10.txt') as f:
file_contents = f.read().lower().split(' ') # split line on spaces to make a list
filtered = fnmatch.filter(file_contents, 'th*')

File renaming; Can I get some feedback

Background: A friend of mine, who just might have some OCD issues, was telling me a story of how he was not looking forward to the hours of work he was about to invest into renaming tons of song files that had the words An, The, Of and many more capitalized.
Criteria: He gave me a list of words, omitted here because you will see them in the code, and told me that capitalization is O.K. if they are at the beginning of the song, but otherwise they must be lowercase.
Question 1: This is actually my first script and I am looking for some feedback. If there is a better way to write this, I would like to see it so I can improve my coding. The script is functional and does exactly what I would like it to do.
Question 2: Initially I did not have all 3 functions. I only had the function that replaced words. For some reason it would not work on files that looked like this "The Dark Side Of The Moon". When I ran the code, the "Of" would be replaced but neither of the "The"s would be. So, through trial and error I found that if I lowercase the first letter of the file, do my replace function and finally uppercase the file, it would work. Any clue as to why?
import os
words = ['A','An','The','Of','For','To','By','Or','Is','In','Out','If','Oh','And','On','At']
fileList = []
rootdir = ''
#Where are the files? Is the input a valid directory?
while True:
rootdir = raw_input('Where is your itunes library? ')
if os.path.isdir(rootdir): break
print('That is not a valid directory. Try again.')
#Get a list of all the files in the directory/sub-directory's
for root, subFolders, files in os.walk(rootdir):
for file in files:
fileList.append(os.path.join(root))
#Create a function that replaces words.
def rename(a,b,c):
for file in os.listdir(c):
if file.find(a):
os.rename(file,file.replace(a,b))
#Create a function that changes the first letter in a filename to lowercase.
def renameL():
for file in os.listdir(os.getcwd()):
if file.find(' '):
os.rename(file,file.replace(file,file[0].lower()+file[1:]))
#Creat a function that changes the first letter in a filename to uppercase.
def renameU():
for file in os.listdir(os.getcwd()):
if file.find(' '):
os.rename(file,file.replace(file,file[0].upper()+file[1:]))
#Change directory/lowercase the first letter of the filename/replace the offending word/uppercase the first letter of the filename.
for x in fileList:
for y in words:
os.chdir(x)
renameL()
rename(y,y.lower(),x)
renameU()
Exit = raw_input('Press enter to exit.')
OK, some criticisms:
Don't prompt for arguments, get them from the command line. It makes testing, scripting, and a lot of other things much easier.
The implementation you've got wouldn't distinguish, e.g. "the" from "theater"
You're using the current-working-directory to pass around the directory you're working on. Don't do this, just use a variable.
Someone else said, "use set, it's faster". That advice is incorrect; the correct advice is "use set, because you need a set". A set is a unordered collection of unique items (a list is an ordered collection of not-necessarily-unique items.) As a bonus for using the right collection, your program will probably run faster.
You need to properly split up the work you're trying to do. I'll explain:
Your program has two parts: 1. You need to loop through all the files in some directory and rename them according to some rule. 2. The Rule, given a string (yeah, it's going to be a file name, but forget about that), capitalize the first word and all of the subsequent words that aren't in some given set.
You've got (1) down pretty pat, so dig further into (2). The steps there are a. knock everything down to lower-case. b. Break the string into words. c. For each word, capitalize it if you're supposed to. d. Join the words back into a string.
Write (2) and write a test program that calls it to make sure it works properly:
assert capitalizeSongName('the Phantom Of tHe OPERA') == 'The Phantom of the Opera'
When you're happy with (2), write (1) and the whole thing should work.
Repeated code is usually considered bad style (DRY is the buzzword). Also I usually try not to interleave functionality.
For the "design" of this little script I would first walk the directories and create a large list of all audio files and directories. Then I write a function handling the changing of one the items in the list and create another list using map. Now you have a current and a want list. Then I would zip those lists together and rename them all.
If your Music Library is really huge, you can use itertools, so you don't have large lists in memory but iterators (only one item in memory at once). This is really easy in python: use imap instead of map and izip instead of zip.
To give you an impression and a few hints to useful functions, here is a rough sketch of how I would do it. (Warning: untested.)
import os
import sys
words = ['A','An','The','Of','For','To','By','Or','Is','In','Out','If','Oh','And','On','At']
wantWords = map(str.lower, words)
def main(args):
rootdir = args[1]
files = findFiles(rootdir)
wantFiles = map(cleanFilename, files)
rename(files, wantFiles)
def findFiles(rootdir):
result = []
for root, subFolders, files in os.walk(rootdir):
for filename in files:
result.append(os.path.join(root, filename))
return result
def cleanFilename(filename):
# do replacement magic
def rename(files, wantFiles):
for source, target in zip(files, wantFiles):
os.rename(source, target)
if __name__ == '__main__':
main(sys.argv)
The advantage is that you can see in the main() what is happening without looking into the details of the functions. Each function does different stuff. On only walks the filesystem, on only changes one filename, one actually renames files.
Use a set instead of a list. (It's faster)
I'm not really sure what you're trying to do there. The approach I took was just to lowercase the whole thing, then uppercase the first letter of each word as long as that word isn't in the set, and then uppercase the very first letter regardless (just in case it was one of those special words).
C# version I wrote a little while ago:
private static HashSet<string> _small = new HashSet<string>(new[] { "of", "the", "and", "on", "sur", "de", "des", "le", "la", "les", "par", "et", "en", "aux", "d", "l", "s" });
static string TitleCase(string str)
{
if (string.IsNullOrEmpty(str)) return string.Empty;
return string.Concat(char.ToUpper(str[0]),
Regex.Replace(str, #"\w+", m =>
{
string lower = m.Value.ToLower();
return _small.Contains(lower)
? lower
: string.Concat(char.ToUpper(lower[0]), lower.Substring(1));
})
.Substring(1));
}
I used a regex instead of splitting on spaces because I had a lot of french words in there that were separated by 's instead.

Categories