Find word in file, return definition - python

I'm making a small dictionary application just to learn Python. I have the function for adding words done (just need to add a check to prevent duplicates) but I'm trying to create the function for looking words up.
This is what my text files looks like when I append the words to the text file.
{word|Definition}
And I can check if the word exists by doing this,
if word in open("words/text.txt").read():
But how do I get the definition? I assume I need to use regex (which is why I split it up and placed it inside curly braces), I just have no idea how.

read() would read the entire file contents. You could do this instead:
for line in open("words/text.txt", 'r').readlines():
split_lines = line.strip('{}').split('|')
if word == split_lines[0]: #Or word in line would look for word anywhere in the line
return split_lines[1]

You can use dictionary if you want effective search.
with open("words/text.txt") as fr:
dictionary = dict(line.strip()[1:-1].split('|') for line in fr)
print(dictionary.get(word))
Also try to avoid syntax like below:
if word in open("words/text.txt").read().
Use context manager (with syntax) to ensure that file will be closed.

To get all definitions
f = open("words/text.txt")
for line in f:
print f.split('|')[1]

Related

How to ignore a line when using readlines in python?

a=open('D:/1.txt','r')
b=a.readlines()
now we get b which contains all the line in 1.txt.
But we know that in Python when we don't use a line we could use
\# mark
to ignore the specific line.
Is there any command we could use in TXT to ignore a specific line when use readlines?
Text files have no specific syntax, they are just a sequence of characters. It is up to your program to decide if any of these characters have particular meaning for your program.
For example if you wanted to read all lines, but discard those starting with '#' then you could filter those out using a list comprehension
with open('D:/1.txt','r') as a:
lines = [line for line in a if not line.startswith('#')]

Amend list from file - Correct syntax and file format?

I currently have a list hard coded into my python code. As it keeps expanding, I wanted to make it more dynamic by reading the list from a file. I have read through many articles about how to do this, but in practice I can't get this working. So firstly, here is an example of the existing hardcoded list:
serverlist = []
serverlist.append(("abc.com", "abc"))
serverlist.append(("def.com", "def"))
serverlist.append(("hji.com", "hji"))
When I enter the command 'print serverlist' the output is shown below and my list works perfectly when I access it:
[('abc.com', 'abc'), ('def.com', 'def'), ('hji.com', 'hji')]
Now I've replaced the above code with the following:
serverlist = []
with open('/server.list', 'r') as f:
serverlist = [line.rstrip('\n') for line in f]
With the contents of server.list being:
'abc.com', 'abc'
'def.com', 'def'
'hji.com', 'hji'
When I now enter the command print serverlist, the output is shown below:
["'abc.com', 'abc'", "'def.com', 'def'", "'hji.com', 'hji'"]
And the list is not working correctly. So what exactly am I doing wrong? Am I reading the file incorrectly or am I formatting the file incorrectly? Or something else?
The contents of the file are not interpreted as Python code. When you read a line in f, it is a string; and the quotation marks, commas etc. in your file are just those characters as parts of a string.
If you want to create some other data structure from the string, you need to parse it. The program has no way to know that you want to turn the string "'abc.com', 'abc'" into the tuple ('abc.com', 'abc'), unless you instruct it to.
This is the point where the question becomes "too broad".
If you are in control of the file contents, then you can simplify the data format to make this more straightforward. For example, if you just have abc.com abc on the line of the file, so that your string ends up as 'abc.com abc', you can then just .split() that; this assumes that you don't need to represent whitespace inside either of the two items. You could instead split on another character (like the comma, in your case) if necessary (.split(',')). If you need a general-purpose hammer, you might want to look into JSON. There is also ast.literal_eval which can be used to treat text as simple Python literal expressions - in this case, you would need the lines of the file to include the enclosing parentheses as well.
If you are willing to let go of the quotes in your file and rewrite it as
abc.com, abc
def.com, def
hji.com, hji
the code to load can be reduced to a one liner using the fact that files are iterables
with open('servers.list') as f:
servers = [tuple(line.split(', ')) for line in f]
Remember that using a file as an iterator already strips off the newlines.
You can allow arbitrary whitespace by doing something like
servers = [tuple(word.strip() for word in line.split(',')) for line in f]
It might be easier to use something like regex to parse the original format. You could use an expression that captures the parts of the line you care about and matches but discards the rest:
import re
pattern = re.compile('\'(.+)\',\\s*\'(.+)\'')
You could then extract the names from the matched groups
with open('servers.list') as f:
servers = [pattern.fullmatch(line).groups() for line in f]
This is just a trivialized example. You can make it as complicated as you wish for your real file format.
Try this:
serverlist = []
with open('/server.list', 'r') as f:
for line in f:
serverlist.append(tuple(line.rstrip('\n').split(',')))
Explanation
You want an explicit for loop so you cycle through each line as expected.
You need list.append for each line to append to your list.
You need to use split(',') in order to split by commas.
Convert to tuple as this is your desired output.
List comprehension method
The for loop can be condensed as below:
with open('/server.list', 'r') as f:
serverlist = [tuple(line.rstrip('\n').split(',')) for line in f]

How to automatically change a particular word while writing to a file in python?

Say a method returns me a long list of lines which I am writing to file. Now on fly is there any way I can change the word "Bread" to "Breakfast", assuming word "Bread" actually exists in several places of my file that is being generated.
Thanks.
I have assigned the sys.stdout to file object, that way all my console print goes to file. So on fly hack would be great.
You could use regular expressions.
import re
word = 'Bread'
rword = 'Breakfast'
line = 'This is a piece of Bread'
line = re.sub(r'\b{0}\b'.format(re.escape(word)), rword, line)
# 'This is a piece of Breakfast'
The advantage of using regular expressions is that it can detect word boundaries (ie. the \b). This prevents it from replacing words that contain your word (ie. Breadth).
You could do this line by line, or replace the word in the whole document at once.
Assuming that it is the list that you want to be changed rather than the file, and assuming that the list is called lines:
lines = [line.replace("Bread", "Breakfast") for line in lines]
You can use the replace string method, like:
text.replace('Bread', 'Breakfast')
Note that this doesn't check if it is a 'word', so it would also change 'Bready' to 'Breakfasty'.
str.replace('bread', 'breakfast) where bread is being replaced by breakfast.

Parsing unique words from a text file

I'm working on a project to parse out unique words from a large number of text files. I've got the file handling down, but I'm trying to refine the parsing procedure. Each file has a specific text segment that ends with certain phrases that I'm catching with a regex on my live system.
The parser should walk through each line, and check each word against 3 criteria:
Longer than two characters
Not in a predefined dictionary set dict_file
Not already in the word list
The result should be a 2D array, each row a list of unique words per file, which is written to a CSV file using the .writerow(foo) method after each file is processed.
My working code's below, but it's slow and kludgy, what am I missing?
My production system is running 2.5.1 with just the default modules (so NLTK is a no-go), can't be upgraded to 2.7+.
def process(line):
line_strip = line.strip()
return line_strip.translate(punct, string.punctuation)
# Directory walking and initialization here
report_set = set()
with open(fullpath, 'r') as report:
for line in report:
# Strip out the CR/LF and punctuation from the input line
line_check = process(line)
if line_check == "FOOTNOTES":
break
for word in line_check.split():
word_check = word.lower()
if ((word_check not in report_set) and (word_check not in dict_file)
and (len(word) > 2)):
report_set.append(word_check)
report_list = list(report_set)
Edit: Updated my code based on steveha's recommendations.
One problem is that an in test for a list is slow. You should probably keep a set to keep track of what words you have seen, because the in test for a set is very fast.
Example:
report_set = set()
for line in report:
for word in line.split():
if we_want_to_keep_word(word):
report_set.add(word)
Then when you are done:
report_list = list(report_set)
Anytime you need to force a set into a list, you can. But if you just need to loop over it or do in tests, you can leave it as a set; it's legal to do for x in report_set:
Another problem that might or might not matter is that you are slurping all the lines from the file in one go, using the .readlines() method. For really large files it is better to just use the open file-handle object as an iterator, like so:
with open("filename", "r") as f:
for line in f:
... # process each line here
A big problem is that I don't even see how this code can work:
while 1:
lines = report.readlines()
if not lines:
break
This will loop forever. The first statement slurps all input lines with .readlines(), then we loop again, then the next call to .readlines() has report already exhausted, so the call to .readlines() returns an empty list, which breaks out of the infinite loop. But this has now lost all the lines we just read, and the rest of the code must make do with an empty lines variable. How does this even work?
So, get rid of that whole while 1 loop, and change the next loop to for line in report:.
Also, you don't really need to keep a count variable. You can use len(report_set) at any time to find out how many words are in the set.
Also, with a set you don't actually need to check whether a word is in the set; you can just always call report_set.add(word) and if it's already in the set it won't be added again!
Also, you don't have to do it my way, but I like to make a generator that does all the processing. Strip the line, translate the line, split on whitespace, and yield up words ready to use. I would also force the words to lower-case except I don't know whether it's important that FOOTNOTES be detected only in upper-case.
So, put all the above together and you get:
def words(file_object):
for line in file_object:
line = line.strip().translate(None, string.punctuation)
for word in line.split():
yield word
report_set = set()
with open(fullpath, 'r') as report:
for word in words(report):
if word == "FOOTNOTES":
break
word = word.lower()
if len(word) > 2 and word not in dict_file:
report_set.add(word)
print("Words in report_set: %d" % len(report_set))
Try replacing report_list with a dictionary or set.
word_check not in report_list works slow if report_list is a list

I want to take a user submitted string and search for that exact string in a log file and print the line number

Currently, I am using a regular expression to search for a pattern of numbers in a log file. I also want to add another search capability, general user submitted ascii string search and print out the line number. This is what I have and trying work around (help is appreciated):
logfile = open("13.00.log", "r")
searchString = raw_input("Enter search string: ")
for line in logfile:
search_string = searchString.findall(line)
for word in search_string:
print word #ideally would like to create and write to a text file
First of all, strings don't have a findall method -- I don't know where you got that. Second, why use a string method or regex at all? For a simple string search of the kind you're describing, in is sufficient, as in if search_string in line:. To get line numbers, a quick solution is the enumerate built-in function: for line_number, line in enumerate(logfile):.
Your code seems fairly fragmented. Psuedocode would look something like
get_search_string
for line, line_no in logfile:
if search_string in line:
do output with line_no

Categories