Text file Converter (replacing unknown words)

Text file Converter (replacing unknown words) - python

I started playing with Python and programming in general like 3 weeks ago so be gentle ;)
What i try to do is convert text files the way i want them to be, the text files have same pattern but the words i want to replace are unknown. So the program must first find them, set a pattern and then replace them to words i want.
For example:
xxxxx
xxxxx
Line3 - word - xxxx xxxx
xxxxx xxxx
word
word
xxxx word
Legend:
xxxxx = template words, present in every file
word = random word, our target
I am able to localize first apperance of the word because it appears always in the same place of the file, from then it appears randomly.
MY code:
f1 = open('test.txt', 'r')
f2 = open('file2.txt', 'w')
pattern = ''
for line in f1.readlines():
if line.startswith('Seat 1'):
line = line.split(' ', 3)
pattern = line[2]
line = ' '.join(line)
f2.write(line)
elif pattern in line.strip():
f2.write(line.replace(pattern, 'NewWord'))
else:
f2.write(line)
f1.close()
f2.close()
This code doesnt work, whats wrong ?

welcome to the world of Python!
I believe you are on the right track and are very close to the correct solution, however I see a couple of potential issues which may cause your program to not run as expected.
If you are trying to see if a string equals another, I would use == instead of is (see this answer for more info)
When reading a file, lines end with \n which means your variable line might never match your word. To fix this you could use strip, which automatically removes leading and trailing "space" characters (like a space or a new line character)
elif line.strip() == pattern:
This is not really a problem but a recommendation, since you are just starting out. When dealing with files it is highly recommended to use the with statement that Python provides (see question and/or tutorial)
Update:
I saw that you might have the word be part of the line, do instead of using == as recommended in point 1, you could use in, but you need to reverse the order, i.e.
elif pattern in line:

Related

If string is preceded by pattern of numbers and colons

I'm pretty new to Python and I've been trying to work out how to use if statements where it looks if a particular string is preceded by a certain pattern.
For example, the text in particular
Sep 09 07:54:28 INFO: line of text here
I have multiple lines like this in a file that I have my script reading from. The dates and times change on every line so I can't specify that text exactly.
I'm trying to replace the word INFO with something else.
However, the word INFO is scattered throughout the text file and I don't want to replace every instance of it.
I only want to replace INFO if it's preceded by number number, colon, number number, colon, number number.
So I've used if statements, string.replace(old, new) and I've been reading about 'positive lookbehind assertions' eg. (?<=abc)def
But I'm unsure how to specify a pattern of text, rather than specifying the exact text.
Just need pointing in the right direction! Thanks
EDIT: I should also point out that there are other instances of INFO that is preceded by a number, so I didn't want to make the rule simply 'be preceded by a number'. It will need to be specifically that pattern (xx:xx:xx)
EDIT2: Putting another example here to clarify further based on a comment
Sep 09 07:54:28 INFO: line of text here that contains many words
line of text that also contains the word INFO in the same line
Sep 09 07:56:30 INFO: line of text here that also contains many words
121334234: line of text here that contains INFO as well
I want replace the word INFO, but only on lines that have the time in that format (num, num, colon, num num, colon, num num)
EDIT 3:
with open(infile) as f:
f = f.read()
with open(infile, 'r') as IN, open('output.html', 'w') as OUT:
f = re.sub(r'(?<=\d{2}:\d{2}:\d{2})\s*INFO\b', ' INFO2', f)
this isn't returning any error but it doesn't perform any action
EDIT 4:
OUT.write(re.sub(r'(?<=\d{2}:\d{2}:\d{2})\s*INFO\b', ' INFO2', f))
Now this does replace INFO with INFO2 but it also stops all the code below it from working. But it's dependent on where I place the code. If I place it after all of my other code, it doesn't seem to do anything, if I place it straight after where I define my IN and OUT then it breaks all formatting from the code below it

You may use the following approach:
import re
s = '''Sep 09 07:54:28 INFO: line of text here that contains many words
line of text that also contains the word INFO in the same line
Sep 09 07:56:30 INFO: line of text here that also contains many words
121334234: line of text here that contains INFO as well'''
repl_str = 'new_info' # sample replacement string
s = re.sub(r'(?<=\d{2}:\d{2}:\d{2})\s*INFO\b', f' {repl_str}', s)
print(s)
The output:
Sep 09 07:54:28 new_info: line of text here that contains many words
line of text that also contains the word INFO in the same line
Sep 09 07:56:30 new_info: line of text here that also contains many words
121334234: line of text here that contains INFO as well

A simple regex like
(?<=\d\d:\d\d:\d\d\s)INFO
would find all such INFO strings

You can find the pattern without using positive lookbehind assertions as well. Assuming that your file name is test.txt, you can do it as following-
with open("test.txt", "r") as reader:
obj = re.compile(r'\d+\s+\d+:\d+:\d+\s+INFO')
for line in reader:
x = obj.search(line)
if x:
# do what you want to do

Python when splitting line in file it leaves newline in the end

I'm trying to make a simple question/answer program, where the questions are written in a normal text file like
this
Problem is when i split the code (at the #) it also leaves the newline, meaning anyone using the program would have to add the newline to the answer. Any way to remove that newline so only the answer is required?
Code:
file1 = open("file1.txt", "r")
p = 0
for line in file:
list = line.split("#")
answer = input(list[0])
if answer == list[1]:
p = p + 1
print("points:",p)

Strip all the whitespace from the right side of your input:
list[1] = list[1].rstrip()
Or just the newlines:
list[1] = list[1].rstrip('\n')
Using list as a variable name is a bad idea because it shadows the built-in class name. You can use argument unpacking to avoid the issue entirely and make your code more legible:
prompt, expected = line.split('#')
expected = expected.rstrip()

You can use rstrip(), which, without arguments, will default to stripping any whitespace character. So just modify the conditional slightly, like this.
if answer == list[1].rstrip():

deleting some special words of a file and write unique words into a new file

I have 2 files, one is a text file containing some sentences. The other is a file which contains the words I want to delete them from the file. first I have to omit the special words, and then write the unique words into a new file, each word in a line. here is the code I wrote. but it doesn't work. in simple words I want to omit some words first then find unique words.
file1 = open('c:/python34/SimilarityCorpus.txt','r')
file2 = open('c:/python34/ListOfStopWords.txt','r')
file3 = open('c:/python34/Output1.txt','w')
first_words=[]
second_words=[]
z=[]
for line in file1: # to write unique words
for word in line.split():
if word not in z:
z.append(word)
for line in file1:
words = line.split()
for w in words:
first_words.append(w)
for line in file2:
w = line.split()
for i in w:
second_words.append(i)
for word1 in first_words :
for word2 in second_words:
if word1==word2:
first_words.remove(word2)
for word in first_words:
file3.write(word)
file3.write(' ')
file1.close()
file2.close()
file3.close()
I know that's basic, but I'm new in programming.

Welcome to programming! It's a fun world here :). I hope the answer below will help you.
Firstly, you are looking to get every unique word. Here, the set object may be useful for you. Using the set, you can iterate over every word and add it to the set, without worrying about duplicates.
z = set()
for line in file1: # to write unique words
for word in line.split():
z.add(word)
From my understanding of your code, you want to find the difference between the SimilarityCorpus and the ListOfStopWords, and then write that to disk. Since you are only interested in unique words, and not worried about the counts, then sets can come to your rescue again.
first_words = set()
for line in file1:
words = line.split()
first_words = first_words.union(words)
Here, the sets().union(other_iterable) operation simplifies the need to iterate over the new words. You can do likewise for second_words.
Finally, you want to take the difference between two sets, which is also available in Python. To do that, you either will be looking for:
words in first_words that are absent in second_words, or
words in second_words that are absent in first_words.
In the first case, you would do:
first_words.difference(second_words)
In the second case, you would do:
second_words.difference(first_words)
More documentation on sets can be found here on the Python docs. I would encourage you to use Python 3 rather than 2, which I see you are, so keep sticking with it!
To write to disk, with each word on a new line, you can do the following:
for word in first_words:
file3.write(word)
file3.write('\n') # this will write a new line.
Currently, you have the following code pattern:
file3 = open('/path/to/your/file.txt', 'w')
# do stuff with file3, e.g. write.
file3.close()
I might suggest that you do, instead:
with open('/path/to/file3.txt', 'w') as file3:
# do stuff with file3.
In this way, you don't need to explicitly open and close the file; the "with open" line can automatically take care of that for you.
I believe the rest of your code is correct, for reading and writing information from and to the disk.
If you could update your question to include more detail on errors that are cropping up, that would really help! Finally, whatever answer you find most useful here, don't forget to upvote/accept it (it doesn't have to be mine, I'm happy to simply add to the corpus of information and help around here).

How to find a specific string in a .txt file Python

I have a large textfile on my computer (location: /home/Seth/documents/bruteforce/passwords.txt) and I'm trying to find a specific string in the file. The list has one word per line and 215,000 lines/words. Does anyone know of simple Python script I can use to find a specific string?
Here's the code I have so far,
f = open("home/seth/documents/bruteforce/passwords.txt", "r")
for line in f.readlines():
line = str(line.lower())
print str(line)
if str(line) == "abe":
print "success!"
else:
print str(line)
I keep running the script, but it never finds the word in the file (and I know for sure the word is in the file).
Is there something wrong with my code? Is there a simpler method than the one I'm trying to use?
Your help is greatly appreciated.
Ps: I'm using Python 2.7 on a Debian Linux laptop.

I'd rather use the in keyword to look for a string in a line. Here I'm looking for the keyword 'KHANNA' in a csv file and for any such existence the code returns true.
In [121]: with open('data.csv') as f:
print any('KHANNA' in line for line in f)
.....:
True

It's just because you forgot to strip the new line char at the end of each line.
line = line.strip().lower()
would help.

Usually, when you read lines out of a file, they have a newline character at the end. Thus, they will technically not be equal to the same string without the newline character. You can get rid of this character by adding the line line=line.strip() before the test for equality to your target string. By default, the strip() method removes all white space (such as newlines) from the string it is called on.

What do you want to do? Just test whether the word is in the file? Here:
print 'abe' in open("passwords.txt").read().split()
Or:
print 'abe' in map(str.strip, open("passwords.txt"))
Or if it doesn't have to be Python:
egrep '^abe$' passwords.txt
EDIT: Oh, I forgot the lower. Probably because passwords are usually case sensitive. But if it really does make sense in your case:
print 'abe' in open("passwords.txt").read().lower().split()
or
print 'abe' in (line.strip().lower() for line in open("passwords.txt"))
or
print 'abe' in map(str.lower, map(str.strip, open("passwords.txt")))

Your script doesn't find the line because you didn't check for the newline characters:
Your file is made of many "lines". Each "line" ends with a character that you didn't account for - the newline character ('\n'1). This is the character that creates a new line - it is what gets written to the file when you hit enter. This is how the next line is created.
So, when you read the lines out of your file, the string contained in each line actually ends with a newline character. This is why your equality test fails. You should instead, test equality against the line, after it has been stripped of this newline character:
with open("home/seth/documents/bruteforce/passwords.txt") as infile:
for line in infile:
line = line.rstrip('\n')
if line == "abe":
print 'success!'
1 Note that on some machines, the newline character is in fact two characters - the carriage return (CR), and line-feed (LF). This terminology comes from back in the day when typewriters had to jump a line-width of space on the paper that was being written to, and that the carriage that contained the paper had to be returned to its starting position. When seen in a line in the file, this appears as '\r\n'

Parsing unique words from a text file

I'm working on a project to parse out unique words from a large number of text files. I've got the file handling down, but I'm trying to refine the parsing procedure. Each file has a specific text segment that ends with certain phrases that I'm catching with a regex on my live system.
The parser should walk through each line, and check each word against 3 criteria:
Longer than two characters
Not in a predefined dictionary set dict_file
Not already in the word list
The result should be a 2D array, each row a list of unique words per file, which is written to a CSV file using the .writerow(foo) method after each file is processed.
My working code's below, but it's slow and kludgy, what am I missing?
My production system is running 2.5.1 with just the default modules (so NLTK is a no-go), can't be upgraded to 2.7+.
def process(line):
line_strip = line.strip()
return line_strip.translate(punct, string.punctuation)
# Directory walking and initialization here
report_set = set()
with open(fullpath, 'r') as report:
for line in report:
# Strip out the CR/LF and punctuation from the input line
line_check = process(line)
if line_check == "FOOTNOTES":
break
for word in line_check.split():
word_check = word.lower()
if ((word_check not in report_set) and (word_check not in dict_file)
and (len(word) > 2)):
report_set.append(word_check)
report_list = list(report_set)
Edit: Updated my code based on steveha's recommendations.

One problem is that an in test for a list is slow. You should probably keep a set to keep track of what words you have seen, because the in test for a set is very fast.
Example:
report_set = set()
for line in report:
for word in line.split():
if we_want_to_keep_word(word):
report_set.add(word)
Then when you are done:
report_list = list(report_set)
Anytime you need to force a set into a list, you can. But if you just need to loop over it or do in tests, you can leave it as a set; it's legal to do for x in report_set:
Another problem that might or might not matter is that you are slurping all the lines from the file in one go, using the .readlines() method. For really large files it is better to just use the open file-handle object as an iterator, like so:
with open("filename", "r") as f:
for line in f:
... # process each line here
A big problem is that I don't even see how this code can work:
while 1:
lines = report.readlines()
if not lines:
break
This will loop forever. The first statement slurps all input lines with .readlines(), then we loop again, then the next call to .readlines() has report already exhausted, so the call to .readlines() returns an empty list, which breaks out of the infinite loop. But this has now lost all the lines we just read, and the rest of the code must make do with an empty lines variable. How does this even work?
So, get rid of that whole while 1 loop, and change the next loop to for line in report:.
Also, you don't really need to keep a count variable. You can use len(report_set) at any time to find out how many words are in the set.
Also, with a set you don't actually need to check whether a word is in the set; you can just always call report_set.add(word) and if it's already in the set it won't be added again!
Also, you don't have to do it my way, but I like to make a generator that does all the processing. Strip the line, translate the line, split on whitespace, and yield up words ready to use. I would also force the words to lower-case except I don't know whether it's important that FOOTNOTES be detected only in upper-case.
So, put all the above together and you get:
def words(file_object):
for line in file_object:
line = line.strip().translate(None, string.punctuation)
for word in line.split():
yield word
report_set = set()
with open(fullpath, 'r') as report:
for word in words(report):
if word == "FOOTNOTES":
break
word = word.lower()
if len(word) > 2 and word not in dict_file:
report_set.add(word)
print("Words in report_set: %d" % len(report_set))

Try replacing report_list with a dictionary or set.
word_check not in report_list works slow if report_list is a list

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Text file Converter (replacing unknown words) - python

Related

If string is preceded by pattern of numbers and colons

Python when splitting line in file it leaves newline in the end

deleting some special words of a file and write unique words into a new file

How to find a specific string in a .txt file Python

Parsing unique words from a text file

Categories

Resources