Split large text file using keyword delimiter - python

I'm trying to split a large text files into smaller text files by using a word delimiter. I tried searching but I've only seen posts to break apart files after x lines. I'm fairly new to programming but I've given it a start. I want to go through all the lines, and if it starts with hello, it will put all of those lines into one file until it reaches the next hello. The first word in the file is hello. Ultimately, I'm trying to get the text into R, but I think it would be easier if I split it up like this first. Any help is appreciated, thanks.
text_file = open("myfile.txt","r")
lines = text_file.readlines()
print len(lines)
for line in lines :
print line
if line[0:5] == "hello":

If you are finding for a very simple logic, Try this.
text_file = open("myfile.txt","r")
lines = text_file.readlines()
print len(lines)
target = open ("filename.txt", 'a') ## a will append, w will over-write
hello1Found = False
hello2Found = False
for line in lines :
if hello1Found == True :
if line[0:5] == "hello":
hello2Found = True
hello1Found = False
break ## When second hello is found looping/saving to file is stopped
##(though using break is not a good practice here it suffice your simple requirement
else:
print line #write the line to new file
target.write(line)
if hello1Found == False:
if line[0:5] == "hello": ##find first occurrence of hello
hello1Found = True
print line
target.write(line) ##if hello is found for the first time write the
##line/subsequent lines to new file till the occurrence of second hello

I am new to Python. I just finished a Python for Geographic Information Systems class at Northeastern University. This is what I came up with.
import os
import sys
import arcpy
def files():
n = 0
while True:
n += 1
yield open('/output/dir/%d.txt' % n, 'w')
pattern = 'hello'
fs = files()
outfile = next(fs)
filename = r'C:\output\dir\filename.txt'
with open(filename) as infile:
for line in infile:
if pattern not in line:
outfile.write(line)
else:
items = line.split(pattern)
outfile.write
(items[0])
for item in items:
outfile = next(fs)
outfile.write(item)
filename.close();outfile.close();

Related

Check if the very last element of a list matches a string of choice

This sounds like it has been asked before but I cannot find anything that helps me.
My code is:
with open('file directory', 'r') as file:
data = []
for line in file:
for word in line.split():
data.append(str(word))
last_word = str(data[-1])
if last_word == "bobby":
print("yes")
else:print("no")
I want to know if the last element of the text file which I turned into a list matches a string, for example "bobby". Instead, what I get is basically a counter of how many times bobby was mentioned in the list.
Terminal:
yes
yes
yes
How about:
with open('file directory', 'r') as file:
last_line = f.readlines()[-1]
print("yes" if last_line == "bobby" else "no")
or if you want to be short and cryptic:
with open('file directory', 'r') as file:
print("yes" if f.readlines()[-1] == "bobby" else "no")
If you want to look for the last word of each line, this works.
with open('asdf.txt', 'r') as file:
for line in file:
words = line.split()
try:
if words[-1] == "bobby":
print("yes")
else:
print("no")
except:
# blank line
pass
Instead of looping the file just turn it into a list and grab the last element. You're very close just the loop is unnecessary:
with open('file directory', 'r') as file:
data = file.readlines()
last_word = data[-1].split()[-1]
if last_word == "bobby":
print("yes")
else:
print("no")

Program not recognizing conditional statements in for loop

I'm trying to print "None" if the input entered by the user is not found in a text file I created. It should also print if the lines if word(s) are found in the text file.
My problem right now is that it is not doing both conditionals. If I were to remove the "line not in user_pass" it would not print anything. I just want the user to be able to know if the strings entered by the user can found in the file and will print that line or "none" if it is not found.
I commented out the ones where I tried fixing my code, but no use.
My code below:
def text_search(text):
try:
filename = "words.txt"
with open(filename) as search:
print('\nWord(s) found in file: ')
for line in search:
line = line.rstrip()
if 4 > len(line):
continue
if line.lower() in text.lower():
print("\n" + line)
# elif line not in text: # the function above will not work if this conditional commented out
# print("None")
# break
# if line not in text: # None will be printed so many times and line.lower in text.lower() conditional will not work
# print("none")
except OSError:
print("ERROR: Cannot open file.")
text_search("information")
I think you need to change for line in search: to for line in search.readlines(): I don't think you're ever reading from the file... Have you tried to just print(line) and ensure your program is reading anything at all?
#EDIT
Here is how I would approach the problem:
def text_search(text):
word_found = False
filename = "words.txt"
try:
with open(filename) as file:
file_by_line = file.readlines() # returns a list
except OSError:
print("ERROR: Cannot open file.")
print(file_by_line) # lets you know you read the data correctly
for line in file_by_line:
line = line.rstrip()
if 4 > len(line):
continue
if line.lower() in text.lower():
word_found = True
print("\n" + line)
if word_found is False:
print("Could not find that word in the file")
text_search("information")
I like this approach because
It is clear where you are reading the file and assigning it to a variable
This variable is then printed, which is useful for debugging
Less stuff is in a try: clause (I like to not hide my errors, but that's not a huge deal here because you did a good job specifying OSError however, what if an OSError occured during line = line.rstrip() for some reason...you would never know!!)
If this helped I'd appreciate if you would click that green check :)
Try this:-
def find_words_in_line(words,line):
for word in words:
if(word in line):
return True;
return False;
def text_search(text,case_insensitive=True):
words = list(map(lambda x:x.strip(),text.split()));
if(case_insensitive):
text = text.lower();
try:
filename = 'words.txt'
with open(filename) as search:
found = False;
for line in search:
line = line.strip();
if(find_words_in_line(words,line)):
print(line);
found = True;
if(not found):
print(None);
except:
print('File not found');
text_search('information');
Didn't really understand your code, so making one on my own according to your requirement.

When I open a text file, it only reads the last line

Say customPassFile.txt has two lines in it. First line is "123testing" and the second line is "testing321". If passwordCracking = "123testing", then the output would be that "123testing" was not found in the file (or something similar). If passwordCracking = "testing321", then the output would be that "testing321" was found in the file. I think that the for loop I have is only reading the last line of the text file. Any solutions to fix this?
import time
import linecache
def solution_one(passwordCracking):
print("Running Solution #1 # " + time.strftime("%Y-%m-%d %H:%M:%S",time.localtime()))
startingTimeSeconds = time.time()
currentLine = 1
attempt = 1
passwordFound = False
wordListFile = open("customPassFile.txt", encoding="utf8")
num_lines = sum(1 for line in open('customPassFile.txt'))
while(passwordFound == False):
for i, line in enumerate(wordListFile):
if(i == currentLine):
line = line
passwordChecking = line
if(passwordChecking == passwordCracking):
passwordFound = True
endingTimeSeconds = time.time()
overallTimeSeconds = endingTimeSeconds - startingTimeSeconds
print("~~~~~~~~~~~~~~~~~")
print("Password Found: {}".format(passwordChecking))
print("ATTEMPTS: {}".format(attempt))
print("TIME TO FIND: {} seconds".format(overallTimeSeconds))
wordListFile.close()
break
elif(currentLine == num_lines):
print("~~~~~~~~~~~~~~~~~")
print("Stopping Solution #1 # " + time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()))
print("REASON: Password could not be cracked")
print("ATTEMPTS: {}".format(attempt))
break
else:
attempt = attempt + 1
currentLine = currentLine + 1
continue
The main problem with your code is that you open the file and you read it multiple times. The first time the file object position goes to the end and stays there. Next time you read the file nothing happens, since you are already at the end of the file.
Example
Sometimes an example is worth more than lots of words.
Take the file test_file.txt with the following lines:
line1
line2
Now open the file and read it twice:
f = open('./test_file.txt')
f.tell()
>>> 0
for l in f:
print(l, end='')
else:
print('nothing')
>>> line1
>>> line2
>>> nothing
f.tell()
>>> 12
for l in f:
print(l, end='')
else:
print('nothing')
>>> nothing
f.close()
The second time nothing happen, as the file object is already at the end.
Solution
Here you have two options:
you read the file only once and save all the lines in a list and then use the list in your code. It should be enough to replace
wordListFile = open("customPassFile.txt", encoding="utf8")
num_lines = sum(1 for line in open('customPassFile.txt'))
with
with open("customPassFile.txt", encoding="utf8") as f:
wordListFile = f.readlines()
num_lines = len(wordListFile)
you reset the file object position after you read the file using seek. It would be something along the line:
for i, line in enumerate(wordListFile):
if(i == currentLine):
line = line
wordListFile.seek(0)
I would go with option 1., unless you have memory constraint (e.g. the file is bigger than memory)
Notes
I have a few extra notes:
python starts counters with 0 (like c/c++) and not 1 (like fortran). So probably you want to set:
currentLine = 0
when you read a file, the new line character \n is not stripped, so you have to do it (with strip) or account for it when comparing strings (using e.g. startswith). As example:
passwordChecking == passwordCracking
will likely always return False as passwordChecking contains \n and passwordCracking very likely doesn't.
Disclamer
I haven't tried the code, nor my suggestions, so there might be other bugs lurking around.
**I will delete this answer after OP understands the problem in indentation of I understand his intention of his code.*
for i, line in enumerate(wordListFile):
if(i == currentLine):
line = line
passwordChecking = line
#rest of the code.
Here your code is outside of for loop so only last line is cached.
for i, line in enumerate(wordListFile):
if(i == currentLine):
line = line
passwordChecking = line
#rest of the code.

Evaluating the next line in a For Loop while in the current iteration

Here is what I am trying to do:
I am trying to solve an issue that has to do with wrapping in a text file.
I want to open a txt file, read a line and if the line contains what I want it to contain, check the next line to see if it does not contain what is in the first line. If it does not, add the line to the first line.
import re
stuff = open("my file")
for line in stuff:
if re.search("From ", line):
first = line
print first
if re.search('From ', handle.next()):
continue
else: first = first + handle.next()
else: continue
I have looked a quite a few things and cannot seem to find an answer. Please help!
I would try to do something like this, but this is invalid for triples of "From " and not elegant at all.
lines = open("file", 'r').readlines()
lines2 = open("file2", 'w')
counter_list=[]
last_from = 0
for counter, line in enumerate(lines):
if "From " in line and counter != last_from +1:
last_from = counter
current_count = counter
if current_count+1 == counter:
if "From " in line:
counter_list.append(current_count+1)
for counter, line in enumerate(lines):
if counter in counter_list:
lines2.write(line)
else:
lines2.write(line, '\n')
Than you can check the lines2 if its helped.
You could also revert order of lines, then check in next line not in previous. That would solve your problem in one loop.
Thank you Martjin for helping me reset my mind frame! This is what I came up with:
handle = open("my file")
first = ""
second = ""
sent = ""
for line in handle:
line = line.rstrip()
if len(first) > 0:
if line.startswith("From "):
if len(sent) > 0:
print sent
else: continue
first = line
second = ""
else:
second = second + line
else:
if line.startswith("From "):
first = line
sent = first + second
It is probably crude, but it definitely got the job done!

How to find a specific line of text in a text file with python?

def match_text(raw_data_file, concentration):
file = open(raw_data_file, 'r')
lines = ""
print("Testing")
for num, line in enumerate(file.readlines(), 0):
w = ' WITH A CONCENTRATION IN ' + concentration
if re.search(w, line):
for i in range(0, 6):
lines += linecache.getline(raw_data_file, num+1)
try:
write(lines, "lines.txt")
print("Lines Data Created...")
except:
print("Could not print Line Data")
else:
print("Didn't Work")
I am trying to open a .txt file and search for a specific string.
If you are simply trying to write all of the lines that hold your string to a file, this will do.
def match_text(raw_data_file, concentration):
look_for = ' WITH A CONCENTRATION IN ' + concentration
with open(raw_data_file) as fin, open('lines.txt', 'w') as fout:
fout.writelines(line for line in fin if look_for in line)
Fixed my own issue. The following works to find a specific line and get the lines following the matched line.
def match_text(raw_data_file, match_this_text):
w = match_this_text
lines = ""
with open(raw_data_file, 'r') as inF:
for line in inF:
if w in line:
lines += line //Will add the matched text to the lines string
for i in range(0, however_many_lines_after_matched_text):
lines += next(inF)
//do something with 'lines', which is final multiline text
This will return multiple lines plus the matched string that the user wants. I apologize if the question was confusing.

Categories