I'm writing a program that reads in a directory of text files and finds a specific combination of strings that are overlapping (i.e. shared among all files). My current approach is to take one file from this directory, parse it, build a list of every string combo, and then search for this string combo in the other files. For instance, if I'd ten files, I'd read one file, parse it, store the keywords I need, then search the other nine files for this combination. I'd repeat this for every file (making sure that the single file doesn't search itself). To do this, I'm trying to use python's acora module.
The code I've thus far is:
def match_lines(f, *keywords):
"""Taken from [https://pypi.python.org/pypi/acora/], FAQs and Recipes #3."""
builder = AcoraBuilder('\r', '\n', *keywords)
ac = builder.build()
line_start = 0
matches = False
for kw, pos in ac.filefind(f): # Modified from original function; search a file, not a string.
if kw in '\r\n':
if matches:
yield f[line_start:pos]
matches = False
line_start = pos + 1
else:
matches = True
if matches:
yield f[line_start:]
def find_overlaps(f_in, fl_in, f_out):
"""f_in: input file to extract string combo from & use to search other files.
fl_in: list of other files to search against.
f_out: output file that'll have all lines and file names that contain the matching string combo from f_in.
"""
string_list = build_list(f_in) # Open the first file, read each line & build a list of tuples (string #1, string #2). The "build_list" function isn't shown in my pasted code.
found_lines = [] # Create a list to hold all the lines (and file names, from fl_in) that are found to have the matching (string #1, string #2).
for keywords in string_list: # For each tuple (string #1, string #2) in the list of tuples
for f in fl_in: # For each file in the input file list
for line in match_lines(f, *keywords):
found_lines.append(line)
As you can probably tell, I used the function match_lines from the acora web page, "FAQ and recipes" #3. I also used it in the mode to parse files (using ac.filefind()), also located from the web page.
The code seems to work, but it's only yielding me the file name that has the matching string combination. My desired output is to write out the entire line from the other files that contain my matching string combination (tuple).
I'm not seeing what here would produce filenames, as you say it does.
Regardless, to get line numbers, you just need to count them as you pass them in match_lines():
line_start = 0
line_number = 0
matches = False
text = open(f, 'r').read()
for kw, pos in ac.filefind(f): # Modified from original function; search a file, not a string.
if kw in '\r\n':
if matches:
yield line_number, text[line_start:pos]
matches = False
line_start = pos + 1
line_number += 1
else:
matches = True
if matches:
line_number, yield text[line_start:]
Related
I am trying to return a portion of the file name after using os.scandir to loop through all .txt files. I'm taking a directory, searching through each text file within the directory for certain words, pulling the section that those words are found in, and then printing. While that portion works, I need to add the file name that the portion of text was found in. Something like HD 354950 : supply chain issues were found with garden gnomes.
Below is the working code for just returning the information from within the texts -
dict = []
linenumber = 0
pattern = re.compile(r"\bsupply|finance\b", re.IGNORECASE)
for filename in os.scandir(directory):
if filename.path.endswith(".txt"):
f = open(filename, encoding = 'utf-8')
lines = f.readlines()
for line in lines:
linenumber += 1
if pattern.search(line) != None:
dict.append((linenumber, line.rstrip('\n')))
continue
else:
continue
when the text is returned I want to be able to pull the name of the file that the text was found alongside the text itself. The filename is typically - HD_0000354950_10Q_20200503_Item1A_excerpt.txt and I want to return HD 354950.
I would like to join this with the output of what is returned when
for d in dict:
print(filenamepieces, ":" + d[1])
where 'filenamepieces' is the file that the text tidbit is taken from
Here's an example using split() and converting the string to int:
fileName = "HD_0000354950_10Q_20200503_Item1A_excerpt.txt" # The name of the file
splitFile = fileName.split("_") # Splits the file name with underscores (_) into sections
index1 = splitFile[0] # Gets the name at the first index
index2 = splitFile[1] # Gets the name at the second index
index2 = int(index2) # Converts the second name into an int to remove the unnecessary zeros
finale = f"{index1} {index2}" # Final string
print(finale) # Prints the final string
# Program outputs : HD 354950
I have to compress a file into a list of words and list of positions to recreate the original file. My program should also be able to take a compressed file and recreate the full text, including punctuation and capitalization, of the original file. I have everything correct apart from the recreation, using the map function my program can't convert my list of positions into floats because of the '[' as it is a list.
My code is:
text = open("speech.txt")
CharactersUnique = []
ListOfPositions = []
DownLine = False
while True:
line = text.readline()
if not line:
break
TwoList = line.split()
for word in TwoList:
if word not in CharactersUnique:
CharactersUnique.append(word)
ListOfPositions.append(CharactersUnique.index(word))
if not DownLine:
CharactersUnique.append("\n")
DownLine = True
ListOfPositions.append(CharactersUnique.index("\n"))
w = open("List_WordsPos.txt", "w")
for c in CharactersUnique:
w.write(c)
w.close()
x = open("List_WordsPos.txt", "a")
x.write(str(ListOfPositions))
x.close()
with open("List_WordsPos.txt", "r") as f:
NewWordsUnique = f.readline()
f.close()
h = open("List_WordsPos.txt", "r")
lines = h.readlines()
NewListOfPositions = lines[1]
NewListOfPositions = map(float, NewListOfPositions)
print("Recreated Text:\n")
recreation = " " .join(NewWordsUnique[pos] for pos in (NewListOfPositions))
print(recreation)
The error I get is:
Task 3 Code.py", line 42, in <genexpr>
recreation = " " .join(NewWordsUnique[pos] for pos in (NewListOfPositions))
ValueError: could not convert string to float: '['
I am using Python IDLE 3.5 (32-bit). Does anyone have any ideas on how to fix this?
Why do you want to turn the position values in the list into floats, since they list indices, and those must be integer? I suspected this might be an instance of what is called the XY Problem.
I also found your code difficult to understand because you haven't followed the PEP 8 - Style Guide for Python Code. In particular, with how many (although not all) of the variable names are CamelCased, which according to the guidelines, should should be reserved for the class names.
In addition some of your variables had misleading names, like CharactersUnique, which actually [mostly] contained unique words.
So, one of the first things I did was transform all the CamelCased variables into lowercase underscore-separated words, like camel_case. In several instances I also gave them better names to reflect their actual contents or role: For example: CharactersUnique became unique_words.
The next step was to improve the handling of files by using Python's with statement to ensure they all would be closed automatically at the end of the block. In other cases I consolidated multiple file open() calls into one.
After all that I had it almost working, but that's when I discovered a problem with the approach of treating newline "\n" characters as separate words of the input text file. This caused a problem when the file was being recreated by the expression:
" ".join(NewWordsUnique[pos] for pos in (NewListOfPositions))
because it adds one space before and after every "\n" character encountered that aren't there in the original file. To workaround that, I ended up writing out the for loop that recreates the file instead of using a list comprehension, because doing so allows the newline "words" could be handled properly.
At any rate, here's the resulting rewritten (and working) code:
input_filename = "speech.txt"
compressed_filename = "List_WordsPos.txt"
# Two lists to represent contents of input file.
unique_words = ["\n"] # preload with newline "word"
word_positions = []
with open(input_filename, "r") as input_file:
for line in input_file:
for word in line.split():
if word not in unique_words:
unique_words.append(word)
word_positions.append(unique_words.index(word))
word_positions.append(unique_words.index("\n")) # add newline at end of each line
# Write representations of the two data-structures to compressed file.
with open(compressed_filename, "w") as compr_file:
words_repr = " ".join(repr(word) for word in unique_words)
compr_file.write(words_repr + "\n")
positions_repr = " ".join(repr(posn) for posn in word_positions)
compr_file.write(positions_repr + "\n")
def strip_quotes(word):
"""Strip the first and last characters from the string (assumed to be quotes)."""
tmp = word[1:-1]
return tmp if tmp != "\\n" else "\n" # newline "words" are special case
# Recreate input file from data in compressed file.
with open(compressed_filename, "r") as compr_file:
line = compr_file.readline()
new_unique_words = list(map(strip_quotes, line.split()))
line = compr_file.readline()
new_word_positions = map(int, line.split()) # using int, not float here
words = []
lines = []
for posn in new_word_positions:
word = new_unique_words[posn]
if word != "\n":
words.append(word)
else:
lines.append(" ".join(words))
words = []
print("Recreated Text:\n")
recreation = "\n".join(lines)
print(recreation)
I created my own speech.txt test file from the first paragraph of your question and ran the script on it with these results:
Recreated Text:
I have to compress a file into a list of words and list of positions to recreate
the original file. My program should also be able to take a compressed file and
recreate the full text, including punctuation and capitalization, of the
original file. I have everything correct apart from the recreation, using the
map function my program can't convert my list of positions into floats because
of the '[' as it is a list.
Per your question in the comments:
You will want to split the input on spaces. You will also likely want to use different data structures.
# we'll map the words to a list of positions
all_words = {}
with open("speech.text") as f:
data = f.read()
# since we need to be able to re-create the file, we'll want
# line breaks
lines = data.split("\n")
for i, line in enumerate(lines):
words = line.split(" ")
for j, word in enumerate(words):
if word in all_words:
all_words[word].append((i, j)) # line and pos
else:
all_words[word] = [(i, j)]
Note that this does not yield maximum compression as foo and foo. count as separate words. If you want more compression, you'll have to go character by character. Hopefully now you can use a similar approach to do so if desired.
I have a txt file that contains the following data:
chrI
ATGCCTTGGGCAACGGT...(multiple lines)
chrII
AGGTTGGCCAAGGTT...(multiple lines)
I want to first find 'chrI' and then iterate through the multiple lines of ATGC until I find the xth char. Then I want to print the xth char until the yth char. I have been using regex but once I have located the line containing chrI, I don't know how to continue iterating to find the xth char.
Here is my code:
for i, line in enumerate(sacc_gff):
for match in re.finditer(chromo_val, line):
print(line)
for match in re.finditer(r"[ATGC]{%d},{%d}\Z" % (int(amino_start), int(amino_end)), line):
print(match.group())
What the variables mean:
chromo_val = chrI
amino_start = (some start point my program found)
amino_end = (some end point my program found)
Note: amino_start and amino_end need to be in variable form.
Please let me know if I could clarify anything for you, Thank you.
It looks like you are working with fasta data, so I will provide an answer with that in mind, but if it isn't you can use the sub_sequence selection part still.
fasta_data = {} # creates an empty dictionary
with open( fasta_file, 'r' ) as fh:
for line in fh:
if line[0] == '>':
seq_id = line.rstrip()[1:] # strip newline character and remove leading '>' character
fasta_data[seq_id] = ''
else:
fasta_data[seq_id] += line.rstrip()
# return substring from chromosome 'chrI' with a first character at amino_start up to but not including amino_end
sequence_string1 = fasta_data['chrI'][amino_start:amino_end]
# return substring from chromosome 'chrII' with a first character at amino_start up to and including amino_end
sequence_string2 = fasta_data['chrII'][amino_start:amino_end+1]
fasta format:
>chr1
ATTTATATATAT
ATGGCGCGATCG
>chr2
AATCGCTGCTGC
Since you are working with fasta files which are formatted like this:
>Chr1
ATCGACTACAAATTT
>Chr2
ACCTGCCGTAAAAATTTCC
and are a bioinformatics major I am guessing you will be manipulating sequences often I recommend install the perl package called FAST. Once this is installed to get the 2-14 character of every sequence you would do this:
fascut 2..14 fasta_file.fa
Here is the recent publication for FAST and github that contains a whole toolbox for manipulating molecule sequence data on the command line.
I have a 500 MB text file that was made a long time ago. It has what looks like html or xml tags but they are not consistent throughout the file. I am trying to find the information between two tags that do not match. What I am using currently works but is very slow: myDict has a list of keywords in it. I can only guarantee the X+key and /N exist. There are no other tags that are consistent. The Dictionary has 18000 keys.
for key in myDict:
start_position = 0
start_position = the_whole_file.find('<X>'+key, start_position)
end_position = the_whole_file.find('</N>', start_position)
date = the_whole_file[start_position:end_position]
Is there a way to do this faster?
reverse the way you are doing it, instead of iterating through the dictionary and searching for potential matches. iterate through potential matches and search the dictionary
import re
for part in re.findall("\<X\>(.*)\<\/N\>",the_whole_text):
key = part.split(" ",1)[0]
if key in my_dict:
do_something(part)
since dictionary lookup is O(1) as opposed to string finding of O(N) (searching the whole file for every key is expensive ...)
so searching your file contents is ~O(500,000,000) and you are doing that 18,000 times
this way you only search the file once finding all potentials ... then you look up each one to see if its in your data dictionary
You can always read the file line by line instead of storing the whole file in memory:
inside_tag = False
data = ''
with open(your file, 'r') as fil:
for line in fil:
if '</N>' in line:
data += line.split('<X>')[0]
print data
inside_tag = False
if inside_tag:
data += line
if '<X>' in line:
data = line.split('<X>')[-1]
inside_tag = True
Note that this does not work when the beginning and end tags are on the same line.
I am still learner in python. I was not able to find a specific string and insert multiple strings after that string in python. I want to search the line in the file and insert the content of write function
I have tried the following which is inserting at the end of the file.
line = '<abc hij kdkd>'
dataFile = open('C:\\Users\\Malik\\Desktop\\release_0.5\\release_0.5\\5075442.xml', 'a')
dataFile.write('<!--Delivery Date: 02/15/2013-->\n<!--XML Script: 1.0.0.1-->\n')
dataFile.close()
You can use fileinput to modify the same file inplace and re to search for particular pattern
import fileinput,re
def modify_file(file_name,pattern,value=""):
fh=fileinput.input(file_name,inplace=True)
for line in fh:
replacement=value + line
line=re.sub(pattern,replacement,line)
sys.stdout.write(line)
fh.close()
You can call this function something like this:
modify_file("C:\\Users\\Malik\\Desktop\\release_0.5\\release_0.5\\5075442.xml",
"abc..",
"!--Delivery Date:")
Python strings are immutable, which means that you wouldn't actually modify the input string -you would create a new one which has the first part of the input string, then the text you want to insert, then the rest of the input string.
You can use the find method on Python strings to locate the text you're looking for:
def insertAfter(haystack, needle, newText):
""" Inserts 'newText' into 'haystack' right after 'needle'. """
i = haystack.find(needle)
return haystack[:i + len(needle)] + newText + haystack[i + len(needle):]
You could use it like
print insertAfter("Hello World", "lo", " beautiful") # prints 'Hello beautiful world'
Here is a suggestion to deal with files, I suppose the pattern you search is a whole line (there is nothing more on the line than the pattern and the pattern fits on one line).
line = ... # What to match
input_filepath = ... # input full path
output_filepath = ... # output full path (must be different than input)
with open(input_filepath, "r", encoding=encoding) as fin \
open(output_filepath, "w", encoding=encoding) as fout:
pattern_found = False
for theline in fin:
# Write input to output unmodified
fout.write(theline)
# if you want to get rid of spaces
theline = theline.strip()
# Find the matching pattern
if pattern_found is False and theline == line:
# Insert extra data in output file
fout.write(all_data_to_insert)
pattern_found = True
# Final check
if pattern_found is False:
raise RuntimeError("No data was inserted because line was not found")
This code is for Python 3, some modifications may be needed for Python 2, especially the with statement (see contextlib.nested. If your pattern fits in one line but is not the entire line, you may use "theline in line" instead of "theline == line". If your pattern can spread on more than one line, you need a stronger algorithm. :)
To write to the same file, you can write to another file and then move the output file over the input file. I didn't plan to release this code, but I was in the same situation some days ago. So here is a class that insert content in a file between two tags and support writing on the input file: https://gist.github.com/Cilyan/8053594
Frerich Raabe...it worked perfectly for me...good one...thanks!!!
def insertAfter(haystack, needle, newText):
#""" Inserts 'newText' into 'haystack' right after 'needle'. """
i = haystack.find(needle)
return haystack[:i + len(needle)] + newText + haystack[i + len(needle):]
with open(sddraft) as f1:
tf = open("<path to your file>", 'a+')
# Read Lines in the file and replace the required content
for line in f1.readlines():
build = insertAfter(line, "<string to find in your file>", "<new value to be inserted after the string is found in your file>") # inserts value
tf.write(build)
tf.close()
f1.close()
shutil.copy("<path to the source file --> tf>", "<path to the destination where tf needs to be copied with the file name>")
Hope this helps someone:)