Please consider the following list:
l = ['pfg022G', 'pfg022T', 'pfg068T', 'pfg130T', 'pfg181G', 'pfg181T', 'pfg424G', 'pfg424T']
and the file:
example.conf
"flowcell_unmapped_bams": ["/groups/cgsd/alexandre/gatk-workflows/src/ubam/pfg022G.unmapped.bam"],
"unmapped_bam_suffix": ".unmapped.bam",
"sample_name": "pfg022G",
"base_file_name": "pfg022G.GRCh38DH.target"
I would like to create a function that reads through every element of the list and looks into the file for that pattern and substitutes the pattern with the subsequent element of that list. For example the first element of the list is pfg022G, read through the file example.conf and search for pfg022G , once found replace to pdf022T.
Two functions, for readability. You can surely combine them into one single one.
def replace_words(words, content):
"""List of words is supposed to have even items
Items are extracted in pairs: Find the first word
in content and replace with next word
"""
_iterator = iter(words)
for _find in _iterator:
_replace = next(_iterator)
content = content.replace(_find, _replace)
return content
def rewrite_file(file, words):
""" Open the file to modify, read its content
then apply the replace_words() function. Once
done, write the replaced content back to the
file. You could compact them into one single
function.
"""
content = open(file, 'r').read()
with open(file, 'w') as f:
f.write(replace_words(words, content))
FILENAME = 'example.conf'
l = ['pfg022G', 'pfg022T', 'pfg068T', 'pfg130T', 'pfg181G', 'pfg181T', 'pfg424G', 'pfg424T']
rewrite_file(FILENAME, l)
Related
I have a large text file I have imported in python and want to split into lines by a key word, then use those lines to take out relevent information into a dataframe.
The data follows along the same pattern for each line but wont be the exact same number of characters and some lines may have extra data
So I have a text file such as:
{data: name:Mary, friends:2, cookies:10, chairs:4},{data: name:Gerald friends:2, cookies:10, chairs:4, outside:4},{data: name:Tom, friends:2, cookies:10, chairs:4, stools:1}
There is always the key word data between lines, is there any way I can split it out by using this word as the beginning of the line (then put it into a dataframe)?
I'm not sure where to begin so any help would be amazing
When you get the content of a .txt file like this...
with open("file.txt", 'r') as file:
content = file.read()
...you have it as a string, so you can split it with the function str.split():
content = content.split(my_keyword)
You can do it with a function:
def splitter(path: str, keyword: str) -> str:
with open(path, 'r') as file:
content = file.read()
return content.split(keyword)
that you can call this way:
>>> splitter("file.txt", "data")
["I really like to write the word ", ", because I think it has a lot of meaning."]
I currently have the below code in Python 3.x:-
lst_exclusion_terms = ['bob','jenny', 'michael']
file_list = ['1.txt', '2.txt', '3.txt']
for f in file_list:
with open(f, "r", encoding="utf-8") as file:
content = file.read()
if any(entry in content for entry in lst_exclusion_terms):
print(content)
What I am aiming to do is to review the content of each file in the list file_list. When reviewing the content, I then want to check to see if any of the entries in the list lst_exclusion_terms exists. If it does, I want to remove that entry from the list.
So, if 'bob' is within the content of 2.txt, this will be removed (popped) out of the list.
I am unsure how to replace my print(content) with the command to identify the current index number for the item being examined and then remove it.
Any suggestions? Thanks
You want to filter a list of files based on whether they contain some piece(s) of text.
There is a Python built-in function filter which can do that. filter takes a function that returns a boolean, and an iterable (e.g. a list), and returns an iterator over the elements from the original iterable for which the function returns True.
So first you can write that function:
def contains_terms(filepath, terms):
with open(filepath) as f:
content = f.read()
return any(term in content for term in terms)
Then use it in filter, and construct a list from the result:
file_list = list(filter(lambda f: not contains_terms(f, lst_exclusion_terms), file_list))
Of course, the lambda is required because contains_terms takes 2 arguments, and returns True if the terms are in the file, which is sort of the opposite of what you want (but sort of makes more sense from the point of view of the function itself). You could specialise the function to your use case and remove the need for the lambda.
def is_included(filepath):
with open(filepath) as f:
content = f.read()
return all(term not in content for term in lst_exclusion_terms)
With this function defined, the call to filter is more concise:
file_list = list(filter(is_included, file_list))
I've had a desire like this before, where I needed to delete a list item when iterating over it. It is often suggested to just recreate a new list with the contents you wanted as suggested here
However, here is a quick and dirty approach that can remove the file from the list:
lst_exclusion_terms = ['bob','jenny', 'michael']
file_list = ['1.txt', '2.txt', '3.txt']
print("Before removing item:")
print(file_list)
flag = True
while flag:
for i,f in enumerate(file_list):
with open(f, "r", encoding="utf-8") as file:
content = file.read()
if any(entry in content for entry in lst_exclusion_terms):
file_list.pop(i)
flag = False
break
print("After removing item")
print(file_list)
In this case, file 3.txt was removed from the list since it matched the lst_exclusion_terms
The following were the contents used in each file:
#1.txt
abcd
#2.txt
5/12/2021
#3.txt
bob
jenny
michael
I'm writing a program that reads in a directory of text files and finds a specific combination of strings that are overlapping (i.e. shared among all files). My current approach is to take one file from this directory, parse it, build a list of every string combo, and then search for this string combo in the other files. For instance, if I'd ten files, I'd read one file, parse it, store the keywords I need, then search the other nine files for this combination. I'd repeat this for every file (making sure that the single file doesn't search itself). To do this, I'm trying to use python's acora module.
The code I've thus far is:
def match_lines(f, *keywords):
"""Taken from [https://pypi.python.org/pypi/acora/], FAQs and Recipes #3."""
builder = AcoraBuilder('\r', '\n', *keywords)
ac = builder.build()
line_start = 0
matches = False
for kw, pos in ac.filefind(f): # Modified from original function; search a file, not a string.
if kw in '\r\n':
if matches:
yield f[line_start:pos]
matches = False
line_start = pos + 1
else:
matches = True
if matches:
yield f[line_start:]
def find_overlaps(f_in, fl_in, f_out):
"""f_in: input file to extract string combo from & use to search other files.
fl_in: list of other files to search against.
f_out: output file that'll have all lines and file names that contain the matching string combo from f_in.
"""
string_list = build_list(f_in) # Open the first file, read each line & build a list of tuples (string #1, string #2). The "build_list" function isn't shown in my pasted code.
found_lines = [] # Create a list to hold all the lines (and file names, from fl_in) that are found to have the matching (string #1, string #2).
for keywords in string_list: # For each tuple (string #1, string #2) in the list of tuples
for f in fl_in: # For each file in the input file list
for line in match_lines(f, *keywords):
found_lines.append(line)
As you can probably tell, I used the function match_lines from the acora web page, "FAQ and recipes" #3. I also used it in the mode to parse files (using ac.filefind()), also located from the web page.
The code seems to work, but it's only yielding me the file name that has the matching string combination. My desired output is to write out the entire line from the other files that contain my matching string combination (tuple).
I'm not seeing what here would produce filenames, as you say it does.
Regardless, to get line numbers, you just need to count them as you pass them in match_lines():
line_start = 0
line_number = 0
matches = False
text = open(f, 'r').read()
for kw, pos in ac.filefind(f): # Modified from original function; search a file, not a string.
if kw in '\r\n':
if matches:
yield line_number, text[line_start:pos]
matches = False
line_start = pos + 1
line_number += 1
else:
matches = True
if matches:
line_number, yield text[line_start:]
I need to be able to import and manipulate multiple text files in the function parameter. I figured using *args in the function parameter would work, but I get an error about tuples and strings.
def open_file(*filename):
file = open(filename,'r')
text = file.read().strip(punctuation).lower()
print(text)
open_file('Strawson.txt','BigData.txt')
ERROR: expected str, bytes or os.PathLike object, not tuple
How do I do this the right way?
When you use the *args syntax in a function parameter list it allows you to call the function with multiple arguments that will appear as a tuple to your function. So to perform a process on each of those arguments you need to create a loop. Like this:
from string import punctuation
# Make a translation table to delete punctuation
no_punct = dict.fromkeys(map(ord, punctuation))
def open_file(*filenames):
for filename in filenames:
print('FILE', filename)
with open(filename) as file:
text = file.read()
text = text.translate(no_punct).lower()
print(text)
print()
#test
open_file('Strawson.txt', 'BigData.txt')
I've also included a dictionary no_punct that can be used to remove all punctuation from the text. And I've used a with statement so each file will get closed automatically.
If you want the function to "return" the processed contents of each file, you can't just put return into the loop because that tells the function to exit. You could save the file contents into a list, and return that at the end of the loop. But a better option is to turn the function into a generator. The Python yield keyword makes that simple. Here's an example to get you started.
def open_file(*filenames):
for filename in filenames:
print('FILE', filename)
with open(filename) as file:
text = file.read()
text = text.translate(no_punct).lower()
yield text
def create_tokens(*filenames):
tokens = []
for text in open_file(*filenames):
tokens.append(text.split())
return tokens
files = '1.txt','2.txt','3.txt'
tokens = create_tokens(*files)
print(tokens)
Note that I removed the word.strip(punctuation).lower() stuff from create_tokens: it's not needed because we're already removing all punctuation and folding the text to lower-case inside open_file.
We don't really need two functions here. We can combine everything into one:
def create_tokens(*filenames):
for filename in filenames:
#print('FILE', filename)
with open(filename) as file:
text = file.read()
text = text.translate(no_punct).lower()
yield text.split()
tokens = list(create_tokens('1.txt','2.txt','3.txt'))
print(tokens)
I have a fastq file like this (part of the file):
#A80HNBABXX:4:1:1344:2224#0/1
AAAACATCAGTATCCATCAGGATCAGTTTGGAAAGGGAGAGGCAATTTTTCCTAAACATGTGTTCAAATGGTCTGAGACAGACGTTAAAATGAAAAGGGG
+
\\YYWX\PX^YT[TVYaTY]^\^H\`^`a`\UZU__TTbSbb^\a^^^`[GOVVXLXMV[Y_^a^BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
#A80HNBABXX:4:1:1515:2211#0/1
TTAGAAACTATGGGATTATTCACTCCCTAGGTACTGAGAATGGAAACTTTCTTTGCCTTAATCGTTGACATCCCCTCTTTTAGGTTCTTGCTTCCTAACA
+
ee^e^\`ad`eeee\dd\ddddYeebdd\ddaYbdcYc`\bac^YX[V^\Ybb]]^bdbaZ]ZZ\^K\^]VPNME][`_``Ubb_bYddZbbbYbbYT^_
#A80HNBABXX:4:1:1538:2220#0/1
CTGAGTAAATCATATACTCAATGATTTTTTTATGTGTGTGCATGTGTGCTGTTGATATTCTTCAGTACCAAAACCCATCATCTTATTTGCATAGGGAAGT
+
fff^fd\c^d^Ycac`dcdcded`effdfedb]beeeeecd^ddccdddddfff`eaeeeffdTecacaLV[QRPa\\a\`]aY]ZZ[XYcccYcZ\\]Y
#A80HNBABXX:4:1:1666:2222#0/1
CTGCCAGCACGCTGTCACCTCTCAATAACAGTGAGTGTAATGGCCATACTCTTGATTTGGTTTTTGCCTTATGAATCAGTGGCTAAAAATATTATTTAAT
+
deeee`bbcddddad\bbbbeee\ecYZcc^dd^ddd\\`]``L`ccabaVJ`MZ^aaYMbbb__PYWY]RWNUUab`Y`BBBBBBBBBBBBBBBBBBBB
The FASTQ file uses four lines per sequence. Line 1 begins with a '#' character and is followed by a sequence identifier. Line 2 is the DNA sequence letters. Line 3 begins with a '+' character. Line 4 encodes the quality values for the sequence in Line 2 (the part after "+" and before the next "#", and must contain the same number of symbols as letters in the sequence.
i want to read the fastq file into a dictionary like this (the key is the DNA sequence and the value is the quality value, and the line starting with "#" and "+" can be discarded):
{'AAAACATCAGTATCCATCAGGATCAGTTTGGAAAGGGAGAGGCAATTTTTCCTAAACATGTGTTCAAATGGTCTGAGACAGACGTTAAAATGAAAAGGGG':'\YYWX\PX^YT[TVYaTY]^\^H`^a\UZU__TTbSbb^\a^^^[GOVVXLXMV[Y_^a^BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB',
'CTGAGTAAATCATATACTCAATGATTTTTTTATGTGTGTGCATGTGTGCTGTTGATATTCTTCAGTACCAAAACCCATCATCTTATTTGCATAGGGAAGT':'fff^fd\c^d^Ycacdcdcdedeffdfedb]beeeeecd^ddccdddddfffeaeeeffdTecacaLV[QRPa\a`]aY]ZZ[XYcccYcZ\]Y ',
....}
I write the following code but it does not give me what I want. Can anyone help me to fix/improve my code?
class fastq(object):
def __init__(self,filename):
self.filename = filename
self.__sequences = {}
def parse_file(self):
symbol=['#','+']
"""Stores both the sequence and the quality values for the sequence"""
f = open(self.filename,'rU')
for lines in self.filename:
if symbol not in lines.startwith()
data = f.readlines()
return data
Here's a pretty quick and efficient way of doing it:
def parse_file(self):
with open(self.filename, 'r') as f:
content = f.readlines()
# Recreate content without lines that start with # and +
content = [line for line in content if not line[0] in '#+']
# Now the lines you want are alternating, so you can make a dict
# from key/value pairs of lists content[0::2] and content[1::2]
data = dict(zip(content[0::2], content[1::2]))
return data
I don't think use the reads as the key is good idea, what if you got exactly the same read. But any way if you want to do it:
In [9]:
with open('temp.fastq') as f:
lines=f.readlines()
head=[item[:-1] for item in lines[::4]] #get rid of '\n'
read=[item[:-1] for item in lines[1::4]]
qual=[item[:-1] for item in lines[3::4]]
dict(zip(read, qual))
Out[9]:
{'AAAACATCAGTATCCATCAGGATCAGTTTGGAAAGGGAGAGGCAATTTTTCCTAAACATGTGTTCAAATGGTCTGAGACAGACGTTAAAATGAAAAGGGG': '\\\\YYWX\\PX^YT[TVYaTY]^\\^H\\`^`a`\\UZU__TTbSbb^\\a^^^`[GOVVXLXMV[Y_^a^BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB',
'CTGAGTAAATCATATACTCAATGATTTTTTTATGTGTGTGCATGTGTGCTGTTGATATTCTTCAGTACCAAAACCCATCATCTTATTTGCATAGGGAAGT': 'fff^fd\\c^d^Ycac`dcdcded`effdfedb]beeeeecd^ddccdddddfff`eaeeeffdTecacaLV[QRPa\\\\a\\`]aY]ZZ[XYcccYcZ\\\\]Y',
'CTGCCAGCACGCTGTCACCTCTCAATAACAGTGAGTGTAATGGCCATACTCTTGATTTGGTTTTTGCCTTATGAATCAGTGGCTAAAAATATTATTTAAT': 'deeee`bbcddddad\\bbbbeee\\ecYZcc^dd^ddd\\\\`]``L`ccabaVJ`MZ^aaYMbbb__PYWY]RWNUUab`Y`BBBBBBBBBBBBBBBBBBBB',
'TTAGAAACTATGGGATTATTCACTCCCTAGGTACTGAGAATGGAAACTTTCTTTGCCTTAATCGTTGACATCCCCTCTTTTAGGTTCTTGCTTCCTAACA': 'ee^e^\\`ad`eeee\\dd\\ddddYeebdd\\ddaYbdcYc`\\bac^YX[V^\\Ybb]]^bdbaZ]ZZ\\^K\\^]VPNME][`_``Ubb_bYddZbbbYbbYT^_'}
you can use function from Bio, like this:
from Bio import SeqIO
myf=mydir+myfile
startlist=[]
for record in SeqIO.parse(myf, "fastq"):
startlist.append(str(record.seq)) #or without 'str'