Write individual array elements to unique files - python

I have two arrays, infile and outfile:
infile = ['Apple', 'Orange', 'Banana']
outfile = ['Applefile', 'Orangefile', 'Bananafile']
I search readin.txt for each element of the infile array, and for any line containing said element, I do a couple of things. This is what readin.txt looks like:
Apple = 13
Celery = 2
Orange = 5
Banana =
Grape = 4
The outfile array contains the names of files I would like to create; each corresponding to an element in infile. The first element in infile corresponds to the first element (file name) in outfile, and so on.
The problem I'm having is with this bit of code:
for line in open("readin.txt", "r"):
for i in infile:
if i in line:
sp = line.split('=')
sp1 = str(sp[1])
def parseline(l):
return sp1.strip() if len(sp) > 1 and sp[1].strip() != '' else None
for s in outfile:
out = parseline(line)
outw = open(s, "w")
outw.write(str(out))
outw.close()
In the first part of the code, I want to search readin.txt for any one of the words from infile (i.e. Apple, Orange, and Banana). I then want the code to select out the entire line in which that word occurs. I know that any such line in readin.txt will contain an equal sign, so I then want the code to split the line around the equal sign, and produce only that which follows the equal sign.
While the last part of the code indeed creates separate files for each element in outfile, the actual output always corresponds to the last element of infile. It's as though each subsequent step in the loop overwrites the previous steps. I feel like I need to be looking at the i-th elements of line, but I'm not sure how to do that in Python. Any help would be great.
Editing for clarity and hopes of having the question re-opened:
In fact, the following code seems to do exactly what I want:
for line in open("parameters.txt", "r"):
for i in infile:
if i in line:
sp = line.split('=')
sp1 = str(sp[1]).strip() if len(sp) > 1 and sp[1].strip() != '' else None
print sp1
On the command line, I get:
13
5
None
So this tells me that the first portion of the code is doing essentially what I want it to (although perhaps not in the most efficient way, so any other suggestions would be appreciated).
At this point, I'd like all of the information that was printed out to be written to individual files based on the outfile array. ie. 13 should be entered into a file called Applefile, None should be written into a file called Bananafile, etc. This is the point at which I'm having trouble. I know that 'outfile' should be indexed similarly, so that the first element of outfile corresponds to the first element of infile, but my attempts so far have not worked.
This is my most recent attempt:
for line in open("parameters.txt", "r"):
for i in infile:
if i in line:
def parseline(l):
sp = l.split('=')
sp1 = str(sp[1]).strip() if len(sp) > 1 and sp[1].strip() != '' else None
if sp1:
out = parseline(line)
outw = open(outfile[i], "w")
outw.write(line)
outw.close()
where defining parseline any sooner in the code negates the whole beginning part of the code for some reason.
I'm not looking for just the answer. I would like to understand what is going on and be able to figure out how to fix it.

the actual output within every single file created corresponds to the last element of infile
because for every element of infile, you are looping over every element of outfile and writing the latest line, so it makes sense that you end up with all files containing the last line. Since your infile/outfile lines correspond you can use the i index from the main infile loop to grab the label you want from outfile.. something like:
for line in open("readin.txt", "r"):
for i in infile:
if i in line:
sp = line.split('=')
sp1 = str(sp[1]).strip() if len(sp) > 1 and sp[1].strip() != '' else None
if sp1:
out = parseline(line)
outw = open(outfile[i], "w")
outw.write(str(out))
outw.close()

I would break this down into two steps:
def parse_params(filename):
"""Convert the parameter file into a map from filename to value."""
out = {}
with open(filename) as f:
for line in f:
word, num = map(str.strip, line.split("="))
out[word] = num
return out # e.g. {'Celery': '2', 'Apple': '13', 'Orange': '5'}
def process(in_, out, paramfile):
"""Write the values defined in param to the out files based on in_."""
value_map = parse_params(paramfile)
for word, filename in zip(infile, outfile):
if word in value_map:
with open(filename, 'w') as f: # or '"{0}.txt".format(filename)'
f.write(value_map[word])
else:
print "No value found for '{0}'.".format(word)
process(infile, outfile, "parameters.txt")
Your current code really doesn't make much sense:
for line in open("parameters.txt", "r"): # iterate over lines in file
for i in infile: # iterate over words in infile list
if i in line: # iterate over characters in the file line (why?)
def parseline(l): # define a function
sp = l.split('=')
sp1 = str(sp[1]).strip() if len(sp) > 1 and sp[1].strip() != '' else None
if sp1:
out = parseline(line)
outw = open(outfile[i], "w")
outw.write(line)
outw.close()
# but apparently never call it (why?)
Using the same loop variable name in two loops is a bad idea, you only ever see the inner value:
>>> for x in range(2):
for x in "ab":
print x
a
b
a
b
If you find that a function "needs" to be defined in a particular place, it suggests that you are relying on scoping to access variables. It is much better to define specific arguments and return values for the parameters you need; it makes development and testing significantly easier.

Related

If the first 3 characters are the same, delete the line in a text file?

Goal: Open the text file. Check whether the first 3 characters of each line are the same in subsequent lines. If yes, delete the bottom one.
The contents of the text file:
cat1
dog4
cat3
fish
dog8
Desired output:
cat1
dog4
fish
Attempt at code:
line = open("text.txt", "r")
for num in line.readlines():
a = line[num][0:3] #getting first 3 characters
for num2 in line.readlines():
b = line[num2][0:3]
if a in b:
line[num2] = ""
Open the file and read one line at a time. Note the first 3 characters (prefix). Check if the prefix has been previously observed. If not, keep that line and add the prefix to a set. For example:
with open('text.txt') as infile:
out_lines = []
prefixes = set()
for line in map(str.strip, infile):
if not (prefix := line[:3]) in prefixes:
out_lines.append(line)
prefixes.add(prefix)
print(out_lines)
Output:
['cat1', 'dog4', 'fish']
Note:
Requires Python 3.8+
You can use a dictionary to store the first 3 char and then check while reading. Sample check then code below
line = open("text.txt", "r")
first_three_char_dict = {}
for num in line.readlines():
a = line[num][0:3] # getting first 3 characters
if first_three_char_dict.get(a):
line[num] = ""
else:
first_three_char_dict[a] = num
pass;
try to read line and add the word (first 3 char)into a dict. The key of dict would be the first 3 char of word and value would be the word itself. At the end you will have dict keys which are unique and their values are your desired result.
You just need to check if the data is already exist or not in temporary list
line = open("text.txt", "r")
result = []
for num in line.readlines():
data = line[num][0:3] # getting first 3 characters
if data not in result: # check if the data is already exist in list or not
result.append(data) # if the data is not exist in list just append it

Use a file to search another file and print lines matching a pattern to first file

Python noob here. I've been smashing my head trying to do this, tried several Unix tools and I'm convinced that python is the way to go.
I have two files, File1 has headers and numbers like this:
>id1
77
>id2
2
>id3
2
>id4
22
...
Note that id number is unique, but the number assigned to it may repeat. I have several files like this all with the same number of headers (~500).
File2 has all numbers of File1 and an appended sequence
1
ATCGTCATA
2
ATCGTCGTA
...
22
CCCGTCGTA
...
77
ATCGTCATA
...
Note that sequence id is unique, as all sequences after it. I have the same amount of files as File1 but the number of sequences within each File2 may vary(~150).
My desired output is the File1 with the sequence from File2, it is important that File1 maintains original order.
>id1
ATCGTCATA
>id2
ATCGTCGTA
>id3
ATCGTCGTA
>id4
CCCGTCGTA
My approach is to extract numbers from File1 and use them as a pattern to match in File2. First I am trying to make this work with only a pair of files. here is what I achieved:
#!/usr/bin/env python
import re
datafile = 'protein2683.fasta.txt.named'
schemaseqs = 'protein2683.fasta'
with open(datafile, 'r') as f:
datafile_lines = set([line.strip() for line in f]) #maybe I could use regex to get only lines with number as pattern?
print (datafile_lines)
outputlist = []
with open(schemaseqs, 'r') as f:
for line in f:
seqs = line.split(',')[0]
if seqs[1:-1] in datafile_lines:
outputlist.append(line)
print (outputlist)
This outputs a mix of patterns from File1 and the sequences from File2. Any help is appreciated.
Ps: I am open to modifications in files structure, I tried substituting \n in File2 for "," with no avail.
import re
datafile = 'protein2683.fasta.txt.named'
schemaseqs = 'protein2683.fasta'
datafile_lines = []
d = {}
prev = None
with open(datafile, 'r') as f:
i = 0
for line in f:
if i % 2 == 0:
d[line.strip()]=0
prev = line.strip()
else:
d[prev] = line.strip()
i+=1
new_d = {}
with open(schemaseqs, 'r') as f:
i=0
prev = None
for line in f:
if i % 2 == 0:
new_d[line.strip()]=0
prev = line.strip()
else:
new_d[prev] = line.strip()
i+=1
for key, value in d.items():
if value in new_d:
d[key] = new_d[value]
print(d)
with open(datafile,'w') as filee:
for k,v in d.items():
filee.writelines(k)
filee.writelines('\n')
filee.writelines(v)
filee.writelines('\n')
creating two dictionary would be easy and then map both dictionary values.
Since the files are so neatly organized, I wouldn't use a set to store the lines. Sets don't enforce order, and the order of these lines conveys a lot of information. I also wouldn't use Regex; it's probably overkill for the task of parsing individual lines, but not powerful enough to keep track of which ID corresponds to each gene sequence.
Instead, I would read the files in the opposite order. First, read the file with the gene sequences and build a mapping of IDs to genes. Then read in the first file and replace each id with the corresponding value in that mapping.
If the IDs are a continuous sequence (1, 2, 3... n, n+1), then a list is probably the easiest way to store them. If the file is already in order, you don't even have to pay attention to the ID numbers; you can just skip every other row and append each gene sequence to an array in order. If they aren't continuous, you can use a dictionary with the IDs as keys. I'll use the dictionary approach for this example:
id_to_gene_map = {}
with open(file2, 'r') as id_to_gene_file:
for line_number, line in enumerate(id_to_gene_file, start=1):
if line_number % 2 == 1: # Update ID on odd numbered lines, including line 1
current_id = line
else:
id_to_gene_map[current_id] = line # Map previous line's ID to this line's value
with open(file1, 'r') as input_file, open('output.txt', 'w') as output_file:
for line in input_file:
if not line.startswith(">"): # Keep ">id1" lines unchanged
line = id_to_gene_map[line] # Otherwise, replace with the corresponding gene
output_file.write(line)
In this case, the IDs and values both have trailing newlines. You can strip them out, but since you'll want to add them back in for writing the output file, it's probably easiest to leave them alone.

How to print the last occurence of a string plus the following N lines

I'm trying to write a file output parser and am having trouble with coming up with a solution to how to print the last occurrence of a string + the following N number of lines. The output files are generally less than 2 MB so I shouldn't have any issues reading the file to memory, but if there is a more elegant solution that would be nice for learning sake.
I have tried saving the lines into a list and then printing out the last occurrences, but it splits the lines into words so the lists end up being to hard to work with. I also have the program reading the total number of lines needed to be printed earlier if there is another solution than what I have tried.
def coord():
stdOrn = 'Standard orientation'
coord = {}
found = False
with open(name, 'r') as text_file:
for line in text_file:
if stdOrn in line:
found = True
elif 'Rotational constants (GHZ)' in line:
found = False
elif found:
coord = line
outFile.write(coord)
You can load it as a string, get the index of the last appearance with .rfind() and then do a string slice from the last index.
stdOrn = 'Standard orientation'
found = False
with open(name, 'r') as text_file:
file_contents = text_file.read()
last_appearance = file_contents.rfind(std0rn)
following_n_lines = file_contents[last_appearance:]

Loop within a loop not re-looping with reading a file Python3

Trying to write a code that will find all of a certain type of character in a text file
For vowels it'll find all of the number of a's but won't reloop through text to read e's. help?
def finder_character(file_name,character):
in_file = open(file_name, "r")
if character=='vowel':
brain_rat='aeiou'
elif character=='consonant':
brain_rat='bcdfghjklmnpqrstvwxyz'
elif character=='space':
brain_rat=''
else:
brain_rat='!##$%^&*()_+=-123456789{}|":?><,./;[]\''
found=0
for line in in_file:
for i in range (len(brain_rat)):
found += finder(file_name,brain_rat[i+1,i+2])
in_file.close()
return found
def finder(file_name,character):
in_file = open(file_name, "r")
line_number = 1
found=0
for line in in_file:
line=line.lower()
found +=line.count(character)
return found
If you want to use your original code, you have to pass the filename to the finder() function, and open the file there, for each char you are testing for.
The reason for this is that the file object (in_file) is a generator, not a list. The way a generator works, is that it returns the next item each time you call their next() method. When you say
for line in in_file:
The for ... in statement calls in_file.next() as long as the next() method "returns" (it actually use the keyword yield, but don't think about that for now) a value. When the generator doesn't return any values any longer, we say that the generator is exhausted. You can't re-use an exhausted generator. If you want to start over again, you have to make a new generator.
I allowed myself to rewrite your code. This should give you the desired result. If anything is unclear, please ask!
def finder_character(file_name,character):
with open(file_name, "r") as ifile:
if character=='vowel':
brain_rat='aeiou'
elif character=='consonant':
brain_rat='bcdfghjklmnpqrstvwxyz'
elif character=='space':
brain_rat=' '
else:
brain_rat='!##$%^&*()_+=-123456789{}|":?><,./;[]\''
return sum(1 if c.lower() in brain_rat else 0 for c in ifile.read())
test.txt:
eeehhh
iii!#
kk ="k
oo o
Output:
>>>print(finder_character('test.txt', 'vowel'))
9
>>>print(finder_character('test.txt', 'consonant'))
6
>>>print(finder_character('test.txt', 'space'))
2
>>>print(finder_character('test.txt', ''))
4
If you are having problems understanding the return line, it should be read backwards, like this:
Sum this generator:
Make a generator with values as v in:
for row in ifile.read():
if c.lower() in brain_rat:
v = 1
else:
v = 0
If you want to know more about generators, I recommend the Python Wiki page concerning it.
This seems to be what you are trying to do in finder_character. I'm not sure why you need finder at all.
In python you can loop over iterables (like strings), so you don't need to do range(len(string)).
for line in in_file:
for i in brain_rat:
if i in line: found += 1
There appear to be a few other oddities in your code too:
You open (and iterate through) the file twice, but only closed once.
line_number is never used
You get the total of a character in a file for each line in the file, so the total will be vastly inflated.
This is probably a much safer version, with open... is generally better than open()... file.close() as you don't need to worry as much about error handling and closing. I've added some comments to help explain what you are trying to do.
def finder_character(file_name,character):
found=0 # Initialise the counter
with open(file_name, "r") as in_file:
# Open the file
in_file = file_name.split('\n')
opts = { 'vowel':'aeiou',
'consonant':'bcdfghjklmnpqrstvwxyz',
'space':'' }
default= '!##$%^&*()_+=-123456789{}|":?><,./;[]\''
for line in in_file:
# Iterate through each line in the file
for c in opts.get(character,default):
With each line, also iterate through the set of chars to check.
if c in line.lower():
# If the current character is in the line
found += 1 # iterate the counter.
return found # return the counter

How to make lists of integers from a portion of a file with Python?

I have a file which looks like the following:
# junk
...
# junk
1.0 -100.102487081243
1.1 -100.102497023421
... ...
3.0 -100.102473082342
&
# junk
...
I am interested only in the two columns of numbers given between the # and & characters. These characters may appear anywhere else in the file but never inside the number block.
I want to create two lists, one with the first column and one with the second column.
List1 = [1.0, 1.1,..., 3.0]
List2 = [-100.102487081243, -100.102497023421,..., -100.102473082342]
I've been using shell scripting to prep these files for a simpler Python script which makes lists, however, I'm trying to migrate these processes over to Python for a more consistent application. Any ideas? I have limited experience with Python and file handling.
Edit: I should mention, this number block appears in two places in the file. Both number blocks are identical.
Edit2: A general function would be most satisfactory for this as I will put it into a custom library.
Current Efforts
I currently use a shell script to trim out everything but the number block into two separate columns. From there it is trivial for me to use the following function
def ReadLL(infile):
List = open(infile).read().splitlines()
intL = [int(i) for i in List]
return intL
by calling it from my main
import sys
import eLIBc
infile = sys.argv[1]
sList = eLIBc.ReadLL(infile)
The problem is knowing how to extract the number block from the original file with Python rather than using shell scripting.
You want to loop over the file itself, and set a flag for when you find the first line without a # character, after which you can start collecting numbers. Break off reading when you find the & character on a line.
def readll(infile):
with open(infile) as data:
floatlist1, floatlist2 = [], []
reading = False
for line in data:
if not reading:
if '#' not in line:
reading = True
else:
continue
if '&' in line:
return floatlist1, floatlist2
numbers = map(float, line.split())
floatlist1.append(numbers[0])
floatlist2.append(numbers[1])
So the above:
sets 'reading' to False, and only when a line without '#' is found, is that set to True.
when 'reading' is True:
returns the data read if the line contains &
otherwise it's assumed the line contains two float values separated by whitespace, which are added to their respective lists
By returning, the function ends, with the file closed automatically. Only the first block is read, the rest of the file is simply ignored.
Try this out:
with open("i.txt") as fp:
lines = fp.readlines()
data = False
List1 = []
List2 = []
for line in lines:
if line[0] not in ['&', '#']:
print line
line = line.split()
List1.append(line[0])
List2.append(line[1])
data = True
elif data == True:
break
print List1
print List2
This should give you the first block of numbers.
Input:
# junk
# junk
1.0 -100.102487081243
1.1 -100.102497023421
3.0 -100.102473082342
&
# junk
1.0 -100.102487081243
1.1 -100.102497023421
Output:
['1.0', '1.1', '3.0']
['-100.102487081243', '-100.102497023421', '-100.102473082342']
Update
If you need both blocks, then use this:
with open("i.txt") as fp:
lines = fp.readlines()
List1 = []
List2 = []
for line in lines:
if line[0] not in ['&', '#']:
print line
line = line.split()
List1.append(line[0])
List2.append(line[1])
print List1
print List2

Categories