Let's say I have a file containing this:
xxoxoxoxox
xxxxxxxxxx
xoxoxoxoxo
ooxoxoxoxo
and I want to iterate through each character and line and store them.
I know that if the characters are separated by spaces, I can do this:
mylist=[]
with open("myfile.txt") as myfile:
for line in file:
line = line.strip().split(" ")
first = line[0]
second = line[1]
alist = [first, second]
mylist.append(alist)
But how do I do something like this without spaces as delimiters? I tried .split()
and
for line in file:
for char in line:
But neither seems to work.
Thanks in advance for any help!
Here is a small snippet that might be useful to you.
Say you have a file called 'doc.txt' with two lines:
kisskiss
bangbang
With the following python script:
with open('doc.txt', 'r') as f:
all_lines = []
# loop through all lines using f.readlines() method
for line in f.readlines():
new_line = []
# this is how you would loop through each alphabet
for chars in line:
new_line.append(chars)
all_lines.append(new_line)
The output is:
>>> all_lines
Out[94]:
[['k', 'i', 's', 's', 'k', 'i', 's', 's', '\n'],
['b', 'a', 'n', 'g', 'b', 'a', 'n', 'g', '\n']]
More "pythonic" way:
def t():
r = []
with open("./test_in.txt") as f_in:
[r.extend(list(l)) for l in f_in]
return r
Note that you can't use return [r.extend(list(l)) for l in f_in] because extend returns None.
Related
I am dealing with sequences in fasta file. Usually when the line starts with
this indicates the name of the sequence. And from the next line is the actual sequence.
I am trying to insert my sequences in a dictionary. So that the name is key and the actual sequence is value.
For example:
First line: >Ebola 23212
Second line: TAATCGTACTAC--ATCC
Third line: TAATATAGGCGT-A--
Fourth line: >Corona E0232.1
Fifth line: TATTTCGATC----AACT
And so on.
Here is what I have come up with so far
import sys
name= '/Users/Tom/OneDrive/Desktop/projekt/sequences.fasta'
from collections import defaultdict
f = open(name)
seq=defaultdict(str)
for line in f:
if line.startswith('>'):
name = line[1:-1]
continue
seq[name]+=line.strip()
but this goes perfect if I only have one sequence in the file but with multiple sequences (and obviously multiple names) it fails. When I print name it gives me only the first sequence name.
any suggestions?
Use Biopython
Biopython will help you achieve exactly what you are looking for.
or Code it
If you prefer to code it you could implement this sort of pipeline:
def filter_nuc(letter):
#extract the sequences by filtering to keep Nucleotides and ">" only
nuc = ['A', 'T', 'G', 'C', '>']
return True if letter in nuc else False
# CHANGE__ the import directory is necessary to import a FASTA file
file_import_directory = "sample dataset.txt"
# import the FASTA file and read through
seq_orig = open(file_import_directory, "r")
seq = seq_orig.read()
seq_id = seq # for ID extraction
assert ">" in seq, "should be a valid FASTA file" # assert the imported file is a FASTA file
# extract the sequences as 'seq'
filtered_seq = filter(filter_nuc, seq)
new_seq = tuple(filtered_seq)
seq = ' '.join(new_seq)
seq = seq.replace(" ", "")
seq = seq.split('>')
#exclude repeated sequences
seq = [i for i in seq if seq.count(i) ==1]
def comparison(inp):
#sort the list into a dictionary for easier comparison between sequences
dict = {}
n = 0
i = len(inp)
while n < i:
dict[inp[n]] = []
for letter in inp[n]:
dict[inp[n]].append(letter)
n += 1
print(dict)
For example, for this input sequence:
>Rosalind_52
TCATC
>Rosalind_44
TTCAT
>Rosalind_68
TCATC
>Rosalind_28
TGAAA
>Rosalind_95
GAGGA
>Rosalind_66
TTTCA
>Rosalind_33
ATCAA
>Rosalind_21
TTGAT
>Rosalind_18
TTTCC
The expected output is:
{'TTCAT': ['T', 'T', 'C', 'A', 'T'], 'TGAAA': ['T', 'G', 'A', 'A', 'A'], 'GAGGA': ['G', 'A', 'G', 'G', 'A'], 'TTTCA': ['T', 'T', 'T', 'C', 'A'], 'ATCAA': ['A', 'T', 'C', 'A', 'A'], 'TTGAT': ['T', 'T', 'G', 'A', 'T'], 'TTTCC': ['T', 'T', 'T', 'C', 'C']}
Hope that helps.
Is this the output you expect. I would still recommend using Biopython for reading and writing common files like fasta but if you really want to code it this should do the trick
filename = '/path/to/sequences.fasta'
def create_sequence_dict(text: str) -> dict[str, str]:
text = text.split('\n')
name = text.pop(0)
return {name: ''.join(iter(text))}
with open(filename, mode='r') as file:
text = file.read()
d = {}
for s in sequence.split('>'):
if s:
d.update(create_sequence_dict(s) )
Output
{'Ebola 23212': 'STAATCGTACTAC--ATCCTAATATAGGCGT-A--',
'Corona E0232.1': 'TATTTCGATC----AACT'}
I am trying to create a program that takes in a text file and replaces each word that does not start with a vowel with the same word with a dash "+" in front of it. However, every time I run the program removes random areas of the text file.
Example input text file:
This is a text file. amazing words should be removed
Example output file:
This+ is a text+ file+. amazing words+ should+ be+ removed+
fin = open("file.txt", "r")
fout = open("new.txt", "w")
vowels = ['a', 'e', 'i', 'o', 'u', 'A', 'E', 'I', 'O', 'U',]
for line in fin:
for word in fin:
if word[0] not in vowels:
word = word + "+"
fout.write(line.replace(line, word))
fin.close()
fout.close()
Please see the inline notes for explanation.
fin = open("file.txt", "r").read().split() # read contents in and split them into words
fout = open("new.txt", "wt")
vowels = ['a', 'e', 'i', 'o', 'u', 'A', 'E', 'I', 'O', 'U',]
words = []
for word in fin: # for each word
if word[0] not in vowels: # if it doesnt starts with vowel
word += "+" # add a plus operator
words.append(word) # append to list of words
else:
words.append(word) # if it does start with vowel append unchanged word to list of words
fout.write(' '.join(words)) # write the list of words joined by spaces to file
fout.close()
Some extra tips:
using the with statement when dealing with files is a highly recommended approach, for example.
textfile = open(somefile,"rt") # instead of this
text = textfile.read()
with open(somefile, "rt") as textfile:
text = textfile.read()
'... do something'
'... do something' # when you exit the with statement the file gets closed automatically.
vowels = ['a', 'e', 'i', 'o', 'u', 'A', 'E', 'I', 'O', 'U']
with open("file.txt", "r") as fin, open("new.txt", "w") as fout:
lines = [x.split() for x in fin.readlines()]
lines = [' '.join(x if x[0] in vowels else f'{x}+' for x in line) for line in lines]
fout.writelines(lines)
test = ['This is a text file.', 'amazing words should be removed']
lines = [x.split() for x in test]
test = [' '.join(x if x[0] in vowels else f'{x}+' for x in line) for line in lines]
print(test)
Output: To correct the file.+ edge case would involve a more in depth program...
['This+ is a text+ file.+', 'amazing words+ should+ be+ removed+']
Do that and will work just like you want:
# import re module
# re module provides support
# for regular expressions
import re
# Make a regular expression
# to find string starting with vowel
regex = '^[aeiouAEIOU][A-Za-z0-9_]*'
f = open(input.txt, 'r+') # open the txt file
# divide the text in lines and the lines in words
with open(txt, 'r') as fp:
lines = fp.readlines() #reading line by line
for i in range(0, len(lines)):
words = lines[i].split()
for w in words : #reading word for word
if(re.search(regex, w)): #find the words with vowel
new_word = f'{w} +'
lines[i] =lines[i].replace(w, new_word)
with open('output.txt', 'w') as f:
for line in lines:
f.write(line)
f.write('\n')
Done!
I am trying to remove '\n' from the end of each element in a list imported from a .txt file
Here is the file I am importing:
NameOfFriends.txt
Joeseph
Mary
Alex
Here is the code I am running
a = open('NameOfFriends.txt', 'r').readlines()
print(a)
newList = str(a)
print(type(newList[0]))
for task in newList:
newList = [task.replace('\n', '') for task in newList]
print(newList)
Here is the output:
['Joeseph\n', 'Mary\n', 'Alex']
<class 'str'>
['[', "'", 'J', 'o', 'e', 's', 'e', 'p', 'h', '\\', 'n', "'", ',', ' ', "'", 'M', 'a', 'r', 'y', '\\', 'n', "'", ',', ' ', "'", 'A', 'l', 'e', 'x', "'", ']']
Here is the desired output when variable newList is printed:
['Joeseph', 'Mary', 'Alex']
with open('NameOfFriends.txt') as file:
names = []
for name in file.readlines():
names.append(name.strip("\n"))
print(names)
You don't need an extra for loop, just do:
newList = [task.replace('\n', '') for task in newList]
Or don't even add these codes, just do on the first line:
a = list(open('NameOfFriends.txt', 'r'))
newList isn't a list, it's a string because you did:
newList = str(a)
There's no need to do that.
You can remove the newlines when you're reading the file:
a = [line.rstrip('\n') for line in open('NameOfFriends.txt', 'r').readlines()]
You don't need a loop or comprehension to do it , if you want to remove only the line breaks from the text. You can use the splitlines function for that.
a = open('NameOfFriends.txt', 'r').read().splitlines()
This will do what you need.
You can use strip() or replace() for this
replace()
a = open('NameOfFriends.txt', 'r').readlines()
new_list = []
for task in a:
new_list.append(task.replace('\n', ''))
print(new_list)
strip()
a = open('NameOfFriends.txt', 'r').readlines()
new_list = []
for task in a:
new_list.append(task.strip('\n'))
print(new_list)
with open('NameOfFriends.txt', 'r') as friends:
#With allows file to close automatically when code block is done.
friends = friends.read() #converts friends to string
friends_separated = friends.split() #converts string to list
print(friends_separated) #prints
I'm importing data from a file object. It's pretty simple; it's a .txt file, and the data was entered like this: ABCDEFGHIJKLMNOPQRSTUVWXYZ
I am trying to get it to be a list of individual characters, e.g.
my_list = ['A', 'B', 'C', 'D', ...etc.]
but it's showing up like this:
my_list = ['ABCDEFGHIBJK... etc.]
Where am I going wrong?
def split(string):
return [char for char in string]
# This opens a file, gets a file object, and returns it to the program.
def get_file_object1():
infile = open(r'#', 'r')
file_object = infile.readlines()
testing = split(file_object) # this is a test line
print(testing) # this is a test line
print(split('This is another test.'))
infile.close()
return file_object
Note: when I pass the file object to split(file_object), I get this
['ABCDEFGHIJKLMNOPQRSTUVWXYZ']
But when I pass a string of text to split('This is another string.'), I get this:
['T', 'h', 'i', 's', ' ', 'i', 's', ' ', 'a', 'n', 'o', 't', 'h', 'e', 'r', ' ',
't', 'e', 's', 't', '.']
Simple, just read and apply list() ;-)
with open(r'C:\Users\Liz\Desktop\PYTHON PROBLEMS\cipherUpper.txt') as f:
myList = list(f.read().strip())
Get myList as your list
The return type of readlines() is list, so you're basically doing this:
string = ['ABC..', ...]
[char for char in string]
where char is equal to 'ABC..'
so you need to iterate over the result of readlines:
testing = []
for line in f.readlines():
testing = testing + split(line)
or read a single line with readline() instead.
split is iterating the object it receives
When you pass file_object, the lines of the file are iterated. There is only one line so you receive the first line as a string.
testing = split(file_object) # >> ['ABCDEFGHIJKLMNOPQRSTUVWXYZ']
When you pass a string, the characters of the string are iterated so you get a list of characters.
print(split('This is another test.')) # >> ['T', 'h', 'i', 's', ....
I'm trying to write a program for the micro:bit which displays text as morse code. I've looked at multiple websites and Stack Overflow posts for a way to split a string into characters.
E.g.
string = "hello"
to
chars = ["h","e","l","l","o"]
I tried creating a function called array, to do this, but this didn't work.
I then tried this:
def getMessage():
file = open("file.txt", "r")
data = file.readlines()
file.close()
words = []
for line in data:
for word in line:
words.append(word)
return words
Any ideas?
You can use builtin list() function:
>>> list("A string")
['A', ' ', 's', 't', 'r', 'i', 'n', 'g']
In your case, you can call list(getMessage()) to convert the contents of the file to chars.
You can try something like this:
word="hello"
result = []
result[:0] = word
print(result)
Now result will be ['h', 'e', 'l', 'l', 'o']