Finding a sequence in a line of letters [closed] - python

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 months ago.
Improve this question
I work with mRNA and would like to identify particular letter sequences in a line of mRNA. mRNA is a string of letters that codes for proteins (It can be represented as .txt or .fasta file). The mRNA string consists of letters "A","U","G" and "C" and can be tens of thousands of letters long. mRNA can also be methylated at Adenine ("A") particular sequences in the cell. In humans, methylation occurs at sequence "DRACH", where "D" can represent letters A/G/U, "R"=A/G and "H"=U/A/C, which would give a total of 3x2x3=18 potential letter combinations if my math is right. I want to write a code in Python that would read my .txt/.fasta file with the mRNA string, scan it for all 18 "DRACH" sequences, list them and highlight them in the sequence.
I created a mock .txt file (C:\rna\RNA_met.txt) containing the string: "AACGAUUCGACCGCAAGACUGGGCGAACCAUUCUAA"
It has 2 DRACH sequences: AGACU and GAACC.
I haven't done any coding but I suspect that my task can be broken down into subtasks. Subtask 1 would be to make a program 'read' my .txt file. The second task would be to teach the program to recognise the DRACH sequence. The 3rd task would be to make python to show the mRNA string with highlighted DRACH sequences.
for subtask 1, I printed the following code in Spyder:
file = open('RNA_met.txt', 'r')
f = file.readlines()
print(f)
There were no mistakes in the code. Unfortunately, I did not see my sequence.
I tried to change to the whole file pathway as:
f = open("C:\\rna\RNA_met.txt", "r")
print(f.read())
but it also didn't help.
Any ideas on how may I fix the first subtask before moving onto the second one?
Thanks!
Maria

Here is full solution using regex (very useful to work with string)
import re
from typing import Dict, List
D = "[AGU]"
R = "[AG]"
A = "A"
C = "C"
H = "[UAC]"
RE_DRACH_PATTERN = re.compile(f"{D}{R}{A}{C}{H}")
def find_drach_seq(mrna_seq: str) -> List[Dict]:
ret = []
for a_match in re.finditer(RE_DRACH_PATTERN, mrna_seq):
ret.append(
{"start": a_match.start(), "end": a_match.end(), "drach": a_match.group()}
)
return ret
def find_drach_in_file(in_file_path: str) -> List[Dict]:
ret = []
current_line = 0
with open(in_file_path, "r", encoding="UTF-8") as fr:
for line in fr:
current_line += 1
drach_matches = find_drach_seq(line)
for a_match in drach_matches:
a_match["line"] = current_line
ret.append(a_match)
return ret
if __name__ == "__main__":
mrna_seq = "AACGAUUCGACCGCAAGACUGGGCGAACCAUUCUAA"
for a_match in find_drach_seq(mrna_seq):
print(a_match)
in_file = "m_rna.txt"
for a_match in find_drach_in_file(in_file_path=in_file):
print(a_match)
My wife is a pathology doctor and at some point may need to learn about mRNA (I forgot the name of specialization). It would be great if we could share.
Anyway, your second code missing an escape: C:\\rna\\RNA_met.txt

You can take advantage of Python dictionaries (hash tables) to come up with an efficient solution like the following:
f = open("RNA_met.txt", "r")
seq = f.read()
#In this case, the content of .txt was "AACGAUUCGACCGCAAGACUGGGCGAACCAUUCUAA"
combinations = {}
for i in ["A", "G", "U"]:
for j in ["A", "G"]:
for k in ["U", "A", "C"]:
combinations[f"{i}{j}AC{k}"] = ""
for i in range(0, len(seq)-5):
if seq[i:i+5] in combinations:
print(seq[i:i+5], "Sequence found on: ", i)
Output:
AGACU Sequence found on: 15
GAACC Sequence found on: 24
This algorithm storages all possible combinations of "DRACH" sequences into a hash table and traverses the .txt file to find potential matches. When found, it prints the match and its position into the file with the long sequence of letters.

Related

How can I pull out text snippets around specific words?

I have a large txt file and I'm trying to pull out every instance of a specific word, as well as the 15 words on either side. I'm running into a problem when there are two instances of that word within 15 words of each other, which I'm trying to get as one large snippet of text.
I'm trying to get chunks of text to analyze about a specific topic. So far, I have working code for all instances except the scenario mentioned above.
def occurs(word1, word2, filename):
import os
infile = open(filename,'r') #opens file, reads, splits into lines
lines = infile.read().splitlines()
infile.close()
wordlist = [word1, word2] #this list allows for multiple words
wordsString = ''.join(lines) #splits file into individual words
words = wordsString.split()
f = open(filename, 'w')
f.write("start")
f.write(os.linesep)
for word in wordlist:
matches = [i for i, w in enumerate(words) if w.lower().find(word) != -1]
for m in matches:
l = " ".join(words[m-15:m+16])
f.write(f"...{l}...") #writes the data to the external file
f.write(os.linesep)
f.close
So far, when two of the same word are too close together, the program just doesn't run on one of them. Instead, I want to get out a longer chunk of text that extends 15 words behind and in front of furthest back and forward words
This snippet will get number of words around the chosen keyword. If there are some keywords together, it will join them:
s = '''xxx I have a large txt file and I'm xxx trying to pull out every instance of a specific word, as well as the 15 words on either side. I'm running into a problem when there are two instances of that word within 15 words of each other, which I'm trying to get as one large snippet of text.
I'm trying to xxx get chunks of text to analyze about a specific topic. So far, I have working code for all instances except the scenario mentioned above. xxx'''
words = s.split()
from itertools import groupby, chain
word = 'xxx'
def get_snippets(words, word, l):
snippets, current_snippet, cnt = [], [], 0
for v, g in groupby(words, lambda w: w != word):
w = [*g]
if v:
if len(w) < l:
current_snippet += [w]
else:
current_snippet += [w[:l] if cnt % 2 else w[-l:]]
snippets.append([*chain.from_iterable(current_snippet)])
current_snippet = [w[-l:] if cnt % 2 else w[:l]]
cnt = 0
cnt += 1
else:
if current_snippet:
current_snippet[-1].extend(w)
else:
current_snippet += [w]
if current_snippet[-1][-1] == word or len(current_snippet) > 1:
snippets.append([*chain.from_iterable(current_snippet)])
return snippets
for snippet in get_snippets(words, word, 15):
print(' '.join(snippet))
Prints:
xxx I have a large txt file and I'm xxx trying to pull out every instance of a specific word, as well as the 15
other, which I'm trying to get as one large snippet of text. I'm trying to xxx get chunks of text to analyze about a specific topic. So far, I have working
topic. So far, I have working code for all instances except the scenario mentioned above. xxx
With the same data and different lenght:
for snippet in get_snippets(words, word, 2):
print(' '.join(snippet))
Prints:
xxx and I'm
I have xxx trying to
trying to xxx get chunks
mentioned above. xxx
As always, a variety of solutions avaliable here. A fun one would a be a recursive wordFind, where it searches the next 15 words and if it finds the target word it can call itself.
A simpler, though perhaps not efficient, solution would be to add words one at a time:
for m in matches:
l = " ".join(words[m-15:m])
i = 1
while i < 16:
if (words[m+i].lower() == word):
i=1
else:
l.join(words[m+(i++)])
f.write(f"...{l}...") #writes the data to the external file
f.write(os.linesep)
Or if you're wanting the subsequent uses to be removed...
bExtend = false;
for m in matches:
if (!bExtend):
l = " ".join(words[m-15:m])
f.write("...")
bExtend = false
i = 1
while (i < 16):
if (words[m].lower() == word):
l.join(words[m+i])
bExtend = true
break
else:
l.join(words[m+(i++)])
f.write(l)
if (!bExtend):
f.write("...")
f.write(os.linesep)
Note, have not tested so may require a bit of debugging. But the gist is clear: add words piecemeal and extend the addition process when a target word is encountered. This also allows you to extend with other target words other than the current one with a bit of addition to to the second conditional if.

Word & Line Concordance Program

I originally posted this question here but was then told to post it to code review; however, they told me that my question needed to be posted here instead. I will try to better explain my problem so hopefully there is no confusion. I am trying to write a word-concordance program that will do the following:
1) Read the stop_words.txt file into a dictionary (use the same type of dictionary that you’re timing) containing only stop words, called stopWordDict. (WARNING: Strip the newline(‘\n’) character from the end of the stop word before adding it to stopWordDict)
2) Process the WarAndPeace.txt file one line at a time to build the word-concordance dictionary(called wordConcordanceDict) containing “main” words for the keys with a list of their associated line numbers as their values.
3) Traverse the wordConcordanceDict alphabetically by key to generate a text file containing the concordance words printed out in alphabetical order along with their corresponding line numbers.
I tested my program on a small file with a short list of stop words and it worked correctly (provided an example of this below). The outcome was what I expected, a list of the main words with their line count, not including words from the stop_words_small.txt file. The only difference between the small file I tested and the main file I am actually trying to test, is the main file is much longer and contains punctuation. So the problem I am running into is when I run my program with the main file, I am getting way more results then expected. The reason I am getting more results then expected is because the punctuation is not being removed from the file.
For example, below is a section of the outcome where my code counted the word Dmitri as four separate words because of the different capitalization and punctuation that follows the word. If my code were to remove the punctuation correctly, the word Dmitri would be counted as one word followed by all the locations found. My output is also separating upper and lower case words, so my code is not making the file lower case either.
What my code currently displays:
Dmitri : [2528, 3674, 3687, 3694, 4641, 41131]
Dmitri! : [16671, 16672]
Dmitri, : [2530, 3676, 3685, 13160, 16247]
dmitri : [2000]
What my code should display:
dmitri : [2000, 2528, 2530, 3674, 3676, 3685, 3687, 3694, 4641, 13160, 16671, 16672, 41131]
Words are defined to be sequences of letters delimited by any non-letter. There should also be no distinction made between upper and lower case letters, but my program splits those up as well; however, blank lines are to be counted in the line numbering.
Below is my code and I would appreciate it if anyone could take a look at it and give me any feedback on what I am doing wrong. Thank you in advance.
import re
def main():
stopFile = open("stop_words.txt","r")
stopWordDict = dict()
for line in stopFile:
stopWordDict[line.lower().strip("\n")] = []
hwFile = open("WarAndPeace.txt","r")
wordConcordanceDict = dict()
lineNum = 1
for line in hwFile:
wordList = re.split(" |\n|\.|\"|\)|\(", line)
for word in wordList:
word.strip(' ')
if (len(word) != 0) and word.lower() not in stopWordDict:
if word in wordConcordanceDict:
wordConcordanceDict[word].append(lineNum)
else:
wordConcordanceDict[word] = [lineNum]
lineNum = lineNum + 1
for word in sorted(wordConcordanceDict):
print (word," : ",wordConcordanceDict[word])
if __name__ == "__main__":
main()
Just as another example and reference here is the small file I test with the small list of stop words that worked perfectly.
stop_words_small.txt file
a, about, be, by, can, do, i, in, is, it, of, on, the, this, to, was
small_file.txt
This is a sample data (text) file to
be processed by your word-concordance program.
The real data file is much bigger.
correct output
bigger: 4
concordance: 2
data: 1 4
file: 1 4
much: 4
processed: 2
program: 2
real: 4
sample: 1
text: 1
word: 2
your: 2
You can do it like this:
import re
from collections import defaultdict
wordConcordanceDict = defaultdict(list)
with open('stop_words_small.txt') as sw:
words = (line.strip() for line in sw)
stop_words = set(words)
with open('small_file.txt') as f:
for line_number, line in enumerate(f, 1):
words = (re.sub(r'[^\w\s]','',word).lower() for word in line.split())
good_words = (word for word in words if word not in stop_words)
for word in good_words:
wordConcordanceDict[word].append(line_number)
for word in sorted(wordConcordanceDict):
print('{}: {}'.format(word, ' '.join(map(str, wordConcordanceDict[word]))))
Output:
bigger: 4
data: 1 4
file: 1 4
much: 4
processed: 2
program: 2
real: 4
sample: 1
text: 1
wordconcordance: 2
your: 2

​I will add explanations tomorrow, it's getting late here ;). Meanwhile, you can ask in the comments if some part of the code isn't clear for you.

Split txt file into multiple new files with regex

I am calling on the collective wisdom of Stack Overflow because I am at my wits end trying to figure out how to do this and I'm a newbie self-taught coder.
I have a txt file of Letters to the Editor that I need to split into their own individual files.
The files are all formatted in relatively the same way with:
For once, before offering such generous but the unasked for advice, put yourselves in...
Who has Israel to talk to? The cowardly Jordanian monarch? Egypt, a country rocked...
Why is it that The Times does not urge totalitarian Arab slates and terrorist...
PAUL STONEHILL Los Angeles
There you go again. Your editorial again makes groundless criticisms of the Israeli...
On Dec. 7 you called proportional representation “bizarre," despite its use in the...
Proportional representation distorts Israeli politics? Huh? If Israel changes the...
MATTHEW SHUGART Laguna Beach
Was Mayor Tom Bradley’s veto of the expansion of the Westside Pavilion a political...
Although the mayor did not support Proposition U (the slow-growth initiative) his...
If West Los Angeles is any indication of the no-growth policy, where do we go from here?
MARJORIE L. SCHWARTZ Los Angeles
I thought that the best way to go about it would be to try and use regex to identify the lines that started with a name that's all in capital letters since that's the only way to really tell where one letter ends and another begins.
I have tried quite a few different approaches but nothing seems to work quite right. All the other answers I have seen are based on a repeatable line or word. (for example the answers posted here how to split single txt file into multiple txt files by Python and here Python read through file until match, read until next pattern). It all seems to not work when I have to adjust it to accept my regex of all capital words.
The closest I've managed to get is the code below. It creates the right number of files. But after the second file is created it all goes wrong. The third file is empty and in all the rest the text is all out of order and/or incomplete. Paragraphs that should be in file 4 are in file 5 or file 7 etc or missing entirely.
import re
thefile = raw_input('Filename to split: ')
name_occur = []
full_file = []
pattern = re.compile("^[A-Z]{4,}")
with open (thefile, 'rt') as in_file:
for line in in_file:
full_file.append(line)
if pattern.search(line):
name_occur.append(line)
totalFiles = len(name_occur)
letters = 1
thefile = re.sub("(.txt)","",thefile)
while letters <= totalFiles:
f1 = open(thefile + '-' + str(letters) + ".txt", "a")
doIHaveToCopyTheLine = False
ignoreLines = False
for line in full_file:
if not ignoreLines:
f1.write(line)
full_file.remove(line)
if pattern.search(line):
doIHaveToCopyTheLine = True
ignoreLines = True
letters += 1
f1.close()
I am open to completely scrapping this approach and doing it another way (but still in Python). Any help or advice would be greatly appreciated. Please assume I am the inexperienced newbie that I am if you are awesome enough to take your time to help me.
I took a simpler approach and avoided regex. The tactic here is essentially to count the uppercase letters in the first three words and make sure they pass certain logic. I went for first word is uppercase and either the second or third word is uppercase too, but you can adjust this if needed. This will then write each letter to new files with the same name as the original file (note: it assumes your file has an extension like .txt or such) but with an incremented integer appended. Try it out and see how it works for you.
import string
def split_letters(fullpath):
current_letter = []
letter_index = 1
fullpath_base, fullpath_ext = fullpath.rsplit('.', 1)
with open(fullpath, 'r') as letters_file:
letters = letters_file.readlines()
for line in letters:
words = line.split()
upper_words = []
for word in words:
upper_word = ''.join(
c for c in word if c in string.ascii_uppercase)
upper_words.append(upper_word)
len_upper_words = len(upper_words)
first_word_upper = len_upper_words and len(upper_words[0]) > 1
second_word_upper = len_upper_words > 1 and len(upper_words[1]) > 1
third_word_upper = len_upper_words > 2 and len(upper_words[2]) > 1
if first_word_upper and (second_word_upper or third_word_upper):
current_letter.append(line)
new_filename = '{0}{1}.{2}'.format(
fullpath_base, letter_index, fullpath_ext)
with open(new_filename, 'w') as new_letter:
new_letter.writelines(current_letter)
current_letter = []
letter_index += 1
else:
current_letter.append(line)
I tested it on your sample input and it worked fine.
While the other answer is suitable, you may still be curious about using a regex to split up a file.
smallfile = None
buf = ""
with open ('input_file.txt', 'rt') as f:
for line in f:
buf += str(line)
if re.search(r'^([A-Z\s\.]+\b)' , line) is not None:
if smallfile:
smallfile.close()
match = re.findall(r'^([A-Z\s\.]+\b)' , line)
smallfile_name = '{}.txt'.format(match[0])
smallfile = open(smallfile_name, 'w')
smallfile.write(buf)
buf = ""
if smallfile:
smallfile.close()
If you run on Linux, use csplit.
Otherwise, check out these two threads:
How can I split a text file into multiple text files using python?
How to match "anything up until this sequence of characters" in a regular expression?

Count the number of times a letter appears in a text file in python [closed]

Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 6 years ago.
Improve this question
I'm aiming to create a python script that counts the number of times each letter appears from a text file. So if the text file contained Hi there, the output would be something like
E is shown 2 times
H is shown 2 times
I is shown 1 time
R is shown 1 time
T is shown 1 time
I've tried different ways of getting this but I have no output being shown as I carry on getting syntax errors. I've tried the following
import collections
import string
def count_letters(example.txt, case_sensitive=False):
with open(example.txt, 'r') as f:
original_text = f.read()
if case_sensitive:
alphabet = string.ascii_letters
text = original_text
else:
alphabet = string.ascii_lowercase
text = original_text.lower()
alphabet_set = set(alphabet)
counts = collections.Counter(c for c in text if c in alphabet_set)
for letter in alphabet:
print(letter, counts[letter])
print("total:", sum(counts.values()))
return counts
And
def count_letters(example.txt, case_sensitive=False):
alphabet = "abcdefghijlkmnopqrstuvxyzABCDEFGHIJKLMNOPQRSTUVXYZ"
with open(example.txt, 'r') as f:
text = f.read()
if not case_sensitive:
alpahbet = alphabet[:26]
text = text.lower()
letter_count = {ltr: 0 for ltr in alphabet}
for char in text:
if char in alphabet:
letter_count[char] += 1
for key in sorted(letter_count):
print(key, letter_count[key])
print("total", sum(letter_count()))
There were a few problems I found when running your script. One was correctly found by #Priyansh Goel in his answer: you can't use example.txt as a parameter. You should just choose a variable name like text_file and when you call the function, you pass in the string of the file's name.
Also there was an indentation error or two. Here's the script I got to work:
import collections
import string
def count_letters(text_file, case_sensitive=False):
with open(text_file, 'r') as f:
original_text = f.read()
if case_sensitive:
alphabet = string.ascii_letters
text = original_text
else:
alphabet = string.ascii_lowercase
text = original_text.lower()
alphabet_set = set(alphabet)
counts = collections.Counter(c for c in text if c in alphabet_set)
for letter in alphabet:
print(letter, counts[letter])
print("total:", sum(counts.values()))
return counts
count_letters("example.txt")
If you will only ever use this on "example.txt", just get rid of the first parameter and hard code the file name into the function:
def count_letters(case_sensitive=False):
with open("example.txt", 'r') as f:
...
count_letters()
One of the best skills you can develop as a programmer is learning to read and understand the errors that get thrown. They're not meant to be scary or frustrating (although sometimes they are), they're meant to be helpful. Syntax errors like what you had are especially useful. If it isn't totally obvious what the errors are indicating, copy and paste the error into a Google search and more often than not you'll find the answer to your question already exists out there.
Good luck in learning! Python was a great choice for your (presumably) first language!
In your function you can't have example.txt as a parameter name.
The following code only traverses through the letters of the text and not the whole alphabet set. I am using a dict to store the frequency of letters. isalpha is used so that we just put alphabets in the dictionary.
import collections
import string
def count_letters(textfile, case_sensitive=False):
with open(textfile, 'r') as f:
original_text = f.read()
if case_sensitive:
text = original_text
else:
text = original_text.lower()
p = dict()
for i in text:
if i in p.keys():
p[i] += 1
elif i.isalpha():
p[i] = 1;
keys = p.keys()
for k in keys:
print str(k) + " " + str(p[k])
count_letters("example.txt")

Python algorithm - Jumble solver [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 9 years ago.
Improve this question
I'm writing a program to find all the possible combinations of a jumbled word from a dictionary in python.
Here's what I've wrote. It's in O(n^2) time. So, my question is Can it be made faster ?
import sys
dictfile = "dictionary.txt"
def get_words(text):
""" Return a list of dict words """
return text.split()
def get_possible_words(words,jword):
""" Return a list of possible solutions """
possible_words = []
jword_length = len(jword)
for word in words:
jumbled_word = jword
if len(word) == jword_length:
letters = list(word)
for letter in letters:
if jumbled_word.find(letter) != -1:
jumbled_word = jumbled_word.replace(letter,'',1)
if not jumbled_word:
possible_words.append(word)
return possible_words
if __name__ == '__main__':
words = get_words(file(dictfile).read())
if len(sys.argv) != 2:
print "Incorrect Format. Type like"
print "python %s <jumbled word>" % sys.argv[0]
sys.exit()
jumbled_word = sys.argv[1]
words = get_possible_words(words,jumbled_word)
print "possible words :"
print '\n'.join(words)
The usual fast solution to anagram problems in to build a mapping of sorted letters to a list of the unsorted words.
With that structure in-hand, the lookups are immediate and fast:
def build_table(wordlist):
table = {}
for word in wordlist:
key = ''.join(sorted(word))
table.setdefault(key, []).append(word)
return table
def lookup(jumble, table):
key = ''.join(sorted(jumble))
return table.get(key, [])
if __name__ == '__main__':
# Build table
with open('/usr/share/dict/words') as f:
wordlist = f.read().lower().split()
table = build_table(wordlist)
# Solve some jumbles
for jumble in ['tesb', 'amgaarn', 'lehsffu', 'tmirlohag']:
print(lookup(jumble, table))
Notes on speed:
The lookup() code is the fast part.
The slower buildtable() function is written for clarity.
Building the table is a one-time operation.
If you care about run-time across repeated runs, the table should be cached in a text file.
Text file format (alpha-order first, followed by the matching words):
aestt state taste tates testa
enost seton steno stone
...
With the preprocessed anagram file, it becomes a simple matter to use subprocess to grep the file for the appropriate line of matching words. This should give a very fast run time (because the sorts and matches were precomputed and because grep is so fast).
Build the preprocessed anagram file like this:
with open('/usr/share/dict/words') as f:
wordlist = f.read().split()
table = {}
for word in wordlist:
key = ''.join(sorted(word)).lower()
table[key] = table.get(key, '') + ' ' + word
lines = ['%s%s\n' % t for t in table.iteritems()]
with open('anagrams.txt', 'w') as f:
f.writelines(lines)
I was trying to solve using ruby -
https://github.com/hackings/jumble_solver
alter the getwords to return a dict(). Make each key have a value of true or 1
import itertools and use itertools.combinations to make all possible anagramatic strings
from the "jumbled_word"
then loop over the possible strings checking if they are keys in the dict
if you wanted a DIY algorithm solution then loading the dictionary into a tree might be "better" but I doubt in the real world that it would be faster

Categories