I have a file with multiple FASTA sequences such as:
File1.fa
>seq1
AAATTTTTATATACCCTACCATTACCATTACCATTACCATTACCATTACCATTACCATTTTATTATTATTATATACCACACA
>seq2
AAATTTTTATATACCCTACCATTACCATTACCATTACCATTACCATTACCATTACCATTTTATTATTATTATATACCACACA
And I have a table such as:
tab
Seq positions
seq1 3:10
seq2 10:20,45:60
And I would like for each tab['Seq'] to replace letters by a X for each corresponding seqn positions within File1.fa
As you can see for the second row, I can have multiple positions to replace (these positions are separated by , in the tab['positions'] column.
Here I should then get a new_File1.fa such as:
>seq1
AAXXXXXXXXATACCCTACCATTACCATTACCATTACCATTACCATTACCATTACCATTTTATTATTATTATATACCACACA
>seq2
AAATTTTTAXXXXXXXXXXXATTACCATTACCATTACCATTACCXXXXXXXXXXXXXXXXTATTATTATTATATACCACACA
where for seq1 I replace X from positions 3 to 10, and for seq2 I replaced X from positions 10 to 20 and from positions 45 to 60 positions.
I guess using biopython package should be a solution here?
So far I tried the following:
record_dict = SeqIO.to_dict(SeqIO.parse("File1.fa, "fasta"))
import re
for index, row in tab.iterrows():
start= re.sub(":*.","",row['positions'])
end= re.sub(".*:","",row['positions'])
print(record_dict[Seq].seq[start-end])
But as you can see I only manage to extract the part I want to replace with X and I cannot figure out how to take into account when there are multiple positions to replace in the sequence.
Convert the sequences to lists, replace your chosen ranges then covert back to a string. For example,
seq = "AAABBBCCC"
s = list(seq)
for idx in range(3, 6):
s[idx] = "X"
new_seq = ''.join(s)
print(new_seq)
Related
I have a dataframe with two columns, the first are names of organisms and the second is there sequence which is a string of letters. I am trying to create an algorithm to see if an organism's sequence is in a string of a larger genome also comprised of letters. If it is in the genome, I want to add the name of the organism to a list. So for example if flu is in the genome below I want flu to be added to a list.
dict_1={'organisms':['flu', 'cold', 'stomach bug'], 'seq_list':['HTIDIJEKODKDMRM',
'AGGTTTEFGFGEERDDTER', 'EGHDGGEDCGRDSGRDCFD']}
df=pd.DataFrame(dict_1)
organisms seq_list
0 flu HTIDIJEKODKDMRM
1 cold AGGTTTEFGFGEERDDTER
2 stomach bug EGHDGGEDCGRDSGRDCFD
genome='TLTPSRDMEDHTIDIJEKODKDMRM'
This first functions finds the index of the match if there is one where p is the organism and t is the genome. The second portion is the one I am having trouble with. I am trying to use a for loop to search each entry in the df, but if I get a match I am not sure how to reference the first column in the df to add the name to the empty list. Thank you for your help!
def naive(p, t):
occurences = []
for i in range(len(t) - len(p) + 1):
match = True
for j in range(len(p)):
if t[i+j] != p[j]:
match = False
break
if match:
occurences.append(i)
return occurences
Organisms_that_matched = []
for x in df:
matches=naive(genome, x)
if len(matches) > 0:
#add name of organism to Organisms_that_matched list
I'm not sure if you are learning about different ways to transverse and apply custom logic in a list, but you can use list comprehensions:
import pandas as pd
dict_1 = {
'organisms': ['flu', 'cold', 'stomach bug'],
'seq_list': ['HTIDIJEKODKDMRM', 'AGGTTTEFGFGEERDDTER', 'EGHDGGEDCGRDSGRDCFD']}
df = pd.DataFrame(dict_1)
genome = 'TLTPSRDMEDHTIDIJEKODKDMRM'
organisms_that_matched = [dict_1['organisms'][index] for index, x in enumerate(dict_1['seq_list']) if x in genome]
print(organisms_that_matched)
I'm an amateur with basic coding skills in python, I'm working on a data frame that has a column as below. The intent is to group the output of nltk.FreqDist by the first word
What I have so far
t_words = df_tech['message']
data_analysis = nltk.FreqDist(t_words)
# Let's take the specific words only if their frequency is greater than 3.
filter_words = dict([(m, n) for m, n in data_analysis.items() if len(m) > 3])
for key in sorted(filter_words):
print("%s: %s" % (key, filter_words[key]))
sample current output:
click full refund showing currently viewed rr number: 1
click go: 1
click post refund: 1
click refresh like replace tokens sending: 1
click refund: 1
click refund order: 1
click resend email confirmation: 1
click responsible party: 1
click send right: 1
click tick mark right: 1
I have 10000+ rows in my output.
My Expected Output
I would like to group the output by the first word and extract it as a dataframe
What I have tried among other solutions
I have tried adapting solutions given here and here, but no satisfactory results.
Any help/guidance appreciated.
Try the following (documentation is inside the code):
# I assume the input, t_words is a list of strings (Each containing multiple words)
t_words = ...
# This creates a counter from a string to it's occurrences
input_frequencies = nltk.FreqDist(t_words)
# Taking inputs only if they appear 3 or more times.
# This is similar to your code, but looks at the frequency. Your previous code
# did len(m) where m was the message. If you want to filter by the string length,
# you can restore it to len(input_str) > 3
frequent_inputs = {
input_str: count
for input_str, count in input_frequencies.items()
if count > 3
}
# We will apply this function on each string to get the first word (to be
# used as the key for the grouping)
def first_word(value):
# You can replace this by a better implementation from nltk
return value.split(' ')[0]
# Now we will use itertools.groupby for the grouping, as documented in
# https://docs.python.org/3/library/itertools.html#itertools.groupby
first_word_to_inputs = itertools.groupby(
# Take the strings from the above dictionary
frequent_inputs.keys(),
# And key by the first word
first_word)
# If you would also want to keep the count of each word, we can map from
# first word to a list of (string, count) pairs:
first_word_to_inpus_and_counts = itertools.groupby(
# Pairs of words and count
frequent_inputs.items(),
# Extract the string from the pair, and then take the first word
lambda pair: first_word(pair[0])
)
I managed to do it like below. There could be an easier implementation. But for now, this gives me what I had expected.
temp = pd.DataFrame(sorted(data_analysis.items()), columns=['word', 'frequency'])
temp['word'] = temp['word'].apply(lambda x: x.strip())
#Removing emtpy rows
filter = temp["word"] != ""
dfNew = temp[filter]
#Splitting first word
dfNew['first_word'] = dfNew.word.str.split().str.get(0)
#New column with setences split without first word
dfNew['rest_words'] = dfNew['word'].str.split(n=1).str[1]
#Subsetting required columns
dfNew = dfNew[['first_word','rest_words']]
# Grouping by first word
dfNew= dfNew.groupby('first_word').agg(lambda x: x.tolist()).reset_index()
#Transpose
dfNew.T
Sample Output
How can I do a search of a value of the first "latitude, longitude" coordinate in a "file.txt" list in Python and get 3 rows above and 3 rows below?
Value
37.0459
file.txt
37.04278,-95.58895
37.04369,-95.58592
37.04369,-95.58582
37.04376,-95.58557
37.04376,-95.58546
37.04415,-95.58429
37.0443,-95.5839
37.04446,-95.58346
37.04461,-95.58305
37.04502,-95.58204
37.04516,-95.58184
37.04572,-95.58139
37.04597,-95.58127
37.04565,-95.58073
37.04546,-95.58033
37.04516,-95.57948
37.04508,-95.57914
37.04494,-95.57842
37.04483,-95.5771
37.0448,-95.57674
37.04474,-95.57606
37.04467,-95.57534
37.04462,-95.57474
37.04458,-95.57396
37.04454,-95.57274
37.04452,-95.57233
37.04453,-95.5722
37.0445,-95.57164
37.04448,-95.57122
37.04444,-95.57054
37.04432,-95.56845
37.04432,-95.56834
37.04424,-95.5668
37.044,-95.56251
37.04396,-95.5618
Expected Result
37.04502,-95.58204
37.04516,-95.58184
37.04572,-95.58139
37.04597,-95.58127
37.04565,-95.58073
37.04546,-95.58033
37.04516,-95.57948
Additional information
In linux I can get the closest line and do the treatment I need using grep, sed, cut and others, but I'd like in Python.
Any help will be greatly appreciated!
Thank you.
How can I do a search of a value of the first "latitude, longitude"
coordinate in a "file.txt" list in Python and get 3 rows above and 3
rows below?*
You can try:
with open("text_filter.txt") as f:
text = f.readlines() # read text lines to list
filter= "37.0459"
match = [i for i,x in enumerate(text) if filter in x] # get list index of item matching filter
if match:
if len(text) >= match[0]+3: # if list has 3 items after filter, print it
print("".join(text[match[0]:match[0]+3]).strip())
print(text[match[0]].strip())
if match[0] >= 3: # if list has 3 items before filter, print it
print("".join(text[match[0]-3:match[0]]).strip())
Output:
37.04597,-95.58127
37.04565,-95.58073
37.04546,-95.58033
37.04597,-95.58127
37.04502,-95.58204
37.04516,-95.58184
37.04572,-95.58139
You can use pandas to import the data in a dataframe and then easily manipulate it. As per your question the value to check is not the exact match and therefore I have converted it to string.
import pandas as pd
data = pd.read_csv("file.txt", header=None, names=["latitude","longitude"]) #imports text file as dataframe
value_to_check = 37.0459 # user defined
for i in range(len(data)):
if str(value_to_check) == str(data.iloc[i,0])[:len(str(value_to_check))]:
break
print(data.iloc[i-3:i+4,:])
output
latitude longitude
9 37.04502 -95.58204
10 37.04516 -95.58184
11 37.04572 -95.58139
12 37.04597 -95.58127
13 37.04565 -95.58073
14 37.04546 -95.58033
15 37.04516 -95.57948
A solution with iterators, that only keeps in memory the necessary lines and doesn't load the unnecessary part of the file:
from collections import deque
from itertools import islice
def find_in_file(file, target, before=3, after=3):
queue = deque(maxlen=before)
with open(file) as f:
for line in f:
if target in map(float, line.split(',')):
out = list(queue) + [line] + list(islice(f, 3))
return out
queue.append(line)
else:
raise ValueError('target not found')
Some tests:
print(find_in_file('test.txt', 37.04597))
# ['37.04502,-95.58204\n', '37.04516,-95.58184\n', '37.04572,-95.58139\n', '37.04597,-95.58127\n',
# '37.04565,-95.58073\n', '37.04565,-95.58073\n', '37.04565,-95.58073\n']
print(find_in_file('test.txt', 37.044)) # Only one line after the match
# ['37.04432,-95.56845\n', '37.04432,-95.56834\n', '37.04424,-95.5668\n', '37.044,-95.56251\n',
# '37.04396,-95.5618\n']
Also, it works if there is less than the expected number of lines before or after the match. We match floats, not strings, as '37.04' would erroneously match '37.0444' otherwise.
This solution will print the before and after elements even if they are less than 3.
Also I am using string as it is implied from the question that you want partial matches also. ie. 37.0459 will match 37.04597
search_term='37.04462'
with open('file.txt') as f:
lines = f.readlines()
lines = [line.strip().split(',') for line in lines] #remove '\n'
for lat,lon in lines:
if search_term in lat:
index=lines.index([lat,lon])
break
left=0
right=0
for k in range (1,4): #bcoz last one is not included
if index-k >=0:
left+=1
if index+k<=(len(lines)-1):
right+=1
for i in range(index-left,index+right+1): #bcoz last one is not included
print(lines[i][0],lines[i][1])
Assume I have A1 as the only cell in a workbook, and it's blank.
I want my code to add "1" "2" and "3" to it so it says "1 2 3"
As of now I have:
NUMBERS = [1, 2, 3, 4, 5]
ThisSheet.Cells(1,1).Value = NUMBERS
this just writes the first value to the cell. I tried
ThisSheet.Cells(1,1).Value = Numbers[0-2]
but that just puts the LAST value in there. Is there a way for me to just add all of the data in there? This information will always be in String format, and I need to use Win32Com.
update:
I did
stringVar = ', '.join(str(v) for v in LIST)
UPDATE:this .join works perfectly for the NUMBERS list. Now I tried attributing it to another list that looks like this
LIST=[Description Good\nBad, Description Valid\nInvalid]
If I print LIST[0] The outcome is
Description Good
Bad
Which is what I want. But if I use .join on this one, it prints
('Description Good\nBad, Description Valid\nInvalid')
so for this one I need it to print as though I did LIST[0] and LIST[1]
So if you want to put each number in a different cell, you would do something like:
it = 1
for num in NUMBERS:
ThisSheet.Cells(1,it).Value = num
it += 1
Or if you want the first 3 numbers in the same cell:
ThisSheet.Cells(1,it).Value = ' '.join([str(num) for num in NUMBERS[:3]])
Or all of the elements in NUMBERS:
ThisSheet.Cells(1,1).Value = ' '.join([str(num) for num in NUMBERS])
EDIT
Based on your question edit, for string types containing \n and assuming every time you find a newline character, you want to jump to the next row:
# Split the LIST[0] by the \n character
splitted_lst0 = LIST[0].split('\n')
# Iterate through the LIST[0] splitted by newlines
it = 1
for line in splitted_lst0:
ThisSheet.Cells(1,it).Value = line
it += 1
If you want to do this for the whole LIST and not only for LIST[0], first merge it with the join method and split it just after it:
joined_list = (''.join(LIST)).split('\n')
And then, iterate through it the same way as we did before.
I have data in a text file that is space separated into right aligned columns. I would like to be able to take each column and put it in a list, basically like you would do with an array. I can't seem to find an equivalent to
left(strname,#ofcharacters)/mid(strname,firstcharacter,lastcharacter)/right(strname,#ofcharacters)
like you would normally use in VB to accomplish the task. How do I separate off the data and put each like 'unit' with its value next from the next line in Python.
Is it possible? Oh yeah, some spacing is 12 characters apart(right aligned) while others are 15 characters apart.
-1234 56 32452 68584.4 Extra_data
-5356 9 546 12434.5 Extra_data
- 90 12 2345 43522.1 Extra_data
Desired output:
[-1234, -5356, -90]
[56, 9, 12]
[32452, 546, 2345]
etc
The equivalent method in python you are looking for is str.split() without any arguments to split the string on whitespaces. It will also take care of any trailing newline/spaces and as in your VB example, you do not need to care about data width.
Example
with open("data.txt") as fin:
data = map(str.split, fin) #Split each line of data on white-spaces
data = zip(*data) #Transpose the Data
But if you have columns with whitespaces, you need some to split the data, based on column position
>>> def split_on_width(data, pos):
if pos[-1] != len(data):
pos = pos + (len(data), )
indexes = zip(pos, pos[1:]) #Create an index pair with current start and
#end as next start
return [data[start: end].strip() for start, end in indexes] #Slice the data using
#the indexes
>>> def trynum(n):
try:
return int(n)
except ValueError:
pass
try:
return float(n)
except ValueError:
return n
>>> pos
(0, 5, 13, 22, 36)
>>> with open("test.txt") as fin:
data = (split_on_width(data.strip(), pos) for data in fin)
data = [[trynum(n) for n in row] for row in zip(*data)]
>>> data
[[-1234, -5356, -90], [56, 9, 12], [32452, 546, 2345], [68584.4, 12434.5, 43522.1], ['Extra_data', 'Extra_data', 'Extra_data']]
Just use str.split() with no arguments; it splits an input string on arbitrary width whitespace:
>>> ' some_value another_column 123.45 42 \n'.split()
['some_value', 'another_column', '123.45', '42']
Note that any columns containing whitespace would also be split.
If you wanted to have lists if columns, you need to transpose the rows:
with open(filename) as inputfh:
columns = zip(*(l.split() for l in inputfh))