Delete unusable characters or data in archive with python [closed] - python

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I have an archive with unusable data and I want to clean them with python.
First, the lines have the form:
Xac:0.01660156#,Yac:0.02343750?,Zac:1.00683593*
I want to delete: Xac:, Yac:, and Zac:, and also the characters at the ends of number like #, ?, * to leave only numbers.
Also, I want to delete some trash lines in the archive like:
!Data Logger Accelerometer] ,
Initializing...
Lines like those in the archive are trash for me and I need to delete them to leave a clean archive with only numbers on three columns. (Really, those numbers are accelerometer readings on the x, y, and z axes, but I have unusable data like I showed above).
How can I achieve this?

You need to parse the data file.
First, skip invalid lines:
if not line.startswith('Xac:'):
return None
Second, split by non-number chars:
parts = re.split('[,Xac:YZ#?*]', line)
Third, filter empty strs:
parts = filter(lambda x: bool(x), parts)
Fourth, covert str to float:
parts = map(lambda x: float(x), parts)
Finally, convert list to tuple
return tuple(parts)
The full example is like this:
import re
def parse_line(line):
""" line -> (int, int, int), None if invalid
"""
if not line.startswith('Xac:'):
return None
parts = re.split('[,Xac:YZ#?*]', line)
parts = filter(lambda x: bool(x), parts)
parts = map(lambda x: float(x), parts)
return tuple(parts)
output = []
with open('input.txt') as f:
for line in iter(f.readline, ''):
axes = parse_line(line.strip())
if axes:
output.append(axes)
print output
Input file input.txt:
!Data Logger Accelerometer] ,
Initializing...
Xac:0.01660156#,Yac:0.02343750?,Zac:1.00683593*
OUTPUT:
[(0.01660156, 0.0234375, 1.00683593)]

you can use python regular expressions.
import re
x = 'Xac:0.01660156#,Yac:0.02343750?,Zac:1.00683593*'
print re.findall('(\d*\.?\d+)', x) #['0.01660156', '0.02343750', '1.00683593']

Related

Finding a sequence in a line of letters [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 months ago.
Improve this question
I work with mRNA and would like to identify particular letter sequences in a line of mRNA. mRNA is a string of letters that codes for proteins (It can be represented as .txt or .fasta file). The mRNA string consists of letters "A","U","G" and "C" and can be tens of thousands of letters long. mRNA can also be methylated at Adenine ("A") particular sequences in the cell. In humans, methylation occurs at sequence "DRACH", where "D" can represent letters A/G/U, "R"=A/G and "H"=U/A/C, which would give a total of 3x2x3=18 potential letter combinations if my math is right. I want to write a code in Python that would read my .txt/.fasta file with the mRNA string, scan it for all 18 "DRACH" sequences, list them and highlight them in the sequence.
I created a mock .txt file (C:\rna\RNA_met.txt) containing the string: "AACGAUUCGACCGCAAGACUGGGCGAACCAUUCUAA"
It has 2 DRACH sequences: AGACU and GAACC.
I haven't done any coding but I suspect that my task can be broken down into subtasks. Subtask 1 would be to make a program 'read' my .txt file. The second task would be to teach the program to recognise the DRACH sequence. The 3rd task would be to make python to show the mRNA string with highlighted DRACH sequences.
for subtask 1, I printed the following code in Spyder:
file = open('RNA_met.txt', 'r')
f = file.readlines()
print(f)
There were no mistakes in the code. Unfortunately, I did not see my sequence.
I tried to change to the whole file pathway as:
f = open("C:\\rna\RNA_met.txt", "r")
print(f.read())
but it also didn't help.
Any ideas on how may I fix the first subtask before moving onto the second one?
Thanks!
Maria
Here is full solution using regex (very useful to work with string)
import re
from typing import Dict, List
D = "[AGU]"
R = "[AG]"
A = "A"
C = "C"
H = "[UAC]"
RE_DRACH_PATTERN = re.compile(f"{D}{R}{A}{C}{H}")
def find_drach_seq(mrna_seq: str) -> List[Dict]:
ret = []
for a_match in re.finditer(RE_DRACH_PATTERN, mrna_seq):
ret.append(
{"start": a_match.start(), "end": a_match.end(), "drach": a_match.group()}
)
return ret
def find_drach_in_file(in_file_path: str) -> List[Dict]:
ret = []
current_line = 0
with open(in_file_path, "r", encoding="UTF-8") as fr:
for line in fr:
current_line += 1
drach_matches = find_drach_seq(line)
for a_match in drach_matches:
a_match["line"] = current_line
ret.append(a_match)
return ret
if __name__ == "__main__":
mrna_seq = "AACGAUUCGACCGCAAGACUGGGCGAACCAUUCUAA"
for a_match in find_drach_seq(mrna_seq):
print(a_match)
in_file = "m_rna.txt"
for a_match in find_drach_in_file(in_file_path=in_file):
print(a_match)
My wife is a pathology doctor and at some point may need to learn about mRNA (I forgot the name of specialization). It would be great if we could share.
Anyway, your second code missing an escape: C:\\rna\\RNA_met.txt
You can take advantage of Python dictionaries (hash tables) to come up with an efficient solution like the following:
f = open("RNA_met.txt", "r")
seq = f.read()
#In this case, the content of .txt was "AACGAUUCGACCGCAAGACUGGGCGAACCAUUCUAA"
combinations = {}
for i in ["A", "G", "U"]:
for j in ["A", "G"]:
for k in ["U", "A", "C"]:
combinations[f"{i}{j}AC{k}"] = ""
for i in range(0, len(seq)-5):
if seq[i:i+5] in combinations:
print(seq[i:i+5], "Sequence found on: ", i)
Output:
AGACU Sequence found on: 15
GAACC Sequence found on: 24
This algorithm storages all possible combinations of "DRACH" sequences into a hash table and traverses the .txt file to find potential matches. When found, it prints the match and its position into the file with the long sequence of letters.

Extract values at specific location in string in Python [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed last year.
Improve this question
Given a data stream, how do I extract the information (between \x15 and \x15\n) that comes right after a *PAR?
Here is the data stream.
'%gra:bla bla bla\n',
'*PAR:\tthe cat wants\n',
'\tcookies . \x159400_14000\x15\n',
'%mor:\tdet:art|the adj|cat\n',
'\tpro:rel|which aux|be&3S part|tip-PRESP coord|and pro:sub|he\n',
'\tv|want-3S n|cookie-PL .\n',
'%gra:\t1|3|DET 2|3|MOD 3|4|SUBJ 4|0|ROOT 5|4|JCT 6|7|DET 7|5|POBJ 8|10|LINK\n',
'\t9|10|AUX 10|7|CMOD 11|13|LINK 12|13|SUBJ 13|10|CJCT 14|13|OBJ 15|16|INF\n',
'\t16|14|XMOD 17|16|JCT 18|19|DET 19|17|POBJ 20|4|PUNCT\n',
'*PAR:\ cookies biscuit\n',
'\tis eating a cookie . \x1514000_18647\x15\n',
My output should be:
"9400_14000"
"14000_18647"
...
Go over the data line by line and try to look for the desired pattern just on lines that follow *PAR:
import re
[re.search('\x15(.*)\x15\n', line).groups()[0]
for i, line in enumerate(data) if '*PAR' in data[i - 1]]
This code will throw an exception if the pattern cannot be matched on a line that follows *PAR. To get all valid matches use:
[match
for i, line in enumerate(data) if '*PAR' in data[i - 1]
for match in re.findall('\x15(.*)\x15\n', line)]
If you expect more than a single pair of \x15 on a line you can use this regex instead to find the shortest match:
'\x15(.*?)\x15\n'
I like to use this function
def get_between(string:str,start:str,end:str,with_brackets=True):
#re.escape escapes all characters that need to be escaped
new_start=re.escape(start)
new_end=re.escape(end)
pattern=f"{new_start}.*?{new_end}"
res=re.findall(pattern,string)
if with_brackets:
return res #-> this is with the brackets
else:
return [x[len(start):-len(end)] for x in res]#-> this is without the brackets
To use it in your example do this:
result = []
for i,string in enumerate(data_stream):
if i>0 and "*PAR" in data_stream[i-1]:
result+=get_between(string,"\x15","\x15\n",False)
print(result)
I don't know the type of the data stream so, here is a generator:
def generator(data_stream):
pattern = r"\x15([^\x15]+)\x15\n"
search_next=False
for line in data_stream:
if search_next:
for out in re.findall(pattern, line):
yield out
search_next = False
if line.find("*PAR") > -1:
search_next = True
If it can be converted to a list, you can use this:
[x for x in generator(data)]
You could use the newer regex module with
(?:\G(?!\A)|\*PAR) # fast forward to *PAR
(?:(?!:\*PAR).)+?\\x15\K # do not overrun another *PAR, fast forward to \x15
.+? # start matching...
(?=\\x15|\z) # ... up to either \x15 or the end
See a demo on regex101.com (and mind the singleline and verbose modifier!).

Removing Lines that: do not contain a comma, or contain more then one comma (in Python) [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I'm a little stuck on this one - I have to remove lines(delete) that do not contain a comma, and remove(delete) lines that have more then one comma. I have to write the script in Python. I have included a sample of the file below:
Anarchism,Taoism
Anarchism,Laozi
Anarchism,Zhuang Zhou
brigand
Anarchism,Diogenes of Sinope
Anarchism,Cynicism philosophy
Anarchism,Zeno of Citium
Anarchism,Stoicism
Thanks!!!!
So you want to keep lines that have exactly one comma
>>> lines = """Anarchism,Taoism
... Anarchism,Laozi
... Anarchism,Zhuang Zhou
... brigand
... Anarchism,Diogenes of Sinope
... Anarchism,Cynicism philosophy
... Anarchism,Zeno of Citium
... Anarchism,Stoicism""".split("\n")
>>> [x for x in lines if x.count(",") == 1]
['Anarchism,Taoism', 'Anarchism,Laozi', 'Anarchism,Zhuang Zhou', 'Anarchism,Diogenes of Sinope', 'Anarchism,Cynicism philosophy', 'Anarchism,Zeno of Citium', 'Anarchism,Stoicism']
>>>
read lines from a file and write the lines with exactly 1 comma to the result file. Say "Anarchism,Taoism
Anarchism,Laozi
Anarchism,Zhuang Zhou
brigand
Anarchism,Diogenes of Sinope
Anarchism,Cynicism philosophy
Anarchism,Zeno of Citium
Anarchism,Stoicisms"
are stored in "test.txt" file.
input = open("c:\\test.txt", "r")
output = open("c:\\result.txt", "w")
for line in input:
if (line.count(",") == 1):
print (line)
output.writelines(line)
input.close()
output.close()
import re
p = re.compile(ur'((?:[^,^\n]*\,){2,}[^,^\n]+\n)|(?![^,^\n]*\,)(?<=\n)([^,^\n]+\n)|^(?![^,^\n]*\,)([^,^\n]+)|(?![^,^\n]*\,)(?<=\n)([^,^\n]+)$', re.DOTALL)
str = u"Anarchism,Taoism \nAnarchism,Laozi \nAnarchism,Zhuang Zhou \nbrigand \nAnarchism,Diogenes of Sinope \nAnarchism,Cynicism philosophy \nAnarchism,Zeno of Citium \nAnarchism,Stoicism"
str = re.sub(p,'', str)
print str
Output:
Anarchism,Taoism
Anarchism,Laozi
Anarchism,Zhuang Zhou
Anarchism,Diogenes of Sinope
Anarchism,Cynicism philosophy
Anarchism,Zeno of Citium
Anarchism,Stoicism

mapping repeating ID's for an email [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I have a file with columns ID MAIL (20 millions):
000000#0000.com 0xE618EF6B90AG
000000#0000.com 0xE618EF6B90AF
00000#00000.com 0xE618EFBCC83D
00000#00000.com 0xE618EFBCC83C
#000000000 0xE618F02C223E432CEA
00000#0000.com 0x01010492A
0000#00000.com 0x52107A
# 0xE618F032F829432CE04343307C570906A
00000#0000.com 0xE618F032F829432CEB
000000#000.com 0xE618F032FE7B432CEC
000000#000.com 0xE618F032FE7B432CED
#hotmail.com 0x41970588
# 0x52087617
I need to map ID's registered to an email, so we can find what ID's have registered on a given mail. The email may have several ID's registered on it.
Here is the function i made, but it turns out that i need to exclude mostly non-valid emails like #.com # etc.
In the first version of script it works almost perfectly with a little thing, my parser breaks down if the email has a space somewhere in between symbols
So i added a regexp to check the email value but i get the error i don't know how to handle:
import re
def duplicates(filename):
with open(filename, 'r') as f:
lines = f.readlines()
query = (line.replace('\n','') for line in lines)
split_query = (line.split(' ') for line in query)
result_mail = {}
for line in split_query:
#added if statement to validate email, remove to check w/o
if re.match(r"[a-zA-Z0-9.-]+#[a-zA-Z0-9.-]+\.[a-zA-Z0-9.-]+", line[0]):
if line[0] not in result_mail:
result_mail[line[0]] = []
result_mail[line[0]].append(line[1])
for mail, ids in result_mail.iteritems():
if len(ids) > 1:
with open('MAIL_ids.txt', 'a') as r_mail:
r_mail.write(str(mail) + '\n')
r_mail.write(str(ids) + '\n')
if __name__ == '__main__':
import sys
filename = sys.argv[1]
duplicates(filename)
After running the script i get the error about KeyError '', why is this happening ?
File ".\dup_1.2.py", line 44, in <module>
duplicates(filename)
File ".\dup_1.2.py", line 32, in duplicates
result_mail[line[0]].append(line[1])
KeyError: ''
I also would like to rewrite the part where i add keys and values to dictionary. I'd like to use a generator defaultdict() smth like:
result_mail = defaultdict(list)
for line in lines:
if line[0] not in result_mail:
result_mail[line[0]].append(line[1])
It seems you just put the line result_mail[line[0]].append(line[1]) at the wrong level of indentation, so it is executed even when the if re.match condition does not apply.
Also, you might want to use collections.defaultdict to get rid of that if line[0] not in result_mail check.
result_mail = collections.defaultdict(list)
for (id_, mail) in split_query:
if re.match(r"[a-zA-Z0-9.-]+#[a-zA-Z0-9.-]+\.[a-zA-Z0-9.-]+", id_):
result_mail[id_].append(mail)

parsing repeated lines of string based on initial characters [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
I am working on lists and strings in python. I have following lines of string.
ID abcd
AC efg
RF hij
ID klmno
AC p
RF q
I want the output as :
abcd, efg, hij
klmno, p, q
This output is based on the the first two characters in the line. How can I achieve it in efficient way?
I'm looking to output the second part of the line for every entry between the ID tags.
I'm having a little trouble parsing the question, but according to my best guess, this should do what you're looking for:
all_data = " ".join([line for line in file]).split("ID")
return [", ".join([item.split(" ")[::2] for item in all_data])]
Basically what you're doing here is first just joining together all of your data (removing the newlines) then splitting on your keyphrase of "ID"
After that, if I'm correctly interpreting the question, you're looking to get the second value of each pair. These pairs are space delimited (as is everything in that item due to the " ".join in the first line), so we just step through that list grabbing every other item.
In general splits have a little more syntactic sugar than is usually used, and the full syntax is: [start:end:step], so [::2] just returns every other item.
You could use the following, which takes order into account so that transposing the dict's values makes more sense...
from collections import OrderedDict
items = OrderedDict()
with open('/home/jon/sample_data.txt') as fin:
lines = (line.strip().partition(' ')[::2] for line in fin)
for key, value in lines:
items.setdefault(key[0], []).append(value)
res = [', '.join(el) for el in zip(*items.values())]
# ['abcd, efg, hij', 'klmno, p, q']
Use a default dict:
from collections import defaultdict
result = defaultdict(list)
for line in lines:
split_line = line.split(' ')
result[split_line[0]].append(split_line[1])
This will give you a dictionary result that stores all the values that have the same key in an array. To get all the strings that were in a line that started with e.g. ID:
print result[ID]
Based on your answers in comments, this should work (if I understand what you're looking for):
data = None
for line in lines:
fields = line.split(2)
if fields[0] == "ID":
#New set of data
if data is not None:
#Output last set of data.
print ", ".join(data)
data = []
data.append(fields[1])
if data is not None:
#Output final data set
print ", ".join(data)
It's pretty straight forward, you're just collecting the second field in each line into data until you see that start of the next data set, at which point you output the previous data set.
I think using itertools.groupby is best for this kind of parsing (do something until next token X)
import itertools
class GroupbyHelper(object):
def __init__(self):
self.state = None
def __call__(self, row):
if self.state is None:
self.state = True
else:
if row[0] == 'ID':
self.state = not self.state
return self.state
# assuming you read data from 'stream'
for _, data in itertools.groupby((line.split() for line in stream), GroupbyHelper()):
print ','.join(c[1] for c in data)
output:
$ python groupby.py
abcd,efg,hij
klmno,p,q
It looks like you would like to sub group your data, when ever 'ID' is present as your key. Groupby solution could work wonder here, if you know how to group your data. Here is one such implementation that might work for you
>>> data=[e.split() for e in data.splitlines()]
>>> def new_key(key):
toggle = [0,1]
def helper(e):
if e[0] == key:
toggle[:] = toggle[::-1]
return toggle[0]
return helper
>>> from itertools import groupby
>>> for k,v in groupby(data, key = new_key('ID')):
for e in v:
print e[-1],
print
abcd efg hij
klmno p q
If lines is equal to
['ID abcd', 'AC efg', 'RF hij']
then
[line.split()[1] for line in lines]
Edit: Added everything below after down votes
I am not sure why this was down voted. I thought that code was the simplest way to get started with the information provided at the time. Perhaps this is a better explanation of what I thought/think the data was/is?
if input is a list of strings in repeated sequence, called alllines;
alllines = [ #a list of repeated lines of string based on initial characters
'ID abcd',
'AC efg',
'RF hij',
'ID klmno',
'AC p',
'RF q'
]
then code is;
[[line.split()[1] for line in lines] for lines in [[alllines.pop(0) \
for i in range(3)] for o in range(len(alllines)/3)]]
This basically says, create a sublist of three split [1] strings from the whole list of all strings for every three strings in the whole list.
and output is;
[[
'abcd', 'efg', 'hij'
], [
'klmno', 'p', 'q'
]]
Edit: 8-6-13 This is an even better one without pop();
zip(*[iter([line.split()[1] for line in alllines])]*3)
with a slightly different output
[(
'abcd', 'efg', 'hij'
), (
'klmno', 'p', 'q'
)]

Categories