Line split per specific accurence of delimiters python - python

I have a txt file which I should split according to specific pattern of delimiters.
For example:
First split should be after " " (or chr(32))
Second split in the same line should be " " (or twice chr(32)), and so on.
Below the example of the line, that I want to split:
'2018-12-14 23:54:53,105 WARN system.equipment - Timed AC is: 110.375\n'
I found the pattern and according to it, I want to split and set it to
ASCII as an array. Tried to iterate for splitting, but without
success. Thanks to everybody for your help and time!!!
delim_array = []
delim_array = [chr(32),chr(32),[chr(32)+chr(32)],[chr(32)+chr(45)+chr(32)]]
for j in delim_array:
part = re.split(j,datafile[1]) #datafile is my list to split
print (part)
I would like to split the list to delimiters between the parts, are
according to the delim_array:
1)'2018-12-14
2)23:54:53,105
3)WARN
4)system.equipment
5)Timed AC is: 110.375
But getting list, which split only by the first delimiter in the array.

You were close. This example will do what you need.
import re
dd = '2018-12-14 23:54:53,105 WARN system.equipment - Timed AC is: 110.375\n'
delim_array = [chr(32),chr(32), chr(32)+chr(32), chr(32)+chr(45)+chr(32)]
part = []
for j in delim_array:
ap, dd = re.split(j, dd, maxsplit=1)
part.append(ap) #datafile is my list to split
part.append(dd.strip())
print(part)
This prints: ['2018-12-14', '23:54:53,105', 'WARN', 'system.equipment', 'Timed AC is: 110.375']
The key part here to use re.split() with maxsplit equal to 1 (here the docs), so each iteration you split your line in two parts. First part is what you want to append to the part list, second part you keep for further splitting. So reassing the second part to same variable which holds the string (dd in my example). Remember after the loop to append the last dd obtained or it will be lost (I also strip the newline here).
If you have a datafile you need a nested loop to do this. Of course be sure that all the lines follow the same format you posted, otherwise you may get unexpected results.
import re
delim_array = [chr(32),chr(32), chr(32)+chr(32), chr(32)+chr(45)+chr(32)]
with open("your_file_name.txt") as datafile:
for dd in datafile:
part = []
for j in delim_array:
ap, dd = re.split(j, dd, maxsplit=1)
part.append(ap) #datafile is my list to split
part.append(dd.strip())
print(part)

Related

Print a list of unique words from a text file after removing punctuation, and find longest word

Goal is to a) print a list of unique words from a text file and also b) find the longest word.
I cannot use imports in this challenge.
File handling and main functionality are what I want, however the list needs to be cleaned. As you can see from the output, words are getting joined with punctuation and therefore maxLength is obviously incorrect.
with open("doc.txt") as reader, open("unique.txt", "w") as writer:
unwanted = "[],."
unique = set(reader.read().split())
unique = list(unique)
unique.sort(key=len)
regex = [elem.strip(unwanted).split() for elem in unique]
writer.write(str(regex))
reader.close()
maxLength = len(max(regex,key=len ))
print(maxLength)
res = [word for word in regex if len(word) == maxLength]
print(res)
===========
Sample:
pioneered the integrated placement year concept over 50 years ago [7][8][9] with more than 70 per cent of students taking a placement year, the highest percentage in the UK.[10]
Here's a solution that uses str.translate() to throw away all bad characters (+ newline) before we ever do the split(). (Normally we'd use a regex with re.sub(), but you're not allowed.) This makes the cleaning a one-liner, which is really neat:
bad = "[],.\n"
bad_transtable = str.maketrans(bad, ' ' * len(bad))
# We can directly read and clean the entire output, without a reader object:
cleaned_input = open('doc.txt').read().translate(bad_transtable)
#with open("doc.txt") as reader:
# cleaned_input = reader.read().translate(bad_transtable)
# Get list of unique words, in decreasing length
unique_words = sorted(set(cleaned_input.split()), key=lambda w: -len(w))
with open("unique.txt", "w") as writer:
for word in unique_words:
writer.write(f'{word}\n')
max_length = len(unique_words[0])
print ([word for word in unique_words if len(word) == max_length])
Notes:
since the input is already 100% cleaned and split, no need to append to a list/insert to a set as we go, then have to make another cleaning pass later. We can just create unique_words directly! (using set() to keep only uniques). And while we're at it, we might as well use sorted(..., key=lambda w: -len(w)) to sort it in decreasing length. Only need to call sort() once. And no iterative-append to lists.
hence we guarantee that max_length = len(unique_words[0])
this approach is also going to be more performant than nested loops for line in <lines>: for word in line.split(): ...iterative append() to wordlist
no need to do explicit writer/reader.open()/.close(), that's what the with statement does for you. (It's also more elegant for handling IO when exceptions happen.)
you could also merge the printing of the max_length words inside the writer loop. But it's cleaner code to keep them separate.
note we use f-string formatting f'{word}\n' to add the newline back when we write() an output line
in Python we use lower_case_with_underscores for variable names, hence max_length not maxLength. See PEP8
in fact here, we don't strictly need a with-statement for the writer, if all we're going to do is slurp its entire contents in one go in with open('doc.txt').read(). (That's not scaleable for huge files, you'd have to read in chunks or n lines).
str.maketrans() is a builtin, but if your teacher objects to the module reference, you can also call it on a bound string e.g. ' '.maketrans()
str.maketrans() is really a throwback to the days when we only had 95 printable ASCII characters, not Unicode. It still works on Unicode, but building and using huge translation dicts is annoying and uses memory, regex on Unicode is easier, you can define entire character classes.
Alternative solution if you don't yet know str.translate()
dirty_input = open('doc.txt').read()
cleaned_input = dirty_input
# If you can't use either 're.sub()' or 'str.translate()', have to manually
# str.replace() each bad char one-by-one (or else use a method like str.isalpha())
for bad_char in bad:
cleaned_input = cleaned_input.replace(bad_char, ' ')
And if you wanted to be ridiculously minimalist, you could write the entire output file in one line with a list-comprehension. Don't do this, it would be terrible for debugging, e.g if you couldn't open/write/overwrite the output file, or got IOError, or unique_words wasn't a list, etc:
open("unique.txt", "w").writelines([f'{word}\n' for word in unique_words])
Here is another solution without any function.
bad = '`~##$%^&*()-_=+[]{}\|;\':\".>?<,/?'
clean = ' '
for i in a:
if i not in bad:
clean += i
else:
clean += ' '
cleans = [i for i in clean.split(' ') if len(i)]
clean_uniq = list(set(cleans))
clean_uniq.sort(key=len)
print(clean_uniq)
print(len(clean_uniq[-1]))
Here is a solution. The trick is to use the python str method .isalpha() to filter non-alphanumerics.
with open("unique.txt", "w") as writer:
with open("doc.txt") as reader:
cleaned_words = []
for line in reader.readlines():
for word in line.split():
cleaned_word = ''.join([c for c in word if c.isalpha()])
if len(cleaned_word):
cleaned_words.append(cleaned_word)
# print unique words
unique_words = set(cleaned_words)
print(unique_words)
# write words to file? depends what you need here
for word in unique_words:
writer.write(str(word))
writer.write('\n')
# print length of longest
print(len(sorted(unique_words, key=len, reverse=True)[0]))

Extract time values from a list and add to a new list or array

I have a script that reads through a log file that contains hundreds of these logs, and looks for the ones that have a "On, Off, or Switch" type. Then I output each log into its own list. I'm trying to find a way to extract the Out and In times into a separate list/array and then subtract the two times to find the duration of each separate log. This is what the outputted logs look like:
['2020-01-31T12:04:57.976Z 1234 Out: [2020-01-31T00:30:20.150Z] Id: {"Id":"4-f-4-9-6a"', '"Type":"Switch"', '"In":"2020-01-31T00:30:20.140Z"']
This is my current code:
logfile = '/path/to/my/logfile'
with open(logfile, 'r') as f:
text = f.read()
words = ["On", "Off", "Switch"]
text2 = text.split('\n')
for l in text.split('\n'):
if (words[0] in l or words[1] in l or words[2] in l):
log = l.split(',')[0:3]
I'm stuck on how to target only the Out and In time values from the logs and put them in an array and convert to a time value to find duration.
Initial log before script: everything after the "In" time is useless for what I'm looking for so I only have the first three indices outputted
2020-01-31T12:04:57.976Z 1234 Out: [2020-01-31T00:30:20.150Z] Id: {"Id":"4-f-4-9-6a","Type":"Switch,"In":"2020-01-31T00:30:20.140Z","Path":"interface","message":"interface changed status from unknown to normal","severity":"INFORMATIONAL","display":true,"json_map":"{\"severity\":null,\"eventId\":\"65e-64d9-45-ab62-8ef98ac5e60d\",\"componentPath\":\"interface_css\",\"displayToGui\":false,\"originalState\":\"unknown\",\"closed\":false,\"eventType\":\"InterfaceStateChange\",\"time\":\"2019-04-18T07:04:32.747Z\",\"json_map\":null,\"message\":\"interface_css changed status from unknown to normal\",\"newState\":\"normal\",\"info\":\"Event created with current status\"}","closed":false,"info":"Event created with current status","originalState":"unknown","newState":"normal"}
Below is a possible solution. The wordmatch line is a bit of a hack, until I find something clearer: it's just a one-liner that create an empty or 1-element set of True if one of the words matches.
(Untested)
import re
logfile = '/path/to/my/logfile'
words = ["On", "Off", "Switch"]
dateformat = r'\d{4}\-\d{2}\-\d{2}T\d{2}:\d{2}:\d{2}\.\d+[Zz]?'
pattern = fr'Out:\s*\[(?P<out>{dateformat})\].*In":\s*\"(?P<in>{dateformat})\"'
regex = re.compile(pattern)
with open(logfile, 'r') as f:
for line in f:
wordmatch = set(filter(None, (word in s for word in words)))
if wordmatch:
match = regex.search(line)
if match:
intime = match.group('in')
outtime = match.group('out')
# whatever to store these strings, e.g., append to list or insert in a dict.
As noted, your log example is very awkward, so this works for the example line, but may not work for every line. Adjust as necessary.
I have also not included (if so wanted), a conversion to a datetime.datetime object. For that, read through the datetime module documentation, in particular datetime.strptime. (Alternatively, you may want to store your results in a Pandas table. In that case, read through the Pandas documentation on how to convert strings to actual datetime objects.)
You also don't need to read nad split on newlines yourself: for line in f will do that for you (provided f is indeed a filehandle).
Regex is probably the way to go (fastness, efficiency etc.) ... but ...
You could take a very simplistic (if very inefficient) approach of cleaning your data:
join all of it into a string
replace things that hinder easy parsing
split wisely and filter the split
like so:
data = ['2020-01-31T12:04:57.976Z 1234 Out: [2020-01-31T00:30:20.150Z] Id: {"Id":"4-f-4-9-6a"', '"Type":"Switch"', '"In":"2020-01-31T00:30:20.140Z"']
all_text = " ".join(data)
# this is inefficient and will create throwaway intermediate strings - if you are
# in a hurry or operate on 100s of MB of data, this is NOT the way to go, unless
# you have time
# iterate pairs of ("bad thing", "what to replace it with") (or list of bad things)
for thing in [ (": ",":"), (list('[]{}"'),"") ]:
whatt = thing[0]
withh = thing[1]
# if list, do so for each bad thing
if isinstance(whatt, list):
for p in whatt:
# replace it
all_text = all_text.replace(p,withh)
else:
all_text = all_text.replace(whatt,withh)
# format is now far better suited to splitting/filtering
cleaned = [a for a in all_text.split(" ")
if any(a.startswith(prefix) or "Switch" in a
for prefix in {"In:","Switch:","Out:"})]
print(cleaned)
Outputs:
['Out:2020-01-31T00:30:20.150Z', 'Type:Switch', 'In:2020-01-31T00:30:20.140Z']
After cleaning your data would look like:
2020-01-31T12:04:57.976Z 1234 Out:2020-01-31T00:30:20.150Z Id:Id:4-f-4-9-6a Type:Switch In:2020-01-31T00:30:20.140Z
You can transform the clean list into a dictionary for ease of lookup:
d = dict( part.split(":",1) for part in cleaned)
print(d)
will produce:
{'In': '2020-01-31T00:30:20.140Z',
'Type': 'Switch',
'Out': '2020-01-31T00:30:20.150Z'}
You can use datetime module to parse the times from your values as shown in 0 0 post.

Python Re-ordering the lines in a dat file by string

Sorry if this is a repeat but I can't find it for now.
Basically I am opening and reading a dat file which contains a load of paths that I need to loop through to get certain information.
Each of the lines in the base.dat file contains m.somenumber. For example some lines in the file might be:
Volumes/hard_disc/u14_cut//u14m12.40_all.beta/beta8
Volumes/hard_disc/u14_cut/u14m12.50_all.beta/beta8
Volumes/hard_disc/u14_cut/u14m11.40_all.beta/beta8
I need to be able to re-write the dat file so that all the lines are re-ordered from the largest m.number to the smallest m.number. Then when I loop through PATH in database (shown in code) I am looping through in decreasing m.
Here is the relevant part of the code
base = open('base8.dat', 'r')
database= base.read().splitlines()
base.close()
counter=0
mu_list=np.array([])
delta_list=np.array([])
ofsset = 0.00136
beta=0
for PATH in database:
if os.path.exists(str(PATH)+'/CHI/optimal_spectral_function_CHI.dat'):
n1_array = numpy.loadtxt(str(PATH)+'/AVERAGES/av-err.n.dat')
n7_array= numpy.loadtxt(str(PATH)+'/AVERAGES/av-err.npx.dat')
n1_mean = n1_array[0]
delta=round(float(5.0+ofsset-(n1_array[0]*2.+4.*n7_array[0])),6)
par = open(str(PATH)+"/params10", "r")
for line in par:
counter= counter+1
if re.match("mu", line):
mioMU= re.findall('\d+', line.translate(None, ';'))
mioMU2=line.split()[2][:-1]
mu=mioMU2
print mu, delta, PATH
mu_list=np.append(mu_list, mu)
delta_list=np.append(delta_list,delta)
optimal_counter=0
print delta_list, mu_list
I have checked the possible flagged repeat but I can't seem to get it to work for mine because my file doesn't technically contain strings and numbers. The 'number' I need to sort by is contained in the string as a whole:
Volumes/data_disc/u14_cut/from_met/u14m11.40_all.beta/beta16
and I need to sort the entire line by just the m(somenumber) part
Assuming that the number part of your line has the form of a float you can use a regular expression to match that part and convert it from string to float.
After that you can use this information in order to sort all the lines read from your file. I added a invalid line in order to show how invalid data is handled.
As a quick example I would suggest something like this:
import re
# TODO: Read file and get list of lines
l = ['Volumes/hard_disc/u14_cut/u14**m12.40**_all.beta/beta8',
'Volumes/hard_disc/u14_cut/u14**m12.50**_all.beta/beta8',
'Volumes/hard_disc/u14_cut/u14**m11.40**_all.beta/beta8',
'Volumes/hard_disc/u14_cut/u14**mm11.40**_all.beta/beta8']
regex = r'^.+\*{2}m{1}(?P<criterion>[0-9\.]*)\*{2}.+$'
p = re.compile(regex)
criterion_list = []
for s in l:
m = p.match(s)
if m:
crit = m.group('criterion')
try:
crit = float(crit)
except Exception as e:
crit = 0
else:
crit = 0
criterion_list.append(crit)
tuples_list = list(zip(criterion_list, l))
output = [element[1] for element in sorted(tuples_list, key=lambda t: t[0])]
print(output)
# TODO: Write output to new file or overwrite existing one.
Giving:
['Volumes/hard_disc/u14_cut/u14**mm11.40**_all.beta/beta8', 'Volumes/hard_disc/u14_cut/u14**m11.40**_all.beta/beta8', 'Volumes/hard_disc/u14_cut/u14**m12.40**_all.beta/beta8', 'Volumes/hard_disc/u14_cut/u14**m12.50**_all.beta/beta8']
This snippets starts after all lines are read from the file and stored into a list (list called l here). The regex group criterion catches the float part contained in **m12.50** as you can see on regex101. So iterating through all the lines gives you a new list containing all matching groups as floats. If the regex does not match on a given string or casting the group to a float fails, crit is set to zero in order to have those invalid lines at the very beginning of the sorted list later.
After that zip() is used to get a list of tules containing the extracted floats and the according string. Now you can sort this list of tuples based on the tuple's first element and write the according string to a new list output.

Trying to split a txt file into multiple variables

So I'm making a program where it reads a text file and I need to separate all the info into their own variables. It looks like this:
>1EK9:A.41,52; B.61,74; C.247,257; D.279,289
ENLMQVYQQARLSNPELRKSAADRDAAFEKINEARSPLLPQLGLGAD
YTYSNGYRDANGINSNATSASLQLTQSIFDMSKWRALTLQEKAAGIQ
DVTYQTDQQTLILNTATAYFNVLNAIDVLSYTQAQKEAIYRQLDQTT
QRFNVGLVAITDVQNARAQYDTVLANEVTARNNLDNAVEQLRQITGN
YYPELAALNVENFKTDKPQPVNALLKEAEKRNLSLLQARLSQDLARE
QIRQAQDGHLPTLDLTASTGISDTSYSGSKTRGAAGTQYDDSNMGQN
KVGLSFSLPIYQGGMVNSQVKQAQYNFVGASEQLESAHRSVVQTVRS
SFNNINASISSINAYKQAVVSAQSSLDAMEAGYSVGTRTIVDVLDAT
TTLYNAKQELANARYNYLINQLNIKSALGTLNEQDLLALNNALSKPV
STNPENVAPQTPEQNAIADGYAPDSPAPVVQQTSARTTTSNGHNPFRN
The code after the > is a title, the next bit that looks like this "A.41,52" are numbered positions in the sequence I need to save to use, and everything after that is an amino acid sequence. I know how to deal with the amino acid sequence, I just need to know how to separate the important numbers in the first line.
In the past when I just had a title and sequence I did something like this:
for line in nucfile:
if line.startswith(">"):
headerline=line.strip("\n")[1:]
else:
nucseq+=line.strip("\n")
Am I on the right track here? This is my first time, any advice would be fantastic and thanks for reading :)
I suggest you use the split() method.
split() allows you to specify the separator of your choice. Provided the sequence title (here 1EK9) is always separated from the rest of the sequence by a colon, you could first pass ":" as your separator. You could then split the remainder of the sequence to recover the numbered positions (e.g. A.41,52) using ";" as a separator.
I hope this helps!
I think what you are trying to do is extract certain parts of the sequence based on their identifiers given to you on the first line (the line starting with >).
This line contains your title, then a sequence name and the data range you need to extract.
Try this:
sequence_pairs = {}
with open('somefile.txt') as f:
header_line = next(f)
sequence = f.read()
title,components = header_line.split(':')
pairs = components.split(';')
for pair in pairs:
start,end = pair[2:-1].split(',')
sequence_pars[pair[:1]] = sequence[start:int(end)+1]
for sequence,data in sequence_pairs.iteritems():
print('{} - {}'.format(sequence, data))
As the other answer may be very good to tackle the assumed problem in it's entirety - but the OP has requested for pointers or an example of the tpyical split-unsplit transform which is often so successful I hereby provide some ideas and working code to show this (based on the example of the question).
So let us focus on the else branch below:
from __future__ import print_function
nuc_seq = [] # a list
title_token = '>'
with open('some_file_of_a_kind.txt', 'rt') as f:
for line in f.readlines():
s_line = line.strip() # this strips whitespace
if line.startswith(title_token):
headerline = line.strip("\n")[1:]
else:
nuc_seq.append(s_line) # build list
# now nuc_seq is a list of strings like:
# ['ENLMQVYQQARLSNPELRKSAADRDAAFEKINEARSPLLPQLGLGAD',
# 'YTYSNGYRDANGINSNATSASLQLTQSIFDMSKWRALTLQEKAAGIQ',
# ...
# ]
demo_nuc_str = ''.join(nuc_seq)
# now:
# demo_nuc_str == 'ENLMQVYQQARLSNPELRKSAADRDAAFEKINEARSPLLPQLGLGADYTYSNGYR ...'
That is fast and widely deployed paradigm in Python programming (and programming with powerful datatypes in general).
If the split-unsplit ( a.k.a. join) method is still unclear, just ask or try to sear SO on excellent answers to related questions.
Also note, that there is no need to line.strip('\n') as \nis considered whitespace like ' ' (string with a space only) or a tabulator '\t', sample:
>>> a = ' \t \n '
>>> '+'.join(a.split())
''
So the "joining character" only appears, if there are at least two element sto join and in this case, strip removed all whits space and left us with the empty string.
Upate:
As requested a further analysis of the "coordinate part" in the line called headline of the question:
>1EK9:A.41,52; B.61,74; C.247,257; D.279,289
If you want to retrieve the:
A.41,52; B.61,74; C.247,257; D.279,289
and assume you have (as above the complete line in headline string):
title, coordinate_string = headline.split(':')
# so now title is '1EK9' and
# coordinates == 'A.41,52; B.61,74; C.247,257; D.279,289'
Now split on the semi colons, trim the entries:
het_seq = [z.strip() for z in coordinates.split(';')]
# now het_seq == ['A.41,52', 'B.61,74', 'C.247,257', 'D.279,289']
If 'a', 'B', 'C', and 'D' are well known dimensions, than you can "lose" the ordering info from input file (as you could always reinforce what you already know ;-) and might map the coordinats as key: (ordered coordinate-pair):
>>> coord_map = dict(
(a, tuple(int(k) for k in bc.split(',')))
for a, bc in (abc.split('.') for abc in het_seq))
>>> coord_map
{'A': (41, 52), 'C': (247, 257), 'B': (61, 74), 'D': (279, 289)}
In context of a micro program:
#! /usr/bin/enc python
from __future__ import print_function
het_seq = ['A.41,52', 'B.61,74', 'C.247,257', 'D.279,289']
coord_map = dict(
(a, tuple(int(k) for k in bc.split(',')))
for a, bc in (abc.split('.') for abc in het_seq))
print(coord_map)
yields:
{'A': (41, 52), 'C': (247, 257), 'B': (61, 74), 'D': (279, 289)}
Here one might write this explicit a nested for loop but it is a late european evening so trick is to read it from right:
for all elements of het_seq
split on the dot and store left in a and right in b
than further split the bc into a sequence of k's, convert to integer and put into tuple (ordered pair of integer coordinates)
arrived on the left you build a tuple of the a ("The dimension like 'A' and the coordinate tuple from 3.
In the end call the dict() function that constructs a dictionary using here the form dict(key_1, value_1, hey_2, value_2, ...) which gives {key_1: value1, ...}
So all coordinates are integers, stored ordered pairs as tuples.
I'ld prefer tuples here, although split() generates lists, because
You will keep those two coordinates not extend or append that pair
In python mapping and remapping is often performed and there a hashable (that is immutable type) is ready to become a key in a dict.
One last variant (with no knoted comprehensions):
coord_map = {}
for abc in het_seq:
a, bc = abc.split('.')
coord_map[a] = tuple(int(k) for k in bc.split(','))
print(coord_map)
The first four lines produce the same as above minor obnoxious "one liner" (that already had been written on three lines kept together within parentheses).
HTH.
So I'm assuming you are trying to process a Fasta like file and so the way I would do it is to first get the header and separate the pieces with Regex. Following that you can store the A:42.52 B... in a list for easy access. The code is as follows.
import re
def processHeader(line):
positions = re.search(r':(.*)', line).group(1)
positions = positions.split('; ')
return positions
dnaSeq = ''
positions = []
with open('myFasta', 'r') as infile:
for line in infile:
if '>' in line:
positions = processHeader(line)
else:
dnaSeq += line.strip()
I am not sure I completely understand the goal (and I think this post is more suitable for a comment, but I do not have enough privileges) but I think that the key to you solution is using .split(). You can then join the elements of the resulting list just by using + similar to this:
>>> result = line.split(' ')
>>> result
['1EK9:A.41,52;', 'B.61,74;', 'C.247,257;', 'D.279,289', 'ENLMQVYQQARLSNPELRKSAADRDAAFEKINEARSPLLPQLGLGAD', 'YTYSNGYRDANGINSNATSASLQLTQSIFDMSKWRALTLQEKAAGIQ', 'DVTYQTDQQTLILNTATAYFNVLNAIDVLSYTQAQKEAIYRQLDQTT', 'QRFNVGLVAITDVQNARAQYDTVLANEVTARNNLDNAVEQLRQITGN',
'YYPELAALNVENFKTDKPQPVNALLKEAEKRNLSLLQARLSQDLARE', 'QIRQAQDGHLPTLDLTASTGISDTSYSGSKTRGAAGTQYDDSNMGQN', 'KVGLSFSLPIYQGGMVNSQVKQAQYNFVGASEQLESAHRSVVQTVRS', 'SFNNINASISSINAYKQAVVSAQSSLDAMEAGYSVGTRTIVDVLDAT', 'TTLYNAKQELANARYNYLINQLNIKSALGTLNEQDLLALNNALSKPV', 'STNPENVAPQTPEQNAIADGYAPDSPAPVVQQTSARTTTSNGHNPFRN']
>>> result[3]+result[4]
'D.279,289ENLMQVYQQARLSNPELRKSAADRDAAFEKINEARSPLLPQLGLGAD'
>>>
etc. You can also use the usual following syntax to extract the elements of the list that you need:
>>> result[5:]
['YTYSNGYRDANGINSNATSASLQLTQSIFDMSKWRALTLQEKAAGIQ', 'DVTYQTDQQTLILNTATAYFNVLNAIDVLSYTQAQKEAIYRQLDQTT', 'QRFNVGLVAITDVQNARAQYDTVLANEVTARNNLDNAVEQLRQITGN', 'YYPELAALNVENFKTDKPQPVNALLKEAEKRNLSLLQARLSQDLARE', 'QIRQAQDGHLPTLDLTASTGISDTSYSGSKTRGAAGTQYDDSNMGQN', 'KVGLSFSLPIYQGGMVNSQVKQAQYNFVGASEQLESAHRSVVQTVRS', 'SFNNINASISSINAYKQAVVSAQSSLDAMEAGYSVGTRTIVDVLDAT', 'TTLYNAKQELANARYNYLINQLNIKSALGTLNEQDLLALNNALSKPV', 'STNPENVAPQTPEQNAIADGYAPDSPAPVVQQTSARTTTSNGHNPFRN']
and join them together:
>>> reduce(lambda x, y: x+y, result[5:])
'YTYSNGYRDANGINSNATSASLQLTQSIFDMSKWRALTLQEKAAGIQDVTYQTDQQTLILNTATAYFNVLNAIDVLSYTQAQKEAIYRQLDQTTQRFNVGLVAITDVQNARAQYDTVLANEVTARNNLDNAVEQLRQITGNYYPELAALNVENFKTDKPQPVNALLKEAEKRNLSLLQARLSQDLAREQIRQAQDGHLPTLDLTASTGISDTSYSGSKTRGAAGTQYDDSNMGQNKVGLSFSLPIYQGGMVNSQVKQAQYNFVGASEQLESAHRSVVQTVRSSFNNINASISSINAYKQAVVSAQSSLDAMEAGYSVGTRTIVDVLDATTTLYNAKQELANARYNYLINQLNIKSALGTLNEQDLLALNNALSKPVSTNPENVAPQTPEQNAIADGYAPDSPAPVVQQTSARTTTSNGHNPFRN'
remember that + on lists produces a list.
By the way I would not remove '\n' to start with as you may try to use it to extract the first line similar to the above with using space to extract "words".
UPDATE (starting from result):
#getting A indexes
letter_seq=result[5:]
ind=result[:4]
Aind=ind[0].split('.')[1].replace(';', '')
#getting one long letter seq
long_letter_seq=reduce(lambda x, y: x+y, letter_seq)
#extracting the final seq fromlong_letter_seq using Aind
output = long_letter_seq[int(Aind.split(',')[0]):int(Aind.split(',')[1])]
the last line is just a union of several operations that were also used earlier.
Same for B C D etc -- so a lot of manual work and calculations...
BE CAREFUL with indexes of A -- numbering in python starts from 0 which may not be the case in your numbering system.
The more elegant solution would be using re (https://docs.python.org/2/library/re.html) to find pettern using a mask, but this requires very well defined rules for how to look up the sequence needed.
UPDATE2: it is also not clear to me what is the role of spaces -- so far I removed them but they may matter when counting the letters in the original string.

Making multiple search and replace more precise in Python for lemmatizer

I am trying to make my own lemmatizer for Spanish in Python2.7 using a lemmatization dictionary.
I would like to replace all of the words in a certain text with their lemma form. This is the code that I have been working on so far.
def replace_all(text, dic):
for i, j in dic.iteritems():
text = text.replace(i, j)
return text
my_text = 'Flojo y cargantes. Decepcionantes. Decenté decentó'
my_text_lower= my_text.lower()
lemmatize_list = 'ExampleDictionary'
lemmatize_word_dict = {}
with open(lemmatize_list) as f:
for line in f:
depurated_line = line.rstrip()
(val, key) = depurated_line.split("\t")
lemmatize_word_dict[key] = val
txt = replace_all(my_text_lower, lemmatize_word_dict)
print txt
Here is an example dictionary file which contains the lemmatized forms used to replace the words in the input, or my_tyext_lower. The example dictionary is a tab-separated 2-column file in which Col. 1 Represented the values and Col 2 represents the keys to match.
ExampleDictionary
flojo floja
flojo flojas
flojo flojos
cargamento cargamentos
cargante cargantes
decepción decepciones
decepcionante decepcionantes
decentar decenté
decentar decentéis
decentar decentemos
decentar decentó
My desired output is as follows:
flojo y cargante. decepcionante. decentar decentar
Using these inputs (and the example phrase, as listed in my_textwithin the code). My actual output currently is:
felitrojo y cargramarramarrartserargramarramarrunirdo. decepáginacionarrtícolitroargramarramarrunirdo. decentar decentar
Currently, I can't seem to understand what it going wrong with the code.
It seems that it is replacing letters or chunks of each word, instead of recognizing the word, finding it in the lemma dictionary and then replace that instead.
For instance, this is the result that I am getting when I use the entire dictionary (more than 50.000 entries). This problem does not happen with my small example dictionary. Only when I use the complete dictionary which makes me think that prehaps it is double "replacing" at some point?
Is there a pythonic technique that I am missing and can incorporate into this code to make my search and replace function more precise, to identify the full words for replacement rather than chunks and/or NOT make any double replacements?
Because you use text.replace there's a chance that you'll still be matching a sub-string, and the text will get processed again. It's better to process one input word at a time and build the output string word-by-word.
I've switched your key-value the other way around (because you want to look up the right and find the word on the left), and I mainly changed the replace_all:
import re
def replace_all(text, dic):
result = ""
input = re.findall(r"[\w']+|[.,!?;]", text)
for word in input:
changed = dic.get(word,word)
result = result + " " + changed
return result
my_text = 'Flojo y cargantes. Decepcionantes. Decenté decentó'
my_text_lower= my_text.lower()
lemmatize_list = 'ExampleDictionary'
lemmatize_word_dict = {}
with open(lemmatize_list) as f:
for line in f:
kv = line.split()
lemmatize_word_dict[kv[1]] =kv[0]
txt = replace_all(my_text_lower, lemmatize_word_dict)
print txt
I see two problems with your code:
it will also replace words if they appear as part of a bigger word
by replacing words one after the other, you could replace (parts of) words that have already been replaced
Instead of that loop, I suggest using re.sub with word boundaries \b to make sure that you replace complete words only. This way, you can also pass a callable as a replacement function.
import re
def replace_all(text, dic):
return re.sub(r"\b\w+\b", lambda m: dic.get(m.group(), m.group()), text)

Categories