Trying to split a txt file into multiple variables - python

So I'm making a program where it reads a text file and I need to separate all the info into their own variables. It looks like this:
>1EK9:A.41,52; B.61,74; C.247,257; D.279,289
ENLMQVYQQARLSNPELRKSAADRDAAFEKINEARSPLLPQLGLGAD
YTYSNGYRDANGINSNATSASLQLTQSIFDMSKWRALTLQEKAAGIQ
DVTYQTDQQTLILNTATAYFNVLNAIDVLSYTQAQKEAIYRQLDQTT
QRFNVGLVAITDVQNARAQYDTVLANEVTARNNLDNAVEQLRQITGN
YYPELAALNVENFKTDKPQPVNALLKEAEKRNLSLLQARLSQDLARE
QIRQAQDGHLPTLDLTASTGISDTSYSGSKTRGAAGTQYDDSNMGQN
KVGLSFSLPIYQGGMVNSQVKQAQYNFVGASEQLESAHRSVVQTVRS
SFNNINASISSINAYKQAVVSAQSSLDAMEAGYSVGTRTIVDVLDAT
TTLYNAKQELANARYNYLINQLNIKSALGTLNEQDLLALNNALSKPV
STNPENVAPQTPEQNAIADGYAPDSPAPVVQQTSARTTTSNGHNPFRN
The code after the > is a title, the next bit that looks like this "A.41,52" are numbered positions in the sequence I need to save to use, and everything after that is an amino acid sequence. I know how to deal with the amino acid sequence, I just need to know how to separate the important numbers in the first line.
In the past when I just had a title and sequence I did something like this:
for line in nucfile:
if line.startswith(">"):
headerline=line.strip("\n")[1:]
else:
nucseq+=line.strip("\n")
Am I on the right track here? This is my first time, any advice would be fantastic and thanks for reading :)

I suggest you use the split() method.
split() allows you to specify the separator of your choice. Provided the sequence title (here 1EK9) is always separated from the rest of the sequence by a colon, you could first pass ":" as your separator. You could then split the remainder of the sequence to recover the numbered positions (e.g. A.41,52) using ";" as a separator.
I hope this helps!

I think what you are trying to do is extract certain parts of the sequence based on their identifiers given to you on the first line (the line starting with >).
This line contains your title, then a sequence name and the data range you need to extract.
Try this:
sequence_pairs = {}
with open('somefile.txt') as f:
header_line = next(f)
sequence = f.read()
title,components = header_line.split(':')
pairs = components.split(';')
for pair in pairs:
start,end = pair[2:-1].split(',')
sequence_pars[pair[:1]] = sequence[start:int(end)+1]
for sequence,data in sequence_pairs.iteritems():
print('{} - {}'.format(sequence, data))

As the other answer may be very good to tackle the assumed problem in it's entirety - but the OP has requested for pointers or an example of the tpyical split-unsplit transform which is often so successful I hereby provide some ideas and working code to show this (based on the example of the question).
So let us focus on the else branch below:
from __future__ import print_function
nuc_seq = [] # a list
title_token = '>'
with open('some_file_of_a_kind.txt', 'rt') as f:
for line in f.readlines():
s_line = line.strip() # this strips whitespace
if line.startswith(title_token):
headerline = line.strip("\n")[1:]
else:
nuc_seq.append(s_line) # build list
# now nuc_seq is a list of strings like:
# ['ENLMQVYQQARLSNPELRKSAADRDAAFEKINEARSPLLPQLGLGAD',
# 'YTYSNGYRDANGINSNATSASLQLTQSIFDMSKWRALTLQEKAAGIQ',
# ...
# ]
demo_nuc_str = ''.join(nuc_seq)
# now:
# demo_nuc_str == 'ENLMQVYQQARLSNPELRKSAADRDAAFEKINEARSPLLPQLGLGADYTYSNGYR ...'
That is fast and widely deployed paradigm in Python programming (and programming with powerful datatypes in general).
If the split-unsplit ( a.k.a. join) method is still unclear, just ask or try to sear SO on excellent answers to related questions.
Also note, that there is no need to line.strip('\n') as \nis considered whitespace like ' ' (string with a space only) or a tabulator '\t', sample:
>>> a = ' \t \n '
>>> '+'.join(a.split())
''
So the "joining character" only appears, if there are at least two element sto join and in this case, strip removed all whits space and left us with the empty string.
Upate:
As requested a further analysis of the "coordinate part" in the line called headline of the question:
>1EK9:A.41,52; B.61,74; C.247,257; D.279,289
If you want to retrieve the:
A.41,52; B.61,74; C.247,257; D.279,289
and assume you have (as above the complete line in headline string):
title, coordinate_string = headline.split(':')
# so now title is '1EK9' and
# coordinates == 'A.41,52; B.61,74; C.247,257; D.279,289'
Now split on the semi colons, trim the entries:
het_seq = [z.strip() for z in coordinates.split(';')]
# now het_seq == ['A.41,52', 'B.61,74', 'C.247,257', 'D.279,289']
If 'a', 'B', 'C', and 'D' are well known dimensions, than you can "lose" the ordering info from input file (as you could always reinforce what you already know ;-) and might map the coordinats as key: (ordered coordinate-pair):
>>> coord_map = dict(
(a, tuple(int(k) for k in bc.split(',')))
for a, bc in (abc.split('.') for abc in het_seq))
>>> coord_map
{'A': (41, 52), 'C': (247, 257), 'B': (61, 74), 'D': (279, 289)}
In context of a micro program:
#! /usr/bin/enc python
from __future__ import print_function
het_seq = ['A.41,52', 'B.61,74', 'C.247,257', 'D.279,289']
coord_map = dict(
(a, tuple(int(k) for k in bc.split(',')))
for a, bc in (abc.split('.') for abc in het_seq))
print(coord_map)
yields:
{'A': (41, 52), 'C': (247, 257), 'B': (61, 74), 'D': (279, 289)}
Here one might write this explicit a nested for loop but it is a late european evening so trick is to read it from right:
for all elements of het_seq
split on the dot and store left in a and right in b
than further split the bc into a sequence of k's, convert to integer and put into tuple (ordered pair of integer coordinates)
arrived on the left you build a tuple of the a ("The dimension like 'A' and the coordinate tuple from 3.
In the end call the dict() function that constructs a dictionary using here the form dict(key_1, value_1, hey_2, value_2, ...) which gives {key_1: value1, ...}
So all coordinates are integers, stored ordered pairs as tuples.
I'ld prefer tuples here, although split() generates lists, because
You will keep those two coordinates not extend or append that pair
In python mapping and remapping is often performed and there a hashable (that is immutable type) is ready to become a key in a dict.
One last variant (with no knoted comprehensions):
coord_map = {}
for abc in het_seq:
a, bc = abc.split('.')
coord_map[a] = tuple(int(k) for k in bc.split(','))
print(coord_map)
The first four lines produce the same as above minor obnoxious "one liner" (that already had been written on three lines kept together within parentheses).
HTH.

So I'm assuming you are trying to process a Fasta like file and so the way I would do it is to first get the header and separate the pieces with Regex. Following that you can store the A:42.52 B... in a list for easy access. The code is as follows.
import re
def processHeader(line):
positions = re.search(r':(.*)', line).group(1)
positions = positions.split('; ')
return positions
dnaSeq = ''
positions = []
with open('myFasta', 'r') as infile:
for line in infile:
if '>' in line:
positions = processHeader(line)
else:
dnaSeq += line.strip()

I am not sure I completely understand the goal (and I think this post is more suitable for a comment, but I do not have enough privileges) but I think that the key to you solution is using .split(). You can then join the elements of the resulting list just by using + similar to this:
>>> result = line.split(' ')
>>> result
['1EK9:A.41,52;', 'B.61,74;', 'C.247,257;', 'D.279,289', 'ENLMQVYQQARLSNPELRKSAADRDAAFEKINEARSPLLPQLGLGAD', 'YTYSNGYRDANGINSNATSASLQLTQSIFDMSKWRALTLQEKAAGIQ', 'DVTYQTDQQTLILNTATAYFNVLNAIDVLSYTQAQKEAIYRQLDQTT', 'QRFNVGLVAITDVQNARAQYDTVLANEVTARNNLDNAVEQLRQITGN',
'YYPELAALNVENFKTDKPQPVNALLKEAEKRNLSLLQARLSQDLARE', 'QIRQAQDGHLPTLDLTASTGISDTSYSGSKTRGAAGTQYDDSNMGQN', 'KVGLSFSLPIYQGGMVNSQVKQAQYNFVGASEQLESAHRSVVQTVRS', 'SFNNINASISSINAYKQAVVSAQSSLDAMEAGYSVGTRTIVDVLDAT', 'TTLYNAKQELANARYNYLINQLNIKSALGTLNEQDLLALNNALSKPV', 'STNPENVAPQTPEQNAIADGYAPDSPAPVVQQTSARTTTSNGHNPFRN']
>>> result[3]+result[4]
'D.279,289ENLMQVYQQARLSNPELRKSAADRDAAFEKINEARSPLLPQLGLGAD'
>>>
etc. You can also use the usual following syntax to extract the elements of the list that you need:
>>> result[5:]
['YTYSNGYRDANGINSNATSASLQLTQSIFDMSKWRALTLQEKAAGIQ', 'DVTYQTDQQTLILNTATAYFNVLNAIDVLSYTQAQKEAIYRQLDQTT', 'QRFNVGLVAITDVQNARAQYDTVLANEVTARNNLDNAVEQLRQITGN', 'YYPELAALNVENFKTDKPQPVNALLKEAEKRNLSLLQARLSQDLARE', 'QIRQAQDGHLPTLDLTASTGISDTSYSGSKTRGAAGTQYDDSNMGQN', 'KVGLSFSLPIYQGGMVNSQVKQAQYNFVGASEQLESAHRSVVQTVRS', 'SFNNINASISSINAYKQAVVSAQSSLDAMEAGYSVGTRTIVDVLDAT', 'TTLYNAKQELANARYNYLINQLNIKSALGTLNEQDLLALNNALSKPV', 'STNPENVAPQTPEQNAIADGYAPDSPAPVVQQTSARTTTSNGHNPFRN']
and join them together:
>>> reduce(lambda x, y: x+y, result[5:])
'YTYSNGYRDANGINSNATSASLQLTQSIFDMSKWRALTLQEKAAGIQDVTYQTDQQTLILNTATAYFNVLNAIDVLSYTQAQKEAIYRQLDQTTQRFNVGLVAITDVQNARAQYDTVLANEVTARNNLDNAVEQLRQITGNYYPELAALNVENFKTDKPQPVNALLKEAEKRNLSLLQARLSQDLAREQIRQAQDGHLPTLDLTASTGISDTSYSGSKTRGAAGTQYDDSNMGQNKVGLSFSLPIYQGGMVNSQVKQAQYNFVGASEQLESAHRSVVQTVRSSFNNINASISSINAYKQAVVSAQSSLDAMEAGYSVGTRTIVDVLDATTTLYNAKQELANARYNYLINQLNIKSALGTLNEQDLLALNNALSKPVSTNPENVAPQTPEQNAIADGYAPDSPAPVVQQTSARTTTSNGHNPFRN'
remember that + on lists produces a list.
By the way I would not remove '\n' to start with as you may try to use it to extract the first line similar to the above with using space to extract "words".
UPDATE (starting from result):
#getting A indexes
letter_seq=result[5:]
ind=result[:4]
Aind=ind[0].split('.')[1].replace(';', '')
#getting one long letter seq
long_letter_seq=reduce(lambda x, y: x+y, letter_seq)
#extracting the final seq fromlong_letter_seq using Aind
output = long_letter_seq[int(Aind.split(',')[0]):int(Aind.split(',')[1])]
the last line is just a union of several operations that were also used earlier.
Same for B C D etc -- so a lot of manual work and calculations...
BE CAREFUL with indexes of A -- numbering in python starts from 0 which may not be the case in your numbering system.
The more elegant solution would be using re (https://docs.python.org/2/library/re.html) to find pettern using a mask, but this requires very well defined rules for how to look up the sequence needed.
UPDATE2: it is also not clear to me what is the role of spaces -- so far I removed them but they may matter when counting the letters in the original string.

Related

Frequency of keywords in a list

Hi so i have 2 text files I have to read the first text file count the frequency of each word and remove duplicates and create a list of list with the word and its count in the file.
My second text file contains keywords I need to count the frequency of these keywords in the first text file and return the result without using any imports, dict, or zips.
I am stuck on how to go about this second part I have the file open and removed punctuation etc but I have no clue how to find the frequency
I played around with the idea of .find() but no luck as of yet.
Any suggestions would be appreciated this is my code at the moment seems to find the frequency of the keyword in the keyword file but not in the first text file
def calculateFrequenciesTest(aString):
listKeywords= aString
listSize = len(listKeywords)
keywordCountList = []
while listSize > 0:
targetWord = listKeywords [0]
count =0
for i in range(0,listSize):
if targetWord == listKeywords [i]:
count = count +1
wordAndCount = []
wordAndCount.append(targetWord)
wordAndCount.append(count)
keywordCountList.append(wordAndCount)
for i in range (0,count):
listKeywords.remove(targetWord)
listSize = len(listKeywords)
sortedFrequencyList = readKeywords(keywordCountList)
return keywordCountList;
EDIT- Currently toying around with the idea of reopening my first file again but this time without turning it into a list? I think my errors are somehow coming from it counting the frequency of my list of list. These are the types of results I am getting
[[['the', 66], 1], [['of', 32], 1], [['and', 27], 1], [['a', 23], 1], [['i', 23], 1]]
You can try something like:
I am taking a list of words as an example.
word_list = ['hello', 'world', 'test', 'hello']
frequency_list = {}
for word in word_list:
if word not in frequency_list:
frequency_list[word] = 1
else:
frequency_list[word] += 1
print(frequency_list)
RESULT: {'test': 1, 'world': 1, 'hello': 2}
Since, you have put a constraint on dicts, I have made use of two lists to do the same task. I am not sure how efficient it is, but it serves the purpose.
word_list = ['hello', 'world', 'test', 'hello']
frequency_list = []
frequency_word = []
for word in word_list:
if word not in frequency_word:
frequency_word.append(word)
frequency_list.append(1)
else:
ind = frequency_word.index(word)
frequency_list[ind] += 1
print(frequency_word)
print(frequency_list)
RESULT : ['hello', 'world', 'test']
[2, 1, 1]
You can change it to how you like or re-factor it as you wish
I agree with #bereal that you should use Counter for this. I see that you have said that you don't want "imports, dict, or zips", so feel free to disregard this answer. Yet, one of the major advantages of Python is its great standard library, and every time you have list available, you'll also have dict, collections.Counter and re.
From your code I'm getting the impression that you want to use the same style that you would have used with C or Java. I suggest trying to be a little more pythonic. Code written this way may look unfamiliar, and can take time getting used to. Yet, you'll learn way more.
Claryfying what you're trying to achieve would help. Are you learning Python? Are you solving this specific problem? Why can't you use any imports, dict, or zips?
So here's a proposal utilizing built in functionality (no third party) for what it's worth (tested with Python 2):
#!/usr/bin/python
import re # String matching
import collections # collections.Counter basically solves your problem
def loadwords(s):
"""Find the words in a long string.
Words are separated by whitespace. Typical signs are ignored.
"""
return (s
.replace(".", " ")
.replace(",", " ")
.replace("!", " ")
.replace("?", " ")
.lower()).split()
def loadwords_re(s):
"""Find the words in a long string.
Words are separated by whitespace. Only characters and ' are allowed in strings.
"""
return (re.sub(r"[^a-z']", " ", s.lower())
.split())
# You may want to read this from a file instead
sourcefile_words = loadwords_re("""this is a sentence. This is another sentence.
Let's write many sentences here.
Here comes another sentence.
And another one.
In English, we use plenty of "a" and "the". A whole lot, actually.
""")
# Sets are really fast for answering the question: "is this element in the set?"
# You may want to read this from a file instead
keywords = set(loadwords_re("""
of and a i the
"""))
# Count for every word in sourcefile_words, ignoring your keywords
wordcount_all = collections.Counter(sourcefile_words)
# Lookup word counts like this (Counter is a dictionary)
count_this = wordcount_all["this"] # returns 2
count_a = wordcount_all["a"] # returns 1
# Only look for words in the keywords-set
wordcount_keywords = collections.Counter(word
for word in sourcefile_words
if word in keywords)
count_and = wordcount_keywords["and"] # Returns 2
all_counted_keywords = wordcount_keywords.keys() # Returns ['a', 'and', 'the', 'of']
Here is a solution with no imports. It uses nested linear searches which are acceptable with a small number of searches over a small input array, but will become unwieldy and slow with larger inputs.
Still the input here is quite large, but it handles it in reasonable time. I suspect if your keywords file was larger (mine has only 3 words) the slow down would start to show.
Here we take an input file, iterate over the lines and remove punctuation then split by spaces and flatten all the words into a single list. The list has dupes, so to remove them we sort the list so the dupes come together and then iterate over it creating a new list containing the string and a count. We can do this by incrementing the count as long the same word appears in the list and moving to a new entry when a new word is seen.
Now you have your word frequency list and you can search it for the required keyword and retrieve the count.
The input text file is here and the keyword file can be cobbled together with just a few words in a file, one per line.
python 3 code, it indicates where applicable how to modify for python 2.
# use string.punctuation if you are somehow allowed
# to import the string module.
translator = str.maketrans('', '', '!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~')
words = []
with open('hamlet.txt') as f:
for line in f:
if line:
line = line.translate(translator)
# py 2 alternative
#line = line.translate(None, string.punctuation)
words.extend(line.strip().split())
# sort the word list, so instances of the same word are
# contiguous in the list and can be counted together
words.sort()
thisword = ''
counts = []
# for each word in the list add to the count as long as the
# word does not change
for w in words:
if w != thisword:
counts.append([w, 1])
thisword = w
else:
counts[-1][1] += 1
for c in counts:
print('%s (%d)' % (c[0], c[1]))
# function to prevent need to break out of nested loop
def findword(clist, word):
for c in clist:
if c[0] == word:
return c[1]
return 0
# open keywords file and search for each word in the
# frequency list.
with open('keywords.txt') as f2:
for line in f2:
if line:
word = line.strip()
thiscount = findword(counts, word)
print('keyword %s appear %d times in source' % (word, thiscount))
If you were so inclined you could modify findword to use a binary search, but its still not going to be anywhere near a dict. collections.Counter is the right solution when there are no restrictions. Its quicker and less code.

In Python,if startswith values in tuple, I also need to return which value

I have an area codes file I put in a tuple
for line1 in area_codes_file.readlines():
if area_code_extract.search(line1):
area_codes.append(area_code_extract.search(line1).group())
area_codes = tuple(area_codes)
and a file I read into Python full of phone numbers.
If a phone number starts with one of the area codes in the tuple, I need to do to things:
1 is to keep the number
2 is to know which area code did it match, as need to put area codes in brackets.
So far, I was only able to do 1:
for line in txt.readlines():
is_number = phonenumbers.parse(line,"GB")
if phonenumbers.is_valid_number(is_number):
if line.startswith(area_codes):
print (line)
How do I do the second part?
The simple (if not necessarily highest performance) approach is to check each prefix individually, and keep the first match:
for line in txt:
is_number = phonenumbers.parse(line,"GB")
if phonenumbers.is_valid_number(is_number):
if line.startswith(area_codes):
print(line, next(filter(line.startswith, area_codes)))
Since we know filter(line.startswith, area_codes) will get exactly one hit, we just pull the hit using next.
Note: On Python 2, you should start the file with from future_builtins import filter to get the generator based filter (which will also save work by stopping the search when you get a hit). Python 3's filter already behaves like this.
For potentially higher performance, the way to both test all prefixes at once and figure out which value hit is to use regular expressions:
import re
# Function that will match any of the given prefixes returning a match obj on hit
area_code_matcher = re.compile(r'|'.join(map(re.escape, area_codes))).match
for line in txt:
is_number = phonenumbers.parse(line,"GB")
if phonenumbers.is_valid_number(is_number):
# Returns None on miss, match object on hit
m = area_code_matcher(line)
if m is not None:
# Whatever matched is in the 0th grouping
print(line, m.group())
Lastly, one final approach you can use if the area codes are of fixed length. Rather than using startswith, you can slice directly; you know the hit because you sliced it off yourself:
# If there are a lot of area codes, using a set/frozenset will allow much faster lookup
area_codes_set = frozenset(area_codes)
for line in txt:
is_number = phonenumbers.parse(line,"GB")
if phonenumbers.is_valid_number(is_number):
# Assuming lines that match always start with ###
if line[:3] in area_codes_set:
print(line, line[:3])

Split string when a certain character is read

I have several strings stored in a file one per line like this:
dsfsdfsd/mhgjghj
cvcv/xcvxc
werwr/erewrwer
nbmbn/iuouiouio
...
As you can see the only character that is always present is the backlash /, the rest being pretty random in its composition. I need to store the first and second part (ie: before and after the backlash respectively) of each line separately, so as to end up with something like this:
first_list = [dsfsdfsd, cvcv, werwr, nbmbn, ...]
secnd_list = [mhgjghj, xcvxc, erewrwer, iuouiouio, ...]
I could do this in python iterating through each line, checking for the existence of the backlash and storing the contents of each part of the line separately. It would look like this:
first_list, secnd_list = [], []
for line in file:
for indx, char in enumerate(line):
if char == '/':
first_list.append(line[:(indx-1)])
secnd_list.append(line[(indx-1):])
break
I'm looking for a prettier (more pythonic) version of this code.
split() might come in handy here:
first_list, secnd_list = [], []
for line in file:
first, second = line.split('/')
first_list.append(first)
secnd_list.append(second)
One of the assumptions made here is that only a single / is present. Knowning that, split('/') will always return a 2-tuple of elements. If this assumption is false, try split('/', 1) instead - it limits the number of splits to 1, counting left-to-right.
As well as str.split you could use str.partition:
first_parts = []
second_parts = []
for line in file:
before, _, after = line.partition('/')
first_parts.append(before)
second_parts.append(after)
An alternative more functional oneliner:
first_parts, _, second_parts = zip(*(line.partition('/') for line in file))
Explanation for the _ in both options - str.partition returns a tuple: (first_part, seperator, last_part). Here, we don't need the seperator (indeed I can't imagine why you ever would), so we assign it to the throwaway variable _.
Here are the docs for str.partition, and here are the docs for str.split.

parsing repeated lines of string based on initial characters [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
I am working on lists and strings in python. I have following lines of string.
ID abcd
AC efg
RF hij
ID klmno
AC p
RF q
I want the output as :
abcd, efg, hij
klmno, p, q
This output is based on the the first two characters in the line. How can I achieve it in efficient way?
I'm looking to output the second part of the line for every entry between the ID tags.
I'm having a little trouble parsing the question, but according to my best guess, this should do what you're looking for:
all_data = " ".join([line for line in file]).split("ID")
return [", ".join([item.split(" ")[::2] for item in all_data])]
Basically what you're doing here is first just joining together all of your data (removing the newlines) then splitting on your keyphrase of "ID"
After that, if I'm correctly interpreting the question, you're looking to get the second value of each pair. These pairs are space delimited (as is everything in that item due to the " ".join in the first line), so we just step through that list grabbing every other item.
In general splits have a little more syntactic sugar than is usually used, and the full syntax is: [start:end:step], so [::2] just returns every other item.
You could use the following, which takes order into account so that transposing the dict's values makes more sense...
from collections import OrderedDict
items = OrderedDict()
with open('/home/jon/sample_data.txt') as fin:
lines = (line.strip().partition(' ')[::2] for line in fin)
for key, value in lines:
items.setdefault(key[0], []).append(value)
res = [', '.join(el) for el in zip(*items.values())]
# ['abcd, efg, hij', 'klmno, p, q']
Use a default dict:
from collections import defaultdict
result = defaultdict(list)
for line in lines:
split_line = line.split(' ')
result[split_line[0]].append(split_line[1])
This will give you a dictionary result that stores all the values that have the same key in an array. To get all the strings that were in a line that started with e.g. ID:
print result[ID]
Based on your answers in comments, this should work (if I understand what you're looking for):
data = None
for line in lines:
fields = line.split(2)
if fields[0] == "ID":
#New set of data
if data is not None:
#Output last set of data.
print ", ".join(data)
data = []
data.append(fields[1])
if data is not None:
#Output final data set
print ", ".join(data)
It's pretty straight forward, you're just collecting the second field in each line into data until you see that start of the next data set, at which point you output the previous data set.
I think using itertools.groupby is best for this kind of parsing (do something until next token X)
import itertools
class GroupbyHelper(object):
def __init__(self):
self.state = None
def __call__(self, row):
if self.state is None:
self.state = True
else:
if row[0] == 'ID':
self.state = not self.state
return self.state
# assuming you read data from 'stream'
for _, data in itertools.groupby((line.split() for line in stream), GroupbyHelper()):
print ','.join(c[1] for c in data)
output:
$ python groupby.py
abcd,efg,hij
klmno,p,q
It looks like you would like to sub group your data, when ever 'ID' is present as your key. Groupby solution could work wonder here, if you know how to group your data. Here is one such implementation that might work for you
>>> data=[e.split() for e in data.splitlines()]
>>> def new_key(key):
toggle = [0,1]
def helper(e):
if e[0] == key:
toggle[:] = toggle[::-1]
return toggle[0]
return helper
>>> from itertools import groupby
>>> for k,v in groupby(data, key = new_key('ID')):
for e in v:
print e[-1],
print
abcd efg hij
klmno p q
If lines is equal to
['ID abcd', 'AC efg', 'RF hij']
then
[line.split()[1] for line in lines]
Edit: Added everything below after down votes
I am not sure why this was down voted. I thought that code was the simplest way to get started with the information provided at the time. Perhaps this is a better explanation of what I thought/think the data was/is?
if input is a list of strings in repeated sequence, called alllines;
alllines = [ #a list of repeated lines of string based on initial characters
'ID abcd',
'AC efg',
'RF hij',
'ID klmno',
'AC p',
'RF q'
]
then code is;
[[line.split()[1] for line in lines] for lines in [[alllines.pop(0) \
for i in range(3)] for o in range(len(alllines)/3)]]
This basically says, create a sublist of three split [1] strings from the whole list of all strings for every three strings in the whole list.
and output is;
[[
'abcd', 'efg', 'hij'
], [
'klmno', 'p', 'q'
]]
Edit: 8-6-13 This is an even better one without pop();
zip(*[iter([line.split()[1] for line in alllines])]*3)
with a slightly different output
[(
'abcd', 'efg', 'hij'
), (
'klmno', 'p', 'q'
)]

Trying to use Python script to add strings to file

I have a spanish novel, in a plain textfile, and I want to make a Python script that puts a translation in brackets after difficult words. I have a list of the words (with translations) I want to do this with in a separate text file, which I have tried to format correctly.
I've forgotten everything I knew about Python, which was very little to begin with, so I'm struggling.
This is a script someone helped me with:
bookin = (open("C:\Users\King Kong\Documents\_div_tekstfiler_\coc_es.txt")).read()
subin = open("C:\Users\King Kong\Documents\_div_tekstfiler_\cocdict.txt")
for line in subin.readlines():
ogword, meaning = line.split()
subword = ogword + " (" + meaning + ")"
bookin.replace(ogword, subword)
ogword = ogword.capitalize()
subword = ogword + " (" + meaning + ")"
bookin.replace(ogword, subword)
subin.close()
bookout = open("fileout.txt", "w")
bookout.write(bookin)
bookout.close()
When I ran this, I got this error message:
Traceback (most recent call last):
File "C:\Python27\translscript_secver.py", line 4, in <module>
ogword, meaning = line.split()
ValueError: too many values to unpack
The novel pretty big, and the dictionary I've made consists of about ten thousand key value pairs.
Does this mean there's something wrong with the dictionary? Or it's too big?
Been researching this a lot, but I can't seem to make sense of it. Any advice would be appreciated.
line.split() in ogword, meaning = line.split() returns a list, and in this case it may be returning more than 2 values. Write your code in a way that can handle more than two values. For instance, by assigning line.split() to a list and then asserting that the list has two items:
mylist = line.split()
assert len(mylist) == 2
ogword, meaning = line.split()[:2]
line.split() return a list of words (space separated token) in line. The error you get suggest that somewhere, your dictionnary contains more than just pair. You may add trace message to locate the error (see below).
If your dictionnary contains richer definitions than synonym, you may use following lines, which put the first word in ogword and following ones in meaning.
words = line.split()
ogword, meaning = words[0], " ".join(words[1:])
If your dictionary syntax is more complex (composed ogword), you have to rely on an explicit separator. You can still use split to divide your lines (line.split("=") will split a line on "=" characters)
Edit: to ignore and display bad lines, replace ogword, meaning = line.split() with
try:
ogword,meaning = line.split()
except:
print "wrong formated line:", line
continue
split()
returns a single list, ie one item, you are trying to assign this one thing to two variables.
It will work if the number of items in the list is equal to the number of variables on the left hand side of the assignment statement. I.e., the list is unpacked and the individual parts are assigned to the variables on the left hand side.
In this case, as pointed out by #Josvic Zammit, the problem can occur if there are more than 2 items in the list and can not properly "unpacked" and assigned.

Categories