Convert a Python list of lists to a string - python

How can I best convert a list of lists in Python to a (for example) comma-delimited-value, newline-delimited-row, string? Ideally, I'd be able to do something like this:
>import csv
>matrix = [["Frodo","Baggins","Hole-in-the-Ground, Shire"],["Sauron", "I forget", "Mordor"]]
> csv_string = csv.generate_string(matrix)
>print(csv_string)
Frodo,Baggins,"Hole-in-the-Ground, Shire"
Sauron,I forget,Mordor
I know that Python has a csv module, as seen in SO questions like this but all of its functions seem to operate on a File object. The lists are small enough that using a file is overkill.
I'm familiar with the join function, and there are plenty of SO answers about it. But this doesn't handle values that contain a comma, nor does it handle multiple rows unless I nest a join within another join.

Combine the csv-module with StringIO:
import io, csv
result = io.StringIO()
writer = csv.writer(result)
writer.writerow([5,6,7])
print(result.getvalue())

The approach in the question you link to as a reference for join, together with a nested joins (what's wrong with that?) works as long as you can convert all of the objects contained in your list of lists to a string:
list_of_lists = [[1, 'a'], [2, 3, 'b'], ['c', 'd']]
joined = '\n'.join(','.join(map(str, row)) for row in list_of_lists)
print(join)
Output:
1,a
2,3,b
c,d
EDIT:
If the string representation of your objects may contain commas, here are a couple of things you could do to achieve an output that can recover the original list of lists:
escape those commas, or
wrap said string representations in some flavor of quotes (then you have to escape the occurrences of that character inside your values). This is precisely what the combination of io.StringIO and csv does (see Daniel's answer).
To achieve the first, you could do
import re
def escape_commas(obj):
return re.sub(',', '\,', str(obj))
joined = '\n'.join(','.join(map(escape_commas, row)) for row in list_of_lists)
For the second,
import re
def wrap_in_quotes(obj):
return '"' + re.sub('"', '\"', str(obj)) + '"'
joined = '\n'.join(','.join(map(wrap_in_quotes, row)) for row in list_of_lists)

Related

List of strings to create a new list of strings without quotes?

Data
crop_list = ['Cotton','Ragi', 'Groundnut', 'Sugarcane', 'Redgram', 'Sunflower', 'Paddy', 'Maize','Jowar']
Now each element is DataFrame
for a in crop_list:
vars()[a] = Data[Data['Crop']== a]
For next line of codes i might need to create a list manually, i.e. dfs
from functools import reduce
dfs =[Cotton,Ragi,Groundnut,Sugarcane,Redgram,Sunflower,Paddy,Maize,Jowar]
df_merged = reduce(lambda a,b: pd.merge(a,b, on='Year'), dfs)
so im asking is there any way to get dunamic list:
Expected output:
Another List with same strings without quotes:
new_crop_list = [Cotton,Ragi, Groundnut, Sugarcane, Redgram, Sunflower, Paddy,Maize,Jowar]
I think this is basically what you meant
crop_list = ["'Cotton'","'Ragi'", "'Groundnut'", "'Sugarcane'", "'Redgram'", "'Sunflower'", "'Paddy'", "'Maize'","'Jowar'"]
Since without quotes, a string is not a string, if this is the case. You can remove the single quotes from the list using the following code
new_list = [ x.replace("'","") for x in crop_list]
The above code will remove single quotes from around the values in the list.
The output will look like
['Cotton', 'Ragi', 'Groundnut', 'Sugarcane', 'Redgram', 'Sunflower', 'Paddy', 'Maize', 'Jowar']
You will still see single quotes in output, since its a list of strings, and the quotes denotes that .
Hope this answers your question

Split List Elements in byte format to separate bytes in python

I have a list with byte elements like this:
list = [b'\x00\xcc\n', b'\x14I\x8dy_\xeb\xbc1C']
Now I want to separate all bytes like following:
list_new =[b'\x00', b'\xcc', b'\x14I', b'\x8dy_', b'\xeb', b'\xbc1C']
I am assuming here that you wanted to split the data with split criteria of '\x', this seems to be matching with your desired output. Let me know otherwise. Also I am not sure why you got this type of string, its little awkward to work with. A bigger context on the question might be more helpful. Nevertheless, I tried to get your desired output in following way:(May be not efficient but gets your job done).
import re
from codecs import encode
lists = [b'\x00\xcc\n', b'\x14I\x8dy_\xeb\xbc1C']
split = [re.split(r'(?=\\x)', str(item)) for item in lists] ## splitting with assumption of \x using lookarounds here
output = [] ## container to save the final item
for item in split: ## split is list of lists hence required two for loops
for nitem in item:
if nitem != "b'": ## remove anything which has only "b'"
output.append(nitem.replace('\\n','').replace("'",'').encode()) ## finally appending everyitem
## Note here that output contains two backward slashes , to remove them we use encode function from codecs module
## like below
[encode(itm.decode('unicode_escape'), 'raw_unicode_escape') for itm in output] ## Final output
Output:
[b'\x00', b'\xcc', b'\x14I', b'\x8dy_', b'\xeb', b'\xbc1C']

Trying to split a txt file into multiple variables

So I'm making a program where it reads a text file and I need to separate all the info into their own variables. It looks like this:
>1EK9:A.41,52; B.61,74; C.247,257; D.279,289
ENLMQVYQQARLSNPELRKSAADRDAAFEKINEARSPLLPQLGLGAD
YTYSNGYRDANGINSNATSASLQLTQSIFDMSKWRALTLQEKAAGIQ
DVTYQTDQQTLILNTATAYFNVLNAIDVLSYTQAQKEAIYRQLDQTT
QRFNVGLVAITDVQNARAQYDTVLANEVTARNNLDNAVEQLRQITGN
YYPELAALNVENFKTDKPQPVNALLKEAEKRNLSLLQARLSQDLARE
QIRQAQDGHLPTLDLTASTGISDTSYSGSKTRGAAGTQYDDSNMGQN
KVGLSFSLPIYQGGMVNSQVKQAQYNFVGASEQLESAHRSVVQTVRS
SFNNINASISSINAYKQAVVSAQSSLDAMEAGYSVGTRTIVDVLDAT
TTLYNAKQELANARYNYLINQLNIKSALGTLNEQDLLALNNALSKPV
STNPENVAPQTPEQNAIADGYAPDSPAPVVQQTSARTTTSNGHNPFRN
The code after the > is a title, the next bit that looks like this "A.41,52" are numbered positions in the sequence I need to save to use, and everything after that is an amino acid sequence. I know how to deal with the amino acid sequence, I just need to know how to separate the important numbers in the first line.
In the past when I just had a title and sequence I did something like this:
for line in nucfile:
if line.startswith(">"):
headerline=line.strip("\n")[1:]
else:
nucseq+=line.strip("\n")
Am I on the right track here? This is my first time, any advice would be fantastic and thanks for reading :)
I suggest you use the split() method.
split() allows you to specify the separator of your choice. Provided the sequence title (here 1EK9) is always separated from the rest of the sequence by a colon, you could first pass ":" as your separator. You could then split the remainder of the sequence to recover the numbered positions (e.g. A.41,52) using ";" as a separator.
I hope this helps!
I think what you are trying to do is extract certain parts of the sequence based on their identifiers given to you on the first line (the line starting with >).
This line contains your title, then a sequence name and the data range you need to extract.
Try this:
sequence_pairs = {}
with open('somefile.txt') as f:
header_line = next(f)
sequence = f.read()
title,components = header_line.split(':')
pairs = components.split(';')
for pair in pairs:
start,end = pair[2:-1].split(',')
sequence_pars[pair[:1]] = sequence[start:int(end)+1]
for sequence,data in sequence_pairs.iteritems():
print('{} - {}'.format(sequence, data))
As the other answer may be very good to tackle the assumed problem in it's entirety - but the OP has requested for pointers or an example of the tpyical split-unsplit transform which is often so successful I hereby provide some ideas and working code to show this (based on the example of the question).
So let us focus on the else branch below:
from __future__ import print_function
nuc_seq = [] # a list
title_token = '>'
with open('some_file_of_a_kind.txt', 'rt') as f:
for line in f.readlines():
s_line = line.strip() # this strips whitespace
if line.startswith(title_token):
headerline = line.strip("\n")[1:]
else:
nuc_seq.append(s_line) # build list
# now nuc_seq is a list of strings like:
# ['ENLMQVYQQARLSNPELRKSAADRDAAFEKINEARSPLLPQLGLGAD',
# 'YTYSNGYRDANGINSNATSASLQLTQSIFDMSKWRALTLQEKAAGIQ',
# ...
# ]
demo_nuc_str = ''.join(nuc_seq)
# now:
# demo_nuc_str == 'ENLMQVYQQARLSNPELRKSAADRDAAFEKINEARSPLLPQLGLGADYTYSNGYR ...'
That is fast and widely deployed paradigm in Python programming (and programming with powerful datatypes in general).
If the split-unsplit ( a.k.a. join) method is still unclear, just ask or try to sear SO on excellent answers to related questions.
Also note, that there is no need to line.strip('\n') as \nis considered whitespace like ' ' (string with a space only) or a tabulator '\t', sample:
>>> a = ' \t \n '
>>> '+'.join(a.split())
''
So the "joining character" only appears, if there are at least two element sto join and in this case, strip removed all whits space and left us with the empty string.
Upate:
As requested a further analysis of the "coordinate part" in the line called headline of the question:
>1EK9:A.41,52; B.61,74; C.247,257; D.279,289
If you want to retrieve the:
A.41,52; B.61,74; C.247,257; D.279,289
and assume you have (as above the complete line in headline string):
title, coordinate_string = headline.split(':')
# so now title is '1EK9' and
# coordinates == 'A.41,52; B.61,74; C.247,257; D.279,289'
Now split on the semi colons, trim the entries:
het_seq = [z.strip() for z in coordinates.split(';')]
# now het_seq == ['A.41,52', 'B.61,74', 'C.247,257', 'D.279,289']
If 'a', 'B', 'C', and 'D' are well known dimensions, than you can "lose" the ordering info from input file (as you could always reinforce what you already know ;-) and might map the coordinats as key: (ordered coordinate-pair):
>>> coord_map = dict(
(a, tuple(int(k) for k in bc.split(',')))
for a, bc in (abc.split('.') for abc in het_seq))
>>> coord_map
{'A': (41, 52), 'C': (247, 257), 'B': (61, 74), 'D': (279, 289)}
In context of a micro program:
#! /usr/bin/enc python
from __future__ import print_function
het_seq = ['A.41,52', 'B.61,74', 'C.247,257', 'D.279,289']
coord_map = dict(
(a, tuple(int(k) for k in bc.split(',')))
for a, bc in (abc.split('.') for abc in het_seq))
print(coord_map)
yields:
{'A': (41, 52), 'C': (247, 257), 'B': (61, 74), 'D': (279, 289)}
Here one might write this explicit a nested for loop but it is a late european evening so trick is to read it from right:
for all elements of het_seq
split on the dot and store left in a and right in b
than further split the bc into a sequence of k's, convert to integer and put into tuple (ordered pair of integer coordinates)
arrived on the left you build a tuple of the a ("The dimension like 'A' and the coordinate tuple from 3.
In the end call the dict() function that constructs a dictionary using here the form dict(key_1, value_1, hey_2, value_2, ...) which gives {key_1: value1, ...}
So all coordinates are integers, stored ordered pairs as tuples.
I'ld prefer tuples here, although split() generates lists, because
You will keep those two coordinates not extend or append that pair
In python mapping and remapping is often performed and there a hashable (that is immutable type) is ready to become a key in a dict.
One last variant (with no knoted comprehensions):
coord_map = {}
for abc in het_seq:
a, bc = abc.split('.')
coord_map[a] = tuple(int(k) for k in bc.split(','))
print(coord_map)
The first four lines produce the same as above minor obnoxious "one liner" (that already had been written on three lines kept together within parentheses).
HTH.
So I'm assuming you are trying to process a Fasta like file and so the way I would do it is to first get the header and separate the pieces with Regex. Following that you can store the A:42.52 B... in a list for easy access. The code is as follows.
import re
def processHeader(line):
positions = re.search(r':(.*)', line).group(1)
positions = positions.split('; ')
return positions
dnaSeq = ''
positions = []
with open('myFasta', 'r') as infile:
for line in infile:
if '>' in line:
positions = processHeader(line)
else:
dnaSeq += line.strip()
I am not sure I completely understand the goal (and I think this post is more suitable for a comment, but I do not have enough privileges) but I think that the key to you solution is using .split(). You can then join the elements of the resulting list just by using + similar to this:
>>> result = line.split(' ')
>>> result
['1EK9:A.41,52;', 'B.61,74;', 'C.247,257;', 'D.279,289', 'ENLMQVYQQARLSNPELRKSAADRDAAFEKINEARSPLLPQLGLGAD', 'YTYSNGYRDANGINSNATSASLQLTQSIFDMSKWRALTLQEKAAGIQ', 'DVTYQTDQQTLILNTATAYFNVLNAIDVLSYTQAQKEAIYRQLDQTT', 'QRFNVGLVAITDVQNARAQYDTVLANEVTARNNLDNAVEQLRQITGN',
'YYPELAALNVENFKTDKPQPVNALLKEAEKRNLSLLQARLSQDLARE', 'QIRQAQDGHLPTLDLTASTGISDTSYSGSKTRGAAGTQYDDSNMGQN', 'KVGLSFSLPIYQGGMVNSQVKQAQYNFVGASEQLESAHRSVVQTVRS', 'SFNNINASISSINAYKQAVVSAQSSLDAMEAGYSVGTRTIVDVLDAT', 'TTLYNAKQELANARYNYLINQLNIKSALGTLNEQDLLALNNALSKPV', 'STNPENVAPQTPEQNAIADGYAPDSPAPVVQQTSARTTTSNGHNPFRN']
>>> result[3]+result[4]
'D.279,289ENLMQVYQQARLSNPELRKSAADRDAAFEKINEARSPLLPQLGLGAD'
>>>
etc. You can also use the usual following syntax to extract the elements of the list that you need:
>>> result[5:]
['YTYSNGYRDANGINSNATSASLQLTQSIFDMSKWRALTLQEKAAGIQ', 'DVTYQTDQQTLILNTATAYFNVLNAIDVLSYTQAQKEAIYRQLDQTT', 'QRFNVGLVAITDVQNARAQYDTVLANEVTARNNLDNAVEQLRQITGN', 'YYPELAALNVENFKTDKPQPVNALLKEAEKRNLSLLQARLSQDLARE', 'QIRQAQDGHLPTLDLTASTGISDTSYSGSKTRGAAGTQYDDSNMGQN', 'KVGLSFSLPIYQGGMVNSQVKQAQYNFVGASEQLESAHRSVVQTVRS', 'SFNNINASISSINAYKQAVVSAQSSLDAMEAGYSVGTRTIVDVLDAT', 'TTLYNAKQELANARYNYLINQLNIKSALGTLNEQDLLALNNALSKPV', 'STNPENVAPQTPEQNAIADGYAPDSPAPVVQQTSARTTTSNGHNPFRN']
and join them together:
>>> reduce(lambda x, y: x+y, result[5:])
'YTYSNGYRDANGINSNATSASLQLTQSIFDMSKWRALTLQEKAAGIQDVTYQTDQQTLILNTATAYFNVLNAIDVLSYTQAQKEAIYRQLDQTTQRFNVGLVAITDVQNARAQYDTVLANEVTARNNLDNAVEQLRQITGNYYPELAALNVENFKTDKPQPVNALLKEAEKRNLSLLQARLSQDLAREQIRQAQDGHLPTLDLTASTGISDTSYSGSKTRGAAGTQYDDSNMGQNKVGLSFSLPIYQGGMVNSQVKQAQYNFVGASEQLESAHRSVVQTVRSSFNNINASISSINAYKQAVVSAQSSLDAMEAGYSVGTRTIVDVLDATTTLYNAKQELANARYNYLINQLNIKSALGTLNEQDLLALNNALSKPVSTNPENVAPQTPEQNAIADGYAPDSPAPVVQQTSARTTTSNGHNPFRN'
remember that + on lists produces a list.
By the way I would not remove '\n' to start with as you may try to use it to extract the first line similar to the above with using space to extract "words".
UPDATE (starting from result):
#getting A indexes
letter_seq=result[5:]
ind=result[:4]
Aind=ind[0].split('.')[1].replace(';', '')
#getting one long letter seq
long_letter_seq=reduce(lambda x, y: x+y, letter_seq)
#extracting the final seq fromlong_letter_seq using Aind
output = long_letter_seq[int(Aind.split(',')[0]):int(Aind.split(',')[1])]
the last line is just a union of several operations that were also used earlier.
Same for B C D etc -- so a lot of manual work and calculations...
BE CAREFUL with indexes of A -- numbering in python starts from 0 which may not be the case in your numbering system.
The more elegant solution would be using re (https://docs.python.org/2/library/re.html) to find pettern using a mask, but this requires very well defined rules for how to look up the sequence needed.
UPDATE2: it is also not clear to me what is the role of spaces -- so far I removed them but they may matter when counting the letters in the original string.

Writing csv with python

I wrote a python script in which I generate a csv from numbers I computed.
The rows I write are:
writeRow = [str(t), len(c) , {k for k in c.keys()}, {k for k in c.values()}]
I have two problems:
t is a number that can begin by 0. But in that case, the 0 is deleted.
I tried without str() but it doesn't change...
the sets are printed as sets in the cells. However, I want to write these numbers separated by commas in the same cell and without the {} How can I do that?
edit
I am using the csv module; In the code, I create lists for each row to write and then write them with csv.writerow
I'm gonna post more code:
from csv import reader, writer
with open(fileName1) as inp, open(fileName2,'w') as o:
I then define the reader, writer, and the variables t,c
writeRow = [str(t), len(c) , {k for k in c.keys()}, {k for k in c.values()}]
Then I write the result in the output file
Edit
form of t and what a row should look like
t = 023
t = 123
t is an int
The line in the end should look like:
cell1 cell2 cell3 cell4
123 2 string1,string2 num1,num2
string1 and str2 are the dict keys; num1,num2 the corresponding values
For starters, you can take advantage of the join() method (assuming keys and values are strings):
",".join(c.keys()) + "," + ",".join(c.values())
This will take care of commas. However, this will break very easily for any non-trivial data, so consider using csv module instead, which would take care of escaping dangerous characters.
Use string formatting for the padding with 0s, and join for the sets:
writeRow = ['{:0>3}'.format(t), len(c) , ','.join(c.keys()), ','.join(c.values())]
note that you should not enter your value for t with a leading 0:
>>>t = 023
>>>t
19
What about using format strings?
'%s,%i'%(t,c)
might do what you want. You could also use '%03i' or something similar to produce some padding zeros before your number. Or, I might misunderstand your question. Try posting a more complete (runnable) example.

python parse csv to lists

I have a csv file thru which I want to parse the data to the lists.
So I am using the python csv module to read that
so basically the following:
import csv
fin = csv.reader(open(path,'rb'),delimiter=' ',quotechar='|')
print fin[0]
#gives the following
['"1239","2249.00","1","3","2011-02-20"']
#lets say i do the following
ele = str(fin[0])
ele = ele.strip().split(',')
print ele
#gives me following
['[\'"1239"', '"2249.00"', '"1"', '"3"', '"2011-02-20"\']']
now
ele[0] gives me --> output---> ['"1239"
How do I get rid of that ['
In the end, I want to do is get 1239 and convert it to integer.. ?
Any clues why this is happening
Thanks
Edit:*Never mind.. resolved thanks to the first comment *
Change your delimiter to ',' and you will get a list of those values from the csv reader.
It's because you are converting a list to a string, there is no need to do this. Grab the first element of the list (in this case it is a string) and parse that:
>>> a = ['"1239","2249.00","1","3","2011-02-20"']
>>> a
['"1239","2249.00","1","3","2011-02-20"']
>>> a[0]
'"1239","2249.00","1","3","2011-02-20"'
>>> b = a[0].replace('"', '').split(',')
>>> b[-1]
'2011-02-20'
of course before you do replace and split string methods you should check if the type is string or handle the exception if it isn't.
Also Blahdiblah is correct your delimiter is probably wrong.

Categories