quick data processing with python?

quick data processing with python? - python

I have a file in the following format:
[s1,s2,s3,s4,...] SOME_TEXT
(per line)
For example:
[dog,cat,monkey] 1,2,3
[a,b,c,d,e,f] 13,4,6
the brackets are included.
let's say I have another field like this, which contains two lines:
[banana,cat2,monkey2] 1,2,3
[a2,b2,c2,d,e,f] 13,4,6
I want to take two files of this form and align them the following way:
[dog^banana,cat^cat2,monkey^monkey2] 1,2,3
[a^a2,b^b2,c^c2,d^d2,e^e2,f^f2] 13,4,6
while making sure that "SOME TEXT" in corresponding lines (such as 1,2,3 and 13,4,6) is the same and that the number of elements in the brackets in each corresponding line is the same. What would be a quick compact way to do it?
Thanks.

def read_file(fp,hash):
for l in fp:
p = l[1:].find(']')
k = l[p+3:-1]
v = l[1:p+1].split(",")
if k not in hash:
hash[k] = v
else:
hash[k] = zip(hash[k], v)
hash = {}
for fname in ('f1.txt', 'f2.txt'):
with open(fname) as fp:
read_file(fp, hash)
for k,v in hash.items():
print "[{0}] {1}".format(",".join("^".join(vv) for vv in v), k)
This is a basic way to do it, if you need the lines in the files in the order they were read you'll have to do a bit more work.
Here's the output I get:
[a^a2,b^b2,c^c2,d^d,e^e,f^f] 13,4,6
[dog^banana,cat^cat2,monkey^monkey2] 1,2,3
Edit:
This also assumes that each key ie. 13,4,6 appears once in a file. If it can appear multiple times you'll have to change the hash[k] = zip(hash[k],v) to something more elaborate such has
if k not in hash:
hash[k] = [[vv] for vv in v]
else:
for i,vv in enumerate(v):
hash[k][i].append(vv)

I'd use a regex to chop off everything after the first ] (and hang on to it). Then another regex to explode the string into an array. Then do whatever you need to do to it with regards to merging different arrays from different files, and then piecing it all back together shouldn't be too hard. I'll leave the regex's as an exercise for the reader :-)

for l, m in zip(f1, f2):
l_head, l_tail = l.strip("[ ").split("]")
m_head, m_tail = m.strip("[ ").split("]")
l_head = l_head.split(",")
m_head = m_head.split(",")
assert len(l_head) == len(m_head)
l_tail = l_tail.split(",")
m_tail = m_tail.split(",")
assert len(l_tail) == len(m_tail)
...
I haven't given your variables good names because I don't know what they are. I would name them something more useful.
I also haven't written the code for reassembling the lines. It shouldn't be too hard...

Related

In Python, how to match a string to a dictionary item (like 'Bra*')

I'm a complete novice at Python so please excuse me for asking something stupid.
From a textfile a dictionary is made to be used as a pass/block filter.
The textfile contains addresses and either a block or allow like "002029568,allow" or "0011*,allow" (without the quotes).
The search-input is a string with a complete code like "001180000".
How can I evaluate if the search-item is in the dictionary and make it match the "0011*,allow" line?
Thank you very much for your efford!
The filter-dictionary is made with:
def loadFilterDict(filename):
global filterDict
try:
with open(filename, "r") as text_file:
lines = text_file.readlines()
for s in lines:
fields = s.split(',')
if len(fields) == 2:
filterDict[fields[0]] = fields[1].strip()
text_file.close()
except:
pass
Check if the code (ccode) is in the dictionary:
if ccode in filterDict:
if filterDict[ccode] in ['block']:
continue
else:
if filterstat in ['block']:
continue
The filters-file is like:
002029568,allow
000923993,allow
0011*, allow

If you can use re, you don't have to worry about the wildcard but let re.match do the hard work for you:
# Rules input (this could also be read from file)
lines = """002029568,allow
0011*,allow
001180001,block
"""
# Parse rules from string
rules = []
for line in lines.split("\n"):
line = line.strip()
if not line:
continue
identifier, ruling = line.split(",")
rules += [(identifier, ruling)]
# Get rulings for specific number
def rule(number):
from re import match
rulings = []
for identifier, ruling in rules:
# Replace wildcard with regex .*
identifier = identifier.replace("*", ".*")
if match(identifier, number):
rulings += [ruling]
return rulings
print(rule("001180000"))
print(rule("001180001"))
Which prints:
['allow']
['allow', 'block']
The function will return a list of rulings. Their order is the same order as they appear in your config lines. So you could easily just pick the last or first ruling whichever is the one you're interested in.
Or break the loop prematurely if you can assume that no two rulings will interfere.
Examples:
001180000 is matched by 0011*,allow only, so the only ruling which applies is allow.
001180001 is matched by 0011*,allow at first, so you'll get allow as before. However, it is also matched by 001180001,block, so a block will get added to the rulings, too.

If the wildcard entries in the file have a fixed length (for example, you only need to support lines like 0011*,allow and not 00110*,allow or 0*,allow or any other arbitrary number of digits followed by *) you can use a nested dictionary, where the outer keys are the known parts of the wildcarded entries.
d = {'0011': {'001180000': 'value', '001180001': 'value'}}
Then when you parse the file and get to the line 0011*,allow you do not need to do any matching. All you have to do is check if '0011' is present. Crude example:
d = {'0011': {'001180000': 'value', '001180001': 'value'}}
line = '0011*,allow'
prefix = line.split(',')[0][:-1]
if prefix in d:
# there is a "match", then you can deal with all the entries that match,
# in this case the items in the inner dictionary
# {'001180000': 'value', '001180001': 'value'}
print('match')
else:
print('no match')
If you do need to support arbitrary lengths of wildcarded entries, you will have to resort to a loop iterating over the dictionary (and therefore beating the point of using a dictionary to begin with):
d = {'001180000': 'value', '001180001': 'value'}
line = '0011*,allow'
prefix = line.split(',')[0][:-1]
for k, v in d.items():
if k.startswith(prefix):
# found matching key-value pair
print(k, v)

Python read .txt File -> list

I have a .txt File and I want to get the values in a list.
The format of the txt file should be:
value0,timestamp0
value1,timestamp1
...
...
...
In the end I want to get a list with
[[value0,timestamp0],[value1,timestamp1],.....]
I know it's easy to get these values by
direction = []
for line in open(filename):
direction,t = line.strip().split(',')
direction = float(direction)
t = long(t)
direction.append([direction,t])
return direction
But I have a big problem: When creating the data I forgot to insert a "\n" in each row.
Thats why I have this format:
value0, timestamp0value1,timestamp1value2,timestamp2value3.....
Every timestamp has exactly 13 characters.
Is there a way to get these data in a list as I want it? Would be very much work get the data again.
Thanks
Max

import re
input = "value0,0123456789012value1,0123456789012value2,0123456789012value3"
for (line, value, timestamp) in re.findall("(([^,]+),(.{13}))", input):
print value, timestamp

You will have to strip the last , but you can insert a comma after every 13 chars following a comma:
import re
s = "-0.1351197,1466615025472-0.25672746,1466615025501-0.3661744,1466615025531-0.4646‌7665,1466615025561-0.5533287,1466615025591-0.63311553,1466615025621-0.7049236,146‌6615025652-0.7695509,1466615025681-1.7158673,1466615025711-1.6896278,146661502574‌1-1.65375,1466615025772-1.6092329,1466615025801"
print(re.sub("(?<=,)(.{13})",r"\1"+",", s))
Which will give you:
-0.1351197,1466615025472,-0.25672746,1466615025501,-0.3661744,1466615025531,-0.4646‌7665,1466615025561,-0.5533287,1466615025591,-0.63311553,1466615025621,-0.7049236,146‌6615025652-0.7695509,1466615025681,-1.7158673,1466615025711,-1.6896278,146661502574‌1-1.65375,1466615025772,-1.6092329,1466615025801,

I coded a quickie using your example, and not using 13 but len("timestamp") so you can adapt
instr = "value,timestampvalue2,timestampvalue3,timestampvalue4,timestamp"
previous_i = 0
for i,c in enumerate(instr):
if c==",":
next_i = i+len("timestamp")+1
print(instr[previous_i:next_i])
previous_i = next_i
output is descrambled:
value,timestamp
value2,timestamp
value3,timestamp
value4,timestamp

I think you could do something like this:
direction = []
for line in open(filename):
list = line.split(',')
v = list[0]
for s in list[1:]:
t = s[:13]
direction.append([float(v), long(t)])
v = s[13:]
If you're using python 3.X, then the long function no longer exists -- use int.

Trying to split a txt file into multiple variables

So I'm making a program where it reads a text file and I need to separate all the info into their own variables. It looks like this:
>1EK9:A.41,52; B.61,74; C.247,257; D.279,289
ENLMQVYQQARLSNPELRKSAADRDAAFEKINEARSPLLPQLGLGAD
YTYSNGYRDANGINSNATSASLQLTQSIFDMSKWRALTLQEKAAGIQ
DVTYQTDQQTLILNTATAYFNVLNAIDVLSYTQAQKEAIYRQLDQTT
QRFNVGLVAITDVQNARAQYDTVLANEVTARNNLDNAVEQLRQITGN
YYPELAALNVENFKTDKPQPVNALLKEAEKRNLSLLQARLSQDLARE
QIRQAQDGHLPTLDLTASTGISDTSYSGSKTRGAAGTQYDDSNMGQN
KVGLSFSLPIYQGGMVNSQVKQAQYNFVGASEQLESAHRSVVQTVRS
SFNNINASISSINAYKQAVVSAQSSLDAMEAGYSVGTRTIVDVLDAT
TTLYNAKQELANARYNYLINQLNIKSALGTLNEQDLLALNNALSKPV
STNPENVAPQTPEQNAIADGYAPDSPAPVVQQTSARTTTSNGHNPFRN
The code after the > is a title, the next bit that looks like this "A.41,52" are numbered positions in the sequence I need to save to use, and everything after that is an amino acid sequence. I know how to deal with the amino acid sequence, I just need to know how to separate the important numbers in the first line.
In the past when I just had a title and sequence I did something like this:
for line in nucfile:
if line.startswith(">"):
headerline=line.strip("\n")[1:]
else:
nucseq+=line.strip("\n")
Am I on the right track here? This is my first time, any advice would be fantastic and thanks for reading :)

I suggest you use the split() method.
split() allows you to specify the separator of your choice. Provided the sequence title (here 1EK9) is always separated from the rest of the sequence by a colon, you could first pass ":" as your separator. You could then split the remainder of the sequence to recover the numbered positions (e.g. A.41,52) using ";" as a separator.
I hope this helps!

I think what you are trying to do is extract certain parts of the sequence based on their identifiers given to you on the first line (the line starting with >).
This line contains your title, then a sequence name and the data range you need to extract.
Try this:
sequence_pairs = {}
with open('somefile.txt') as f:
header_line = next(f)
sequence = f.read()
title,components = header_line.split(':')
pairs = components.split(';')
for pair in pairs:
start,end = pair[2:-1].split(',')
sequence_pars[pair[:1]] = sequence[start:int(end)+1]
for sequence,data in sequence_pairs.iteritems():
print('{} - {}'.format(sequence, data))

As the other answer may be very good to tackle the assumed problem in it's entirety - but the OP has requested for pointers or an example of the tpyical split-unsplit transform which is often so successful I hereby provide some ideas and working code to show this (based on the example of the question).
So let us focus on the else branch below:
from __future__ import print_function
nuc_seq = [] # a list
title_token = '>'
with open('some_file_of_a_kind.txt', 'rt') as f:
for line in f.readlines():
s_line = line.strip() # this strips whitespace
if line.startswith(title_token):
headerline = line.strip("\n")[1:]
else:
nuc_seq.append(s_line) # build list
# now nuc_seq is a list of strings like:
# ['ENLMQVYQQARLSNPELRKSAADRDAAFEKINEARSPLLPQLGLGAD',
# 'YTYSNGYRDANGINSNATSASLQLTQSIFDMSKWRALTLQEKAAGIQ',
# ...
# ]
demo_nuc_str = ''.join(nuc_seq)
# now:
# demo_nuc_str == 'ENLMQVYQQARLSNPELRKSAADRDAAFEKINEARSPLLPQLGLGADYTYSNGYR ...'
That is fast and widely deployed paradigm in Python programming (and programming with powerful datatypes in general).
If the split-unsplit ( a.k.a. join) method is still unclear, just ask or try to sear SO on excellent answers to related questions.
Also note, that there is no need to line.strip('\n') as \nis considered whitespace like ' ' (string with a space only) or a tabulator '\t', sample:
>>> a = ' \t \n '
>>> '+'.join(a.split())
''
So the "joining character" only appears, if there are at least two element sto join and in this case, strip removed all whits space and left us with the empty string.
Upate:
As requested a further analysis of the "coordinate part" in the line called headline of the question:
>1EK9:A.41,52; B.61,74; C.247,257; D.279,289
If you want to retrieve the:
A.41,52; B.61,74; C.247,257; D.279,289
and assume you have (as above the complete line in headline string):
title, coordinate_string = headline.split(':')
# so now title is '1EK9' and
# coordinates == 'A.41,52; B.61,74; C.247,257; D.279,289'
Now split on the semi colons, trim the entries:
het_seq = [z.strip() for z in coordinates.split(';')]
# now het_seq == ['A.41,52', 'B.61,74', 'C.247,257', 'D.279,289']
If 'a', 'B', 'C', and 'D' are well known dimensions, than you can "lose" the ordering info from input file (as you could always reinforce what you already know ;-) and might map the coordinats as key: (ordered coordinate-pair):
>>> coord_map = dict(
(a, tuple(int(k) for k in bc.split(',')))
for a, bc in (abc.split('.') for abc in het_seq))
>>> coord_map
{'A': (41, 52), 'C': (247, 257), 'B': (61, 74), 'D': (279, 289)}
In context of a micro program:
#! /usr/bin/enc python
from __future__ import print_function
het_seq = ['A.41,52', 'B.61,74', 'C.247,257', 'D.279,289']
coord_map = dict(
(a, tuple(int(k) for k in bc.split(',')))
for a, bc in (abc.split('.') for abc in het_seq))
print(coord_map)
yields:
{'A': (41, 52), 'C': (247, 257), 'B': (61, 74), 'D': (279, 289)}
Here one might write this explicit a nested for loop but it is a late european evening so trick is to read it from right:
for all elements of het_seq
split on the dot and store left in a and right in b
than further split the bc into a sequence of k's, convert to integer and put into tuple (ordered pair of integer coordinates)
arrived on the left you build a tuple of the a ("The dimension like 'A' and the coordinate tuple from 3.
In the end call the dict() function that constructs a dictionary using here the form dict(key_1, value_1, hey_2, value_2, ...) which gives {key_1: value1, ...}
So all coordinates are integers, stored ordered pairs as tuples.
I'ld prefer tuples here, although split() generates lists, because
You will keep those two coordinates not extend or append that pair
In python mapping and remapping is often performed and there a hashable (that is immutable type) is ready to become a key in a dict.
One last variant (with no knoted comprehensions):
coord_map = {}
for abc in het_seq:
a, bc = abc.split('.')
coord_map[a] = tuple(int(k) for k in bc.split(','))
print(coord_map)
The first four lines produce the same as above minor obnoxious "one liner" (that already had been written on three lines kept together within parentheses).
HTH.

So I'm assuming you are trying to process a Fasta like file and so the way I would do it is to first get the header and separate the pieces with Regex. Following that you can store the A:42.52 B... in a list for easy access. The code is as follows.
import re
def processHeader(line):
positions = re.search(r':(.*)', line).group(1)
positions = positions.split('; ')
return positions
dnaSeq = ''
positions = []
with open('myFasta', 'r') as infile:
for line in infile:
if '>' in line:
positions = processHeader(line)
else:
dnaSeq += line.strip()

I am not sure I completely understand the goal (and I think this post is more suitable for a comment, but I do not have enough privileges) but I think that the key to you solution is using .split(). You can then join the elements of the resulting list just by using + similar to this:
>>> result = line.split(' ')
>>> result
['1EK9:A.41,52;', 'B.61,74;', 'C.247,257;', 'D.279,289', 'ENLMQVYQQARLSNPELRKSAADRDAAFEKINEARSPLLPQLGLGAD', 'YTYSNGYRDANGINSNATSASLQLTQSIFDMSKWRALTLQEKAAGIQ', 'DVTYQTDQQTLILNTATAYFNVLNAIDVLSYTQAQKEAIYRQLDQTT', 'QRFNVGLVAITDVQNARAQYDTVLANEVTARNNLDNAVEQLRQITGN',
'YYPELAALNVENFKTDKPQPVNALLKEAEKRNLSLLQARLSQDLARE', 'QIRQAQDGHLPTLDLTASTGISDTSYSGSKTRGAAGTQYDDSNMGQN', 'KVGLSFSLPIYQGGMVNSQVKQAQYNFVGASEQLESAHRSVVQTVRS', 'SFNNINASISSINAYKQAVVSAQSSLDAMEAGYSVGTRTIVDVLDAT', 'TTLYNAKQELANARYNYLINQLNIKSALGTLNEQDLLALNNALSKPV', 'STNPENVAPQTPEQNAIADGYAPDSPAPVVQQTSARTTTSNGHNPFRN']
>>> result[3]+result[4]
'D.279,289ENLMQVYQQARLSNPELRKSAADRDAAFEKINEARSPLLPQLGLGAD'
>>>
etc. You can also use the usual following syntax to extract the elements of the list that you need:
>>> result[5:]
['YTYSNGYRDANGINSNATSASLQLTQSIFDMSKWRALTLQEKAAGIQ', 'DVTYQTDQQTLILNTATAYFNVLNAIDVLSYTQAQKEAIYRQLDQTT', 'QRFNVGLVAITDVQNARAQYDTVLANEVTARNNLDNAVEQLRQITGN', 'YYPELAALNVENFKTDKPQPVNALLKEAEKRNLSLLQARLSQDLARE', 'QIRQAQDGHLPTLDLTASTGISDTSYSGSKTRGAAGTQYDDSNMGQN', 'KVGLSFSLPIYQGGMVNSQVKQAQYNFVGASEQLESAHRSVVQTVRS', 'SFNNINASISSINAYKQAVVSAQSSLDAMEAGYSVGTRTIVDVLDAT', 'TTLYNAKQELANARYNYLINQLNIKSALGTLNEQDLLALNNALSKPV', 'STNPENVAPQTPEQNAIADGYAPDSPAPVVQQTSARTTTSNGHNPFRN']
and join them together:
>>> reduce(lambda x, y: x+y, result[5:])
'YTYSNGYRDANGINSNATSASLQLTQSIFDMSKWRALTLQEKAAGIQDVTYQTDQQTLILNTATAYFNVLNAIDVLSYTQAQKEAIYRQLDQTTQRFNVGLVAITDVQNARAQYDTVLANEVTARNNLDNAVEQLRQITGNYYPELAALNVENFKTDKPQPVNALLKEAEKRNLSLLQARLSQDLAREQIRQAQDGHLPTLDLTASTGISDTSYSGSKTRGAAGTQYDDSNMGQNKVGLSFSLPIYQGGMVNSQVKQAQYNFVGASEQLESAHRSVVQTVRSSFNNINASISSINAYKQAVVSAQSSLDAMEAGYSVGTRTIVDVLDATTTLYNAKQELANARYNYLINQLNIKSALGTLNEQDLLALNNALSKPVSTNPENVAPQTPEQNAIADGYAPDSPAPVVQQTSARTTTSNGHNPFRN'
remember that + on lists produces a list.
By the way I would not remove '\n' to start with as you may try to use it to extract the first line similar to the above with using space to extract "words".
UPDATE (starting from result):
#getting A indexes
letter_seq=result[5:]
ind=result[:4]
Aind=ind[0].split('.')[1].replace(';', '')
#getting one long letter seq
long_letter_seq=reduce(lambda x, y: x+y, letter_seq)
#extracting the final seq fromlong_letter_seq using Aind
output = long_letter_seq[int(Aind.split(',')[0]):int(Aind.split(',')[1])]
the last line is just a union of several operations that were also used earlier.
Same for B C D etc -- so a lot of manual work and calculations...
BE CAREFUL with indexes of A -- numbering in python starts from 0 which may not be the case in your numbering system.
The more elegant solution would be using re (https://docs.python.org/2/library/re.html) to find pettern using a mask, but this requires very well defined rules for how to look up the sequence needed.
UPDATE2: it is also not clear to me what is the role of spaces -- so far I removed them but they may matter when counting the letters in the original string.

Extracting data by regex and writing to CSV, Python glob (pandas?)

I have a large list of varyingly dirty CSVs containing phone numbers in various formats. What I want is to comb through all of them and export to a single-column CSV of cleaned phone numbers in a simple format. So far, I have pieced together something to work, though it has some issues: (partial revision further below)
import csv
import re
import glob
import string
with open('phonelist.csv', 'wb') as out:
seen = set()
output = []
out_writer = csv.writer(out)
csv_files = glob.glob('CSVs\*.csv')
for filename in csv_files:
with open(filename, 'rbU') as ifile:
read = csv.reader(ifile)
for row in read:
for column in row:
s1 = column.strip()
if re.match(r'\b\d\d\d\d\d\d\d\d\d\d\b', s1):
if s1 not in seen:
seen.add(s1)
output.append(s1)
elif re.search(r'\b\(\d\d\d\) \d\d\d-\d\d\d\d\b', s1):
s2 = filter(lambda x: x in string.digits, s1)
if s2 not in seen:
seen.add(s2)
output.append(s2)
for val in output:
out_writer.writerow([val])
I'm putting this together with no formal python knowledge, just piecing things I've gleaned on this site. Any advice regarding pythonic stylization or utilizing the pandas library for shortcuts would all be welcome.
First issue: What's the simplest way to filter to just the matched values? IE, I may get 9815556667 John Smith, but I just want the number.
Second issue: This takes forever. I assume it's the lambda part. Is there a faster or more efficient method?
Third issue: How do I glob *.csv in the directory of the program and the CSVs directory (as written)?
I know that's several questions at once, but I got myself halfway there. Any additional pointers are appreciated.
For examples, requested, this isn't from a file (these are multi-gigabyte files), but here's what I'm looking for:
John Smith, (981) 991-0987, 9987765543 extension 541, 671 Maple St 98402
(998) 222-0011, 13949811123, Foo baR Us, 2567 Appleberry Lane
office, www.somewebpage.com, City Group, Anchorage AK
9281239812
(345) 666-7777
Should become:
9819910987
9987765543
9982220011
3949811123
3456667777
(I forgot that I need to drop a leading 1 from 11-digit numbers, too)
EDIT: I've changed my current code to incorporate Shahram's advice, so now, from for column in row above, I have, instead of above:
for column in row:
s1 = column.strip()
result = re.match(
r'.*(\+?[2-9]?[0-9]?[0-9]?-?\(?[0-9][0-9][0-9]\)? ?[0-9][0-9][0-9]-?[0-9][0-9][0-9][0-9]).*', s1) or re.match(
r'.*(\+?[2-9]?[0-9]?[0-9]?-?\(?[0-9][0-9][0-9]\)?-?[0-9][0-9][0-9]-?[0-9][0-9][0-9][0-9]).*', s1)
if result:
tempStr = result.group(1)
for ch in ['(', ')', '-', ' ']:
tempStr = tempStr.replace(ch, '')
if tempStr not in seen:
seen.add(tempStr)
output.append(tempStr)
This seems to work for my purposes, but I still don't know how to glob the current directory and subdirectory, and I still don't know if my code has issues that I'm unaware of because of my hodge-podge-ing. Also, in my larger directory, this is taking forever - as in, about a gig of CSVs is timing out for me (by my hand) at around 20 minutes. I don't know if it's hitting a snag, but judging by the speed at which python normally chews through any number of CSVs, it feels like I'm missing something.

About your first question, You can use the below regular expression to capture different types of Phone Numbers:
result = re.match(r'.*(\+?[0-9]?[0-9]?[0-9]?-?\(?[0-9][0-9][0-9]\)?-?[0-9][0-9][0-9]-?[0-9][0-9][0-9][0-9]).*', s1)
if result:
if result.group(1) not in seen:
seen.add(result.group(1))
output.append(result.group(1))
About your second question: You may want to look at the replace function. So the above code can be changed to:
result = re.match(r'.*(\+?[0-9]?[0-9]?[0-9]?-?\(?[0-9][0-9][0-9]\)?-?[0-9][0-9][0-9]-?[0-9][0-9][0-9][0-9]).*', s1)
if result:
if result.group(1) not in seen:
tempStr = result.group(1)
tempStr.replace('-','')
seen.add(tempStr)
output.append(tempStr)

Writing csv with python

I wrote a python script in which I generate a csv from numbers I computed.
The rows I write are:
writeRow = [str(t), len(c) , {k for k in c.keys()}, {k for k in c.values()}]
I have two problems:
t is a number that can begin by 0. But in that case, the 0 is deleted.
I tried without str() but it doesn't change...
the sets are printed as sets in the cells. However, I want to write these numbers separated by commas in the same cell and without the {} How can I do that?
edit
I am using the csv module; In the code, I create lists for each row to write and then write them with csv.writerow
I'm gonna post more code:
from csv import reader, writer
with open(fileName1) as inp, open(fileName2,'w') as o:
I then define the reader, writer, and the variables t,c
writeRow = [str(t), len(c) , {k for k in c.keys()}, {k for k in c.values()}]
Then I write the result in the output file
Edit
form of t and what a row should look like
t = 023
t = 123
t is an int
The line in the end should look like:
cell1 cell2 cell3 cell4
123 2 string1,string2 num1,num2
string1 and str2 are the dict keys; num1,num2 the corresponding values

For starters, you can take advantage of the join() method (assuming keys and values are strings):
",".join(c.keys()) + "," + ",".join(c.values())
This will take care of commas. However, this will break very easily for any non-trivial data, so consider using csv module instead, which would take care of escaping dangerous characters.

Use string formatting for the padding with 0s, and join for the sets:
writeRow = ['{:0>3}'.format(t), len(c) , ','.join(c.keys()), ','.join(c.values())]
note that you should not enter your value for t with a leading 0:
>>>t = 023
>>>t
19

What about using format strings?
'%s,%i'%(t,c)
might do what you want. You could also use '%03i' or something similar to produce some padding zeros before your number. Or, I might misunderstand your question. Try posting a more complete (runnable) example.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

quick data processing with python? - python

Related

In Python, how to match a string to a dictionary item (like 'Bra*')

Python read .txt File -> list

Trying to split a txt file into multiple variables

Extracting data by regex and writing to CSV, Python glob (pandas?)

Writing csv with python

Categories

Resources