Error in concatenation and one more error - python

I'm trying to import a csv file and then output the continuous series from the file into a new csv file
the contents of the file are like
1
5
6
7
8
and so on
here for example the output would be ['1,1','5,5','6,8']
The error i'm getting is
>>> gaps = [[s, e] for s, e in zip(nums, nums[1:]) if s+1 < e]
TypeError: can only concatenate str (not "int") to str
Also for some reason after I do str1 = str1.replace(i, '')
it turns str1 into
['2855']'2856']'3250']'3251']'3252']'3253']'3254']'3255']'3256']'3257']'3258']'3259']'3260']'3261']'3262']'3263']'3264']'3265']'3278']'3279']'3280']'3281']'3299']'3312']'3314']'3331']'3332']'3333']'3334']'3405']'3406']'3407']'3408']'3500']'4849']'4850']'5567']'5568']'5569']'6000']
2856]3250]3251]3252]3253]3254]3255]3256]3257]3258]3259]3260]3261]3262]3263]3264]3265]3278]3279]3280]3281]3299]3312]3314]3331]3332]3333]3334]3405]3406]3407]3408]3500]4849]4850]5567]5568]5569]6000]
intead of giving just
2856]3250]3251]3252]3253]3254]3255]3256]3257]3258]3259]3260]3261]3262]3263]3264]3265]3278]3279]3280]3281]3299]3312]3314]3331]3332]3333]3334]3405]3406]3407]3408]3500]4849]4850]5567]5568]5569]6000]
The code:
with open('Book1.csv', newline='') as f:
reader = csv.reader(f)
data = list(reader)
str1 = ''.join(str(e) for e in data)
bad_chars = ["[","'"]
for i in bad_chars :
str1 = str1.replace(i, '')
str1.split("]",-1)
x = list((str1.split("]")))
def ranges(nums):
nums = sorted(set(nums))
gaps = [[s, e] for s, e in zip(nums, nums[1:]) if s+1 < e]
edges = iter(nums[:1] + sum(gaps, []) + nums[-1:])
return list(zip(edges, edges))
print(ranges(x))

Try this:
def ranges(nums):
nums = sorted(set(nums))
gaps = [[s, e] for s, e in zip(nums, nums[1:]) if s+1 < e]
edges = iter(nums[:1] + sum(gaps, []) + nums[-1:])
return list(zip(edges, edges))
data = []
with open('Book1.csv', newline='') as f:
reader = csv.reader(f)
for i in reader:
data.append(int(i[0]))
print(ranges(data))
The problem with your code was you were making the code more complex and the task redundant by joining lists of strings to a big string, and then removing the bad chars from it. Instead you could just add the integer parts of separate lists beforehand and saved the time, like I did.
Also, the code in ranges function was giving error because you were trying to add string s to 1, which is an integer. What you didn't realise then that the x list still contained string types.

Related

How to transform a csv file into a multi-dimensional list using Python?

I started out with a 4d list, something like
tokens = [[[["a"], ["b"], ["c"]], [["d"]]], [[["e"], ["f"], ["g"]],[["h"], ["i"], ["j"], ["k"], ["l"]]]]
So I converted this to a csv file using the code
import csv
def export_to_csv(tokens):
csv_list = [["A", "B", "C", word]]
for h_index, h in enumerate(tokens):
for i_index, i in enumerate(h):
for j_index, j in enumerate(i):
csv_list.append([h_index, i_index, j_index, j])
with open('TEST.csv', 'w') as f:
# using csv.writer method from CSV package
write = csv.writer(f)
write.writerows(csv_list)
But now I want to do the reverse process, want to convert a csv file obtained in this format, back to the list format mentioned above.
Assuming you wanted your csv file to look something like this (there were a couple typos in the posted code):
A,B,C,word
0,0,0,a
0,0,1,b
0,0,2,c
...
here's one solution:
import csv
def import_from_csv(filename):
retval = []
with open(filename) as fh:
reader = csv.reader(fh)
# discard header row
next(reader)
# process data rows
for (x,y,z,word) in reader:
x = int(x)
y = int(y)
z = int(z)
retval.extend([[[]]] * (x + 1 - len(retval)))
retval[x].extend([[]] * (y + 1 - len(retval[x])))
retval[x][y].extend([0] * (z + 1 - len(retval[x][y])))
retval[x][y][z] = [word]
return retval
def import_from_csv(file):
import ast
import csv
data = []
# Read the CSV file
with open(file) as fp:
reader = csv.reader(fp)
# Skip the first line, which contains the headers
next(reader)
for line in reader:
# Read the first 3 elements of the line
a, b, c = [int(i) for i in line[:3]]
# When we read it back, everything comes in as strings. Use
# `literal_eval` to convert it to a Python list
value = ast.literal_eval(line[3])
# Extend the list to accomodate the new element
data.append([[[]]]) if len(data) < a + 1 else None
data[a].append([[]]) if len(data[a]) < b + 1 else None
data[a][b].append([]) if len(data[a][b]) < c + 1 else None
data[a][b][c] = value
return data
# Test
assert import_from_csv("TEST.csv") == tokens
First, I'd make writing this construction in a CSV format independent from dimensions:
import csv
def deep_iter(seq):
for i, val in enumerate(seq):
if type(val) is list:
for others in deep_iter(val):
yield i, *others
else:
yield i, val
with open('TEST.csv', 'w') as f:
csv.writer(f).writerows(deep_iter(tokens))
Next, we can use the lexicographic order of the indices to recreate the structure. All we have to do is sequentially move deeper into the output list according to the indices of a word. We stop at the penultimate index to get the last list, because the last index is pointing only at the place of the word in this list and doesn't matter due to the natural ordering:
with open('TEST.csv', 'r') as f:
rows = [*csv.reader(f)]
res = []
for r in rows:
index = r[:-2] # skip the last index and word
e = res
while index:
i = int(index.pop(0)) # get next part of a current index
if i < len(e):
e = e[i]
else:
e.append([]) # add new record at this level
e = e[-1]
e.append(r[-1]) # append the word to the corresponding list

Python: Parsing a specific column (from scratch, no "import csv") in tab-separated-file

I've written some code that can parse a string into tuples as such:
s = '30M3I5X'
l = []
num = ""
for c in s:
if c in '0123456789':
num = num + c
print(num)
else:
l.append([int(num), c])
num = ""
print(l)
I.e.;
'30M3I5X'
becomes
[[30, 'M'], [3, 'I'], [5, 'X']]
That part works just fine. I'm struggling now, however, with figuring out how to get the values from the first column of a tab-separated-value file to become my new 's'. I.e.; for a file that looks like:
# File Example #
30M3I45M2I20M I:AAC-I:TC
50M3X35M2I20M X:TCC-I:AG
There would somehow be a loop incorporated to take only the first column, producing
[[30, 'M'],[3, 'I'],[45, 'M'],[2, 'I'],[20, 'M']]
[[50, 'M'],[3, 'X'],[35, 'M'],[2, 'I'],[20, 'M']]
without having to use
import csv
Or any other module.
Thanks so much!
Just open the path to the file and iterate through the records?
def fx(s):
l=[]
num=""
for c in s:
if c in '0123456789':
num=num+c
print(num)
else:
l.append([int(num), c])
num=""
return l
with open(fp) as f:
for record in f:
s, _ = record.split('\t')
l = fx(s)
# process l here ...
The following code would serve your purpose
rows = ['30M3I45M2I20M I:AAC-I:TC', '30M3I45M2I20M I:AAC-I:TC']
for row in rows:
words = row.split(' ')
print(words[0])
l = []
num = ""
for c in words[0]:
if c in '0123456789':
num = num + c
else:
l.append([int(num), c])
print(l)
Change row.split(' ') to ('\t') or any other seperator as per the need
something like this should do what you're looking for.
filename = r'\path\to\your\file.txt'
with open(filename,'r') as input:
for row in input:
elements = row.split()
# processing goes here
elements[0] contains the string that is the first column of data in the file.
Edit:
to end up with a list of the lists of processed data:
result = []
filename = r'\path\to\your\file.txt'
with open(filename,'r') as input:
for row in input:
elements = row.split()
# processing goes here
result.append(l) # l is the result of your processing
So this is what ended up working for me--took bits and pieces from everyone, thank you all!
Note: I know it's a bit verbose, but since I'm new, it helps me keep track of everything :)
#Defining the parser function
def col1parser(col1):
l = []
num = ""
for c in col1:
if c in '0123456789':
num = num + c
else:
l.append([int(num), c])
num = ""
print(l)
#Open file, run function on column1
filename = r'filepath.txt'
with open(filename,'r') as input:
for row in input:
elements = row.split()
col1 = elements[0]
l = col1parser(col1)

Python two-dimensional list: exporting as delimited text file

I have a 2-D list, myList[r][c], where there are r rows of c columns each.
I am trying to export it into a text file, with each of the column elements delimited by pipes | , and an ampersand & at the end of each row.
myList = [[[] for a in range(c)] for b in range(r)]
{a bunch of code populating myList}
f = open("myfile.txt","w")
for x in range(0,r):
thisRow = ''
for y in range(0,c):
appendThis = myList[x][y]
thisRow += appendThis + "|"
f.write(thisRow)
f.write("&")
f.close
...but I get TypeError: can only concatenate list (not "str") to list on the line where I add the pipe character.
The csv module is made for this. Python 2.7 example:
import csv
r=7
c=5
myList = [[b*10+a for a in range(c)] for b in range(r)]
with open("myfile.txt","wb") as f:
w = csv.writer(f,delimiter='|',lineterminator='&\r\n')
w.writerows(myList)
Output:
0|1|2|3|4&
10|11|12|13|14&
20|21|22|23|24&
30|31|32|33|34&
40|41|42|43|44&
50|51|52|53|54&
60|61|62|63|64&
Python has great support for CSV. The following is adapted straight from an example:
import csv
with open('myfile.txt', 'wb') as csvfile:
mywriter = csv.writer(csvfile, delimiter='|',lineterminator='&',
quotechar='"', quoting=csv.QUOTE_MINIMAL)
mywriter.writerow(myList)
note the use of the lineterminator if you want each line to have an & at the end as well as a new line you can use '&\r\n' as the terminator.
In case I have got your rows and columns mixedup, please note that you can do something like
for line in myList:
mywriter.writerow(line)
So here you have a working version.
The problem was in the first line, where you had:
myList = [[[] for a in range(c)] for b in range(r)]
but that just creates 2-D array of arrays, I have replaced it with an array that simply contains the indexes of its elements. (And also now it adds newlines to the line ends.)
myList = [[a for a in range(c)] for b in range(r)]
f = open("myfile.txt","w")
for x in range(0,r):
thisRow = ''
for y in range(0,c):
appendThis = myList[x][y]
thisRow += str(appendThis) + "|"
f.write(thisRow)
f.write("&\n")
f.close

Building a dictionary using two files

I am very very new to python and I have been playing around to write a script using two files. File 1 contains a number of ID numbers such as:
1000012
1000015
1000046
1000047
1000050
1000072
1000076
100008
1000102
100013
The other file has few lines of single ID numbers followed by lines made of one ID number followed by other ID numbers which have a + or - at the end:
951450
8951670
8951800
8951863
8951889
9040311
9255087 147+ 206041- 8852164- 4458078- 1424812- 3631438- 8603144+ 4908786- 4780663+ 4643406+ 3061176- 7523696- 5876052- 163881- 6234800- 395660-
9255088 149+ 7735585+ 6359867+ 620034- 4522360- 2810885- 3705265+ 5966368- 7021344+ 9165926- 2477382+ 4015358- 2497281+ 9166415+ 6837601-
9255089 217+ 6544241+ 5181434+ 4625589+ 7433598+ 7295233+ 3938917+ 4109401+ 2135539+ 4960823+ 1838531+ 1959852+ 5698864+ 1925066+ 8212560+ 3056544+ 82N 1751642+ 4772695+ 2396528+ 2673866+ 2963754+ 5087444+ 977167+ 2892617- 7412278- 6920479- 2539680- 4315259- 8899799- 733101- 5281901- 7055760+ 8508290+ 8559218+ 7985985+ 6391093+ 2483783+ 8939632+ 3373919- 924346+ 1618865- 8670617+ 515619+ 5371996+ 2152211+ 6337329+ 284813+ 8512064+ 3469059+ 3405322+ 1415471- 1536881- 8034033+ 4592921+ 4226887- 6578783-
I want to build a dictionary using these two files. My script has to search inside File 2 for the ID numbers in File 1 and append those lines as values in which the key is the number in File 1. Therefore there may be more than one value for each key. I only want to search the lines in File 2 that have more than one number (if len(x) > 1).
the output would be something like: 1000047: 9292540 1000047+ 9126889+ 3490727- 8991434+ 4296324+ 9193432- 3766395+ 9193431+ 8949379- (I need to print each ID number in File1 as the key and as its value, the chunk of lines that contain that ID number as a whole)
Here is my -very wrong- script:
#!/usr/bin/python
f = open('file1')
z = open('file2')
d = dict() # d is an empty dictionary
for l in f:
p = l.rstrip()
d[p] = list() # sets the keys in the dictionary as p (IDs with newline characters stripped)
y = z.readlines() # retrieves a string from the path file
s = "".join(y) # makes a string from y
x = str.split(s) #splits the path file at white spaces
if len(x) > 1: # only the lines that include contigs IDs that were used to make another contig
for lines in y:
k = lines.rstrip()
w = tuple(x) # convert list x into a tuple called w
for i in w:
if i[:-1] in d:
d[p].append(k)
print d
Try:
#!/usr/bin/python
f = open('file1')
z = open('file2')
d = dict() # d is an empty dictionary
for l in f:
p = l.rstrip()
d[p] = list() # Change #1
f.close()
# Now we have a dictinary with the keys from file1 and empty lists as values
for line in z:
items = item.split() # items will be a list from 1 line
if len(items) > 1: # more than initial item in the list
k = items[0] # First is the key line
for i in items[1:]: # rest of items
if d.haskey(i[:-1]): # is it in the dict
d[i].append(k) # Add the k value
z.close()
print d
N.B. This is untested code but shouldn't be too far off.
Is this what you are looking for ?? (I have not tested it ...)
#!/usr/bin/python
f = open('file1')
z = open('file2')
d = dict() # d is an empty dictionary
for l in f.readlines():
for l2 in z.readlines():
if l.rstrip() in l2.rstrip():
d[l] = l2
z.seek(0, 0)
f.close()
z.close()
Here is a simpler version the same code, if you don't want to deal with the file pointer
f = open("file1")
z = open("file2")
d = dict() # d is an empty dictionary
file1_lines = f.readlines()
file2_lines = z.readlines()
for l in file1_lines:
for l2 in file2_lines:
if l.rstrip() in l2.rstrip():
d[l] = l2
print d
f.close()
z.close()

How to parse data in a variable length delimited file?

I have a text file which does not confirm to standards. So I know the (end,start) positions of each column value.
Sample text file :
# # # #
Techy Inn Val NJ
Found the position of # using this code :
1 f = open('sample.txt', 'r')
2 i = 0
3 positions = []
4 for line in f:
5 if line.find('#') > 0:
6 print line
7 for each in line:
8 i += 1
9 if each == '#':
10 positions.append(i)
1 7 11 15 => Positions
So far, so good! Now, how do I fetch the values from each row based on the positions I fetched? I am trying to construct an efficient loop but any pointers are greatly appreciated guys! Thanks (:
Here's a way to read fixed width fields using regexp
>>> import re
>>> s="Techy Inn Val NJ"
>>> var1,var2,var3,var4 = re.match("(.{5}) (.{3}) (.{3}) (.{2})",s).groups()
>>> var1
'Techy'
>>> var2
'Inn'
>>> var3
'Val'
>>> var4
'NJ'
>>>
Off the top of my head:
f = open(.......)
header = f.next() # get first line
posns = [i for i, c in enumerate(header + "#") if c = '#']
for line in f:
fields = [line[posns[k]:posns[k+1]] for k in xrange(len(posns) - 1)]
Update with tested, fixed code:
import sys
f = open(sys.argv[1])
header = f.next() # get first line
print repr(header)
posns = [i for i, c in enumerate(header) if c == '#'] + [-1]
print posns
for line in f:
posns[-1] = len(line)
fields = [line[posns[k]:posns[k+1]].rstrip() for k in xrange(len(posns) - 1)]
print fields
Input file:
# # #
Foo BarBaz
123456789abcd
Debug output:
'# # #\n'
[0, 7, 10, -1]
['Foo', 'Bar', 'Baz']
['1234567', '89a', 'bcd']
Robustification notes:
This solution caters for any old rubbish (or nothing) after the last # in the header line; it doesn't need the header line to be padded out with spaces or anything else.
The OP needs to consider whether it's an error if the first character of the header is not #.
Each field has trailing whitespace stripped; this automatically removes a trailing newline from the rihtmost field (and doesn't run amok if the last line is not terminated by a newline).
Final(?) update: Leapfrooging #gnibbler's suggestion to use slice(): set up the slices once before looping.
import sys
f = open(sys.argv[1])
header = f.next() # get first line
print repr(header)
posns = [i for i, c in enumerate(header) if c == '#']
print posns
slices = [slice(lo, hi) for lo, hi in zip(posns, posns[1:] + [None])]
print slices
for line in f:
fields = [line[sl].rstrip() for sl in slices]
print fields
Adapted from John Machin's answer
>>> header = "# # # #"
>>> row = "Techy Inn Val NJ"
>>> posns = [i for i, c in enumerate(header) if c == '#']
>>> [row[slice(*x)] for x in zip(posns, posns[1:]+[None])]
['Techy ', 'Inn ', 'Val ', 'NJ']
You can also write the last line like this
>>> [row[i:j] for i,j in zip(posns, posns[1:]+[None])]
For the other example you give in the comments, you just need to have the correct header
>>> header = "# # # #"
>>> row = "Techiyi Iniin Viial NiiJ"
>>> posns = [i for i, c in enumerate(header) if c == '#']
>>> [row[slice(*x)] for x in zip(posns, posns[1:]+[None])]
['Techiyi ', 'Iniin ', 'Viial ', 'NiiJ']
>>>
Ok, to be little different and to give the asked in comments generalized solution, I use the header line instead of slice and generator function. Additionally I have allowed first columns to be comment by not putting field name in first column and using of multichar field names instead of only '#'.
Minus point is that one char fields are not possible to have header names but only have '#' in header line (which are allways considered like in previous solutions as beginning of field, even after letters in header)
sample="""
HOTEL CAT ST DEP ##
Test line Techy Inn Val NJ FT FT
"""
data=sample.splitlines()[1:]
def fields(header,line):
previndex=0
prevchar=''
for index,char in enumerate(header):
if char == '#' or (prevchar != char and prevchar == ' '):
if previndex or header[0] != ' ':
yield line[previndex:index]
previndex=index
prevchar = char
yield line[previndex:]
header,dataline = data
print list(fields(header,dataline))
Output
['Techy Inn ', 'Val ', 'NJ ', 'FT ', 'F', 'T']
One practical use of this is to use in parsing fixed field length data without knowing the lengths by just putting copy of dataline with all fields and no comment present and spaces replaced with something else like '_' and single character field values replaced by #.
Header from sample line:
' Techy_Inn Val NJ FT ##'
def parse(your_file):
first_line = your_file.next().rstrip()
slices = []
start = None
for e, c in enumerate(first_line):
if c != '#':
continue
if start is None:
start = e
continue
slices.append(slice(start, e))
start = e
if start is not None:
slices.append(slice(start, None))
for line in your_file:
parsed = [line[s] for s in slices]
yield parsed
f = open('sample.txt', 'r')
pos = [m.span() for m in re.finditer('#\s*', f.next())]
pos[-1] = (pos[-1][0], None)
for line in f:
print [line[i:j].strip() for i, j in pos]
f.close()
How about this?
with open('somefile','r') as source:
line= source.next()
sizes= map( len, line.split("#") )[1:]
positions = [ (sum(sizes[:x]),sum(sizes[:x+1])) for x in xrange(len(sizes)) ]
for line in source:
fields = [ line[start,end] for start,end in positions ]
Is this what you're looking for?

Categories