I have a very large csv file, with a matrix like this:
null,A,B,C
A,0,2,3
B,3,4,2
C,1,2,4
It is always a n*n matrix. The first column and the first row are the names. I want to convert it to a 3 column format (also could be called edge list, long form, etc) like this:
A,A,0
A,B,2
A,C,3
B,A,3
B,B,4
B,C,2
C,A,1
C,B,2
C,C,4
I have used:
row = 0
for line in fin:
line = line.strip("\n")
col = 0
tokens = line.split(",")
for t in tokens:
fout.write("\n%s,%s,%s"%(row,col,t))
col += 1
row += 1
doesn't work...
Could you please help? Thank you..
You also need to enumerate the column titles as your print out the individual cells.
For a matrix file mat.csv:
null,A,B,C
A,0,2,3
B,3,4,2
C,1,2,4
The following program:
csv = open("mat.csv")
columns = csv.readline().strip().split(',')[1:]
for line in csv:
tokens = line.strip().split(',')
row = tokens[0]
for column, cell in zip(columns,tokens[1:]):
print '{},{},{}'.format(row,column,cell)
prints out:
A,A,0
A,B,2
A,C,3
B,A,3
B,B,4
B,C,2
C,A,1
C,B,2
C,C,4
For generating the upper diagonal, you can use the following script:
csv = open("mat.csv")
columns = csv.readline().strip().split(',')[1:]
for i, line in enumerate(csv):
tokens = line.strip().split(',')
row = tokens[0]
for column, cell in zip(columns[i:],tokens[i+1:]):
print '{},{},{}'.format(row,column,cell)
which results in the output:
A,A,0
A,B,2
A,C,3
B,B,4
B,C,2
C,C,4
You need to skip the first column in each line:
for t in tokens[1:]:
Related
So I've got this code I've been working on for a few days. I need to iterate through a set of csv's, and using general logic, find the indexes which don't have the same number of columns as index 2 and strip them out of the new csv. I've gotten the code to this point, but I'm stuck as to how to use slicing to strip the broken index.
Say each index in file A is supposed to have 10 columns, and for some reason index 2,000 logs with only 7 columns. How is the best way to approach this problem to get the code to strip index 2,000 out of the new csv?
#Comments to the right
for f in TD_files: #FOR ALL TREND FILES:
with open(f,newline='',encoding='latin1') as g: #open file as read
r = csv.reader((line.replace('\0','') for line in g)) #declare read variable for list while stripping nulls
data = [line for line in r] #set list to all data in file
for j in range(0,len(data)): #set up data variable
if data[j][2] != data[j-1][2] and j != 0: #compare index j2 and j2-1
print('Index Not Equal') #print debug
data[0] = TDmachineID #add machine ID line
data[1] = trendHeader #add trend header line
with open(f,'w',newline='') as g: #open file as write
w = csv.writer(g) #declare write variable
w.writerows(data)
The Index To Strip
EDIT
Since you loop through the whole data anyway, I would replace that \0 at the same list comprehension when checking for the length. It looks cleaner to me and works the same.
with open(f, newline='', encoding='latin1') as g:
raw_data = csv.reader(g)
data = [[elem.replace('\0', '') for elem in line] for line in raw_data if len(line)==10]
data[0] = TDmachineID
data[1] = trendHeader
old answer:
You could add a condition to your list comprehension if the list has the length 10.
with open(f,newline='',encoding='latin1') as g:
r = csv.reader((line.replace('\0','') for line in g))
data = [line for line in r if len(line)==10] #add condition to check if the line is added to your data
data[0] = TDmachineID
data[1] = trendHeader
Consider the following textfile excerpt
Distance,Velocity,Time
(m),(m/s),(s)
1,1,1
2,1,2
3,1,3
I want it to be transformed into this:
Distance(m),Velocity(m/s),Time(s)
1,1,1
2,1,2
3,1,3
In other words, I want to concatenate rows that contains text, and I want them to be concatenated column-wise.
I am initially manipulating a textfile that's generated from a software. I have successfully transformed it down to only numeric columns and their headers, in a csv format. But I have multiple headers for each column. And I need all the information in each header row, because the column attributes will differ from file to file. How can I do this in a smart way in python?
edit: Thank you for your suggestions, it helped me a lot. I used Daweos solution, and added dynamic row count because number of header rows may differ from 2 to 7, depending on the generated output. Here's the code snippet i ended up with.
# Get column headers
a = 0
header_rows= 0
with open(full,"r") as input:
Lines= ""
for line in input:
l = line
g = re.sub(' +',' ',l)
y = re.sub('\t',',',g)
numlines += 1
if len(l.encode('ANSI')) > 250:
# finds header start row
a += 1
if a>0:
# finds header end row
if "---" in line:
header_rows = numlines - (numlines-a+1)
break
else:
# Lines is my headers string
Lines = Lines + "%s" % (y) + ' '
output.close()
# Create concatenated column headers
rows = [i.split(',') for i in Lines.rstrip().split('\n')]
cols = [list(c) for c in zip(*rows)]
for i in (cols):
for j in (rows):
newcolz = [list(c) for c in zip(*rows)]
print(newcolz)
I would do it following way:
txt = " Distance,Velocity,Time \n (m),(m/s),(s) \n 1,1,1 \n 2,1,2 \n 3,1,3 \n "
rows = [i.split(',') for i in txt.rstrip().split('\n')]
cols = [list(c) for c in zip(*rows)]
newcols = [[i[0]+i[1],*i[2:]] for i in cols]
newrows = [','.join(i) for i in zip(*newcols)]
print(newtxt)
Output:
Distance (m),Velocity(m/s),Time (s)
1,1,1
2,1,2
3,1,3
Crucial here is usage of zip to transpose your data, so I can deal with columns rather than rows. [[i[0]+i[1],*i[2:]] for i in cols] is responsible for actual concat, so if you would have headers spanning 3 lines you can do [[i[0]+i[1]+i[2],*i[3:]] for i in cols] and so on.
I am not aware of anything that exists to do this so instaed you can just write a custom function. In the example below the function takes to strings and also a separator which defaults to ,.
It will split each string into a list then use list comprehension using zip to pair up the lists. and then joining the pairs.
Lastly it will join the consolidated headers again with the separator.
def concat_headers(header1, header2, seperator=","):
headers1 = header1.split(seperator)
headers2 = header2.split(seperator)
consolidated_headers = ["".join(values) for values in zip(headers1, headers2)]
return seperator.join(consolidated_headers)
data = """Distance,Velocity,Time\n(m),(m/s),(s)\n1,1,1\n2,1,2\n3,1,3\n"""
header1, header2, *lines = data.splitlines()
consolidated_headers = concat_headers(header1, header2)
print(consolidated_headers)
print("\n".join(lines))
OUTPUT
Distance(m),Velocity(m/s),Time(s)
1,1,1
2,1,2
3,1,3
You don't really need a function to do it because it can be done like this using the csv module:
import csv
data_filename = 'position_data.csv'
new_filename = 'new_position_data.csv'
with open(data_filename, 'r', newline='') as inp, \
open(new_filename, 'w', newline='') as outp:
reader, writer = csv.reader(inp), csv.writer(outp)
row1, row2 = next(reader), next(reader)
new_header = [a+b for a,b in zip(row1, row2)]
writer.writerow(new_header)
# Copy the rest of the input file.
for row in reader:
writer.writerow(row)
Below is some python code that runs on a file similar to this (old_file.csv).
A,B,C,D
1,2,XX,3
11,22,XX,33
111,222,XX,333
How can I iterate through all lines in the old_file.csv (if I don't know the length of the file) and replace all values in column C or index 2 or cells[row][2] (based on cells[row][col]). But I'd like to ignore the header row. In the new_file.csv, all values containing 'XX' will become 'YY' for example.
import csv
r = csv.reader(open('old_file.csv'))
cells = [l for l in r]
cells[1][2] = 'YY'
cells[2][2] = 'YY'
cells[3][2] = 'YY'
w = csv.writer(open('new_file.csv', 'wb'))
w.writerows(cells)
Just small change in #Soviut ans, try this I think this will help you
import csv
rows = csv.reader(open('old_file.csv'))
newRows=[]
for i, row in enumerate(rows):
# ignore the first row, modify all the rest
if i > 0:
row[2] = 'YY'
newRows.append(row)
# write rows to new CSV file, no header is written unless explicitly told to
w = csv.writer(open('new_file.csv', 'wb'))
w.writerows(newRows)
You can very easily loop over the array of rows and replace values in the target cell.
# get rows from old CSV file
rows = csv.reader(open('old_file.csv'))
# iterate over each row and replace target cell
for i, row in enumerate(rows):
# ignore the first row, modify all the rest
if i > 0:
row[2] = 'YY'
# write rows to new CSV file, no header is written unless explicitly told to
w = csv.writer(open('new_file.csv', 'wb'))
w.writerows(rows)
csv reader makes arrays, so you could just run it on r[1:]
len(cells) is the number of rows. Iterating from 1 makes it skip the header line. Also the lines should be cells.
import csv
r = csv.reader(open('old_file.csv'))
cells = [l for l in r]
for i in range(1, len(cells)):
cells[i][2] = 'YY'
w = csv.writer(open('new_file.csv', 'wb'))
w.writerows(cells)
read_handle = open('old_file.csv', 'r')
data = read_handle.read().split('\n')
read_handle.close()
new_data = []
new_data.append(data[0])
for line in data[1:]:
if not line:
new_data.append(line)
continue
line = line.split(',')
line[2] = 'YY'
new_data.append(','.join(line))
write_handle = open('new_file.csv', 'w')
write_handle.writelines('\n'.join(new_data))
write_handle.close()
So in my CSV i have peoples names as the first row and i'm trying to get the average of 3 numbers on the columns 1-3. (not the first column though) Is there a way to skip a column so that i can just pull out columns 1-3? here is my code for getting the average. Any help would be much appreciated. So just to be clear on what i want:
I want to skip a column so that i can successfully get the mean average from columns 1-3.
if order ==("average score"):
with open("data.csv") as f:
reader = csv.reader(f)
columns = f.readline().strip().split(" ")
numRows = 0
sums = [1] * len(columns)
for line in f:
# Skip empty lines
if not line.strip():
continue
values = line.split(" ")
for i in range(len(values)):
sums[i] += int(values[i])
numRows += 1
for index, summedRowValue in enumerate (str(sums)):
print (columns[index], 1.0 * summedRowValue / numRows)
All you should have to do is change your range to start at 1 instead of 0, to skip the first column:
for i in range(1, len(values)):
I have the following data in a file:
Sarah,10
John,5
Sarah,7
Sarah,8
John,4
Sarah,2
I would like to keep the last three rows for each person. The output would be:
John,5
Sarah,7
Sarah,8
John,4
Sarah,2
In the example, the first row for Sarah was removed since there where three later rows. The rows in the output also maintain the same order as the rows in the input. How can I do this?
Additional Information
You are all amazing - Thank you so much. Final code which seems to have been deleted from this post is -
import collections
with open("Class2.txt", mode="r",encoding="utf-8") as fp:
count = collections.defaultdict(int)
rev = reversed(fp.readlines())
rev_out = []
for line in rev:
name, value = line.split(',')
if count[name] >= 3:
continue
count[name] += 1
rev_out.append((name, value))
out = list(reversed(rev_out))
print (out)
Since this looks like csv data, use the csv module to read and write it. As you read each line, store the rows grouped by the first column. Store the line number along with the row so that they can be written out maintaining the same order as the input. Use a bound deque to keep only the last three rows for each name. Finally, sort the rows and write them out.
import csv
by_name = defaultdict(lambda x: deque(x, maxlen=3))
with open('my_data.csv') as f_in
for i, row in enumerate(csv.reader(f_in)):
by_name[row[0]].append((i, row))
# sort the rows for each name by line number, discarding the number
rows = sorted(row[1] for value in by_name.values() for row in value, key=lambda row: row[0])
with open('out_data.csv', 'w') as f_out:
csv.writer(f_out).writerows(rows)