So I've got this code I've been working on for a few days. I need to iterate through a set of csv's, and using general logic, find the indexes which don't have the same number of columns as index 2 and strip them out of the new csv. I've gotten the code to this point, but I'm stuck as to how to use slicing to strip the broken index.
Say each index in file A is supposed to have 10 columns, and for some reason index 2,000 logs with only 7 columns. How is the best way to approach this problem to get the code to strip index 2,000 out of the new csv?
#Comments to the right
for f in TD_files: #FOR ALL TREND FILES:
with open(f,newline='',encoding='latin1') as g: #open file as read
r = csv.reader((line.replace('\0','') for line in g)) #declare read variable for list while stripping nulls
data = [line for line in r] #set list to all data in file
for j in range(0,len(data)): #set up data variable
if data[j][2] != data[j-1][2] and j != 0: #compare index j2 and j2-1
print('Index Not Equal') #print debug
data[0] = TDmachineID #add machine ID line
data[1] = trendHeader #add trend header line
with open(f,'w',newline='') as g: #open file as write
w = csv.writer(g) #declare write variable
w.writerows(data)
The Index To Strip
EDIT
Since you loop through the whole data anyway, I would replace that \0 at the same list comprehension when checking for the length. It looks cleaner to me and works the same.
with open(f, newline='', encoding='latin1') as g:
raw_data = csv.reader(g)
data = [[elem.replace('\0', '') for elem in line] for line in raw_data if len(line)==10]
data[0] = TDmachineID
data[1] = trendHeader
old answer:
You could add a condition to your list comprehension if the list has the length 10.
with open(f,newline='',encoding='latin1') as g:
r = csv.reader((line.replace('\0','') for line in g))
data = [line for line in r if len(line)==10] #add condition to check if the line is added to your data
data[0] = TDmachineID
data[1] = trendHeader
Related
I am trying to read a .txt file and save the data in each column as a list. each column in the file contains a variable which I will later on use to plot a graph. I have tried looking up the best method to do this and most answers recommend opening the file, reading it, and then either splitting or saving the columns as a list. The data in the .txt is as follows -
0 1.644231726
0.00025 1.651333945
0.0005 1.669593478
0.00075 1.695214575
0.001 1.725409504
the delimiter is a space '' or a tab '\t' . I have used the following code to try and append the columns to my variables -
import csv
with open('./rvt.txt') as file:
readfile = csv.reader(file, delimiter='\t')
time = []
rim = []
for line in readfile:
t = line[0]
r = line[1]
time.append(t)
rim.append(r)
print(time, rim)
However, when I try to print the lists, time and rim, using print(time, rim), I get the following error message -
r = line[1]
IndexError: list index out of range
I am, however, able to print only the 'time' if I comment out the r=line[1] and rim.append(r) parts. How do I approach this problem? Thank you in advance!
I would suggest the following:
import pandas as pd
df=pd.read_csv('./rvt.txt', sep='\t'), header=[a list with your column names])
Then you can use list(your_column) to work with your columns as lists
The problem is with the delimiter. The dataset contain multiple space ' '.
When you use '\t' and
print line you can see it's not separating the line with the delimiter.
eg:
['0 1.644231726']
['0.00025 1.651333945']
['0.0005 1.669593478']
['0.00075 1.695214575']
['0.001 1.725409504']
To get the desired result you can use (space) as delimiter and filter the empty values:
readfile = csv.reader(file, delimiter=" ")
time, rim = [], []
for line in readfile:
line = list(filter(lambda x: len(x), line))
t = line[0]
r = line[1]
Here is the code to do this:
import csv
with open('./rvt.txt') as file:
readfile = csv.reader(file, delimiter=” ”)
time = []
rim = []
for line in readfile:
t = line[0]
r = line[1]
time.append(t)
rim.append(r)
print(time, rim)
Below is some python code that runs on a file similar to this (old_file.csv).
A,B,C,D
1,2,XX,3
11,22,XX,33
111,222,XX,333
How can I iterate through all lines in the old_file.csv (if I don't know the length of the file) and replace all values in column C or index 2 or cells[row][2] (based on cells[row][col]). But I'd like to ignore the header row. In the new_file.csv, all values containing 'XX' will become 'YY' for example.
import csv
r = csv.reader(open('old_file.csv'))
cells = [l for l in r]
cells[1][2] = 'YY'
cells[2][2] = 'YY'
cells[3][2] = 'YY'
w = csv.writer(open('new_file.csv', 'wb'))
w.writerows(cells)
Just small change in #Soviut ans, try this I think this will help you
import csv
rows = csv.reader(open('old_file.csv'))
newRows=[]
for i, row in enumerate(rows):
# ignore the first row, modify all the rest
if i > 0:
row[2] = 'YY'
newRows.append(row)
# write rows to new CSV file, no header is written unless explicitly told to
w = csv.writer(open('new_file.csv', 'wb'))
w.writerows(newRows)
You can very easily loop over the array of rows and replace values in the target cell.
# get rows from old CSV file
rows = csv.reader(open('old_file.csv'))
# iterate over each row and replace target cell
for i, row in enumerate(rows):
# ignore the first row, modify all the rest
if i > 0:
row[2] = 'YY'
# write rows to new CSV file, no header is written unless explicitly told to
w = csv.writer(open('new_file.csv', 'wb'))
w.writerows(rows)
csv reader makes arrays, so you could just run it on r[1:]
len(cells) is the number of rows. Iterating from 1 makes it skip the header line. Also the lines should be cells.
import csv
r = csv.reader(open('old_file.csv'))
cells = [l for l in r]
for i in range(1, len(cells)):
cells[i][2] = 'YY'
w = csv.writer(open('new_file.csv', 'wb'))
w.writerows(cells)
read_handle = open('old_file.csv', 'r')
data = read_handle.read().split('\n')
read_handle.close()
new_data = []
new_data.append(data[0])
for line in data[1:]:
if not line:
new_data.append(line)
continue
line = line.split(',')
line[2] = 'YY'
new_data.append(','.join(line))
write_handle = open('new_file.csv', 'w')
write_handle.writelines('\n'.join(new_data))
write_handle.close()
I need help sorting a list from a text file. I'm reading a .txt and then adding some data, then sorting it by population change %, then lastly, writing that to a new text file.
The only thing that's giving me trouble now is the sort function. I think the for statement syntax is what's giving me issues -- I'm unsure where in the code I would add the sort statement and how I would apply it to the output of the for loop statement.
The population change data I am trying to sort by is the [1] item in the list.
#Read file into script
NCFile = open("C:\filelocation\NC2010.txt")
#Save a write file
PopulationChange =
open("C:\filelocation\Sorted_Population_Change_Output.txt", "w")
#Read everything into lines, except for first(header) row
lines = NCFile.readlines()[1:]
#Pull relevant data and create population change variable
for aLine in lines:
dataRow = aLine.split(",")
countyName = dataRow[1]
population2000 = float(dataRow[6])
population2010 = float(dataRow[8])
popChange = ((population2010-population2000)/population2000)*100
outputRow = countyName + ", %.2f" %popChange + "%\n"
PopulationChange.write(outputRow)
NCFile.close()
PopulationChange.close()
You can fix your issue with a couple of minor changes. Split the line as you read it in and loop over the sorted lines:
lines = [aLine.split(',') for aLine in NCFile][1:]
#Pull relevant data and create population change variable
for dataRow in sorted(lines, key=lambda row: row[1]):
population2000 = float(dataRow[6])
population2010 = float(dataRow[8])
...
However, if this is a csv you might want to look into the csv module. In particular DictReader will read in the data as a list of dictionaries based on the header row. I'm making up the field names below but you should get the idea. You'll notice I sort the data based on 'countryName' as it is read in:
from csv import DictReader, DictWriter
with open("C:\filelocation\NC2010.txt") as NCFile:
reader = DictReader(NCFile)
data = sorted(reader, key=lambda row: row['countyName'])
for row in data:
population2000 = float(row['population2000'])
population2010 = float(row['population2010'])
popChange = ((population2010-population2000)/population2000)*100
row['popChange'] = "{0:.2f}".format(popChange)
with open("C:\filelocation\Sorted_Population_Change_Output.txt", "w") as PopulationChange:
writer = csv.DictWriter(PopulationChange, fieldnames=['countryName', 'popChange'])
writer.writeheader()
writer.writerows(data)
This will give you a 2 column csv of ['countryName', 'popChange']. You would need to correct this with the correct fieldnames.
You need to read all of the lines in the file before you can sort it. I've created a list called change to hold the tuple pair of the population change and the country name. This list is sorted and then saved.
with open("NC2010.txt") as NCFile:
lines = NCFile.readlines()[1:]
change = []
for line in lines:
row = line.split(",")
country_name = row[1]
population_2000 = float(row[6])
population_2010 = float(row[8])
pop_change = ((population_2010 / population_2000) - 1) * 100
change.append((pop_change, country_name))
change.sort()
output_rows = []
[output_rows.append("{0}, {1:.2f}\n".format(pair[1], pair[0]))
for pair in change]
with open("Sorted_Population_Change_Output.txt", "w") as PopulationChange:
PopulationChange.writelines(output_rows)
I used a list comprehension to generate the output rows which swaps the pair back in the desired order, i.e. country name first.
I am trying to write a script that will take several 2 column files, write the first and second columns from the first one to a result file and then only the second columns from all other files and append them on.
Example:
File one File two
Column 1 Column 2 dont take this column Column 2
Line 1 Line 2 dont take this column Line 2
The final result should be
Result file
Column 1 Column 2 Column 2
Line1 Line 2 Line 2
etc
I have the almost everything working except for adding the second columns onto the first. I am taking the ResultFile as r+ and I want to read out the line that's there (the first file data) and then read the corresponding line from the other files, append it, and put it back in.
Here's the code I have for the second section:
#Open each subsequent file for 2nd column data
while n < i:
with open(FileNames[n], "r") as InputFile
with ResultFile:
Temp2 = ResultFile.readline()
for line in InputFile:
Temp2 += line.split(",", 1)[-1]
if line == LastValue:
break
if len(ResultFile,readline()) == "":
break
YData += (Temp2 + "\n")
n += 1
InputFile.close
The break IFs are not working quite right atm I just needed a way to end the infinite loop. Also LastValue is equal to the last x column value from the first file.
Any help would be appreciated
EDIT
I'm trying to do this without itertools.
It might help to open up all the files first and store them in a list.
fileHandles = []
for f in fileNames:
fileHandles.append(open(f))
Then you can just readline() them in order for each line in the first file.
dataLine = fileHandles[0].readline()
while dataLine:
outFields = dataLine.split(",")[0:2]
for inFile in fileHandles[1:]:
dataLine = inFile.readline()
field = dataLine.split(",")[1]
outFields.append(field)
print ",".join(outFields)
dataLine = fileHandles[0].readline()
Fundamentally you want to loop over all input files simultaneously the way zip does with iterators.
This example illustrates the pattern without the distraction of files and csvs:
file_row_col = [[['1A1', '1A2'], # File 1, Row A, Column 1 and 2
['1B1', '1B2']], # File 1, Row B, Column 1 and 2
[['2A1', '2A2'], # File 2
['2B1', '2B2']],
[['3A1', '3A2'], # File 3
['3B1', '3B2']]]
outrows = []
for rows in zip(*file_row_col):
outrow = [rows[0][0]] # Column 1 of the first file
for row in rows:
outrow.extend(row[1:]) # Only Column 2 and on
outrows.append(outrow)
# outrows is now [['1A1', '1A2', '2A2', '3A2'],
# ['1B1', '1B2', '2B2', '3B2']]
The key to this is the transformation done by zip(*file_row_col).
Now let's reimplement this pattern with actual files. I'm going to use the csv library make reading and writing the csvs easier and safer.
import csv
infilenames = ['1.csv','2.csv','3.csv']
outfilename = 'result.csv'
with open(outfilename, 'wb') as out:
outcsv = csv.writer(out)
infiles = []
# We can't use `with` with a list of resources, so we use
# try...finally the old-fashioned way instead.
try:
incsvs = []
for infilename in infilenames:
infile = open(infilename, 'rb')
infiles.append(infile)
incsvs.append(csv.reader(infile))
for inrows in zip(*incsvs):
outrow = [inrows[0][0]] # Column 1 of file 1
for inrow in inrows:
outrow.extend(inrow[1:])
outcsv.writerow(outrow)
finally:
for infile in infiles:
infile.close()
Given these input files:
#1.csv
1A1,1A2
1B1,1B2
#2.csv
2A1,2A2
2B1,2B2
#3.csv
3A1,3A2
3B1,3B2
the code produces this result.csv:
1A1,1A2,2A2,3A2
1B1,1B2,2B2,3B2
I have two sets of data. Both have a little over 13000 rows and, one of them (the one I open as a csv in the main function), has two columns that I need to match up to the other file (opened as text file and put into list of dictionaries in example_05() function).
They are from the same source and I need to make sure the data stays the same when I add the last two parameters for each row in the list of dicts because I have about 20 extra rows in the .csv file that I'm adding to the list of dicts so I must have extra or null data in the .csv file.
To delete these anomalous rows, I'm trying to compare the indices of the list of Q* values from the .csv file to the {'Q*':} value in the dictionary within the list of dictionaries (each dictionary is a row) to look for mismatches because they should be the same and then just delete the item from the mass_list before I add it to the list of dictionaries as I do at the end of example_05() function.
When I try to compare them I get an 'IndexError: list index out of range' error at this line:
if row10['Q*'] != Q_list_2[check_index]:
Can anybody tell me why? Here's example_05() and the main function:
def example_05(filename):
with open(filename,'r') as file : data = file.readlines()
header, data = data[0].split(), data[1:]
#...... convert each line to a dict, using header words keys
global kept
kept = []
for line in data :
line = [to_float(term) for term in line.split()]
kept.append( dict( zip(header, line) ) )
del mass_list[0]
mass_list_2 = [to_float(j) for j in mass_list]
del Q_list[0]
Q_list_2 = [to_float(k) for k in Q_list]
print "Number in Q_list_2 list = "
print len(Q_list_2)
check_index = 0
delete_index = 0
for row10 in kept:
if row10['Q*'] != Q_list_2[check_index]:
del mass_list_2[delete_index]
del Q_list_2[delete_index]
check_index+=1
delete_index+=1
else:
check_index+=1
delete_index+=1
continue
k_index=0
for d in kept:
d['log_10_m'] = mass_list_2[k_index]
k_index+=1
print "Number in mass_list_2 list = "
print len(mass_list_2)
if __name__ == '__main__' :
f = open('MagandMass20150401.csv')
csv_f = csv.reader(f)
mag_list = []
mass_list = []
Q_list = []
for row in csv_f:
mag_list.append(row[17])
mass_list.append(row[18])
Q_list.append(row[15])
del csv_f
f.close()
example_05('summ20150401.txt')