Reading from two files - python
I am trying to write a script that will take several 2 column files, write the first and second columns from the first one to a result file and then only the second columns from all other files and append them on.
Example:
File one File two
Column 1 Column 2 dont take this column Column 2
Line 1 Line 2 dont take this column Line 2
The final result should be
Result file
Column 1 Column 2 Column 2
Line1 Line 2 Line 2
etc
I have the almost everything working except for adding the second columns onto the first. I am taking the ResultFile as r+ and I want to read out the line that's there (the first file data) and then read the corresponding line from the other files, append it, and put it back in.
Here's the code I have for the second section:
#Open each subsequent file for 2nd column data
while n < i:
with open(FileNames[n], "r") as InputFile
with ResultFile:
Temp2 = ResultFile.readline()
for line in InputFile:
Temp2 += line.split(",", 1)[-1]
if line == LastValue:
break
if len(ResultFile,readline()) == "":
break
YData += (Temp2 + "\n")
n += 1
InputFile.close
The break IFs are not working quite right atm I just needed a way to end the infinite loop. Also LastValue is equal to the last x column value from the first file.
Any help would be appreciated
EDIT
I'm trying to do this without itertools.
It might help to open up all the files first and store them in a list.
fileHandles = []
for f in fileNames:
fileHandles.append(open(f))
Then you can just readline() them in order for each line in the first file.
dataLine = fileHandles[0].readline()
while dataLine:
outFields = dataLine.split(",")[0:2]
for inFile in fileHandles[1:]:
dataLine = inFile.readline()
field = dataLine.split(",")[1]
outFields.append(field)
print ",".join(outFields)
dataLine = fileHandles[0].readline()
Fundamentally you want to loop over all input files simultaneously the way zip does with iterators.
This example illustrates the pattern without the distraction of files and csvs:
file_row_col = [[['1A1', '1A2'], # File 1, Row A, Column 1 and 2
['1B1', '1B2']], # File 1, Row B, Column 1 and 2
[['2A1', '2A2'], # File 2
['2B1', '2B2']],
[['3A1', '3A2'], # File 3
['3B1', '3B2']]]
outrows = []
for rows in zip(*file_row_col):
outrow = [rows[0][0]] # Column 1 of the first file
for row in rows:
outrow.extend(row[1:]) # Only Column 2 and on
outrows.append(outrow)
# outrows is now [['1A1', '1A2', '2A2', '3A2'],
# ['1B1', '1B2', '2B2', '3B2']]
The key to this is the transformation done by zip(*file_row_col).
Now let's reimplement this pattern with actual files. I'm going to use the csv library make reading and writing the csvs easier and safer.
import csv
infilenames = ['1.csv','2.csv','3.csv']
outfilename = 'result.csv'
with open(outfilename, 'wb') as out:
outcsv = csv.writer(out)
infiles = []
# We can't use `with` with a list of resources, so we use
# try...finally the old-fashioned way instead.
try:
incsvs = []
for infilename in infilenames:
infile = open(infilename, 'rb')
infiles.append(infile)
incsvs.append(csv.reader(infile))
for inrows in zip(*incsvs):
outrow = [inrows[0][0]] # Column 1 of file 1
for inrow in inrows:
outrow.extend(inrow[1:])
outcsv.writerow(outrow)
finally:
for infile in infiles:
infile.close()
Given these input files:
#1.csv
1A1,1A2
1B1,1B2
#2.csv
2A1,2A2
2B1,2B2
#3.csv
3A1,3A2
3B1,3B2
the code produces this result.csv:
1A1,1A2,2A2,3A2
1B1,1B2,2B2,3B2
Related
Iteration and index dropping using general logic in python
So I've got this code I've been working on for a few days. I need to iterate through a set of csv's, and using general logic, find the indexes which don't have the same number of columns as index 2 and strip them out of the new csv. I've gotten the code to this point, but I'm stuck as to how to use slicing to strip the broken index. Say each index in file A is supposed to have 10 columns, and for some reason index 2,000 logs with only 7 columns. How is the best way to approach this problem to get the code to strip index 2,000 out of the new csv? #Comments to the right for f in TD_files: #FOR ALL TREND FILES: with open(f,newline='',encoding='latin1') as g: #open file as read r = csv.reader((line.replace('\0','') for line in g)) #declare read variable for list while stripping nulls data = [line for line in r] #set list to all data in file for j in range(0,len(data)): #set up data variable if data[j][2] != data[j-1][2] and j != 0: #compare index j2 and j2-1 print('Index Not Equal') #print debug data[0] = TDmachineID #add machine ID line data[1] = trendHeader #add trend header line with open(f,'w',newline='') as g: #open file as write w = csv.writer(g) #declare write variable w.writerows(data) The Index To Strip
EDIT Since you loop through the whole data anyway, I would replace that \0 at the same list comprehension when checking for the length. It looks cleaner to me and works the same. with open(f, newline='', encoding='latin1') as g: raw_data = csv.reader(g) data = [[elem.replace('\0', '') for elem in line] for line in raw_data if len(line)==10] data[0] = TDmachineID data[1] = trendHeader old answer: You could add a condition to your list comprehension if the list has the length 10. with open(f,newline='',encoding='latin1') as g: r = csv.reader((line.replace('\0','') for line in g)) data = [line for line in r if len(line)==10] #add condition to check if the line is added to your data data[0] = TDmachineID data[1] = trendHeader
How to read a csv file and create a new csv file after every nth number of rows?
I'm trying to write a function that reads a sheet of an existing .csv file and every 20 rows are copied to a newly created csv file. Therefore, it needs to be designed like a file counter "file_01, file_02, file_04,...," where the first 20 rows are copied to file_01, the next 20 to file_02.csv, and so on. Currently I have this code which hasn't worked for me work so far. import csv import os.path from itertools import islice N = 20 new_filename = "" filename = "" with open(filename, "rb") as file: # the a opens it in append mode reader = csv.reader(file) for i in range(N): line = next(file).strip() #print(line) with open(new_filename, 'wb') as outfh: writer = csv.writer(outfh) writer.writerow(line) writer.writerows(islice(reader, 2)) I have attached a file for testing. https://1drv.ms/u/s!AhdJmaLEPcR8htYqFooEoYUwDzdZbg 32.01,18.42,58.98,33.02,55.37,63.25,12.82,-32.42,33.99,179.53, 41.11,33.94,67.85,57.61,59.23,94.69,19.43,-19.15,21.71,-161.13, 49.80,54.12,72.78,100.74,56.97,128.84,26.95,-6.76,10.07,-142.62, 55.49,81.02,68.93,148.17,49.25,157.32,34.94,5.39,0.44,-123.32, 56.01,112.81,59.27,177.87,38.50,179.63,43.43,18.42,-5.81,-102.24, 50.79,142.87,48.06,-162.32,26.60,-161.21,52.38,34.37,-7.42,-79.64, 41.54,167.36,37.12,-145.93,15.01,-142.84,60.90,57.05,-4.47,-56.54, 30.28,-172.09,27.36,-130.24,5.11,-123.66,66.24,91.12,-0.76,-35.44, 18.64,-153.20,19.52,-114.09,-1.54,-102.96,64.77,131.32,5.12,-21.68, 7.92,-134.07,14.24,-96.93,-3.79,-80.91,57.10,162.35,12.51,-9.21, -0.34,-113.74,11.80,-78.73,-2.49,-58.46,46.75,-175.86,20.81,2.87, -4.81,-91.85,11.78,-60.28,0.59,-39.26,35.75,-158.12,29.79,15.71, -4.76,-68.67,13.79,-43.84,6.82,-24.69,25.27,-141.56,39.05,30.71, -1.33,-46.42,18.44,-30.23,14.53,-11.95,16.21,-124.45,47.91,50.25, 4.14,-29.61,24.89,-18.02,23.01,0.10,9.59,-106.05,54.46,77.07, 11.04,-15.39,32.33,-6.66,31.92,12.48,6.24,-86.34,55.72,110.53, 18.69,-2.32,40.46,4.57,41.11,26.87,6.07,-65.68,50.25,142.78, 26.94,10.56,49.18,16.67,49.92,45.39,8.06,-46.86,40.13,168.29, 35.80,24.58,58.45,31.99,56.83,70.92,12.96,-31.90,28.10,-171.07, 44.90,41.72,67.41,55.89,59.21,103.94,19.63,-18.67,15.97,-152.40, -5.41,-77.62,11.40,-63.21,4.80,-29.06,31.33,-151.44,43.00,37.25, -2.88,-54.38,13.08,-46.00,12.16,-15.86,21.21,-134.62,51.25,59.16, 1.69,-35.73,17.44,-32.01,20.37,-3.78,13.06,-117.10,56.18,88.98, 8.15,-20.80,23.70,-19.66,29.11,8.29,7.74,-98.22,54.91,123.30, 15.52,-7.45,31.04,-8.22,38.22,21.78,5.76,-77.99,47.34,153.31, 23.53,5.38,39.07,2.98,47.29,38.71,6.58,-57.45,36.18,176.74, 32.16,18.76,47.71,14.88,55.08,61.71,9.76,-40.52,23.99,-163.75, 41.27,34.36,56.93,29.53,59.23,92.75,15.53,-26.40,12.16,-145.27, 49.92,54.65,66.04,51.59,57.34,126.97,22.59,-13.65,2.14,-126.20, 55.50,81.56,72.21,90.19,49.88,155.84,30.32,-1.48,-4.71,-105.49, 55.92,113.45,70.26,139.40,39.23,178.48,38.55,10.92,-7.09,-83.11, 50.58,143.40,61.40,172.50,27.38,-162.27,47.25,24.86,-4.77,-60.15, 41.30,167.74,50.34,-166.33,15.74,-143.93,56.21,43.14,-0.54,-38.22, 30.03,-171.78,39.24,-149.48,5.71,-124.87,63.77,70.19,4.75,-24.15, 18.40,-152.91,29.17,-133.78,-1.18,-104.31,66.51,108.81,11.86,-11.51, 7.69,-133.71,20.84,-117.74,-3.72,-82.28,61.95,146.15,20.05,0.65, -0.52,-113.33,14.97,-100.79,-2.58,-59.75,52.78,172.46,28.91,13.29, -4.91,-91.36,11.92,-82.84,0.34,-40.12,41.93,-167.91,38.21,27.90,
These are some of the problems with your current solution. You created a csv.reader object but then you did not use it You read each line but then you did not store them anywhere You are not keeping track of 20 rows which was supposed to be your requirement You created the output file in a separate with block which does not have access anymore to the read lines or the csv.reader object Here's a working solution: import csv inp_file = "input.csv" out_file_pattern = "file_{:{fill}2}.csv" max_rows = 20 with open(inp_file, "r") as inp_f: reader = csv.reader(inp_f) all_rows = [] cur_file = 1 for row in reader: all_rows.append(row) if len(all_rows) == max_rows: with open(out_file_pattern.format(cur_file, fill="0"), "w") as out_f: writer = csv.writer(out_f) writer.writerows(all_rows) all_rows = [] cur_file += 1 The flow is as follows: Read each row of the CSV using a csv.reader Store each row in an all_rows list Once that list gets 20 rows, open a file and write all the rows to it Use the csv.writer's writerows method Use a cur_file counter to format the filename Every time 20 rows are dumped to a file, empty out the list and increment the file counter This solution includes the blank lines as part of the 20 rows. Your test file has actually 19 rows of CSV data and 1 row for a blank line. If you need to skip the blank line, just add a simple check of if not row: continue Also, as I mentioned in a comment, I assume that the input file is an actual CSV file, meaning it's a plain text file with CSV formatted data. If the input is actually an Excel file, then solutions like this won't work, because you'll need some special libraries to read Excel files, even if the contents visually looks like CSV or even if you rename the file to .csv.
Without using any special CSV libraries (e.g. csv, though you could, just that I don't know how to use them, however don't think it is necessary for this case), you could: excel_csv_fp = open(r"<file_name>", "r", encoding="utf-8") # Check proper encoding for your file csv_data = excel_csv_fp.readlines() file_counter = 0 new_file_name = "" new_fp = "" for line in csv_data: if line == "": if new_fp != "": new_fp.close() file_counter += 1 new_file_name = "file_" + "{:02d}".format(file_counter) # 1 turns into 01 and 10 turns 10 i.e. remains the same new_fp = open("<some_path>/" + new_file_name + ".csv", "w", encoding="utf-8") # Makes a new CSV file to start writing to elif new_fp != "": # Updated code to make sure new_fp is a file pointer and not a string new_fp.write(line) # Write each line after a space If you have any questions on any of the code (how it works, why I choose what etc.), just ask in the comments and I'll try to reply as soon as possible.
csv file loop results
I'm trying to extract csv files by the cities with `re.findall(), but when I try to do that and write to results to another csv file, it loops over and over many times! import io import csv import re lines=0 outfile1 =codecs.open('/mesh/وسطى.csv','w','utf_8') outfile6 =codecs.open('/mesh/أخرى.csv','w','utf_8') with io.open('/mishal.csv','r',encoding="utf-8",newline='') as f: reader = csv.reader(f) for row in f : for rows in row: lines += 1 #الوسطى m = re.findall('\u0634\u0642\u0631\u0627\u0621',row) if m: outfile1.write(row) else: outfile6.write(row) print("saved In to mishal !") f.close() I want the re.finall() cities to not loop, just execute once for each match—not loooooooping so many times whenever there's a match. Here's a screenshot of the output showing the excessive looping:
csv readers return a list for each line of the file - your outer loop is iterating over the lines/rows and your inner loop is iterating over items in each row. It isn't clear what you want. but your conditional writes happen for each item in each row. If your intent is to check and see if there is a match in the row instead of items in the row, for row in f : match = False for item in row: lines += 1 #?? #الوسطى match = re.search('\u0634\u0642\u0631\u0627\u0621',item) if match: outfile1.write(row) else: outfile6.write(row) You could accomplish the same thing just iterating over the lines in the file without using a csv reader with io.open('/mishal.csv','r',encoding="utf-8",newline='') as f: for line in f: #الوسطى if re.search('\u0634\u0642\u0631\u0627\u0621',line): outfile1.write(line) else: outfile6.write(line)
Trying to compare two csv files and write differences as output
I'm developing a script which takes the difference between 2 csv files and makes a new csv file as output with the differences BUT only if the same 2 rows (refers to row number) between the two input files contain different data e.g. row 3 has "mike", "basketball player" in file 1 and row 3 in file 2 has "mike", "baseball player". The output csv would grab these print them and write them to a csv. It works but there are some issues (I know that this question has also been asked several times before but others have done it differently to me and since I'm fairly new to programming I don't quite understand their codes). The output in the new csv file has each letter of the output in each cell (see image below) and I believe its something to do with the delimiter/quotechar/quoting line 37. I want them in their own cells without any fullstops, multiple spaces, commas or "|". Another issue is that it takes a long time to run. I'm working with datasets of up to 50,000 rows and it can take over an hour to run. Why is this and what advice would be useful to speed it up? Put something outside of the for loop maybe? I did try the difflib method earlier on but I was only able to print the entire "input_file1" but not compare that file with another. # aim of script is to compare csv files and output difference as a new csv # import necessary libraries import csv # File1 = open(raw_input("path:"),"r") #filename, mode # File2 = open(raw_input("path:"),"r") #filename, mode # selects the 2 input files to be compared input_file1 = "G:/savestuffhereqwerty/electorate_meshblocks/teststuff/Book1.csv" input_file2 = "G:/savestuffhereqwerty/electorate_meshblocks/teststuff/Book2.csv" # creates the blank output csv file output_path = "G:/savestuffhereqwerty/electorate_meshblocks/outputs/output2.csv" a = open(input_file1, "r") output_file = open(output_path,"w") output_file.close() count = 0 with open(input_file1) as fp1: for row_number1, row_value1 in enumerate(fp1): if row_number1 == count: print "got to 1st point" value1 = row_value1 with open(input_file2) as fp2: for row_number2, row_value2 in enumerate(fp2): if row_number2 == count: print "got to 2nd point" value2 = row_value2 if value1 == value2: print value1, value2 else: print value1, value2 with open(output_path, 'wb') as f: writer = csv.writer(f, delimiter=',', quotechar='|', quoting=csv.QUOTE_MINIMAL) # testing to see if the code writes text to the csv writer.writerow(["test1"]) writer.writerow(["test2", "test3", "test4"]) writer.writerows([value1, value2]) print "code reached writing stage" count += 1 print count print "done" # replace(",",".")
Since you want to compare the two files line-by-line, you should not loop through the second file for every line in the first file. You can simply zip two csv readers and filter the rows: input_file1 = "foo" input_file2 = "bar" output_path = "baz" with open(input_file1) as fin1: with open(input_file2) as fin2: read1 = csv.reader(fin1) read2 = csv.reader(fin2) diff_rows = (row1 for row1, row2 in zip(read1, read2) if row1 != row2) with open(output_path, 'w') as fout: writer = csv.writer(fout) writer.writerows(diff_rows) This solution assumes that the two files have the same number of lines.
Python: Pandas, dealing with spaced column names
If I have multiple text files that I need to parse that look like so, but can vary in terms of column names, and the length of the hashtags above: How would I go about turning this into a pandas dataframe? I've tried using pd.read_table('file.txt', delim_whitespace = True, skiprows = 14), but it has all sorts of problems. My issues are... All the text, asterisks, and pounds at the top needs to be ignored, but I can't just use skip rows because the size of all the junk up top can vary in length in another file. The columns "stat (+/-)" and "syst (+/-)" are seen as 4 columns because of the whitespace. The one pound sign is included in the column names, and I don't want that. I can't just assign the column names manually because they vary from text file to text file. Any help is much obliged, I'm just not really sure where to go from after I read the file using pandas.
Consider reading in raw file, cleaning it line by line while writing to a new file using csv module. Regex is used to identify column headers using the i as match criteria. Below assumes more than one space separates columns: import os import csv, re import pandas as pd rawfile = "path/To/RawText.txt" tempfile = "path/To/TempText.txt" with open(tempfile, 'w', newline='') as output_file: writer = csv.writer(output_file) with open(rawfile, 'r') as data_file: for line in data_file: if re.match('^.*i', line): # KEEP COLUMN HEADER ROW line = line.replace('\n', '') row = line.split(" ") writer.writerow(row) elif line.startswith('#') == False: # REMOVE HASHTAG LINES line = line.replace('\n', '') row = line.split(" ") writer.writerow(row) df = pd.read_csv(tempfile) # IMPORT TEMP FILE df.columns = [c.replace('# ', '') for c in df.columns] # REMOVE '#' IN COL NAMES os.remove(tempfile) # DELETE TEMP FILE
This is the way I'm mentioning in the comment: it uses a file object to skip the custom dirty data you need to skip at the beginning. You land the file offset at the appropriate location in the file where read_fwf simply does the job: with open(rawfile, 'r') as data_file: while(data_file.read(1)=='#'): last_pound_pos = data_file.tell() data_file.readline() data_file.seek(last_pound_pos) df = pd.read_fwf(data_file) df Out[88]: i mult stat (+/-) syst (+/-) Q2 x x.1 Php 0 0 0.322541 0.018731 0.026681 1.250269 0.037525 0.148981 0.104192 1 1 0.667686 0.023593 0.033163 1.250269 0.037525 0.150414 0.211203 2 2 0.766044 0.022712 0.037836 1.250269 0.037525 0.149641 0.316589 3 3 0.668402 0.024219 0.031938 1.250269 0.037525 0.148027 0.415451 4 4 0.423496 0.020548 0.018001 1.250269 0.037525 0.154227 0.557743 5 5 0.237175 0.023561 0.007481 1.250269 0.037525 0.159904 0.750544