reading from a particular tuple onwards from a file in python - python
Using seek and tell is not functioning properly as the tell returns the current position in bytes; I need to get the line number rather the position of file pointer to proceed.
I have a file glass.csv and I need to cluster the datasets. Each line in the file contains a number 1,2,3... like the below:
65,1.52172,13.48,3.74,0.90,72.01,0.18,9.61,0.00,0.07,1
66,1.52099,13.69,3.59,1.12,71.96,0.09,9.40,0.00,0.00,1
67,1.52152,13.05,3.65,0.87,72.22,0.19,9.85,0.00,0.17,1
68,1.52152,13.05,3.65,0.87,72.32,0.19,9.85,0.00,0.17,1
69,1.52152,13.12,3.58,0.90,72.20,0.23,9.82,0.00,0.16,1
70,1.52300,13.31,3.58,0.82,71.99,0.12,10.17,0.00,0.03,1
71,1.51574,14.86,3.67,1.74,71.87,0.16,7.36,0.00,0.12,2
72,1.51848,13.64,3.87,1.27,71.96,0.54,8.32,0.00,0.32,2
73,1.51593,13.09,3.59,1.52,73.10,0.67,7.83,0.00,0.00,2
74,1.51631,13.34,3.57,1.57,72.87,0.61,7.89,0.00,0.00,2
142,1.51851,13.20,3.63,1.07,72.83,0.57,8.41,0.09,0.17,2
143,1.51662,12.85,3.51,1.44,73.01,0.68,8.23,0.06,0.25,2
144,1.51709,13.00,3.47,1.79,72.72,0.66,8.18,0.00,0.00,2
145,1.51660,12.99,3.18,1.23,72.97,0.58,8.81,0.00,0.24,2
146,1.51839,12.85,3.67,1.24,72.57,0.62,8.68,0.00,0.35,2
147,1.51769,13.65,3.66,1.11,72.77,0.11,8.60,0.00,0.00,3
148,1.51610,13.33,3.53,1.34,72.67,0.56,8.33,0.00,0.00,3
149,1.51670,13.24,3.57,1.38,72.70,0.56,8.44,0.00,0.10,3
150,1.51643,12.16,3.52,1.35,72.89,0.57,8.53,0.00,0.00,3
I need to take some inputs from those tuples having 1 as the last number and save it in another file, (train.txt), and the remaining in another file, (test.txt). Likewise I need to take certain lines from those having 2 as the last number and append to the first file i.e. train.txt and remaining to test.txt.
I cannot get the second input but appends the first result itself.
The easiest way, assuming that you have a large file and can not simply load the whole file would be to use 1 file for each to do your sorting. If it is a small(ish) input file then just load as a comma separated file using the csv module.
As a quick and dirty method, (assuming smallish files).
data = []
with open('glass.csv', 'r') as infile:
for line in infile:
linedata = [float(val) for val in line.strip().split(',')]
data.append(linedata)
adata = sorted(data, key=lambda items: items[-1])
## Then open both your output files and write them in the required fields.
The default behavior for reading a text file is line-by-line. You can just do something like that:
with open('input.csv', 'r') as f, open('output_1.csv') as output_1, open('output_2.csv') as output_2:
for line in f:
line_fields = line.strip().split()[',']
if line_fields[-1] == '1':
output_1.write(line)
continue
if line_fields[-1] == '2':
output_2.write(line)
Or you can use the CSV module, it's much easier https://docs.python.org/2/library/csv.html
Related
Trying to remove rows based in csv file based off column value
I'm trying to remove duplicated rows in a csv file based on if a column has a unique value. My code looks like this: seen = set() for line in fileinput.FileInput('DBA.csv', inplace=1): if line[2] in seen: continue # skip duplicated line seen.add(line[2]) print(line, end='') I'm trying to get the value of the 2 index column in every row and check if it's unique. But for some reason my seen set looks like this: {'b', '"', 't', '/', 'k'} Any advice on where my logic is flawed?
You're reading your file line by line, so when you pick line[2] you're actually picking the third character of each line you're running this on. If you want to capture the value of the second column for each row, you need to parse your CSV first, something like: import csv seen = set() with open("DBA.csv", "rUb") as f: reader = csv.reader(f) for line in reader: if line[2] in seen: continue seen.add(line[2]) print(line) # this will NOT print valid CSV, it will print Python list If you want to edit your CSV in place I'm afraid it will be a bit more complicated than that. If your CSV is not huge, you can load it in memory, truncate it and then write down your lines: import csv seen = set() with open("DBA.csv", "rUb+") as f: handler = csv.reader(f) data = list(handler) f.seek(0) f.truncate() handler = csv.writer(f) for line in data: if line[2] in seen: continue seen.add(line[2]) handler.writerow(line) Otherwise you'll have to read your file line by line and use a buffer that you'll pass to csv.reader() to parse it, check the value of its third column and if not seen write the line to the live-editing file. If seen, you'll have to seek back to the previous line beginning before writing the next line etc. Of course, you don't need to use the csv module if you know your line structures well which can simplify the things (you won't need to deal with passing buffers left and right), but for a universal solution it's highly advisable to let the csv module do your bidding.
I want to replace a certain column of a file with a list - Python
I have a stock file which looks like this: 12334232:seat belt:2.30:12:10:30 14312332:toy card:3.40:52:10:30 12512312:xbox one:5.30:23:10:30 12543243:laptop:1.34:14:10:30 65478263:banana:1.23:23:10:30 27364729:apple:4.23:42:10:30 28912382:orange:1.12:16:10:30 12892829:elephant:6.45:14:10:30 I want to replace the items in the fourth column if they are below the numbers in the fifth column after a certain transaction to the numbers in the sxith column. How would I replace the items in the fourth column? Everytime I use the following lines of code below, it overwrites the whole file with nothing (deletes everything) for line in stockfile: c=line.split(":") print("pass") if stock_order[i] == User_list[i][0]: stockfile.write(line.replace(current_stocklevel_list[i], reorder_order[i] ) ) else: i = i + 1 I want the stockfile to look like this after it has replaced the necessary items in the column: 12334232:seat belt:2.30:30:10:30 14312332:toy card:3.40:30:10:30 12512312:xbox one:5.30:30:10:30 12543243:laptop:1.34:30::10:30 65478263:banana:1.23:30:10:30 27364729:apple:4.23:30:10:30 28912382:orange:1.12:30:10:30 12892829:elephant:6.45:30:10:30
If you are opening file after some time, you should use "a" (append) as a mode so that file doesn't get truncated. Write pointer will automatically be on the end of file. So: f = open("filename", "a") f.seek(0) # To start from beginning But if you want to read and write, then add "+" to the mode and file wouldn't be truncated as well. f = open("filename", "r+") Both read and write pointers will be at the beginning of file, you'll need to seek only onto position where you wish to start writing/reading. But you are doing it wrong. See, file's content will be overwritten, not inserted automatically. If you are in writable mode and at the end of file content will be added. So, you either need to load whole file, make changes you need and write everything back. Or, you have to write changes at some point and shift remaining content truncating the file if the content is shorter than before. mmap module can help you to treat file as a string,. You will be able to efficiently shift data and to resize the file. But, if you really want to change file in place, you should have the file with fixed length of columns. So, when you want to change a value, you do not need to shift anything back and forth. Just find the right row and col, seek there, write new value over the old one (making sure to delete all of the old) and that is just that.
You should try to read in the data first: with open('inputfile', 'r') as infile: data = infile.readlines() Then you can loop over the data and edit as needed and write it out: with open('outputfile', 'w') as outfile: for line in data: c = line.split(":") if random.randint(1,3) == 1: # update fourth column based on some good reason c[3] += 2 outfile.write(':'.join(c) + '\n') Or you could do it on go with something like: with open('inputfile', 'r') as infile, open('outputfile', 'w') as outfile: line = infile.readline() c = line.split(":") if random.randint(1,3) == 1: # update fourth column based on some good reason c[3] += 2 outfile.write(':'.join(c) + '\n') os.rename('outputfile', 'inputfile')
How can I append to the new line of a file while using write()?
In Python: Let's say I have a loop, during each cycle of which I produce a list with the following format: ['n1','n2','n3'] After each cycle I would like to write to append the produced entry to a file (which contains all the outputs from the previous cycles). How can I do that? Also, is there a way to make a list whose entries are the outputs of this cycle? i.e. [[],[],[]] where each internal []=['n1','n2','n3] etc
Writing single list as a line to file Surely you can write it into a file like, after converting it to string: with open('some_file.dat', 'w') as f: for x in xrange(10): # assume 10 cycles line = [] # ... (here is your code, appending data to line) ... f.write('%r\n' % line) # here you write representation to separate line Writing all lines at once When it comes to the second part of your question: Also, is there a way to make a list whose entries are the outputs of this cycle? i.e. [[],[],[]] where each internal []=['n1','n2','n3'] etc it is also pretty basic. Assuming you want to save it all at once, just write: lines = [] # container for a list of lines for x in xrange(10): # assume 10 cycles line = [] # ... (here is your code, appending data to line) ... lines.append('%r\n' % line) # here you add line to the list of lines # here "lines" is your list of cycle results with open('some_file.dat', 'w') as f: f.writelines(lines) Better way of writing a list to file Depending on what you need, you should probably use one of the more specialized formats, than just a text file. Instead of writing list representations (which are okay, but not ideal), you could use eg. csv module (similar to Excel's spreadsheet): http://docs.python.org/3.3/library/csv.html
f=open(file,'a') first para is the path of file,second is the pattern,'a' is append,'w' is write, 'r' is read ,and so on im my opinion,you can use f.write(list+'\n') to write a line in a loop ,otherwise you can use f.writelines(list),it also functions.
Hope this can help you: lVals = [] with open(filename, 'a') as f: for x,y,z in zip(range(10), range(5, 15), range(10, 20)): lVals.append([x,y,z]) f.write(str(lVals[-1]))
Sorting a list from a file, outputting in another file
I am trying to find the min and max out of a csv file, and have it output into a text file, currently my code outputs all data into the output file, and I am unsure of how to grab the data out of the multiple columns and have them sorted accordingly. Any guidance would be appreciated, as I don't have a good lead on how to figure this out read_file = open("riskfactors.csv", 'r') def create_file(): read_file = open("riskfactors.csv", 'r') write_file = open("best_and_worst.txt", "w") for line_str in read_file: read_file.readline() print (line_str,file=write_file) write_file.close() read_file.close()
Assuming your file is a standard .csv file containing only numbers separated by semicolons: 1;5;7;6; 3;8;1;1; Then it's easiest to use the str.split() command, followed by a type conversion to int. You could store all values in a list (or quicker: set) and then get the maximum: valuelist=[] for line_str in read_file: for cell in line_str.split(";"): valuelist.append(int(cell)) print(max(valuelist)) print(min(valuelist)) Warning: If your file contains non-number entries you'd have to filter them out. .csv-files can also have different delimiters.
import sys, csv def cmp_risks(x, y): # This assumes risk factors are prioritised by key columns 1, 3 # and that column 1 is numeric while column 3 is textual return cmp(int(x[0]), int(y[0])) or cmp(x[2], y[2]) l = sorted(csv.reader(sys.stdin), cmp_risks)) # Write out the first and last rows csv.writer(sys.stdout).writerows([l[0], l[len(l)-1]]) Now, I took a shortcut and said the input and output files were sys.stdin and sys.stdout. You'd probably replace these with the file objects you created in your original question. (e.g. read_file and write_file) However, in my case, I'd probably just run it (if I were using linux) with: $ ./foo.py <riskfactors.csv >best_and_worst.txt
PYTHON how to search a text file for a number
There's a text file that I'm reading line by line. It looks something like this: 3 3 67 46 67 3 46 Each time the program encounters a new number, it writes it to a text file. The way I'm thinking of doing this is writing the first number to the file, then looking at the second number and checking if it's already in the output file. If it isn't, it writes THAT number to the file. If it is, it skips that line to avoid repetitions and goes on to the next line. How do I do this?
Rather than searching your output file, keep a set of the numbers you've written, and only write numbers that are not in the set.
Instead of checking output file for the number if it was already written it is better to keep this information in a variable (a set or list). It will save you on disk reads. To search a file for numbers you need to loop through each line of that file, you can do that with for line in open('input'): loop, where input is the name of your file. On each iteration line would contain one line of input file ended with end of line character '\n'. In each iteration you should try to convert the value on that line to a number, int() function may be used. You may want to protect yourself against empty lines or non-number values with try statement. In each iteration having the number you should check if the value you found wasn't already written to the output file by checking a set of already written numbers. If value is not in the set yet, add it and write to the output file. #!/usr/bin/env python numbers = set() # create a set for storing numbers that were already written out = open('output', 'w') # open 'output' file for writing for line in open('input'): # loop through each line of 'input' file try: i = int(line) # try to convert line to integer except ValueError: # if conversion to integer fails display a warning print "Warning: cannot convert to number string '%s'" % line.strip() continue # skip to next line on error if i not in numbers: # check if the number wasn't already added to the set out.write('%d\n' % i) # write the number to the 'output' file followed by EOL numbers.add(i) # add number to the set to mark it as already added This example assumes that your input file contains single numbers on each line. In case of empty on incorrect line a warning will be displayed to stdout. You could also use list in the above example, but it may be less efficient. Instead of numbers = set() use numbers = [] and instead of numbers.add(i): numbers.append(i). The if condition stays the same.
Don't do that. Use a set() to keep track of all the numbers you have seen. It will only have one of each. numbers = set() for line in open("numberfile"): numbers.add(int(line.strip())) open("outputfile", "w").write("\n".join(str(n) for n in numbers)) Note this reads them all, then writes them all out at once. This will put them in a different order than in the original file (assuming they're integers, they will come out in ascending numeric order). If you don't want that, you can also write them as you read them, but only if they are not already in the set: numbers = set() with open("outfile", "w") as outfile: for line in open("numberfile"): number = int(line.strip()) if number not in numbers: outfile.write(str(number) + "\n") numbers.add(number)
Are you working with exceptionally large files? You probably don't want to try to "search" the file you're writing to for a value you just wrote. You (probably) want something more like this: encountered = set([]) with open('file1') as fhi, open('file2', 'w') as fho: for line in fhi: if line not in encountered: encountered.add(line) fho.write(line)
If you want to scan through a file to see if it contains a number on any line, you could do something like this: def file_contains(f, n): with f: for line in f: if int(line.strip()) == n: return True return False However as Ned points out in his answer, this isn't a very efficient solution; if you have to search through the file again for each line, the running time of your program will increase proportional to the square of the number of numbers. It the number of values is not incredibly large, it would be more efficient to use a set (documentation). Sets are designed to very efficiently keep track of unordered values. For example: with open("input_file.txt", "rt") as in_file: with open("output_file.txt", "wt") as out_file: encountered_numbers = set() for line in in_file: n = int(line.strip()) if n not in encountered_numbers: encountered_numbers.add(n) out_file.write(line)