Using python 2.4 i have a .txt file sorted into 3 colums, 9 spaces between each column which is the reason for x.split - roughly 1000 lines ex:
$1$sda4356:[sgb1_diska5.ldlbat44.libabl]talild_0329_base.rpt talild_0329_base.rpt 00000000000000005062
I'm using the following code to sort by column 3 (which is file size)
fobj = open('data.txt')
data = [ x.split() for x in fobj ]
fobj.close()
from operator import itemgetter
data.sort(key=itemgetter(2), reverse=True)
I want to print the output of an entire column and if possible with Python 2.4 even name them. If i do something like data[1] it will just output line 2 how can i get this to show column 2 instead. If i can't name it i see a few things with import csv but i can't figure out the right command to use the data i've already sorted instead of calling up the .txt file again. Most are looking for file name as shown below
with open(filename, 'r') as f:
def getcolumn(n,data):
return (i[n] for i in data) # if this dosent work replace () with []
for i in getcolumn(1,data):
print i
Related
Let's say I have an input file with the following data:
50 50
A
B
C
D
I know that I can extract the first line using the map function as follow:
x,y= map(int, input().split())
But I am unsure how I can retrieve the next 4 lines and put them into a list. I tried using the splitlines() function, since each value is on a seperate line, but that only returns the first value.
strings = input().splitlines()
How can I choose what parts of the input file I want to read and then store them in respective variables?
Open the file, read all the lines into a list, do what you want with them
with open("[filename]", "r") as f:
lines = list(map(lambda l: l.strip(), f.readlines()))
# do whatever with the lines here
# use lines.pop(0) if you want to remove the line from the list
I routinely use PowerShell to split larger text or csv files in to smaller files for quicker processing. However, I have a few files that come over that are an usual format. These are basically print files to a text file. Each record starts with a single line that starts with a 1 and there is nothing else on the line.
What I need to be able to do is to split a file based on the number of statements. So, basically if I want to split the file in to chunks of 3000 statements, I would go down until I see the 3001 occurrence of 1 in position 1 and copy everything before that to the new file. I can run this from windows, linux or OS X so pretty much anything is open for the split.
Any ideas would be greatly appreciated.
Maybe try recognizing it by the fact that there is a '1' plus a new line?
with open(input_file, 'r') as f:
my_string = f.read()
my_list = my_string.split('\n1\n')
Separates each record to a list assuming it has the following format:
1
....
....
1
....
....
....
You can then output each element in the list to a separate file.
for x in range(len(my_list)):
print >> str(x)+'.txt', my_list[x]
To avoid loading the file in memory, you could define a function that generates records incrementally and then use itertool's grouper recipe to write each 3000 records to a new file:
#!/usr/bin/env python3
from itertools import zip_longest
with open('input.txt') as input_file:
files = zip_longest(*[generate_records(input_file)]*3000, filevalue=())
for n, records in enumerate(files):
open('output{n}.txt'.format(n=n), 'w') as output_file:
output_file.writelines(''.join(lines)
for r in records for lines in r)
where generate_records() yields one record at a time where a record is also an iterator over lines in the input file:
from itertools import chain
def generate_records(input_file, start='1\n', eof=[]):
def record(yield_start=True):
if yield_start:
yield start
for line in input_file:
if line == start: # start new record
break
yield line
else: # EOF
eof.append(True)
# the first record may include lines before the first 1\n
yield chain(record(yield_start=False),
record())
while not eof:
yield record()
generate_records() is a generator that yield generators like itertools.groupby() does.
For performance reasons, you could read/write chunks of multiple lines at once.
I am trying to find the min and max out of a csv file, and have it output into a text file, currently my code outputs all data into the output file, and I am unsure of how to grab the data out of the multiple columns and have them sorted accordingly.
Any guidance would be appreciated, as I don't have a good lead on how to figure this out
read_file = open("riskfactors.csv", 'r')
def create_file():
read_file = open("riskfactors.csv", 'r')
write_file = open("best_and_worst.txt", "w")
for line_str in read_file:
read_file.readline()
print (line_str,file=write_file)
write_file.close()
read_file.close()
Assuming your file is a standard .csv file containing only numbers separated by semicolons:
1;5;7;6;
3;8;1;1;
Then it's easiest to use the str.split() command, followed by a type conversion to int.
You could store all values in a list (or quicker: set) and then get the maximum:
valuelist=[]
for line_str in read_file:
for cell in line_str.split(";"):
valuelist.append(int(cell))
print(max(valuelist))
print(min(valuelist))
Warning: If your file contains non-number entries you'd have to filter them out. .csv-files can also have different delimiters.
import sys, csv
def cmp_risks(x, y):
# This assumes risk factors are prioritised by key columns 1, 3
# and that column 1 is numeric while column 3 is textual
return cmp(int(x[0]), int(y[0])) or cmp(x[2], y[2])
l = sorted(csv.reader(sys.stdin), cmp_risks))
# Write out the first and last rows
csv.writer(sys.stdout).writerows([l[0], l[len(l)-1]])
Now, I took a shortcut and said the input and output files were sys.stdin and sys.stdout. You'd probably replace these with the file objects you created in your original question. (e.g. read_file and write_file)
However, in my case, I'd probably just run it (if I were using linux) with:
$ ./foo.py <riskfactors.csv >best_and_worst.txt
I have a code where in I first convert a .csv file into multiple lists and then I have to create a subset of the original file containing only those with a particular word in column 5 of my file.
I am trying to use the following code to do so, but it gives me a syntax error for the if statement. Can anyone tell me how to fix this?
import csv
with open('/Users/jadhav/Documents/Hubble files/m4_hubble_1.csv') as f:
bl = [[],[],[],[],[]]
reader = csv.reader(f)
for r in reader:
for c in range(5):
bl[c].append(r[c])
print "The files have now been sorted into lists"
name = 'HST_10775_64_ACS_WFC_F814W_F606W'
for c in xrange(0,1):
if bl[4][c]!='HST_10775_64_ACS_WFC_F814W_F606W'
print bl[0][c]
You need a colon after your if test, and you need to indent the if taken clause:
if bl[4][c]!='HST_10775_64_ACS_WFC_F814W_F606W':
print bl[0][c]
I have a .txt file,primary list, with strings like this:
f
r
y
h
g
j
and I have a .csv file,recipes list, with rows like this:
d,g,r,e,w,s
j,f,o,b,x,q,h
y,n,b,w,q,j
My programe is going throw each row and counts number of objects which belongs to primary list, for example in this case outcome is:
2
3
2
I always get 0, the mistake must be silly, but I can't figure it out:
from __future__ import print_function
import csv
primary_data = open('test_list.txt','r')
primary_list = []
for line in primary_data.readlines():
line.strip('\n')
primary_list.append(line)
recipes_reader = csv.reader(open('test.csv','r'), delimiter =',')
for row in recipes_reader:
primary_count = 0
for i in row:
if i in primary_list:
primary_count += 1
print (primary_count)
Here's the bare-essentials pedal-to-the-metal version:
from __future__ import print_function
import csv
with open('test_list.txt', 'r') as f: # with statement ensures your file is closed
primary_set = set(line.strip() for line in f)
with open('test.csv', 'rb') as f: #### see note below ###
for row in csv.reader(f): # delimiter=',' is the default
print(sum(i in primary_set for i in row)) # i in primary_set has int value 0 or 1
Note: In Python 2.x, always open csv files in binary mode. In Python3.x, always open csv files with newline=''
The reading into primary_list adds \n to each number - you should remove it:
When appending to primary_list do:
for line in primary_data:
primary_list.append(line.strip())
Note the strip call. Also, as you can see, you don't really need realines, since for line in primary_data already does what you need when primary_data is a file object.
Now, as a general comment, since you're using the primary list for lookup, I suggest replacing the list by a set - this will make things much faster if the list is large. Python sets are very efficient for key-based lookup, lists are not designed for that purpose.
Following code would solve the problem.
from __future__ import print_function
import csv
primary_data = open('test_list.txt','r')
primary_list = [line.rstrip() for line in primary_data]
recipies_reader = csv.reader(open('recipies.csv','r'), delimiter =',')
for row in recipies_reader:
count = 0
for i in row:
if i in primary_list:
count += 1
print (count)
Output
2
3
2