I am working with datasets stored in large text files. For the analysis I am carrying out, I open the files, extract parts of the dataset and compare the extracted subsets. My code works like so:
from math import ceil
with open("seqs.txt","rb") as f:
f = f.readlines()
assert type(f) == list, "ERROR: file object not converted to list"
fives = int( ceil(0.05*len(f)) )
thirds = int( ceil(len(f)/3) )
## top/bottom 5% of dataset
low_5=f[0:fives]
top_5=f[-fives:]
## top/bottom 1/3 of dataset
low_33=f[0:thirds]
top_33=f[-thirds:]
## Write lists to file
# top-5
with open("high-5.out","w") as outfile1:
for i in top_5:
outfile1.write("%s" %i)
# low-5
with open("low-5.out","w") as outfile2:
for i in low_5:
outfile2.write("%s" %i)
# top-33
with open("high-33.out","w") as outfile3:
for i in top_33:
outfile3.write("%s" %i)
# low-33
with open("low-33.out","w") as outfile4:
for i in low_33:
outfile4.write("%s" %i)
I am trying to find a more clever way of automating the process of writing the lists out to files. In this case there are only four, but in the future cases where I may end up with as many as 15-25 lists I would some function to take care of this. I wrote the following:
def write_to_file(*args):
for i in args:
with open(".out", "w") as outfile:
outfile.write("%s" %i)
but the resulting file only contains the final list when I call the function like so:
write_to_file(low_33,low_5,top_33,top_5)
I understand that I have to define an output file for each list (which I am not doing in the function above), I'm just not sure how to implement this. Any ideas?
Make your variable names match your filenames and then use a dictionary to hold them instead of keeping them in the global namespace:
data = {'high_5': # data
,'low_5': # data
,'high_33': # data
,'low_33': # data}
for key in data:
with open('{}.out'.format(key), 'w') as output:
for i in data[key]:
output.write(i)
Keeps your data in a single easy to use place, and assuming you want to apply the same actions to them you can continue using the same paradigm.
As mentioned by PM2Ring below, it would be advisable to use underscores (as you do in the variable names) instead of dashes(as you do in the filenames) as by doing so you can pass the dictionary keys as keyword arguments into a writing function:
write_to_file(**data)
This would equate to:
write_to_file(low_5=f[:fives], high_5=f[-fives:],...) # and the rest of the data
From this you could use one of the functions defined by the other answers.
You could have one output file per argument by incrementing a counter for each argument. For example:
def write_to_file(*args):
for index, i in enumerate(args):
with open("{}.out".format(index+1), "w") as outfile:
outfile.write("%s" %i)
The example above will create output files "1.out", "2.out", "3.out", and "4.out".
Alternatively, if you had specific names you wanted to use (as in your original code), you could do something like the following:
def write_to_file(args):
for name, data in args:
with open("{}.out".format(name), "w") as outfile:
outfile.write("%s" % data)
args = [('low-33', low_33), ('low-5', low_5), ('high-33', top_33), ('high-5', top_5)]
write_to_file(args)
which would create output files "low-33.out", "low-5.out", "high-33.out", and "high-5.out".
Don't try to be clever. Instead aim to have your code readable, easy to understand. You can group repeated code into a function, for example:
from math import ceil
def save_to_file(data, filename):
with open(filename, 'wb') as f:
for item in data:
f.write('{}'.format(item))
with open('data.txt') as f:
numbers = list(f)
five_percent = int(len(numbers) * 0.05)
thirty_three_percent = int(ceil(len(numbers) / 3.0))
# Why not: thirty_three_percent = int(len(numbers) * 0.33)
save_to_file(numbers[:five_percent], 'low-5.out')
save_to_file(numbers[-five_percent:], 'high-5.out')
save_to_file(numbers[:thirty_three_percent], 'low-33.out')
save_to_file(numbers[-thirty_three_percent:], 'high-33.out')
Update
If you have quite a number of lists to write, then it makes sense to use a loop. I suggest to have two functions: save_top_n_percent and save_low_n_percent to help with the job. They contain a little duplicated code, but by separating them into two functions, it is clearer and easier to understand.
def save_to_file(data, filename):
with open(filename, 'wb') as f:
for item in data:
f.write(item)
def save_top_n_percent(n, data):
n_percent = int(len(data) * n / 100.0)
save_to_file(data[-n_percent:], 'top-{}.out'.format(n))
def save_low_n_percent(n, data):
n_percent = int(len(data) * n / 100.0)
save_to_file(data[:n_percent], 'low-{}.out'.format(n))
with open('data.txt') as f:
numbers = list(f)
for n_percent in [5, 33]:
save_top_n_percent(n_percent, numbers)
save_low_n_percent(n_percent, numbers)
On this line you are opening up a file called .out each time and writing to it.
with open(".out", "w") as outfile:
You need to make the ".out" unique for each i in args. you can achieve this by passing in a list as the args and the list will contain the file name and data.
def write_to_file(*args):
for i in args:
with open("%s.out" % i[0], "w") as outfile:
outfile.write("%s" % i[1])
And pass in arguments like so...
write_to_file(["low_33",low_33],["low_5",low_5],["top_33",top_33],["top_5",top_5])
You are creating a file called '.out' and overwriting it each time.
def write_to_file(*args):
for i in args:
filename = i + ".out"
contents = globals()[i]
with open(".out", "w") as outfile:
outfile.write("%s" %contents)
write_to_file("low_33", "low_5", "top_33", "top_5")
https://stackoverflow.com/a/6504497/3583980 (variable name from a string)
This will create low_33.out, low_5.out, top_33.out, top_5.out and their contents will be the lists stored in these variables.
Related
I have a python script that goes out and pulls a huge chunk of JSON data and then iterates it to build 2 lists
# Get all price data
response = c.get_price_history_every_minute(symbol)
# Build prices list
prices = list()
for i in range (len(response.json()["candles"])):
prices.append (response.json()["candles"][i]["prices"])
# Build times list
times = list()
for i in range (len(response.json()["candles"])):
times.append (response.json()["candles"][i]["datetime"])
This works fine, but it takes a LONG time to pull in all of the data and build the lists. I am doing some testing trying to build out a complex script, and would like to save these two lists to two files, and then import the data from those files and recreate the lists when I run subsequent tests to skip generating, iterating and parsing the JSON.
I have been trying the following:
# Write Price to a File
a_file = open("prices7.txt", "w")
content = str(prices)
a_file.write(content)
a_file.close()
And then in future scripts:
# Load Prices from File
prices_test = array('d')
a_file = open("prices7.txt", "r")
prices_test = a_file.read()
The outputs from my json lists and the data loaded into the list created from the file output look identical, but when I try to do anything with the data loaded from a file it is garbage...
print (prices)
{The output looks like this} [69.73, 69.72, 69.64, ... 69.85, 69.82, etc]
print (prices_test)
The output looks identical
If I run a simple query like:
print (prices[1], prices[2])
I get the expected output {69.73, 69.72]
If I do the same on the list created from the file:
print (prices_test[1], prices_test[2])
I get the output ( [,6 )
It is pulling every character in the string individually instead of using the comma separated values as I would have expected...
I've googled every combination of search terms I could think of so any help would be GREATLY appreciated!!
I had to do something like this before. I used pickle to do it.
import pickle
def pickle_the_data(pickle_name, list_to_pickle):
"""This function pickles a given list.
Args:
pickle_name (str): name of the resulting pickle.
list_to_pickle (list): list that you need to pickle
"""
with open(pickle_name +'.pickle', 'wb') as pikd:
pickle.dump(list_to_pickle, pikd)
file_name = pickle_name + '.pickle'
print(f'{file_name}: Created.')
def unpickle_the_data(pickle_file_name):
"""This will unpickle a pickled file
Args:
pickle_file_name (str): file name of the pickle
Returns:
list: when we pass a pickled list, it will return an
unpickled list.
"""
with open(pickle_file_name, 'rb') as pk_file:
unpickleddata = pickle.load(pk_file)
return unpickleddata
so first pickle your list pickle_the_data(name_for_pickle, your_list)
then when you need to load the list unpickle_the_data(name_of_your_pickle_file)
This is what I'm trying to explain into the comments section. Note I replaced response.json() to jsonData, successfully taking it out of each for-loop, and reduced both loops into a single one for more efficiency. Now the code should run faster.
import json
def saveData(filename, data):
# Convert Data to a JSON String
data = json.dumps(data)
# Open the file, then save it
try:
file = open(filename, "wt")
except:
print("Failed to save the file.")
return False
else:
file.write(data)
file.close()
return True
def loadData(filename):
# Open the file, then load its contents
try:
file = open(filename, "rt")
except:
print("Failed to load the file.")
return None
else:
data = file.read()
file.close()
# Data is a JSON string, so now we convert it back
# to a Python Structure:
data = json.loads(data)
return data
# Get all price data
response = c.get_price_history_every_minute(symbol)
jsonData = response.json()
# Build prices and times list:
#
# As you're iterating over the same "candles" index on both loops
# when building those two lists, just reduce it to a single loop
prices = list()
times = list()
for i in range(len(jsonData["candles"])):
prices.append(jsonData["candles"][i]["prices"])
times.append(jsonData["candles"][i]["datetime"])
# Now, when you need, just save each list like this:
saveData("prices_list.json", prices)
saveData("times_list.json", times)
# And retrieve them back when you need it later:
prices = loadData("prices_list.json")
times = loadData("times_list.json")
Btw, pickle does the same thing, but it uses Binary Data instead of json, which is probably faster for save / load data. I don't know, didn't tested it.
In json, you have the advantage of readability, as you can open each file and read it directly, if you can understand JSON syntax.
First, sorry if the title is not clear. I (noob) am baffled by this...
Here's my code:
import csv
from random import random
from collections import Counter
def rn(dic, p):
for ptry in parties:
if p < float(dic[ptry]):
return ptry
else:
p -= float(dic[ptry])
def scotland(r):
r['SNP'] = 48
r['Con'] += 5
r['Lab'] += 1
r['LibDem'] += 5
def n_ireland(r):
r['DUP'] = 9
r['Alliance'] = 1
# SF = 7
def election():
results = Counter([rn(row, random()) for row in data])
scotland(results)
n_ireland(results)
return results
parties = ['Con', 'Lab', 'LibDem', 'Green', 'BXP', 'Plaid', 'Other']
with open('/Users/andrew/Downloads/msp.csv', newline='') as f:
data = csv.DictReader(f)
for i in range(1000):
print(election())
What happens is that in every iteration after the first one, the variable data seems to have vanished: the function election() creates a Counter object from a list obtained by processing data, but on every pass after the first one, this object is empty, so the function just returns the hard coded data from scotland() and n_ireland(). (msp.csv is a csv file containing detailed polling data). I'm sure I'm doing something stupid but would welcome anyone gently pointing out where...
I’m going to place a bet on your definition of newline. Are you sure you don’t want newline = “\n” ? Otherwise it will interpret the entire file as a single line, which explains what you’re seeing.
EDIT
I now see another issue. The file object in python acts as a generator for each line. The problem is once the generator is finished (you hit the end of the file), you have no more data generated. To solve this: reset your file pointer to the beginning of the file like so:
with open('/Users/andrew/Downloads/msp.csv') as f:
data = csv.DictReader(f)
for i in range(1000):
print(election())
f.seek(0)
Here the call to f.seek(0) will reset the file pointer to the beginning of your file. You are correct that data is a global object given the way you've defined it at the module level, there's no need to pass it as a parameter.
I agree with #smassey, you might need to change the code to
with open('/Users/andrew/Downloads/msp.csv', newline='\n') as f:
or simply try not use that argument
with open('/Users/andrew/Downloads/msp.csv') as f:
So, I'm trying to write a random amount of random whole numbers (in the range of 0 to 1000), square these numbers, and return these squares as a list. Initially, I started off writing to a specific txt file that I had already created, but it didn't work properly. I looked for some methods I could use that might make things a little easier, and I found the tempfile.NamedTemporaryFile method that I thought might be useful. Here's my current code, with comments provided:
# This program calculates the squares of numbers read from a file, using several functions
# reads file- or writes a random number of whole numbers to a file -looping through numbers
# and returns a calculation from (x * x) or (x**2);
# the results are stored in a list and returned.
# Update 1: after errors and logic problems, found Python method tempfile.NamedTemporaryFile:
# This function operates exactly as TemporaryFile() does, except that the file is guaranteed to have a visible name in the file system, and creates a temprary file that can be written on and accessed
# (say, for generating a file with a list of integers that is random every time).
import random, tempfile
# Writes to a temporary file for a length of random (file_len is >= 1 but <= 100), with random numbers in the range of 0 - 1000.
def modfile(file_len):
with tempfile.NamedTemporaryFile(delete = False) as newFile:
for x in range(file_len):
newFile.write(str(random.randint(0, 1000)))
print(newFile)
return newFile
# Squares random numbers in the file and returns them as a list.
def squared_num(newFile):
output_box = list()
for l in newFile:
exp = newFile(l) ** 2
output_box[l] = exp
print(output_box)
return output_box
print("This program reads a file with numbers in it - i.e. prints numbers into a blank file - and returns their conservative squares.")
file_len = random.randint(1, 100)
newFile = modfile(file_len)
output = squared_num(file_name)
print("The squared numbers are:")
print(output)
Unfortunately, now I'm getting this error in line 15, in my modfile function: TypeError: 'str' does not support the buffer interface. As someone who's relatively new to Python, can someone explain why I'm having this, and how I can fix it to achieve the desired result? Thanks!
EDIT: now fixed code (many thanks to unutbu and Pedro)! Now: how would I be able to print the original file numbers alongside their squares? Additionally, is there any minimal way I could remove decimals from the outputted float?
By default tempfile.NamedTemporaryFile creates a binary file (mode='w+b'). To open the file in text mode and be able to write text strings (instead of byte strings), you need to change the temporary file creation call to not use the b in the mode parameter (mode='w+'):
tempfile.NamedTemporaryFile(mode='w+', delete=False)
You need to put newlines after each int, lest they all run together creating a huge integer:
newFile.write(str(random.randint(0, 1000))+'\n')
(Also set the mode, as explained in PedroRomano's answer):
with tempfile.NamedTemporaryFile(mode = 'w+', delete = False) as newFile:
modfile returns a closed filehandle. You can still get a filename out of it, but you can't read from it. So in modfile, just return the filename:
return newFile.name
And in the main part of your program, pass the filename on to the squared_num function:
filename = modfile(file_len)
output = squared_num(filename)
Now inside squared_num you need to open the file for reading.
with open(filename, 'r') as f:
for l in f:
exp = float(l)**2 # `l` is a string. Convert to float before squaring
output_box.append(exp) # build output_box with append
Putting it all together:
import random, tempfile
def modfile(file_len):
with tempfile.NamedTemporaryFile(mode = 'w+', delete = False) as newFile:
for x in range(file_len):
newFile.write(str(random.randint(0, 1000))+'\n')
print(newFile)
return newFile.name
# Squares random numbers in the file and returns them as a list.
def squared_num(filename):
output_box = list()
with open(filename, 'r') as f:
for l in f:
exp = float(l)**2
output_box.append(exp)
print(output_box)
return output_box
print("This program reads a file with numbers in it - i.e. prints numbers into a blank file - and returns their conservative squares.")
file_len = random.randint(1, 100)
filename = modfile(file_len)
output = squared_num(filename)
print("The squared numbers are:")
print(output)
PS. Don't write lots of code without running it. Write little functions, and test that each works as expected. For example, testing modfile would have revealed that all your random numbers were being concatenated. And printing the argument sent to squared_num would have shown it was a closed filehandle.
Testing the pieces gives you firm ground to stand on and lets you develop in an organized way.
i have this code:
import csv
import collections
def do_work():
(data,counter)=get_file('thefile.csv')
b=samples_subset1(data, counter,'/pythonwork/samples_subset3.csv',500)
return
def get_file(start_file):
with open(start_file, 'rb') as f:
data = list(csv.reader(f))
counter = collections.defaultdict(int)
for row in data:
counter[row[10]] += 1
return (data,counter)
def samples_subset1(data,counter,output_file,sample_cutoff):
with open(output_file, 'wb') as outfile:
writer = csv.writer(outfile)
b_counter=0
b=[]
for row in data:
if counter[row[10]] >= sample_cutoff:
b.append(row)
writer.writerow(row)
b_counter+=1
return (b)
i recently started learning python, and would like to start off with good habits. therefore, i was wondering if you can help me get started to turn this code into classes. i dont know where to start.
Per my comment on the original post, I don't think a class is necessary here. Still, if other Python programmers will ever read this, I'd suggest getting it inline with PEP8, the Python style guide. Here's a quick rewrite:
import csv
import collections
def do_work():
data, counter = get_file('thefile.csv')
b = samples_subset1(data, counter, '/pythonwork/samples_subset3.csv', 500)
def get_file(start_file):
with open(start_file, 'rb') as f:
counter = collections.defaultdict(int)
data = list(csv.reader(f))
for row in data:
counter[row[10]] += 1
return (data, counter)
def samples_subset1(data, counter, output_file, sample_cutoff):
with open(output_file, 'wb') as outfile:
writer = csv.writer(outfile)
b = []
for row in data:
if counter[row[10]] >= sample_cutoff:
b.append(row)
writer.writerow(row)
return b
Notes:
No one uses more than 4 spaces to
indent ever. Use 2 - 4. And all
your levels of indentation should
match.
Use a single space after the commas between arguments
to functions ("F(a, b, c)" not
"F(a,b,c)")
Naked return statements at the end of a function
are meaningless. Functions without
return statements implicitly return
None
Single space around all
operators (a = 1, not a=1)
Do not
wrap single values in parentheses.
It looks like a tuple, but it isn't.
b_counter wasn't used at all, so I
removed it.
csv.reader returns an iterator, which you are casting to a list. That's usually a bad idea because it forces Python to load the entire file into memory at once, whereas the iterator will just return each line as needed. Understanding iterators is absolutely essential to writing efficient Python code. I've left data in for now, but you could rewrite to use an iterator everywhere you're using data, which is a list.
Well, I'm not sure what you want to turn into a class. Do you know what a class is? You want to make a class to represent some type of thing. If I understand your code correctly, you want to filter a CSV to show only those rows whose row[ 10 ] is shared by at least sample_cutoff other rows. Surely you could do that with an Excel filter much more easily than by reading through the file in Python?
What the guy in the other thread suggested is true, but not really applicable to your situation. You used a lot of global variables unnecessarily: if they'd been necessary to the code you should have put everything into a class and made them attributes, but as you didn't need them in the first place, there's no point in making a class.
Some tips on your code:
Don't cast the file to a list. That makes Python read the whole thing into memory at once, which is bad if you have a big file. Instead, simply iterate through the file itself: for row in csv.reader(f): Then, when you want to go through the file a second time, just do f.seek(0) to return to the top and start again.
Don't put return at the end of every function; that's just unnecessary. You don't need parentheses, either: return spam is fine.
Rewrite
import csv
import collections
def do_work():
with open( 'thefile.csv' ) as f:
# Open the file and count the rows.
data, counter = get_file(f)
# Go back to the start of the file.
f.seek(0)
# Filter to only common rows.
b = samples_subset1(data, counter,
'/pythonwork/samples_subset3.csv', 500)
return b
def get_file(f):
counter = collections.defaultdict(int)
data = csv.reader(f)
for row in data:
counter[row[10]] += 1
return data, counter
def samples_subset1(data, counter, output_file, sample_cutoff):
with open(output_file, 'wb') as outfile:
writer = csv.writer(outfile)
b = []
for row in data:
if counter[row[10]] >= sample_cutoff:
b.append(row)
writer.writerow(row)
return b
I have a file whose contents are of the form:
.2323 1
.2327 1
.3432 1
.4543 1
and so on some 10,000 lines in each file.
I have a variable whose value is say a=.3344
From the file I want to get the row number of the row whose first column is closest to this variable...for example it should give row_num='3' as .3432 is closest to it.
I have tried in a method of loading the first columns element in a list and then comparing the variable to each element and getting the index number
If I do in this method it is very much time consuming and slow my model...I want a very quick method as this need to to called some 1000 times minimum...
I want a method with least overhead and very quick can anyone please tell me how can it be done very fast.
As the file size is maximum of 100kb can this be done directly without loading into any list of anything...if yes how can it be done.
Any method quicker than the method mentioned above are welcome but I am desperate to improve the speed -- please help.
def get_list(file, cmp, fout):
ind, _ = min(enumerate(file), key=lambda x: abs(x[1] - cmp))
return fout[ind].rstrip('\n').split(' ')
#root = r'c:\begpython\wavnk'
header = 6
for lst in lists:
save = database_index[lst]
#print save
index, base,abs2, _ , abs1 = save
using_data[index] = save
base = 'C:/begpython/wavnk/'+ base.replace('phone', 'text')
fin, fout = base + '.pm', base + '.mcep'
file = open(fin)
fout = open(fout).readlines()
[next(file) for _ in range(header)]
file = [float(line.partition(' ')[0]) for line in file]
join_cost_index_end[index] = get_list(file, float(abs1), fout)
join_cost_index_strt[index] = get_list(file, float(abs2), fout)
this is the code i was using..copying file into a list.and all please give better alternarives to this
Building on John Kugelman's answer, here's a way you might be able to do a binary search on a file with fixed-length lines:
class SubscriptableFile(object):
def __init__(self, file):
self._file = file
file.seek(0,0)
self._line_length = len(file.readline())
file.seek(0,2)
self._len = file.tell() / self._line_length
def __len__(self):
return self._len
def __getitem__(self, key):
self._file.seek(key * self._line_length)
s = self._file.readline()
if s:
return float(s.split()[0])
else:
raise KeyError('Line number too large')
This class wraps a file in a list-like structure, so that now you can use the functions of the bisect module on it:
def find_row(file, target):
fw = SubscriptableFile(file)
i = bisect.bisect_left(fw, target)
if fw[i + 1] - target < target - fw[i]:
return i + 1
else:
return i
Here file is an open file object and target is the number you want to find. The function returns the number of the line with the closest value.
I will note, however, that the bisect module will try to use a C implementation of its binary search when it is available, and I'm not sure if the C implementation supports this kind of behavior. It might require a true list, rather than a "fake list" (like my SubscriptableFile).
Is the data in the file sorted in numerical order? Are all the lines of the same length? If not, the simplest approach is best. Namely, reading through the file line by line. There's no need to store more than one line in memory at a time.
Code:
def closest(num):
closest_row = None
closest_value = None
for row_num, row in enumerate(file('numbers.txt')):
value = float(row.split()[0])
if closest_value is None or abs(value - num) < abs(closest_value - num):
closest_row = row
closest_row_num = row_num
closest_value = value
return (closest_row_num, closest_row)
print closest(.3344)
Output for sample data:
(2, '.3432 1\n')
If the lines are all the same length and the data is sorted then there are some optimizations that will make this a very fast process. All the lines being the same length would let you seek directly to particular lines (you can't do this in a normal text file with lines of different length). Which would then enable you to do a binary search.
A binary search would be massively faster than a linear search. A linear search will on average have to read 5,000 lines of a 10,000 line file each time, whereas a binary search would on average only read log2 10,000 ≈ 13 lines.
Load it into a list then use bisect.