I am looking for a way to speed up my code. I managed to speed up most parts of my code, reducing runtime to about 10 hours, but it's still not fast enough and since I'm running out of time I'm looking for a quick way to optimize my code.
An example:
text = pd.read_csv(os.path.join(dir,"text.csv"),chunksize = 5000)
new_text = [np.array(chunk)[:,2] for chunk in text]
new_text = list(itertools.chain.from_iterable(new_text))
In the code above I read in about 6 million rows of text documents in chunks and flatten them. This code takes about 3-4 hours to execute. This is the main bottleneck of my program. edit: I realized that I wasn't very clear on what the main issue was, The flattening is the part which takes the most amount of time.
Also this part of my program takes a long time:
train_dict = dict(izip(text,labels))
result = [train_dict[test[sample]] if test[sample] in train_dict else predictions[sample] for sample in xrange(len(predictions))]
The code above first zips the text documents with their corresponding labels (This a machine learning task, with the train_dict being the training set). Earlier in the program I generated predictions on a test set. There are duplicates between my train and test set so I need to find those duplicates. Therefore, I need to iterate over my test set row by row (2 million rows in total), when I find a duplicate I actually don't want to use the predicted label, but the label from the duplicate in the train_dict. I assign the result of this iteration to the variable result in the above code.
I heard there are various libraries in python that could speed up parts of your code, but I don't know which of those could do the job and right know I do not have the time to investigate this, that is why I need someone to point me in the right direction. Is there a way with which I could speed the code snippets above up?
edit2
I have investigated again. And it is definitely a memory issue. I tried to read the file in a row by row manner and after a while the speed declined dramatically, furthermore my ram usage is nearly 100%, and python's disk usage increased sharply. How can I decrease the memory footprint? Or should I find a way to make sure that I don't hold everything into memory?
edit3
As memory is the main issue of my problems I'll give an outline of a part of my program. I have dropped the predictions for the time being, which reduced the complexity of my program significantly, instead I insert a standard sample for every non duplicate in my test set.
import numpy as np
import pandas as pd
import itertools
import os
train = pd.read_csv(os.path.join(dir,"Train.csv"),chunksize = 5000)
train_2 = pd.read_csv(os.path.join(dir,"Train.csv"),chunksize = 5000)
test = pd.read_csv(os.path.join(dir,"Test.csv"), chunksize = 80000)
sample = list(np.array(pd.read_csv(os.path.join(dir,"Samples.csv"))[:,2]))#this file is only 70mb
sample = sample[1]
test_set = [np.array(chunk)[:,2] for chunk in test]
test_set = list(itertools.chain.from_iterable(test_set))
train_set = [np.array(chunk)[:,2] for chunk in train]
train_set = list(itertools.chain.from_iterable(train_set))
labels = [np.array(chunk)[:,3] for chunk in train_2]
labels = list(itertools.chain.from_iterable(labels))
"""zipping train and labels"""
train_dict = dict(izip(train,labels))
"""finding duplicates"""
results = [train_dict[test[item]] if test[item] in train_dict else sample for item in xrange(len(test))]
Although this isn't my entire program, this is the part of my code that needs optimization. As you can see I am only using three important modules in this part, pandas, numpy and itertools. The memory issues arise when flattening train_set and test_set. The only thing I am doing is reading in the files, getting the necessary parts zipping the train documents with the corresponding labels in a dictionary. And then search for duplicates.
edit 4
As requested I'll give an explanation of my data sets. My Train.csv contains 4 columns. The first columns contain ID's for every sample, the second column contains titles and the third column contains text body samples(varying from 100-700 words). The fourth column contains category labels. Test.csv contains only the ID's and text bodies and titles. The columns are separated by commas.
Could you please post a dummy sample data set of a half dozen rows or so?
I can't quite see what your code is doing and I'm not a Pandas expert, but I think we can greatly speed up this code. It reads all the data into memory and then keeps re-copying the data to various places.
By writing "lazy" code we should be able to avoid all the re-copying. The ideal would be to read one line, transform it as we want, and store it into its final destination. Also this code uses indexing when it should be just iterating over values; we can pick up some speed there too.
Is the code you posted your actual code, or something you made just to post here? It appears to contain some mistakes so I am not sure what it actually does. In particular, train and labels would appear to contain identical data.
I'll check back and see if you have posted sample data. If so I can probably write "lazy" code for you that will have less re-copying of data and will be faster.
EDIT: Based on your new information, here's my dummy data:
id,title,body,category_labels
0,greeting,hello,noun
1,affirm,yes,verb
2,deny,no,verb
Here is the code that reads the above:
def get_train_data(training_file):
with open(training_file, "rt") as f:
next(f) # throw away "headers" in first line
for line in f:
lst = line.rstrip('\n').split(',')
# lst contains: id,title,body,category_labels
yield (lst[1],lst[2])
train_dict = dict(get_train_data("data.csv"))
And here is a faster way to build results:
results = [train_dict.get(x, sample) for x in test]
Instead of repeatedly indexing test to find the next item, we just iterate over the values in test. The dict.get() method handles the if x in train_dict test we need.
You can try Cython. It supports numpy and can give you a nice speedup.
Here is an introduction and explanation of what needs to be done
http://www.youtube.com/watch?v=Iw9-GckD-gQ
If the order of your rows is not important you can use sets to find elements that are in train set but not in test set (intersection trainset & testset) and add them first to your "result" and after that use set difference (testset-trainset) to add elements that are in your test set but not in the train set. This will allow to save on checking if sample is in trainset.
Related
I am writing a piece of code that involves generation of new parameter values over a double FOR loop and store these values to a file. The loop iteration count can go as high as 10,000 * 100,000. I have stored the variable values in a string, which gets appended on every iteration with newer values. Finally, at the end of loop I write the complete string in a txt file.
op=open("output file path","w+")
totresult = ""
for n seconds: #this user input parameter can be upto 100,000
result = ""
for car in (cars running): #number of cars can be 10000
#Code to check if given car is in range to another car
.
.
#if car in range with another car
if distance < 1000:
result = getDetailsofOtherCar()
totresult = totalresult + carName + result
#end of loops
op.write(totresult)
op.close()
My question here is, is there a better pythonic way to perform this kind of logging. As I am guessing the string gets very bulky in the later iterations and may be causing delay in execution. Is the use of string the best possible option to store the values. Or should I consider other python data structures like list, array. I came across Logging python module but would like to get an opinion before switching to it.
I tried looking up for similar issues but found nothing similar to my current doubt.
Open to any suggestions
Thank you
Edit: code added
You can write to the file as you go e.g.
with open("output.txt", "w") as log:
for i in range(10):
for j in range(10):
log.write(str((i,j)))
Update: whether or not directly streaming the records is faster than concatenating them in a memory buffer depends crucially on how big the buffer becomes, which in turn depends on the number of records and the size of each record. On my machine this seems to kick in around 350MB.
I want to create sequences of my dataset. However, Tensorflow only provides the function:
tf.parse_single_example()
I tried to avoid this problem by using the tf.py_func and smth like this:
dataset.map(lambda x: tf.py_func(_parse_tf_record, [x, sequence_length])
for sequence_id in range(0, sequence_length):
filename = x
# files only contain one record
for record in tf.python_io.tf_record_iterator(filename, options):
...
tf.parse_single_example()
...
break # only one sample per file
So for every map call I read #sequence_length files. However, this cannot be done parallel since tf.py_func does not allow for it.
A tensorflow example is a single conceptual unit and it should be independent from the other examples (so that batching and shuffling work properly).
If you want more data to be grouped together you should write it as a single example.
To make things easier there tf.train.SequenceExample that works with tf.parse_single_sequence_example. It has a context part that's common for all entries in the sequence and a sequence part that is repeated for every step. This is commonly used when working with recurrent networks (LSTM and alike) but you can use it whenever it makes sense in your context.
I have built a classifier, trained and tested on labeled data. Now I want to test it further by making predictions on a dataset without the labels. I already know the labels myself, but I want to remove them for the purpose of testing, and have it print out the values with a 0 prediction so I can compare the accuracy myself. I'm using the following code to iterate through my dataset and make a prediction for each row in the DataFrame;
malware = set()
for index, row in dataset.iterrows():
res = clf.predict([row])
if res == 0:
malware.add(index)
print(malware)
f.write(str(malware) + "\n")
It seems to be working, however it's not a quick process, is there a better way or anything I can do to speed it up?
Using a for loop to iterate through elements in a dataset is slow in general. What you want to do is apply your function to every element in the column(s), and generate a series of labels according to the result. (Assuming you're using Pandas for the dataframe, by the way)
labels=dataset.apply(clf.predict)
You can then just scan through this series with a for loop. That should be relatively instant.
After a bit of work I have turned the comment from Ding into a workable answer that is much quicker. My new code is;
from collections import OrderedDict
malware = []
malware.append(OrderedDict.fromkeys(dataset.index[clf.predict(dataset) == 0]))
print (malware)
Thanks very much Ding!
I'm using pandas to import a lot of data from a CSV file, and once read I format it to contain only numerical data. This then returns a list within a list. Each list then contains around 140k bits of data. numericalData[][].
From this list, I wish to create Testing and Training Data. For my testing data, I want to have 30% of my read data numericalData, so I use this following bit of code;
testingAmount = len(numericalData0[0]) * trainingDataPercentage / 100
Works a treat. Then, I use numpy to select that amount of data from each column of my imported numericalData;
testingData.append(np.random.choice(numericalData[x], testingAmount) )
This then returns a sample with 38 columns (running in a loop), where each column has around 49k elements of data randomly selected from my imported numericalData.
The issue is, my trainingData needs to hold the other 70% of the data, but I'm unsure on how to do this. I've tried to compare each element in my testingData, and if both elements aren't equal, then add it to my trainingData. This resulted in an error and didn't work. Next, I tried to delete the selected testingData from my imported data, and then save that new column to my trainingData, alas, that didn't work eiher.
I've only been working with python for the past week so I'm a bit lost on what to try now.
You can use random.shuffle and split list after that. For toy example:
import random
data = range(1, 11)
random.shuffle(data)
training = data[:5]
testing = data[5:]
To get more information, read the docs.
I am have a fairly large dataset that I store in HDF5 and access using PyTables. One operation I need to do on this dataset are pairwise comparisons between each of the elements. This requires 2 loops, one to iterate over each element, and an inner loop to iterate over every other element. This operation thus looks at N(N-1)/2 comparisons.
For fairly small sets I found it to be faster to dump the contents into a multdimensional numpy array and then do my iteration. I run into problems with large sets because of memory issues and need to access each element of the dataset at run time.
Putting the elements into an array gives me about 600 comparisons per second, while operating on hdf5 data itself gives me about 300 comparisons per second.
Is there a way to speed this process up?
Example follows (this is not my real code, just an example):
Small Set:
with tb.openFile(h5_file, 'r') as f:
data = f.root.data
N_elements = len(data)
elements = np.empty((N_elements, 1e5))
for ii, d in enumerate(data):
elements[ii] = data['element']
D = np.empty((N_elements, N_elements))
for ii in xrange(N_elements):
for jj in xrange(ii+1, N_elements):
D[ii, jj] = compare(elements[ii], elements[jj])
Large Set:
with tb.openFile(h5_file, 'r') as f:
data = f.root.data
N_elements = len(data)
D = np.empty((N_elements, N_elements))
for ii in xrange(N_elements):
for jj in xrange(ii+1, N_elements):
D[ii, jj] = compare(data['element'][ii], data['element'][jj])
Two approaches I'd suggest here:
numpy memmap: Create a memory mapped array, put the data inside this and then run code for "Small Set". Memory maps behave almost like arrays.
Use multiprocessing-module to allow parallel processing: if the "compare" method consumes at least a noticeable amount of CPU time, you could use more than one process.
Assuming you have more than one core in the CPU, this will speed up significantly. Use
one process to read the data from the hdf and put in into a queue
one process to grab from the queue and do the comparisson and put some result to "output-queue"
one process to collect the results again.
Before choosing the way: "Know your enemy", i.e., use profiling! Optimizations are only worth the effort if you improve at the bottlenecks, so first find out which methods consume you precious CPU time.
Your algorithm is O(n^2), which is not good for large data. Don't you see any chance to reduce this, e.g., by applying some logic? This is always the best approach.
Greetings,
Thorsten