tensorflow reading data from database - python

I am new to tensorflow. I have a large amount of data in my data base and I want a way to train a tensorflow model on the data. I understand how to do this if I was writing the data to a csv file and then reading the data from csv.
But how do I do this directly from the data base. I can connect to the database from my script(python) and run an SQL query to retrieve the data but if I want to learn in batches or epochs and mix the data?
Also the data is too big to hold in memory all at once.
any tips on where to start?
Thank you

Let's reiterate the problem :
it is impossible to load all the data into memory (even if the data is trimmed of all unneeded meta data)
it is not possible (for technical or policy reasons) to first query the database then save the results to disk as a csv file then work with the csv file.
If we could implement either of the above then we wouldn't have the problem. We are stuck with querying the database somehow and we want to:
get the data in smallish chunks
Well, that is easy enough! Let's say that our database has a primary key that is numeric. Simply decide how many chunks you want the data in and use a function like modulus
# for 7 batches
key % 7 == 0 gets you the first batch
key % 7 == 1 gets you the second batch
... etc
Okay, so you want to add another requirement
get the data in random smallish chunks
Well, that's not too difficult. Let's just change pick 2 random numbers X (preferably a prime number) and Y (less than the number of batches) and do the same thing but like so
# for 7 batches
( key * X + Y ) % 7 == 0 gets you the first batch
( key * X + Y ) % 7 == 1 gets you the second batch
... etc
You don't have a list of primes handy? No problem, just get a whole bunch and pick one at random.
For the next epoch use a different X and Y and you'll get different batches.

Related

Python: improve performance in log writing to file

I am writing a piece of code that involves generation of new parameter values over a double FOR loop and store these values to a file. The loop iteration count can go as high as 10,000 * 100,000. I have stored the variable values in a string, which gets appended on every iteration with newer values. Finally, at the end of loop I write the complete string in a txt file.
op=open("output file path","w+")
totresult = ""
for n seconds: #this user input parameter can be upto 100,000
result = ""
for car in (cars running): #number of cars can be 10000
#Code to check if given car is in range to another car
.
.
#if car in range with another car
if distance < 1000:
result = getDetailsofOtherCar()
totresult = totalresult + carName + result
#end of loops
op.write(totresult)
op.close()
My question here is, is there a better pythonic way to perform this kind of logging. As I am guessing the string gets very bulky in the later iterations and may be causing delay in execution. Is the use of string the best possible option to store the values. Or should I consider other python data structures like list, array. I came across Logging python module but would like to get an opinion before switching to it.
I tried looking up for similar issues but found nothing similar to my current doubt.
Open to any suggestions
Thank you
Edit: code added
You can write to the file as you go e.g.
with open("output.txt", "w") as log:
for i in range(10):
for j in range(10):
log.write(str((i,j)))
Update: whether or not directly streaming the records is faster than concatenating them in a memory buffer depends crucially on how big the buffer becomes, which in turn depends on the number of records and the size of each record. On my machine this seems to kick in around 350MB.

Creating two lists from one randomly

I'm using pandas to import a lot of data from a CSV file, and once read I format it to contain only numerical data. This then returns a list within a list. Each list then contains around 140k bits of data. numericalData[][].
From this list, I wish to create Testing and Training Data. For my testing data, I want to have 30% of my read data numericalData, so I use this following bit of code;
testingAmount = len(numericalData0[0]) * trainingDataPercentage / 100
Works a treat. Then, I use numpy to select that amount of data from each column of my imported numericalData;
testingData.append(np.random.choice(numericalData[x], testingAmount) )
This then returns a sample with 38 columns (running in a loop), where each column has around 49k elements of data randomly selected from my imported numericalData.
The issue is, my trainingData needs to hold the other 70% of the data, but I'm unsure on how to do this. I've tried to compare each element in my testingData, and if both elements aren't equal, then add it to my trainingData. This resulted in an error and didn't work. Next, I tried to delete the selected testingData from my imported data, and then save that new column to my trainingData, alas, that didn't work eiher.
I've only been working with python for the past week so I'm a bit lost on what to try now.
You can use random.shuffle and split list after that. For toy example:
import random
data = range(1, 11)
random.shuffle(data)
training = data[:5]
testing = data[5:]
To get more information, read the docs.

Efficient way to intersect multiple large files containing geodata

Okay, deep breath, this may be a bit verbose, but better to err on the side of detail than lack thereof...
So, in one sentence, my goal is to find the intersection of about 22 ~300-400mb files based on 3 of 139 attributes.:
Now a bit more background. The files range from ~300-400mb, consisting of 139 columns and typically in the range of 400,000-600,000 rows. I have three particular fields I want to join on - a unique ID, and latitude/longitude (with a bit of a tolerance if possible). The goal is to determine which of these recored existed across certain ranges of files. Going worst case, that will mean performing a 22 file intersection.
So far, the following has failed
I tried using MySQL to perform the join. This was back when I was only looking at 7 years. Attempting the join on 7 years (using INNER JOIN about 7 times... e.g. t1 INNER JOIN t2 ON condition INNER JOIN t3 ON condition ... etc), I let it run for about 48 hours before the timeout ended it. Was it likely to actually still be running, or does that seem overly long? Despite all the suggestions I found to enable better multithreading and more RAM usage, I couldn't seem to get the cpu usage above 25%. If this is a good approach to pursue, any tips would be greatly appreciated.
I tried using ArcMap. I converted the CSVs to tables and imported them into a file geodatabase. I ran the intersection tool on two files, which took about 4 days, and the number of records returned was more than twice the number of input features combined. Each file had about 600,000 records. The intersection returned with 2,000,0000 results. In other cases, not all records were recognized by ArcMap. ArcMap says there are 5,000 records, when in reality there are 400,000+
I tried combining in python. Firstly, I can immediately tell RAM is going to be an issue. Each file takes up roughly 2GB of RAM in python when fully opened. I do this with:
f1 = [row for row in csv.reader(open('file1.csv', 'rU'))]
f2 = [row for row in csv.reader(open('file2.csv', 'rU'))]
joinOut = csv.writer(open('Intersect.csv', 'wb'))
uniqueIDs = set([row[uniqueIDIndex] for row in f1].extend([row[uniqueIDIndex] for row in f2]))
for uniqueID in uniqueIDs:
f1rows = [row for row in f1 if row[uniqueIDIndex] == uniqueID]
f2rows = [row for row in f2 if row[uniqueIDIndex] == uniqueID]
if len(f1rows) == 0 or len(f2rows) == 0:
//Not an intersect
else:
// Strings, split at decimal, if integer and first 3 places
// after decimal are equal, they are spatially close enough
f1lat = f1rows[0][latIndex].split('.')
f1long = f1rows[0][longIndex].split('.')
f2lat = f2rows[0][latIndex].split('.')
f2long = f2rows[0][longIndex].split('.')
if f1lat[0]+f1lat[1][:3] == f2lat[0]+f2lat[1][:3] and f1long[0]+f1long[1][:3] == f2long[0]+f2long[1][:3]:
joinOut.writerows([f1rows[0], f2rows[0]])
Obviously, this approach requires that the files being intersected are available in memory. Well I only have 16GB of RAM available and 22 files would need ~44GB of RAM. I could change it so that instead, when each uniqueID is iterated, it opens and parses each file for the row with that uniqueID. This has the benefit of reducing the footprint to almost nothing, but with hundreds of thousands of unique IDs, that could take an unreasonable amount of time to execute.
So, here I am, asking for suggestions on how I can best handle this data. I have an i7-3770k at 4.4Ghz, 16GB RAM, and a vertex4 SSD, rated at 560 MB/s read speed. Is this machine even capable of handling this amount of data?
Another venue I've thought about exploring is an Amazon EC2 cluster and Hadoop. Would that be a better idea to investigate?
Suggestion: Pre-process all the files to extract the 3 attributes you're interested in first. You can always keep track of the file/rownumber as well, so you can reference all the original attributes later if you want.

Looking for a quick way to speed up my code

I am looking for a way to speed up my code. I managed to speed up most parts of my code, reducing runtime to about 10 hours, but it's still not fast enough and since I'm running out of time I'm looking for a quick way to optimize my code.
An example:
text = pd.read_csv(os.path.join(dir,"text.csv"),chunksize = 5000)
new_text = [np.array(chunk)[:,2] for chunk in text]
new_text = list(itertools.chain.from_iterable(new_text))
In the code above I read in about 6 million rows of text documents in chunks and flatten them. This code takes about 3-4 hours to execute. This is the main bottleneck of my program. edit: I realized that I wasn't very clear on what the main issue was, The flattening is the part which takes the most amount of time.
Also this part of my program takes a long time:
train_dict = dict(izip(text,labels))
result = [train_dict[test[sample]] if test[sample] in train_dict else predictions[sample] for sample in xrange(len(predictions))]
The code above first zips the text documents with their corresponding labels (This a machine learning task, with the train_dict being the training set). Earlier in the program I generated predictions on a test set. There are duplicates between my train and test set so I need to find those duplicates. Therefore, I need to iterate over my test set row by row (2 million rows in total), when I find a duplicate I actually don't want to use the predicted label, but the label from the duplicate in the train_dict. I assign the result of this iteration to the variable result in the above code.
I heard there are various libraries in python that could speed up parts of your code, but I don't know which of those could do the job and right know I do not have the time to investigate this, that is why I need someone to point me in the right direction. Is there a way with which I could speed the code snippets above up?
edit2
I have investigated again. And it is definitely a memory issue. I tried to read the file in a row by row manner and after a while the speed declined dramatically, furthermore my ram usage is nearly 100%, and python's disk usage increased sharply. How can I decrease the memory footprint? Or should I find a way to make sure that I don't hold everything into memory?
edit3
As memory is the main issue of my problems I'll give an outline of a part of my program. I have dropped the predictions for the time being, which reduced the complexity of my program significantly, instead I insert a standard sample for every non duplicate in my test set.
import numpy as np
import pandas as pd
import itertools
import os
train = pd.read_csv(os.path.join(dir,"Train.csv"),chunksize = 5000)
train_2 = pd.read_csv(os.path.join(dir,"Train.csv"),chunksize = 5000)
test = pd.read_csv(os.path.join(dir,"Test.csv"), chunksize = 80000)
sample = list(np.array(pd.read_csv(os.path.join(dir,"Samples.csv"))[:,2]))#this file is only 70mb
sample = sample[1]
test_set = [np.array(chunk)[:,2] for chunk in test]
test_set = list(itertools.chain.from_iterable(test_set))
train_set = [np.array(chunk)[:,2] for chunk in train]
train_set = list(itertools.chain.from_iterable(train_set))
labels = [np.array(chunk)[:,3] for chunk in train_2]
labels = list(itertools.chain.from_iterable(labels))
"""zipping train and labels"""
train_dict = dict(izip(train,labels))
"""finding duplicates"""
results = [train_dict[test[item]] if test[item] in train_dict else sample for item in xrange(len(test))]
Although this isn't my entire program, this is the part of my code that needs optimization. As you can see I am only using three important modules in this part, pandas, numpy and itertools. The memory issues arise when flattening train_set and test_set. The only thing I am doing is reading in the files, getting the necessary parts zipping the train documents with the corresponding labels in a dictionary. And then search for duplicates.
edit 4
As requested I'll give an explanation of my data sets. My Train.csv contains 4 columns. The first columns contain ID's for every sample, the second column contains titles and the third column contains text body samples(varying from 100-700 words). The fourth column contains category labels. Test.csv contains only the ID's and text bodies and titles. The columns are separated by commas.
Could you please post a dummy sample data set of a half dozen rows or so?
I can't quite see what your code is doing and I'm not a Pandas expert, but I think we can greatly speed up this code. It reads all the data into memory and then keeps re-copying the data to various places.
By writing "lazy" code we should be able to avoid all the re-copying. The ideal would be to read one line, transform it as we want, and store it into its final destination. Also this code uses indexing when it should be just iterating over values; we can pick up some speed there too.
Is the code you posted your actual code, or something you made just to post here? It appears to contain some mistakes so I am not sure what it actually does. In particular, train and labels would appear to contain identical data.
I'll check back and see if you have posted sample data. If so I can probably write "lazy" code for you that will have less re-copying of data and will be faster.
EDIT: Based on your new information, here's my dummy data:
id,title,body,category_labels
0,greeting,hello,noun
1,affirm,yes,verb
2,deny,no,verb
Here is the code that reads the above:
def get_train_data(training_file):
with open(training_file, "rt") as f:
next(f) # throw away "headers" in first line
for line in f:
lst = line.rstrip('\n').split(',')
# lst contains: id,title,body,category_labels
yield (lst[1],lst[2])
train_dict = dict(get_train_data("data.csv"))
And here is a faster way to build results:
results = [train_dict.get(x, sample) for x in test]
Instead of repeatedly indexing test to find the next item, we just iterate over the values in test. The dict.get() method handles the if x in train_dict test we need.
You can try Cython. It supports numpy and can give you a nice speedup.
Here is an introduction and explanation of what needs to be done
http://www.youtube.com/watch?v=Iw9-GckD-gQ
If the order of your rows is not important you can use sets to find elements that are in train set but not in test set (intersection trainset & testset) and add them first to your "result" and after that use set difference (testset-trainset) to add elements that are in your test set but not in the train set. This will allow to save on checking if sample is in trainset.

How can I group a large dataset

I have simple text file containing two columns, both integers
1 5
1 12
2 5
2 341
2 12
and so on..
I need to group the dataset by second value,
such that the output will be.
5 1 2
12 1 2
341 2
Now the problem is that the file is very big around 34 Gb
in size, I tried writing a python script to group them into a dictionary with value as an array of integers, still it takes way too long. (I guess a large time is taken for allocating the array('i') and extending them on append.
I am now planning to write a pig script which I am planning to run on a pseudo distributed hadoop machine (An Amazon EC3 High Memory Large instance).
data = load 'Net.txt';
gdata = Group data by $1; // I know it will lead to 5 (1,5) (2,5) but thats okay for this snippet
store gdata into 'res.txt';
I wanted to know if there was any simpler way of doing this.
Update:
keeping such a big file in memory is out of question, In case of python solution, what I planned was to conduct 4 runs in first run only second col values from 1 - 10 million are considered in next run 10 million to 20 million are considered and so on. but this turned out to be really slow.
The pig / hadoop solution is interesting because it keeps everything on disk [Well most of it].
For better understanding this dataset contains information about connectivity of ~45 Million twitter users and the format in file means that userid given by the second number is following the the first one.
Solution which I had used:
class AdjDict(dict):
"""
A special Dictionary Class to hold adjecancy list
"""
def __missing__(self, key):
"""
Missing is changed such that when a key is not found an integer array is initialized
"""
self.__setitem__(key,array.array('i'))
return self[key]
Adj= AdjDict()
for line in file("net.txt"):
entry = line.strip().split('\t')
node = int(entry[1])
follower = int(entry[0])
if node < 10 ** 6:
Adj[node].append(follower)
# Code for writting Adj matrix to the file:
Assuming you have ~17 characters per line (a number I picked randomly to make the math easier), you have about 2 billion records in this file. Unless you are running with much physical memory on a 64-bit system, you will thrash your pagefile to death trying to hold all this in memory in a single dict. And that's just to read it in as a data structure - one presumes that after this structure is built, you plan to actually do something with it.
With such a simple data format, I should think you'd be better off doing something in C instead of Python. Cracking this data shouldn't be difficult, and you'll have much less per-value overhead. At minimum, just to hold 2 billion 4-byte integers would be 8 Gb (unless you can make some simplifying assumptions about the possible range of the values you currently list as 1 and 2 - if they will fit within a byte or a short, then you can use smaller int variables, which will be worth the trouble for a data set of this size).
If I had to solve this on my current hardware, I'd probably write a few small programs:
The first would work on 500-megabyte chunks of the file, swapping columns and writing the result to new files. (You'll get 70 or more.) (This won't take much memory.)
Then I'd call the OS-supplied sort(1) on each small file. (This might take a few gigs of memory.)
Then I'd write a merge-sort program that would merge together the lines from all 70-odd sub-files. (This won't take much memory.)
Then I'd write a program that would run through the large sorted list; you'll have a bunch of lines like:
5 1
5 2
12 1
12 2
and you'll need to return:
5 1 2
12 1 2
(This won't take much memory.)
By breaking it into smaller chunks, hopefully you can keep the RSS down to something that would fit a reasonable machine -- it will take more disk I/O, but on anything but astonishing hardware, swap use would kill attempts to handle this in one big program.
Maybe you can do a multi-pass through the file.
Do a range of keys each pass through the file, for example if you picked a range size of 100
1st pass - work out all the keys from 0-99
2nd pass - work out all the keys from 100-199
3rd pass - work out all the keys from 200-299
4th pass - work out all the keys from 300-399
..and so on.
for your sample, the 1st pass would output
5 1 2
12 1 2
and the 4th pass would output
341 2
Choose the range size so that the dict you are creating fits into your RAM
I wouldn't bother using multiprocessing to try to speed it up by using multiple cores, unless you have a very fast harddrive this should be IO bound and you would just end up thrashing the disk
If you are working with a 34 GB file, I'm assuming that hard drive, both in terms of storage and access-time, is not a problem. How about reading the pairs sequentially and when you find pair (x,y), open file "x", append " y" and close file "x"? In the end, you will have one file per Twitter userid, and each file containing all users this one is connected to. You can then concatenate all those files if you want to have your result in the output format you specified.
THAT SAID HOWEVER, I really do think that:
(a) for such a large data set, exact resolution is not appropriate and that
(b) there is probably some better way to measure connectivity, so perhaps you'd like to tell us about your end goal.
Indeed, you have a very large graph and a lot of efficient techniques have been devised to study the shape and properties of huge graphs---most of these techniques are built to work as streaming, online algorithms.
For instance, a technique called triangle counting, coupled with probabilistic cardinality estimation algorithms, efficiently and speedily provides information on the cliques contained in your graph. For a better idea on the triangle counting aspect, and how it is relevant to graphs, see for example this (randomly chosen) article.
I had a similar requirement and you just require one more pig statement to remove the redundancies in 5 (1,5) (2,5).
a = LOAD 'edgelist' USING PigStorage('\t') AS (user:int,following:int);
b = GROUP a BY user;
x = FOREACH b GENERATE group.user, a.following;
store x INTO 'following-list';

Categories