Converting 10+GB data to ragged tensor - python

I have a Pandas dataframe which is 13.4GB large. Considering it is a multi-index dataframe of which the first level varies in length, I would like to convert it to a ragged tensor. Here is how I do it:
def convert(df):
X = df.groupby(level=0).apply(tf.convert_to_tensor)
X = X.tolist()
X = tf.ragged.stack(X)
return X
However, due to the enormous size of the data, this process has become absolutely intractable.
I have found out the following:
By far most of the time is being spent in the line "X = tf.ragged.stack(X)"
tf.ragged.constant or even tf.RaggedTensor.from_row_lengths are not faster.
Any advice would be much appreciated!

What I thought I had found was wrong; Using the factory functions such as tf.RaggedTensor.from_row_lengths is tremendously faster, especially on large datasets. I think that's because these functions require you to manually encode the data, whereas using tf.ragged.constant forces Tensorflow to calculate the encoding. Also, they are implemented in C instead of in Python. However, be sure to set the argument validate to False, because the difference in time is enormous.

Related

Pandas UDF Function Takes Unusually Long to Complete on Big Data

I'm new to PySpark and Pandas UDFs, I'm running the following Pandas UDF Function to jumble a column containing strings (For Example: an input 'Luke' will result in 'ulek')
pandas_udf("string")
def jumble_string(column: pd.Series)-> pd.Series:
return column.apply(lambda x: None if x==None else ''.join(random.sample(x, len(x))).lower())
spark_df = spark_df.withColumn("names", jumble_string("names"))
On running the above function on a large dataset I've noticed that the execution takes unusually long.
I'm guessing the .apply function has something to do with this issue.
Is there anyway I can rewrite this function so it can effectively execute on a Big Dataset?
Please Advise
As the .apply method is not vectorized, the given operation is done by looping through the elements which slows down the execution as the data size becomes large.
For small sized data, the time difference is usually negligible. However, as the size increases, the difference starts to become noticeable. We are likely to deal with vast amount of data so time should always be taken into consideration.
You can read more about Apply vs Vectorized Operations here.
Therefore I decided to use a list comprehension which did increase my performance marginally.
#pandas_udf("string")
def jumble_string(column: pd.Series)-> pd.Series:
return pd.Series([None if x==None else ''.join(random.sample(x, len(x))).lower() for x in column])
There is no implemented function in Spark that jumbles the strings in a column, so we are forced to resort to UDFs or Pandas UDFs.
Your solution is actually quite nice; perhaps we can improve it by removing the .apply method from the series and using only base Python on a string for each row.
#pandas_udf("string")
def jumble_string_new(column: pd.Series)-> pd.Series:
x = column.iloc[0] # each pd.Series is made only of one element
if x is None:
return pd.Series([None])
else:
return pd.Series([''.join(random.sample(x, len(x))).lower()])
The results are the same to your function; however, I've not been able to test it on a very large dataframe. Try it yourself and see whether it's computationally more efficient.

Best data type (in terms of speed/RAM) for millions of pairs of a single int paired with a batch (2 to 100) of ints

I have about 15 million pairs that consist of a single int, paired with a batch of (2 to 100) other ints.
If it makes a difference, the ints themselve range from 0 to 15 million.
I have considered using:
Pandas, storing the batches as python lists
Numpy, where the batch is stored as it's own numpy array (since numpy doesn't allow variable length rows in it's 2D data structures)
Python List of Lists.
I also looked at Tensorflow tfrecords but not too sure about this one.
I only have about 12 gbs of RAM. I will also be using to train over a machine learning algorithm so
If you must store all values in memory, numpy will probably be the most efficient way. Pandas is built on top of numpy so it includes some overhead which you can avoid if you do not need any of the functionality that comes with pandas.
Numpy should have no memory issues when handling data of this size but another thing to consider, and this depends on how you will be using this data, is to use a generator to read from a file that has each pair on a new line. This would reduce memory usage significantly but would be slower than numpy for processing aggregate functions like sum() or max() and is more suitable if each value pair would be processed independently.
with open(file, 'r') as f:
data = (l for l in f) # generator
for line in data:
# process each record here
I would do the following:
# create example data
A = np.random.randint(0,15000000,100)
B = [np.random.randint(0,15000000,k) for k in np.random.randint(2,101,100)]
int32 is sufficient
A32 = A.astype(np.int32)
We want to glue all the batches together.
First, write down the batch sizes so we can separate them later.
from itertools import chain
sizes = np.fromiter(chain((0,),map(len,B)),np.int32,len(B)+1)
boundaries = sizes.cumsum()
# force int32
B_all = np.empty(boundaries[-1],np.int32)
np.concatenate(B,out=B_all)
After glueing resplit.
B32 = np.split(B_all, boundaries[1:-1])
Finally, make an array of pairs for convenience:
pairs = np.rec.fromarrays([A32,B32],names=["first","second"])
What was the point of glueing and then splitting again?
First, note that the resplit arrays are all views into B_all, so we do not waste much memory by having both. Also, if we modify either B_all_ or B32 (or rather some of its elements) in place the other one will be automatically updated as well.
The advantage of having B_all around is efficiency via numpy's reduceat ufunc method. If we wanted for example the means of all batches we could do np.add.reduceat(B_all, boundaries[:-1]) / sizes which is faster than looping through pairs['second']
Use numpy. It us the most efficient and you can use it easily with a machine learning model.

Time complexity vs memory usage in pandas using large datasets

I am wondering which would be the most efficient way of creating a column in a pandas dataframe which if an id_row exist in a given list will return 1 or else 0.
I'm currently using a lambda function to apply the result. My problem is that it is taking a long time as my dataframe is around 2M rows and the list it checks into between 200k to 100k items. If I'm not mistaken, this is quadratic time (I'm really not sure though), which in this case runs really slowly give the size of the objects.
The worst is that I have to repeat this bit of code for over a 100 other (different) dataframes.
Here is the function:
lst_to_add = [1,2,3.......,n]
df_table['TEST'] = df_table['id_row'].apply(lambda x : 1 if x i lst_to_add else 0)
I'm wondering how could I make the bit of code (way) more efficient.
I thought of a 'divide and conquer' solution using a recursive function perhaps, but I'm really open to any suggestions.
Last thing. I also have constraits with memory hence I'd would prefer a method which takes a little more time but less memory than the alternative (if I had a choice).
You could do
df_table['TEST'] = (df_table['id_row'].isin(lst_to_add)).astype(int)
This code checks if the id_row variables are in lst_to_add and returns True and False, which the astype(int) converts to 1's and 0's. Since this approach is vectorized (acts on the entire series), it should be faster than using apply.
As far as time complexity, your list should be a set, this will make your O(M*N) solution O(N), since set membership tests are constant time instead of linear time (like it is for lists). Then, use the built-in method .isin:
lst_to_add = set(lst_to_add)
df_table['TEST'] = df_table['TEST'].isin(lst_to_add)
You should stick to the boolean type if memory is an issue, and you only want 0 and 1.

How to Optimize the Python Code

All,
I am going to compute some feature values using the following python codes. But, because the input sizes are too big, it is very time-consuming. Please help me to optimize the codes.
leaving_volume=len([x for x in pickup_ids if x not in dropoff_ids])
arriving_volume=len([x for x in dropoff_ids if x not in pickup_ids])
transition_volume=len([x for x in dropoff_ids if x in pickup_ids])
union_ids=list(set(pickup_ids + dropoff_ids))
busstop_ids=[x for x in union_ids if self.geoitems[x].fare>0]
busstop_density=np.sum([Util.Geodist(self.geoitems[x].orilat, self.geoitems[x].orilng, self.geoitems[x].destlat, self.geoitems[x].destlng)/(1000*self.geoitems[x].fare) for x in busstop_ids])/len(busstop_ids) if len(busstop_ids) > 0 else 0
busstop_ids=[x for x in union_ids if self.geoitems[x].balance>0]
smartcard_balance=np.sum([self.geoitems[x].balance for x in busstop_ids])/len(busstop_ids) if len(busstop_ids) > 0 else 0
Hi, All,
Here is my revised version. I run this code on my GPS traces data. It is faster.
intersect_ids=set(pickup_ids).intersection( set(dropoff_ids) )
union_ids=list(set(pickup_ids + dropoff_ids))
leaving_ids=set(pickup_ids)-intersect_ids
leaving_volume=len(leaving_ids)
arriving_ids=set(dropoff_ids)-intersect_ids
arriving_volume=len(arriving_ids)
transition_volume=len(intersect_ids)
busstop_density=np.mean([Util.Geodist(self.geoitems[x].orilat, self.geoitems[x].orilng, self.geoitems[x].destlat, self.geoitems[x].destlng)/(1000*self.geoitems[x].fare) for x in union_ids if self.geoitems[x].fare>0])
if not busstop_density > 0:
busstop_density = 0
smartcard_balance=np.mean([self.geoitems[x].balance for x in union_ids if self.geoitems[x].balance>0])
if not smartcard_balance > 0:
smartcard_balance = 0
Many thanks for the help.
Just a few things I noticed, as some Python efficiency trivia:
if x not in dropoff_ids
Checking for membership using the in operator is more efficient on a set than a list. But iterating with for through a list is probably more efficient than on a set. So if you want your first two lines to be as efficient as possible you should have both types of data structure around beforehand.
list(set(pickup_ids + dropoff_ids))
It's more efficient to create your sets before you combine data, rather than creating a long list and constructing a set from it. Luckily you probably already have the set versions around now (see the first comment)!
Above all you need to ask yourself the question:
Is the time I save by constructing extra data structures worth the time it takes to construct them?
Next one:
np.sum([...])
I've been trained by Python to think of constructing a list and then applying a function that theoretically only requires a generator as a code smell. I'm not sure if this applies in numpy, since from what I remember it's not completely straightforward to pull data from a generator and put it in a numpy structure.
It looks like this is just a small fragment of your code. If you're really concerned about efficiency I'd recommend making use of numpy arrays rather than lists, and trying to stick within numpy's built-in data structures and function as much as possible. They are likely more highly optimized for raw data crunching in C than the built-in Python functions.
If you're really, really concerned about efficiency then you should probably be doing this data analysis straight-up in C. Especially if you don't have much more code than what you've presented here it might be pretty easy to translate over.
I can only support what machine yerning wrote in his this post. If you are thinking of switching to numpy so if your variables pickup_ids and dropoff_ids were numpy arrays (which maybe they already are else do:
dropoff_ids = np.array( dropoff_ids, dtype='i' )
pickup_ids = np.array( pickup_ids, dtype='i' )
then you can make use of the functions np.in1d() which will give you a True/False array which you can just sum over to get the total number of True entries.
leaving_volume = (-np.in1d( pickup_ids, dropoff_ids )).sum()
transition_volume= np.in1d( dropoff_ids, pickup_ids).sum()
arriving_volume = (-np.in1d( dropoff_ids, pickup_ids)).sum()
somehow I have the feeling that transition_volume = len(pickup_ids) - arriving_volume but I'm not 100% sure right now.
Another function that could be useful to you is np.unique() if you want to get rid of duplicate entries which in a way will turn your array into a set.

Speeding slicing of big numpy array

I have a big array ( 1000x500000x6 ) that is stored in a pyTables file. I am doing some calculations on it that are fairly optimized in terms of speed, but what is taking the most time is the slicing of the array.
At the beginning of the script, I need to get a subset of the rows : reduced_data = data[row_indices, :, :] and then, for this reduced dataset, I need to access:
columns one by one: reduced_data[:,clm_indice,:]
a subset of the columns: reduced_data[:,clm_indices,:]
Getting these arrays takes forever. Is there any way to speed that up ? storing the data differently for example ?
You can try choosing the chunkshape of your array wisely, see: http://pytables.github.com/usersguide/libref.html#tables.File.createCArray
This option controls in which order the data is physically stored in the file, so it might help to speed up access.
With some luck, for your data access pattern, something like chunkshape=(1000, 1, 6) might work.

Categories