Speed up "is in" for DataFrame

Speed up "is in" for DataFrame - python

Given a DataFrame, I would like to add a row if its not in the DF already.
if state not in df.index:
# append new state DataFrame
df = df.append(pd.Series([0] * len(self.actions), index=df.columns, name=state))
state is a string like this [0 1 12 36 67 0 14 5 6 4] (a list of 10 entries, handed over as a string).
For the first few rows added, this takes about 0.0045 seconds on average. Having 10.000+ rows already makes it significantly slower, about 0.0623 seconds, and with 100.000+ rows it becomes something like 0.1364 seconds...
Is there any way to speed up the check if the index already exists? I am new to python, but maybe there is a way to keep the index in the RAM and check there for better performance? Maybe hashing the index would speed it up, or maybe a combination of those?
Any hint is highly appreciated!

Related

Need help optimizing this code for faster results

To give an overview of the data, there are multiple rows of data which have the same id, and furthermore, have multiple columns with the same values. Now there are some functions which will output the same result for rows with the same id. Therefore, I group by this id, perform the functions I need to perform on them, and then I begin looping through each row within each group, to perform the functions which will yield different results for each row, even with the same id.
Here is some sample data:
id map_sw_lon map_sw_lat map_ne_lon map_ne_lat exact_lon exact_lat
1 10 15 11 16 20 30
1 10 15 11 16 34 50
2 20 16 21 17 44 33
2 20 16 21 17 50 60
Here is my code:
for id, group in df.groupby("id", sort=False):
viewport = box(group["map_sw_lon"].iloc[0],
group["map_sw_lat"].iloc[0], group["map_ne_lon"].iloc[0],
group["map_ne_lat"].iloc[0])
center_of_viewport = viewport.centroid
center_hex = h3.geo_to_h3(center_of_viewport.y, center_of_viewport.x, 8)
# everything above here can be done only once per group.
# everything below needs to be done per row per group.
for index, row in group.iterrows():
current_hex = h3.geo_to_h3(row["exact_lat"], row["exact_lon"], 8)
df.at[index,'hex_id'] = current_hex
df.at[index, 'hit_count'] = 1
df.at[index, 'center_hex'] = center_hex
distance_to_center = h3.h3_distance(current_hex, center_hex)
df.at[index,'hex_dist_to_center'] = distance_to_center
This code works in around 5 mins for 1 million rows of data. The problem is I’m dealing with data much larger than that, and need something that works faster. I know it isn’t recommended to use for loops in Pandas, but I’m not sure how to solve this problem without using them. Any help would be appreciated.
Edit: Still struggling with this..any help would be appreciated!

You need to do some profiling to see how much time each part of the code takes to run. I conjecture the most time consuming parts are the geo_to_h3 and h3_distance calls. If so, other possible improvements on data frame operations (e.g., using DataFrame.apply and GroupBy.transform) would not help a lot.

Insert a series of values into pd.dataframe randomly

I have a large dataframe and what I want to do is overwrite X entries of that dataframe with a new value I set. The new entries have to be at a random position, but they have to be in order. Like I have a Column with random numbers, and want to overwrite 20 of them in a row with the new value x.
I tried df.sample(x) and then update the dataframe, but I only get individual entries. But I need the X new entries in a row (consecutively).
Somebody got a solution? I'm quite new to Python and have to get into it for my master thesis.
CLARIFICATION:
My dataframe has 5 columns with almost 60,000 rows, each row for 10 minutes of the year.
One Column is 'output' with electricity production values for that 10 minutes.
For 2 consecutive hours (120 consecutive minutes, hence 12 consecutive rows) of the year I want to lower that production to 60%. I want it to happen at a random time of the year.
Another column is 'status', with information about if the production is reduced or not.
I tried:
df_update = df.sample(12)
df_update.status = 'reduced'
df.update(df_update)
df.loc[('status) == 'reduced', ['production']] *=0.6
which does the trick for the total amount of time (12*10 minutes), but I want 120 consecutive minutes and not separated.

I decided to get a random value and just index the next 12 entries to be 0.6. I think this is what you want.
df = pd.DataFrame({'output':np.random.randn(20),'status':[0]*20})
idx = df.sample(1).index.values[0]
df.loc[idx:idx+11,"output"]=0.6
df.loc[idx:idx+11,"status"]=1

How to apply a XNOR operation to rows from to two dataframes

I have been reading documentation for a few hours now and I feel I am approaching the problem with the wrong mindset.
I have two tables in HIVE which I read with (spark.table(table_A)) with the same amount and type of columns, but with different origins, so their data is different. Both tables reflect flags that show whether or not a condition is met. There are around 20 columns, at least, and they could increase in the future.
If table_A has its first row be 0 0 1 1 0 table_B could be 0 1 0 1 0, I would like the result to be the result of a XNOR, comparing positions, so: 1 0 0 1 1 , since it has the same values in the first, fourth and fifth position
So I thought of the XNOR operation, when if boths values match then it returns a 1, and a 0 otherwise.
I am facing a number of problems, one of them is the volume of my data (right now I am working with a sample of 1 week and it's already at the 300MB mark), so I am working with pyspark and avoiding pandas since it usually does not fit in memory and/or lags the operation a lot.
Summing up, I have two objects of type pyspark.sql.dataframe.DataFrame, each has one of the tables, and so far the best I've got is something like this:
df_bitwise = df_flags_A.flag_column_A.bitwiseXOR(df_flags_B.flag_columns_B)
But sadly this returns a pyspark.sql.column.Column and I do not know how to read that result, and I do not know to build a dataframe with this (I would like the end result to be something like 20 times the above operation, one for each column, each forming a column of a dataframe).
What am I doing wrong because I feel like this is not the right approach.

Pysparkling 2 reset monotonically_increasing_id from 1

I want to split a spark dataframe into two pieces and defined the row number for each of the sub-dataframe. But I found that the function monotonically_increasing_id will still define the row number from the original dataframe.
Here is what I did in python:
# df is the original sparkframe
splits = df.randomSplit([7.0,3.0],400)
# add column rowid for the two subframes
set1 = splits[0].withColumn("rowid", monotonically_increasing_id())
set2 = splits[1].withColumn("rowid", monotonically_increasing_id())
# check the results
set1.select("rowid").show()
set2.select("rowid").show()
I would expect the first five elements of rowid for the two frames are both 1 to 5 (or 0 to 4, can't remember clearly):
set1: 1 2 3 4 5
set2: 1 2 3 4 5
But what I actually got is:
set1: 1 3 4 7 9
set2: 2 5 6 8 10
The two subframes' row id are actually their row id in the original sparkframe df not the new ones.
As a newbee of spark, I am seeking helps on why this happened and how to fix it.

First of all, what version of Spark are you using? The monotonically_increasing_id method implementation has been changed a few times. I can reproduce your problem in Spark 2.0, but it seems the behavior is different in spark 2.2. So it could be a bug that was fixed in newer spark release.
That being said, you should NOT expect the value generated by monotonically_increasing_id to be increase consecutively. In your code, I believe there is only one partition for the dataframe. According to http://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html
The generated ID is guaranteed to be monotonically increasing and
unique, but not consecutive. The current implementation puts the
partition ID in the upper 31 bits, and the record number within each
partition in the lower 33 bits. The assumption is that the data frame
has less than 1 billion partitions, and each partition has less than 8
billion records.
As an example, consider a DataFrame with two partitions, each with 3
records. This expression would return the following IDs: 0, 1, 2,
8589934592 (1L << 33), 8589934593, 8589934594.
So if you code should not expect the rowid to be increased consecutively.
Besides, you should also consider caching scenario and failure scenarios. Even if monotonically_increase_id works as what you expect -- increase the value consecutively, it still will not work. What if your node fails? The partitions on the failed node will be regenerated from the source or last cache/checkpoint, which might have a different order thus different rowid. Eviction out of cache also causes issue. Assuming that after generating a dataframe and you cache it to memory. What if it is evicted out of memory. Future action will try to regenerate the dataframe again, thus gives out different rowid.

Indexing by row counts in a pandas dataframe

I have a pandas dataframe with a two-element hierarchical index ("month" and "item_id"). Each row represents a particular item at a particular month, and has columns for several numeric measures of interest. The specifics are irrelevant, so we'll just say we have column X for our purposes here.
My problem stems from the fact that items vary in the months for which they have observations, which may or may not be contiguous. I need to calculate the average of X, across all items, for the 1st, 2nd, ..., n-th month in which there is an observation for that item.
In other words, the first row in my result should be the average across all items of the first row in the dataframe for each item, the second result row should be the average across all items of the second observation for that item, and so on.
Stated another way, if we were to take all the date-ordered rows for each item and index them from i=1,2,...,n, I need the average across all items of the values of rows 1,2,...,n. That is, I want the average of the first observation for each item across all items, the average of the second observation across all items, and so on.
How can I best accomplish this? I can't use the existing date index, so do I need to add another index to the dataframe (something like I describe in the previous paragraph), or is my only recourse to iterate across the rows for each item and keep a running average? This would work, but is not leveraging the power of pandas whatsoever.
Adding some example data:
item_id date X DUMMY_ROWS
20 2010-11-01 16759 0
2010-12-01 16961 1
2011-01-01 17126 2
2011-02-01 17255 3
2011-03-01 17400 4
2011-04-01 17551 5
21 2007-09-01 4 6
2007-10-01 5 7
2007-11-01 6 8
2007-12-01 10 9
22 2006-05-01 10 10
2006-07-01 13 11
23 2006-05-01 2 12
24 2008-01-01 2 13
2008-02-01 9 14
2008-03-01 18 15
2008-04-01 19 16
2008-05-01 23 17
2008-06-01 32 18
I've added a dummy rows column that does not exist in the data for explanatory purposes. The operation I'm describing would effectively give the mean of rows 0,6,10,12, and 13 (the first observation for each item), then the mean of rows 1,7,11,and 15 (the second observation for each item, excluding item 23 because it has only one observation), and so on.

One option is to reset the index then group by id.
df_new = df.reset_index()
df_new.groupby(['item_id']).X.agg(np.mean)
this leaves your original df intact and gets you the mean across all months for each item id.
For your updated question (great example by the way) I think the approach would be to add an "item_sequence_id" I've done this in the path with similar data.
df.sort(['item_id', 'date'], inplace = True)
def sequence_id(item):
item['seq_id'] = range(0,len(item)-1,1)
return item
df_with_seq_id = df.groupby(['item_id']).apply(sequence_id)
df_with_seq_id.groupby(['seq_id']).agg(np.mean)
The idea here is that the seq_id allows you to identify the position of the data point in time per item_id assigning non-unique seq_id values to the items will allow you to group across multiple items. The context I've used this in before relates to users doing something first in a session. Using this ID structure I can identify all of the first, second, third, etc... actions taken by users regardless of their absolute time and user id.
Hopefully this is more of what you want.

Here's an alternative method for this I finally figured out (which assumes we don't care about the actual dates for the purposes of calculating the mean). Recall the method proposed by #cwharland:
def sequence_id(item):
item['seq'] = range(0,len(item),1)
return item
shrinkWithSeqID_old = df.groupby(level='item_id').apply(sequence_id)
Testing this on a 10,000 row subset of the data frame:
%timeit -n10 dfWithSeqID_old = shrink.groupby(level='item_id').apply(sequence_id)
10 loops, best of 3: 301 ms per loop
It turns out we can simplify things by remembering that pandas' default behavior (i.e. without specifying an index column) is to generate a numeric index for a dataframe numbered from 0 to n (the number of rows in the frame). We can leverage this like so:
dfWithSeqID_new = df.groupby(level='item_id').apply(lambda x: x.reset_index(drop=True))
The only difference in the output is that we have a new, unlabeled numeric index with the same content as the 'seq' column used in the previous answer, BUT it's almost 4 times faster (I can't compare the methods for the full 13 million row dataframe, as the first methods was resulting in memory errors):
%timeit -n10 dfWithSeqID_new = df.groupby(level='item_id').apply(lambda x: x.reset_index(drop=True))
10 loops, best of 3: 77.2 ms per loop
Calculating the average as in my original question is only slightly different. The original method was:
dfWithSeqID_old.groupby('seq').agg(np.mean).head()
But now we simply have to account for the fact that we're using the new unlabeled index instead of the 'seq' column:
dfWithSeqID_new.mean(level=1).head()
The result is the same.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.