Creating a Numpy Array from two Series - python

I had a DataFrame called "segments" that looks like the below:
ORIGIN_AIRPORT_ID DEST_AIRPORT_ID FL_COUNT ORIGIN_INDEX DEST_INDEX OUTDEGREE
WEIGHT
0 10135 10397 77 119 373 3 0.333333
1 10135 11433 85 119 1375 3 0.333333
Using this, I created two Boolean Series objects: One in which I'm storing all the IDs for which the WEIGHT column is not 0 and one in which they are:
Zeroes = (segments['WEIGHT'] == 0).groupby(segments['ORIGIN_INDEX']).all()
Non_zeroes = (segments['WEIGHT'] != 0).groupby(segments['ORIGIN_INDEX']).all()
I want to do two things (because I'm not sure which this task needs):
Create a NumPy vector where all "True" values in the Non_zeroes Series are set to the result of 1/4191 (~0.024~) and all "True" values in the Zeroes Series are set to 0 (or the same logic using True and False of one Series) keeping the IDs (e.g. ORIGIN_INDEX 119 0.024%, etc.)
And I'd also like to create a NumPy vector that is JUST a list of the percentages and zeroes WITHOUT the IDs
EDIT to add extra detail requested!
I tried using a condition as a variable, then using .loc to apply it:
cond_array = copied.WEIGHT is not 0
df.loc[cond_array, ID] = 1/4191
I tried using from_coo(), toarray(), and DataFrame to convert:
pd.Series.sparse.from_coo(P, dense_index=True)
P.toarray()
pd.DataFrame(P)
Finally, I tried applying logic to the DF instead of the COO Matrix. I THINK this gets close, but it is still failing. I believe it fails because it is not including the 0s (copied is just a DF that's a copy of segments):
copied['WEIGHT'] = copied.loc[copied['WEIGHT'] != 0, 'WEIGHT'] = float((1/len(copied))) #0.00023860653
The last code passes the first two tests (testing if it's an array and that it sums to 1.0), but fails the last
assert np.isclose(x0.max(), 1.0/n_actual, atol=10*n*np.finfo(float).eps), "x0` values seem off..."
EDIT 2:
Had the wrong count. It was supposed to be 1/300, not 1/4191. All fixed now, thanks all who took a look :)

Related

Iterate through 200 datasets [duplicate]

This question already has answers here:
Creating multiple dataframes with a loop
(3 answers)
Closed 1 year ago.
I have 200 datasets and I want to iterate through them to pick random rows and add them to another dataset(empty dataset), using iloc and value function. when I execute the code it does not give an error but also does not add anything to the empty dataset. However, when I try to run the single command to check if the random row has any value or not it gives an error of:
AttributeError: 'str' object has no attribute 'iloc'.
my code is given below:
Tdata = np.zeros([20, 6])
k = 0
for j in range(200):
for j1 in range(0, 20):
Tdata[k:k+1,:] = (('dataset'+j)).iloc[random.randint(100)].values
k += 1
('dataset'+j) is basically selecting different datasets. The names of my datasets are dataset0, dataset1, dataset2......there are already defined.
There are multiple issues with you code.
1. Using str in place of the actual DataFrame variable
You are trying use .iloc over a string dataframe1 for example. This won't work since what str has no attribute .iloc, as the error reads for you.
Since you want to work with DataFrame variable names, you may need to use eval() to interpret the string as a variable name. NOTE: BE EXTRA CAREFUL while using eval(). Please read the dangers of using eval() carefully.
2. Sampling 20 rows from each DataFrame.
If you are trying to get 20 rows by using for j1 in range(0, 20): along with random.randint(100), there is a better way to avoid this iteration. Instead what you need is to use random.randint(0,100,(n,) to get n random indexes. In this case random.randint(0,100,(20,)
Or an even better way to do this is just simply using df.sample(20) to sample 20 rows from a given dataframe.
3. Forcing update over views of the dataframe
Its better to use a different appraoch than force an update over a view of the dataframe with Tdata[k:k+1,:] == .... Since you want to combine dataframes, its better to just collect them in a list and pass them to a pd.concat which would be much more useful.
Here is sample code with a simple setting which should help guide you to what you are looking for.
import pandas as pd
import numpy as np
dataset0 = pd.DataFrame(np.random.random((100,3)))
dataset1 = pd.DataFrame(np.random.random((100,3)))
dataset2 = pd.DataFrame(np.random.random((100,3)))
dataset3 = pd.DataFrame(np.random.random((100,3)))
##Using random.randint
##samples = [eval('dataset'+str(i)).iloc[np.random.randint(0,100,(3,))] for i in range(4)]
##Using df.sample()
samples = [eval('dataset'+str(i)).sample(3) for i in range(4)]
##Change -
##1. The 3 to 20 for 20 samples per dataframe
##2. range(4) to range(200) to work with 200 dataframes
output = pd.concat(samples)
print(output)
0 1 2
42 0.372626 0.445972 0.030467
20 0.376201 0.445504 0.835735
56 0.214806 0.083550 0.582863
85 0.691495 0.346022 0.619638
24 0.290397 0.202795 0.704082
16 0.112986 0.013269 0.903917
51 0.521951 0.115386 0.632143
73 0.946870 0.531085 0.437418
98 0.745897 0.718701 0.280326
56 0.679253 0.010143 0.124667
4 0.028559 0.769682 0.737377
84 0.857553 0.866464 0.827472
4. Storing 200 dataframes??
Last but not the least, you should ask yourself, why are you storing 200 dataframe as individual variables, only to sample some rows from each.
Why not try to -
Read each of the files iteratively
Sample rows from each
Store them in a list of dataframes
pd.concat once you are done iterating over the 200 files
... instead of saving 200 dataframes and then doing the same.

Dynamically make comparisons between columns in pandas columns

I have a large database in which I need to drop entries that don't satisfy a boolean criteria, but the criteria may involve several dozen columns.
I have the following which works with copying and pasting the names
df = df[~( (df['FirstCol'] > df['SecondCol']) |
(df['ThirdCol'] > df['FifthCol']) |
...
(df['FiftiethCol'] > df['TweniethCol']) |
(df['ThisCouldBeHundredsCol'] > df['LastOne'])
)]
However, I want to be able to do this in shorter amounts of code. If I have the column names that need to be compared in a list, like so
list_of_comparison_cols = ['FirstCol', 'SecondCol', 'ThirdCol', 'FifthCol', ..., 'FiftiethCol', 'TweniethCol', 'ThisCouldBeHundredsCol', 'LastOne']
How would I go about doing this in as little code and more dynamically as possible?
Many thanks.
You can do it by selecting every two elements of your list with [::2] to get ['FirstCol', 'ThirdCol',...] and [1::2] to get ['SecondCol', 'FifthCol', .... Use it to select the columns and compare to_numpy arrays between both side of the inequality. Then use any over axis=1 that correspond to the | used in your condition.
#example
list_of_comparison_cols = ['FirstCol', 'SecondCol', 'ThirdCol', 'FifthCol',
'FiftiethCol', 'TweniethCol', 'ThisCouldBeHundredsCol',
'LastOne']
np.random.seed(0)
df = pd.DataFrame(np.random.randint(0,50,8*10).reshape(10,8),
columns=list_of_comparison_cols)
# create the mask
mask = (df[list_of_comparison_cols[::2]].to_numpy()
>df[list_of_comparison_cols[1::2]].to_numpy()
).any(1)
print (df[~mask])
FirstCol SecondCol ThirdCol FifthCol FiftiethCol TweniethCol \
0 44 47 0 3 3 39
ThisCouldBeHundredsCol LastOne
0 9 19

New Dataframe column as a generic function of other rows (pandas)

What is the fastest (and most efficient) way to create a new column in a DataFrame that is a function of other rows in pandas ?
Consider the following example:
import pandas as pd
d = {
'id': [1, 2, 3, 4, 5, 6],
'word': ['cat', 'hat', 'hag', 'hog', 'dog', 'elephant']
}
pandas_df = pd.DataFrame(d)
Which yields:
id word
0 1 cat
1 2 hat
2 3 hag
3 4 hog
4 5 dog
5 6 elephant
Suppose I want to create a new column bar containing a value that is based on the output of using a function foo to compare the word in the current row to the other rows in the dataframe.
def foo(word1, word2):
# do some calculation
return foobar # in this example, the return type is numeric
threshold = some_threshold
for index, _id, word in pandas_df.itertuples():
value = sum(
pandas_df[pandas_df['word'] != word].apply(
lambda x: foo(x['word'], word),
axis=1
) < threshold
)
pandas_df.loc[index, 'bar'] = value
This does produce the correct output, but it uses itertuples() and apply(), which is not performant for large DataFrames.
Is there a way to vectorize (is that the correct term?) this approach? Or is there another better (faster) way to do this?
Notes / Updates:
In the original post, I used edit distance/levenshtein distance as the foo function. I have changed the question in an attempt to be more generic. The idea is that the function to be applied is to compare the current rows value against all other rows and return some aggregate value.
If foo was nltk.metrics.distance.edit_distance and the threshold was set to 2 (as in the original post), this produces the output below:
id word bar
0 1 cat 1.0
1 2 hat 2.0
2 3 hag 2.0
3 4 hog 2.0
4 5 dog 1.0
5 6 elephant 0.0
I have the same question for spark dataframes as well. I thought it made sense to split these into two posts so they are not too broad. However, I have generally found that solutions to similar pandas problems can sometimes be modified to work for spark.
Inspired by this answer to my spark version of this question, I tried to use a cartesian product in pandas. My speed tests indicate that this is slightly faster (though I suspect that may vary with the size of the data). Unfortunately, I still can't get around calling apply().
Example code:
from nltk.metrics.distance import edit_distance as edit_dist
pandas_df2 = pd.DataFrame(d)
i, j = np.where(np.ones((len(pandas_df2), len(pandas_df2))))
cart = pandas_df2.iloc[i].reset_index(drop=True).join(
pandas_df2.iloc[j].reset_index(drop=True), rsuffix='_r'
)
cart['dist'] = cart.apply(lambda x: edit_dist(x['word'], x['word_r']), axis=1)
pandas_df2 = (
cart[cart['dist'] < 2].groupby(['id', 'word']).count()['dist'] - 1
).reset_index()
Let's try to analyze the problem for a second:
If you have N rows, then you have N*N "pairs" to consider in your similarity function. In the general case, there is no escape from evaluating all of them (sounds very rational, but I can't prove it). Hence, you have at least O(n^2) time complexity.
What you can try, however, is to play with the constant factors of that time complexity.
The possible options I found are:
1. Parallelization:
Since you have some large DataFrame, parallelizing the processing is the best obvious choice. That will gain you (almost) linear improvement in time complexity, so if you have 16 workers you will gain (almost) 16x improvement.
For example, we can partition the rows of the df into disjoint parts, and process each part individually, then combine the results.
A very basic parallel code might look like this:
from multiprocessing import cpu_count,Pool
def work(part):
"""
Args:
part (DataFrame) : a part (collection of rows) of the whole DataFrame.
Returns:
DataFrame: the same part, with the desired property calculated and added as a new column
"""
# Note that we are using the original df (pandas_df) as a global variable
# But changes made in this function will not be global (a side effect of using multiprocessing).
for index, _id, word in part.itertuples(): # iterate over the "part" tuples
value = sum(
pandas_df[pandas_df['word'] != word].apply( # Calculate the desired function using the whole original df
lambda x: foo(x['word'], word),
axis=1
) < threshold
)
part.loc[index, 'bar'] = value
return part
# New code starts here ...
cores = cpu_count() #Number of CPU cores on your system
data_split = np.array_split(data, cores) # Split the DataFrame into parts
pool = Pool(cores) # Create a new thread pool
new_parts = pool.map(work , data_split) # apply the function `work` to each part, this will give you a list of the new parts
pool.close() # close the pool
pool.join()
new_df = pd.concat(new_parts) # Concatenate the new parts
Note: I've tried to keep the code as close to OP's code as possible. This is just a basic demonstration code and a lot of better alternatives exist.
2. "Low level" optimizations:
Another solution is to try to optimize the similarity function computation and iterating/mapping. I don't think this will gain you much speedup compared to the previous option or the next one.
3. Function-dependent pruning:
The last thing you can try are similarity-function-dependent improvements. This doesn't work in the general case, but will work very well if you can analyze the similarity function. For example:
Assuming you are using Levenshtein distance (LD), you can observe that the distance between any two strings is >= the difference between their lengths. i.e. LD(s1,s2) >= abs(len(s1)-len(s2)) .
You can use this observation to prune the possible similar pairs to consider for evaluation. So for each string with length l1, compare it only with strings having length l2 having abs(l1-l2) <= limit. (limit is the maximum accepted dis-similarity, 2 in your provided example).
Another observation is that LD(s1,s2) = LD(s2,s1). That cuts the number of pairs by a factor of 2.
This solution may actually get you down to O(n) time complexity (depends highly on the data).
Why? you may ask.
That's because if we had 10^9 rows, but on average we have only 10^3 rows with "close" length to each row, then we need to evaluate the function for about 10^9 * 10^3 /2 pairs, instead of 10^9 * 10^9 pairs. But that's (again) depends on the data. This approach will be useless if (in this example) you have strings all which have length 3.
Thoughts about preprocessing (groupby)
Because you are looking for edit distance less than 2, you can first group by the length of strings. If the difference of length between groups is greater or equal to 2, you do not need to compare them. (This part is quite similar to Qusai Alothman's answer in section 3. H)
Thus, first thing is to group by the length of the string.
df["length"] = df.word.str.len()
df.groupby("length")["id", "word"]
Then, you compute the edit distance between every two consecutive group if the difference in length is less than or equal to 2. This does not directly relate to your question but I hope it would be helpful.
Potential vectorization (after groupby)
After that, you may also try to vectorize the computation by splitting each string into characters. Note that if the cost of splitting is greater than the vectorized benefits it carries, you should not do this. Or when you are creating the data frame, just create one that with characters rather than words.
We will use the answer in Pandas split dataframe column for every character to split a string into a list of characters.
# assuming we had groupped the df.
df_len_3 = pd.DataFrame({"word": ['cat', 'hat', 'hag', 'hog', 'dog']})
# turn it into chars
splitted = df_len_3.word.apply(lambda x: pd.Series(list(x)))
0 1 2
0 c a t
1 h a t
2 h a g
3 h o g
4 d o g
splitted.loc[0] == splitted # compare one word to all words
0 1 2
0 True True True -> comparing to itself is always all true.
1 False True True
2 False True False
3 False False False
4 False False False
splitted.apply(lambda x: (x == splitted).sum(axis=1).ge(len(x)-1), axis=1).sum(axis=1) - 1
0 1
1 2
2 2
3 2
4 1
dtype: int64
Explanation of splitted.apply(lambda x: (x == splitted).sum(axis=1).ge(len(x)-1), axis=1).sum(axis=1) - 1
For each row, lambda x: (x == splitted) compares each row to the whole df just like splitted.loc[0] == splitted above. It will generate a true/false table.
Then, we sum up the table horizontally with a .sum(axis=1) following (x == splitted).
Then, we want to find out which words are similar. Thus, we apply a ge function that checks the number of true is over a threshold. Here, we only allow difference to be 1, so it is set to be len(x)-1.
Finally, we will have to subtract the whole array by 1 because we compare each word with itself in operation. We will want to exclude self-comparison.
Note, this vectorization part only works for within-group similarity checking. You still need to check groups with different length with the edit distance approach, I suppose.

How to efficiently create a SparseDataFrame from a long table?

I have a SQL table which I can read in as a Pandas data frame, that has the following structure:
user_id value
1 100
1 200
2 100
4 200
It's a representation of a matrix, for which all the values are 1 or 0. The dense representation of this matrix would look like this:
100 200
1 1 1
2 1 0
4 0 1
Normally, to do this conversion you can use pivot, but in my case with tens or hundreds of millions of rows in the first table one gets a big dense matrix full of zeros which is expensive to drag around. You can convert it to sparse, but getting that far requires a lot of resources.
Right now I'm working on a solution to assign row numbers to each user_id, sorting, and then splitting the 'value' column into SparseSeries before recombining into a SparseDataFrame. Is there a better way?
I arrived at a solution, albeit a slightly imperfect one.
What one can do is to manually create from the columns a number of Pandas SparseSeries, combine them into a dict, and then cast that dict to a DataFrame (not a SparseDataFrame). Casting as SparseDataFrame currently hits an immature constructor, which deconstructs the whole object into dense and then back into sparse form regardless of the input. Building SparseSeries into a conventional DataFrame, however, maintains sparsity but creates a viable and otherwise complete DataFrame object.
Here's a demonstration of how to do it, written more for clarity than for performance. One difference with my own implementation is I created the dict of sparse vectors as a dict comprehension instead of a loop.
import pandas
import numpy
df = pandas.DataFrame({'user_id':[1,2,1,4],'value':[100,100,200,200]})
# Get unique users and unique features
num_rows = len(df['user_id'].unique())
num_features = len(df['value'].unique())
unique_users = df['user_id'].unique().copy()
unique_features = df['value'].unique().copy()
unique_users.sort()
unique_features.sort()
# assign each user_id to a row_number
user_lookup = pandas.DataFrame({'uid':range(num_rows), 'user_id':unique_users})
vec_dict = {}
# Create a sparse vector for each feature
for i in range(num_features):
users_with_feature = df[df['value']==unique_features[i]]['user_id']
uid_rows = user_lookup[user_lookup['user_id'].isin(users_with_feature)]['uid']
vec = numpy.zeros(num_rows)
vec[uid_rows] = 1
sparse_vec = pandas.Series(vec).to_sparse(fill_value=0)
vec_dict[unique_features[i]] = sparse_vec
my_pandas_frame = pandas.DataFrame(vec_dict)
my_pandas_frame = my_pandas_frame.set_index(user_lookup['user_id'])
The results:
>>> my_pandas_frame
100 200
user_id
1 1 1
2 1 0
4 0 1
>>> type(my_pandas_frame)
<class 'pandas.core.frame.DataFrame'>
>>> type(my_pandas_frame[100])
<class 'pandas.sparse.series.SparseSeries'>
Complete, but still sparse. There are a few caveats, if you do a simple copy or subset not-in-place then it will forget itself and try to recast to dense, but for my purposes I'm pretty happy with it.

setting null values in a numpy array

how do I null certain values in numpy array based on a condition?
I don't understand why I end up with 0 instead of null or empty values where the condition is not met... b is a numpy array populated with 0 and 1 values, c is another fully populated numpy array. All arrays are 71x71x166
a = np.empty(((71,71,166)))
d = np.empty(((71,71,166)))
for indexes, value in np.ndenumerate(b):
i,j,k = indexes
a[i,j,k] = np.where(b[i,j,k] == 1, c[i,j,k], d[i,j,k])
I want to end up with an array which only has values where the condition is met and is empty everywhere else but with out changing its shape
FULL ISSUE FOR CLARIFICATION as asked for:
I start with a float populated array with shape (71,71,166)
I make an int array based on a cutoff applied to the float array basically creating a number of bins, roughly marking out 10 areas within the array with 0 values in between
What I want to end up with is an array with shape (71,71,166) which has the average values in a particular array direction (assuming vertical direction, if you think of a 3D array as a 3D cube) of a certain "bin"...
so I was trying to loop through the "bins" b == 1, b == 2 etc, sampling the float where that condition is met but being null elsewhere so I can take the average, and then recombine into one array at the end of the loop....
Not sure if I'm making myself understood. I'm using the np.where and using the indexing as I keep getting errors when I try and do it without although it feels very inefficient.
Consider this example:
import numpy as np
data = np.random.random((4,3))
mask = np.random.random_integers(0,1,(4,3))
data[mask==0] = np.NaN
The data will be set to nan wherever the mask is 0. You can use any kind of condition you want, of course, or do something different for different values in b.
To erase everything except a specific bin, try the following:
c[b!=1] = np.NaN
So, to make a copy of everything in a specific bin:
a = np.copy(c)
a[b!=1] == np.NaN
To get the average of everything in a bin:
np.mean(c[b==1])
So perhaps this might do what you want (where bins is a list of bin values):
a = np.empty(c.shape)
a[b==0] = np.NaN
for bin in bins:
a[b==bin] = np.mean(c[b==bin])
np.empty sometimes fills the array with 0's; it's undefined what the contents of an empty() array is, so 0 is perfectly valid. For example, try this instead:
d = np.nan * np.empty((71, 71, 166)).
But consider using numpy's strength, and don't iterate over the array:
a = np.where(b, c, d)
(since b is 0 or 1, I've excluded the explicit comparison b == 1.)
You may even want to consider using a masked array instead:
a = np.ma.masked_where(b, c)
which seems to make more sense with respect to your question: "how do I null certain values in a numpy array based on a condition" (replace null with mask and you're done).

Categories