Combining Different Scaled arrays in dataframe - python

Is there a built-in function (numpy or pandas I'm thinking) that would help combine multiple rows of one column in a dataframe, keeping the same dimensions, but different scale? Also, combined with that, summing the values from a different column between the intervals? Or is it something I just need to build from scratch? Example below, I'm not sure exactly how to ask. This would need to be scalable; the example is simple, in reality I'm working with a 250 dim array and theoretically unlimited rows.
Ex:
import pandas as pd
import numpy as np
#Creating DF
df = pd.DataFrame([[[-2,-1,0,1,2],[-10,-5,5,5,-10]],
[[-.5,.5,1.5,2.5,3.5],[-3,-2,0,-2,-3]]])
output: 0 1
0 [-2, -1, 0, 1, 2] [-10, -5, 5, 5, -10]
1 [-0.5, 0.5, 1.5, 2.5, 3.5] [-3, -2, 0, -2, -3]
where the answer is [-2,-0.625,0.75,2.125,3.5] (column0 combined with dim 5) , [-10,-5,0,-5,-5] (sum of column1 between steps of column0 where (interval-1) < x<=interval)
answer = pd.DataFrame([[[-2,-.625,.75,2.125,3.5],[-10,-5,0,-5,-5]]])

Related

How to select rows in pandas based on a condition with a list?

Let's say I have the following dataset:
x = [1, 1, 1, 2, 2, 2, 3, 3, 3]
y = [1, 2, 3, 2, 3, 4, 3, 4, 5]
import pandas as pd
df = pd.DataFrame({'x':x,'y':y}) #dataframe to work with
which, plotted using matplotlib scatter looks like this.
I would like to select the bottom three points using Pandas, without iterating over the rows of my dataframe (because of speed considerations of a large dataframe), and without simply selecting 1st, 4th and 7th point of the dataframe:
I tried selecting based on a condition:
selected_df = df.loc[df["y"] <=3] #selects an extra point at x=1,y=2
This selects an extra point which I don't want. I also tried building two lists of values representing a line that separates the bottom points from others:
x_line = [1,2,3]
y_line = [1.5, 2.5, 3.5]
selected_df = df.loc[df["y"] <=y_line ] #y_line is a list, doesn't work
I also unfortunately must not solve it by filling y_line with more points to make y_line same size as df["y"].
Can anyone please show me the direction how to select the bottom points preferably using functions of DataFrame such as df.where or a condition? I would appreciate it very much.
IIUC, what you're esentially looking for is the lowest y for each x, so you can phrase this as a groupby problem:
>>> selected_df = df.groupby("x", as_index=False).y.min()
>>> selected_df
x y
0 1 1
1 2 2
2 3 3

Why is this dictionary comprehension so slow? Please suggest way to speed it up

Hi Please help me either: speed up this dictionary compression; offer a better way to do it or gain a higher understanding of why it is so slow internally (like for example is calculation slowing down as the dictionary grows in memory size). I'm sure there must be a quicker way without learning some C!
classes = {i : [1 if x in df['column'].str.split("|")[i] else 0 for x in df['column']] for i in df.index}
with the output:
{1:[0,1,0...0],......, 4000:[0,1,1...0]}
from a df like this:
data_ = {'drugbank_id': ['DB06605', 'DB06606', 'DB06607', 'DB06608', 'DB06609'],
'drug-interactions': ['DB06605|DB06695|DB01254|DB01609|DB01586|DB0212',
'DB06605|DB06695|DB01254|DB01609|DB01586|DB0212',
'DB06606|DB06607|DB06608|DB06609',
'DB06606|DB06607',
'DB06608']
}
pd.DataFrame(data = data_ , index=range(0,5) )
I am preforming it in a df with 4000 rows, the column df['column'] contains a string of Ids separated by |. The number of IDs in each row that needs splitting varies from 1 to 1000, however, this is done for all 4000 indexes. I tested it on the head of the df and it seemed quick enough, now the comprehension has been running for 24hrs. So maybe it is just the sheer size of the job, but feel like I could speed it up and at this point I want to stop it an re-engineer, however, I am scared that will set me back without much increase in speed, so before I do that wanted to get some thoughts, ideas and suggestions.
Beyond 4000x4000 size I suspect that using the Series and Index Objects is the another problem and that I would be better off using lists, but given the size of the task I am not sure how much speed that will gain and maybe I am better off using some other method such as pd.apply(df, f(write line by line to json)). I am not sure - any help and education appreciated, thanks.
Here is one approach:
import pandas as pd
# create data frame
df = pd.DataFrame({'idx': [1, 2, 3, 4], 'col': ['1|2', '1|2|3', '2|3', '1|4']})
# split on '|' to convert string to list
df['col'] = df['col'].str.split('|')
# explode to get one row for each list element
df = df.explode('col')
# create dummy ID (this will become True in the final result)
df['dummy'] = 1
# use pivot to create dense matrix
df = (df.pivot(index='idx', columns='col', values='dummy')
.fillna(0)
.astype(int))
# convert each row to a list
df['test'] = df.apply(lambda x: x.to_list(), axis=1)
print(df)
col 1 2 3 4 test
idx
1 1 1 0 0 [1, 1, 0, 0]
2 1 1 1 0 [1, 1, 1, 0]
3 0 1 1 0 [0, 1, 1, 0]
4 1 0 0 1 [1, 0, 0, 1]
The output you want can be achieved using dummies. We split the column, stack, and use max to turn it into dummy indicators based on the original index. Then we use reindex to get it in the order you want based on the 'drugbank_id' column.
Finally to get the dictionary you want we will transpose and use to_dict
classes = (pd.get_dummies(df['drug-interactions'].str.split('|', expand=True).stack())
.max(level=0)
.reindex(df['drugbank_id'], axis=1)
.fillna(0, downcast='infer')
.T.to_dict('list'))
print(classes)
{0: [1, 0, 0, 0, 0], #Has DB06605, No DB06606, No DB06607, No DB06608, No DB06609
1: [1, 0, 0, 0, 0],
2: [0, 1, 1, 1, 1],
3: [0, 1, 1, 0, 0],
4: [0, 0, 0, 1, 0]}

Tenforflow Sparse Arithmetic

Hi I'm learning tensorflow right now and I am have a sparse dataset which is made up of three columns, date, bond, spread. I figured that storing this data in sparse tensor with bond as one dimension, and date as another will make operations on this tensor feel natural, do let me know if you think there is a better way.
I am trying to perform arithmetic on two slices of the tensor where I add/subtract values on one date only if given tensor values is not empty, and while I found some functions that help me with that task I can't shake off the feeling that I'm missing a really simple solution to the problem.
Using data bellow:
import tensorflow as tf
tf.enable_eager_execution()
indicies = [[0, 0], [0, 1], [1, 0], [1, 2], [2, 2]]
values = [10 , 10 , 10 , 11 , 11 ]
spreads = tf.sparse.SparseTensor(indicies, values, [3, 3])
In above example I intend to use dimension one for dates, and dimension two for bonds such that
tf.sparse.slice(spreads,[0,2],[3,1])
Gives me all spreads for date2, but apparently subtraction is not supported for SparseTensor, nor can I use tf.math.subtract. So I am no longer sure what is supported.
Specifically what I want to accomplish in above example is subtract date 0 for all other dates only if bond has spread on both dates. For Example bond 0 shows up in date 0 and 1 but not date 2 so I want to subtract spread in date 0 from both dates 0 and 1.
Final tensor would look like this:
indicies2 = [[0, 0], [0, 1], [1, 0], [1, 2]]
output = [ 0 , 0 , 0, , 1 ]
tf.sparse.to_dense(tf.sparse(tf.sparse.SparseTensor(indicies2, output, [3, 3])))
tf.Tensor: id=128, shape=(3, 3), dtype=int32, numpy=
array([[0, 0, 0],
[ 0, 0, 1],
[ 0, 0, 0]])
I guess easy solution would be to use tf.sparse.to_dense but that kind of defeats the whole point of using SparseTensor, so I'm not really sure if I missed something in API docs that makes my solution possible or did I got wrong completely by trying to use SparseTensor?
At the end of the day I am just looking to perform some math for each value of a tensor if that value has a match in another tensor.
UPDATE:
I realized that I can apply tf.math/negative to one of the slices to subtract two slices problem is that output assumes that if value in one slice is missing then it can be assumed to be some default value(zero).
I'm not sure there is any simple trick to make that work that easily. I would either make the dense computation or write the sparse computation myself. That is a bit trickier, so probably only worth it if the data is really very sparse and you would save a lot memory and computation. Here is a way to do that:
import tensorflow as tf
tf.enable_eager_execution()
bonds = [0, 0, 1, 1, 2]
dates = [0, 1, 0, 2, 2]
values = [10, 10, 10, 11, 11]
# Find date 0 data
m0 = tf.equal(dates, 0)
bonds0 = tf.boolean_mask(bonds, m0)
values0 = tf.boolean_mask(values, m0)
# Find where date 0 bonds are
match = tf.equal(tf.expand_dims(bonds, 1), bonds0)
# Compute the amount to subtract from each data point
values_sub = tf.reduce_sum(values0 * tf.dtypes.cast(match, values0.dtype), 1)
# Compute new spread values
values_new = values - values_sub
# Mask null values
m_valid = tf.not_equal(values_new, 0)
bonds_new = tf.boolean_mask(bonds, m_valid)
dates_new = tf.boolean_mask(dates, m_valid)
values_new = tf.boolean_mask(values_new, m_valid)
# Make sparse tensor
indices_new = tf.dtypes.cast(tf.stack([bonds_new, dates_new], 1), tf.int64)
spreads_new = tf.sparse.SparseTensor(indices_new, values_new, [3, 3])
tf.print(spreads_new)
# 'SparseTensor(indices=[[1 2]
# [2 2]], values=[1 11], shape=[3 3])'
For the example that you give, I get the outputs (1, 2) => 1 and (2, 2) => 11 - (2, 2) is unaffected because there was no spread for 2 in date 0. That is different from what you wrote, so let me know if that is not what you meant.

Best way converting data in PANDAS DataFrame to matrix in Python

I found one thread of converting a matrix to das pandas DataFrame. However, I would like to do the opposite - I have a pandas DataFrame with time series data of this structure:
row time stamp, batch, value
1, 1, 0.1
2, 1, 0.2
3, 1, 0.3
4, 1, 0.3
5, 2, 0.25
6, 2, 0.32
7, 2, 0.2
8, 2, 0.1
...
What I would like to have is a matrix of values with one row belonging to one batch:
[[0.1, 0.2, 0.3, 0.3],
[0.25, 0.32, 0.2, 0.1],
...]
which I want to plot as heatmap using matplotlib or alike.
Any suggestion?
What you can try is to first group by the desired index:
g = df.groupby("batch")
And then convert this group to an array by aggregating using the list constructor.
The result can then be converted to an array using the .values property (or .as_matrix() function, but this is getting deprecated soon.)
mtr = g.aggregate(list).values
One downside of this method is that it will create arrays of lists instead of a nice array, even if the result would lead to a non-jagged array.
Alternatively, if you know that you get exactly 4 values for every unique value of batch you can just use the matrix directly.
df = df.sort_values("batch")
my_indices = [1, 2] # Or whatever indices you desire.
mtr = df.values[:, my_indices] # or df.as_matrix()
mtr = mtr.reshape(-1, 4) # Only works if you have exactly 4 values for each batch
Try and use crosstab from pandas, pd.crosstab(). You will have to confirm the aggfunction.
https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.crosstab.html
and then .as_matrix()

obtaining indices of n max absolute values in dataframe row

suppose i create a Pandas DataFrame as below
import pandas as pd
import numpy as np
np.random.seed(0)
x = 10*np.random.randn(5,5)
df = pd.DataFrame(x)
as an example, this can generate the below:
for each row, i am looking for a way to readily obtain the indices corresponding to the largest n (say 3) values in absolute value terms. for example, for the first row, i would expect [0,3,4]. we can assume that the results don't need to be ordered.
i tried searching for solutions similar to idxmax and argmax, but it seems these do not readily handle multiple values
You can use np.argsort(axis=1)
Given dataset:
x = 10*np.random.randn(5,5)
df = pd.DataFrame(x)
0 1 2 3 4
0 17.640523 4.001572 9.787380 22.408932 18.675580
1 -9.772779 9.500884 -1.513572 -1.032189 4.105985
2 1.440436 14.542735 7.610377 1.216750 4.438632
3 3.336743 14.940791 -2.051583 3.130677 -8.540957
4 -25.529898 6.536186 8.644362 -7.421650 22.697546
df.abs().values.argsort(1)[:, -3:][:, ::-1]
array([[3, 4, 0],
[0, 1, 4],
[1, 2, 4],
[1, 4, 0],
[0, 4, 2]])
Try this ( this is not the optimal code ) :
idx_nmax = {}
n = 3
for index, row in df.iterrows():
idx_nmax[index] = list(row.nlargest(n).index)
at the end of that you will have a dictionary with:
as Key the index of the row
and as Values ​​the index of the 'n' highest value of this row

Categories