Splitting Pandas dataframe - python

I'd like to split my time-series data into X and y by shifting the data. The dummy dataframe looks like:
i.e. if the time steps equal to 2, X and y look like: X=[3,0] -> y= [5]
X=[0,5] -> y= [7] (this should be applied to the entire samples (rows))
I wrote the function below, but it returns empty matrices when I pass pandas dataframe to the function.
def create_dataset(dataset, time_step=1):
dataX, dataY = [], []
for i in range (len(dataset)-time_step-1):
a = dataset.iloc[:,i:(i+time_step)]
dataX.append(a)
dataY.append(dataset.iloc[:, i + time_step ])
return np.array(dataX), np.array(dataY)
Thank you for any solutions.

Here is an example that replicates the example, IIUC:
import pandas as pd
# function to process each row
def process_row(s):
assert isinstance(s, pd.Series)
return pd.concat([
s.rename('timestep'),
s.shift(-1).rename('x_1'),
s.shift(-2).rename('x_2'),
s.shift(-3).rename('y')
], axis=1).dropna(how='any', axis=0).astype(int)
# test case for the example
process_row( pd.Series([2, 3, 0, 5, 6]) )
# type in first two rows of the data frame
df = pd.DataFrame(
{'x-2': [3, 2], 'x-1': [0, 3],
'x0': [5, 0], 'x1': [7, 5], 'x2': [1, 6]})
# perform the transformation
ts = list()
for idx, row in df.iterrows():
t = process_row(row)
t.index = [idx] * t.index.size
ts.append(t)
print(pd.concat(ts))
# results
timestep x_1 x_2 y
0 3 0 5 7
0 0 5 7 1
1 2 3 0 5 <-- first part of expected results
1 3 0 5 6 <-- second part

Do you mean something like this:
df = df.shift(periods=-2, axis='columns')
# you can also pass a fill values parameter
df = df.shift(periods=-2, axis='columns', fill_value = 0)

Related

How to generate numeric mapping for categorical columns in pandas?

I want to manipulate categorical data using pandas data frame and then convert them to numpy array for model training.
Say I have the following data frame in pandas.
import pandas as pd
df2 = pd.DataFrame({"c1": ['a','b',None], "c2": ['d','e','f']})
>>> df2
c1 c2
0 a d
1 b e
2 None f
And now I want "compress the categories" horizontally as the following:
compressed_categories
0 c1-a, c2-d <--- this could be a string, ex. "c1-a, c2-d" or array ["c1-a", "c2-d"] or categorical data
1 c1-b, c2-e
2 c1-nan, c2-f
Next I want to generate a dictionary/vocabulary based on the unique occurrences plus "nan" columns in compressed_categories, ex:
volcab = {
"c1-a": 0,
"c1-b": 1,
"c1-c": 2,
"c1-nan": 3,
"c2-d": 4,
"c2-e": 5,
"c2-f": 6,
"c2-nan": 7,
}
So I can further numerically encoding then as follows:
compressed_categories_numeric
0 [0, 4]
1 [1, 5]
2 [3, 6]
So my ultimate goal is to make it easy to convert them to numpy array for each row and thus I can further convert it to tensor.
input_data = np.asarray(df['compressed_categories_numeric'].tolist())
then I can train my model using input_data.
Can anyone please show me an example how to make this series of conversion? Thanks in advance!
To build volcab dictionary and compressed_categories_numeric, you can use:
df3 = df2.fillna(np.nan).astype(str).apply(lambda x: x.name + '-' + x)
volcab = {k: v for v, k in enumerate(np.unique(df3))}
df2['compressed_categories_numeric'] = df3.replace(volcab).agg(list, axis=1)
Output:
>>> volcab
{'c1-a': 0, 'c1-b': 1, 'c1-nan': 2, 'c2-d': 3, 'c2-e': 4, 'c2-f': 5}
>>> df2
c1 c2 compressed_categories_numeric
0 a d [0, 3]
1 b e [1, 4]
2 None f [2, 5]
>>> np.array(df2['compressed_categories_numeric'].tolist())
array([[0, 3],
[1, 4],
[2, 5]])

Pandas dataframe to 3D array

I have a dataframe like this
group b c d e label
A 0.577535 0.299304 0.617103 0.378887 1
0.167907 0.244972 0.615077 0.311497 0
B 0.640575 0.768187 0.652760 0.822311 0
0.424744 0.958405 0.659617 0.998765 1
0.077048 0.407182 0.758903 0.273737 0
I want to reshape it into a 3D array which an LSTM could use as input, using padding. So group A should feed in a sequence of length 3 (after padding) and group B of length 3. Desired output something like
array1 = [[[0.577535, 0.299304, 0.617103, 0.378887],
[0.167907, 0.244972, 0.615077, 0.311497],
[0, 0, 0, 0]],
[[0.640575, 0.768187, 0.652760, 0.822311],
[0.424744, 0.958405, 0.659617, 0.998765],
[0.077048, 0.407182, 0.758903, 0.273737]]]
and then the labels have to be reshaped accordingly too
array2 = [[1,
0,
0],
[0,
1,
0]]
How can I put in the padding and reshape my data?
You can first use cumcount to create a count for each group, reindex by MultiIndex.from_product and fill with 0, and finally export to list:
df["count"] = df.groupby("group")["label"].cumcount()
mux = pd.MultiIndex.from_product([df["group"].unique(), range(max(df["count"]+1))], names=["group","count"])
df = df.set_index(["group","count"]).reindex(mux, fill_value=0)
print (df.iloc[:,:4].groupby(level=0).apply(pd.Series.tolist).values.tolist())
[[[0.577535, 0.299304, 0.617103, 0.378887],
[0.167907, 0.24497199999999997, 0.6150770000000001, 0.31149699999999997],
[0.0, 0.0, 0.0, 0.0]],
[[0.640575, 0.768187, 0.65276, 0.822311],
[0.42474399999999995, 0.958405, 0.659617, 0.998765],
[0.077048, 0.40718200000000004, 0.758903, 0.273737]]]
print (df.groupby(level=0)["label"].apply(list).tolist())
[[1, 0, 0], [0, 1, 0]]
I'm assuming your group column consists of many values and not just 1 'A' and 1 'B'. This code worked for me, you can give it a try as well:
import pandas as pd
df = pd.read_csv('file2.csv')
vals = df['group'].unique()
array1 = []
array2 = []
for val in vals:
val_df = df[df.group == val]
val_label = val_df.label
smaller_array = []
label_small_array = []
for label in val_label:
label_small_array.append(label)
array2.append(label_small_array)
for i in range(val_df.shape[0]):
smallest_array = []
for j in val_df.columns:
smallest_array.append(j)
smaller_array.append(smallest_array)
array1.append(smaller_array)

obtaining indices of n max absolute values in dataframe row

suppose i create a Pandas DataFrame as below
import pandas as pd
import numpy as np
np.random.seed(0)
x = 10*np.random.randn(5,5)
df = pd.DataFrame(x)
as an example, this can generate the below:
for each row, i am looking for a way to readily obtain the indices corresponding to the largest n (say 3) values in absolute value terms. for example, for the first row, i would expect [0,3,4]. we can assume that the results don't need to be ordered.
i tried searching for solutions similar to idxmax and argmax, but it seems these do not readily handle multiple values
You can use np.argsort(axis=1)
Given dataset:
x = 10*np.random.randn(5,5)
df = pd.DataFrame(x)
0 1 2 3 4
0 17.640523 4.001572 9.787380 22.408932 18.675580
1 -9.772779 9.500884 -1.513572 -1.032189 4.105985
2 1.440436 14.542735 7.610377 1.216750 4.438632
3 3.336743 14.940791 -2.051583 3.130677 -8.540957
4 -25.529898 6.536186 8.644362 -7.421650 22.697546
df.abs().values.argsort(1)[:, -3:][:, ::-1]
array([[3, 4, 0],
[0, 1, 4],
[1, 2, 4],
[1, 4, 0],
[0, 4, 2]])
Try this ( this is not the optimal code ) :
idx_nmax = {}
n = 3
for index, row in df.iterrows():
idx_nmax[index] = list(row.nlargest(n).index)
at the end of that you will have a dictionary with:
as Key the index of the row
and as Values ​​the index of the 'n' highest value of this row

Finding smallest possible difference between lists of unequal length

I have a dataframe with two columns A and B that contains lists:
import pandas as pd
df = pd.DataFrame({"A" : [[1,5,10], [], [2], [1,2]],
"B" : [[15, 2], [], [6], []]})
I want to construct a third column C that is defined such that it is equal to the smallest possible difference between list-elements in A and B if they are non-empty, and 0 if one or both of them are empty.
For the first row the smallest difference is 1 (we take absolute value..), for the second row it is 0 due to lists being empty, third row is 4 and fourth row is 0 again due to one empty list, so we ultimately end up with:
df["C"] = [1, 0, 4, 0]
This isn't easily vectorisable, since you have object dtype series of lists. You can use a list comprehension with itertools.product:
from itertools import product
zipper = zip(df['A'], df['B'])
df['C'] = [min((abs(x - y) for x, y in product(*vals)), default=0) for vals in zipper]
# alternative:
# df['C'] = [min((abs(x - y) for x, y in product(*vals)), default=0) \
# for vals in df[['A', 'B']].values]
print(df)
# A B C
# 0 [1, 5, 10] [15, 2] 1
# 1 [] [] 0
# 2 [2] [6] 4
# 3 [1, 2] [] 0
You can use the following list comprehension, checking for the min difference of the cartesian product (itertools.product) from both columns
[min(abs(i-j) for i,j in product(*a)) if all(a) else 0 for a in df.values]
[1, 0, 4, 0]
df['C'] = df.apply(lambda row: min([abs(x - y) for x in row['A'] for y in row['B']], default=0), axis=1)
I just want to introduce the unnesting again
df['Diff']=unnesting(df[['B']],['B']).join(unnesting(df[['A']],['A'])).eval('C=B-A').C.abs().min(level=0)
df.Diff=df.Diff.fillna(0).astype(int)
df
Out[60]:
A B Diff
0 [1, 5, 10] [15, 2] 1
1 [] [] 0
2 [2] [6] 4
3 [1, 2] [] 0
FYI
def unnesting(df, explode):
idx=df.index.repeat(df[explode[0]].str.len())
df1=pd.concat([pd.DataFrame({x:np.concatenate(df[x].values)} )for x in explode],axis=1)
df1.index=idx
return df1.join(df.drop(explode,1),how='left')
I think this works
def diff(a,b):
if len(a) > 0 and len(b) > 0:
return min([abs(i-j) for i in a for j in b])
return 0
df['C'] = df.apply(lambda x: diff(x.A, x.B), axis=1)
df
A B C
0 [1, 5, 10] [15, 2] 1
1 [] [] 0
2 [2] [6] 4
3 [1, 2] [] 0

pandas dataframe exponential decay summation

I have a pandas dataframe,
[[1, 3],
[4, 4],
[2, 8]...
]
I want to create a column that has this:
1*(a)^(3) # = x
1*(a)^(3 + 4) + 4 * (a)^4 # = y
1*(a)^(3 + 4 + 8) + 4 * (a)^(4 + 8) + 2 * (a)^8 # = z
...
Where "a" is some value.
The stuff: 1, 4, 2, is from column one, the repeated 3, 4, 8 is column 2
Is this possible using some form of transform/apply?
Essentially getting:
[[1, 3, x],
[4, 4, y],
[2, 8, z]...
]
Where x, y, z is the respective sums from the new column (I want them next to each other)
There is a "groupby" that is being done on the dataframe, and this is what I want to do for a given group
If I'm understanding your question correctly, this should work:
df = pd.DataFrame([[1, 3], [4, 4], [2, 8]], columns=['a', 'b'])
a = 42
new_lst = []
for n in range(len(lst)):
z = 0
i = 0
while i <= n:
z += df['a'][i]*a**(sum(df['b'][i:n+1]))
i += 1
new_lst.append(z)
df['new'] = new_lst
Update:
Saw that you are using pandas and updated with dataframe methods. Not sure there's an easy way to do this with apply since you need a mix of values from different rows. I think this for loop is still the best route.

Categories