I have a dataframe like this
group b c d e label
A 0.577535 0.299304 0.617103 0.378887 1
0.167907 0.244972 0.615077 0.311497 0
B 0.640575 0.768187 0.652760 0.822311 0
0.424744 0.958405 0.659617 0.998765 1
0.077048 0.407182 0.758903 0.273737 0
I want to reshape it into a 3D array which an LSTM could use as input, using padding. So group A should feed in a sequence of length 3 (after padding) and group B of length 3. Desired output something like
array1 = [[[0.577535, 0.299304, 0.617103, 0.378887],
[0.167907, 0.244972, 0.615077, 0.311497],
[0, 0, 0, 0]],
[[0.640575, 0.768187, 0.652760, 0.822311],
[0.424744, 0.958405, 0.659617, 0.998765],
[0.077048, 0.407182, 0.758903, 0.273737]]]
and then the labels have to be reshaped accordingly too
array2 = [[1,
0,
0],
[0,
1,
0]]
How can I put in the padding and reshape my data?
You can first use cumcount to create a count for each group, reindex by MultiIndex.from_product and fill with 0, and finally export to list:
df["count"] = df.groupby("group")["label"].cumcount()
mux = pd.MultiIndex.from_product([df["group"].unique(), range(max(df["count"]+1))], names=["group","count"])
df = df.set_index(["group","count"]).reindex(mux, fill_value=0)
print (df.iloc[:,:4].groupby(level=0).apply(pd.Series.tolist).values.tolist())
[[[0.577535, 0.299304, 0.617103, 0.378887],
[0.167907, 0.24497199999999997, 0.6150770000000001, 0.31149699999999997],
[0.0, 0.0, 0.0, 0.0]],
[[0.640575, 0.768187, 0.65276, 0.822311],
[0.42474399999999995, 0.958405, 0.659617, 0.998765],
[0.077048, 0.40718200000000004, 0.758903, 0.273737]]]
print (df.groupby(level=0)["label"].apply(list).tolist())
[[1, 0, 0], [0, 1, 0]]
I'm assuming your group column consists of many values and not just 1 'A' and 1 'B'. This code worked for me, you can give it a try as well:
import pandas as pd
df = pd.read_csv('file2.csv')
vals = df['group'].unique()
array1 = []
array2 = []
for val in vals:
val_df = df[df.group == val]
val_label = val_df.label
smaller_array = []
label_small_array = []
for label in val_label:
label_small_array.append(label)
array2.append(label_small_array)
for i in range(val_df.shape[0]):
smallest_array = []
for j in val_df.columns:
smallest_array.append(j)
smaller_array.append(smallest_array)
array1.append(smaller_array)
Related
I have a Dataframe as follows:
import pandas as pd
df = pd.DataFrame({'Target': [0 ,1, 2],
'Source': [1, 0, 3],
'Count': [1, 1, 1]})
I have to count how many pairs of Sources and Targets there are. (1,0) and (0,1) will be treated as duplicate, hence the count will be 2.
I need to do it several times as I have 79 nodes in total. Any help will be much appreciated.
import pandas as pd
# instantiate without the 'count' column to start over
In[1]: df = pd.DataFrame({'Target': [0, 1, 2],
'Source': [1, 0, 3]})
Out[1]: Target Source
0 0 1
1 1 0
2 2 3
To count pairs regardless of their order is possible by converting to numpy.ndarray and sorting the rows to make them identical:
In[1]: array = df.values
In[2]: array.sort(axis=1)
In[3]: array
Out[3]: array([[0, 1],
[0, 1],
[2, 3]])
And then turn it back to a DataFrame to perform .value_counts():
In[1]: df_sorted = pd.DataFrame(array, columns=['value1', 'value2'])
In[2]: df_sorted.value_counts()
Out[2]: value1 value2
0 1 2
2 3 1
dtype: int64
I want to make sure column 2 is smaller than column 1 and where it is just set to 0
x = np.array([[0,1],[1,0]])
x = np.where((x[1] > (x[0])), 0, x)
print(x)=>[[0,0],[1,0]]
Maybe this help you:
arr = np.array([[0,1],[1,0]])
arr[arr[:,1] > arr[:,0], 1] = 0
print(arr)
Output:
array([[0, 0],
[1, 0]])
You started with a list (of lists), so I'll give you a list answer.
First define a simple helper function:
def foo(row):
if row[1]<row[0]:
row[1] = 0
return row
And apply it to x row by row:
In [37]: x = [[0,1],[1,0]]
In [38]: [foo(row) for row in x]
Out[38]: [[0, 1], [1, 0]]
I have a 2d NumPy array that looks like this:
array([[1, 1],
[1, 2],
[2, 1],
[2, 2],
[3, 1],
[5, 1],
[5, 2]])
and I want to group it and have an output that looks something like this:
Col1 Col2
group 1: 1-2, 1-2
group 2: 3-3, 1-1
group 3: 5-5, 1-2
I want to group the columns based on if they are consecutive.
So, for a unique value In column 1, group data in the second column if they are consecutive between rows. Now for a unique grouping of column 2, group column 1 if it is consecutive between rows.
The result can be thought of as corner points of a grid. In the above example, group 1 is a square grid, group 2 is a a point, and group 3 is a flat line.
My system won't allow me to use pandas so I cannot use group_by in that library but I can use other standard libraries.
Any help is appreciated. Thank you
Here you go ...
Steps are:
Get a list xUnique of unique column 1 values with sort order preserved.
Build a list xRanges of items of the form [col1_value, [col2_min, col2_max]] holding the column 2 ranges for each column 1 value.
Build a list xGroups of items of the form [[col1_min, col1_max], [col2_min, col2_max]] where the [col1_min, col1_max] part is created by merging the col1_value part of consecutive items in xRanges if they differ by 1 and have identical [col2_min, col2_max] value ranges for column 2.
Turn the ranges in each item of xGroups into strings and print with the required row and column headings.
Also package and print as a numpy.array to match the form of the input.
import numpy as np
data = np.array([
[1, 1],
[1, 2],
[2, 1],
[2, 2],
[3, 1],
[5, 1],
[5, 2]])
xUnique = list({pair[0] for pair in data})
xRanges = list(zip(xUnique, [[0, 0] for _ in range(len(xUnique))]))
rows, cols = data.shape
iRange = -1
for i in range(rows):
if i == 0 or data[i, 0] > data[i - 1, 0]:
iRange += 1
xRanges[iRange][1][0] = data[i, 1]
xRanges[iRange][1][1] = data[i, 1]
xGroups = []
for i in range(len(xRanges)):
if i and xRanges[i][0] - xRanges[i - 1][0] == 1 and xRanges[i][1] == xRanges[i - 1][1]:
xGroups[-1][0][1] = xRanges[i][0]
else:
xGroups += [[[xRanges[i][0], xRanges[i][0]], xRanges[i][1]]]
xGroupStrs = [ [f'{a}-{b}' for a, b in row] for row in xGroups]
groupArray = np.array(xGroupStrs)
print(groupArray)
print()
print(f'{"":<10}{"Col1":<8}{"Col2":<8}')
[print(f'{"group " + str(i) + ":":<10}{col1:<8}{col2:<8}') for i, (col1, col2) in enumerate(xGroupStrs)]
Output:
[['1-2' '1-2']
['3-3' '1-1']
['5-5' '1-2']]
Col1 Col2
group 0: 1-2 1-2
group 1: 3-3 1-1
group 2: 5-5 1-2
I'd like to split my time-series data into X and y by shifting the data. The dummy dataframe looks like:
i.e. if the time steps equal to 2, X and y look like: X=[3,0] -> y= [5]
X=[0,5] -> y= [7] (this should be applied to the entire samples (rows))
I wrote the function below, but it returns empty matrices when I pass pandas dataframe to the function.
def create_dataset(dataset, time_step=1):
dataX, dataY = [], []
for i in range (len(dataset)-time_step-1):
a = dataset.iloc[:,i:(i+time_step)]
dataX.append(a)
dataY.append(dataset.iloc[:, i + time_step ])
return np.array(dataX), np.array(dataY)
Thank you for any solutions.
Here is an example that replicates the example, IIUC:
import pandas as pd
# function to process each row
def process_row(s):
assert isinstance(s, pd.Series)
return pd.concat([
s.rename('timestep'),
s.shift(-1).rename('x_1'),
s.shift(-2).rename('x_2'),
s.shift(-3).rename('y')
], axis=1).dropna(how='any', axis=0).astype(int)
# test case for the example
process_row( pd.Series([2, 3, 0, 5, 6]) )
# type in first two rows of the data frame
df = pd.DataFrame(
{'x-2': [3, 2], 'x-1': [0, 3],
'x0': [5, 0], 'x1': [7, 5], 'x2': [1, 6]})
# perform the transformation
ts = list()
for idx, row in df.iterrows():
t = process_row(row)
t.index = [idx] * t.index.size
ts.append(t)
print(pd.concat(ts))
# results
timestep x_1 x_2 y
0 3 0 5 7
0 0 5 7 1
1 2 3 0 5 <-- first part of expected results
1 3 0 5 6 <-- second part
Do you mean something like this:
df = df.shift(periods=-2, axis='columns')
# you can also pass a fill values parameter
df = df.shift(periods=-2, axis='columns', fill_value = 0)
I have this code:
import numpy as np
result = {}
result['depth'] = [1,1,1,2,2,2]
result['generation'] = [1,1,1,2,2,2]
result['dimension'] = [1,2,3,1,2,3]
result['data'] = [np.array([0,0,0]), np.array([0,0,0]), np.array([0,0,0]), np.array([0,0,0]), np.array([0,0,0]), np.array([0,0,0])]
for v in np.unique(result['depth']):
temp_v = (result['depth'] == v)
values_v = [result[string][temp_v] for string in result.keys()]
this_v = dict(zip(result.keys(), values_v))
in which I want to create a new dictcalled 'this_v', with the same keys as the original dict result, but fewer values.
The line:
values_v = [result[string][temp_v] for string in result.keys()]
gives an error
TypeError: only integer scalar arrays can be converted to a scalar index
which I don't understand, since I can create ex = result[result.keys()[0]][temp_v] just fine. It just does not let me do this with a for loop so that I can fill the list.
Any idea as to why it does not work?
In order to solve your problem (finding and dropping duplicates) I encourage you to use pandas. It is a Python module that makes your life absurdly simple:
import numpy as np
result = {}
result['depth'] = [1,1,1,2,2,2]
result['generation'] = [1,1,1,2,2,2]
result['dimension'] = [1,2,3,1,2,3]
result['data'] = [np.array([0,0,0]), np.array([0,0,0]), np.array([0,0,0]),\
np.array([0,0,0]), np.array([0,0,0]), np.array([0,0,0])]
# Here comes pandas!
import pandas as pd
# Converting your dictionary of lists into a beautiful dataframe
df = pd.DataFrame(result)
#> data depth dimension generation
# 0 [0, 0, 0] 1 1 1
# 1 [0, 0, 0] 1 2 1
# 2 [0, 0, 0] 1 3 1
# 3 [0, 0, 0] 2 1 2
# 4 [0, 0, 0] 2 2 2
# 5 [0, 0, 0] 2 3 2
# Dropping duplicates... in one single command!
df = df.drop_duplicates('depth')
#> data depth dimension generation
# 0 [0, 0, 0] 1 1 1
# 3 [0, 0, 0] 2 1 2
If you want oyur data back in the original format... you need yet again just one line of code!
df.to_dict('list')
#> {'data': [array([0, 0, 0]), array([0, 0, 0])],
# 'depth': [1, 2],
# 'dimension': [1, 1],
# 'generation': [1, 2]}