I would like to turn the names of columns into values. This is so to create a factor variable and define the levels as the column names. I am hoping to achieve x2 from x1. In R it would be like using the model.matrix() function
Thank you
x1 = pd.DataFrame({'A': [1,0,0],
'B': [0,1,0],
'C': [0,1,1]})
x2 = pd.DataFrame({'All': ['A','BC','C']})
You can also use list comprehension, as follows:
cols = x1.columns.values
x2 = pd.DataFrame({'All': [''.join(cols[x]) for x in x1.eq(1).values]})
Or simply:
x2 = pd.DataFrame({'All': [''.join(x1.columns[x]) for x in x1.eq(1).values]})
Result:
print(x2)
All
0 A
1 BC
2 C
That's one way, there should be a simpler solution:
x1.astype(bool).apply(lambda row: ''.join(x1.columns[row]), axis=1)
Use the # (matrix multiplication operator) to multiply the columns vector by the boolean matrix:
import pandas as pd
x1 = pd.DataFrame({'A': [1, 0, 0],
'B': [0, 1, 0],
'C': [0, 1, 1]})
# create result DataFrame
x2 = pd.DataFrame({"all": x1 # x1.columns})
print(x2)
Output
all
0 A
1 BC
2 C
Related
I am trying to calculate a point biserial correlation for a set of columns in my datasets. I am able to do it on individual variable, however if i need to calculate for all the columns in one iteration then it is showing an error.
Below is the code:
df = pd.DataFrame({'A':[1, 0, 1, 0, 1], 'B':[6, 7, 8, 9, 10],'C':[9, 4, 6,9,10],'D':[8,9,5,7,10]})
from scipy import stats
corr_list = {}
y = df['A'].astype(float)
for column in df:
x = df[['B','C','D']].astype(float)
corr = stats.pointbiserialr(x, y)
corr_list[['B','C','D']] = corr
print(corr_list)
TypeError: No loop matching the specified signature and casting was found for ufunc add
x must be a column not a dataframe, if you take the column instead of the dataframe , it will work. You can try this :
df = pd.DataFrame({'A':[1, 0, 1, 0, 1], 'B':[6, 7, 8, 9, 10],'C':[9, 4, 6,9,10],'D':[8,9,5,7,10]})
print(df)
from scipy import stats
corr_list = []
y = df['A'].astype(float)
for column in df:
x=df[column]
corr = stats.pointbiserialr(list(x), list(y))
corr_list.append(corr[0])
print(corr_list)
by the way you can use print(df.corr())and this give you the Correlation Matrix of the dataframe
You can use the pd.DataFrame.corrwith() function:
df[['B', 'C', 'D']].corrwith(df['A'].astype('float'), method=stats.pointbiserialr)
Output will be a list of the columns and their corresponding correlations & p-values (row 0 and 1, respectively) with the target DataFrame or Series. Link to docs:
B C D
0 4.547937e-18 0.400066 -0.094916
1 1.000000e+00 0.504554 0.879331
I'd like to split my time-series data into X and y by shifting the data. The dummy dataframe looks like:
i.e. if the time steps equal to 2, X and y look like: X=[3,0] -> y= [5]
X=[0,5] -> y= [7] (this should be applied to the entire samples (rows))
I wrote the function below, but it returns empty matrices when I pass pandas dataframe to the function.
def create_dataset(dataset, time_step=1):
dataX, dataY = [], []
for i in range (len(dataset)-time_step-1):
a = dataset.iloc[:,i:(i+time_step)]
dataX.append(a)
dataY.append(dataset.iloc[:, i + time_step ])
return np.array(dataX), np.array(dataY)
Thank you for any solutions.
Here is an example that replicates the example, IIUC:
import pandas as pd
# function to process each row
def process_row(s):
assert isinstance(s, pd.Series)
return pd.concat([
s.rename('timestep'),
s.shift(-1).rename('x_1'),
s.shift(-2).rename('x_2'),
s.shift(-3).rename('y')
], axis=1).dropna(how='any', axis=0).astype(int)
# test case for the example
process_row( pd.Series([2, 3, 0, 5, 6]) )
# type in first two rows of the data frame
df = pd.DataFrame(
{'x-2': [3, 2], 'x-1': [0, 3],
'x0': [5, 0], 'x1': [7, 5], 'x2': [1, 6]})
# perform the transformation
ts = list()
for idx, row in df.iterrows():
t = process_row(row)
t.index = [idx] * t.index.size
ts.append(t)
print(pd.concat(ts))
# results
timestep x_1 x_2 y
0 3 0 5 7
0 0 5 7 1
1 2 3 0 5 <-- first part of expected results
1 3 0 5 6 <-- second part
Do you mean something like this:
df = df.shift(periods=-2, axis='columns')
# you can also pass a fill values parameter
df = df.shift(periods=-2, axis='columns', fill_value = 0)
I have a dataframe like this
group b c d e label
A 0.577535 0.299304 0.617103 0.378887 1
0.167907 0.244972 0.615077 0.311497 0
B 0.640575 0.768187 0.652760 0.822311 0
0.424744 0.958405 0.659617 0.998765 1
0.077048 0.407182 0.758903 0.273737 0
I want to reshape it into a 3D array which an LSTM could use as input, using padding. So group A should feed in a sequence of length 3 (after padding) and group B of length 3. Desired output something like
array1 = [[[0.577535, 0.299304, 0.617103, 0.378887],
[0.167907, 0.244972, 0.615077, 0.311497],
[0, 0, 0, 0]],
[[0.640575, 0.768187, 0.652760, 0.822311],
[0.424744, 0.958405, 0.659617, 0.998765],
[0.077048, 0.407182, 0.758903, 0.273737]]]
and then the labels have to be reshaped accordingly too
array2 = [[1,
0,
0],
[0,
1,
0]]
How can I put in the padding and reshape my data?
You can first use cumcount to create a count for each group, reindex by MultiIndex.from_product and fill with 0, and finally export to list:
df["count"] = df.groupby("group")["label"].cumcount()
mux = pd.MultiIndex.from_product([df["group"].unique(), range(max(df["count"]+1))], names=["group","count"])
df = df.set_index(["group","count"]).reindex(mux, fill_value=0)
print (df.iloc[:,:4].groupby(level=0).apply(pd.Series.tolist).values.tolist())
[[[0.577535, 0.299304, 0.617103, 0.378887],
[0.167907, 0.24497199999999997, 0.6150770000000001, 0.31149699999999997],
[0.0, 0.0, 0.0, 0.0]],
[[0.640575, 0.768187, 0.65276, 0.822311],
[0.42474399999999995, 0.958405, 0.659617, 0.998765],
[0.077048, 0.40718200000000004, 0.758903, 0.273737]]]
print (df.groupby(level=0)["label"].apply(list).tolist())
[[1, 0, 0], [0, 1, 0]]
I'm assuming your group column consists of many values and not just 1 'A' and 1 'B'. This code worked for me, you can give it a try as well:
import pandas as pd
df = pd.read_csv('file2.csv')
vals = df['group'].unique()
array1 = []
array2 = []
for val in vals:
val_df = df[df.group == val]
val_label = val_df.label
smaller_array = []
label_small_array = []
for label in val_label:
label_small_array.append(label)
array2.append(label_small_array)
for i in range(val_df.shape[0]):
smallest_array = []
for j in val_df.columns:
smallest_array.append(j)
smaller_array.append(smallest_array)
array1.append(smaller_array)
I have 2 numpy arrays, I am using the top row as column headers. Each array has the same columns except for 2 columns. arr2 will have a different C column as well as an additional column
How can I combine all of these columns into a single np array?
arr1 = [ ['A', 'B', 'C1'], [1, 1, 0], [0, 1, 1] ]
arr2 = [ ['A', 'B', 'C2', 'C3'], [0, 1, 0, 1], [0, 0, 1, 0] ]
a1 = np.array(arr1)
a2 = np.array(arr2)
b = np.append(a1, a2, axis=0)
print(b)
# Desired Result
# A B C1 C2 C3
# 1 1 0 - -
# 0 1 1 - -
# 0 1 - 0 1
# 0 0 - 1 0
NumPy arrays aren't great for handling data with named columns, which might contain different types. Instead, I would use pandas for this. For example:
import pandas as pd
arr1 = [[1, 1, 0], [0, 1, 1] ]
arr2 = [[0, 1, 0, 1], [0, 0, 1, 0] ]
df1 = pd.DataFrame(arr1, columns=['A', 'B', 'C1'])
df2 = pd.DataFrame(arr2, columns=['A', 'B', 'C2', 'C3'])
df = pd.concat([df1, df2], sort=False)
df.to_csv('mydata.csv', index=False)
This results in a 'dataframe', a spreadsheet-like data structure. Jupyter Notebooks render these as follows:
You might notice there's an extra new column; this is the "index", which you can think of as row labels. You don't need it if you don't want it in your CSV, but if you carry on doing things in the dataframe, you might want to do df = df.reset_index() to relabel the rows in a more useful way.
If you want the dataframe back as a NumPy array, you can do df.values and away you go. It doesn't have the column names though.
Last thing: if you really want to stay in NumPy-land, then check out structured arrays, which give you another way to name the columns, essentially, in an array. Honestly, since pandas came along, I hardly ever see these in the wild.
suppose i create a Pandas DataFrame as below
import pandas as pd
import numpy as np
np.random.seed(0)
x = 10*np.random.randn(5,5)
df = pd.DataFrame(x)
as an example, this can generate the below:
for each row, i am looking for a way to readily obtain the indices corresponding to the largest n (say 3) values in absolute value terms. for example, for the first row, i would expect [0,3,4]. we can assume that the results don't need to be ordered.
i tried searching for solutions similar to idxmax and argmax, but it seems these do not readily handle multiple values
You can use np.argsort(axis=1)
Given dataset:
x = 10*np.random.randn(5,5)
df = pd.DataFrame(x)
0 1 2 3 4
0 17.640523 4.001572 9.787380 22.408932 18.675580
1 -9.772779 9.500884 -1.513572 -1.032189 4.105985
2 1.440436 14.542735 7.610377 1.216750 4.438632
3 3.336743 14.940791 -2.051583 3.130677 -8.540957
4 -25.529898 6.536186 8.644362 -7.421650 22.697546
df.abs().values.argsort(1)[:, -3:][:, ::-1]
array([[3, 4, 0],
[0, 1, 4],
[1, 2, 4],
[1, 4, 0],
[0, 4, 2]])
Try this ( this is not the optimal code ) :
idx_nmax = {}
n = 3
for index, row in df.iterrows():
idx_nmax[index] = list(row.nlargest(n).index)
at the end of that you will have a dictionary with:
as Key the index of the row
and as Values the index of the 'n' highest value of this row