What is the idiomatic way to store this kind of data structure in a pandas :
### Option 1
df = pd.DataFrame(data = [
{'kws' : np.array([0,0,0]), 'x' : i, 'y', i} for i in range(10)
])
# df.x and df.y works as expected
# the list and array casting is required because df.kws is
# an array of arrays
np.array(list(df.kws))
# this causes problems when trying to assign as well though:
# for any other data type, this would set all kws in df to the rhs [1,2,3]
# but since the rhs is a list, it tried to do an element-wise assignment and
# errors saying that the length of df and the length of the rhs do not match
df.kws = [1,2,3]
### Option 2
df = pd.DataFrame(data = [
{'kw_0' : 0, 'kw_1' : 0, 'kw_2' : 0, 'x' : i, 'y', i} for i in range(10)
])
# retrieving 2d array:
df[sorted([c for c in df if c.startswith('kw_')])].values
# batch set :
kws = [1,2,3]
for i, kw in enumerate(kws) :
df['kw_'+i] = kw
Neither of these solutions feel right to me. For one, neither of them allow retrieving a 2d matrix out without copying all of the data. Is there a better way to handle this kind of mixed dimension data, or is this just a task that pandas isn't up to right now?
Just use a column multi-index, the docs
In [31]: df = pd.DataFrame([ {'kw_0' : 0, 'kw_1' : 0, 'kw_2' : 0, 'x' : i, 'y': i} for i in range(10) ])
In [32]: df
Out[32]:
kw_0 kw_1 kw_2 x y
0 0 0 0 0 0
1 0 0 0 1 1
2 0 0 0 2 2
3 0 0 0 3 3
4 0 0 0 4 4
5 0 0 0 5 5
6 0 0 0 6 6
7 0 0 0 7 7
8 0 0 0 8 8
9 0 0 0 9 9
In [33]: df.columns = MultiIndex.from_tuples([('kw',0),('kw',1),('kw',2),('value','x'),('value','y')])
In [34]: df
Out[34]:
kw value
0 1 2 x y
0 0 0 0 0 0
1 0 0 0 1 1
2 0 0 0 2 2
3 0 0 0 3 3
4 0 0 0 4 4
5 0 0 0 5 5
6 0 0 0 6 6
7 0 0 0 7 7
8 0 0 0 8 8
9 0 0 0 9 9
Selection is easy
In [35]: df['kw']
Out[35]:
0 1 2
0 0 0 0
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
5 0 0 0
6 0 0 0
7 0 0 0
8 0 0 0
9 0 0 0
Setting too
In [36]: df.loc[1,'kw'] = [4,5,6]
In [37]: df
Out[37]:
kw value
0 1 2 x y
0 0 0 0 0 0
1 4 5 6 1 1
2 0 0 0 2 2
3 0 0 0 3 3
4 0 0 0 4 4
5 0 0 0 5 5
6 0 0 0 6 6
7 0 0 0 7 7
8 0 0 0 8 8
9 0 0 0 9 9
Alternatively you can use 2 dataframes, indexed the same, and combine/merge when needed.
Related
I am trying to handle the following dataframe
import pandas as pd
df =pd.DataFrame(
data = {'m1' : [0,0,1,0,0,0,0,0,0,0,0],
'm2' : [0,0,0,0,0,1,0,0,0,0,0],
'm3' : [0,0,0,0,0,0,0,0,1,0,0],
'm4' : [0,1,0,0,0,0,0,0,0,0,0],
'm5' : [0,0,0,0,0,0,0,0,0,0,0],
'm6' : [0,0,0,0,0,0,0,0,0,1,0]}
)
df
#
m1 m2 m3 m4 m5 m6
0 0 0 0 0 0 0
1 0 0 0 1 0 0
2 1 0 0 0 0 0
3 0 0 0 0 0 0
4 0 0 0 0 0 0
5 0 1 0 0 0 0
6 0 0 0 0 0 0
7 0 0 0 0 0 0
8 0 0 1 0 0 0
9 0 0 0 0 0 1
10 0 0 0 0 0 0
From the above dataframe, I want to separate m1 and other features.
Assign 1 to m_other if any of m2 to m6 is 1.
Ideal results are shown below.
m1 m_other
0 0 0
1 0 1
2 1 0
3 0 0
4 0 0
5 0 1
6 0 0
7 0 0
8 0 1
9 0 1
10 0 0
I thought about adapting the any function, but I stumbled and couldn't figure it out.
If anyone has any good ideas, I would appreciate it if you could share them with me.
Use DataFrame.any or DataFrame.max with left join to m1 column:
#select all columns without first
df1 = df[['m1']].assign(m_other=df.iloc[:, 1:].max(axis=1))
df1 = df[['m1']].assign(m_other=df.iloc[:, 1:].any(axis=1).astype(int))
#select all columns without m1
df1 = df[['m1']].assign(m_other=df.drop('m1',1).max(axis=1))
#seelct columns between m2 and m6
df1 = df[['m1']].assign(m_other=df.loc[:, 'm2':'m6'].max(axis=1))
print (df1)
m1 m_other
0 0 0
1 0 1
2 1 0
3 0 0
4 0 0
5 0 1
6 0 0
7 0 0
8 0 1
9 0 1
10 0 0
Does this work:
pd.concat([df['m1'], df.iloc[:,1:].apply(lambda x : 1 if x.any() == 1 else 0, axis = 1)], axis = 1, keys = ['m1','m_other'])
m1 m_other
0 0 0
1 0 1
2 1 0
3 0 0
4 0 0
5 0 1
6 0 0
7 0 0
8 0 1
9 0 1
10 0 0
A more simplistic way is to seperate it into two dataframes then recombine it.
#data is the datframe of the result
data=pd.DataFrame(columns=['m1','m_other'])
# there's no change in m1 so we assign it directly
data.m1=df.m1
# we create a data for the other columns
data_other=df[['m2','m3','m4','m5','m6']]
# we assign True if anyone from 2 to 6 has 1 value
data.m_other=[any(data_other.iloc[i]==1) for i in range(len(df))]
# we map it to 1 and 0 instead of True and False
data.m_other=data.m_other.astype(int)
# this is our final results
data
Here is one way to do it
using concat to combine the first column and the max of the renaming columns and then renaming the column name
df2=pd.concat([df.iloc[:,:1],(df.iloc[:,1:].max(axis=1))], axis=1)
df2=df2.rename(columns={0:'m_other'})
df2
m1 m_other
0 0 0
1 0 1
2 1 0
3 0 0
4 0 0
5 0 1
6 0 0
7 0 0
8 0 1
9 0 1
10 0 0
I am trying to do cumulative sum by intervals ie. with cumsum being reset to zero if the next value to accumulate is 0. Below is an example with the desired result following. I have tried using numpy 'convolve' and 'groupby' but can't get come up with a way to do the reset except by creating a def that loops over all the rows. Is there a clever approach I'm missing? Note that the real data in column 'x' are real numbers separated by 0's.
import numpy as np
import pandas as pd
a = pd.DataFrame([[0,0],[1,0],[1,0],[1,0],[0,0],[0,0],[0,0],[0,0],[0,0],[0,0],\
[0,0],[0,0],[0,0],[0,0],[1,0],[1,0],[0,0]], columns=["x","y"])
def patch(k):
k["z"] = k.x.cumsum()
return k
print(patch(a))
Current output:
x y z
0 0 0 0
1 1 0 1
2 1 0 2
3 1 0 3
4 0 0 3
6 0 0 3
7 0 0 3
9 0 0 3
10 0 0 3
12 0 0 3
13 1 0 4
15 1 0 5
16 0 0 5
Desired output:
x y z
0 0 0 0
1 1 0 1
2 1 0 2
3 1 0 3
4 0 0 0
6 0 0 0
7 0 0 0
9 0 0 0
10 0 0 0
12 0 0 0
13 1 0 1
15 1 0 2
16 0 0 0
Do groupby on cumsum:
a['z'] = a.groupby(a['x'].eq(0).cumsum())['x'].cumsum()
Output:
x y z
0 0 0 0
1 1 0 1
2 1 0 2
3 1 0 3
4 0 0 0
6 0 0 0
7 0 0 0
9 0 0 0
10 0 0 0
12 0 0 0
13 1 0 1
15 1 0 2
16 0 0 0
I have a csv with data that I want to import to an ndarray so I can manipulate it. The csv data is formatted like this.
u i r c
1 1 5 1
2 2 5 1
3 3 1 0
4 4 1 1
I want to get all the elements with c = 1 in a row, and the ones with c = 0 in another one, like so, reducing the dimensionality.
1 1 1 5 2 2 5 4 4 1
0 3 3 1
However, different u and i can't be in the same column, hence the final result needing zero padding, like this. I want to keep the c variable column, since this represents a categorical variable, so I need to keep its value to be able to make the correspondence between the information and the c value. I don't want to just separate the data according to the value of c.
1 1 1 5 2 2 5 0 0 0 4 4 1
0 0 0 0 0 0 0 3 3 1 0 0 0
So far, I'm reading the .csv file with df = pd.read_csv and creating a multidimensional array/tensor by using arr=df.to_numpy(). After that, I'm permutating the order of the columns to make the c column be the first one, getting this array [[ 1 1 1 5][ 1 2 2 5][ 0 3 3 1][ 1 4 4 1]].
I then do arr = arr.reshape(2,), since there are two possible values for c and then delete all but the first c column according to the length of the tuples. So in this case, since there are 4 elements in each tuple and 16 elements I'm doing arr = np.delete(arr, (4,8,12), axis=1).
Finally, I'm doing this to pad the array with zeros when the u doesn't match with both columns.
nomatch = 0
for j in range(1, cols, 3):
if arr[0][j] != arr[1][j]:
nomatch+=1
z = np.zeros(nomatch*3, dtype=arr.dtype)
h1 = np.split(arr, [0][0])
new0 = np.concatenate((arr[0],z))
new1 = np.concatenate((z,arr[1])) # problem
final = np.concatenate((new0, new1))
In the line with the comment, the problem is how can I concatenate the array while maintaining the first element. Instead of just appending, I'd like to be able to set up a start and end index and patch the zeros only on those indexes. By using concatenate, I don't get the expected result, since I'm altering the first element (the head of the array should be untouched).
Additionally, I can't help but wonder if this is a good way to achieve the end result. For an example I tried to pad the array with resize before reshaping with np.resize(), but it doesn't work, when I print the result the array is the same as previous, no matter the dimensions I use as argument. A good solution would be one that adapted if there were 3 or more possible values for c, and that could include multiple c-like values, such as c1, c2... that would become rows in the table. I appreciate all the input and suggestions in advance.
Here is a compact numpy approach:
asnp = df.to_numpy()
(np.bitwise_xor.outer(np.arange(2),asnp[:,3:])*asnp[:,:3]).reshape(2,-1)
# array([[1, 1, 5, 2, 2, 5, 0, 0, 0, 4, 4, 1],
# [0, 0, 0, 0, 0, 0, 3, 3, 1, 0, 0, 0]])
UPDATE: multi category:
categories must be the last k columns and have column headers starting with "cat". we create a row for each unique combination of categories, this combination is prepended to the row.
Code:
import numpy as np
import pandas as pd
import itertools as it
def spreadcats(df):
cut = sum(map(str.startswith,df.columns,it.repeat("cat")))
data = df.to_numpy()
cats,idx = np.unique(data[:,-cut:],axis=0,return_inverse=True)
m,n,k,_ = data.shape + cats.shape
out = np.zeros((k,cut+(n-cut)*m),int)
out[:,:cut] = cats
out[:,cut:].reshape(k,m,n-cut)[idx,np.arange(m)] = data[:,:-cut]
return out
x = np.random.randint([1,1,1,0,0],[10,10,10,3,2],(10,5))
df = pd.DataFrame(x,columns=[f"data{i}" for i in "123"] + ["cat1","cat2"])
print(df)
print(spreadcats(df))
Sample run:
data1 data2 data3 cat1 cat2
0 9 5 1 1 1
1 7 4 2 2 0
2 3 9 8 1 0
3 3 9 1 1 0
4 9 1 7 2 1
5 1 3 7 2 0
6 2 8 2 1 0
7 1 4 9 0 1
8 8 7 3 1 1
9 3 6 9 0 1
[[0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 4 9 0 0 0 3 6 9]
[1 0 0 0 0 0 0 0 3 9 8 3 9 1 0 0 0 0 0 0 2 8 2 0 0 0 0 0 0 0 0 0]
[1 1 9 5 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 8 7 3 0 0 0]
[2 0 0 0 0 7 4 2 0 0 0 0 0 0 0 0 0 1 3 7 0 0 0 0 0 0 0 0 0 0 0 0]
[2 1 0 0 0 0 0 0 0 0 0 0 0 0 9 1 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]
I have this one level dataframe:
d = {'A': np.random.randint(0, 10, 5)
, 'B': np.random.randint(0, 10, 5)
, 'C': np.random.randint(0, 10, 5)
, 'D': np.random.randint(0, 10, 5)}
x = pd.DataFrame(d)
print(x)
A B C D
0 8 7 6 0
1 6 5 4 9
2 4 0 5 7
3 1 9 7 9
4 6 9 9 8
And this multi level:
from functools import reduce
v = ['u','v','z']
l = ['300','350','400','450','500'] * len(v)
d = ['1','2','3','4'] * len(l)
size = len(v) * len(l) * len(d)
der_v = reduce(lambda x,y: x+y, [[i] * 20 for i in v])
der_l = reduce(lambda x,y: x+y, [[i] * 4 for i in l])
der_d = reduce(lambda x,y: x+y, [[i] for i in d])
arrays =[der_v,der_l,der_d]
y = pd.DataFrame(np.random.randint(0, 1, (5,60)),index=range(0,5), columns=arrays)
print(y)
u ... z
300 350 400 ... 400 450 500
1 2 3 4 1 2 3 4 1 2 ... 3 4 1 2 3 4 1 2 3 4
0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
[5 rows x 60 columns]
I'm trying to concat:
z = pd.concat([x, y], axis=1)
So, I got like this:
A B C D (u, 300, 1) (u, 300, 2) (u, 300, 3) (u, 300, 4) \
0 8 7 6 0 0 0 0 0 ...
1 6 5 4 9 0 0 0 0 ...
2 4 0 5 7 0 0 0 0 ...
3 1 9 7 9 0 0 0 0 ...
4 6 9 9 8 0 0 0 0 ...
But I got columns as tuples, eg: (u, 300, 1). It's weird! Is possible have in axis 1 one level and multilevel at the same time?
Expected output:
u ... z
A B C D 300 350 400 ... 400 450 500
1 2 3 4 1 2 3 4 1 2 ... 3 4 1 2 3 4 1 2 3 4
0 8 7 6 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 6 5 4 9 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 4 0 5 7 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 1 9 7 9 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 6 9 9 8 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
I really don't know if possible had columns with one level and multi level. So, I hope be possible slicing. For example: y.loc[:,('u','500')] works fine. But after concating doesn't work anymore.
I solve without concatenate, because is not possible have in axis 1 diferent levels. I decide to use the data in dataframe x as a index in dataframe y.
So, follow this steps:
1.Create dataframe x:
d = {'A': np.random.randint(0, 10, 5)
, 'B': np.random.randint(0, 10, 5)
, 'C': np.random.randint(0, 10, 5)
, 'D': np.random.randint(0, 10, 5)}
x = pd.DataFrame(d)
A B C D
0 7 1 6 8
1 4 0 5 6
2 7 5 0 7
3 8 4 3 8
4 9 1 4 0
2.Create a index based in dataframe x:
index = [x[col] for col in x.columns]
3.Creating freatures for dataframe y:
from functools import reduce
v = ['u','v','z']
l = ['300','350','400','450','500'] * len(v)
d = ['1','2','3','4'] * len(l)
size = len(v) * len(l) * len(d)
der_v = reduce(lambda x,y: x+y, [[i] * 20 for i in v])
der_l = reduce(lambda x,y: x+y, [[i] * 4 for i in l])
der_d = reduce(lambda x,y: x+y, [[i] for i in d])
arrays =[der_v,der_l,der_d]
4.Now, to create the dataframe y, we use the index from x as parameter:
y = pd.DataFrame(np.random.randint(0, 1, (5,60)), columns=arrays, index=index)
y.columns = y.columns.rename(['variables', 'level','days'], level=[0,1,2])
y.index.names = ['A','B','C','D']
print(y)
variables u ... z \
level 300 350 400 ... 400 450 500
days 1 2 3 4 1 2 3 4 1 2 ... 3 4 1 2 3 4 1 2 3
A B C D ...
7 1 6 8 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0
4 0 5 6 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0
7 5 0 7 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0
8 4 3 8 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0
9 1 4 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0
variables
level
days 4
A B C D
7 1 6 8 0
4 0 5 6 0
7 5 0 7 0
8 4 3 8 0
9 1 4 0 0
[5 rows x 60 columns]
Is there a more efficient way to create multiple new columns in a pandas dataframe df initialized to zero than:
for col in add_cols:
df.loc[:, col] = 0
UPDATE: using #jeff's method, but doing it dynamically:
In [208]: add_cols = list('xyz')
In [209]: df.assign(**{i:0 for i in add_cols})
Out[209]:
a b c x y z
0 4 8 6 0 0 0
1 3 7 0 0 0 0
2 4 0 1 0 0 0
3 5 4 5 0 0 0
4 1 3 0 0 0 0
OLD answer:
Another method:
df[add_cols] = pd.DataFrame(0, index=df.index, columns=add_cols)
Demo:
In [343]: df = pd.DataFrame(np.random.randint(0, 10, (5,3)), columns=list('abc'))
In [344]: add_cols = list('xyz')
In [345]: add_cols
Out[345]: ['x', 'y', 'z']
In [346]: df
Out[346]:
a b c
0 4 9 0
1 1 1 1
2 8 8 1
3 0 1 4
4 8 5 6
In [347]: df[add_cols] = pd.DataFrame(0, index=df.index, columns=add_cols)
In [348]: df
Out[348]:
a b c x y z
0 4 9 0 0 0 0
1 1 1 1 0 0 0
2 8 8 1 0 0 0
3 0 1 4 0 0 0
4 8 5 6 0 0 0
In [13]: df = pd.DataFrame(np.random.randint(0, 10, (5,3)), columns=list('abc'))
In [14]: df
Out[14]:
a b c
0 7 2 3
1 7 0 7
2 5 1 5
3 9 1 4
4 2 1 4
In [15]: df.assign(x=0, y=0, z=0)
Out[15]:
a b c x y z
0 7 2 3 0 0 0
1 7 0 7 0 0 0
2 5 1 5 0 0 0
3 9 1 4 0 0 0
4 2 1 4 0 0 0
Here is a hack:
[df.insert(0, col, 0) for col in add_cols]
You can treat a DataFrame with a dict-like syntax:
for col in add_cols:
df[col] = 0