I have a csv with data that I want to import to an ndarray so I can manipulate it. The csv data is formatted like this.
u i r c
1 1 5 1
2 2 5 1
3 3 1 0
4 4 1 1
I want to get all the elements with c = 1 in a row, and the ones with c = 0 in another one, like so, reducing the dimensionality.
1 1 1 5 2 2 5 4 4 1
0 3 3 1
However, different u and i can't be in the same column, hence the final result needing zero padding, like this. I want to keep the c variable column, since this represents a categorical variable, so I need to keep its value to be able to make the correspondence between the information and the c value. I don't want to just separate the data according to the value of c.
1 1 1 5 2 2 5 0 0 0 4 4 1
0 0 0 0 0 0 0 3 3 1 0 0 0
So far, I'm reading the .csv file with df = pd.read_csv and creating a multidimensional array/tensor by using arr=df.to_numpy(). After that, I'm permutating the order of the columns to make the c column be the first one, getting this array [[ 1 1 1 5][ 1 2 2 5][ 0 3 3 1][ 1 4 4 1]].
I then do arr = arr.reshape(2,), since there are two possible values for c and then delete all but the first c column according to the length of the tuples. So in this case, since there are 4 elements in each tuple and 16 elements I'm doing arr = np.delete(arr, (4,8,12), axis=1).
Finally, I'm doing this to pad the array with zeros when the u doesn't match with both columns.
nomatch = 0
for j in range(1, cols, 3):
if arr[0][j] != arr[1][j]:
nomatch+=1
z = np.zeros(nomatch*3, dtype=arr.dtype)
h1 = np.split(arr, [0][0])
new0 = np.concatenate((arr[0],z))
new1 = np.concatenate((z,arr[1])) # problem
final = np.concatenate((new0, new1))
In the line with the comment, the problem is how can I concatenate the array while maintaining the first element. Instead of just appending, I'd like to be able to set up a start and end index and patch the zeros only on those indexes. By using concatenate, I don't get the expected result, since I'm altering the first element (the head of the array should be untouched).
Additionally, I can't help but wonder if this is a good way to achieve the end result. For an example I tried to pad the array with resize before reshaping with np.resize(), but it doesn't work, when I print the result the array is the same as previous, no matter the dimensions I use as argument. A good solution would be one that adapted if there were 3 or more possible values for c, and that could include multiple c-like values, such as c1, c2... that would become rows in the table. I appreciate all the input and suggestions in advance.
Here is a compact numpy approach:
asnp = df.to_numpy()
(np.bitwise_xor.outer(np.arange(2),asnp[:,3:])*asnp[:,:3]).reshape(2,-1)
# array([[1, 1, 5, 2, 2, 5, 0, 0, 0, 4, 4, 1],
# [0, 0, 0, 0, 0, 0, 3, 3, 1, 0, 0, 0]])
UPDATE: multi category:
categories must be the last k columns and have column headers starting with "cat". we create a row for each unique combination of categories, this combination is prepended to the row.
Code:
import numpy as np
import pandas as pd
import itertools as it
def spreadcats(df):
cut = sum(map(str.startswith,df.columns,it.repeat("cat")))
data = df.to_numpy()
cats,idx = np.unique(data[:,-cut:],axis=0,return_inverse=True)
m,n,k,_ = data.shape + cats.shape
out = np.zeros((k,cut+(n-cut)*m),int)
out[:,:cut] = cats
out[:,cut:].reshape(k,m,n-cut)[idx,np.arange(m)] = data[:,:-cut]
return out
x = np.random.randint([1,1,1,0,0],[10,10,10,3,2],(10,5))
df = pd.DataFrame(x,columns=[f"data{i}" for i in "123"] + ["cat1","cat2"])
print(df)
print(spreadcats(df))
Sample run:
data1 data2 data3 cat1 cat2
0 9 5 1 1 1
1 7 4 2 2 0
2 3 9 8 1 0
3 3 9 1 1 0
4 9 1 7 2 1
5 1 3 7 2 0
6 2 8 2 1 0
7 1 4 9 0 1
8 8 7 3 1 1
9 3 6 9 0 1
[[0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 4 9 0 0 0 3 6 9]
[1 0 0 0 0 0 0 0 3 9 8 3 9 1 0 0 0 0 0 0 2 8 2 0 0 0 0 0 0 0 0 0]
[1 1 9 5 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 8 7 3 0 0 0]
[2 0 0 0 0 7 4 2 0 0 0 0 0 0 0 0 0 1 3 7 0 0 0 0 0 0 0 0 0 0 0 0]
[2 1 0 0 0 0 0 0 0 0 0 0 0 0 9 1 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]
Related
I have a column called 'on' with a series of 0 and 1:
d1 = {'on': [0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0]}
df = pd.DataFrame(d1)
I want to create a new column called 'value' such that it do a cumulative count cumsum() only when the '1' of the 'on' column is on and recount from zero once the 'on' column shows zero.
I tried using a combination of cumsum() and np.where but I don't get what I want as follows:
df['value_try'] = df['on'].cumsum()
df['value_try'] = np.where(df['on'] == 0, 0, df['value_try'])
Attempt:
on value_try
0 0 0
1 0 0
2 0 0
3 1 1
4 1 2
5 1 3
6 0 0
7 0 0
8 1 4
9 1 5
10 0 0
What my desired output would be:
on value
0 0 0
1 0 0
2 0 0
3 1 1
4 1 2
5 1 3
6 0 0
7 0 0
8 1 1
9 1 2
10 0 0
You can set groups on consecutive 0 or 1 by checking whether the value of on is equal to that of previous row by .shift() and get group number by .Series.cumsum(). Then for each group use .Groupby.cumsum() to get the value within group.
g = df['on'].ne(df['on'].shift()).cumsum()
df['value'] = df.groupby(g).cumsum()
Result:
print(df)
on value
0 0 0
1 0 0
2 0 0
3 1 1
4 1 2
5 1 3
6 0 0
7 0 0
8 1 1
9 1 2
10 0 0
Let us try cumcount + cumsum
df['out'] = df.groupby(df['on'].eq(0).cumsum()).cumcount()
Out[18]:
0 0
1 0
2 0
3 1
4 2
5 3
6 0
7 0
8 1
9 2
10 0
dtype: int64
I have a pandas data frame. One of the columns has a nested list. I would like to create new columns from the nested list
Example:
L = [[1,2,4],
[5,6,7,8],
[9,3,5]]
I want all the elements in the nested lists as columns. The value should be one if the list has the element and zero if it does not.
1 2 4 5 6 7 8 9 3
1 1 1 0 0 0 0 0 0
0 0 0 1 1 1 1 0 0
0 0 0 1 0 0 0 1 1
You can try the following:
df = pd.DataFrame({"A": L})
df
# A
#0 [1, 2, 4]
#1 [5, 6, 7, 8]
#2 [9, 3, 5]
# for each cell, use `pd.Series(1, x)` to create a Series object with the elements in the
# list as the index which will become the column headers in the result
df.A.apply(lambda x: pd.Series(1, x)).fillna(0).astype(int)
# 1 2 3 4 5 6 7 8 9
#0 1 1 0 1 0 0 0 0 0
#1 0 0 0 0 1 1 1 1 0
#2 0 0 1 0 1 0 0 0 1
pandas
Very similar to #Psidom's answer. However, I use pd.value_counts and will handle repeats
Use #Psidom's df
df = pd.DataFrame({'A': L})
df.A.apply(pd.value_counts).fillna(0).astype(int)
numpy
More involved, but speedy
lst = df.A.values.tolist()
n = len(lst)
lengths = [len(sub) for sub in lst]
flat = np.concatenate(lst)
u, inv = np.unique(flat, return_inverse=True)
rng = np.arange(n)
slc = np.hstack([
rng.repeat(lengths)[:, None],
inv[:, None]
])
data = np.zeros((n, u.shape[0]), dtype=np.uint8)
data[slc[:, 0], slc[:, 1]] = 1
pd.DataFrame(data, df.index, u)
Results
1 2 3 4 5 6 7 8 9
0 1 1 0 1 0 0 0 0 0
1 0 0 0 0 1 1 1 1 0
2 0 0 1 0 1 0 0 0 1
I want to start with an empty data frame and then add to it one row each time.
I can even start with a 0 data frame data=pd.DataFrame(np.zeros(shape=(10,2)),column=["a","b"]) and then replace one line each time.
How can I do that?
Use .loc for label based selection, it is important you understand how to slice properly: http://pandas.pydata.org/pandas-docs/stable/indexing.html#selection-by-label and understand why you should avoid chained assignment: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
In [14]:
data=pd.DataFrame(np.zeros(shape=(10,2)),columns=["a","b"])
data
Out[14]:
a b
0 0 0
1 0 0
2 0 0
3 0 0
4 0 0
5 0 0
6 0 0
7 0 0
8 0 0
9 0 0
[10 rows x 2 columns]
In [15]:
data.loc[2:2,'a':'b']=5,6
data
Out[15]:
a b
0 0 0
1 0 0
2 5 6
3 0 0
4 0 0
5 0 0
6 0 0
7 0 0
8 0 0
9 0 0
[10 rows x 2 columns]
If you are replacing the entire row then you can just use an index and not need row,column slices.
...
data.loc[2]=5,6
Consider the DataFrame P1 and P2:
P1 =
A B
0 0 0
1 0 1
2 1 0
3 1 1
P2 =
A B C
0 0 0 0
1 0 0 1
2 0 1 0
3 0 1 1
4 1 0 0
5 1 0 1
6 1 1 0
7 1 1 1
I would like to know if there is a concise and efficient way of getting the indices in P1 for the row (tuple/configurations/assignments) of columns ['A','B'] in P2.
That is, given P2['A','B']:
P2['A','B'] =
A B
0 0 0
1 0 0
2 0 1
3 0 1
4 1 0
5 1 0
6 1 1
7 1 1
I would like to get [0, 0, 1, 1, 2, 2, 3, 3], since the first and second rows in P2['A','B'] corresponds to the first row in P1, and so on.
You could use merge and extract the overlapping keys
In [3]: tmp = p2[['A', 'B']].merge(p1.reset_index())
In [4]: tmp
Out[4]:
A B index
0 0 0 0
1 0 0 0
2 0 1 1
3 0 1 1
4 1 0 2
5 1 0 2
6 1 1 3
7 1 1 3
Get the values.
In [5]: tmp['index'].values
Out[5]: array([0, 0, 1, 1, 2, 2, 3, 3], dtype=int64)
However, there could be a native NumPy method to do this aswell.
What is the idiomatic way to store this kind of data structure in a pandas :
### Option 1
df = pd.DataFrame(data = [
{'kws' : np.array([0,0,0]), 'x' : i, 'y', i} for i in range(10)
])
# df.x and df.y works as expected
# the list and array casting is required because df.kws is
# an array of arrays
np.array(list(df.kws))
# this causes problems when trying to assign as well though:
# for any other data type, this would set all kws in df to the rhs [1,2,3]
# but since the rhs is a list, it tried to do an element-wise assignment and
# errors saying that the length of df and the length of the rhs do not match
df.kws = [1,2,3]
### Option 2
df = pd.DataFrame(data = [
{'kw_0' : 0, 'kw_1' : 0, 'kw_2' : 0, 'x' : i, 'y', i} for i in range(10)
])
# retrieving 2d array:
df[sorted([c for c in df if c.startswith('kw_')])].values
# batch set :
kws = [1,2,3]
for i, kw in enumerate(kws) :
df['kw_'+i] = kw
Neither of these solutions feel right to me. For one, neither of them allow retrieving a 2d matrix out without copying all of the data. Is there a better way to handle this kind of mixed dimension data, or is this just a task that pandas isn't up to right now?
Just use a column multi-index, the docs
In [31]: df = pd.DataFrame([ {'kw_0' : 0, 'kw_1' : 0, 'kw_2' : 0, 'x' : i, 'y': i} for i in range(10) ])
In [32]: df
Out[32]:
kw_0 kw_1 kw_2 x y
0 0 0 0 0 0
1 0 0 0 1 1
2 0 0 0 2 2
3 0 0 0 3 3
4 0 0 0 4 4
5 0 0 0 5 5
6 0 0 0 6 6
7 0 0 0 7 7
8 0 0 0 8 8
9 0 0 0 9 9
In [33]: df.columns = MultiIndex.from_tuples([('kw',0),('kw',1),('kw',2),('value','x'),('value','y')])
In [34]: df
Out[34]:
kw value
0 1 2 x y
0 0 0 0 0 0
1 0 0 0 1 1
2 0 0 0 2 2
3 0 0 0 3 3
4 0 0 0 4 4
5 0 0 0 5 5
6 0 0 0 6 6
7 0 0 0 7 7
8 0 0 0 8 8
9 0 0 0 9 9
Selection is easy
In [35]: df['kw']
Out[35]:
0 1 2
0 0 0 0
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
5 0 0 0
6 0 0 0
7 0 0 0
8 0 0 0
9 0 0 0
Setting too
In [36]: df.loc[1,'kw'] = [4,5,6]
In [37]: df
Out[37]:
kw value
0 1 2 x y
0 0 0 0 0 0
1 4 5 6 1 1
2 0 0 0 2 2
3 0 0 0 3 3
4 0 0 0 4 4
5 0 0 0 5 5
6 0 0 0 6 6
7 0 0 0 7 7
8 0 0 0 8 8
9 0 0 0 9 9
Alternatively you can use 2 dataframes, indexed the same, and combine/merge when needed.