I have a pandas data frame. One of the columns has a nested list. I would like to create new columns from the nested list
Example:
L = [[1,2,4],
[5,6,7,8],
[9,3,5]]
I want all the elements in the nested lists as columns. The value should be one if the list has the element and zero if it does not.
1 2 4 5 6 7 8 9 3
1 1 1 0 0 0 0 0 0
0 0 0 1 1 1 1 0 0
0 0 0 1 0 0 0 1 1
You can try the following:
df = pd.DataFrame({"A": L})
df
# A
#0 [1, 2, 4]
#1 [5, 6, 7, 8]
#2 [9, 3, 5]
# for each cell, use `pd.Series(1, x)` to create a Series object with the elements in the
# list as the index which will become the column headers in the result
df.A.apply(lambda x: pd.Series(1, x)).fillna(0).astype(int)
# 1 2 3 4 5 6 7 8 9
#0 1 1 0 1 0 0 0 0 0
#1 0 0 0 0 1 1 1 1 0
#2 0 0 1 0 1 0 0 0 1
pandas
Very similar to #Psidom's answer. However, I use pd.value_counts and will handle repeats
Use #Psidom's df
df = pd.DataFrame({'A': L})
df.A.apply(pd.value_counts).fillna(0).astype(int)
numpy
More involved, but speedy
lst = df.A.values.tolist()
n = len(lst)
lengths = [len(sub) for sub in lst]
flat = np.concatenate(lst)
u, inv = np.unique(flat, return_inverse=True)
rng = np.arange(n)
slc = np.hstack([
rng.repeat(lengths)[:, None],
inv[:, None]
])
data = np.zeros((n, u.shape[0]), dtype=np.uint8)
data[slc[:, 0], slc[:, 1]] = 1
pd.DataFrame(data, df.index, u)
Results
1 2 3 4 5 6 7 8 9
0 1 1 0 1 0 0 0 0 0
1 0 0 0 0 1 1 1 1 0
2 0 0 1 0 1 0 0 0 1
Related
I have the following dataframe:
The basket_new column contains numbers from 0 to 5 in a list (the amount can vary for each number and transaction). I would like to count the occurrences of every number for each transaction and save that number in another DataFrame like this:
I just created a lambda function for Cat_0 to test it, unfortunately it's not working as it is creating "None" entries (see picture 2).
This is the function:
df_cat["Cat_0"] = df_train["basket_new"].map(lambda x: df_cat["Cat_0"]+1 if "0" in x else None)
Can you please just tell me what I'm doing wrong / how to fix my issue?
Use explode and crosstab.
Let say you have a df like this:
df = pd.DataFrame({'a':[1,2,3,4], 'b':[[1,2],[0],[3,1,2,3],[4,2,2,2,1]]})
df:
a b
0 1 [1, 2]
1 2 [0]
2 3 [3, 1, 2, 3]
3 4 [4, 2, 2, 2, 1]
df1 = df['b'].explode()
df[['a', 'b']].join(pd.crosstab(df1.index, df1))
a b 0 1 2 3 4
0 1 [1, 2] 0 1 1 0 0
1 2 [0] 1 0 0 0 0
2 3 [3, 1, 2, 3] 0 1 1 2 0
3 4 [4, 2, 2, 2, 1] 0 1 3 0 1
If you want to rename columns:
df[['a', 'b']].join(pd.crosstab(df1.index, df1, colnames=['b']).add_prefix('cat_'))
a b cat_0 cat_1 cat_2 cat_3 cat_4
0 1 [1, 2] 0 1 1 0 0
1 2 [0] 1 0 0 0 0
2 3 [3, 1, 2, 3] 0 1 1 2 0
3 4 [4, 2, 2, 2, 1] 0 1 3 0 1
Using the list.count() method:
df_cat["Cat_0"] = df['basket_new'].map(lambda x: x.count(0))
Using the first 5 rows of your data:
for i in range(0,5): df_cat["Cat_{0}".format(i)] = df['basket'].map(lambda x: x.count(i))
Cat_0 Cat_1 Cat_2 Cat_3 Cat_4 Cat_5
0 0 0 0 1 0 0
1 1 0 0 2 0 1
2 0 1 0 1 1 1
3 0 0 1 0 0 0
4 0 0 0 0 4 0
This is a rather lengthy one but it works
df.explode('basket_new').groupby(['transaction_id','customerType','basket_new']).agg(count = ('basket_new','count'))\
.reset_index().pivot_table(index=['transaction_id','customerType'], columns='basket_new', values='count', fill_value=0)\
.reset_index()
I have a csv with data that I want to import to an ndarray so I can manipulate it. The csv data is formatted like this.
u i r c
1 1 5 1
2 2 5 1
3 3 1 0
4 4 1 1
I want to get all the elements with c = 1 in a row, and the ones with c = 0 in another one, like so, reducing the dimensionality.
1 1 1 5 2 2 5 4 4 1
0 3 3 1
However, different u and i can't be in the same column, hence the final result needing zero padding, like this. I want to keep the c variable column, since this represents a categorical variable, so I need to keep its value to be able to make the correspondence between the information and the c value. I don't want to just separate the data according to the value of c.
1 1 1 5 2 2 5 0 0 0 4 4 1
0 0 0 0 0 0 0 3 3 1 0 0 0
So far, I'm reading the .csv file with df = pd.read_csv and creating a multidimensional array/tensor by using arr=df.to_numpy(). After that, I'm permutating the order of the columns to make the c column be the first one, getting this array [[ 1 1 1 5][ 1 2 2 5][ 0 3 3 1][ 1 4 4 1]].
I then do arr = arr.reshape(2,), since there are two possible values for c and then delete all but the first c column according to the length of the tuples. So in this case, since there are 4 elements in each tuple and 16 elements I'm doing arr = np.delete(arr, (4,8,12), axis=1).
Finally, I'm doing this to pad the array with zeros when the u doesn't match with both columns.
nomatch = 0
for j in range(1, cols, 3):
if arr[0][j] != arr[1][j]:
nomatch+=1
z = np.zeros(nomatch*3, dtype=arr.dtype)
h1 = np.split(arr, [0][0])
new0 = np.concatenate((arr[0],z))
new1 = np.concatenate((z,arr[1])) # problem
final = np.concatenate((new0, new1))
In the line with the comment, the problem is how can I concatenate the array while maintaining the first element. Instead of just appending, I'd like to be able to set up a start and end index and patch the zeros only on those indexes. By using concatenate, I don't get the expected result, since I'm altering the first element (the head of the array should be untouched).
Additionally, I can't help but wonder if this is a good way to achieve the end result. For an example I tried to pad the array with resize before reshaping with np.resize(), but it doesn't work, when I print the result the array is the same as previous, no matter the dimensions I use as argument. A good solution would be one that adapted if there were 3 or more possible values for c, and that could include multiple c-like values, such as c1, c2... that would become rows in the table. I appreciate all the input and suggestions in advance.
Here is a compact numpy approach:
asnp = df.to_numpy()
(np.bitwise_xor.outer(np.arange(2),asnp[:,3:])*asnp[:,:3]).reshape(2,-1)
# array([[1, 1, 5, 2, 2, 5, 0, 0, 0, 4, 4, 1],
# [0, 0, 0, 0, 0, 0, 3, 3, 1, 0, 0, 0]])
UPDATE: multi category:
categories must be the last k columns and have column headers starting with "cat". we create a row for each unique combination of categories, this combination is prepended to the row.
Code:
import numpy as np
import pandas as pd
import itertools as it
def spreadcats(df):
cut = sum(map(str.startswith,df.columns,it.repeat("cat")))
data = df.to_numpy()
cats,idx = np.unique(data[:,-cut:],axis=0,return_inverse=True)
m,n,k,_ = data.shape + cats.shape
out = np.zeros((k,cut+(n-cut)*m),int)
out[:,:cut] = cats
out[:,cut:].reshape(k,m,n-cut)[idx,np.arange(m)] = data[:,:-cut]
return out
x = np.random.randint([1,1,1,0,0],[10,10,10,3,2],(10,5))
df = pd.DataFrame(x,columns=[f"data{i}" for i in "123"] + ["cat1","cat2"])
print(df)
print(spreadcats(df))
Sample run:
data1 data2 data3 cat1 cat2
0 9 5 1 1 1
1 7 4 2 2 0
2 3 9 8 1 0
3 3 9 1 1 0
4 9 1 7 2 1
5 1 3 7 2 0
6 2 8 2 1 0
7 1 4 9 0 1
8 8 7 3 1 1
9 3 6 9 0 1
[[0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 4 9 0 0 0 3 6 9]
[1 0 0 0 0 0 0 0 3 9 8 3 9 1 0 0 0 0 0 0 2 8 2 0 0 0 0 0 0 0 0 0]
[1 1 9 5 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 8 7 3 0 0 0]
[2 0 0 0 0 7 4 2 0 0 0 0 0 0 0 0 0 1 3 7 0 0 0 0 0 0 0 0 0 0 0 0]
[2 1 0 0 0 0 0 0 0 0 0 0 0 0 9 1 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]
I have a pandas dataframe like where the first four columns form a multiindex:
import pandas as pd
data = [[1, 'A', 1, 0, 10],
[1, 'A', 0, 1, 10],
[1, 'A', 1, 1, 10],
[1, 'A', 0, 0, 10],
[1, 'B', 1, 0, 10],
[1, 'B', 0, 1, 10],
[1, 'B', 1, 1, 10],
[1, 'B', 0, 0, 10]]
cols = ['user_id','type','flag1','flag2','cnt']
df = pd.DataFrame(data,columns = cols)
df = df.set_index(['user_id','type','flag1','flag2'])
print df
user_id type flag1 flag2 cnt
________________________________________
1 A 1 0 10
1 A 0 1 10
1 A 1 1 10
1 A 0 0 10
1 B 1 0 10
1 B 0 1 10
1 B 1 1 10
1 B 0 0 10
I'd like to iterate over the index values to get the grouped total count for each unique index values like so:
user_id type flag1 flag2 cnt
________________________________________
1 ALL ALL ALL 80
1 ALL ALL 0 40
1 ALL ALL 1 40
1 ALL 1 ALL 40
1 ALL 0 ALL 40
1 A ALL ALL 40
1 B ALL ALL 40
1 A ALL 0 20
1 A ALL 1 20
1 B ALL 0 20
1 B ALL 1 20
1 A 1 ALL 20
1 A 0 ALL 20
1 B 1 ALL 20
1 B 0 ALL 20
1 A 1 0 10
1 A 0 1 10
1 A 1 1 10
1 A 0 0 10
1 B 1 0 10
1 B 0 1 10
1 B 1 1 10
1 B 0 0 10
I'm able to generate each group easily using query and groupby, but ideally I'd like to be able to iterate over any number of index columns to get the sum of the cnt column.
Similar to previous answers, here's a slightly more streamlined approach using itertools and groupby:
from itertools import chain, combinations
indices = ['user_id','type','flag1','flag2']
powerset = list(chain.from_iterable(combinations(indices, r) for r in range(1,len(indices)+1)))
master = (pd.concat([df.reset_index().groupby(p, as_index=False).sum()
for p in powerset if p[0] == "user_id"])[cols]
.replace([None,4,2], "ALL")
.sort_values("cnt", ascending=False))
Output:
user_id type flag1 flag2 cnt
0 1 ALL ALL ALL 80
0 1 A ALL ALL 40
1 1 B ALL ALL 40
0 1 ALL 0 ALL 40
1 1 ALL 1 ALL 40
0 1 ALL ALL 0 40
1 1 ALL ALL 1 40
3 1 ALL 1 1 20
2 1 ALL 1 0 20
1 1 ALL 0 1 20
0 1 ALL 0 0 20
3 1 B 1 1 20
2 1 B 1 0 20
1 1 A 1 1 20
0 1 A 1 0 20
3 1 B 1 1 20
2 1 B 0 1 20
1 1 A 1 1 20
0 1 A 0 1 20
0 1 A 0 0 10
1 1 A 0 1 10
2 1 A 1 0 10
3 1 A 1 1 10
4 1 B 0 0 10
5 1 B 0 1 10
6 1 B 1 0 10
7 1 B 1 1 10
The powerset computation is taken directly from the itertools docs.
#build all groupby key combinations
import itertools
keys = ['user_id', 'type', 'flag1', 'flag2']
key_combos = [c for i in range(len(keys)) for c in itertools.combinations(keys, i+1)]
#make sure only select the combos with 'user_id' in it
key_combos = [list(e) for e in key_combos if 'user_id' in e]
#groupby using all groupby keys and concatenate the results to a Dataframe
df2 = pd.concat([df.groupby(by=key).cnt.sum().to_frame().reset_index() for key in sorted(key_combos)])
#Fill na with ALL and re-order columns
df2.fillna('ALL')[['user_id','type','flag1','flag2','cnt']]
Out[521]:
user_id type flag1 flag2 cnt
0 1 ALL ALL ALL 80
0 1 ALL 0 ALL 40
1 1 ALL 1 ALL 40
0 1 ALL 0 0 20
1 1 ALL 0 1 20
2 1 ALL 1 0 20
3 1 ALL 1 1 20
0 1 ALL ALL 0 40
1 1 ALL ALL 1 40
0 1 A ALL ALL 40
1 1 B ALL ALL 40
0 1 A 0 ALL 20
1 1 A 1 ALL 20
2 1 B 0 ALL 20
3 1 B 1 ALL 20
0 1 A 0 0 10
1 1 A 0 1 10
2 1 A 1 0 10
3 1 A 1 1 10
4 1 B 0 0 10
5 1 B 0 1 10
6 1 B 1 0 10
7 1 B 1 1 10
0 1 A ALL 0 20
1 1 A ALL 1 20
2 1 B ALL 0 20
3 1 B ALL 1 20
I mainly used combinations and product from itertools.
combinations is for all combinations of values within each column.
product is for all combinations of values across all column.
import pandas as pd
from itertools import combinations, product
import numpy as np
def iterativeSum(df, cols, target_col):
# All possible combinations within each column
comb_each_col = []
for col in cols:
# Take 1 to n element in the unique set of values in each column
each_col = [list(combinations(set(df[col]), i))
for i in range(1, len(set(df[col]))+1)]
# Flat the list
each_col = [list(x) for sublist in each_col for x in sublist]
# Record the combination
comb_each_col.append(each_col)
# All possible combinations across all columns
comb_all_col = list(product(*comb_each_col))
result = pd.DataFrame()
# Iterate over all combinations
for value in comb_all_col:
# Get condition which match the value in each column
condition = np.array(
[df[col].isin(v).values for col, v in zip(cols, value)]).all(axis=0)
# Get the sum of rows which meet the condition
condition_sum = df.loc[condition][target_col].sum()
# Format values for output
value2 = []
for x in value:
try:
# String can be joined together directly
value2.append(','.join(x))
except:
# Numbers can be joined after converted to string
x = [str(y) for y in x]
value2.append(','.join(x))
# Put result into table
result = pd.concat([result, pd.DataFrame([value2+[condition_sum]])])
result.columns = cols + [target_col]
return(result)
data = [[1, 'A', 1, 0, 10],
[1, 'A', 0, 1, 10],
[1, 'A', 1, 1, 10],
[1, 'A', 0, 0, 10],
[1, 'B', 1, 0, 10],
[1, 'B', 0, 1, 10],
[1, 'B', 1, 1, 10],
[1, 'B', 0, 0, 10]]
cols = ['user_id', 'type', 'flag1', 'flag2', 'cnt']
df = pd.DataFrame(data, columns=cols)
# Columns for grouping
grouped_cols = ['type', 'flag1', 'flag2']
# Columns for summing
target_col = 'cnt'
print iterativeSum(df, grouped_cols, target_col)
Result:
type flag1 flag2 cnt
0 A 0 0 10
0 A 0 1 10
0 A 0 0,1 20
0 A 1 0 10
0 A 1 1 10
0 A 1 0,1 20
0 A 0,1 0 20
0 A 0,1 1 20
0 A 0,1 0,1 40
0 B 0 0 10
0 B 0 1 10
0 B 0 0,1 20
0 B 1 0 10
0 B 1 1 10
0 B 1 0,1 20
0 B 0,1 0 20
0 B 0,1 1 20
0 B 0,1 0,1 40
0 A,B 0 0 20
0 A,B 0 1 20
0 A,B 0 0,1 40
0 A,B 1 0 20
0 A,B 1 1 20
0 A,B 1 0,1 40
0 A,B 0,1 0 40
0 A,B 0,1 1 40
0 A,B 0,1 0,1 80
Consider the DataFrame P1 and P2:
P1 =
A B
0 0 0
1 0 1
2 1 0
3 1 1
P2 =
A B C
0 0 0 0
1 0 0 1
2 0 1 0
3 0 1 1
4 1 0 0
5 1 0 1
6 1 1 0
7 1 1 1
I would like to know if there is a concise and efficient way of getting the indices in P1 for the row (tuple/configurations/assignments) of columns ['A','B'] in P2.
That is, given P2['A','B']:
P2['A','B'] =
A B
0 0 0
1 0 0
2 0 1
3 0 1
4 1 0
5 1 0
6 1 1
7 1 1
I would like to get [0, 0, 1, 1, 2, 2, 3, 3], since the first and second rows in P2['A','B'] corresponds to the first row in P1, and so on.
You could use merge and extract the overlapping keys
In [3]: tmp = p2[['A', 'B']].merge(p1.reset_index())
In [4]: tmp
Out[4]:
A B index
0 0 0 0
1 0 0 0
2 0 1 1
3 0 1 1
4 1 0 2
5 1 0 2
6 1 1 3
7 1 1 3
Get the values.
In [5]: tmp['index'].values
Out[5]: array([0, 0, 1, 1, 2, 2, 3, 3], dtype=int64)
However, there could be a native NumPy method to do this aswell.
What is the idiomatic way to store this kind of data structure in a pandas :
### Option 1
df = pd.DataFrame(data = [
{'kws' : np.array([0,0,0]), 'x' : i, 'y', i} for i in range(10)
])
# df.x and df.y works as expected
# the list and array casting is required because df.kws is
# an array of arrays
np.array(list(df.kws))
# this causes problems when trying to assign as well though:
# for any other data type, this would set all kws in df to the rhs [1,2,3]
# but since the rhs is a list, it tried to do an element-wise assignment and
# errors saying that the length of df and the length of the rhs do not match
df.kws = [1,2,3]
### Option 2
df = pd.DataFrame(data = [
{'kw_0' : 0, 'kw_1' : 0, 'kw_2' : 0, 'x' : i, 'y', i} for i in range(10)
])
# retrieving 2d array:
df[sorted([c for c in df if c.startswith('kw_')])].values
# batch set :
kws = [1,2,3]
for i, kw in enumerate(kws) :
df['kw_'+i] = kw
Neither of these solutions feel right to me. For one, neither of them allow retrieving a 2d matrix out without copying all of the data. Is there a better way to handle this kind of mixed dimension data, or is this just a task that pandas isn't up to right now?
Just use a column multi-index, the docs
In [31]: df = pd.DataFrame([ {'kw_0' : 0, 'kw_1' : 0, 'kw_2' : 0, 'x' : i, 'y': i} for i in range(10) ])
In [32]: df
Out[32]:
kw_0 kw_1 kw_2 x y
0 0 0 0 0 0
1 0 0 0 1 1
2 0 0 0 2 2
3 0 0 0 3 3
4 0 0 0 4 4
5 0 0 0 5 5
6 0 0 0 6 6
7 0 0 0 7 7
8 0 0 0 8 8
9 0 0 0 9 9
In [33]: df.columns = MultiIndex.from_tuples([('kw',0),('kw',1),('kw',2),('value','x'),('value','y')])
In [34]: df
Out[34]:
kw value
0 1 2 x y
0 0 0 0 0 0
1 0 0 0 1 1
2 0 0 0 2 2
3 0 0 0 3 3
4 0 0 0 4 4
5 0 0 0 5 5
6 0 0 0 6 6
7 0 0 0 7 7
8 0 0 0 8 8
9 0 0 0 9 9
Selection is easy
In [35]: df['kw']
Out[35]:
0 1 2
0 0 0 0
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
5 0 0 0
6 0 0 0
7 0 0 0
8 0 0 0
9 0 0 0
Setting too
In [36]: df.loc[1,'kw'] = [4,5,6]
In [37]: df
Out[37]:
kw value
0 1 2 x y
0 0 0 0 0 0
1 4 5 6 1 1
2 0 0 0 2 2
3 0 0 0 3 3
4 0 0 0 4 4
5 0 0 0 5 5
6 0 0 0 6 6
7 0 0 0 7 7
8 0 0 0 8 8
9 0 0 0 9 9
Alternatively you can use 2 dataframes, indexed the same, and combine/merge when needed.