Split columns with mixed integers an tuples to multiple columns

Split columns with mixed integers an tuples to multiple columns - python

I have a df
a b
0 (0,1) 1
1 1 (1,2)
2 2 3
The desired output is:
w x y z
0 0 1 1 0
1 1 0 1 2
2 2 0 3 3
The problem is, that the tuples can have multiple different lenghts.
The following tolist() only works for tuples of length 2 and not for mixed columns.
df[['w', 'x']]=pd.DataFrame(df['a'].tolist(), index=df.index)
Any ideas?
Thanks in advance.

Idea is add tuples if scalars and then create new columns:
def f(col):
return pd.DataFrame([x if isinstance(x, tuple) else (x, )
for x in col]).fillna(0).astype(int)
df[['w', 'x']]=df.pop('a').pipe(f)
df[['y', 'z']]=df.pop('b').pipe(f)
print (df)
w x y z
0 0 1 1 0
1 1 0 1 2
2 2 0 3 0
More general solution with concat:
dfs= [pd.DataFrame([x if isinstance(x, tuple) else (x, ) for x in df.pop(c)],
index=df.index) for c in df.columns]
df = pd.concat(dfs, axis=1, ignore_index=True).fillna(0).astype(int)
print (df)
0 1 2 3
0 0 1 1 0
1 1 0 1 2
2 2 0 3 0

You can convert to str and then strip the () and split by ,
>>> df[['w', 'x']] = pd.DataFrame(df.pop('a')
.astype(str)
.str.strip('(/)')
.str.split(',')
.tolist()).fillna(0).astype(int)
>>> df[['y', 'z']] = pd.DataFrame(df.pop('b')
.astype(str)
.str.strip('(/)')
.str.split(',')
.tolist()).fillna(0).astype(int)
>>> df
w x y z
0 0 1 1 0
1 1 0 1 2
2 2 0 3 0

Related

How to split comma separated text into columns on pandas dataframe?

I have a dataframe where one of the columns has its items separated with commas. It looks like:
Data
a,b,c
a,c,d
d,e
a,e
a,b,c,d,e
My goal is to create a matrix that has as header all the unique values from column Data, meaning [a,b,c,d,e]. Then as rows a flag indicating if the value is at that particular row.
The matrix should look like this:
Data
a
b
c
d
e
a,b,c
1
1
1
0
0
a,c,d
1
0
1
1
0
d,e
0
0
0
1
1
a,e
1
0
0
0
1
a,b,c,d,e
1
1
1
1
1
To separate column Data what I did is:
df['data'].str.split(',', expand = True)
Then I don't know how to proceed to allocate the flags to each of the columns.

Maybe you can try this without pivot.
Create the dataframe.
import pandas as pd
import io
s = '''Data
a,b,c
a,c,d
d,e
a,e
a,b,c,d,e'''
df = pd.read_csv(io.StringIO(s), sep = "\s+")
We can use pandas.Series.str.split with expand argument equals to True. And value_counts each rows with axis = 1.
Finally fillna with zero and change the data into integer with astype(int).
df["Data"].str.split(pat = ",", expand=True).apply(lambda x : x.value_counts(), axis = 1).fillna(0).astype(int)
#
a b c d e
0 1 1 1 0 0
1 1 0 1 1 0
2 0 0 0 1 1
3 1 0 0 0 1
4 1 1 1 1 1
And then merge it with the original column.
new = df["Data"].str.split(pat = ",", expand=True).apply(lambda x : x.value_counts(), axis = 1).fillna(0).astype(int)
pd.concat([df, new], axis = 1)
#
Data a b c d e
0 a,b,c 1 1 1 0 0
1 a,c,d 1 0 1 1 0
2 d,e 0 0 0 1 1
3 a,e 1 0 0 0 1
4 a,b,c,d,e 1 1 1 1 1

Use the Series.str.get_dummies() method to return the required matrix of 'a', 'b', ... 'e' columns.
df["Data"].str.get_dummies(sep=',')

If you split the strings into lists, then explode them, it makes pivot possible.
(df.assign(data_list=df.Data.str.split(','))
.explode('data_list')
.pivot_table(index='Data',
columns='data_list',
aggfunc=lambda x: 1,
fill_value=0))
Output
data_list a b c d e
Data
a,b,c 1 1 1 0 0
a,b,c,d,e 1 1 1 1 1
a,c,d 1 0 1 1 0
a,e 1 0 0 0 1
d,e 0 0 0 1 1

You could apply a custom count function for each key:
for k in ["a","b","c","d","e"]:
df[k] = df.apply(lambda row: row["Data"].count(k), axis=1)

Count the frequency of list element in a row grouped by Date and tag

I have a dataframe df which looks like this:
ID Date Input
1 1-Nov A,B
1 2-NOV A
2 3-NOV A,B,C
2 4-NOV B,D
i want my output to count the occurrence of each input, if it is consecutive otherwise reset it to zero again(if IDs are same then only count) , Also the output should be renamed to X.A, X.B, X.C and X.D so my output will look like this:
ID Date Input X.A X.B X.C X.D
1 1-NOV A,B 1 1 0 0
1 2-NOV A 2 0 0 0
2 3-NOV A,B,C 1 1 1 0
2 4-NOV B,D 0 2 0 1
How can I create the output(A,B,C and D) which will count the input occurence date and ID wise.

Use Series.str.get_dummies for indicator columns and then count consecutive 1 per groups - so use GroupBy.cumsum with subtract by GroupBy.ffill, change columns names by DataFrame.add_prefix and last DataFrame.join to original:
a = df['Input'].str.get_dummies(',') == 1
b = a.groupby(df.ID).cumsum().astype(int)
df1 = (b-b.mask(a).groupby(df.ID).ffill().fillna(0).astype(int)).add_prefix('X.')
df = df.join(df1)
print (df)
ID Date Input X.A X.B X.C X.D
0 1 1-Nov A,B 1 1 0 0
1 1 2-NOV A 2 0 0 0
2 2 3-NOV A,B,C 1 1 1 0
3 2 4-NOV B,D 0 2 0 1

first add the counts of new columns and then use group by to make a cumulative sum
# find which columns to add
cols = set([l for sublist in df['Input'].apply(lambda x: x.split(',')).values for l in sublist])
# add the new columns
for col in cols:
df['X.' + col] = df['Input'].apply(lambda x: int(col in x))
# group by and add cumulative sum conditional it has a positive value
group = df.groupby('ID')
for col in cols:
df['X.' + col] = group['X.' + col].apply(lambda x: np.cumsum(x) * (x > 0).astype(int))
results is then
print(df)
ID Date Input X.C X.D X.A X.B
0 1 1-NOV A,B 0 0 1 1
1 1 2-NOV A 0 0 2 0
2 2 3-NOV A,B,C 1 0 1 1
3 2 4-NOV B,D 0 1 0 2

Pandas: occurrence matrix from one hot encoding from pandas dataframe

I have a dataframe, it's in one hot format:
dummy_data = {'a': [0,0,1,0],'b': [1,1,1,0], 'c': [0,1,0,1],'d': [1,1,1,0]}
data = pd.DataFrame(dummy_data)
Output:
a b c d
0 0 1 0 1
1 0 1 1 1
2 1 1 0 1
3 0 0 1 0
I am trying to get the occurrence matrix from dataframe, but if I have columns name in list instead of one hot like this:
raw = [['b','d'],['b','c','d'],['a','b','d'],['c']]
unique_categories = ['a','b','c','d']
Then I am able to find the occurrence matrix like this:
df = pd.DataFrame(raw).stack().rename('val').reset_index().drop(columns='level_1')
df = df.loc[df.val.isin(unique_categories)]
df = df.merge(df, on='level_0').query('val_x != val_y')
final = pd.crosstab(df.val_x, df.val_y)
adj_matrix = (pd.crosstab(df.val_x, df.val_y)
.reindex(unique_categories, axis=0).reindex(unique_categories, axis=1)).fillna(0)
Output:
val_y a b c d
val_x
a 0 1 0 1
b 1 0 1 3
c 0 1 0 1
d 1 3 1 0
How to get the occurrence matrix directly from one hot dataframe?

You can have some fun with matrix math!
u = np.diag(np.ones(df.shape[1], dtype=bool))
df.T.dot(df) * (~u)
a b c d
a 0 1 0 1
b 1 0 1 3
c 0 1 0 1
d 1 3 1 0

Pandas: Count values on a row basis

I have a numeric DataFrame, for example:
x = np.array([[1,2,3],[-1,-1,1],[0,0,0]])
df = pd.DataFrame(x, columns=['A','B','C'])
df
A B C
0 1 2 3
1 -1 -1 1
2 0 0 0
And I want to count, for each row, the number of positive values, negativa values and values equals to 0. I've been trying the following:
df['positive_count'] = df.apply(lambda row: (row > 0).sum(), axis = 1)
df['negative_count'] = df.apply(lambda row: (row < 0).sum(), axis = 1)
df['zero_count'] = df.apply(lambda row: (row == 0).sum(), axis = 1)
But I'm getting the following result, which is obviously incorrent
A B C positive_count negative_count zero_count
0 1 2 3 3 0 1
1 -1 -1 1 1 2 0
2 0 0 0 0 0 5
Anyone knows what might be going wrong, or could help me find the best way to do what I'm looking for?
Thank you.

There are some ways, but one option is using np.sign and get_dummies:
u = (pd.get_dummies(np.sign(df.stack()))
.sum(level=0)
.rename({-1: 'negative_count', 1: 'positive_count', 0: 'zero_count'}, axis=1))
u
negative_count zero_count positive_count
0 0 0 3
1 2 0 1
2 0 3 0
df = pd.concat([df, u], axis=1)
df
A B C negative_count zero_count positive_count
0 1 2 3 0 0 3
1 -1 -1 1 2 0 1
2 0 0 0 0 3 0
np.sign treats zero differently from positive and negative values, so it is ideal to use here.
Another option is groupby and value_counts:
(np.sign(df)
.stack()
.groupby(level=0)
.value_counts()
.unstack(1, fill_value=0)
.rename({-1: 'negative_count', 1: 'positive_count', 0: 'zero_count'}, axis=1))
negative_count zero_count positive_count
0 0 0 3
1 2 0 1
2 0 3 0
Slightly more verbose but still worth knowing about.

Finding duplicate rows in a Pandas Dataframe then Adding a column in the Dataframe that states if the row is a duplicate

I have a pandas dataframe that contains a column with possible duplicates. I would like to create a column that will produce a 1 if the row is duplicate and 0 if it is not.
So if I have:
A|B
1 1|x
2 2|y
3 1|x
4 3|z
I would get:
A|B|C
1 1|x|1
2 2|y|0
3 1|x|1
4 3|z|0
I tried df['C'] = np.where(df['A']==df['A'], '1', '0') but this just created a column of all 1's in C.

You need Series.duplicated with parameter keep=False for all duplicates first, then cast boolean mask (Trues and Falses) to 1s and 0s by astype by int and if necessary then cast to str:
df['C'] = df['A'].duplicated(keep=False).astype(int).astype(str)
print (df)
A B C
1 1 x 1
2 2 y 0
3 1 x 1
4 3 z 0
If need check duplicates in columns A and B together use DataFrame.duplicated:
df['C'] = df.duplicated(subset=['A','B'], keep=False).astype(int).astype(str)
print (df)
A B C
1 1 x 1
2 2 y 0
3 1 x 1
4 3 z 0
And numpy.where solution:
df['C'] = np.where(df['A'].duplicated(keep=False), '1', '0')
print (df)
A B C
1 1 x 1
2 2 y 0
3 1 x 1
4 3 z 0

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Split columns with mixed integers an tuples to multiple columns - python

Related

How to split comma separated text into columns on pandas dataframe?

Count the frequency of list element in a row grouped by Date and tag

Pandas: occurrence matrix from one hot encoding from pandas dataframe

Pandas: Count values on a row basis

Finding duplicate rows in a Pandas Dataframe then Adding a column in the Dataframe that states if the row is a duplicate

Categories

Resources