Make a table from 2 columns - python

I'm fairly new on Python.
I have 2 columns on a dataframe, columns are something like:
db = pd.read_excel(path_to_file/file.xlsx)
db = db.loc[:,['col1','col2']]
col1 col2
C 4
C 5
A 1
B 6
B 1
A 2
C 4
I need them to be like this:
1 2 3 4 5 6
A 1 1 0 0 0 0
B 1 0 0 0 0 1
C 0 0 0 2 1 0
so they act like rows and columns and values refer to the number of coincidences.

Say your columns are called cat and val:
In [26]: df = pd.DataFrame({'cat': ['C', 'C', 'A', 'B', 'B', 'A', 'C'], 'val': [4, 5, 1, 6, 1, 2, 4]})
In [27]: df
Out[27]:
cat val
0 C 4
1 C 5
2 A 1
3 B 6
4 B 1
5 A 2
6 C 4
Then you can groupby the table hierarchicaly, then unstack it:
In [28]: df.val.groupby([df.cat, df.val]).sum().unstack().fillna(0).astype(int)
Out[28]:
val 1 2 4 5 6
cat
A 1 2 0 0 0
B 1 0 0 0 6
C 0 0 8 5 0
Edit
As IanS pointed out, 3 is missing here (thanks!). If there's a range of columns you must have, then you can use
r = df.val.groupby([df.cat, df.val]).sum().unstack().fillna(0).astype(int)
for c in set(range(1, 7)) - set(df.val.unique()):
r[c] = 0

I think you need aggreagate by size and add missing values to columns by reindex:
print (df)
a b
0 C 4
1 C 5
2 A 1
3 B 6
4 B 1
5 A 2
6 C 4
df1 = df.b.groupby([df.a, df.b])
.size()
.unstack()
.reindex(columns=(range(1,df.b.max() + 1)))
.fillna(0)
.astype(int)
df1.index.name = None
df1.columns.name = None
print (df1)
1 2 3 4 5 6
A 1 1 0 0 0 0
B 1 0 0 0 0 1
C 0 0 0 2 1 0
Instead size you can use count, size counts NaN values, count does not.

Related

Pandas binary table from column based on index

How can I binarize a dataset according to the index? E.g.
A B C
idUser
3 1 1 1
2 0 1 0
4 1 0 0
I have tried using pd.get_dummies but the result is almost what I need.
dictio = {'idUser': [3, 3, 3, 2, 4], 'artist': ['A', 'B', 'C', 'B', 'A']}
df = pd.DataFrame(dictio)
df = df.set_index('idUser')
df_binary = pd.get_dummies(df, columns=['artist'])
print(df_binary)
A B C
idUser
3 1 0 0
3 0 1 0
3 0 0 1
2 0 1 0
4 1 0 0
In [27]: df_binary.groupby(level=0).any().astype(int)
Out[27]:
artist_A artist_B artist_C
idUser
2 0 1 0
3 1 1 1
4 1 0 0
alternatively starting from your df before the .set_index()
In [33]: df.pivot_table(index='idUser', columns='artist', aggfunc='size', fill_value=0).rename_axis(columns=None)
Out[33]:
A B C
idUser
2 0 1 0
3 1 1 1
4 1 0 0

How can I create a new column containing 0 and 1 values via groupby("col1")?

I have a dataframe like this:
df = pd.DataFrame({"col1":["a","a","a","b","b","c","c","c","c","d"]})
How can I create a new column containing 0 and 1 values via groupby("col1") ?
col1 col2
0 a 0
1 a 0
2 a 0
3 b 1
4 b 1
5 c 0
6 c 0
7 c 0
8 c 0
9 d 1
You can groupby col1 and take the remainder of the group number divided by 2:
df['col2'] = df.groupby('col1', sort=False).ngroup()%2
output:
col1 col2
0 a 0
1 a 0
2 a 0
3 b 1
4 b 1
5 c 0
6 c 0
7 c 0
8 c 0
9 d 1
Alternative form:
df['col2'] = df.groupby('col1', sort=False).ngroup().mod(2)
And in case you want odd groups to be 1 and even groups 0:
df['col2'] = df.groupby('col1', sort=False).ngroup().add(1).mod(2)
Without groupby try factorize
df['new'] = df.col1.factorize()[0]%2
df
Out[151]:
col1 new
0 a 0
1 a 0
2 a 0
3 b 1
4 b 1
5 c 0
6 c 0
7 c 0
8 c 0
9 d 1
Or try with
from itertools import cycle
df['new'] = df.col1.map(dict(zip(df.col1.unique(), cycle([0,1]))))
df
Out[155]:
col1 new
0 a 0
1 a 0
2 a 0
3 b 1
4 b 1
5 c 0
6 c 0
7 c 0
8 c 0
9 d 1
[It appears the question was asking about flagging every other group with 0/1; this was not clear from the initial framing of the question, so this answer perhaps appears overly simplistic.]
Check if col1 is either b or d and convert the boolean True/False to an integer:
df = pd.DataFrame({"col1":["a","a","a","b","b","c","c","c","c","d"]})
df['col2'] = df['col1'].isin(['b','d']).astype(int)
col1 col2
0 a 0
1 a 0
2 a 0
3 b 1
4 b 1
5 c 0
6 c 0
7 c 0
8 c 0
9 d 1

How to categorize two categories in one dataframe in Pandas

I have one pd including two categorical columns with 150 categories. May be a value in column A is not appeared in Column B. For example
a = pd.DataFrame({'A':list('bbaba'), 'B':list('cccaa')})
a['A'] = a['A'].astype('category')
a['B'] = a['B'].astype('category')
The output is
Out[217]:
A B
0 b c
1 b c
2 a c
3 b a
4 a a
And also
cat_columns = a.select_dtypes(['category']).columns
a[cat_columns] = a[cat_columns].apply(lambda x: x.cat.codes)
a
The output is
Out[220]:
A B
0 1 1
1 1 1
2 0 1
3 1 0
4 0 0
My problem is that in column A, the b is considered as 1, but in column B, the c is considered as 1. However, I want something like this:
Out[220]:
A B
0 1 2
1 1 2
2 0 2
3 1 0
4 0 0
which 2 is considered as c.
Please note that I have 150 different labels.
Using pd.Categorical() you can specify a list of categories:
In [44]: cats = a[['A','B']].stack().sort_values().unique()
In [45]: cats
Out[45]: array(['a', 'b', 'c'], dtype=object)
In [46]: a['A'] = pd.Categorical(a['A'], categories=cats)
In [47]: a['B'] = pd.Categorical(a['B'], categories=cats)
In [48]: a[cat_columns] = a[cat_columns].apply(lambda x: x.cat.codes)
In [49]: a
Out[49]:
A B
0 1 2
1 1 2
2 0 2
3 1 0
4 0 0
We can use pd.factorize all at once.
pd.DataFrame(
pd.factorize(a.values.ravel())[0].reshape(a.shape),
a.index, a.columns
)
A B
0 0 1
1 0 1
2 2 1
3 0 2
4 2 2
Or if you wanted to factorize by sorted category value, use the sort=True argument
pd.DataFrame(
pd.factorize(a.values.ravel(), True)[0].reshape(a.shape),
a.index, a.columns
)
A B
0 1 2
1 1 2
2 0 2
3 1 0
4 0 0
Or equivalently with np.unique
pd.DataFrame(
np.unique(a.values.ravel(), return_inverse=True)[1].reshape(a.shape),
a.index, a.columns
)
A B
0 1 2
1 1 2
2 0 2
3 1 0
4 0 0
If you are only interested in converting to categorical codes and being able to access the mapping via a dictionary, pd.factorize may be more convenient.
Algorithm for getting unique values across columns via #AlexRiley.
a = pd.DataFrame({'A':list('bbaba'), 'B':list('cccaa')})
fact = dict(zip(*pd.factorize(pd.unique(a[['A', 'B']].values.ravel('K')))[::-1]))
b = a.applymap(fact.get)
Result:
A B
0 0 2
1 0 2
2 1 2
3 0 1
4 1 1

Interweave two dataframes

Suppose I have two dataframes d1 and d2
d1 = pd.DataFrame(np.ones((3, 3), dtype=int), list('abc'), [0, 1, 2])
d2 = pd.DataFrame(np.zeros((3, 2), dtype=int), list('abc'), [3, 4])
d1
0 1 2
a 1 1 1
b 1 1 1
c 1 1 1
d2
3 4
a 0 0
b 0 0
c 0 0
What is an easy and generalized way to interweave two dataframes' columns. We can assume that the number of columns in d2 is always one less than the number of columns in d1. And, the indices are the same.
I want this:
pd.concat([d1[0], d2[3], d1[1], d2[4], d1[2]], axis=1)
0 3 1 4 2
a 1 0 1 0 1
b 1 0 1 0 1
c 1 0 1 0 1
Using pd.concat to combine the DataFrames, and toolz.interleave reorder the columns:
from toolz import interleave
pd.concat([d1, d2], axis=1)[list(interleave([d1, d2]))]
The resulting output is as expected:
0 3 1 4 2
a 1 0 1 0 1
b 1 0 1 0 1
c 1 0 1 0 1
Here's one NumPy approach -
def numpy_interweave(d1, d2):
c1 = list(d1.columns)
c2 = list(d2.columns)
N = (len(c1)+len(c2))
cols = [None]*N
cols[::2] = c1
cols[1::2] = c2
out_dtype = np.result_type(d1.values.dtype, d2.values.dtype)
out = np.empty((d1.shape[0],N),dtype=out_dtype)
out[:,::2] = d1.values
out[:,1::2] = d2.values
df_out = pd.DataFrame(out, columns=cols, index=d1.index)
return df_out
Sample run -
In [346]: d1
Out[346]:
x y z
a 6 7 4
b 3 5 6
c 4 6 2
In [347]: d2
Out[347]:
p q
a 4 2
b 7 7
c 7 2
In [348]: numpy_interweave(d1, d2)
Out[348]:
x p y q z
a 6 4 7 2 4
b 3 7 5 7 6
c 4 7 6 2 2
Interweave the columns:
c = np.empty((d1.columns.size + d2.columns.size,), dtype=object)
c[0::2], c[1::2] = d1.columns, d2.columns
Now, do a join and re-order with boolean indexing:
d1.join(d2)[c]
0 3 1 4 2
a 1 0 1 0 1
b 1 0 1 0 1
c 1 0 1 0 1
You may prefer pd.concat when dealing with multiple dataframes.
write a function to abstract away the generic merge-reorder
from itertools import zip_longest
def weave(df1, df2):
col1 = df1.columns
col2 = df2.columns
weaved = [col for zipped in zip_longest(col1,col2)
for col in zipped
if col is not None]
return pd.concat([df1, df2], axis=1)[weaved]
weave(d1, d2)
# Output:
0 3 1 4 2
a 1 0 1 0 1
b 1 0 1 0 1
c 1 0 1 0 1
we can use itertools.zip_longest:
In [75]: from itertools import zip_longest
In [76]: cols = pd.Series(np.concatenate(list(zip_longest(d1.columns, d2.columns)))).dropna()
In [77]: cols
Out[77]:
0 0
1 3
2 1
3 4
4 2
dtype: object
In [78]: df = pd.concat([d1, d2], axis=1)[cols]
In [79]: df
Out[79]:
0 3 1 4 2
a 1 0 1 0 1
b 1 0 1 0 1
c 1 0 1 0 1
My solution was to use pd.DataFrame.insert making sure to insert from the back first
df = d1.copy()
for i in range(d2.shape[1], 0, -1):
df.insert(i, d2.columns[i - 1], d2.iloc[:, i - 1])
df
0 3 1 4 2
a 1 0 1 0 1
b 1 0 1 0 1
c 1 0 1 0 1
The roundrobin itertools recipe has an interleaving characteristic. This option offers the choice between directly implementing the recipe from the Python docs, or importing a third-party package such as more_itertools that implements the recipe for you:
from more_itertools import roundrobin
pd.concat([d1, d2], axis=1)[list(roundrobin(d1, d2))]
# Output
0 3 1 4 2
a 1 0 1 0 1
b 1 0 1 0 1
c 1 0 1 0 1
Inspired by #root's answer, the column indices are interleaved and used to slice a concatenated DataFrame.

Get values and column names

I have a pandas data frame that looks something like this:
data = {'1' : [0, 2, 0, 0], '2' : [5, 0, 0, 2], '3' : [2, 0, 0, 0], '4' : [0, 7, 0, 0]}
df = pd.DataFrame(data, index = ['a', 'b', 'c', 'd'])
df
1 2 3 4
a 0 5 2 0
b 2 0 0 7
c 0 0 0 0
d 0 2 0 0
I know I can get the maximum value and the corresponding column name for each row by doing (respectively):
df.max(1)
df.idxmax(1)
How can I get the values and the column name for every cell that is not zero?
So in this case, I'd want 2 tables, one giving me each value != 0 for each row:
a 5
a 2
b 2
b 7
d 2
And one giving me the column names for those values:
a 2
a 3
b 1
b 4
d 2
Thanks!
You can use stack for Series, then filter by boolean indexing, rename_axis, reset_index and last drop column or select columns by subset:
s = df.stack()
df1 = s[s!= 0].rename_axis(['a','b']).reset_index(name='c')
print (df1)
a b c
0 a 2 5
1 a 3 2
2 b 1 2
3 b 4 7
4 d 2 2
df2 = df1.drop('b', axis=1)
print (df2)
a c
0 a 5
1 a 2
2 b 2
3 b 7
4 d 2
df3 = df1.drop('c', axis=1)
print (df3)
a b
0 a 2
1 a 3
2 b 1
3 b 4
4 d 2
df3 = df1[['a','c']]
print (df3)
a c
0 a 5
1 a 2
2 b 2
3 b 7
4 d 2
df3 = df1[['a','b']]
print (df3)
a b
0 a 2
1 a 3
2 b 1
3 b 4
4 d 2

Categories