Keystroke-efficient way to create new columns in pandas - python

Is there a more efficient way to create multiple new columns in a pandas dataframe df initialized to zero than:
for col in add_cols:
df.loc[:, col] = 0

UPDATE: using #jeff's method, but doing it dynamically:
In [208]: add_cols = list('xyz')
In [209]: df.assign(**{i:0 for i in add_cols})
Out[209]:
a b c x y z
0 4 8 6 0 0 0
1 3 7 0 0 0 0
2 4 0 1 0 0 0
3 5 4 5 0 0 0
4 1 3 0 0 0 0
OLD answer:
Another method:
df[add_cols] = pd.DataFrame(0, index=df.index, columns=add_cols)
Demo:
In [343]: df = pd.DataFrame(np.random.randint(0, 10, (5,3)), columns=list('abc'))
In [344]: add_cols = list('xyz')
In [345]: add_cols
Out[345]: ['x', 'y', 'z']
In [346]: df
Out[346]:
a b c
0 4 9 0
1 1 1 1
2 8 8 1
3 0 1 4
4 8 5 6
In [347]: df[add_cols] = pd.DataFrame(0, index=df.index, columns=add_cols)
In [348]: df
Out[348]:
a b c x y z
0 4 9 0 0 0 0
1 1 1 1 0 0 0
2 8 8 1 0 0 0
3 0 1 4 0 0 0
4 8 5 6 0 0 0

In [13]: df = pd.DataFrame(np.random.randint(0, 10, (5,3)), columns=list('abc'))
In [14]: df
Out[14]:
a b c
0 7 2 3
1 7 0 7
2 5 1 5
3 9 1 4
4 2 1 4
In [15]: df.assign(x=0, y=0, z=0)
Out[15]:
a b c x y z
0 7 2 3 0 0 0
1 7 0 7 0 0 0
2 5 1 5 0 0 0
3 9 1 4 0 0 0
4 2 1 4 0 0 0

Here is a hack:
[df.insert(0, col, 0) for col in add_cols]

You can treat a DataFrame with a dict-like syntax:
for col in add_cols:
df[col] = 0

Related

Count number of occurences of values per column of DataFrame

I have the following dataframe:
df = pd.DataFrame(np.array([[4, 1], [1,1], [5,1], [1,3], [7,8], [np.NaN,8]]), columns=['a', 'b'])
a b
0 4 1
1 1 1
2 5 1
3 1 3
4 7 8
5 Nan 8
Now I would like to do a value_counts() on the columns for values from 1 to 9 which should give me the following:
a b
1 2 3
2 0 0
3 0 1
4 1 0
5 1 0
6 0 0
7 1 0
8 0 2
9 0 0
That means I just count the number of occurences of the values 1 to 9 for each column. How can this be done? I would like to get this format so that I can apply afterwards df.plot(kind='bar', stacked=True) to get e stacked bar plot with the discrete values from 1 to 9 at the x axis and the count for a and b on the y axis.
Use pd.value_counts:
df.apply(pd.value_counts).reindex(range(10)).fillna(0)
Use np.bincount on each column:
df.apply(lambda x: np.bincount(x.dropna(),minlength=10))
a b
0 0 0
1 2 3
2 0 0
3 0 1
4 1 0
5 1 0
6 0 0
7 1 0
8 0 2
9 0 0
Alternatively, using a list comprehension instead of apply.
pd.DataFrame([
np.bincount(df[c].dropna(), minlength=10) for c in df
], index=df.columns).T
a b
0 0 0
1 2 3
2 0 0
3 0 1
4 1 0
5 1 0
6 0 0
7 1 0
8 0 2
9 0 0

Map columns in pandas using a dictionary

I need to create a new column, which will contain letters {a,b,c,d} based on rules:
{'a' if (df['q1']==0 & df['q2']==0),
'b' if (df['q1']==0 & df['q2']==1),
'c' if (df['q1']==1 & df['q2']==0),
'd' if (df['q1']==1 & df['q2']==1)}
so, the new third column should contain a letter which corresponds to a particular combination of {0,1} in two columns.
q1 q2
0 0 1
1 0 1
2 0 1
3 0 1
4 0 1
5 0 1
6 0 1
7 0 1
8 0 1
9 0 1
10 1 1
11 1 1
12 0 1
13 0 1
14 1 0
15 0 0
16 0 0
17 0 0
18 0 0
19 0 0
20 0 0
21 0 0
I thought about converting numbers in each row from binary to decimal format and then apply dictionary rules.
You can use join by Series with MultiIndex:
idx = pd.MultiIndex.from_product([[0,1],[0,1]], names=('q1','q2'))
s = pd.Series(['a','b','c','d'], index=idx, name='val')
print (s)
q1 q2
0 0 a
1 b
1 0 c
1 d
Name: val, dtype: object
df = df.join(s, on=['q1','q2'])
print (df)
q1 q2 val
0 0 1 b
1 0 1 b
2 0 1 b
3 0 1 b
4 0 1 b
5 0 1 b
6 0 1 b
7 0 1 b
8 0 1 b
9 0 1 b
10 1 1 d
11 1 1 d
12 0 1 b
13 0 1 b
14 1 0 c
15 0 0 a
16 0 0 a
17 0 0 a
18 0 0 a
19 0 0 a
20 0 0 a
21 0 0 a
Another method with df.map and df.transform:
In [90]: mapping = {(0, 0) :'a', (0, 1) : 'b', (1, 0): 'c', (1, 1): 'd'}
In [91]: df['val'] = df.transform(lambda x: (x['q1'], x['q2']), axis=1).map(mapping); df
Out[91]:
q1 q2 val
0 0 1 b
1 0 1 b
2 0 1 b
3 0 1 b
4 0 1 b
5 0 1 b
6 0 1 b
7 0 1 b
8 0 1 b
9 0 1 b
10 1 1 d
11 1 1 d
12 0 1 b
13 0 1 b
14 1 0 c
15 0 0 a
16 0 0 a
17 0 0 a
18 0 0 a
19 0 0 a
20 0 0 a
21 0 0 a
You can also use zip to generate columns, apply pd.Series and then do the mapping:
In [119]: df['val'] = pd.Series(list(zip(df.q1, df.q2))).map(mapping); df
Out[119]:
q1 q2 val
0 0 1 b
1 0 1 b
2 0 1 b
3 0 1 b
4 0 1 b
5 0 1 b
6 0 1 b
7 0 1 b
8 0 1 b
9 0 1 b
10 1 1 d
11 1 1 d
12 0 1 b
13 0 1 b
14 1 0 c
15 0 0 a
16 0 0 a
17 0 0 a
18 0 0 a
19 0 0 a
20 0 0 a
21 0 0 a
Performance
jezrael's solution:
In [552]: %%timeit
...: idx = pd.MultiIndex.from_product([[0,1],[0,1]], names=('q1','q2'))
...: s = pd.Series(['a','b','c','d'], index=idx, name='val')
...: df.join(s, on=['q1','q2'])
...:
100 loops, best of 3: 2.84 ms per loop
Proposed in this post:
In [553]: %%timeit
...: mapping = {(0, 0) :'a', (0, 1) : 'b', (1, 0): 'c', (1, 1): 'd'}
...: df.transform(lambda x: (x['q1'], x['q2']), axis=1).map(mapping)
...:
1000 loops, best of 3: 1.7 ms per loop

Make a table from 2 columns

I'm fairly new on Python.
I have 2 columns on a dataframe, columns are something like:
db = pd.read_excel(path_to_file/file.xlsx)
db = db.loc[:,['col1','col2']]
col1 col2
C 4
C 5
A 1
B 6
B 1
A 2
C 4
I need them to be like this:
1 2 3 4 5 6
A 1 1 0 0 0 0
B 1 0 0 0 0 1
C 0 0 0 2 1 0
so they act like rows and columns and values refer to the number of coincidences.
Say your columns are called cat and val:
In [26]: df = pd.DataFrame({'cat': ['C', 'C', 'A', 'B', 'B', 'A', 'C'], 'val': [4, 5, 1, 6, 1, 2, 4]})
In [27]: df
Out[27]:
cat val
0 C 4
1 C 5
2 A 1
3 B 6
4 B 1
5 A 2
6 C 4
Then you can groupby the table hierarchicaly, then unstack it:
In [28]: df.val.groupby([df.cat, df.val]).sum().unstack().fillna(0).astype(int)
Out[28]:
val 1 2 4 5 6
cat
A 1 2 0 0 0
B 1 0 0 0 6
C 0 0 8 5 0
Edit
As IanS pointed out, 3 is missing here (thanks!). If there's a range of columns you must have, then you can use
r = df.val.groupby([df.cat, df.val]).sum().unstack().fillna(0).astype(int)
for c in set(range(1, 7)) - set(df.val.unique()):
r[c] = 0
I think you need aggreagate by size and add missing values to columns by reindex:
print (df)
a b
0 C 4
1 C 5
2 A 1
3 B 6
4 B 1
5 A 2
6 C 4
df1 = df.b.groupby([df.a, df.b])
.size()
.unstack()
.reindex(columns=(range(1,df.b.max() + 1)))
.fillna(0)
.astype(int)
df1.index.name = None
df1.columns.name = None
print (df1)
1 2 3 4 5 6
A 1 1 0 0 0 0
B 1 0 0 0 0 1
C 0 0 0 2 1 0
Instead size you can use count, size counts NaN values, count does not.

Problems with pandas and numpy where condition/multiple values?

I have the follwoing pandas dataframe:
A B
1 3
0 3
1 2
0 1
0 0
1 4
....
0 0
I would like to add a new column at the right side, following the following condition:
If the value in B has 3 or 2 add 1 in the new_col for instance:
(*)
A B new_col
1 3 1
0 3 1
1 2 1
0 1 0
0 0 0
1 4 0
....
0 0 0
So I tried the following:
df['new_col'] = np.where(df['B'] == 3 & 2,'1','0')
However it did not worked:
A B new_col
1 3 0
0 3 0
1 2 1
0 1 0
0 0 0
1 4 0
....
0 0 0
Any idea of how to do a multiple contidition statement with pandas and numpy like (*)?.
You can use Pandas isin which will return a boolean showing whether the elements you're looking for are contained in column 'B'.
df['new_col'] = df['B'].isin([3, 2])
A B new_col
0 1 3 True
1 0 3 True
2 1 2 True
3 0 1 False
4 0 0 False
5 1 4 False
Then, you can use astype to convert the boolean values to 0 and 1, True being 1 and False being 0
df['new_col'] = df['B'].isin([3, 2]).astype(int)
Output:
A B new_col
0 1 3 1
1 0 3 1
2 1 2 1
3 0 1 0
4 0 0 0
5 1 4 0
Using numpy:
>>> df['new_col'] = np.where(np.logical_or(df['B'] == 3, df['B'] == 2), '1','0')
>>> df
A B new_col
0 1 3 1
1 0 3 1
2 1 2 1
3 0 1 0
4 0 0 0
5 1 4 0
df['new_col'] = [1 if x in [2, 3] else 0 for x in df.B]
The operators * + ^ work on booleans as expected, and mixing with integers give the expected result. So you can also do:
df['new_col'] = [(x in [2, 3]) * 1 for x in df.B]
using numpy
df['new'] = (df.B.values[:, None] == np.array([2, 3])).any(1) * 1
Timing
over given data set
over 60,000 rows
df=pd.DataFrame({'A':[1,0,1,0,0,1],'B':[3,3,2,1,0,4]})
print df
df['C']=[1 if vals==2 or vals==3 else 0 for vals in df['B'] ]
print df
A B
0 1 3
1 0 3
2 1 2
3 0 1
4 0 0
5 1 4
A B C
0 1 3 1
1 0 3 1
2 1 2 1
3 0 1 0
4 0 0 0
5 1 4 0

mixing 3 dimenional and 2 dimensional data in a pandas dataframe

What is the idiomatic way to store this kind of data structure in a pandas :
### Option 1
df = pd.DataFrame(data = [
{'kws' : np.array([0,0,0]), 'x' : i, 'y', i} for i in range(10)
])
# df.x and df.y works as expected
# the list and array casting is required because df.kws is
# an array of arrays
np.array(list(df.kws))
# this causes problems when trying to assign as well though:
# for any other data type, this would set all kws in df to the rhs [1,2,3]
# but since the rhs is a list, it tried to do an element-wise assignment and
# errors saying that the length of df and the length of the rhs do not match
df.kws = [1,2,3]
### Option 2
df = pd.DataFrame(data = [
{'kw_0' : 0, 'kw_1' : 0, 'kw_2' : 0, 'x' : i, 'y', i} for i in range(10)
])
# retrieving 2d array:
df[sorted([c for c in df if c.startswith('kw_')])].values
# batch set :
kws = [1,2,3]
for i, kw in enumerate(kws) :
df['kw_'+i] = kw
Neither of these solutions feel right to me. For one, neither of them allow retrieving a 2d matrix out without copying all of the data. Is there a better way to handle this kind of mixed dimension data, or is this just a task that pandas isn't up to right now?
Just use a column multi-index, the docs
In [31]: df = pd.DataFrame([ {'kw_0' : 0, 'kw_1' : 0, 'kw_2' : 0, 'x' : i, 'y': i} for i in range(10) ])
In [32]: df
Out[32]:
kw_0 kw_1 kw_2 x y
0 0 0 0 0 0
1 0 0 0 1 1
2 0 0 0 2 2
3 0 0 0 3 3
4 0 0 0 4 4
5 0 0 0 5 5
6 0 0 0 6 6
7 0 0 0 7 7
8 0 0 0 8 8
9 0 0 0 9 9
In [33]: df.columns = MultiIndex.from_tuples([('kw',0),('kw',1),('kw',2),('value','x'),('value','y')])
In [34]: df
Out[34]:
kw value
0 1 2 x y
0 0 0 0 0 0
1 0 0 0 1 1
2 0 0 0 2 2
3 0 0 0 3 3
4 0 0 0 4 4
5 0 0 0 5 5
6 0 0 0 6 6
7 0 0 0 7 7
8 0 0 0 8 8
9 0 0 0 9 9
Selection is easy
In [35]: df['kw']
Out[35]:
0 1 2
0 0 0 0
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
5 0 0 0
6 0 0 0
7 0 0 0
8 0 0 0
9 0 0 0
Setting too
In [36]: df.loc[1,'kw'] = [4,5,6]
In [37]: df
Out[37]:
kw value
0 1 2 x y
0 0 0 0 0 0
1 4 5 6 1 1
2 0 0 0 2 2
3 0 0 0 3 3
4 0 0 0 4 4
5 0 0 0 5 5
6 0 0 0 6 6
7 0 0 0 7 7
8 0 0 0 8 8
9 0 0 0 9 9
Alternatively you can use 2 dataframes, indexed the same, and combine/merge when needed.

Categories