When one stores the groupby object before calling apply, the index is dropped somewhere. How can this happen?
MWE:
import pandas as pd
df = pd.DataFrame({'a': [1, 1, 0, 0], 'b': list(range(4))})
df.groupby('a').apply(lambda x: x)
a b
0 1 0
1 1 1
2 0 2
3 0 3
dfg = df.groupby('a')
dfg.apply(lambda x: x)
b
0 0
1 1
2 2
3 3
EDIT:
I was on pandas 0.23.2 but this bus is not reproducible with pandas 0.24.x. So upgrading is a solution.
Related
I've seemingly simple problem, based on condition e.g. that value in dataframe is smaller than two, change value to 1, in opposite case to 0. Kind of "if-else".
Toy exmample, input:
a b
0 1 -5
1 2 0
2 3 10
Output:
a b
0 1 1
1 0 1
2 0 0
Here is my solution:
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': [1,2,3], 'b': [-5, 0, 10]})
arr = np.where(df < 2, 1, 0)
df_fin = pd.DataFrame(data=arr, index=df.index, columns=df.columns)
I don't like direct dependency on numpy and it also a little looks verbose to me. Could it be done in more cleaner, idiomatic way?
General solutions:
Pandas is built in numpy, so in my opinion only need import. Here is possible set values in df[:]
import numpy as np
df[:] = np.where(df < 2, 1, 0)
print (df)
a b
0 1 1
1 0 1
2 0 0
A bit overcomplicated if use only pandas functions:
m = df < 2
df = df.mask(m, 1).where(m, 0)
Replace to 0,1 solution:
Convert mask for map True to 1 and False to 0 by DataFrame.view or like in another answer:
df = (df < 2).view('i1')
Pandas' replace might be handy here:
df.lt(2).replace({False : 0, True: 1})
Out[7]:
a b
0 1 1
1 0 1
2 0 0
or you just convert the booleans to integers:
df.lt(2).astype(int)
Out[9]:
a b
0 1 1
1 0 1
2 0 0
I have a dataframe with columns A,B. I need to create a column C such that for every record / row:
C = max(A, B).
How should I go about doing this?
You can get the maximum like this:
>>> import pandas as pd
>>> df = pd.DataFrame({"A": [1,2,3], "B": [-2, 8, 1]})
>>> df
A B
0 1 -2
1 2 8
2 3 1
>>> df[["A", "B"]]
A B
0 1 -2
1 2 8
2 3 1
>>> df[["A", "B"]].max(axis=1)
0 1
1 8
2 3
and so:
>>> df["C"] = df[["A", "B"]].max(axis=1)
>>> df
A B C
0 1 -2 1
1 2 8 8
2 3 1 3
If you know that "A" and "B" are the only columns, you could even get away with
>>> df["C"] = df.max(axis=1)
And you could use .apply(max, axis=1) too, I guess.
#DSM's answer is perfectly fine in almost any normal scenario. But if you're the type of programmer who wants to go a little deeper than the surface level, you might be interested to know that it is a little faster to call numpy functions on the underlying .to_numpy() (or .values for <0.24) array instead of directly calling the (cythonized) functions defined on the DataFrame/Series objects.
For example, you can use ndarray.max() along the first axis.
# Data borrowed from #DSM's post.
df = pd.DataFrame({"A": [1,2,3], "B": [-2, 8, 1]})
df
A B
0 1 -2
1 2 8
2 3 1
df['C'] = df[['A', 'B']].values.max(1)
# Or, assuming "A" and "B" are the only columns,
# df['C'] = df.values.max(1)
df
A B C
0 1 -2 1
1 2 8 8
2 3 1 3
If your data has NaNs, you will need numpy.nanmax:
df['C'] = np.nanmax(df.values, axis=1)
df
A B C
0 1 -2 1
1 2 8 8
2 3 1 3
You can also use numpy.maximum.reduce. numpy.maximum is a ufunc (Universal Function), and every ufunc has a reduce:
df['C'] = np.maximum.reduce(df['A', 'B']].values, axis=1)
# df['C'] = np.maximum.reduce(df[['A', 'B']], axis=1)
# df['C'] = np.maximum.reduce(df, axis=1)
df
A B C
0 1 -2 1
1 2 8 8
2 3 1 3
np.maximum.reduce and np.max appear to be more or less the same (for most normal sized DataFrames)—and happen to be a shade faster than DataFrame.max. I imagine this difference roughly remains constant, and is due to internal overhead (indexing alignment, handling NaNs, etc).
The graph was generated using perfplot. Benchmarking code, for reference:
import pandas as pd
import perfplot
np.random.seed(0)
df_ = pd.DataFrame(np.random.randn(5, 1000))
perfplot.show(
setup=lambda n: pd.concat([df_] * n, ignore_index=True),
kernels=[
lambda df: df.assign(new=df.max(axis=1)),
lambda df: df.assign(new=df.values.max(1)),
lambda df: df.assign(new=np.nanmax(df.values, axis=1)),
lambda df: df.assign(new=np.maximum.reduce(df.values, axis=1)),
],
labels=['df.max', 'np.max', 'np.maximum.reduce', 'np.nanmax'],
n_range=[2**k for k in range(0, 15)],
xlabel='N (* len(df))',
logx=True,
logy=True)
For finding max among multiple columns would be:
df[['A','B']].max(axis=1).max(axis=0)
Example:
df =
A B
timestamp
2019-11-20 07:00:16 14.037880 15.217879
2019-11-20 07:01:03 14.515359 15.878632
2019-11-20 07:01:33 15.056502 16.309152
2019-11-20 07:02:03 15.533981 16.740607
2019-11-20 07:02:34 17.221073 17.195145
print(df[['A','B']].max(axis=1).max(axis=0))
17.221073
I want to merge two datasets by indexes and columns.
I want to merge entire dataset
df1 = pd.DataFrame([[1, 0, 0], [0, 2, 0], [0, 0, 3]],columns=[1, 2, 3])
df1
1 2 3
0 1 0 0
1 0 2 0
2 0 0 3
df2 = pd.DataFrame([[0, 0, 1], [0, 2, 0], [3, 0, 0]],columns=[1, 2, 3])
df2
1 2 3
0 0 0 1
1 0 2 0
2 3 0 0
I have tried this code but I got this error. I can't get why it shows the size of axis as an error.
df_sum = pd.concat([df1, df2])\
.groupby(df2.index)[df2.columns]\
.sum().reset_index()
ValueError: Grouper and axis must be same length
This was what I expected the output of df_sum
df_sum
1 2 3
0 1 0 1
1 0 4 0
2 3 0 3
You can use :df1.add(df2, fill_value=0). It will add df2 into df1 also it will replace NAN value with 0.
>>> import numpy as np
>>> import pandas as pd
>>> df2 = pd.DataFrame([(10,9),(8,4),(7,np.nan)], columns=['a','b'])
>>> df1 = pd.DataFrame([(1,2),(3,4),(5,6)], columns=['a','b'])
>>> df1.add(df2, fill_value=0)
a b
0 11 11.0
1 11 8.0
2 12 6.0
I have a DataFrame with integer indexes that are missing some values (i.e. not equally spaced), I want to create a new DataFrame with equally spaced index values and forward fill column values. Below is a simple example:
have
import pandas as pd
df = pd.DataFrame(['A', 'B', 'C'], index=[0, 2, 4])
0
0 A
2 B
4 C
want to use above and create:
0
0 A
1 A
2 B
3 B
4 C
Use reindex with method='ffill':
df = df.reindex(np.arange(0, df.index.max()+1), method='ffill')
Or:
df = df.reindex(np.arange(df.index.min(), df.index.max() + 1), method='ffill')
print (df)
0
0 A
1 A
2 B
3 B
4 C
Using reindex and ffill:
df = df.reindex(range(df.index[0],df.index[-1]+1)).ffill()
print(df)
0
0 A
1 A
2 B
3 B
4 C
You can do this:
In [319]: df.reindex(list(range(df.index.min(),df.index.max()+1))).ffill()
Out[319]:
0
0 A
1 A
2 B
3 B
4 C
I have a weird problem: I am trying to add a new row in my table with multi-indeces. However, even though I do it exactly as solved in here: adding a row to a MultiIndex DataFrame/Series . Their example and solution works, but it does not work with my data. Either I am doing something wrong, or there is a bug
import pandas as pd
import datetime as dt
Their code (working):
df = pd.DataFrame({'Time': [dt.datetime(2013,2,3,9,0,1), dt.datetime(2013,2,3,9,0,1)],
'hsec': [1,25], 'vals': [45,46]})
df.set_index(['Time','hsec'],inplace=True)
print df
df.ix[(dt.datetime(2013,2,3,9,0,2),0),:] = 5
print df
Output:
vals
Time hsec
2013-02-03 09:00:01 1 45
25 46
and
vals
Time hsec
2013-02-03 09:00:01 1 45
25 46
2013-02-03 09:00:02 0 5
My code (not working):
d = [[0, 0, 2], [0, 2, 2], [1, 0, 2], [1, 2, 2]]
df = pd.DataFrame(d, columns=('frames', 'classID', 'amount'))
df.set_index(['frames', 'classID'], inplace=True)
print df
df.ix[(1,1),:] = 5
print df
Output:
amount
frames classID
0 0 2
2 2
1 0 2
2 2
and
amount
frames classID
0 0 2
2 5
1 0 2
2 2
Note the 5 appeared at df.loc[(0,2)] !
This seems a bug in pandas to me, but it is apparantly fixed in the newly released 0.16:
In [9]: pd.__version__
Out[9]: '0.16.0'
In [10]: df.ix[(1,1),:] = 5
In [11]: df
Out[11]:
amount
frames classID
0 0 2
2 2
1 0 2
2 2
1 5
But I can confirm this was indeed not working in pandas 0.15.2. If can upgrade to pandas 0.16, I would also advice to explicitly use loc in this case (so it certainly does not fall back to the positional integer location). But note the bug is also in loc below pandas 0.16