Pandas: add new row to Multi index table does not work - python

I have a weird problem: I am trying to add a new row in my table with multi-indeces. However, even though I do it exactly as solved in here: adding a row to a MultiIndex DataFrame/Series . Their example and solution works, but it does not work with my data. Either I am doing something wrong, or there is a bug
import pandas as pd
import datetime as dt
Their code (working):
df = pd.DataFrame({'Time': [dt.datetime(2013,2,3,9,0,1), dt.datetime(2013,2,3,9,0,1)],
'hsec': [1,25], 'vals': [45,46]})
df.set_index(['Time','hsec'],inplace=True)
print df
df.ix[(dt.datetime(2013,2,3,9,0,2),0),:] = 5
print df
Output:
vals
Time hsec
2013-02-03 09:00:01 1 45
25 46
and
vals
Time hsec
2013-02-03 09:00:01 1 45
25 46
2013-02-03 09:00:02 0 5
My code (not working):
d = [[0, 0, 2], [0, 2, 2], [1, 0, 2], [1, 2, 2]]
df = pd.DataFrame(d, columns=('frames', 'classID', 'amount'))
df.set_index(['frames', 'classID'], inplace=True)
print df
df.ix[(1,1),:] = 5
print df
Output:
amount
frames classID
0 0 2
2 2
1 0 2
2 2
and
amount
frames classID
0 0 2
2 5
1 0 2
2 2
Note the 5 appeared at df.loc[(0,2)] !

This seems a bug in pandas to me, but it is apparantly fixed in the newly released 0.16:
In [9]: pd.__version__
Out[9]: '0.16.0'
In [10]: df.ix[(1,1),:] = 5
In [11]: df
Out[11]:
amount
frames classID
0 0 2
2 2
1 0 2
2 2
1 5
But I can confirm this was indeed not working in pandas 0.15.2. If can upgrade to pandas 0.16, I would also advice to explicitly use loc in this case (so it certainly does not fall back to the positional integer location). But note the bug is also in loc below pandas 0.16

Related

Append function in python (pandas)

import pandas as pd
df = pd.DataFrame([[5, 6], [1.2, 3]])
ser = pd.Series([0, 0], name='r3')
df_app = df.append(ser)
print('{}\n'.format(df_app)) #has 3 rows
df_app = df.append(ser, ignore_index=True)
print('{}\n'.format(df_app)) #has 3 rows
df2 = pd.DataFrame([[0,0],[9,9]])
df_app = df.append(df2)
print(format(df_app)) #didnt understand this part, where did the series row go?
OUTPUT
0 1
0 5.0 6
1 1.2 3
r3 0.0 0
0 1
0 5.0 6
1 1.2 3
2 0.0 0
0 1
0 5.0 6
1 1.2 3
0 0.0 0
1 9.0 9
I didn't understand where did the appended series go in the last append.
df has 2 rows, then [0,0] series is appended =>3 rows
df2 has 2 rows as well, after appending, there is a total of 4 rows. Where did the series row go?
You have been appending your series to the df Dataframe, which will remain same every time.
You are printing the df_app dataframe, which is updated over the df only. If you need to append all the rows then instead of appending to df you need to append to df_app itself.
You should change the code to following:
df_app = df_app.append(ser)

Combining two integer columns without addition [duplicate]

I have a dataframe with columns A,B. I need to create a column C such that for every record / row:
C = max(A, B).
How should I go about doing this?
You can get the maximum like this:
>>> import pandas as pd
>>> df = pd.DataFrame({"A": [1,2,3], "B": [-2, 8, 1]})
>>> df
A B
0 1 -2
1 2 8
2 3 1
>>> df[["A", "B"]]
A B
0 1 -2
1 2 8
2 3 1
>>> df[["A", "B"]].max(axis=1)
0 1
1 8
2 3
and so:
>>> df["C"] = df[["A", "B"]].max(axis=1)
>>> df
A B C
0 1 -2 1
1 2 8 8
2 3 1 3
If you know that "A" and "B" are the only columns, you could even get away with
>>> df["C"] = df.max(axis=1)
And you could use .apply(max, axis=1) too, I guess.
#DSM's answer is perfectly fine in almost any normal scenario. But if you're the type of programmer who wants to go a little deeper than the surface level, you might be interested to know that it is a little faster to call numpy functions on the underlying .to_numpy() (or .values for <0.24) array instead of directly calling the (cythonized) functions defined on the DataFrame/Series objects.
For example, you can use ndarray.max() along the first axis.
# Data borrowed from #DSM's post.
df = pd.DataFrame({"A": [1,2,3], "B": [-2, 8, 1]})
df
A B
0 1 -2
1 2 8
2 3 1
df['C'] = df[['A', 'B']].values.max(1)
# Or, assuming "A" and "B" are the only columns,
# df['C'] = df.values.max(1)
df
A B C
0 1 -2 1
1 2 8 8
2 3 1 3
If your data has NaNs, you will need numpy.nanmax:
df['C'] = np.nanmax(df.values, axis=1)
df
A B C
0 1 -2 1
1 2 8 8
2 3 1 3
You can also use numpy.maximum.reduce. numpy.maximum is a ufunc (Universal Function), and every ufunc has a reduce:
df['C'] = np.maximum.reduce(df['A', 'B']].values, axis=1)
# df['C'] = np.maximum.reduce(df[['A', 'B']], axis=1)
# df['C'] = np.maximum.reduce(df, axis=1)
df
A B C
0 1 -2 1
1 2 8 8
2 3 1 3
np.maximum.reduce and np.max appear to be more or less the same (for most normal sized DataFrames)—and happen to be a shade faster than DataFrame.max. I imagine this difference roughly remains constant, and is due to internal overhead (indexing alignment, handling NaNs, etc).
The graph was generated using perfplot. Benchmarking code, for reference:
import pandas as pd
import perfplot
np.random.seed(0)
df_ = pd.DataFrame(np.random.randn(5, 1000))
perfplot.show(
setup=lambda n: pd.concat([df_] * n, ignore_index=True),
kernels=[
lambda df: df.assign(new=df.max(axis=1)),
lambda df: df.assign(new=df.values.max(1)),
lambda df: df.assign(new=np.nanmax(df.values, axis=1)),
lambda df: df.assign(new=np.maximum.reduce(df.values, axis=1)),
],
labels=['df.max', 'np.max', 'np.maximum.reduce', 'np.nanmax'],
n_range=[2**k for k in range(0, 15)],
xlabel='N (* len(df))',
logx=True,
logy=True)
For finding max among multiple columns would be:
df[['A','B']].max(axis=1).max(axis=0)
Example:
df =
A B
timestamp
2019-11-20 07:00:16 14.037880 15.217879
2019-11-20 07:01:03 14.515359 15.878632
2019-11-20 07:01:33 15.056502 16.309152
2019-11-20 07:02:03 15.533981 16.740607
2019-11-20 07:02:34 17.221073 17.195145
print(df[['A','B']].max(axis=1).max(axis=0))
17.221073

Pandas groupby drop index when groupby object is storred

When one stores the groupby object before calling apply, the index is dropped somewhere. How can this happen?
MWE:
import pandas as pd
df = pd.DataFrame({'a': [1, 1, 0, 0], 'b': list(range(4))})
df.groupby('a').apply(lambda x: x)
a b
0 1 0
1 1 1
2 0 2
3 0 3
dfg = df.groupby('a')
dfg.apply(lambda x: x)
b
0 0
1 1
2 2
3 3
EDIT:
I was on pandas 0.23.2 but this bus is not reproducible with pandas 0.24.x. So upgrading is a solution.

merge/duplicate two data sets by pandas

I am trying to merge two datasets by using pandas. One is location (longitude and latitude) and the other is time frame (0 to 24hrs, 15 mins step = 96 datapoints)
Here is the sample code:
s1 = pd.Series([1, 2, 3])
s2 = pd.Series([4, 5, 6])
df = pd.DataFrame([list(s1), list(s2)], columns = ["A", "B", "C"])
timeframe_array=[]
for i in range(0, 3600, timeframe):
timeframe_array.append(i)
And I want to get the data like this:
A B C time
0 1 2 3 0
1 1 2 3 15
2 1 2 3 30
3 1 2 3 45
...
How can I get the data like this?
While not particularly elegant, this should work:
from __future__ import division # only needed if you're using Python 2
import pandas as pd
from math import ceil
# Constants
timeframe = 15
total_t = 3600
Create df1:
s1 = [1, 2, 3]
s2 = [4, 5, 6]
df1 = pd.DataFrame([s1, s2], columns=['A', 'B', 'C'])
Next, we want to build df2 such that the sequence 0-3600 (step=15) is replicated for each row in df1. We can extract the number of rows with df1.shape[0] (which is 2 in this case).
df2 = pd.DataFrame({'time': range(0, total_t * df1.shape[0], timeframe)})
Next, you need to replicate the rows in df1 to match df2.
factor = ceil(df2.shape[0] / df1.shape[0])
df1_f = pd.concat([df1] * factor).sort_index().reset_index(drop=True)
Lastly, join the two data frames together and trim off any excess rows.
df3 = df1_f.join(df2, how='left')[:df2.shape[0]]
Pandas may have a built-in way to do this, but to my knowledge both join and merge can only make up a difference in rows by filling with a constant (NaN by default).
Result:
>>> print(df3.head(4))
A B C time
0 1 2 3 0
1 1 2 3 15
2 1 2 3 30
3 1 2 3 45
>>> print(df3.tail(4))
A B C time
476 4 5 6 7140
477 4 5 6 7155
478 4 5 6 7170
479 4 5 6 7185
>>> df3.shape # (480, 4)

Why does a pandas Series of DataFrame mean() fail, but sum() does not, and how to make it work?

There may be a smarter way to do this in Python Pandas, but the following example should, but doesn't work:
import pandas as pd
import numpy as np
df1 = pd.DataFrame([[1, 0], [1, 2], [2, 0]], columns=['a', 'b'])
df2 = df1.copy()
df3 = df1.copy()
idx = pd.date_range("2010-01-01", freq='H', periods=3)
s = pd.Series([df1, df2, df3], index=idx)
# This causes an error
s.mean()
I won't post the whole traceback, but the main error message is interesting:
TypeError: Could not convert melt T_s
0 6 12
1 0 6
2 6 10 to numeric
It looks like the dataframe was successfully sum'med, but not divided by the length of the series.
However, we can take the sum of the dataframes in the series:
s.sum()
... returns:
a b
0 6 12
1 0 6
2 6 10
Why wouldn't mean() work when sum() does? Is this a bug or a missing feature? This does work:
(df1 + df2 + df3)/3.0
... and so does this:
s.sum()/3.0
a b
0 2 4.000000
1 0 2.000000
2 2 3.333333
But this of course is not ideal.
You could (as suggested by #unutbu) use a hierarchical index but when you have a three dimensional array you should consider using a "pandas Panel". Especially when one of the dimensions represents time as in this case.
The Panel is oft overlooked but it is after all where the name pandas comes from. (Panel Data System or something like that).
Data slightly different from your original so there are not two dimensions with the same length:
df1 = pd.DataFrame([[1, 0], [1, 2], [2, 0], [2, 3]], columns=['a', 'b'])
df2 = df1 + 1
df3 = df1 + 10
Panels can be created a couple of different ways but one is from a dict. You can create the dict from your index and the dataframes with:
s = pd.Panel(dict(zip(idx,[df1,df2,df3])))
The mean you are looking for is simply a matter of operating on the correct axis (axis=0 in this case):
s.mean(axis=0)
Out[80]:
a b
0 4.666667 3.666667
1 4.666667 5.666667
2 5.666667 3.666667
3 5.666667 6.666667
With your data, sum(axis=0) returns the expected result.
EDIT: OK too late for panels as the hierarchical index approach is already "accepted". I will say that that approach is preferable if the data is know to be "ragged" with an unknown but different number in each grouping. For "square" data, the panel is absolutly the way to go and will be significantly faster with more built-in operations. Pandas 0.15 has many improvements for multi-level indexing but still has limitations and dark edge cases in real world apps.
When you define s with
s = pd.Series([df1, df2, df3], index=idx)
you get a Series with DataFrames as items:
In [77]: s
Out[77]:
2010-01-01 00:00:00 a b
0 1 0
1 1 2
2 2 0
2010-01-01 01:00:00 a b
0 1 0
1 1 2
2 2 0
2010-01-01 02:00:00 a b
0 1 0
1 1 2
2 2 0
Freq: H, dtype: object
The sum of the items is a DataFrame:
In [78]: s.sum()
Out[78]:
a b
0 3 0
1 3 6
2 6 0
but when you take the mean, nanops.nanmean is called:
def nanmean(values, axis=None, skipna=True):
values, mask, dtype, dtype_max = _get_values(values, skipna, 0)
the_sum = _ensure_numeric(values.sum(axis, dtype=dtype_max))
...
Notice that _ensure_numeric (source code) is called on the resultant sum.
An error is raised because a DataFrame is not numeric.
Here is a workaround. Instead of making a Series with DataFrames as items,
you can concatenate the DataFrames into a new DataFrame with a hierarchical index:
In [79]: s = pd.concat([df1, df2, df3], keys=idx)
In [80]: s
Out[80]:
a b
2010-01-01 00:00:00 0 1 0
1 1 2
2 2 0
2010-01-01 01:00:00 0 1 0
1 1 2
2 2 0
2010-01-01 02:00:00 0 1 0
1 1 2
2 2 0
Now you can take the sum and the mean:
In [82]: s.sum(level=1)
Out[82]:
a b
0 3 0
1 3 6
2 6 0
In [84]: s.mean(level=1)
Out[84]:
a b
0 1 0
1 1 2
2 2 0

Categories