pandas groupby first column shifted down - python

So I have read in a csv file as a pandas dataframe:
But when I group it, the year column is shifted down by one:
So when I try to pull out Years into a numpy array, it gives an error saying "KeyError:'Year'".
Is there a way to get the array to find the years, or a way to shift that first column up by one?
I have found a way to shift a dataframe column up by one, but I need to shift the grouping, not the dataframe.
I also tried turning the new grouping into a new dataframe so that I can shift the year column up, but haven't been successful.

Year is the name of the index.
In [11]: df = pd.DataFrame([[1, 2], [3, 4]], columns=["A", "B"])
In [12]: df
Out[12]:
A B
0 1 2
1 3 4
In [13]: df.index.name = "foo"
In [14]: df
Out[14]:
A B
foo
0 1 2
1 3 4
Pull out the index with .index:
In [15]: df.index
Out[15]: Int64Index([0, 1], dtype='int64', name='foo')
In [16]: df.index.values
Out[16]: array([0, 1])

Related

Pandas explode to create new columns

The pandas explode method creates new row for each value found in the inner list of a given column ; this is so a row-wise explode.
Is there an easy column-wise explode already implemented in pandas, ie something to transform df into the second dataframe ?
MWE:
>>> s = pd.DataFrame([[1, 2], [3, 4]]).agg(list, axis=1)
>>> df = pd.DataFrame({"a": ["a", "b"], "s": s})
>>> df
Out:
a s
0 a [1, 2]
1 b [3, 4]
>>> pd.DataFrame(s.tolist()).assign(a=["a", "b"]).reindex(["a", 0, 1], axis=1)
Out[121]:
a 0 1
0 a 1 2
1 b 3 4
You can use apply to convert those values to Pandas Series, which will ultimately transform the dataframe in the required format:
>>> df.apply(pd.Series)
Out[28]:
0 1
0 1 2
1 3 4
As a side note, your df becomes a Pandas series after using agg
For the updated data, you can concat above result to the existing data frame
>>> pd.concat([df, df['s'].apply(pd.Series)], axis=1)
Out[48]:
a s 0 1
0 a [1, 2] 1 2
1 b [3, 4] 3 4

Apply an operation on pandas DataFrame rows based on a Series that matchs the number of columns of the DataFrame

Let's say that I have a DataFrame df and a Series s like this:
>>> df = pd.DataFrame(np.random.randn(2,3), columns=["A", "B", "C"])
>>> df
A B C
0 -0.625816 0.793552 -1.519706
1 -0.955960 0.142163 0.847624
>>> s = pd.Series([1, 2, 3])
>>> s
0 1
1 2
2 3
dtype: int64
I'd like to add the values of s to each row in df. I guess I should use some apply with axis=1 or applymap but I can't figure out how (do I have to transpose at some point?).
Actually my problem is more complex that that and the final DataFrame will be composed of the elements of the initial DataFrame that will have been processed according to the values of two Series.
Possible solution is add 1d numpy array created from Series for prevent alignment columns of DataFrame to index of Series:
df = df + s.values
print (df)
A B C
0 0.207070 1.995021 4.829518
1 0.819741 2.802982 2.801355
If same columns and index values it working with sum:
#index is same like columns names
s = pd.Series([1, 2, 3], index=df.columns)
print (s)
A 1
B 2
C 3
dtype: int64
df = df + s

Creating Pivot DataFrame using Multiple Columns in Pandas

I have a pandas dataframe following the form in the example below:
data = {'id': [1,1,1,1,2,2,2,2,3,3,3], 'a': [-1,1,1,0,0,0,-1,1,-1,0,0], 'b': [1,0,0,-1,0,1,1,-1,-1,1,0]}
df = pd.DataFrame(data)
Now, what I want to do is create a pivot table such that for each of the columns except the id, I will have 3 new columns corresponding to the values. That is, for column a, I will create a_neg, a_zero and a_pos. Similarly, for b, I will create b_neg, b_zero and b_pos. The values for these new columns would correspond to the number of times those values appear in the original a and b column. The final dataframe should look like this:
result = {'id': [1,2,3], 'a_neg': [1, 1, 1],
'a_zero': [1, 2, 2], 'a_pos': [2, 1, 0],
'b_neg': [1, 1, 1], 'b_zero': [2,1,1], 'b_pos': [1,2,1]}
df_result = pd.DataFrame(result)
Now, to do this, I can do the following steps and arrive at my final answer:
by_a = df.groupby(['id', 'a']).count().reset_index().pivot('id', 'a', 'b').fillna(0).astype(int)
by_a.columns = ['a_neg', 'a_zero', 'a_pos']
by_b = df.groupby(['id', 'b']).count().reset_index().pivot('id', 'b', 'a').fillna(0).astype(int)
by_b.columns = ['b_neg', 'b_zero', 'b_pos']
df_result = by_a.join(by_b).reset_index()
However, I believe that that method is not optimal especially if I have a lot of original columns aside from a and b. Is there a shorter and/or more efficient solution for getting what I want to achieve here? Thanks.
A shorter solution, though still quite in-efficient:
In [11]: df1 = df.set_index("id")
In [12]: g = df1.groupby(level=0)
In [13]: g.apply(lambda x: x.apply(lambda x: x.value_counts())).fillna(0).astype(int).unstack(1)
Out[13]:
a b
-1 0 1 -1 0 1
id
1 1 1 2 1 2 1
2 1 2 1 1 1 2
3 1 2 0 1 1 1
Note: I think you should be aiming for the multi-index columns.
I'm reasonably sure I've seen a trick to remove the apply/value_count/fillna with something cleaner and more efficient, but at the moment it eludes me...

How to add a hierarchically-named column to a Pandas DataFrame

I have an empty DataFrame:
import pandas as pd
df = pd.DataFrame()
I want to add a hierarchically-named column. I tried this:
df['foo', 'bar'] = [1,2,3]
But it gives a column whose name is a tuple:
(foo, bar)
0 1
1 2
2 3
I want this:
foo
bar
0 1
1 2
2 3
Which I can get if I construct a brand new DataFrame this way:
pd.DataFrame([1,2,3], columns=pd.MultiIndex.from_tuples([('foo', 'bar')]))
How can I create such a layout when adding new columns to an existing DataFrame? The number of levels is always 2...and I know all the possible values for the first level in advance.
If you are looking to build the multi-index DF one column at a time, you could append the frames and drop the Nan's introduced leaving you with the desired multi-index DF as shown:
Demo:
df = pd.DataFrame()
df['foo', 'bar'] = [1,2,3]
df['foo', 'baz'] = [3,4,5]
df
Taking one column at a time and build the corresponding headers.
pd.concat([df[[0]], df[[1]]]).apply(lambda x: x.dropna())
Due to the Nans produced, the values are typecasted into float dtype which could be re-casted back to integers with the help of DF.astype(int).
Note:
This assumes that the number of levels are matching during concatenation.
I'm not sure there is a way to get away with this without redefining the index of the columns to be a Multiindex. If I am not mistaken the levels of the MultiIndex class are actually made up of Index objects. While you can have DataFrames with Hierarchical indices that do not have values for one or more of the levels the index object itself still must be a MultiIndex. For example:
In [2]: df = pd.DataFrame({'foo': [1,2,3], 'bar': [4,5,6]})
In [3]: df
Out[3]:
bar foo
0 4 1
1 5 2
2 6 3
In [4]: df.columns
Out[4]: Index([u'bar', u'foo'], dtype='object')
In [5]: df.columns = pd.MultiIndex.from_tuples([('', 'foo'), ('foo','bar')])
In [6]: df.columns
Out[6]:
MultiIndex(levels=[[u'', u'foo'], [u'bar', u'foo']],
labels=[[0, 1], [1, 0]])
In [7]: df.columns.get_level_values(0)
Out[7]: Index([u'', u'foo'], dtype='object')
In [8]: df
Out[8]:
foo
foo bar
0 4 1
1 5 2
2 6 3
In [9]: df['bar', 'baz'] = [7,8,9]
In [10]: df
Out[10]:
foo bar
foo bar baz
0 4 1 7
1 5 2 8
2 6 3 9
So as you can see, once the MultiIndex is in place you can add columns as you thought, but unfortunately I am not aware of any way of coercing the DataFrame to adaptively adopt a MultiIndex.

Should multi-index levels be updated after dropna() called on pandas DataFrame?

After calling dropna on a multi index dataframe, the levels metadata in the index does not appear to be updated. Is this a bug?
In [1]: import pandas
In [2]: print pandas.__version__
0.10.1
In [3]: df_multi = pandas.DataFrame(index=[[1, 2],['a', 'b',]],
data=[[float('nan'), 5], [6, 7]])
In [4]: print df_multi
0 1
1 a NaN 5
2 b 6 7
In [5]: df_multi = df_multi.dropna(axis=0, how='any')
In [6]: print df_multi
0 1
2 b 6 7
In [7]: print df_multi.index
MultiIndex
[(2, b)]
In [8]: print df_multi.index.levels
[Int64Index([1, 2], dtype=int64), Index([a, b], dtype=object)]
Note above that the MultiIndex only has (2, b), but it reports 1 and 'a' are in the index.levels.
The workaround I have is to reindex with a "clean" Multi-Index as follows:
In [10]: c_clean = pandas.MultiIndex.from_tuples(df_multi.index)
In [11]: df_multi = df_multi.reindex(c_clean)
In [12]: print df_multi
0 1
2 b 6 7
In [13]: print df_multi.index.levels
[Int64Index([2], dtype=int64), Index([b], dtype=object)]
Edit:
This problem also occurs during a slicing with .ix, and probably with other indexing operations as well.
This is a known situtation archived here
https://github.com/pydata/pandas/issues/2655
People are currently contemplating how to deal with it.
My work-around is to use index.get_level_values(level), because a dropna(how='all') might only remove some of an axis but not all, but I might need all remaining values of one of the levels of a multi-index.
For some reason the return of index.get_level_values(level) is correct, while index.levels has not been updated (maybe too costly for speed reasons?).

Categories