I have two index-related questions on Python Pandas dataframes.
import pandas as pd
import numpy as np
df = pd.DataFrame({'id' : range(1,9),
'B' : ['one', 'one', 'two', 'three',
'two', 'three', 'one', 'two'],
'amount' : np.random.randn(8)})
df = df.ix[df.B != 'three'] # remove where B = three
df.index
>> Int64Index([0, 1, 2, 4, 6, 7], dtype=int64) # the original index is preserved.
1) I do not understand why the indexing is not automatically updated after I modify the dataframe. Is there a way to automatically update the indexing while modifying a dataframe? If not, what is the most efficient manual way to do this?
2) I want to be able to set the B column of the 5th element of df to 'three'. But df.iloc[5]['B'] = 'three' does not do that. I checked on the manual but it does not cover how to change a specific cell value accessed by location.
If I were accessing by row name, I could do: df.loc[5,'B'] = 'three' but I don't know what the index access equivalent is.
P.S. Link1 and link2 are relevant answers to my second question. However, they do not answer my question.
1) I do not understand why the indexing is not automatically updated after I modify the dataframe.
If you want to reset the index after removing/adding rows you can do this:
df = df[df.B != 'three'] # remove where B = three
df.reset_index(drop=True)
B amount id
0 one -1.176137 1
1 one 0.434470 2
2 two -0.887526 3
3 two 0.126969 5
4 one 0.090442 7
5 two -1.511353 8
Indexes are meant to label/tag/id a row... so you might think about making your 'id' column the index, and then you'll appreciate that Pandas doesn't 'automatically update' the index when deleting rows.
df.set_index('id')
B amount
id
1 one -0.410671
2 one 0.092931
3 two -0.100324
4 three 0.322580
5 two -0.546932
6 three -2.018198
7 one -0.459551
8 two 1.254597
2) I want to be able to set the B column of the 5th element of df to 'three'. But df.iloc[5]['B'] = 'three' does not do that. I checked on the manual but it does not cover how to change a specific cell value accessed by location.
Jeff already answered this...
In [5]: df = pd.DataFrame({'id' : range(1,9),
...: 'B' : ['one', 'one', 'two', 'three',
...: 'two', 'three', 'one', 'two'],
...: 'amount' : np.random.randn(8)})
In [6]: df
Out[6]:
B amount id
0 one -1.236735 1
1 one -0.427070 2
2 two -2.330888 3
3 three -0.654062 4
4 two 0.587660 5
5 three -0.719589 6
6 one 0.860739 7
7 two -2.041390 8
[8 rows x 3 columns]
Your question 1) your code above is correct (see #Briford Wylie for resetting the index,
which is what I think you want)
In [7]: df.ix[df.B!='three']
Out[7]:
B amount id
0 one -1.236735 1
1 one -0.427070 2
2 two -2.330888 3
4 two 0.587660 5
6 one 0.860739 7
7 two -2.041390 8
[6 rows x 3 columns]
In [8]: df = df.ix[df.B!='three']
In [9]: df.index
Out[9]: Int64Index([0, 1, 2, 4, 6, 7], dtype='int64')
In [10]: df.iloc[5]
Out[10]:
B two
amount -2.04139
id 8
Name: 7, dtype: object
Question 2):
You are trying to set a copy; In 0.13 this will raise/warn. see here
In [11]: df.iloc[5]['B'] = 5
/usr/local/bin/ipython:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
In [24]: df.iloc[5,df.columns.get_indexer(['B'])] = 'foo'
In [25]: df
Out[25]:
B amount id
0 one -1.236735 1
1 one -0.427070 2
2 two -2.330888 3
4 two 0.587660 5
6 one 0.860739 7
7 foo -2.041390 8
[6 rows x 3 columns]
You can also do this. This is NOT setting a copy and since it selects a Series (that is what df['B'] is, then it CAN be set directly
In [30]: df['B'].iloc[5] = 5
In [31]: df
Out[31]:
B amount id
0 one -1.236735 1
1 one -0.427070 2
2 two -2.330888 3
4 two 0.587660 5
6 one 0.860739 7
7 5 -2.041390 8
[6 rows x 3 columns]
Related
In my application I am multiplying two Pandas Series which both have multiple index levels. Sometimes, a level contains only a single unique value, in which case I don't get all the index levels from both Series in my result.
To illustrate the problem, let's take two series:
s1 = pd.Series(np.random.randn(4), index=[[1, 1, 1, 1], [1,2,3,4]])
s1.index.names = ['A', 'B']
A B
1 1 -2.155463
2 -0.411068
3 1.041838
4 0.016690
s2 = pd.Series(np.random.randn(4), index=[['a', 'a', 'a', 'a'], [1,2,3,4]])
s2.index.names = ['C', 'B']
C B
a 1 0.043064
2 -1.456251
3 0.024657
4 0.912114
Now, if I multiply them, I get the following:
s1.mul(s2)
A B
1 1 -0.092822
2 0.598618
3 0.025689
4 0.015223
While my desired result would be
A C B
1 a 1 -0.092822
2 0.598618
3 0.025689
4 0.015223
How can I keep index level C in the multiplication?
I have so far been able to get the right result as shown below, but would much prefer a neater solution which keeps my code more simple and readable.
s3 = s2.mul(s1).to_frame()
s3['C'] = 'a'
s3.set_index('C', append=True, inplace=True)
You can use Series.unstack with DataFrame.stack:
s = s2.unstack(level=0).mul(s1, level=1, axis=0).stack().reorder_levels(['A','C','B'])
print (s)
A C B
1 a 1 0.827482
2 -0.476929
3 -0.473209
4 -0.520207
dtype: float64
Here's an excerpt from the pandas pivot docs:
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.pivot.html
>>> df = pd.DataFrame({'foo': ['one','one','one','two','two','two'],
'bar': ['A', 'B', 'C', 'A', 'B', 'C'],
'baz': [1, 2, 3, 4, 5, 6]})
>>> df
foo bar baz
0 one A 1
1 one B 2
2 one C 3
3 two A 4
4 two B 5
5 two C 6
>>> df.pivot(index='foo', columns='bar', values='baz')
A B C
one 1 2 3
two 4 5 6
When I run the exact code above (pandas 0.19.2), I instead get the following output:
bar A B C
foo
one 1 2 3
two 4 5 6
My questions are:
Do other people get this behaviour?
Why does the behaviour differ from the documentation?
What actually is the nature of this resulting DataFrame? I am quite new to pandas so this is probably a stupid question. But I don't think I've seen a name (bar) over the index before. I can't work out what it is?
Thanks.
I think this is due to an older version of pandas that generated the docs, in the latest versions it will name the index if passed, in this case 'foo'
In [111]:
pv = df.pivot(index='foo', columns='bar', values='baz')
pv.index
Out[111]:
Index(['one', 'two'], dtype='object', name='foo')
You can see that the index now has a 'name' attribute
I have an empty DataFrame:
import pandas as pd
df = pd.DataFrame()
I want to add a hierarchically-named column. I tried this:
df['foo', 'bar'] = [1,2,3]
But it gives a column whose name is a tuple:
(foo, bar)
0 1
1 2
2 3
I want this:
foo
bar
0 1
1 2
2 3
Which I can get if I construct a brand new DataFrame this way:
pd.DataFrame([1,2,3], columns=pd.MultiIndex.from_tuples([('foo', 'bar')]))
How can I create such a layout when adding new columns to an existing DataFrame? The number of levels is always 2...and I know all the possible values for the first level in advance.
If you are looking to build the multi-index DF one column at a time, you could append the frames and drop the Nan's introduced leaving you with the desired multi-index DF as shown:
Demo:
df = pd.DataFrame()
df['foo', 'bar'] = [1,2,3]
df['foo', 'baz'] = [3,4,5]
df
Taking one column at a time and build the corresponding headers.
pd.concat([df[[0]], df[[1]]]).apply(lambda x: x.dropna())
Due to the Nans produced, the values are typecasted into float dtype which could be re-casted back to integers with the help of DF.astype(int).
Note:
This assumes that the number of levels are matching during concatenation.
I'm not sure there is a way to get away with this without redefining the index of the columns to be a Multiindex. If I am not mistaken the levels of the MultiIndex class are actually made up of Index objects. While you can have DataFrames with Hierarchical indices that do not have values for one or more of the levels the index object itself still must be a MultiIndex. For example:
In [2]: df = pd.DataFrame({'foo': [1,2,3], 'bar': [4,5,6]})
In [3]: df
Out[3]:
bar foo
0 4 1
1 5 2
2 6 3
In [4]: df.columns
Out[4]: Index([u'bar', u'foo'], dtype='object')
In [5]: df.columns = pd.MultiIndex.from_tuples([('', 'foo'), ('foo','bar')])
In [6]: df.columns
Out[6]:
MultiIndex(levels=[[u'', u'foo'], [u'bar', u'foo']],
labels=[[0, 1], [1, 0]])
In [7]: df.columns.get_level_values(0)
Out[7]: Index([u'', u'foo'], dtype='object')
In [8]: df
Out[8]:
foo
foo bar
0 4 1
1 5 2
2 6 3
In [9]: df['bar', 'baz'] = [7,8,9]
In [10]: df
Out[10]:
foo bar
foo bar baz
0 4 1 7
1 5 2 8
2 6 3 9
So as you can see, once the MultiIndex is in place you can add columns as you thought, but unfortunately I am not aware of any way of coercing the DataFrame to adaptively adopt a MultiIndex.
Basically I am trying to do the opposite of How to generate a list from a pandas DataFrame with the column name and column values?
To borrow that example, I want to go from the form:
data = [['Name','Rank','Complete'],
['one', 1, 1],
['two', 2, 1],
['three', 3, 1],
['four', 4, 1],
['five', 5, 1]]
which should output:
Name Rank Complete
One 1 1
Two 2 1
Three 3 1
Four 4 1
Five 5 1
However when I do something like:
pd.DataFrame(data)
I get a dataframe where the first list should be my colnames, and then the first element of each list should be the rowname
EDIT:
To clarify, I want the first element of each list to be the row name. I am scrapping data so it is formatted this way...
One way to do this would be to take the column names as a separate list and then only give from 1st index for pd.DataFrame -
In [8]: data = [['Name','Rank','Complete'],
...: ['one', 1, 1],
...: ['two', 2, 1],
...: ['three', 3, 1],
...: ['four', 4, 1],
...: ['five', 5, 1]]
In [10]: df = pd.DataFrame(data[1:],columns=data[0])
In [11]: df
Out[11]:
Name Rank Complete
0 one 1 1
1 two 2 1
2 three 3 1
3 four 4 1
4 five 5 1
If you want to set the first column Name column as index, use the .set_index() method and send in the column to use for index. Example -
In [16]: df = pd.DataFrame(data[1:],columns=data[0]).set_index('Name')
In [17]: df
Out[17]:
Rank Complete
Name
one 1 1
two 2 1
three 3 1
four 4 1
five 5 1
I have a dataframe that has two columns, user_id and item_bought.
Here user_id is the index of the dataframe. I want to group by both user_id and item_bought and get the item wise count for the user.
How do I do that?
From version 0.20.1 it is simplier:
Strings passed to DataFrame.groupby() as the by parameter may now reference either column names or index level names
arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
index = pd.MultiIndex.from_arrays(arrays, names=['first', 'second'])
df = pd.DataFrame({'A': [1, 1, 1, 1, 2, 2, 3, 3],
'B': np.arange(8)}, index=index)
print (df)
A B
first second
bar one 1 0
two 1 1
baz one 1 2
two 1 3
foo one 2 4
two 2 5
qux one 3 6
two 3 7
print (df.groupby(['second', 'A']).sum())
B
second A
one 1 2
2 4
3 6
two 1 4
2 5
3 7
this should work:
>>> df = pd.DataFrame(np.random.randint(0,5,(6, 2)), columns=['col1','col2'])
>>> df['ind1'] = list('AAABCC')
>>> df['ind2'] = range(6)
>>> df.set_index(['ind1','ind2'], inplace=True)
>>> df
col1 col2
ind1 ind2
A 0 3 2
1 2 0
2 2 3
B 3 2 4
C 4 3 1
5 0 0
>>> df.groupby([df.index.get_level_values(0),'col1']).count()
col2
ind1 col1
A 2 2
3 1
B 2 1
C 0 1
3 1
I had the same problem using one of the columns from multiindex. with multiindex, you cannot use df.index.levels[0] since it has only distinct values from that particular index level and will be most likely of different size than whole dataframe...
check http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Index.get_level_values.html - get_level_values "Return vector of label values for requested level, equal to the length of the index"
import pandas as pd
import numpy as np
In [11]:
df = pd.DataFrame()
In [12]:
df['user_id'] = ['b','b','b','c']
In [13]:
df['item_bought'] = ['x','x','y','y']
In [14]:
df['ct'] = 1
In [15]:
df
Out[15]:
user_id item_bought ct
0 b x 1
1 b x 1
2 b y 1
3 c y 1
In [16]:
pd.pivot_table(df,values='ct',index=['user_id','item_bought'],aggfunc=np.sum)
Out[16]:
user_id item_bought
b x 2
y 1
c y 1
I had the same problem - imported a bunch of data and I wanted to groupby a field that was the index. I didn't have a multi-index or any of that jazz and nor do you.
I figured the problem is that the field I want is the index, so at first I just reset the index - but this gives me a useless index field that I don't want. So now I do the following (two levels of grouping):
grouped = df.reset_index().groupby(by=['Field1','Field2'])
then I can use 'grouped' in a bunch of ways for different reports
grouped[['Field3','Field4']].agg([np.mean, np.std])
(which was what I wanted, giving me Field4 and Field3 averages, grouped by Field1 (the index) and Field2
For you, if you just want to do the count of items per user, in one simple line using groupby, the code could be
df.reset_index().groupby(by=['user_id']).count()
If you want to do more things then you can (like me) create 'grouped' and then use that. As a beginner, I find it easier to follow that way.
Please note, that the "reset_index" is not 'in place' and so will not mess up your original dataframe