How to add a hierarchically-named column to a Pandas DataFrame - python

I have an empty DataFrame:
import pandas as pd
df = pd.DataFrame()
I want to add a hierarchically-named column. I tried this:
df['foo', 'bar'] = [1,2,3]
But it gives a column whose name is a tuple:
(foo, bar)
0 1
1 2
2 3
I want this:
foo
bar
0 1
1 2
2 3
Which I can get if I construct a brand new DataFrame this way:
pd.DataFrame([1,2,3], columns=pd.MultiIndex.from_tuples([('foo', 'bar')]))
How can I create such a layout when adding new columns to an existing DataFrame? The number of levels is always 2...and I know all the possible values for the first level in advance.

If you are looking to build the multi-index DF one column at a time, you could append the frames and drop the Nan's introduced leaving you with the desired multi-index DF as shown:
Demo:
df = pd.DataFrame()
df['foo', 'bar'] = [1,2,3]
df['foo', 'baz'] = [3,4,5]
df
Taking one column at a time and build the corresponding headers.
pd.concat([df[[0]], df[[1]]]).apply(lambda x: x.dropna())
Due to the Nans produced, the values are typecasted into float dtype which could be re-casted back to integers with the help of DF.astype(int).
Note:
This assumes that the number of levels are matching during concatenation.

I'm not sure there is a way to get away with this without redefining the index of the columns to be a Multiindex. If I am not mistaken the levels of the MultiIndex class are actually made up of Index objects. While you can have DataFrames with Hierarchical indices that do not have values for one or more of the levels the index object itself still must be a MultiIndex. For example:
In [2]: df = pd.DataFrame({'foo': [1,2,3], 'bar': [4,5,6]})
In [3]: df
Out[3]:
bar foo
0 4 1
1 5 2
2 6 3
In [4]: df.columns
Out[4]: Index([u'bar', u'foo'], dtype='object')
In [5]: df.columns = pd.MultiIndex.from_tuples([('', 'foo'), ('foo','bar')])
In [6]: df.columns
Out[6]:
MultiIndex(levels=[[u'', u'foo'], [u'bar', u'foo']],
labels=[[0, 1], [1, 0]])
In [7]: df.columns.get_level_values(0)
Out[7]: Index([u'', u'foo'], dtype='object')
In [8]: df
Out[8]:
foo
foo bar
0 4 1
1 5 2
2 6 3
In [9]: df['bar', 'baz'] = [7,8,9]
In [10]: df
Out[10]:
foo bar
foo bar baz
0 4 1 7
1 5 2 8
2 6 3 9
So as you can see, once the MultiIndex is in place you can add columns as you thought, but unfortunately I am not aware of any way of coercing the DataFrame to adaptively adopt a MultiIndex.

Related

What is the recommended method for accessing pandas data? [duplicate]

In both the bellow cases:
import pandas
d = {'col1': 2, 'col2': 2.5}
df = pandas.DataFrame(data=d, index=[0])
print(df['col2'])
print(df.col2)
Both methods can be used to index on a column and yield the same result, so is there any difference between them?
The "dot notation", i.e. df.col2 is the attribute access that's exposed as a convenience.
You may access an index on a Series, column on a DataFrame, and an item on a Panel directly as an attribute:
df['col2'] does the same: it returns a pd.Series of the column.
A few caveats about attribute access:
you cannot add a column (df.new_col = x won't work, worse: it will silently actually create a new attribute rather than a column - think monkey-patching here)
it won't work if you have spaces in the column name or if the column name is an integer.
They are the same as long you're accessing a single column with a simple name, but you can do more with the bracket notation. You can only use df.col if the column name is a valid Python identifier (e.g., does not contains spaces and other such stuff). Also, you may encounter surprises if your column name clashes with a pandas method name (like sum). With brackets you can select multiple columns (e.g., df[['col1', 'col2']]) or add a new column (df['newcol'] = ...), which can't be done with dot access.
The other question you linked to applies, but that is a much more general question. Python objects get to define how the . and [] operators apply to them. Pandas DataFrames have chosen to make them the same for this limited case of accessing single columns, with the caveats described above.
Short answer for differences:
[] indexing (squared brackets access) has the full functionality to operate on DataFrame column data.
While attribute access (dot access) is mainly for convenience to access existing DataFrame column data, but occasionally has its limitations (e.g. special column names, creating a new column).
More explanation
Series and DataFrame are core classes and data structures in pandas, and of course they are Python classes too, so there are some minor distinction when involving attribute access between pandas DataFrame and normal Python objects. But it's well documented and can be easily understood. Just a few points to note:
In Python, users may dynamically add data attributes of their own to an instance object using attribute access.
>>> class Dog(object):
... pass
>>> dog = Dog()
>>> vars(dog)
{}
>>> superdog = Dog()
>>> vars(superdog)
{}
>>> dog.legs = 'I can run.'
>>> superdog.wings = 'I can fly.'
>>> vars(dog)
{'legs': 'I can run.'}
>>> vars(superdog)
{'wings': 'I can fly.'}
In pandas, index and column are closely related to the data structure, you may access an index on a Series, column on a DataFrame as an attribute.
>>> import pandas as pd
>>> import numpy as np
>>> data = np.random.randint(low=0, high=10, size=(2,2))
>>> df = pd.DataFrame(data, columns=['a', 'b'])
>>> df
a b
0 7 6
1 5 8
>>> vars(df)
{'_is_copy': None,
'_data': BlockManager
Items: Index(['a', 'b'], dtype='object')
Axis 1: RangeIndex(start=0, stop=2, step=1)
IntBlock: slice(0, 2, 1), 2 x 2, dtype: int64,
'_item_cache': {}}
But, pandas attribute access is mainly a convinience for reading from and modifying an existing element of a Series or column of a DataFrame.
>>> df.a
0 7
1 5
Name: a, dtype: int64
>>> df.b = [1, 1]
>>> df
a b
0 7 1
1 5 1
And, the convenience is a tradeoff for full functionality. E.g. you can create a DataFrame object with column names ['space bar', '1', 'loc', 'min', 'index'], but you can't access them as an attribute, because they are either not a valid Python identifier (1, space bar) or conflict with an existing attribute name.
>>> data = np.random.randint(0, 10, size=(2, 5))
>>> df_special_col_names = pd.DataFrame(data, columns=['space bar', '1', 'loc', 'min', 'index'])
>>> df_special_col_names
space bar 1 loc min index
0 4 4 4 8 9
1 3 0 1 2 3
In these cases, the .loc, .iloc and [] indexing is the defined way to fullly access/operate index and columns of Series and DataFrame objects.
>>> df_special_col_names['space bar']
0 4
1 3
Name: space bar, dtype: int64
>>> df_special_col_names.loc[:, 'min']
0 8
1 2
Name: min, dtype: int64
>>> df_special_col_names.iloc[:, 1]
0 4
1 0
Name: 1, dtype: int64
Another important difference is when trying to create a new column for DataFrame. As you can see, df.c = df.a + df.b just created an new attribute along side to the core data structure, so starting from version 0.21.0 and later, this behavior will raise a UserWarning (silent no more).
>>> df
a b
0 7 1
1 5 1
>>> df.c = df.a + df.b
__main__:1: UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access`
>>> df['d'] = df.a + df.b
>>> df
a b d
0 7 1 8
1 5 1 6
>>> df.c
0 8
1 6
dtype: int64
>>> vars(df)
{'_is_copy': None,
'_data':
BlockManager
Items: Index(['a', 'b', 'd'], dtype='object')
Axis 1: RangeIndex(start=0, stop=2, step=1)
IntBlock: slice(0, 2, 1), 2 x 2, dtype: int64
IntBlock: slice(2, 3, 1), 1 x 2, dtype: int64,
'_item_cache': {},
'c': 0 8
1 6
dtype: int64}
Finally, to create a new column for DataFrame, never use attribute access. The correct way is to use either [] or .loc indexing:
>>> df
a b
0 7 6
1 5 8
>>> df['c'] = df.a + df.b
>>> # OR
>>> df.loc[:, 'c'] = df.a + df.b
>>> df # c is an new added column
a b c
0 7 6 13
1 5 8 13
. notation is very useful when working interactively and for exploration. However, for code clarity and to avoid crazy shit to happen, you definitely should use [] notation. An example why you should use [] when creating a new column.
df = pd.DataFrame(data={'A':[1, 2, 3],
'B':[4,5,6]})
# this has no effect
df.D = 11
df
A B
0 1 4
1 2 5
2 3 6
# but this works
df['D'] = 11
df
Out[19]:
A B D
0 1 4 11
1 2 5 11
2 3 6 11
If you had a dataframe like this (I am not recommending these column names)...
df = pd.DataFrame({'min':[1,2], 'max': ['a','a'], 'class': [1975, 1981], 'sum': [3,4]})
print(df)
min max class sum
0 1 a 1975 3
1 2 a 1981 4
It all looks OK and there are no errors. You can even access the columns via df['min'] etc...
print(df['min'])
0 1
1 2
Name: min, dtype: int64
However, if you tried with df.<column_name> you would get problems:
print(df.min)
<bound method NDFrame._add_numeric_operations.<locals>.min of min max class sum
0 1 a 1975 3
1 2 a 1981 4>
print(df.max)
<bound method NDFrame._add_numeric_operations.<locals>.max of min max class sum
0 1 a 1975 3
1 2 a 1981 4>
print(df.class)
File "<ipython-input-31-3472b02a328e>", line 1
print(df.class)
^
SyntaxError: invalid syntax
print(df.sum)
<bound method NDFrame._add_numeric_operations.<locals>.sum of min max class sum
0 1 a 1975 3
1 2 a 1981 4>

How to apply function to index named with at-sign in pandas? [duplicate]

In both the bellow cases:
import pandas
d = {'col1': 2, 'col2': 2.5}
df = pandas.DataFrame(data=d, index=[0])
print(df['col2'])
print(df.col2)
Both methods can be used to index on a column and yield the same result, so is there any difference between them?
The "dot notation", i.e. df.col2 is the attribute access that's exposed as a convenience.
You may access an index on a Series, column on a DataFrame, and an item on a Panel directly as an attribute:
df['col2'] does the same: it returns a pd.Series of the column.
A few caveats about attribute access:
you cannot add a column (df.new_col = x won't work, worse: it will silently actually create a new attribute rather than a column - think monkey-patching here)
it won't work if you have spaces in the column name or if the column name is an integer.
They are the same as long you're accessing a single column with a simple name, but you can do more with the bracket notation. You can only use df.col if the column name is a valid Python identifier (e.g., does not contains spaces and other such stuff). Also, you may encounter surprises if your column name clashes with a pandas method name (like sum). With brackets you can select multiple columns (e.g., df[['col1', 'col2']]) or add a new column (df['newcol'] = ...), which can't be done with dot access.
The other question you linked to applies, but that is a much more general question. Python objects get to define how the . and [] operators apply to them. Pandas DataFrames have chosen to make them the same for this limited case of accessing single columns, with the caveats described above.
Short answer for differences:
[] indexing (squared brackets access) has the full functionality to operate on DataFrame column data.
While attribute access (dot access) is mainly for convenience to access existing DataFrame column data, but occasionally has its limitations (e.g. special column names, creating a new column).
More explanation
Series and DataFrame are core classes and data structures in pandas, and of course they are Python classes too, so there are some minor distinction when involving attribute access between pandas DataFrame and normal Python objects. But it's well documented and can be easily understood. Just a few points to note:
In Python, users may dynamically add data attributes of their own to an instance object using attribute access.
>>> class Dog(object):
... pass
>>> dog = Dog()
>>> vars(dog)
{}
>>> superdog = Dog()
>>> vars(superdog)
{}
>>> dog.legs = 'I can run.'
>>> superdog.wings = 'I can fly.'
>>> vars(dog)
{'legs': 'I can run.'}
>>> vars(superdog)
{'wings': 'I can fly.'}
In pandas, index and column are closely related to the data structure, you may access an index on a Series, column on a DataFrame as an attribute.
>>> import pandas as pd
>>> import numpy as np
>>> data = np.random.randint(low=0, high=10, size=(2,2))
>>> df = pd.DataFrame(data, columns=['a', 'b'])
>>> df
a b
0 7 6
1 5 8
>>> vars(df)
{'_is_copy': None,
'_data': BlockManager
Items: Index(['a', 'b'], dtype='object')
Axis 1: RangeIndex(start=0, stop=2, step=1)
IntBlock: slice(0, 2, 1), 2 x 2, dtype: int64,
'_item_cache': {}}
But, pandas attribute access is mainly a convinience for reading from and modifying an existing element of a Series or column of a DataFrame.
>>> df.a
0 7
1 5
Name: a, dtype: int64
>>> df.b = [1, 1]
>>> df
a b
0 7 1
1 5 1
And, the convenience is a tradeoff for full functionality. E.g. you can create a DataFrame object with column names ['space bar', '1', 'loc', 'min', 'index'], but you can't access them as an attribute, because they are either not a valid Python identifier (1, space bar) or conflict with an existing attribute name.
>>> data = np.random.randint(0, 10, size=(2, 5))
>>> df_special_col_names = pd.DataFrame(data, columns=['space bar', '1', 'loc', 'min', 'index'])
>>> df_special_col_names
space bar 1 loc min index
0 4 4 4 8 9
1 3 0 1 2 3
In these cases, the .loc, .iloc and [] indexing is the defined way to fullly access/operate index and columns of Series and DataFrame objects.
>>> df_special_col_names['space bar']
0 4
1 3
Name: space bar, dtype: int64
>>> df_special_col_names.loc[:, 'min']
0 8
1 2
Name: min, dtype: int64
>>> df_special_col_names.iloc[:, 1]
0 4
1 0
Name: 1, dtype: int64
Another important difference is when trying to create a new column for DataFrame. As you can see, df.c = df.a + df.b just created an new attribute along side to the core data structure, so starting from version 0.21.0 and later, this behavior will raise a UserWarning (silent no more).
>>> df
a b
0 7 1
1 5 1
>>> df.c = df.a + df.b
__main__:1: UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access`
>>> df['d'] = df.a + df.b
>>> df
a b d
0 7 1 8
1 5 1 6
>>> df.c
0 8
1 6
dtype: int64
>>> vars(df)
{'_is_copy': None,
'_data':
BlockManager
Items: Index(['a', 'b', 'd'], dtype='object')
Axis 1: RangeIndex(start=0, stop=2, step=1)
IntBlock: slice(0, 2, 1), 2 x 2, dtype: int64
IntBlock: slice(2, 3, 1), 1 x 2, dtype: int64,
'_item_cache': {},
'c': 0 8
1 6
dtype: int64}
Finally, to create a new column for DataFrame, never use attribute access. The correct way is to use either [] or .loc indexing:
>>> df
a b
0 7 6
1 5 8
>>> df['c'] = df.a + df.b
>>> # OR
>>> df.loc[:, 'c'] = df.a + df.b
>>> df # c is an new added column
a b c
0 7 6 13
1 5 8 13
. notation is very useful when working interactively and for exploration. However, for code clarity and to avoid crazy shit to happen, you definitely should use [] notation. An example why you should use [] when creating a new column.
df = pd.DataFrame(data={'A':[1, 2, 3],
'B':[4,5,6]})
# this has no effect
df.D = 11
df
A B
0 1 4
1 2 5
2 3 6
# but this works
df['D'] = 11
df
Out[19]:
A B D
0 1 4 11
1 2 5 11
2 3 6 11
If you had a dataframe like this (I am not recommending these column names)...
df = pd.DataFrame({'min':[1,2], 'max': ['a','a'], 'class': [1975, 1981], 'sum': [3,4]})
print(df)
min max class sum
0 1 a 1975 3
1 2 a 1981 4
It all looks OK and there are no errors. You can even access the columns via df['min'] etc...
print(df['min'])
0 1
1 2
Name: min, dtype: int64
However, if you tried with df.<column_name> you would get problems:
print(df.min)
<bound method NDFrame._add_numeric_operations.<locals>.min of min max class sum
0 1 a 1975 3
1 2 a 1981 4>
print(df.max)
<bound method NDFrame._add_numeric_operations.<locals>.max of min max class sum
0 1 a 1975 3
1 2 a 1981 4>
print(df.class)
File "<ipython-input-31-3472b02a328e>", line 1
print(df.class)
^
SyntaxError: invalid syntax
print(df.sum)
<bound method NDFrame._add_numeric_operations.<locals>.sum of min max class sum
0 1 a 1975 3
1 2 a 1981 4>

UserWarning: Pandas doesn't allow columns to be created via a new attribute name

I am stuck with my pandas script.
Actually , i am working with two csv file(one input and the other output file).
i want to copy all the rows of two column and want to make calculation and then copy it to another dataframe (output file).
The columns are as follows :
'lat', 'long','PHCount', 'latOffset_1', 'longOffset_1','PH_Lat_1', 'PH_Long_1', 'latOffset_2', 'longOffset_2', 'PH_Lat_2', 'PH_Long_2', 'latOffset_3', 'longOffset_3','PH_Lat_3', 'PH_Long_3', 'latOffset_4', 'longOffset_4','PH_Lat_4', 'PH_Long_4'.
i want to take the column 'lat' and 'latOffset_1' , do some calculation and put it in another new column('PH_Lat_1') which i have already created.
My function is :
def calculate_latoffset(latoffset): #Calculating Lat offset.
a=(df2['lat']-(2*latoffset))
return a
The main code :
for i in range(1,5):
print(i)
a='PH_lat_%d' % i
print (a)
b='latOffset_%d' % i
print (b)
df2.a = df2.apply(lambda x: calculate_latoffset(x[b]), axis=1)
Since the column name just differ by (1,2,3,4). so i want to call the function calculate_latoffset and calculate the all the rows of all the columns(PH_Lat_1, PH_Lat_2, PH_Lat_3,PH_Lat_4) in one go.
When using the above code i am getting this error :
basic_conversion.py:46: UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access
df2.a = df2.apply(lambda x: calculate_latoffset(x[b]), axis=1)
is it possible ?
Please kindly help
Simply use df2['a'] instead of df2.a
This is a Warning not an Error, so your code could still run through, but probably not following your intention.
Short answer: To create a new column for DataFrame, never use attribute access, the correct way is to use either [] or .loc indexing:
>>> df
a b
0 7 6
1 5 8
>>> df['c'] = df.a + df.b
>>> # OR
>>> df.loc[:, 'c'] = df.a + df.b
>>> df # c is an new added column
a b c
0 7 6 13
1 5 8 13
More explaination, Seires and DataFrame are core classes and data structures in pandas, and of course they are Python classes too, so there are some minor distinction when involving attribute access between pandas DataFrame and normal Python objects. But it's well documented and can be easily understood. Just few points to note:
In Python, users may dynamically add data attributes of their own to an instance object using attribute access.
>>> class Dog(object):
... pass
>>> dog = Dog()
>>> vars(dog)
{}
>>> superdog = Dog()
>>> vars(superdog)
{}
>>> dog.legs = 'I can run.'
>>> superdog.wings = 'I can fly.'
>>> vars(dog)
{'legs': 'I can run.'}
>>> vars(superdog)
{'wings': 'I can fly.'}
In pandas, index and column are closely related to the data structure, you may access an index on a Series, column on a DataFrame as an attribute.
>>> import pandas as pd
>>> import numpy as np
>>> data = np.random.randint(low=0, high=10, size=(2,2))
>>> df = pd.DataFrame(data, columns=['a', 'b'])
>>> df
a b
0 7 6
1 5 8
>>> vars(df)
{'_is_copy': None,
'_data': BlockManager
Items: Index(['a', 'b'], dtype='object')
Axis 1: RangeIndex(start=0, stop=2, step=1)
IntBlock: slice(0, 2, 1), 2 x 2, dtype: int64,
'_item_cache': {}}
But, pandas attribute access is mainly a convinience for reading from and modifying an existing element of a Series or column of a DataFrame.
>>> df.a
0 7
1 5
Name: a, dtype: int64
>>> df.b = [1, 1]
>>> df
a b
0 7 1
1 5 1
And, the convinience is a tradeoff for full functionality. E.g. you can create a DataFrame object with column names ['space bar', '1', 'loc', 'min', 'index'], but you can't access them as an attribute, because they are either not a valid Python identifier 1, space bar or conflicts with an existing method name.
>>> data = np.random.randint(0, 10, size=(2, 5))
>>> df_special_col_names = pd.DataFrame(data, columns=['space bar', '1', 'loc', 'min', 'index'])
>>> df_special_col_names
space bar 1 loc min index
0 4 4 4 8 9
1 3 0 1 2 3
In these cases, the .loc, .iloc and [] indexing is the defined way to fullly access/operate index and columns of Series and DataFrame objects.
>>> df_special_col_names['space bar']
0 4
1 3
Name: space bar, dtype: int64
>>> df_special_col_names.loc[:, 'min']
0 8
1 2
Name: min, dtype: int64
>>> df_special_col_names.iloc[:, 1]
0 4
1 0
Name: 1, dtype: int64
As to the topic, to create a new column for DataFrame, as you can see, df.c = df.a + df.b just created an new attribute along side to the core data structure, so starting from version 0.21.0 and later, this behavior will raise a UserWarning (silent no more).
>>> df
a b
0 7 1
1 5 1
>>> df.c = df.a + df.b
__main__:1: UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access
>>> df['d'] = df.a + df.b
>>> df
a b d
0 7 1 8
1 5 1 6
>>> df.c
0 8
1 6
dtype: int64
>>> vars(df)
{'_is_copy': None,
'_data':
BlockManager
Items: Index(['a', 'b', 'd'], dtype='object')
Axis 1: RangeIndex(start=0, stop=2, step=1)
IntBlock: slice(0, 2, 1), 2 x 2, dtype: int64
IntBlock: slice(2, 3, 1), 1 x 2, dtype: int64,
'_item_cache': {},
'c': 0 8
1 6
dtype: int64}
Finally, back to the Short answer.
The solution I can think of is to use .loc to get the column. You can try df.loc[:,a] instead of df.a.
Pandas dataframe columns cannot be created using the dot method to avoid potential conflicts with the dataframe attributes. Hope this helps
although all other answers are likely a much better solution, i figured that it does no harm to just ignore it move on.
import warnings
warnings.filterwarnings("ignore","Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access", UserWarning)
using the code above, the script will just disregard the warning and move on.
In df2.apply(lambda x: calculate_latoffset(x[b]), axis=1) you are creating a 5 column dataframe and you were trying to assign the value to a single field. Do df2[a] = calculate_latoffset(df2[b]) instead should deliver the desired output.

Group by index + column in pandas

I have a dataframe that has two columns, user_id and item_bought.
Here user_id is the index of the dataframe. I want to group by both user_id and item_bought and get the item wise count for the user.
How do I do that?
From version 0.20.1 it is simplier:
Strings passed to DataFrame.groupby() as the by parameter may now reference either column names or index level names
arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
index = pd.MultiIndex.from_arrays(arrays, names=['first', 'second'])
df = pd.DataFrame({'A': [1, 1, 1, 1, 2, 2, 3, 3],
'B': np.arange(8)}, index=index)
print (df)
A B
first second
bar one 1 0
two 1 1
baz one 1 2
two 1 3
foo one 2 4
two 2 5
qux one 3 6
two 3 7
print (df.groupby(['second', 'A']).sum())
B
second A
one 1 2
2 4
3 6
two 1 4
2 5
3 7
this should work:
>>> df = pd.DataFrame(np.random.randint(0,5,(6, 2)), columns=['col1','col2'])
>>> df['ind1'] = list('AAABCC')
>>> df['ind2'] = range(6)
>>> df.set_index(['ind1','ind2'], inplace=True)
>>> df
col1 col2
ind1 ind2
A 0 3 2
1 2 0
2 2 3
B 3 2 4
C 4 3 1
5 0 0
>>> df.groupby([df.index.get_level_values(0),'col1']).count()
col2
ind1 col1
A 2 2
3 1
B 2 1
C 0 1
3 1
I had the same problem using one of the columns from multiindex. with multiindex, you cannot use df.index.levels[0] since it has only distinct values from that particular index level and will be most likely of different size than whole dataframe...
check http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Index.get_level_values.html - get_level_values "Return vector of label values for requested level, equal to the length of the index"
import pandas as pd
import numpy as np
In [11]:
df = pd.DataFrame()
In [12]:
df['user_id'] = ['b','b','b','c']
In [13]:
df['item_bought'] = ['x','x','y','y']
In [14]:
df['ct'] = 1
In [15]:
df
Out[15]:
user_id item_bought ct
0 b x 1
1 b x 1
2 b y 1
3 c y 1
In [16]:
pd.pivot_table(df,values='ct',index=['user_id','item_bought'],aggfunc=np.sum)
Out[16]:
user_id item_bought
b x 2
y 1
c y 1
I had the same problem - imported a bunch of data and I wanted to groupby a field that was the index. I didn't have a multi-index or any of that jazz and nor do you.
I figured the problem is that the field I want is the index, so at first I just reset the index - but this gives me a useless index field that I don't want. So now I do the following (two levels of grouping):
grouped = df.reset_index().groupby(by=['Field1','Field2'])
then I can use 'grouped' in a bunch of ways for different reports
grouped[['Field3','Field4']].agg([np.mean, np.std])
(which was what I wanted, giving me Field4 and Field3 averages, grouped by Field1 (the index) and Field2
For you, if you just want to do the count of items per user, in one simple line using groupby, the code could be
df.reset_index().groupby(by=['user_id']).count()
If you want to do more things then you can (like me) create 'grouped' and then use that. As a beginner, I find it easier to follow that way.
Please note, that the "reset_index" is not 'in place' and so will not mess up your original dataframe

Create a Pandas DataFrame from series without duplicating their names?

Is it possible to create a DataFrame from a list of series without duplicating their names?
Ex, creating the same DataFrame as:
>>> pd.DataFrame({ "foo": data["foo"], "bar": other_data["bar"] })
But without without needing to explicitly name the columns?
Try pandas.concat which takes a list of items to combine as its argument:
df1 = pd.DataFrame(np.random.randn(100, 4), columns=list('abcd'))
df2 = pd.DataFrame(np.random.randn(100, 3), columns=list('xyz'))
df3 = pd.concat([df1['a'], df2['y']], axis=1)
Note that you need to use axis=1 to stack things together side-by side and axis=0 (which is the default) to combine them one-over-the-other.
Seems like you want to join the dataframes (works similar to SQL):
import numpy as np
import pandas
df1 = pandas.DataFrame(
np.random.random_integers(low=0, high=10, size=(10,2)),
columns = ['foo', 'bar'],
index=list('ABCDEFHIJK')
)
df2 = pandas.DataFrame(
np.random.random_integers(low=0, high=10, size=(10,2)),
columns = ['bar', 'bax'],
index=list('DEFHIJKLMN')
)
df1[['foo']].join(df2['bar'], how='outer')
The on kwarg takes a list of columns or None. If None, it'll join on the indices of the two dataframes. You just need to make sure that you're using a dataframe for the left size -- hence the double brackets to force df[['foo']] to a dataframe (df['foo'] returns a series)
This gives me:
foo bar
A 4 NaN
B 0 NaN
C 10 NaN
D 8 3
E 2 0
F 3 3
H 9 10
I 0 9
J 5 6
K 2 9
L NaN 3
M NaN 1
N NaN 1
You can also do inner, left, and right joins.
I prefer the explicit way, as presented in your original post, but if you really want to write certain names once, you could try this:
import pandas as pd
import numpy as np
def dictify(*args):
return dict((i,n[i]) for i,n in args)
data = { 'foo': np.random.randn(5) }
other_data = { 'bar': np.random.randn(5) }
print pd.DataFrame(dictify(('foo', data), ('bar', other_data)))
The output is as expected:
bar foo
0 0.533973 -0.477521
1 0.027354 0.974038
2 -0.725991 0.350420
3 1.921215 0.648210
4 0.547640 1.652310
[5 rows x 2 columns]

Categories