Weird behaviour with groupby on ordered categorical columns - python

MCVE
df = pd.DataFrame({
'Cat': ['SF', 'W', 'F', 'R64', 'SF', 'F'],
'ID': [1, 1, 1, 2, 2, 2]
})
df.Cat = pd.Categorical(
df.Cat, categories=['R64', 'SF', 'F', 'W'], ordered=True)
As you can see, I've define an ordered categorical column on Cat. To verify, check;
0 SF
1 W
2 F
3 R64
4 SF
5 F
Name: Cat, dtype: category
Categories (4, object): [R64 < SF < F < W]
I want to find the largest category PER ID. Doing groupby + max works.
df.groupby('ID').Cat.max()
ID
1 W
2 F
Name: Cat, dtype: object
But I don't want ID to be the index, so I specify as_index=False.
df.groupby('ID', as_index=False).Cat.max()
ID Cat
0 1 W
1 2 SF
Oops! Now, the max is taken lexicographically. Can anyone explain whether this is intended behaviour? Or is this a bug?
Note, for this problem, the workaround is df.groupby('ID').Cat.max().reset_index().
Note,
>>> pd.__version__
'0.22.0'

This is not intended behavior, it's a bug.
Source diving shows the flag does two completely different things. The one simply ignores grouper levels and names, it just takes the values with a new range index. The other one clearly keeps them.

Related

What is the recommended method for accessing pandas data? [duplicate]

In both the bellow cases:
import pandas
d = {'col1': 2, 'col2': 2.5}
df = pandas.DataFrame(data=d, index=[0])
print(df['col2'])
print(df.col2)
Both methods can be used to index on a column and yield the same result, so is there any difference between them?
The "dot notation", i.e. df.col2 is the attribute access that's exposed as a convenience.
You may access an index on a Series, column on a DataFrame, and an item on a Panel directly as an attribute:
df['col2'] does the same: it returns a pd.Series of the column.
A few caveats about attribute access:
you cannot add a column (df.new_col = x won't work, worse: it will silently actually create a new attribute rather than a column - think monkey-patching here)
it won't work if you have spaces in the column name or if the column name is an integer.
They are the same as long you're accessing a single column with a simple name, but you can do more with the bracket notation. You can only use df.col if the column name is a valid Python identifier (e.g., does not contains spaces and other such stuff). Also, you may encounter surprises if your column name clashes with a pandas method name (like sum). With brackets you can select multiple columns (e.g., df[['col1', 'col2']]) or add a new column (df['newcol'] = ...), which can't be done with dot access.
The other question you linked to applies, but that is a much more general question. Python objects get to define how the . and [] operators apply to them. Pandas DataFrames have chosen to make them the same for this limited case of accessing single columns, with the caveats described above.
Short answer for differences:
[] indexing (squared brackets access) has the full functionality to operate on DataFrame column data.
While attribute access (dot access) is mainly for convenience to access existing DataFrame column data, but occasionally has its limitations (e.g. special column names, creating a new column).
More explanation
Series and DataFrame are core classes and data structures in pandas, and of course they are Python classes too, so there are some minor distinction when involving attribute access between pandas DataFrame and normal Python objects. But it's well documented and can be easily understood. Just a few points to note:
In Python, users may dynamically add data attributes of their own to an instance object using attribute access.
>>> class Dog(object):
... pass
>>> dog = Dog()
>>> vars(dog)
{}
>>> superdog = Dog()
>>> vars(superdog)
{}
>>> dog.legs = 'I can run.'
>>> superdog.wings = 'I can fly.'
>>> vars(dog)
{'legs': 'I can run.'}
>>> vars(superdog)
{'wings': 'I can fly.'}
In pandas, index and column are closely related to the data structure, you may access an index on a Series, column on a DataFrame as an attribute.
>>> import pandas as pd
>>> import numpy as np
>>> data = np.random.randint(low=0, high=10, size=(2,2))
>>> df = pd.DataFrame(data, columns=['a', 'b'])
>>> df
a b
0 7 6
1 5 8
>>> vars(df)
{'_is_copy': None,
'_data': BlockManager
Items: Index(['a', 'b'], dtype='object')
Axis 1: RangeIndex(start=0, stop=2, step=1)
IntBlock: slice(0, 2, 1), 2 x 2, dtype: int64,
'_item_cache': {}}
But, pandas attribute access is mainly a convinience for reading from and modifying an existing element of a Series or column of a DataFrame.
>>> df.a
0 7
1 5
Name: a, dtype: int64
>>> df.b = [1, 1]
>>> df
a b
0 7 1
1 5 1
And, the convenience is a tradeoff for full functionality. E.g. you can create a DataFrame object with column names ['space bar', '1', 'loc', 'min', 'index'], but you can't access them as an attribute, because they are either not a valid Python identifier (1, space bar) or conflict with an existing attribute name.
>>> data = np.random.randint(0, 10, size=(2, 5))
>>> df_special_col_names = pd.DataFrame(data, columns=['space bar', '1', 'loc', 'min', 'index'])
>>> df_special_col_names
space bar 1 loc min index
0 4 4 4 8 9
1 3 0 1 2 3
In these cases, the .loc, .iloc and [] indexing is the defined way to fullly access/operate index and columns of Series and DataFrame objects.
>>> df_special_col_names['space bar']
0 4
1 3
Name: space bar, dtype: int64
>>> df_special_col_names.loc[:, 'min']
0 8
1 2
Name: min, dtype: int64
>>> df_special_col_names.iloc[:, 1]
0 4
1 0
Name: 1, dtype: int64
Another important difference is when trying to create a new column for DataFrame. As you can see, df.c = df.a + df.b just created an new attribute along side to the core data structure, so starting from version 0.21.0 and later, this behavior will raise a UserWarning (silent no more).
>>> df
a b
0 7 1
1 5 1
>>> df.c = df.a + df.b
__main__:1: UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access`
>>> df['d'] = df.a + df.b
>>> df
a b d
0 7 1 8
1 5 1 6
>>> df.c
0 8
1 6
dtype: int64
>>> vars(df)
{'_is_copy': None,
'_data':
BlockManager
Items: Index(['a', 'b', 'd'], dtype='object')
Axis 1: RangeIndex(start=0, stop=2, step=1)
IntBlock: slice(0, 2, 1), 2 x 2, dtype: int64
IntBlock: slice(2, 3, 1), 1 x 2, dtype: int64,
'_item_cache': {},
'c': 0 8
1 6
dtype: int64}
Finally, to create a new column for DataFrame, never use attribute access. The correct way is to use either [] or .loc indexing:
>>> df
a b
0 7 6
1 5 8
>>> df['c'] = df.a + df.b
>>> # OR
>>> df.loc[:, 'c'] = df.a + df.b
>>> df # c is an new added column
a b c
0 7 6 13
1 5 8 13
. notation is very useful when working interactively and for exploration. However, for code clarity and to avoid crazy shit to happen, you definitely should use [] notation. An example why you should use [] when creating a new column.
df = pd.DataFrame(data={'A':[1, 2, 3],
'B':[4,5,6]})
# this has no effect
df.D = 11
df
A B
0 1 4
1 2 5
2 3 6
# but this works
df['D'] = 11
df
Out[19]:
A B D
0 1 4 11
1 2 5 11
2 3 6 11
If you had a dataframe like this (I am not recommending these column names)...
df = pd.DataFrame({'min':[1,2], 'max': ['a','a'], 'class': [1975, 1981], 'sum': [3,4]})
print(df)
min max class sum
0 1 a 1975 3
1 2 a 1981 4
It all looks OK and there are no errors. You can even access the columns via df['min'] etc...
print(df['min'])
0 1
1 2
Name: min, dtype: int64
However, if you tried with df.<column_name> you would get problems:
print(df.min)
<bound method NDFrame._add_numeric_operations.<locals>.min of min max class sum
0 1 a 1975 3
1 2 a 1981 4>
print(df.max)
<bound method NDFrame._add_numeric_operations.<locals>.max of min max class sum
0 1 a 1975 3
1 2 a 1981 4>
print(df.class)
File "<ipython-input-31-3472b02a328e>", line 1
print(df.class)
^
SyntaxError: invalid syntax
print(df.sum)
<bound method NDFrame._add_numeric_operations.<locals>.sum of min max class sum
0 1 a 1975 3
1 2 a 1981 4>

How do I create a function to perform label encoding

I have the dataframe -
df = pd.DataFrame({'colA':['a', 'a', 'a', 'b' ,'b'], 'colB':['a', 'b', 'a', 'c', 'b'], 'colC':['x', 'x', 'y', 'y', 'y']})
I would like to write a function to replace each value with it's frequency count in that column. For example colA will now be [3, 3, 3, 2, 2]
I have attempted to do this by creating a dictionary with the value and the frequency count, assign that dictionary to a variable freq, then map the column values to freq. I have written the following function
def LabelEncode_method1(col):
freq = col.value_counts().to_dict()
col = col.map(freq)
return col.head()```
When I run the following LabelEncode_method1(df.colA), I get the result 3, 3, 3, 2, 2. However when I call the dataframe df, the values for colA are still 'a', 'a', 'a', 'b', 'b'
What am I doing wrong. How do I fix my function?
How do I write another function that loops through all columns and maps the values to freq, as opposed to calling the function 3 separate times for each column.
You can do groupby + transform
df['new'] = df.groupby('colA')['colA'].transform('count')
You can use map + value_counts (Which you have already found, you just need to assign the result back to your DataFrame).
df['colA'].map(df['colA'].value_counts())
0 3
1 3
2 3
3 2
4 2
Name: colA, dtype: int64
For all columns, which will create a new DataFrame:
pd.concat([
df[col].map(df[col].value_counts()) for col in df
], axis=1)
colA colB colC
0 3 2 2
1 3 2 2
2 3 2 3
3 2 1 3
4 2 2 3

Sorting and ranking subsets of complex data

I have a large and complex GIS-datafile on road accidents in “cities" within “counties”. Rows represent roads. Columns provide “City”, “County” and "Sum of accidents in city”. A city thus contains several roads (repeated values of accident sums) and a county several cities.
For each 'County', I now want to rank cities according to the number of accidents, so that within each 'County', the city with most accidents is ranked “1" and cities with fewer accidents are ranked “2” and above. This rank-value shall be written into the original datafile.
My original approach was to:
1. Sort data according to "'County'_ID" and “Accidents” (descending)
2. than calculate for each row:
if('County' in row 'n+1' = 'County' in row ’n’) AND (Accidents in row 'n+1' = 'Accidents' in row ’n’):
return value: ’n’ ## maintain same rank for cities within 'County'
else if ('County' in row 'n+1' = 'County' in row ’n’) AND if ('Accidents' in row 'n+1' < 'Accidents' in row ’n’):
return value: ’n+1’ ## increasing rank value within 'County'
else if ('County' in row 'n+1' < 'County' in row ’n’) AND ('Accidents' in row 'n+1’ < 'Accidents' in row ’n’):
return value:’1’ ## new 'County', i.e. start ranking from 1
else:
return “0” #error
However, I could not figure out how to code this properly; and maybe this approach is not an appropriate way either; maybe a loop would do the trick?
Any recommendations?
Suggest using Python Pandas module
Fictitious Data
Create Data with columns county, accidents, city
Would use pandas read_csv to load actual data.
import pandas as pd
df = pd.DataFrame([
['a', 1, 'A'],
['a', 2, 'B'],
['a', 5, 'C'],
['b', 5, 'D'],
['b', 5, 'E'],
['b', 6, 'F'],
['b', 8, 'G'],
['c', 2, 'H'],
['c', 2, 'I'],
['c', 7, 'J'],
['c', 7, 'K']
], columns = ['county', 'accidents', 'city'])
Resultant Dataframe
df:
county accidents city
0 a 1 A
1 a 2 B
2 a 5 C
3 b 5 D
4 b 5 E
5 b 6 F
6 b 8 G
7 c 2 H
8 c 2 I
9 c 7 J
10 c 7 K
Group data rows by county, and rank rows within group by accidents
Ranking Code
# ascending = False causes cities with most accidents to be ranked = 1
df["rank"] = df.groupby("county")["accidents"].rank("dense", ascending=True)
Result
df:
county accidents city rank
0 a 1 A 3.0
1 a 2 B 2.0
2 a 5 C 1.0
3 b 5 D 3.0
4 b 5 E 3.0
5 b 6 F 2.0
6 b 8 G 1.0
7 c 2 H 2.0
8 c 2 I 2.0
9 c 7 J 1.0
10 c 7 K 1.0
I think the approach by #DarryIG is correct, but it doesn't consider that the environment is ArcGIS.
Since you tagged your question with Python I came up with a workflow utilizing Pandas. There are other ways to do the same, using ArcGIS tools and or the Field Calculator.
import arcpy # if you are using this script outside ArcGIS
import pandas as pd
# change this to your actual shapefile, you might have to include a path
filename = "road_accidents"
sFields = ['County', 'City', 'SumOfAccidents'] # consider this to be your columns
# read everything in your file into a Pandas DataFrame with a SearchCursor
with arcpy.da.SearchCursor(filename, sFields) as sCursor:
df = pandas.DataFrame(data=[row for row in sCursor], columns=field_names)
df = df.drop_duplicates() # since each row represents a street, we can remove duplicate
# we use this code from DarrylG to calculate a rank
df['Rank'] = df.groupby('County')['SumOfAccidents'].rank('dense', ascending=True)
# set a multiindex, since there might be duplicate city-names
df = df.set_index(['County', 'City'])
dct = df.to_dict() # convert the dataframe into a dictionary
# add a field to your shapefile
arcpy.AddField_management('Rank', 'Rank', 'SHORT')
# now we can update the Shapefile
uFields = ['County', 'City', 'Rank']
with arcpy.da.UpdateCursor(filename, uFields) as uCursor: # open a UpdateCursor on the file
for row in uCursor: # for each row (street)
# get the county/city combo
County_City = (row[uFields.index('County')], row[uFields.index('City')])
if County_City in dct: # see if it is in your dictionary (it should)
# give it the value from dictionary
row[uFields.index('Rank')] = dct['Rank'][County_City]
else:
# otherwise...
row[uFields.index('Rank')] = 999
uCursor.updateRow(row) # update the row
You can run this code inside ArcGIS Pro Python console. Or using Jupyter Notebooks. Hope it helps!

How to apply function to index named with at-sign in pandas? [duplicate]

In both the bellow cases:
import pandas
d = {'col1': 2, 'col2': 2.5}
df = pandas.DataFrame(data=d, index=[0])
print(df['col2'])
print(df.col2)
Both methods can be used to index on a column and yield the same result, so is there any difference between them?
The "dot notation", i.e. df.col2 is the attribute access that's exposed as a convenience.
You may access an index on a Series, column on a DataFrame, and an item on a Panel directly as an attribute:
df['col2'] does the same: it returns a pd.Series of the column.
A few caveats about attribute access:
you cannot add a column (df.new_col = x won't work, worse: it will silently actually create a new attribute rather than a column - think monkey-patching here)
it won't work if you have spaces in the column name or if the column name is an integer.
They are the same as long you're accessing a single column with a simple name, but you can do more with the bracket notation. You can only use df.col if the column name is a valid Python identifier (e.g., does not contains spaces and other such stuff). Also, you may encounter surprises if your column name clashes with a pandas method name (like sum). With brackets you can select multiple columns (e.g., df[['col1', 'col2']]) or add a new column (df['newcol'] = ...), which can't be done with dot access.
The other question you linked to applies, but that is a much more general question. Python objects get to define how the . and [] operators apply to them. Pandas DataFrames have chosen to make them the same for this limited case of accessing single columns, with the caveats described above.
Short answer for differences:
[] indexing (squared brackets access) has the full functionality to operate on DataFrame column data.
While attribute access (dot access) is mainly for convenience to access existing DataFrame column data, but occasionally has its limitations (e.g. special column names, creating a new column).
More explanation
Series and DataFrame are core classes and data structures in pandas, and of course they are Python classes too, so there are some minor distinction when involving attribute access between pandas DataFrame and normal Python objects. But it's well documented and can be easily understood. Just a few points to note:
In Python, users may dynamically add data attributes of their own to an instance object using attribute access.
>>> class Dog(object):
... pass
>>> dog = Dog()
>>> vars(dog)
{}
>>> superdog = Dog()
>>> vars(superdog)
{}
>>> dog.legs = 'I can run.'
>>> superdog.wings = 'I can fly.'
>>> vars(dog)
{'legs': 'I can run.'}
>>> vars(superdog)
{'wings': 'I can fly.'}
In pandas, index and column are closely related to the data structure, you may access an index on a Series, column on a DataFrame as an attribute.
>>> import pandas as pd
>>> import numpy as np
>>> data = np.random.randint(low=0, high=10, size=(2,2))
>>> df = pd.DataFrame(data, columns=['a', 'b'])
>>> df
a b
0 7 6
1 5 8
>>> vars(df)
{'_is_copy': None,
'_data': BlockManager
Items: Index(['a', 'b'], dtype='object')
Axis 1: RangeIndex(start=0, stop=2, step=1)
IntBlock: slice(0, 2, 1), 2 x 2, dtype: int64,
'_item_cache': {}}
But, pandas attribute access is mainly a convinience for reading from and modifying an existing element of a Series or column of a DataFrame.
>>> df.a
0 7
1 5
Name: a, dtype: int64
>>> df.b = [1, 1]
>>> df
a b
0 7 1
1 5 1
And, the convenience is a tradeoff for full functionality. E.g. you can create a DataFrame object with column names ['space bar', '1', 'loc', 'min', 'index'], but you can't access them as an attribute, because they are either not a valid Python identifier (1, space bar) or conflict with an existing attribute name.
>>> data = np.random.randint(0, 10, size=(2, 5))
>>> df_special_col_names = pd.DataFrame(data, columns=['space bar', '1', 'loc', 'min', 'index'])
>>> df_special_col_names
space bar 1 loc min index
0 4 4 4 8 9
1 3 0 1 2 3
In these cases, the .loc, .iloc and [] indexing is the defined way to fullly access/operate index and columns of Series and DataFrame objects.
>>> df_special_col_names['space bar']
0 4
1 3
Name: space bar, dtype: int64
>>> df_special_col_names.loc[:, 'min']
0 8
1 2
Name: min, dtype: int64
>>> df_special_col_names.iloc[:, 1]
0 4
1 0
Name: 1, dtype: int64
Another important difference is when trying to create a new column for DataFrame. As you can see, df.c = df.a + df.b just created an new attribute along side to the core data structure, so starting from version 0.21.0 and later, this behavior will raise a UserWarning (silent no more).
>>> df
a b
0 7 1
1 5 1
>>> df.c = df.a + df.b
__main__:1: UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access`
>>> df['d'] = df.a + df.b
>>> df
a b d
0 7 1 8
1 5 1 6
>>> df.c
0 8
1 6
dtype: int64
>>> vars(df)
{'_is_copy': None,
'_data':
BlockManager
Items: Index(['a', 'b', 'd'], dtype='object')
Axis 1: RangeIndex(start=0, stop=2, step=1)
IntBlock: slice(0, 2, 1), 2 x 2, dtype: int64
IntBlock: slice(2, 3, 1), 1 x 2, dtype: int64,
'_item_cache': {},
'c': 0 8
1 6
dtype: int64}
Finally, to create a new column for DataFrame, never use attribute access. The correct way is to use either [] or .loc indexing:
>>> df
a b
0 7 6
1 5 8
>>> df['c'] = df.a + df.b
>>> # OR
>>> df.loc[:, 'c'] = df.a + df.b
>>> df # c is an new added column
a b c
0 7 6 13
1 5 8 13
. notation is very useful when working interactively and for exploration. However, for code clarity and to avoid crazy shit to happen, you definitely should use [] notation. An example why you should use [] when creating a new column.
df = pd.DataFrame(data={'A':[1, 2, 3],
'B':[4,5,6]})
# this has no effect
df.D = 11
df
A B
0 1 4
1 2 5
2 3 6
# but this works
df['D'] = 11
df
Out[19]:
A B D
0 1 4 11
1 2 5 11
2 3 6 11
If you had a dataframe like this (I am not recommending these column names)...
df = pd.DataFrame({'min':[1,2], 'max': ['a','a'], 'class': [1975, 1981], 'sum': [3,4]})
print(df)
min max class sum
0 1 a 1975 3
1 2 a 1981 4
It all looks OK and there are no errors. You can even access the columns via df['min'] etc...
print(df['min'])
0 1
1 2
Name: min, dtype: int64
However, if you tried with df.<column_name> you would get problems:
print(df.min)
<bound method NDFrame._add_numeric_operations.<locals>.min of min max class sum
0 1 a 1975 3
1 2 a 1981 4>
print(df.max)
<bound method NDFrame._add_numeric_operations.<locals>.max of min max class sum
0 1 a 1975 3
1 2 a 1981 4>
print(df.class)
File "<ipython-input-31-3472b02a328e>", line 1
print(df.class)
^
SyntaxError: invalid syntax
print(df.sum)
<bound method NDFrame._add_numeric_operations.<locals>.sum of min max class sum
0 1 a 1975 3
1 2 a 1981 4>

Filtering rows from pandas dataframe using concatenated strings

I have a pandas dataframe plus a pandas series of identifiers, and would like to filter the rows from the dataframe that correspond to the identifiers in the series. To get the identifiers from the dataframe, I need to concatenate its first two columns. I have tried various things to filter, but none seem to work so far. Here is what I have tried:
1) I tried adding a column of booleans to the data frame, being true if that row corresponds to one of the identifiers, and false otherwise (hoping to be able to do filtering afterwards using the new column):
df["isInAcids"] = (df["AcNo"] + df["Sortcode"]) in acids
where
acids
is the series containing the identifiers.
However, this gives me a
TypeError: unhashable type
2) I tried filtering using the apply function:
df[df.apply(lambda x: x["AcNo"] + x["Sortcode"] in acids, axis = 1)]
This doesn't give me an error, but the length of the data frame remains unchanged, so it doesn't appear to filter anything.
3) I have added a new column, containing the concatenated strings/identifiers, and then try to filter afterwards (see Filter dataframe rows if value in column is in a set list of values):
df["ACIDS"] = df["AcNo"] + df["Sortcode"]
df[df["ACIDS"].isin(acids)]
But again, the dataframe doesn't change.
I hope this makes sense...
Any suggestions where I might be going wrong?
Thanks,
Anne
I think you're asking for something like the following:
In [1]: other_ids = pd.Series(['a', 'b', 'c', 'c'])
In [2]: df = pd.DataFrame({'vals': [1, 2, 3, 4], 'ids': ['a', 'b', 'c', 'f']})
In [3]: df
Out[3]:
ids vals
0 a 1
1 b 2
2 c 3
3 f 4
In [4]: other_ids
Out[4]:
0 a
1 b
2 c
3 c
dtype: object
In this case, the series other_ids would be like your series acids. We want to select just those rows of df whose id is in the series other_ids. To do that we'll use the dataframe's method .isin().
In [5]: df.ids.isin(other_ids)
Out[5]:
0 True
1 True
2 True
3 False
Name: ids, dtype: bool
This gives a column of bools that we can index into:
In [6]: df[df.ids.isin(other_ids)]
Out[6]:
ids vals
0 a 1
1 b 2
2 c 3
This is close to what you're doing with your 3rd attempt. Once you post a sample of your dataframe I can edit this answer, if it doesn't work already.
Reading a bit more, you may be having trouble because you have two columns in df that are your ids? Dataframe doesn't have an isin method, but we can get around that with something like:
In [26]: df = pd.DataFrame({'vals': [1, 2, 3, 4], 'ids': ['a', 'b', 'f', 'f'],
'ids2': ['e', 'f', 'c', 'f']})
In [27]: df
Out[27]:
ids ids2 vals
0 a e 1
1 b f 2
2 f c 3
3 f f 4
In [28]: df.ids.isin(ids) + df.ids2.isin(ids)
Out[28]:
0 True
1 True
2 True
3 False
dtype: bool
True is like 1 and False is like zero so we add the two boolean series from the two isins() to get something like an OR operation. Then like before we can index into this boolean series:
In [29]: new = df.ix[df.ids.isin(ids) + df.ids2.isin(ids)]
In [30]: new
Out[30]:
ids ids2 vals
0 a e 1
1 b f 2
2 f c 3

Categories