groupby name and position in the group - python

I would like to group by a column and then split one or more of the groups into two.
Exmaple
This
np.random.seed(11)
df = pd.DataFrame({"animal":np.random.choice( ['panda','python','shark'], 10),
"number": 1})
df.sort_values("animal")
gives me this dataframe
animal number
1 panda 1
4 panda 1
7 panda 1
9 panda 1
0 python 1
2 python 1
3 python 1
5 python 1
8 python 1
6 shark 1
Now I would like to group by animal but also split the "pythons" into the first two and the rest of the "pythons". So that
df.grouby(your_magic).sum()
gives me
number
animal
panda 4
python_1 2
python_2 3
shark 1

What about
np.random.seed(11)
df = pd.DataFrame({"animal":np.random.choice( ['panda','python','shark'], 10),
"number": 1})
## find index on which you split python into python_1 and python_2
python_split_idx = df[df['animal'] == 'python'].iloc[2].name
## rename python according to index
df[df['animal'] == 'python'] = df[df['animal'] == 'python'].apply(lambda row: pd.Series(['python_1' if row.name < python_split_idx else 'python_2', row.number], index=['animal', 'number']), axis=1)
## group according to all animals and sum the number
df.groupby('animal').agg({'number': sum})
Output:
number
animal
panda 4
python_1 2
python_2 3
shark 1

I ended up using something similar to Stefan's answer but slightly reformulated it to avoid having to use apply. This looks a bit cleaner to me.
idx1 = df[df['animal'] == 'python'].iloc[:2].index
idx2 = df[df['animal'] == 'python'].iloc[2:].index
df.loc[idx1, "animal"] = "python_1"
df.loc[idx2, "animal"] = "python_2"
df.groupby("animal").sum()

This is a quick and dirty (and somewhat inefficient) way to do it if you want to rename all your pythons before you group them.
indices = []
for i,v in enumerate(df['animal']):
if v == 'python':
if len(indices) <2:
indices.append(i)
df.loc[i,'animal'] = 'python_1'
else:
df.loc[i,'animal'] = 'python_2'
grouped = df.groupby('animal').agg('sum')
print(grouped)
This provides your desired output exactly.
As an alternative, here's a totally different approach that creates another column to capture whether each animal is a member of the group of the top two pythons and then groups on both columns.
snakes = df[df['animal'] == 'python']
df['special_snakes'] = [1 if i not in snakes.index[:2] else 0 for i in df.index]
df.groupby(['animal', 'special_snakes']).agg('sum')
The output looks a bit different, but achieves the same outcome. This approach also has the advantage of capturing the condition on which you are grouping your animals without actually changing the values in the animal column.
number
animal special_snakes
panda 1 4
python 0 2
1 3
shark 1 1

Related

How to change the value of a column items using pandas?

This is my fist question on stackoverflow.
I'm implementing a Machine Learning classification algorithm and I want to generalize it for any input dataset that have their target class in the last column. For that, I want to modify all values of this column without needing to know the names of each column or rows using pandas in python.
For example, let's suppose I load a dataset:
dataset = pd.read_csv('random_dataset.csv')
Let's say the last column has the following data:
0 dog
1 dog
2 cat
3 dog
4 cat
I want to change each "dog" appearence to 1 and each cat appearance to 0, so that the column would look:
0 1
1 1
2 0
3 1
4 0
I have found some ways of changing the values of specific cells using pandas, but for this case, what would be the best way to do that?
I appreciate each answer.
You can use pandas.Categorical:
df['column'] = pd.Categorical(df['column']).codes
You can also use the built in functionality for this too:
df['column'] = df['column'].astype('category').cat.codes
use the map and map the values as per requirement:
df['col_name'] = df['col_name'].map({'dog' : 1 , 'cat': 0})
OR -> Use factorize(Encode the object as an enumerated type) -> if you wanna assign random numeric values
df['col_name'] = df['col_name'].factorize()[0]
OUTPUT:
0 1
1 1
2 0
3 1
4 0

Iterate over results of value_counts() on a groupby object

I have a dataframe like df = pd.DataFrame({'ID':[1,1,2,2,3,3,4,4,5,5,5],'Col1':['Y','Y','Y','N','N','N','Y','Y','Y','N','N']}). What I would like to do is group by the 'ID' column and then get statistics on three conditions:
How many groups have only 'Y's
How many groups have at least 1 'Y' and at least 1 'N'
How many groups have only 'N's
groups = df.groupby('ID') groups.Col1.value_counts()
gives me a visual representation of what I'm looking for, but how can I then iterate over the results of the value_counts() method to check for these conditions?
I think pd.crosstab() may be more suitable for your use case.
Code
df_crosstab = pd.crosstab(df["ID"], df["Col1"])
Col1 N Y
ID
1 0 2
2 1 1
3 2 0
4 0 2
5 2 1
Groupby can also do the job, but much more tedious:
df_crosstab = df.groupby('ID')["Col1"]\
.value_counts()\
.rename("count")\
.reset_index()\
.pivot(index="ID", columns="Col1", values="count")\
.fillna(0)
Filtering the groups
After producing df_crosstab, the filters for your 3 questions could be easily constructed:
# 1. How many groups have only 'Y's
df_crosstab[df_crosstab['N'] == 0]
Col1 N Y
ID
1 0 2
4 0 2
# 2. How many groups have at least 1 'Y' and at least 1 'N'
df_crosstab[(df_crosstab['N'] > 0) & (df_crosstab['Y'] > 0)]
Col1 N Y
ID
2 1 1
5 2 1
# 3. How many groups have only 'N's
df_crosstab[df_crosstab['Y'] == 0]
Col1 N Y
ID
3 2 0
If you want the number of groups only, just take the length of the the filtered crosstab dataframe. I believe this also makes automation much easier.
groups = df.groupby('ID')
answers = groups.Col1.value_counts()
for item in answers.iteritems():
print(item)
What you are making is a series from value_counts() and you can iterate over them. Note that this is not what you want. You would have to check each of these items for the tests you are looking for.
If you group by 'ID' and use 'sum' function, you will have all letters in one line for each group. Then you can just count strings to check your conditions and take their sums to know the exact numbers for all of the groups:
strings = df.groupby(['ID']).sum()
only_y = sum(strings['Col1'].str.count('N') == 0)
only_n = sum(strings['Col1'].str.count('Y') == 0)
both = sum((strings['Col1'].str.count('Y') > 0) & (strings['Col1'].str.count('N') > 0))
print('Number of groups with Y only: ' + str(only_y),
'Number of groups with N only: ' + str(only_n),
'Number of groups with at least one Y and one N: ' + str(both),
sep='\n')

Update dataframe values that match a regex condition and keep remaining values intact

The following is an excerpt from my dataframe:
In[1]: df
Out[1]:
LongName BigDog
1 Big Dog 1
2 Mastiff 0
3 Big Dog 1
4 Cat 0
I want to use regex to update BigDog values to 1 if LongName is a mastiff. I need other values to stay the same. I tried this, and although it assigns 1 to mastiffs, it nulls all other values instead of keeping them intact.
def BigDog(longname):
if re.search('(?i)mastiff', longname):
return '1'
df['BigDog'] = df['LongName'].apply(BigDog)
I'm not sure what to do, could anybody please help?
You don't need a loop or apply, use str.match with DataFrame.loc:
df.loc[df['LongName'].str.match('(?i)mastiff'), 'BigDog'] = 1
LongName BigDog
1 Big Dog 1
2 Mastiff 1
3 Big Dog 1
4 Cat 0

Improving performance of Python for loops with Pandas data frames

please consider the following DataFrame df:
timestamp id condition
1234 A
2323 B
3843 B
1234 C
8574 A
9483 A
Basing on the condition contained in the column condition I have to define a new column in this data frame which counts how many ids are in that condition.
However, please note that since the DataFrame is ordered by the timestamp column, one could have multiple entries of the same id and then a simple .cumsum() is not a viable option.
I have come out with the following code, which is working properly but is extremely slow:
#I start defining empty arrays
ids_with_condition_a = np.empty(0)
ids_with_condition_b = np.empty(0)
ids_with_condition_c = np.empty(0)
#Initializing new column
df['count'] = 0
#Using a for loop to do the task, but this is sooo slow!
for r in range(0, df.shape[0]):
if df.condition[r] == 'A':
ids_with_condition_a = np.append(ids_with_condition_a, df.id[r])
elif df.condition[r] == 'B':
ids_with_condition_b = np.append(ids_with_condition_b, df.id[r])
ids_with_condition_a = np.setdiff1d(ids_with_condition_a, ids_with_condition_b)
elifif df.condition[r] == 'C':
ids_with_condition_c = np.append(ids_with_condition_c, df.id[r])
df.count[r] = ids_with_condition_a.size
Keeping these Numpy arrays is very useful to me because it gives the list of the ids in a particular condition. I would also be able to put dinamically these arrays in a corresponding cell in the df DataFrame.
Are you able to come out with a better solution in terms of performance?
you need to use groupby on the column 'condition' and cumcount to count how many ids are in each condition up to the current row (which seems to be what your code do):
df['count'] = df.groupby('condition').cumcount()+1 # +1 is to start at 1 not 0
with your input sample, you get:
id condition count
0 1234 A 1
1 2323 B 1
2 3843 B 2
3 1234 C 1
4 8574 A 2
5 9483 A 3
which is faster than using loop for
and if you want just have the row with condition A for example, you can use a mask such as, if you do
print (df[df['condition'] == 'A']), you see row with only condition egal to A. So to get an array,
arr_A = df.loc[df['condition'] == 'A','id'].values
print (arr_A)
array([1234, 8574, 9483])
EDIT: to create two column per conditions, you can do for example for condition A:
# put 1 in a column where the condition is met
df['nb_cond_A'] = pd.np.where(df['condition'] == 'A',1,None)
# then use cumsum for increment number, ffill to fill the same number down
# where the condition is not meet, fillna(0) for filling other missing values
df['nb_cond_A'] = df['nb_cond_A'].cumsum().ffill().fillna(0).astype(int)
# for the partial list, first create the full array
arr_A = df.loc[df['condition'] == 'A','id'].values
# create the column with apply (here another might exist, but it's one way)
df['partial_arr_A'] = df['nb_cond_A'].apply(lambda x: arr_A[:x])
the output looks like this:
id condition nb_condition_A partial_arr_A nb_cond_A
0 1234 A 1 [1234] 1
1 2323 B 1 [1234] 1
2 3843 B 1 [1234] 1
3 1234 C 1 [1234] 1
4 8574 A 2 [1234, 8574] 2
5 9483 A 3 [1234, 8574, 9483] 3
then same thing for B, C. Maybe with a loop for cond in set(df['condition']) ould be practical for generalisation
EDIT 2: one idea to do what you expalined in the comments but not sure it improves the performance:
# array of unique condition
arr_cond = df.condition.unique()
#use apply to create row-wise the list of ids for each condition
df[arr_cond] = (df.apply(lambda row: (df.loc[:row.name].drop_duplicates('id','last')
.groupby('condition').id.apply(list)) ,axis=1)
.applymap(lambda x: [] if not isinstance(x,list) else x))
Some explanations: for each row, select the dataframe up to this row loc[:row.name], drop the duplicated 'id' and keep the last one drop_duplicates('id','last') (in your example, it means that once we reach the row 3, the row 0 is dropped, as the id 1234 is twice), then the data is grouped by condition groupby('condition'), and ids for each condition are put in a same list id.apply(list). The part starting with applymap fillna with empty list (you can't use fillna([]), it's not possible).
For the length for each condition, you can do:
for cond in arr_cond:
df['len_{}'.format(cond)] = df[cond].str.len().fillna(0).astype(int)
THe result is like this:
id condition A B C len_A len_B len_C
0 1234 A [1234] [] [] 1 0 0
1 2323 B [1234] [2323] [] 1 1 0
2 3843 B [1234] [2323, 3843] [] 1 2 0
3 1234 C [] [2323, 3843] [1234] 0 2 1
4 8574 A [8574] [2323, 3843] [1234] 1 2 1
5 9483 A [8574, 9483] [2323, 3843] [1234] 2 2 1

Pandas Multi-Colum Boolean Indexing/Selection with Dict Generator

Lets imagine you have a DataFrame df with a large number of columns, say 50, and df does not have any indexes (i.e. index_col=None). You would like to select a subset of the columns as defined by a required_columns_list, but would like to only return those rows meeting a mutiple criteria as defined by various boolean indexes. Is there a way to consicely generate the selection statement using a dict generator?
As an example:
df = pd.DataFrame(np.random.randn(100,50),index=None,columns=["Col" + ("%03d" % (i + 1)) for i in range(50)])
# df.columns = Index[u'Col001', u'Col002', ..., u'Col050']
required_columns_list = ['Col002', 'Col012', 'Col025', 'Col032', 'Col033']
now lets imagine that I define:
boolean_index_dict = {'Col001':"MyAccount", 'Col002':"Summary", 'Col005':"Total"}
I would like to select out using a dict generator to construct the multiple boolean indices:
df.loc[GENERATOR_USING_boolean_index_dict, required_columns_list].values
The above generator boolean method would be the equivalent of:
df.loc[(df['Col001']=="MyAccount") & (df['Col002']=="Summary") & (df['Col005']=="Total"), ['Col002', 'Col012', 'Col025', 'Col032', 'Col033']].values
Hopefully, you can see that this would be really useful 'template' in operating on large DataFrames and the boolean indexing can then be defined in the boolean_index_dict. I would greatly appreciate if you could let me know if this is possible in Pandas and how to construct the GENERATOR_USING_boolean_index_dict?
Many thanks and kind regards,
Bertie
p.s. If you would like to test this out, you will need to populate some of df columns with text. The definition of df using random numbers was simply given as a starter if required for testing...
Suppose this is your df:
df = pd.DataFrame(np.random.randint(0,4,(100,50)),index=None,columns=["Col" + ("%03d" % (i + 1)) for i in range(50)])
# the first five cols and rows:
df.iloc[:5,:5]
Col001 Col002 Col003 Col004 Col005
0 2 0 2 3 1
1 0 1 0 1 3
2 0 1 1 0 3
3 3 1 0 2 1
4 1 2 3 1 0
Compared to your example all columns are filled with ints of 0,1,2 or 3.
Lets define the criteria:
req = ['Col002', 'Col012', 'Col025', 'Col032', 'Col033']
filt = {'Col001': 2, 'Col002': 2, 'Col005': 2}
So we want some columns, where some others columns all contain the value 2.
You can then get the result with:
df.loc[df[filt.keys()].apply(lambda x: x.tolist() == filt.values(), axis=1), req]
In my case this is the result:
Col002 Col012 Col025 Col032 Col033
43 2 2 1 3 3
98 2 1 1 1 2
Lets check the required columns for those rows:
df[filt.keys()].iloc[[43,98]]
Col005 Col001 Col002
43 2 2 2
98 2 2 2
And some other (non-matching) rows:
df[filt.keys()].iloc[[44,99]]
Col005 Col001 Col002
44 3 0 3
99 1 0 0
I'm starting to like Pandas more and more.

Categories