Grabbing the keys and values of a groupby object: How does it work? - python

I'm pretty new, and SO is intimidating, so please be gentle.
Since I can't actually see what's in a groupby object, I'm trying to understand how the iterables a and b below access the keys ('name') and the group data, respectively. The results below imply that the groupby object is a list of tuples like this: (name, group data). Is that correct?
ETA: I'm trying to understand how a grabs (iterates on)
grouped.groups
and b grabs
grouped.get_group(a)
. It appears that they're being grabbed from
grouped.__iter__()
Is that correct? Are these the first two elements of that list/tuple?
Thanks in advance.
import pandas as pd
data = {'name': ['Alice', 'Bob', 'Charlie', 'David'],
'age': [25, 30, 35, 40]}
# create a pandas DataFrame from the dictionary
df = pd.DataFrame(data)
grouped = df.groupby('name')
for a, b in grouped:
print(a)
print(b)
Output is the following, as expected.
Alice
name age
0 Alice 25
Bob
name age
1 Bob 30
Charlie
name age
2 Charlie 35
David
name age
3 David 40

Yes.
Here is a more interesting example.
data = {'name': ['Alice', 'Bob', 'David', 'David'],
'age': [25, 30, 35, 40]}
Among the b dataframes it produces is this one:
>>> b
name age
2 David 35
3 David 40
BTW, notice the returned a is always of type str.
Typically reporting some aggregate will be the
motivation for grouping.
>>> print(grouped.max())
age
name
Alice 25
Bob 30
David 40

Related

Replacing values for Pandas categorical column based on category codes

I am looking for more elegant approach to replace the values for categorical column based on category codes. I am not able to use map method as the original values are not known in advance.
I am currently using the following approach:
df['Gender'] = pd.Categorical.from_codes(df['Gender'].cat.codes.fillna(-1), categories=['Female', 'Male'])
This approach feels inelegant because I convert categorical column to integer, and then convert it back to categorical. Full code is below.
import pandas as pd
df = pd.DataFrame({
'Name': ['Jack', 'John', 'Jil', 'Jax'],
'Gender': ['M', 'M', 'F', pd.NA],
})
df['Gender'] = df['Gender'].astype('category')
# don't want to do this as original values may not be known to establish the dict
# df['Gender'] = df['Gender'].map({'M': 'Male', 'F': 'Female'})
# offline, we know 0 = Female, 1 = Male
# what is more elegant way to do below?
df['Gender'] = pd.Categorical.from_codes(df['Gender'].cat.codes.fillna(-1), categories=['Female', 'Male'])
here is one way to do that
create a dictionary of unique items and using enumerate assign an index
d = {item: i for i, item in enumerate(df['Gender'].unique())}
use map to map the values
df['cat'] = df['Gender'].map(d)
df
Name Gender cat
0 Jack M 0
1 John M 0
2 Jil F 1
3 Jax <NA> 2
What about using cat.rename_categories?
df['Gender'] = (df['Gender'].astype('category')
.cat.rename_categories(['Female', 'Male'])
)
output:
Name Gender
0 Jack Male
1 John Male
2 Jil Female
3 Jax NaN

How to merge 2 dataframes on multiple columns containing duplicates

so I have 2 dataframes df1 and df2.
df1
names = ['Bob', 'Joe', '', 'Bob', '0000', 'Alice', 'Joe', 'Alice', 'Bob', '']
df1 = pd.DataFrame({'names': names,'ages': ages})
df2
names_in_student_db = [' Bob', ' Joe ', '', ' Bob ', 'Chris', 'Alice', 'Joe ', 'Alice ', ' Bob ', 'Daniel']
df2 = pd.DataFrame({'student_names': names_in_student_db,'grades': grades})
Now, I want to merge these 2 dataframes but obviously, there are 2 problems:
names and names_in_student_db are not fully identical.
Both of them contain duplicates — this seems to be making merge functions to throw an error. Also, duplicates in one column are not the same (meaning let's say, 1st Bob and 3rd Bob in any of these columns are not the same person), but let's say the 2nd Bob in 1st column and 2nd Bob in the 2nd column are the same person.
So how do I write a general code (not tailored for these specific dataframes) to solve this? I'm looking for outer join btw.
My guess is I could create another column in each dataframe, let's call it 'order' column which basically would be basically integers from 0 to 9. And then if I could merge dataframes based on 2 columns (I mean matching 'order1' column with 'order2' and 'names' with 'student_names'). Is that possible? I think that still throws a duplicate-related error though.
You could clean up the student names, and assign a sequence number to all repeated names (on both DFs), assuming the order is the same. In reality, you would rather use a last name and optionally some other identifier, just to make sure you are joining the right people with their grades :-)
z = df2.assign(names=df2['student_names'].str.strip())
out = (
df1
.assign(seq=df1.groupby('names').cumcount())
.merge(
z
.assign(seq=z.groupby('names').cumcount()),
on=['names', 'seq'],
how='left',
)
)
>>> out
names ages seq student_names grades
0 Bob 33 0 Bob A
1 Joe 45 0 Joe F
2 21 0 B
3 Bob 38 1 Bob F
4 0000 44 0 NaN NaN
5 Alice 10 0 Alice C
6 Joe 10 1 Joe C
7 Alice 46 1 Alice A
8 Bob 15 2 Bob B
9 48 1 NaN NaN
PS: actual setup
The setup in the question was incomplete (missing ages and grades), so I improvised:
names = ['Bob', 'Joe', '', 'Bob', '0000', 'Alice', 'Joe', 'Alice', 'Bob', '']
ages = np.random.randint(10, 50, len(names))
df1 = pd.DataFrame({'names': names,'ages': ages})
names_in_student_db = [' Bob', ' Joe ', '', ' Bob ', 'Chris', 'Alice', 'Joe ', 'Alice ', ' Bob ', 'Daniel']
grades = np.random.choice(list('ABCDF'), len(names_in_student_db))
df2 = pd.DataFrame({'student_names': names_in_student_db,'grades': grades})
If they always match up based on index, you can just concat them together and then drop columns you no longer want.
pd.concat([df1, df2], axis=1)
Specify the column name like this
pd.merge(left=df1, right=df2, left_on='names', right_on='student_names', how='left')
depending on your expected result.

How to extract data from dataframe in pandas and assign them to normal variables

I am trying to get individual data from groupby() function result in pandas and assign them to variables, but i dont know how:
for example:
df
Names Grades Ages
0 Bob 4 20
1 Jessica 3 21
3 Bob 3 22
4 John 2 20
5 Bob 4 24
print(df.groupby('Names').Ages.mean())
Names
Bob 33
Jessica 21
John 20
Now i want get the mean value of Bob into a scalar variable, like:
Bob_mean = 33 <-- how to extract this value from the dataframe object in pandas
Please help.
Thanks.
You can try:
import pandas as pd
df = pd.DataFrame([['Bob', 4, 20],
['Jessica', 3, 21],
['Bob', 3, 22],
['John', 2, 20],
['Bob', 4, 24]], columns=['Names', 'Grades', 'Ages'])
bob_mean = df.groupby(by = 'Names').Ages.mean()['Bob']
You have two options.
Simply amend your code to select the index 'Bob':
df.groupby('Names').Ages.mean()['Bob']
However groupby operations can become very slow and instead we can use df.loc:
df.loc[df['Names']=='Bob'].Ages.mean()

Write Pandas DataFrame with List in Column to a File

I have a simple dataframe that has emails being sent to different receivers:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Sender': ['Alice', 'Alice', 'Bob', 'Carl', 'Bob', 'Alice'],
'Receiver': ['David', 'Eric', 'Frank', 'Ginger', 'Holly', 'Ingrid'],
'Emails': [9, 3, 5, 1, 6, 7]
})
df
That looks like this:
Emails Receiver Sender
0 9 David Alice
1 3 Eric Alice
2 5 Frank Bob
3 1 Ginger Carl
4 6 Holly Bob
5 7 Ingrid Alice
For each sender, I can get a list of receivers by performing a groupby along with a custom aggregation:
grouped = df.groupby('Sender')
grouped.agg({'Receiver': (lambda x: list(x)),
'Emails': np.sum
})
Which produces this dataframe output:
Emails Receiver
Sender
Alice 19 [David, Eric, Ingrid]
Bob 11 [Frank, Holly]
Carl 1 [Ginger]
I want to write the dataframe to a file (not a CSV since it will be jagged) with spaces separating each element (including splitting out the list) so it would look like this:
Alice 19 David Eric Ingrid
Bob 11 Frank Holly
Carl 1 Ginger
I'd could iterate over each row and write the contents to a file but I was wondering if there was a better approach to get the same output starting from the original dataframe?
You can do that using like so:
output_file = './out.txt'
with open(output_file, 'w') as fout:
for group, df in grouped:
fout.write('{} {} {}\n'.format(group,
sum(df['Emails'].values),
' '.join(df['Receiver'].values)))
Now, the out.txt file will be:
Alice 19 David Eric Ingrid
Bob 11 Frank Holly
Carl 1 Ginger
You are almost there, just use ' '.join as the aggregating function for the Receiver column:
import numpy as np
import pandas as pd
df = pd.DataFrame({'Sender': ['Alice', 'Alice', 'Bob', 'Carl', 'Bob', 'Alice'],
'Receiver': ['David', 'Eric', 'Frank', 'Ginger', 'Holly', 'Ingrid'],
'Emails': [9, 3, 5, 1, 6, 7]
})
grouped = df.groupby('Sender')
result = grouped.agg({'Receiver': ' '.join,
'Emails': np.sum
})
print(result)
Output
Receiver Emails
Sender
Alice David Eric Ingrid 19
Bob Frank Holly 11
Carl Ginger 1
For the sake of completeness, if the Receiver column where int instead of strings you could transform to string first and then join:
df = pd.DataFrame({'Sender': ['Alice', 'Alice', 'Bob', 'Carl', 'Bob', 'Alice'],
'Receiver': [1, 2, 3, 4, 5, 6],
'Emails': [9, 3, 5, 1, 6, 7]
})
grouped = df.groupby('Sender')
result = grouped.agg({'Receiver': lambda x: ' '.join(map(str, x)),
'Emails': np.sum
})
print(result)
Output
Receiver Emails
Sender
Alice 1 2 6 19
Bob 3 5 11
Carl 4 1

How to insert rows at specific positions into a dataframe in Python?

suppose you have a dataframe
df = pd.DataFrame({'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':
[28,34,29,42]})
and another dataframe
df1 = pd.DataFrame({'Name':['Anna', 'Susie'],'Age':[20,50]})
as well as a list with indices
pos = [0,2].
What is the most pythonic way to create a new dataframe df2 where df1 is integrated into df right before the index positions of df specified in pos?
So, the new array should look like this:
df2 =
Age Name
0 20 Anna
1 28 Tom
2 34 Jack
3 50 Susie
4 29 Steve
5 42 Ricky
Thank you very much.
Best,
Nathan
The behavior you are searching for is implemented by numpy.insert, however, this will not play very well with pandas.DataFrame objects, but no-matter, pandas.DataFrame objects have a numpy.ndarray inside of them (sort of, depending on various factors, it may be multiple arrays, but you can think of them as on array accessible via the .values parameter).
You will simply have to reconstruct the columns of your data-frame, but otherwise, I suspect this is the easiest and fastest way:
In [1]: import pandas as pd, numpy as np
In [2]: df = pd.DataFrame({'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':
...: [28,34,29,42]})
In [3]: df1 = pd.DataFrame({'Name':['Anna', 'Susie'],'Age':[20,50]})
In [4]: np.insert(df.values, (0,2), df1.values, axis=0)
Out[4]:
array([['Anna', 20],
['Tom', 28],
['Jack', 34],
['Susie', 50],
['Steve', 29],
['Ricky', 42]], dtype=object)
So this returns an array, but this array is exactly what you need to make a data-frame! And you have the other elements, i.e. the columns already on the original data-frames, so you can just do:
In [5]: pd.DataFrame(np.insert(df.values, (0,2), df1.values, axis=0), columns=df.columns)
Out[5]:
Name Age
0 Anna 20
1 Tom 28
2 Jack 34
3 Susie 50
4 Steve 29
5 Ricky 42
So that single line is all you need.
Tricky solution with float indexes:
df = pd.DataFrame({'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age': [28,34,29,42]})
df1 = pd.DataFrame({'Name':['Anna', 'Susie'],'Age':[20,50]}, index=[-0.5, 1.5])
result = df.append(df1, ignore_index=False).sort_index().reset_index(drop=True)
print(result)
Output:
Name Age
0 Anna 20
1 Tom 28
2 Jack 34
3 Susie 50
4 Steve 29
5 Ricky 42
Pay attention to index parameter in df1 creation. You can construct index from pos using simple list comprehension:
[x - 0.5 for x in pos]

Categories