Pandas data frame sum of column and collecting the results - python

Given the following dataframe:
import pandas as pd
p1 = {'name': 'willy', 'age': 11, 'interest': "Lego"}
p2 = {'name': 'willy', 'age': 11, 'interest': "games"}
p3 = {'name': 'zoe', 'age': 9, 'interest': "cars"}
df = pd.DataFrame([p1, p2, p3])
df
age interest name
0 11 Lego willy
1 11 games willy
2 9 cars zoe
I want to know the sum of interests of each person and let each person only show once in the list. I do the following:
Interests = df[['age', 'name', 'interest']].groupby(['age' , 'name']).count()
Interests.reset_index(inplace=True)
Interests.sort('interest', ascending=False, inplace=True)
Interests
age name interest
1 11 willy 2
0 9 zoe 1
This works but I have the feeling that I'm doing it wrong. Now I'm using the column 'interest' to display my sum values which is okay but like I said I expect there to be a nicer way to do this.
I saw many questions about counting/sum in Pandas but for me the part where I leave out the 'duplicates' is key.

You can use size (the length of each group), rather than count, the non-NaN enties in each column of the group.
In [11]: df[['age', 'name', 'interest']].groupby(['age' , 'name']).size()
Out[11]:
age name
9 zoe 1
11 willy 2
dtype: int64
In [12]: df[['age', 'name', 'interest']].groupby(['age' , 'name']).size().reset_index(name='count')
Out[12]:
age name count
0 9 zoe 1
1 11 willy 2

In [2]: df
Out[2]:
age interest name
0 11 Lego willy
1 11 games willy
2 9 cars zoe
In [3]: for name,group in df.groupby('name'):
...: print name
...: print group.interest.count()
...:
willy
2
zoe
1

Related

How to create a subtotal record in pandas using certain rows data

My current code is something like this:
df = pd.DataFrame({'animal': ['falcon', 'dog', 'spider', 'fish'],
'num_specimen_seen': [10, 2, 1, 8]})
df = df.append({'animal': 'Total Land', 'num_specimen_seen': df.loc[df['animal']=='falcon','num_specimen_seen']+df.loc[df['animal']=='dog','num_specimen_seen']+df.loc[df['animal']=='spider','num_specimen_seen']}, ignore_index=True)
df
In the above code, I'm creating a new record for animal column called Total Land and its num_specimen_seen is being calculated by referencing corresponding land animals count. Is there a better way to achieve my desired result provided below ?
I preferably wouldn't want to create a subset of current dataframe to use .sum functionality as I need to do above operation multiple times
Current Output:
animal
num_specimen_seen
0
falcon
10
1
dog
2
2
spider
1
3
fish
8
4
Total Land
0 NaN 1 NaN 2 NaN Name: num_specimen_see...
Expected Output:
animal
num_specimen_seen
0
falcon
10
1
dog
2
2
spider
1
3
fish
8
4
Total Land
13
Use Series.isin with sum:
new = df.loc[df['animal'].isin(['falcon', 'dog','spider']), 'num_specimen_seen'].sum()
df = df.append({'animal': 'Total Land', 'num_specimen_seen': new}, ignore_index=True)
print (df)
animal num_specimen_seen
0 falcon 10
1 dog 2
2 spider 1
3 fish 8
4 Total Land 13
For one line solution use (unfortunately hard readable):
df = df.append({'animal': 'Total Land',
'num_specimen_seen': df.loc[df['animal'].isin(['falcon', 'dog','spider']), 'num_specimen_seen'].sum()}, ignore_index=True)
use sum like below:
df = df.append({'animal': 'Total Land',
'num_specimen_seen': df['num_specimen_seen'][df['animal'].isin(['falcon', 'dog','spider'])].sum()}, ignore_index=True)
output:

Python: Compare two dataframes in Python with different number rows and a Compsite key

I have two different dataframes which i need to compare.
These two dataframes are having different number of rows and doesnt have a Pk its Composite primarykey of (id||ver||name||prd||loc)
df1:
id ver name prd loc
a 1 surya 1a x
a 1 surya 1a y
a 2 ram 1a x
b 1 alex 1b z
b 1 alex 1b y
b 2 david 1b z
df2:
id ver name prd loc
a 1 surya 1a x
a 1 surya 1a y
a 2 ram 1a x
b 1 alex 1b z
I tried the below code and this workingif there are same number of rows , but if its like the above case its not working.
df1 = pd.DataFrame(Source)
df1 = df1.astype(str) #converting all elements as objects for easy comparison
df2 = pd.DataFrame(Target)
df2 = df2.astype(str) #converting all elements as objects for easy comparison
header_list = df1.columns.tolist() #creating a list of column names from df1 as the both df has same structure
df3 = pd.DataFrame(data=None, columns=df1.columns, index=df1.index)
for x in range(len(header_list)) :
df3[header_list[x]] = np.where(df1[header_list[x]] == df2[header_list[x]], 'True', 'False')
df3.to_csv('Output', index=False)
Please leet me know how to compare the datasets if there are different number od rows.
You can try this:
~df1.isin(df2)
# df1[~df1.isin(df2)].dropna()
Lets consider a quick example:
df1 = pd.DataFrame({
'Buyer': ['Carl', 'Carl', 'Carl'],
'Quantity': [18, 3, 5, ]})
# Buyer Quantity
# 0 Carl 18
# 1 Carl 3
# 2 Carl 5
df2 = pd.DataFrame({
'Buyer': ['Carl', 'Mark', 'Carl', 'Carl'],
'Quantity': [2, 1, 18, 5]})
# Buyer Quantity
# 0 Carl 2
# 1 Mark 1
# 2 Carl 18
# 3 Carl 5
~df2.isin(df1)
# Buyer Quantity
# 0 False True
# 1 True True
# 2 False True
# 3 True True
df2[~df2.isin(df1)].dropna()
# Buyer Quantity
# 1 Mark 1
# 3 Carl 5
Another idea can be merge on the same column names.
Sure, tweak the code to your needs. Hope this helped :)

How do you replace duplicate values with multiple unique strings in Pandas?

import pandas as pd
import numpy as np
data = {'Name':['Tom', 'Tom', 'Jack', 'Terry'], 'Age':[20, 21, 19, 18]}
df = pd.DataFrame(data)
Lets say I have a dataframe that looks like this. I am trying to figure out how to check the Name column for the value 'Tom' and if I find it the first time I replace it with the value 'FirstTom' and the second time it appears I replace it with the value 'SecondTom'. How do you accomplish this? I've used the replace method before but only for replacing all Toms with a single value. I don't want to add a 1 on the end of the value, but completely change the string to something else.
Edit:
If the df looked more like this below, how would we check for Tom in the first column and the second column and then replace the first instance with FirstTom and the second instance with SecondTom
data = {'Name':['Tom', 'Jerry', 'Jack', 'Terry'], 'OtherName':[Tom, John, Bob,Steve]}
Just adding in to the existing solutions , you can use inflect to create dynamic dictionary
import inflect
p = inflect.engine()
df['Name'] += df.groupby('Name').cumcount().add(1).map(p.ordinal).radd('_')
print(df)
Name Age
0 Tom_1st 20
1 Tom_2nd 21
2 Jack_1st 19
3 Terry_1st 18
We can do cumcount
df.Name=df.Name+df.groupby('Name').cumcount().astype(str)
df
Name Age
0 Tom0 20
1 Tom1 21
2 Jack0 19
3 Terry0 18
Update
suf = lambda n: "%d%s"%(n,{1:"st",2:"nd",3:"rd"}.get(n if n<20 else n%10,"th"))
g=df.groupby('Name')
df.Name=df.Name.radd(g.cumcount().add(1).map(suf).mask(g.Name.transform('count')==1,''))
df
Name Age
0 1stTom 20
1 2ndTom 21
2 Jack 19
3 Terry 18
Update 2 for column
suf = lambda n: "%d%s"%(n,{1:"st",2:"nd",3:"rd"}.get(n if n<20 else n%10,"th"))
g=s.groupby([s.index.get_level_values(0),s])
s=s.radd(g.cumcount().add(1).map(suf).mask(g.transform('count')==1,''))
s=s.unstack()
Name OtherName
0 1stTom 2ndTom
1 Jerry John
2 Jack Bob
3 Terry Steve
EDIT: For count duplicated per rows use:
df = pd.DataFrame(data = {'Name':['Tom', 'Jerry', 'Jack', 'Terry'],
'OtherName':['Tom', 'John', 'Bob','Steve'],
'Age':[20, 21, 19, 18]})
print (df)
Name OtherName Age
0 Tom Tom 20
1 Jerry John 21
2 Jack Bob 19
3 Terry Steve 18
import inflect
p = inflect.engine()
#map by function for dynamic counter
f = lambda i: p.number_to_words(p.ordinal(i))
#columns filled by names
cols = ['Name','OtherName']
#reshaped to MultiIndex Series
s = df[cols].stack()
#counter per groups
count = s.groupby([s.index.get_level_values(0),s]).cumcount().add(1)
#mask for filter duplicates
mask = s.reset_index().duplicated(['level_0',0], keep=False).values
#filter only duplicates and map, reshape back and add to original data
df[cols] = count[mask].map(f).unstack().add(df[cols], fill_value='')
print (df)
Name OtherName Age
0 firstTom secondTom 20
1 Jerry John 21
2 Jack Bob 19
3 Terry Steve 18
Use GroupBy.cumcount with Series.map, but only for duplicated values by Series.duplicated:
data = {'Name':['Tom', 'Tom', 'Jack', 'Terry'], 'Age':[20, 21, 19, 18]}
df = pd.DataFrame(data)
nth = {
0: "First",
1: "Second",
2: "Third",
3: "Fourth"
}
mask = df.Name.duplicated(keep=False)
df.loc[mask, 'Name'] = df[mask].groupby('Name').cumcount().map(nth) + df.loc[mask, 'Name']
print (df)
Name Age
0 FirstTom 20
1 SecondTom 21
2 Jack 19
3 Terry 18
Dynamic dictionary should be like:
import inflect
p = inflect.engine()
mask = df.Name.duplicated(keep=False)
f = lambda i: p.number_to_words(p.ordinal(i))
df.loc[mask, 'Name'] = df[mask].groupby('Name').cumcount().add(1).map(f) + df.loc[mask, 'Name']
print (df)
Name Age
0 firstTom 20
1 secondTom 21
2 Jack 19
3 Terry 18
transform
nth = ['First', 'Second', 'Third', 'Fourth']
def prefix(d):
n = len(d)
if n > 1:
return d.radd([nth[i] for i in range(n)])
else:
return d
df.assign(Name=df.groupby('Name').Name.transform(prefix))
Name Age
0 FirstTom 20
1 SecondTom 21
2 Jack 19
3 Terry 18
4 FirstSteve 17
5 SecondSteve 16
6 ThirdSteve 15
​

How to insert rows at specific positions into a dataframe in Python?

suppose you have a dataframe
df = pd.DataFrame({'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':
[28,34,29,42]})
and another dataframe
df1 = pd.DataFrame({'Name':['Anna', 'Susie'],'Age':[20,50]})
as well as a list with indices
pos = [0,2].
What is the most pythonic way to create a new dataframe df2 where df1 is integrated into df right before the index positions of df specified in pos?
So, the new array should look like this:
df2 =
Age Name
0 20 Anna
1 28 Tom
2 34 Jack
3 50 Susie
4 29 Steve
5 42 Ricky
Thank you very much.
Best,
Nathan
The behavior you are searching for is implemented by numpy.insert, however, this will not play very well with pandas.DataFrame objects, but no-matter, pandas.DataFrame objects have a numpy.ndarray inside of them (sort of, depending on various factors, it may be multiple arrays, but you can think of them as on array accessible via the .values parameter).
You will simply have to reconstruct the columns of your data-frame, but otherwise, I suspect this is the easiest and fastest way:
In [1]: import pandas as pd, numpy as np
In [2]: df = pd.DataFrame({'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':
...: [28,34,29,42]})
In [3]: df1 = pd.DataFrame({'Name':['Anna', 'Susie'],'Age':[20,50]})
In [4]: np.insert(df.values, (0,2), df1.values, axis=0)
Out[4]:
array([['Anna', 20],
['Tom', 28],
['Jack', 34],
['Susie', 50],
['Steve', 29],
['Ricky', 42]], dtype=object)
So this returns an array, but this array is exactly what you need to make a data-frame! And you have the other elements, i.e. the columns already on the original data-frames, so you can just do:
In [5]: pd.DataFrame(np.insert(df.values, (0,2), df1.values, axis=0), columns=df.columns)
Out[5]:
Name Age
0 Anna 20
1 Tom 28
2 Jack 34
3 Susie 50
4 Steve 29
5 Ricky 42
So that single line is all you need.
Tricky solution with float indexes:
df = pd.DataFrame({'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age': [28,34,29,42]})
df1 = pd.DataFrame({'Name':['Anna', 'Susie'],'Age':[20,50]}, index=[-0.5, 1.5])
result = df.append(df1, ignore_index=False).sort_index().reset_index(drop=True)
print(result)
Output:
Name Age
0 Anna 20
1 Tom 28
2 Jack 34
3 Susie 50
4 Steve 29
5 Ricky 42
Pay attention to index parameter in df1 creation. You can construct index from pos using simple list comprehension:
[x - 0.5 for x in pos]

Python, pandas: How to append a series to a dataframe

I have the following dataframe df1:
import pandas as pd
data = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy', 'Lisa', 'Molly', 'Lisa', 'Molly', 'Fred'],
'gender': ['m', 'f', 'f', 'm', 'f', 'f', 'f', 'f','f', 'm'],
}
df1 = pd.DataFrame(data, index = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
I want to create a table with some standard and some custom summary statistics df2.
df2 = df1.describe()
df2.rename(index={'top':'mode'},inplace=True)
df2.rename(index={'freq':'mode freq'},inplace=True)
df2
df2:
gender name
count 10 10
unique 2 7
mode f Molly
mode freq 7 3
I want to append one row to df2 for the second mode and one for the frequency of the second mode:
Example:
gender name
count 10 10
unique 2 7
mode f Molly
mode freq 7 3
2nd mode m Lisa
2nd freq 3 2
I figured out that you can get the second mode & frequency by doing this:
my_series
for column in df1:
my_series=df1[column].value_counts()[1:2]
print(my_series)
But how do I append this to df2?
You can do apply with value_counts, then we need modify your dataframe shape .
df1.apply(lambda x : x.value_counts().iloc[[1]]).stack().reset_index(level=0).T
Out[172]:
name gender
level_0 Lisa m
0 2 3
The final out put (Change the index name using what you show to us rename)
pd.concat([df1.describe(),df1.apply(lambda x : x.value_counts().iloc[[1]]).stack().reset_index(level=0).T])
Out[173]:
gender name
count 10 10
unique 2 7
top f Molly
freq 7 3
level_0 m Lisa
0 3 2
With Counter
from collections import Counter
def f(s):
return pd.Series(Counter(s).most_common(2)[1], ['mode2', 'mode2 freq'])
df1.describe().rename(dict(top='mode1', freq='mode1 freq')).append(df1.apply(f))
name gender
count 10 10
unique 7 2
mode1 Molly f
mode1 freq 3 7
mode2 Lisa m
mode2 freq 2 3
value_counts
Same thing without Counter
def f(s):
c = s.value_counts()
return pd.Series([s.iat[1], s.index[1]], ['mode2', 'mode2 freq'])
df1.describe().rename(dict(top='mode1', freq='mode1 freq')).append(df1.apply(f))
Numpy bits
def f(s):
f, u = pd.factorize(s)
c = np.bincount(f)
i = np.argpartition(c, -2)[-2]
return pd.Series([u[i], c[i]], ['mode2', 'mode2 freq'])
df1.describe().rename(dict(top='mode1', freq='mode1 freq')).append(df1.apply(f))

Categories