make a change in each group after groupby - python

Let we have a dataframe like
names = ['bob', 'alice', 'bob', 'bob', 'ali', 'alice', 'bob', 'ali', 'moji', 'ali', 'moji', 'ali', 'bob', 'bob', 'bob']
times = [21 , 34, 37, 40, 55, 65, 67, 84, 88, 90 , 91, 97, 104,105, 108]
df = pd.DataFrame({'name' : names , 'time_of_action' : times})
We would like to groupby it by name and in each group consider times from 0. That is in each group we want to subtract min-time_of_action in that group from all times of that group. how could we do this systematically with pandas?

If I am correct then you want this:
df['new time'] = df['time_of_action']-df.groupby('name')['time_of_action'].transform('min')
df:
name time_of_action new time
0 bob 21 0
1 alice 34 0
2 bob 37 16
3 bob 40 19
4 ali 55 0
5 alice 65 31
6 bob 67 46
7 ali 84 29
8 moji 88 0
9 ali 90 35
10 moji 91 3
11 ali 97 42
12 bob 104 83
13 bob 105 84
14 bob 108 87

Try this
df['new_time'] = df.groupby('name')['time_of_action'].apply(lambda x: x - x.min())
df
Output:
name time_of_action new_time
0 bob 21 0
1 alice 34 0
2 bob 37 16
3 bob 40 19
4 ali 55 0
5 alice 65 31

Others have already answered but here's mine
import pandas as pd
names = ['bob', 'alice', 'bob', 'bob', 'ali', 'alice', 'bob', 'ali', 'moji', 'ali', 'moji', 'ali', 'bob', 'bob', 'bob']
times = [21 , 34, 37, 40, 55, 65, 67, 84, 88, 90 , 91, 97, 104,105, 108]
df = pd.DataFrame({'name' : names , 'time_of_action' : times})
def subtract_min(df):
df['new_time'] = df['time_of_action'] - df['time_of_action'].min()
return df
df.groupby('name').apply(subtract_min).sort_values('name')
As others have said I am kind of guessing as well

Related

How to drop Duplicates for each unique row value in Pandas?

I have the following dataframe:
df = pd.DataFrame({
'ID': [42, 42, 42, 43, 43, 43,58, 58, 58],
'Thing': ['cup', 'cup', 'plate', 'plate', 'plate', 'plate', 'cup', 'cup', 'plate']
})
df
ID Thing
0 42 cup
1 42 cup
2 42 plate
3 43 plate
4 43 plate
5 43 plate
6 58 cup
7 58 cup
8 58 plate
I want to drop duplicates from the "Thing" column, but only for each unique ID. I want the result to look like this:
ID Thing
0 42 cup
2 42 plate
6 58 cup
8 58 plate
I tried this:
for id in df['ID'].unique():
df= df.drop_duplicates(subset=['Thing'], keep='first')
But the result looks like this:
ID Thing
0 42 cup
2 42 plate
Does anyone know what is the best way to accomplish this in Pandas?
Try:
df = df.drop_duplicates(subset = ['ID','Thing'])

Pandas: Convert annual data to decade data

Background
I want to determine the global cumulative value of a variable for different decades starting from 1990 to 2014 i.e. 1990, 2000, 2010 (3 decades separately). I have annual data for different countries. However, data availability is not uniform.
Existing questions
Uses R: 1
Following questions look at date formatting issues: 2, 3
Answers to these questions do not address the current question.
Current question
How to obtain a global sum for the period of different decades using features/tools of Pandas?
Expected outcome
1990-2000 x1
2000-2010 x2
2010-2015 x3
Method used so far
data_binned = data_pivoted.copy()
decade = []
# obtaining decade values for each country
for i in range(1960, 2017):
if i in list(data_binned):
# adding the columns into the decade list
decade.append(i)
if i % 10 == 0:
# adding large header so that newly created columns are set at the end of the dataframe
data_binned[i *10] = data_binned.apply(lambda x: sum(x[j] for j in decade), axis=1)
decade = []
for x in list(data_binned):
if x < 3000:
# removing non-decade columns
del data_binned[x]
# renaming the decade columns
new_names = [int(x/10) for x in list(data_binned)]
data_binned.columns = new_names
# computing global values
global_values = data_binned.sum(axis=0)
This is a non-optimal method because of less experience in using Pandas. Kindly suggest a better method which uses features of Pandas. Thank you.
If I had pandas.DataFrame called df looking like this:
>>> df = pd.DataFrame(
... {
... 1990: [1, 12, 45, 67, 78],
... 1999: [1, 12, 45, 67, 78],
... 2000: [34, 6, 67, 21, 65],
... 2009: [34, 6, 67, 21, 65],
... 2010: [3, 6, 6, 2, 6555],
... 2015: [3, 6, 6, 2, 6555],
... }, index=['country_1', 'country_2', 'country_3', 'country_4', 'country_5']
... )
>>> print(df)
1990 1999 2000 2009 2010 2015
country_1 1 1 34 34 3 3
country_2 12 12 6 6 6 6
country_3 45 45 67 67 6 6
country_4 67 67 21 21 2 2
country_5 78 78 65 65 6555 6555
I could make another pandas.DataFrame called df_decades with decades statistics like this:
>>> df_decades = pd.DataFrame()
>>>
>>> for decade in set([(col // 10) * 10 for col in df.columns]):
... cols_in_decade = [col for col in df.columns if (col // 10) * 10 == decade]
... df_decades[f'{decade}-{decade + 9}'] = df[cols_in_decade].sum(axis=1)
>>>
>>> df_decades = df_decades[sorted(df_decades.columns)]
>>> print(df_decades)
1990-1999 2000-2009 2010-2019
country_1 2 68 6
country_2 24 12 12
country_3 90 134 12
country_4 134 42 4
country_5 156 130 13110
The idea behind this is to iterate over all possible decades provided by column names in df, filtering those columns, which are part of the decade and aggregating them.
Finally, I could merge these data frames together, so my data frame df could be enriched by decades statistics from the second data frame df_decades.
>>> df = pd.merge(left=df, right=df_decades, left_index=True, right_index=True, how='left')
>>> print(df)
1990 1999 2000 2009 2010 2015 1990-1999 2000-2009 2010-2019
country_1 1 1 34 34 3 3 2 68 6
country_2 12 12 6 6 6 6 24 12 12
country_3 45 45 67 67 6 6 90 134 12
country_4 67 67 21 21 2 2 134 42 4
country_5 78 78 65 65 6555 6555 156 130 13110

How to flatten pandas dataframe

Here is my pandas dataframe, and I would like to flatten. How can I do that ?
The input I have
key column
1 {'health_1': 45, 'health_2': 60, 'health_3': 34, 'health_4': 60, 'name': 'Tom'}
2 {'health_1': 28, 'health_2': 10, 'health_3': 42, 'health_4': 07, 'name': 'John'}
3 {'health_1': 86, 'health_2': 65, 'health_3': 14, 'health_4': 52, 'name': 'Adam'}
The expected output
All the health and name will become a column name of their own with their corresponding values. In no particular order.
health_1 health_2 health_3 health_4 name key
45 60 34 60 Tom 1
28 10 42 07 John 2
86 65 14 52 Adam 3
You can do it with one line solution,
df_expected = pd.concat([df, df['column'].apply(pd.Series)], axis = 1).drop('column', axis = 1)
Full version:
import pandas as pd
df = pd.DataFrame({"column":[
{'health_1': 45, 'health_2': 60, 'health_3': 34, 'health_4': 60, 'name': 'Tom'} ,
{'health_1': 28, 'health_2': 10, 'health_3': 42, 'health_4': 7, 'name': 'John'} ,
{'health_1': 86, 'health_2': 65, 'health_3': 14, 'health_4': 52, 'name': 'Adam'}
]})
df_expected = pd.concat([df, df['column'].apply(pd.Series)], axis = 1).drop('column', axis = 1)
print(df_expected)
DEMO: https://repl.it/repls/ButteryFrightenedFtpclient
This should work:
df['column'].apply(pd.Series)
Gives:
health_1 health_2 health_3 health_4 name
0 45 60 34 60 Tom
1 28 10 42 7 John
2 86 65 14 52 Adam
Try:
pd.concat([pd.DataFrame(i, index=[0]) for i in df.column], ignore_index=True)
Output:
health_1 health_2 health_3 health_4 name
0 45 60 34 60 Tom
1 28 10 42 7 John
2 86 65 14 52 Adam
The solutions using apply are going overboard. You can create your desired DataFrame using a list of dictionaries like you have in your column Series. You can easily get this list of dictionaries by using the tolist method:
res = pd.concat([df.key, pd.DataFrame(df.column.tolist())], axis=1)
print(res)
key health_1 health_2 health_3 health_4 name
0 1 45 60 34 60 Tom
1 2 28 10 42 7 John
2 3 86 65 14 52 Adam
Not sure I understand - This is the default format for a DataFrame?
import pandas as pd
df = pd.DataFrame([
{'health_1': 45, 'health_2': 60, 'health_3': 34, 'health_4': 60, 'name': 'Tom'} ,
{'health_1': 28, 'health_2': 10, 'health_3': 42, 'health_4': 7, 'name': 'John'} ,
{'health_1': 86, 'health_2': 65, 'health_3': 14, 'health_4': 52, 'name': 'Adam'}
])

Order Dataframe Index based on a second Dataframe

I got two DataFrames in Python, but the Column there are to be used as Indexes (CodeNumber) are not in the same order. There would be needed to order them equally; Follows the code:
#generating DataFrames:
d3 = {'CodeNumber': [1234, 1235, 111, 101], 'Date': [20150808, 20141201, 20180119, 20120720], 'Weight': [26, 32, 41, 24]}
d4 = {'CodeNumber': [1235, 1234, 101, 111], 'Date': [20160808, 20151201, 20180219, 20130720], 'Weight': [28, 25, 47, 3]}
data_SKU3 = pd.DataFrame(data=d3)
data_SKU4 = pd.DataFrame(data=d4)
Then i set as an index the CodeNumber:
dados_SKU3.set_index('CodeNumber', inplace = True)
dados_SKU4.set_index('CodeNumber', inplace = True)
if we print the resulting DataFrames, note that data_SKU3 has the following order of Code Number: 1234 1235 111 101 , while data_SKU4: 1235 1234 101 111
Is there a way to order the Code Numbers so both DataFrames would be in the same order?
You can also sort values by CodeNumber on each dataframe by calling .sort_values(by = 'CodeNumber') before setting them as index:
d3 = {'CodeNumber': [1234, 1235, 111, 101], 'Date': [20150808, 20141201, 20180119, 20120720], 'Weight': [26, 32, 41, 24]}
d4 = {'CodeNumber': [1235, 1234, 101, 111], 'Date': [20160808, 20151201, 20180219, 20130720], 'Weight': [28, 25, 47, 3]}
data_SKU3 = pd.DataFrame(data=d3).sort_values(by = 'CodeNumber')
data_SKU4 = pd.DataFrame(data=d4).sort_values(by = 'CodeNumber')
data_SKU3.set_index('CodeNumber', inplace = True)
data_SKU4.set_index('CodeNumber', inplace = True)
Use sort_index if same number of values in both indices:
data_SKU3 = data_SKU3.set_index('CodeNumber').sort_index()
data_SKU4 = data_SKU4.set_index('CodeNumber').sort_index()
print (data_SKU3)
Date Weight
CodeNumber
101 20120720 24
111 20180119 41
1234 20150808 26
1235 20141201 32
print (data_SKU4)
Date Weight
CodeNumber
101 20180219 47
111 20130720 3
1234 20151201 25
1235 20160808 28
Another approach is use reindex by another index values, but is necessary unique values and only diiference is different ordering:
data_SKU3 = data_SKU3.set_index('CodeNumber')
data_SKU4 = data_SKU4.set_index('CodeNumber').reindex(index=data_SKU3.index)
print (data_SKU3)
Date Weight
CodeNumber
1234 20150808 26
1235 20141201 32
111 20180119 41
101 20120720 24
print (data_SKU4)
Date Weight
CodeNumber
1234 20151201 25
1235 20160808 28
111 20130720 3
101 20180219 47

Python numpy where function behavior

Have a question regarding using numpy's where condition. I am able to use where condition with == operator but not able to use where condition with "is one string substring of another string ?"
CODE:
import pandas as pd
import datetime as dt
import numpy as np
data = {'name': ['Smith, Jason', 'Bush, Molly', 'Smith, Tina',
'Clinton, Jake', 'Hamilton, Amy'],
'age': [42, 52, 36, 24, 73],
'preTestScore': [4, 24, 31, 2, 3],
'postTestScore': [25, 94, 57, 62, 70]}
df = pd.DataFrame(data, columns = ['name', 'age', 'preTestScore',
'postTestScore'])
print "BEFORE---- "
print df
print "AFTER----- "
df["Smith Family"]=np.where("Smith" in df['name'],'Y','N' )
print df
OUTPUT:
BEFORE-----
name age preTestScore postTestScore
0 Smith, Jason 42 4 25
1 Bush, Molly 52 24 94
2 Smith, Tina 36 31 57
3 Clinton, Jake 24 2 62
4 Hamilton, Amy 73 3 70
AFTER-----
name age preTestScore postTestScore Smith Family
0 Smith, Jason 42 4 25 N
1 Bush, Molly 52 24 94 N
2 Smith, Tina 36 31 57 N
3 Clinton, Jake 24 2 62 N
4 Hamilton, Amy 73 3 70 N
Why numpy.where condition does not work in the above case.
Had expected Smith Family to have values
Y
N
Y
N
N
But did not get that output. Output as seen above is all N,N,N,N,N
Instead of using condition "Smith" in df['name'] (also tried str(df['name']).find("Smith") >-1 ) but that did not work either.
Any idea what is wrong or what could I have done differently?
I think you need str.contains for boolean mask:
print (df['name'].str.contains("Smith"))
0 True
1 False
2 True
3 False
4 False
Name: name, dtype: bool
df["Smith Family"]=np.where(df['name'].str.contains("Smith"),'Y','N' )
print (df)
name age preTestScore postTestScore Smith Family
0 Smith, Jason 42 4 25 Y
1 Bush, Molly 52 24 94 N
2 Smith, Tina 36 31 57 Y
3 Clinton, Jake 24 2 62 N
4 Hamilton, Amy 73 3 70 N
Or str.startswith:
df["Smith Family"]=np.where(df['name'].str.startswith("Smith"),'Y','N' )
print (df)
name age preTestScore postTestScore Smith Family
0 Smith, Jason 42 4 25 Y
1 Bush, Molly 52 24 94 N
2 Smith, Tina 36 31 57 Y
3 Clinton, Jake 24 2 62 N
4 Hamilton, Amy 73 3 70 N
If want use in working with scalars need apply:
This solution is faster, but doesnt work if NaN in column name.
df["Smith Family"]=np.where(df['name'].apply(lambda x: "Smith" in x),'Y','N' )
print (df)
name age preTestScore postTestScore Smith Family
0 Smith, Jason 42 4 25 Y
1 Bush, Molly 52 24 94 N
2 Smith, Tina 36 31 57 Y
3 Clinton, Jake 24 2 62 N
4 Hamilton, Amy 73 3 70 N
The behavior of np.where("Smith" in df['name'],'Y','N' ) depends on what df['name'] produces - I assume some sort of numpy array. The rest is numpy
In [733]: x=np.array(['one','two','three'])
In [734]: 'th' in x
Out[734]: False
In [744]: 'two' in np.array(['one','two','three'])
Out[744]: True
in is a whole string test, both for a list and an array of strings. It's not a substring test.
np.char has a bunch of functions that apply string functions to elements of an array. These are roughly the equivalent of np.array([x.fn() for x in arr]).
In [754]: x=np.array(['one','two','three'])
In [755]: np.char.startswith(x,'t')
Out[755]: array([False, True, True], dtype=bool)
In [756]: np.where(np.char.startswith(x,'t'),'Y','N')
Out[756]:
array(['N', 'Y', 'Y'],
dtype='<U1')
Or with find:
In [760]: np.char.find(x,'wo')
Out[760]: array([-1, 1, -1])
The pandas .str method appears to do something similar; applying string methods to elements of a data series.

Categories