Pandas: Calculate remaining time in grouping - python

I have a requirement to sort a table by date starting from the oldest. Total field is created by grouping name and kind fields and applying sum. Now for each row I need to calculate the remaining time in the same name-kind grouping.
The csv looks like that:
date name kind duration total remaining
1-1-2017 a 1 10 100 ? should be 90
2-1-2017 b 1 5 35 ? should be 30
3-1-2017 a 2 3 50 ? should be 47
4-1-2017 b 2 1 25 ? should be 24
5-1-2017 a 1 8 100 ? should be 82
6-1-2017 b 1 2 35 ? should be 33
7-1-2017 a 2 3 50 ? should be 44
8-1-2017 b 2 6 25 ? should be 18
...
My question is how do I calculate the remaining value while having the DataFrame grouped by name and kind?
My initial approach was to shift the column and add the values from duration to each other like that:
df['temp'] = df.groupby(['name', 'kind'])['duration'].apply(lambda x: x.shift() + x)
and then:
df['duration'] = df.apply(lambda x: x['total'] - x['temp'], axis=1)
But it did not work as expected.
Is there a clean way to do it, or using the iloc, ix, loc somehow is the way to go?
Thanks.

You could do something like:
df["cumsum"] = df.groupby(['name', 'kind'])["duration"].cumsum()
df["remaining"] = df["total"] - df["cumsum"]
Being careful with resetting the index maybe.

Related

How to call a column by combining a string and another variable in a python dataframe?

Imagine I have a dataframe with these variables and values:
ID
Weight
LR Weight
UR Weight
Age
LS Age
US Age
Height
LS Height
US Height
1
63
50
80
20
18
21
165
160
175
2
75
50
80
22
18
21
172
160
170
3
49
45
80
17
18
21
180
160
180
I want to create the additional following variables:
ID
Flag_Weight
Flag_Age
Flag_Height
1
1
1
1
2
1
0
0
3
1
0
1
These flags simbolize that the main variable values (e.g.: Weight, Age and Height) are between the correspondent Lower or Upper limits, which may start with different 2 digits (in this dataframe I gave four examples: LR, UR, LS, US, but in my real dataframe I have more), and whose limit values sometimes differ from ID to ID.
Can you help me create these flags, please?
Thank you in advance.
You can use reshaping using a temporary MultiIndex:
(df.set_index('ID')
.pipe(lambda d: d.set_axis(pd.MultiIndex.from_frame(
d.columns.str.extract('(^[LU]?).*?\s*(\S+)$')),
axis=1)
)
.stack()
.assign(flag=lambda d: d[''].between(d['L'], d['U']).astype(int))
['flag'].unstack().add_prefix('Flag_').reset_index()
)
Output:
ID Flag_Age Flag_Height Flag_Weight
0 1 1 1 1
1 2 0 0 1
2 3 0 1 1
So, if I understood correctly, you want to add columns with these new variables. The simplest solution to this would be df.insert().
You could make it something like this:
df.insert(number of column after which you want to insert the new column, name of the column, values of the new column)
You can make up the new values in pretty much everyway you can imagine. So just copying a column or simple mathematical operations like +,-,*,/, can be performed. But you can also apply a whole function, which returns the flags based on your conditions as values of the new column.
If the new columsn can just be appended, you can even just make up a new column like this:
df['new column name'] = any values you want
I hope this helped.

Is there a way to make custom function in pandas aggregation function?

Want to apply custom function in a Dataframe
eg. Dataframe
index City Age
0 1 A 50
1 2 A 24
2 3 B 65
3 4 A 40
4 5 B 68
5 6 B 48
Function to apply
def count_people_above_60(age):
** *** #i dont know if the age can or can't be passed as series or list to perform any operation later
return count_people_above_60
expecting to do something like
df.groupby(['City']).agg{"AGE" : ["mean",""count_people_above_60"]}
expected Output
City Mean People_Above_60
A 38 0
B 60.33 2
If performance is important create new column filled by compared values converted to integers, so for count is used aggregation sum:
df = (df.assign(new = df['Age'].gt(60).astype(int))
.groupby(['City'])
.agg(Mean= ("Age" , "mean"), People_Above_60= ('new',"sum")))
print (df)
Mean People_Above_60
City
A 38.000000 0
B 60.333333 2
Your solution should be changed with compare values and sum, but is is slow if many groups or large DataFrame:
def count_people_above_60(age):
return (age > 60).sum()
df = (df.groupby(['City']).agg(Mean=("Age" , "mean"),
People_Above_60=('Age',count_people_above_60)))
print (df)
Mean People_Above_60
City
A 38.000000 0
B 60.333333 2

Sequential name of column in a DataFrame python

I work in python.
I have a large DataFrame df1 ( 25000 x 484 ) where, except than the first 4 columns, all the others can be divided in group of 4 and have sequential number.
To be clear, non considering the first 4 columns, that's how the header of the columns look like:
comp_type_1 / tag_1 /length_1 / value_1 / comp_type_2 / tag_2 /length_2 / value_2 / comp_type_3 / tag_3 /length_3 / value_3 ....
I would like to create df2 such that it contains only the column lenght_i, where i goes from 1 to the last number (120. Is there a way to realize that considering that part of the name of the column is the same, and what changes is only a number?
Thanks!
If I understand the question correctly, this is what you're looking for.
# setup
df = pd.DataFrame(np.random.randint(0,100,size=(3, 12)), columns=["comp_type_1", "tag_1", "length_1", "value_1", "comp_type_2", "tag_2", "length_2", "value_2", "comp_type_3", "tag_3", "length_3", "value_3"])
# column filter
df2 = df[[_ for _ in df.columns if 'length' in _]]
Output (df2)
length_1 length_2 length_3
0 91 81 23
1 42 92 50
2 61 79 76
Given dataframe df You can filter on columns:
df = df.filter(regex=("length"))

Filtering rows on multiple string conditions at the same column

I want to filter a dataframe on multiple conditions. Let's say I have one column called 'detail', i want to get a dataframe where the 'detail' column values match the following:
detail = unidecode.unidecode(str(row['detail']).lower())
So now I have all detail rows unidecoded and to lowercase, then i want to extract the rows that start with some substring like:
detail.startswith('bomb')
And finally also take the rows where another integer column equals 100.
I tried to do this but obviously it doesn't work:
llista_dfs['df_bombes'] = df_filtratge[df_filtratge['detail'].str.lower().startswith('bomb') or df_filtratge['family']==100]
This line above is what I would like to execute but I'm not sure which is the syntax to be able to achieve this in a single line of code (if that's possible).
That's an example of what the code should do:
Initial table:
detail family
0 bòmba 90
1 boMbá 87
2 someword 100
3 someotherword 65
4 Bombá 90
Result table:
detail family
0 bòmba 90
1 boMbá 87
2 someword 100
4 Bombá 90
Actually #user3483203's comment is the right solution as to filter in pandas you use & and | instead of and and or. In any case in case you want to get rid of unidecode you might use this solution:
import pandas as pd
txt="""0 bòmba 90
1 boMbá 87
2 someword 100
3 someotherword 65
4 Bombá 90"""
df = [list(filter(lambda x: x!='', t.split(' ')))[1:]
for t in txt.split("\n")]
df = pd.DataFrame(df, columns=["details", 'family'])
df["family"] = df["family"].astype(int)
cond1 = df["details"].str.normalize('NFKD')\
.str.encode('ascii', errors='ignore')\
.str.decode('utf-8')\
.str.lower()\
.str.startswith('bomba')
cond2 = df["family"]==100
df[cond1 | cond2]

Summing up values from one column based on values in other column

I have a dataframe something like below,
Timestamp count
20180702-06:26:20 50
20180702-06:27:11 10
20180702-07:05:10 20
20180702-07:10:10 30
20180702-08:27:11 40
I want output something like below,
Timestamp Sum_of_count
20180702-06 60
20180702-07 50
20180702-08 40
Basically, I need to find sum of count for every hour.
Any help is really appreciated.
You need separate value some way - one is split and seelct first lists by str[0] and then aggregate sum:
s = df['Timestamp'].str.split(':', n=1).str[0]
df1 = df['count'].groupby(s).sum().reset_index(name='Sum_of_count')
Or convert values to datetimes by to_datetime and get values by strftime:
df['Timestamp'] = pd.to_datetime(df['Timestamp'], format='%Y%m%d-%H:%M:%S')
s = df['Timestamp'].dt.strftime('%Y%m%d-%H')
df1 = df['count'].groupby(s).sum().reset_index(name='Sum_of_count')
print (df1)
Timestamp Sum_of_count
0 20180702-06 60
1 20180702-07 50
2 20180702-08 40
Use
In [252]: df.groupby(df.Timestamp.dt.strftime('%Y-%m-%d-%H'))['count'].sum()
Out[252]:
Timestamp
2018-07-02-06 60
2018-07-02-07 50
2018-07-02-08 40
Name: count, dtype: int64
In [254]: (df.groupby(df.Timestamp.dt.strftime('%Y-%m-%d-%H'))['count'].sum()
.reset_index(name='Sum_of_count'))
Out[254]:
Timestamp Sum_of_count
0 2018-07-02-06 60
1 2018-07-02-07 50
2 2018-07-02-08 40

Categories