Take sum of values before the row's date - python

I have a dataframe that looks like this:
df = pd.DataFrame({'id': {0: 1, 1: 3, 2: 2, 3: 2, 4: 1, 5: 3},
'date': {0: '11/11/2018',
1: '11/12/2018',
2: '11/13/2018',
3: '11/14/2018',
4: '11/15/2018',
5: '11/16/2018'},
'score': {0: 1, 1: 1, 2: 3, 3: 2, 4: 0, 5: 5}})
I need the resulting dataframe to look like this:
output = pd.DataFrame({'id': {0: 1, 1: 3, 2: 2, 3: 2, 4: 1, 5: 3},
'date': {0: '11/11/2018',
1: '11/12/2018',
2: '11/13/2018',
3: '11/14/2018',
4: '11/15/2018',
5: '11/16/2018'},
'score': {0: 1, 1: 1, 2: 3, 3: 2, 4: 0, 5: 5},
'total_score_per_id_before_date': {0: 1, 1: 1, 2: 3, 3: 3, 4: 1, 5: 1}})
my code so far:
output= df[["id","score"]].groupby("id").sum()
However, this gives me the total sum of scores for each id. I need the sum of scores before the date in that specific row. Only the first score should not be discarded.

Use the cumulative sum on a series. Then subtract the current values, as you asked for the cumulative sum before the current index. Finally, add back the first values, otherwise they’re zero.
previously_accumulated_scores = df.groupby("id").cumsum().score - df.score
firsts = df.groupby("id").first().reset_index()
df2 = df.merge(firsts, on=["id", "date"], how="left", suffixes=("", "_r"))
df["total_score_per_id_before_date"] = previously_accumulated_scores + df2.score_r.fillna(0)
The merge could be done more elegantly, by changing the index to a MultiIndex, but that’s a style preference.
Note: this assumes your DataFrame is sorted by the date-like column (groupby preserves the order of rows within each group (source: docs)).

Related

Creating a nested dictionary using for loop

I have a bigger code from which I obtain some datetime object for some events (YYYY-MM-DD) for two years (2021,2022) out of which I want to group data together in a nested dictionary structure. For a particular event, I want the following structure -
event_name:
{2021:
{01:
number_of_datetime_having_month january,
02:
number_of_datetime_having_month_feb
...etc etc upto december},
2022:
{01:
number_of_datetime_having_month_january,
........etc etc upto december}
}
I am planning to write this data to csv and plot this afterwards.
I am wondering what will be the best approach. Hard-coding the schema beforehand?
from datetime import datetime, timedelta
datetimes = [datetime.now() + timedelta(days=20*i) for i in range(20)]
# Sparse result (zero-counts excluded):
result = {}
for dt in datetimes:
months_data = result.setdefault(dt.year, {})
months_data[dt.month] = months_data.setdefault(dt.month, 0) + 1
# Non-sparse result:
result = {}
for y in set(o.year for o in datetimes):
result[y] = {}
for m in range(1,13):
result[y][m] = 0
for dt in datetimes:
result[dt.year][dt.month] += 1
# Output result
from pprint import pprint
pprint(result)
Sparse output:
{2022: {9: 1, 10: 2, 11: 1, 12: 2},
2023: {1: 1, 2: 2, 3: 1, 4: 2, 5: 1, 6: 2, 7: 1, 8: 2, 9: 2}}
Non-sparse output:
{2022: {1: 0,
2: 0,
3: 0,
4: 0,
5: 0,
6: 0,
7: 0,
8: 0,
9: 1,
10: 2,
11: 1,
12: 2},
2023: {1: 1,
2: 2,
3: 1,
4: 2,
5: 1,
6: 2,
7: 1,
8: 2,
9: 2,
10: 0,
11: 0,
12: 0}}
I made some changes in the earlier answer of kwiknik (the sparse one), however i have to admit his approach is more elegant. Neverthless, I am posting my approach too.
from datetime import datetime, timedelta
datetimes = [datetime.now() + timedelta(days=20*i) for i in range(20)]
result = {}
for dt in datetimes:
months_data = result.setdefault(dt.year, {})
months_data[dt.month] = months_data.setdefault(dt.month, 0) + 1
############################################################################
count=0
for year in result.keys():
for k in range(1,13,1):
for items in result[year].keys():
if items==k:
pass
else:
count=count+1
if count==len(result[year].keys()):
result[year][k]='0'
count=0
kk= dict(sorted(result[year].items()))
result[year]=kk
print(result)
Output
{2022: {1: '0', 2: '0', 3: '0', 4: '0', 5: '0', 6: '0', 7: '0', 8: '0', 9: 1, 10: 2, 11: 1, 12: 2}, 2023: {1: 1, 2: 2, 3: 1, 4: 2, 5: 1, 6: 2, 7: 1, 8: 2, 9: 1, 10: 1, 11: '0', 12: '0'}}

Custom function to replace missing values in dataframe with median located in pivot table

I am attempting to write a function to replace missing values in the 'total_income' column with the median 'total_income' provided by the pivot table, using the row's 'education' and 'income_type' to index the pivot table. I want to populate using these medians so that the values are as optimal as they can be. Here is what I am testing:
This is the first 5 rows of the dataframe as a dictionary:
{'index': {0: 0, 1: 1, 2: 2, 3: 3, 4: 4},
'children': {0: 1, 1: 1, 2: 0, 3: 3, 4: 0},
'days_employed': {0: 8437.673027760233,
1: 4024.803753850451,
2: 5623.422610230956,
3: 4124.747206540018,
4: 340266.07204682194},
'dob_years': {0: 42, 1: 36, 2: 33, 3: 32, 4: 53},
'education': {0: "bachelor's degree",
1: 'secondary education',
2: 'secondary education',
3: 'secondary education',
4: 'secondary education'},
'education_id': {0: 0, 1: 1, 2: 1, 3: 1, 4: 1},
'family_status': {0: 'married',
1: 'married',
2: 'married',
3: 'married',
4: 'civil partnership'},
'family_status_id': {0: 0, 1: 0, 2: 0, 3: 0, 4: 1},
'gender': {0: 'F', 1: 'F', 2: 'M', 3: 'M', 4: 'F'},
'income_type': {0: 'employee',
1: 'employee',
2: 'employee',
3: 'employee',
4: 'retiree'},
'debt': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0},
'total_income': {0: 40620.102,
1: 17932.802,
2: 23341.752,
3: 42820.568,
4: 25378.572},
'purpose': {0: 'purchase of the house',
1: 'car purchase',
2: 'purchase of the house',
3: 'supplementary education',
4: 'to have a wedding'},
'age_group': {0: 'adult',
1: 'adult',
2: 'adult',
3: 'adult',
4: 'older adult'}}
def fill_income(row):
total_income = row['total_income']
age_group = row['age_group']
income_type = row['income_type']
education = row['education']
table = df.pivot_table(index=['age_group','income_type' ], columns='education', values='total_income', aggfunc='median')
if total_income == 'NaN':
if age_group =='adult':
return table.loc[education, income_type]
My desired output is the pivot table value (the median total_income) for the dataframe row's given education and income_type. When I test it, it returns 'None'.
Thanks in advance for your time helping me with this problem!

python dataframe to dictionary with multiple columns in keys and values

I am working on an optimization problem and need to create indexing to build a mixed-integer mathematical model. I am using python dictionaries for the task. Below is a sample of my dataset. Full dataset is expected to have about 400K rows if that matters.
# sample input data
pd.DataFrame.from_dict({'origin': {0: 'perris', 1: 'perris', 2: 'perris', 3: 'perris', 4: 'perris'},
'dest': {0: 'alexandria', 1: 'alexandria', 2: 'alexandria', 3: 'alexandria', 4: 'alexandria'},
'product': {0: 'bike', 1: 'bike', 2: 'bike', 3: 'bike', 4: 'bike'},
'lead_time': {0: 4, 1: 4, 2: 4, 3: 4, 4: 4}, 'build_time': {0: 2, 1: 2, 2: 2, 3: 2, 4: 2},
'ship_date': {0: '02/25/2022', 1: '02/26/2022', 2: '02/27/2022', 3: '02/28/2022', 4: '03/01/2022'},
'ship_day': {0: 5, 1: 6, 2: 7, 3: 1, 4: 2},
'truck_in': {0: '03/01/2022', 1: '03/02/2022', 2: '03/03/2022', 3: '03/04/2022', 4: '03/07/2022'},
'product_in': {0: '03/03/2022', 1: '03/04/2022', 2: '03/05/2022', 3: '03/06/2022', 4: '03/09/2022'}})
The data frame looks like this -
I am looking to generate a dictionary from each row of this dataframe where the keys and values are tuples made of multiple column values. The output would look like this -
(origin, dest, product, ship_date): (origin, dest, product, truck_in)
# for example, first two rows will become a dictionary key-value pair like
{('perris', 'alexandria', 'bike', '2/25/2022'): ('perris', 'alexandria', 'bike', '3/1/2022'),
('perris', 'alexandria', 'bike', '2/26/2022'): ('perris', 'alexandria', 'bike', '3/2/2022')}
I am very new to python and couldn't figure out how to do this. Any help is appreciated. Thanks!
You can loop through the DataFrame.
Assuming your DataFrame is called "df" this gives you the dict.
result_dict = {}
for idx, row in df.iterrows():
result_dict[(row.origin, row.dest, row['product'], row.ship_date )] = (
row.origin, row.dest, row['product'], row.truck_in )
Since looping through 400k rows will take some time, have a look at tqdm (https://tqdm.github.io/) to get a progress bar with a time estimate that quickly tells you if the approach works for your dataset.
Also, note that 400K dictionary entries may take up a lot of memory so you may try to estimate if the dict fits your memory.
Another, memory waisting but faster way is to do it in Pandas
Create a new column with the value for the dictionary
df['value'] = df.apply(lambda x: (x.origin, x.dest, x['product'], x.truck_in), axis=1)
Then set the index and convert to dict
df.set_index(['origin','dest','product','ship_date'])['value'].to_dict()
The approach below splits the initial dataframe into two dataframes that will be the source of the keys and values in the dictionary. These are then converted to arrays in order to get away from working with dataframes as soon as possible. The arrays are converted to tuples and zipped together to create the key:value pairs.
import pandas as pd
import numpy as np
df = pd.DataFrame.from_dict(
{'origin': {0: 'perris', 1: 'perris', 2: 'perris', 3: 'perris', 4: 'perris'},
'dest': {0: 'alexandria', 1: 'alexandria', 2: 'alexandria', 3: 'alexandria', 4: 'alexandria'},
'product': {0: 'bike', 1: 'bike', 2: 'bike', 3: 'bike', 4: 'bike'},
'lead_time': {0: 4, 1: 4, 2: 4, 3: 4, 4: 4}, 'build_time': {0: 2, 1: 2, 2: 2, 3: 2, 4: 2},
'ship_date': {0: '02/25/2022', 1: '02/26/2022', 2: '02/27/2022', 3: '02/28/2022', 4: '03/01/2022'},
'ship_day': {0: 5, 1: 6, 2: 7, 3: 1, 4: 2},
'truck_in': {0: '03/01/2022', 1: '03/02/2022', 2: '03/03/2022', 3: '03/04/2022', 4: '03/07/2022'},
'product_in': {0: '03/03/2022', 1: '03/04/2022', 2: '03/05/2022', 3: '03/06/2022', 4: '03/09/2022'}}
)
#display(df)
#desired output: (origin, dest, product, ship_date): (origin, dest, product, truck_in)
#slice df to key/value chunks
#list to array
ship = df[['origin','dest', 'product', 'ship_date']]
ship.set_index('origin', inplace = True)
keys_array=ship.to_records()
truck = df[['origin', 'dest', 'product', 'truck_in']]
truck.set_index('origin', inplace = True)
values_array = truck.to_records()
#array_of_tuples = map(tuple, an_array)
keys_map = map(tuple, keys_array)
values_map = map(tuple, values_array)
#tuple_of_tuples = tuple(array_of_tuples)
keys_tuple = tuple(keys_map)
values_tuple = tuple(values_map)
zipp = zip(keys_tuple, values_tuple)
dict2 = dict(zipp)
print(dict2)

How to change from index to multiindex - pandas

I've got a data frame structured as below:
dict1 = {'id': {0: 11, 1: 12, 2: 13, 3: 14, 4: 15, 5: 16, 6: 19, 7: 18, 8: 17},
'var1': {0: 20.272108843537413,
1: 21.088435374149658,
2: 20.68027210884354,
3: 23.945578231292515,
4: 22.857142857142854,
5: 21.496598639455787,
6: 39.18367346938776,
7: 36.46258503401361,
8: 34.965986394557824},
'var2': {0: 27.731092436974773,
1: 43.907563025210074,
2: 55.67226890756303,
3: 62.81512605042017,
4: 71.63865546218487,
5: 83.40336134453781,
6: 43.48739495798319,
7: 59.243697478991606,
8: 67.22689075630252},
'var3': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 2, 7: 2, 8: 2}}
ex = pd.DataFrame(dict1.to_dict()).set_index('id')
id is set as an index, but now I would like to create a MultiIndex from var3 and id. But my following attempt fails:
ex.set_index(['var3', 'id'])
How can I then set a MultiIndex straight from Index? I know I can reset_index first and then set a MultiIndex, but it feels there has to be more elegant way.
DataFrame.set_index has an append argument, which is False by default.
If you have a DataFrame already indexed by "id", and you'd like to append "var3" to that, simply invoke:
new_df = ex.set_index("var3", append=True)
As suggested by #piRSquared in the comments, you can also swap the order if you would like "var3" to come first by method chaining a call to swaplevel. I.e.:
new_df = ex.set_index("var3", append=True).swaplevel(0, 1)
Like this:
ex.set_index(['var3', ex.index])

Probability Density Function using pandas data

I would like to model the probability of an event occurring given the existence of the previous event.
To give you more context, I plan to group my data by anonymous_id, sort the values of the grouped dataset by timestamp (ts) and calculate the probability of the sequence of sources (utm_source) the person goes through. The person is represented by a unique anonymous_id. So the desired end goal is the probability of someone who came from a Facebook source to then come through from a Google source etc
I have been told that a package such as sci.py gaussian_kde would be useful for this. However, from playing around with it, this requires numerical inputs.
test_sample = test_sample.groupby('anonymous_id').apply(lambda x: x.sort_values(['ts'])).reset_index(drop=True)
and not sure what to try next.
I have also tried this, but i don't think that it makes much sense:
stats.gaussian_kde(test_two['utm_source'])
Here is a sample of my data
{'Unnamed: 0': {0: 0, 1: 1, 2: 2, 3: 3, 4: 4, 5: 5, 6: 6, 7: 7, 8: 8, 9: 9},
'anonymous_id': {0: '0000f8ea-3aa6-4423-9247-1d9580d378e1',
1: '00015d49-2cd8-41b1-bbe7-6aedbefdb098',
2: '0002226e-26a4-4f55-9578-2eff2999de7e',
3: '00022b83-240e-4ef9-aaad-ac84064bb902',
4: '00022b83-240e-4ef9-aaad-ac84064bb902',
5: '00022b83-240e-4ef9-aaad-ac84064bb902',
6: '00022b83-240e-4ef9-aaad-ac84064bb902',
7: '00022b83-240e-4ef9-aaad-ac84064bb902',
8: '00022b83-240e-4ef9-aaad-ac84064bb902',
9: '0002ed69-4aff-434d-a626-fc9b20ef1b02'},
'ts': {0: '2018-04-11 06:59:20.206000',
1: '2019-05-18 05:59:11.874000',
2: '2018-09-10 18:19:25.260000',
3: '2017-10-11 08:20:18.092000',
4: '2017-10-11 08:20:31.466000',
5: '2017-10-11 08:20:37.345000',
6: '2017-10-11 08:21:01.322000',
7: '2017-10-11 08:21:14.145000',
8: '2017-10-11 08:23:47.526000',
9: '2019-06-12 10:42:50.401000'},
'utm_source': {0: nan,
1: 'facebook',
2: 'facebook',
3: 'google',
4: nan,
5: 'facebook',
6: 'google',
7: 'adwords',
8: 'youtube',
9: nan},
'rank': {0: 1, 1: 1, 2: 1, 3: 1, 4: 2, 5: 3, 6: 4, 7: 5, 8: 6, 9: 1}}
Note: i converted the dataframe to a dictionary
Here is one way you can do it (if I understand correctly):
from itertools import chain
from collections import Counter
groups = (df
.sort_values(by='ts')
.dropna()
.groupby('anonymous_id').utm_source
.agg(list)
.reset_index()
)
groups['transitions'] = groups.utm_source.apply(lambda x: list(zip(x,x[1:])))
all_transitions = Counter(chain(*groups.transitions.tolist()))
Which gives you (on your example data):
In [42]: all_transitions
Out[42]:
Counter({('google', 'facebook'): 1,
('facebook', 'google'): 1,
('google', 'adwords'): 1,
('adwords', 'youtube'): 1})
Or are you looking for something different?

Categories