I have a bigger code from which I obtain some datetime object for some events (YYYY-MM-DD) for two years (2021,2022) out of which I want to group data together in a nested dictionary structure. For a particular event, I want the following structure -
event_name:
{2021:
{01:
number_of_datetime_having_month january,
02:
number_of_datetime_having_month_feb
...etc etc upto december},
2022:
{01:
number_of_datetime_having_month_january,
........etc etc upto december}
}
I am planning to write this data to csv and plot this afterwards.
I am wondering what will be the best approach. Hard-coding the schema beforehand?
from datetime import datetime, timedelta
datetimes = [datetime.now() + timedelta(days=20*i) for i in range(20)]
# Sparse result (zero-counts excluded):
result = {}
for dt in datetimes:
months_data = result.setdefault(dt.year, {})
months_data[dt.month] = months_data.setdefault(dt.month, 0) + 1
# Non-sparse result:
result = {}
for y in set(o.year for o in datetimes):
result[y] = {}
for m in range(1,13):
result[y][m] = 0
for dt in datetimes:
result[dt.year][dt.month] += 1
# Output result
from pprint import pprint
pprint(result)
Sparse output:
{2022: {9: 1, 10: 2, 11: 1, 12: 2},
2023: {1: 1, 2: 2, 3: 1, 4: 2, 5: 1, 6: 2, 7: 1, 8: 2, 9: 2}}
Non-sparse output:
{2022: {1: 0,
2: 0,
3: 0,
4: 0,
5: 0,
6: 0,
7: 0,
8: 0,
9: 1,
10: 2,
11: 1,
12: 2},
2023: {1: 1,
2: 2,
3: 1,
4: 2,
5: 1,
6: 2,
7: 1,
8: 2,
9: 2,
10: 0,
11: 0,
12: 0}}
I made some changes in the earlier answer of kwiknik (the sparse one), however i have to admit his approach is more elegant. Neverthless, I am posting my approach too.
from datetime import datetime, timedelta
datetimes = [datetime.now() + timedelta(days=20*i) for i in range(20)]
result = {}
for dt in datetimes:
months_data = result.setdefault(dt.year, {})
months_data[dt.month] = months_data.setdefault(dt.month, 0) + 1
############################################################################
count=0
for year in result.keys():
for k in range(1,13,1):
for items in result[year].keys():
if items==k:
pass
else:
count=count+1
if count==len(result[year].keys()):
result[year][k]='0'
count=0
kk= dict(sorted(result[year].items()))
result[year]=kk
print(result)
Output
{2022: {1: '0', 2: '0', 3: '0', 4: '0', 5: '0', 6: '0', 7: '0', 8: '0', 9: 1, 10: 2, 11: 1, 12: 2}, 2023: {1: 1, 2: 2, 3: 1, 4: 2, 5: 1, 6: 2, 7: 1, 8: 2, 9: 1, 10: 1, 11: '0', 12: '0'}}
Related
when I use:
df[["Type 2", "Type 4"]].applymap(lambda n: int(n, 16))
It stops in the error because of invalid value in Type 2 column because of invalid values (negative values, NaN, string...) for hexa conversion. how to ignore this error or mark the invalid value as zero
{'Type 1': {0: 1, 1: 3, 2: 5, 3: 7, 4: 9, 5: 11, 6: 13, 7: 15, 8: 17},
'Type 2': {0: 'AA',
1: 'BB',
2: 'NaN',
3: '55',
4: '3.14',
5: '-96',
6: 'String',
7: 'FFFFFF',
8: 'FEEE'},
'Type 3': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0},
'Type 4': {0: '23',
1: 'fefe',
2: 'abcd',
3: 'dddd',
4: 'dad',
5: 'cfe',
6: 'cf42',
7: '321',
8: '0'},
'Type 5': {0: -120,
1: -120,
2: -120,
3: -120,
4: -120,
5: -120,
6: -120,
7: -120,
8: -120}}
You can create a personalized function that handles this exception to use in your lambda. For example:
def lambda_int(n):
try:
return int(n, 16)
except ValueError:
return 0
df[["Type 2", "Type 4"]] = df[["Type 2", "Type 4"]].applymap(lambda n: lambda_int(n))
Please go through this, i reconstructed your question and gave steps to follow
1. You first dictionary you provided does not have a value, it has a string "NaN"
data = {'Type 1': {0: 1, 1: 3, 2: 5, 3: 7, 4: 9, 5: 11, 6: 13, 7: 15, 8: 17},
'Type 2': {0: 'AA',
1: 'BB',
2: 'NaN',
3: '55',
4: '3.14',
5: '-96',
6: 'String',
7: 'FFFFFF',
8: 'FEEE'},
'Type 3': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0},
'Type 4': {0: '23',
1: 'fefe',
2: 'abcd',
3: 'dddd',
4: 'dad',
5: 'cfe',
6: 'cf42',
7: '321',
8: '0'},
'Type 5': {0: -120,
1: -120,
2: -120,
3: -120,
4: -120,
5: -120,
6: -120,
7: -120,
8: -120}}
import pandas as pd
df = pd.DataFrame(data)
df.head()
To check nan in your df and remove them
columns_with_na = df.isna().sum()
#filter starting from 1 missing value
columns_with_na = columns_with_na[columns_with_na != 0]
print(len(columns_with_na))
print(len(columns_with_na.sort_values(ascending = False))) #print them in descendng order
Prints 0 and 0 because there is no nan
Reconstructed your data to include a nan by using numpy.nan
import numpy as np
#recreated a dataset and included a nan value : np.nan at Type 2
data = {'Type 1': {0: 1, 1: 3, 2: 5, 3: 7, 4: 9, 5: 11, 6: 13, 7: 15, 8: 17},
'Type 2': {0: 'AA',
1: 'BB',
2: np.nan,
3: '55',
4: '3.14',
5: '-96',
6: 'String',
7: 'FFFFFF',
8: 'FEEE'},
'Type 3': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0},
'Type 4': {0: '23',
1: 'fefe',
2: 'abcd',
3: 'dddd',
4: 'dad',
5: 'cfe',
6: 'cf42',
7: '321',
8: '0'},
'Type 5': {0: -120,
1: -120,
2: -120,
3: -120,
4: -120,
5: -120,
6: -120,
7: -120,
8: -120}}
df2 = pd.DataFrame(data)
df2.head()
#sum up number of columns with nan
columns_with_na = df2.isna().sum()
#filter starting from 1 missing value
columns_with_na = columns_with_na[columns_with_na != 0]
print(len(columns_with_na))
print(len(columns_with_na.sort_values(ascending = False)))
prints 1 and 1 because there is a nan at Type 2 column
#drop nan values
df2 = df2.dropna(how = 'any')
#sum up number of columns with nan
columns_with_na = df2.isna().sum()
#filter starting from 1 missing value
columns_with_na = columns_with_na[columns_with_na != 0]
print(len(columns_with_na))
#prints 0 because I dropped all the nan values
df2.head()
To fill nan in df with 0 use:
df2.fillna(0, inplace = True)
Fill in nan with 0 in df2['Type 2'] only:
#if you dont want to change the origina dataframe set inplace to false
df2['Type 2'].fillna(0, inplace = True) #inplace is set to True to change the original df
I am struggling with the ipywidgets module.
I am trying to make a plot where you can toggle lines off/on with checkboxes based on a province.
fig, ax = plt.subplots(figsize=(10,10))
sns.lineplot(data=df5, x="Date_of_report", y="Total_reported", hue="Province", ax=ax)
provinces = df5["Province"].unique()
chk = [widgets.Checkbox(description=a) for a in provinces]
def updatePlot(**kwargs):
print([(k,v) for k, v in kwargs.items()])
widgets.interact(updatePlot, **{c.description: c.value for c in chk})
As you can see, I can draw the checkboxes and it prints out the status of the boxes.
but I don't know how to update the seaborn line plot. So when you select say: Drenthe it only shows the line from Drenthe.
here is the dataframe as a dict:
{'Date_of_report': {0: Timestamp('2020-03-13 10:00:00'), 1: Timestamp('2020-03-13 10:00:00'), 2: Timestamp('2020-03-13 10:00:00'), 3: Timestamp('2020-03-13 10:00:00'), 4: Timestamp('2020-03-13 10:00:00'), 5: Timestamp('2020-03-13 10:00:00'), 6: Timestamp('2020-03-13 10:00:00'), 7: Timestamp('2020-03-13 10:00:00'), 8: Timestamp('2020-03-13 10:00:00'), 9: Timestamp('2020-03-13 10:00:00')}, 'Province': {0: 'Drenthe', 1: 'Flevoland', 2: 'Friesland', 3: 'Gelderland', 4: 'Groningen', 5: 'Limburg', 6: 'Noord-Brabant', 7: 'Noord-Holland', 8: 'Overijssel', 9: 'Utrecht'}, 'Total_reported': {0: 14, 1: 7, 2: 8, 3: 64, 4: 4, 5: 71, 6: 377, 7: 66, 8: 18, 9: 83}, 'Hospital_admission': {0: 0, 1: 3, 2: 2, 3: 9, 4: 1, 5: 17, 6: 65, 7: 4, 8: 0, 9: 7}, 'Deceased': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 3, 6: 5, 7: 0, 8: 0, 9: 0}}
I have a dataframe that looks like this:
df = pd.DataFrame({'id': {0: 1, 1: 3, 2: 2, 3: 2, 4: 1, 5: 3},
'date': {0: '11/11/2018',
1: '11/12/2018',
2: '11/13/2018',
3: '11/14/2018',
4: '11/15/2018',
5: '11/16/2018'},
'score': {0: 1, 1: 1, 2: 3, 3: 2, 4: 0, 5: 5}})
I need the resulting dataframe to look like this:
output = pd.DataFrame({'id': {0: 1, 1: 3, 2: 2, 3: 2, 4: 1, 5: 3},
'date': {0: '11/11/2018',
1: '11/12/2018',
2: '11/13/2018',
3: '11/14/2018',
4: '11/15/2018',
5: '11/16/2018'},
'score': {0: 1, 1: 1, 2: 3, 3: 2, 4: 0, 5: 5},
'total_score_per_id_before_date': {0: 1, 1: 1, 2: 3, 3: 3, 4: 1, 5: 1}})
my code so far:
output= df[["id","score"]].groupby("id").sum()
However, this gives me the total sum of scores for each id. I need the sum of scores before the date in that specific row. Only the first score should not be discarded.
Use the cumulative sum on a series. Then subtract the current values, as you asked for the cumulative sum before the current index. Finally, add back the first values, otherwise they’re zero.
previously_accumulated_scores = df.groupby("id").cumsum().score - df.score
firsts = df.groupby("id").first().reset_index()
df2 = df.merge(firsts, on=["id", "date"], how="left", suffixes=("", "_r"))
df["total_score_per_id_before_date"] = previously_accumulated_scores + df2.score_r.fillna(0)
The merge could be done more elegantly, by changing the index to a MultiIndex, but that’s a style preference.
Note: this assumes your DataFrame is sorted by the date-like column (groupby preserves the order of rows within each group (source: docs)).
I would like to model the probability of an event occurring given the existence of the previous event.
To give you more context, I plan to group my data by anonymous_id, sort the values of the grouped dataset by timestamp (ts) and calculate the probability of the sequence of sources (utm_source) the person goes through. The person is represented by a unique anonymous_id. So the desired end goal is the probability of someone who came from a Facebook source to then come through from a Google source etc
I have been told that a package such as sci.py gaussian_kde would be useful for this. However, from playing around with it, this requires numerical inputs.
test_sample = test_sample.groupby('anonymous_id').apply(lambda x: x.sort_values(['ts'])).reset_index(drop=True)
and not sure what to try next.
I have also tried this, but i don't think that it makes much sense:
stats.gaussian_kde(test_two['utm_source'])
Here is a sample of my data
{'Unnamed: 0': {0: 0, 1: 1, 2: 2, 3: 3, 4: 4, 5: 5, 6: 6, 7: 7, 8: 8, 9: 9},
'anonymous_id': {0: '0000f8ea-3aa6-4423-9247-1d9580d378e1',
1: '00015d49-2cd8-41b1-bbe7-6aedbefdb098',
2: '0002226e-26a4-4f55-9578-2eff2999de7e',
3: '00022b83-240e-4ef9-aaad-ac84064bb902',
4: '00022b83-240e-4ef9-aaad-ac84064bb902',
5: '00022b83-240e-4ef9-aaad-ac84064bb902',
6: '00022b83-240e-4ef9-aaad-ac84064bb902',
7: '00022b83-240e-4ef9-aaad-ac84064bb902',
8: '00022b83-240e-4ef9-aaad-ac84064bb902',
9: '0002ed69-4aff-434d-a626-fc9b20ef1b02'},
'ts': {0: '2018-04-11 06:59:20.206000',
1: '2019-05-18 05:59:11.874000',
2: '2018-09-10 18:19:25.260000',
3: '2017-10-11 08:20:18.092000',
4: '2017-10-11 08:20:31.466000',
5: '2017-10-11 08:20:37.345000',
6: '2017-10-11 08:21:01.322000',
7: '2017-10-11 08:21:14.145000',
8: '2017-10-11 08:23:47.526000',
9: '2019-06-12 10:42:50.401000'},
'utm_source': {0: nan,
1: 'facebook',
2: 'facebook',
3: 'google',
4: nan,
5: 'facebook',
6: 'google',
7: 'adwords',
8: 'youtube',
9: nan},
'rank': {0: 1, 1: 1, 2: 1, 3: 1, 4: 2, 5: 3, 6: 4, 7: 5, 8: 6, 9: 1}}
Note: i converted the dataframe to a dictionary
Here is one way you can do it (if I understand correctly):
from itertools import chain
from collections import Counter
groups = (df
.sort_values(by='ts')
.dropna()
.groupby('anonymous_id').utm_source
.agg(list)
.reset_index()
)
groups['transitions'] = groups.utm_source.apply(lambda x: list(zip(x,x[1:])))
all_transitions = Counter(chain(*groups.transitions.tolist()))
Which gives you (on your example data):
In [42]: all_transitions
Out[42]:
Counter({('google', 'facebook'): 1,
('facebook', 'google'): 1,
('google', 'adwords'): 1,
('adwords', 'youtube'): 1})
Or are you looking for something different?
I want to compare average revenue "in offer" vs average revenue "out of offer" for each SKU.
When I merge the below two dataframes on sku I get multiple rows for each entry because in second dataframe sku is not unique. For example every instance of sku = 1 will have two entries because test_offer contains 2 separate offers for sku 1. However there can only be one offer live for a SKU at any time, which should verify the condition:
test_ga['day'] >= test_offer['start_day'] & test_ga['day'] <= test_offer['end_day']
dataset 1
test_ga = pd.DataFrame( {'day': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 1, 9: 2, 10: 3, 11: 4, 12: 5, 13: 6, 14: 7, 15: 8, 16: 1, 17: 2, 18: 3, 19: 4, 20: 5, 21: 6, 22: 7, 23: 8},
'sku': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 2, 9: 2, 10: 2, 11: 2, 12: 2, 13: 2, 14: 2, 15: 2, 16: 3, 17: 3, 18: 3, 19: 3, 20: 3, 21: 3, 22: 3, 23: 3},
'revenue': {0: 12, 1: 34, 2: 28, 3: 76, 4: 30, 5: 84, 6: 55, 7: 78, 8: 23, 9: 58, 10: 11, 11: 15, 12: 73, 13: 9, 14: 69, 15: 34, 16: 71, 17: 69, 18: 90, 19: 93, 20: 43, 21: 45, 22: 57, 23: 89}} )
dataset 2
test_offer = pd.DataFrame( {'sku': {0: 1, 1: 1, 2: 2},
'offer_number': {0: 5, 1: 6, 2: 7},
'start_day': {0: 2, 1: 6, 2: 4},
'end_day': {0: 4, 1: 7, 2: 8}} )
Expected Output
expected_output = pd.DataFrame( {'day': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 1, 9: 2, 10: 3, 11: 4, 12: 5, 13: 6, 14: 7, 15: 8},
'sku': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 2, 9: 2, 10: 2, 11: 2, 12: 2, 13: 2, 14: 2, 15: 2},
'offer': {0: float('nan'), 1: '5', 2: '5', 3: '5', 4: float('nan'), 5: '6', 6: '6', 7: float('nan'), 8: float('nan'), 9: float('nan'), 10: float('nan'), 11: '7', 12: '7', 13: '7', 14: '7', 15: '7'},
'start_day': {0: float('nan'), 1: '2', 2: '2', 3: '2', 4: float('nan'), 5: '6', 6: '6', 7: float('nan'), 8: float('nan'), 9: float('nan'), 10: float('nan'), 11: '4', 12: '4', 13: '4', 14: '4', 15: '4'},
'end_day': {0: float('nan'), 1: '4', 2: '4', 3: '4', 4: float('nan'), 5: '7', 6: '7', 7: float('nan'), 8: float('nan'), 9: float('nan'), 10: float('nan'), 11: '8', 12: '8', 13: '8', 14: '8', 15: '8'},
'revenue': {0: 12, 1: 34, 2: 28, 3: 76, 4: 30, 5: 84, 6: 55, 7: 78, 8: 23, 9: 58, 10: 11, 11: 15, 12: 73, 13: 9, 14: 69, 15: 34}} )
I did actually find a solution based on this SO answer, but it took me a while and the question is not really clear.
I thought it could still be useful to create this question even if I found a solution. Besides, there are probably better ways to achieve this that do not require to create a dummy variables and sorting the dataframe?
If this question is a duplicate let me know and I will cancel it.
One possible solution:
test_data = pd.merge(test_ga, test_offer, on = 'sku')
# I define if every row is in offer or not.
test_data['is_offer'] = np.where((test_data['day'] >= test_data['start_day']) & (test_data['day'] <= test_data['end_day']), 1, 0)
expected_output = test_data.sort_values(['sku','day','is_offer']).groupby(['day', 'sku']).tail(1)
and then clean up the data adding Nan values for rows not in offer.
expected_output['start_day'] = np.where(expected_output['is_offer'] == 0, np.NAN, expected_output['start_day'])
expected_output['end_day'] = np.where(expected_output['is_offer'] == 0, np.NAN, expected_output['end_day'])
expected_output['offer_number'] = np.where(expected_output['is_offer'] == 0, np.NAN, expected_output['offer_number'])
expected_output