Creating a nested dictionary using for loop

Creating a nested dictionary using for loop - python

I have a bigger code from which I obtain some datetime object for some events (YYYY-MM-DD) for two years (2021,2022) out of which I want to group data together in a nested dictionary structure. For a particular event, I want the following structure -
event_name:
{2021:
{01:
number_of_datetime_having_month january,
02:
number_of_datetime_having_month_feb
...etc etc upto december},
2022:
{01:
number_of_datetime_having_month_january,
........etc etc upto december}
}
I am planning to write this data to csv and plot this afterwards.
I am wondering what will be the best approach. Hard-coding the schema beforehand?

from datetime import datetime, timedelta
datetimes = [datetime.now() + timedelta(days=20*i) for i in range(20)]
# Sparse result (zero-counts excluded):
result = {}
for dt in datetimes:
months_data = result.setdefault(dt.year, {})
months_data[dt.month] = months_data.setdefault(dt.month, 0) + 1
# Non-sparse result:
result = {}
for y in set(o.year for o in datetimes):
result[y] = {}
for m in range(1,13):
result[y][m] = 0
for dt in datetimes:
result[dt.year][dt.month] += 1
# Output result
from pprint import pprint
pprint(result)
Sparse output:
{2022: {9: 1, 10: 2, 11: 1, 12: 2},
2023: {1: 1, 2: 2, 3: 1, 4: 2, 5: 1, 6: 2, 7: 1, 8: 2, 9: 2}}
Non-sparse output:
{2022: {1: 0,
2: 0,
3: 0,
4: 0,
5: 0,
6: 0,
7: 0,
8: 0,
9: 1,
10: 2,
11: 1,
12: 2},
2023: {1: 1,
2: 2,
3: 1,
4: 2,
5: 1,
6: 2,
7: 1,
8: 2,
9: 2,
10: 0,
11: 0,
12: 0}}

I made some changes in the earlier answer of kwiknik (the sparse one), however i have to admit his approach is more elegant. Neverthless, I am posting my approach too.
from datetime import datetime, timedelta
datetimes = [datetime.now() + timedelta(days=20*i) for i in range(20)]
result = {}
for dt in datetimes:
months_data = result.setdefault(dt.year, {})
months_data[dt.month] = months_data.setdefault(dt.month, 0) + 1
############################################################################
count=0
for year in result.keys():
for k in range(1,13,1):
for items in result[year].keys():
if items==k:
pass
else:
count=count+1
if count==len(result[year].keys()):
result[year][k]='0'
count=0
kk= dict(sorted(result[year].items()))
result[year]=kk
print(result)
Output
{2022: {1: '0', 2: '0', 3: '0', 4: '0', 5: '0', 6: '0', 7: '0', 8: '0', 9: 1, 10: 2, 11: 1, 12: 2}, 2023: {1: 1, 2: 2, 3: 1, 4: 2, 5: 1, 6: 2, 7: 1, 8: 2, 9: 1, 10: 1, 11: '0', 12: '0'}}

Related

In Python, pandas, how to ignore invalid values in python when i convert the columns from hexa to decimal?

when I use:
df[["Type 2", "Type 4"]].applymap(lambda n: int(n, 16))
It stops in the error because of invalid value in Type 2 column because of invalid values (negative values, NaN, string...) for hexa conversion. how to ignore this error or mark the invalid value as zero
{'Type 1': {0: 1, 1: 3, 2: 5, 3: 7, 4: 9, 5: 11, 6: 13, 7: 15, 8: 17},
'Type 2': {0: 'AA',
1: 'BB',
2: 'NaN',
3: '55',
4: '3.14',
5: '-96',
6: 'String',
7: 'FFFFFF',
8: 'FEEE'},
'Type 3': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0},
'Type 4': {0: '23',
1: 'fefe',
2: 'abcd',
3: 'dddd',
4: 'dad',
5: 'cfe',
6: 'cf42',
7: '321',
8: '0'},
'Type 5': {0: -120,
1: -120,
2: -120,
3: -120,
4: -120,
5: -120,
6: -120,
7: -120,
8: -120}}

You can create a personalized function that handles this exception to use in your lambda. For example:
def lambda_int(n):
try:
return int(n, 16)
except ValueError:
return 0
df[["Type 2", "Type 4"]] = df[["Type 2", "Type 4"]].applymap(lambda n: lambda_int(n))

Please go through this, i reconstructed your question and gave steps to follow
1. You first dictionary you provided does not have a value, it has a string "NaN"
data = {'Type 1': {0: 1, 1: 3, 2: 5, 3: 7, 4: 9, 5: 11, 6: 13, 7: 15, 8: 17},
'Type 2': {0: 'AA',
1: 'BB',
2: 'NaN',
3: '55',
4: '3.14',
5: '-96',
6: 'String',
7: 'FFFFFF',
8: 'FEEE'},
'Type 3': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0},
'Type 4': {0: '23',
1: 'fefe',
2: 'abcd',
3: 'dddd',
4: 'dad',
5: 'cfe',
6: 'cf42',
7: '321',
8: '0'},
'Type 5': {0: -120,
1: -120,
2: -120,
3: -120,
4: -120,
5: -120,
6: -120,
7: -120,
8: -120}}
import pandas as pd
df = pd.DataFrame(data)
df.head()
To check nan in your df and remove them
columns_with_na = df.isna().sum()
#filter starting from 1 missing value
columns_with_na = columns_with_na[columns_with_na != 0]
print(len(columns_with_na))
print(len(columns_with_na.sort_values(ascending = False))) #print them in descendng order
Prints 0 and 0 because there is no nan
Reconstructed your data to include a nan by using numpy.nan
import numpy as np
#recreated a dataset and included a nan value : np.nan at Type 2
data = {'Type 1': {0: 1, 1: 3, 2: 5, 3: 7, 4: 9, 5: 11, 6: 13, 7: 15, 8: 17},
'Type 2': {0: 'AA',
1: 'BB',
2: np.nan,
3: '55',
4: '3.14',
5: '-96',
6: 'String',
7: 'FFFFFF',
8: 'FEEE'},
'Type 3': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0},
'Type 4': {0: '23',
1: 'fefe',
2: 'abcd',
3: 'dddd',
4: 'dad',
5: 'cfe',
6: 'cf42',
7: '321',
8: '0'},
'Type 5': {0: -120,
1: -120,
2: -120,
3: -120,
4: -120,
5: -120,
6: -120,
7: -120,
8: -120}}
df2 = pd.DataFrame(data)
df2.head()
#sum up number of columns with nan
columns_with_na = df2.isna().sum()
#filter starting from 1 missing value
columns_with_na = columns_with_na[columns_with_na != 0]
print(len(columns_with_na))
print(len(columns_with_na.sort_values(ascending = False)))
prints 1 and 1 because there is a nan at Type 2 column
#drop nan values
df2 = df2.dropna(how = 'any')
#sum up number of columns with nan
columns_with_na = df2.isna().sum()
#filter starting from 1 missing value
columns_with_na = columns_with_na[columns_with_na != 0]
print(len(columns_with_na))
#prints 0 because I dropped all the nan values
df2.head()
To fill nan in df with 0 use:
df2.fillna(0, inplace = True)
Fill in nan with 0 in df2['Type 2'] only:
#if you dont want to change the origina dataframe set inplace to false
df2['Type 2'].fillna(0, inplace = True) #inplace is set to True to change the original df

How to update a seaborn line plot with ipywidgets checkboxes?

I am struggling with the ipywidgets module.
I am trying to make a plot where you can toggle lines off/on with checkboxes based on a province.
fig, ax = plt.subplots(figsize=(10,10))
sns.lineplot(data=df5, x="Date_of_report", y="Total_reported", hue="Province", ax=ax)
provinces = df5["Province"].unique()
chk = [widgets.Checkbox(description=a) for a in provinces]
def updatePlot(**kwargs):
print([(k,v) for k, v in kwargs.items()])
widgets.interact(updatePlot, **{c.description: c.value for c in chk})
As you can see, I can draw the checkboxes and it prints out the status of the boxes.
but I don't know how to update the seaborn line plot. So when you select say: Drenthe it only shows the line from Drenthe.
here is the dataframe as a dict:
{'Date_of_report': {0: Timestamp('2020-03-13 10:00:00'), 1: Timestamp('2020-03-13 10:00:00'), 2: Timestamp('2020-03-13 10:00:00'), 3: Timestamp('2020-03-13 10:00:00'), 4: Timestamp('2020-03-13 10:00:00'), 5: Timestamp('2020-03-13 10:00:00'), 6: Timestamp('2020-03-13 10:00:00'), 7: Timestamp('2020-03-13 10:00:00'), 8: Timestamp('2020-03-13 10:00:00'), 9: Timestamp('2020-03-13 10:00:00')}, 'Province': {0: 'Drenthe', 1: 'Flevoland', 2: 'Friesland', 3: 'Gelderland', 4: 'Groningen', 5: 'Limburg', 6: 'Noord-Brabant', 7: 'Noord-Holland', 8: 'Overijssel', 9: 'Utrecht'}, 'Total_reported': {0: 14, 1: 7, 2: 8, 3: 64, 4: 4, 5: 71, 6: 377, 7: 66, 8: 18, 9: 83}, 'Hospital_admission': {0: 0, 1: 3, 2: 2, 3: 9, 4: 1, 5: 17, 6: 65, 7: 4, 8: 0, 9: 7}, 'Deceased': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 3, 6: 5, 7: 0, 8: 0, 9: 0}}

Take sum of values before the row's date

I have a dataframe that looks like this:
df = pd.DataFrame({'id': {0: 1, 1: 3, 2: 2, 3: 2, 4: 1, 5: 3},
'date': {0: '11/11/2018',
1: '11/12/2018',
2: '11/13/2018',
3: '11/14/2018',
4: '11/15/2018',
5: '11/16/2018'},
'score': {0: 1, 1: 1, 2: 3, 3: 2, 4: 0, 5: 5}})
I need the resulting dataframe to look like this:
output = pd.DataFrame({'id': {0: 1, 1: 3, 2: 2, 3: 2, 4: 1, 5: 3},
'date': {0: '11/11/2018',
1: '11/12/2018',
2: '11/13/2018',
3: '11/14/2018',
4: '11/15/2018',
5: '11/16/2018'},
'score': {0: 1, 1: 1, 2: 3, 3: 2, 4: 0, 5: 5},
'total_score_per_id_before_date': {0: 1, 1: 1, 2: 3, 3: 3, 4: 1, 5: 1}})
my code so far:
output= df[["id","score"]].groupby("id").sum()
However, this gives me the total sum of scores for each id. I need the sum of scores before the date in that specific row. Only the first score should not be discarded.

Use the cumulative sum on a series. Then subtract the current values, as you asked for the cumulative sum before the current index. Finally, add back the first values, otherwise they’re zero.
previously_accumulated_scores = df.groupby("id").cumsum().score - df.score
firsts = df.groupby("id").first().reset_index()
df2 = df.merge(firsts, on=["id", "date"], how="left", suffixes=("", "_r"))
df["total_score_per_id_before_date"] = previously_accumulated_scores + df2.score_r.fillna(0)
The merge could be done more elegantly, by changing the index to a MultiIndex, but that’s a style preference.
Note: this assumes your DataFrame is sorted by the date-like column (groupby preserves the order of rows within each group (source: docs)).

Probability Density Function using pandas data

I would like to model the probability of an event occurring given the existence of the previous event.
To give you more context, I plan to group my data by anonymous_id, sort the values of the grouped dataset by timestamp (ts) and calculate the probability of the sequence of sources (utm_source) the person goes through. The person is represented by a unique anonymous_id. So the desired end goal is the probability of someone who came from a Facebook source to then come through from a Google source etc
I have been told that a package such as sci.py gaussian_kde would be useful for this. However, from playing around with it, this requires numerical inputs.
test_sample = test_sample.groupby('anonymous_id').apply(lambda x: x.sort_values(['ts'])).reset_index(drop=True)
and not sure what to try next.
I have also tried this, but i don't think that it makes much sense:
stats.gaussian_kde(test_two['utm_source'])
Here is a sample of my data
{'Unnamed: 0': {0: 0, 1: 1, 2: 2, 3: 3, 4: 4, 5: 5, 6: 6, 7: 7, 8: 8, 9: 9},
'anonymous_id': {0: '0000f8ea-3aa6-4423-9247-1d9580d378e1',
1: '00015d49-2cd8-41b1-bbe7-6aedbefdb098',
2: '0002226e-26a4-4f55-9578-2eff2999de7e',
3: '00022b83-240e-4ef9-aaad-ac84064bb902',
4: '00022b83-240e-4ef9-aaad-ac84064bb902',
5: '00022b83-240e-4ef9-aaad-ac84064bb902',
6: '00022b83-240e-4ef9-aaad-ac84064bb902',
7: '00022b83-240e-4ef9-aaad-ac84064bb902',
8: '00022b83-240e-4ef9-aaad-ac84064bb902',
9: '0002ed69-4aff-434d-a626-fc9b20ef1b02'},
'ts': {0: '2018-04-11 06:59:20.206000',
1: '2019-05-18 05:59:11.874000',
2: '2018-09-10 18:19:25.260000',
3: '2017-10-11 08:20:18.092000',
4: '2017-10-11 08:20:31.466000',
5: '2017-10-11 08:20:37.345000',
6: '2017-10-11 08:21:01.322000',
7: '2017-10-11 08:21:14.145000',
8: '2017-10-11 08:23:47.526000',
9: '2019-06-12 10:42:50.401000'},
'utm_source': {0: nan,
1: 'facebook',
2: 'facebook',
3: 'google',
4: nan,
5: 'facebook',
6: 'google',
7: 'adwords',
8: 'youtube',
9: nan},
'rank': {0: 1, 1: 1, 2: 1, 3: 1, 4: 2, 5: 3, 6: 4, 7: 5, 8: 6, 9: 1}}
Note: i converted the dataframe to a dictionary

Here is one way you can do it (if I understand correctly):
from itertools import chain
from collections import Counter
groups = (df
.sort_values(by='ts')
.dropna()
.groupby('anonymous_id').utm_source
.agg(list)
.reset_index()
)
groups['transitions'] = groups.utm_source.apply(lambda x: list(zip(x,x[1:])))
all_transitions = Counter(chain(*groups.transitions.tolist()))
Which gives you (on your example data):
In [42]: all_transitions
Out[42]:
Counter({('google', 'facebook'): 1,
('facebook', 'google'): 1,
('google', 'adwords'): 1,
('adwords', 'youtube'): 1})
Or are you looking for something different?

Merge pandas dataframes and remove duplicate rows based on condition

I want to compare average revenue "in offer" vs average revenue "out of offer" for each SKU.
When I merge the below two dataframes on sku I get multiple rows for each entry because in second dataframe sku is not unique. For example every instance of sku = 1 will have two entries because test_offer contains 2 separate offers for sku 1. However there can only be one offer live for a SKU at any time, which should verify the condition:
test_ga['day'] >= test_offer['start_day'] & test_ga['day'] <= test_offer['end_day']
dataset 1
test_ga = pd.DataFrame( {'day': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 1, 9: 2, 10: 3, 11: 4, 12: 5, 13: 6, 14: 7, 15: 8, 16: 1, 17: 2, 18: 3, 19: 4, 20: 5, 21: 6, 22: 7, 23: 8},
'sku': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 2, 9: 2, 10: 2, 11: 2, 12: 2, 13: 2, 14: 2, 15: 2, 16: 3, 17: 3, 18: 3, 19: 3, 20: 3, 21: 3, 22: 3, 23: 3},
'revenue': {0: 12, 1: 34, 2: 28, 3: 76, 4: 30, 5: 84, 6: 55, 7: 78, 8: 23, 9: 58, 10: 11, 11: 15, 12: 73, 13: 9, 14: 69, 15: 34, 16: 71, 17: 69, 18: 90, 19: 93, 20: 43, 21: 45, 22: 57, 23: 89}} )
dataset 2
test_offer = pd.DataFrame( {'sku': {0: 1, 1: 1, 2: 2},
'offer_number': {0: 5, 1: 6, 2: 7},
'start_day': {0: 2, 1: 6, 2: 4},
'end_day': {0: 4, 1: 7, 2: 8}} )
Expected Output
expected_output = pd.DataFrame( {'day': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 1, 9: 2, 10: 3, 11: 4, 12: 5, 13: 6, 14: 7, 15: 8},
'sku': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 2, 9: 2, 10: 2, 11: 2, 12: 2, 13: 2, 14: 2, 15: 2},
'offer': {0: float('nan'), 1: '5', 2: '5', 3: '5', 4: float('nan'), 5: '6', 6: '6', 7: float('nan'), 8: float('nan'), 9: float('nan'), 10: float('nan'), 11: '7', 12: '7', 13: '7', 14: '7', 15: '7'},
'start_day': {0: float('nan'), 1: '2', 2: '2', 3: '2', 4: float('nan'), 5: '6', 6: '6', 7: float('nan'), 8: float('nan'), 9: float('nan'), 10: float('nan'), 11: '4', 12: '4', 13: '4', 14: '4', 15: '4'},
'end_day': {0: float('nan'), 1: '4', 2: '4', 3: '4', 4: float('nan'), 5: '7', 6: '7', 7: float('nan'), 8: float('nan'), 9: float('nan'), 10: float('nan'), 11: '8', 12: '8', 13: '8', 14: '8', 15: '8'},
'revenue': {0: 12, 1: 34, 2: 28, 3: 76, 4: 30, 5: 84, 6: 55, 7: 78, 8: 23, 9: 58, 10: 11, 11: 15, 12: 73, 13: 9, 14: 69, 15: 34}} )
I did actually find a solution based on this SO answer, but it took me a while and the question is not really clear.
I thought it could still be useful to create this question even if I found a solution. Besides, there are probably better ways to achieve this that do not require to create a dummy variables and sorting the dataframe?
If this question is a duplicate let me know and I will cancel it.

One possible solution:
test_data = pd.merge(test_ga, test_offer, on = 'sku')
# I define if every row is in offer or not.
test_data['is_offer'] = np.where((test_data['day'] >= test_data['start_day']) & (test_data['day'] <= test_data['end_day']), 1, 0)
expected_output = test_data.sort_values(['sku','day','is_offer']).groupby(['day', 'sku']).tail(1)
and then clean up the data adding Nan values for rows not in offer.
expected_output['start_day'] = np.where(expected_output['is_offer'] == 0, np.NAN, expected_output['start_day'])
expected_output['end_day'] = np.where(expected_output['is_offer'] == 0, np.NAN, expected_output['end_day'])
expected_output['offer_number'] = np.where(expected_output['is_offer'] == 0, np.NAN, expected_output['offer_number'])
expected_output

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Creating a nested dictionary using for loop - python

Related

In Python, pandas, how to ignore invalid values in python when i convert the columns from hexa to decimal?

How to update a seaborn line plot with ipywidgets checkboxes?

Take sum of values before the row's date

Probability Density Function using pandas data

Merge pandas dataframes and remove duplicate rows based on condition

Categories

Resources