My codebase relies on managing data thats currently in a very deeply nested dictionary. Example:
'USA': {
'Texas': {
'Austin': {
'2017-01-01': 169,
'2017-02-01': 231
},
'Houston': {
'2017-01-01': 265,
'2017-02-01': 310
}
}
This extends for multiple countries, states/regions, cities, and dates.
I encounter a problem when trying to access values since I need to have a deeply nested for-loop to iterate over each country, state, city, and date to apply some kind of operation. I'm looking for some kind of alternative.
Assuming the nested dict structure is the same, is there an alternative to so many loops? Perhaps using map, reduce or lambda?
Is there a better way to store all of this data without using nested dicts?
You can use a Pandas DataFrame object (Pandas Dataframe Documentation), that can store your data in a tabular format, similar to a spreadsheet. In that case, your DataFrame should have a column to represent each key in your nested data (one column for Country, another for State, and so on).
Pandas DataFrames also accounts for filtering, grouping and another useful operations based on your records (rows) for each column. Let's say you want to filter your data to return only the rows from Texas that happened after '2018-02-01' (df is your DataFrame). This could be achieved with something like this:
df[df['State'] == 'Texas' & df['Date'] > '2018-02-01']
To build these DataFrame objects, you could start from your data formatted as a collection of records:
data = [['USA', 'Texas', 'Austin', '2017-01-01', 169],
['USA', 'Texas', 'Austin', '2017-02-01', 231],
['USA', 'Texas', 'Houston', '2017-01-01', 265],
['USA', 'Texas', 'Houston', '2017-02-01', 310]]
and then build them like this:
df = DataFrame(data, columns=['Country', 'State', 'City', 'Date', 'Value'])
If DataFrame objects are not an option, and you do not want to use nested loops, you could also access inner data using list comprehensions with nested predicates and filters:
[
d[country][state][city][date]
for country in d.keys()
for state in d[country].keys()
for city in d[country][state].keys()
for date in d[country][state][city].keys()
if country == 'USA' and state == 'Texas' and city == 'Houston'
]
However, I can not see much difference in that approach over the nested loops, and there is a penalty in code readability, imho.
Using the collection of records approach pointed earlier (data), instead of a nested structure, you could filter your rows using:
[r for r in data if r[2] == 'Houston']
For improved readability, you could use a list of namedtuple objects as your list of records. Your data would be:
from collections import namedtuple
record = namedtuple('Record', 'country state city date value')
data = [
record('USA', 'Texas', 'Austin', '2017-01-01', 169),
record('USA', 'Texas', 'Austin', '2017-02-01', 231),
record('USA', 'Texas', 'Houston', '2017-01-01', 265),
record('USA', 'Texas', 'Houston', '2017-02-01', 310)
]
and your filtering would be improved, eg.:
Getting specific records
[r for r in data if r.city == 'Houston']
returning
[
Record(country='USA', state='Texas', city='Houston', date='2017-01-01', value=265),
Record(country='USA', state='Texas', city='Houston', date='2017-02-01', value=310)
]
Getting only the values for those specific records
[r.value for r in data if r.city == 'Houston']
returning
[265, 310]
This last approach can also deal with custom object instances, considering that namedtuple objects can store them easily.
You can create a class, implementing overloading methods, and use recursion:
d = {'USA': {
'Texas': {
'Austin': {
'2017-01-01': 169,
'2017-02-01': 231
},
'Houston': {
'2017-01-01': 265,
'2017-02-01': 310
}
}
}
}
class StateData:
def __init__(self, structure):
self.structure = structure
self.levels = {'country':0, 'state':1, 'city':2, 'date':3}
def get_level(self, d, target, current= 0):
total_listing = [((a, b) if target == 3 else a) if current == target else self.get_level(b, target, current + 1) for a, b in d.items()]
return [i for b in total_listing for i in b] if all(isinstance(i, list) for i in total_listing) else total_listing
def __getitem__(self, val):
return self.get_level(self.structure, self.levels[val])
s = StateData(d)
print(s['city'])
print(s['date'])
Output:
['Austin', 'Houston']
[('2017-01-01', 169), ('2017-02-01', 231), ('2017-01-01', 265), ('2017-02-01', 310)]
It may be best to store your data as a list of lists, which will then make it possible for you to group according to the needs of each individual operation. For instance:
state_data = [['USA', 'Texas', 'Austin', '2017-01-01', 169],
['USA', 'Texas', 'Austin', '2017-02-01', 231],
['USA', 'Houston', '2017-01-01', 265],
['USA', 'Houston', '2017-02-01', 310]]
There's also this.
It allows you to flatten a dict into a pandas dataframe like this:
from pandas.io.json import json_normalize
d = json.load(f)
# Parent node of dict d is 'programs'
n = json_normalize(d['programs'])
Related
I have the following task.
I have this data:
import pandas
import numpy as np
data = {'name': ['Todd', 'Chris', 'Jackie', 'Ben', 'Richard', 'Susan', 'Joe', 'Rick'],
'phone': [912341.0, np.nan , 912343.0, np.nan, 912345.0, 912345.0, 912347.0, np.nan],
' email': ['todd#gmail.com', 'chris#gmail.com', np.nan, 'ben#gmail.com', np.nan ,np.nan , 'joe#gmail.com', 'rick#gmail.com'],
'most_visited_airport': ['Heathrow', 'Beijing', 'Heathrow', np.nan, 'Tokyo', 'Beijing', 'Tokyo', 'Heathrow'],
'most_visited_place': ['Turkey', 'Spain',np.nan , 'Germany', 'Germany', 'Spain',np.nan , 'Spain']
}
df = pandas.DataFrame(data)
What I have to do is for every feature column (most_visited_airport etc.) and its values (Heathrow, Beijing, Tokyo) I have to generate personal information and output it to a file.
E.g. If we look at most_visited_airport and Heathrow
I need to output three files containing the names, emails and phones of the people who visited the airport the most.
Currently, I have this code to do the operation for both columns and all the values:
columns_to_iterate = [ x for x in df.columns if 'most' in x]
for each in df[columns_to_iterate]:
values = df[each].dropna().unique()
for i in values:
df1 = df.loc[df[each]==i,'name']
df2 = df.loc[df[each]==i,' email']
df3 = df.loc[df[each]==i,'phone']
df1.to_csv(f'{each}_{i}_{df1.name}.csv')
df2.to_csv(f'{each}_{i}_{df2.name}.csv')
df3.to_csv(f'{each}_{i}_{df3.name}.csv')
Is it possible to do this in a more elegant and maybe faster way? Currently I have small dataset but not sure if this code will perform well with big data. My particular concern are the nested loops.
Thank you in advance!
You could replace the call to unique with a groupby, which would not only get the unique values, but split up the dataframe for you:
for column in df.filter(regex='^most'):
for key, group in df.groupby(column):
for attr in ('name', 'phone', 'email'):
group['name'].dropna().to_csv(f'{column}_{key}_{attr}.csv')
You can do it this way.
cols = df.filter(regex='most').columns.values
def func_current_cols_to_csv(most_col):
place = [i for i in df[most_col].dropna().unique().tolist()]
csv_cols = ['name', 'phone', ' email']
result = [df[df[most_col] == i][j].dropna().to_csv(f'{most_col}_{i}_{j}.csv', index=False) for i in place for j in
csv_cols]
return result
[func_current_cols_to_csv(i) for i in cols]
also in the options when writing to csv, you can leave the index, but do not forget to reset it before writing.
I want to get a sub-dataframe that contains all elements in a list.
Let's take the DataFrame as an example.
my_dict = {
'Job': ['Painting', 'Capentry', 'Teacher', 'Farming'],
'Job_Detail': ['all sort of painting',
'kitchen utensils, all types of roofing etc.',\
'skill and practical oriented teaching',\
'all agricultural practices']
}
df = pd.DataFrame(my_dict)
Output looks thus:
Job Job_Detail
0 Painting all sort of painting
1 Capentry kitchen utensils, all types of roofing etc.
2 Teacher skill and practical oriented teaching
3 Farming all agricultural practices
my_lst = ['of', 'all']
I want to filter df with mylst to get a sub_DataFrame that looks like this:
Job Job_Detail
0 Painting all sort of painting
1 Capentry kitchen utensils, all types of roofing etc.
I've tried df[df.Job_Detail.isin(['of', 'all']) but it returns an empty DataFrame.
I'm no pandas expert, but the best function to use here seems to be str.contains
From the docs:
Series.str.contains(pat, case=True, flags=0, na=None, regex=True)
Test if pattern or regex is contained within a string of a Series or Index.
Return boolean Series or Index based on whether a given pattern or regex is contained within a string of a Series or Index.
Edit: This masks using or, not and
import pandas as pd
my_dict = {
'Job': ['Painting', 'Capentry', 'Teacher', 'Farming'],
'Job_Detail': ['all sort of painting',
'kitchen utensils, all types of roofing etc.',
'skill and practical oriented teaching',
'all agricultural practices']
}
my_lst = ['of', 'all']
df = pd.DataFrame(my_dict)
print(df)
mask = df.Job_Detail.str.contains('|'.join(my_lst), regex=True)
print(df[mask])
Here's a solution that masks uing and:
import pandas as pd
my_dict = {
'Job': ['Painting', 'Capentry', 'Teacher', 'Farming'],
'Job_Detail': ['all sort of painting',
'kitchen utensils, all types of roofing etc.',
'skill and practical oriented teaching',
'all agricultural practices']
}
my_lst = ['of', 'all']
df = pd.DataFrame(my_dict)
print(df)
print("------")
masks = [df.Job_Detail.str.contains(word) for word in my_lst]
mask = pd.concat(masks, axis=1).all(axis=1)
print(df[mask])
#Lone Your code answered a different question, but it helped me arrive at the answer. Thank you, appreciated.
Here's the closest to what I needed:
df[(df.Job_Detail.str.contains('of')) & (df.Job_Detail.str.contains('all'))]
Given a pandas.core.series, is it possible to remove certain items from each index in and retain the ones that i want to keep? For example:
country=['China','USA','Asia','Brazil']
contents={'Tags': ['China;Panda;Brazil;Plug','USA;China;Asia','Brazil;Peanut']}
df=pd.DataFrame(contents)
tags=df["Tags"]
I wish to discard the values that are not in the country list and keep the rest of the values. So, for tags[0] the result should be [China,Brazil]. tags[1] all the values remain and tags[2] the result should be [Brazil]
tags = tags.str.split(';')
I have attempted to split the ; inbetween each value but i am uncertain of how to proceed.
[[val for val in a_list if val in country] for a_list in df.Tags.str.split(";")]
For each list after splitting, keep only the values that are in country list
to get
[['China', 'Brazil'], ['USA', 'China', 'Asia'], ['Brazil']]
Exploding and regrouping is also an option:
import pandas as pd
country = ['China', 'USA', 'Asia', 'Brazil']
contents = {'Tags': ['China;Panda;Brazil;Plug',
'USA;China;Asia',
'Brazil;Peanut']}
df = pd.DataFrame(contents)
e = df.Tags.str.split(';').explode()
tags = e[e.isin(country)].groupby(level=0).agg(list).tolist()
print(tags)
tags:
[['China', 'Brazil'], ['USA', 'China', 'Asia'], ['Brazil']]
An option with merge rather than isin:
import pandas as pd
country = ['China', 'USA', 'Asia', 'Brazil']
contents = {'Tags': ['China;Panda;Brazil;Plug',
'USA;China;Asia',
'Brazil;Peanut']}
df = pd.DataFrame(contents)
e = df.Tags.str.split(';').explode().reset_index()
tags = e.merge(
pd.DataFrame(country, columns=['Tags']),
on='Tags'
).groupby('index')['Tags'].agg(list).tolist()
print(tags)
tags:
[['China', 'Brazil'], ['China', 'USA', 'Asia'], ['Brazil']]
You can create a function to do the parsing and copy the information to a new column. So, for example, something like:
from functools import partial
import numpy as np
def pars_func(countries,content_str):
ret_str=''
for tag in content_str.split(';'):
if tag in country:
ret_str=ret_str+';'+tag
if ret_str== '':
return np.nan
else:
return ret_str[1:].split(';')
df['Tags']=df['Tags'].apply(partial(pars_func,country))
Here is an example input:
[{'name':'susan', 'wins': 1, 'team': 'team1'}
{'name':'jack', 'wins':1, 'team':'team2'}
{'name':'susan', 'wins':1, 'team':'team1'}]
Desired output
[{'name':'susan', 'wins':2, 'team': 'team1'}
{'name':'jack', 'wins':1, 'team':'team2'}]
I have lots of the dictionaries and want to only add, the 'win' value, based on the 'name' value,
and keep the 'team' values
I've tried to use Counter, but the result was
{'name':'all the names added toghther',
'wins': 'all the wins added toghther'
}
I was able to use defaultdict which seemed to work
result = defaultdict(int)
for d in data:
result[d['name']] += d['wins'])
but the results was something like
{'susan': 2, 'jack':1}
Here it added the values correctly but didn't keep the 'team' key
I guess I'm confused about defaultdict and how it works.
any help very appreciated.
Did you consider using pandas?
import pandas as pd
dicts = [
{'name':'susan', 'wins': 1, 'team': 'team1'},
{'name':'jack', 'wins':1, 'team':'team2'},
{'name':'susan', 'wins':1, 'team':'team1'},
]
agg_by = ["name", "team"]
df = pd.DataFrame(dicts)
df = df.groupby(agg_by)["wins"].apply(sum)
df = df.reset_index()
aggregated_dict = df.to_dict("records")
I have a dictionary my_dict having some elements like:
my_dict = {
'India':'Delhi',
'Canada':'Ottawa',
}
Now I want to add multiple dictionary key-value pair to a dict like:
my_dict = {
'India': 'Delhi',
'Canada': 'Ottawa',
'USA': 'Washington',
'Brazil': 'Brasilia',
'Australia': 'Canberra',
}
Is there any possible way to do this?
Because I don't want to add elements one after the another.
Use update() method.
d= {'India':'Delhi','Canada':'Ottawa'}
d.update({'USA':'Washington','Brazil':'Brasilia','Australia':'Canberra'})
PS: Naming your dictionary as dict is a horrible idea. It replaces the built in dict.
To make things more interesting in this answer section, you can
add multiple dictionary key-value pair to a dict
by doing so (In Python 3.5 or greater):
d = {'India': 'Delhi', 'Canada': 'Ottawa'}
d = {**d, 'USA': 'Washington', 'Brazil': 'Brasilia', 'Australia': 'Canberra', 'India': 'Blaa'}
Which produces an output:
{'India': 'Blaa', 'Canada': 'Ottawa', 'USA': 'Washington', 'Brazil': 'Brasilia', 'Australia': 'Canberra'}
This alternative doesn't even seem memory inefficient. Which kind-a comes as a contradiction to one of "The Zen of Python" postulates,
There should be one-- and preferably only one --obvious way to do it
What I didn't like about the d.update() alternative are the round brackets, when I skim read and see round brackets, I usually think tuples.
Either way, added this answer just to have some fun.
you have a few options:
use update():
d= {'India':'Delhi','Canada':'Ottawa'}
d.update({'USA':'Washington','Brazil':'Brasilia','Australia':'Canberra'})
use merge:
d= {'India':'Delhi','Canada':'Ottawa'}
d2 = {'USA':'Washington','Brazil':'Brasilia','Australia':'Canberra'}
new_dict = d| d2
The update() method works well
As someone who primarily works in pandas data frames, I wanted to share how you can take values from a data frame and add them to a dictionary using update() and pd.to_dict().
import pandas as pd
Existing dictionary
my_dict = {
'India':'Delhi',
'Canada':'Ottawa',
}
Data frame with additional values you want to add to dictionary
country_index = ['USA','Brazil','Australia']
city_column = ['Washingon','Brasilia','Canberra']
new_values_df = pd.DataFrame(data=city_column, index=country_index, columns=['cities'])
Adding data frame values to dictionary
my_dict.update(new_values_df.to_dict(orient='dict')['cities'])
Dictionary now looks like
my_dict = {
'India': 'Delhi',
'Canada': 'Ottawa',
'USA': 'Washington',
'Brazil': 'Brasilia',
'Australia': 'Canberra',
}