I am working on loading a dataset from a pickle file like this
""" Load the dictionary containing the dataset """
with open("final_project_dataset.pkl", "r") as data_file:
data_dict = pickle.load(data_file)
It works fine and loads the data correctly. This is an example of one row:
'GLISAN JR BEN F': {'salary': 274975, 'to_messages': 873, 'deferral_payments': 'NaN', 'total_payments': 1272284, 'exercised_stock_options': 384728, 'bonus': 600000, 'restricted_stock': 393818, 'shared_receipt_with_poi': 874, 'restricted_stock_deferred': 'NaN', 'total_stock_value': 778546, 'expenses': 125978, 'loan_advances': 'NaN', 'from_messages': 16, 'other': 200308, 'from_this_person_to_poi': 6, 'poi': True, 'director_fees': 'NaN', 'deferred_income': 'NaN', 'long_term_incentive': 71023, 'email_address': 'ben.glisan#enron.com', 'from_poi_to_this_person': 52}
Now, how can get the number of features? e.g (salary, to_messages, .... , from_poi_to_this_person) ?
I got this row by printing my whole dataset (print data_dict) and this is one of the results. I want to know how many features are there is general i.e. in the whole dataset without specifying a key in the dictionary.
Thanks
Try this.
no_of_features = len(data_dict[data_dict.keys()[0]])
This will work only if all your keys in data_dict have same number of features.
or simply
no_of_features = len(data_dict['GLISAN JR BEN F'])
""" Load the dictionary containing the dataset """
with open("final_project_dataset.pkl", "r") as data_file:
data_dict = pickle.load(data_file)
print len(data_dict)
I think you want to find out the size of the set of all unique field names used in the row dictionaries. You can find that like this:
data_dict = {
'red':{'alpha':1,'bravo':2,'golf':3,'kilo':4},
'green':{'bravo':1,'delta':2,'echo':3},
'blue':{'foxtrot':1,'tango':2}
}
unique_features = set(
feature
for row_dict in data_dict.values()
for feature in row_dict.keys()
)
print(unique_features)
# {'golf', 'delta', 'foxtrot', 'alpha', 'bravo', 'echo', 'tango', 'kilo'}
print(len(unique_features))
# 8
Apply sum to the len of each nested dictionary:
sum(len(v) for _, v in data_dict.items())
v represents a nested dictionary object.
Dictionaries will naturally return their keys when you call an iterator on them (or something of that sort), so calling len will return the number of keys in each nested dictionary, viz. number of features.
If the features may be duplicated across nested objects, then collect them in a set and apply len
len(set(f for v in data_dict.values() for f in v.keys()))
Here is the answer
https://discussions.udacity.com/t/lesson-5-number-of-features/44253/4
where we choose 1 person in this case SKILLING JEFFREY K within the database called enron_data. and then we print the lenght of the keys in the dictionary.
print len(enron_data["SKILLING JEFFREY K"].keys())
Related
I'm having to make a dictionary from a file that looks like this:
example =
'Computer science', random name, 17
'Computer science', another name, 18
'math', one name, 19
I want the majors to be keys but I'm having trouble grouping them this is what I've tried
dictionary = {}
for i in example_file:
dictionary = {example[0]:{example[1] : example[2]}
the problem with this is that it does turn the lines into a dictionary but one by one instead of having the ones with the same key in one dictionary
this is what its returning:
{computer science; {random name: 17}}
{computer science: {another name: 18}}
{math{one name:19}}
this is how I want it to look
{computer science: {random name: 17, another name: 18}, math:{one name:19}}
how do I group these?
You need to update the dictionary elements, not assign the whole dictionary each time through the loop.
You can use defaultdict(dict) to automatically create the nested dictionaries as needed.
from collections import defaultdict
dictionary = defaultdict(dict)
for subject, name, score in example_file:
dictionary[subject][name] = int(score)
It's a pretty well known problem with an elegant solution, making use of dict's setdefault() method.
dictionary = {}
for example in example_file:
names = dictionary.setdefault(example[0], {})
names[example[1]] = example[2]
print(dictionary)
This code prints:
{'Computer science': {'random name': 17, 'another name': 18}, 'math': {'one name': 19}}
An alternative code:
(but #hhimko 's solution is almost 50 times faster)
import pandas as pd
df = pd.read_csv("file.csv", header=None).sort_values(0).reset_index(drop=True)
result = dict()
major_holder = None
for index, row in tt.iterrows():
if row.iloc[0] != major_holder:
major_holder = row.iloc[0]
result[major_holder] = dict()
result[major_holder][row.iloc[1]] = row.iloc[2]
else:
result[major_holder][row.iloc[1]] = row.iloc[2]
print(result)
I have a csv file that looks something like this:
apple 12 yes
apple 15 no
apple 19 yes
and I want to use the fruit as a key and turn rest of the row into a list of lists that's a value, so it looks like:
{'apple': [[12, 'yes'],[15, 'no'],[19, 'yes']]}
A sample of my code below, turns each row into its own dictionary, when I want to combine the data.
import csv
fp = open('fruits.csv', 'r')
reader = csv.reader(fp)
next(reader,None)
D = {}
for row in reader:
D = {row[0]:[row[1],row[2]]}
print(D)
My output looks like:
{'apple': [12,'yes']}
{'apple': [15,'no']}
{'apple': [19,'yes']}
Your problem is you reset D in every iteration. Don't do that.
Note that the output may look somewhat related to what you want, but this isn't actually the case. If you inspect the variable D after this code is finished running, you'll see that it contains only the last value that you set it to:
{'apple': [19,'yes']}
Instead, add keys to the dictionary whenever you encounter a new fruit. The value at this key will be an empty list. Then append the data you want to this empty list.
import csv
fp = open('fruits.csv', 'r')
reader = csv.reader(fp)
next(reader,None)
D = {}
for row in reader:
if row[0] not in D: # if the key doesn't already exist in D, add an empty list
D[row[0]] = []
D[row[0]].append([row[1:]]) # append the rest of this row to the list in the dictionary
print(D) # print the dictionary AFTER you finish creating it
Alternatively, define D as a collections.defaultdict(list) and you can skip the entire if block
Note that in a single dictionary, one key can only have one value. There can not be multiple values assigned to the same key. In this case, each fruit name (key) has a single list value assigned to it. This list contains more lists inside it, but that is immaterial to the dictionary.
You can use a mix of sorting and groupby:
from itertools import groupby
from operator import itemgetter
_input = """apple 12 yes
apple 15 no
apple 19 yes
"""
entries = [l.split() for l in _input.splitlines()]
{key : [values[1:] for values in grp] for key, grp in groupby( sorted(entries, key=itemgetter(0)), key=itemgetter(0))}
Sorting is applied before groupby to have unduplicated keys, and the key of both is taking the first element of each line.
Part of the issue you are running into is that rather than "adding" data to D[key] via append, you are just replacing it. In the end you get only the last result per key.
You might look at collections.defaultdict(list) as a strategy to initialize D or use setdefault(). In this case I'll use setdefault() as it is straightforward, but don't discount defaultdict() in more complicated senarios.
data = [
["apple", 12, "yes"],
["apple", 15, "no"],
["apple", 19, "yes"]
]
result = {}
for item in data:
result.setdefault(item[0], []).append(item[1:])
print(result)
This should give you:
{
'apple': [
[12, 'yes'],
[15, 'no'],
[19, 'yes']
]
}
If you were keen on looking at defaultdict() an solution based on it might look like:
import collections
data = [
["apple", 12, "yes"],
["apple", 15, "no"],
["apple", 19, "yes"]
]
result = collections.defaultdict(list)
for item in data:
result[item[0]].append(item[1:])
print(dict(result))
I have a list of dictionaries like shown below and i would like to extract the partID and the corresponding quantity for a specific orderID using python, but i don't know how to do it.
dataList = [{'orderID': 'D00001', 'customerID': 'C00001', 'partID': 'P00001', 'quantity': 2},
{'orderID': 'D00002', 'customerID': 'C00002', 'partID': 'P00002', 'quantity': 1},
{'orderID': 'D00003', 'customerID': 'C00003', 'partID': 'P00001', 'quantity': 1},
{'orderID': 'D00004', 'customerID': 'C00004', 'partID': 'P00003', 'quantity': 3}]
So for example, when i search my dataList for a specific orderID == 'D00003', i would like to receive both the partID ('P00001'), as well as the corresponding quantity (1) of the specified order. How would you go about this? Any help is much appreciated.
It depends.
You are not going to do that a lot of time, you can just iterate over the list of dictionaries until you find the "correct" one:
search_for_order_id = 'D00001'
for d in dataList:
if d['orderID'] == search_for_order_id:
print(d['partID'], d['quantity'])
break # assuming orderID is unique
Outputs
P00001 2
Since this solution is O(n), if you are going to do this search a lot of times it will add up.
In that case it will be better to transform the data to a dictionary of dictionaries, with orderID being the outer key (again, assuming orderID is unique):
better = {d['orderID']: d for d in dataList}
This is also O(n) but you pay it only once. Any subsequent lookup is an O(1) dictionary lookup:
search_for_order_id = 'D00001'
print(better[search_for_order_id]['partID'], better[search_for_order_id]['quantity'])
Also outputs
P00001 2
I believe you would like to familiarize yourself with the pandas package, which is very useful for data analysis. If these are the kind of problems you're up against, I advise you to take the time and take a tutorial in pandas. It can do a lot, and is very popular.
Your dataList is very similar to a DataFrame structure, so what you're looking for would be as simple as:
import pandas as pd
df = pd.DataFrame(dataList)
df[df['orderID']=='D00003']
You can use this:
results = [[x['orderID'], x['partID'], x['quantity']] for x in dataList]
for i in results:
print(i)
Also,
results = [['Order ID: ' + x['orderID'], 'Part ID: ' + x['partID'],'Quantity:
' + str(x['quantity'])] for x in dataList]
To get the partID you can make use of the filter function.
myData = [{"x": 1, "y": 1}, {"x": 2, "y": 5}]
filtered = filter(lambda item: item["x"] == 1) # Search for an object with x equal to 1
# Get the next item from the filter (the matching item) and get the y property.
print(next(filtered)["y"])
You should be able to apply this to your situation.
I need to create lookup tables in python from a csv. I have to do this, though, by unique values in my columns. The example is attached. I have a name column that is the name of the model. For reach model, I need a dictionary with the title from the variable column, the key from the level column and value from the value column. I'm thinking the best thing is a dictionary of dictionaries. I will use this look up table in the future to multiply the values together based on the keys.
Here is code to generate sample data set:
Name = ['model1', 'model1', 'model1', 'model2', 'model2',
'model2','model1', 'model1', 'model1', 'model1', 'model2', 'model2',
'model2','model2']
Variable = ['channel_model','channel_model','channel_model','channel_model','channel_model','channel_model', 'driver_age', 'driver_age', 'driver_age', 'driver_age',
'driver_age', 'driver_age', 'driver_age', 'driver_age']
channel_Level = ['Dir', 'IA', 'EA','Dir', 'IA', 'EA', '21','22','23','24', '21','22','23','24']
Value = [1.11,1.18,1.002, 2.2, 2.5, 2.56, 1.1,1.2,1.3,1.4,2.1,2.2,2.3,2.4]
df= {'Name': Name, 'Variable': Variable, 'Level': channel_Level, 'Value':Value}
factor_table = pd.DataFrame(df)
I have read the following but it hasn't yielded great results:
Python Creating Dictionary from excel data
I've also tried:
import pandas as pd
factor_table = pd.read_excel('...\\factor_table_example.xlsx')
#define function to be used multiple times
def factor_tables(file, model_column, variable_column, level_column, value_column):
for i in file[model_column]:
for row in file[variable_column]:
lookup = {}
lookup = dict(zip(file[level_column], file[value,column]))
This yields the error:
`dict expected at most 1 arguments, got 2
What I would ultimately like is:
{{'model2':{'channel':{'EA':1.002, 'IA': 1.18, 'DIR': 1.11}}}, {'model1'::{'channel':{'EA':1.86, 'IA': 1.66, 'DIR': 1.64}}}}
Using collections.defaultdict, you can create a nested dictionary while iterating your dataframe. Then realign into a list of dictionaries via a list comprehension.
from collections import defaultdict
tree = lambda: defaultdict(tree)
d = tree()
for row in factor_table.itertuples(index=False):
d[(row.Name, row.Variable)].update({row.Level: row.Value})
res = [{k[0]: {k[1]: dict(v)}} for k, v in d.items()]
print(res)
[{'model1': {'channel_model': {'Dir': 1.110, 'EA': 1.002, 'IA': 1.180}}},
{'model2': {'channel_model': {'Dir': 2.200, 'EA': 2.560, 'IA': 2.500}}},
{'model1': {'driver_age': {'21': 1.100, '22': 1.200, '23': 1.300, '24': 1.400}}},
{'model2': {'driver_age': {'21': 2.100, '22': 2.200, '23': 2.300, '24': 2.400}}}]
It looks like your error could be comming from this line:
lookup = dict(zip(file[level_column], file[value,column]))
where file is a dict expecting one key, yet you give it value,column, thus it got two args. The loop you might be looking for is like so
def factor_tables(file, model_column, variable_column, level_column, value_column):
lookup = {}
for i in file[model_column]:
lookup[model_column] = dict(zip(file[level_column], file[value_column]))
return lookup
This will return to you a single dictionary with keys corresponding to individual (and unique) models:
{'model_1':{'level_col': 'val_col'}, 'model_2':...}
Allowing you to use:
lookups.get('model_1')
{'level_col': 'val_col'}
If you need the variable_column, you can wrap it one level deeper:
def factor_tables(file, model_column, variable_column, level_column, value_column):
lookup = {}
for i in file[model_column]:
lookup[model_column] = {variable_column: dict(zip(file[level_column], file[value_column]))}
return lookup
I'm trying to get unique values from the column 'name' for every distinct value in column 'gender'.
Here's sample data:
sample input_file_data:
index,name,gender,alive
1,Adam,Male,Y
2,Bella,Female,N
3,Marc,Male,Y
1,Adam,Male,N
I could get it when I give a value corresponding to 'gender' like for example, gave "Male" in the code below:
filtered_data = filter(lambda person: person["gender"] == "Male", input_file_data)
reader = (dict((k, v.strip()) for k, v in row.items() if v) for row in filtered_data)
countt = [rec[gender] for rec in reader]
final1 = input_file_name + ".txt", "gender", "Male"
output1 = str(final1).replace("(", "").replace(")", "").replace("'","").replace(", [{", " -- [").replace("}", "")
final2 = set(re.findall(r"name': '(.*?)'", str(filtered_data)))
final_count = len(final2)
output = str(final_count) + " occurrences", str(final2)
output2 = output1, str(output)
output_final = str(output2).replace('\\', "").replace('"',"").replace(']"', "]").replace("set", "").replace("(", "").replace(")", "").replace("'","").replace(", [{", " -- [").replace("}", "")
output_final = output_final + "\n"
current output:
input_file_name.txt, gender, Male, 2 occurrences, [Adam,Marc]
Expected output:
input_file_name.txt, gender, Male, 2 occurrences, [Adam,Marc], Female, 1 occurrences [Bella]
which should show up all the unique occurrences of names, for every distinct gender value (without hardcoding). Also I do not want to use Pandas. Any help is highly appreciated.
PS- I have multiple files and not all files have the same columns. So I can't hardcode them. Also, all the files have a 'name' column, but not all files have a 'gender' column. And this script should work for any other column like 'index' or 'alive' or anything else for that matter and not just gender.
I would use the csv module along with the defaultdict from collections for this. Say this is stored in a file called test.csv:
>>> import csv
>>> from collections import defaultdict
>>> with open('test.csv', 'rb') as fin: data = list(csv.reader(fin))[1:]
>>> gender_dict = defaultdict(set)
>>> for idx, name, gender, alive in data:
gender_dict[gender].add(name)
>>> gender_dict
defaultdict(<type 'set'>, {'Male': ['Adam', 'Marc'], 'Female': ['Bella']})
You now have a dictionary. Each key is a unique value from the gender column. Each value is a set, so you'll only get unique items. Notice that we added 'Adam' twice, but only see one in the resulting set.
You don't need defaultdict, but it allows you to use less boilerplate code to check if a key exists.
EDIT: It might help to have better visibility into the data itself. Given your code, I can make the following assumptions:
input_file_data is an iterable (list, tuple, something like that) containing dictionaries.
Each dictionary contains a 'gender' key. If it didn't include at least 'gender', you would get a key error when trying to filter it.
Each dictionary has a 'name' key, it looks like.
Rather than doing all of that regex, what about this?
>>> gender_dict = {'Male': set(), 'Female': set()}
>>> for item in input_file_data:
gender_dict[item['gender']].add(item['name'])
You can use item.get('name') instead of item['name'] if not every entry will have a name.
Edit #2: Ok, the first thing you need to do is get your data into a consistent state. We can absolutely get to a point where you have a column name (gender, index, alive, whatever you want) and a set of unique names corresponding to those columns. Something like this:
data_dict = {'gender':
{'Male': ['Adam', 'Marc'],
'Female': ['Bella']}
'alive':
{'Y': ['Adam', 'Marc'],
'N': ['Bella', 'Adam']}
'index':
{1: ['Adam'],
2: ['Bella'],
3: ['Marc']}
}
If that's what you want, you could try this:
>>> data_dict = defaultdict(lambda: defaultdict(lambda: defaultdict(set)))
>>> for element in input_file_data:
for key, value in element.items():
if key != 'name':
data_dict[key][value].add(element[name])
That should get you what you want, I think? I can't test as I don't have your data, but give it a try.