how to group a file into a dictionary without importing

how to group a file into a dictionary without importing - python

I'm having to make a dictionary from a file that looks like this:
example =
'Computer science', random name, 17
'Computer science', another name, 18
'math', one name, 19
I want the majors to be keys but I'm having trouble grouping them this is what I've tried
dictionary = {}
for i in example_file:
dictionary = {example[0]:{example[1] : example[2]}
the problem with this is that it does turn the lines into a dictionary but one by one instead of having the ones with the same key in one dictionary
this is what its returning:
{computer science; {random name: 17}}
{computer science: {another name: 18}}
{math{one name:19}}
this is how I want it to look
{computer science: {random name: 17, another name: 18}, math:{one name:19}}
how do I group these?

You need to update the dictionary elements, not assign the whole dictionary each time through the loop.
You can use defaultdict(dict) to automatically create the nested dictionaries as needed.
from collections import defaultdict
dictionary = defaultdict(dict)
for subject, name, score in example_file:
dictionary[subject][name] = int(score)

It's a pretty well known problem with an elegant solution, making use of dict's setdefault() method.
dictionary = {}
for example in example_file:
names = dictionary.setdefault(example[0], {})
names[example[1]] = example[2]
print(dictionary)
This code prints:
{'Computer science': {'random name': 17, 'another name': 18}, 'math': {'one name': 19}}

An alternative code:
(but #hhimko 's solution is almost 50 times faster)
import pandas as pd
df = pd.read_csv("file.csv", header=None).sort_values(0).reset_index(drop=True)
result = dict()
major_holder = None
for index, row in tt.iterrows():
if row.iloc[0] != major_holder:
major_holder = row.iloc[0]
result[major_holder] = dict()
result[major_holder][row.iloc[1]] = row.iloc[2]
else:
result[major_holder][row.iloc[1]] = row.iloc[2]
print(result)

Related

How would I merge identical dictionary keys into one?

I have a csv file that looks something like this:
apple 12 yes
apple 15 no
apple 19 yes
and I want to use the fruit as a key and turn rest of the row into a list of lists that's a value, so it looks like:
{'apple': [[12, 'yes'],[15, 'no'],[19, 'yes']]}
A sample of my code below, turns each row into its own dictionary, when I want to combine the data.
import csv
fp = open('fruits.csv', 'r')
reader = csv.reader(fp)
next(reader,None)
D = {}
for row in reader:
D = {row[0]:[row[1],row[2]]}
print(D)
My output looks like:
{'apple': [12,'yes']}
{'apple': [15,'no']}
{'apple': [19,'yes']}

Your problem is you reset D in every iteration. Don't do that.
Note that the output may look somewhat related to what you want, but this isn't actually the case. If you inspect the variable D after this code is finished running, you'll see that it contains only the last value that you set it to:
{'apple': [19,'yes']}
Instead, add keys to the dictionary whenever you encounter a new fruit. The value at this key will be an empty list. Then append the data you want to this empty list.
import csv
fp = open('fruits.csv', 'r')
reader = csv.reader(fp)
next(reader,None)
D = {}
for row in reader:
if row[0] not in D: # if the key doesn't already exist in D, add an empty list
D[row[0]] = []
D[row[0]].append([row[1:]]) # append the rest of this row to the list in the dictionary
print(D) # print the dictionary AFTER you finish creating it
Alternatively, define D as a collections.defaultdict(list) and you can skip the entire if block
Note that in a single dictionary, one key can only have one value. There can not be multiple values assigned to the same key. In this case, each fruit name (key) has a single list value assigned to it. This list contains more lists inside it, but that is immaterial to the dictionary.

You can use a mix of sorting and groupby:
from itertools import groupby
from operator import itemgetter
_input = """apple 12 yes
apple 15 no
apple 19 yes
"""
entries = [l.split() for l in _input.splitlines()]
{key : [values[1:] for values in grp] for key, grp in groupby( sorted(entries, key=itemgetter(0)), key=itemgetter(0))}
Sorting is applied before groupby to have unduplicated keys, and the key of both is taking the first element of each line.

Part of the issue you are running into is that rather than "adding" data to D[key] via append, you are just replacing it. In the end you get only the last result per key.
You might look at collections.defaultdict(list) as a strategy to initialize D or use setdefault(). In this case I'll use setdefault() as it is straightforward, but don't discount defaultdict() in more complicated senarios.
data = [
["apple", 12, "yes"],
["apple", 15, "no"],
["apple", 19, "yes"]
]
result = {}
for item in data:
result.setdefault(item[0], []).append(item[1:])
print(result)
This should give you:
{
'apple': [
[12, 'yes'],
[15, 'no'],
[19, 'yes']
]
}
If you were keen on looking at defaultdict() an solution based on it might look like:
import collections
data = [
["apple", 12, "yes"],
["apple", 15, "no"],
["apple", 19, "yes"]
]
result = collections.defaultdict(list)
for item in data:
result[item[0]].append(item[1:])
print(dict(result))

Python list of dictionaries aggregate values

Here is an example input:
[{'name':'susan', 'wins': 1, 'team': 'team1'}
{'name':'jack', 'wins':1, 'team':'team2'}
{'name':'susan', 'wins':1, 'team':'team1'}]
Desired output
[{'name':'susan', 'wins':2, 'team': 'team1'}
{'name':'jack', 'wins':1, 'team':'team2'}]
I have lots of the dictionaries and want to only add, the 'win' value, based on the 'name' value,
and keep the 'team' values
I've tried to use Counter, but the result was
{'name':'all the names added toghther',
'wins': 'all the wins added toghther'
}
I was able to use defaultdict which seemed to work
result = defaultdict(int)
for d in data:
result[d['name']] += d['wins'])
but the results was something like
{'susan': 2, 'jack':1}
Here it added the values correctly but didn't keep the 'team' key
I guess I'm confused about defaultdict and how it works.
any help very appreciated.

Did you consider using pandas?
import pandas as pd
dicts = [
{'name':'susan', 'wins': 1, 'team': 'team1'},
{'name':'jack', 'wins':1, 'team':'team2'},
{'name':'susan', 'wins':1, 'team':'team1'},
]
agg_by = ["name", "team"]
df = pd.DataFrame(dicts)
df = df.groupby(agg_by)["wins"].apply(sum)
df = df.reset_index()
aggregated_dict = df.to_dict("records")

Get a list of keys and values in a nested dictionary oriented by index

I have an Excel file with a structure like this:
name age status
anna 35 single
petr 27 married
I have converted such a file into a nested dictionary with a structure like this:
{'anna': {'age':35}, {'status': 'single'}},
{'petr': {'age':27}, {'status': 'married'}}
using pandas:
import pandas as pd
df = pd.read_excel('path/to/file')
df.set_index('name', inplace=True)
print(df.to_dict(orient='index'))
But now when running list(df.keys()) it returns me a list of all keys in the dictionary ('age', 'status', etc) but not 'name'.
My eventual goal is that it returns me all the keys and values by typing a name.
Is it possible somehow? Or maybe I should use some other way to import a data in order to achieve a goal? Eventually I should anyway come to a dictionary because I will merge it with other dictionaries by a key.

I think you need parameter drop=False to set_index for not drop column Name:
import pandas as pd
df = pd.read_excel('path/to/file')
df.set_index('name', inplace=True, drop=False)
print (df)
name age status
name
anna anna 35 single
petr petr 27 married
d = df.to_dict(orient='index')
print (d)
{'anna': {'age': 35, 'status': 'single', 'name': 'anna'},
'petr': {'age': 27, 'status': 'married', 'name': 'petr'}}
print (list(df.keys()))
['name', 'age', 'status']

Given a dataframe from excel, you should do this to obtain the thing you want:
resulting_dict = {}
for name, info in df.groupby('name').apply(lambda x: x.to_dict()).iteritems():
stats = {}
for key, values in info.items():
if key != 'name':
value = list(values.values())[0]
stats[key] = value
resulting_dict[name] = stats

Try this :
import pandas as pd
df = pd.read_excel('path/to/file')
df[df['name']=='anna'] #Get all details of anna
df[df['name']=='petr'] #Get all details of petr

Number of features in dictionary

I am working on loading a dataset from a pickle file like this
""" Load the dictionary containing the dataset """
with open("final_project_dataset.pkl", "r") as data_file:
data_dict = pickle.load(data_file)
It works fine and loads the data correctly. This is an example of one row:
'GLISAN JR BEN F': {'salary': 274975, 'to_messages': 873, 'deferral_payments': 'NaN', 'total_payments': 1272284, 'exercised_stock_options': 384728, 'bonus': 600000, 'restricted_stock': 393818, 'shared_receipt_with_poi': 874, 'restricted_stock_deferred': 'NaN', 'total_stock_value': 778546, 'expenses': 125978, 'loan_advances': 'NaN', 'from_messages': 16, 'other': 200308, 'from_this_person_to_poi': 6, 'poi': True, 'director_fees': 'NaN', 'deferred_income': 'NaN', 'long_term_incentive': 71023, 'email_address': 'ben.glisan#enron.com', 'from_poi_to_this_person': 52}
Now, how can get the number of features? e.g (salary, to_messages, .... , from_poi_to_this_person) ?
I got this row by printing my whole dataset (print data_dict) and this is one of the results. I want to know how many features are there is general i.e. in the whole dataset without specifying a key in the dictionary.
Thanks

Try this.
no_of_features = len(data_dict[data_dict.keys()[0]])
This will work only if all your keys in data_dict have same number of features.
or simply
no_of_features = len(data_dict['GLISAN JR BEN F'])

""" Load the dictionary containing the dataset """
with open("final_project_dataset.pkl", "r") as data_file:
data_dict = pickle.load(data_file)
print len(data_dict)

I think you want to find out the size of the set of all unique field names used in the row dictionaries. You can find that like this:
data_dict = {
'red':{'alpha':1,'bravo':2,'golf':3,'kilo':4},
'green':{'bravo':1,'delta':2,'echo':3},
'blue':{'foxtrot':1,'tango':2}
}
unique_features = set(
feature
for row_dict in data_dict.values()
for feature in row_dict.keys()
)
print(unique_features)
# {'golf', 'delta', 'foxtrot', 'alpha', 'bravo', 'echo', 'tango', 'kilo'}
print(len(unique_features))
# 8

Apply sum to the len of each nested dictionary:
sum(len(v) for _, v in data_dict.items())
v represents a nested dictionary object.
Dictionaries will naturally return their keys when you call an iterator on them (or something of that sort), so calling len will return the number of keys in each nested dictionary, viz. number of features.
If the features may be duplicated across nested objects, then collect them in a set and apply len
len(set(f for v in data_dict.values() for f in v.keys()))

Here is the answer
https://discussions.udacity.com/t/lesson-5-number-of-features/44253/4
where we choose 1 person in this case SKILLING JEFFREY K within the database called enron_data. and then we print the lenght of the keys in the dictionary.
print len(enron_data["SKILLING JEFFREY K"].keys())

Get unique values from a column using Python

I'm trying to get unique values from the column 'name' for every distinct value in column 'gender'.
Here's sample data:
sample input_file_data:
index,name,gender,alive
1,Adam,Male,Y
2,Bella,Female,N
3,Marc,Male,Y
1,Adam,Male,N
I could get it when I give a value corresponding to 'gender' like for example, gave "Male" in the code below:
filtered_data = filter(lambda person: person["gender"] == "Male", input_file_data)
reader = (dict((k, v.strip()) for k, v in row.items() if v) for row in filtered_data)
countt = [rec[gender] for rec in reader]
final1 = input_file_name + ".txt", "gender", "Male"
output1 = str(final1).replace("(", "").replace(")", "").replace("'","").replace(", [{", " -- [").replace("}", "")
final2 = set(re.findall(r"name': '(.*?)'", str(filtered_data)))
final_count = len(final2)
output = str(final_count) + " occurrences", str(final2)
output2 = output1, str(output)
output_final = str(output2).replace('\\', "").replace('"',"").replace(']"', "]").replace("set", "").replace("(", "").replace(")", "").replace("'","").replace(", [{", " -- [").replace("}", "")
output_final = output_final + "\n"
current output:
input_file_name.txt, gender, Male, 2 occurrences, [Adam,Marc]
Expected output:
input_file_name.txt, gender, Male, 2 occurrences, [Adam,Marc], Female, 1 occurrences [Bella]
which should show up all the unique occurrences of names, for every distinct gender value (without hardcoding). Also I do not want to use Pandas. Any help is highly appreciated.
PS- I have multiple files and not all files have the same columns. So I can't hardcode them. Also, all the files have a 'name' column, but not all files have a 'gender' column. And this script should work for any other column like 'index' or 'alive' or anything else for that matter and not just gender.

I would use the csv module along with the defaultdict from collections for this. Say this is stored in a file called test.csv:
>>> import csv
>>> from collections import defaultdict
>>> with open('test.csv', 'rb') as fin: data = list(csv.reader(fin))[1:]
>>> gender_dict = defaultdict(set)
>>> for idx, name, gender, alive in data:
gender_dict[gender].add(name)
>>> gender_dict
defaultdict(<type 'set'>, {'Male': ['Adam', 'Marc'], 'Female': ['Bella']})
You now have a dictionary. Each key is a unique value from the gender column. Each value is a set, so you'll only get unique items. Notice that we added 'Adam' twice, but only see one in the resulting set.
You don't need defaultdict, but it allows you to use less boilerplate code to check if a key exists.
EDIT: It might help to have better visibility into the data itself. Given your code, I can make the following assumptions:
input_file_data is an iterable (list, tuple, something like that) containing dictionaries.
Each dictionary contains a 'gender' key. If it didn't include at least 'gender', you would get a key error when trying to filter it.
Each dictionary has a 'name' key, it looks like.
Rather than doing all of that regex, what about this?
>>> gender_dict = {'Male': set(), 'Female': set()}
>>> for item in input_file_data:
gender_dict[item['gender']].add(item['name'])
You can use item.get('name') instead of item['name'] if not every entry will have a name.
Edit #2: Ok, the first thing you need to do is get your data into a consistent state. We can absolutely get to a point where you have a column name (gender, index, alive, whatever you want) and a set of unique names corresponding to those columns. Something like this:
data_dict = {'gender':
{'Male': ['Adam', 'Marc'],
'Female': ['Bella']}
'alive':
{'Y': ['Adam', 'Marc'],
'N': ['Bella', 'Adam']}
'index':
{1: ['Adam'],
2: ['Bella'],
3: ['Marc']}
}
If that's what you want, you could try this:
>>> data_dict = defaultdict(lambda: defaultdict(lambda: defaultdict(set)))
>>> for element in input_file_data:
for key, value in element.items():
if key != 'name':
data_dict[key][value].add(element[name])
That should get you what you want, I think? I can't test as I don't have your data, but give it a try.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

how to group a file into a dictionary without importing - python

Related

How would I merge identical dictionary keys into one?

Python list of dictionaries aggregate values

Get a list of keys and values in a nested dictionary oriented by index

Number of features in dictionary

Get unique values from a column using Python

Categories

Resources