Get unique values from a column using Python - python

I'm trying to get unique values from the column 'name' for every distinct value in column 'gender'.
Here's sample data:
sample input_file_data:
index,name,gender,alive
1,Adam,Male,Y
2,Bella,Female,N
3,Marc,Male,Y
1,Adam,Male,N
I could get it when I give a value corresponding to 'gender' like for example, gave "Male" in the code below:
filtered_data = filter(lambda person: person["gender"] == "Male", input_file_data)
reader = (dict((k, v.strip()) for k, v in row.items() if v) for row in filtered_data)
countt = [rec[gender] for rec in reader]
final1 = input_file_name + ".txt", "gender", "Male"
output1 = str(final1).replace("(", "").replace(")", "").replace("'","").replace(", [{", " -- [").replace("}", "")
final2 = set(re.findall(r"name': '(.*?)'", str(filtered_data)))
final_count = len(final2)
output = str(final_count) + " occurrences", str(final2)
output2 = output1, str(output)
output_final = str(output2).replace('\\', "").replace('"',"").replace(']"', "]").replace("set", "").replace("(", "").replace(")", "").replace("'","").replace(", [{", " -- [").replace("}", "")
output_final = output_final + "\n"
current output:
input_file_name.txt, gender, Male, 2 occurrences, [Adam,Marc]
Expected output:
input_file_name.txt, gender, Male, 2 occurrences, [Adam,Marc], Female, 1 occurrences [Bella]
which should show up all the unique occurrences of names, for every distinct gender value (without hardcoding). Also I do not want to use Pandas. Any help is highly appreciated.
PS- I have multiple files and not all files have the same columns. So I can't hardcode them. Also, all the files have a 'name' column, but not all files have a 'gender' column. And this script should work for any other column like 'index' or 'alive' or anything else for that matter and not just gender.

I would use the csv module along with the defaultdict from collections for this. Say this is stored in a file called test.csv:
>>> import csv
>>> from collections import defaultdict
>>> with open('test.csv', 'rb') as fin: data = list(csv.reader(fin))[1:]
>>> gender_dict = defaultdict(set)
>>> for idx, name, gender, alive in data:
gender_dict[gender].add(name)
>>> gender_dict
defaultdict(<type 'set'>, {'Male': ['Adam', 'Marc'], 'Female': ['Bella']})
You now have a dictionary. Each key is a unique value from the gender column. Each value is a set, so you'll only get unique items. Notice that we added 'Adam' twice, but only see one in the resulting set.
You don't need defaultdict, but it allows you to use less boilerplate code to check if a key exists.
EDIT: It might help to have better visibility into the data itself. Given your code, I can make the following assumptions:
input_file_data is an iterable (list, tuple, something like that) containing dictionaries.
Each dictionary contains a 'gender' key. If it didn't include at least 'gender', you would get a key error when trying to filter it.
Each dictionary has a 'name' key, it looks like.
Rather than doing all of that regex, what about this?
>>> gender_dict = {'Male': set(), 'Female': set()}
>>> for item in input_file_data:
gender_dict[item['gender']].add(item['name'])
You can use item.get('name') instead of item['name'] if not every entry will have a name.
Edit #2: Ok, the first thing you need to do is get your data into a consistent state. We can absolutely get to a point where you have a column name (gender, index, alive, whatever you want) and a set of unique names corresponding to those columns. Something like this:
data_dict = {'gender':
{'Male': ['Adam', 'Marc'],
'Female': ['Bella']}
'alive':
{'Y': ['Adam', 'Marc'],
'N': ['Bella', 'Adam']}
'index':
{1: ['Adam'],
2: ['Bella'],
3: ['Marc']}
}
If that's what you want, you could try this:
>>> data_dict = defaultdict(lambda: defaultdict(lambda: defaultdict(set)))
>>> for element in input_file_data:
for key, value in element.items():
if key != 'name':
data_dict[key][value].add(element[name])
That should get you what you want, I think? I can't test as I don't have your data, but give it a try.

Related

how to group a file into a dictionary without importing

I'm having to make a dictionary from a file that looks like this:
example =
'Computer science', random name, 17
'Computer science', another name, 18
'math', one name, 19
I want the majors to be keys but I'm having trouble grouping them this is what I've tried
dictionary = {}
for i in example_file:
dictionary = {example[0]:{example[1] : example[2]}
the problem with this is that it does turn the lines into a dictionary but one by one instead of having the ones with the same key in one dictionary
this is what its returning:
{computer science; {random name: 17}}
{computer science: {another name: 18}}
{math{one name:19}}
this is how I want it to look
{computer science: {random name: 17, another name: 18}, math:{one name:19}}
how do I group these?
You need to update the dictionary elements, not assign the whole dictionary each time through the loop.
You can use defaultdict(dict) to automatically create the nested dictionaries as needed.
from collections import defaultdict
dictionary = defaultdict(dict)
for subject, name, score in example_file:
dictionary[subject][name] = int(score)
It's a pretty well known problem with an elegant solution, making use of dict's setdefault() method.
dictionary = {}
for example in example_file:
names = dictionary.setdefault(example[0], {})
names[example[1]] = example[2]
print(dictionary)
This code prints:
{'Computer science': {'random name': 17, 'another name': 18}, 'math': {'one name': 19}}
An alternative code:
(but #hhimko 's solution is almost 50 times faster)
import pandas as pd
df = pd.read_csv("file.csv", header=None).sort_values(0).reset_index(drop=True)
result = dict()
major_holder = None
for index, row in tt.iterrows():
if row.iloc[0] != major_holder:
major_holder = row.iloc[0]
result[major_holder] = dict()
result[major_holder][row.iloc[1]] = row.iloc[2]
else:
result[major_holder][row.iloc[1]] = row.iloc[2]
print(result)

How would I merge identical dictionary keys into one?

I have a csv file that looks something like this:
apple 12 yes
apple 15 no
apple 19 yes
and I want to use the fruit as a key and turn rest of the row into a list of lists that's a value, so it looks like:
{'apple': [[12, 'yes'],[15, 'no'],[19, 'yes']]}
A sample of my code below, turns each row into its own dictionary, when I want to combine the data.
import csv
fp = open('fruits.csv', 'r')
reader = csv.reader(fp)
next(reader,None)
D = {}
for row in reader:
D = {row[0]:[row[1],row[2]]}
print(D)
My output looks like:
{'apple': [12,'yes']}
{'apple': [15,'no']}
{'apple': [19,'yes']}
Your problem is you reset D in every iteration. Don't do that.
Note that the output may look somewhat related to what you want, but this isn't actually the case. If you inspect the variable D after this code is finished running, you'll see that it contains only the last value that you set it to:
{'apple': [19,'yes']}
Instead, add keys to the dictionary whenever you encounter a new fruit. The value at this key will be an empty list. Then append the data you want to this empty list.
import csv
fp = open('fruits.csv', 'r')
reader = csv.reader(fp)
next(reader,None)
D = {}
for row in reader:
if row[0] not in D: # if the key doesn't already exist in D, add an empty list
D[row[0]] = []
D[row[0]].append([row[1:]]) # append the rest of this row to the list in the dictionary
print(D) # print the dictionary AFTER you finish creating it
Alternatively, define D as a collections.defaultdict(list) and you can skip the entire if block
Note that in a single dictionary, one key can only have one value. There can not be multiple values assigned to the same key. In this case, each fruit name (key) has a single list value assigned to it. This list contains more lists inside it, but that is immaterial to the dictionary.
You can use a mix of sorting and groupby:
from itertools import groupby
from operator import itemgetter
_input = """apple 12 yes
apple 15 no
apple 19 yes
"""
entries = [l.split() for l in _input.splitlines()]
{key : [values[1:] for values in grp] for key, grp in groupby( sorted(entries, key=itemgetter(0)), key=itemgetter(0))}
Sorting is applied before groupby to have unduplicated keys, and the key of both is taking the first element of each line.
Part of the issue you are running into is that rather than "adding" data to D[key] via append, you are just replacing it. In the end you get only the last result per key.
You might look at collections.defaultdict(list) as a strategy to initialize D or use setdefault(). In this case I'll use setdefault() as it is straightforward, but don't discount defaultdict() in more complicated senarios.
data = [
["apple", 12, "yes"],
["apple", 15, "no"],
["apple", 19, "yes"]
]
result = {}
for item in data:
result.setdefault(item[0], []).append(item[1:])
print(result)
This should give you:
{
'apple': [
[12, 'yes'],
[15, 'no'],
[19, 'yes']
]
}
If you were keen on looking at defaultdict() an solution based on it might look like:
import collections
data = [
["apple", 12, "yes"],
["apple", 15, "no"],
["apple", 19, "yes"]
]
result = collections.defaultdict(list)
for item in data:
result[item[0]].append(item[1:])
print(dict(result))

Parse one string column in dataframe column into many other columns

I have a column in a pandas data frame that contains string like the following format as for example
fullyRandom=true+mapSizeDividedBy64=51048
mapSizeDividedBy16000=9756+fullyRandom=false
qType=MpmcArrayQueue+qCapacity=822398+burstSize=664
count=11087+mySeed=2+maxLength=9490
capacity=27281
capacity=79882
we can read for example the first row as 2 parameters separated by '+' each parameter has a value, that clear by '=' that separate between the parameter and its value.
in Output, I'm asking if there is a python script that either extract the parameters we retrieve a list of unique parameters like the following
[fullyRandom,mapSizeDividedBy64,mapSizeDividedBy64,qType,qCapacity,qCapacity, count,mySeed,maxLength,Capacity]
Notice from the previous list that it contains only the unique parameters without its values
Or extended pandas data frame if it's not too difficult if we can parse the following column and convert into many columns, each column is for one parameter that store it's value in it
Try this, it will store the values in a list.
data = []
with open('<your text file>', 'r') as file:
content = file.readlines()
for row in content:
if '+' in row:
sub_row = row.strip('\n').split('+')
for r in sub_row:
data.append(r)
else:
data.append(row.strip('\n'))
print(data)
Output:
['fullyRandom=true', 'mapSizeDividedBy64=51048', 'mapSizeDividedBy16000=9756', 'fullyRandom=false', 'qType=MpmcArrayQueue', 'qCapacity=822398', 'burstSize=664', 'count=11087', 'mySeed=2', 'maxLength=9490', 'capacity=27281', 'capacity=79882']
to convert to a list of dict that could be used in pandas:
dict_list = []
for item in data:
df = {
item.split('=')[0]: item.split('=')[1]
}
dict_list.append(df)
print(dict_list)
Output:
[{'fullyRandom': 'true'}, {'mapSizeDividedBy64': '51048'}, {'mapSizeDividedBy16000': '9756'}, {'fullyRandom': 'false'}, {'qType': 'MpmcArrayQueue'}, {'qCapacity': '822398'}, {'burstSize': '664'}, {'count': '11087'}, {'mySeed': '2'}, {'maxLength': '9490'}, {'capacity': '27281'}, {'capacity': '79882'}]
To just get the headers:
dict_list.append(item.split('=')[0])
Output:
['fullyRandom', 'mapSizeDividedBy64', 'mapSizeDividedBy16000', 'fullyRandom', 'qType', 'qCapacity', 'burstSize', 'count', 'mySeed', 'maxLength', 'capacity', 'capacity']

How to combine multiple dictionaries in a list based on the given key columns?

I am working on a List which contains many dictionaries. Here I am trying to combine those dictionary into a single dict based on their key value. For illustration see the below example.
my_dict =[{'COLUMN_NAME': 'TABLE_1_COL_1', 'TABLE_NAME': 'TABLE_1'},
{'COLUMN_NAME': 'TABLE_1_COL_2', 'TABLE_NAME': 'TABLE_1'},
{'COLUMN_NAME': 'TABLE_1_COL_3', 'TABLE_NAME': 'TABLE_1'},
{'COLUMN_NAME': 'TABLE_2_COL_1', 'TABLE_NAME': 'TABLE_2'},
{'COLUMN_NAME': 'TABLE_2_COL_2', 'TABLE_NAME': 'TABLE_2'}]
Here for any key value matches with another key value then need to combine other key values.
Below is the sample output what I expect from the above list of dict.
new_lst = [{'TABLE_NAME': 'TABLE_1','COLUMN_NAME':['TABLE_1_COL_1','TABLE_1_COL_2','TABLE_1_COL_3']}, {'TABLE_NAME': 'TABLE_2','COLUMN_NAME': ['TABLE_2_COL_1','TABLE_2_COL_2']]
How can i achieve this in most efficient way.
You can use defaultdict to get similar output.
from collections import defaultdict
new_lst = []
for some_dict in list_of_dicts:
new_lst.append(defaultdict(list))
for key, value in some_dict.items():
new_lst[len(new_lst) - 1][key].append(value)
new_lst will be of the form:
[{'TABLE_NAME': ['TABLE_1'],'COLUMN_NAME':['TABLE_1_COL_1','TABLE_1_COL_2','TABLE_1_COL_3']}, {'TABLE_NAME': ['TABLE_2'],'COLUMN_NAME': ['TABLE_2_COL_1','TABLE_2_COL_2']]
Which is slightly different from what you wanted (even the singular elements are in arrays). I would recommend you leave it in this format if given the choice.
To get exactly what you wanted, add this after the above code:
for some_dict in new_lst:
for key, value in some_dict.items():
if len(value) == 1:
some_dict[key] = value[0]
Now, new_lst is exactly like you expected:
[{'TABLE_NAME': 'TABLE_1','COLUMN_NAME':['TABLE_1_COL_1','TABLE_1_COL_2','TABLE_1_COL_3']}, {'TABLE_NAME': 'TABLE_2','COLUMN_NAME': ['TABLE_2_COL_1','TABLE_2_COL_2']]
Something like that?
data = {}
for element in my_dict:
table_name = element['TABLE_NAME']
column_name = element['COLUMN_NAME']
if table_name not in data:
data[table_name] = []
data[table_name].append(column_name)
new_lst = [{'TABLE_NAME': key, 'COLUMN_NAME': val} for key, val in data.items()]

Number of features in dictionary

I am working on loading a dataset from a pickle file like this
""" Load the dictionary containing the dataset """
with open("final_project_dataset.pkl", "r") as data_file:
data_dict = pickle.load(data_file)
It works fine and loads the data correctly. This is an example of one row:
'GLISAN JR BEN F': {'salary': 274975, 'to_messages': 873, 'deferral_payments': 'NaN', 'total_payments': 1272284, 'exercised_stock_options': 384728, 'bonus': 600000, 'restricted_stock': 393818, 'shared_receipt_with_poi': 874, 'restricted_stock_deferred': 'NaN', 'total_stock_value': 778546, 'expenses': 125978, 'loan_advances': 'NaN', 'from_messages': 16, 'other': 200308, 'from_this_person_to_poi': 6, 'poi': True, 'director_fees': 'NaN', 'deferred_income': 'NaN', 'long_term_incentive': 71023, 'email_address': 'ben.glisan#enron.com', 'from_poi_to_this_person': 52}
Now, how can get the number of features? e.g (salary, to_messages, .... , from_poi_to_this_person) ?
I got this row by printing my whole dataset (print data_dict) and this is one of the results. I want to know how many features are there is general i.e. in the whole dataset without specifying a key in the dictionary.
Thanks
Try this.
no_of_features = len(data_dict[data_dict.keys()[0]])
This will work only if all your keys in data_dict have same number of features.
or simply
no_of_features = len(data_dict['GLISAN JR BEN F'])
""" Load the dictionary containing the dataset """
with open("final_project_dataset.pkl", "r") as data_file:
data_dict = pickle.load(data_file)
print len(data_dict)
I think you want to find out the size of the set of all unique field names used in the row dictionaries. You can find that like this:
data_dict = {
'red':{'alpha':1,'bravo':2,'golf':3,'kilo':4},
'green':{'bravo':1,'delta':2,'echo':3},
'blue':{'foxtrot':1,'tango':2}
}
unique_features = set(
feature
for row_dict in data_dict.values()
for feature in row_dict.keys()
)
print(unique_features)
# {'golf', 'delta', 'foxtrot', 'alpha', 'bravo', 'echo', 'tango', 'kilo'}
print(len(unique_features))
# 8
Apply sum to the len of each nested dictionary:
sum(len(v) for _, v in data_dict.items())
v represents a nested dictionary object.
Dictionaries will naturally return their keys when you call an iterator on them (or something of that sort), so calling len will return the number of keys in each nested dictionary, viz. number of features.
If the features may be duplicated across nested objects, then collect them in a set and apply len
len(set(f for v in data_dict.values() for f in v.keys()))
Here is the answer
https://discussions.udacity.com/t/lesson-5-number-of-features/44253/4
where we choose 1 person in this case SKILLING JEFFREY K within the database called enron_data. and then we print the lenght of the keys in the dictionary.
print len(enron_data["SKILLING JEFFREY K"].keys())

Categories