iterable from pandas dataframe - python

I need to create an iterable of the form (id, {feature name: features weight}) for using a python package.
my data are store in a pandas dataframe, here an example:
data = pd.DataFrame({"id":[1,2,3],
"gender":[1,0,1],
"age":[25,23,40]})
for the {feature name: features weight}) part, I know I can use this:
fe = data.to_dict(orient='records')
Out[28]:
[{'age': 25, 'gender': 1, 'id': 1},
{'age': 23, 'gender': 0, 'id': 2},
{'age': 40, 'gender': 1, 'id': 3}]
I know I can also iterate over the datframe for get the id, like this:
(row[1] for row in data.itertuples())
But I can get this two together to get one iterable (generator object )
I tried :
((row[1] for row in data.itertuples()),fe[i] for i in range(len(data)))
but the syntax is wrong.
Do you guys know how to do so ?

pd.DataFrame.itertuples returns named tuples. You can iterate and convert each row to a dictionary via the purpose-built method _asdict. You can wrap this in a generator function to create a lazy reader:
data = pd.DataFrame({"id":[1,2,3],
"gender":[1,0,1],
"age":[25,23,40]})
def gen_rows(df):
for row in df.itertuples(index=False):
yield row._asdict()
G = gen_rows(data)
print(next(G)) # OrderedDict([('age', 25), ('gender', 1), ('id', 1)])
print(next(G)) # OrderedDict([('age', 23), ('gender', 0), ('id', 2)])
print(next(G)) # OrderedDict([('age', 40), ('gender', 1), ('id', 3)])
Note that the result will be OrderedDict objects. As a subclass of dict, for most purposes this should be sufficient.

I think need first set_index by column id and then to_dict with orient='index':
fe = data.set_index('id', drop=False).to_dict(orient='index')
print (fe)
{1: {'id': 1, 'gender': 1, 'age': 25},
2: {'id': 2, 'gender': 0, 'age': 23},
3: {'id': 3, 'gender': 1, 'age': 40}}

Related

Ordering an array by value

I have this array :
a = [{'id': 1, 'date: '2020-01-31'}, {'id': 2, 'date': '2020-01-25'}, {'id': 3, 'date': '2020-01-26'}]
I would like to order it by date like that :
a = [{'id': 2, 'date': '2020-01-25'}, {'id': 3, 'date': '2020-01-26'}, {'id': 3, 'date'; '2020-01-31'}]
How can I do this to sort my array like that ?
Thank you very much !
Simplest solution: use the built-in sorted function, and for the key default argument, use a lambda expression to extract the date of each element, as shown below:
a = [
{'id': 1, 'date': '2020-01-31'},
{'id': 2, 'date': '2020-01-25'},
{'id': 3, 'date': '2020-01-26'}
]
a_sorted = sorted(a, key=lambda e: e['date'])
print(a_sorted)
Console output:
[{'id': 2, 'date': '2020-01-25'}, {'id': 3, 'date': '2020-01-26'}, {'id': 1, 'date': '2020-01-31'}]
If you want to sort the array on your own, you can use a sorting Algorithm, for example the bubble sort or you can check here in this link for more algorithms
sorted(a, key= lambda item: item['date'])
if you need descending order
sorted(a, key= lambda item: item['date'], reverse=True)

Correlation in Apache Spark and groupBy with Python

I'm new in Python and Apache Spark, and try to understand, how function "pyspark.sql.functions.corr (val1, val2)" works.
I have big dataframe with auto brand, age and price. I want to get correlation between age and price for each auto brand.
I have 2 solutions:
//get all brands
get_all_maker = data.groupBy("brand").agg(F.count("*").alias("counts")).collect()
for row in get_all_maker:
print(row["brand"],": ",data.filter(data["brand"]==row["brand"]).corr("age","price"))
This solution is slow, because I use "corr" a lot of times.
So I tried to do it with one aggregation:
get_all_maker_corr = data.groupBy("brand").agg(
F.count("*").alias("counts"),
F.corr("age","price").alias("correlation")).collect()
for row in get_all_maker_corr:
print(row["brand"],": ",row["correlation"])
If I try to compare results, they are different. But why?
I tried with simple examples. Here I generate simple data frame:
d = [
{'name': 'a', 'age': 1, 'price': 2},
{'name': 'a', 'age': 2, 'price': 4},
{'name': 'b', 'age': 1, 'price': 1},
{'name': 'b', 'age': 2, 'price': 2}
]
b = spark.createDataFrame(d)
Let's test two methods:
#first version
get_all_maker = b.groupBy("name").agg(F.count("*").alias("counts")).collect()
print("Correlation (1st)")
for row in get_all_maker:
print(row["name"],"(",row["counts"],"):",b.filter(b["name"] == row["name"]).corr("age","price"))
#second version
get_all_maker_corr = b.groupBy("name").agg(
F.count("*").alias("counts"),
F.corr("age","price").alias("correlation")).collect()
print("Correlation (2nd)")
for row in get_all_maker_corr:
print(row["name"],"(",row["counts"],"):",row["correlation"])
Both of them bring me the same answer:
Correlation (1st)
b ( 2 ): 1.0
a ( 2 ): 1.0
Let's add another entry to data frame with None-value:
d = [
{'name': 'a', 'age': 1, 'price': 2},
{'name': 'a', 'age': 2, 'price': 4},
{'name': 'a', 'age': 3, 'price': None},
{'name': 'b', 'age': 1, 'price': 1},
{'name': 'b', 'age': 2, 'price': 2}
]
b = spark.createDataFrame(d)
In first version you will get these results:
Correlation (1st)
b ( 2 ): 1.0
a ( 3 ): -0.5
and the second version bring you other results:
Correlation (2nd)
b ( 2 ): 1.0
a ( 3 ): 1.0
I think, that dataframe.filter with corr-function set None-value to 0-value.
And dataframe.groupBy with F.corr-function in agg-function will ignore None-value.
So, two these methods are not equal. I don't know, if this is a bug or a feature of the Spark system, but just in case you want to count correlation value, the data should be used only without None-value.

How to combine multiple numpy arrays into a dictionary list

I have the following array:
column_names = ['id', 'temperature', 'price']
And three numpy array as follows:
idArry = ([1,2,3,4,....])
tempArry = ([20.3,30.4,50.4,.....])
priceArry = ([1.2,3.5,2.3,.....])
I wanted to combine the above into a dictionary as follows:
table_dict = ( {'id':1, 'temperature':20.3, 'price':1.2 },
{'id':2, 'temperature':30.4, 'price':3.5},...)
I can use a for loop together with append to create the dictionary but the list is huge at about 15000 rows. Can someone show me how to use python zip functionality or other more efficient and fast way to achieve the above requirement?
You can use a listcomp and the function zip():
[{'id': i, 'temperature': j, 'price': k} for i, j, k in zip(idArry, tempArry, priceArry)]
# [{'id': 1, 'temperature': 20.3, 'price': 1.2}, {'id': 2, 'temperature': 30.4, 'price': 3.5}]
If your ids are 1, 2, 3... and you use a list you don’t need ids in your dicts. This is a redundant information in the list.
[{'temperature': i, 'price': j} for i, j in zip(tempArry, priceArry)]
You can use also a dict of dicts. The lookup in the dict must be faster than in the list.
{i: {'temperature': j, 'price': k} for i, j, k in zip(idArry, tempArry, priceArry)}
# {1: {'temperature': 20.3, 'price': 1.2}, 2: {'temperature': 30.4, 'price': 3.5}}
I'd take a look at the functionality of the pandas package. In particular there is a pandas.DataFrame.to_dict method.
I'm confident that for large arrays this method should be pretty fast (though I'm willing to have the zip method proved more efficient).
In the example below I first construct a pandas dataframe from your arrays and then use the to_dict method.
import numpy as np
import pandas as pd
column_names = ['id', 'temperature', 'price']
idArry = np.array([1, 2, 3])
tempArry = np.array([20.3, 30.4, 50.4])
priceArry = np.array([1.2, 3.5, 2.3])
df = pd.DataFrame(np.vstack([idArry, tempArry, priceArry]).T, columns=column_names)
table_dict = df.to_dict(orient='records')
This could work. Enumerate is used to create a counter that starts at 0 and then each applicable value is pulled out of your tempArry and priceArray. This also creates a generator expression which helps with memory (especially if your lists are really large).
new_dict = ({'id': i + 1 , 'temperature': tempArry[i], 'price': priceArry[i]} for i, _ in enumerate(idArry))
You can use list-comprehension to achieve this by just iterating over one of the arrays:
[{'id': idArry[i], 'temperature': tempArry[i], 'price': priceArry[i]} for i in range(len(idArry))]
You could build a NumPy matrix then convert to a dictionary as follow. Given your data (I changed the values just for example):
import numpy as np
idArry = np.array([1,2,3,4])
tempArry = np.array([20,30,50,40])
priceArry = np.array([200,300,100,400])
Build the matrix:
table = np.array([idArry, tempArry, priceArry]).transpose()
Create the dictionary:
dict_table = [ dict(zip(column_names, values)) for values in table ]
#=> [{'id': 2, 'temperature': 30, 'price': 300}, {'id': 3, 'temperature': 50, 'price': 100}, {'id': 4, 'temperature': 40, 'price': 400}]
I don't know the purpose, but maybe you can also use the matrix as follow.
temp_col = table[:,1]
table[temp_col >= 40]
# [[ 3 50 100]
# [ 4 40 400]]
A way to do it would be as follows:
column_names = ['id', 'temperature', 'price']
idArry = ([1,2,3,4])
tempArry = ([20.3,30.4,50.4, 4])
priceArry = ([1.2,3.5,2.3, 4.5])
You could zip all elements in the different list:
l = zip(idArry,tempArry,priceArry)
print(list(l))
[(1, 20.3, 1.2), (2, 30.4, 3.5), (3, 50.4, 2.3), (4, 4, 4.5)]
And append the inner dictionaries to a list using a list comprehension and by iterating over the elements in l as so:
[dict(zip(column_names, next(l))) for i in range(len(idArry))]
[{'id': 1, 'temperature': 20.3, 'price': 1.2},
{'id': 2, 'temperature': 30.4, 'price': 3.5},
{'id': 3, 'temperature': 50.4, 'price': 2.3},
{'id': 4, 'temperature': 4, 'price': 4.5}]
The advantage of using this method is that it only uses built-in methods and that it works for an arbitrary amount of column_names.

Grouping python list of dictionaries and aggregation value data

I have input list
inlist = [{"id":123,"hour":5,"groups":"1"},{"id":345,"hour":3,"groups":"1;2"},{"id":65,"hour":-2,"groups":"3"}]
I need to group the dictionaries by 'groups' value. After that I need to add key min and max of hour in new grouped lists. The output should look like this
outlist=[(1, [{"id":123, "hour":5, "min_group_hour":3, "max_group_hour":5}, {"id":345, "hour":3, "min_group_hour":3, "max_group_hour":5}]),
(2, [{"id":345, "hour":3, "min_group_hour":3, "max_group_hour":3}])
(3, [{"id":65, "hour":-2, "min_group_hour":-2, "max_group_hour":-2}])]
So far I managed to group input list
new_list = []
for domain in test:
for group in domain['groups'].split(';'):
d = dict()
d['id'] = domain['id']
d['group'] = group
d['hour'] = domain['hour']
new_list.append(d)
for k,v in itertools.groupby(new_list, key=itemgetter('group')):
print (int(k),max(list(v),key=itemgetter('hour'))
And output is
('1', [{'group': '1', 'id': 123, 'hour': 5}])
('2', [{'group': '2', 'id': 345, 'hour': 3}])
('3', [{'group': '3', 'id': 65, 'hour': -2}])
I don't know how to aggregate values by group? And is there more pythonic way of grouping dictionaries by key value that needs to be splitted?
Start by creating a dict that maps group numbers to dictionaries:
from collections import defaultdict
dicts_by_group = defaultdict(list)
for dic in inlist:
groups = map(int, dic['groups'].split(';'))
for group in groups:
dicts_by_group[group].append(dic)
This gives us a dict that looks like
{1: [{'id': 123, 'hour': 5, 'groups': '1'},
{'id': 345, 'hour': 3, 'groups': '1;2'}],
2: [{'id': 345, 'hour': 3, 'groups': '1;2'}],
3: [{'id': 65, 'hour': -2, 'groups': '3'}]}
Then iterate over the grouped dicts and set the min_group_hour and max_group_hour for each group:
outlist = []
for group in sorted(dicts_by_group.keys()):
dicts = dicts_by_group[group]
min_hour = min(dic['hour'] for dic in dicts)
max_hour = max(dic['hour'] for dic in dicts)
dicts = [{'id': dic['id'], 'hour': dic['hour'], 'min_group_hour': min_hour,
'max_group_hour': max_hour} for dic in dicts]
outlist.append((group, dicts))
Result:
[(1, [{'id': 123, 'hour': 5, 'min_group_hour': 3, 'max_group_hour': 5},
{'id': 345, 'hour': 3, 'min_group_hour': 3, 'max_group_hour': 5}]),
(2, [{'id': 345, 'hour': 3, 'min_group_hour': 3, 'max_group_hour': 3}]),
(3, [{'id': 65, 'hour': -2, 'min_group_hour': -2, 'max_group_hour': -2}])]
IIUC: Here is another way to do it in pandas:
import pandas as pd
input = [{"id":123,"hour":5,"group":"1"},{"id":345,"hour":3,"group":"1;2"},{"id":65,"hour":-2,"group":"3"}]
df = pd.DataFrame(input)
#Get minimum
dfmi = df.groupby('group').apply(min)
#Rename hour column as min_hour
dfmi.rename(columns={'hour':'min_hour'}, inplace=True)
dfmx = df.groupby('group').apply(max)
#Rename hour column as max_hour
dfmx.rename(columns={'hour':'max_hour'}, inplace=True)
#Merge min df with main df
df = df.merge(dfmi, on='group', how='outer')
#Merge max df with main df
df = df.merge(dfmx, on='group', how='outer')
output = list(df.apply(lambda x: x.to_dict(), axis=1))
#Dictionary of dictionaries
dict_out = df.to_dict(orient='index')

How to Extract Values from Dictionaries and Create Lists From Them

I have a list of dictionaries:
dictionaries = [
{'id': 1, 'name': 'test1', 'description': 'foo'},
{'id': 2, 'name': 'test2', 'description': 'bar'}
]
I would like to separate the values from the keys in each dictionary, making a list of lists like this:
[(1 ,'test1', 'foo'), (2, 'test2', 'bar')]
I have the following code to perform this...
values_list = []
for dict in dictionaries:
values_list.append(list(dict.values()))
When I run this code in my app, I get:
TypeError: list() takes 0 positional arguments but 1 was given
What's the right way to do this type of list comprehension?
That can be done with a couple of comprehensions like:
Code:
def get_values_as_tuple(dict_list, keys):
return [tuple(d[k] for k in keys) for d in dict_list]
How?
How does this work? This is a nested comprehension. Let's go from the inside out, and start with:
tuple(d[k] for k in keys)
This creates a tuple of all of the elements in d that are specified via k in keys. So, that is nice, but what the heck are d, k and keys?
keys is passed to the function, and are the keys we will look for in our dicts.
k is the individual values in keys from k in keys.
d is the individual dicts in dict_list from d in dict_list.
The outer comprehension builds a list of the tuples discussed above:
[tuple(d[k] for k in keys) for d in dict_list]
Test Code:
dictionaries = [{'id': 1, 'name': 'test1', 'description': 'foo'},
{'id': 2, 'name': 'test2', 'description': 'bar'}]
def get_values_as_tuple(dict_list, keys):
return [tuple(d[k] for k in keys) for d in dict_list]
print(get_values_as_tuple(dictionaries, ('id', 'name', 'description')))
Results:
[(1, 'test1', 'foo'), (2, 'test2', 'bar')]
Hi If you want to convert a list of tuples it so easy with list comprehension technique or the code which you are using it is for list of list but you expected solution looks like list of tuples
dictionaries = [{'id': 1, 'name': 'test1', 'description': 'foo'},
{'id': 2, 'name': 'test2', 'description': 'bar'}]
values_list = []
for dict in dictionaries:
values_list.append(tuple(dict.values()))
data = [tuple(dict.values()) for dict in dictionaries]
print(values_list)
print(data)
This might help you which will give you output like this
[(1, 'test1', 'foo'), (2, 'test2', 'bar')]
[(1, 'test1', 'foo'), (2, 'test2', 'bar')]
In [14]: dictionaries = [{'id': 1, 'name': 'test1', 'description':
'foo'},
....: {'id': 2, 'name': 'test2', 'description': 'bar'}]
In [15]: [tuple(d.values()) for d in dictionaries]
Out[15]: [('foo', 1, 'test1'), ('bar', 2, 'test2')]
Hi, Trey,Dict is disordered in python, so the values in the tuple may be disordered too.If you want a ordered result, you may try OrderedDict in python.
In Python 2.x:
from collections import namedtuple
dictionaries = [
{'id': 1, 'name': 'test1', 'description': 'foo'},
{'id': 2, 'name': 'test2', 'description': 'bar'}
]
main_list = []
keys = namedtuple("keys",['id','name','description'])
for val in dictionaries:
data = keys(**val)
main_list.append((data.id,data.name,data.description))
print(main_list)
>>>[(1, 'test1', 'foo'), (2, 'test2', 'bar')]
Python 3.x:
dictionaries = [
{'id': 1, 'name': 'test1', 'description': 'foo'},
{'id': 2, 'name': 'test2', 'description': 'bar'}
]
data = [tuple(val.values()) for val in dictionaries]
print(data)
>>>[(1, 'test1', 'foo'), (2, 'test2', 'bar')]
Code:
dicts = [
{'id': 1, 'name': 'test1', 'description': 'foo'},
{'id': 2, 'name': 'test2', 'description': 'bar'}
]
v_list = []
for d in dicts:
v_list.append( tuple( d.values() ) )
print( v_list )
Results:
[(1, 'test1', 'foo'), (2, 'test2', 'bar')]

Categories