Pandas- Concatenating two columns of string lists - python

I've got an.. interesting data frame that comes from a database. The data frame has two columns, which are lists of strings. I need to concat the values in these two lists, to create a new column of lists. For example:
data = [
{'id': 1, 'l1': ['Luke', 'Han'], 'l2': ['Skywalker', 'Solo']},
{'id': 2, 'l1': ['Darth', 'Kylo'], 'l2': ['Vader', 'Ren']},
{'id': 3, 'l1': [], 'l2': []}
]
df = pd.DataFrame(data)
Notice the third row has no values. You can also assume that l1 and l2 are of the same length.
And I need to concat the values in l1 and l2 (with a space between), e.g.:
result = [
{'id': 1, 'name': ['Luke Skywalker', 'Han Solo']},
{'id': 2, 'name': ['Darth Vader', 'Kylo Ren']},
{'id': 3, 'name': []}
]
result_df = pd.DataFrame(result)

You you use dict comprehension and ' '.join in combination with zip to iterate over your dataset, for example, this:
import pandas as pd
data = [
{'id': 1, 'l1': ['Luke', 'Han'], 'l2': ['Skywalker', 'Solo']},
{'id': 2, 'l1': ['Darth', 'Kylo'], 'l2': ['Vader', 'Ren']},
{'id': 3, 'l1': [], 'l2': []}
]
df = pd.DataFrame(data)
result = [
{
'id': row['id'],
'name': [' '.join(l1_l2) for l1_l2 in zip(row['l1'], row['l2'])]
} for row in data
]
print(pd.DataFrame(result))
>>>
id name
0 1 [Luke Skywalker, Han Solo]
1 2 [Darth Vader, Kylo Ren]
2 3 []

This should get you to where you want: assuming you only have two columns (if you have more just add one of those ' '+df.iloc[j,3 &or 4 &or...][i])
Voila =[]
for j in range(len(df)):
Voila.append([df.iloc[j,1][i]+ ' '+df.iloc[j,2][i] for i in range(len(df.
loc[j,'l1']))])
df['Voila'] = Voila

Related

Python: consolidate list of lists

I want to consolidate a list of lists (of dicts), but I have honestly no idea how to get it done.
The list looks like this:
l1 = [
[
{'id': 1, 'category': 5}, {'id': 3, 'category': 7}
],
[
{'id': 1, 'category': 5}, {'id': 4, 'category': 8}, {'id': 6, 'category': 9}
],
[
{'id': 6, 'category': 9}, {'id': 9, 'category': 16}
],
[
{'id': 2, 'category': 4}, {'id': 5, 'category': 17}
]
]
If one of the dicts from l1[0] is also present in l1[1], I want to concatenate the two lists and delete l1[0]. Afterwards I want to check if there are values from l1[1] also present in l1[2].
So my desired output would eventually look like this:
new_list = [
[
{'id': 1, 'category': 5}, {'id': 3, 'category': 7}, {'id': 4, 'category': 8}, {'id': 6, 'category': 9}, {'id': 9, 'category': 16}
],
[
{'id': 2, 'category': 4}, {'id': 5, 'category': 17}
]
]
Any idea how it can be done?
I tried it with 3 different for loops, but it wouldnt work, because I change the length of the list and by doing so I provoke an index-out-of-range error (apart from that it would be an ugly solution anyway):
for list in l1:
for dictionary in list:
for index in range(0, len(l1), 1):
if dictionary in l1[index]:
dictionary in l1[index].append(list)
dictionary.remove(list)
Can I apply some map or list_comprehension here?
Thanks a lot for any help!
IIUC, the following algorithm works.
Initialize result to empty
For each sublist in l1:
if sublist and last item in result overlap
append into last list of result without overlapping items
otherwise
append sublist at end of result
Code
# Helper functions
def append(list1, list2):
' append list1 and list2 (without duplicating elements) '
return list1 + [d for d in list2 if not d in list1]
def is_intersect(list1, list2):
' True if list1 and list2 have an element in common '
return any(d in list2 for d in list1) or any(d in list1 for d in list2)
# Generate desired result
result = [] # resulting list
for sublist in l1:
if not result or not is_intersect(sublist, result[-1]):
result.append(sublist)
else:
# Intersection with last list, so append to last list in result
result[-1] = append(result[-1], sublist)
print(result)
Output
[[{'id': 1, 'category': 5},
{'id': 3, 'category': 7},
{'id': 4, 'category': 8},
{'id': 6, 'category': 9},
{'id': 9, 'category': 16}],
[{'id': 2, 'category': 4}, {'id': 5, 'category': 17}]]
​
maybe you can try to append the elements into a new list. by doing so, the original list will remain the same and index-out-of-range error wouldn't be raised.
new_list = []
for list in l1:
inner_list = []
for ...
if dictionary in l1[index]:
inner_list.append(list)
...
new_list.append(inner_list)

How to convert list of dict into two lists?

For example:
persons = [{'id': 1, 'name': 'john'}, {'id': 2, 'name': 'mary'}, {'id': 3, 'name': 'tom'}]
I want to get two lists from it:
ids = [1, 2, 3]
names = ['john', 'mary', 'tom']
What I did:
names = [d['name'] for d in persons]
ids = [d['id'] for d in persons]
Is there a better way to do it?
I'd stick with using list comprehension or use #Woodford technique
ids,name = [dcts['id'] for dcts in persons],[dcts['name'] for dcts in persons]
output
[1, 2, 3]
['john', 'mary', 'tom']
What you did works fine. Another way to handle this (not necessarily better, depending on your needs) is to store your data in a more efficient dictionary and pull the names/ids out of it when you need them:
>>> persons = [{'id': 1, 'name': 'john'}, {'id': 2, 'name': 'mary'}, {'id': 3, 'name': 'tom'}]
>>> p2 = {x['id']: x['name'] for x in persons}
>>> p2
{1: 'john', 2: 'mary', 3: 'tom'}
>>> list(p2.keys())
[1, 2, 3]
>>> list(p2.values())
['john', 'mary', 'tom']
You can do it with pandas in a vectorized fashion:
import pandas as pd
persons = [{'id': 1, 'name': 'john'}, {'id': 2, 'name': 'mary'}, {'id': 3, 'name': 'tom'}]
df = pd.DataFrame(persons)
id_list = df.id.tolist() #[1, 2, 3]
name_list = df.name.tolist() #['john', 'mary', 'tom']
An alternative, inspired by this question, is
ids, names = zip(*map(lambda x: x.values(), persons))
that return tuples. If you need lists
ids, names = map(list, zip(*map(lambda x: x.values(), persons)))
It is a little bit slower on my laptop using python3.9 than the accepted answer but it might be useful.
It sounds like you're trying to iterate through the values of your list while unpacking your dictionaries:
persons = [{'id': 1, 'name': 'john'}, {'id': 2, 'name': 'mary'}, {'id': 3, 'name': 'tom'}]
for x in persons:
id, name = x.values()
ids.append(id)
names.append(name)

How to create a Dataframe from a list containing multiple dictionaries coming from multiple sources

I have a list containing multiple dictionaries coming from multiple sources. key for each dictionary inside this list is the source. For example, the list looks something like below:
data = [{'source1': {'time': 1, 'name': 'abc', 'memory': 9.82}},
{'source1': {'time': 2, 'name': 'def', 'memory': 9.14}},
{'source2': {'time': 1,'name': 'random1', 'memory': 1.45}},
{'source2': {'time': 2,'name': 'random2', 'memory': 1.49}}]
The above list contains dictionaries from multiple sources at a time and many more attributes.
I want to create a data frame, which looks something like below:
i tried what i could here is the code :
import pandas as pd
data = [{'source1': {'time': 1, 'name': 'abc', 'memory': 9.82}},
{'source1': {'time': 2, 'name': 'def', 'memory': 9.14}},
{'source2': {'time': 1,'name': 'random1', 'memory': 1.45}},
{'source2': {'time': 2,'name': 'random2', 'memory': 1.49}}]
dfs = []
last_source = next(iter(data[0]))
df = pd.DataFrame()
for i in data :
for key, val in i.items() :
new_source = key
cols = []
rows = []
for subkey in val :
cols.append(subkey)
rows.append(val[subkey])
if new_source != last_source :
last_source = new_source
dfs.append(df)
df = pd.DataFrame([rows],columns=cols)
else :
dft = pd.DataFrame([rows],columns=cols)
df = df.append(dft)
dfs.append(df)
#print(pd.concat(dfs, axis=1, join='inner'))
print( dfs[0].join(dfs[1].set_index('time'), on='time'))
OUTPUT
time name_source1 memory_source1 name_source2 memory_source2
0 1 abc 9.82 random1 1.45
0 2 def 9.14 random2 1.49

How to combine multiple numpy arrays into a dictionary list

I have the following array:
column_names = ['id', 'temperature', 'price']
And three numpy array as follows:
idArry = ([1,2,3,4,....])
tempArry = ([20.3,30.4,50.4,.....])
priceArry = ([1.2,3.5,2.3,.....])
I wanted to combine the above into a dictionary as follows:
table_dict = ( {'id':1, 'temperature':20.3, 'price':1.2 },
{'id':2, 'temperature':30.4, 'price':3.5},...)
I can use a for loop together with append to create the dictionary but the list is huge at about 15000 rows. Can someone show me how to use python zip functionality or other more efficient and fast way to achieve the above requirement?
You can use a listcomp and the function zip():
[{'id': i, 'temperature': j, 'price': k} for i, j, k in zip(idArry, tempArry, priceArry)]
# [{'id': 1, 'temperature': 20.3, 'price': 1.2}, {'id': 2, 'temperature': 30.4, 'price': 3.5}]
If your ids are 1, 2, 3... and you use a list you don’t need ids in your dicts. This is a redundant information in the list.
[{'temperature': i, 'price': j} for i, j in zip(tempArry, priceArry)]
You can use also a dict of dicts. The lookup in the dict must be faster than in the list.
{i: {'temperature': j, 'price': k} for i, j, k in zip(idArry, tempArry, priceArry)}
# {1: {'temperature': 20.3, 'price': 1.2}, 2: {'temperature': 30.4, 'price': 3.5}}
I'd take a look at the functionality of the pandas package. In particular there is a pandas.DataFrame.to_dict method.
I'm confident that for large arrays this method should be pretty fast (though I'm willing to have the zip method proved more efficient).
In the example below I first construct a pandas dataframe from your arrays and then use the to_dict method.
import numpy as np
import pandas as pd
column_names = ['id', 'temperature', 'price']
idArry = np.array([1, 2, 3])
tempArry = np.array([20.3, 30.4, 50.4])
priceArry = np.array([1.2, 3.5, 2.3])
df = pd.DataFrame(np.vstack([idArry, tempArry, priceArry]).T, columns=column_names)
table_dict = df.to_dict(orient='records')
This could work. Enumerate is used to create a counter that starts at 0 and then each applicable value is pulled out of your tempArry and priceArray. This also creates a generator expression which helps with memory (especially if your lists are really large).
new_dict = ({'id': i + 1 , 'temperature': tempArry[i], 'price': priceArry[i]} for i, _ in enumerate(idArry))
You can use list-comprehension to achieve this by just iterating over one of the arrays:
[{'id': idArry[i], 'temperature': tempArry[i], 'price': priceArry[i]} for i in range(len(idArry))]
You could build a NumPy matrix then convert to a dictionary as follow. Given your data (I changed the values just for example):
import numpy as np
idArry = np.array([1,2,3,4])
tempArry = np.array([20,30,50,40])
priceArry = np.array([200,300,100,400])
Build the matrix:
table = np.array([idArry, tempArry, priceArry]).transpose()
Create the dictionary:
dict_table = [ dict(zip(column_names, values)) for values in table ]
#=> [{'id': 2, 'temperature': 30, 'price': 300}, {'id': 3, 'temperature': 50, 'price': 100}, {'id': 4, 'temperature': 40, 'price': 400}]
I don't know the purpose, but maybe you can also use the matrix as follow.
temp_col = table[:,1]
table[temp_col >= 40]
# [[ 3 50 100]
# [ 4 40 400]]
A way to do it would be as follows:
column_names = ['id', 'temperature', 'price']
idArry = ([1,2,3,4])
tempArry = ([20.3,30.4,50.4, 4])
priceArry = ([1.2,3.5,2.3, 4.5])
You could zip all elements in the different list:
l = zip(idArry,tempArry,priceArry)
print(list(l))
[(1, 20.3, 1.2), (2, 30.4, 3.5), (3, 50.4, 2.3), (4, 4, 4.5)]
And append the inner dictionaries to a list using a list comprehension and by iterating over the elements in l as so:
[dict(zip(column_names, next(l))) for i in range(len(idArry))]
[{'id': 1, 'temperature': 20.3, 'price': 1.2},
{'id': 2, 'temperature': 30.4, 'price': 3.5},
{'id': 3, 'temperature': 50.4, 'price': 2.3},
{'id': 4, 'temperature': 4, 'price': 4.5}]
The advantage of using this method is that it only uses built-in methods and that it works for an arbitrary amount of column_names.

Create a dict from two lists of dicts with matching value

I have two lists of dictionaries, dict1 and dict2.
dict1 = [
{
'id': 0,
'name': 'James'
}, {
'id': 1,
'name': 'Bob'
}
]
dict2 = [
{
'id': 0,
'name': 'James'
}, {
'id': 1,
'name': 'James'
}, {
'id': 2,
'name': 'Bob'
}
]
And i want to create a dict like that:
result = {'James': [0, 1], 'Bob': [2]}
With the names from dict1 as keys and as value the list of "id" fields having the same name.
What's a clean way to do that in Python?
IIUC, I believe last line should be Bob and not James. That way, using pandas
import pandas as pd
>>> df1 = pd.DataFrame(dict1)
>>> df2 = pd.DataFrame(dict2)
>>> df2.groupby('name').agg(list).to_dict()['id']
{'Bob': [2], 'James': [0, 1]}
To filter only names that are in dict1,
>>> df2 = df2[df2['name'].isin(df1['name'])]
and group and agg after that
You can put the names in dict1 in a set first, so that when you iterate over dict2, you can check if the current name is in the set before adding it to the resulting dict of lists:
names = {d['name'] for d in dict1}
result = {}
for d in dict2:
if d['name'] in names:
result.setdefault(d['name'], []).append(d['id'])
result becomes:
{'James': [0, 1], 'Bob': [2]}
You can also do it in pure Python as following:
dict1 = [
{
'id': 0,
'name': 'James'
}, {
'id': 1,
'name': 'Bob'
}
]
dict2 = [
{
'id': 0,
'name': 'James'
}, {
'id': 1,
'name': 'James'
}, {
'id': 2,
'name': 'Bob'
}
]
names = [elem['name'] for elem in dict1]
result = dict((name,[]) for name in names)
for elem in dict2:
result[elem['name']].append(elem['id'])
print(result)
Output:
{'James': [0, 1], 'Bob': [2]}
IIUC, here's a defaultdict solution:
>>> from collections import defaultdict
>>>
>>> result = defaultdict(list)
>>> names = {d['name'] for d in dict1}
>>>
>>> for d in dict2:
...: name = d['name']
...: if name in names:
...: result[name].append(d['id'])
...:
>>> result
defaultdict(list, {'Bob': [2], 'James': [0, 1]})

Categories