How to combine two rows in Python list - python

Suppose I have a 2D list,
a= [['a','b','c',1],
['a','b','d',2],
['a','e','d',3],
['a','e','c',4]]
I want to obtain a list such that if the first two elements in rows are identical, sum the fourth element, drop the third element and combine these rows together, like the following,
b = [['a','b',3],
['a','e',7]]
What is the most efficient way to do this?

If your list is already sorted, then you can use itertools.groupby. Once you group by the first two elements, you can use a generator expression to sum the 4th element and create your new lists.
>>> from itertools import groupby
>>> a= [['a','b','c',1],
['a','b','d',2],
['a','e','d',3],
['a','e','c',4]]
>>> [g[0] + [sum(i[3] for i in g[1])] for g in groupby(a, key = lambda i : i[:2])]
[['a', 'b', 3],
['a', 'e', 7]]

Using pandas's groupby:
import pandas as pd
df = pd.DataFrame(a)
df.groupby([0, 1]).sum().reset_index().values.tolist()
Output:
df.groupby([0, 1]).sum().reset_index().values.tolist()
Out[19]: [['a', 'b', 3L], ['a', 'e', 7L]]

You can use pandas groupby methods to achieve that goal.
import pandas as pd
a= [['a','b','c',1],
['a','b','d',2],
['a','e','d',3],
['a','e','c',4]]
df = pd.DataFrame(a)
df_sum = df.groupby([0,1])[3].sum().reset_index()
array_return = df_sum.values
list_return = array_return.tolist()
print(list_return)
list_reuturn is the result you want.

If you're interested. Here is an implementation using raw python. I've only tested it on the dataset you provided.
a= [['a','b','c',1],
['a','b','d',2],
['a','e','d',3],
['a','e','c',4]]
b_dict = {}
for row in a:
key = (row[0], row[1])
b_dict[key] = b_dict[key] + row[3] if key in b_dict else row[3]
b = [[key[0], key[1], value] for key, value in b_dict.iteritems()]

Related

How to creat key and list of elements from column as values from dataframes

How to create python dictionary using the data below
Df1:
Id
mail-id
1
xyz#gm
1
ygzbb
2.
Ghh.
2.
Hjkk.
I want it as
{1:[xyz#gm,ygzbb], 2:[Ghh,Hjkk]}
Something like this?
data = [
[1, "xyz#gm"],
[1, "ygzbb"],
[2, "Ghh"],
[2, "Hjkk"],
]
dataDict = {}
for k, v in data:
if k not in dataDict:
dataDict[k] = []
dataDict[k].append(v)
print(dataDict)
One option is to groupby the Id column and turn the mail-id into a list in a dictionary comprehension:
{k:v["mail-id"].values.tolist() for k,v in df.groupby("Id")}
One option is to iterate over the set version of the ids and check one by one:
>>> _d = {}
>>> df = pd.DataFrame({"Id":[1,1,2,2],"mail-id":["xyz#gm","ygzbb","Ghh","Hjkk"]})
>>> for x in set(df["Id"]):
... _d.update({x:df[df["id"]==x]["mail_id"]})
But it's much faster to use dictionary comprehension and builtin pandas DataFrame.groupby; a quick look from the Official Documentation:
A groupby operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups.
DataFrame.groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=NoDefault.no_default, observed=False, dropna=True)
as #fsimonjetz pointed out, this code will be sufficent:
>>> df = pd.DataFrame({"Id":[1,1,2,2],"mail-id":["xyz#gm","ygzbb","Ghh","Hjkk"]})
>>> {k:v["mail-id"].values.tolist() for k,v in df.groupby("Id")}
You can do:
df.groupby('Id').agg(list).to_dict()['mail-id']
Output:
{1: ['xyz#gm', 'ygzbb'], 2: ['Ghh.', 'Hjkk.']}

How to check if any of elements in a dictionary value is in string?

I have a dataframe with strings and a dictionary which values are lists of strings.
I need to check if each string of the dataframe contains any element of every value in the dictionary. And if it does, I need to label it with the appropriate key from the dictionary. All I need to do is to categorize all the strings in the dataframe with keys from the dictionary.
For example.
df = pd.DataFrame({'a':['x1','x2','x3','x4']})
d = {'one':['1','aa'],'two':['2','bb']}
I would like to get something like this:
df = pd.DataFrame({
'a':['x1','x2','x3','x4'],
'Category':['one','two','x3','x4']})
I tried this, but it has not worked:
df['Category'] = np.nan
for k, v in d.items():
for l in v:
df['Category'] = [k if l in str(x).lower() else x for x in df['a']]
Any ideas appreciated!
Firstly create a function that do this for you:-
def func(val):
for x in range(0,len(d.values())):
if val in list(d.values())[x]:
return list(d.keys())[x]
Now make use of split() and apply() method:-
df['Category']=df['a'].str.split('',expand=True)[2].apply(func)
Finally use fillna() method:-
df['Category']=df['Category'].fillna(df['a'])
Now if you print df you will get your expected output:-
a Category
0 x1 one
1 x2 two
2 x3 x3
3 x4 x4
Edit:
You can also do this by:-
def func(val):
for x in range(0,len(d.values())):
if any(l in val for l in list(d.values())[x]):
return list(d.keys())[x]
then:-
df['Category']=df['a'].apply(func)
Finally:-
df['Category']=df['Category'].fillna(df['a'])
I've come up with the following heuristic, which looks really dirty.
It outputs what you desire, albeit with some warnings, since I've used indices to append values to dataframe.
import pandas as pd
import numpy as np
def main():
df = pd.DataFrame({'a': ['x1', 'x2', 'x3', 'x4']})
d = {'one': ['1', 'aa'], 'two': ['2', 'bb']}
found = False
i = 0
df['Category'] = np.nan
for x in df['a']:
for k,v in d.items():
for item in v:
if item in x:
df['Category'][i] = k
found = True
break
else:
df['Category'][i] = x
if found:
found = False
break
i += 1
print(df)
main()

2D list to csv - by column

I'd like to export the content of a 2D-list into a csv file.
The size of the sublists can be different. For example, the 2D-list can be something like :
a = [ ['a','b','c','d'], ['e','f'], ['g'], [], ['h','i'] ]
I want my csv to store the data like this - "by column" :
a,e,g, ,h
b,f, , ,i
c
d
Do I have to add some blank spaces to get the same size for each sublist ? Or is there another way to do so ?
Thank you for your help
You can use itertools.zip_longest:
import itertools, csv
a = [ ['a','b','c','d'], ['e','f'], ['g'], [], ['h','i'] ]
with open('filename.csv', 'w') as f:
write = csv.writer(f)
write.writerows(list(itertools.zip_longest(*a, fillvalue='')))
Output:
a,e,g,,h
b,f,,,i
c,,,,
d,,,,
It can be done using pandas and transpose function (T)
import pandas as pd
pd.DataFrame(a).T.to_csv('test.csv')
Result:
(test.csv)
,0,1,2,3,4
0,a,e,g,,h
1,b,f,,,i
2,c,,,,
3,d,,,,
import itertools
import pandas as pd
First create a dataframe using a nested array:
a = ['a','b','c','d']
b = ['e','f']
c = ['g']
d = []
e = ['h','i']
nest = [a,b,c,d,e]
df = pd.DataFrame((_ for _ in itertools.zip_longest(*nest)), columns=['a', 'b', 'c', 'd', 'e'])
like that:
a b c d e
0 a e g None h
1 b f None None i
2 c None None None None
3 d None None None None
and then store it using pandas:
df.to_csv('filename.csv', index=False)
We have three task to do here: fill sublist so all have same length, transpose, write to csv.
Sadly Python has not built-in function for filling, however it can be done relatively easily, I would do it following way
(following code is intended to give result as requested in OP):
a = [['a','b','c','d'],['e','f'],['g'],[],['h','i']]
ml = max([len(i) for i in a]) #number of elements of longest sublist
a = [(i+[' ']*ml)[:ml] for i in a] #adding ' ' for sublist shorter than longest
a = list(zip(*a)) #transpose
a = [','.join(i) for i in a] #create list of lines to be written
a = [i.rstrip(', ') for i in a] #jettison spaces if not followed by value
a = '\n'.join(a) #create string to be written to file
with open('myfile.csv','w') as f: f.write(a)
Content of myfile.csv:
a,e,g, ,h
b,f, , ,i
c
d

Pandas use cell value as dict key to return dict value

my question relates to using the values in a dataframe column as keys in order to return their respective values and run a conditional.
I have a dataframe, df, containing a column "count" that has integers from 1 to 8 and a column "category" that has values either "A", "B", or "C"
I have a dictionary, dct, containing pairs A:2, B:4, C:6
This is my (incorrect) code:
result = df[df["count"] >= dct.get(df["category"])]
So I want to return a dataframe where the "count" value for a given row is equal to more than the value retrieved from a dictionary using the "category" letter in the same row.
So if there were count values of (1, 2, 6, 6) and category values of (A, B, C, A), the third and forth row would be return in the resultant dataframe.
How do I modify the above code to achieve this?
A good way to go is to add your dictionary into a the existing dataframe and then apply a query on the new dataframe:
import pandas as pd
df = pd.DataFrame(data={'count': [4, 5, 6], 'category': ['A', 'B', 'C']})
dct = {'A':5, 'B':4, 'C':-1}
df['min_count'] = df['category'].map(dct)
df = df.query('count>min_count')
following your logic:
import pandas as pd
dct = {'A':2, 'B':4, 'C':6}
df = pd.DataFrame({'count':[1,2,5,6],
'category':['A','B','C','A']})
print('original dataframe')
print(df)
def process_row(x):
return True if x['count'] >= dct[x['category']] else False
f = df.apply(lambda row: process_row(row), axis=1)
df = df[f]
print('final output')
print(df)
output:
original dataframe
count category
0 1 A
1 2 B
2 5 C
3 6 A
final output
count category
3 6 A
A small modification to your code:
result = df[df['count'] >= df['category'].apply(lambda x: dct[x])]
You cannot directly use dct.get(df['category']) because df['category'] returns a mutable Series which cannot be used as a dictionary key (Dictionary keys need to be immutable objects)
So, apply and lambda to the rescue! :)

python pandas groupby sorting and concatenating

I have a panda data frame:
df = pd.DataFrame({'a': [1,1,1,1,2,2,2], 'b': ['a','a','a','a','b','b','b'], 'c': ['o','o','o','o','p','p','p'], 'd': [ [2,3,4], [1,3,3,4], [3,3,1,2], [4,1,2], [8,2,1], [0,9,1,2,3], [4,3,1] ], 'e': [13,12,5,10,3,2,5] })
What I want is:
First group by columns a, b, c --- there are two groups
Then sort within each group according to column e in an ascending order
Lastly concatenate within each group column d
So the result I want is:
result = pd.DataFrame({'a':[1,2], 'b':['a','b'], 'c':['o','p'], 'd':[[3,3,1,2,4,1,2,1,3,3,4,2,3,4],[0,9,1,2,3,8,2,1,4,3,1]]})
Could anyone share some quick/elegant ways to get around this? Thanks very much.
You can sort by column e, group by a, b and c and then use a list comprehension to concatenate the d column (flatten it). Notice that we can use sort and then groupby since groupby will
preserve the order in which observations are sorted within each group:
according to the doc here:
(df.sort_values('e').groupby(['a', 'b', 'c'])['d']
.apply(lambda g: [j for i in g for j in i]).reset_index())
An alternative to list-comprehension is the chain from itertools:
from itertools import chain
(df.sort_values('e').groupby(['a', 'b', 'c'])['d']
.apply(lambda g: list(chain.from_iterable(g))).reset_index())

Categories