functools reduce In-Place modifies original dataframe - python

I currently facing the issue that "functools.reduce(operator.iadd,...)" alters the original input. E.g.
I have a simple dataframe
df = pd.DataFrame([[['A', 'B']], [['C', 'D']]])
0
0 [A, B]
1 [C, D]
Applying the iadd operator leads to following result:
functools.reduce(operator.iadd, df[0])
['A', 'B', 'C', 'D']
Now, the original df changed to
0
0 [A, B, C, D]
1 [C, D]
Also copying the df using df.copy(deep=True) beforehand does not help.
Has anyone an idea to overcome this issue?
THX, Lazloo

Use operator.add instead of operator.iadd:
In [8]: functools.reduce(operator.add, df[0])
Out[8]: ['A', 'B', 'C', 'D']
In [9]: df
Out[9]:
0
0 [A, B]
1 [C, D]
After all, operator.iadd(a, b) is the same as a += b. So it modifies df[0]. In contrast, operator.add(a, b) returns a + b, so there is no modification of df[0].
Or, you could compute the same quantity using df[0].sum():
In [39]: df[0].sum()
Out[39]: ['A', 'B', 'C', 'D']
The docs for df.copy warns:
When deep=True, data is copied but actual Python objects
will not be copied recursively, only the reference to the object.
Since df[0] contains Python lists, the lists are not copied even with df.copy(deep=True). This is why modifying the copy still affects df.

In addition to #unutbu's good answer, you can also use the int.__add__ method:
df = pd.DataFrame([[['A', 'B']], [['C', 'D']]])
functools.reduce(lambda x,y: (x).__add__(y), df[0])
print(df)
And You can see that it is:
0
0 [A, B]
1 [C, D]
For output!!!

Related

python: sort array when sorting other array

I have two arrays:
a = np.array([1,3,4,2,6])
b = np.array(['c', 'd', 'e', 'f', 'g'])
These two array are linked (in the sense that there is a 1-1 correspondence between the elements of the two arrays), so when i sort a by decreasing order I would like to sort b in the same order.
For instance, when I do:
a = np.sort(a)[::-1]
I get:
a = [6, 4, 3, 2, 1]
and I would like to be able to get also:
b = ['g', 'e', 'd', 'f', 'c']
i would do smth like this:
import numpy as np
a = np.array([1,3,4,2,6])
b = np.array(['c', 'd', 'e', 'f', 'g'])
idx_order = np.argsort(a)[::-1]
a = a[idx_order]
b = b[idx_order]
output:
a = [6 4 3 2 1]
b = ['g' 'e' 'd' 'f' 'c']
I don't know how or even if you can do this in numpy arrays. However there is a way using standard lists albeit slightly convoluted. Consider this:-
a = [1, 3, 4, 2, 6]
b = ['c', 'd', 'e', 'f', 'g']
assert len(a) == len(b)
c = []
for i in range(len(a)):
c.append((a[i], b[i]))
r = sorted(c)
for i in range(len(r)):
a[i], b[i] = r[i]
print(a)
print(b)
In your problem statement, there is no relationship between the two tables. What happens here is that we make a relationship by grouping relevant data from each table into a temporary list of tuples. In this scenario, sorted() will carry out an ascending sort on the first element of each tuple. We then just rebuild our original arrays

Write many columns (more than 20)

I want to know an efficient and simple way to write many columns to a single file with Python.
For example, I have a and b arrays with a size of 20 for N rows.
each row has a different a and b.
I would like to write a file with a format like this:
Names of each column
0 a[0] b[0] a[1] b[1] ... a[19] b[19]
1 a[0] b[0] a[1] b[1] ....a[19] b[19]
I can only think this way:
data = open(output_filename,'w')
for i in range(0, N):
data.write('{} {} {} ...\n'.format(i, a[0], b[0], ....))
import numpy
#Assuming a, b are numpy arrays, else convert them accordingly:
# a= np.array(a)
# b= np.array(b)
c = np.zeros((100,40))
for i in range(20):
c[:, 2*i] = a[:,i]
c[:, 2*i+1] = b[:,i]
np.savetxt("test.txt",c)
This is the simplest way I could think of it.
If I understand correctly, you want to interleave two lists. You can do that with zip and some post-processing of it:
>>> a = ['a', 'b', 'c']
>>> b = ['d', 'e', 'f']
>>> print(list(zip(a, b)))
[('a', 'd'), ('b', 'e'), ('c', 'f')]
>>> from itertools import chain
>>> print(list(chain.from_iterable(zip(a, b))))
['a', 'd', 'b', 'e', 'c', 'f']
>>> print(' '.join(chain.from_iterable(zip(a, b))))
a d b e c f
You probably want to apply that something like this:
data.write('{} {}\n'.format(i, ' '.join(chain.from_iterable(zip(a, b)))))

How to make a new pandas column that is a list of every value in an index range not including the row value

I was wondering if it was possible to make a new column in a pandas dataframe that is a list of every value NOT including the value of the row itself. For example, in the df below, I have for the first row in columns 'list' values [b, c], and the value of the row itself, 'a'. Is this possible to do per index?
I have tried this, but it returns a list of all values combined per index:
import pandas as pd
d = {'index': [1, 1, 1, 2, 2, 3], 'col1': ['a', 'b', 'c', 'd', 'e, f', 'g']}
df = pd.DataFrame(d)
df = df.groupby("index")["col1"].apply(list)
Whereas I am looking for something that retains the all of the rows and produces each list in a new column without the row value included.
Thank you for any help!!
We can do explode with groupby create the whole list within each index, then do set sub
df['l']=df.col1.str.split(',')
df['new']=df.explode('l').groupby('index')['l'].agg(list).reindex(df['index']).tolist()
df['List']=(df.new.apply(set)-df['l'].apply(set)).apply(list)
df.loc[~df.List.astype(bool),'List']=df.l
df
index col1 l new List
0 1 a [a] [a, b, c] [c, b]
1 1 b [b] [a, b, c] [a, c]
2 1 c [c] [a, b, c] [a, b]
3 2 d [d] [d, e, f] [e, f]
4 2 e, f [e, f] [d, e, f] [d]
5 3 g [g] [g] [g]
Update
l=[]
... for x , y in zip(df.l,df.new):
... x=x.copy()
... y=y.copy()
... for i in x:
... if i in y:
... y.remove(i)
... l.append(y)
...
l
[['b', 'c'], ['a', 'c'], ['a', 'b'], ['e', ' f'], ['d'], []]
df['List']=l

Remove rows of dataframe of values in a list present in another list

I'm attempting to remove the rows of values in a list within df which are present in lst.
I'm aware of using df[df[x].isin(y)] for singular strings but am not sure as to how to adjust the same method to work with lists within a dataframe.
lst = ['f','a']
df:
Column1 Out1
0 ['x', 'y'] a
1 ['a', 'b'] i
2 ['c', 'd'] o
3 ['e', 'f'] u
etc.
I've attempted to use list comprehension but it doesn't seem to work the same with Pandas
df = df[[i for x in list for i in df['Column1']]]
Error:
TypeError: unhashable type: 'list'
My expected output would be as followed; removing the rows that contain the lists of which have the values in lst:
Column1 Out1
0 ['x', 'y'] a
1 ['c', 'd'] o
etc.
You can use convert values to sets and then use &, for inverting mask use ~:
df = pd.DataFrame({'Column1':[['x','y'], ['a','b'], ['c','d'],['e','f']],
'Out1':list('aiou')})
lst = ['f','a']
df1 = df[~(df['Column1'].apply(set) & set(lst))]
print (df1)
Column1 Out1
0 [x, y] a
2 [c, d] o
Solution with nested list comprehension - get list of booleans, so need all for check if all values are True:
df1 =df[[all([x not in lst for x in i]) for i in df['Column1']]]
print (df1)
Column1 Out1
0 [x, y] a
2 [c, d] o
print ([[x not in lst for x in i] for i in df['Column1']])
[[True, True], [False, True], [True, True], [True, False]]

How do I do a SQL style disjoint or set difference on two Pandas DataFrame objects?

I'm trying to use Pandas to solve an issue courtesy of an idiot DBA not doing a backup of a now crashed data set, so I'm trying to find differences between two columns. For reasons I won't get into, I'm using Pandas rather than a database.
What I'd like to do is, given:
Dataset A = [A, B, C, D, E]
Dataset B = [C, D, E, F]
I would like to find values which are disjoint.
Dataset A!=B = [A, B, F]
In SQL, this is standard set logic, accomplished differently depending on the dialect, but a standard function. How do I elegantly apply this in Pandas? I would love to input some code, but nothing I have is even remotely correct. It's a situation in which I don't know what I don't know..... Pandas has set logic for intersection and union, but nothing for disjoint/set difference.
Thanks!
You can use the set.symmetric_difference function:
In [1]: df1 = DataFrame(list('ABCDE'), columns=['x'])
In [2]: df1
Out[2]:
x
0 A
1 B
2 C
3 D
4 E
In [3]: df2 = DataFrame(list('CDEF'), columns=['y'])
In [4]: df2
Out[4]:
y
0 C
1 D
2 E
3 F
In [5]: set(df1.x).symmetric_difference(df2.y)
Out[5]: set(['A', 'B', 'F'])
Here's a solution for multiple columns, probably not very efficient, I would love to get some feedback on making this faster:
input = pd.DataFrame({'A': [1, 2, 2, 3, 3], 'B': ['a', 'a', 'b', 'a', 'c']})
limit = pd.DataFrame({'A': [1, 2, 3], 'B': ['a', 'b', 'c']})
def set_difference(input_set, limit_on_set):
limit_on_set_sub = limit_on_set[['A', 'B']]
limit_on_tuples = [tuple(x) for x in limit_on_set_sub.values]
limit_on_dict = dict.fromkeys(limit_on_tuples, 1)
entries_in_limit = input_set.apply(lambda row:
(row['A'], row['B']) in limit_on_dict, axis=1)
return input_set[~entries_in_limit]
>>> set_difference(input, limit)
item user
1 a 2
3 a 3

Categories