I currently facing the issue that "functools.reduce(operator.iadd,...)" alters the original input. E.g.
I have a simple dataframe
df = pd.DataFrame([[['A', 'B']], [['C', 'D']]])
0
0 [A, B]
1 [C, D]
Applying the iadd operator leads to following result:
functools.reduce(operator.iadd, df[0])
['A', 'B', 'C', 'D']
Now, the original df changed to
0
0 [A, B, C, D]
1 [C, D]
Also copying the df using df.copy(deep=True) beforehand does not help.
Has anyone an idea to overcome this issue?
THX, Lazloo
Use operator.add instead of operator.iadd:
In [8]: functools.reduce(operator.add, df[0])
Out[8]: ['A', 'B', 'C', 'D']
In [9]: df
Out[9]:
0
0 [A, B]
1 [C, D]
After all, operator.iadd(a, b) is the same as a += b. So it modifies df[0]. In contrast, operator.add(a, b) returns a + b, so there is no modification of df[0].
Or, you could compute the same quantity using df[0].sum():
In [39]: df[0].sum()
Out[39]: ['A', 'B', 'C', 'D']
The docs for df.copy warns:
When deep=True, data is copied but actual Python objects
will not be copied recursively, only the reference to the object.
Since df[0] contains Python lists, the lists are not copied even with df.copy(deep=True). This is why modifying the copy still affects df.
In addition to #unutbu's good answer, you can also use the int.__add__ method:
df = pd.DataFrame([[['A', 'B']], [['C', 'D']]])
functools.reduce(lambda x,y: (x).__add__(y), df[0])
print(df)
And You can see that it is:
0
0 [A, B]
1 [C, D]
For output!!!
Related
I have two arrays:
a = np.array([1,3,4,2,6])
b = np.array(['c', 'd', 'e', 'f', 'g'])
These two array are linked (in the sense that there is a 1-1 correspondence between the elements of the two arrays), so when i sort a by decreasing order I would like to sort b in the same order.
For instance, when I do:
a = np.sort(a)[::-1]
I get:
a = [6, 4, 3, 2, 1]
and I would like to be able to get also:
b = ['g', 'e', 'd', 'f', 'c']
i would do smth like this:
import numpy as np
a = np.array([1,3,4,2,6])
b = np.array(['c', 'd', 'e', 'f', 'g'])
idx_order = np.argsort(a)[::-1]
a = a[idx_order]
b = b[idx_order]
output:
a = [6 4 3 2 1]
b = ['g' 'e' 'd' 'f' 'c']
I don't know how or even if you can do this in numpy arrays. However there is a way using standard lists albeit slightly convoluted. Consider this:-
a = [1, 3, 4, 2, 6]
b = ['c', 'd', 'e', 'f', 'g']
assert len(a) == len(b)
c = []
for i in range(len(a)):
c.append((a[i], b[i]))
r = sorted(c)
for i in range(len(r)):
a[i], b[i] = r[i]
print(a)
print(b)
In your problem statement, there is no relationship between the two tables. What happens here is that we make a relationship by grouping relevant data from each table into a temporary list of tuples. In this scenario, sorted() will carry out an ascending sort on the first element of each tuple. We then just rebuild our original arrays
I want to know an efficient and simple way to write many columns to a single file with Python.
For example, I have a and b arrays with a size of 20 for N rows.
each row has a different a and b.
I would like to write a file with a format like this:
Names of each column
0 a[0] b[0] a[1] b[1] ... a[19] b[19]
1 a[0] b[0] a[1] b[1] ....a[19] b[19]
I can only think this way:
data = open(output_filename,'w')
for i in range(0, N):
data.write('{} {} {} ...\n'.format(i, a[0], b[0], ....))
import numpy
#Assuming a, b are numpy arrays, else convert them accordingly:
# a= np.array(a)
# b= np.array(b)
c = np.zeros((100,40))
for i in range(20):
c[:, 2*i] = a[:,i]
c[:, 2*i+1] = b[:,i]
np.savetxt("test.txt",c)
This is the simplest way I could think of it.
If I understand correctly, you want to interleave two lists. You can do that with zip and some post-processing of it:
>>> a = ['a', 'b', 'c']
>>> b = ['d', 'e', 'f']
>>> print(list(zip(a, b)))
[('a', 'd'), ('b', 'e'), ('c', 'f')]
>>> from itertools import chain
>>> print(list(chain.from_iterable(zip(a, b))))
['a', 'd', 'b', 'e', 'c', 'f']
>>> print(' '.join(chain.from_iterable(zip(a, b))))
a d b e c f
You probably want to apply that something like this:
data.write('{} {}\n'.format(i, ' '.join(chain.from_iterable(zip(a, b)))))
I was wondering if it was possible to make a new column in a pandas dataframe that is a list of every value NOT including the value of the row itself. For example, in the df below, I have for the first row in columns 'list' values [b, c], and the value of the row itself, 'a'. Is this possible to do per index?
I have tried this, but it returns a list of all values combined per index:
import pandas as pd
d = {'index': [1, 1, 1, 2, 2, 3], 'col1': ['a', 'b', 'c', 'd', 'e, f', 'g']}
df = pd.DataFrame(d)
df = df.groupby("index")["col1"].apply(list)
Whereas I am looking for something that retains the all of the rows and produces each list in a new column without the row value included.
Thank you for any help!!
We can do explode with groupby create the whole list within each index, then do set sub
df['l']=df.col1.str.split(',')
df['new']=df.explode('l').groupby('index')['l'].agg(list).reindex(df['index']).tolist()
df['List']=(df.new.apply(set)-df['l'].apply(set)).apply(list)
df.loc[~df.List.astype(bool),'List']=df.l
df
index col1 l new List
0 1 a [a] [a, b, c] [c, b]
1 1 b [b] [a, b, c] [a, c]
2 1 c [c] [a, b, c] [a, b]
3 2 d [d] [d, e, f] [e, f]
4 2 e, f [e, f] [d, e, f] [d]
5 3 g [g] [g] [g]
Update
l=[]
... for x , y in zip(df.l,df.new):
... x=x.copy()
... y=y.copy()
... for i in x:
... if i in y:
... y.remove(i)
... l.append(y)
...
l
[['b', 'c'], ['a', 'c'], ['a', 'b'], ['e', ' f'], ['d'], []]
df['List']=l
I'm attempting to remove the rows of values in a list within df which are present in lst.
I'm aware of using df[df[x].isin(y)] for singular strings but am not sure as to how to adjust the same method to work with lists within a dataframe.
lst = ['f','a']
df:
Column1 Out1
0 ['x', 'y'] a
1 ['a', 'b'] i
2 ['c', 'd'] o
3 ['e', 'f'] u
etc.
I've attempted to use list comprehension but it doesn't seem to work the same with Pandas
df = df[[i for x in list for i in df['Column1']]]
Error:
TypeError: unhashable type: 'list'
My expected output would be as followed; removing the rows that contain the lists of which have the values in lst:
Column1 Out1
0 ['x', 'y'] a
1 ['c', 'd'] o
etc.
You can use convert values to sets and then use &, for inverting mask use ~:
df = pd.DataFrame({'Column1':[['x','y'], ['a','b'], ['c','d'],['e','f']],
'Out1':list('aiou')})
lst = ['f','a']
df1 = df[~(df['Column1'].apply(set) & set(lst))]
print (df1)
Column1 Out1
0 [x, y] a
2 [c, d] o
Solution with nested list comprehension - get list of booleans, so need all for check if all values are True:
df1 =df[[all([x not in lst for x in i]) for i in df['Column1']]]
print (df1)
Column1 Out1
0 [x, y] a
2 [c, d] o
print ([[x not in lst for x in i] for i in df['Column1']])
[[True, True], [False, True], [True, True], [True, False]]
I'm trying to use Pandas to solve an issue courtesy of an idiot DBA not doing a backup of a now crashed data set, so I'm trying to find differences between two columns. For reasons I won't get into, I'm using Pandas rather than a database.
What I'd like to do is, given:
Dataset A = [A, B, C, D, E]
Dataset B = [C, D, E, F]
I would like to find values which are disjoint.
Dataset A!=B = [A, B, F]
In SQL, this is standard set logic, accomplished differently depending on the dialect, but a standard function. How do I elegantly apply this in Pandas? I would love to input some code, but nothing I have is even remotely correct. It's a situation in which I don't know what I don't know..... Pandas has set logic for intersection and union, but nothing for disjoint/set difference.
Thanks!
You can use the set.symmetric_difference function:
In [1]: df1 = DataFrame(list('ABCDE'), columns=['x'])
In [2]: df1
Out[2]:
x
0 A
1 B
2 C
3 D
4 E
In [3]: df2 = DataFrame(list('CDEF'), columns=['y'])
In [4]: df2
Out[4]:
y
0 C
1 D
2 E
3 F
In [5]: set(df1.x).symmetric_difference(df2.y)
Out[5]: set(['A', 'B', 'F'])
Here's a solution for multiple columns, probably not very efficient, I would love to get some feedback on making this faster:
input = pd.DataFrame({'A': [1, 2, 2, 3, 3], 'B': ['a', 'a', 'b', 'a', 'c']})
limit = pd.DataFrame({'A': [1, 2, 3], 'B': ['a', 'b', 'c']})
def set_difference(input_set, limit_on_set):
limit_on_set_sub = limit_on_set[['A', 'B']]
limit_on_tuples = [tuple(x) for x in limit_on_set_sub.values]
limit_on_dict = dict.fromkeys(limit_on_tuples, 1)
entries_in_limit = input_set.apply(lambda row:
(row['A'], row['B']) in limit_on_dict, axis=1)
return input_set[~entries_in_limit]
>>> set_difference(input, limit)
item user
1 a 2
3 a 3