Optimizing pandas filter inside apply function

Optimizing pandas filter inside apply function - python

I have a list of pairs--stored in a DataFrame--each pair having an 'a' column and a 'b' column. For each pair I want to return the 'b's that have the same 'a'. For example, given the following set of pairs:
a b
0 c d
1 e f
2 c g
3 e h
4 i j
5 e k
I would like to end up with:
a b equivalents
0 c d [g]
1 e f [h, k]
2 c g [d]
3 e h [f, k]
4 i j []
5 e k [h, e]
I can do this with the following:
def equivalents(x):
l = pairs[pairs["a"] == x["a"]]["b"].tolist()
return l[1:] if l else l
pairs["equivalents"] = pairs.apply(equivalents, axis = 1)
But it is painfully slow on larger sets (e.g. 1 million plus pairs). Any suggestions how I could do this faster?

I think this ought to be a bit faster. First, just add them up.
df['equiv'] = df.groupby('a')['b'].transform(sum)
a b equiv
0 c d dg
1 e f fhk
2 c g dg
3 e h fhk
4 i j j
5 e k fhk
Now convert to a list and remove whichever letter is already in column 'b'.
df.apply( lambda x: [ y for y in list( x.equiv ) if y != x.b ], axis=1 )
0 [g]
1 [h, k]
2 [d]
3 [f, k]
4 []
5 [f, h]

Related

Alphabetical Grid using python3

how to write a function grid that returns an alphabetical grid of size NxN, where a = 0, b = 1, c = 2.... in python
example :
a b c d
b c d e
c d e f
d e f g
here I try to create a script using 3 for loops but it's going to print all the alphabets
def grid(N):
for i in range(N):
for j in range(N):
for k in range(ord('a'),ord('z')+1):
print(chr(k))
pass

Not the most elegant, but gets the job done.
import string
def grid(N):
i = 0
for x in range(N):
for y in string.ascii_lowercase[i:N+i]:
print(y, end=" ")
i += 1
print()
grid(4)
Output
a b c d
b c d e
c d e f
d e f g

Extending from #MichHeng's suggestion, and using list comprehension:
letters = [chr(x) for x in range(ord('a'),ord('z')+1)]
def grid(N):
for i in range(N):
print(' '.join([letters[i] for i in range(i,N+i)]))
grid(4)
output is
a b c d
b c d e
c d e f
d e f g

You have specified for k in range(ord('a'),ord('z')+1) which prints out the entire series from 'a' to 'z'. What you probably need is a reference list comprehension to pick your letters from, for example
[chr(x) for x in range(ord('a'),ord('z')+1)]
Try this:
letters = [chr(x) for x in range(ord('a'),ord('z')+1)]
def grid(N):
for i in range(N):
for j in range(i, N+i):
print(letters[j], end=' ')
if j==N+i-1:
print('') #to move to next line
grid(4)
Output
a b c d
b c d e
c d e f
d e f g
Do you need to add a check for N<=13 ?

Trying to verify last position of a string

Im trying to verify if the last char is not on my list
def acabar_char(input):
list_chars = "a b c d e f g h i j k l m n o p q r s t u v w x y z A B C D E F G H I J K L M N O P Q R S T U V W X Y Z 1 2 3 4 5 6 7 8 9 0".split()
tam = 0
tam = (len(input)-1)
for char in input:
if char[tam] in list_chars:
return False
else:
return True
When i try this i get this error:
if char[tam] in list_chars:
IndexError: string index out of range

you can index from the end (of a sting or a list) with negative numbers
def acabar_char(input, list_cars):
return input[-1] is not in list_chars

It seems that you are trying to assert that the last element of an input string (or also list/tuple) is NOT in a subset of disallowed chars.
Currently, your loop never even gets to the second and more iteration because you use return inside the loop; so the last element of the input only gets checked if the input has length of 1.
I suggest something like this instead (also using the string.ascii_letters definition):
import string
DISALLOWED_CHARS = string.ascii_letters + string.digits
def acabar_char(val, disallowed_chars=DISALLOWED_CHARS):
if len(val) == 0:
return False
return val[-1] not in disallowed_chars
Does this work for you?

you are already iterating through your list in that for loop, so theres no need to use indices. you can use list comprehension as the other answer suggest, but I'm guessing you're trying to learn python, so here would be the way to rewrite your function.
list_chars = "a b c d e f g h i j k l m n o p q r s t u v w x y z A B C D E F G H I J K L M N O P Q R S T U V W X Y Z 1 2 3 4 5 6 7 8 9 0".split()
for char in input:
if char in list_chars:
return False
return True

list_chars = "a b c d e f g h i j k l m n o p q r s t u v w x y z A B C D E F G H I J K L M N O P Q R S T U V W X Y Z 1 2 3 4 5 6 7 8 9 0".split()
def acabar_char(input):
if input in list_chars:
print('True')

How to create data frame from pandas series containg lists of different length

I've got pandas series withe below structure:
> 0 [{k1:a,k2:b,k3:c},{k1:d,k2:e,k3:f}]
> 1 [{k1:g,k2:h,k3:i},{k1:j,k2:k,k3:l},{k1:ł,k2:m,k3:n}]
> 2 [{k1:o,k2:p,k3:r}
> 3 [{k1:s,k2:t,k3:w},{k1:q,k2:z,k3:w},{k1:x,k2:y,k3:z},{k1:v,k2:f,k3:g}]
As You can see this series contains elemnts as lists of different length. Elements in each list are dictionaries. I would like to create data frame, which will looks like that:
> k1 k2 k3
> 0 a b c
> 1 d e f
> 2 g h i
> 3 j k l
> 4 ł m n
> 5 o p r
> 6 s t w
> 7 q z w
> 8 x y z
> 9 f v g
I have tried below code:
>for index_val, series_val in series.iteritems():
>> for dict in series_val:
>>> for key,value in dict.items():
>>>> actions['key']=value
However PyCharm stops and produces nothing. Are there any other method to do that?

Use concat with apply pd.DataFrame i.e
x = pd.Series([[{'k1':'a','k2':'b','k3':'c'},{'k1':'d','k2':'e','k3':'f'}], [{'k1':'g','k2':'h','k3':'i'},{'k1':'j','k2':'k','k3':'l'},{'k1':'ł','k2':'m','k3':'n'}],
[{'k1':'o','k2':'p','k3':'r'}],[{'k1':'s','k2':'t','k3':'w'},{'k1':'q','k2':'z','k3':'w'},{'k1':'x','k2':'y','k3':'z'},{'k1':'v','k2':'f','k3':'g'}]])
df = pd.concat(x.apply(pd.DataFrame,1).tolist(),ignore_index=True)
Output :
k1 k2 k3
0 a b c
1 d e f
2 g h i
3 j k l
4 ł m n
5 o p r
6 s t w
7 q z w
8 x y z
9 v f g

Arrange sequences of entries in pairs in a dataframe

Given a table of the form:
ID Sequence
1 A,C,D,E,F,G
2 D,F,G,B
3 A,B,A,C
and so on
Now I wish to arrange this data so that it can be fed into a RNN in a sequential manner so that I'm able to predict the next entry in each sequence. So here's what's required (in a new dataframe) in the form of all possible sequences:
X Y
A,C,D E
C,D,E F
D,E,F G
D,F,G B
A,B,A C
X could be of length 3 or any custom length. How should I go about it?

Here's another way using df.split and applying pd.Series to sublists:
In [623]: df.Sequence.str.split(',')\
...: .apply(lambda x: pd.Series([x[i : i + 3], x[i + 3]] for i in range(0, len(x)- 3))).stack()\
...: .apply(lambda x: pd.Series([x[0], x[1]]))\
...: .reset_index(drop=True)
Out[623]:
0 1
0 [A, C, D] E
1 [C, D, E] F
2 [D, E, F] G
3 [D, F, G] B
4 [A, B, A] C
Setting the columns is as simple as df.columns = ['X', 'Y'].

This will do the job:
vals=[l.split(',') for l in df.sequences.values]
X,Y=zip(*sum([[[','.join(el[i:i+3]),el[i+3]] for i in range(len(el)-3)] for el in vals],[]))
res=pd.DataFrame({'X':X,'Y':Y})
Then res is
X Y
0 A,C,D E
1 C,D,E F
2 D,E,F G
3 D,F,G B
4 A,B,A C

Here's one of the (many) ways of doing it.
In [52]: vals = df.Sequence.str.split(',')
In [53]: seqs = []
In [54]: for val in vals:
...: seqs += [{'X': val[i:i+3], 'Y': val[i+3]} for i in xrange(len(val)-3)]
...:
In [55]: pd.DataFrame(seqs)
Out[55]:
X Y
0 [A, C, D] E
1 [C, D, E] F
2 [D, E, F] G
3 [D, F, G] B
4 [A, B, A] C

Using apply on a column

I have a dataframe like this one.
A B C D E
0 a b c d e
1 f g h i j
2 k l m n o
3 p q r s t
What I'd like is to get a dataframe with each column as a list.
0
0 [a, f, k, p]
1 [b, g, l, q]
2 [c, h, m, r]
3 [d, i, o, s]
4 [e, j, p, t]
I'd like to somehow apply a function to each column, converting it to a list and placing it in a new DataFrame. However, apply only operates on individual entries.

df2 = pd.DataFrame(df.transpose().apply(lambda x: [', '.join(x)], axis=1))

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Optimizing pandas filter inside apply function - python

Related

Alphabetical Grid using python3

Trying to verify last position of a string

How to create data frame from pandas series containg lists of different length

Arrange sequences of entries in pairs in a dataframe

Using apply on a column

Categories

Resources