create df form list comprehension within loop - python

I have to the following code to create df from a list comprehension within a loop. However, the output is not as I desire.
I would like to create a new column for each group in the list. In this example, 3 groups implies 3 columns.
Input:
t = [x * .001 for x in range(2)]
l = [[10, 2, 40], [20, 4, 80], [30, 6, 160]]
tmp = pd.DataFrame([], dtype=object)
for i in range(len(l)):
l1 = [l[i][1]*l[i][0]*l[i][2]*t[j] for j in range(len(t))]
tmp = tmp.append(l1, ignore_index=False)
Output:
l = [[10, 2, 40], [20, 4, 80], [30, 6, 160]]
tmp=
0
0 0.0
1 0.8
0 0.0
1 6.4
0 0.0
1 28.8
Desired Output:
0.0 0.0 0.0
0.8 6.4 28.8
How can I get the above desired output?

I believe you can create lists and then call DataFrame cosntructor for improve performance:
t=[x * .001 for x in range(2)]
l=[[10,2,40],[20,4,80],[30,6,160]]
tmp = []
for i in range(len(l)):
l1 = [l[i][1]*l[i][0]*l[i][2]*t[j] for j in range(len(t))]
print (l1)
mp.append(l1)
df = pd.DataFrame(tmp, dtype=object).T
print (df)
0 1 2
0 0 0 0
1 0.8 6.4 28.8
If need use DataFrame.append:
t=[x * .001 for x in range(2)]
l=[[10,2,40],[20,4,80],[30,6,160]]
tmp = pd.DataFrame([], dtype=object)
for i in range(len(l)):
l1 = [l[i][1]*l[i][0]*l[i][2]*t[j] for j in range(len(t))]
print (l1)
tmp=tmp.append([l1])
df = tmp.T
df.columns = range(len(df.columns))
print (df)
0 1 2
0 0.0 0.0 0.0
1 0.8 6.4 28.8

you can use concat instead of append:
for i in range(len(l)):
l1 = [l[i][1]*l[i][0]*l[i][2]*t[j] for j in range(len(t))]
l1 = pd.DataFrame(l1)
tmp = pd.concat([tmp,l1], axis=1)

If you wanted to make your code a little bit cleaner and increase its readability, I suggest to use double list comprehension in combination with numpy.prod and numpy.array funcitons.
import pandas as pd
import numpy as np
t = [x * .001 for x in range(2)]
l = [[10, 2, 40], [20, 4, 80], [30, 6, 160]]
tmp = pd.DataFrame(
np.array(
[
np.prod(np.array(i)) * j
for j in t
for i in l
]
).reshape(len(t), len(l))
)
The result looks like this:
>>> print(tmp)
0 1 2
0 0.0 0.0 0.0
1 0.8 6.4 28.8

Related

Division by 0 in pandas -Avoid it

df = pd.DataFrame({f'Diff (a - b)': c['a'] - c['b'],
'Diff in %': (c['a'] - c['b']) * 100 / c['a']})
If some value in c['a'] will be 0 it will not be correct to divide by 0.
Overall function doesn't fail, and outputs inf for these cases.
How to avoid this situation and instead of inf add 0 for these cases (when c['a'] == 0)?
You can replace np.inf by 0 with replace method:
a = [0, 1, 2]
b = [4, 5, 6]
c = pd.DataFrame({'a': a, 'b': b})
df = pd.DataFrame({'col21': (c['a'] - c['b']) * 100 / c['a']})
df = df.replace({-np.inf: 0})
print(df)
# Output
col21
0 0.0
1 -400.0
2 -200.0

Multiply all elements of a column in a pandas dataframe

I have a pandas dataframe:
Idx
A
B
C
1
2
5
1
2
1
2
2
3
3
1
1
4
2
3
0
I want to calculate the product of all elements in all columns. e.g.
- P_A = 2*1*3*2 = 12
- P_B = 5*2*1*3 = 30
- P_C = 1*2*1*0 = 0
Ideally, the result would be in a list format [P_A, P_B, P_C].
What is the most efficient way to compute this?
Try:
>>> df[['A', 'B', 'C']].prod().tolist()
[12, 30, 0]
>>>
Or:
>>> df.set_index('Idx').prod().tolist()
[12, 30, 0]
>>>
Or also:
>>> df.filter(regex='[^Idx]').prod().tolist()
[12, 30, 0]
>>>
Or with iloc:
>>> df.iloc[:, 1:].prod().tolist()
[12, 30, 0]
>>>
Or with drop:
>>> df[df.columns.drop('Idx')].prod().tolist()
[12, 30, 0]
>>>
You can apply numpy.product:
import numpy as np
np.product(df.set_index('Idx'))
output:
A 12
B 30
C 0
as list:
products = np.product(df.set_index('Idx')).to_list()

Multiplying Elements in a List

I want to multiply the first and second element of sum_row by 13 individually. And multiply the third and fourth by 11 individually and the last element by 9.
I guess my question really is how do I access the elements in lists, so I can use them for calculations later on?
matrix5x5 = [[1 for row in range (5)] for col in range (5)]
for row in matrix5x5:
for item in row:
print(item,end=" ")
print()
sum_row = [sum(i) for i in matrix5x5]
print(sum_row)
OUTPUT:
1 1 1 1 1
1 1 1 1 1
1 1 1 1 1
1 1 1 1 1
1 1 1 1 1
[5, 5, 5, 5, 5]
You can try this one:
sum_row = [1,1,1,1,1] # example
YourList = [13,13,11,11,9]
result = []
for i in range(0, len(sum_row)):
result.append(sum_row[i]*YourList[i])
print result
and the output going to be:
[13, 13, 11, 11, 9]
You can even try with [5,5,5,5,5] as the sum of each row.
You can use zip() function:
result = [a * b for a, b in zip(sum_row, [13,13,11,11,9])]
print(result)
# OUTPUT
# [65, 65, 55, 55, 45]
For vectorized calculations, use numpy:
import numpy as np
result = np.array(sum_row) * np.array([13,13,11,11,9])
result:
>>> result
array([65, 65, 55, 55, 45])
The simplest answer is:
l = [1,2,3,4,5]
a = l[0] * 13
b = l[1] * 13
c = l[2] * 11
d = l[3] * 11
e = l[4] * 9
print(a, b, c, d, e)
Your results will be 13 26 33 44 45.
Other users have provided much shorter and better ways of doing this, but you should try to understand what they did if you want to follow theirs.

Find overlapping columns ratio in pandas

Dataframe (Assume all values as categorical):
df = pd.DataFrame(
{"a" : [1 ,2, 3, 4, 5],
"b" : [2,1,3,4,5],
"c" : [1,3,4,2,5]},
index = [1, 2, 3, 4, 5])
I want to find what percentage of overlap is present between different columns
check_a_b = df.a == df.b
check_b_c = df.b == df.c
check_a_c = df.a == df.c
print(np.sum(check_a_b)/len(check_a_b)) # 0.6
print(np.sum(check_b_c)/len(check_b_c)) # 0.2
print(np.sum(check_a_c)/len(check_a_c)) # 0.4
Final output required as a matrix / DataFrame ( Triangular matrix):
a b c
a 0.6 0.4
b 0.2
c
Now I want to implement this for 15 columns in an automated way for a data of more than 100K rows.
What would be the optimized way to do this?
Dropping down to numpy is usually efficient. Only return to pandas when you have the result.
from itertools import combinations
df = pd.DataFrame({"a" : [1 ,2, 3, 4, 5],
"b" : [2,1,3,4,5],
"c" : [1,3,4,2,5]},
index = [1, 2, 3, 4, 5])
a = df.values
d = {(i, j): np.mean(a[:, i] == a[:, j]) for i, j in combinations(range(a.shape[1]), 2)}
res, c, vals = np.zeros((a.shape[1], a.shape[1])), \
list(map(list, zip(*d.keys()))), list(d.values())
res[c[0], c[1]] = vals
res_df = pd.DataFrame(res, columns=df.columns, index=df.columns)
# a b c
# a 0.0 0.6 0.4
# b 0.0 0.0 0.2
# c 0.0 0.0 0.0
One way you can do this is as follows:
from itertools import combinations
df = pd.DataFrame({"a" : [1 ,2, 3, 4, 5],
"b" : [2,1,3,4,5],
"c" : [1,3,4,2,5]},
index = [1, 2, 3, 4, 5])
df_out = pd.DataFrame()
for i in combinations(df.columns, 2):
s = pd.DataFrame((df[i[0]] == df[i[1]]).mean(),index=[i[0]], columns=[i[1]])
df_out = pd.concat([df_out,s])
df_out.sum(level=0).reindex(df.columns).reindex(df.columns, axis=1).fillna(0)
Output:
a b c
a 0.0 0.6 0.4
b 0.0 0.0 0.2
c 0.0 0.0 0.0
There is on way
Yourdf=pd.DataFrame(columns=df.columns,index=df.columns)
Yourdf=Yourdf.stack(dropna=False).to_frame().apply(lambda x : (df[x.name[0]]==df[x.name[1]]).sum()/len(df),axis=1).unstack()
Yourdf=Yourdf.where(np.triu(np.ones(Yourdf.shape),1).astype(np.bool))
Yourdf
Out[169]:
a b c
a NaN 0.6 0.4
b NaN NaN 0.2
c NaN NaN NaN
Update : mention by Scott
Change to mean
Yourdf=Yourdf.stack(dropna=False).to_frame().apply(lambda x : (df[x.name[0]]==df[x.name[1]]).mean(),axis=1).unstack()

loop through a list and use previous elements

I have a list that contains decimal numbers, however in this example I use ints:
my_list = [40, 60, 100, 240, ...]
I want to print each element of the list in reverse order and afterwards I want to print a second line where every value is divided by 2, then a third line where the previous int is devided by 3 and so on...
Output should be:
240 120 60 36
120 60 30 18 #previous number divided by 2
40 20 10 6 #previous number divided by 3
... ... ... ... #previous number divided by 4 ...
My solution is ugly: I can make a slice and reverse that list and make n for loops and append the result in a new list. But there must be a better way. How would you do that?
I'd write a generator to yield lists in turn:
def divider(lst,n):
lst = [float(x) for x in lst[::-1]]
for i in range(1,n+1):
lst = [x/i for x in lst]
yield lst
is more appropriate. If we want to make it slightly more efficient, we could factor out the first iteration (division by 1) and yield it separately:
def divider(lst,n):
lst = [float(x) for x in reversed(lst)]
yield lst
for i in range(2,n+1):
lst = [x/i for x in lst]
yield lst
*Note that in this context there isn't a whole lot of difference between lst[::-1] and reversed(lst). The former is typically a little faster, but the latter is a little more memory efficient. Choose according to your constraints.
Demo:
>>> def divider(lst,n):
... lst = [float(x) for x in reversed(lst)]
... yield lst
... for i in range(2,n+1):
... lst = [x/i for x in lst]
... yield lst
...
>>> for lst in divider([40, 60, 100, 240],3):
... print lst
...
[240.0, 100.0, 60.0, 40.0]
[120.0, 50.0, 30.0, 20.0]
[40.0, 16.666666666666668, 10.0, 6.666666666666667]
To print the columnar the output you want, use format strings. You may have to tweak this to get the alignment and precision you want for your actual data:
def print_list(L):
print ' '.join('{:>3d}'.format(i) for i in L)
Normally to do the division we could use a function with recursion, but we can also use a simple loop where each iteration produces the list that is worked on next:
my_list = [40, 60, 100, 240, 36, 60, 120, 240]
maxdiv = 20
baselist = list(reversed(my_list))
for div in range(1, maxdiv+1):
baselist = [i/div for i in baselist]
print_list(baselist)
Output:
240 120 60 36 240 100 60 40
120 60 30 18 120 50 30 20
40 20 10 6 40 16 10 6
10 5 2 1 10 4 2 1
2 1 0 0 2 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
...
max_n = 3
vals = [40, 60, 100, 240]
grid = [list(reversed(vals))]
for n in xrange(2, max_n + 1):
grid.append([v/n for v in grid[-1]])
for g in grid:
print g
# Output
[240, 100, 60, 40]
[120.0, 50.0, 30.0, 20.0]
[40.0, 16.666666666666668, 10.0, 6.666666666666667]
new_list = my_list[::-1] #reverse the list
print '\t'.join(map(str, new_list))
for _counter in range(2, 21): #count from 2 to 20
for _index in range(len(new_list)): # iterate the whole list
new_list[_index] = new_list[_index]/_counter
print '\t'.join(map(str, new_list))
Which will produce an output like(I used float instead of int):
240.0 100.0 60.0 40.0
120.0 50.0 30.0 20.0
40.0 16.6666666667 10.0 6.66666666667
10.0 4.16666666667 2.5 1.66666666667
2.0 0.833333333333 0.5 0.333333333333
my_list = [40, 60, 100, 240]
def dostuff(l,limit):
print('\t'.join(map(str,reversed(l))))
print('\n'.join([ '\t'.join(map(str,[v/float(i) for v in reversed(my_list)])) for i in range(2,limit+1)]))
dostuff(my_list,20)
Produces:
240 100 60 40
120.0 50.0 30.0 20.0
80.0 33.333333333333336 20.0 13.333333333333334
60.0 25.0 15.0 10.0
48.0 20.0 12.0 8.0
40.0 16.666666666666668 10.0 6.666666666666667
34.285714285714285 14.285714285714286 8.571428571428571 5.714285714285714
30.0 12.5 7.5 5.0
26.666666666666668 11.11111111111111 6.666666666666667 4.444444444444445
24.0 10.0 6.0 4.0
21.818181818181817 9.090909090909092 5.454545454545454 3.6363636363636362
20.0 8.333333333333334 5.0 3.3333333333333335
18.46153846153846 7.6923076923076925 4.615384615384615 3.076923076923077
17.142857142857142 7.142857142857143 4.285714285714286 2.857142857142857
16.0 6.666666666666667 4.0 2.6666666666666665
15.0 6.25 3.75 2.5
14.117647058823529 5.882352941176471 3.5294117647058822 2.3529411764705883
13.333333333333334 5.555555555555555 3.3333333333333335 2.2222222222222223
12.631578947368421 5.2631578947368425 3.1578947368421053 2.1052631578947367
12.0 5.0 3.0 2.0

Categories