How to iterate over rows with multiple dataframes which have multiple columns - python

I created a function with 3 parameters input: x y z. I want to loop over them.
x is a dataframe with one column
y same
z asks for a dataframe with multiple columns
I tried this:
result = [f(x,y,z) for x,y,z in zip(df1["1com"], df2["1com"], df3["3com"])]
Df 1,2,3 have the same index length.
This doensnt work because the method list comp doesn't allow for multiple columns like this. I tried a bunch of things with out succes.
btw I found the list comprehension method here: How to iterate over rows in a DataFrame in Pandas

You could zip with individual columns of the multi-column DataFrame:
import pandas as pd
df1 = pd.DataFrame({"col_1": [1, 2, 3]})
df2 = pd.DataFrame({"col_1": [4, 5, 6]})
df3 = pd.DataFrame({"col_1": [7, 8, 9], "col_2": [10, 11, 12]})
def f(w, x, y, z):
return sum([w, x, y, z])
result = [
f(w, x, y, z)
for w, x, y, z
in zip(
df1["col_1"], df2["col_1"],
df3["col_1"], df3["col_2"] # list all required df3 columns individually
)
]
print(result)
Output:
[22, 26, 30]
Or you could join the DataFrames into a single one first:
df = df1.join(df2, lsuffix="_df1").join(df3, lsuffix="_df2")
print(df)
result = [
f(w, x, y, z)
for idx, (w, x, y, z)
in df.iterrows()
]
print(result)
Output:
col_1_df1 col_1_df2 col_1 col_2
0 1 4 7 10
1 2 5 8 11
2 3 6 9 12
[22, 26, 30]
Or you could convert df3 to a list of Series and "pivot" it using zip like below.
def f(x, y, z):
return x, y, z
result = [
f(x, y, z)
for x, y, z
in zip(
df1["col_1"],
df2["col_1"],
zip(*[df3[c] for c in df3.columns]))
]
print(result)
Output:
[(1, 4, (7, 10)), (2, 5, (8, 11)), (3, 6, (9, 12))]

Related

Cumulative average in python

I'm working with csv files.
I'd like a to create a continuously updated average of a sequence. ex;
I'd like to output the average of each individual value of a list
list; [a, b, c, d, e, f]
formula:
(a)/1= ?
(a+b)/2=?
(a+b+c)/3=?
(a+b+c+d)/4=?
(a+b+c+d+e)/5=?
(a+b+c+d+e+f)/6=?
To demonstrate:
if i have a list; [1, 4, 7, 4, 19]
my output should be; [1, 2.5, 4, 4, 7]
explained;
(1)/1=1
(1+4)/2=2.5
(1+4+7)/3=4
(1+4+7+4)/4=4
(1+4+7+4+19)/5=7
As far as my python file it is a simple code:
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_csv('somecsvfile.csv')
x = [] #has to be a list of 1 to however many rows are in the "numbers" column, will be a simple [1, 2, 3, 4, 5] etc...
#x will be used to divide the numbers selected in y to give us z
y = df[numbers]
z = #new dataframe derived from the continuous average of y
plt.plot(x, z)
plt.show()
If numpy is needed that is no problem.
pandas.DataFrame.expanding is what you need.
Using it you can just call df.expanding().mean() to get the result you want:
mean = df.expanding().mean()
print(mean)
Out[10]:
0 1.0
1 2.5
2 4.0
3 4.0
4 7.0
If you want to do it just in one column, use pandas.Series.expanding.
Just use the column instead of df:
df['column_name'].expanding().mean()
You can use cumsum to get cumulative sum and then divide to get the running average.
x = np.array([1, 4, 7, 4, 19])
np.cumsum(x)/range(1,len(x)+1)
print (z)
output:
[1. 2.5 4. 4. 7. ]
To give a complete answer to your question, filling in the blanks of your code using numpy and plotting:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
#df = pd.read_csv('somecsvfile.csv')
#instead I just create a df with a column named 'numbers'
df = pd.DataFrame([1, 4, 7, 4, 19], columns = ['numbers',])
x = range(1, len(df)+1) #x will be used to divide the numbers selected in y to give us z
y = df['numbers']
z = np.cumsum(y) / np.array(x)
plt.plot(x, z, 'o')
plt.xticks(x)
plt.xlabel('Entry')
plt.ylabel('Cumulative average')
But as pointed out by Augusto, you can also just put the whole thing into a DataFrame. Adding a bit more to his approach:
n = [1, 4, 7, 4, 19]
df = pd.DataFrame(n, columns = ['numbers',])
#augment the index so it starts at 1 like you want
df.index = np.arange(1, len(df)+1)
# create a new column for the cumulative average
df = df.assign(cum_avg = df['numbers'].expanding().mean())
# numbers cum_avg
# 1 1 1.0
# 2 4 2.5
# 3 7 4.0
# 4 4 4.0
# 5 19 7.0
# plot
df['cum_avg'].plot(linestyle = 'none',
marker = 'o',
xticks = df.index,
xlabel = 'Entry',
ylabel = 'Cumulative average')

Extract rows where the lists of columns contain certain values in a pandas dataframe

I have a dataframe that looks like this:
ID AgeGroups PaperIDs
0 1 [3, 3, 10] [A, B, C]
1 2 [5] [D]
2 3 [4, 12] [A, D]
3 4 [2, 6, 13, 12] [X, Z, T, D]
I would like the extract the rows where the list in the AgeGroups column has at least 2 values less than 7 and at least 1 value greater than 8.
So the result should look like this:
ID AgeGroups PaperIDs
0 1 [3, 3, 10] [A, B, C]
3 4 [2, 6, 13, 12] [X, Z, T, D]
I'm not sure how to do it.
First create helper DataFrame and compare by DataFrame.lt and
DataFrame.gt, then Series by Series.ge and chain masks by & for bitwise AND:
import ast
#if not lists
#df['AgeGroups'] = df['AgeGroups'].apply(ast.literal_eval)
df1 = pd.DataFrame(df['AgeGroups'].tolist())
df = df[df1.lt(7).sum(axis=1).ge(2) & df1.gt(8).sum(axis=1).ge(1)]
print (df)
ID AgeGroups PaperIDs
0 1 [3, 3, 10] [A, B, C]
3 4 [2, 6, 13, 12] [X, Z, T, D]
Or use list comprehension with compare numpy arrays, counts by sum and compare both counts chained by and, because scalars:
m = [(np.array(x) < 7).sum() >= 2 and (np.array(x) > 8).sum() >=1 for x in df['AgeGroups']]
df = df[m]
print (df)
ID AgeGroups PaperIDs
0 1 [3, 3, 10] [A, B, C]
3 4 [2, 6, 13, 12] [X, Z, T, D]
Simple if else logic I wrote for each row using apply function, you can also use list comprehension for row.
data = {'ID':['1', '2', '3', '4'], 'AgeGroups':[[3,3,10],[2],[4,12],[2,6,13,12]],'PaperIDs':[['A','B','C'],['D'],['A','D'],['X','Z','T','D']]}
df = pd.DataFrame(data)
def extract_age(row):
my_list = row['AgeGroups']
count1 = 0
count2 = 0
if len(my_list)>=3:
for i in my_list:
if i<7:
count1 = count1 +1
elif i>8:
count2 = count2+1
if (count1 >= 2) and (count2 >=1):
print(row['AgeGroups'],row['PaperIDs'])
df.apply(lambda x: extract_age(x), axis =1)
Output
[3, 3, 10] ['A', 'B', 'C']
[2, 6, 13, 12] ['X', 'Z', 'T', 'D']

How can I use lambdify to evaluate my function?

I have an expression with several variables, let's say something like below:
import numpy as np
import sympy as sym
from sympy import Symbol, Add, re, lambdify
x = sym.Symbol('x')
y = sym.Symbol('y')
z = sym.Symbol('z')
F = x+ y +z
I have three lists for the variables like below:
x = [3, 2 ,3]
y = [4, 5 , 6]
z = [7, 10 ,3]
I want to evaluate my function for the each element of my variables.
I know I can define something like below:
f_dis = lambdify([x, y, z], x + y + z, 'numpy')
d = f_dis(3, 4, 7)
print ( "f_dis =", d)
which give me 14 as the desired result. But how can I pass the x, y, and z as three lists (instead of writing the elements separately) and get a result like below:
[14, 17, 12]
It seems using lambdify is a more efficient way to evaluate a function, based on this note:
https://www.sympy.org/scipy-2017-codegen-tutorial/notebooks/22-lambdify.html
Thanks.
import sympy as sp
x = sp.Symbol('x')
y = sp.Symbol('y')
z = sp.Symbol('z')
X = [3, 2 ,3]
Y = [4, 5 , 6]
Z = [7, 10 ,3]
values = list(zip(X, Y, Z))
f_dis = sp.lambdify([x, y, z], x + y + z, 'numpy')
ans = [f_dis(*value) for value in values]
for d in ans:
print ( "f_dis =", d)
this will give you:
f_dis = 14
f_dis = 17
f_dis = 12

Efficiently find index of DataFrame values in array

I have a DataFrame that resembles:
x y z
--------------
0 A 10
0 D 13
1 X 20
...
and I have two sorted arrays for every possible value for x and y:
x_values = [0, 1, ...]
y_values = ['a', ..., 'A', ..., 'D', ..., 'X', ...]
so I wrote a function:
def lookup(record, lookup_list, lookup_attr):
return np.searchsorted(lookup_list, getattr(record, lookup_attr))
and then call:
df_x_indicies = df.apply(lambda r: lookup(r, x_values, 'x')
df_y_indicies = df.apply(lambda r: lookup(r, y_values, 'y')
# df_x_indicies: [0, 0, 1, ...]
# df_y_indicies: [26, ...]
but is there are more performant way to do this? and possibly multiple columns at once to get a returned DataFrame rather than a series?
I tried:
np.where(np.in1d(x_values, df.x))[0]
but this removes duplicate values and that is not desired.
You can convert your index arrays to pd.Index objects to make lookup fast(er).
u, v = map(pd.Index, [x_values, y_values])
pd.DataFrame({'x': u.get_indexer(df.x), 'y': v.get_indexer(df.y)})
x y
0 0 1
1 0 2
2 1 3
Where,
x_values
# [0, 1]
y_values
# ['a', 'A', 'D', 'X']
As to your requirement of having this work for multiple columns, you will have to iterate over each one. Here's a version of the code above that should generalise to N columns and indices.
val_list = [x_values, y_values] # [x_values, y_values, z_values, ...]
idx_list = map(pd.Index, val_list)
pd.DataFrame({
f'{c}': idx.get_indexer(df[c]) for idx, c in zip(idx_list, df)})
x y
0 0 1
1 0 2
2 1 3
Update using Series with .loc , you may can also try with reindex
pd.Series(range(len(x_values)),index=x_values).loc[df.x].tolist()
Out[33]: [0, 0, 1]

pandas dataframe exponential decay summation

I have a pandas dataframe,
[[1, 3],
[4, 4],
[2, 8]...
]
I want to create a column that has this:
1*(a)^(3) # = x
1*(a)^(3 + 4) + 4 * (a)^4 # = y
1*(a)^(3 + 4 + 8) + 4 * (a)^(4 + 8) + 2 * (a)^8 # = z
...
Where "a" is some value.
The stuff: 1, 4, 2, is from column one, the repeated 3, 4, 8 is column 2
Is this possible using some form of transform/apply?
Essentially getting:
[[1, 3, x],
[4, 4, y],
[2, 8, z]...
]
Where x, y, z is the respective sums from the new column (I want them next to each other)
There is a "groupby" that is being done on the dataframe, and this is what I want to do for a given group
If I'm understanding your question correctly, this should work:
df = pd.DataFrame([[1, 3], [4, 4], [2, 8]], columns=['a', 'b'])
a = 42
new_lst = []
for n in range(len(lst)):
z = 0
i = 0
while i <= n:
z += df['a'][i]*a**(sum(df['b'][i:n+1]))
i += 1
new_lst.append(z)
df['new'] = new_lst
Update:
Saw that you are using pandas and updated with dataframe methods. Not sure there's an easy way to do this with apply since you need a mix of values from different rows. I think this for loop is still the best route.

Categories