Related
I have a dataframe with columns m, n:
m=[0, 0, 1, 0, 0, 0, 4, 0, 0]
n=[6, 1, 2, 1, 4, 3, 1, 3, 5, 1]
I am looking for an iterative loop that adds up values of column n if the value in column m is non-zero. For example at 3rd place of column m the value is 1 (non-zero) so it should add in the column n from index 0 to 2 i.e. 6+1+2=9. Similarly, at m[6]=4 (non-zero) this implies 1+4+3+1=9 and so on.
Let's say you have a dataframe and you want to sum the elements in each column based on the position of non-zero values in column "m". The following code gives you the output as a dataframe. See the comment in the code if you are just looking for summing the values in column "n":
import pandas as pd
from random import randint
m = [0, 1, 0, 0, 1, 0, 0, 0, 2]
n = [1, 1, 3, 4, 1, 1, 2, 1, 3]
r = [randint(1, 3) for _ in m]
names = ['lev', 'yan', 'coke' , 'coke', 'yan', 'lev', 'lev', 'yan', 'lev']
df = pd.DataFrame({'m': m, 'n': n, 'r': r, 'names': names})
print(f"Input dataframe:\n{df}")
# if you want to iterate over all columns
iter_cols = df.columns.tolist()
iter_cols.remove('m')
# To iterate over an specific column (e.g. 'n') you use iter_cols = ['n']
starting_idx = 0
sum_df = pd.DataFrame()
for idx, val in enumerate(df.m):
if val != 0:
sum_df = sum_df.append(df.iloc[starting_idx: (idx+1)][iter_cols].sum(), ignore_index=True)
starting_idx = idx+1
print(f"Output dataframe:\n{sum_df}")
Output:
Input dataframe:
m n r names
0 0 1 2 lev
1 1 1 3 yan
2 0 3 1 coke
3 0 4 2 coke
4 1 1 2 yan
5 0 1 3 lev
6 0 2 3 lev
7 0 1 3 yan
8 2 3 2 lev
Output dataframe:
n names r
0 2.0 levyan 5.0
1 8.0 cokecokeyan 5.0
2 7.0 levlevyanlev 11.0
And if you want to iterate over distinct values in names column and sum the values in 'n' column accordingly:
iter_cols = ['n']
distinct_names = set(df.names)
print(distinct_names)
out_dct = {}
for name in distinct_names:
starting_idx = 0
sum_df = pd.DataFrame()
for idx, val in enumerate(df.names):
if val == name:
sum_df = sum_df.append(df.iloc[starting_idx: (idx+1)][iter_cols].sum(), ignore_index=True)
starting_idx = idx+1
out_dct[name] = sum_df
Let's say I have the following dataframes:
df_t1 = pd.DataFrame([["AAA", 1 ,2],["BBB", 0, 3],["CCC", 1, 2],["DDD", 0, 0],["EEE", 0, 3]], columns=list('ABC'))
A B C
0 AAA 1 2
1 BBB 0 3
2 CCC 1 2
3 DDD 0 0
4 EEE 0 3
and
df_t2 = pd.DataFrame([["XXX", 4, 1],["YYY", 5 ,6],["ZZZ", 0, 3]], columns=list('ABC'))
A B C
0 XXX 4 1
1 YYY 5 6
2 ZZZ 0 3
I can locate the rows in df_t1 that meet a certain condition using the code below:
df_t1.loc[(df_t1['B'] <= 2) & (df_t1['C'] > 2)]
A B C
1 BBB 0 3
4 EEE 0 3
df_t1.loc[(df_t1['B'] <= 3) & (df_t1['C'] > 3)]
A B C
[Empty Dataframe]
I can create a for loop that returns those same results:
for i in value_check:
print(df_t1.loc[(df_t1['B'] <= i) & (df_t1['C'] > i)])
A B C
1 BBB 0 3
4 EEE 0 3
Empty DataFrame
Columns: [A, B, C]
Index: []
But when I try to use that code to attach those values to df_t2:
value_check = [2,3]
for i in value_check:
df_t2.append(df_t1.loc[(df_t1['B'] <= i) & (df_t1['C'] > i)])
df_t2 is unchanged
Using pd.concat
df_t1 = pd.DataFrame([["AAA", 1 ,2],["BBB", 0, 3],["CCC", 1, 2],["DDD", 0, 0],["EEE", 0, 3]], columns=list('ABC'))
df_t2 = pd.DataFrame([["XXX", 4, 1],["YYY", 5 ,6],["ZZZ", 0, 3]], columns=list('ABC'))
value_check = [2, 3]
for i in value_check:
condition = (df_t1['B'] <= i) & (df_t1['C'] > i)
row_to_add = df_t1.loc[condition]
df_t2 = pd.concat([df_t2, row_to_add], axis=0)
From the docs, the append method: "Appends rows of other to the end of this frame, returning a new object". You have to use assign df_t2 in your loop:
value_check = [2,3]
for i in value_check:
df_t2 = df_t2.append(df_t1.loc[(df_t1['B'] <= i) & (df_t1['C'] > i)])
That said, concatenating is more efficient.
Edit: Here is a implementation using concatenation and list comprehension
df_t2 = pd.concat([df_t2] + [df_t1.loc[(df_t1['B'] <= i) & (df_t1['C'] > i)] for i in value_check], axis=0)
Dataframe (Assume all values as categorical):
df = pd.DataFrame(
{"a" : [1 ,2, 3, 4, 5],
"b" : [2,1,3,4,5],
"c" : [1,3,4,2,5]},
index = [1, 2, 3, 4, 5])
I want to find what percentage of overlap is present between different columns
check_a_b = df.a == df.b
check_b_c = df.b == df.c
check_a_c = df.a == df.c
print(np.sum(check_a_b)/len(check_a_b)) # 0.6
print(np.sum(check_b_c)/len(check_b_c)) # 0.2
print(np.sum(check_a_c)/len(check_a_c)) # 0.4
Final output required as a matrix / DataFrame ( Triangular matrix):
a b c
a 0.6 0.4
b 0.2
c
Now I want to implement this for 15 columns in an automated way for a data of more than 100K rows.
What would be the optimized way to do this?
Dropping down to numpy is usually efficient. Only return to pandas when you have the result.
from itertools import combinations
df = pd.DataFrame({"a" : [1 ,2, 3, 4, 5],
"b" : [2,1,3,4,5],
"c" : [1,3,4,2,5]},
index = [1, 2, 3, 4, 5])
a = df.values
d = {(i, j): np.mean(a[:, i] == a[:, j]) for i, j in combinations(range(a.shape[1]), 2)}
res, c, vals = np.zeros((a.shape[1], a.shape[1])), \
list(map(list, zip(*d.keys()))), list(d.values())
res[c[0], c[1]] = vals
res_df = pd.DataFrame(res, columns=df.columns, index=df.columns)
# a b c
# a 0.0 0.6 0.4
# b 0.0 0.0 0.2
# c 0.0 0.0 0.0
One way you can do this is as follows:
from itertools import combinations
df = pd.DataFrame({"a" : [1 ,2, 3, 4, 5],
"b" : [2,1,3,4,5],
"c" : [1,3,4,2,5]},
index = [1, 2, 3, 4, 5])
df_out = pd.DataFrame()
for i in combinations(df.columns, 2):
s = pd.DataFrame((df[i[0]] == df[i[1]]).mean(),index=[i[0]], columns=[i[1]])
df_out = pd.concat([df_out,s])
df_out.sum(level=0).reindex(df.columns).reindex(df.columns, axis=1).fillna(0)
Output:
a b c
a 0.0 0.6 0.4
b 0.0 0.0 0.2
c 0.0 0.0 0.0
There is on way
Yourdf=pd.DataFrame(columns=df.columns,index=df.columns)
Yourdf=Yourdf.stack(dropna=False).to_frame().apply(lambda x : (df[x.name[0]]==df[x.name[1]]).sum()/len(df),axis=1).unstack()
Yourdf=Yourdf.where(np.triu(np.ones(Yourdf.shape),1).astype(np.bool))
Yourdf
Out[169]:
a b c
a NaN 0.6 0.4
b NaN NaN 0.2
c NaN NaN NaN
Update : mention by Scott
Change to mean
Yourdf=Yourdf.stack(dropna=False).to_frame().apply(lambda x : (df[x.name[0]]==df[x.name[1]]).mean(),axis=1).unstack()
I am trying to filter a pandas data frame using thresholds for three columns
import pandas as pd
df = pd.DataFrame({"A" : [6, 2, 10, -5, 3],
"B" : [2, 5, 3, 2, 6],
"C" : [-5, 2, 1, 8, 2]})
df = df.loc[(df.A > 0) & (df.B > 2) & (df.C > -1)].reset_index(drop = True)
df
A B C
0 2 5 2
1 10 3 1
2 3 6 2
However, I want to do this inside a function where the names of the columns and their thresholds are given to me in a dictionary. Here's my first try that works ok. Essentially I am putting the filter inside cond variable and just run it:
df = pd.DataFrame({"A" : [6, 2, 10, -5, 3],
"B" : [2, 5, 3, 2, 6],
"C" : [-5, 2, 1, 8, 2]})
limits_dic = {"A" : 0, "B" : 2, "C" : -1}
cond = "df = df.loc["
for key in limits_dic.keys():
cond += "(df." + key + " > " + str(limits_dic[key])+ ") & "
cond = cond[:-2] + "].reset_index(drop = True)"
exec(cond)
df
A B C
0 2 5 2
1 10 3 1
2 3 6 2
Now, finally I put everything inside a function and it stops working (perhaps exec function does not like to be used inside a function!):
df = pd.DataFrame({"A" : [6, 2, 10, -5, 3],
"B" : [2, 5, 3, 2, 6],
"C" : [-5, 2, 1, 8, 2]})
limits_dic = {"A" : 0, "B" : 2, "C" : -1}
def filtering(df, limits_dic):
cond = "df = df.loc["
for key in limits_dic.keys():
cond += "(df." + key + " > " + str(limits_dic[key])+ ") & "
cond = cond[:-2] + "].reset_index(drop = True)"
exec(cond)
return(df)
df = filtering(df, limits_dic)
df
A B C
0 6 2 -5
1 2 5 2
2 10 3 1
3 -5 2 8
4 3 6 2
I know that exec function acts differently when used inside a function but was not sure how to address the problem. Also, I am wondering there must be a more elegant way to define a function to do the filtering given two input: 1)df and 2)limits_dic = {"A" : 0, "B" : 2, "C" : -1}. I would appreciate any thoughts on this.
If you're trying to build a dynamic query, there are easier ways. Here's one using a list comprehension and str.join:
query = ' & '.join(['{}>{}'.format(k, v) for k, v in limits_dic.items()])
Or, using f-strings with python-3.6+,
query = ' & '.join([f'{k}>{v}' for k, v in limits_dic.items()])
print(query)
'A>0 & C>-1 & B>2'
Pass the query string to df.query, it's meant for this very purpose:
out = df.query(query)
print(out)
A B C
1 2 5 2
2 10 3 1
4 3 6 2
What if my column names have whitespace, or other weird characters?
From pandas 0.25, you can wrap your column name in backticks so this works:
query = ' & '.join([f'`{k}`>{v}' for k, v in limits_dic.items()])
See this Stack Overflow post for more.
You could also use df.eval if you want to obtain a boolean mask for your query, and then indexing becomes straightforward after that:
mask = df.eval(query)
print(mask)
0 False
1 True
2 True
3 False
4 True
dtype: bool
out = df[mask]
print(out)
A B C
1 2 5 2
2 10 3 1
4 3 6 2
String Data
If you need to query columns that use string data, the code above will need a slight modification.
Consider (data from this answer):
df = pd.DataFrame({'gender':list('MMMFFF'),
'height':[4,5,4,5,5,4],
'age':[70,80,90,40,2,3]})
print (df)
gender height age
0 M 4 70
1 M 5 80
2 M 4 90
3 F 5 40
4 F 5 2
5 F 4 3
And a list of columns, operators, and values:
column = ['height', 'age', 'gender']
equal = ['>', '>', '==']
condition = [1.68, 20, 'F']
The appropriate modification here is:
query = ' & '.join(f'{i} {j} {repr(k)}' for i, j, k in zip(column, equal, condition))
df.query(query)
age gender height
3 40 F 5
For information on the pd.eval() family of functions, their features and use cases, please visit Dynamic Expression Evaluation in pandas using pd.eval().
An alternative to #coldspeed 's version:
conditions = None
for key, val in limit_dic.items():
cond = df[key] > val
if conditions is None:
conditions = cond
else:
conditions = conditions & cond
print(df[conditions])
An alternative to both posted, that may or may not be more pythonic:
import pandas as pd
import operator
from functools import reduce
df = pd.DataFrame({"A": [6, 2, 10, -5, 3],
"B": [2, 5, 3, 2, 6],
"C": [-5, 2, 1, 8, 2]})
limits_dic = {"A": 0, "B": 2, "C": -1}
# equiv to [df['A'] > 0, df['B'] > 2 ...]
loc_elements = [df[key] > val for key, val in limits_dic.items()]
df = df.loc[reduce(operator.and_, loc_elements)]
How I do this without creating a string and df.query:
limits_dic = {"A" : 0, "B" : 2, "C" : -1}
cond = None
# Build the conjunction one clause at a time
for key, val in limits_dic.items():
if cond is None:
cond = df[key] > val
else:
cond = cond & (df[key] > val)
df.loc[cond]
A B C
0 2 5 2
1 10 3 1
2 3 6 2
Note the hard coded (>, &) operators (since I wanted to follow your example exactly).
How to aggregate in the way to get the average of b for group a, while excluding the current row (the target result is in c)?
a b c
1 1 0.5 # (avg of 0 & 1, excluding 1)
1 1 0.5 # (avg of 0 & 1, excluding 1)
1 0 1 # (avg of 1 & 1, excluding 0)
2 1 0.5 # (avg of 0 & 1, excluding 1)
2 0 1 # (avg of 1 & 1, excluding 0)
2 1 0.5 # (avg of 0 & 1, excluding 1)
3 1 0.5 # (avg of 0 & 1, excluding 1)
3 0 1 # (avg of 1 & 1, excluding 0)
3 1 0.5 # (avg of 0 & 1, excluding 1)
Data dump:
import pandas as pd
data = pd.DataFrame([[1, 1, 0.5], [1, 1, 0.5], [1, 0, 1], [2, 1, 0.5], [2, 0, 1],
[2, 1, 0.5], [3, 1, 0.5], [3, 0, 1], [3, 1, 0.5]],
columns=['a', 'b', 'c'])
Suppose a group has values x_1, ..., x_n.
The average of the entire group would be
m = (x_1 + ... + x_n)/n
The sum of the group without x_i would be
(m*n - x_i)
The average of the group without x_i would be
(m*n - x_i)/(n-1)
Therefore, you could compute the desired column of values with
import pandas as pd
df = pd.DataFrame([[1, 1, 0.5], [1, 1, 0.5], [1, 0, 1], [2, 1, 0.5], [2, 0, 1],
[2, 1, 0.5], [3, 1, 0.5], [3, 0, 1], [3, 1, 0.5]],
columns=['a', 'b', 'c'])
grouped = df.groupby(['a'])
n = grouped['b'].transform('count')
mean = grouped['b'].transform('mean')
df['result'] = (mean*n - df['b'])/(n-1)
which yields
In [32]: df
Out[32]:
a b c result
0 1 1 0.5 0.5
1 1 1 0.5 0.5
2 1 0 1.0 1.0
3 2 1 0.5 0.5
4 2 0 1.0 1.0
5 2 1 0.5 0.5
6 3 1 0.5 0.5
7 3 0 1.0 1.0
8 3 1 0.5 0.5
In [33]: assert df['result'].equals(df['c'])
Per the comments below, in the OP's actual use case, the DataFrame's a column
contains strings:
def make_random_str_array(letters, strlen, size):
return (np.random.choice(list(letters), size*strlen)
.view('|S{}'.format(strlen)))
N = 3*10**6
df = pd.DataFrame({'a':make_random_str_array(letters='ABCD', strlen=10, size=N),
'b':np.random.randint(10, size=N)})
so that there are about a million unique values in df['a'] out of 3 million
total:
In [87]: uniq, key = np.unique(df['a'], return_inverse=True)
In [88]: len(uniq)
Out[88]: 988337
In [89]: len(df)
Out[89]: 3000000
In this case the calculation above requires (on my machine) about 11 seconds:
In [86]: %%timeit
....: grouped = df.groupby(['a'])
n = grouped['b'].transform('count')
mean = grouped['b'].transform('mean')
df['result'] = (mean*n - df['b'])/(n-1)
....: ....: ....: ....:
1 loops, best of 3: 10.5 s per loop
Pandas converts all string-valued columns to object
dtype. But we could convert the
DataFrame column to a NumPy array with a fixed-width dtype, and the group
according to those values.
Here is a benchmark showing that if we convert the Series with object dtype to a NumPy array with fixed-width string dtype, the calculation requires less than 2 seconds:
In [97]: %%timeit
....: grouped = df.groupby(df['a'].values.astype('|S4'))
n = grouped['b'].transform('count')
mean = grouped['b'].transform('mean')
df['result'] = (mean*n - df['b'])/(n-1)
....: ....: ....: ....:
1 loops, best of 3: 1.39 s per loop
Beware that you need to know the maximum length of the strings in df['a'] to choose the appropriate fixed-width dtype. In the example above, all the strings have length 4, so |S4 works. If you use |Sn for some integer n and n is smaller than the longest string, then those strings will get silently truncated with no error warning. This could potentially lead to the grouping of values which should not be grouped together. Thus, the onus is on you to choose the correct fixed-width dtype.
You could use
dtype = '|S{}'.format(df['a'].str.len().max())
grouped = df.groupby(df['a'].values.astype(dtype))
to ensure the conversion uses the correct dtype.
You can calculate the statistics manually by iterating group by group:
# Set up input
import pandas as pd
df = pd.DataFrame([
[1, 1, 0.5], [1, 1, 0.5], [1, 0, 1],
[2, 1, 0.5], [2, 0, 1], [2, 1, 0.5],
[3, 1, 0.5], [3, 0, 1], [3, 1, 0.5]
], columns=['a', 'b', 'c'])
df
a b c
0 1 1 0.5
1 1 1 0.5
2 1 0 1.0
3 2 1 0.5
4 2 0 1.0
5 2 1 0.5
6 3 1 0.5
7 3 0 1.0
8 3 1 0.5
# Perform grouping, excluding the current row
results = []
grouped = df.groupby(['a'])
for key, group in grouped:
for idx, row in group.iterrows():
# The group excluding current row
group_other = group.drop(idx)
avg = group_other['b'].mean()
results.append(row.tolist() + [avg])
# Compare our results with what is expected
results_df = pd.DataFrame(
results, columns=['a', 'b', 'c', 'c_new']
)
results_df
a b c c_new
0 1 1 0.5 0.5
1 1 1 0.5 0.5
2 1 0 1.0 1.0
3 2 1 0.5 0.5
4 2 0 1.0 1.0
5 2 1 0.5 0.5
6 3 1 0.5 0.5
7 3 0 1.0 1.0
8 3 1 0.5 0.5
This way you can use any statistic you want.