I have a dataframe like this:
right_answer rater1 rater2 rater3 item
1 1 1 2 S01
1 1 2 2 S02
2 1 2 1 S03
2 2 1 2 S04
and I need to get those rows or values in 'items' where at least two out of the three raters gave the wrong answer. I could already check if all the raters agree with each other with this code:
df.where(df[['rater1', 'rater2', 'rater3']].eq(df.iloc[:, 0], axis=0).all(1) == True)
I don't want to calculate a column with a majority voting because maybe I need to adjust the number of raters that have to agree or disagree wih the right answer.
Thanks for help
Use, DataFrame.filter to filter the dataframe containing columns like rater, then use DataFrame.ne along axis=0 to compare the columns containing rater with the column right_answer, then use DataFrame.sum along axis=1 to get number of raters who have given wrong answer, then use Series.ge to create a boolean mask, finally filter the dataframe rows using this mask:
mask = (
df.filter(like='rater')
.ne(df['right_answer'], axis=0).sum(axis=1).ge(2)
)
df = df[mask]
Result:
# print(df)
right_answer rater1 rater2 rater3 item
1 1 1 2 2 S02
2 2 1 2 1 S03
For speed up, purely using numpy broadcasting:
diffs = np.not_equal(df.filter(like='rater'), df['right_answer'][:, None])
diffs = np.sum(diffs, axis=1) >= 2
df[diffs]
right_answer rater1 rater2 rater3 item
1 1 1 2 2 S02
2 2 1 2 1 S03
Lets time it!
# create dataframe with 4 million rows
dfbig = pd.concat([df]*1000000, ignore_index=True)
dfbig.shape
# (4000000, 5)
def numpy_broadcasting(data):
diffs = np.not_equal(data.filter(like='rater'), data['right_answer'][:, None])
diffs = np.sum(diffs, axis=1) >= 2
def pandas_method(data):
mask = (
data.filter(like='rater')
.ne(df['right_answer'], axis=0).sum(axis=1).ge(2)
)
%%timeit
numpy_broadcasting(dfbig)
# 92.5 ms ± 789 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
pandas_method(dfbig)
# 296 ms ± 7.27 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
numpy broadcasting is 296 / 92.5 = 3.2 times faster
Related
Intention: To filter binary numbers based on hamming weights using pandas. Here i check number of 1s occurring in the binary and write the count to df.
Effort so far:
import pandas as pd
def ones(num):
return bin(num).count('1')
num = list(range(1,8))
C = pd.Index(["num"])
df = pd.DataFrame(num, columns=C)
df['count'] = df.apply(lambda row : ones(row['num']), axis = 1)
print(df)
output:
num count
0 1 1
1 2 1
2 3 2
3 4 1
4 5 2
5 6 2
6 7 3
Intended output:
1 2 3
0 1 3 7
1 2 5
2 4 6
Help!
You can use pivot_table. Though you'll need to define the index as the cumcount of the grouped count column, pivot_table can't figure it out all on its own :)
(df.pivot_table(index=df.groupby('count').cumcount(),
columns='count',
values='num'))
count 1 2 3
0 1.0 3.0 7.0
1 2.0 5.0 NaN
2 4.0 6.0 NaN
You also have the parameter fill_value, though I wouldn't recommend you to use it, since you'll get mixed types. Now it looks like NumPy would be a good option from here, you can easily obtain an array from the result with new_df.to_numpy().
Also, focusing on the logic in ones, we can vectorise this with (based on this answer):
m = df.num.to_numpy().itemsize
df['count'] = (df.num.to_numpy()[:,None] & (1 << np.arange(m)) > 0).view('i1').sum(1)
Here's a check on both approaches' performance:
df_large = pd.DataFrame({'num':np.random.randint(0,10,(10_000))})
def vect(df):
m = df.num.to_numpy().itemsize
(df.num.to_numpy()[:,None] & (1 << np.arange(m)) > 0).view('i1').sum(1)
%timeit vect(df_large)
# 340 µs ± 5.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit df_large.apply(lambda row : ones(row['num']), axis = 1)
# 103 ms ± 2.32 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
I suggest a different output:
df.groupby("count").agg(list)
which will give you
num
count
1 [1, 2, 4]
2 [3, 5, 6]
3 [7]
it's the same information in a slightly different format. In your original pivoted format, the rows are meaningless and you have an undetermined number of columns. I suggest it is more common to have an undetermined number of rows. I think you'll find this easier to work with going forward.
Or consider just creating a dictionary as a DataFrame is adding a lot of overhead here for no benefit:
df.groupby("count").agg(list).to_dict()["num"]
which gives you
{
1: [1, 2, 4],
2: [3, 5, 6],
3: [7],
}
Here's one approach
df.groupby('count')['num'].agg(list).apply(pd.Series).T
I have a datafarme which has 50 columns and above 200 rows with binary values:
a1 a2 a3 a4 ….. a50
0 1 0 1 ….. 1
1 0 0 1 …. 0
0 1 1 0 …. 0
1 1 1 0 …. 1
I would like to compare cell values of first row to other rows one by one and make the 51th column which output the non-matching cells as below: (since the first row is not compared with any row it will get a nan value)
a51
NAN
a1,a2,…,a50
a3,a4…,a50
a1,a3,a4,…
I am not sure how to do this efficiently. I have not find any answer similar to this question. Sorry if I am asking repeated question. Thank you in advance!
Setup
import numpy as np
df = pd.DataFrame(np.random.randint(2,size=(200,50)),
columns =[f'a{i}' for i in range(1,51)])
Series.dot + DataFrame.add_suffix and Series.str.rstrip
df['a51']=df.iloc[1:].ne(df.iloc[0]).dot(df.add_suffix(', ').columns).str.rstrip(', ')
Time comparision for 50 columns and 200 rows
%%timeit
df['a51'] = df.iloc[1:].ne(df.iloc[0]).dot(df.add_suffix(', ').columns).str.rstrip(', ')
25.4 ms ± 681 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
a = df.to_numpy()
m = np.where(a[0,:] != a[1:,None], df.columns, np.nan)
pd.DataFrame(m.squeeze()).stack().groupby(level=0).agg(', '.join)
41.1 ms ± 4.16 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
df.iloc[1:].apply(lambda row: df.columns[df.iloc[0] != row].values, axis=1)
147 ms ± 18.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Here's one approach:
import numpy as np
a = df.to_numpy()
m = np.where(a[0,:] != a[1:,None], df.columns, np.nan)
pd.DataFrame(m.squeeze()).stack().groupby(level=0).agg(', '.join)
0 a1, a2, a50
1 a3, a4, a50
2 a1, a3, a4
dtype: object
Input data:
print(df)
a1 a2 a3 a4 a50
0 0 1 0 1 1
1 1 0 0 1 0
2 0 1 1 0 0
3 1 1 1 0 1
I assume you want the list of column names that don't match the first row:
df['a51'] = df.iloc[1:].apply(lambda row: df.columns[df.iloc[0] != row].values, axis=1)
200 rows is small enough so that apply(..., axis=1) is not a performance concern.
I have these two different columns in a dataframe. I want to iterate and know if column 'Entry_Point' is a Str then inDelivery_Point put the Client_Num.
df
Client_Num Entry_Point Delivery_Point
1 0
2 a
3 3
4 4
5 b
6 c
8 d
It should look like this:
Client_Num Entry_Point Delivery_Point
1 10 10
2 a 2
3 32 32
4 14 14
5 b 5
6 c 6
8 d 8
I already tried doing a for but it takes too long, especially when I have 20k rows.
for i in range(len(df)):
if type(df.loc[i]['Entry_Point']) == str:
df.loc[i]['Delivery_Point'] = df.loc[i]['Client_num']
else:
df.loc[i]['Delivery_Point'] = df.loc[i]['Entry_Point']
Let us using pandas to_numeric
df['New']=pd.to_numeric(df.Entry_Point,errors='coerce').fillna(df.Client_Num)
df
Out[22]:
Client_Num Entry_Point New
0 1 0 0.0
1 2 a 2.0
2 3 3 3.0
3 4 4 4.0
4 5 b 5.0
5 6 c 6.0
6 8 d 8.0
Pandas column will be imported as a single data type. So the method you apply may not fetch the correct result. I think you want to do the following:
df['Delivery_Point'] = df.apply(lambda x: x.Client_num if not x.Entry_Point.strip().isnumeric() else x.Entry_Point, axis=1)
Another option that might perform even better on very large datasets is to use vectorized numpy functions:
import numpy as np
#np.vectorize
def get_if_str(client_num, entry_point):
if isinstance(entry_point, str):
return client_num
return entry_point
df['Delivery_Point'] = get_if_str(df['Client_Num'], df['Entry_Point'])
We can compare the times here:
##slow way
def generic(df):
for i in range(len(df)):
if type(df.loc[i]['Entry_Point']) == str:
df.loc[i]['Delivery_Point'] = df.loc[i]['Client_Num']
else:
df.loc[i]['Delivery_Point'] = df.loc[i]['Entry_Point']
%timeit generic(df)
# 237 ms ± 5.88 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# Miliseconds
%timeit df['Delivery_Point'] = get_if_int(df['Client_Num'], df['Entry_Point'])
#185 µs ± 1.38 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# Microseconds
As you can see, considerable gains from using Numpy vectorized functions. More about them can be found here
EDIT
If you actually use the numpy array of the values, you should get an even better performance from the vectorization:
df['Delivery_Point'] = get_if_str(df['Client_Num'].values, df['Entry_Point'].values)
Given a pandas Series with an index:
import pandas as pd
s = pd.Series(data=[1,2,3],index=['a','b','c'])
How can a Series be used to fill the diagonal entries of an empty DataFrame in pandas version >= 0.23.0?
The resulting DataFrame would look like:
a b c
a 1 0 0
b 0 2 0
c 0 0 3
There is a prior similar question which will fill the diagonal with the same value, my question is asking to fill the diagonal with varying values from a Series.
Thank you in advance for your consideration and response.
First create DataFrame and then numpy.fill_diagonal:
import numpy as np
s = pd.Series(data=[1,2,3],index=['a','b','c'])
df = pd.DataFrame(0, index=s.index, columns=s.index, dtype=s.dtype)
np.fill_diagonal(df.values, s)
print (df)
a b c
a 1 0 0
b 0 2 0
c 0 0 3
Another solution is create empty 2d array, add values to diagonal and last use DataFrame constructor:
arr = np.zeros((len(s), len(s)), dtype=s.dtype)
np.fill_diagonal(arr, s)
print (arr)
[[1 0 0]
[0 2 0]
[0 0 3]]
df = pd.DataFrame(arr, index=s.index, columns=s.index)
print (df)
a b c
a 1 0 0
b 0 2 0
c 0 0 3
I'm not sure about directly doing it with Pandas, but you can do this easily enough if you don't mind using numpy.diag() to build the diagonal data matrix for your series and then plugging that into a DataFrame:
diag_data = np.diag(s) # don't need s.as_matrix(), turns out
df = pd.DataFrame(diag_data, index=s.index, columns=s.index)
a b c
a 1 0 0
b 0 2 0
c 0 0 3
In one line:
df = pd.DataFrame(np.diag(s),
index=s.index,
columns=s.index)
Timing comparison with a Series made from a random array of 10000 elements:
s = pd.Series(np.random.rand(10000), index=np.arange(10000))
df = pd.DataFrame(np.diag(s), ...)
173 ms ± 2.91 ms per loop (mean ± std. dev. of 7 runs, 20 loops each)
df = pd.DataFrame(0, ...)
np.fill_diagonal(df.values, s)
212 ms ± 909 µs per loop (mean ± std. dev. of 7 runs, 20 loops each)
mat = np.zeros(...)
np.fill_diagonal(mat, s)
df = pd.DataFrame(mat, ...)
175 ms ± 3.72 ms per loop (mean ± std. dev. of 7 runs, 20 loops each)
It looks like the first and third option shown here are essentially the same, while the middle option is the slowest.
I am trying to convert values within the current dataframe as the "Index" and the dataframe's Index as the "Labels". For Example:
Value1 Value2
0 0 1
1 2 4
2 NaN 3
This would result in
Labels
0 0
1 0
2 1
3 2
4 1
Currently I managed to do this using a loop to check and apply the necessary labels/values but with millions of labels to mark this process becomes extremely time consuming. Is there a way to do this in a smarter and quicker way? Thanks in advance.
Use stack with DataFrame constructor:
s = df.stack()
df = pd.DataFrame(s.index.get_level_values(0).values,
columns=['Labels'],
index=s.values.astype(int)).sort_index()
print (df)
Labels
0 0
1 0
2 1
3 2
4 1
Detail:
print (df.stack())
0 Value1 0.0
Value2 1.0
1 Value1 2.0
Value2 4.0
2 Value2 3.0
dtype: float64
Came up with a really good one (thanks to the collective effort of the pandas community). This one should be fast.
It uses the power a flexibility of repeat and ravel to flatten your data.
s = pd.Series(df.index.repeat(2), index=df.values.ravel())
s[s.index.notnull()].sort_index()
0.0 0
1.0 0
2.0 1
3.0 2
4.0 1
dtype: int64
A subsequent conversion results in an integer index:
df.index = df.index.astype(int)
A similar (slightly faster depending on your data) solution which also results in an integer index is performing the filtering before converting to Series -
v = df.index.repeat(df.shape[1])
i = df.values.ravel()
m = ~np.isnan(i)
s = pd.Series(v[m], index=i[m].astype(int)).sort_index()
s
0 0
1 0
2 1
3 2
4 1
dtype: int64
Performance
df2 = pd.concat([df] * 10000, ignore_index=True)
# jezrael's solution
%%timeit
s = df2.stack()
pd.DataFrame(s.index.get_level_values(0).values,
columns=['Labels'],
index=s.values.astype(int)).sort_index()
4.57 ms ± 220 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
s = pd.Series(df2.index.repeat(2), index=df2.values.ravel())
s[s.index.notnull()].sort_index()
3.12 ms ± 110 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
v = df2.index.repeat(df.shape[1])
i = df2.values.ravel()
m = ~np.isnan(i)
s = pd.Series(v[m], index=i[m].astype(int)).sort_index()
3.1 ms ± 117 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)