Given a pandas Series with an index:
import pandas as pd
s = pd.Series(data=[1,2,3],index=['a','b','c'])
How can a Series be used to fill the diagonal entries of an empty DataFrame in pandas version >= 0.23.0?
The resulting DataFrame would look like:
a b c
a 1 0 0
b 0 2 0
c 0 0 3
There is a prior similar question which will fill the diagonal with the same value, my question is asking to fill the diagonal with varying values from a Series.
Thank you in advance for your consideration and response.
First create DataFrame and then numpy.fill_diagonal:
import numpy as np
s = pd.Series(data=[1,2,3],index=['a','b','c'])
df = pd.DataFrame(0, index=s.index, columns=s.index, dtype=s.dtype)
np.fill_diagonal(df.values, s)
print (df)
a b c
a 1 0 0
b 0 2 0
c 0 0 3
Another solution is create empty 2d array, add values to diagonal and last use DataFrame constructor:
arr = np.zeros((len(s), len(s)), dtype=s.dtype)
np.fill_diagonal(arr, s)
print (arr)
[[1 0 0]
[0 2 0]
[0 0 3]]
df = pd.DataFrame(arr, index=s.index, columns=s.index)
print (df)
a b c
a 1 0 0
b 0 2 0
c 0 0 3
I'm not sure about directly doing it with Pandas, but you can do this easily enough if you don't mind using numpy.diag() to build the diagonal data matrix for your series and then plugging that into a DataFrame:
diag_data = np.diag(s) # don't need s.as_matrix(), turns out
df = pd.DataFrame(diag_data, index=s.index, columns=s.index)
a b c
a 1 0 0
b 0 2 0
c 0 0 3
In one line:
df = pd.DataFrame(np.diag(s),
index=s.index,
columns=s.index)
Timing comparison with a Series made from a random array of 10000 elements:
s = pd.Series(np.random.rand(10000), index=np.arange(10000))
df = pd.DataFrame(np.diag(s), ...)
173 ms ± 2.91 ms per loop (mean ± std. dev. of 7 runs, 20 loops each)
df = pd.DataFrame(0, ...)
np.fill_diagonal(df.values, s)
212 ms ± 909 µs per loop (mean ± std. dev. of 7 runs, 20 loops each)
mat = np.zeros(...)
np.fill_diagonal(mat, s)
df = pd.DataFrame(mat, ...)
175 ms ± 3.72 ms per loop (mean ± std. dev. of 7 runs, 20 loops each)
It looks like the first and third option shown here are essentially the same, while the middle option is the slowest.
Related
I have a dataframe like this:
right_answer rater1 rater2 rater3 item
1 1 1 2 S01
1 1 2 2 S02
2 1 2 1 S03
2 2 1 2 S04
and I need to get those rows or values in 'items' where at least two out of the three raters gave the wrong answer. I could already check if all the raters agree with each other with this code:
df.where(df[['rater1', 'rater2', 'rater3']].eq(df.iloc[:, 0], axis=0).all(1) == True)
I don't want to calculate a column with a majority voting because maybe I need to adjust the number of raters that have to agree or disagree wih the right answer.
Thanks for help
Use, DataFrame.filter to filter the dataframe containing columns like rater, then use DataFrame.ne along axis=0 to compare the columns containing rater with the column right_answer, then use DataFrame.sum along axis=1 to get number of raters who have given wrong answer, then use Series.ge to create a boolean mask, finally filter the dataframe rows using this mask:
mask = (
df.filter(like='rater')
.ne(df['right_answer'], axis=0).sum(axis=1).ge(2)
)
df = df[mask]
Result:
# print(df)
right_answer rater1 rater2 rater3 item
1 1 1 2 2 S02
2 2 1 2 1 S03
For speed up, purely using numpy broadcasting:
diffs = np.not_equal(df.filter(like='rater'), df['right_answer'][:, None])
diffs = np.sum(diffs, axis=1) >= 2
df[diffs]
right_answer rater1 rater2 rater3 item
1 1 1 2 2 S02
2 2 1 2 1 S03
Lets time it!
# create dataframe with 4 million rows
dfbig = pd.concat([df]*1000000, ignore_index=True)
dfbig.shape
# (4000000, 5)
def numpy_broadcasting(data):
diffs = np.not_equal(data.filter(like='rater'), data['right_answer'][:, None])
diffs = np.sum(diffs, axis=1) >= 2
def pandas_method(data):
mask = (
data.filter(like='rater')
.ne(df['right_answer'], axis=0).sum(axis=1).ge(2)
)
%%timeit
numpy_broadcasting(dfbig)
# 92.5 ms ± 789 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
pandas_method(dfbig)
# 296 ms ± 7.27 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
numpy broadcasting is 296 / 92.5 = 3.2 times faster
I have a datafarme which has 50 columns and above 200 rows with binary values:
a1 a2 a3 a4 ….. a50
0 1 0 1 ….. 1
1 0 0 1 …. 0
0 1 1 0 …. 0
1 1 1 0 …. 1
I would like to compare cell values of first row to other rows one by one and make the 51th column which output the non-matching cells as below: (since the first row is not compared with any row it will get a nan value)
a51
NAN
a1,a2,…,a50
a3,a4…,a50
a1,a3,a4,…
I am not sure how to do this efficiently. I have not find any answer similar to this question. Sorry if I am asking repeated question. Thank you in advance!
Setup
import numpy as np
df = pd.DataFrame(np.random.randint(2,size=(200,50)),
columns =[f'a{i}' for i in range(1,51)])
Series.dot + DataFrame.add_suffix and Series.str.rstrip
df['a51']=df.iloc[1:].ne(df.iloc[0]).dot(df.add_suffix(', ').columns).str.rstrip(', ')
Time comparision for 50 columns and 200 rows
%%timeit
df['a51'] = df.iloc[1:].ne(df.iloc[0]).dot(df.add_suffix(', ').columns).str.rstrip(', ')
25.4 ms ± 681 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
a = df.to_numpy()
m = np.where(a[0,:] != a[1:,None], df.columns, np.nan)
pd.DataFrame(m.squeeze()).stack().groupby(level=0).agg(', '.join)
41.1 ms ± 4.16 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
df.iloc[1:].apply(lambda row: df.columns[df.iloc[0] != row].values, axis=1)
147 ms ± 18.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Here's one approach:
import numpy as np
a = df.to_numpy()
m = np.where(a[0,:] != a[1:,None], df.columns, np.nan)
pd.DataFrame(m.squeeze()).stack().groupby(level=0).agg(', '.join)
0 a1, a2, a50
1 a3, a4, a50
2 a1, a3, a4
dtype: object
Input data:
print(df)
a1 a2 a3 a4 a50
0 0 1 0 1 1
1 1 0 0 1 0
2 0 1 1 0 0
3 1 1 1 0 1
I assume you want the list of column names that don't match the first row:
df['a51'] = df.iloc[1:].apply(lambda row: df.columns[df.iloc[0] != row].values, axis=1)
200 rows is small enough so that apply(..., axis=1) is not a performance concern.
I want to apply a function to each column of a DataFrame.
Which rows to apply this to depends on some column-specific condition.
The parameter values to use also depends on the function.
Take this very simple DataFrame:
>>> df = pd.DataFrame(data=np.arange(15).reshape(5, 3))
>>> df
0 1 2
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14
I want to apply a function to each column using column-specific values contained in an array, say:
>>> multiplier = np.array([0, 100, 1000]) # First column multiplied by 0, second by 100...
I also only want to multiply rows whose index are within a column-specific range, say below the values contained in the array:
>>> limiter = np.array([2, 3, 4]) # Only first two elements in first column get multiplied, first three in second column...
What works is this:
>>> for i in range(limit.shape[0]):
>>> df.loc[df.index<limit[i], i] = multiplier[i] * df.loc[:, i]
>>> df
0 1 2
0 0 100 2000
1 0 400 5000
2 6 700 8000
3 9 10 11000
4 12 13 14
But this approach is way too slow for the large DataFrames I'm dealing with.
Is there some way to vectorize this?
You could take advantage of underlying numpy array.
df = pd.DataFrame(data=pd.np.arange(15).reshape(5, 3))
multiplier = pd.np.array([0, 100, 1000])
limit = pd.np.array([2, 3, 4])
df1 = df.values
for i in pd.np.arange(limit.size):
df1[: limit[i], i] = df1[: limit[i], i] * multiplier[i]
df2 = pd.DataFrame(df1)
print (df2)
0 1 2
0 0 100 2000
1 0 400 5000
2 6 700 8000
3 9 10 11000
4 12 13 14
Performace:
# Your implementation
%timeit for i in range(limit.shape[0]): df.loc[df.index<limit[i], i] = multiplier[i] * df.loc[:, i]
3.92 ms ± 120 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# Numpy implementation (High Performance Gain)
%timeit for i in pd.np.arange(limit.size): df1[: limit[i], i] = df1[: limit[i], i] * multiplier[i]
25 µs ± 1.27 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Here's my data
Id Amount
1 6
2 2
3 0
4 6
What I need, is to map : if Amount is more than 3 , Map is 1. But,if Amount is less than 3, Map is 0
Id Amount Map
1 6 1
2 2 0
3 0 0
4 5 1
What I did
a = df[['Id','Amount']]
a = a[a['Amount'] >= 3]
a['Map'] = 1
a = a[['Id', 'Map']]
df= df.merge(a, on='Id', how='left')
df['Amount'].fillna(0)
It works, but not highly configurable and not effective.
Convert boolean mask to integer:
#for better performance convert to numpy array
df['Map'] = (df['Amount'].values >= 3).astype(int)
#pure pandas solution
df['Map'] = (df['Amount'] >= 3).astype(int)
print (df)
Id Amount Map
0 1 6 1
1 2 2 0
2 3 0 0
3 4 6 1
Performance:
#[400000 rows x 3 columns]
df = pd.concat([df] * 100000, ignore_index=True)
In [133]: %timeit df['Map'] = (df['Amount'].values >= 3).astype(int)
2.44 ms ± 97.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [134]: %timeit df['Map'] = (df['Amount'] >= 3).astype(int)
2.6 ms ± 66.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
I am trying to convert values within the current dataframe as the "Index" and the dataframe's Index as the "Labels". For Example:
Value1 Value2
0 0 1
1 2 4
2 NaN 3
This would result in
Labels
0 0
1 0
2 1
3 2
4 1
Currently I managed to do this using a loop to check and apply the necessary labels/values but with millions of labels to mark this process becomes extremely time consuming. Is there a way to do this in a smarter and quicker way? Thanks in advance.
Use stack with DataFrame constructor:
s = df.stack()
df = pd.DataFrame(s.index.get_level_values(0).values,
columns=['Labels'],
index=s.values.astype(int)).sort_index()
print (df)
Labels
0 0
1 0
2 1
3 2
4 1
Detail:
print (df.stack())
0 Value1 0.0
Value2 1.0
1 Value1 2.0
Value2 4.0
2 Value2 3.0
dtype: float64
Came up with a really good one (thanks to the collective effort of the pandas community). This one should be fast.
It uses the power a flexibility of repeat and ravel to flatten your data.
s = pd.Series(df.index.repeat(2), index=df.values.ravel())
s[s.index.notnull()].sort_index()
0.0 0
1.0 0
2.0 1
3.0 2
4.0 1
dtype: int64
A subsequent conversion results in an integer index:
df.index = df.index.astype(int)
A similar (slightly faster depending on your data) solution which also results in an integer index is performing the filtering before converting to Series -
v = df.index.repeat(df.shape[1])
i = df.values.ravel()
m = ~np.isnan(i)
s = pd.Series(v[m], index=i[m].astype(int)).sort_index()
s
0 0
1 0
2 1
3 2
4 1
dtype: int64
Performance
df2 = pd.concat([df] * 10000, ignore_index=True)
# jezrael's solution
%%timeit
s = df2.stack()
pd.DataFrame(s.index.get_level_values(0).values,
columns=['Labels'],
index=s.values.astype(int)).sort_index()
4.57 ms ± 220 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
s = pd.Series(df2.index.repeat(2), index=df2.values.ravel())
s[s.index.notnull()].sort_index()
3.12 ms ± 110 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
v = df2.index.repeat(df.shape[1])
i = df2.values.ravel()
m = ~np.isnan(i)
s = pd.Series(v[m], index=i[m].astype(int)).sort_index()
3.1 ms ± 117 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)