Pandas Convert Dataframe Values to Labels - python

I am trying to convert values within the current dataframe as the "Index" and the dataframe's Index as the "Labels". For Example:
Value1 Value2
0 0 1
1 2 4
2 NaN 3
This would result in
Labels
0 0
1 0
2 1
3 2
4 1
Currently I managed to do this using a loop to check and apply the necessary labels/values but with millions of labels to mark this process becomes extremely time consuming. Is there a way to do this in a smarter and quicker way? Thanks in advance.

Use stack with DataFrame constructor:
s = df.stack()
df = pd.DataFrame(s.index.get_level_values(0).values,
columns=['Labels'],
index=s.values.astype(int)).sort_index()
print (df)
Labels
0 0
1 0
2 1
3 2
4 1
Detail:
print (df.stack())
0 Value1 0.0
Value2 1.0
1 Value1 2.0
Value2 4.0
2 Value2 3.0
dtype: float64

Came up with a really good one (thanks to the collective effort of the pandas community). This one should be fast.
It uses the power a flexibility of repeat and ravel to flatten your data.
s = pd.Series(df.index.repeat(2), index=df.values.ravel())
s[s.index.notnull()].sort_index()
0.0 0
1.0 0
2.0 1
3.0 2
4.0 1
dtype: int64
A subsequent conversion results in an integer index:
df.index = df.index.astype(int)
A similar (slightly faster depending on your data) solution which also results in an integer index is performing the filtering before converting to Series -
v = df.index.repeat(df.shape[1])
i = df.values.ravel()
m = ~np.isnan(i)
s = pd.Series(v[m], index=i[m].astype(int)).sort_index()
s
0 0
1 0
2 1
3 2
4 1
dtype: int64
Performance
df2 = pd.concat([df] * 10000, ignore_index=True)
# jezrael's solution
%%timeit
s = df2.stack()
pd.DataFrame(s.index.get_level_values(0).values,
columns=['Labels'],
index=s.values.astype(int)).sort_index()
4.57 ms ± 220 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
s = pd.Series(df2.index.repeat(2), index=df2.values.ravel())
s[s.index.notnull()].sort_index()
3.12 ms ± 110 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
v = df2.index.repeat(df.shape[1])
i = df2.values.ravel()
m = ~np.isnan(i)
s = pd.Series(v[m], index=i[m].astype(int)).sort_index()
3.1 ms ± 117 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Related

Compare cell value of first row of a dataframe to cell value of other rows

I have a datafarme which has 50 columns and above 200 rows with binary values:
a1 a2 a3 a4 ….. a50
0 1 0 1 ….. 1
1 0 0 1 …. 0
0 1 1 0 …. 0
1 1 1 0 …. 1
I would like to compare cell values of first row to other rows one by one and make the 51th column which output the non-matching cells as below: (since the first row is not compared with any row it will get a nan value)
a51
NAN
a1,a2,…,a50
a3,a4…,a50
a1,a3,a4,…
I am not sure how to do this efficiently. I have not find any answer similar to this question. Sorry if I am asking repeated question. Thank you in advance!
Setup
import numpy as np
df = pd.DataFrame(np.random.randint(2,size=(200,50)),
columns =[f'a{i}' for i in range(1,51)])
Series.dot + DataFrame.add_suffix and Series.str.rstrip
df['a51']=df.iloc[1:].ne(df.iloc[0]).dot(df.add_suffix(', ').columns).str.rstrip(', ')
Time comparision for 50 columns and 200 rows
%%timeit
df['a51'] = df.iloc[1:].ne(df.iloc[0]).dot(df.add_suffix(', ').columns).str.rstrip(', ')
25.4 ms ± 681 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
a = df.to_numpy()
m = np.where(a[0,:] != a[1:,None], df.columns, np.nan)
pd.DataFrame(m.squeeze()).stack().groupby(level=0).agg(', '.join)
41.1 ms ± 4.16 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
df.iloc[1:].apply(lambda row: df.columns[df.iloc[0] != row].values, axis=1)
147 ms ± 18.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Here's one approach:
import numpy as np
a = df.to_numpy()
m = np.where(a[0,:] != a[1:,None], df.columns, np.nan)
pd.DataFrame(m.squeeze()).stack().groupby(level=0).agg(', '.join)
0 a1, a2, a50
1 a3, a4, a50
2 a1, a3, a4
dtype: object
Input data:
print(df)
a1 a2 a3 a4 a50
0 0 1 0 1 1
1 1 0 0 1 0
2 0 1 1 0 0
3 1 1 1 0 1
I assume you want the list of column names that don't match the first row:
df['a51'] = df.iloc[1:].apply(lambda row: df.columns[df.iloc[0] != row].values, axis=1)
200 rows is small enough so that apply(..., axis=1) is not a performance concern.

Is there a faster way than "for" to compare values in column to choose the one i want?

I have these two different columns in a dataframe. I want to iterate and know if column 'Entry_Point' is a Str then inDelivery_Point put the Client_Num.
df
Client_Num Entry_Point Delivery_Point
1 0
2 a
3 3
4 4
5 b
6 c
8 d
It should look like this:
Client_Num Entry_Point Delivery_Point
1 10 10
2 a 2
3 32 32
4 14 14
5 b 5
6 c 6
8 d 8
I already tried doing a for but it takes too long, especially when I have 20k rows.
for i in range(len(df)):
if type(df.loc[i]['Entry_Point']) == str:
df.loc[i]['Delivery_Point'] = df.loc[i]['Client_num']
else:
df.loc[i]['Delivery_Point'] = df.loc[i]['Entry_Point']
Let us using pandas to_numeric
df['New']=pd.to_numeric(df.Entry_Point,errors='coerce').fillna(df.Client_Num)
df
Out[22]:
Client_Num Entry_Point New
0 1 0 0.0
1 2 a 2.0
2 3 3 3.0
3 4 4 4.0
4 5 b 5.0
5 6 c 6.0
6 8 d 8.0
Pandas column will be imported as a single data type. So the method you apply may not fetch the correct result. I think you want to do the following:
df['Delivery_Point'] = df.apply(lambda x: x.Client_num if not x.Entry_Point.strip().isnumeric() else x.Entry_Point, axis=1)
Another option that might perform even better on very large datasets is to use vectorized numpy functions:
import numpy as np
#np.vectorize
def get_if_str(client_num, entry_point):
if isinstance(entry_point, str):
return client_num
return entry_point
df['Delivery_Point'] = get_if_str(df['Client_Num'], df['Entry_Point'])
We can compare the times here:
##slow way
def generic(df):
for i in range(len(df)):
if type(df.loc[i]['Entry_Point']) == str:
df.loc[i]['Delivery_Point'] = df.loc[i]['Client_Num']
else:
df.loc[i]['Delivery_Point'] = df.loc[i]['Entry_Point']
%timeit generic(df)
# 237 ms ± 5.88 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# Miliseconds
%timeit df['Delivery_Point'] = get_if_int(df['Client_Num'], df['Entry_Point'])
#185 µs ± 1.38 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# Microseconds
As you can see, considerable gains from using Numpy vectorized functions. More about them can be found here
EDIT
If you actually use the numpy array of the values, you should get an even better performance from the vectorization:
df['Delivery_Point'] = get_if_str(df['Client_Num'].values, df['Entry_Point'].values)

How to map new variable in pandas in effective way

Here's my data
Id Amount
1 6
2 2
3 0
4 6
What I need, is to map : if Amount is more than 3 , Map is 1. But,if Amount is less than 3, Map is 0
Id Amount Map
1 6 1
2 2 0
3 0 0
4 5 1
What I did
a = df[['Id','Amount']]
a = a[a['Amount'] >= 3]
a['Map'] = 1
a = a[['Id', 'Map']]
df= df.merge(a, on='Id', how='left')
df['Amount'].fillna(0)
It works, but not highly configurable and not effective.
Convert boolean mask to integer:
#for better performance convert to numpy array
df['Map'] = (df['Amount'].values >= 3).astype(int)
#pure pandas solution
df['Map'] = (df['Amount'] >= 3).astype(int)
print (df)
Id Amount Map
0 1 6 1
1 2 2 0
2 3 0 0
3 4 6 1
Performance:
#[400000 rows x 3 columns]
df = pd.concat([df] * 100000, ignore_index=True)
In [133]: %timeit df['Map'] = (df['Amount'].values >= 3).astype(int)
2.44 ms ± 97.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [134]: %timeit df['Map'] = (df['Amount'] >= 3).astype(int)
2.6 ms ± 66.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Selecting row based on a column variation

Suppose we have a file, named any_csv.csv, containing...
A,B,random
1,2,300
3,4,300
5,6,300
1,2,300
3,4,350
8,9,350
4,5,350
5,6,320
7,8,300
3,3,300
I wish to keep all the rows, where random variates/changes.
I made this little program to achieve this, but, as I wish to learn more about pandas and as my program is slower than I expect it to be (~130 seconds to process a 1.2 million line log file), I ask for your help.
import pandas as pd
import numpy as np
df = pd.read_csv('any_csv.csv')
mask = np.zeros(len(df.index), dtype=bool)
# Initializing my current value for comparison purposes.
mask[0] = 1
previous_val = df.iloc[0]['random']
for index, row in df.iterrows():
if row['random'] != previous_val:
# If a variation has been detected, switch to True current, and previous index.
previous_val = row['random']
mask[index] = 1
mask[index - 1] = 1
# Keeping the last item.
mask[-1] = 1
df = df.loc[mask]
df.to_csv('any_other_csv.csv', index=False)
I guess that in short, I wish to know how to apply my if, in this homemade for-loop, that is averall pretty slow.
Results :
A,B,random
1,2,300
1,2,300
3,4,350
4,5,350
5,6,320
7,8,300
3,3,300
You can utilize pd.Series.shift to create a mask of Boolean values. The Boolean mask indicates when a value is different to a value above or below it within the series.
You can then apply the Boolean mask to your dataframe directly.
mask = (df['random'] != df['random'].shift()) | \
(df['random'] != df['random'].shift(-1))
df = df[mask]
print(df)
A B random
0 1 2 300
3 1 2 300
4 3 4 350
6 4 5 350
7 5 6 320
8 7 8 300
9 3 3 300
Use boolean indexing with 2 masks for check different values with shift and ne for not equal:
df = df[df['random'].ne(df['random'].shift()) | df['random'].ne(df['random'].shift(-1))]
print (df)
A B random
0 1 2 300
3 1 2 300
4 3 4 350
6 4 5 350
7 5 6 320
8 7 8 300
9 3 3 300
For better verifying:
df['mask1'] = df['random'].ne(df['random'].shift())
df['mask2'] = df['random'].ne(df['random'].shift(-1))
df['mask3'] = df['random'].ne(df['random'].shift()) | df['random'].ne(df['random'].shift(-1))
print (df)
A B random mask1 mask2 mask3
0 1 2 300 True False True
1 3 4 300 False False False
2 5 6 300 False False False
3 1 2 300 False True True
4 3 4 350 True False True
5 8 9 350 False False False
6 4 5 350 False True True
7 5 6 320 True True True
8 7 8 300 True False True
9 3 3 300 False True True
Timings:
N = 1000
In [157]: %timeit orig(df)
56.8 ms ± 1.47 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [158]: %timeit (df[df['random'].ne(df['random'].shift()) |
df['random'].ne(df['random'].shift(-1))])
939 µs ± 7.61 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#jpp solution - a bit slowier
In [159]: %timeit df[(df['random'] != df['random'].shift()) | (df['random'] != df['random'].shift(-1))]
1.11 ms ± 8.71 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
N = 10000
In [160]: %timeit orig(df)
538 ms ± 3.28 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [161]: %timeit (df[df['random'].ne(df['random'].shift()) | df['random'].ne(df['random'].shift(-1))])
1.16 ms ± 75 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#jpp solution - a bit slowier
In [162]: %timeit df[(df['random'] != df['random'].shift()) | (df['random'] != df['random'].shift(-1))]
1.28 ms ± 8.51 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
np.random.seed(123)
N = 1000
df = pd.DataFrame({'random':np.random.randint(2, size=N)})
print (df)
def orig(df):
mask = np.zeros(len(df.index), dtype=bool)
# Initializing my current value for comparison purposes.
mask[0] = 1
previous_val = df.iloc[0]['random']
for index, row in df.iterrows():
if row['random'] != previous_val:
# If a variation has been detected, switch to True current, and previous index.
previous_val = row['random']
mask[index] = 1
mask[index - 1] = 1
# Keeping the last item.
mask[-1] = 1
return df.loc[mask]
You could try something like below:`
df.groupby(["A", "Random"]).filter(lambda df:df.shape[0] == 1)

'DataFrame' object has no attribute 'sort'

I face some problem here, in my python package I have install numpy, but I still have this error:
'DataFrame' object has no attribute 'sort'
Anyone can give me some idea..
This is my code :
final.loc[-1] =['', 'P','Actual']
final.index = final.index + 1 # shifting index
final = final.sort()
final.columns=[final.columns,final.iloc[0]]
final = final.iloc[1:].reset_index(drop=True)
final.columns.names = (None, None)
sort() was deprecated for DataFrames in favor of either:
sort_values() to sort by column(s)
sort_index() to sort by the index
sort() was deprecated (but still available) in Pandas with release 0.17 (2015-10-09) with the introduction of sort_values() and sort_index(). It was removed from Pandas with release 0.20 (2017-05-05).
Pandas Sorting 101
sort has been replaced in v0.20 by DataFrame.sort_values and DataFrame.sort_index. Aside from this, we also have argsort.
Here are some common use cases in sorting, and how to solve them using the sorting functions in the current API. First, the setup.
# Setup
np.random.seed(0)
df = pd.DataFrame({'A': list('accab'), 'B': np.random.choice(10, 5)})
df
A B
0 a 7
1 c 9
2 c 3
3 a 5
4 b 2
Sort by Single Column
For example, to sort df by column "A", use sort_values with a single column name:
df.sort_values(by='A')
A B
0 a 7
3 a 5
4 b 2
1 c 9
2 c 3
If you need a fresh RangeIndex, use DataFrame.reset_index.
Sort by Multiple Columns
For example, to sort by both col "A" and "B" in df, you can pass a list to sort_values:
df.sort_values(by=['A', 'B'])
A B
3 a 5
0 a 7
4 b 2
2 c 3
1 c 9
Sort By DataFrame Index
df2 = df.sample(frac=1)
df2
A B
1 c 9
0 a 7
2 c 3
3 a 5
4 b 2
You can do this using sort_index:
df2.sort_index()
A B
0 a 7
1 c 9
2 c 3
3 a 5
4 b 2
df.equals(df2)
# False
df.equals(df2.sort_index())
# True
Here are some comparable methods with their performance:
%timeit df2.sort_index()
%timeit df2.iloc[df2.index.argsort()]
%timeit df2.reindex(np.sort(df2.index))
605 µs ± 13.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
610 µs ± 24.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
581 µs ± 7.63 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Sort by List of Indices
For example,
idx = df2.index.argsort()
idx
# array([0, 7, 2, 3, 9, 4, 5, 6, 8, 1])
This "sorting" problem is actually a simple indexing problem. Just passing integer labels to iloc will do.
df.iloc[idx]
A B
1 c 9
0 a 7
2 c 3
3 a 5
4 b 2

Categories