'DataFrame' object has no attribute 'sort' - python

I face some problem here, in my python package I have install numpy, but I still have this error:
'DataFrame' object has no attribute 'sort'
Anyone can give me some idea..
This is my code :
final.loc[-1] =['', 'P','Actual']
final.index = final.index + 1 # shifting index
final = final.sort()
final.columns=[final.columns,final.iloc[0]]
final = final.iloc[1:].reset_index(drop=True)
final.columns.names = (None, None)

sort() was deprecated for DataFrames in favor of either:
sort_values() to sort by column(s)
sort_index() to sort by the index
sort() was deprecated (but still available) in Pandas with release 0.17 (2015-10-09) with the introduction of sort_values() and sort_index(). It was removed from Pandas with release 0.20 (2017-05-05).

Pandas Sorting 101
sort has been replaced in v0.20 by DataFrame.sort_values and DataFrame.sort_index. Aside from this, we also have argsort.
Here are some common use cases in sorting, and how to solve them using the sorting functions in the current API. First, the setup.
# Setup
np.random.seed(0)
df = pd.DataFrame({'A': list('accab'), 'B': np.random.choice(10, 5)})
df
A B
0 a 7
1 c 9
2 c 3
3 a 5
4 b 2
Sort by Single Column
For example, to sort df by column "A", use sort_values with a single column name:
df.sort_values(by='A')
A B
0 a 7
3 a 5
4 b 2
1 c 9
2 c 3
If you need a fresh RangeIndex, use DataFrame.reset_index.
Sort by Multiple Columns
For example, to sort by both col "A" and "B" in df, you can pass a list to sort_values:
df.sort_values(by=['A', 'B'])
A B
3 a 5
0 a 7
4 b 2
2 c 3
1 c 9
Sort By DataFrame Index
df2 = df.sample(frac=1)
df2
A B
1 c 9
0 a 7
2 c 3
3 a 5
4 b 2
You can do this using sort_index:
df2.sort_index()
A B
0 a 7
1 c 9
2 c 3
3 a 5
4 b 2
df.equals(df2)
# False
df.equals(df2.sort_index())
# True
Here are some comparable methods with their performance:
%timeit df2.sort_index()
%timeit df2.iloc[df2.index.argsort()]
%timeit df2.reindex(np.sort(df2.index))
605 µs ± 13.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
610 µs ± 24.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
581 µs ± 7.63 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Sort by List of Indices
For example,
idx = df2.index.argsort()
idx
# array([0, 7, 2, 3, 9, 4, 5, 6, 8, 1])
This "sorting" problem is actually a simple indexing problem. Just passing integer labels to iloc will do.
df.iloc[idx]
A B
1 c 9
0 a 7
2 c 3
3 a 5
4 b 2

Related

How to filter values using Pandas?

Intention: To filter binary numbers based on hamming weights using pandas. Here i check number of 1s occurring in the binary and write the count to df.
Effort so far:
import pandas as pd
def ones(num):
return bin(num).count('1')
num = list(range(1,8))
C = pd.Index(["num"])
df = pd.DataFrame(num, columns=C)
df['count'] = df.apply(lambda row : ones(row['num']), axis = 1)
print(df)
output:
num count
0 1 1
1 2 1
2 3 2
3 4 1
4 5 2
5 6 2
6 7 3
Intended output:
1 2 3
0 1 3 7
1 2 5
2 4 6
Help!
You can use pivot_table. Though you'll need to define the index as the cumcount of the grouped count column, pivot_table can't figure it out all on its own :)
(df.pivot_table(index=df.groupby('count').cumcount(),
columns='count',
values='num'))
count 1 2 3
0 1.0 3.0 7.0
1 2.0 5.0 NaN
2 4.0 6.0 NaN
You also have the parameter fill_value, though I wouldn't recommend you to use it, since you'll get mixed types. Now it looks like NumPy would be a good option from here, you can easily obtain an array from the result with new_df.to_numpy().
Also, focusing on the logic in ones, we can vectorise this with (based on this answer):
m = df.num.to_numpy().itemsize
df['count'] = (df.num.to_numpy()[:,None] & (1 << np.arange(m)) > 0).view('i1').sum(1)
Here's a check on both approaches' performance:
df_large = pd.DataFrame({'num':np.random.randint(0,10,(10_000))})
def vect(df):
m = df.num.to_numpy().itemsize
(df.num.to_numpy()[:,None] & (1 << np.arange(m)) > 0).view('i1').sum(1)
%timeit vect(df_large)
# 340 µs ± 5.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit df_large.apply(lambda row : ones(row['num']), axis = 1)
# 103 ms ± 2.32 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
I suggest a different output:
df.groupby("count").agg(list)
which will give you
num
count
1 [1, 2, 4]
2 [3, 5, 6]
3 [7]
it's the same information in a slightly different format. In your original pivoted format, the rows are meaningless and you have an undetermined number of columns. I suggest it is more common to have an undetermined number of rows. I think you'll find this easier to work with going forward.
Or consider just creating a dictionary as a DataFrame is adding a lot of overhead here for no benefit:
df.groupby("count").agg(list).to_dict()["num"]
which gives you
{
1: [1, 2, 4],
2: [3, 5, 6],
3: [7],
}
Here's one approach
df.groupby('count')['num'].agg(list).apply(pd.Series).T

Is there a faster way than "for" to compare values in column to choose the one i want?

I have these two different columns in a dataframe. I want to iterate and know if column 'Entry_Point' is a Str then inDelivery_Point put the Client_Num.
df
Client_Num Entry_Point Delivery_Point
1 0
2 a
3 3
4 4
5 b
6 c
8 d
It should look like this:
Client_Num Entry_Point Delivery_Point
1 10 10
2 a 2
3 32 32
4 14 14
5 b 5
6 c 6
8 d 8
I already tried doing a for but it takes too long, especially when I have 20k rows.
for i in range(len(df)):
if type(df.loc[i]['Entry_Point']) == str:
df.loc[i]['Delivery_Point'] = df.loc[i]['Client_num']
else:
df.loc[i]['Delivery_Point'] = df.loc[i]['Entry_Point']
Let us using pandas to_numeric
df['New']=pd.to_numeric(df.Entry_Point,errors='coerce').fillna(df.Client_Num)
df
Out[22]:
Client_Num Entry_Point New
0 1 0 0.0
1 2 a 2.0
2 3 3 3.0
3 4 4 4.0
4 5 b 5.0
5 6 c 6.0
6 8 d 8.0
Pandas column will be imported as a single data type. So the method you apply may not fetch the correct result. I think you want to do the following:
df['Delivery_Point'] = df.apply(lambda x: x.Client_num if not x.Entry_Point.strip().isnumeric() else x.Entry_Point, axis=1)
Another option that might perform even better on very large datasets is to use vectorized numpy functions:
import numpy as np
#np.vectorize
def get_if_str(client_num, entry_point):
if isinstance(entry_point, str):
return client_num
return entry_point
df['Delivery_Point'] = get_if_str(df['Client_Num'], df['Entry_Point'])
We can compare the times here:
##slow way
def generic(df):
for i in range(len(df)):
if type(df.loc[i]['Entry_Point']) == str:
df.loc[i]['Delivery_Point'] = df.loc[i]['Client_Num']
else:
df.loc[i]['Delivery_Point'] = df.loc[i]['Entry_Point']
%timeit generic(df)
# 237 ms ± 5.88 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# Miliseconds
%timeit df['Delivery_Point'] = get_if_int(df['Client_Num'], df['Entry_Point'])
#185 µs ± 1.38 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# Microseconds
As you can see, considerable gains from using Numpy vectorized functions. More about them can be found here
EDIT
If you actually use the numpy array of the values, you should get an even better performance from the vectorization:
df['Delivery_Point'] = get_if_str(df['Client_Num'].values, df['Entry_Point'].values)

Efficient and fastest way in Pandas to create sorted list from column values

Given a dataframe
A B C
3 1 2
2 1 3
3 2 1
I would like to get a new column with column names in sorted order
A B C new_col
3 1 2 [B,C,A]
2 1 3 [B,A,C]
3 2 1 [C,B,A]
This is my code. It works but is quite slow.
def blist(x):
col_dict = {}
for col in col_list:
col_dict[col] = x[col]
sorted_tuple = sorted(col_dict.items(), key=operator.itemgetter(1))
return [i[0] for i in sorted_tuple]
df['new_col'] = df.apply(blist,axis=1)
I will appreciate a better approach to solve this problem.
Try to use np.argsort() in conjunction with np.take():
In [132]: df['new_col'] = np.take(df.columns, np.argsort(df)).tolist()
In [133]: df
Out[133]:
A B C new_col
0 3 1 2 [B, C, A]
1 2 1 3 [B, A, C]
2 3 2 1 [C, B, A]
Timing for 30.000 rows DF:
In [182]: df = pd.concat([df] * 10**4, ignore_index=True)
In [183]: df.shape
Out[183]: (30000, 3)
In [184]: %timeit df.apply(blist,axis=1)
4.84 s ± 31 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [185]: %timeit np.take(df.columns, np.argsort(df)).tolist()
5.45 ms ± 26.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Ratio:
In [187]: (4.84*1000)/5.45
Out[187]: 888.0733944954128

Pandas Convert Dataframe Values to Labels

I am trying to convert values within the current dataframe as the "Index" and the dataframe's Index as the "Labels". For Example:
Value1 Value2
0 0 1
1 2 4
2 NaN 3
This would result in
Labels
0 0
1 0
2 1
3 2
4 1
Currently I managed to do this using a loop to check and apply the necessary labels/values but with millions of labels to mark this process becomes extremely time consuming. Is there a way to do this in a smarter and quicker way? Thanks in advance.
Use stack with DataFrame constructor:
s = df.stack()
df = pd.DataFrame(s.index.get_level_values(0).values,
columns=['Labels'],
index=s.values.astype(int)).sort_index()
print (df)
Labels
0 0
1 0
2 1
3 2
4 1
Detail:
print (df.stack())
0 Value1 0.0
Value2 1.0
1 Value1 2.0
Value2 4.0
2 Value2 3.0
dtype: float64
Came up with a really good one (thanks to the collective effort of the pandas community). This one should be fast.
It uses the power a flexibility of repeat and ravel to flatten your data.
s = pd.Series(df.index.repeat(2), index=df.values.ravel())
s[s.index.notnull()].sort_index()
0.0 0
1.0 0
2.0 1
3.0 2
4.0 1
dtype: int64
A subsequent conversion results in an integer index:
df.index = df.index.astype(int)
A similar (slightly faster depending on your data) solution which also results in an integer index is performing the filtering before converting to Series -
v = df.index.repeat(df.shape[1])
i = df.values.ravel()
m = ~np.isnan(i)
s = pd.Series(v[m], index=i[m].astype(int)).sort_index()
s
0 0
1 0
2 1
3 2
4 1
dtype: int64
Performance
df2 = pd.concat([df] * 10000, ignore_index=True)
# jezrael's solution
%%timeit
s = df2.stack()
pd.DataFrame(s.index.get_level_values(0).values,
columns=['Labels'],
index=s.values.astype(int)).sort_index()
4.57 ms ± 220 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
s = pd.Series(df2.index.repeat(2), index=df2.values.ravel())
s[s.index.notnull()].sort_index()
3.12 ms ± 110 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
v = df2.index.repeat(df.shape[1])
i = df2.values.ravel()
m = ~np.isnan(i)
s = pd.Series(v[m], index=i[m].astype(int)).sort_index()
3.1 ms ± 117 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Get first letter of a string from column

I'm fighting with pandas and for now I'm loosing. I have source table similar to this:
import pandas as pd
a=pd.Series([123,22,32,453,45,453,56])
b=pd.Series([234,4353,355,453,345,453,56])
df=pd.concat([a, b], axis=1)
df.columns=['First', 'Second']
I would like to add new column to this data frame with first digit from values in column 'First':
a) change number to string from column 'First'
b) extracting first character from newly created string
c) Results from b save as new column in data frame
I don't know how to apply this to the pandas data frame object. I would be grateful for helping me with that.
Cast the dtype of the col to str and you can perform vectorised slicing calling str:
In [29]:
df['new_col'] = df['First'].astype(str).str[0]
df
Out[29]:
First Second new_col
0 123 234 1
1 22 4353 2
2 32 355 3
3 453 453 4
4 45 345 4
5 453 453 4
6 56 56 5
if you need to you can cast the dtype back again calling astype(int) on the column
.str.get
This is the simplest to specify string methods
# Setup
df = pd.DataFrame({'A': ['xyz', 'abc', 'foobar'], 'B': [123, 456, 789]})
df
A B
0 xyz 123
1 abc 456
2 foobar 789
df.dtypes
A object
B int64
dtype: object
For string (read:object) type columns, use
df['C'] = df['A'].str[0]
# Similar to,
df['C'] = df['A'].str.get(0)
.str handles NaNs by returning NaN as the output.
For non-numeric columns, an .astype conversion is required beforehand, as shown in #Ed Chum's answer.
# Note that this won't work well if the data has NaNs.
# It'll return lowercase "n"
df['D'] = df['B'].astype(str).str[0]
df
A B C D
0 xyz 123 x 1
1 abc 456 a 4
2 foobar 789 f 7
List Comprehension and Indexing
There is enough evidence to suggest a simple list comprehension will work well here and probably be faster.
# For string columns
df['C'] = [x[0] for x in df['A']]
# For numeric columns
df['D'] = [str(x)[0] for x in df['B']]
df
A B C D
0 xyz 123 x 1
1 abc 456 a 4
2 foobar 789 f 7
If your data has NaNs, then you will need to handle this appropriately with an if/else in the list comprehension,
df2 = pd.DataFrame({'A': ['xyz', np.nan, 'foobar'], 'B': [123, 456, np.nan]})
df2
A B
0 xyz 123.0
1 NaN 456.0
2 foobar NaN
# For string columns
df2['C'] = [x[0] if isinstance(x, str) else np.nan for x in df2['A']]
# For numeric columns
df2['D'] = [str(x)[0] if pd.notna(x) else np.nan for x in df2['B']]
A B C D
0 xyz 123.0 x 1
1 NaN 456.0 NaN 4
2 foobar NaN f NaN
Let's do some timeit tests on some larger data.
df_ = df.copy()
df = pd.concat([df_] * 5000, ignore_index=True)
%timeit df.assign(C=df['A'].str[0])
%timeit df.assign(D=df['B'].astype(str).str[0])
%timeit df.assign(C=[x[0] for x in df['A']])
%timeit df.assign(D=[str(x)[0] for x in df['B']])
12 ms ± 253 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
27.1 ms ± 1.38 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
3.77 ms ± 110 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
7.84 ms ± 145 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
List comprehensions are 4x faster.

Categories