Intention: To filter binary numbers based on hamming weights using pandas. Here i check number of 1s occurring in the binary and write the count to df.
Effort so far:
import pandas as pd
def ones(num):
return bin(num).count('1')
num = list(range(1,8))
C = pd.Index(["num"])
df = pd.DataFrame(num, columns=C)
df['count'] = df.apply(lambda row : ones(row['num']), axis = 1)
print(df)
output:
num count
0 1 1
1 2 1
2 3 2
3 4 1
4 5 2
5 6 2
6 7 3
Intended output:
1 2 3
0 1 3 7
1 2 5
2 4 6
Help!
You can use pivot_table. Though you'll need to define the index as the cumcount of the grouped count column, pivot_table can't figure it out all on its own :)
(df.pivot_table(index=df.groupby('count').cumcount(),
columns='count',
values='num'))
count 1 2 3
0 1.0 3.0 7.0
1 2.0 5.0 NaN
2 4.0 6.0 NaN
You also have the parameter fill_value, though I wouldn't recommend you to use it, since you'll get mixed types. Now it looks like NumPy would be a good option from here, you can easily obtain an array from the result with new_df.to_numpy().
Also, focusing on the logic in ones, we can vectorise this with (based on this answer):
m = df.num.to_numpy().itemsize
df['count'] = (df.num.to_numpy()[:,None] & (1 << np.arange(m)) > 0).view('i1').sum(1)
Here's a check on both approaches' performance:
df_large = pd.DataFrame({'num':np.random.randint(0,10,(10_000))})
def vect(df):
m = df.num.to_numpy().itemsize
(df.num.to_numpy()[:,None] & (1 << np.arange(m)) > 0).view('i1').sum(1)
%timeit vect(df_large)
# 340 µs ± 5.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit df_large.apply(lambda row : ones(row['num']), axis = 1)
# 103 ms ± 2.32 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
I suggest a different output:
df.groupby("count").agg(list)
which will give you
num
count
1 [1, 2, 4]
2 [3, 5, 6]
3 [7]
it's the same information in a slightly different format. In your original pivoted format, the rows are meaningless and you have an undetermined number of columns. I suggest it is more common to have an undetermined number of rows. I think you'll find this easier to work with going forward.
Or consider just creating a dictionary as a DataFrame is adding a lot of overhead here for no benefit:
df.groupby("count").agg(list).to_dict()["num"]
which gives you
{
1: [1, 2, 4],
2: [3, 5, 6],
3: [7],
}
Here's one approach
df.groupby('count')['num'].agg(list).apply(pd.Series).T
Related
I have a data frame like this.
mydf = pd.DataFrame({'a':[1,1,3,3],'b':[np.nan,2,3,6],'c':[1,3,3,9]})
a b c
0 1 NaN 1
1 1 2.0 3
2 3 3.0 3
3 3 6.0 9
I would like to have a resulting dataframe like this.
myResults = pd.concat([mydf.groupby('a').apply(lambda x: (x.b/x.c).max()), mydf.groupby('a').apply(lambda x: (x.c/x.b).max())], axis =1)
myResults.columns = ['b_c','c_b']
b_c c_b
a
1 0.666667 1.5
3 1.000000 1.5
Basically i would like to have max and min of ratio of column b and column c for each group (grouped by column a)
If it possible to achieve this by agg?
I tried mydf.groupby('a').agg([lambda x: (x.b/x.c).max(), lambda x: (x.c/x.b).max()]). It will not work, and seems column name b and c will not be recognized.
Is there a better way to achieve this (prefer in one line) through agg or other function? In summary, I would like to apply customized function to grouped DataFrame, and the customized function needs to read multiple columns (may more than b and c columns mentioned above) from original DataFrame.
One way of doing it
def func(x):
C= (x['b']/x['c']).max()
D= (x['c']/x['b']).max()
return pd.Series([C, D], index=['b_c','c_b'])
mydf.groupby('a').apply(func).reset_index()
Output
a b_c c_b
0 1 0.666667 1.5
1 3 1.000000 1.5
Prepend new temporary columns to the dataframe via assign, then do your groupby and max functions. This method should provide significant performance benefits.
>>> (mydf
.assign(b_c=df['b'].div(df['c']), c_b=df['c'].div(df['b']))
.groupby('a')[['b_c', 'c_b']]
.max()
)
b_c c_b
a
1 0.666667 1.5
3 1.000000 1.5
Timings
# Sample data.
n = 1000 # Sample data number of rows = 4 * n.
data = {
'a': list(range(n)) * 4,
'b': [np.nan, 2, 3, 6] * n,
'c': [1, 3, 3, 9] * n
}
df = pd.DataFrame(data)
# Solution 1.
%timeit df.assign(b_c=df['b'].div(df['c']), c_b=df['c'].div(df['b'])).groupby('a')[['b_c', 'c_b']].max()
# 3.96 ms ± 152 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# Solution 2.
def func(x):
C= (x['b']/x['c']).max()
D= (x['c']/x['b']).max()
return pd.Series([C, D], index=['b_c','c_b'])
%timeit df.groupby('a').apply(func)
# 1.09 s ± 56.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Both solutions give the same result.
I have these two different columns in a dataframe. I want to iterate and know if column 'Entry_Point' is a Str then inDelivery_Point put the Client_Num.
df
Client_Num Entry_Point Delivery_Point
1 0
2 a
3 3
4 4
5 b
6 c
8 d
It should look like this:
Client_Num Entry_Point Delivery_Point
1 10 10
2 a 2
3 32 32
4 14 14
5 b 5
6 c 6
8 d 8
I already tried doing a for but it takes too long, especially when I have 20k rows.
for i in range(len(df)):
if type(df.loc[i]['Entry_Point']) == str:
df.loc[i]['Delivery_Point'] = df.loc[i]['Client_num']
else:
df.loc[i]['Delivery_Point'] = df.loc[i]['Entry_Point']
Let us using pandas to_numeric
df['New']=pd.to_numeric(df.Entry_Point,errors='coerce').fillna(df.Client_Num)
df
Out[22]:
Client_Num Entry_Point New
0 1 0 0.0
1 2 a 2.0
2 3 3 3.0
3 4 4 4.0
4 5 b 5.0
5 6 c 6.0
6 8 d 8.0
Pandas column will be imported as a single data type. So the method you apply may not fetch the correct result. I think you want to do the following:
df['Delivery_Point'] = df.apply(lambda x: x.Client_num if not x.Entry_Point.strip().isnumeric() else x.Entry_Point, axis=1)
Another option that might perform even better on very large datasets is to use vectorized numpy functions:
import numpy as np
#np.vectorize
def get_if_str(client_num, entry_point):
if isinstance(entry_point, str):
return client_num
return entry_point
df['Delivery_Point'] = get_if_str(df['Client_Num'], df['Entry_Point'])
We can compare the times here:
##slow way
def generic(df):
for i in range(len(df)):
if type(df.loc[i]['Entry_Point']) == str:
df.loc[i]['Delivery_Point'] = df.loc[i]['Client_Num']
else:
df.loc[i]['Delivery_Point'] = df.loc[i]['Entry_Point']
%timeit generic(df)
# 237 ms ± 5.88 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# Miliseconds
%timeit df['Delivery_Point'] = get_if_int(df['Client_Num'], df['Entry_Point'])
#185 µs ± 1.38 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# Microseconds
As you can see, considerable gains from using Numpy vectorized functions. More about them can be found here
EDIT
If you actually use the numpy array of the values, you should get an even better performance from the vectorization:
df['Delivery_Point'] = get_if_str(df['Client_Num'].values, df['Entry_Point'].values)
I am working with a DataFrame which looks like this
List Numb Name
1 1 one
1 2 two
2 3 three
4 4 four
3 5 five
and I am trying to compute the following output.
List Numb Name
one 1 one
one 2 two
two 3 three
four 4 four
three 5 five
In my current approach I'm trying to iterate through the columns, then replace values with the contents of a third column.
For example, if List[0][1] is equal to Numb[1][1] replace column List[0][1] with 'one'.
How could I make an iteration like this work, or alternatively solve the problem without explicitly iterating at all?
Use map
df['List'] = df['List'].map(df.set_index('Numb')['Name'])
List Numb Name
0 one 1 one
1 one 2 two
2 two 3 three
3 four 4 four
4 three 5 five
How about creating a dict to help you?
import pandas as pd
df = pd.DataFrame({'List': [1, 1, 2, 4, 3], 'Numb': [1, 2, 3, 4, 5], 'Name': ['one', 'two', 'three', 'four', 'five']})
d = dict(zip(df['Numb'], df['Name']))
df = df.replace({'List': d})
You can do this in one line. Looks like you want to join your dataframe onto itself:
df.rename(columns={"List": "List_numb"}).join(df.set_index("Numb")["Name"].to_frame("List"), on="List_numb")[["List", "Numb", "Name"]]
Using set_index and then reindex:
df['List'] = df.set_index('Numb')['Name'].reindex(df['List']).values
print(df)
List Numb Name
0 one 1 one
1 one 2 two
2 two 3 three
3 four 4 four
4 three 5 five
import pandas as pd
df = pd.DataFrame({
'List': [1,1,2,4,3],
'Numb': [1,2,3,4,5],
'Name':['one','two','three','four','five']
})
dfnew = pd.merge(df, df, how='inner', left_on=['List'], right_on=['Numb'])
dfnew = dfnew.rename({'List_x': 'List', 'Numb_x': 'Numb', 'Name_y': 'Name'}, axis='columns')
dfnew = dfnew[['List','Numb','Name']]
dfnew['List'] = dfnew['Name']
print (dfnew)
# List Numb Name
#0 one 1 one
#1 one 2 one
#2 two 3 two
#3 four 4 four
#4 three 5 three
Similar to Vaishali's answer answer, but building a Series explicitly seems to be a bit faster.
df['List'] = df['List'].map(pd.Series(df['Name'].values, df['Numb']))
Timings (the Numb and Name columns have unique-value dummy data and I only included the three fastest solutions so far):
>>> df
List Numb Name
0 1 1 one_0
1 1 2 two_1
2 2 3 three_2
3 4 4 four_3
4 3 5 five_4
... ... ... ...
4995 1 4996 one_4995
4996 1 4997 two_4996
4997 2 4998 three_4997
4998 4 4999 four_4998
4999 3 5000 five_4999
[5000 rows x 3 columns]
# Timings (i5-6200U CPU # 2.30GHz, but only relative times are interesting)
>>> %timeit df.set_index('Numb')['Name'].reindex(df['List']).values # jpp
1.14 ms ± 3.36 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit df['List'].map(df.set_index('Numb')['Name']) # Vaishali
1.04 ms ± 7.13 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit df['List'].map(pd.Series(df['Name'].values, df['Numb'])) # timgeb
437 µs ± 3.16 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
I want to apply a function to each column of a DataFrame.
Which rows to apply this to depends on some column-specific condition.
The parameter values to use also depends on the function.
Take this very simple DataFrame:
>>> df = pd.DataFrame(data=np.arange(15).reshape(5, 3))
>>> df
0 1 2
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14
I want to apply a function to each column using column-specific values contained in an array, say:
>>> multiplier = np.array([0, 100, 1000]) # First column multiplied by 0, second by 100...
I also only want to multiply rows whose index are within a column-specific range, say below the values contained in the array:
>>> limiter = np.array([2, 3, 4]) # Only first two elements in first column get multiplied, first three in second column...
What works is this:
>>> for i in range(limit.shape[0]):
>>> df.loc[df.index<limit[i], i] = multiplier[i] * df.loc[:, i]
>>> df
0 1 2
0 0 100 2000
1 0 400 5000
2 6 700 8000
3 9 10 11000
4 12 13 14
But this approach is way too slow for the large DataFrames I'm dealing with.
Is there some way to vectorize this?
You could take advantage of underlying numpy array.
df = pd.DataFrame(data=pd.np.arange(15).reshape(5, 3))
multiplier = pd.np.array([0, 100, 1000])
limit = pd.np.array([2, 3, 4])
df1 = df.values
for i in pd.np.arange(limit.size):
df1[: limit[i], i] = df1[: limit[i], i] * multiplier[i]
df2 = pd.DataFrame(df1)
print (df2)
0 1 2
0 0 100 2000
1 0 400 5000
2 6 700 8000
3 9 10 11000
4 12 13 14
Performace:
# Your implementation
%timeit for i in range(limit.shape[0]): df.loc[df.index<limit[i], i] = multiplier[i] * df.loc[:, i]
3.92 ms ± 120 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# Numpy implementation (High Performance Gain)
%timeit for i in pd.np.arange(limit.size): df1[: limit[i], i] = df1[: limit[i], i] * multiplier[i]
25 µs ± 1.27 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
I face some problem here, in my python package I have install numpy, but I still have this error:
'DataFrame' object has no attribute 'sort'
Anyone can give me some idea..
This is my code :
final.loc[-1] =['', 'P','Actual']
final.index = final.index + 1 # shifting index
final = final.sort()
final.columns=[final.columns,final.iloc[0]]
final = final.iloc[1:].reset_index(drop=True)
final.columns.names = (None, None)
sort() was deprecated for DataFrames in favor of either:
sort_values() to sort by column(s)
sort_index() to sort by the index
sort() was deprecated (but still available) in Pandas with release 0.17 (2015-10-09) with the introduction of sort_values() and sort_index(). It was removed from Pandas with release 0.20 (2017-05-05).
Pandas Sorting 101
sort has been replaced in v0.20 by DataFrame.sort_values and DataFrame.sort_index. Aside from this, we also have argsort.
Here are some common use cases in sorting, and how to solve them using the sorting functions in the current API. First, the setup.
# Setup
np.random.seed(0)
df = pd.DataFrame({'A': list('accab'), 'B': np.random.choice(10, 5)})
df
A B
0 a 7
1 c 9
2 c 3
3 a 5
4 b 2
Sort by Single Column
For example, to sort df by column "A", use sort_values with a single column name:
df.sort_values(by='A')
A B
0 a 7
3 a 5
4 b 2
1 c 9
2 c 3
If you need a fresh RangeIndex, use DataFrame.reset_index.
Sort by Multiple Columns
For example, to sort by both col "A" and "B" in df, you can pass a list to sort_values:
df.sort_values(by=['A', 'B'])
A B
3 a 5
0 a 7
4 b 2
2 c 3
1 c 9
Sort By DataFrame Index
df2 = df.sample(frac=1)
df2
A B
1 c 9
0 a 7
2 c 3
3 a 5
4 b 2
You can do this using sort_index:
df2.sort_index()
A B
0 a 7
1 c 9
2 c 3
3 a 5
4 b 2
df.equals(df2)
# False
df.equals(df2.sort_index())
# True
Here are some comparable methods with their performance:
%timeit df2.sort_index()
%timeit df2.iloc[df2.index.argsort()]
%timeit df2.reindex(np.sort(df2.index))
605 µs ± 13.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
610 µs ± 24.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
581 µs ± 7.63 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Sort by List of Indices
For example,
idx = df2.index.argsort()
idx
# array([0, 7, 2, 3, 9, 4, 5, 6, 8, 1])
This "sorting" problem is actually a simple indexing problem. Just passing integer labels to iloc will do.
df.iloc[idx]
A B
1 c 9
0 a 7
2 c 3
3 a 5
4 b 2