Is it okey to use lambda in this case? - python

I'm trying to figure out a way to loop over a panda DataFrame to generate a new key.
Here's an example of the dataframe:
df = pd.DataFrame({"pdb" : ["a", "b"], "beg": [1, 2], "end" : [10, 11]})
for index, row in df.iterrows():
df['range'] = [list(x) for x in zip(df['beg'], df['end'])]
And now I want to create a new key, that basically takes the first and last number of df["range"] and full the list with the numbers in the middle (ie, the first one will be [1 2 3 4 5 6 7 8 9 10])
So far I think that I should be using something like this, but I could be completely wrong:
df["total"] = df["range"].map(lambda x: #and here I should append all the "x" that are betwen df["range"][0] and df["range"][1]
Here's an example of the result that I'm looking for:
pdb beg end range total
0 a 1 10 1 10 1 2 3 4 5 6 7 8 9 10
1 b 2 11 2 11 2 3 4 5 6 7 8 9 10 11
I could use some help with the lambda function, I get really confused with the syntax.

Try with apply
df['new'] = df.apply(lambda x : list(range(x['beg'],x['end']+1)),axis=1)
Out[423]:
0 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
1 [2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
dtype: object

This should work:
df['total'] = df['range'].apply(lambda x: [n for n in range(x[0], x[1]+1)])

As per your output, you need
In [18]: df['new'] = df.apply(lambda x : " ".join(list(map(str,range(x['beg'],x['end']+1)))),axis=1)
In [19]: df
Out[19]:
pdb beg end range new
0 a 1 10 [1, 10] 1 2 3 4 5 6 7 8 9 10
1 b 2 11 [2, 11] 2 3 4 5 6 7 8 9 10 11

If you want to use iterrows then you can do it in the loop itself as follows:
Code :
import pandas as pd
df = pd.DataFrame({"pdb" : ["a", "b"], "beg": [1, 2], "end" : [10, 11]})
for index, row in df.iterrows():
df['range'] = [list(x) for x in zip(df['beg'], df['end'])]
df['total'] = [range(*x) for x in zip(df['beg'], df['end'])]
Output:
pdb beg end range total
0 a 1 10 [1, 10] (1, 2, 3, 4, 5, 6, 7, 8, 9)
1 b 2 11 [2, 11] (2, 3, 4, 5, 6, 7, 8, 9, 10)

Related

How to find column numbers in increasing order

I have a pandas dataframe, with a column containing item numbers that are supposed to increase by 1, each row.
df1 = pd.DataFrame({
"item_number" : [1, 2, 3, 4, 5, 6, 8, 10],
"col_A" : ['aaa','bbb','ccc','ddd','eee','fff','hhh', 'jjj']})
df1
item_number col_A
0 1 aaa
1 2 bbb
2 3 ccc
3 4 ddd
4 5 eee
5 6 fff
6 8 hhh
7 10 jjj
As you can see, the item number increases by two between 6 and 8 and 8 and 10. Is there a way to write a function that will a list of the skipped numbers ie. ['7','9'] otherwise, return True
s=pd.Series(range(df['item_number'].min(), (df['item_number'].max()+1)))
s[~s.isin(df['item_number'])].values
array([7, 9], dtype=int64)
one-liner:
set(range(df1.item_number.min(), df1.item_number.max()+1)) - set(df1.item_number) or True
You can take advantage of Python sets and lists operations to find out if the condition you are proposing meets on the input list:
li = [1, 2, 3, 4, 5, 6, 8, 10]
def fun(l):
a = list(set(list(range(l[0], l[-1]+1))) - set(l))
if a == []:
return True
else:
return a
print(fun(li))
Output:
[9, 7]
Also, you can use return sorted(a) if you want the list elements to be returned in order.
Use range with np.setdiff1d:
In [1518]: import numpy as np
In [1519]: rng = range(df1.item_number.min(), df1.item_number.max() + 1)
In [1523]: res = np.setdiff1d(rng, df1.item_number)
In [1524]: res
Out[1524]: array([7, 9])
This will do it:
def foo(df):
x = df.set_index('item_number').reindex(range(df.item_number.min(), df.item_number.max() + 1))
x = list(x.index[x.col_A.isna()])
return x if x else True
Examples:
y = foo(df1)
print(y)
y = foo(df1.loc[range(1, 6)])
print(y)
Output:
[7, 9]
True

Automatically Rename Columns in Pandas using string and number

I have the current DataFrame:
df = pd.DataFrame([[-1, 2, 1, 3], [4, 6, 7,8], [-2, 10, 11, 13], [5, 6, 8, 9]], columns=['0', '1', '2', '3'])
I am trying to automatically rename every other column so that the first two are lo1, up1, lo2, up2. This is just an example, but I was hoping for a way to develop this for an entire DataFrame of many columns.
Thanks.
An efficient approach is to use itertools.product:
from itertools import product
cols = [f'{a}{b+1}' for b,a in product(range(len(df. columns)//2), ['lo','up'])]
df.columns = cols
Output:
lo1 up1 lo2 up2
0 -1 2 1 3
1 4 6 7 8
2 -2 10 11 13
3 5 6 8 9
Here's one approach:
import pandas as pd
df = pd.DataFrame([[-1, 2, 1, 3], [4, 6, 7,8], [-2, 10, 11, 13], [5, 6, 8, 9]], columns=['0', '1', '2', '3'])
max_nr = int(len(df.columns)/2)
for i in range(max_nr):
df.rename(columns={'{}'.format(i*2):'lo{}'.format(i), '{}'.format(i*2+1):'hi{}'.format(i)}, inplace=True)
output:
lo0
hi0
lo1
hi1
0
-1
2
1
3
1
4
6
7
8
2
-2
10
11
13
3
5
6
8
9
Even though it is longer compared to the concise solutions posted above, there is another method by defining a function named col_namer in which col_size corresponds to the number of columns in DataFrame and names_list is the list of strings with which you want to name your columns. As a result we have:
def col_namer(col_size, names_list):
rep = len(names_list)
col_names = []
mul = col_size // rep
rem = col_size % rep
if rem == 0:
for i in range(1, mul + 1):
for j in names_list:
col_names.append(j + str(i))
return col_names
else:
for i in range(1, mul + 1):
for j in names_list:
col_names.append(j + str(i))
for i in range(mul + 1, mul + rem + 2):
for j in names_list:
if len(col_names) < col_size:
col_names.append(j + str(i))
return col_names
df = pd.DataFrame([[-1, 2, 1, 3], [4, 6, 7,8], [-2, 10, 11, 13], [5, 6, 8, 9]], columns=['0', '1', '2', '3'])
df.columns = col_namer(df.shape[1], ['lo', 'up'])
df
Out:
lo1 up1 lo2 up2
0 -1 2 1 3
1 4 6 7 8
2 -2 10 11 13
3 5 6 8 9

ValueError: Wrong number of items passed 5, placement implies 1, error while finding the second max value in a row

Obtaining the second maximum value for each row in a data frame, but getting value error
column = [col for col in dataframe.columns if '%' in col]
dataframe["Max_2nd"] = dataframe[column].apply(lambda row: row.nlargest(2).values[-1],axis=1)
How can I resolve this
So, given the following toy dataframe:
import pandas as pd
df = pd.DataFrame(
(
{
"col1": [1, 2, 3, 4, 5, 6, 7, 8],
"col2": [3, 0, 1, 3, 0, 1, 8, 5],
"col3": [7, 9, 2, 6, 7, 8, 0, 1],
"col4": [0, 4, 5, 0, 4, 3, 4, 0],
}
)
)
You can find the second max value in each row like this:
df["Max_2nd"] = df.apply(lambda x: sorted(x, reverse=True)[1], axis=1)
print(df)
# Outputs
col1 col2 col3 col4 Max_2nd
0 1 3 7 0 3
1 2 0 9 4 4
2 3 1 2 5 3
3 4 3 6 0 4
4 5 0 7 4 5
5 6 1 8 3 6
6 7 8 0 4 7
7 8 5 1 0 5

non fixed rolling window

I am looking to implement a rolling window on a list, but instead of a fixed length of window, I would like to provide a rolling window list:
Something like this:
l1 = [5, 3, 8, 2, 10, 12, 13, 15, 22, 28]
l2 = [1, 2, 2, 2, 3, 4, 2, 3, 5, 3]
get_custom_roling( l1, l2, np.average)
and the result would be:
[5, 4, 5.5, 5, 6.67, ....]
6.67 is calculated as average of 3 elements 10, 2, 8.
I implemented a slow solution, and every idea is welcome to make it quicker :):
import numpy as np
def get_the_list(end_point, number_points):
"""
example: get_the_list(6, 3) ==> [4, 5, 6]
example: get_the_list(9, 5) ==> [5, 6, 7, 8, 9]
"""
if np.isnan(number_points):
return []
number_points = int( number_points)
return list(range(end_point, end_point - number_points, -1 ))
def get_idx(s):
ss = list(enumerate(s) )
sss = (get_the_list(*elem) for elem in ss )
return sss
def get_custom_roling(s, ss, funct):
output_get_idx = get_idx(ss)
agg_stuff = [s[elem] for elem in output_get_idx]
res_agg_stuff = [ funct(elem) for elem in agg_stuff ]
res_agg_stuff = eiu.pd.Series(data=res_agg_stuff, index = s.index)
return res_agg_stuff
Pandas custom window rolling allows you to modify size of window.
Simple explanation: start and end arrays hold values of indexes to make slices of your data.
#start = [0 0 1 2 2 2 5 5 4 7]
#end = [1 2 3 4 5 6 7 8 9 10]
Arguments passed to get_window_bounds are given by BaseIndexer.
import pandas as pd
import numpy as np
from pandas.api.indexers import BaseIndexer
from typing import Optional, Tuple
class CustomIndexer(BaseIndexer):
def get_window_bounds(self,
num_values: int = 0,
min_periods: Optional[int] = None,
center: Optional[bool] = None,
closed: Optional[str] = None
) -> Tuple[np.ndarray, np.ndarray]:
end = np.arange(1, num_values+1, dtype=np.int64)
start = end - np.array(self.custom_name_whatever, dtype=np.int64)
return start, end
df = pd.DataFrame({"l1": [5, 3, 8, 2, 10, 12, 13, 15, 22, 28],
"l2": [1, 2, 2, 2, 3, 4, 2, 3, 5, 3]})
indexer = CustomIndexer(custom_name_whatever=df.l2)
df["variable_mean"] = df.l1.rolling(indexer).mean()
print(df)
Outputs:
l1 l2 variable_mean
0 5 1 5.000000
1 3 2 4.000000
2 8 2 5.500000
3 2 2 5.000000
4 10 3 6.666667
5 12 4 8.000000
6 13 2 12.500000
7 15 3 13.333333
8 22 5 14.400000
9 28 3 21.666667

delete elements from a data frame w.r.t columns of another data frame

I have a data frame say df1 with MULTILEVEL INDEX:
A B C D
0 0 0 1 2 3
4 5 6 7
1 2 8 9 10 11
3 2 3 4 5
and I have another data frame with 2 common columns in df2 also with MULTILEVEL INDEX
X B C Y
0 0 0 0 7 3
1 4 5 6 7
1 2 8 2 3 11
3 2 3 4 5
I need to remove the rows from df1 where the values of column B and C are the same as in df2, so I should be getting something like this:
A B C D
0 0 0 1 2 3
0 2 8 9 10 11
I have tried to do this by getting the index of the common elements and then remove them via a list, but they are all messed up and are in multi-level form.
You can do this in a one liner using pandas.dataframe.iloc, numpy.where and numpy.logical_or like this: (I find it to be the simplest way)
df1 = df1.iloc[np.where(np.logical_or(df1['B']!=df2['B'],df1['C']!=df2['C']))]
of course don't forget to:
import numpy as np
output:
A B C D
0 0 0 1 2 3
1 2 8 9 10 11
Hope this was helpful. If there are any questions or remarks please feel free to comment.
You could make MultiIndexes out of the B and C columns, and then call the index's isin method:
idx1 = pd.MultiIndex.from_arrays([df1['B'],df1['C']])
idx2 = pd.MultiIndex.from_arrays([df2['B'],df2['C']])
mask = idx1.isin(idx2)
result = df1.loc[~mask]
For example,
import pandas as pd
df1 = pd.DataFrame({'A': [0, 4, 8, 2], 'B': [1, 5, 9, 3], 'C': [2, 6, 10, 4], 'D': [3, 7, 11, 5], 'P': [0, 0, 1, 1], 'Q': [0, 0, 2, 3]})
df1 = df1.set_index(list('PQ'))
df1.index.names = [None,None]
df2 = pd.DataFrame({'B': [0, 5, 2, 3], 'C': [7, 6, 3, 4], 'P': [0, 0, 1, 1], 'Q': [0, 1, 2, 3], 'X': [0, 4, 8, 2], 'Y': [3, 7, 11, 5]})
df2 = df2.set_index(list('PQ'))
df2.index.names = [None,None]
idx1 = pd.MultiIndex.from_arrays([df1['B'],df1['C']])
idx2 = pd.MultiIndex.from_arrays([df2['B'],df2['C']])
mask = idx1.isin(idx2)
result = df1.loc[~mask]
print(result)
yields
A B C D
0 0 0 1 2 3
1 2 8 9 10 11

Categories