Use a dictionary to key a range of values [duplicate] - python

This question already has answers here:
Mapping ranges of values in pandas dataframe [duplicate]
(2 answers)
Closed 2 years ago.
I have a pandas dataframe and I want to create categories in a new column based on the values of another column. I can solve my basic problem by doing this:
range = {
range(0, 5) : 'Below 5',
range(6,10): 'between',
range(11, 1000) : 'above'
}
df['range'] = df['value'].map(range)
In the final dictionary key I have chosen a large upper value for range to ensure it captures all the values I am trying to map. However, this seems an ugly hack and am wondering how to generalise this without specifying the upper limit. ie. if > 10 : 'above'.
Thanks

Assume you have a dataframe like this:
range value
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
8 8 8
9 9 9
Then you can apply the following function to the column 'value':
def get_value(range):
if range < 5:
return 'Below 5'
elif range < 10:
return 'Between 5 and 10'
else:
return 'Above 10'
df['value'] = df.apply(lambda col: get_value(col['range']), axis=1)
To get the result you want.

You could set all values first to 'above', and then use map() for the remaining options (thus with your range dict having only two items):
range = {
range(0, 5) : 'Below 5',
range(6,10): 'between',
}
df['range'] = 'above'
df['range'] = df['value'].map(range)

Thanks for the hints. I see I can achieve the same with:
df['range'] = df['value'].map(range).fillna('above')

pandas.Series.map accepts also function as first argument, so you could do:
def fun(x):
if x in range(0, 5):
return 'Below 5'
elif x in range(6, 10):
return 'between'
elif x >= 11:
return 'above'
then:
df['range'] = df['value'].map(fun)

Here's another approach using numpy.select, where you specify a list of boolean conditions, and a list of choices:
import numpy as np
# Setup
df = pd.DataFrame({'value': [1, 3, 6, 8, 20, 10000000]})
condlist = [
df['value'].lt(5),
df['value'].between(5, 10),
df['value'].gt(10)]
choicelist = ['Below 5', 'between', 'above']
df['out'] = np.select(condlist, choicelist)
print(df)
[out]
value out
0 1 Below 5
1 3 Below 5
2 6 between
3 8 between
4 20 above
5 10000000 above
Another idea would be to use pandas.cut with bins and labels parameters specified:
df['out'] = pd.cut(df['value'], bins=[-np.inf, 5, 10, np.inf],
labels=['below', 'between', 'above'])
value out
0 1 below
1 3 below
2 6 between
3 8 between
4 20 above
5 10000000 above

df['range'] = pd.cut(df['value'], bins = [0, 5, 10, 1000], labels = ["below 5", "between", "above"])

Related

Labeling whether the numbers in a dataframe is going up first or down first

Let's label a dataframe with two columns, A,B, and 100M rows. Starting at the index i, we want to know if the data in column B is trending down or trending up comparing to the data at [i, 'A'].
Here is a loop:
import pandas as pd
df = pd.DataFrame({'A': [0,1,2,3,5,0,0,0,0,0], 'B': [1, 10, -10, 2, 3,0,0,0,0,0], "label":[0,0,0,0,0,0,0,0,0,0]})
for i in range (0,5):
j = i
while j in range (i,i+5) and df.at[i,'label'] == 0: #if classfied, no need to continue
if df.at[j,'B']-df.at[i,'A']>= 10:
df.at[i,'label'] = 1 #Label 1 means trending up
if df.at[j,'B']-df.at[i,'A']<= -10:
df.at[i,'label'] = 2 #Label 2 means trending down
j=j+1
[out]
A B label
0 1 1
1 10 2
2 -10 2
3 2 0
5 3 0
...
The estimated finishing time for this code is 30 days. (A human with a plot and a ruler might finish this task faster.)
What is a fast way to do this? Ideally without a loop.
Looping on Dataframe is slow compared to using Pandas methods.
The task can be accomplished using Pandas vectorized methods:
rolling method which does computations in a rolling window
min & max methods which we compute in the rolling window
where method DataFrame where allows us to set values based upon logic
Code
def set_trend(df, threshold = 10, window_size = 2):
'''
Use rolling_window to find max/min values in a window from the current point
rolling window normally looks at backward values
We use technique from https://stackoverflow.com/questions/22820292/how-to-use-pandas-rolling-functions-on-a-forward-looking-basis/22820689#22820689
to look at forward values
'''
# To have a rolling window on lookahead values in column B
# We reverse values in column B
df['B_rev'] = df["B"].values[::-1]
# Max & Min in B_rev, then reverse order of these max/min
# https://stackoverflow.com/questions/50837012/pandas-rolling-min-max
df['max_'] = df.B_rev.rolling(window_size, min_periods = 0).max().values[::-1]
df['min_'] = df.B_rev.rolling(window_size, min_periods = 0).min().values[::-1]
nrows = df.shape[0] - 1 # adjustment for argmax & armin indexes since rows are in reverse order
# i.e. idx = nrows - x.argmax() give index for max in non-reverse row
df['max_idx'] = df.B_rev.rolling(window_size, min_periods = 0).apply(lambda x: nrows - x.argmax(), raw = True).values[::-1]
df['min_idx'] = df.B_rev.rolling(window_size, min_periods = 0).apply(lambda x: nrows - x.argmin(), raw = True).values[::-1]
# Use np.select to implement label assignment logic
conditions = [
(df['max_'] - df["A"] >= threshold) & (df['max_idx'] <= df['min_idx']), # max above & comes first
(df['min_'] - df["A"] <= -threshold) & (df['min_idx'] <= df['max_idx']), # min below & comes first
df['max_'] - df["A"] >= threshold, # max above threshold but didn't come first
df['min_'] - df["A"] <= -threshold, # min below threshold but didn't come first
]
choices = [
1, # max above & came first
2, # min above & came first
1, # max above threshold
2, # min above threshold
]
df['label'] = np.select(conditions, choices, default = 0)
# Drop scratch computation columns
df.drop(['B_rev', 'max_', 'min_', 'max_idx', 'min_idx'], axis = 1, inplace = True)
return df
Tests
Case 1
df = pd.DataFrame({'A': [0,1,2,3,5,0,0,0,0,0], 'B': [1, 10, -10, 2, 3,0,0,0,0,0], "label":[0,0,0,0,0,0,0,0,0,0]})
display(set_trend(df, 10, 4))
Case 2
df = pd.DataFrame({'A': [0,1,2], 'B': [1, -10, 10]})
display(set_trend(df, 10, 4))
Output
Case 1
A B label
0 0 1 1
1 1 10 2
2 2 -10 2
3 3 2 0
4 5 3 0
5 0 0 0
6 0 0 0
7 0 0 0
8 0 0 0
9 0 0 0
Case 2
A B label
0 0 1 2
1 1 -10 2
2 2 10 0

How can I evenly split up a pandas.DataFrame into n-groups? [duplicate]

This question already has answers here:
Split dataframe into relatively even chunks according to length
(2 answers)
Closed 1 year ago.
I need to perform n-fold (in my particular case, a 5-fold) cross validation on a dataset that I've stored in a pandas.DataFrame. My current way seems to rearrange the row labels;
spreadsheet1 = pd.ExcelFile("Testing dataset.xlsx")
dataset = spreadsheet1.parse('Sheet1')
data = 5 * [pd.DataFrame()]
i = 0
while(i < len(dataset)):
j = 0
while(j < 5 and i < len(dataset)):
data[j] = (data[j].append(dataset.iloc[i])).reset_index(drop = True)
i += 1
j += 1
How can I split my DataFrame efficiently/intelligently without tampering with the order of the columns?
Use np.array_split to break it up into a list of "evenly" sized DataFrames. You can shuffle too if you sample the full DataFrame
import pandas as pd
import numpy as np
df = pd.DataFrame(np.arange(24).reshape(-1,2), columns=['A', 'B'])
N = 5
np.array_split(df, N)
#np.array_split(df.sample(frac=1), N) # Shuffle and split
[ A B
0 0 1
1 2 3
2 4 5,
A B
3 6 7
4 8 9
5 10 11,
A B
6 12 13
7 14 15,
A B
8 16 17
9 18 19,
A B
10 20 21
11 22 23]
I am still not sure why you want to do it in this way but here is a solution
df['fold'] = np.random.randint(1, 6, df.shape[0])
For example, your first fold is
df.loc[df['fold'] == 1]

How to get location of nearest number in DataFrame

I have following DataFrame
0
0 5
1 10
2 15
3 20
i want get location of value which is near to n value. For example: if n=7 then nearest number is 5 .
And after this return location of 5 i.e [0] [0]
Use Series.abs and Series.idxmin:
# Setup
df = pd.DataFrame({0: {0: 5, 1: 10, 2: 15, 3: 20}})
n = 7
(n - df[0]).abs().idxmin()
[out]
0
Use numpy.argmin to get the closest number index:
df[1] = 7
df[2] = df[1] - df[0]
df[2] = df[2].abs()
print(np.argmin(df[2]))

Get index of a row of a pandas dataframe as an integer

Assume an easy dataframe, for example
A B
0 1 0.810743
1 2 0.595866
2 3 0.154888
3 4 0.472721
4 5 0.894525
5 6 0.978174
6 7 0.859449
7 8 0.541247
8 9 0.232302
9 10 0.276566
How can I retrieve an index value of a row, given a condition?
For example:
dfb = df[df['A']==5].index.values.astype(int)
returns [4], but what I would like to get is just 4. This is causing me troubles later in the code.
Based on some conditions, I want to have a record of the indexes where that condition is fulfilled, and then select rows between.
I tried
dfb = df[df['A']==5].index.values.astype(int)
dfbb = df[df['A']==8].index.values.astype(int)
df.loc[dfb:dfbb,'B']
for a desired output
A B
4 5 0.894525
5 6 0.978174
6 7 0.859449
but I get TypeError: '[4]' is an invalid key
The easier is add [0] - select first value of list with one element:
dfb = df[df['A']==5].index.values.astype(int)[0]
dfbb = df[df['A']==8].index.values.astype(int)[0]
dfb = int(df[df['A']==5].index[0])
dfbb = int(df[df['A']==8].index[0])
But if possible some values not match, error is raised, because first value not exist.
Solution is use next with iter for get default parameetr if values not matched:
dfb = next(iter(df[df['A']==5].index), 'no match')
print (dfb)
4
dfb = next(iter(df[df['A']==50].index), 'no match')
print (dfb)
no match
Then it seems need substract 1:
print (df.loc[dfb:dfbb-1,'B'])
4 0.894525
5 0.978174
6 0.859449
Name: B, dtype: float64
Another solution with boolean indexing or query:
print (df[(df['A'] >= 5) & (df['A'] < 8)])
A B
4 5 0.894525
5 6 0.978174
6 7 0.859449
print (df.loc[(df['A'] >= 5) & (df['A'] < 8), 'B'])
4 0.894525
5 0.978174
6 0.859449
Name: B, dtype: float64
print (df.query('A >= 5 and A < 8'))
A B
4 5 0.894525
5 6 0.978174
6 7 0.859449
To answer the original question on how to get the index as an integer for the desired selection, the following will work :
df[df['A']==5].index.item()
Little sum up for searching by row:
This can be useful if you don't know the column values ​​or if columns have non-numeric values
if u want get index number as integer u can also do:
item = df[4:5].index.item()
print(item)
4
it also works in numpy / list:
numpy = df[4:7].index.to_numpy()[0]
lista = df[4:7].index.to_list()[0]
in [x] u pick number in range [4:7], for example if u want 6:
numpy = df[4:7].index.to_numpy()[2]
print(numpy)
6
for DataFrame:
df[4:7]
A B
4 5 0.894525
5 6 0.978174
6 7 0.859449
or:
df[(df.index>=4) & (df.index<7)]
A B
4 5 0.894525
5 6 0.978174
6 7 0.859449
The nature of wanting to include the row where A == 5 and all rows upto but not including the row where A == 8 means we will end up using iloc (loc includes both ends of slice).
In order to get the index labels we use idxmax. This will return the first position of the maximum value. I run this on a boolean series where A == 5 (then when A == 8) which returns the index value of when A == 5 first happens (same thing for A == 8).
Then I use searchsorted to find the ordinal position of where the index label (that I found above) occurs. This is what I use in iloc.
i5, i8 = df.index.searchsorted([df.A.eq(5).idxmax(), df.A.eq(8).idxmax()])
df.iloc[i5:i8]
numpy
you can further enhance this by using the underlying numpy objects the analogous numpy functions. I wrapped it up into a handy function.
def find_between(df, col, v1, v2):
vals = df[col].values
mx1, mx2 = (vals == v1).argmax(), (vals == v2).argmax()
idx = df.index.values
i1, i2 = idx.searchsorted([mx1, mx2])
return df.iloc[i1:i2]
find_between(df, 'A', 5, 8)
timing
Or you can add a for loop
for i in dfb:
dfb = i
for j in dfbb:
dgbb = j
This way the element '4' is out of the list

python pandas isin method?

I have a dictionary 'wordfreq' like this:
{'techsmart': 30, 'paradies': 57, 'jobvark': 5000, 'midgley': 100, 'weisman': 2, 'tucuman': 1, 'amdahl': 2, 'frogfeet': 1, 'd8848': 1, 'jiaoyuwang': 1, 'walter': 19}
and I want to put the keys in a list if the value is more than 5 and also if the key is not in another dataframe 'df', and then adding them to a list called 'stopword':here is a df dataframe:
word freq
1 paradies 1
5 tucuman 1
and here is the code I am using:
stopword = []
for k,v in wordfreq.items():
if v >= 5:
if k not in list_c:
stopword.append((k))
Anybody knows how can I do the same thing with isin() method or more efficiently at least?
I'd load your dict into a df:
In [177]:
wordfreq = {'techsmart': 30, 'paradies': 57, 'jobvark': 5000, 'midgley': 100, 'weisman': 2, 'tucuman': 1, 'amdahl': 2, 'frogfeet': 1, 'd8848': 1, 'jiaoyuwang': 1, 'walter': 19}
df = pd.DataFrame({'word':list(wordfreq.keys()), 'freq':list(wordfreq.values())})
df
Out[177]:
freq word
0 1 frogfeet
1 1 tucuman
2 57 paradies
3 1 d8848
4 5000 jobvark
5 100 midgley
6 1 jiaoyuwang
7 30 techsmart
8 2 weisman
9 19 walter
10 2 amdahl
And then filter using isin against the other df (df_1 in my case) like this:
In [181]:
df[(df['freq'] > 5) & (~df['word'].isin(df1['word']))]
Out[181]:
freq word
4 5000 jobvark
5 100 midgley
7 30 techsmart
9 19 walter
So the boolean condition looks for freq values greater than 5 and also where the word is not in the other df using isin and invert the boolean mask ~.
You can then now get a list easily:
In [182]:
list(df[(df['freq'] > 5) & (~df['word'].isin(df1['word']))]['word'])
Out[182]:
['jobvark', 'midgley', 'techsmart', 'walter']

Categories