Subtract from each cell in pandas dataframe based on value - python

I have a df like this-- it's a dataframe and all values are floats:
data=np.random.randint(3000,size=(10,1))
data=pd.DataFrame(data)
For each value, if it's between 570 and 1140, I want to subtract 570.
If it's over 1140, I want to subtract 1140 from the value. I wrote this function to do that.
def AdjustTimes(val):
if val > 570 and val < 1140:
val = val-570
elif val > 1140:
val = val - 1140
Based on another question I tried to apply it using data.applymap(AdjustTimes). I got no error but the function does not seem to have been applied.

Setup
data
0
0 1863
1 2490
2 2650
3 2321
4 822
5 82
6 2192
7 722
8 2537
9 874
First, let's create masks for each of your conditions. One pandaic approach is using between to retrieve a mask for the first condition -
m1 = data.loc[:, 0].between(570, 1140, inclusive=True)
Or, you can do this with a couple of logical operators -
m1 = data.loc[:, 0].ge(570) & data.loc[:, 0].le(1140)
And,
m2 = data.loc[:, 0].gt(1140)
Now, to perform replacement, you have a couple of options.
Option 1
Use loc to index and subtract -
data.loc[m1, 0] -= 570
data.loc[m2, 0] -= 1140
data
0
0 723
1 1350
2 1510
3 1181
4 252
5 82
6 1052
7 152
8 1397
9 304
Equivalent version for a pd.Series -
m1 = data.ge(570) & data.le(1140)
m2 = data.gt(1140)
data.loc[m1] -= 570
data.loc[m2] -= 1140
Option 2
You can also do this with np.where (but it'd be a bit more inefficient).
v = data.loc[:, 0]
data.loc[:, 0] = np.where(m1, v - 570, np.where(m2, v - 1140, v))
Here, m1 and m2 are the masks computed from before.
data
0
0 723
1 1350
2 1510
3 1181
4 252
5 82
6 1052
7 152
8 1397
9 304
Equivalent pd.Series code -
data[:] = np.where(m1, data - 570, np.where(m2, data - 1140, data))

Could you try something like:
data=np.random.randint(3000,size=(10,1))
data=pd.DataFrame(data)
data = data -570*((data > 570) & (data < 1140)) -1140*(data > 1140)

The applymap method is designed to generate a new dataframe, not to modify an existing one (and the function it calls should return a value for the new cell rather than modifying its argument). You don't show the line where you actually use applymap, but I suspect it's just data.applymap(AdjustTimes) on its own. If you change your code to the following it should work fine:
def AdjustTimes(val):
if val >= 1140:
return val - 1140
elif val >= 570:
return val - 570
data = data.applymap(AdjustTimes)
(I've also cleaned up the if statements to be a little faster and deal with the case where Val = 1140 (your original code wouldn't adjust that one).

Related

Pandas cumsum + cumcount on multiple columns

Aloha,
I have the following DataFrame
stores = [1,2,3,4,5]
weeks = [1,1,1,1,1]
df = pd.DataFrame({'Stores' : stores,
'Weeks' : weeks})
df = pd.concat([df]*53)
df['Weeks'] = df['Weeks'].add(df.groupby('Stores').cumcount())
df['Target'] = np.random.randint(400,600,size=len(df))
df['Actual'] = np.random.randint(350,800,size=len(df))
df['Variance %'] = (df['Target'] - df['Actual']) / df['Target']
df.loc[df['Variance %'] >= 0.01, 'Status'] = 'underTarget'
df.loc[df['Variance %'] <= 0.01, 'Status'] = 'overTarget'
df['Status'] = df['Status'].fillna('atTarget')
df.sort_values(['Stores','Weeks'],inplace=True)
this gives me the following
print(df.head())
Stores Weeks Target Actual Variance % Status
0 1 1 430 605 -0.406977 overTarget
0 1 2 549 701 -0.276867 overTarget
0 1 3 471 509 -0.080679 overTarget
0 1 4 549 378 0.311475 underTarget
0 1 5 569 708 -0.244288 overTarget
0 1 6 574 650 -0.132404 overTarget
0 1 7 466 623 -0.336910 overTarget
now what I'm trying to do is do a cumulative count of Stores where they were either over or undertarget but reset when the status changes.
I thought this would be the best way to do this (and many variants of this) but this does not reset the counter.
s = df.groupby(['Stores','Weeks','Status'])['Status'].shift().ne(df['Status'])
df['Count'] = s.groupby(df['Stores']).cumsum()
my logic was to group by my relevant columns, and do a != shift to reset the cumsum
Naturally I've scoured lots of different questions but I can't seem to figure this out. Would anyone be so kind to explain to me what would be the best method to tackle this problem?
I hope everything here is clear and reproducible. Please let me know if you need any additional information.
Expected Output
Stores Weeks Target Actual Variance % Status Count
0 1 1 430 605 -0.406977 overTarget 1
0 1 2 549 701 -0.276867 overTarget 2
0 1 3 471 509 -0.080679 overTarget 3
0 1 4 549 378 0.311475 underTarget 1 # Reset here as status changes
0 1 5 569 708 -0.244288 overTarget 1 # Reset again.
0 1 6 574 650 -0.132404 overTarget 2
0 1 7 466 623 -0.336910 overTarget 3
Try pd.Series.groupby() after create the key by cumsum
s=df.groupby('Stores')['Status'].apply(lambda x : x.ne(x.shift()).ne(0).cumsum())
df['Count']=df.groupby([df.Stores,s]).cumcount()+1

pandas find closed values on same line

I like to find and extract all closed text based on same line and distance between text < 10 (x2 - x < 10) from pandas dataframe. x,y,x2,y2 are coordinates of bounding box which contains text. Texts can be different each time (string, float, int,...).
In my example, I want extract 'Amount VAT' idx 70 and 71: there are on same line, and distance from 'VAT'[x] - 'Amount'[x2] < 10
line text x y x2 y2
29 11 Amount 2184 1140 2311 1166
51 14 Amount 1532 1450 1660 1476
66 15 Amount 1893 1500 2021 1527
70 16 Amount 1893 1551 2022 1578
71 16 VAT 2031 1550 2121 1578
Final result must be:
line text x y x2 y2
70 16 Amount 1893 1551 2022 1578
71 16 VAT 2031 1550 2121 1578
and extraction should work for 2 or more text on same line and (x2 - x < 10). Other result with 3 values:
line text x y x2 y2
5 16 Total 1755 1551 1884 1578
8 16 Amount 1893 1551 2022 1578
20 16 VAT 2031 1550 2121 1578
I find a way to find same lines:
same_line = find_labels['line'].map(find_labels['line'].value_counts() > 1)
and I try to find near values x2 - x < 10, but I don't how to do this.
I try to make loop or use .cov() but not working...
Some can help me ?
Thanks for your help
Assuming VAT and Amount are both indexed by the same line value, I would do this:
# set the index in line
df.set_index('line', inplace=True)
#split up the table into the 2 parts to work on
amount_df = df[df['text'] == 'Amount']
vat_df = df[df['text'] == 'VAT']
# join the 2 tables to get everything on one row
df2 = amount_df.join(vat_df, how='outer', on='line', rsuffix='amount', lsuffix='vat')
# do the math
condition = df2['xvat'] - df2['x2amount'] < 10
df2 = df2[condition]
df2['text'] = 'Total'
df2['x'] = df2['xvat'] - (df2['xamount'] - df2['xvat'])
df2['y'] = df2['yvat'] - (df2['yamount'] - df2['yvat'])
df2['x2'] = df2['x2vat'] - (df2['x2amount'] - df2['x2vat'])
df2['y2'] = df2['y2vat'] - (df2['y2amount'] - df2['y2vat'])
df.append(df2[['text','x','y','x2','y2']])
I get
not quite exactly what you asked, but you get the idea.
Not sure what the right math is that gives you the results you show

iterating through rows and columns of stock price python

I have a code below
result, diff = [], []
for index, row in final.iterrows():
for column in final.columns:
if ((final['close'] - final['open']) > 20):
diff = final['close'] - final['open']
result = 1
elif ((final['close'] - final['open']) < -20):
diff = final['close'] - final['open']
result = -1
elif (-20 < (final['close'] - final['open']) < 20 ):
diff = final['close'] - final['open']
result = 0
else:
continue
The intention is to for every time stamp, check if close - open is greater than 20 pips, then assign a buy value to it. If it's less than -20 assign a sell value, if in between assign a 0.
I am getting this error
The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
[Finished in 35.418s
Someone experienced with pandas would give a better answer, but since no one is answering here's mine. You generally don't want to iterate directly with pandas.Dataframes as that defeats the purpose. A pandas solution would look more like:
import pandas as pd
data = {
'symbol': ['WZO', 'FDL', 'KXW', 'GYU', 'MIR', 'YAC', 'CDE', 'DSD', 'PAM', 'BQE'],
'open': [356, 467, 462, 289, 507, 654, 568, 646, 440, 625],
'close': [399, 497, 434, 345, 503, 665, 559, 702, 488, 608]
}
df = pd.DataFrame.from_dict(data)
df['diff'] = df['close'] - df['open']
df.loc[(df['diff'] < 20) & (df['diff'] > -20), 'result'] = 0
df.loc[df['diff'] >= 20, 'result'] = 1
df.loc[df['diff'] <= -20, 'result'] = -1
df now contains:
symbol open close diff result
0 WZO 356 399 43 1.0
1 FDL 467 497 30 1.0
2 KXW 462 434 -28 -1.0
3 GYU 289 345 56 1.0
4 MIR 507 503 -4 0.0
5 YAC 654 665 11 0.0
6 CDE 568 559 -9 0.0
7 DSD 646 702 56 1.0
8 PAM 440 488 48 1.0
9 BQE 625 608 -17 0.0
Regarding your code, I'll repeat my comment from above: You are iterating by row, but then using the whole DataFrame final in your conditions. I think you meant to do row there. You don't need to iterate over columns grabbing your values by index. Your conditions miss for when final['close'] - final['open'] is exactly 20. result, diff = [], [] are lists at the top, but then assigned as integers in the loop. Perhaps you want result.append()?

Selecting rows with lowest values based on combination two columns from pandas

I'm not even sure if the title makes sense.
I have a pandas dataframe with 3 columns: x, y, time. There are a few thousand rows. Example below:
x y time
0 225 0 20.295270
1 225 1 21.134015
2 225 2 21.382298
3 225 3 20.704367
4 225 4 20.152735
5 225 5 19.213522
.......
900 437 900 27.748966
901 437 901 20.898460
902 437 902 23.347935
903 437 903 22.011992
904 437 904 21.231041
905 437 905 28.769945
906 437 906 21.662975
.... and so on
What I want to do is retrieve those rows which have the smallest time associated with x and y. Basically for every element on the y, I want to find which have the smallest time value but I want to exclude those that have time 0.0. This happens when x has the same value as y.
So for example, the fastest way to get to y-0 is by starting from x-225 and so on, therefore it could be the case that x repeats itself but for a different y.
e.g.
x y time
225 0 20.295270
438 1 19.648954
27 20 4.342732
9 438 17.884423
225 907 24.560400
I tried up until now groupby but I'm only getting the same x as y.
print(df.groupby('id_y', sort=False)['time'].idxmin())
y
0 0
1 1
2 2
3 3
4 4
The one below just returns the df that I already have.
df.loc[df.groupby("id_y")["time"].idxmin()]
Just to point out one thing, I'm open to options, not just groupby, if there are other ways that is very good.
So need remove rows with time equal first by boolean indexing and then use your solution:
df = df[df['time'] != 0]
df2 = df.loc[df.groupby("y")["time"].idxmin()]
Similar alternative with filter by query:
df = df.query('time != 0')
df2 = df.loc[df.groupby("y")["time"].idxmin()]
Or use sort_values with drop_duplicates:
df2 = df[df['time'] != 0].sort_values(['y','time']).drop_duplicates('y')

Numpy "where" with multiple conditions

I try to add a new column "energy_class" to a dataframe "df_energy" which it contains the string "high" if the "consumption_energy" value > 400, "medium" if the "consumption_energy" value is between 200 and 400, and "low" if the "consumption_energy" value is under 200.
I try to use np.where from numpy, but I see that numpy.where(condition[, x, y]) treat only two condition not 3 like in my case.
Any idea to help me please?
Thank you in advance
Try this:
Using the setup from #Maxu
col = 'consumption_energy'
conditions = [ df2[col] >= 400, (df2[col] < 400) & (df2[col]> 200), df2[col] <= 200 ]
choices = [ "high", 'medium', 'low' ]
df2["energy_class"] = np.select(conditions, choices, default=np.nan)
consumption_energy energy_class
0 459 high
1 416 high
2 186 low
3 250 medium
4 411 high
5 210 medium
6 343 medium
7 328 medium
8 208 medium
9 223 medium
You can use a ternary:
np.where(consumption_energy > 400, 'high',
(np.where(consumption_energy < 200, 'low', 'medium')))
I like to keep the code clean. That's why I prefer np.vectorize for such tasks.
def conditions(x):
if x > 400: return "High"
elif x > 200: return "Medium"
else: return "Low"
func = np.vectorize(conditions)
energy_class = func(df_energy["consumption_energy"])
Then just add numpy array as a column in your dataframe using:
df_energy["energy_class"] = energy_class
The advantage in this approach is that if you wish to add more complicated constraints to a column, it can be done easily.
Hope it helps.
I would use the cut() method here, which will generate very efficient and memory-saving category dtype:
In [124]: df
Out[124]:
consumption_energy
0 459
1 416
2 186
3 250
4 411
5 210
6 343
7 328
8 208
9 223
In [125]: pd.cut(df.consumption_energy,
[0, 200, 400, np.inf],
labels=['low','medium','high']
)
Out[125]:
0 high
1 high
2 low
3 medium
4 high
5 medium
6 medium
7 medium
8 medium
9 medium
Name: consumption_energy, dtype: category
Categories (3, object): [low < medium < high]
WARNING: Be careful with NaNs
Always be careful that if your data has missing values np.where may be tricky to use and may give you the wrong result inadvertently.
Consider this situation:
df['cons_ener_cat'] = np.where(df.consumption_energy > 400, 'high',
(np.where(df.consumption_energy < 200, 'low', 'medium')))
# if we do not use this second line, then
# if consumption energy is missing it would be shown medium, which is WRONG.
df.loc[df.consumption_energy.isnull(), 'cons_ener_cat'] = np.nan
Alternatively, you can use one-more nested np.where for medium versus nan which would be ugly.
IMHO best way to go is pd.cut. It deals with NaNs and easy to use.
Examples:
import numpy as np
import pandas as pd
import seaborn as sns
df = sns.load_dataset('titanic')
# pd.cut
df['age_cat'] = pd.cut(df.age, [0, 20, 60, np.inf], labels=['child','medium','old'])
# manually add another line for nans
df['age_cat2'] = np.where(df.age > 60, 'old', (np.where(df.age <20, 'child', 'medium')))
df.loc[df.age.isnull(), 'age_cat'] = np.nan
# multiple nested where
df['age_cat3'] = np.where(df.age > 60, 'old',
(np.where(df.age <20, 'child',
np.where(df.age.isnull(), np.nan, 'medium'))))
# outptus
print(df[['age','age_cat','age_cat2','age_cat3']].head(7))
age age_cat age_cat2 age_cat3
0 22.0 medium medium medium
1 38.0 medium medium medium
2 26.0 medium medium medium
3 35.0 medium medium medium
4 35.0 medium medium medium
5 NaN NaN medium nan
6 54.0 medium medium medium
Let's start by creating a dataframe with 1000000 random numbers between 0 and 1000 to be used as test
df_energy = pd.DataFrame({'consumption_energy': np.random.randint(0, 1000, 1000000)})
[Out]:
consumption_energy
0 683
1 893
2 545
3 13
4 768
5 385
6 644
7 551
8 572
9 822
A bit of a description of the dataframe
print(df.energy.describe())
[Out]:
consumption_energy
count 1000000.000000
mean 499.648532
std 288.600140
min 0.000000
25% 250.000000
50% 499.000000
75% 750.000000
max 999.000000
There are various ways to achieve that, such as:
Using numpy.where
df_energy['energy_class'] = np.where(df_energy['consumption_energy'] > 400, 'high', np.where(df_energy['consumption_energy'] > 200, 'medium', 'low'))
Using numpy.select
df_energy['energy_class'] = np.select([df_energy['consumption_energy'] > 400, df_energy['consumption_energy'] > 200], ['high', 'medium'], default='low')
Using numpy.vectorize
df_energy['energy_class'] = np.vectorize(lambda x: 'high' if x > 400 else ('medium' if x > 200 else 'low'))(df_energy['consumption_energy'])
Using pandas.cut
df_energy['energy_class'] = pd.cut(df_energy['consumption_energy'], bins=[0, 200, 400, 1000], labels=['low', 'medium', 'high'])
Using Python's built in modules
def energy_class(x):
if x > 400:
return 'high'
elif x > 200:
return 'medium'
else:
return 'low'
df_energy['energy_class'] = df_energy['consumption_energy'].apply(energy_class)
Using a lambda function
df_energy['energy_class'] = df_energy['consumption_energy'].apply(lambda x: 'high' if x > 400 else ('medium' if x > 200 else 'low'))
Time Comparison
From all the tests that I've done, by measuring time with time.perf_counter() (for other ways to measure time of execution see this), pandas.cut was the fastest approach.
method time
0 np.where() 0.124139
1 np.select() 0.155879
2 numpy.vectorize() 0.452789
3 pandas.cut() 0.046143
4 Python's built-in functions 0.138021
5 lambda function 0.19081
Notes:
For the difference between pandas.cut and pandas.qcut see this: What is the difference between pandas.qcut and pandas.cut?
Try this : Even if consumption_energy contains nulls don't worry about it.
def egy_class(x):
'''
This function assigns classes as per the energy consumed.
'''
return ('high' if x>400 else
'low' if x<200 else 'medium')
chk = df_energy.consumption_energy.notnull()
df_energy['energy_class'] = df_energy.consumption_energy[chk].apply(egy_class)
I second using np.vectorize. It is much faster than np.where and also cleaner code wise. You can definitely tell the speed up with larger data sets. You can use a dictionary format for your conditionals as well as the output of those conditions.
# Vectorizing with numpy
row_dic = {'Condition1':'high',
'Condition2':'medium',
'Condition3':'low',
'Condition4':'lowest'}
def Conditions(dfSeries_element,dictionary):
'''
dfSeries_element is an element from df_series
dictionary: is the dictionary of your conditions with their outcome
'''
if dfSeries_element in dictionary.keys():
return dictionary[dfSeries]
def VectorizeConditions():
func = np.vectorize(Conditions)
result_vector = func(df['Series'],row_dic)
df['new_Series'] = result_vector
# running the below function will apply multi conditional formatting to your df
VectorizeConditions()
myassign["assign3"]=np.where(myassign["points"]>90,"genius",(np.where((myassign["points"]>50) & (myassign["points"]<90),"good","bad"))
when you wanna use only "where" method but with multiple condition. we can add more condition by adding more (np.where) by the same method like we did above. and again the last two will be one you want.

Categories