Pandas DataFrame: multiply values in a column, based on condition [duplicate] - python

This question already has an answer here:
Pandas: update a column with an if statement
(1 answer)
Closed 3 years ago.
Hi I have a DataFrame column like the follow.
dataframe['BETA'], which has float numbers between 0 and 100.
I need to have just numbers with the same numbers of digits. Example:
Dataframe['BETA´]:
[0] 0.11 to [0] 110
[1] 1.54 to [1] 154
[2] 22.1 to [2] 221
I tried to change one by one, but its super inefficient process:
for i in range (len(df_ld)):
nbeta=df_ld['BETA'][i]
if nbeta<1:
val=nbeta
val=val*1000
df_ld.loc[i,'BETA']=val
if (nbeta>=1) and (nbeta<=10):
val=nbeta
val=val*100
df_ld.loc[i,'BETA']=val
if (nbeta>10) and (nbeta<=100):
val=nbeta
val=val*10
df_ld.loc[i,'BETA']=val
#print('%.f >10, %.f Nuevo valor'% (nbeta,val))
Note: The dataframe size is more then 80k elements
Please help!
Edited: Solution
numpy.select
import numpy as np
x = df_ld['BETA']
condlist = [x<1, (x>=1) & (x<10),(x>=10) & (x<100)]
choicelist = [x*1000, x*100,x*10]
output=np.select(condlist, choicelist)
df_ld.insert(4,'BETA3',output,True)
Thank you!

Try this.
I'm guessing your dataframe is called df_ld and your target column is df_ld['BETA'].
def multiply(column):
newcol = []
for item in column:
if item<1:
item=item*1000
newcol.append(item)
if (item>=1) and (item<=10):
item=item*100
newcol.append(item)
if (item>10) and (item<=100):
item=item*10
newcol.append(item)
return newcol
# apply function and create new column
df_ld['newcol'] = multiply(df_ld['BETA'])

Related

how do you divide each value from a pandas series in sequence

Hi I was trying to figure out how to divide values from a DataFrame. But here I made an example for pandas series
a = pd.Series([1, 2, 16,64,128,360,720])
a
-----------------
0 1
1 2
2 16
3 64
4 128
5 360
6 720
So is there any way I could divide a number in a given row by the value from the previous row?
0 2
1 8
2 4
3 2
4 2.8
5 2
Furthermore, I also tried to get the output like "if the value is double, print the index".
Thank you for your help!
What it seems to me is that you are trying to divide a number in a given row by the one of the previous. This can be achieved using this code
import pandas as pd
import numpy as np
a = pd.Series([1, 2, 16,64,128,360,720])
division = pd.Series(np.divide(a.values[1:],a.values[:-1]))
index = pd.Series(np.multiply(division == 2, [i for i in range(len(a)-1)]))
Note: your question is very ill posed. You didn't specify what you wanted to achieve, I figured out by myself from the example. You also added a wrong snipped of code. Pay attention to make a nicer question next time
Well I don't know what exactly you want to divide your pandas Series by, but you can do it like this :
# If you want to store a new result
b = a / value_to_divive_by
# or if you want to apply it directly to your serie
a /= value_to_divive_by
or using list comprehension
b = [int(nb / your_value_here) for nb in a]
# with a min value to do the divison
b = [int(nb / your_value_here) for nb in a if nb > min_value]
There is probably other ways to do what you want, but there is two easy solutions

pandas: getting the name of the column corresponding to the highest value in the row [duplicate]

This question already has answers here:
Find the column name which has the maximum value for each row
(5 answers)
Closed 1 year ago.
I have a pandas DF as follows:
What I want to get is a 1 Column df that contains the name of the column of the maximum value of the row, or the order number of that column. (see red circles in the pic)
Any suggestion?
here the data for copy and paste:
matrix = np.array([[0.92234683, 0.94209485, 0.90884652, 0.99763808],
[0.86166401, 0.96755855, 0.9243107 , 0.94240756],
[0.85457367, 0.9169915 , 0.95042024, 0.90661279],
[0.83972504, 0.93902909, 0.91985442, 0.93765059],
[0.84373323, 0.87762977, 0.91005636, 0.88525626]])
thanks
Use idxmax:
df = pd.DataFrame(matrix, columns=['Y_clm1', 'Y_clm2', 'Y_clm3', 'Y_clm4'])
>>> df.idxmax(axis=1)
0 Y_clm4
1 Y_clm2
2 Y_clm3
3 Y_clm2
4 Y_clm3
dtype: object
use max
df = pd.DataFrame(matrix)
max(df) == 3
max(df) corresponds to Y_clm4

label intervals based on other intervals in pandas [duplicate]

This question already has answers here:
Add/fill pandas column based on range in rows from another dataframe
(3 answers)
Closed 3 years ago.
I have two dataframes, a and b
b has a datetime index, while a has a Start and End datetime columns
I need to 'Label' to True, all the rows of b whose indexes fall within any [Start,End] intervals from a
Right now I doing:
for _,r in a.iterrows():
b.loc[np.logical_and(b.index>=r.Start,
b.index<=r.End),'Label']=True
but this is extremely slow when b is large.
How to optimize the provided code snippet?
MVCE:
b=pd.DataFrame(index=[pd.Timestamp('2017-01-01'),pd.Timestamp('2018-01-01')],columns=['Label'])
a=pd.DataFrame.from_dict([{'Start':pd.Timestamp('2018-01-01'),'End':pd.Timestamp('2020-01-01')}])
EDIT:
the solution at
Add/fill pandas column based on range in rows from another dataframe
does not work for me (they use range to fill the intervals, while we are working on datetime
Here's one solution using apply -
Dummy CSV data
Date,Start,End
01-08-2019,01-02-2019, 01-10-2019
01-08-2019,01-02-2020, 01-10-2020
Code
df = pd.read_csv('dummy.csv').apply(pd.to_datetime)
df.T.apply(lambda x: x[1] < x[0] and x[2] > x[0])
Result
0 True
1 False
dtype: bool
How about doing something like this?
def func(): # b.index
mask = (a['Start'] > date) & (a['End'] <= date)
df = a.loc[mask]
if len(df) > 0:
return True
else:
return False
b['Label'] = b.index().to_series().apply(func)

Slicing a Data frame by checking consecutive elements [duplicate]

This question already has answers here:
Pandas: Drop consecutive duplicates
(8 answers)
Closed 4 years ago.
I have a DF indexed by time and one of its columns (with 2 variables) is like [x,x,y,y,x,x,x,y,y,y,y,x]. I want to slice this DF so Ill get this column without same consecutive variables- in this example :[x,y,x,y,x] and every variable was the first in his subsequence.
Still trying to figure it out...
Thanks!!
Assuming you have df like below
df=pd.DataFrame(['x','x','y','y','x','x','x','y','y','y','y','x'])
We using shift to find the next is equal to the current or not
df[df[0].shift()!=df[0]]
Out[142]:
0
0 x
2 y
4 x
7 y
11 x
You jsut try to loop through and safe the last element used
df=pd.DataFrame(['x','x','y','y','x','x','x','y','y','y','y','x'])
df2=pd.DataFrame()
old = df[0].iloc[0] # get the first element
for column in df:
df[column].iloc[0] != old:
df2.append(df[column].iloc[0])
old = df[column].iloc[0]
EDIT:
Or for a vector use a list
>>> L=[1,1,1,1,1,1,2,3,4,4,5,1,2]
>>> from itertools import groupby
>>> [x[0] for x in groupby(L)]
[1, 2, 3, 4, 5, 1, 2]

moving average or rolling mean pandas without any window size [duplicate]

This question already has an answer here:
Access the result of a previous calculation in custom function passed to apply()
(1 answer)
Closed 5 years ago.
How do i calculate a rolling mean or moving average where i consider all the items that I have seen so far.
Lets say I have a data-frame like below
col new_col
0 1 1
1 2 1.5
2 3 2
and so on.
Now i would like to add a new column where i caclulate the average of all the items in col until that point.
Specifying a window will mean that i will get the first few as Nan and then it only does a rolling window. But i need something like above.
The snippet below will do exactly what you're requesting. There is plenty of room for improvement though. It uses a for loop with an if-else statetment. There are surely faster ways to do this with a vectorized function. It will also trigger the SettingsWithCopyWarning if you omit the pd.options.mode.chained_assignment = None part.
But it does the job:
# Libraries
import pandas as pd
import numpy as np
# Settings
pd.options.mode.chained_assignment = None
# Dataframe with desired input
df = pd.DataFrame({'col':[1,2,3]})
# Make room for a new column
df['new_col'] = np.nan
# Fill the new column with values
for i in df.index + 1:
if i == 0:
df['new_col'].iloc[i] = np.nan
else:
df['new_col'].iloc[i-1] = pd.rolling_mean(df.col.iloc[:i].values, window = i)[-1]
print(df)
Output:

Categories