I would like to create following dataframe:
df = pd.DataFrame({
'A': ['0','0','0','8.020833015','8.009259224','8.003472328','8.020833015','0','0','5','4.994213104','0','0','0','8.012152672','8.009259224','0'],
'Step_ID': ['Step_1','Step_1','Step_1','Step_2','Step_2','Step_2','Step_2','Step_3','Step_3','Step_4','Step_4','Step_5','Step_5','Step_5','Step_6','Step_6','Step_7']})
print (df)
What I have is the column A and according to these values I would like to set the values in the column Step_ID.
Step_ID - it begins from Step_1. Then if the number is bigger then Step_2 (for all the number that are bigger than 0, till the zero values will be reached). Then to zero values should be Step_3 assigned and so on.
# add a Step ID
df = pd.DataFrame({
'A': ['0','0','0','8.020833015','8.009259224','8.003472328','8.020833015','0','0','5','4.994213104','0','0','0','8.012152672','8.009259224','0']})
step = 0
value = None
def get_step(x):
global step
global value
if x != value:
value = x
step += 1
return f'Step_{step}'
df['Step_ID'] = df['A'].apply(get_step)
df.to_csv('test.csv' , index=None)
The code above does something similar, but only with unique numbers. Should be there one more "if" - if value > 0 in order to perform desired functionality?
I can see you implemented XOR gate but we need some customisation, I have added a new function to check.
import pandas as pd
df = pd.DataFrame({
'A': ['0','0','0','8.020833015','8.009259224','8.003472328','8.020833015','0','0','5','4.994213104','0','0','0','8.012152672','8.009259224','0']})
step = 0
value = None
def check(x, y):
try:
x = float(x)
y = float(y)
if x== 0 and y == 0:
return 0
elif x == 0 and y > 0:
return 1
elif x > 0 and y == 0:
return 1
else:
return 0
except:
return 1
def get_step(x):
global step
global value
# if x != value:
if check(x, value):
step += 1
value = x
return f'Step_{step}'
df['Step_ID'] = df['A'].apply(get_step)
df.to_csv('GSH0211.csv' , index=None)
Try this. You can adjust the threshold to the value you want.
df = pd.DataFrame({'A': ['0','0','0','8.020833015','8.009259224','8.003472328','8.020833015','0','0','5','4.994213104','0','0','0','8.012152672','8.009259224','0']})
df['A'] = df['A'].astype(float)
diff = df['A']-df['A'].shift().fillna(0)
threshold = 0.1
df['Step_ID'] = (abs(diff)>threshold).cumsum().add(1)
df['Step_ID'] = 'Step_' + df['Step_ID'].astype(str)
df
A Step_ID
0 0.000000 Step_1
1 0.000000 Step_1
2 0.000000 Step_1
3 8.020833 Step_2
4 8.009259 Step_2
5 8.003472 Step_2
6 8.020833 Step_2
7 0.000000 Step_3
8 0.000000 Step_3
9 5.000000 Step_4
10 4.994213 Step_4
11 0.000000 Step_5
12 0.000000 Step_5
13 0.000000 Step_5
14 8.012153 Step_6
15 8.009259 Step_6
16 0.000000 Step_7
Related
I have a data frame that consists of a time-series of integers. I'm trying to group the data frame by year and then for each year count the number of times that the sum of the absolute value of consecutive entries with the same sign is greater than or equal to 5.
>>> import pandas as pd
>>> l = [1, -1, -4, 2, 2, 4, 5, 1, -3, -4]
>>> idx1 = pd.date_range('2019-01-01',periods=5)
>>> idx2 = pd.date_range('2020-01-01',periods=5)
>>> idx = idx1.union(idx2)
>>> df = pd.DataFrame(l, index=idx, columns=['a'])
>>> df
a
2019-01-01 1
2019-01-02 -1
2019-01-03 -4 \\ 2019 count = 1: abs(-1) + abs(-4) >= 5
2019-01-04 2
2019-01-05 2
2020-01-01 4
2020-01-02 5 \\ 2020 count = 1: abs(4) + abs(5) + abs(1) = 10 >=5
2020-01-03 1
2020-01-04 -3
2020-01-05 -4 \\ 2020 count = 2: abs(-3) + abs(-4) = 7 >= 5
The desired output is:
2019 1
2020 2
My approach to solve this problem is to chain groupby and apply. Below are the implementations of the functions I created to pass to groupby and apply respectively.
>>> def get_year(x):
return x.year
>>> def count(group, t=5):
c = 0 # counter
s = 0 # sum of consec vals w same sign
for i in range(1,len(group)):
if np.sign(group['a'].iloc[i-1]) == np.sign(group['a'].iloc[i]):
if s == 0:
s = group['a'].iloc[i-1] + group['a'].iloc[i]
else:
s += group['a'].iloc[i]
if i == (len(group) -1):
return c + 1
elif (np.sign(group['a'].iloc[i-1]) != np.sign(group['a'].iloc[i])) and (abs(s) >= t):
#if consec streak of vals w same sign is broken and abs(s) >= t then inc c and reset s
c += 1
s = 0
elif (np.sign(group['a'].iloc[i-1]) != np.sign(group['a'].iloc[i])) and (abs(s) < t):
#if consec streak of vals w same sign is broken and abs(s) < t then reset s
s = 0
return c
>>> by_year = df.groupby(get_year)
>>> by_year.apply(count)
2019 1
2020 2
My question is:
Is there a more "pythonic" implementation of the above count function that produces the desired result but doesn't rely on for loops?
What is the best way to calculate the RMS of a column in sections in python/pandas. Here is a example for a better understanding what I mean:
index
x
x_rms
0
2
1
3
2.55
2
10
3
22
17.09
...
...
...
So 2.55 is the RMS of 2 and 3, 17.09 is the RMS of 10 and 22 and so on.
the following will work
import pandas as pd
df = pd.DataFrame([2,3,10,22], columns=["x"])
def rms(a, b):
# return round(np.sqrt((a**2+b**2)/2), 2) # for only two decimals
return np.sqrt((a**2+b**2)/2)
df["rms"] = [rms(df.loc[idx-1,"x"], val["x"]) if idx%2 != 0 else np.nan
for idx, val in df.iterrows()]
output
x rms
0 2 NaN
1 3 2.549510
2 10 NaN
3 22 17.088007
EDIT regarding comment
if your index is a date you should do this to have the same output
values = [2,3,10,22]
tidx = pd.date_range('2019-01-01', periods=len(values), freq='D')
df = pd.DataFrame([2,3,10,22], columns=["x"], index=tidx)
def rms(a, b):
# return round(np.sqrt((a**2+b**2)/2), 2) # for only two decimals
return np.sqrt((a**2+b**2)/2)
df = df.reset_index()
df["rms"] = [rms(df.loc[idx-1,"x"], val["x"]) if idx%2 != 0 else np.nan
for idx, val in df.iterrows()]
df.set_index("index")
We're trying to figure out a way to easily pull values from what I guess I would describe as a grid of conditional statements. We've got two variables, x and y, and depending on those values, we want to pull one of (something1, ..., another1, ... again1...). We could definitely do this using if statements, but we were wondering if there was a better way. Some caveats: we would like to be able to easily change the bounds on the x and y conditionals. The problem with a bunch of if statements is that it's not very easy to compare the values of those bounds with the values in the example table below.
Example:
So if x = 4% and y = 30%, we would get back another1. Whereas if x = 50% and y = 10%, we would get something3.
Overall two questions:
Is there a general name for this kind of problem?
Is there an easy framework or library that could do this for us without if statements?
Even though Pandas is not really made for this kind of usage, with function aggregation and boolean indexing it allows for an elegant-ish solution for your problem. Alternatively, constraint-based programing might be an option (see python-constraint on pypi).
Define the constraints as functions.
x_constraints = [lambda x: 0 <= x < 5,
lambda x: 5 <= x < 10,
lambda x: 10<= x < 15,
lambda x: x >= 15
]
y_constraints = [lambda y: 0 <= y < 20,
lambda y: 20 <= y < 50,
lambda y: y >= 50]
x = 15
y = 30
Now we want to make two dataframes: One that only holds the x-values,
and another that only holds the y-values where number of columns = number of x-constraints and number of rows = number of y-constraints.
import pandas as pd
def make_dataframe(value):
return pd.DataFrame(data=value,
index=range(len(y_constraints)),
columns=range(len(x_constraints)))
x_df = make_dataframe(x)
y_df = make_dataframe(y)
The dataframes look like this:
>>> x_df
0 1 2 3
0 15 15 15 15
1 15 15 15 15
2 15 15 15 15
>>> y_df
0 1 2 3
0 30 30 30 30
1 30 30 30 30
2 30 30 30 30
Next, we need the dataframe label_df that holds the possible outcomes. The shape must match the dimension of x_df and y_df above. (What's cool about this is that you can store the data in a
CSV-file and directly read it into a dataframe with pd.read_csv if you wish.)
label_df = pd.DataFrame([[f"{w}{i+1}" for i in range(len(x_constraints))] for w in "something another again".split()])
>>> label_df
0 1 2 3
0 something1 something2 something3 something4
1 another1 another2 another3 another4
2 again1 again2 again3 again4
Next, we want to apply the x_constraints to the columns of x_df, and the y_constraints to the rows of y_df. .aggregate takes
a dictionary that maps column or row names to functions {colname: func},
which we construct inline using dict(zip(...)). axis=1 means "apply the functions row-wise".
x_mask = x_df.aggregate(dict(zip(x_df.columns, x_constraints)))
y_mask = y_df.aggregate(dict(zip(y_df.columns, y_constraints)), axis=1)
The result are two dataframes holding boolean values, and ideally,
there should be exactly one column in x_mask and one row in y_mask that's all True, e.g.
>>> x_mask
0 1 2 3
0 False False False True
1 False False False True
2 False False False True
>>> y_mask
0 1 2 3
0 False False False False
1 True True True True
2 False False False False
If we combine them with bit-wise and &, we get a boolean mask with exactly
one True value.
>>> m = x_mask & y_mask
>>> m
0 1 2 3
0 False False False False
1 False False False True
2 False False False False
Use m to select the target value from label_df. The result df is all NaN except one value, which we extract with df.stack().iloc[0]:
>>> df = label_df[m]
0 1 2 3
0 NaN NaN NaN NaN
1 NaN NaN NaN another4
2 NaN NaN NaN NaN
>>> df.stack().iloc[0]
'another4'
And that's it! It should be very easy to maintain, by just changing the list of constraints and adapting the possible outcomes in label_df.
I didn't hear about any name.
If (ha-ha) it should be more conceptually close to you, I might suggest that you create two mapper functions that would map x and y values to the categories of your contingency table.
map_x = lambda x: 0 if x < 0.05 else 1 if x < 0.1 else 2
map_y = lambda y: 0 if y < 0.2 else 1 if y < 0.5 else 2
df.iloc[map_x(x), map_y(y)]
If you have just a handful of conditionals then you may define two lists with the upper bounds, and use a simple linear search:
x_bounds = [0.05, 0.1, 1.0]
y_bounds = [0.2, 0.5, 1.0]
def linear(x_bounds, y_bounds, x, y):
for i,xb in enumerate(x_bounds):
if x <= xb:
break
for j,yb in enumerate(y_bounds):
if y <= yb:
break
return i,j
linear(x_bounds, y_bounds, 0.4, 3.0) #(0,1)
If there are many conditionals a binary search will be better:
def binary(x_bounds, y_bounds, x, y):
lower = 0
upper = len(x_bounds)-1
while upper > lower+1:
mid = (lower+upper)//2
if x_bounds[mid] < x:
lower = mid
elif x_bounds[mid] >= x:
if mid > 0 and x_bounds[mid-1] < x:
xmid = mid
break
else:
xmid = mid-1
break
else:
upper = mid
lower = 0
upper = len(y_bounds)-1
while upper > lower+1:
mid = (lower+upper)//2
if y_bounds[mid] < y:
lower = mid
elif y_bounds[mid] >= y:
if mid > 0 and y_bounds[mid-1] < y:
ymid = mid
break
else:
ymid = mid-1
break
else:
upper = mid
return xmid,ymid
binary(x_bounds, y_bounds, 0.4, 3.0) #(0,1)
I am attempting to change the values of two columns in my dataset from specific numeric values (2, 10, 25 etc.) to single values (1, 2, 3 or 4) based on the percentile of the specific value within the dataset.
Using the pandas quantile() function I have got the ranges I wish to replace between, but I haven't figured out a working method to do so.
age1 = datasetNB.Age.quantile(0.25)
age2 = datasetNB.Age.quantile(0.5)
age3 = datasetNB.Age.quantile(0.75)
fare1 = datasetNB.Fare.quantile(0.25)
fare2 = datasetNB.Fare.quantile(0.5)
fare3 = datasetNB.Fare.quantile(0.75)
My current solution attempt for this problem is as follows:
for elem in datasetNB['Age']:
if elem <= age1:
datasetNB[elem].replace(to_replace = elem, value = 1)
print("set to 1")
elif (elem > age1) & (elem <= age2):
datasetNB[elem].replace(to_replace = elem, value = 2)
print("set to 2")
elif (elem > age2) & (elem <= age3):
datasetNB[elem].replace(to_replace = elem, value = 3)
print("set to 3")
elif elem > age3:
datasetNB[elem].replace(to_replace = elem, value = 4)
print("set to 4")
else:
pass
for elem in datasetNB['Fare']:
if elem <= fare1:
datasetNB[elem] = 1
elif (elem > fare1) & (elem <= fare2):
datasetNB[elem] = 2
elif (elem > fare2) & (elem <= fare3):
datasetNB[elem] = 3
elif elem > fare3:
datasetNB[elem] = 4
else:
pass
What should I do to get this working?
pandas already has one function to do that, pandas.qcut.
You can simply do
q_list = [0, 0.25, 0.5, 0.75, 1]
labels = range(1, 5)
df['Age'] = pd.qcut(df['Age'], q_list, labels=labels)
df['Fare'] = pd.qcut(df['Fare'], q_list, labels=labels)
Input
import numpy as np
import pandas as pd
# Generate fake data for the sake of example
df = pd.DataFrame({
'Age': np.random.randint(10, size=6),
'Fare': np.random.randint(10, size=6)
})
>>> df
Age Fare
0 1 6
1 8 2
2 0 0
3 1 9
4 9 6
5 2 2
Output
DataFrame after running the above code
>>> df
Age Fare
0 1 3
1 4 1
2 1 1
3 1 4
4 4 3
5 3 1
Note that in your specific case, since you want quartiles, you can just assign q_list = 4.
How can I replace the values of a DataFrame if are smaller or greater than a particular value?
print(df)
name seq1 seq11
0 seq102 -14 -5.99
1 seq103 -5.25 -7.94
I want to set the values < than -8.5 to 1 and > than -8.5 to 0.
I tried this but all the values gets zero;
import pandas as pd
df = pd.read_csv('df.csv')
num = df._get_numeric_data()
num[num < -8.50] = 1
num[num > -8.50] = 0
The desired output should be:
name seq1 seq11
0 seq102 1 0
1 seq103 0 0
Thank you
Try
num.iloc[:,1:] = num.iloc[:,1:].applymap(lambda x: 1 if x < -8.50 else 0)
Note that values equal to -8.50 will be set to zero here.
def thresh(x):
if(x < -8.5):
return 1
elif(x > -8.5):
return 0
return x
print(df[["seq1", "seq2"]].apply(thresh))