So I have a df that looks like this:
some_int another_int
0 1 5
1 2 6
2 10 7
3 11 8
4 15 9
So I want to perform a groupby operation on those elements that have a diff of only 1 between each other. Let's say I want to groupby some_int (diff of 1) and perform a sum on another_int By doing that I would obtain something like:
some_int another_int
0 1 5
1 2 6
2 10 7
3 11 8
4 15 9
sum
0 5 + 6 = 11
1 7 + 8 = 15
2 15 = 15
What is the best pythonic way to do so? I tried creating a diff mask then shift it and perform or amongst those. However, it seems kind of verbose. What do you think?
I suggest making a new column called group
df['group'] = (df.some_int.diff() > 1).cumsum()
then you can groupby this column and apply a custom function that returns the sum of another_int or the single values in some_int:
def sum_or_val(x):
print(len(x))
if len(x) > 1:
return sum(x['another_int'])
return x['some_int'].values[0]
grouped = df.groupby('group').apply(sum_or_val)
Related
It seems simple but I can’t seem to find an efficient way to solve this in Python 3: Is there is a loop I can use in my dataframe that takes every column after the current column (starting with the 1st column), and subtracts it from the current column, so that I can add that resulting column to a new dataframe?
This is what my data looks like:
This is what I have so far, but when running run_analysis my "result" equation is bringing up an error, and I do not know how to store the results in a new dataframe. I'm a beginner at all of this so any help would be much appreciated.
storage = [] #container that will store the results of the subtracted columns
def subtract (a,b): #function to call to do the column-wise subtractions
return a-b
def run_analysis (frame, store):
for first_col_index in range(len(frame)): #finding the first column to use
temp=[] #temporary place to store the column-wise values from the analysis
for sec_col_index in range(len(frame)): #finding the second column to subtract from the first column
if (sec_col_index <= first_col_index): #if the column is below the current column or is equal to
#the current column, then skip to next column
continue
else:
result = [r for r in map(subtract, frame[sec_col_index], frame[first_col_index])]
#if column above our current column, the subtract values in the column and keep the result in temp
temp.append(result)
store.append(temp) #save the complete analysis in the store
Something like this?
#dummy ddataframe
df = pd.DataFrame({'a':list(range(10)), 'b':list(range(10,20)), 'c':list(range(10))})
print(df)
output:
a b c
0 0 10 0
1 1 11 1
2 2 12 2
3 3 13 3
4 4 14 4
5 5 15 5
6 6 16 6
7 7 17 7
8 8 18 8
9 9 19 9
Now iterate over pairs of columns and subtract them while assigning another column to the dataframe
for c1, c2 in zip(df.columns[:-1], df.columns[1:]):
df[f'{c2}-{c1}'] = df[c2]-df[c1]
print(df)
output:
a b c b-a c-b
0 0 10 0 10 -10
1 1 11 1 10 -10
2 2 12 2 10 -10
3 3 13 3 10 -10
4 4 14 4 10 -10
5 5 15 5 10 -10
6 6 16 6 10 -10
7 7 17 7 10 -10
8 8 18 8 10 -10
9 9 19 9 10 -10
This is my desired output:
I am trying to calculate the column df[Value] and df[Value_Compensed]. However, to do that, I need to consider the previous value of the row df[Value_Compensed]. In terms of my table:
The first row all the values are 0
The following rows: df[Remained] = previous df[Value_compensed]. Then df[Value] = df[Initial_value] + df[Remained]. Then df[Value_Compensed] = df[Value] - df[Compensation]
...And So on...
I am struggling to pass the value of Value_Compensed from one row to the next, I tried with the function shift() but as you can see in the following image the values in df[Value_Compensed] are not correct due to it is not a static value and also it also changes after each row it did not work. Any Idea??
Thanks.
Manuel.
You can use apply to create your customised operations. I've made a dummy dataset as you didn't provide the initial dataframe.
from itertools import zip_longest
# dummy data
df = pd.DataFrame(np.random.randint(1, 10, (8, 5)),
columns=['compensation', 'initial_value',
'remained', 'value', 'value_compensed'],)
df.loc[0] = 0,0,0,0,0
>>> print(df)
compensation initial_value remained value value_compensed
0 0 0 0 0 0
1 2 9 1 9 7
2 1 4 9 8 3
3 3 4 5 7 6
4 3 2 5 5 6
5 9 1 5 2 4
6 4 5 9 8 2
7 1 6 9 6 8
Use apply (axis=1) to do row-wise iteration, where you use the initial dataframe as an argument, from which you can then get the previous row x.name-1 and do your calculations. Not sure if I fully understood the intended result, but you can adjust the individual calculations of the different columns in the function.
def f(x, data):
if x.name == 0:
return [0,]*data.shape[1]
else:
x_remained = data.loc[x.name-1]['value_compensed']
x_value = data.loc[x.name-1]['initial_value'] + x_remained
x_compensed = x_value - x['compensation']
return [x['compensation'], x['initial_value'], x_remained, \
x_value, x_compensed]
adj = df.apply(f, args=(df,), axis=1)
adj = pd.DataFrame.from_records(zip_longest(*adj.values), index=df.columns).T
>>> print(adj)
compensation initial_value remained value value_compensed
0 0 0 0 0 0
1 5 9 0 0 -5
2 5 7 4 13 8
3 7 9 1 8 1
4 6 6 5 14 8
5 4 9 6 12 8
6 2 4 2 11 9
7 9 2 6 10 1
Is there an efficient way to change the value of a previous row whenever a conditional is met in a subsequent entry? Specifically I am wondering if there is anyway to adapt pandas.where to modify the entry in a row prior or subsequent to the conditional test. Suppose
Data={'Energy':[12,13,14,12,15,16],'Time':[2,3,4,2,5,6]}
DF = pd.DataFrame(Data)
DF
Out[123]:
Energy Time
0 12 2
1 13 3
2 14 4
3 12 2
4 15 5
5 16 6
If I wanted to change the value of Energy to 'X' whenever Time <= 2 I could just do something like.
DF['ENERGY']=DF['ENERGY'].where(DF['TIME'] >2,'X')
or
DF.loc[DF['Time']<=2,'Energy']='X'
Which would output
Energy Time
0 X 2
1 13 3
2 14 4
3 X 2
4 15 5
5 16 6
But what if I want to change the value of 'Energy' in the row after Time <=2 so that the output would actually be.
Energy Time
0 12 2
1 X 3
2 14 4
3 12 2
4 X 5
5 16 6
Is there an easy modification for a vectorized approach to this?
Shift the values one row down using Series.shift and then compare:
df.loc[df['Time'].shift() <= 2, 'Energy'] = 'X'
df
Energy Time
0 12 2
1 X 3
2 14 4
3 12 2
4 X 5
5 16 6
Side note: I assume 'X' is actually something else here, but FYI, mixing strings and numeric data leads to object type columns which is a known pandas anti-pattern.
I have a large time series df (2.5mil rows) that contain 0 values in a given row, some of which are legitimate. However if there are repeated continuous occurrences of zero values I would like to remove them from my df.
Example:
Col. A contains [1,2,3,0,4,5,0,0,0,1,2,3,0,8,8,0,0,0,0,9] I would like to remove the [0,0,0] and [0,0,0,0] from the middle and leave the remaining 0 to make a new df [1,2,3,0,4,5,1,2,3,0,8,8,9].
The length of zero values before deletion being a parameter that has to be set - in this case > 2.
Is there a clever way to do this in pandas?
It looks like you want to remove the row if it is 0 and either previous or next row in same column is 0. You can use shift to look for previous and next value and compare with current value as below:
result_df = df[~(((df.ColA.shift(-1) == 0) & (df.ColA == 0)) | ((df.ColA.shift(1) == 0) & (df.ColA == 0)))]
print(result_df)
Result:
ColA
0 1
1 2
2 3
3 0
4 4
5 5
9 1
10 2
11 3
12 0
13 8
14 8
19 9
Update for more than 2 consecutive
Following example in link, adding new column to track consecutive occurrence and later checking it to filter:
# https://stackoverflow.com/a/37934721/5916727
df['consecutive'] = df.ColA.groupby((df.ColA != df.ColA.shift()).cumsum()).transform('size')
df[~((df.consecutive>10) & (df.ColA==0))]
We need build a new para meter here, then using drop_duplicates
df['New']=df.A.eq(0).astype(int).diff().ne(0).cumsum()
s=pd.concat([df.loc[df.A.ne(0),:],df.loc[df.A.eq(0),:].drop_duplicates(keep=False)]).sort_index()
s
Out[190]:
A New
0 1 1
1 2 1
2 3 1
3 0 2
4 4 3
5 5 3
9 1 5
10 2 5
11 3 5
12 0 6
13 8 7
14 8 7
19 9 9
Explanation :
#df.A.eq(0) to find the value equal to 0
#diff().ne(0).cumsum() if they are not equal to 0 then we would count them in same group .
I know that Pandas has a get_dummy function which you can use to convert categorical variables to dummy variables in a DataFrame. What I'm trying to do is slightly different.
I have a column containing percentage values from 0.0 to 100.0. I need to convert this to a column that has 1's for any value >= 10.0 and 0's for any value < 10.0. Is there a good way to do this repurposing get_dummy here or will I have to construct a loop to do it?
You can can convert bools to ints directly:
(df.column_of_interest >= 10).astype(int)
I assume you're discussing pandas.get_dummies here, and I don't think that this is a use case for it. You are attempting to set two values on a boolean condition. One approach would be to get a boolean Series and take the integer representations for indicators, with
df['indicators'] = (df.percentages >= 10.).astype('int')
Demo
>>> df
percentages
0 70.176341
1 70.638246
2 55.078803
3 42.586290
4 73.340089
5 53.308670
6 3.059331
7 49.494812
8 10.379713
9 7.676286
10 55.023261
11 4.417545
12 51.744169
13 49.513638
14 39.189640
15 90.521703
16 29.696734
17 11.546118
18 5.737921
19 83.258049
>>> df['indicators'] = (df.percentages >= 10.).astype('int')
>>> df
percentages indicators
0 70.176341 1
1 70.638246 1
2 55.078803 1
3 42.586290 1
4 73.340089 1
5 53.308670 1
6 3.059331 0
7 49.494812 1
8 10.379713 1
9 7.676286 0
10 55.023261 1
11 4.417545 0
12 51.744169 1
13 49.513638 1
14 39.189640 1
15 90.521703 1
16 29.696734 1
17 11.546118 1
18 5.737921 0
19 83.258049 1
Let's assume you have a dataframe df, with a column Perc that contains your percentages:
import pandas as pd
pd.np.random.seed(111)
df = pd.DataFrame({"Perc": pd.np.random.uniform(1, 100, 20)})
Now, you can easily form a new column by using a lambda function that recodes your percentages, like so:
df["Category"] = df.Perc.apply(lambda x: 0 if x < 10.0 else 1)