I need to create a column that computes the difference between another column's elements:
Column A Computed Column
10 blank # nothing to compute for first record
9 1 # = 10-9
7 2 # = 9-7
4 3 # = 7-4
I am assuming this is a lambda function, but i am not sure how to reference the elements in 'Column A'
Any help/direction you can provide would be great- thanks!
You can do it by shifting the column.
import pandas as pd
dict1 = {'A': [10,9,7,4]}
df = pd.DataFrame.from_dict(dict1)
df['Computed'] = df['A'].shift() - df['A']
print(df)
giving
A Computed
0 10 NaN
1 9 1.0
2 7 2.0
3 4 3.0
EDIT: OP extended his requirement to multi columns
dict1 = {'A': [10,9,7,4], 'B': [10,9,7,4], 'C': [10,9,7,4]}
df = pd.DataFrame.from_dict(dict1)
columns_to_update = ['A', 'B']
for col in columns_to_update:
df['Computed'+col] = df[col].shift() - df[col]
print(df)
By using the columns_to_update, you can choose the columns you want.
A B C ComputedA ComputedB
0 10 10 10 NaN NaN
1 9 9 9 1.0 1.0
2 7 7 7 2.0 2.0
3 4 4 4 3.0 3.0
Use diff.
df = pd.DataFrame(data=[10,9,7,4], columns=['A'])
df['B'] = df.A.diff(-1).shift(1)
Output:
df
Out[140]:
A B
0 10 NaN
1 9 1.0
2 7 2.0
3 4 3.0
I would just do:
df = pd.DataFrame(data=[10,9,7,4], columns=['A'])
df['B'] = abs(df['A'].diff())
The reason for abs() is because diff() computes the difference between current - previous whereas you want previous - current. This method is already built-in to the Series class, so using abs() will get you the correct result by taking the absolute value either way.
To support:
import pandas as pd
df = pd.DataFrame(data=[10,9,7,4], columns=['A'])
df['B'] = abs(df['A'].diff())
>>> df
# Output
A B
0 10 NaN
1 9 1.0
2 7 2.0
3 4 3.0
df2 = pd.DataFrame(data=[10,4,7,9], columns=['A'])
df2['B'] = abs(df2['A'].diff())
>>> df2
# Output
A B
0 10 NaN
1 4 6.0
2 7 3.0
3 9 2.0
To still out perform that of #cosmic_inquiry's solution:
import pandas as pd
df = pd.DataFrame(data=[10,9,7,4], columns=['A'])
df2 = pd.DataFrame(data=[10,4,7,9], columns=['A'])
df['B'] = df['A'].diff() * -1
df2['B'] = df2['A'].diff() * -1
>>> df
# Output:
A B
0 10 NaN
1 9 1.0
2 7 2.0
3 4 3.0
>>> df2
# Output:
A B
0 10 NaN
1 4 6.0
2 7 -3.0
3 9 -2.0
Related
I need to interpolate the NaN values over a Dataframe but I want that interpolation to get the first values of the DataFrame in case the NaN value is the last value. Here is an example:
import pandas as pd
import numpy as np
df = pd.DataFrame.from_dict({"a": [1,2,3], "b":[1,2,np.nan]})
So the DataFrame is:
a b
0 1 1.0
1 2 2.0
2 3 NaN
But when I interpolate the nan values like:
df.interpolate(method="linear", inplace=True)
I got:
a b
0 1 1.0
1 2 2.0
2 3 2.0
The interpolation doesn't use the first value to do it. My desired output wold be to fill in with the value of 1.5 because of that circular interpolation.
One possible solution is add first row, interpolate and remove last row:
df = df.append(df.iloc[0]).interpolate(method="linear").iloc[:-1]
print (df)
a b
0 1.0 1.0
1 2.0 2.0
2 3.0 1.5
EDIT:
More general solution:
df = pd.DataFrame.from_dict({"a": [1,2,3,4], "b":[np.nan,1,2,np.nan]})
df = pd.concat([df] * 3).interpolate(method="linear").iloc[len(df):-len(df)]
print (df)
a b
0 1 1.333333
1 2 1.000000
2 3 2.000000
3 4 1.666667
Or if need working only with last non missing values:
df = pd.DataFrame.from_dict({"a": [1,2,3,4], "b":[np.nan,1,2,np.nan]})
df1 = df.ffill().iloc[[-1]]
df2 = df.bfill().iloc[[0]]
df = pd.concat([df1, df, df2]).interpolate(method="linear").iloc[1:-1]
print (df)
a b
0 1 1.5
1 2 1.0
2 3 2.0
3 4 1.5
I would like to discard all cells that contain a value below a given value. So not only the rows or only the columns that, but for for all cells.
Tried code below, where all values in each cell should be at least 3. Doesn't work.
df[(df >= 3).any(axis=1)]
Example
import pandas as pd
my_dict = {'A':[1,5,6,2],'B':[9,9,1,2],'C':[1,1,3,5]}
df = pd.DataFrame(my_dict)
df
A B C
0 1 9 1
1 5 9 1
2 6 1 3
3 2 2 5
I want to keep only the cells that are at least 3.
If you want "all values in each cell should be at least 3"
df [df < 3] = 3
df
A B C
0 3 9 3
1 5 9 3
2 6 3 3
3 3 3 5
If you want "to keep only the cells that are at least 3"
df = df [df >= 3]
df
A B C
0 NaN 9.0 NaN
1 5.0 9.0 NaN
2 6.0 NaN 3.0
3 3.0 3.0 5.0
You can check if the value is >= 3 then drop all rows with NaN value.
df[df >= 3 ].dropna()
DEMO:
import pandas as pd
my_dict = {'A':[1,5,6,3],'B':[9,9,1,3],'C':[1,1,3,5]}
df = pd.DataFrame(my_dict)
df
A B C
0 1 9 1
1 5 9 1
2 6 1 3
3 3 3 5
df = df[df >= 3 ].dropna().reset_index(drop=True)
df
A B C
0 3.0 3.0 5.0
I'm working with a huge dataframe in python and sometimes I need to add an empty row or several rows in a definite position to dataframe. For this question I create a small dataframe df in order to show, what I want to achieve.
> df = pd.DataFrame(np.random.randint(10, size = (3,3)), columns =
> ['A','B','C'])
> A B C
> 0 4 5 2
> 1 6 7 0
> 2 8 1 9
Let's say I need to add an empty row, if I have a zero-value in the column 'C'. Here the empty row should be added after the second row. So at the end I want to have a new dataframe like:
>new_df
> A B C
> 0 4 5 2
> 1 6 7 0
> 2 nan nan nan
> 3 8 1 9
I tried with concat and append, but I didn't get what I want to. Could you help me please?
You can try in this way:
l = df[df['C']==0].index.tolist()
for c, i in enumerate(l):
dfs = np.split(df, [i+1+c])
df = pd.concat([dfs[0], pd.DataFrame([[np.NaN, np.NaN, np.NaN]], columns=df.columns), dfs[1]], ignore_index=True)
print df
Input:
A B C
0 4 3 0
1 4 0 4
2 4 4 2
3 3 2 1
4 3 1 2
5 4 1 4
6 1 0 4
7 0 2 0
8 2 0 3
9 4 1 3
Output:
A B C
0 4.0 3.0 0.0
1 NaN NaN NaN
2 4.0 0.0 4.0
3 4.0 4.0 2.0
4 3.0 2.0 1.0
5 3.0 1.0 2.0
6 4.0 1.0 4.0
7 1.0 0.0 4.0
8 0.0 2.0 0.0
9 NaN NaN NaN
10 2.0 0.0 3.0
11 4.0 1.0 3.0
Last thing: it can happen that the last row has 0 in 'C', so you can add:
if df["C"].iloc[-1] == 0 :
df.loc[len(df)] = [np.NaN, np.NaN, np.NaN]
Try using slice.
First, you need to find the rows where C == 0. So let's create a bool df for this. I'll just name it 'a':
a = (df['C'] == 0)
So, whenever C == 0, a == True.
Now we need to find the index of each row where C == 0, create an empty row and add it to the df:
df2 = df.copy() #make a copy because we want to be safe here
for i in df.loc[a].index:
empty_row = pd.DataFrame([], index=[i]) #creating the empty data
j = i + 1 #just to get things easier to read
df2 = pd.concat([df2.ix[:i], empty_row, df2.ix[j:]]) #slicing the df
df2 = df2.reset_index(drop=True) #reset the index
I must say... I don't know the size of your df and if this is fast enough, but give it a try
In case you know the index where you want to insert a new row, concat can be a solution.
Example dataframe:
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})
# A B C
# 0 1 4 7
# 1 2 5 8
# 2 3 6 9
Your new row as a dataframe with index 1:
new_row = pd.DataFrame({'A': np.nan, 'B': np.nan,'C': np.nan}, index=[1])
Inserting your new row after the second row:
new_df = pd.concat([df.loc[:1], new_row, df.loc[2:]]).reset_index(drop=True)
# A B C
# 0 1.0 4.0 7.0
# 1 2.0 5.0 8.0
# 2 NaN NaN NaN
# 3 3.0 6.0 9.0
something like this should work for you:
for key, row in df.iterrows():
if row['C'] == 0:
df.loc[key+1] = pd.Series([np.nan])
I am interested if we can use pandas.core.groupby.DataFrameGroupBy.agg function to make arithmetic operations on multiple columns columns. For example:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.arange(15).reshape(5, 3))
df['C'] = [0, 0, 2, 2, 5]
print(df.groupby('C').mean()[0] - df.groupby('C').mean()[1])
print(df.groupby('C').agg({0: 'mean', 1: 'sum', 2: 'nunique', 'C': 'mean0-mean1'}))
Is it somehow possible that we receive result like in this example: the difference between means of column 0 and column 1 grouped by column 'C'?
df
0 1 2 C
0 0 1 2 0
1 3 4 5 0
2 6 7 8 2
3 9 10 11 2
4 12 13 14 5
Groupped difference
C
0 -1.0
2 -1.0
5 -1.0
dtype: float64
I am not interested with solutions that does not use agg method. I am curious only if agg method can take multiple columns as argument and then do some operations on them to return one columns after job is done.
IIUC:
In [12]: df.groupby('C').mean().diff(axis=1)
Out[12]:
0 1 2
C
0 NaN 1.0 1.0
2 NaN 1.0 1.0
5 NaN 1.0 1.0
or
In [13]: df.groupby('C').mean().diff(-1, axis=1)
Out[13]:
0 1 2
C
0 -1.0 -1.0 NaN
2 -1.0 -1.0 NaN
5 -1.0 -1.0 NaN
I want to delete the values that are greater than a certain threshold from a pandas dataframe. Is there an efficient way to perform this? I am doing it with apply and lambda, which works fine but a bit slow for a large dataframe and I feel like there must be a better method.
df = pd.DataFrame({'A': [1,2,3,4,5], 'B': [1,2,3,4,5]})
df
A B
0 1 1
1 2 2
2 3 3
3 4 4
4 5 5
How can this be done without apply and lambda?
df['A'] = df.apply(lambda x: x['A'] if x['A'] < 3 else None, axis=1)
df
A B
0 1.0 1
1 2.0 2
2 NaN 3
3 NaN 4
4 NaN 5
Use a boolean mask against the df:
In[21]:
df[df<3]
Out[21]:
A
0 1.0
1 2.0
2 NaN
3 NaN
4 NaN
Here where the boolean condition is not met a False is returned, this will just mask out the df value returning NaN
If you actually want to drop these rows then self-assign:
df = df[df<3]
To compare a specific column:
In[22]:
df[df['A']<3]
Out[22]:
A
0 1
1 2
If you want NaN in the removed rows then you can use a trick where a double square brackets will return a single column df so we can mask the df:
In[25]:
df[df[['A']]<3]
Out[25]:
A
0 1.0
1 2.0
2 NaN
3 NaN
4 NaN
If you have multiple columns then the above won't work as the boolean mask has to match the orig df, in which case you can reindex against the orig df index:
In[31]:
df = pd.DataFrame({'A': [1,2,3,4,5], 'B': [1,2,3,4,5]})
df[df['A']<3].reindex(df.index)
Out[31]:
A B
0 1.0 1.0
1 2.0 2.0
2 NaN NaN
3 NaN NaN
4 NaN NaN
EDIT
You've updated your question again, if you want to just overwrite the single column:
In[32]:
df = pd.DataFrame({'A': [1,2,3,4,5], 'B': [1,2,3,4,5]})
df['A'] = df.loc[df['A'] < 3,'A']
df
Out[32]:
A B
0 1.0 1
1 2.0 2
2 NaN 3
3 NaN 4
4 NaN 5