I have two pandas series objects with slightly different indexes. I want to divide one series by another. The default method gives me NAs when one of the two series is missing an indexed element. There is an option to fill missing values, but it can only be set to one value. I want to fill a value based on which series is missing the value.
For example
series1
0 10
1 20
2 30
3 40
series2
1 2
2 3
3 4
4 5
expected result: series1.divide(series2)
0 inf
1 10
2 10
3 10
4 0
actual result: series1.divide(series2)
0 NaN
1 10
2 10
3 10
4 NaN
Is there an easy way to do this?
You could use reindex to expand series1.index to include series2.index, filling missing values with 0. Then you could use the div method, which fills in missing values with NaN by default:
series1 = pd.Series([10,20,30,40], index=[0,1,2,3])
series2 = pd.Series([2,3,4,5], index=[1,2,3,4])
series1 = series1.reindex(series1.index.union(series2.index), fill_value=0)
print(series1.div(series2))
# 0 nan
# 1 10.00000
# 2 10.00000
# 3 10.00000
# 4 0.00000
# dtype: float64
Related
I have the following dataframe in Python:
ID
maths
value
0
add
12
1
sub
30
0
add
10
2
mult
3
0
sub
10
1
add
11
3
sub
40
2
add
21
My idea is to perform the following operations to get the result I want:
First step: Group the rows of the dataframe by ID. The order of the groups shall be indicated by the order of the original dataframe.
ID
maths
value
0
add
12
0
add
10
0
sub
10
1
sub
30
1
add
11
2
mult
3
2
add
21
3
sub
40
Second step: For each group created: Create a value for a new column 'result' where a mathematical operation indicated by the previous column of 'maths' is performed. If there is no previous row for the group, this column would have the value NaN.
ID
maths
value
result
0
add
12
NaN
0
add
10
22
0
sub
10
20
1
sub
30
NaN
1
add
11
19
2
mult
3
NaN
2
add
21
63
3
sub
40
NaN
Third step: Return the resulting dataframe.
I have tried to realise this code by making use of the pandas groupby method. But I have problems to iterate with conditions for each row and each group, and I don't know how to create the new column 'result' on a groupby object.
grouped_df = testing.groupby('ID')
for key, item in grouped_df:
print(grouped_df.get_group(key))
I don't know whether to use orderby or groupby or some other method that works for what I want to do. If you can help me with a better idea, I'd appreciate it.
ID = list("00011223")
maths = ["add","add","sub","sub","add","mult","add","sub"]
value = [12,10,10,30,11,3,21,40]
import pandas as pd
df = pd.DataFrame(list(zip(ID,maths,value)),columns = ["ID","Maths","Value"])
df["Maths"] = df.groupby(["ID"]).pipe(lambda df:df.Maths.shift(1)).fillna("add")
df["Value1"] = df.groupby(["ID"]).pipe(lambda df:df.Value.shift(1))
df["result"] = df.groupby(["Maths"]).pipe(lambda x:(x.get_group("add")["Value1"] + x.get_group("add")["Value"]).append(
x.get_group("sub")["Value1"] - x.get_group("sub")["Value"]).append(
x.get_group("mult")["Value1"] * x.get_group("mult")["Value"])).sort_index()
Here is the Output:
df
Out[168]:
ID Maths Value Value1 result
0 0 add 12 NaN NaN
1 0 add 10 12.0 22.0
2 0 add 10 10.0 20.0
3 1 add 30 NaN NaN
4 1 sub 11 30.0 19.0
5 2 add 3 NaN NaN
6 2 mult 21 3.0 63.0
7 3 add 40 NaN NaN
lets say I have dataframe below:
index value
1 1
2 2
3 3
4 4
I want to apply a function to each row using previous two rows using "apply" statement. Lets say for example I want to multiple current row and previous 2 rows if it exists. (This could be any funtion)
Result:
index value result
1 1 nan
2 2 nan
3 3 6
4 4 24
Thank you.
You can try rolling with prod:
df['result'] = df['value'].rolling(3).apply(lambda x: x.prod())
Output:
index value result
0 1 1 NaN
1 2 2 NaN
2 3 3 6.0
3 4 4 24.0
Use assign function:
df = df.assign(result = lambda x: x['value'].cumprod().tail(len(df)-2))
I presume you have more than four rows. If so, please try groupby every four rows, cumproduct, choose the last 2 and join to the original datframe.
df['value']=df.index.map(df.assign(result=df['value'].cumprod(0)).groupby(df.index//4).result.tail(2).to_dict())
If just four rows then this should you;
Lets try combine .cumprod() and .tail()
df['result']=df['value'].cumprod(0).tail(2)
index value result
0 1 1 NaN
1 2 2 NaN
2 3 3 6.0
3 4 4 24.0
I want to find the difference between rows prior to and following to the specific row. Specifically, I have the following dataset:
Number of rows A
1 4
2 2
3 2
4 3
5 2
I should get the following data:
Number of rows A B
1 4 NaN (since there is not row before this row)
2 2 2 (4-2)
3 2 -1 (2-3)
4 3 0 (2-2)
5 2 NaN (since there is not row after this row)
As you can see, each row in column B, equal the difference between previous and following rows in column A. For example, second row in column B, equal the difference between value in the first row in column A and value in the third row in column A. IMPORTANT POINT: I do not need only previous and following. I should find the difference between previous 2 and the following 2 rows. I meant the value in row Number 23 in column B will be equal the difference between the value in row Number 21 in column A and the value in row Number 25 in column A. I use the previous and the following rows for simplicity.
I hope I could explain it.
Seems like you need a centered rolling window. You can specify that with the arg center=True
>>> df.A.rolling(3, center=True).apply(lambda s: s[0]-s[-1])
0 NaN
1 2.0
2 -1.0
3 0.0
4 NaN
Name: A, dtype: float64
This approach works for any window. Notice that this is a centered window, so the size of the window has to be N+N+1 (where N is the number of lookback and lookforward rows, and you add 1 to account for the value in the middle). Thus, the general formula is
window = 2*N + 1
If you need 2 rows before and 2 after, then N = 2. if you need 5 and 5, N=5 (and window = 11) etc. The apply lambda stays the same.
Let the series (i.e. DataFrame column) be s.
You want:
s.shift(1) - s.shift(-1)
You need to use .shift on the column (series) where you want to run your calculation.
With shift(1) you get the previous row, with shift(-1) you get the next row.
from there you need to calculate previous - next
>>> s = pd.Series([4,2,2,3,2])
>>> s
0 4
1 2
2 2
3 3
4 2
dtype: int64
# previous
>>> s.shift(1)
0 NaN
1 4.0
2 2.0
3 2.0
4 3.0
dtype: float64
# next
>>> s.shift(-1)
0 2.0
1 2.0
2 3.0
3 2.0
4 NaN
dtype: float64
# previous - next
>>> s.shift(1)-s.shift(-1)
0 NaN
1 2.0
2 -1.0
3 0.0
4 NaN
dtype: float64
Using python and panda: For a given data set how does one find the total number of missing attributes? I have found the number for each column, but I need to sum the columns using python to find the total. Below is the code I have currently used.
def num_missing(x):
return sum(x.isnull())
print("Missing Values per Column:")
print(data_file1.apply(num_missing))
Consider df -
df
A B C
0 1.0 4 NaN
1 2.0 5 1.0
2 NaN 6 6.0
3 NaN 7 3.0
Column-wise NaN count -
df.isnull().sum(0)
A 2
B 0
C 1
dtype: int64
Row-wise NaN count -
df.isnull().sum(1)
0 1
1 0
2 1
3 1
dtype: int64
df-wide NaN count -
df.isnull().values.sum()
3
Option 1: call .sum() twice, where the second call finds the sum of the intermediate Series.
df = pd.DataFrame(np.ones((5,5)))
df.iloc[2:4, 1:3] = np.nan
df.isnull().sum().sum()
# 4
Option 2: use underlying NumPy array.
np.isnan(df.values).sum()
# 4
Option 2 should be significantly faster (8.5 us vs. 249 us on this sample data).
As noted by #root and here, np.isnan() works only on numeric data, not object dtypes. pandas.DataFrame.isnull() doesn't have this problem.
I have two Pandas DataFrames indexed by a timeline. We'll call the first df_A, in which the 'epoch' corresponds to the index.
df_A:
timeline epoch price z-value
0 1476336104 0 434.313 1
1 1476336120 1 434.312 false
2 1476336134 2 434.312 false
3 1476336149 3 435.900 false
4 1476336165 4 435.900 1
5 1476336178 5 435.500 1
The second, df_B, has entries that may have one, none, or multiple entries per index of df_A, as you can see by the 'epoch' column.
df_B:
timeline epoch send-value tx-in
0 1476336123 1 10000 False
1 1476336169 4 299950000 False
2 1476336187 5 22879033493 False
3 1476336194 5 130000000 False
4 1476336212 7 10000000000 False
How can I merge these on the index of df_A, and add extra values contained in df_B as columns? I'd like to also add a suffix to differentiate the additional columns. The two example datasets should create a new DataFrame, df_AB that looks like this:
timeline epoch price z-value send-value tx-in send-value_1 tx-in_1
0 1476336104 0 434.313 1 NaN NaN NaN NaN
1 1476336120 1 434.312 false 10000 False NaN NaN
2 1476336134 2 434.312 false NaN NaN NaN NaN
3 1476336149 3 435.900 false NaN NaN NaN NaN
4 1476336165 4 435.900 1 299950000 False NaN NaN
5 1476336178 5 435.500 1 22879033493 False 130000000 False
It looks like there are a few different methods where I might be able to reindex and then merge on 'timeline', or use something like merge_asof, but I can't seem to get any of them to produce the result I am looking for.
How can I do this?