Using python and panda: For a given data set how does one find the total number of missing attributes? I have found the number for each column, but I need to sum the columns using python to find the total. Below is the code I have currently used.
def num_missing(x):
return sum(x.isnull())
print("Missing Values per Column:")
print(data_file1.apply(num_missing))
Consider df -
df
A B C
0 1.0 4 NaN
1 2.0 5 1.0
2 NaN 6 6.0
3 NaN 7 3.0
Column-wise NaN count -
df.isnull().sum(0)
A 2
B 0
C 1
dtype: int64
Row-wise NaN count -
df.isnull().sum(1)
0 1
1 0
2 1
3 1
dtype: int64
df-wide NaN count -
df.isnull().values.sum()
3
Option 1: call .sum() twice, where the second call finds the sum of the intermediate Series.
df = pd.DataFrame(np.ones((5,5)))
df.iloc[2:4, 1:3] = np.nan
df.isnull().sum().sum()
# 4
Option 2: use underlying NumPy array.
np.isnan(df.values).sum()
# 4
Option 2 should be significantly faster (8.5 us vs. 249 us on this sample data).
As noted by #root and here, np.isnan() works only on numeric data, not object dtypes. pandas.DataFrame.isnull() doesn't have this problem.
Related
lets say I have dataframe below:
index value
1 1
2 2
3 3
4 4
I want to apply a function to each row using previous two rows using "apply" statement. Lets say for example I want to multiple current row and previous 2 rows if it exists. (This could be any funtion)
Result:
index value result
1 1 nan
2 2 nan
3 3 6
4 4 24
Thank you.
You can try rolling with prod:
df['result'] = df['value'].rolling(3).apply(lambda x: x.prod())
Output:
index value result
0 1 1 NaN
1 2 2 NaN
2 3 3 6.0
3 4 4 24.0
Use assign function:
df = df.assign(result = lambda x: x['value'].cumprod().tail(len(df)-2))
I presume you have more than four rows. If so, please try groupby every four rows, cumproduct, choose the last 2 and join to the original datframe.
df['value']=df.index.map(df.assign(result=df['value'].cumprod(0)).groupby(df.index//4).result.tail(2).to_dict())
If just four rows then this should you;
Lets try combine .cumprod() and .tail()
df['result']=df['value'].cumprod(0).tail(2)
index value result
0 1 1 NaN
1 2 2 NaN
2 3 3 6.0
3 4 4 24.0
Actual dataframe consist of more than a million rows.
Say for example a dataframe is:
UniqueID Code Value OtherData
1 A 5 Z01
1 B 6 Z02
1 C 7 Z03
2 A 10 Z11
2 B 11 Z24
2 C 12 Z23
3 A 10 Z21
4 B 8 Z10
I want to obtain ratio of A/B for each UniqueID and put it in a new dataframe. For example, for UniqueID 1, its ratio of A/B = 5/6.
What is the most efficient way to do this in Python?
Want:
UniqueID RatioAB
1 5/6
2 10/11
3 Inf
4 0
Thank you.
One approach is using pivot_table, aggregating with the sum in the case there are multiple occurrences of the same letters (otherwise a simple pivot will do), and evaluating on columns A and B:
df.pivot_table(index='UniqueID', columns='Code', values='Value', aggfunc='sum').eval('A/B')
UniqueID
1 0.833333
2 0.909091
3 NaN
4 NaN
dtype: float64
If there is maximum one occurrence of each letter per group:
df.pivot(index='UniqueID', columns='Code', values='Value').eval('A/B')
UniqueID
1 0.833333
2 0.909091
3 NaN
4 NaN
dtype: float64
If you only care about A/B ratio:
df1 = df[df['Code'].isin(['A','B'])][['UniqueID', 'Code', 'Value']]
df1 = df1.pivot(index='UniqueID',
columns='Code',
values='Value')
df1['RatioAB'] = df1['A']/df1['B']
The most apparent way is via groupby.
df.groupby('UniqueID').apply(lambda g: g.query("Code == 'A'")['Value'].iloc[0] / g.query("Code == 'B'")['Value'].iloc[0])
I want to find the difference between rows prior to and following to the specific row. Specifically, I have the following dataset:
Number of rows A
1 4
2 2
3 2
4 3
5 2
I should get the following data:
Number of rows A B
1 4 NaN (since there is not row before this row)
2 2 2 (4-2)
3 2 -1 (2-3)
4 3 0 (2-2)
5 2 NaN (since there is not row after this row)
As you can see, each row in column B, equal the difference between previous and following rows in column A. For example, second row in column B, equal the difference between value in the first row in column A and value in the third row in column A. IMPORTANT POINT: I do not need only previous and following. I should find the difference between previous 2 and the following 2 rows. I meant the value in row Number 23 in column B will be equal the difference between the value in row Number 21 in column A and the value in row Number 25 in column A. I use the previous and the following rows for simplicity.
I hope I could explain it.
Seems like you need a centered rolling window. You can specify that with the arg center=True
>>> df.A.rolling(3, center=True).apply(lambda s: s[0]-s[-1])
0 NaN
1 2.0
2 -1.0
3 0.0
4 NaN
Name: A, dtype: float64
This approach works for any window. Notice that this is a centered window, so the size of the window has to be N+N+1 (where N is the number of lookback and lookforward rows, and you add 1 to account for the value in the middle). Thus, the general formula is
window = 2*N + 1
If you need 2 rows before and 2 after, then N = 2. if you need 5 and 5, N=5 (and window = 11) etc. The apply lambda stays the same.
Let the series (i.e. DataFrame column) be s.
You want:
s.shift(1) - s.shift(-1)
You need to use .shift on the column (series) where you want to run your calculation.
With shift(1) you get the previous row, with shift(-1) you get the next row.
from there you need to calculate previous - next
>>> s = pd.Series([4,2,2,3,2])
>>> s
0 4
1 2
2 2
3 3
4 2
dtype: int64
# previous
>>> s.shift(1)
0 NaN
1 4.0
2 2.0
3 2.0
4 3.0
dtype: float64
# next
>>> s.shift(-1)
0 2.0
1 2.0
2 3.0
3 2.0
4 NaN
dtype: float64
# previous - next
>>> s.shift(1)-s.shift(-1)
0 NaN
1 2.0
2 -1.0
3 0.0
4 NaN
dtype: float64
I have just read this question:
In a Pandas dataframe, how can I extract the difference between the values on separate rows within the same column, conditional on a second column?
and I am completely baffled by the answer. How does this work???
I mean, when I groupby('user') shouldn't the result be, well, grouped by user?
Whatever the function I use (mean, sum etc) I would expect a result like this:
aa=pd.DataFrame([{'user':'F','time':0},
{'user':'T','time':0},
{'user':'T','time':0},
{'user':'T','time':1},
{'user':'B','time':1},
{'user':'K','time':2},
{'user':'J','time':2},
{'user':'T','time':3},
{'user':'J','time':4},
{'user':'B','time':4}])
aa2=aa.groupby('user')['time'].sum()
print(aa2)
user
B 5
F 0
J 6
K 2
T 4
Name: time, dtype: int64
How does diff() instead return a diff of each row with the previous, within each group?
aa['diff']=aa.groupby('user')['time'].diff()
print(aa)
time user diff
0 0 F NaN
1 0 T NaN
2 0 T 0.0
3 1 T 1.0
4 1 B NaN
5 2 K NaN
6 2 J NaN
7 3 T 2.0
8 4 J 2.0
9 4 B 3.0
And more important, how is the result not a unique list of 'user' values?
I found many answers that use groupby.diff() but none of them explain it in detail. It would be extremely useful to me, and hopefully to others, to understand the mechanics behind it. Thanks.
I have two pandas series objects with slightly different indexes. I want to divide one series by another. The default method gives me NAs when one of the two series is missing an indexed element. There is an option to fill missing values, but it can only be set to one value. I want to fill a value based on which series is missing the value.
For example
series1
0 10
1 20
2 30
3 40
series2
1 2
2 3
3 4
4 5
expected result: series1.divide(series2)
0 inf
1 10
2 10
3 10
4 0
actual result: series1.divide(series2)
0 NaN
1 10
2 10
3 10
4 NaN
Is there an easy way to do this?
You could use reindex to expand series1.index to include series2.index, filling missing values with 0. Then you could use the div method, which fills in missing values with NaN by default:
series1 = pd.Series([10,20,30,40], index=[0,1,2,3])
series2 = pd.Series([2,3,4,5], index=[1,2,3,4])
series1 = series1.reindex(series1.index.union(series2.index), fill_value=0)
print(series1.div(series2))
# 0 nan
# 1 10.00000
# 2 10.00000
# 3 10.00000
# 4 0.00000
# dtype: float64