Df.mean returns imaginary numbers - python

I have a dataframe with around 50 columns and around 3000 rows. Most cells are empty but not all of them. I am trying to add a new row at the end of the dataframe, with the mean value of each column and I need it to ignore the empty cells.
I am using df.mean(axis=0), which somehows turns all values of the dataframe into imaginary numbers. All values stay the same but a +0j is added. I have no Idea why.
Turbine.loc['Mean_Values'] = Turbine.mean(axis=0)
I couldnt find a solution for this, is it because of the empty cells?

Base on this, df.mean() will automatically skip the NaN/Null value with parameter value of skipna=True. Example:
df=pd.DataFrame({'value':[1,2,3,np.nan,5,6,np.nan]})
df=df.append({'value':df.mean(numeric_only=True).value}, ignore_index=True,)
print(df)
Output:
value
0 1.0
1 2.0
2 3.0
3 NaN
4 5.0
5 6.0
6 NaN
7 3.4
But if there is a complex number in a cell, the result of df.mean() will be cast to complex number. Example:
df=pd.DataFrame({'value':[1,2,3,np.nan,5,6,np.nan, complex(1,0)]})
print(df)
print('\n')
df=df.append({'value':df.mean(numeric_only=True).value}, ignore_index=True,)
print(df)
Output with a complex value in a cell:
value
0 (1+0j)
1 (2+0j)
2 (3+0j)
3 NaN
4 (5+0j)
5 (6+0j)
6 NaN
7 (1+0j)
value
0 (1+0j)
1 (2+0j)
2 (3+0j)
3 NaN
4 (5+0j)
5 (6+0j)
6 NaN
7 (1+0j)
8 (3+0j)
Hope this can help you :)

some cells had information about directions (north, west...) in them, which were interpreted as imaginary numbers.

Related

How to replace missing values with group mode in Pandas?

I follow the method in this post to replace missing values with the group mode, but encounter the "IndexError: index out of bounds".
df['SIC'] = df.groupby('CIK').SIC.apply(lambda x: x.fillna(x.mode()[0]))
I guess this is probably because some groups have all missing values and do not have a mode. Is there a way to get around this? Thank you!
mode is quite difficult, given that there really isn't any agreed upon way to deal with ties. Plus it's typically very slow. Here's one way that will be "fast". We'll define a function that calculates the mode for each group, then we can fill the missing values afterwards with a map. We don't run into issues with missing groups, though for ties we arbitrarily choose the modal value that comes first when sorted:
def fast_mode(df, key_cols, value_col):
"""
Calculate a column mode, by group, ignoring null values.
Parameters
----------
df : pandas.DataFrame
DataFrame over which to calcualate the mode.
key_cols : list of str
Columns to groupby for calculation of mode.
value_col : str
Column for which to calculate the mode.
Return
------
pandas.DataFrame
One row for the mode of value_col per key_cols group. If ties,
returns the one which is sorted first.
"""
return (df.groupby(key_cols + [value_col]).size()
.to_frame('counts').reset_index()
.sort_values('counts', ascending=False)
.drop_duplicates(subset=key_cols)).drop(columns='counts')
Sample data df:
CIK SIK
0 C 2.0
1 C 1.0
2 B NaN
3 B 3.0
4 A NaN
5 A 3.0
6 C NaN
7 B NaN
8 C 1.0
9 A 2.0
10 D NaN
11 D NaN
12 D NaN
Code:
df.loc[df.SIK.isnull(), 'SIK'] = df.CIK.map(fast_mode(df, ['CIK'], 'SIK').set_index('CIK').SIK)
Output df:
CIK SIK
0 C 2.0
1 C 1.0
2 B 3.0
3 B 3.0
4 A 2.0
5 A 3.0
6 C 1.0
7 B 3.0
8 C 1.0
9 A 2.0
10 D NaN
11 D NaN
12 D NaN

Compute difference between rows prior to and following to the specific row_pandas

I want to find the difference between rows prior to and following to the specific row. Specifically, I have the following dataset:
Number of rows A
1 4
2 2
3 2
4 3
5 2
I should get the following data:
Number of rows A B
1 4 NaN (since there is not row before this row)
2 2 2 (4-2)
3 2 -1 (2-3)
4 3 0 (2-2)
5 2 NaN (since there is not row after this row)
As you can see, each row in column B, equal the difference between previous and following rows in column A. For example, second row in column B, equal the difference between value in the first row in column A and value in the third row in column A. IMPORTANT POINT: I do not need only previous and following. I should find the difference between previous 2 and the following 2 rows. I meant the value in row Number 23 in column B will be equal the difference between the value in row Number 21 in column A and the value in row Number 25 in column A. I use the previous and the following rows for simplicity.
I hope I could explain it.
Seems like you need a centered rolling window. You can specify that with the arg center=True
>>> df.A.rolling(3, center=True).apply(lambda s: s[0]-s[-1])
0 NaN
1 2.0
2 -1.0
3 0.0
4 NaN
Name: A, dtype: float64
This approach works for any window. Notice that this is a centered window, so the size of the window has to be N+N+1 (where N is the number of lookback and lookforward rows, and you add 1 to account for the value in the middle). Thus, the general formula is
window = 2*N + 1
If you need 2 rows before and 2 after, then N = 2. if you need 5 and 5, N=5 (and window = 11) etc. The apply lambda stays the same.
Let the series (i.e. DataFrame column) be s.
You want:
s.shift(1) - s.shift(-1)
You need to use .shift on the column (series) where you want to run your calculation.
With shift(1) you get the previous row, with shift(-1) you get the next row.
from there you need to calculate previous - next
>>> s = pd.Series([4,2,2,3,2])
>>> s
0 4
1 2
2 2
3 3
4 2
dtype: int64
# previous
>>> s.shift(1)
0 NaN
1 4.0
2 2.0
3 2.0
4 3.0
dtype: float64
# next
>>> s.shift(-1)
0 2.0
1 2.0
2 3.0
3 2.0
4 NaN
dtype: float64
# previous - next
>>> s.shift(1)-s.shift(-1)
0 NaN
1 2.0
2 -1.0
3 0.0
4 NaN
dtype: float64

Baffled by dataframe groupby.diff()

I have just read this question:
In a Pandas dataframe, how can I extract the difference between the values on separate rows within the same column, conditional on a second column?
and I am completely baffled by the answer. How does this work???
I mean, when I groupby('user') shouldn't the result be, well, grouped by user?
Whatever the function I use (mean, sum etc) I would expect a result like this:
aa=pd.DataFrame([{'user':'F','time':0},
{'user':'T','time':0},
{'user':'T','time':0},
{'user':'T','time':1},
{'user':'B','time':1},
{'user':'K','time':2},
{'user':'J','time':2},
{'user':'T','time':3},
{'user':'J','time':4},
{'user':'B','time':4}])
aa2=aa.groupby('user')['time'].sum()
print(aa2)
user
B 5
F 0
J 6
K 2
T 4
Name: time, dtype: int64
How does diff() instead return a diff of each row with the previous, within each group?
aa['diff']=aa.groupby('user')['time'].diff()
print(aa)
time user diff
0 0 F NaN
1 0 T NaN
2 0 T 0.0
3 1 T 1.0
4 1 B NaN
5 2 K NaN
6 2 J NaN
7 3 T 2.0
8 4 J 2.0
9 4 B 3.0
And more important, how is the result not a unique list of 'user' values?
I found many answers that use groupby.diff() but none of them explain it in detail. It would be extremely useful to me, and hopefully to others, to understand the mechanics behind it. Thanks.

Take first 6 digits of Pandas Column

I have a task to take the first 6 digits of a column in pandas. However, if this number is less than 6 digits long it adds a decimal to the end of the number. Unfortunately, this is not acceptable for my needs later down the road.
I'm sure I can get rid of the decimal with various code, but It will probably be inefficient as DataFrames get larger.
Current code:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'A' : [np.NaN,np.NaN,3,4,5,5,3,1,5,np.NaN],
'B' : [1,0,3,5,0,0,np.NaN,9,0,0],
'C' : [10,0,30,50,0,0,4,10,1,0],
'D' : [123456,123456,1234567,12345678,12345,12345,12345678,123456789,1234567,np.NaN],
'E' : ['Assign','Unassign','Assign','Ugly','Appreciate','Undo','Assign','Unicycle','Assign','Unicorn',]})
wow2 = df1
wow2['D'] = wow2['D'][:6]
print(wow2)
A B C D E
0 NaN 1.0 10 123456 Assign
1 NaN 0.0 0 123456 Unassign
2 3.0 3.0 30 123456 Assign
3 4.0 5.0 50 123456 Ugly
4 5.0 0.0 0 12345. Appreciate <--- Notice Decimal
5 5.0 0.0 0 12345. Undo <--- Notice Decimal
6 3.0 NaN 4 NaN Assign
7 1.0 9.0 10 NaN Unicycle
8 5.0 0.0 1 NaN Assign
9 NaN 0.0 0 NaN Unicorn
Is there a way I can leave the digit if it's length is not over 6? I thought about converting the column to string and doing a loop... But I believe that would be wildly inefficient and create more problems than solutions
To get the first 6 digits of a number (without converting to string and back), you may use the modulo operator.
In order to represent your numeric values as non floating point numbers you need to convert them into integers. However, mixing integers and np.NaN within the same column will result into float64 (see here for more). To get around this (which is kind of ugly) you need to convert the integers into strings which forces the dtype to be object because you mix strings and float values.
The solution looks like the following:
wow2['D'] = wow2['D'].mod(10**6)\
.dropna()\
.astype(int)\
.astype(str)
print(wow['D'])
0 123456
1 123456
2 234567
3 345678
4 12345
5 12345
6 345678
7 456789
8 234567
9 NaN
Name: D, dtype: object

Using fillna() selectively in pandas

I would like to fill N/A values in a DataFrame in a selective manner. In particular, if there is a sequence of consequetive nans within a column, I want them to be filled by the preceeding non-nan value, but only if the length of the nan sequence is below a specified threshold. For example, if the threshold is 3 then a within-column sequence of 3 or less will be filled with the preceeding non-nan value, whereas a sequence of 4 or more nans will be left as is.
That is, if the input DataFrame is
2 5 4
nan nan nan
nan nan nan
5 nan nan
9 3 nan
7 9 1
I want the output to be:
2 5 4
2 5 nan
2 5 nan
5 5 nan
9 3 nan
7 9 1
The fillna function, when applied to a DataFrame, has the method and limit options. But these are unfortunately not sufficient to acheive the task. I tried to specify method='ffill' and limit=3, but that fills in the first 3 nans of any sequence, not selectively as described above.
I suppose this can be coded by going column by column with some conditional statements, but I suspect there must be something more Pythonic. Any suggestinos on an efficient way to acheive this?
Working with contiguous groups is still a little awkward in pandas.. or at least I don't know of a slick way to do this, which isn't at all the same thing. :-)
One way to get what you want would be to use the compare-cumsum-groupby pattern:
In [68]: nulls = df.isnull()
...: groups = (nulls != nulls.shift()).cumsum()
...: to_fill = groups.apply(lambda x: x.groupby(x).transform(len) <= 3)
...: df.where(~to_fill, df.ffill())
...:
Out[68]:
0 1 2
0 2.0 5.0 4.0
1 2.0 5.0 NaN
2 2.0 5.0 NaN
3 5.0 5.0 NaN
4 9.0 3.0 NaN
5 7.0 9.0 1.0
Okay, another alternative which I don't like because it's too tricky:
def method_2(df):
nulls = df.isnull()
filled = df.ffill(limit=3)
unfilled = nulls & (~filled.notnull())
nf = nulls.replace({False: 2.0, True: np.nan})
do_not_fill = nf.combine_first(unfilled.replace(False, np.nan)).bfill() == 1
return df.where(do_not_fill, df.ffill())
This doesn't use any groupby tools and so should be faster. Note that a different approach would be to manually (using shifts) determine which elements are to be filled because they're a group of length 1, 2, or 3.

Categories