I have the following dataframe.
a b c d
0 4.65 30.572857 133.899994 23.705000
1 4.77 30.625713 134.690002 23.225000
2 4.73 30.138571 132.250000 23.040001
3 5.07 30.082857 130.000000 23.290001
4 4.98 30.282858 133.520004 23.389999
5 NaN NaN NaN NaN
6 4.82 29.674286 127.349998 23.700001
7 4.83 30.092857 129.110001 24.254999
8 4.85 29.918571 127.349998 24.695000
9 4.70 29.418571 127.139999 24.424999
10 4.69 30.719999 127.610001 25.200001
11 NaN NaN NaN NaN
12 4.43 29.724285 126.620003 24.764999
13 NaN NaN NaN NaN
14 4.29 29.010000 120.309998 24.730000
15 4.11 29.420000 119.480003 25.035000
I want to split this df into multiple dfs when there is row with all NaN.
I explored the following links but could not figure out how to apply it to my problem.
Split pandas dataframe in two if it has more than 10 rows
Splitting dataframe into multiple dataframes
In my example, I would have 4 dataframes with 5,5,1 and 2 rows as the output.
Please suggest the way forward.
Using isna, all, cumsum and groupby.
First we check if all the values in a row are NaN, then use cumsum to create a group indicator and finally we save these dataframes in a list with groupby:
grps = df.isna().all(axis=1).cumsum()
dfs = [df.dropna() for _, df in df.groupby(grps)]
for df in dfs:
print(df)
a b c d
0 4.65 30.572857 133.899994 23.705000
1 4.77 30.625713 134.690002 23.225000
2 4.73 30.138571 132.250000 23.040001
3 5.07 30.082857 130.000000 23.290001
4 4.98 30.282858 133.520004 23.389999
a b c d
6 4.82 29.674286 127.349998 23.700001
7 4.83 30.092857 129.110001 24.254999
8 4.85 29.918571 127.349998 24.695000
9 4.70 29.418571 127.139999 24.424999
10 4.69 30.719999 127.610001 25.200001
a b c d
12 4.43 29.724285 126.620003 24.764999
a b c d
14 4.29 29.01 120.309998 24.730
15 4.11 29.42 119.480003 25.035
Something like this should do the trick:
import pandas as pd
import numpy as np
data_frame = pd.DataFrame({"a":[1,np.nan,3,np.nan,4,np.nan,5],
"b":[1,np.nan,3,np.nan,4,np.nan,5],
"c":[1,np.nan,3,np.nan,4,np.nan,5],
"d":[1,np.nan,3,np.nan,4,np.nan,5],
"e":[1,np.nan,3,np.nan,4,np.nan,5],
"f":[1,np.nan,3,np.nan,4,np.nan,5]})
all_nan = data_frame.index[data_frame.isnull().all(1)]
df_list = []
prev = 0
for i in all_nan:
df_list.append(data_frame[prev:i])
prev = i+1
for i in df_list:
print(i)
Just another flavor of doing the same thing:
nan_indices = df.index[df.isna().all(axis=1)]
df_list = [df.dropna() for df in np.split(df, nan_indices)]
df_list
[ a b c d
0 4.65 30.572857 133.899994 23.705000
1 4.77 30.625713 134.690002 23.225000
2 4.73 30.138571 132.250000 23.040001
3 5.07 30.082857 130.000000 23.290001
4 4.98 30.282858 133.520004 23.389999,
a b c d
6 4.82 29.674286 127.349998 23.700001
7 4.83 30.092857 129.110001 24.254999
8 4.85 29.918571 127.349998 24.695000
9 4.70 29.418571 127.139999 24.424999
10 4.69 30.719999 127.610001 25.200001,
a b c d
12 4.43 29.724285 126.620003 24.764999,
a b c d
14 4.29 29.01 120.309998 24.730
15 4.11 29.42 119.480003 25.035]
Related
I have a pandas dataframe with multiple columns and I would like to create a new dataframe by flattening all columns into one using the melt function. But I do not want the column names from the original dataframe to be a part of the new dataframe.
Below is the sample dataframe and code. Is there a way to make it more concise?
date Col1 Col2 Col3 Col4
1990-01-02 12:00:00 24 24 24.8 24.8
1990-01-02 01:00:00 59 58 60 60.3
1990-01-02 02:00:00 43.7 43.9 48 49
The output desired:
Rates
0 24
1 59
2 43.7
3 24
4 58
5 43.9
6 24.8
7 60
8 48
9 24.8
10 60.3
11 49
Code :
df = df.melt(var_name='ColumnNames', value_name='Rates') #using melt function to flatten columns
df_main.drop(['ColumnNames'], axis = 1, inplace = True) # dropping 'ColumnNames'
Set value_name and value_vars params for your purpose:
In [137]: pd.melt(df, value_name='Price', value_vars=df.columns[1:]).drop('variable', axis=1)
Out[137]:
Price
0 24.0
1 59.0
2 43.7
3 24.0
4 58.0
5 43.9
6 24.8
7 60.0
8 48.0
9 24.8
10 60.3
11 49.0
As an alternative you can use stack() and transpose():
dfx = df.T.stack().reset_index(drop=True) #date must be index.
Output:
0
0 24.0
1 59.0
2 43.7
3 24.0
4 58.0
5 43.9
6 24.8
7 60.0
8 48.0
9 24.8
10 60.3
11 49.0
I'm trying to run anova test to dataframe that looks like this:
>>>code 2020-11-01 2020-11-02 2020-11-03 2020-11-04 ...
0 1 22.5 73.1 12.2 77.5
1 1 23.1 75.4 12.4 78.3
2 2 43.1 72.1 13.4 85.4
3 2 41.6 85.1 34.1 96.5
4 3 97.3 43.2 31.1 55.3
5 3 12.1 44.4 32.2 52.1
...
I want to calculate one way anova for each column based on the code. I have used for that statsmodel and for loop :
keys = []
tables = []
for variable in df.columns[1:]:
model = ols('{} ~ code'.format(variable), data=df).fit()
anova_table = sm.stats.anova_lm(model)
keys.append(variable)
tables.append(anova_table)
df_anova = pd.concat(tables, keys=keys, axis=0)
df_anova
The problem is that I keep getting error for the 4th line:
PatsyError: numbers besides '0' and '1' are only allowed with **
2020-11-01 ~ code
^^^^
I have tried to use the Q argument as suggested here:
...
model = ols('{Q(x)} ~ code'.format(x=variable), data=df).fit()
KeyError: 'Q(x)'
I have also tried to locate the Q outside but got the same error.
My end goal: to calculate one-way anove for each day (each column) based on the "code" column.
You can try to pivot it long and skip the iteration through columns:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
df = pd.DataFrame({"code":[1,1,2,2,3,3],
"2020-11-01":[22.5,23.1,43.1,41.6,97.3,12.1],
"2020-11-02":[73.1,75.4,72.1,85.1,43.2,44.4]})
df_long = df.melt(id_vars="code")
df_long
code variable value
0 1 2020-11-01 22.5
1 1 2020-11-01 23.1
2 2 2020-11-01 43.1
3 2 2020-11-01 41.6
4 3 2020-11-01 97.3
5 3 2020-11-01 12.1
6 1 2020-11-02 73.1
7 1 2020-11-02 75.4
8 2 2020-11-02 72.1
9 2 2020-11-02 85.1
10 3 2020-11-02 43.2
11 3 2020-11-02 44.4
Then applying your code:
tables = []
keys = df_long.variable.unique()
for D in keys:
model = ols('value ~ code', data=df_long[df_long.variable == D]).fit()
anova_table = sm.stats.anova_lm(model)
tables.append(anova_table)
pd.concat(tables,keys=keys)
Or simply:
def aov_func(x):
model = ols('value ~ code', data=x).fit()
return sm.stats.anova_lm(model)
df_long.groupby("variable").apply(aov_func)
Gives this result:
df sum_sq mean_sq F PR(>F)
variable
2020-11-01 code 1.0 1017.6100 1017.610000 1.115768 0.350405
Residual 4.0 3648.1050 912.026250 NaN NaN
2020-11-02 code 1.0 927.2025 927.202500 6.194022 0.067573
Residual 4.0 598.7725 149.693125 NaN NaN
I have the following DataFrame:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.nan, index=range(1,16), columns=['A','B','C','D','E','F','G','H'])
a = [1550, 41, 9.41, 22.6, 4.74, 3.2, 11.64, 2.23]
b = [1540, 43, 9.41, 22.3, 4.84, 3.12, 11.64, 2.23]
c = [1590, 39, 9.41, 23.7, 4.74, 3.0, 11.64, 2.23]
d = [1540, 41, 9.41, 22.5, 4.74, 3.2, 11.64, 2.23]
df.loc[[1,8,13,15],:] = [a,b,c,d]
Looking like this:
A B C D E F G H
1 1550.0 41.0 9.41 22.6 4.74 3.20 11.64 2.23
2 NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN NaN NaN NaN NaN
6 NaN NaN NaN NaN NaN NaN NaN NaN
7 NaN NaN NaN NaN NaN NaN NaN NaN
8 1540.0 43.0 9.41 22.3 4.84 3.12 11.64 2.23
9 NaN NaN NaN NaN NaN NaN NaN NaN
10 NaN NaN NaN NaN NaN NaN NaN NaN
11 NaN NaN NaN NaN NaN NaN NaN NaN
12 NaN NaN NaN NaN NaN NaN NaN NaN
13 1590.0 39.0 9.41 23.7 4.74 3.00 11.64 2.23
14 NaN NaN NaN NaN NaN NaN NaN NaN
15 1540.0 41.0 9.41 22.5 4.74 3.20 11.64 2.23
I want the null values to be filled with:
"Average(All the preceding values before null, first non-null succeeding value after null)"
Note:If the first succeeding value after null is also Null, then the code should look for the first succeeding value which is not null .
Example:
Row 2 of column A should be filled with Average(1550,1540) = 1545
Here "All Preceding value before null" = 1550, "First non null succeeding value after null" = 1540
Similarly,
row 3 of column A should be filled with Average(1550,1545,1540) = 1545
Here all the preceding value before the null are 1550 and 1545(1545 is what we found in the above step)
First non null succeeding value after null is again 1540 .
It goes on like that and row 9 of column A should be filled with
Average(All the values before the null, 1590) 1590 is now the first non null succeeding value after null.
So at the end my desired output in Column A looks like this:
Desired Output Example for A column:
Row A
1 1550
2 1545
3 1545
4 1545
5 1545
6 1545
7 1545
8 1540
9 1550
10 1550
11 1550
12 1550
13 1590
14 1549.285
15 1540
Similarly i wanted my null values to be filled for all the other columns as well.
Since i am new to python I dont know how to write a code for this.
Any help on the code is much appreciated.
This is a very similar post to this, but I suppose this is different enough (and the operation convoluted enough) to warrant a different answer.
You can define an apply function to use for every row:
def foo(row):
if any(row.isna()):
next_non_null = df.loc[df.index>row.name, row.isna()].dropna(how='all').index[0]
df.loc[row.name, row.isna()] = df.expanding().mean().loc[next_non_null, :]
The basic logic is this:
loop over df and look at every row. for every row:
check if there are any missing values in the row (this can save time, see the post linked above)
if there are, find the index of the next non-null entry for those missing values: take df with the empty values dropped, and find the rows in that which are greater than the current row, and take the first index
rewrite the current rows empty values with those from the expanding mean of the df at the first non-null value
Applying this function is then simply:
df.apply(foo, axis=1)
Converting df into:
A B C D E F G H
1 1550.000000 41.000000 9.41 22.600000 4.740000 3.200000 11.64 2.23
2 1545.000000 42.000000 9.41 22.450000 4.790000 3.160000 11.64 2.23
3 1545.000000 42.000000 9.41 22.450000 4.790000 3.160000 11.64 2.23
4 1545.000000 42.000000 9.41 22.450000 4.790000 3.160000 11.64 2.23
5 1545.000000 42.000000 9.41 22.450000 4.790000 3.160000 11.64 2.23
6 1545.000000 42.000000 9.41 22.450000 4.790000 3.160000 11.64 2.23
7 1545.000000 42.000000 9.41 22.450000 4.790000 3.160000 11.64 2.23
8 1540.000000 43.000000 9.41 22.300000 4.840000 3.120000 11.64 2.23
9 1550.000000 41.666667 9.41 22.588889 4.784444 3.142222 11.64 2.23
10 1550.000000 41.666667 9.41 22.588889 4.784444 3.142222 11.64 2.23
11 1550.000000 41.666667 9.41 22.588889 4.784444 3.142222 11.64 2.23
12 1550.000000 41.666667 9.41 22.588889 4.784444 3.142222 11.64 2.23
13 1590.000000 39.000000 9.41 23.700000 4.740000 3.000000 11.64 2.23
14 1549.285714 41.619048 9.41 22.582540 4.781270 3.146349 11.64 2.23
15 1540.000000 41.000000 9.41 22.500000 4.740000 3.200000 11.64 2.23
I'm not going to check if the other columns are right 😂
But note, this apply is modifying df in place, but returning an empty DataFrame. So if you are working in a console and run the apply line, you will see a DataFrame of None returned. But if you check df again after, you should see that it has been updated.
def fill_nulls(ls):
non_null_index = [i for i in range(len(ls)) if not np.isnan(ls[i])]
non_null_values = [i for i in ls if not np.isnan(i)]
if 0 not in non_null_index:
ls[0] = non_null_values[0]
for i in range(len(ls)):
if i == 0:
pass
else:
if np.isnan(ls[i]):
left_non_null = [j for j in ls[:i] if not np.isnan(j)]
right_non_null = [[j for j in ls[i:] if not np.isnan(j)][0]]
fill_value = np.mean(left_non_null + right_non_null)
ls[i] = fill_value
else:
pass
return ls
df['A'] = fill_nulls(df['A'].values)
# Output for new df['A'].values
[1550.0,
1545.0,
1545.0,
1545.0,
1545.0,
1545.0,
1545.0,
1540.0,
1550.0,
1550.0,
1550.0,
1550.0,
1590.0,
1549.2857142857142,
1540.0]
I have a dataframe in pandas, and I am trying to take data from the same row and different columns and fill NaN values in my data. How would I do this in pandas?
For example,
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
83 27.0 29.0 NaN 29.0 30.0 NaN NaN 15.0 16.0 17.0 NaN 28.0 30.0 NaN 28.0 18.0
The goal is for the data to look like this:
1 2 3 4 5 6 7 ... 10 11 12 13 14 15 16
83 NaN NaN NaN 27.0 29.0 29.0 30.0 ... 15.0 16.0 17.0 28.0 30.0 28.0 18.0
The goal is to be able to take the mean of the last five columns that have data. If there are not >= 5 data-filled cells, then take the average of however many cells there are.
Use function justify for improve performance with filter all columns without first by DataFrame.iloc:
print (df)
name 1 2 3 4 5 6 7 8 9 10 11 12 13 \
80 bob 27.0 29.0 NaN 29.0 30.0 NaN NaN 15.0 16.0 17.0 NaN 28.0 30.0
14 15 16
80 NaN 28.0 18.0
df.iloc[:, 1:] = justify(df.iloc[:, 1:].to_numpy(), invalid_val=np.nan, side='right')
print (df)
name 1 2 3 4 5 6 7 8 9 10 11 12 13 \
80 bob NaN NaN NaN NaN NaN 27.0 29.0 29.0 30.0 15.0 16.0 17.0 28.0
14 15 16
80 30.0 28.0 18.0
Function:
#https://stackoverflow.com/a/44559180/2901002
def justify(a, invalid_val=0, axis=1, side='left'):
"""
Justifies a 2D array
Parameters
----------
A : ndarray
Input array to be justified
axis : int
Axis along which justification is to be made
side : str
Direction of justification. It could be 'left', 'right', 'up', 'down'
It should be 'left' or 'right' for axis=1 and 'up' or 'down' for axis=0.
"""
if invalid_val is np.nan:
mask = ~np.isnan(a)
else:
mask = a!=invalid_val
justified_mask = np.sort(mask,axis=axis)
if (side=='up') | (side=='left'):
justified_mask = np.flip(justified_mask,axis=axis)
out = np.full(a.shape, invalid_val)
if axis==1:
out[justified_mask] = a[mask]
else:
out.T[justified_mask.T] = a.T[mask.T]
return out
Performance:
#100 rows
df = pd.concat([df] * 100, ignore_index=True)
#41 times slowier
In [39]: %timeit df.loc[:,df.columns[1:]] = df.loc[:,df.columns[1:]].apply(fun, axis=1)
145 ms ± 23.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [41]: %timeit df.iloc[:, 1:] = justify(df.iloc[:, 1:].to_numpy(), invalid_val=np.nan, side='right')
3.54 ms ± 236 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
#1000 rows
df = pd.concat([df] * 1000, ignore_index=True)
#198 times slowier
In [43]: %timeit df.loc[:,df.columns[1:]] = df.loc[:,df.columns[1:]].apply(fun, axis=1)
1.13 s ± 37.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [45]: %timeit df.iloc[:, 1:] = justify(df.iloc[:, 1:].to_numpy(), invalid_val=np.nan, side='right')
5.7 ms ± 184 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Assuming you need to move all NaN to the first columns I would define a function that takes all NaN and places them first and leave the rest as it is:
def fun(row):
index_order = row.index[row.isnull()].append(row.index[~row.isnull()])
row.iloc[:] = row[index_order].values
return row
df_fix = df.loc[:,df.columns[1:]].apply(fun, axis=1)
If you need to overwrite the results in the same dataframe then:
df.loc[:,df.columns[1:]] = df_fix.copy()
suppose I have a Dataframe like this. I want to convert this to a 2-level multiIndex Dataframe.
dt st close volume
0 20100101 000001.sz 1 10000
1 20100101 000002.sz 10 50000
2 20100101 000003.sz 5 1000
3 20100101 000004.sz 15 7000
4 20100101 000005.sz 100 100000
5 20100102 000001.sz 2 20000
6 20100102 000002.sz 20 60000
7 20100102 000003.sz 6 2000
8 20100102 000004.sz 20 8000
9 20100102 000005.sz 110 110000
But when I try this code:
data = pd.read_csv('data/trial.csv')
print(data)
idx = pd.MultiIndex.from_product([data.dt.unique(),
data.st.unique()],
names=['dt', 'st'])
col = ['close', 'volume']
df = pd.DataFrame(data, idx, col)
print(df)
I find that all the element are NaN
close volume
dt st
20100101 000001.sz NaN NaN
000002.sz NaN NaN
000003.sz NaN NaN
000004.sz NaN NaN
000005.sz NaN NaN
20100102 000001.sz NaN NaN
000002.sz NaN NaN
000003.sz NaN NaN
000004.sz NaN NaN
000005.sz NaN NaN
How to handle this situation? Thanks.
You need only parameter index_col in read_csv:
#by positions of columns
data = pd.read_csv('data/trial.csv', index_col=[0,1])
Or:
#by names of columns
data = pd.read_csv('data/trial.csv', index_col=['dt', 'st'])
print (data)
close volume
dt st
20100101 000001.sz 1 10000
000002.sz 10 50000
000003.sz 5 1000
000004.sz 15 7000
000005.sz 100 100000
20100102 000001.sz 2 20000
000002.sz 20 60000
000003.sz 6 2000
000004.sz 20 8000
000005.sz 110 110000
Why all element are all NaN when construct a multiIndex Dataframe?
Reason is in DataFrame constructor:
df = pd.DataFrame(data, idx, col)
DataFrame called data has RangeIndex and not align with new MultiIndex, so get NaNs in data.
Possible solution if always each dt has same st values is filter Dataframe by columns names and then convert to numpy array, but better are index_col and set_index solutions:
df = pd.DataFrame(data[col].values, idx, col)
Try using set_index() like this:
new_df = df.set_index(['dt', 'st'])
Result:
>>> new_df
close volume
dt st
20100101 000001.sz 1 10000
000002.sz 10 50000
000003.sz 5 1000
000004.sz 15 7000
000005.sz 100 100000
20100102 000001.sz 2 20000
000002.sz 20 60000
000003.sz 6 2000
000004.sz 20 8000
000005.sz 110 110000
>>> new_df.index
MultiIndex(levels=[[20100101, 20100102], ['000001.sz', '000002.sz', '000003.sz', '000004.sz', '000005.sz']],
labels=[[0, 0, 0, 0, 0, 1, 1, 1, 1, 1], [0, 1, 2, 3, 4, 0, 1, 2, 3, 4]],
names=['dt', 'st'])