Nested lists to python dataframe - python

I have a nested numpy.ndarray of the following format (each of the sublists has the same size)
len(exp_data) # Timepoints
Out[205]: 42
len(exp_data[0])
Out[206]: 1
len(exp_data[0][0]) # Y_bins
Out[207]: 13
len(exp_data[0][0][0]) # X_bins
Out[208]: 43
type(exp_data[0][0][0][0])
Out[209]: numpy.float64
I want to move these into a pandas DataFrame such that there are 3 columns numbered from 0 to N and the last one with the float value.
I could do this with a series of loops, but that seems like a very non-efficient way of solving the problem.
In addition I would like to get rid of any nan values (not present in sample data). Do I do this after creating the df or is there a way to skip adding them in the first place?
NOTE: code below has been edited and I've added sample data
import random
import numpy as np
import pandas as pd
exp_data = [[[ [random.random() for x in range (5)],
[random.random() for x in range (5)],
[random.random() for x in range (5)],
]]]*5
exp_data[0][0][0][1]=np.nan
df = pd.DataFrame(columns = ['Timepoint','Y_bin','X_bin','Values'])
for t,timepoint in enumerate(exp_data):
for y,y_bin in enumerate(timepoint[0]):
for x,x_bin in enumerate(y_bin):
df.loc[len(df)] = [int(t),int(y),int(x),x_bin]
df = df.dropna().reset_index(drop=True)
The final format should be as follows (except I'd preferably like integers instead of floats in first 3 columns, but not essential; int(t) etc. doesn't do the trick)
df
Out[291]:
Timepoint Y_bin X_bin Values
0 0.0 0.0 0.0 0.095391
1 0.0 0.0 2.0 0.963608
2 0.0 0.0 3.0 0.855735
3 0.0 0.0 4.0 0.392637
4 0.0 1.0 0.0 0.555199
5 0.0 1.0 1.0 0.118981
6 0.0 1.0 2.0 0.201782
...
len(df) # has received a total of 75 (5*3*5) input values of which 5 are nan
Out[293]: 70

change the format of the float out put to this by adding this piece of code
pd.options.display.float_format = '{:,.0f}'.format
to the end of your code like this to change the format
df = pd.DataFrame(columns = columns)
for t,timepoint in enumerate(exp_data):
for y,y_bin in enumerate(timepoint[0]):
for x,x_bin in enumerate(y_bin):
df.loc[len(df)] = [t,y,x,x_bin]
df.dropna().reset_index(drop=True)
pd.options.display.float_format = '{:,.0f}'.format
df
Out[250]:
Timepoint Y_bin X_bin Values
0 0 4 10 -2
1 0 4 11 -1
2 0 4 12 -2
3 0 4 13 -2
4 0 4 14 -2
5 0 4 15 -2
6 0 4 16 -3
...

Related

python dataframe number of last consequence rows less than current

I need to set number of last consequence rows less than current.
Below is a sample input and the result.
df = pd.DataFrame([10,9,8,11,10,11,13], columns=['value'])
df_result = pd.DataFrame([[10,9,8,11,10,11,13], [0,0,0,3,0,1,6]], columns=['value', 'number of last consequence rows less than current'])
Is it possible to achieve this without loop?
Otherwise solution with loop would be good.
More question
Could I do it with groupby operation, for the following input?
df = pd.DataFrame([[10,0],[9,0],[7,0],[8,0],[11,1],[10,1],[11,1],[13,1]], columns=['value','group'])
Following printed an error.
df.groupby('group')['value'].expanding()
Assuming this input:
value
0 10
1 9
2 8
3 11
4 10
5 13
You can use a cummax and expanding custom function:
df['out'] = (df['value'].cummax().expanding()
.apply(lambda s: s.lt(df.loc[s.index[-1], 'value']).sum())
)
For the particular case of < comparison, you can use a much faster trick with numpy. If a value is greater than all previous values, then it is greater than n values where n is the rank:
m = df['value'].lt(df['value'].cummax())
df['out'] = np.where(m, 0, np.arange(len(df)))
Output:
value out
0 10 0.0
1 9 0.0
2 8 0.0
3 11 3.0
4 10 0.0
5 13 5.0
update: consecutive values
df['out'] = (
df['value'].expanding()
.apply(lambda s: s.iloc[-2::-1].lt(s.iloc[-1]).cummin().sum())
)
Output:
value out
0 10 0.0
1 9 0.0
2 8 0.0
3 11 3.0
4 10 0.0
5 11 1.0
6 13 6.0

How to generate a new column which the values are between two existing columns

I need to add a new column based on the values of two existing columns.
My data set looks like this:
Date Bid Ask Last Volume
0 2021.02.01 00:01:02.327 1.21291 1.21336 0.0 0
1 2021.02.01 00:01:21.445 1.21290 1.21336 0.0 0
2 2021.02.01 00:01:31.912 1.21287 1.21336 0.0 0
3 2021.02.01 00:01:32.600 1.21290 1.21336 0.0 0
4 2021.02.01 00:02:08.920 1.21290 1.21338 0.0 0
... ... ... ... ... ...
80356 2021.02.01 23:58:54.332 1.20603 1.20605 0.0 0
and I need to generate a new column named "New" and the values of column "New" needs to have a random number between Column "Bid" and Column "Ask". For each value of the column "New", it has to be in the range from Bid to Ask (can equal to Bid or Ask).
I have tried to do like this
df['rand_between'] = df.apply(lambda x: np.random.randint(x.Ask,x.Bid), axis=1)
But I got this
Exception has occurred: ValueError
low >= high
I am new to Python.
Use np.random.uniform so you get a random float with equal probability between your high and low bounds with closure [low_bound, high_bound).
Also ditch the apply; np.random.uniform can generate the numbers using arrays of bounds. (I added a row at the bottom to make this obvious).
import numpy as np
df['New'] = np.random.uniform(df.Bid, df.Ask, len(df))
Date Bid Ask Last Volume New
0 2021.02.01 00:01:02.327 1.21291 1.21336 0.0 0 1.213114
1 2021.02.01 00:01:21.445 1.21290 1.21336 0.0 0 1.212969
2 2021.02.01 00:01:31.912 1.21287 1.21336 0.0 0 1.213342
3 2021.02.01 00:01:32.600 1.21290 1.21336 0.0 0 1.212933
4 2021.02.01 00:02:08.920 1.21290 1.21338 0.0 0 1.212948
5 2021.02.01 00:02:08.920 100.00000 115.00000 0.0 0 100.552836
All you need to do is switch the order of x.Ask and x.Bid in your code. In your dataframe, the ask prices are always higher than the bid, that's why you are getting the error:
df['rand_between'] = df.apply(lambda x: np.random.randint(x.Bid,x.Ask), axis=1)
If your ask value is sometimes greater and sometimes less than the bid, use:
df['rand_between'] = df.apply(lambda x: np.random.randint(x.Bid,x.Ask) if x.Ask > x.Bid else np.random.randint(x.Ask,x.Bid), axis=1)
Finally, if it is possible for ask to be greater, less than or equal to bis, use:
def helper(x):
if x.Ask > x.Bid:
return np.random.randint(x.Bid,x.Ask)
elif x.Bid > x.Ask:
return np.random.randint(x.Ask, x.Bid)
else:
return None
df['rand_between'] = df.apply(helper, axis=1)
You can loop through the rows using apply and then use your randint function (for floats you might want to use random.uniform). For example:
In [1]: import pandas as pd
...: from random import randint
...: df = pd.DataFrame({'bid':range(10),'ask':range(0,20,2)})
...:
...: df['new'] = df.apply(lambda x: randint(x['bid'],x['ask']), axis=1)
...: df
Out[1]:
bid ask new
0 0 0 0
1 1 2 1
2 2 4 4
3 3 6 6
4 4 8 6
5 5 10 9
6 6 12 9
7 7 14 12
8 8 16 13
9 9 18 9
The axis=1 is telling the apply function to loop over rows, not columns.

how to replace infinite value with maximum value of a pandas column?

I have a dataframe which looks like
City Crime_Rate
A 10
B 20
C inf
D 15
I want to replace the inf with the max value of the Crime_Rate column , so that my resulting dataframe should look like
City Crime_Rate
A 10
B 20
C 20
D 15
I tried
df['Crime_Rate'].replace([np.inf],max(df['Crime_Rate']),inplace=True)
But python takes inf as the maximum value , where am I going wrong here ?
Filter out inf values first and then get max of Series:
m = df.loc[df['Crime_Rate'] != np.inf, 'Crime_Rate'].max()
df['Crime_Rate'].replace(np.inf,m,inplace=True)
Another solution:
mask = df['Crime_Rate'] != np.inf
df.loc[~mask, 'Crime_Rate'] = df.loc[mask, 'Crime_Rate'].max()
print (df)
City Crime_Rate
0 A 10.0
1 B 20.0
2 C 20.0
3 D 15.0
Here is a solution for a whole matrix/data frame:
highest_non_inf = df.max().loc[lambda v: v<np.Inf].max()
df.replace(np.Inf, highest_non_inf)
Set use_inf_as_nan to true and then use fillna. (Use this if you want to consider inf and nan both as missing value) i.e
pd.options.mode.use_inf_as_na = True
df['Crime_Rate'].fillna(df['Crime_Rate'].max(),inplace=True)
City Crime_Rate
0 A 10.0
1 B 20.0
2 C 20.0
3 D 15.0
One way to do it using an additional function replace(np.inf, np.nan) within max().
It replaces inf with nan for the operations happening inside max() and max returns the expected maximum value not inf
Example below : Max value is 100 and replaces inf
#Create dummy data frame
import pandas as pd
import numpy as np
a = float('Inf')
v = [1,2,5,a,10,5,a,5,100,2]
df = pd.DataFrame({'Col_A': v})
#Data frame looks like this
In [33]: df
Out[33]:
Col_A
0 1.000000
1 2.000000
2 5.000000
3 inf
4 10.000000
5 5.000000
6 inf
7 5.000000
8 100.000000
9 2.000000
# Replace inf
df['Col_A'].replace([np.inf],max(df['Col_A'].replace(np.inf,
np.nan)),inplace=True)
In[35]: df
Out[35]:
Col_A
0 1.0
1 2.0
2 5.0
3 100.0
4 10.0
5 5.0
6 100.0
7 5.0
8 100.0
9 2.0
Hope that works !
Use numpy clip. It's elegant and blazingly fast:
import numpy as np
import pandas as pd
df = pd.DataFrame({"x": [-np.inf, +np.inf, np.nan, 4, 3]})
df["x"] = np.clip(df["x"], -np.inf, 100)
# Out:
# x
# 0 -inf
# 1 100.0
# 2 NaN
# 3 4.0
# 4 3.0
To get rid of the negative infinity as well, replace -np.inf with a small number. NaN is always unaffected. To get the max, use max(df["x"]).

How to use previous N values in pandas column to fill NaNs?

Say I have a time series data as below.
df
priceA priceB
0 25.67 30.56
1 34.12 28.43
2 37.14 29.08
3 Nan 34.23
4 32 Nan
5 18.75 41.1
6 Nan 45.12
7 23 39.67
8 Nan 36.45
9 36 Nan
Now I want to fill NaNs in column priceA by taking mean of previous N values in the column. In this case take N=3.
And for column priceB I have to fill Nan by value M rows above(current index-M).
I tried to write for loop for it which is not a good practice as my data is too large. Is there a better way to do this?
N=3
M=2
def fillPriceA( df,indexval,n):
temp=[ ]
for i in range(n):
if i < 0:
continue
temp.append(df.loc[indexval-(i+1), 'priceA'])
return np.nanmean(np.array(temp, dtype=np.float))
def fillPriceB(df, indexval, m):
return df.loc[indexval-m, 'priceB']
for idx, rows for df.iterrows():
if idx< N:
continue
else:
if rows['priceA']==None:
rows['priceA']= fillPriceA(df, idx,N)
if rows['priceB']==None:
rows['priceB']=fillPrriceB(df,idx,M)
Expected output:
priceA priceB
0 25.67 30.56
1 34.12 28.43
2 37.14 29.08
3 32.31 34.23
4 32 29.08
5 18.75 41.1
6 27.68 45.12
7 23 39.67
8 23.14 36.45
9 36 39.67
A solution could be to only work with the nan index (see dataframe boolean indexing):
param = dict(priceA = 3, priceB = 2) #Number of previous values to consider
for col in df.columns:
for i in df[np.isnan(df[col])].index: #Iterate over nan index
_window = df.iloc[max(0,(i-param[col])):i][col] #get the nth expected elements
df.loc[i][col] = _window.mean() if col == 'priceA' else _window.iloc[0] #Replace with right method
print(df)
Result:
priceA priceB
0 25.670000 30.56
1 34.120000 28.43
2 37.140000 29.08
3 32.310000 34.23
4 32.000000 29.08
5 18.750000 41.10
6 27.686667 45.12
7 23.000000 39.67
8 23.145556 36.45
9 36.000000 39.67
Note
1. Using np.isnan() implies that your columns are numeric. If not convert your columns before with pd.to_numeric():
...
for col in df.columns:
df[col] = pd.to_numeric(df[col], errors = 'coerce')
...
Or use pd.isnull() instead (see example below). Be aware of the performances (numpy is faster):
from random import randint
#A sample with 10k elements and some np.nan
arr = np.random.rand(10000)
for i in range(100):
arr[randint(0,9999)] = np.nan
#Performances
%timeit pd.isnull(arr)
10000 loops, best of 3: 24.8 µs per loop
%timeit np.isnan(arr)
100000 loops, best of 3: 5.6 µs per loop
2. A more generic alternative could be to define methods and window size to apply for each column in a dict:
import pandas as pd
param = {}
param['priceA'] = {'n':3,
'method':lambda x: pd.isnull(x)}
param['priceB'] = {'n':2,
'method':lambda x: x[0]}
param contains now n the number of elements and method a lambda expression. Accordingly rewrite your loops:
for col in df.columns:
for i in df[np.isnan(df[col])].index: #Iterate over nan index
_window = df.iloc[max(0,(i-param[col]['n'])):i][col] #get the nth expected elements
df.loc[i][col] = param[col]['method'](_window.values) #Replace with right method
print(df)#This leads to a similar result.
You can use an NA mask to do what you need per column:
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': [1,2,3,4, None, 5, 6], 'b': [1, None, 2, 3, 4, None, 7]})
df
# a b
# 0 1.0 1.0
# 1 2.0 NaN
# 2 3.0 2.0
# 3 4.0 3.0
# 4 NaN 4.0
# 5 5.0 NaN
# 6 6.0 7.0
for col in df.columns:
s = df[col]
na_indices = s[s.isnull()].index.tolist()
prev = 0
for k in na_indices:
s[k] = np.mean(s[prev:k])
prev = k
df[col] = s
print(df)
a b
# 0 1.0 1.0
# 1 2.0 1.0
# 2 3.0 2.0
# 3 4.0 3.0
# 4 2.5 4.0
# 5 5.0 2.5
# 6 6.0 7.0
While this is still a custom operation, I am pretty sure it will be slightly faster because it is not iterating over each row, just over the NA values, which I am assuming will be sparse compared to the actual data
To fill priceA use rolling, then shift and use this result in fillna,
# make some data
df = pd.DataFrame({'priceA': range(10)})
#make some rows missing
df.loc[[4, 6], 'priceA'] = np.nan
n = 3
df.priceA = df.priceA.fillna(df.priceA.rolling(n, min_periods=1).mean().shift(1))
The only edge case here is when two nans are within n of one another but it seems to handle this as in your question.
For priceB just use shift,
df = pd.DataFrame({'priceB': range(10)})
df.loc[[4, 8], 'priceB'] = np.nan
m = 2
df.priceB = df.priceB.fillna(df.priceB.shift(m))
Like before, there is the edge case where there is a nan exactly m before another nan.

splitting length (metre) data by interval with Pandas

I have a dataframe of length-interval data (from boreholes) which looks something like this:
df
Out[46]:
from to min intensity
0 0 10 py 2
1 5 15 cpy 3.5
2 14 27 spy 0.7
I need to pivot this data, but also break it on the least common length interval; resulting in the 'min' column as column headers, and the values being the 'rank'. The output would look like this:
df.somefunc(index=['from','to'], columns='min', values='intensity', fill_value=0)
Out[47]:
from to py cpy spy
0 0 5 2 0 0
1 5 10 2 3.5 0
2 10 14 0 3.5 0
3 14 15 0 3.5 0.7
4 15 27 0 0 0.7
so basically the "From" and "To" describe non-overlapping intervals down a borehole, where the intervals have been split by the least common denominator - as you can see the "py" interval from the original table has been split, the first (0-5m) into py:2, cpy:0 and the second (5-10m) into py:2, cpy:3.5.
The result from just a basic pivot_table function is this:
pd.pivot_table(df, values='intensity', index=['from', 'to'], columns="min", aggfunc="first", fill_value=0)
Out[48]:
min cpy py spy
from to
0 10 0 2 0
5 15 3.5 0 0
14 27 0 0 0.75
which just treats the from and to columns combined as an index. An important point is that my output cannot have overlapping from and to values (IE the subsequent 'from' value cannot be less than the previous 'to' value).
Is there an elegant way to accomplish this using Pandas? Thanks for the help!
I don't know natural interval arithmetic in Pandas, so you need do do it.
Here a way to do that, If I correctly understand bound conditions.
This can be a O(n^3) problem, it will create huge table for big entries.
# make the new bounds
bounds=np.unique(np.hstack((df["from"],df["to"])))
df2=pd.DataFrame({"from":bounds[:-1],"to":bounds[1:]})
#find inclusions
isin=df.apply(lambda x :
df2['from'].between(x[0],x[1]-1)
| df2['to'].between(x[0]+1,x[1])
,axis=1).T
#data
data=np.where(isin,df.intensity,0)
#result
df3=pd.DataFrame(data,
pd.MultiIndex.from_arrays(df2.values.T),df["min"])
For :
In [26]: df3
Out[26]:
min py cpy spy
0 5 2.0 0.0 0.0
5 10 2.0 3.5 0.0
10 14 0.0 3.5 0.0
14 15 0.0 3.5 0.7
15 27 0.0 0.0 0.7

Categories