What is the best way to calculate the RMS of a column in sections in python/pandas. Here is a example for a better understanding what I mean:
index
x
x_rms
0
2
1
3
2.55
2
10
3
22
17.09
...
...
...
So 2.55 is the RMS of 2 and 3, 17.09 is the RMS of 10 and 22 and so on.
the following will work
import pandas as pd
df = pd.DataFrame([2,3,10,22], columns=["x"])
def rms(a, b):
# return round(np.sqrt((a**2+b**2)/2), 2) # for only two decimals
return np.sqrt((a**2+b**2)/2)
df["rms"] = [rms(df.loc[idx-1,"x"], val["x"]) if idx%2 != 0 else np.nan
for idx, val in df.iterrows()]
output
x rms
0 2 NaN
1 3 2.549510
2 10 NaN
3 22 17.088007
EDIT regarding comment
if your index is a date you should do this to have the same output
values = [2,3,10,22]
tidx = pd.date_range('2019-01-01', periods=len(values), freq='D')
df = pd.DataFrame([2,3,10,22], columns=["x"], index=tidx)
def rms(a, b):
# return round(np.sqrt((a**2+b**2)/2), 2) # for only two decimals
return np.sqrt((a**2+b**2)/2)
df = df.reset_index()
df["rms"] = [rms(df.loc[idx-1,"x"], val["x"]) if idx%2 != 0 else np.nan
for idx, val in df.iterrows()]
df.set_index("index")
Related
this is my first question at stackoverflow.
I have two dataframes of different sizes df1(266808 rows) and df2 (201 rows).
df1
and
df2
I want to append the count of each value/number in df1['WS_140m'] to df2['count'] if number falls in a class interval given in df2['Class_interval'].
I have tried
1)
df2['count']=pd.cut(x=df1['WS_140m'], bins=df2['Class_interval'])
2)
df2['count'] = df1['WS_140m'].groupby(df1['Class_interval'])
3)
for anum in df1['WS_140m']:
if anum in df2['Class_interval']:
df2['count'] = df2['count'] + 1
Please guide, if someone knows.
Please try something like:
def in_class_interval(value, interval):
#TODO
def in_class_interval_closure(interval):
return lambda x: in_class_interval(x, interval)
df2['count'] = df2['Class_interval']
.apply(lambda x: df1[in_class_interval_closure(x)(df1['WS_140m'])].size,axis=1)
Define your function in_class_interval(value, interval), which returns boolean.
I guess something like this would do it:
In [330]: df1
Out[330]:
WS_140m
0 5.10
1 5.16
2 5.98
3 5.58
4 4.81
In [445]: df2
Out[445]:
count Class_interval
0 0 NaN
1 0 (0.05,0.15]
2 0 (0.15,0.25]
3 0 (0.25,0.35]
4 0 (3.95,5.15]
In [446]: df2.Class_interval = df2.Class_interval.str.replace(']', ')')
In [451]: from ast import literal_eval
In [449]: for i, v in df2.Class_interval.iteritems():
...: if pd.notnull(v):
...: df2.at[i, 'Class_interval'] = literal_eval(df2.Class_interval[i])
In [342]: df2['falls_in_range'] = df1.WS_140m.between(df2.Class_interval.str[0], df2.Class_interval.str[1])
You can increase the count wherever True comes like below :
In [360]: df2['count'] = df2.loc[df2.index[df2['falls_in_range'] == True].tolist()]['count'] +1
In [361]: df2
Out[361]:
count Class_interval falls_in_range
0 NaN NaN False
1 NaN (0.05, 0.15) False
2 NaN (0.15, 0.25) False
3 NaN (0.25, 0.35) False
4 1.0 (3.95, 5.15) True
I am having dataframe like this:
azimuth id
15 100
15 1
15 100
150 2
150 100
240 3
240 100
240 100
350 100
What I need is to fill instead 100 values values from row where azimuth is the closest:
Desired output:
azimuth id
15 1
15 1
15 1
150 2
150 2
240 3
240 3
240 3
350 1
350 is near to 15 because this is a circle (angle representation). The difference is 25.
What I have:
def mysubstitution(x):
for i in x.index[x['id'] == 100]:
i = int(i)
diff = (x['azimuth'] - x.loc[i, 'azimuth']).abs()
for ind in diff.index:
if diff[ind] > 180:
diff[ind] = 360 - diff[ind]
else:
pass
exclude = [y for y in x.index if y not in x.index[x['id'] == 100]]
closer_idx = diff[exclude]
closer_df = pd.DataFrame(closer_idx)
sorted_df = closer_df.sort_values('azimuth', ascending=True)
try:
a = sorted_df.index[0]
x.loc[i, 'id'] = x.loc[a, 'id']
except Exception as a:
print(a)
return x
Which works ok most of the time, but I guess there is some simpler solution.
Thanks in advance.
I tried to implement the functionality in two steps. First, for each azimuth, I grouped another dataframe that holds their id value(for values other than 100).
Then, using this array I implemented the replaceAzimuth function, which takes each row in the dataframe, first checks if the value already exists. If so, it directly replaces it. Otherwise,it replaces the id value with the closest azimuth value from the grouped dataframe.
Here is the implementation:
df = pd.DataFrame([[15,100],[15,1],[15,100],[150,2],[150,100],[240,3],[240,100],[240,100],[350,100]],columns=['azimuth','id'])
df_non100 = df[df['id'] != 100]
df_grouped = df_non100.groupby(['azimuth'])['id'].min().reset_index()
def replaceAzimuth(df_grouped,id_val):
real_id = df_grouped[df_grouped['azimuth'] == id_val['azimuth']]['id']
if real_id.size == 0:
df_diff = df_grouped
df_diff['azimuth'] = df_diff['azimuth'].apply(lambda x: min(abs(id_val['azimuth'] - x),(360 - id_val['azimuth'] + x)))
id_val['id'] = df_grouped.iloc[df_diff['azimuth'].idxmin()]['id']
else:
id_val['id'] = real_id
return id_val
df = df.apply(lambda x: replaceAzimuth(df_grouped,x), axis = 1)
df
For me, the code seems to give the output you have shown. But not sure if will work on all cases!
First set all ids to nan if they are 100.
df.id = np.where(df.id==100, np.nan, df.id)
Then calculate the angle diff pairwise and find the closest ID to fill the nans.
df.id = df.id.combine_first(
pd.DataFrame(np.abs(((df.azimuth.values[:,None]-df.azimuth.values) +180) % 360 - 180))
.pipe(np.argsort)
.applymap(lambda x: df.id.iloc[x])
.apply(lambda x: x.dropna().iloc[0], axis=1)
)
df
azimuth id
0 15 1.0
1 15 1.0
2 15 1.0
3 150 2.0
4 150 2.0
5 240 3.0
6 240 3.0
7 240 3.0
8 350 1.0
I'm looking to find the max run of consecutive zeros in a DataFrame with the result grouped by user. I'm interested in running the RLE on usage.
sample input:
user--day--usage
A-----1------0
A-----2------0
A-----3------1
B-----1------0
B-----2------1
B-----3------0
Desired output
user---longest_run
a - - - - 2
b - - - - 1
mydata <- mydata[order(mydata$user, mydata$day),]
user <- unique(mydata$user)
d2 <- data.frame(matrix(NA, ncol = 2, nrow = length(user)))
names(d2) <- c("user", "longest_no_usage")
d2$user <- user
for (i in user) {
if (0 %in% mydata$usage[mydata$user == i]) {
run <- rle(mydata$usage[mydata$user == i]) #Run Length Encoding
d2$longest_no_usage[d2$user == i] <- max(run$length[run$values == 0])
} else {
d2$longest_no_usage[d2$user == i] <- 0 #some users did not have no-usage days
}
}
d2 <- d2[order(-d2$longest_no_usage),]
this works in R but I want to do the same thing in python, I'm totally stumped
Use groupby with size by columns user, usage and helper Series for consecutive values first:
print (df)
user day usage
0 A 1 0
1 A 2 0
2 A 3 1
3 B 1 0
4 B 2 1
5 B 3 0
6 C 1 1
df1 = (df.groupby([df['user'],
df['usage'].rename('val'),
df['usage'].ne(df['usage'].shift()).cumsum()])
.size()
.to_frame(name='longest_run'))
print (df1)
longest_run
user val usage
A 0 1 2
1 2 1
B 0 3 1
5 1
1 4 1
C 1 6 1
Then filter only zero rows, get max and add reindex for append non 0 groups:
df2 = (df1.query('val == 0')
.max(level=0)
.reindex(df['user'].unique(), fill_value=0)
.reset_index())
print (df2)
user longest_run
0 A 2
1 B 1
2 C 0
Detail:
print (df['usage'].ne(df['usage'].shift()).cumsum())
0 1
1 1
2 2
3 3
4 4
5 5
6 6
Name: usage, dtype: int32
get max number of consecutive zeros on series:
def max0(sr):
return (sr != 0).cumsum().value_counts().max() - (0 if (sr != 0).cumsum().value_counts().idxmax()==0 else 1)
max0(pd.Series([1,0,0,0,0,2,3]))
4
I think the following does what you are looking for, where the consecutive_zero function is an adaptation of the top answer here.
Hope this helps!
import pandas as pd
from itertools import groupby
df = pd.DataFrame([['A', 1], ['A', 0], ['A', 0], ['B', 0],['B',1],['C',2]],
columns=["user", "usage"])
def len_iter(items):
return sum(1 for _ in items)
def consecutive_zero(data):
x = list((len_iter(run) for val, run in groupby(data) if val==0))
if len(x)==0: return 0
else: return max(x)
df.groupby('user').apply(lambda x: consecutive_zero(x['usage']))
Output:
user
A 2
B 1
C 0
dtype: int64
If you have a large dataset and speed is essential, you might want to try the high-performance pyrle library.
Setup:
# pip install pyrle
# or
# conda install -c bioconda pyrle
import numpy as np
np.random.seed(0)
import pandas as pd
from pyrle import Rle
size = int(1e7)
number = np.random.randint(2, size=size)
user = np.random.randint(5, size=size)
df = pd.DataFrame({"User": np.sort(user), "Number": number})
df
# User Number
# 0 0 0
# 1 0 1
# 2 0 1
# 3 0 0
# 4 0 1
# ... ... ...
# 9999995 4 1
# 9999996 4 1
# 9999997 4 0
# 9999998 4 0
# 9999999 4 1
#
# [10000000 rows x 2 columns]
Execution:
for u, udf in df.groupby("User"):
r = Rle(udf.Number)
is_0 = r.values == 0
print("User", u, "Max", np.max(r.runs[is_0]))
# (Wall time: 1.41 s)
# User 0 Max 20
# User 1 Max 23
# User 2 Max 20
# User 3 Max 22
# User 4 Max 23
I can convert all text features in a pandas dataframe by casting to 'category' using the df.astype() method as below. However I find category hard to work with (eg for plotting data) and would prefer to create a new column of integers
#convert all objects to categories
object_types = dataset.select_dtypes(include=['O'])
for col in object_types:
dataset['{0}_category'.format(col)] = dataset[col].astype('category')
I can convert the text to integers using this hack:
#convert all objects to int values
object_types = dataset.select_dtypes(include=['O'])
new_cols = {}
for col in object_types:
data_set = set(dataset[col].tolist())
data_indexed = {}
for i, item in enumerate(data_set):
data_indexed[item] = i
new_list = []
for item in dataset[col].tolist():
new_list.append(data_indexed[item])
new_cols[col]=new_list
for key, val in new_cols.items():
dataset['{0}_int_value'.format(key)] = val
But is there a better (or existing) way to do the same?
I would use factorize method, which is designed for this particular task:
In [90]: x
Out[90]:
A B
9 c z
10 c z
4 b x
5 b y
1 a w
7 b z
In [91]: x.apply(lambda col: pd.factorize(col, sort=True)[0])
Out[91]:
A B
9 2 3
10 2 3
4 1 1
5 1 2
1 0 0
7 1 3
or:
In [92]: x.apply(lambda col: pd.factorize(col)[0])
Out[92]:
A B
9 0 0
10 0 0
4 1 1
5 1 2
1 2 3
7 1 0
consider df
df = pd.DataFrame(dict(A=list('aaaabbbbcccc'),
B=list('wwxxxyyzzzzz')))
df
you can convert to integers like this
def intify(s):
u = np.unique(s)
i = np.arange(len(u))
return s.map(dict(zip(u, i)))
or shorter version
def intify(s):
u = np.unique(s)
return s.map({k: i for i, k in enumerate(u)})
df.apply(intify)
Or in a single line
df.apply(lambda s: s.map({k:i for i,k in enumerate(s.unique())}))
I have a pandas-Series 'A' containing comma separated values like this :
index A
1 null
2 5,6
3 3
4 null
5 5,18,22
... ...
I need a dataframe like this :
index A_5 A_6 A_18 A_20
1 0 0 0 ...
2 1 1 0 ...
3 0 0 0 ...
4 0 0 0 ...
5 1 0 1 ...
... ... ... ... ...
Values that don't occur at least MIN_OBS times should be ignored and not get an own column, because there are so many distinct values that the df would become too big if this threshold isn't applied.
I designed the solution below. It works, but is way too slow (due to iterating over rows I suppose). Could anyone suggest a faster approach ?
temp_dict = defaultdict(int)
for k, v in A.iteritems():
temp_list = v.split(',')
for item in temp_list:
temp_dict[item] += 1
cols_to_make = []
for k, v in temp_dict.iteritems():
if v > MIN_OBS:
cols_to_make.append('A_' + k)
result_df = pd.DataFrame(0, index = the_series.index, columns = cols_to_make)
for k, v in A.iteritems():
temp_list = v.split(',')
for item in temp_list:
if ('A_' + item) in cols_to_make:
temp_df['A_' + item][k] = 1
You can use get_dummies for creating indicator variables, then convert columns to numbers by to_numeric and last filter columns by variable TRESH and ix:
print df
A
index
1 null
2 5,6
3 3
4 null
5 5,18,22
df = df.A.str.get_dummies(sep=",")
print df
18 22 3 5 6 null
index
1 0 0 0 0 0 1
2 0 0 0 1 1 0
3 0 0 1 0 0 0
4 0 0 0 0 0 1
5 1 1 0 1 0 0
df.columns = pd.to_numeric(df.columns, errors='coerce')
df = df.sort_index(axis=1)
TRESH = 5
cols = [col for col in df.columns if col > TRESH]
print cols
[6.0, 18.0, 22.0]
df = df.ix[:, cols]
print df
6 18 22
index
1 0 0 0
2 1 0 0
3 0 0 0
4 0 0 0
5 0 1 1
df.columns = ["A_" + str(int(col)) for col in df.columns]
print df
A_6 A_18 A_22
index
1 0 0 0
2 1 0 0
3 0 0 0
4 0 0 0
5 0 1 1
EDIT:
I try modified perfect original unutbu answer and change creating Series, removing Series with null values in index and add parameter prefix to get_dummies:
import numpy as np
import pandas as pd
s = pd.Series(['null', '5,6', '3', 'null', '5,18,22', '3,4'])
print s
#result = s.str.split(',').apply(pd.Series).stack()
#replacing to:
result = pd.DataFrame([ x.split(',') for x in s ]).stack()
count = pd.value_counts(result)
min_obs = 2
#add removing Series, which contains null
count = count[(count >= min_obs) & ~(count.index.isin(['null'])) ]
result = result.loc[result.isin(count.index)]
#add prefix to function get_dummies
result = pd.get_dummies(result, prefix="A")
result.index = result.index.droplevel(1)
result = result.reindex(s.index)
print(result)
A_3 A_5
0 NaN NaN
1 0 1
2 1 0
3 NaN NaN
4 0 1
5 1 0
Timings:
In [143]: %timeit pd.DataFrame([ x.split(',') for x in s ]).stack()
1000 loops, best of 3: 866 µs per loop
In [144]: %timeit s.str.split(',').apply(pd.Series).stack()
100 loops, best of 3: 2.46 ms per loop
Since memory is an issue, we have to be careful not to build large intermediate
data structures if possible.
Let's start with the OP's posted code that works:
def orig(A, MIN_OBS):
temp_dict = collections.defaultdict(int)
for k, v in A.iteritems():
temp_list = v.split(',')
for item in temp_list:
temp_dict[item] += 1
cols_to_make = []
for k, v in temp_dict.iteritems():
if v > MIN_OBS:
cols_to_make.append('A_' + k)
result_df = pd.DataFrame(0, index=A.index, columns=cols_to_make)
for k, v in A.iteritems():
temp_list = v.split(',')
for item in temp_list:
if ('A_' + item) in cols_to_make:
result_df['A_' + item][k] = 1
return result_df
and extract the first loop into its own function:
def count(A, MIN_OBS):
temp_dict = collections.Counter()
for k, v in A.iteritems():
temp_list = v.split(',')
for item in temp_list:
temp_dict[item] += 1
temp_dict = {k:v for k, v in temp_dict.items() if v > MIN_OBS}
return temp_dict
From experimentation in an interactive session, we can see this is not the bottleneck; even for "large" DataFrames, count(A, MIN_OBS) completes fairly quickly.
The slowness of orig occurs in the the double for-loop at the end of orig
which increments modifies cells in the DataFrame one value at a time
(e.g. result_df['A_' + item][k] = 1.)
We could replace that double-for loop with a single for-loop over the columns of the DataFrame, using the vectorized string method, A.str.contains to search for values in the strings. Since we never split the original strings into Python lists of strings (or Pandas DataFrames holding the string fragments), we save some memory.
Since orig and alt use similar data structures, their memory footprint is about the same.
def alt(A, MIN_OBS):
temp_dict = count(A, MIN_OBS)
df = pd.DataFrame(0, index=A.index, columns=temp_dict)
for col in df:
df[col] = A.str.contains(r'^{v}|,{v},|,{v}$'.format(v=col)).astype(int)
df.columns = ['A_{}'.format(col) for col in df]
return df
Here is an example, on a 200K row DataFrame with 40K different possible values:
import numpy as np
import pandas as pd
import collections
np.random.seed(2016)
ncols = 5
nrows = 200000
nvals = 40000
MIN_OBS = 200
# nrows = 20
# nvals = 4
# MIN_OBS = 2
idx = np.random.randint(ncols, size=nrows).cumsum()
data = np.random.choice(np.arange(nvals), size=idx[-1])
data = np.array_split(data, idx[:-1])
data = map(','.join, [map(str, arr) for arr in data])
A = pd.Series(data)
A.loc[A == ''] = 'null'
def orig(A, MIN_OBS):
temp_dict = collections.defaultdict(int)
for k, v in A.iteritems():
temp_list = v.split(',')
for item in temp_list:
temp_dict[item] += 1
cols_to_make = []
for k, v in temp_dict.iteritems():
if v > MIN_OBS:
cols_to_make.append('A_' + k)
result_df = pd.DataFrame(0, index=A.index, columns=cols_to_make)
for k, v in A.iteritems():
temp_list = v.split(',')
for item in temp_list:
if ('A_' + item) in cols_to_make:
result_df['A_' + item][k] = 1
return result_df
def count(A, MIN_OBS):
temp_dict = collections.Counter()
for k, v in A.iteritems():
temp_list = v.split(',')
for item in temp_list:
temp_dict[item] += 1
temp_dict = {k:v for k, v in temp_dict.items() if v > MIN_OBS}
return temp_dict
def alt(A, MIN_OBS):
temp_dict = count(A, MIN_OBS)
df = pd.DataFrame(0, index=A.index, columns=temp_dict)
for col in df:
df[col] = A.str.contains(r'^{v}|,{v},|,{v}$'.format(v=col)).astype(int)
df.columns = ['A_{}'.format(col) for col in df]
return df
Here is a benchmark:
In [48]: %timeit expected = orig(A, MIN_OBS)
1 loops, best of 3: 3.03 s per loop
In [49]: %timeit expected = alt(A, MIN_OBS)
1 loops, best of 3: 483 ms per loop
Note that the majority of the time required for alt to complete is spent in count:
In [60]: %timeit count(A, MIN_OBS)
1 loops, best of 3: 304 ms per loop
Would something like this work or could it be modified to fit your need?
df = pd.DataFrame({'A': ['null', '5,6', '3', 'null', '5,18,22']}, columns=['A'])
A
0 null
1 5,6
2 3
3 null
4 5,18,22
Then use get_dummies()
pd.get_dummies(df['A'].str.split(',').apply(pd.Series), prefix=df.columns[0])
Result:
A_3 A_5 A_null A_18 A_6 A_22
index
1 0 0 1 0 0 0
2 0 1 0 0 1 0
3 1 0 0 0 0 0
4 0 0 1 0 0 0
5 0 1 0 1 0 1