I have a large pandas DataFrame that I need to fill.
Here is my code:
trains = np.arange(1, 101)
#The above are example values, it's actually 900 integers between 1 and 20000
tresholds = np.arange(10, 70, 10)
tuples = []
for i in trains:
for j in tresholds:
tuples.append((i, j))
index = pd.MultiIndex.from_tuples(tuples, names=['trains', 'tresholds'])
df = pd.DataFrame(np.zeros((len(index), len(trains))), index=index, columns=trains, dtype=float)
metrics = dict()
for i in trains:
m = binary_metric_train(True, i)
#Above function returns a binary array of length 35
#Example: [1, 0, 0, 1, ...]
metrics[i] = m
for i in trains:
for j in tresholds:
trA = binary_metric_train(True, i, tresh=j)
for k in trains:
if k != i:
trB = metrics[k]
corr = abs(pearsonr(trA, trB)[0])
df[k][i][j] = corr
else:
df[k][i][j] = np.nan
My problem is, when this piece of code is finally done computing, my DataFrame df still contains nothing but zeros. Even the NaN are not inserted. I think that my indexing is correct. Also, I have tested my binary_metric_train function separately, it does return an array of length 35.
Can anyone spot what I am missing here?
EDIT: For clarity, this DataFrame looks like this:
1 2 3 4 5 ...
trains tresholds
1 10
20
30
40
50
60
2 10
20
30
40
50
60
...
As #EdChum noted, you should take a lookt at pandas indexing. Here's some test data for the purpose of illustration, which should clear things up.
import numpy as np
import pandas as pd
trains = [ 1, 1, 1, 2, 2, 2]
thresholds = [10, 20, 30, 10, 20, 30]
data = [ 1, 0, 1, 0, 1, 0]
df = pd.DataFrame({
'trains' : trains,
'thresholds' : thresholds,
'C1' : data,
'C2' : data
}).set_index(['trains', 'thresholds'])
print df
df.ix[(2, 30), 0] = 3 # using column index
# or...
df.ix[(2, 30), 'C1'] = 3 # using column name
df.loc[(2, 30), 'C1'] = 3 # using column name
# but not...
df.loc[(2, 30), 1] = 3 # creates a new column
print df
Which outputs the DataFrame before and after modification:
C1 C2
trains thresholds
1 10 1 1
20 0 0
30 1 1
2 10 0 0
20 1 1
30 0 0
C1 C2 1
trains thresholds
1 10 1 1 NaN
20 0 0 NaN
30 1 1 NaN
2 10 0 0 NaN
20 1 1 NaN
30 3 0 3
Related
Let's label a dataframe with two columns, A,B, and 100M rows. Starting at the index i, we want to know if the data in column B is trending down or trending up comparing to the data at [i, 'A'].
Here is a loop:
import pandas as pd
df = pd.DataFrame({'A': [0,1,2,3,5,0,0,0,0,0], 'B': [1, 10, -10, 2, 3,0,0,0,0,0], "label":[0,0,0,0,0,0,0,0,0,0]})
for i in range (0,5):
j = i
while j in range (i,i+5) and df.at[i,'label'] == 0: #if classfied, no need to continue
if df.at[j,'B']-df.at[i,'A']>= 10:
df.at[i,'label'] = 1 #Label 1 means trending up
if df.at[j,'B']-df.at[i,'A']<= -10:
df.at[i,'label'] = 2 #Label 2 means trending down
j=j+1
[out]
A B label
0 1 1
1 10 2
2 -10 2
3 2 0
5 3 0
...
The estimated finishing time for this code is 30 days. (A human with a plot and a ruler might finish this task faster.)
What is a fast way to do this? Ideally without a loop.
Looping on Dataframe is slow compared to using Pandas methods.
The task can be accomplished using Pandas vectorized methods:
rolling method which does computations in a rolling window
min & max methods which we compute in the rolling window
where method DataFrame where allows us to set values based upon logic
Code
def set_trend(df, threshold = 10, window_size = 2):
'''
Use rolling_window to find max/min values in a window from the current point
rolling window normally looks at backward values
We use technique from https://stackoverflow.com/questions/22820292/how-to-use-pandas-rolling-functions-on-a-forward-looking-basis/22820689#22820689
to look at forward values
'''
# To have a rolling window on lookahead values in column B
# We reverse values in column B
df['B_rev'] = df["B"].values[::-1]
# Max & Min in B_rev, then reverse order of these max/min
# https://stackoverflow.com/questions/50837012/pandas-rolling-min-max
df['max_'] = df.B_rev.rolling(window_size, min_periods = 0).max().values[::-1]
df['min_'] = df.B_rev.rolling(window_size, min_periods = 0).min().values[::-1]
nrows = df.shape[0] - 1 # adjustment for argmax & armin indexes since rows are in reverse order
# i.e. idx = nrows - x.argmax() give index for max in non-reverse row
df['max_idx'] = df.B_rev.rolling(window_size, min_periods = 0).apply(lambda x: nrows - x.argmax(), raw = True).values[::-1]
df['min_idx'] = df.B_rev.rolling(window_size, min_periods = 0).apply(lambda x: nrows - x.argmin(), raw = True).values[::-1]
# Use np.select to implement label assignment logic
conditions = [
(df['max_'] - df["A"] >= threshold) & (df['max_idx'] <= df['min_idx']), # max above & comes first
(df['min_'] - df["A"] <= -threshold) & (df['min_idx'] <= df['max_idx']), # min below & comes first
df['max_'] - df["A"] >= threshold, # max above threshold but didn't come first
df['min_'] - df["A"] <= -threshold, # min below threshold but didn't come first
]
choices = [
1, # max above & came first
2, # min above & came first
1, # max above threshold
2, # min above threshold
]
df['label'] = np.select(conditions, choices, default = 0)
# Drop scratch computation columns
df.drop(['B_rev', 'max_', 'min_', 'max_idx', 'min_idx'], axis = 1, inplace = True)
return df
Tests
Case 1
df = pd.DataFrame({'A': [0,1,2,3,5,0,0,0,0,0], 'B': [1, 10, -10, 2, 3,0,0,0,0,0], "label":[0,0,0,0,0,0,0,0,0,0]})
display(set_trend(df, 10, 4))
Case 2
df = pd.DataFrame({'A': [0,1,2], 'B': [1, -10, 10]})
display(set_trend(df, 10, 4))
Output
Case 1
A B label
0 0 1 1
1 1 10 2
2 2 -10 2
3 3 2 0
4 5 3 0
5 0 0 0
6 0 0 0
7 0 0 0
8 0 0 0
9 0 0 0
Case 2
A B label
0 0 1 2
1 1 -10 2
2 2 10 0
I have a dataframe with columns m, n:
m=[0, 0, 1, 0, 0, 0, 4, 0, 0]
n=[6, 1, 2, 1, 4, 3, 1, 3, 5, 1]
I am looking for an iterative loop that adds up values of column n if the value in column m is non-zero. For example at 3rd place of column m the value is 1 (non-zero) so it should add in the column n from index 0 to 2 i.e. 6+1+2=9. Similarly, at m[6]=4 (non-zero) this implies 1+4+3+1=9 and so on.
Let's say you have a dataframe and you want to sum the elements in each column based on the position of non-zero values in column "m". The following code gives you the output as a dataframe. See the comment in the code if you are just looking for summing the values in column "n":
import pandas as pd
from random import randint
m = [0, 1, 0, 0, 1, 0, 0, 0, 2]
n = [1, 1, 3, 4, 1, 1, 2, 1, 3]
r = [randint(1, 3) for _ in m]
names = ['lev', 'yan', 'coke' , 'coke', 'yan', 'lev', 'lev', 'yan', 'lev']
df = pd.DataFrame({'m': m, 'n': n, 'r': r, 'names': names})
print(f"Input dataframe:\n{df}")
# if you want to iterate over all columns
iter_cols = df.columns.tolist()
iter_cols.remove('m')
# To iterate over an specific column (e.g. 'n') you use iter_cols = ['n']
starting_idx = 0
sum_df = pd.DataFrame()
for idx, val in enumerate(df.m):
if val != 0:
sum_df = sum_df.append(df.iloc[starting_idx: (idx+1)][iter_cols].sum(), ignore_index=True)
starting_idx = idx+1
print(f"Output dataframe:\n{sum_df}")
Output:
Input dataframe:
m n r names
0 0 1 2 lev
1 1 1 3 yan
2 0 3 1 coke
3 0 4 2 coke
4 1 1 2 yan
5 0 1 3 lev
6 0 2 3 lev
7 0 1 3 yan
8 2 3 2 lev
Output dataframe:
n names r
0 2.0 levyan 5.0
1 8.0 cokecokeyan 5.0
2 7.0 levlevyanlev 11.0
And if you want to iterate over distinct values in names column and sum the values in 'n' column accordingly:
iter_cols = ['n']
distinct_names = set(df.names)
print(distinct_names)
out_dct = {}
for name in distinct_names:
starting_idx = 0
sum_df = pd.DataFrame()
for idx, val in enumerate(df.names):
if val == name:
sum_df = sum_df.append(df.iloc[starting_idx: (idx+1)][iter_cols].sum(), ignore_index=True)
starting_idx = idx+1
out_dct[name] = sum_df
so I have this data frame with about 5 columns. 2 of them are longitude and lattitude pairs in form of tuples. so I have another user defined function that calculates the distance between two given tuples of lon/lat.
data_all['gc_distance'] = ""
### let's start calculate the great circle distance
for idx, row in data_all.iterrows():
row['gc_distance'] = gcd.dist(row['ping_location'], row['destination'])
print(row)
so basically, i created an empty column named gc_distance, then i iterate through each row to calculate the distance. when i print each row, the data looks great;
sample print of a row:
created_at_des 2018-01-17 18:55:55.154000
location_missing 0
ping_location (-121.9419444444, 37.4897222222)
destination (-122.15057, 37.39465)
gc_distance 23.85 km
Name: 393529, dtype: object
as you can see, the gc_distance DOES have value.
Here's the sample output from the print statement after the loop:
location_missing ping_location \
0 (-152.859052, 51.218273)
0 (120.585289, 31.298974)
0 (120.585289, 31.298974)
0 (120.585289, 31.298974)
0 (121.4737021, 31.2303904)
destination gc_distance
0 (-122.057005, 37.606922)
1 (-122.057005, 37.606922)
2 (-122.057005, 37.606922)
3 (-122.057005, 37.606922)
4 (-122.057005, 37.606922)
However, when I print it again outside of the for loop, gc_distance column has only blank vlaues! :(
Why is this??? There's no compile or run time error... And all other outputs look good, why is this calculated field not there, even though when I print it during the for loop it does have value? (but outside for loop it doesn't anymore)
Try this method out:
import pandas as pd
import numpy as np
import math
def dist(i):
diff = list(map(lambda a,b: a-b, df['a'][i], df['b'][i]))
squared = [(k)**2 for k in diff]
squared_diff = sum(squared)
root = math.sqrt(squared_diff)
return root
df = pd.DataFrame([[0, 0, 5, 6, '', '', ''], [2, 6, -5, 8, '', '', '']], columns = ["x_a", "y_a", "x_b", "y_b", "a", "b", "dist"])
print(df)
#data_all['ping_location'] = list(zip(data_all.longitude_evnt, data_all.lattitude_evnt))
df['a'] = list(zip(df.x_a, df.y_a))
df['b'] = list(zip(df.x_b, df.y_b))
print(df)
for i in range(0, len(df)):
df['dist'][i] = dist(i)
print(dist(i))
print(df)
This is my terminal output:
x_a y_a x_b y_b a b dist
0 0 0 5 6
1 2 6 -5 8
x_a y_a x_b y_b a b dist
0 0 0 5 6 (0, 0) (5, 6)
1 2 6 -5 8 (2, 6) (-5, 8)
test.py:24: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
df['dist'][i] = dist(i)
7.810249675906654
7.280109889280518
x_a y_a x_b y_b a b dist
0 0 0 5 6 (0, 0) (5, 6) 7.81025
1 2 6 -5 8 (2, 6) (-5, 8) 7.28011
I understand how to create simple quantiles in Pandas using pd.qcut. But after searching around, I don't see anything to create weighted quantiles. Specifically, I wish to create a variable which bins the values of a variable of interest (from smallest to largest) such that each bin contains an equal weight. So far this is what I have:
def wtdQuantile(dataframe, var, weight = None, n = 10):
if weight == None:
return pd.qcut(dataframe[var], n, labels = False)
else:
dataframe.sort_values(var, ascending = True, inplace = True)
cum_sum = dataframe[weight].cumsum()
cutoff = max(cum_sum)/n
quantile = cum_sum/cutoff
quantile[-1:] -= 1
return quantile.map(int)
Is there an easier way, or something prebuilt from Pandas that I'm missing?
Edit: As requested, I'm providing some sample data. In the following, I'm trying to bin the "Var" variable using "Weight" as the weight. Using pd.qcut, we get an equal number of observations in each bin. Instead, I want an equal weight in each bin, or in this case, as close to equal as possible.
Weight Var pd.qcut(n=5) Desired_Rslt
10 1 0 0
14 2 0 0
18 3 1 0
15 4 1 1
30 5 2 1
12 6 2 2
20 7 3 2
25 8 3 3
29 9 4 3
45 10 4 4
I don't think this is built-in to Pandas, but here is a function that does what you want in a few lines:
import numpy as np
import pandas as pd
from pandas._libs.lib import is_integer
def weighted_qcut(values, weights, q, **kwargs):
'Return weighted quantile cuts from a given series, values.'
if is_integer(q):
quantiles = np.linspace(0, 1, q + 1)
else:
quantiles = q
order = weights.iloc[values.argsort()].cumsum()
bins = pd.cut(order / order.iloc[-1], quantiles, **kwargs)
return bins.sort_index()
We can test it on your data this way:
data = pd.DataFrame({
'var': range(1, 11),
'weight': [10, 14, 18, 15, 30, 12, 20, 25, 29, 45]
})
data['qcut'] = pd.qcut(data['var'], 5, labels=False)
data['weighted_qcut'] = weighted_qcut(data['var'], data['weight'], 5, labels=False)
print(data)
The output matches your desired result from above:
var weight qcut weighted_qcut
0 1 10 0 0
1 2 14 0 0
2 3 18 1 0
3 4 15 1 1
4 5 30 2 1
5 6 12 2 2
6 7 20 3 2
7 8 25 3 3
8 9 29 4 3
9 10 45 4 4
I have a dictionary 'wordfreq' like this:
{'techsmart': 30, 'paradies': 57, 'jobvark': 5000, 'midgley': 100, 'weisman': 2, 'tucuman': 1, 'amdahl': 2, 'frogfeet': 1, 'd8848': 1, 'jiaoyuwang': 1, 'walter': 19}
and I want to put the keys in a list if the value is more than 5 and also if the key is not in another dataframe 'df', and then adding them to a list called 'stopword':here is a df dataframe:
word freq
1 paradies 1
5 tucuman 1
and here is the code I am using:
stopword = []
for k,v in wordfreq.items():
if v >= 5:
if k not in list_c:
stopword.append((k))
Anybody knows how can I do the same thing with isin() method or more efficiently at least?
I'd load your dict into a df:
In [177]:
wordfreq = {'techsmart': 30, 'paradies': 57, 'jobvark': 5000, 'midgley': 100, 'weisman': 2, 'tucuman': 1, 'amdahl': 2, 'frogfeet': 1, 'd8848': 1, 'jiaoyuwang': 1, 'walter': 19}
df = pd.DataFrame({'word':list(wordfreq.keys()), 'freq':list(wordfreq.values())})
df
Out[177]:
freq word
0 1 frogfeet
1 1 tucuman
2 57 paradies
3 1 d8848
4 5000 jobvark
5 100 midgley
6 1 jiaoyuwang
7 30 techsmart
8 2 weisman
9 19 walter
10 2 amdahl
And then filter using isin against the other df (df_1 in my case) like this:
In [181]:
df[(df['freq'] > 5) & (~df['word'].isin(df1['word']))]
Out[181]:
freq word
4 5000 jobvark
5 100 midgley
7 30 techsmart
9 19 walter
So the boolean condition looks for freq values greater than 5 and also where the word is not in the other df using isin and invert the boolean mask ~.
You can then now get a list easily:
In [182]:
list(df[(df['freq'] > 5) & (~df['word'].isin(df1['word']))]['word'])
Out[182]:
['jobvark', 'midgley', 'techsmart', 'walter']