How to assign values based on an interval in Pandas - python

I am trying to assign a value to a dataframe column based on a value that falls IN BETWEEN two values of an other dataframe:
intervals = pd.DataFrame(columns = ['From','To','Value'], data = [[0,100,'A'],[100,200,'B'],[200,500,'C']])
print('intervals\n',intervals,'\n')
points = pd.DataFrame(columns = ['Point', 'Value'], data = [[45,'X'],[125,'X'],[145,'X'],[345,'X']])
print('points\n',points,'\n')
DesiredResult = pd.DataFrame(columns = ['Point', 'Value'], data = [[45,'A'],[125,'B'],[145,'B'],[345,'C']])
print('DesiredResult\n',DesiredResult,'\n')
Many thanks

Let's use map, first create a series using pd.IntervalIndex with from_arrays method:
intervals = intervals.set_index(pd.IntervalIndex.from_arrays(intervals['From'],
intervals['To']))['Value']
points['Value'] = points['Point'].map(intervals)
Output:
Point Value
0 45 A
1 125 B
2 145 B
3 345 C

Another approach:
def calculate_value(x):
return intervals.loc[(x >= intervals['From']) & (x < intervals['To']), 'Value'].squeeze()
desired_result = points.copy()
desired_result['Value'] = desired_result['Point'].apply(calculate_value)

Related

Cover all columns using the least amount of rows in a pandas dataframe

I have a pandas dataframe looking like the following picture:
The goal here is to select the least amount of rows to have a "1" in all columns. In this scenario, the final selection should be these two rows:
The algorithm should work even if I add columns and rows. It should also work if I change the combination of 1 and 0 in any given row.
Use sum per rows, then compare by Series.ge (>=) for greater or equal and filter by boolean indexing:
df[df.sum(axis=1).ge(2)]
It want test 1 or 0 values first compare by DataFrame.eq for equal ==:
df[df.eq(1).sum(axis=1).ge(2)]
df[df.eq(0).sum(axis=1).ge(2)]
For those interested, this is how I managed to do it:
def _getBestRowsFinalSelection(self, df, cols):
"""
Get the selected rows for the final selection
Parameters:
1. df: Dataframe to use
2. cols: Columns of the binary variables in the Dataframe object (df)
RETURNS -> DataFrame : dfSelected
"""
isOne = df.loc[df[df.loc[:, cols] == 1].sum(axis=1) > 0, :]
lstIsOne = isOne.loc[:, cols].values.tolist()
lstIsOne = [(x, lstItem) for x, lstItem in zip(isOne.index.values.tolist(), lstIsOne)]
winningComb = None
stopFlag = False
for i in range(1, isOne.shape[0] + 1):
if stopFlag:
break;
combs = combinations(lstIsOne, i) #from itertools
for c in combs:
data = [x[1] for x in c]
index = [x[0] for x in c]
dfTmp = pd.DataFrame(data=data, columns=cols, index=index)
if (dfTmp.sum() > 0).all():
dfTmp["Final Selection"] = "Yes"
winningComb = dfTmp
stopFlag = True
break;
return winningComb

How to compare rows of two different dataframes

I have 2 dataframes(df and df_flagMax) that are not the same in size. I need help on the structure of comparing two different databases that are not the same in size. I want to compare the rows of both dataframes.
df = pd.read_excel('df.xlsx')
df_flagMax = df.groupby(['Name'], as_index=False)['Max'].max()
df['flagMax'] = 0
num = len(df)
for i in range(num):
colMax = df.at[i, 'Name']
df['flagMax'][(df['Max'] == colMax)] = 1
print(df)
df_flagMax data:
Name Max
0 Sf 39.91
1 Th -25.74
df data:
For example: I want to compare 'Sf' from both df and df_flagMax and then perform this line:
df['flag'][(df['Max'] == colMax)] = 1
if and only if the 'Sf' is in both dataframes on the same row index. The same goes for the next Name value ... 'Th'

Feature extraction from the training data

I have a training data like below which have all the information under a single column. The data set has above 300000 data.
id features label
1 name=John Matthew;age=25;1.=Post Graduate;2.=Football Player; 1
2 name=Mark clark;age=21;1.=Under Graduate;Interest=Video Games; 1
3 name=David;age=12;1:=High School;2:=Cricketer;native=america; 2
4 name=George;age=11;1:=High School;2:=Carpenter;married=yes 2
.
.
300000 name=Kevin;age=16;1:=High School;2:=Driver;Smoker=No 3
Now i need to convert this training data like below
id name age 1 2 Interest married Smoker
1 John Matthew 25 Post Graduate Football Player Nan Nan Nan
2 Mark clark 21 Under Graduate Nan Video Games Nan Nan
.
.
Is there any efficient way to do this. I tried the below code but it took 3 hours to complete
#Getting the proper features from the features column
cols = {}
for choices in set_label:
collection_list = []
array = train["features"][train["label"] == choices].values
for i in range(1,len(array)):
var_split = array[i].split(";")
try :
d = (dict(s.split('=') for s in var_split))
for x in d.keys():
collection_list.append(x)
except ValueError:
Error = ValueError
count = Counter(collection_list)
for k , v in count.most_common(5):
key = k.replace(":","").replace(" ","_").lower()
cols[key] = v
columns_add = list(cols.keys())
train = train.reindex(columns = np.append( train.columns.values, columns_add))
print (train.columns)
print (train.shape)
#Adding the values for the newly created problem
for row in train.itertuples():
dummy_dic = {}
new_dict={}
value = train.loc[row.Index, 'features']
v_split = value.split(";")
try :
dummy_dict = (dict(s.split('=') for s in v_split))
for k, v in dummy_dict.items():
new_key = k.replace(":","").replace(" ","_").lower()
new_dict[new_key] = v
except ValueError:
Error = ValueError
for k,v in new_dict.items():
if k in train.columns:
train.loc[row.Index, k] = v
Is there any useful function that i can apply here for efficient way of feature extraction ?
Create two DataFrames (in the first one all the features are the same for every data point and the second one is a modification of the first one introducing different features for some data points) meeting your criteria:
import pandas as pd
import numpy as np
import random
import time
import itertools
# Create a DataFrame where all the keys for each datapoint in the "features" column are the same.
num = 300000
NAMES = ['John', 'Mark', 'David', 'George', 'Kevin']
AGES = [25, 21, 12, 11, 16]
FEATURES1 = ['Post Graduate', 'Under Graduate', 'High School']
FEATURES2 = ['Football Player', 'Cricketer', 'Carpenter', 'Driver']
LABELS = [1, 2, 3]
df = pd.DataFrame()
df.loc[:num, 0]= ["name={0};age={1};feature1={2};feature2={3}"\
.format(NAMES[np.random.randint(0, len(NAMES))],\
AGES[np.random.randint(0, len(AGES))],\
FEATURES1[np.random.randint(0, len(FEATURES1))],\
FEATURES2[np.random.randint(0, len(FEATURES2))]) for i in xrange(num)]
df['label'] = [LABELS[np.random.randint(0, len(LABELS))] for i in range(num)]
df.rename(columns={0:"features"}, inplace=True)
print df.head(20)
# Create a modified sample DataFrame from the previous one, where not all the keys are the same for each data point.
mod_df = df
random_positions1 = random.sample(xrange(10), 5)
random_positions2 = random.sample(xrange(11, 20), 5)
INTERESTS = ['Basketball', 'Golf', 'Rugby']
SMOKING = ['Yes', 'No']
mod_df.loc[random_positions1, 'features'] = ["name={0};age={1};interest={2}"\
.format(NAMES[np.random.randint(0, len(NAMES))],\
AGES[np.random.randint(0, len(AGES))],\
INTERESTS[np.random.randint(0, len(INTERESTS))]) for i in xrange(len(random_positions1))]
mod_df.loc[random_positions2, 'features'] = ["name={0};age={1};smoking={2}"\
.format(NAMES[np.random.randint(0, len(NAMES))],\
AGES[np.random.randint(0, len(AGES))],\
SMOKING[np.random.randint(0, len(SMOKING))]) for i in xrange(len(random_positions2))]
print mod_df.head(20)
Assume that your original data is stored in a DataFrame called df.
Solution 1 (all the features are the same for every data point).
def func2(y):
lista = y.split('=')
value = lista[1]
return value
def function(x):
lista = x.split(';')
array = [func2(i) for i in lista]
return array
# Calculate the execution time
start = time.time()
array = pd.Series(df.features.apply(function)).tolist()
new_df = df.from_records(array, columns=['name', 'age', '1', '2'])
end = time.time()
new_df
print 'Total time:', end - start
Total time: 1.80923295021
Edit: The one thing you need to do is to edit accordingly the columns list.
Solution 2 (The features might be the same or different for every data point).
import pandas as pd
import numpy as np
import time
import itertools
# The following functions are meant to extract the keys from each row, which are going to be used as columns.
def extract_key(x):
return x.split('=')[0]
def def_columns(x):
lista = x.split(';')
keys = [extract_key(i) for i in lista]
return keys
df = mod_df
columns = pd.Series(df.features.apply(def_columns)).tolist()
flattened_columns = list(itertools.chain(*columns))
flattened_columns = np.unique(np.array(flattened_columns)).tolist()
flattened_columns
# This function turns each row from the original dataframe into a dictionary.
def function(x):
lista = x.split(';')
dict_ = {}
for i in lista:
key, val = i.split('=')
dict_[key ] = val
return dict_
df.features.apply(function)
arr = pd.Series(df.features.apply(function)).tolist()
pd.DataFrame.from_dict(arr)
Suppose your data is like this :
features= ["name=John Matthew;age=25;1:=Post Graduate;2:=Football Player;",
'name=Mark clark;age=21;1:=Under Graduate;2:=Football Player;',
"name=David;age=12;1:=High School;2:=Cricketer;",
"name=George;age=11;1:=High School;2:=Carpenter;",
'name=Kevin;age=16;1:=High School;2:=Driver; ']
df = pd.DataFrame({'features': features})
I will start by this answer and try to replace all separator (name, age , 1:= , 2:= ) by ;
with this function
def replace_feature(x):
for r in (("name=", ";"), (";age=", ";"), (';1:=', ';'), (';2:=', ";")):
x = x.replace(*r)
x = x.split(';')
return x
df = df.assign(features= df.features.apply(replace_feature))
After applying that function to your df all the values will a list of features. where you can get each one by index
then I use 4 customs function to get each attribute name, age, grade; job,
Note: There can be a better way to do this by using only one function
def get_name(df):
return df['features'][1]
def get_age(df):
return df['features'][2]
def get_grade(df):
return df['features'][3]
def get_job(df):
return df['features'][4]
And finaly applying that function to your dataframe :
df = df.assign(name = df.apply(get_name, axis=1),
age = df.apply(get_age, axis=1),
grade = df.apply(get_grade, axis=1),
job = df.apply(get_job, axis=1))
Hope this will be quick and fast
As far as I understand your code, the poor performances comes from the fact that you create the dataframe element by element. It's better to create the whole dataframe at once whith a list of dictionnaries.
Let's recreate your input dataframe :
from StringIO import StringIO
data=StringIO("""id features label
1 name=John Matthew;age=25;1.=Post Graduate;2.=Football Player; 1
2 name=Mark clark;age=21;1.=Under Graduate;2.=Football Player; 1
3 name=David;age=12;1:=High School;2:=Cricketer; 2
4 name=George;age=11;1:=High School;2:=Carpenter; 2""")
df=pd.read_table(data,sep=r'\s{3,}',engine='python')
we can check :
print df
id features label
0 1 name=John Matthew;age=25;1.=Post Graduate;2.=F... 1
1 2 name=Mark clark;age=21;1.=Under Graduate;2.=Fo... 1
2 3 name=David;age=12;1:=High School;2:=Cricketer; 2
3 4 name=George;age=11;1:=High School;2:=Carpenter; 2
Now we can create the needed list of dictionnaries with the following code :
feat=[]
for line in df['features']:
line=line.replace(':','.')
lsp=line.split(';')[:-1]
feat.append(dict([elt.split('=') for elt in lsp]))
And the resulting dataframe :
print pd.DataFrame(feat)
1. 2. age name
0 Post Graduate Football Player 25 John Matthew
1 Under Graduate Football Player 21 Mark clark
2 High School Cricketer 12 David
3 High School Carpenter 11 George

Efficient (fast) way to group continuous data in one DataFrame based on ranges taken from another DataFrame in Python Pandas?

I have experimental data produced by different programs. One is logging the start and end time of a trial as well as the type of trial (a category).
start trial type end
0 6.002987 2 c 7.574240
1 7.967054 3 b 19.084946
2 21.864419 5 b 23.298480
3 23.656995 7 c 24.087210
4 24.194764 9 c 27.960752
The other one records a continous datastream and logs the time for each observation.
X Y Z
0.0000 0.324963 -0.642636 -2.305040
0.0333 0.025089 -0.480412 -0.637273
0.0666 0.364149 0.966594 0.789467
0.0999 -0.087334 -0.761769 0.399813
0.1332 0.841872 2.306711 -1.059608
I have the 2 tables as pandas DataFrames and want to retrieve only those parts of the continuous data that is between the start to end ranges found in the rows of the trials DataFrame. I managed that by using a for-loop that iterates over the rows, but I was thinking that there must be more of a "pandas way" of doing this. So I looked into apply, but what I came up with so far was even considerably slower than the loop.
As I'm working on a lot of large datasets I'm looking for the most efficient way in terms of execution time to solve this.
This is a slice of the expected result for the continous DataFrame:
X Y Z trial type
13.6863 0.265358 0.116529 1.196689 NaN NaN
13.7196 -0.715096 -0.413416 0.696454 NaN NaN
13.7529 0.714897 -0.158183 1.735958 4.0 b
13.7862 -0.259513 0.194762 -0.531482 4.0 b
13.8195 -0.929080 -1.200593 -1.233834 4.0 b
[EDIT:] Here I test performance of different approaches. I found a way using apply(), but it isn't much faster than using iterrows.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
def create_trials_df(num_trials=360, max_start=1400.0):
# First df holds start and end times (as seconds) of a trial as well as type of trial.
d = {'trial': pd.Series(np.sort(np.random.choice(np.arange(1, 400), replace=False, size=(360,)))),
'type': pd.Series(np.random.choice(('a', 'b', 'c', 'd'),size=num_trials)),
'start': pd.Series(np.sort(np.random.random_sample((num_trials,))) * max_start)}
trials_df = pd.DataFrame(d)
# Create column for when the trial ended.
trials_df['end'] = trials_df['start'].shift(-1)
trials_df.loc[num_trials-1, 'end'] = trials_df['start'].iloc[-1] + 2.0
trials_df['diff'] = trials_df['end'] - trials_df['start']
trials_df['end'] = trials_df['end'] - trials_df['diff'] * 0.2
del trials_df['diff']
return trials_df
def create_continuous_df(num_trials=360, max_start=1400.0):
# Second df has continuously recorded data with time as index.
time_delta = 1.0/30.0
rows = int((max_start+2) * 1/time_delta)
idx_time = pd.Index(np.arange(rows) * time_delta)
continuous_df = pd.DataFrame(np.random.randn(rows, 3), index=idx_time, columns=list('XYZ'))
print("continuous rows:", continuous_df.index.size)
print("continuous last time:", continuous_df.last_valid_index())
return continuous_df
# I want to group the continuous data by trial and type later on.
def iterrows_test(trials_df, continuous_df):
for index, row in trials_df.iterrows():
continuous_df.loc[row['start']:row['end'], 'trial'] = row['trial']
continuous_df.loc[row['start']:row['end'], 'type'] = row['type']
def itertuples_test(trials_df, continuous_df):
continuous_df['trial'] = np.NaN
continuous_df['type'] = np.NaN
for row in trials_df.itertuples():
continuous_df.loc[slice(row[1],row[4]), ['trial','type']] = [row[2],row[3]]
def apply_test(trials_df, continuous_df):
trial_series = pd.Series([x[0] for x in zip(trials_df.values)])
continuous_df['trial'] = np.NaN
continuous_df['type'] = np.NaN
def insert_trial_data_to_continuous(vals, con_df):
con_df.loc[slice(vals[0], vals[3]), ['trial','type']] = [vals[1],vals[2]]
trial_series.apply(insert_trial_data_to_continuous, args=(continuous_df,))
def real_slow_index_map(trials_df, continuous_df):
# Transform trial_data to new df: merge start and end ordered, make it float index.
trials_df['pre-start'] = trials_df['start'] - 0.0001
trials_df['post-end'] = trials_df['end'] + 0.0001
start_df = pd.DataFrame(data={'type': trials_df['type'].values, 'trial': trials_df['trial'].values},
index=trials_df['start'])
end_df = pd.DataFrame(data={'type': trials_df['type'].values, 'trial': trials_df['trial'].values},
index=trials_df['end'])
# Fill inbetween trials with NaN.
pre_start_df = pd.DataFrame({'trial': np.NaN, 'type': np.NaN}, index=trials_df['pre-start'])
post_end_df = pd.DataFrame({'trial': np.NaN, 'type': np.NaN}, index=trials_df['post-end'])
new_df = start_df.append([end_df, pre_start_df, post_end_df])
new_df.sort_index(inplace=True)
# Each start/end index in new_df has corresponding value in type and trial column.
def get_tuple(idx):
res = new_df.iloc[new_df.index.get_loc(idx, method='nearest')]
# return trial and type column values.
return tuple(res.values)
# Apply this to all indices.
idx_series = continuous_df.index.to_series()
continuous_df['trial'] = idx_series.apply(get_tuple).values
continuous_df[['trial', 'type']] = continuous_df['trial'].apply(pd.Series)
def jp_data_analysis_answer(trials_df, continuous_df):
ranges = trials_df[['trial', 'type', 'start', 'end']].values
def return_trial(n):
for i, r in enumerate(ranges):
if r[2] <= n <= r[3]:
return tuple((i, r[1]))
else:
return np.nan, np.nan
continuous_df['trial'], continuous_df['type'] = list(zip(*continuous_df.index.map(return_trial)))
def performance_test(func, trials_df, continuous_df):
return_df = continuous_df.copy()
time_ref = time.perf_counter()
func(trials_df, return_df)
time_delta = time.perf_counter() - time_ref
print("time delta for {}:".format(func.__name__), time_delta)
return return_df
# Just to illustrate where this is going:
def plot_trial(continuous_df):
continuous_df['type'] = continuous_df['type'].astype('category')
continuous_df = continuous_df.groupby('type').filter(lambda x: x is not np.NaN)
# Without the NaNs in column, let's set the trial column to dtype integer.
continuous_df['trial'] = continuous_df['trial'].astype('int64')
# Plot the data by trial.
for key, group in continuous_df.groupby('trial'):
group.drop(['trial', 'type'], axis=1).plot()
plt.title('Trial {}, Type: {}'.format(key, group['type'].iloc[0]))
plt.show()
break
if __name__ == '__main__':
import time
num_trials = 360
max_start_time = 1400
trials_df = create_trials_df(max_start=max_start_time)
data_df = create_continuous_df(max_start=max_start_time)
# My original approach with a for-loop over iterrows.
iterrows_df = performance_test(iterrows_test,trials_df, data_df)
# itertuples test
itertuples_df = performance_test(itertuples_test,trials_df, data_df)
# apply() on trial data, continuous data is manipulated therein
apply_df = performance_test(apply_test,trials_df, data_df)
# Mapping on index of continuous data. SLOW!
map_idx_df = performance_test(real_slow_index_map,trials_df, data_df)
# method by jp_data_analysis' answer. Works well with small continuous_df, but doesn't scale well.
jp_df = performance_test(jp_data_analysis_answer,trials_df, data_df)
plot_trial(apply_df)
I see a factor ~7x improvement with below logic. The trick is to use an index.map(custom_function) on continuous_df and unpack the results, together with (in my opinion) underused for..else.. construct. This is still sub-optimal, but may be sufficient for your purposes, and certainly better than iterating rows.
import numpy as np
import pandas as pd
def test2():
# First df holds start and end times (as seconds) of a trial as well as type of trial.
num_trials = 360
max_start = 1400.0
d = {'trial': pd.Series(np.sort(np.random.choice(np.arange(1, 400), replace=False, size=(360,)))),
'type': pd.Series(np.random.choice(('a', 'b', 'c', 'd'),size=num_trials)),
'start': pd.Series(np.sort(np.random.random_sample((num_trials,))) * max_start)}
trials_df = pd.DataFrame(d)
# Create column for when the trial ended.
trials_df['end'] = trials_df['start'].shift(-1)
trials_df.loc[num_trials-1, 'end'] = trials_df['start'].iloc[-1] + 2.0
trials_df['diff'] = trials_df['end'] - trials_df['start']
trials_df['end'] = trials_df['end'] - trials_df['diff'] * 0.2
del trials_df['diff']
# Second df has continuously recorded data with time as index.
time_delta = 0.0333
rows = int(max_start+2/time_delta)
idx_time = pd.Index(np.arange(rows) * time_delta)
continuous_df = pd.DataFrame(np.random.randn(rows,3), index=idx_time, columns=list('XYZ'))
ranges = trials_df[['trial', 'type', 'start', 'end']].values
def return_trial(n):
for r in ranges:
if r[2] <= n <= r[3]:
return tuple(r[:2])
else:
return (np.nan, '')
continuous_df['trial'], continuous_df['type'] = list(zip(*continuous_df.index.map(return_trial)))
return trials_df, continuous_df

appending to a pandas dataframe

I want to add make a pandas dataframe with two columns : read_id and score
I am using the following code :
reads_array = []
for x in Bio.SeqIO.parse("inp.fasta","fasta"):
reads_array.append(x)
columns = ["read_id","score"]
df = pd.DataFrame(columns = columns)
df = df.fillna(0)
for x in reads_array:
alignments=pairwise2.align.globalms("ACTTGAT",str(x.seq),2,-1,-.5,-.1)
sorted_alignments = sorted(alignments, key=operator.itemgetter(2),reverse = True)
read_id = x.name
score = sorted_alignments[0][2]
df['read_id'] = read_id
df['score'] = score
But this does not work. Can you suggest a way of generating the dataframe df
At the top make sure you have
import numpy as np
Then replace the code you shared with
reads_array = []
for x in Bio.SeqIO.parse("inp.fastq", "fastq"):
reads_array.append(x)
df = pd.DataFrame(np.zeros((len(reads_array), 2)), columns=["read_id", "score"])
for index, x in enumerate(reads_array):
alignments = pairwise2.align.globalms("ACTTGAT", str(x.seq), 2, -1, -.5, -.1)
sorted_alignments = sorted(alignments, key=operator.itemgetter(2), reverse=True)
read_id = x.name
score = sorted_alignments[0][2]
df.loc[index, 'read_id'] = read_id
df.loc[index, 'score'] = score
The main problem with your original code was two things:
1) Your dataframe had 0 rows
2) df['column_name'] refers to the entire column, not a single cell, so when you execute df['column_name'] = value, all cells in that column get set to that value
df['read_id'] and df['score'] is Series. So if you want to iterate reads_array and calculate some value, then assign it to df's columns, try following:
for i, x in enumerate(reads_array):
...
df.ix[i]['read_id'] = read_id
df.ix[i]['score'] = score

Categories