How to show top 10 ranking with different magnitude - python

I want to create an overall ranking (but in my true data the features have not the same magnitude at all).
So for example if the top 10 in feature 6 looks like 10^6, 9^6...2^6, values in feature 1 are like 10^2,9^2...2^2.
Hence the overall ranking would be the same ranking as in feature 6, as it is influenced by the magnitude and the given weight is insignificant for infuencing the ranking.
I want to create a new column (or a new dataframe) for overall ranking.
A column that take into account the ranking for each features (hence eliminating the values).
In a second step, rank the countrues with the given different weight for each features, in order to plot the overall ranking of the 10 features.
Also it would be great if I could vizualise the result with matplotlib even though it is a dictionary in each column.
This is the dataframe I have:
import pandas as pd
import numpy as np
data = np.random.randint(100,size=(12,10))
countries = [
'Country1',
'Country2',
'Country3',
'Country4',
'Country5',
'Country6',
'Country7',
'Country8',
'Country9',
'Country10',
'Country11',
'Country12',
]
feature_names_weights = {
'feature1' :1.0,
'feature2' :4.0,
'feature3' :1.0,
'feature4' :7.0,
'feature5' :1.0,
'feature6' :1.0,
'feature7' :8.0,
'feature8' :1.0,
'feature9' :9.0,
'feature10' :1.0,
}
feature_names = list(feature_names_weights.keys())
df = pd.DataFrame(data=data, index=countries, columns=feature_names)
data_etude_copy = df
data_sorted_by_feature = {}
country_scores = (pd.DataFrame(data=np.zeros(len(countries)),index=countries))[0]
for feature in feature_names:
#Adds to each country's score and multiplies by weight factor for each feature
for country in countries:
country_scores[country] += data_etude_copy[feature][country]*(feature_names_weights[feature])
#Sorts the countries by feature (your code in loop form)
data_sorted_by_feature[feature] = data_etude_copy.sort_values(by=[feature], ascending=False).head(10)
data_sorted_by_feature[feature].drop(data_sorted_by_feature[feature].loc[:,data_sorted_by_feature[feature].columns!=feature], inplace=True, axis = 1)
#sort country total scores
ranked_countries = country_scores.sort_values(ascending=False).head(10)
##Put everything into one DataFrame
#Create empty DataFrame
empty_data=np.empty((10,10),str)
outputDF = pd.DataFrame(data=empty_data,columns=((feature_names)))
#Add entries for all features
for feature in feature_names:
for index in range(10):
country = list(data_sorted_by_feature[feature].index)[index]
outputDF[feature][index] = f'{country}: {data_sorted_by_feature[feature][feature][country]}'
#Add column for overall country score
#Print DataFrame
outputDF
The features in my dataframe have not the data normalized, just "ranked".
Expected output would be something like a sum of the normalized rankings with their corresponding weight:

Related

How to dynamically name a dataframe within this for loop

I have numerous dataframes and each dataframe has about 100 different chemical compounds and a categorical variable listing the type of material. For example, a smaller version of my datasets would look something like this:
Decane Octanal Material
1 20 Water
2 1 Glass
10 5 Glass
9 4 Water
I am using a linear regression model to regress the chemicals onto the material type. I want to be able to dynamically rename the results dataframe based on which dataset I am using. My code looks like this (where 'feature_cols' are the names of the chemicals):
count=0
dataframe=[]
#loop through the three datasets (In reality I have many more than three)
for dataset in [first, second, third]:
count+=1
for feature in feature_cols:
#define the model and fit it
mod = smf.ols(formula='Q(feature)'+'~material', data=dataset)
res = mod.fit()
#create a dataframe of the pvalues
#I would like to be able to dynamically name pvalues so that when looping through
#the chemicals of the first dataframe it is called 'pvalues_first' and so on.
pvalues=pd.DataFrame(res.pvalues)
You can use a dictionary (here with dummy values) :
names = ['first', 'second', 'third', 'fourth', 'fifth', 'sixth']
pvalues = {}
for i in range(len(names)):
pvalues["pvalues_" + names[i]] = i+1
print(pvalues)
Output:
{'pvalues_first': 1, 'pvalues_second': 2, 'pvalues_third': 3, 'pvalues_fourth': 4, 'pvalues_fifth': 5, 'pvalues_sixth': 6}
To access pvalues_third for example :
pvalues["pvalues_third"] = 20
print(pvalues)
**Output: **
{'pvalues_first': 1, 'pvalues_second': 2, 'pvalues_third': 20, 'pvalues_fourth': 4, 'pvalues_fifth': 5, 'pvalues_sixth': 6}
count=0
dataframe=[]
#loop through the three datasets (In reality I have many more than three)
names = ["first", "second", "third"]
for feature in feature_cols:
#define the model and fit it
mod = smf.ols(formula='Q(feature)'+'~material', data=dataset)
res = mod.fit()
#create a dataframe of the pvalues
#I would like to be able to dynamically name pvalues so that when looping through
#the chemicals of the first dataframe it is called 'pvalues_first' and so on.
name_str = "pvalues"+str(names[count])
pvalues = {'Intercept':[res.pvalues[0]], 'cap_type':[res.pvalues[1]]}
name_str=pd.DataFrame(pvalues)
count+=1

Grouped Time Series forecasting with scikit-hts

I am trying to forecast sales for multiple time series I took from kaggle's Store item demand forecasting challenge. It consists of a long format time series for 10 stores and 50 items resulting in 500 time series stacked on top of each other. And for each store and each item, I have 5 years of daily records with weekly and annual seasonalities.
In total there are : 365.2days * 5years * 10stores *50items = 913000 records.
From my understanding based on what I've read so far on Hierarchical and Grouped time series, the whole dataframe could be structured as a Grouped Time Series and not simply as a strict Hierarchical Time Series as aggregation could be done at the store or item levels interchangeably.
I want to find a way to forecast all 500 time series (for store1_item1, store1_item2,..., store10_item50) for the next year (from 01-jan-2015 to 31-dec-2015) using the scikit-hts library and its AutoArimaModel function which is a wrapper function of pmdarima's AutoArima function.
To handle the two levels of seasonality, I added Fourier terms as exogenous features to deal with the annual seasonality while auto_arima deals with the weekly seasonality.
My problem is that I got an error at during prediction step.
Here's the error message :
ValueError: Provided exogenous values are not of the appropriate shape. Required (365, 4), got (365, 8).
I assume something is wrong with the exogenous dictionary but I do not know how to solve the issue as I'm using scikit-hts for the first time. To do this, I followed the official documentation of scikit-hts here.
EDIT :______________________________________________________________
I have not seen that a similar bug was reported on Github. Following the proposed fix that I implemented locally, I could have some results. However, even though there is no error when running the code, some of the forecasts are negative as raised in the comments below this post. And we even get disproportionate values for the positive ones.
Here are the plots for all the combinations of store and item. You can see that this seems to work for only one combination.
df.loc['2014','store_1_item_1'].plot()
predictions.loc['2015','store_1_item_1'].plot()
df.loc['2014','store_1_item_2'].plot()
predictions.loc['2015','store_1_item_2'].plot()
df.loc['2014','store_2_item_1'].plot()
predictions.loc['2015','store_2_item_1'].plot()
df.loc['2014','store_2_item_2'].plot()
predictions.loc['2015','store_2_item_2'].plot()
_____________________________________________________________________
Complete code:
# imports
import pandas as pd
from pmdarima.preprocessing import FourierFeaturizer
import hts
from hts.hierarchy import HierarchyTree
from hts.model import AutoArimaModel
from hts import HTSRegressor
# read data from the csv file
data = pd.read_csv('train.csv', index_col='date', parse_dates=True)
# Train/Test split with reduced size
train_data = data.query('store == [1,2] and item == [1, 2]').loc['2013':'2014']
test_data = data.query('store == [1,2] and item == [1, 2]').loc['2015']
# Create the stores time series
# For each timestamp group by store and apply sum
stores_ts = train_data.drop(columns=['item']).groupby(['date','store']).sum()
stores_ts = stores_ts.unstack('store')
stores_ts.columns = stores_ts.columns.droplevel(0)
stores_ts.columns = ['store_' + str(i) for i in stores_ts.columns]
# Create the items time series
# For each timestamp group by item and apply sum
items_ts = train_data.drop(columns=['store']).groupby(['date','item']).sum()
items_ts = items_ts.unstack('item')
items_ts.columns = items_ts.columns.droplevel(0)
items_ts.columns = ['item_' + str(i) for i in items_ts.columns]
# Create the stores_items time series
# For each timestamp group by store AND by item and apply sum
store_item_ts = train_data.pivot_table(index= 'date', columns=['store', 'item'], aggfunc='sum')
store_item_ts.columns = store_item_ts.columns.droplevel(0)
# Rename the columns as store_i_item_j
col_names = []
for i in store_item_ts.columns:
col_name = 'store_' + str(i[0]) + '_item_' + str(i[1])
col_names.append(col_name)
store_item_ts.columns = store_item_ts.columns.droplevel(0)
store_item_ts.columns = col_names
# Create a new dataframe and add the root level of the hierarchy as the sum of all stores (or all items)
df = pd.DataFrame()
df['total'] = stores_ts.sum(1)
# Concatenate all created dataframes into one df
# df is the dataframe that will be used for model training
df = pd.concat([df, stores_ts, items_ts, store_item_ts], 1)
# Build fourier terms for train and test sets
four_terms = FourierFeaturizer(365.2, 1)
# Build the exogenous features dataframe for training data
exog_train_df = pd.DataFrame()
for i in range(1, 3):
for j in range(1, 3):
_, exog = four_terms.fit_transform(train_data.query(f'store == {i} and item == {j}').sales)
exog.columns= [f'store_{i}_item_{j}_'+ x for x in exog.columns]
exog_train_df = pd.concat([exog_train_df, exog], axis=1)
exog_train_df['date'] = df.index
exog_train_df.set_index('date', inplace=True)
# add the exogenous features dataframe to df before training
df = pd.concat([df, exog_train_df], axis= 1)
# Build the exogenous features dataframe for test set
# It will be used only when using model.predict()
exog_test_df = pd.DataFrame()
for i in range(1, 3):
for j in range(1, 3):
_, exog_test = four_terms.fit_transform(test_data.query(f'store == {i} and item == {j}').sales)
exog_test.columns= [f'store_{i}_item_{j}_'+ x for x in exog_test.columns]
exog_test_df = pd.concat([exog_test_df, exog_test], axis=1)
# Build the hierarchy of the Grouped Time Series
stores = [i for i in stores_ts.columns]
items = [i for i in items_ts.columns]
store_items = col_names
# Exogenous features mapping
exog_store_items = {e: [v for v in exog_train_df.columns if v.startswith(e)] for e in store_items}
exog_stores = {e:[v for v in exog_train_df.columns if v.startswith(e)] for e in stores}
exog_items = {e:[v for v in exog_train_df.columns if v.find(e) != -1] for e in items}
exog_total = {'total':[v for v in exog_train_df.columns if v.find('FOURIER') != -1]}
# Merge all dictionaries
exog_to_merge = [exog_store_items, exog_stores, exog_items, exog_total]
exogenous = {k:v for x in exog_to_merge for k,v in x.items()}
# Build hierarchy
total = {'total': stores + items}
store_h = {k: [v for v in store_items if v.startswith(k)] for k in stores}
hierarchy = {**total, **store_h}
# Hierarchy tree automatically created by hts
ht = HierarchyTree.from_nodes(nodes=hierarchy, df=df, exogenous=exogenous)
# Instanciate the auto arima model using HTSRegressor
autoarima = HTSRegressor(model='auto_arima', D=1, m=7, seasonal=True, revision_method='OLS', n_jobs=12)
# Fit the model to the training df that includes time series and exog_train_df
# Set exogenous param to the previously built dictionary
model = autoarima.fit(df, hierarchy, exogenous=exogenous)
# Make predictions
# Set the exogenous_df param
predictions = model.predict(exogenous_df=exog_test_df, steps_ahead=365)
Other approaches I thought of and that I already implemented successfully for one series (for store 1 and item 1 for example) :
TBATS applied to each series independently inside a loop across all 500 time series
auto_arima (SARIMAX) with exogenous features (=Fourier terms to deal with the weekly and annual seasonalities) for each series independently + a loop across all 500 time series
What do you think of these approaches? Do you have other suggestions on how to scale ARIMA to multiple time series?
I also want to try LSTM but I'm new to data science and deep learning and do not know how to prepare the data. Should I keep the data in their original form (long format) and apply one hot encoding to train_data['store'] and train_data['item'] columns or should I start with the df I ended up with here?
I Hope this helped you in fixing the issue with exogenous regressors. To handle negative forecasts I would suggest you to try square root transformation.

How create summary table for every column?

I have Pandas DataFrames with around 100 columns each. I have to create a summary table for all of those columns. In the summary Dataframe I want to have a name (one from every of these data frames and this I'm doing okay) and put mean and std of every column.
So my final table should have the shape: n x m
where n is the number of files
and m is the number of columns x 2 (mean and std)
Something like this
name mean_col1 std_col1 mean_col2 std_col2
ABC 22.815293 0.103567 90.277533 0.333333
DCE 22.193991 0.12389 87.17391 0.123457
I tried following but I'm not getting what I wanted:
list_with_names = []
list_for_mean_and_std = []
for file in glob.glob("/data/path/*.csv"):
df = pd.read_csv(file)
output = {'name':df['name'][0]}
list_with_names.append(output)
numerical_cols = df.select_dtypes('float64')
for column in numerical_cols:
mean_col = numerical_cols[column].mean()
std_col = numerical_cols[column].std()
output_2 = {'mean': mean_col,
'std': std_col}
list_for_mean_and_std.append(output_2)
summary = pd.DataFrame(list_with_names, list_for_mean_and_std)
And I'm getting an error Shape of passed values is (183, 1), indices imply (7874, 1) because I'm assigning in the wrong way these values with means and std but I have no idea how.
I will be glad for any advice on how to change it
Of coruse, Pandas have a method for that - describe():
df.describe()
Which gives more statistics than you requested. If you are interested only in mean and std, you can do:
df.describe()[['mean', 'std']]

Mapping nearest values from two pandas dataframes (latitude and longitude)

How to map closed values from two dataframes:
I've two dataframes in below format and looking to map values based on o_lat,o_long from data1 and near_lat,near_lon:
data1 ={'lat': [-0.659901, -0.659786, -0.659821],
'long':[2.530561, 2.530797, 2.530587],
'd':[0.4202, 1.0957, 0.6309],
'o_lat':[-37.8095,-37.8030,-37.8050],
'o_long':[145.0000,145.0077,145.0024]}
Where lat,long is coordinates of destination, d is the distance between origin and destination, o_lat,o_long is the coordinates of origin.
data2={'nearest_warehouse': ['Nickolson','Thompson','Bakers'],
'lat':[-37.8185,-37.8126,-37.8099],
'lon':[144.9695,144.9470,144.9952]}
I want to produce another column in data1 which locates nearest_warehouse in the following format based on closed value:
result={'lat': [-0.659901, -0.659786, -0.659821],
'long':[2.530561, 2.530797, 2.530587],
'd':[0.4202, 1.0957, 0.6309],
'o_lat':[-37.8095,-37.8030,-37.8050],
'o_long':[145.0000,145.0077,145.0024],
'nearest_warehouse':['Bakers','Thompson','Nickolson']}
I've tried following code:
lat_diff=[]
long_diff=[]
min_distance=[]
for i in range(0,3):
lat_diff.append(float(warehouse.near_lat[i])-lat_long_d.o_lat[0])
for j in range(0,3):
long_diff.append(float(warehouse.near_lon[j])-lat_long_d.o_long[0])
long_diff.append(float(warehouse.near_lon[j])-lat_long_d.o_long[0])
min_distance=[min(lat_diff),min(long_diff)]
min_distance
Which gives the following result which is the minimum value of the difference between latitude and longitude for o_lat=-37.8095 and o_lang=145.0000:
[-0.00897867136701791, -0.05300973586690816].
I feel the approach is not viable to map close values over a large dataset.
Looking for a better approach in this regard
From the first dataframe, you can go through each row with lambda x: and compare to all rows of the second dataframe and return a list of the absolute difference of latitude and add that to the absolute difference of longitude using list comprehension. This effectively gives you the minimum distance.
Now, what you are interested in is the index, i.e. position of the minimum absolute difference of longiture plus absolute difference of latitude for each row. You can find this with idxmin(). In dataframe 1, this returns the index number which you can use to merge against the index of dataframe 2 to pull in the closest warehouse:
setup:
data1 = pd.DataFrame({'lat': [-0.659901, -0.659786, -0.659821], 'long':[2.530561, 2.530797, 2.530587],
'd':[0.4202, 1.0957, 0.6309],
'o_lat':[-37.8095,-37.8030,-37.8050], 'o_long':[145.0000,145.0077,145.0024]})
data2= pd.DataFrame({'nearest_warehouse': ['Nickolson','Thompson','Bakers'],
'lat':[-37.818595, -37.812673, -37.809996], 'lon':[144.969551, 144.947069, 144.995232],
'near_lat':[-37.8185,-37.8126,-37.8099], 'near_lon':[144.9695,144.9470,144.9952]})
code:
data1['key'] = data1.apply(lambda x: ((x['o_lat'] - data2['near_lat']).abs()
+ (x['o_long'] - data2['near_lon']).abs()).idxmin(), axis=1)
data1 = pd.merge(data1, data2[['nearest_warehouse']], how='left', left_on='key', right_index=True).drop('key', axis=1)
data1
Out[1]:
lat long d o_lat o_long nearest_warehouse
0 -0.659901 2.530561 0.4202 -37.8095 145.0000 Bakers
1 -0.659786 2.530797 1.0957 -37.8030 145.0077 Bakers
2 -0.659821 2.530587 0.6309 -37.8050 145.0024 Bakers
This result looks accurate if you append the two dataframes into one and do a basic scatterplot. As you can see Bakers warehouse is right there compared to the other points (graph IS to scale with last line of code):
import matplotlib.pyplot as plt
data1 = pd.DataFrame({'o_lat':[-37.8095,-37.8030,-37.8050], 'o_long':[145.0000,145.0077,145.0024],
'nearest_warehouse': ['0','1','2']})
data2= pd.DataFrame({'nearest_warehouse': ['Nickolson','Thompson','Bakers'],
'o_lat':[-37.8185,-37.8126,-37.8099], 'o_long':[144.9695,144.9470,144.9952]})
df = data1.append(data2)
y = df['o_lat'].to_list()
z = df['o_long'].to_list()
n = df['nearest_warehouse'].to_list()
fig, ax = plt.subplots()
ax.scatter(z, y)
for i, txt in enumerate(n):
ax.annotate(txt, (z[i], y[i]))
plt.gca().set_aspect('equal', adjustable='box')

Simplifying Python Pandas code for selecting co-occurrences in a window of time

I am a beginner at programming. I was able to build the thing below, which achieves what I want with a small dataset. With larger datasets, my RAM gets swamped bringing the computer to a halt (2014 Macbook Pro with 16GB RAM). Can I simplify my process somehow?
# This code starts from a co-occurrence list with dates in the first column, like this:
#
# Jan-20; Monkey; Dog; Horse
# Jan-21; Dog; Horse; Cat
# Jan-22; Monkey; Cat; Dog
# Jan-23; Monkey; Dog; Horse
#
# That is, these animals occurred together on these specific days.
#
# This code cleans out the list, keeping only those lines that have co-occurrences
# including an animal below a certain "age" in the dataset. Let's say 2 days. Like this:
#
# Monkey occurred the first time on Jan-20, meaning that lines including 'Monkey' should
# be kept only if they are dated Jan-# 20 or Jan-21. Cat occurred the first time on Jan-21,
# meaning that lines including 'Cat' should be kept in the dataset only # if they are dated
# Jan-21 or Jan-22.
# Gather data on dates of earliest occurrence for all included animals
# Set a time window
# Extract lines based on 1 and 2
import pandas
# STEP 1: Gather data on earliest occurence ('entrydates') for all included items
## Set column names
colnames=['Date','Item1','Item2']
## Read csv adding column names
data = pandas.read_csv('/Users/Simon/Dropbox/Work/Datasets/idlehash.csv', names=colnames)
## Create a dataframe with info on dates for first column
datelist1 = data[['Date', 'Item1']]
## Create a dataframe with info on dates for second column
datelist2 = data[['Date', 'Item2']]
## Join the two dataframes into one
entrydates = datelist1.append(datelist2)
## Melt the resulting dataframe into two columns
entrydates = pandas.melt(entrydates, id_vars='Date')[['Date','value']]
## Sort the dataframe by Date and k eep only the earliest occurence of a value
## drop_duplicates considers the column 'value' and keeps only first occurence
entrydates = entrydates.sort('Date').drop_duplicates(cols=['value'])
# STEP 2: Calculate item "ages" in dataset at each co-occurence event
## Create a dataframe with co-occurrence pairs and the entrydates of Item1 in each pair
matrix = pandas.merge(left=data, right=entrydates, left_on='Item1', right_on='value')
## Create a dataframe with co-occurrence pairs and the entrydates of Item2 in each pair
matrix2 = pandas.merge(left=data, right=entrydates, left_on='Item2', right_on='value')
## Rename some of the columns for clarity
matrix = matrix.rename(columns={'Date_x':'co-oc date', 'Date_y':'entrydate of item 1',
'value':'Item1 (check)'})
matrix2 = matrix2.rename(columns={'Date_x':'co-oc date', 'Date_y':'entrydate of item 2',
'value':'Item2 (check)'})
## Sort them
matrix = matrix.sort(['co-oc date','entrydate of item 1'], ascending=False)
matrix2 = matrix2.sort(['co-oc date','entrydate of item 2'], ascending=False)
## Join them
gorillaking = pandas.merge(matrix, matrix2, on='Item2', how='outer')
## Build dataframe with selected columns from gorillaking
gorillaking = pandas.concat([gorillaking['co-oc date_x'], gorillaking['Item1_x'],
gorillaking['Item2'], gorillaking['entrydate of item 1'], gorillaking['entrydate of item 2']],
axis=1, keys=['date', 'item1', 'item2', 'item1 birth', 'item2 birth'])
## Add a column calculating the "age" of Item 1 on the occasion of each co-occurrence
gorillaking['item1 age'] = gorillaking['date'] - gorillaking['item1 birth']
## Add a column calculating the "age" of Item 2 on the occasion of each co-occurrence
gorillaking['item2 age'] = gorillaking['date'] - gorillaking['item2 birth']
# STEP 3: Select only the co-occurrences that happen in a certain window of time
## Set a timewindow
timewindow = 7
## Extract only the rows where the "age" of Item 1
## is less than or equal to the user defined timewindow
## That is: ('date' - 'item1 birth') <= timewindow
mask = (gorillaking['date'] - gorillaking['item1 birth'] <= timewindow)
keptpairs = gorillaking.loc[mask]
## Output kept pairs to a file
dataset = keptpairs[['item1', 'item2']]
dataset.to_csv('/Users/Simon/Dropbox/Work/Datasets/#keptpairs.csv', sep='\t', encoding='utf-8',
index=False, header=False)
## Print result
print dataset
EDITED
I finally got the code to achieve what I want, without stressing pandas. I guess this was just as much about realizing what I actually want the code to do. In the above code I calculated the "age" for both Item1 and 2 which was not needed.I finally came up with this code to keep only pairs where the item that comes before the second one has been present in the dataset for less than time T.
import pandas
## Set column names
colnames=['Date','Item1','Item2']
## Read csv adding column names
## The csv must be formatted like:
## date;item1;item2
data = pandas.read_csv('/path/file.csv', names=colnames)
# STEP 1: GET "PUBLICATIONDATES" OF THE TAGS
## Create a dataframe with info
## on dates for first column
pubdates = data[['Date', 'Item1']]
## Sort the dataframe by Date and
## keep only the earliest occurence of a value
## drop_duplicates considers the column 'Item1'
## and keeps only the first occurence
pubdates = pubdates.sort('Date').drop_duplicates(cols=['Item1'])
## Create a dataframe with co-occurrence pairs
## and the pubdates of Item1 in each pair
timematrix = pandas.merge(left=data, right=pubdates, left_on='Item1', right_on='Item1')
## Rename some of the columns for clarity
timematrix = timematrix.rename(columns={'Date_x':'Coocdate', 'Date_y':'Item1-pubdate',
'value':'Item1 (check)'})
## Sort them
timematrix = timematrix.sort(['Coocdate','Item1-pubdate'], ascending=False)
## Add a column calculating the "age"
## of Item 1 on the occasion of each co-occurrence
timematrix['Item1-age'] = timematrix['Coocdate'] - timematrix['Item1-pubdate']
# STEP 2: KEEP ONLY COOCS THAT HAPPEN IN TIME T AFTER ITEM1 PUBDATE
## Set a timeframe
timeframe = 1
## Extract only the rows where the
## "age" of Item 1 is less than or
## equal to the user defined timeframe
mask = (timematrix['Coocdate'] - timematrix['Item1-pubdate'] <= timeframe)
keptpairs = timematrix.loc[mask]
## Output kept pairs to a file
dataset = keptpairs[['Item1', 'Item2']]
dataset.to_csv('/path/dataset.csv', sep='\t', encoding='utf-8', index=False, header=False)

Categories