Calculate Overlapping & Non-Overlapping Data Points in a Data frame across Years - python

I have a single Dataframe and I need to find how many toys with different color are same and how many are changing across years.
For Example: Toy1 color remain intact from 2019 to 2020 but in year 2021 there were two toys one with red and other with green color. Hence there is no change in 2019 to 2020 stating overlap of 1 and new count as 0. However for year 2020 to 2021 overlap count though will remain 1 (due to red color), new count will get the value as 1 (due to addition of green color of toy)
Attaching a sample data, original data has million of records.
Input data -
input_data = pd.DataFrame({'Toy': ['Toy1', 'Toy1', 'Toy1', 'Toy1', 'Toy2', 'Toy2', 'Toy2', 'Toy2', 'Toy2', 'Toy3', 'Toy3', 'Toy3'],
'Toy_year': [2019, 2020, 2021, 2021, 2019, 2020, 2020, 2021, 2021, 2019, 2020, 2021],
'Color': ['Red', 'Red', 'Red', 'Green ', 'Green ', 'Green ', 'Red', 'Green ', 'Red', 'Blue', 'Yellow', 'Yellow']})
Output data -
output_data = pd.DataFrame({'Year': ['2019-2020', '2019-2020', '2019-2020', '2020-2021', '2020-2021', '2020-2021'],
'Toy': ['Toy1', 'Toy2', 'Toy3', 'Toy1', 'Toy2', 'Toy3'],
'overlap_count': [1, 1, 0, 1, 1, 1],
'new_count': [0, 1, 1, 1, 1, 0]})
I am trying the below method but it is very slow -
toy_list = ['Toy1','Toy2','Toy3']
year_list = [2019,2020]
for i in toy_list:
for j in year_list:
y1 = j
y2 = j+1
x1 = input_data[(input_data['Toy']==i)&(input_data['Toy_year']==y1)]
x2 = input_data[(input_data['Toy']==i)&(input_data['Toy_year']==y2)]
z1 = list(set(x1.Color) & set(x2.Color))
print (x1)
print (x2)
print (z1)
Any leads is really appreciated

A few steps here. First we unstack the data to have a cross table of toy/year vs color, where 1 indicates that that color was in force for that toy/year
df1 = input_data.assign(count=1).set_index(['Toy','Toy_year','Color']).unstack(level=2)
df1
df1 looks like this:
count
Color Blue Green Red Yellow
Toy Toy_year
Toy1 2019 NaN NaN 1.0 NaN
2020 NaN NaN 1.0 NaN
2021 NaN 1.0 1.0 NaN
Toy2 2019 NaN 1.0 NaN NaN
2020 NaN 1.0 1.0 NaN
2021 NaN 1.0 1.0 NaN
Toy3 2019 1.0 NaN NaN NaN
2020 NaN NaN NaN 1.0
2021 NaN NaN NaN 1.0
Now we can aggregate these, by row, to come up with summary statistics 'overlap_count' and 'new_count'. Overlap_count is the sum of matches between each row and its next (within each toy/year group), and new_count is the sum across the next row minus the overlap from the current row
ccols= df1.columns
df2 = df1.copy()
df2['overlap_count'] = df1.groupby(['Toy'], group_keys = False).apply(lambda g: (g[ccols] == g[ccols].shift(-1)).sum(axis=1))
df2['new_count']= df2.groupby(['Toy'], group_keys = False).apply(lambda g: g[ccols].shift(-1).sum(axis=1) - g['overlap_count'])
Now we just massage the result into the required form:
df3 = df2[['overlap_count','new_count']].reset_index().droplevel(1,axis=1)
df3['Year'] = df3['Toy_year'].astype(str) + '-' + df3['Toy_year'].astype(str).shift(-1)
df3 = df3[df3['Toy_year'] != 2021].drop(columns = ['Toy_year'])
df3
output:
Toy overlap_count new_count Year
-- ----- --------------- ----------- ---------
0 Toy1 1 0 2019-2020
1 Toy1 1 1 2020-2021
3 Toy2 1 1 2019-2020
4 Toy2 2 0 2020-2021
6 Toy3 0 1 2019-2020
7 Toy3 1 0 2020-2021

Related

Selecting rows based on value counts of TWO columns

My question is similar to Pandas: Selecting rows based on value counts of a particular column but with TWO columns:
This is a very small snippet from the dataframe (The main df contains millions of entries):
overall vote verified reviewTime reviewerID productID
4677505 5.0 NaN True 11 28, 2017 A2O8EJJBFJ9F1 B00NR2VMNC
1302483 5.0 NaN True 04 1, 2017 A1YMYW7EWN4RL3 B001S2PPT0
5073908 3.0 83 True 02 12, 2016 A3H796UY7GIX0K B00ULRFQ1A
200512 5.0 NaN True 07 14, 2016 A150W68P8PYXZE B0000DC0T3
1529831 5.0 NaN True 12 19, 2013 A28GVVNJUZ3VFA B002WE3BZ8
1141922 5.0 NaN False 12 20, 2008 A2UOHALGF2X77Q B001CCLBSA
5930187 3.0 2 True 05 21, 2018 A2CUSR21CZQ6J7 B01DCDG9JC
1863730 5.0 NaN True 05 6, 2017 A38A3VQL8RLS8D B004HKIB6E
1835030 5.0 NaN True 06 20, 2016 A30QT3MWWEPNIE B004D09HRK
4226935 5.0 NaN True 12 27, 2015 A3UORFPF49N96B B00JP12170
Now I want to filter the dataframe so that each reviewerID and productID appears at least k times (lets say k=2) in the final filtered dataframe. In other words: That each user and product has at least k distinct entries/rows.
I would greatly appreciate any help.
Try this way
k=2
df = pd.read_csv('text.csv')
df['count']=1
df_group = df[['reviewerID','productID','count']].groupby(['reviewerID','productID'],as_index=False).sum()
df_group = df_group[df_group['count']>=k]
df_group.drop(['count'],axis=1,inplace=True)
df.drop(['count'],axis=1,inplace=True)
df = df.merge(df_group,on=['reviewerID','productID'])
df
Hope so it will help

More efficient alternative to nested For loop

I have two dataframes which contain data collected at two different frequencies.
I want to update the label of df2, to that of df1 if it falls into the duration of an event.
I created a nested for-loop to do it, but it takes a rather long time.
Here is the code I used:
for i in np.arange(len(df1)-1):
for j in np.arange(len(df2)):
if (df2.timestamp[j] > df1.timestamp[i]) & (df2.timestamp[j] < (df1.timestamp[i] + df1.duration[i])):
df2.loc[j,"label"] = df1.loc[i,"label"]
Is there a more efficient way of doing this?
df1 size (367, 4)
df2 size (342423, 9)
short example data:
import numpy as np
import pandas as pd
data1 = {'timestamp': [1,2,3,4,5,6,7,8,9],
'duration': [0.5,0.3,0.8,0.2,0.4,0.5,0.3,0.7,0.5],
'label': ['inh','exh','inh','exh','inh','exh','inh','exh','inh']
}
df1 = pd.DataFrame (data1, columns = ['timestamp','duration','label'])
data2 = {'timestamp': [1,1.5,2,2.5,3,3.5,4,4.5,5,5.5,6,6.5,7,7.5,8,8.5,9,9.5],
'label': ['plc','plc','plc','plc','plc','plc','plc','plc','plc','plc','plc','plc','plc','plc','plc','plc','plc','plc']
}
df2 = pd.DataFrame (data2, columns = ['timestamp','label'])
I would first use a merge_asof to select the highest timestamp from df1 below the timestamp from df2. Next a simple (vectorized) comparison of df2.timestamp and df1.timestamp + df1.duration is enough to select matching lines.
Code could be:
df1['t2'] = df1['timestamp'].astype('float64') # types of join columns must be the same
temp = pd.merge_asof(df2, df1, left_on='timestamp', right_on='t2')
df2.loc[temp.timestamp_x <= temp.t2 + temp.duration, 'label'] = temp.label_y
It gives for df2:
timestamp label
0 1.0 inh
1 1.5 inh
2 2.0 exh
3 2.5 plc
4 3.0 inh
5 3.5 inh
6 4.0 exh
7 4.5 plc
8 5.0 inh
9 5.5 plc
10 6.0 exh
11 6.5 exh
12 7.0 inh
13 7.5 plc
14 8.0 exh
15 8.5 exh
16 9.0 inh
17 9.5 inh

How to manipulate pandas dataframe with multiple IF statements/ conditions?

I'm relatively new to python and pandas and am trying to determine how do I create a IF statement or any other statement that once initially returns value continues with other IF statement with in given range?
I have tried .between, .loc, and if statements but am still struggling. I have tried to recreate what is happening in my code but cannot replicate it precisely. Any suggestions or ideas around this problem?
import pandas as pd
data = {'Yrs': [ '2018','2019', '2020', '2021', '2022'], 'Val': [1.50, 1.75, 2.0, 2.25, 2.5] }
data2 = {'F':['2015','2018', '2020'], 'L': ['2019','2022', '2024'], 'Base':['2','5','5'],
'O':[20, 40, 60], 'S': [5, 10, 15]}
df = pd.DataFrame(data)
df2 = pd.DataFrame(data2)
r = pd.DataFrame()
#use this code to get first value when F <= Yrs
r.loc[(df2['F'] <= df.at[0,'Yrs']), '2018'] = \
(1/pd.to_numeric(df2['Base']))*(pd.to_numeric(df2['S']))* \
(pd.to_numeric(df.at[0, 'Val']))+(pd.to_numeric(df2['Of']))
#use this code to get the rest of the values until L = Yrs
r.loc[(df2['L'] <= df.at[1,'Yrs']) & (df2['L'] >= df.at[1,'Yrs']),\
'2019'] = (pd.to_numeric(r['2018'])- pd.to_numeric(df2['Of']))* \
pd.to_numeric(df.at[1, 'Val'] / pd.to_numeric(df.at[0, 'Val'])) + \
pd.to_numeric(df2['Of'])
r
I expect output to be:(the values may be different but its the pattern I want)
2018 2019 2020 2021 2022
0 7.75 8.375 NaN NaN NaN
1 11.0 11.5 12 12.5 13.0
2 NaN NaN 18 18.75 19.25
but i get:
2018 2019 2020 2021 2022
0 7.75 8.375 9.0 9.625 10.25
1 11.0 11.5 12 NaN NaN
2 16.50 17.25 18 NaN NaN

Pandas DF Multiple Conditionals using np.where

I am trying to combine a few relatively simple conditions into an np.where clause, but am having trouble getting the syntax down for the logic.
My current dataframe looks like the df below, with four columns. I would like to add two columns, named the below, with the following conditions:
The desired output is below - the df df_so_v2
Days since activity
*Find most recent prior row with same ID, then subtract dates column
*If no most recent value, return NA
Chg. Avg. Value
Condition 1: If Count = 0, NA
Condition 2: If Count !=0, find most recent prior row with BOTH the same ID and Count!=0, then find the difference in Avg. Value column.
However, I am building off simple np.where queries like the below and do not know how to combine the multiple conditions needed in this case.
df['CASH'] = np.where(df['CASH'] != 0, df['CASH'] + commission , df['CASH'])
Thank you very much for your help on this.
df_dict={'DateOf': ['2017-08-07','2017-08-07','2017-08-07','2017-08-04','2017-08-04','2017-08-04'
, '2017-08-03','2017-08-03','2017-08-03','2017-08-02','2017-08-02','2017-08-02','2017-08-01','2017-08-01','2017-08-01'],
'ID': ['553','559','914','553','559','914','553','559','914','553','559','914','553','559','914'], 'Count': [0, 4, 5, 0, 11, 10, 3, 9, 0,1,0,2,4,4,0],
'Avg. Value': [0,3.5,2.2,0,4.2,3.3,5.3,5,0,3,0,2,4.4,6.4,0]}
df_so=pd.DataFrame(df_dict)
df_dict_v2={'DateOf': ['2017-08-07','2017-08-07','2017-08-07','2017-08-04','2017-08-04','2017-08-04'
, '2017-08-03','2017-08-03','2017-08-03','2017-08-02','2017-08-02','2017-08-02','2017-08-01','2017-08-01','2017-08-01'],
'ID': ['553','559','914','553','559','914','553','559','914','553','559','914','553','559','914'], 'Count': [0, 4, 5, 0, 11, 10, 3, 9, 0,1,0,2,4,4,0],
'Avg. Value': [0,3.5,2.2,0,4.2,3.3,5.3,5,0,3,0,2,4.4,6.4,0],
'Days_since_activity': [4,3,1,1,1,2,1,2,1,1,1,1,'NA','NA','NA'],
'Chg. Avg Value': ['NA',-0.7,-1.1,'NA',-0.8,1.3,2.3,-1.4,'NA',-1.4,'NA','NA','NA','NA','NA']
}
df_so_v2=pd.DataFrame(df_dict_v2)
Here is the answer to this part of the question. I need more clarification on the conditions of 2.
1) Days since activity *Find most recent prior row with same ID, then subtract dates column *If no most recent value, return NA
First you need to convert strings to datetime, then sort the dates in ascending order. Finally use .transform to find the difference.
df_dict={'DateOf': ['2017-08-07','2017-08-07','2017-08-07','2017-08-04','2017-08-04','2017-08-04'
, '2017-08-03','2017-08-03','2017-08-03','2017-08-02','2017-08-02','2017-08-02','2017-08-01','2017-08-01','2017-08-01'],
'ID': ['553','559','914','553','559','914','553','559','914','553','559','914','553','559','914'], 'Count': [0, 4, 5, 0, 11, 10, 3, 9, 0,1,0,2,4,4,0],
'Avg. Value': [0,3.5,2.2,0,4.2,3.3,5.3,5,0,3,0,2,4.4,6.4,0]}
df_so = pd.DataFrame(df_dict)
df_so['DateOf'] = pd.to_datetime(df_so['DateOf'])
df_so.sort_values('DateOf', inplace=True)
df_so['Days_since_activity'] = df_so.groupby(['ID'])['DateOf'].transform(pd.Series.diff)
df_so.sort_index()
Edited based on your comment:
Find the most recent previous day that does not have a count of Zero and calculate the difference.
df_dict={'DateOf': ['2017-08-07','2017-08-07','2017-08-07','2017-08-04','2017-08-04','2017-08-04'
, '2017-08-03','2017-08-03','2017-08-03','2017-08-02','2017-08-02','2017-08-02','2017-08-01','2017-08-01','2017-08-01'],
'ID': ['553','559','914','553','559','914','553','559','914','553','559','914','553','559','914'], 'Count': [0, 4, 5, 0, 11, 10, 3, 9, 0,1,0,2,4,4,0],
'Avg. Value': [0,3.5,2.2,0,4.2,3.3,5.3,5,0,3,0,2,4.4,6.4,0]}
df = pd.DataFrame(df_dict)
df['DateOf'] = pd.to_datetime(df['DateOf'], format='%Y-%m-%d')
df.sort_values(['ID','DateOf'], inplace=True)
df['Days_since_activity'] = df.groupby(['ID'])['DateOf'].diff()
mask = df.ID != df.ID.shift(1)
mask2 = df.groupby('ID').Count.shift(1) == 0
df['Days_since_activity'][mask] = np.nan
df['Days_since_activity'][mask2] = df.groupby(['ID'])['DateOf'].diff(2)
df['Chg. Avg Value'] = df.groupby(['ID'])['Avg. Value'].diff()
df['Chg. Avg Value'][mask2] = df.groupby(['ID'])['Avg. Value'].diff(2)
conditions = [((df['Count'] == 0)),]
choices = [np.nan,]
df['Chg. Avg Value'] = np.select(conditions, choices, default = df['Chg. Avg Value'])
# df = df.sort_index()
df
New unsorted Output for easy comparison:
DateOf ID Count Avg. Value Days_since_activity Chg. Avg Value
12 2017-08-01 553 4 4.4 NaT NaN
9 2017-08-02 553 1 3.0 1 days -1.4
6 2017-08-03 553 3 5.3 1 days 2.3
3 2017-08-04 553 0 0.0 1 days NaN
0 2017-08-07 553 0 0.0 4 days NaN
13 2017-08-01 559 4 6.4 NaT NaN
10 2017-08-02 559 0 0.0 1 days NaN
7 2017-08-03 559 9 5.0 2 days -1.4
4 2017-08-04 559 11 4.2 1 days -0.8
1 2017-08-07 559 4 3.5 3 days -0.7
14 2017-08-01 914 0 0.0 NaT NaN
11 2017-08-02 914 2 2.0 NaT NaN
8 2017-08-03 914 0 0.0 1 days NaN
5 2017-08-04 914 10 3.3 2 days 1.3
2 2017-08-07 914 5 2.2 3 days -1.1
index 11 should be NaT because the most current previous row has a count of zero and there is nothing else to compare it to

keep rows that have data in list of columns python

How can I select rows that contain data in a specific list of columns and drop the ones that have no data at all in those specific columns?
This is the code that I have so far:
VC_sub_selection = final[final['VC'].isin(['ACTIVE', 'SILENT']) & final['Status'].isin(['Test'])]
data_usage_months = list(data_usage_res.columns)
This is an example of the data set
item VC Status Jun 2016 Jul 2016
1 Active Test Nan 1.0
2 Silent Test Nan Nan
3 Active Test 2.0 3.0
4 Silent Test 5.0 Nan
What I would like to achieve is that item 1,3,4 will stay in the data set and that item 2 will be deleted. So the condition that applies is: if all months are Nan than drop row.
Thank you,
Jeroen
Though Nickil's solution answers the question, it does not take into account that more date columns may be added later. Hence, using the index position of a column might not be sufficient in future situations.
The solution presented below does not use the index, rather it uses a regex to find the date columns:
import pandas as pd
import re
# item VC Status Jun 2016 Jul 2016
# 1 Active Test Nan 1.0
# 2 Silent Test Nan Nan
# 3 Active Test 2.0 3.0
# 4 Silent Test 5.0 Nan
df = pd.DataFrame({'item': [1,2,3,4],
'VC': ['Active', 'Silent', 'Active', 'Silent'],
'Status': ['Test'] * 4,
'Jun 2016': [None, None, 2.0, 5.0],
'Jul 2016': [1.0, None, 3.0, None]})
regex_pattern = r'[a-zA-Z]{3}\s\d{4}'
date_cols = list(filter(lambda x: re.search(regex_pattern, x), df.columns.tolist()))
df_res = df.dropna(subset=date_cols, how='all')
# Jul 2016 Jun 2016 Status VC item
# 0 1.0 NaN Test Active 1
# 2 3.0 2.0 Test Active 3
# 3 NaN 5.0 Test Silent 4

Categories