Calculate Overlapping & Non-Overlapping Data Points in a Data frame across Years - python
I have a single Dataframe and I need to find how many toys with different color are same and how many are changing across years.
For Example: Toy1 color remain intact from 2019 to 2020 but in year 2021 there were two toys one with red and other with green color. Hence there is no change in 2019 to 2020 stating overlap of 1 and new count as 0. However for year 2020 to 2021 overlap count though will remain 1 (due to red color), new count will get the value as 1 (due to addition of green color of toy)
Attaching a sample data, original data has million of records.
Input data -
input_data = pd.DataFrame({'Toy': ['Toy1', 'Toy1', 'Toy1', 'Toy1', 'Toy2', 'Toy2', 'Toy2', 'Toy2', 'Toy2', 'Toy3', 'Toy3', 'Toy3'],
'Toy_year': [2019, 2020, 2021, 2021, 2019, 2020, 2020, 2021, 2021, 2019, 2020, 2021],
'Color': ['Red', 'Red', 'Red', 'Green ', 'Green ', 'Green ', 'Red', 'Green ', 'Red', 'Blue', 'Yellow', 'Yellow']})
Output data -
output_data = pd.DataFrame({'Year': ['2019-2020', '2019-2020', '2019-2020', '2020-2021', '2020-2021', '2020-2021'],
'Toy': ['Toy1', 'Toy2', 'Toy3', 'Toy1', 'Toy2', 'Toy3'],
'overlap_count': [1, 1, 0, 1, 1, 1],
'new_count': [0, 1, 1, 1, 1, 0]})
I am trying the below method but it is very slow -
toy_list = ['Toy1','Toy2','Toy3']
year_list = [2019,2020]
for i in toy_list:
for j in year_list:
y1 = j
y2 = j+1
x1 = input_data[(input_data['Toy']==i)&(input_data['Toy_year']==y1)]
x2 = input_data[(input_data['Toy']==i)&(input_data['Toy_year']==y2)]
z1 = list(set(x1.Color) & set(x2.Color))
print (x1)
print (x2)
print (z1)
Any leads is really appreciated
A few steps here. First we unstack the data to have a cross table of toy/year vs color, where 1 indicates that that color was in force for that toy/year
df1 = input_data.assign(count=1).set_index(['Toy','Toy_year','Color']).unstack(level=2)
df1
df1 looks like this:
count
Color Blue Green Red Yellow
Toy Toy_year
Toy1 2019 NaN NaN 1.0 NaN
2020 NaN NaN 1.0 NaN
2021 NaN 1.0 1.0 NaN
Toy2 2019 NaN 1.0 NaN NaN
2020 NaN 1.0 1.0 NaN
2021 NaN 1.0 1.0 NaN
Toy3 2019 1.0 NaN NaN NaN
2020 NaN NaN NaN 1.0
2021 NaN NaN NaN 1.0
Now we can aggregate these, by row, to come up with summary statistics 'overlap_count' and 'new_count'. Overlap_count is the sum of matches between each row and its next (within each toy/year group), and new_count is the sum across the next row minus the overlap from the current row
ccols= df1.columns
df2 = df1.copy()
df2['overlap_count'] = df1.groupby(['Toy'], group_keys = False).apply(lambda g: (g[ccols] == g[ccols].shift(-1)).sum(axis=1))
df2['new_count']= df2.groupby(['Toy'], group_keys = False).apply(lambda g: g[ccols].shift(-1).sum(axis=1) - g['overlap_count'])
Now we just massage the result into the required form:
df3 = df2[['overlap_count','new_count']].reset_index().droplevel(1,axis=1)
df3['Year'] = df3['Toy_year'].astype(str) + '-' + df3['Toy_year'].astype(str).shift(-1)
df3 = df3[df3['Toy_year'] != 2021].drop(columns = ['Toy_year'])
df3
output:
Toy overlap_count new_count Year
-- ----- --------------- ----------- ---------
0 Toy1 1 0 2019-2020
1 Toy1 1 1 2020-2021
3 Toy2 1 1 2019-2020
4 Toy2 2 0 2020-2021
6 Toy3 0 1 2019-2020
7 Toy3 1 0 2020-2021
Related
Selecting rows based on value counts of TWO columns
My question is similar to Pandas: Selecting rows based on value counts of a particular column but with TWO columns: This is a very small snippet from the dataframe (The main df contains millions of entries): overall vote verified reviewTime reviewerID productID 4677505 5.0 NaN True 11 28, 2017 A2O8EJJBFJ9F1 B00NR2VMNC 1302483 5.0 NaN True 04 1, 2017 A1YMYW7EWN4RL3 B001S2PPT0 5073908 3.0 83 True 02 12, 2016 A3H796UY7GIX0K B00ULRFQ1A 200512 5.0 NaN True 07 14, 2016 A150W68P8PYXZE B0000DC0T3 1529831 5.0 NaN True 12 19, 2013 A28GVVNJUZ3VFA B002WE3BZ8 1141922 5.0 NaN False 12 20, 2008 A2UOHALGF2X77Q B001CCLBSA 5930187 3.0 2 True 05 21, 2018 A2CUSR21CZQ6J7 B01DCDG9JC 1863730 5.0 NaN True 05 6, 2017 A38A3VQL8RLS8D B004HKIB6E 1835030 5.0 NaN True 06 20, 2016 A30QT3MWWEPNIE B004D09HRK 4226935 5.0 NaN True 12 27, 2015 A3UORFPF49N96B B00JP12170 Now I want to filter the dataframe so that each reviewerID and productID appears at least k times (lets say k=2) in the final filtered dataframe. In other words: That each user and product has at least k distinct entries/rows. I would greatly appreciate any help.
Try this way k=2 df = pd.read_csv('text.csv') df['count']=1 df_group = df[['reviewerID','productID','count']].groupby(['reviewerID','productID'],as_index=False).sum() df_group = df_group[df_group['count']>=k] df_group.drop(['count'],axis=1,inplace=True) df.drop(['count'],axis=1,inplace=True) df = df.merge(df_group,on=['reviewerID','productID']) df Hope so it will help
More efficient alternative to nested For loop
I have two dataframes which contain data collected at two different frequencies. I want to update the label of df2, to that of df1 if it falls into the duration of an event. I created a nested for-loop to do it, but it takes a rather long time. Here is the code I used: for i in np.arange(len(df1)-1): for j in np.arange(len(df2)): if (df2.timestamp[j] > df1.timestamp[i]) & (df2.timestamp[j] < (df1.timestamp[i] + df1.duration[i])): df2.loc[j,"label"] = df1.loc[i,"label"] Is there a more efficient way of doing this? df1 size (367, 4) df2 size (342423, 9) short example data: import numpy as np import pandas as pd data1 = {'timestamp': [1,2,3,4,5,6,7,8,9], 'duration': [0.5,0.3,0.8,0.2,0.4,0.5,0.3,0.7,0.5], 'label': ['inh','exh','inh','exh','inh','exh','inh','exh','inh'] } df1 = pd.DataFrame (data1, columns = ['timestamp','duration','label']) data2 = {'timestamp': [1,1.5,2,2.5,3,3.5,4,4.5,5,5.5,6,6.5,7,7.5,8,8.5,9,9.5], 'label': ['plc','plc','plc','plc','plc','plc','plc','plc','plc','plc','plc','plc','plc','plc','plc','plc','plc','plc'] } df2 = pd.DataFrame (data2, columns = ['timestamp','label'])
I would first use a merge_asof to select the highest timestamp from df1 below the timestamp from df2. Next a simple (vectorized) comparison of df2.timestamp and df1.timestamp + df1.duration is enough to select matching lines. Code could be: df1['t2'] = df1['timestamp'].astype('float64') # types of join columns must be the same temp = pd.merge_asof(df2, df1, left_on='timestamp', right_on='t2') df2.loc[temp.timestamp_x <= temp.t2 + temp.duration, 'label'] = temp.label_y It gives for df2: timestamp label 0 1.0 inh 1 1.5 inh 2 2.0 exh 3 2.5 plc 4 3.0 inh 5 3.5 inh 6 4.0 exh 7 4.5 plc 8 5.0 inh 9 5.5 plc 10 6.0 exh 11 6.5 exh 12 7.0 inh 13 7.5 plc 14 8.0 exh 15 8.5 exh 16 9.0 inh 17 9.5 inh
How to manipulate pandas dataframe with multiple IF statements/ conditions?
I'm relatively new to python and pandas and am trying to determine how do I create a IF statement or any other statement that once initially returns value continues with other IF statement with in given range? I have tried .between, .loc, and if statements but am still struggling. I have tried to recreate what is happening in my code but cannot replicate it precisely. Any suggestions or ideas around this problem? import pandas as pd data = {'Yrs': [ '2018','2019', '2020', '2021', '2022'], 'Val': [1.50, 1.75, 2.0, 2.25, 2.5] } data2 = {'F':['2015','2018', '2020'], 'L': ['2019','2022', '2024'], 'Base':['2','5','5'], 'O':[20, 40, 60], 'S': [5, 10, 15]} df = pd.DataFrame(data) df2 = pd.DataFrame(data2) r = pd.DataFrame() #use this code to get first value when F <= Yrs r.loc[(df2['F'] <= df.at[0,'Yrs']), '2018'] = \ (1/pd.to_numeric(df2['Base']))*(pd.to_numeric(df2['S']))* \ (pd.to_numeric(df.at[0, 'Val']))+(pd.to_numeric(df2['Of'])) #use this code to get the rest of the values until L = Yrs r.loc[(df2['L'] <= df.at[1,'Yrs']) & (df2['L'] >= df.at[1,'Yrs']),\ '2019'] = (pd.to_numeric(r['2018'])- pd.to_numeric(df2['Of']))* \ pd.to_numeric(df.at[1, 'Val'] / pd.to_numeric(df.at[0, 'Val'])) + \ pd.to_numeric(df2['Of']) r I expect output to be:(the values may be different but its the pattern I want) 2018 2019 2020 2021 2022 0 7.75 8.375 NaN NaN NaN 1 11.0 11.5 12 12.5 13.0 2 NaN NaN 18 18.75 19.25 but i get: 2018 2019 2020 2021 2022 0 7.75 8.375 9.0 9.625 10.25 1 11.0 11.5 12 NaN NaN 2 16.50 17.25 18 NaN NaN
Pandas DF Multiple Conditionals using np.where
I am trying to combine a few relatively simple conditions into an np.where clause, but am having trouble getting the syntax down for the logic. My current dataframe looks like the df below, with four columns. I would like to add two columns, named the below, with the following conditions: The desired output is below - the df df_so_v2 Days since activity *Find most recent prior row with same ID, then subtract dates column *If no most recent value, return NA Chg. Avg. Value Condition 1: If Count = 0, NA Condition 2: If Count !=0, find most recent prior row with BOTH the same ID and Count!=0, then find the difference in Avg. Value column. However, I am building off simple np.where queries like the below and do not know how to combine the multiple conditions needed in this case. df['CASH'] = np.where(df['CASH'] != 0, df['CASH'] + commission , df['CASH']) Thank you very much for your help on this. df_dict={'DateOf': ['2017-08-07','2017-08-07','2017-08-07','2017-08-04','2017-08-04','2017-08-04' , '2017-08-03','2017-08-03','2017-08-03','2017-08-02','2017-08-02','2017-08-02','2017-08-01','2017-08-01','2017-08-01'], 'ID': ['553','559','914','553','559','914','553','559','914','553','559','914','553','559','914'], 'Count': [0, 4, 5, 0, 11, 10, 3, 9, 0,1,0,2,4,4,0], 'Avg. Value': [0,3.5,2.2,0,4.2,3.3,5.3,5,0,3,0,2,4.4,6.4,0]} df_so=pd.DataFrame(df_dict) df_dict_v2={'DateOf': ['2017-08-07','2017-08-07','2017-08-07','2017-08-04','2017-08-04','2017-08-04' , '2017-08-03','2017-08-03','2017-08-03','2017-08-02','2017-08-02','2017-08-02','2017-08-01','2017-08-01','2017-08-01'], 'ID': ['553','559','914','553','559','914','553','559','914','553','559','914','553','559','914'], 'Count': [0, 4, 5, 0, 11, 10, 3, 9, 0,1,0,2,4,4,0], 'Avg. Value': [0,3.5,2.2,0,4.2,3.3,5.3,5,0,3,0,2,4.4,6.4,0], 'Days_since_activity': [4,3,1,1,1,2,1,2,1,1,1,1,'NA','NA','NA'], 'Chg. Avg Value': ['NA',-0.7,-1.1,'NA',-0.8,1.3,2.3,-1.4,'NA',-1.4,'NA','NA','NA','NA','NA'] } df_so_v2=pd.DataFrame(df_dict_v2)
Here is the answer to this part of the question. I need more clarification on the conditions of 2. 1) Days since activity *Find most recent prior row with same ID, then subtract dates column *If no most recent value, return NA First you need to convert strings to datetime, then sort the dates in ascending order. Finally use .transform to find the difference. df_dict={'DateOf': ['2017-08-07','2017-08-07','2017-08-07','2017-08-04','2017-08-04','2017-08-04' , '2017-08-03','2017-08-03','2017-08-03','2017-08-02','2017-08-02','2017-08-02','2017-08-01','2017-08-01','2017-08-01'], 'ID': ['553','559','914','553','559','914','553','559','914','553','559','914','553','559','914'], 'Count': [0, 4, 5, 0, 11, 10, 3, 9, 0,1,0,2,4,4,0], 'Avg. Value': [0,3.5,2.2,0,4.2,3.3,5.3,5,0,3,0,2,4.4,6.4,0]} df_so = pd.DataFrame(df_dict) df_so['DateOf'] = pd.to_datetime(df_so['DateOf']) df_so.sort_values('DateOf', inplace=True) df_so['Days_since_activity'] = df_so.groupby(['ID'])['DateOf'].transform(pd.Series.diff) df_so.sort_index() Edited based on your comment: Find the most recent previous day that does not have a count of Zero and calculate the difference. df_dict={'DateOf': ['2017-08-07','2017-08-07','2017-08-07','2017-08-04','2017-08-04','2017-08-04' , '2017-08-03','2017-08-03','2017-08-03','2017-08-02','2017-08-02','2017-08-02','2017-08-01','2017-08-01','2017-08-01'], 'ID': ['553','559','914','553','559','914','553','559','914','553','559','914','553','559','914'], 'Count': [0, 4, 5, 0, 11, 10, 3, 9, 0,1,0,2,4,4,0], 'Avg. Value': [0,3.5,2.2,0,4.2,3.3,5.3,5,0,3,0,2,4.4,6.4,0]} df = pd.DataFrame(df_dict) df['DateOf'] = pd.to_datetime(df['DateOf'], format='%Y-%m-%d') df.sort_values(['ID','DateOf'], inplace=True) df['Days_since_activity'] = df.groupby(['ID'])['DateOf'].diff() mask = df.ID != df.ID.shift(1) mask2 = df.groupby('ID').Count.shift(1) == 0 df['Days_since_activity'][mask] = np.nan df['Days_since_activity'][mask2] = df.groupby(['ID'])['DateOf'].diff(2) df['Chg. Avg Value'] = df.groupby(['ID'])['Avg. Value'].diff() df['Chg. Avg Value'][mask2] = df.groupby(['ID'])['Avg. Value'].diff(2) conditions = [((df['Count'] == 0)),] choices = [np.nan,] df['Chg. Avg Value'] = np.select(conditions, choices, default = df['Chg. Avg Value']) # df = df.sort_index() df New unsorted Output for easy comparison: DateOf ID Count Avg. Value Days_since_activity Chg. Avg Value 12 2017-08-01 553 4 4.4 NaT NaN 9 2017-08-02 553 1 3.0 1 days -1.4 6 2017-08-03 553 3 5.3 1 days 2.3 3 2017-08-04 553 0 0.0 1 days NaN 0 2017-08-07 553 0 0.0 4 days NaN 13 2017-08-01 559 4 6.4 NaT NaN 10 2017-08-02 559 0 0.0 1 days NaN 7 2017-08-03 559 9 5.0 2 days -1.4 4 2017-08-04 559 11 4.2 1 days -0.8 1 2017-08-07 559 4 3.5 3 days -0.7 14 2017-08-01 914 0 0.0 NaT NaN 11 2017-08-02 914 2 2.0 NaT NaN 8 2017-08-03 914 0 0.0 1 days NaN 5 2017-08-04 914 10 3.3 2 days 1.3 2 2017-08-07 914 5 2.2 3 days -1.1 index 11 should be NaT because the most current previous row has a count of zero and there is nothing else to compare it to
keep rows that have data in list of columns python
How can I select rows that contain data in a specific list of columns and drop the ones that have no data at all in those specific columns? This is the code that I have so far: VC_sub_selection = final[final['VC'].isin(['ACTIVE', 'SILENT']) & final['Status'].isin(['Test'])] data_usage_months = list(data_usage_res.columns) This is an example of the data set item VC Status Jun 2016 Jul 2016 1 Active Test Nan 1.0 2 Silent Test Nan Nan 3 Active Test 2.0 3.0 4 Silent Test 5.0 Nan What I would like to achieve is that item 1,3,4 will stay in the data set and that item 2 will be deleted. So the condition that applies is: if all months are Nan than drop row. Thank you, Jeroen
Though Nickil's solution answers the question, it does not take into account that more date columns may be added later. Hence, using the index position of a column might not be sufficient in future situations. The solution presented below does not use the index, rather it uses a regex to find the date columns: import pandas as pd import re # item VC Status Jun 2016 Jul 2016 # 1 Active Test Nan 1.0 # 2 Silent Test Nan Nan # 3 Active Test 2.0 3.0 # 4 Silent Test 5.0 Nan df = pd.DataFrame({'item': [1,2,3,4], 'VC': ['Active', 'Silent', 'Active', 'Silent'], 'Status': ['Test'] * 4, 'Jun 2016': [None, None, 2.0, 5.0], 'Jul 2016': [1.0, None, 3.0, None]}) regex_pattern = r'[a-zA-Z]{3}\s\d{4}' date_cols = list(filter(lambda x: re.search(regex_pattern, x), df.columns.tolist())) df_res = df.dropna(subset=date_cols, how='all') # Jul 2016 Jun 2016 Status VC item # 0 1.0 NaN Test Active 1 # 2 3.0 2.0 Test Active 3 # 3 NaN 5.0 Test Silent 4