Selecting rows based on value counts of TWO columns - python

My question is similar to Pandas: Selecting rows based on value counts of a particular column but with TWO columns:
This is a very small snippet from the dataframe (The main df contains millions of entries):
overall vote verified reviewTime reviewerID productID
4677505 5.0 NaN True 11 28, 2017 A2O8EJJBFJ9F1 B00NR2VMNC
1302483 5.0 NaN True 04 1, 2017 A1YMYW7EWN4RL3 B001S2PPT0
5073908 3.0 83 True 02 12, 2016 A3H796UY7GIX0K B00ULRFQ1A
200512 5.0 NaN True 07 14, 2016 A150W68P8PYXZE B0000DC0T3
1529831 5.0 NaN True 12 19, 2013 A28GVVNJUZ3VFA B002WE3BZ8
1141922 5.0 NaN False 12 20, 2008 A2UOHALGF2X77Q B001CCLBSA
5930187 3.0 2 True 05 21, 2018 A2CUSR21CZQ6J7 B01DCDG9JC
1863730 5.0 NaN True 05 6, 2017 A38A3VQL8RLS8D B004HKIB6E
1835030 5.0 NaN True 06 20, 2016 A30QT3MWWEPNIE B004D09HRK
4226935 5.0 NaN True 12 27, 2015 A3UORFPF49N96B B00JP12170
Now I want to filter the dataframe so that each reviewerID and productID appears at least k times (lets say k=2) in the final filtered dataframe. In other words: That each user and product has at least k distinct entries/rows.
I would greatly appreciate any help.

Try this way
k=2
df = pd.read_csv('text.csv')
df['count']=1
df_group = df[['reviewerID','productID','count']].groupby(['reviewerID','productID'],as_index=False).sum()
df_group = df_group[df_group['count']>=k]
df_group.drop(['count'],axis=1,inplace=True)
df.drop(['count'],axis=1,inplace=True)
df = df.merge(df_group,on=['reviewerID','productID'])
df
Hope so it will help

Related

How to create a new dataframe that contains the value changes from multiple columns between two exisitng dataframes

I am looking at football player development over a five year period.
I have two dataframes (DFs), one that contains all 20 year-old strikers from FIFA 17 and another that contains all 25 year-old strikers from FIFA 22. I want to create a third DF that contains the attribute changes for each player. There are about 30 columns denoting each attribute, e.g. tackling, shooting, passing etc. So I want the new DF to contain +3 for tackling, +2 for shooting, +6 for passing etc.
The best way of solving this that I can think of is by merging the two DFs and then applying a function to every column that gives the difference between the x and y values, which represent the FIFA 17 and FIFA 22 data respectively.
Any tips much appreciated. Thank you.
As stated, use the difference of the dataframes. I'm suspecting they are not ALL NaN values, as you'll only get that for rows where the same player isn't in both 17 and 22 Fifas.
When I do it, there are only 533 player in both 17 and 22 (that were 20 years old in Fifa 17 and 25 in Fifa 22).
Here's an example:
import pandas as pd
fifa17 = pd.read_csv('D:/test/fifa/players_17.csv')
fifa17 = fifa17[fifa17['age'] == 20]
fifa17 = fifa17.set_index('sofifa_id')
fifa22 = pd.read_csv('D:/test/fifa/players_22.csv')
fifa22 = fifa22[fifa22['age'] == 25]
fifa22 = fifa22.set_index('sofifa_id')
compareCols = ['pace', 'shooting', 'passing', 'dribbling', 'defending',
'physic', 'attacking_crossing', 'attacking_finishing',
'attacking_heading_accuracy', 'attacking_short_passing',
'attacking_volleys', 'skill_dribbling', 'skill_curve',
'skill_fk_accuracy', 'skill_long_passing',
'skill_ball_control', 'movement_acceleration',
'movement_sprint_speed', 'movement_agility',
'movement_reactions', 'movement_balance', 'power_shot_power',
'power_jumping', 'power_stamina', 'power_strength',
'power_long_shots', 'mentality_aggression',
'mentality_interceptions', 'mentality_positioning',
'mentality_vision', 'mentality_penalties',
'mentality_composure', 'defending_marking_awareness',
'defending_standing_tackle', 'defending_sliding_tackle']
df = fifa22[compareCols] - fifa17[compareCols]
df = df.dropna(axis=0)
df = pd.merge(df,fifa22[['short_name']], how = 'left', left_index=True, right_index=True)
Output:
print(df)
pace shooting ... defending_sliding_tackle short_name
sofifa_id ...
205291 -1.0 0.0 ... 3.0 H. Stengel
205988 -7.0 3.0 ... -1.0 L. Shaw
206086 0.0 8.0 ... 5.0 H. Toffolo
206113 -2.0 21.0 ... -2.0 S. Gnabry
206463 -3.0 8.0 ... 3.0 J. Dudziak
... ... ... ... ...
236311 -2.0 -1.0 ... 18.0 M. Rog
236393 2.0 5.0 ... 0.0 Marc Cardona
236415 3.0 1.0 ... 9.0 R. Alfani
236441 10.0 31.0 ... 18.0 F. Bustos
236458 1.0 0.0 ... 5.0 A. Poungouras
[533 rows x 36 columns]
You might subtract pandas.DataFrames consider following simple example
import pandas as pd
df1 = pd.DataFrame({'X':[1,2],'Y':[3,4]})
df2 = pd.DataFrame({'X':[10,20],'Y':[30,40]})
dfdiff = df2 - df1
print(dfdiff)
gives output
X Y
0 9 27
1 18 36
I have found a solution but it is very tedious as it requires a line of code for each and every attribute.
I'm simply assigning a new column for each attribute change. So for Passing, for instance, the code is:
mergedDF = mergedDF.assign(PassingChange = mergedDF.Passing_x - mergedDF.Passing_y)

Calculate Overlapping & Non-Overlapping Data Points in a Data frame across Years

I have a single Dataframe and I need to find how many toys with different color are same and how many are changing across years.
For Example: Toy1 color remain intact from 2019 to 2020 but in year 2021 there were two toys one with red and other with green color. Hence there is no change in 2019 to 2020 stating overlap of 1 and new count as 0. However for year 2020 to 2021 overlap count though will remain 1 (due to red color), new count will get the value as 1 (due to addition of green color of toy)
Attaching a sample data, original data has million of records.
Input data -
input_data = pd.DataFrame({'Toy': ['Toy1', 'Toy1', 'Toy1', 'Toy1', 'Toy2', 'Toy2', 'Toy2', 'Toy2', 'Toy2', 'Toy3', 'Toy3', 'Toy3'],
'Toy_year': [2019, 2020, 2021, 2021, 2019, 2020, 2020, 2021, 2021, 2019, 2020, 2021],
'Color': ['Red', 'Red', 'Red', 'Green ', 'Green ', 'Green ', 'Red', 'Green ', 'Red', 'Blue', 'Yellow', 'Yellow']})
Output data -
output_data = pd.DataFrame({'Year': ['2019-2020', '2019-2020', '2019-2020', '2020-2021', '2020-2021', '2020-2021'],
'Toy': ['Toy1', 'Toy2', 'Toy3', 'Toy1', 'Toy2', 'Toy3'],
'overlap_count': [1, 1, 0, 1, 1, 1],
'new_count': [0, 1, 1, 1, 1, 0]})
I am trying the below method but it is very slow -
toy_list = ['Toy1','Toy2','Toy3']
year_list = [2019,2020]
for i in toy_list:
for j in year_list:
y1 = j
y2 = j+1
x1 = input_data[(input_data['Toy']==i)&(input_data['Toy_year']==y1)]
x2 = input_data[(input_data['Toy']==i)&(input_data['Toy_year']==y2)]
z1 = list(set(x1.Color) & set(x2.Color))
print (x1)
print (x2)
print (z1)
Any leads is really appreciated
A few steps here. First we unstack the data to have a cross table of toy/year vs color, where 1 indicates that that color was in force for that toy/year
df1 = input_data.assign(count=1).set_index(['Toy','Toy_year','Color']).unstack(level=2)
df1
df1 looks like this:
count
Color Blue Green Red Yellow
Toy Toy_year
Toy1 2019 NaN NaN 1.0 NaN
2020 NaN NaN 1.0 NaN
2021 NaN 1.0 1.0 NaN
Toy2 2019 NaN 1.0 NaN NaN
2020 NaN 1.0 1.0 NaN
2021 NaN 1.0 1.0 NaN
Toy3 2019 1.0 NaN NaN NaN
2020 NaN NaN NaN 1.0
2021 NaN NaN NaN 1.0
Now we can aggregate these, by row, to come up with summary statistics 'overlap_count' and 'new_count'. Overlap_count is the sum of matches between each row and its next (within each toy/year group), and new_count is the sum across the next row minus the overlap from the current row
ccols= df1.columns
df2 = df1.copy()
df2['overlap_count'] = df1.groupby(['Toy'], group_keys = False).apply(lambda g: (g[ccols] == g[ccols].shift(-1)).sum(axis=1))
df2['new_count']= df2.groupby(['Toy'], group_keys = False).apply(lambda g: g[ccols].shift(-1).sum(axis=1) - g['overlap_count'])
Now we just massage the result into the required form:
df3 = df2[['overlap_count','new_count']].reset_index().droplevel(1,axis=1)
df3['Year'] = df3['Toy_year'].astype(str) + '-' + df3['Toy_year'].astype(str).shift(-1)
df3 = df3[df3['Toy_year'] != 2021].drop(columns = ['Toy_year'])
df3
output:
Toy overlap_count new_count Year
-- ----- --------------- ----------- ---------
0 Toy1 1 0 2019-2020
1 Toy1 1 1 2020-2021
3 Toy2 1 1 2019-2020
4 Toy2 2 0 2020-2021
6 Toy3 0 1 2019-2020
7 Toy3 1 0 2020-2021

How to manipulate pandas dataframe with multiple IF statements/ conditions?

I'm relatively new to python and pandas and am trying to determine how do I create a IF statement or any other statement that once initially returns value continues with other IF statement with in given range?
I have tried .between, .loc, and if statements but am still struggling. I have tried to recreate what is happening in my code but cannot replicate it precisely. Any suggestions or ideas around this problem?
import pandas as pd
data = {'Yrs': [ '2018','2019', '2020', '2021', '2022'], 'Val': [1.50, 1.75, 2.0, 2.25, 2.5] }
data2 = {'F':['2015','2018', '2020'], 'L': ['2019','2022', '2024'], 'Base':['2','5','5'],
'O':[20, 40, 60], 'S': [5, 10, 15]}
df = pd.DataFrame(data)
df2 = pd.DataFrame(data2)
r = pd.DataFrame()
#use this code to get first value when F <= Yrs
r.loc[(df2['F'] <= df.at[0,'Yrs']), '2018'] = \
(1/pd.to_numeric(df2['Base']))*(pd.to_numeric(df2['S']))* \
(pd.to_numeric(df.at[0, 'Val']))+(pd.to_numeric(df2['Of']))
#use this code to get the rest of the values until L = Yrs
r.loc[(df2['L'] <= df.at[1,'Yrs']) & (df2['L'] >= df.at[1,'Yrs']),\
'2019'] = (pd.to_numeric(r['2018'])- pd.to_numeric(df2['Of']))* \
pd.to_numeric(df.at[1, 'Val'] / pd.to_numeric(df.at[0, 'Val'])) + \
pd.to_numeric(df2['Of'])
r
I expect output to be:(the values may be different but its the pattern I want)
2018 2019 2020 2021 2022
0 7.75 8.375 NaN NaN NaN
1 11.0 11.5 12 12.5 13.0
2 NaN NaN 18 18.75 19.25
but i get:
2018 2019 2020 2021 2022
0 7.75 8.375 9.0 9.625 10.25
1 11.0 11.5 12 NaN NaN
2 16.50 17.25 18 NaN NaN

keep rows that have data in list of columns python

How can I select rows that contain data in a specific list of columns and drop the ones that have no data at all in those specific columns?
This is the code that I have so far:
VC_sub_selection = final[final['VC'].isin(['ACTIVE', 'SILENT']) & final['Status'].isin(['Test'])]
data_usage_months = list(data_usage_res.columns)
This is an example of the data set
item VC Status Jun 2016 Jul 2016
1 Active Test Nan 1.0
2 Silent Test Nan Nan
3 Active Test 2.0 3.0
4 Silent Test 5.0 Nan
What I would like to achieve is that item 1,3,4 will stay in the data set and that item 2 will be deleted. So the condition that applies is: if all months are Nan than drop row.
Thank you,
Jeroen
Though Nickil's solution answers the question, it does not take into account that more date columns may be added later. Hence, using the index position of a column might not be sufficient in future situations.
The solution presented below does not use the index, rather it uses a regex to find the date columns:
import pandas as pd
import re
# item VC Status Jun 2016 Jul 2016
# 1 Active Test Nan 1.0
# 2 Silent Test Nan Nan
# 3 Active Test 2.0 3.0
# 4 Silent Test 5.0 Nan
df = pd.DataFrame({'item': [1,2,3,4],
'VC': ['Active', 'Silent', 'Active', 'Silent'],
'Status': ['Test'] * 4,
'Jun 2016': [None, None, 2.0, 5.0],
'Jul 2016': [1.0, None, 3.0, None]})
regex_pattern = r'[a-zA-Z]{3}\s\d{4}'
date_cols = list(filter(lambda x: re.search(regex_pattern, x), df.columns.tolist()))
df_res = df.dropna(subset=date_cols, how='all')
# Jul 2016 Jun 2016 Status VC item
# 0 1.0 NaN Test Active 1
# 2 3.0 2.0 Test Active 3
# 3 NaN 5.0 Test Silent 4

Multiple Groupings on Pandas DataFrame

Forgive any bad wording as I'm rather new to Pandas. I've done a fair amount of Googling but can't quite figure out the keywords I need to get the answer I'm looking for. I have some rather simple data containing counts of a certain flag grouped by IDs and dates, similar to the below:
id date flag count
-------------------------------------
CAZ1 02/03/2012 Y 12
CAZ1 02/03/2012 N 7
CAZ2 03/03/2012 Y 6
CAZ2 03/03/2012 N 2
CRI2 02/03/2012 Y 14
CRI2 02/03/2012 G 5
LMU3 01/12/2013 G 7
LMU4 02/12/2013 G 4
LMU5 01/12/2014 G 3
LMU6 01/12/2014 G 2
LMU7 05/12/2014 G 2
EUR4 01/16/2014 N 3
What I'm looking to do is group the IDs by certain flag combinations, sum their counts, and then get means for these per year. Resulting data should look something like:
2012 2013 2014 Mean Calculations:
--------------------------------------
Y,N | 6.75 NaN NaN (((12+7)/2)+((6+2)/2))/2
--------------------------------------
Y,G | 9.5 NaN NaN (14+5)/2
--------------------------------------
G | NaN 5.5 2.33 (7+4)/2, (3+2+2)/3
--------------------------------------
N | NaN NaN 3 (3)
Not sure if this makes sense. I think I need to perform multiple GroupBys at the same time, with the option to define the different criteria for each of the different groupings.
Happy to clarify further if needed. My initial attempts at coding this have been filled with errors so I don't think there's much benefit in posting progress so far. In fact, I just tried to write something and it seemed more misleading than helpful. Sorry, >_<.
IIUC, you can get what you want by first doing a groupby and then building a pivot_table:
[original version]
df["date"] = pd.to_datetime(df["date"])
grouped = df.groupby(["id","date"], as_index=False)
df_new = grouped.agg({"flag": ",".join, "count": "sum"})
df_new["year"] = df_new["date"].dt.year
df_final = df_new.pivot_table(index="flag", columns="year")
produces
>>> df_final
count
year 2012 2013 2014
flag
G NaN 5.5 2.333333
N NaN NaN 3.000000
Y,G 19.0 NaN NaN
Y,N 13.5 NaN NaN
[updated after the question was edited]
If you want the mean instead of the sum, just write mean instead of sum when doing the aggregation, i.e.
df_new = grouped.agg({"flag": ",".join, "count": "mean"})
which gives
>>> df_final
count
year 2012 2013 2014
flag
G NaN 5.5 2.333333
N NaN NaN 3.000000
Y,G 9.50 NaN NaN
Y,N 6.75 NaN NaN
The only tricky part is passing the dictionary to agg so we can perform two aggregation operations at once:
>>> df_new
id date count flag year
0 CAZ1 2012-02-03 19 Y,N 2012
1 CAZ2 2012-03-03 8 Y,N 2012
2 CRI2 2012-02-03 19 Y,G 2012
3 EUR4 2014-01-16 3 N 2014
4 LMU3 2013-01-12 7 G 2013
5 LMU4 2013-02-12 4 G 2013
6 LMU5 2014-01-12 3 G 2014
7 LMU6 2014-01-12 2 G 2014
8 LMU7 2014-05-12 2 G 2014
It's usually easier to work with these flat formats as much as you can and then pivot only at the end.
For example, if your real dataset is more complicated than the one you posted, you might need another groupby -- but that's easy enough using this pattern.

Categories