How can I select rows that contain data in a specific list of columns and drop the ones that have no data at all in those specific columns?
This is the code that I have so far:
VC_sub_selection = final[final['VC'].isin(['ACTIVE', 'SILENT']) & final['Status'].isin(['Test'])]
data_usage_months = list(data_usage_res.columns)
This is an example of the data set
item VC Status Jun 2016 Jul 2016
1 Active Test Nan 1.0
2 Silent Test Nan Nan
3 Active Test 2.0 3.0
4 Silent Test 5.0 Nan
What I would like to achieve is that item 1,3,4 will stay in the data set and that item 2 will be deleted. So the condition that applies is: if all months are Nan than drop row.
Thank you,
Jeroen
Though Nickil's solution answers the question, it does not take into account that more date columns may be added later. Hence, using the index position of a column might not be sufficient in future situations.
The solution presented below does not use the index, rather it uses a regex to find the date columns:
import pandas as pd
import re
# item VC Status Jun 2016 Jul 2016
# 1 Active Test Nan 1.0
# 2 Silent Test Nan Nan
# 3 Active Test 2.0 3.0
# 4 Silent Test 5.0 Nan
df = pd.DataFrame({'item': [1,2,3,4],
'VC': ['Active', 'Silent', 'Active', 'Silent'],
'Status': ['Test'] * 4,
'Jun 2016': [None, None, 2.0, 5.0],
'Jul 2016': [1.0, None, 3.0, None]})
regex_pattern = r'[a-zA-Z]{3}\s\d{4}'
date_cols = list(filter(lambda x: re.search(regex_pattern, x), df.columns.tolist()))
df_res = df.dropna(subset=date_cols, how='all')
# Jul 2016 Jun 2016 Status VC item
# 0 1.0 NaN Test Active 1
# 2 3.0 2.0 Test Active 3
# 3 NaN 5.0 Test Silent 4
Related
I am looking at football player development over a five year period.
I have two dataframes (DFs), one that contains all 20 year-old strikers from FIFA 17 and another that contains all 25 year-old strikers from FIFA 22. I want to create a third DF that contains the attribute changes for each player. There are about 30 columns denoting each attribute, e.g. tackling, shooting, passing etc. So I want the new DF to contain +3 for tackling, +2 for shooting, +6 for passing etc.
The best way of solving this that I can think of is by merging the two DFs and then applying a function to every column that gives the difference between the x and y values, which represent the FIFA 17 and FIFA 22 data respectively.
Any tips much appreciated. Thank you.
As stated, use the difference of the dataframes. I'm suspecting they are not ALL NaN values, as you'll only get that for rows where the same player isn't in both 17 and 22 Fifas.
When I do it, there are only 533 player in both 17 and 22 (that were 20 years old in Fifa 17 and 25 in Fifa 22).
Here's an example:
import pandas as pd
fifa17 = pd.read_csv('D:/test/fifa/players_17.csv')
fifa17 = fifa17[fifa17['age'] == 20]
fifa17 = fifa17.set_index('sofifa_id')
fifa22 = pd.read_csv('D:/test/fifa/players_22.csv')
fifa22 = fifa22[fifa22['age'] == 25]
fifa22 = fifa22.set_index('sofifa_id')
compareCols = ['pace', 'shooting', 'passing', 'dribbling', 'defending',
'physic', 'attacking_crossing', 'attacking_finishing',
'attacking_heading_accuracy', 'attacking_short_passing',
'attacking_volleys', 'skill_dribbling', 'skill_curve',
'skill_fk_accuracy', 'skill_long_passing',
'skill_ball_control', 'movement_acceleration',
'movement_sprint_speed', 'movement_agility',
'movement_reactions', 'movement_balance', 'power_shot_power',
'power_jumping', 'power_stamina', 'power_strength',
'power_long_shots', 'mentality_aggression',
'mentality_interceptions', 'mentality_positioning',
'mentality_vision', 'mentality_penalties',
'mentality_composure', 'defending_marking_awareness',
'defending_standing_tackle', 'defending_sliding_tackle']
df = fifa22[compareCols] - fifa17[compareCols]
df = df.dropna(axis=0)
df = pd.merge(df,fifa22[['short_name']], how = 'left', left_index=True, right_index=True)
Output:
print(df)
pace shooting ... defending_sliding_tackle short_name
sofifa_id ...
205291 -1.0 0.0 ... 3.0 H. Stengel
205988 -7.0 3.0 ... -1.0 L. Shaw
206086 0.0 8.0 ... 5.0 H. Toffolo
206113 -2.0 21.0 ... -2.0 S. Gnabry
206463 -3.0 8.0 ... 3.0 J. Dudziak
... ... ... ... ...
236311 -2.0 -1.0 ... 18.0 M. Rog
236393 2.0 5.0 ... 0.0 Marc Cardona
236415 3.0 1.0 ... 9.0 R. Alfani
236441 10.0 31.0 ... 18.0 F. Bustos
236458 1.0 0.0 ... 5.0 A. Poungouras
[533 rows x 36 columns]
You might subtract pandas.DataFrames consider following simple example
import pandas as pd
df1 = pd.DataFrame({'X':[1,2],'Y':[3,4]})
df2 = pd.DataFrame({'X':[10,20],'Y':[30,40]})
dfdiff = df2 - df1
print(dfdiff)
gives output
X Y
0 9 27
1 18 36
I have found a solution but it is very tedious as it requires a line of code for each and every attribute.
I'm simply assigning a new column for each attribute change. So for Passing, for instance, the code is:
mergedDF = mergedDF.assign(PassingChange = mergedDF.Passing_x - mergedDF.Passing_y)
My question is similar to Pandas: Selecting rows based on value counts of a particular column but with TWO columns:
This is a very small snippet from the dataframe (The main df contains millions of entries):
overall vote verified reviewTime reviewerID productID
4677505 5.0 NaN True 11 28, 2017 A2O8EJJBFJ9F1 B00NR2VMNC
1302483 5.0 NaN True 04 1, 2017 A1YMYW7EWN4RL3 B001S2PPT0
5073908 3.0 83 True 02 12, 2016 A3H796UY7GIX0K B00ULRFQ1A
200512 5.0 NaN True 07 14, 2016 A150W68P8PYXZE B0000DC0T3
1529831 5.0 NaN True 12 19, 2013 A28GVVNJUZ3VFA B002WE3BZ8
1141922 5.0 NaN False 12 20, 2008 A2UOHALGF2X77Q B001CCLBSA
5930187 3.0 2 True 05 21, 2018 A2CUSR21CZQ6J7 B01DCDG9JC
1863730 5.0 NaN True 05 6, 2017 A38A3VQL8RLS8D B004HKIB6E
1835030 5.0 NaN True 06 20, 2016 A30QT3MWWEPNIE B004D09HRK
4226935 5.0 NaN True 12 27, 2015 A3UORFPF49N96B B00JP12170
Now I want to filter the dataframe so that each reviewerID and productID appears at least k times (lets say k=2) in the final filtered dataframe. In other words: That each user and product has at least k distinct entries/rows.
I would greatly appreciate any help.
Try this way
k=2
df = pd.read_csv('text.csv')
df['count']=1
df_group = df[['reviewerID','productID','count']].groupby(['reviewerID','productID'],as_index=False).sum()
df_group = df_group[df_group['count']>=k]
df_group.drop(['count'],axis=1,inplace=True)
df.drop(['count'],axis=1,inplace=True)
df = df.merge(df_group,on=['reviewerID','productID'])
df
Hope so it will help
I'm relatively new to python and pandas and am trying to determine how do I create a IF statement or any other statement that once initially returns value continues with other IF statement with in given range?
I have tried .between, .loc, and if statements but am still struggling. I have tried to recreate what is happening in my code but cannot replicate it precisely. Any suggestions or ideas around this problem?
import pandas as pd
data = {'Yrs': [ '2018','2019', '2020', '2021', '2022'], 'Val': [1.50, 1.75, 2.0, 2.25, 2.5] }
data2 = {'F':['2015','2018', '2020'], 'L': ['2019','2022', '2024'], 'Base':['2','5','5'],
'O':[20, 40, 60], 'S': [5, 10, 15]}
df = pd.DataFrame(data)
df2 = pd.DataFrame(data2)
r = pd.DataFrame()
#use this code to get first value when F <= Yrs
r.loc[(df2['F'] <= df.at[0,'Yrs']), '2018'] = \
(1/pd.to_numeric(df2['Base']))*(pd.to_numeric(df2['S']))* \
(pd.to_numeric(df.at[0, 'Val']))+(pd.to_numeric(df2['Of']))
#use this code to get the rest of the values until L = Yrs
r.loc[(df2['L'] <= df.at[1,'Yrs']) & (df2['L'] >= df.at[1,'Yrs']),\
'2019'] = (pd.to_numeric(r['2018'])- pd.to_numeric(df2['Of']))* \
pd.to_numeric(df.at[1, 'Val'] / pd.to_numeric(df.at[0, 'Val'])) + \
pd.to_numeric(df2['Of'])
r
I expect output to be:(the values may be different but its the pattern I want)
2018 2019 2020 2021 2022
0 7.75 8.375 NaN NaN NaN
1 11.0 11.5 12 12.5 13.0
2 NaN NaN 18 18.75 19.25
but i get:
2018 2019 2020 2021 2022
0 7.75 8.375 9.0 9.625 10.25
1 11.0 11.5 12 NaN NaN
2 16.50 17.25 18 NaN NaN
I have a sample data data table like
import pandas as pd
compnaies = ['Microsoft', 'Google', 'Amazon', 'Microsoft', 'Facebook', 'Google']
products = ['OS', 'Search', 'E-comm', 'E-comm', 'Social Media', 'OS']
count = [5,7,3,19,23,54]
average = [1.2,3.4,2.4,5.2,3.2,4.4]
df = pd.DataFrame({'company' : compnaies, 'product':products,
'count':count , 'average' : average})
df
average company count product
0 1.2 Microsoft 5 OS
1 3.4 Google 7 Search
2 2.4 Amazon 3 E-comm
3 5.2 Microsoft 19 E-comm
4 3.2 Facebook 23 Social Media
5 4.4 Google 54 OS
Now I want to create pivot view on both 'average' and 'count' but I am not able to define both values, here the sample code with one 'average'
df.pivot_table(index='company', columns='product', values='average', fill_value=0)
the output will be
but I need the data in below format, can someone please help meanwhile I tried the stack, and group by which creates multi index data frame but it does not give desired output, I will share the code if needed
desired output which I need to download in excel
Use set_index with stack and unstack:
df = (df.set_index(['company','product'])
.stack()
.unstack(axis=1)
.rename_axis([None, None])
.rename_axis(None, axis=1))
print (df)
E-comm OS Search Social Media
Amazon count 3.0 NaN NaN NaN
average 2.4 NaN NaN NaN
Facebook count NaN NaN NaN 23.0
average NaN NaN NaN 3.2
Google count NaN 54.0 7.0 NaN
average NaN 4.4 3.4 NaN
Microsoft count 19.0 5.0 NaN NaN
average 5.2 1.2 NaN NaN
Forgive any bad wording as I'm rather new to Pandas. I've done a fair amount of Googling but can't quite figure out the keywords I need to get the answer I'm looking for. I have some rather simple data containing counts of a certain flag grouped by IDs and dates, similar to the below:
id date flag count
-------------------------------------
CAZ1 02/03/2012 Y 12
CAZ1 02/03/2012 N 7
CAZ2 03/03/2012 Y 6
CAZ2 03/03/2012 N 2
CRI2 02/03/2012 Y 14
CRI2 02/03/2012 G 5
LMU3 01/12/2013 G 7
LMU4 02/12/2013 G 4
LMU5 01/12/2014 G 3
LMU6 01/12/2014 G 2
LMU7 05/12/2014 G 2
EUR4 01/16/2014 N 3
What I'm looking to do is group the IDs by certain flag combinations, sum their counts, and then get means for these per year. Resulting data should look something like:
2012 2013 2014 Mean Calculations:
--------------------------------------
Y,N | 6.75 NaN NaN (((12+7)/2)+((6+2)/2))/2
--------------------------------------
Y,G | 9.5 NaN NaN (14+5)/2
--------------------------------------
G | NaN 5.5 2.33 (7+4)/2, (3+2+2)/3
--------------------------------------
N | NaN NaN 3 (3)
Not sure if this makes sense. I think I need to perform multiple GroupBys at the same time, with the option to define the different criteria for each of the different groupings.
Happy to clarify further if needed. My initial attempts at coding this have been filled with errors so I don't think there's much benefit in posting progress so far. In fact, I just tried to write something and it seemed more misleading than helpful. Sorry, >_<.
IIUC, you can get what you want by first doing a groupby and then building a pivot_table:
[original version]
df["date"] = pd.to_datetime(df["date"])
grouped = df.groupby(["id","date"], as_index=False)
df_new = grouped.agg({"flag": ",".join, "count": "sum"})
df_new["year"] = df_new["date"].dt.year
df_final = df_new.pivot_table(index="flag", columns="year")
produces
>>> df_final
count
year 2012 2013 2014
flag
G NaN 5.5 2.333333
N NaN NaN 3.000000
Y,G 19.0 NaN NaN
Y,N 13.5 NaN NaN
[updated after the question was edited]
If you want the mean instead of the sum, just write mean instead of sum when doing the aggregation, i.e.
df_new = grouped.agg({"flag": ",".join, "count": "mean"})
which gives
>>> df_final
count
year 2012 2013 2014
flag
G NaN 5.5 2.333333
N NaN NaN 3.000000
Y,G 9.50 NaN NaN
Y,N 6.75 NaN NaN
The only tricky part is passing the dictionary to agg so we can perform two aggregation operations at once:
>>> df_new
id date count flag year
0 CAZ1 2012-02-03 19 Y,N 2012
1 CAZ2 2012-03-03 8 Y,N 2012
2 CRI2 2012-02-03 19 Y,G 2012
3 EUR4 2014-01-16 3 N 2014
4 LMU3 2013-01-12 7 G 2013
5 LMU4 2013-02-12 4 G 2013
6 LMU5 2014-01-12 3 G 2014
7 LMU6 2014-01-12 2 G 2014
8 LMU7 2014-05-12 2 G 2014
It's usually easier to work with these flat formats as much as you can and then pivot only at the end.
For example, if your real dataset is more complicated than the one you posted, you might need another groupby -- but that's easy enough using this pattern.