I am looking at football player development over a five year period.
I have two dataframes (DFs), one that contains all 20 year-old strikers from FIFA 17 and another that contains all 25 year-old strikers from FIFA 22. I want to create a third DF that contains the attribute changes for each player. There are about 30 columns denoting each attribute, e.g. tackling, shooting, passing etc. So I want the new DF to contain +3 for tackling, +2 for shooting, +6 for passing etc.
The best way of solving this that I can think of is by merging the two DFs and then applying a function to every column that gives the difference between the x and y values, which represent the FIFA 17 and FIFA 22 data respectively.
Any tips much appreciated. Thank you.
As stated, use the difference of the dataframes. I'm suspecting they are not ALL NaN values, as you'll only get that for rows where the same player isn't in both 17 and 22 Fifas.
When I do it, there are only 533 player in both 17 and 22 (that were 20 years old in Fifa 17 and 25 in Fifa 22).
Here's an example:
import pandas as pd
fifa17 = pd.read_csv('D:/test/fifa/players_17.csv')
fifa17 = fifa17[fifa17['age'] == 20]
fifa17 = fifa17.set_index('sofifa_id')
fifa22 = pd.read_csv('D:/test/fifa/players_22.csv')
fifa22 = fifa22[fifa22['age'] == 25]
fifa22 = fifa22.set_index('sofifa_id')
compareCols = ['pace', 'shooting', 'passing', 'dribbling', 'defending',
'physic', 'attacking_crossing', 'attacking_finishing',
'attacking_heading_accuracy', 'attacking_short_passing',
'attacking_volleys', 'skill_dribbling', 'skill_curve',
'skill_fk_accuracy', 'skill_long_passing',
'skill_ball_control', 'movement_acceleration',
'movement_sprint_speed', 'movement_agility',
'movement_reactions', 'movement_balance', 'power_shot_power',
'power_jumping', 'power_stamina', 'power_strength',
'power_long_shots', 'mentality_aggression',
'mentality_interceptions', 'mentality_positioning',
'mentality_vision', 'mentality_penalties',
'mentality_composure', 'defending_marking_awareness',
'defending_standing_tackle', 'defending_sliding_tackle']
df = fifa22[compareCols] - fifa17[compareCols]
df = df.dropna(axis=0)
df = pd.merge(df,fifa22[['short_name']], how = 'left', left_index=True, right_index=True)
Output:
print(df)
pace shooting ... defending_sliding_tackle short_name
sofifa_id ...
205291 -1.0 0.0 ... 3.0 H. Stengel
205988 -7.0 3.0 ... -1.0 L. Shaw
206086 0.0 8.0 ... 5.0 H. Toffolo
206113 -2.0 21.0 ... -2.0 S. Gnabry
206463 -3.0 8.0 ... 3.0 J. Dudziak
... ... ... ... ...
236311 -2.0 -1.0 ... 18.0 M. Rog
236393 2.0 5.0 ... 0.0 Marc Cardona
236415 3.0 1.0 ... 9.0 R. Alfani
236441 10.0 31.0 ... 18.0 F. Bustos
236458 1.0 0.0 ... 5.0 A. Poungouras
[533 rows x 36 columns]
You might subtract pandas.DataFrames consider following simple example
import pandas as pd
df1 = pd.DataFrame({'X':[1,2],'Y':[3,4]})
df2 = pd.DataFrame({'X':[10,20],'Y':[30,40]})
dfdiff = df2 - df1
print(dfdiff)
gives output
X Y
0 9 27
1 18 36
I have found a solution but it is very tedious as it requires a line of code for each and every attribute.
I'm simply assigning a new column for each attribute change. So for Passing, for instance, the code is:
mergedDF = mergedDF.assign(PassingChange = mergedDF.Passing_x - mergedDF.Passing_y)
I am new to pandas so sorry for naiveté.
I have two dataframe.
One is out.hdf:
999999 2014 1 2 15 19 45.19 14.095 -91.528 69.7 4.5 0.0 0.0 0.0 603879074
999999 2014 1 2 23 53 57.58 16.128 -97.815 23.2 4.8 0.0 0.0 0.0 603879292
999999 2014 1 9 12 27 10.98 13.265 -89.835 55.0 4.5 0.0 0.0 0.0 603947030
999999 2014 1 9 20 57 44.88 23.273 -80.778 15.0 5.1 0.0 0.0 0.0 603947340
and another one is out.res (the first column is station name):
061Z 56.72 0.0 P 603879074
061Z 29.92 0.0 P 603879074
0614 46.24 0.0 P 603879292
109C 87.51 0.0 P 603947030
113A 66.93 0.0 P 603947030
113A 26.93 0.0 P 603947030
121A 31.49 0.0 P 603947340
The last columns in both dataframes are ID.
I want to creat a new dataframe which puts the same IDs from two dataframes together in this way (first reads a line from hdf, then puts the lines from res with the same ID beneath it, but doesn't keep the ID in res).
The new dataframe:
"999999 2014 1 2 15 19 45.19 14.095 -91.528 69.7 4.5 0.0 0.0 0.0 603879074"
061Z 56.72 0.0 P
061Z 29.92 0.0 P
"999999 2014 1 2 23 53 57.58 16.128 -97.815 23.2 4.8 0.0 0.0 0.0 603879292"
0614 46.24 0.0 P
"999999 2014 1 9 12 27 10.98 13.265 -89.835 55.0 4.5 0.0 0.0 0.0 603947030"
109C 87.51 0.0 P
113A 66.93 0.0 P
113A 26.93 0.0 P
"999999 2014 1 9 20 57 44.88 23.273 -80.778 15.0 5.1 0.0 0.0 0.0 603947340"
121A 31.49 0.0 P
My code to do this is:
import csv
import pandas as pd
import numpy as np
path= './'
hdf = pd.read_csv(path + 'out.hdf', delimiter = '\t', header = None)
res = pd.read_csv(path + 'out.res', delimiter = '\t', header = None)
###creating input to the format of ph2dt-jp/ph
with open('./new_df', 'w', encoding='UTF8') as f:
writer = csv.writer(f, delimiter='\t')
i=0
with open('./out.hdf', 'r') as a_file:
for line in a_file:
liney = line.strip()
writer.writerow(np.array([liney]))
print(liney)
j=0
with open('./out.res', 'r') as a_file:
for line in a_file:
if res.iloc[j, 4] == hdf.iloc[i, 14]:
strng = res.iloc[j, [0, 1, 2, 3]]
print(strng)
writer.writerow(np.array(strng))
j+=1
i+=1
The goal is to keep just unique stations in the 3rd dataframe. I used these commands for res to keep unique stations before creating the 3rd dataframe:
res.drop_duplicates([0], keep = 'last', inplace = True)
and
res.groupby([0], as_index = False).last()
and it works fine. The problem is for a large data set, including thousands of lines, using these commands causes some lines of res file to be omitted in the 3rd dataframe.
Could you please let me know what I should do to give the same result for a large dataset?
I am going crazy and thanks for your time and help in advance.
I found the problem and hope it is helpful for others in the future.
In a large data set, the duplicated stations were repeating many times but not consecutively. Drop_duplicates() were keeping just one of them.
However, I wanted to drop just consecutive stations not all of them. And I've done this using shift:
unique_stations = res.loc[res[0].shift() != res[0]]
I'm trying to get data from txt file with pandas.read_csv but it doesn't show the repeated(same) values in the file such as I have 2043 in the row but It shows it once not in every row.
My file sample
Result set
All the circles I've drawn should be 2043 also but they are empty.
My code is :
import pandas as pd
df= pd.read_csv('samplefile.txt', sep='\t', header=None,
names = ["234", "235", "236"]
You get MultiIndex, so first level value are not shown only.
You can convert MultiIndex to columns by reset_index:
df = df.reset_index()
Or specify each column in parameter names for avoid MultiIndex:
df = pd.read_csv('samplefile.txt', sep='\t', names = ["one","two","next", "234", "235", "236"]
A word of warning with MultiIndex as I was bitten by this yesterday and wasted time trying to trouble shoot a non-existant problem.
If one of your index levels is of type float64 then you may find that the indexes are not shown in full. I had a dataframe I was df.groupby().describe() and the variable I was performing the groupby() on was originally a long int, at some point it was converted to a float and when printing out this index was rounded. There were a number of values very close to each other and so it appeared on printing that the groupby() had found multiple levels of the second index.
Thats not very clear so here is an illustrative example...
import numpy as np
import pandas as pd
index = np.random.uniform(low=89908893132829,
high=89908893132929,
size=(50,))
df = pd.DataFrame({'obs': np.arange(100)},
index=np.append(index, index)).sort_index()
df.index.name = 'index1'
df['index2'] = [1, 2] * 50
df.reset_index(inplace=True)
df.set_index(['index1', 'index2'], inplace=True)
Look at the dataframe and it appears that there is only one level of index1...
df.head(10)
obs
index1 index2
8.990889e+13 1 4
2 54
1 61
2 11
1 89
2 39
1 65
2 15
1 60
2 10
groupby(['index1', 'index2']).describe() and it looks like there is only one level of index1...
summary = df.groupby(['index1', 'index2']).describe()
summary.head()
obs
count mean std min 25% 50% 75% max
index1 index2
8.990889e+13 1 1.0 4.0 NaN 4.0 4.0 4.0 4.0 4.0
2 1.0 54.0 NaN 54.0 54.0 54.0 54.0 54.0
1 1.0 61.0 NaN 61.0 61.0 61.0 61.0 61.0
2 1.0 11.0 NaN 11.0 11.0 11.0 11.0 11.0
1 1.0 89.0 NaN 89.0 89.0 89.0 89.0 89.0
But if you look at the actual values of index1 in either you see that there are multiple unique values. In the original dataframe...
df.index.get_level_values('index1')
Float64Index([89908893132833.12, 89908893132833.12, 89908893132834.08,
89908893132834.08, 89908893132835.05, 89908893132835.05,
89908893132836.3, 89908893132836.3, 89908893132837.95,
89908893132837.95, 89908893132838.1, 89908893132838.1,
89908893132838.6, 89908893132838.6, 89908893132841.89,
89908893132841.89, 89908893132841.95, 89908893132841.95,
89908893132845.81, 89908893132845.81, 89908893132845.83,
89908893132845.83, 89908893132845.88, 89908893132845.88,
89908893132846.02, 89908893132846.02, 89908893132847.2,
89908893132847.2, 89908893132847.67, 89908893132847.67,
89908893132848.5, 89908893132848.5, 89908893132848.5,
89908893132848.5, 89908893132855.17, 89908893132855.17,
89908893132855.45, 89908893132855.45, 89908893132864.62,
89908893132864.62, 89908893132868.61, 89908893132868.61,
89908893132873.16, 89908893132873.16, 89908893132875.6,
89908893132875.6, 89908893132875.83, 89908893132875.83,
89908893132878.73, 89908893132878.73, 89908893132879.9,
89908893132879.9, 89908893132880.67, 89908893132880.67,
89908893132880.69, 89908893132880.69, 89908893132881.31,
89908893132881.31, 89908893132881.69, 89908893132881.69,
89908893132884.45, 89908893132884.45, 89908893132887.27,
89908893132887.27, 89908893132887.83, 89908893132887.83,
89908893132892.8, 89908893132892.8, 89908893132894.34,
89908893132894.34, 89908893132894.5, 89908893132894.5,
89908893132901.88, 89908893132901.88, 89908893132903.27,
89908893132903.27, 89908893132904.53, 89908893132904.53,
89908893132909.27, 89908893132909.27, 89908893132910.38,
89908893132910.38, 89908893132911.86, 89908893132911.86,
89908893132913.4, 89908893132913.4, 89908893132915.73,
89908893132915.73, 89908893132916.06, 89908893132916.06,
89908893132922.48, 89908893132922.48, 89908893132923.44,
89908893132923.44, 89908893132924.66, 89908893132924.66,
89908893132925.14, 89908893132925.14, 89908893132928.28,
89908893132928.28],
dtype='float64', name='index1')
...and in the summarised dataframe...
summary.index.get_level_values('index1')
Float64Index([89908893132833.12, 89908893132833.12, 89908893132834.08,
89908893132834.08, 89908893132835.05, 89908893132835.05,
89908893132836.3, 89908893132836.3, 89908893132837.95,
89908893132837.95, 89908893132838.1, 89908893132838.1,
89908893132838.6, 89908893132838.6, 89908893132841.89,
89908893132841.89, 89908893132841.95, 89908893132841.95,
89908893132845.81, 89908893132845.81, 89908893132845.83,
89908893132845.83, 89908893132845.88, 89908893132845.88,
89908893132846.02, 89908893132846.02, 89908893132847.2,
89908893132847.2, 89908893132847.67, 89908893132847.67,
89908893132848.5, 89908893132848.5, 89908893132855.17,
89908893132855.17, 89908893132855.45, 89908893132855.45,
89908893132864.62, 89908893132864.62, 89908893132868.61,
89908893132868.61, 89908893132873.16, 89908893132873.16,
89908893132875.6, 89908893132875.6, 89908893132875.83,
89908893132875.83, 89908893132878.73, 89908893132878.73,
89908893132879.9, 89908893132879.9, 89908893132880.67,
89908893132880.67, 89908893132880.69, 89908893132880.69,
89908893132881.31, 89908893132881.31, 89908893132881.69,
89908893132881.69, 89908893132884.45, 89908893132884.45,
89908893132887.27, 89908893132887.27, 89908893132887.83,
89908893132887.83, 89908893132892.8, 89908893132892.8,
89908893132894.34, 89908893132894.34, 89908893132894.5,
89908893132894.5, 89908893132901.88, 89908893132901.88,
89908893132903.27, 89908893132903.27, 89908893132904.53,
89908893132904.53, 89908893132909.27, 89908893132909.27,
89908893132910.38, 89908893132910.38, 89908893132911.86,
89908893132911.86, 89908893132913.4, 89908893132913.4,
89908893132915.73, 89908893132915.73, 89908893132916.06,
89908893132916.06, 89908893132922.48, 89908893132922.48,
89908893132923.44, 89908893132923.44, 89908893132924.66,
89908893132924.66, 89908893132925.14, 89908893132925.14,
89908893132928.28, 89908893132928.28],
dtype='float64', name='index1')
I wasted time scratching my head wondering why my groupby([index1,index2) had produced only one level of index1!