I have DataFrame with columns: City, Wind direction, Temperature. Of course each City occures only 1 time!!! and has only 1 data point of Wind direction and Temperature. For instance:
0 New York 252.0 22.0
How can create my own methon and use it in DataFrame ? For example I would like to create my own method "aa" which returns some solution (Temperature in City minus mean Temperature for entire column "Temperature") and use this created method during aggregation my DataFrame.
Currently I created method "aa" as you can see below and I use it in aggregation, nevertheless, "aa" method shows "0" everywhere. Could you write me an appropriate code? Did I make mistake id def aa(x) ?
def aa(x):
return x - np.mean(x)
file.groupby(["City"]).agg({"Wind direction":[np.mean, aa], "Temperature":["mean", aa]})
Sample data: (Taken from comments provided by OP)
file = pd.DataFrame({"City":["New York", "Berlin", "London"], "Wind direction":[225.0, 252.0, 310.0], "Temperature":[21.0, 18.5, 22.0]})
You are getting zeros because the input that aa receives is the group, not the full series, and the mean of a single-element group is the single element.
Now, it's a bit weird to use groupby when you know that each group has only a single element, but you can force it through using something like
def aa(x):
return x - file[x.name].mean()
With your given example:
In [23]: file.groupby(["City"]).agg({"Wind direction":[np.mean, aa], "Temperature":["mean", aa]})
Out[23]:
Wind direction Temperature
mean aa mean aa
City
Berlin 252.0 -10.333333 18.5 -2.0
London 310.0 47.666667 22.0 1.5
New York 225.0 -37.333333 21.0 0.5
Much more straightforward would be to simply operate on the data frame directly, e.g.
In [26]: file['Wind direction aa'] = file['Wind direction'] - file['Wind direction'].mean()
In [27]: file['Temperature aa'] = file['Temperature'] - file['Temperature'].mean()
In [28]: file
Out[28]:
City Wind direction Temperature Wind direction aa Temperature aa
0 New York 225.0 21.0 -37.333333 0.5
1 Berlin 252.0 18.5 -10.333333 -2.0
2 London 310.0 22.0 47.666667 1.5
Related
I am looking at football player development over a five year period.
I have two dataframes (DFs), one that contains all 20 year-old strikers from FIFA 17 and another that contains all 25 year-old strikers from FIFA 22. I want to create a third DF that contains the attribute changes for each player. There are about 30 columns denoting each attribute, e.g. tackling, shooting, passing etc. So I want the new DF to contain +3 for tackling, +2 for shooting, +6 for passing etc.
The best way of solving this that I can think of is by merging the two DFs and then applying a function to every column that gives the difference between the x and y values, which represent the FIFA 17 and FIFA 22 data respectively.
Any tips much appreciated. Thank you.
As stated, use the difference of the dataframes. I'm suspecting they are not ALL NaN values, as you'll only get that for rows where the same player isn't in both 17 and 22 Fifas.
When I do it, there are only 533 player in both 17 and 22 (that were 20 years old in Fifa 17 and 25 in Fifa 22).
Here's an example:
import pandas as pd
fifa17 = pd.read_csv('D:/test/fifa/players_17.csv')
fifa17 = fifa17[fifa17['age'] == 20]
fifa17 = fifa17.set_index('sofifa_id')
fifa22 = pd.read_csv('D:/test/fifa/players_22.csv')
fifa22 = fifa22[fifa22['age'] == 25]
fifa22 = fifa22.set_index('sofifa_id')
compareCols = ['pace', 'shooting', 'passing', 'dribbling', 'defending',
'physic', 'attacking_crossing', 'attacking_finishing',
'attacking_heading_accuracy', 'attacking_short_passing',
'attacking_volleys', 'skill_dribbling', 'skill_curve',
'skill_fk_accuracy', 'skill_long_passing',
'skill_ball_control', 'movement_acceleration',
'movement_sprint_speed', 'movement_agility',
'movement_reactions', 'movement_balance', 'power_shot_power',
'power_jumping', 'power_stamina', 'power_strength',
'power_long_shots', 'mentality_aggression',
'mentality_interceptions', 'mentality_positioning',
'mentality_vision', 'mentality_penalties',
'mentality_composure', 'defending_marking_awareness',
'defending_standing_tackle', 'defending_sliding_tackle']
df = fifa22[compareCols] - fifa17[compareCols]
df = df.dropna(axis=0)
df = pd.merge(df,fifa22[['short_name']], how = 'left', left_index=True, right_index=True)
Output:
print(df)
pace shooting ... defending_sliding_tackle short_name
sofifa_id ...
205291 -1.0 0.0 ... 3.0 H. Stengel
205988 -7.0 3.0 ... -1.0 L. Shaw
206086 0.0 8.0 ... 5.0 H. Toffolo
206113 -2.0 21.0 ... -2.0 S. Gnabry
206463 -3.0 8.0 ... 3.0 J. Dudziak
... ... ... ... ...
236311 -2.0 -1.0 ... 18.0 M. Rog
236393 2.0 5.0 ... 0.0 Marc Cardona
236415 3.0 1.0 ... 9.0 R. Alfani
236441 10.0 31.0 ... 18.0 F. Bustos
236458 1.0 0.0 ... 5.0 A. Poungouras
[533 rows x 36 columns]
You might subtract pandas.DataFrames consider following simple example
import pandas as pd
df1 = pd.DataFrame({'X':[1,2],'Y':[3,4]})
df2 = pd.DataFrame({'X':[10,20],'Y':[30,40]})
dfdiff = df2 - df1
print(dfdiff)
gives output
X Y
0 9 27
1 18 36
I have found a solution but it is very tedious as it requires a line of code for each and every attribute.
I'm simply assigning a new column for each attribute change. So for Passing, for instance, the code is:
mergedDF = mergedDF.assign(PassingChange = mergedDF.Passing_x - mergedDF.Passing_y)
A pandas.Series() called "bla" in my example contains pressures in Pa as the index and wind speeds in m/s as values:
bla
100200.0 2.0
97600.0 NaN
91100.0 NaN
85000.0 3.0
82600.0 NaN
...
6670.0 NaN
5000.0 2.0
4490.0 NaN
3880.0 NaN
3000.0 9.0
Length: 29498, dtype: float64
bla.index
Float64Index([100200.0, 97600.0, 91100.0, 85000.0, 82600.0, 81400.0,
79200.0, 73200.0, 70000.0, 68600.0,
...
11300.0, 10000.0, 9970.0, 9100.0, 7000.0, 6670.0,
5000.0, 4490.0, 3880.0, 3000.0],
dtype='float64', length=29498)
As the wind speed values are NaN more often than not, I intended to interpolate considering the different pressure levels in order to have more wind speed values to work with.
The docs of interpolate() state that there's a method called "index" which interpolates considering the index-values, but the results don't make sense as compared to the initial values:
bla.interpolate(method="index", axis=0, limit=1, limit_direction="both")
100200.0 **2.00**
97600.0 10.40
91100.0 8.00
85000.0 **3.00**
82600.0 9.75
...
6670.0 3.00
5000.0 **2.00**
4490.0 9.00
3880.0 5.00
3000.0 **9.00**
Length: 29498, dtype: float64
I marked the original values in boldface.
I'd rather expect something like when using "linear":
bla.interpolate(method="linear", axis=0, limit=1, limit_direction="both")
100200.0 **2.000000**
97600.0 2.333333
91100.0 2.666667
85000.0 **3.000000**
82600.0 4.600000
...
6670.0 4.500000
5000.0 **2.000000**
4490.0 4.333333
3880.0 6.666667
3000.0 **9.000000**
Nevertheless, I'd like to use properly "index" as interpolation method, since this should be the most accurate considering the pressure levels for interpolation to mark the "distance" between each wind speed value.
By and large, I'd like to understand how the interpolation results using "index" with the pressure levels in it could become so counterintuitive, and how I could achieve them to be more sound.
Thanks to #ALollz in the first comment underneath my question I came up where the issue lied:
It was just that my dataframe had 2 index levels, the outer being unique measuring timestamps, the inner being a standard range-index.
I should've looked just at each sub-set associated with the unique timestamps separately.
Within these subsets, interpolation makes sense and the results are being produced just right.
Example:
# Loop over all unique timestamps in the outermost index level
for timestamp in df.index.get_level_values(level=0).unique():
# Extract the current subset
df_subset = df.loc[timestamp, :]
# Carry out interpolation on a column of interest
df_subset["column of interest"] = df_subset[
"column of interest"].interpolate(method="linear",
axis=0,
limit=1,
limit_direction="both")
I'm setting up a dataframe by reading a csv file in pandas, the columns represent points in one dimensional positional arguments for different samples, the rows each represent 0.01s time segments. I want to create a new dataframe to represent velocity and acceleration (so basically apply the operation [point(i)-point(i-1)]/0.01) to every cell in the data frame.
I'm having trouble using pandas.applymap or other approaches because I don't quite know how to refer to multiple arguments in the dataframe for every operation, if that makes sense.
import pandas as pd
import numpy as np
data = pd.read_csv("file_name")
def velocity(xf, xi):
v = (xf - xi)*100
return v
velocity = data.applymap(velocity)
This is what the first few column and rows of the original data frame look like:
X LFHD Y LFHD Z LFHD X RFHD Y RFHD
0 700.003 -1769.61 1556.05 811.922 -1878.46
1 699.728 -1769.50 1555.99 811.942 -1878.14
2 699.465 -1769.38 1555.99 811.980 -1877.81
3 699.118 -1769.38 1555.83 812.005 -1877.48
4 699.017 -1768.78 1556.19 812.003 -1877.11
For every positional value in each column, I want to calculate the velocity where the initial positional value is the cell above (xi as the input in the velocity function) and the final positional value is the cell in question (xf).
when I try to run the above code, it gives me an error because there is only one argument provided for velocity, when it expects 2. I don't know how to go about providing the second argument so that it outputs the proper new dataframe with the velocity calculated in each cell.
df_velocity = data.diff()*100
df_velocity
Out[6]:
X_LFHD Y_LFHD Z_LFHD X_RFHD Y_RFHD
0 NaN NaN NaN NaN NaN
1 -27.5 11.0 -6.0 2.0 32.0
2 -26.3 12.0 0.0 3.8 33.0
3 -34.7 0.0 -16.0 2.5 33.0
4 -10.1 60.0 36.0 -0.2 37.0
I have a dataframe, sega_df:
Month 2016-11-01 2016-12-01
Character
Sonic 12.0 3.0
Shadow 5.0 23.0
I would like to create multiple new columns, by applying a formula for each already existing column within my dataframe (to put it shortly, pretty much double the number of columns). That formula is (100 - [5*eachcell])*0.2.
For example, for November for Sonic, (100-[5*12.0])*0.2 = 8.0, and December for Sonic, (100-[5*3.0])*0.2 = 17.0 My ideal output is:
Month 2016-11-01 2016-12-01 Weighted_2016-11-01 Weighted_2016-12-01
Character
Sonic 12.0 3.0 8.0 17.0
Shadow 5.0 23.0 15.0 -3.0
I know how to create a for loop to create one column. This is for if only one month was in consideration:
for w in range(1,len(sega_df.index)):
sega_df['Weighted'] = (100 - 5*sega_df)*0.2
sega_df[sega_df < 0] = 0
I haven't gotten the skills or experience yet to create multiple columns. I've looked for other questions that may answer what exactly I am doing but haven't gotten anything to work yet. Thanks in advance.
One vectorised approach is to drown to numpy:
A = sega_df.values
A = (100 - 5*A) * 0.2
res = pd.DataFrame(A, index=sega_df.index, columns=('Weighted_'+sega_df.columns))
Then join the result to your original dataframe:
sega_df = sega_df.join(res)
I have a dataframe like this:
long lat Place
-6.779 61.9 Aarhus
-6.790 62.0 Aarhus
54.377 24.4 Dhabi
38.834 9.0 Addis
35.698 9.2 Addis
Is it possible to transform the dataframe into a format like below?
Office long + lat
Aarhus [[-6.779,61.9], [-6.790,62.0]]
Dhabi [[54.377]]
Addis [[38.834,9.0], [35.698,9.2]]
I tried different methods but still couldn't work this out. This is
what I tried to get a list for each distinct place value:
df2["index"] = df2.index
df2["long"]=df2.groupby('index')['long'].apply(list)
list 1= []
for values in ofce_list:
if df['Office'].any() == values:
list1.append(df.loc[df['Office'] == values, 'long'])
But this returned a series in a list instead which is not desired. Please help. Thank you so much.
df.groupby('Place')[['long','lat']].apply(lambda x :x.values.tolist()).\
reset_index(name='long + lat')
Out[1380]:
Place long + lat
0 Aarhus [[-6.779, 61.9], [-6.79, 62.0]]
1 Addis [[38.834, 9.0], [35.698, 9.2]]
2 Dhabi [[54.376999999999995, 24.4]]