Python : Remodeling the presentation data from a pandas Dataframe / group duplicates - python

Let's say that I have this dataframe with three column : "Name", "Account" and "Ccy".
import pandas as pd
Name = ['Dan', 'Mike', 'Dan', 'Dan', 'Sara', 'Charles', 'Mike', 'Karl']
Account = ['100', '30', '50', '200', '90', '20', '65', '230']
Ccy = ['EUR','EUR','USD','USD','','CHF', '','DKN']
df = pd.DataFrame({'Name':Name, 'Account' : Account, 'Ccy' : Ccy})
Name Account Ccy
0 Dan 100 EUR
1 Mike 30 EUR
2 Dan 50 USD
3 Dan 200 USD
4 Sara 90
5 Charles 20 CHF
6 Mike 65
7 Karl 230 DKN
I would like to reprensent this data differently. I would like to write a script that find all the duplicates in the column name and regroup them wit the different account and if there are an currency "Ccy", it add a new column next to it with all the currency associated.
So something like that :
Dan Ccy1 Mike Ccy2 Sara Charles Ccy3 Karl Ccy4
0 100 EUR 30 EUR 90 20 CHF 230 DKN
1 50 USD 65
2 200 USD
I dont' really know how to start that ! So I simplify the problem to do step y step. I try to regroup the dupicates by the name with a list however it did not identify the duplicates.
x_len, y_len = df.shape
new_data = []
for i in range(x_len) :
if df.iloc[i,0] not in new_data :
print(str(df.iloc[i,0]) + '\t'+ str(df.iloc[i,1])+ '\t' + str(bool(df.iloc[i,0] not in new_data)))
new_data.append([df.iloc[i,0],df.iloc[i,1]])
else:
new_data[str(df.iloc[i,0])].append(df.iloc[i,1])
Then I thought that it was easier to use a dictionary. So I try this loop but there is an error and maybe it is not the best way to go to the expected final result
from collections import defaultdict
dico=defaultdict(list)
x_len, y_len = df.shape
for i in range(x_len) :
if df.iloc[i,0] not in dico :
print(str(df.iloc[i,0]) + '\t'+ str(df.iloc[i,1])+ '\t' + str(bool(df.iloc[i,0] not in dico)))
dico[str(df.iloc[i,0])] = df.iloc[i,1]
print(dico)
else :
dico[df.iloc[i,0]].append(df.iloc[i,1])
Anyone has an idea how to start or to do the code if it is simple ?
Thank you

Use GroupBy.cumcount for counter, reshape by DataFrame.set_index and DataFrame.unstack and last flatten columns names:
g = df.groupby(['Name']).cumcount()
df = df.set_index([g,'Name']).unstack().sort_index(level=1, axis=1)
df.columns = df.columns.map(lambda x: f'{x[0]}_{x[1]}')
print (df)
Account_Charles Ccy_Charles Account_Dan Ccy_Dan Account_Karl Ccy_Karl \
0 20 CHF 100 EUR 230 DKN
1 NaN NaN 50 USD NaN NaN
2 NaN NaN 200 USD NaN NaN
Account_Mike Ccy_Mike Account_Sara Ccy_Sara
0 30 EUR 90
1 65 NaN NaN
2 NaN NaN NaN NaN
If need custom columns names use if-else in list comprehension:
g = df.groupby(['Name']).cumcount()
df = df.set_index([g,'Name']).unstack().sort_index(level=1, axis=1)
L = [b if a == 'Account' else f'{a}{i // 2}' for i, (a, b) in enumerate(df.columns)]
df.columns = L
print (df)
Charles Ccy0 Dan Ccy1 Karl Ccy2 Mike Ccy3 Sara Ccy4
0 20 CHF 100 EUR 230 DKN 30 EUR 90
1 NaN NaN 50 USD NaN NaN 65 NaN NaN
2 NaN NaN 200 USD NaN NaN NaN NaN NaN NaN

Related

Find the second largest date in df python [duplicate]

I am using pandas to analyse some election results. I have a DF, Results, which has a row for each constituency and columns representing the votes for the various parties (over 100 of them):
In[60]: Results.columns
Out[60]:
Index(['Constituency', 'Region', 'Country', 'ID', 'Type', 'Electorate',
'Total', 'Unnamed: 9', '30-50', 'Above',
...
'WP', 'WRP', 'WVPTFP', 'Yorks', 'Young', 'Zeb', 'Party', 'Votes',
'Share', 'Turnout'],
dtype='object', length=147)
So...
In[63]: Results.head()
Out[63]:
Constituency Region Country ID Type \
PAID
1 Aberavon Wales Wales W07000049 County
2 Aberconwy Wales Wales W07000058 County
3 Aberdeen North Scotland Scotland S14000001 Burgh
4 Aberdeen South Scotland Scotland S14000002 Burgh
5 Aberdeenshire West & Kincardine Scotland Scotland S14000058 County
Electorate Total Unnamed: 9 30-50 Above ... WP WRP WVPTFP \
PAID ...
1 49821 31523 NaN NaN NaN ... NaN NaN NaN
2 45525 30148 NaN NaN NaN ... NaN NaN NaN
3 67745 43936 NaN NaN NaN ... NaN NaN NaN
4 68056 48551 NaN NaN NaN ... NaN NaN NaN
5 73445 55196 NaN NaN NaN ... NaN NaN NaN
Yorks Young Zeb Party Votes Share Turnout
PAID
1 NaN NaN NaN Lab 15416 0.489040 0.632725
2 NaN NaN NaN Con 12513 0.415052 0.662230
3 NaN NaN NaN SNP 24793 0.564298 0.648550
4 NaN NaN NaN SNP 20221 0.416490 0.713398
5 NaN NaN NaN SNP 22949 0.415773 0.751528
[5 rows x 147 columns]
The per-constituency results for each party are given in the columns Results.ix[:, 'Unnamed: 9': 'Zeb']
I can find the winning party (i.e. the party which polled highest number of votes) and the number of votes it polled using:
RawResults = Results.ix[:, 'Unnamed: 9': 'Zeb']
Results['Party'] = RawResults.idxmax(axis=1)
Results['Votes'] = RawResults.max(axis=1).astype(int)
But, I also need to know how many votes the second-place party got (and ideally its index/name). So is there any way in pandas to return the second highest value/index in a set of columns for each row?
To get the highest values of a column, you can use nlargest() :
df['High'].nlargest(2)
The above will give you the 2 highest values of column High.
You can also use nsmallest() to get the lowest values.
Here is a NumPy solution:
In [120]: df
Out[120]:
a b c d e f g h
0 1.334444 0.322029 0.302296 -0.841236 -0.360488 -0.860188 -0.157942 1.522082
1 2.056572 0.991643 0.160067 -0.066473 0.235132 0.533202 1.282371 -2.050731
2 0.955586 -0.966734 0.055210 -0.993924 -0.553841 0.173793 -0.534548 -1.796006
3 1.201001 1.067291 -0.562357 -0.794284 -0.554820 -0.011836 0.519928 0.514669
4 -0.243972 -0.048144 0.498007 0.862016 1.284717 -0.886455 -0.757603 0.541992
5 0.739435 -0.767399 1.574173 1.197063 -1.147961 -0.903858 0.011073 -1.404868
6 -1.258282 -0.049719 0.400063 0.611456 0.443289 -1.110945 1.352029 0.215460
7 0.029121 -0.771431 -0.285119 -0.018216 0.408425 -1.458476 -1.363583 0.155134
8 1.427226 -1.005345 0.208665 -0.674917 0.287929 -1.259707 0.220420 -1.087245
9 0.452589 0.214592 -1.875423 0.487496 2.411265 0.062324 -0.327891 0.256577
In [121]: np.sort(df.values)[:,-2:]
Out[121]:
array([[ 1.33444404, 1.52208164],
[ 1.28237078, 2.05657214],
[ 0.17379254, 0.95558613],
[ 1.06729107, 1.20100071],
[ 0.86201603, 1.28471676],
[ 1.19706331, 1.57417327],
[ 0.61145573, 1.35202868],
[ 0.15513379, 0.40842477],
[ 0.28792928, 1.42722604],
[ 0.48749578, 2.41126532]])
or as a pandas Data Frame:
In [122]: pd.DataFrame(np.sort(df.values)[:,-2:], columns=['2nd-largest','largest'])
Out[122]:
2nd-largest largest
0 1.334444 1.522082
1 1.282371 2.056572
2 0.173793 0.955586
3 1.067291 1.201001
4 0.862016 1.284717
5 1.197063 1.574173
6 0.611456 1.352029
7 0.155134 0.408425
8 0.287929 1.427226
9 0.487496 2.411265
or a faster solution from #Divakar:
In [6]: df
Out[6]:
a b c d e f g h
0 0.649517 -0.223116 0.264734 -1.121666 0.151591 -1.335756 -0.155459 -2.500680
1 0.172981 1.233523 0.220378 1.188080 -0.289469 -0.039150 1.476852 0.736908
2 -1.904024 0.109314 0.045741 -0.341214 -0.332267 -1.363889 0.177705 -0.892018
3 -2.606532 -0.483314 0.054624 0.979734 0.205173 0.350247 -1.088776 1.501327
4 1.627655 -1.261631 0.589899 -0.660119 0.742390 -1.088103 0.228557 0.714746
5 0.423972 -0.506975 -0.783718 -2.044002 -0.692734 0.980399 1.007460 0.161516
6 -0.777123 -0.838311 -1.116104 -0.433797 0.599724 -0.884832 -0.086431 -0.738298
7 1.131621 1.218199 0.645709 0.066216 -0.265023 0.606963 -0.194694 0.463576
8 0.421164 0.626731 -0.547738 0.989820 -1.383061 -0.060413 -1.342769 -0.777907
9 -1.152690 0.696714 -0.155727 -0.991975 -0.806530 1.454522 0.788688 0.409516
In [7]: a = df.values
In [8]: a[np.arange(len(df))[:,None],np.argpartition(-a,np.arange(2),axis=1)[:,:2]]
Out[8]:
array([[ 0.64951665, 0.26473378],
[ 1.47685226, 1.23352348],
[ 0.17770473, 0.10931398],
[ 1.50132666, 0.97973383],
[ 1.62765464, 0.74238959],
[ 1.00745981, 0.98039898],
[ 0.5997243 , -0.0864306 ],
[ 1.21819904, 1.13162068],
[ 0.98982033, 0.62673128],
[ 1.45452173, 0.78868785]])
Here is an interesting approach. What if we replace the maximum value with the minimum value and calculate. Although it is a quick hack and, not recommended!
first_highest_value_index = df.idxmax()
second_highest_value_index = df.replace(df.max(),df(min)).idxmax()
first_highest_value = df[first_highest_value_index]
second_highest_value = df[second_highest_value_index]
You could just sort your results, such that the first rows will contain the max. Then you can simply use indexing to get the first n places.
RawResults = Results.ix[:, 'Unnamed: 9': 'Zeb'].sort_values(by='votes', ascending=False)
RawResults.iloc[0, :] # First place
RawResults.iloc[1, :] # Second place
RawResults.iloc[n, :] # nth place
Here is a solution using nlargest function:
>>> df
a b c
0 4 20 2
1 5 10 2
2 3 40 5
3 1 50 10
4 2 30 15
>>> def give_largest(col,n):
... largest = col.nlargest(n).reset_index(drop = True)
... data = [x for x in largest]
... index = [f'{i}_largest' for i in range(1,len(largest)+1)]
... return pd.Series(data,index=index)
...
...
>>> def n_largest(df, axis, n):
... '''
... Function to return the n-largest value of each
... column/row of the input DataFrame.
... '''
... return df.apply(give_largest, axis = axis, n = n)
...
>>> n_largest(df,axis = 1, n = 2)
1_largest 2_largest
0 20 4
1 10 5
2 40 5
3 50 10
4 30 15
>>> n_largest(df,axis = 0, n = 2)
a b c
1_largest 5 50 15
2_largest 4 40 10
import numpy as np
import pandas as pd
df = pd.DataFrame({
'a': [4, 5, 3, 1, 2],
'b': [20, 10, 40, 50, 30],
'c': [25, 20, 5, 15, 10]
})
def second_largest(df):
return (df.nlargest(2).min())
print(df.apply(second_largest))
a 4
b 40
c 20
dtype: int64
df
a b c d e f g h
0 1.334444 0.322029 0.302296 -0.841236 -0.360488 -0.860188 -0.157942 1.522082
1 2.056572 0.991643 0.160067 -0.066473 0.235132 0.533202 1.282371 -2.050731
2 0.955586 -0.966734 0.055210 -0.993924 -0.553841 0.173793 -0.534548 -1.796006
3 1.201001 1.067291 -0.562357 -0.794284 -0.554820 -0.011836 0.519928 0.514669
4 -0.243972 -0.048144 0.498007 0.862016 1.284717 -0.886455 -0.757603 0.541992
5 0.739435 -0.767399 1.574173 1.197063 -1.147961 -0.903858 0.011073 -1.404868
6 -1.258282 -0.049719 0.400063 0.611456 0.443289 -1.110945 1.352029 0.215460
7 0.029121 -0.771431 -0.285119 -0.018216 0.408425 -1.458476 -1.363583 0.155134
8 1.427226 -1.005345 0.208665 -0.674917 0.287929 -1.259707 0.220420 -1.087245
9 0.452589 0.214592 -1.875423 0.487496 2.411265 0.062324 -0.327891 0.256577
tranpose and use nlargest in a for loop to get the results order by each line:
df1=df.T
results=list()
for col in df1.columns: results.append(df1[col].nlargest(len(df.columns))
the results var is a list of pandas objects, where the first item on the list will be the df's first row sorted in descending order and so on. Since each item on the list is a pandas object, it carries df's column as index (it was transposed), so you will get the values and the df's columns name of each row sorted
results
[h 1.522082
a 1.334444
b 0.322029
c 0.302296
g -0.157942
e -0.360488
d -0.841236
f -0.860188
Name: 0, dtype: float64,
a 2.056572
g 1.282371
b 0.991643
f 0.533202
e 0.235132
c 0.160067
d -0.066473
h -2.050731
Name: 1, dtype: float64,
....

How do I create new columns by combining data in existing columns?

I have a dataset that includes 5 columns Excuse formatting:
id Price Service Rater Name Cleanliness
401013357 5 3 A 1
401014972 2 1 A 5
401022510 3 4 B 2
401022510 5 1 C 9
401022510 3 1 D 4
401022510 2 2 E 2
I would like for there to be only one row for each ID. Therefore, I need to create columns for each of the raters' names and ratings categories (e.g. Rater Name Price, Rater Name Service, Rater name Cleanliness), each in its own column. Thank you.
I've explored groupby but cannot figure out how to manipulate these into new columns. Thank you!
Here's the code and data I'm actually using:
import requests
from pandas import DataFrame
import pandas as pd
linesinfo_url = 'https://api.collegefootballdata.com/lines?year=2018&seasonType=regular'
linesresp = requests.get(linesinfo_url)
dflines = DataFrame(linesresp.json())
#nesteddata in lines like game info
#setting game ID as index
dflines.set_index('id', inplace=True)
a = linesresp.json()
#defining a as the response to our get request for this data, in JSON format
buf = []
#i believe this creates a receptacle for nested data I'm extracting from json
for game in a:
for line in game['lines']:
game_dict = dict(id=game['id'])
for cat in ('provider', 'spread','formattedSpread', 'overUnder'):
game_dict[cat] = line[cat]
buf.append(game_dict)
dflinestable = pd.DataFrame(buf)
dflinestable.set_index(['id', 'provider'])
From this, I get
formattedSpread overUnder spread
id provider
401013357 consensus UMass -21 68.0 -21.0
401014972 consensus Rice -22.5 58.5 -22.5
401022510 Caesars Colorado State -17.5 57.5 -17.5
consensus Colorado State -17 57.5 -17.0
numberfire Colorado State -17 58.5 -17.0
teamrankings Colorado State -17 58.0 -17.0
401013437 numberfire Wyoming -5 47.0 5.0
teamrankings Wyoming -5 47.0 5.0
401020671 consensus Ball State -19.5 61.5 -19.5
401019470 Caesars UCF -22.5 NaN 22.5
consensus UCF -22.5 NaN 22.5
numberfire UCF -24 70.0 24.0
teamrankings UCF -24 70.0 24.0
401013328 numberfire Minnesota -21.5 47.0 -21.5
teamrankings Minnesota -21.5 49.0 -21.5
The outcome I am looking for is for each of the 4 different providers to have three columns each, so that it's caesars_formattedSpread, caesars_overUnder, Caesars spread, numberfire_formattedSpread, numberfire_overUnder, numberfire_spread, etc.
When I run unstack as suggested, I don't get what I expect. Instead I get:
formattedSpread 0 UMass -21
1 Rice -22.5
2 Colorado State -17.5
3 Colorado State -17
4 Colorado State -17
5 Colorado State -17
6 Wyoming -5
7 Wyoming -5
8 Ball State -19.5
9 UCF -22.5
10 UCF -22.5
11 UCF -24
12 UCF -24
* Edited, based on the edited question *
Given that your dataframe is df:
df = df.set_index(['id', 'Rater Name']) # Make it a Multi Index
df_unstacked = df.unstack()
The problem with your edited code, is that you don't assign dflinestable.set_index(['id', 'provider']) to anything. So when you then use dflinestable.unstack(), you are unstacking the original dflinestable.
So with your entire code, it should be:
import requests
import pandas as pd
linesinfo_url = 'https://api.collegefootballdata.com/lines?year=2018&seasonType=regular'
linesresp = requests.get(linesinfo_url)
dflines = pd.DataFrame(linesresp.json())
#nesteddata in lines like game info
#setting game ID as index
dflines.set_index('id', inplace=True)
a = linesresp.json()
#defining a as the response to our get request for this data, in JSON format
buf = []
#i believe this creates a receptacle for nested data I'm extracting from json
for game in a:
for line in game['lines']:
game_dict = dict(id=game['id'])
for cat in ('provider', 'spread','formattedSpread', 'overUnder'):
game_dict[cat] = line[cat]
buf.append(game_dict)
dflinestable = pd.DataFrame(buf)
dflinestable.set_index(['id', 'provider'], inplace=True) # Add inplace=True
dflinestable_unstacked = dflinestable.unstack() # unstack (you could also reassign to the same df
# Flatten columns to single level, in the order as described
dflinestable_unstacked.columns = [f'{j}_{i}' for i, j in dflinestable_unstacked.columns]
This will give you a DataFrame like (abbreviated):
Caesars_formattedSpread ... teamrankings_spread
id ...
401012246 Alabama -24 ... -23.5
401012247 Arkansas -34 ... NaN
401012248 Auburn -1 ... -1.5
401012249 NaN ... NaN
401012250 Georgia -44 ... NaN

df.apply(sorted, axis=1) removes column names?

Working through Pandas Cookbook. Counting the Total Number of Flights Between Cities.
import pandas as pd
import numpy as np
# import matplotlib.pyplot as plt
print('NumPy: {}'.format(np.__version__))
print('Pandas: {}'.format(pd.__version__))
print('-----')
desired_width = 320
pd.set_option('display.width', desired_width)
pd.options.display.max_rows = 50
pd.options.display.max_columns = 14
# pd.options.display.float_format = '{:,.2f}'.format
file = "e:\\packt\\data_analysis_and_exploration_with_pandas\\section07\\data\\flights.csv"
flights = pd.read_csv(file)
print(flights.head(10))
print()
# This returns the total number of rows for each group.
flights_ct = flights.groupby(['ORG_AIR', 'DEST_AIR']).size()
print(flights_ct.head(10))
print()
# Get the number of flights between Atlanta and Houston in both directions.
print(flights_ct.loc[[('ATL', 'IAH'), ('IAH', 'ATL')]])
print()
# Sort the origin and destination cities:
# flights_sort = flights.sort_values(by=['ORG_AIR', 'DEST_AIR'], axis=1)
flights_sort = flights[['ORG_AIR', 'DEST_AIR']].apply(sorted, axis=1)
print(flights_sort.head(10))
print()
# Passing just the first row.
print(sorted(flights.loc[0, ['ORG_AIR', 'DEST_AIR']]))
print()
# Once each row is independently sorted, the column name are no longer correct.
# We will rename them to something generic, then again find the total number of flights between all cities.
rename_dict = {'ORG_AIR': 'AIR1', 'DEST_AIR': 'AIR2'}
flights_sort = flights_sort.rename(columns=rename_dict)
flights_ct2 = flights_sort.groupby(['AIR1', 'AIR2']).size()
print(flights_ct2.head(10))
print()
When I get to this line of code my output differs from the authors:
```flights_sort = flights[['ORG_AIR', 'DEST_AIR']].apply(sorted, axis=1)```
My output does not contain any column names. As a result, when I get to:
```flights_ct2 = flights_sort.groupby(['AIR1', 'AIR2']).size()```
it throws a KeyError. This makes sense, as I am trying to rename columns when no column names exist.
My question is, why are the column names gone? All other output matches the authors output exactly:
Connected to pydev debugger (build 191.7141.48)
NumPy: 1.16.3
Pandas: 0.24.2
-----
MONTH DAY WEEKDAY AIRLINE ORG_AIR DEST_AIR SCHED_DEP DEP_DELAY AIR_TIME DIST SCHED_ARR ARR_DELAY DIVERTED CANCELLED
0 1 1 4 WN LAX SLC 1625 58.0 94.0 590 1905 65.0 0 0
1 1 1 4 UA DEN IAD 823 7.0 154.0 1452 1333 -13.0 0 0
2 1 1 4 MQ DFW VPS 1305 36.0 85.0 641 1453 35.0 0 0
3 1 1 4 AA DFW DCA 1555 7.0 126.0 1192 1935 -7.0 0 0
4 1 1 4 WN LAX MCI 1720 48.0 166.0 1363 2225 39.0 0 0
5 1 1 4 UA IAH SAN 1450 1.0 178.0 1303 1620 -14.0 0 0
6 1 1 4 AA DFW MSY 1250 84.0 64.0 447 1410 83.0 0 0
7 1 1 4 F9 SFO PHX 1020 -7.0 91.0 651 1315 -6.0 0 0
8 1 1 4 AA ORD STL 1845 -5.0 44.0 258 1950 -5.0 0 0
9 1 1 4 UA IAH SJC 925 3.0 215.0 1608 1136 -14.0 0 0
ORG_AIR DEST_AIR
ATL ABE 31
ABQ 16
ABY 19
ACY 6
AEX 40
AGS 83
ALB 33
ANC 2
ASE 1
ATW 10
dtype: int64
ORG_AIR DEST_AIR
ATL IAH 121
IAH ATL 148
dtype: int64
*** No columns names *** Why?
0 [LAX, SLC]
1 [DEN, IAD]
2 [DFW, VPS]
3 [DCA, DFW]
4 [LAX, MCI]
5 [IAH, SAN]
6 [DFW, MSY]
7 [PHX, SFO]
8 [ORD, STL]
9 [IAH, SJC]
dtype: object
The author's output. Note the columns names are present.
sorted returns a list object and obliterates the columns:
In [11]: df = pd.DataFrame([[1, 2], [3, 4]], columns=["A", "B"])
In [12]: df.apply(sorted, axis=1)
Out[12]:
0 [1, 2]
1 [3, 4]
dtype: object
In [13]: type(df.apply(sorted, axis=1).iloc[0])
Out[13]: list
It's possible that this wouldn't have been the case in earlier pandas... but it would still be bad code.
You can do this by passing the columns explicitly:
In [14]: df.apply(lambda x: pd.Series(sorted(x), df.columns), axis=1)
Out[14]:
A B
0 1 2
1 3 4
A more efficient way to do this is to sort the sort the underlying numpy array:
In [21]: df = pd.DataFrame([[1, 2], [3, 1]], columns=["A", "B"])
In [22]: df
Out[22]:
A B
0 1 2
1 3 1
In [23]: arr = df[["A", "B"]].values
In [24]: arr.sort(axis=1)
In [25]: df[["A", "B"]] = arr
In [26]: df
Out[26]:
A B
0 1 2
1 1 3
As you can see this sorts each row.
A final note. I just applied #AndyHayden numpy based solution from above.
flights_sort = flights[["ORG_AIR", "DEST_AIR"]].values
flights_sort.sort(axis=1)
flights[["ORG_AIR", "DEST_AIR"]] = flights_sort
All I can say is … Wow. What an enormous performance difference. I get the exact same
correct answer and I get it as soon as I click the mouse as compared to the pandas lambda solution also provided by #AndyHayden which takes about 20 seconds to perform the sort. That dataset is 58,000+ rows. The numpy solution returns the sort instantly.

pandas new column from values in others

I have a df that is populated with XY coordinates from different subjects. I want to create a new column that takes specified XY coordinates from those subjects.
This is achieved when the name of any subject is highlighted in the 'Person' column. This returns the XY coordinates of that subject at that index.
import pandas as pd
import numpy as np
import random
AA = 10, 20
k = 5
N = 10
df = pd.DataFrame({
'John Doe_X' : np.random.uniform(k, k + 100 , size=N),
'John Doe_Y' : np.random.uniform(k, k + 100 , size=N),
'Kevin Lee_X' : np.random.uniform(k, k + 100 , size=N),
'Kevin Lee_Y' : np.random.uniform(k, k + 100 , size=N),
'Liam Smith_X' : np.random.uniform(k, k + -100 , size=N),
'Liam Smith_Y' : np.random.uniform(k, k + 100 , size=N),
'Event' : ['AA', 'nan', 'BB', 'nan', 'nan', 'CC', 'nan','CC', 'DD','nan'],
'Person' : ['nan','nan','John Doe','John Doe','nan','Kevin Lee','nan','Liam Smith','John Doe','John Doe']})
df['X'] = df.apply(lambda row: row.get(row['Person']+'_X') if pd.notnull(row['Person']) else np.nan, axis=1)
df['Y'] = df.apply(lambda row: row.get(row['Person']+'_Y') if pd.notnull(row['Person']) else np.nan, axis=1)
Output:
Event John Doe_X John Doe_Y Kevin Lee_X Kevin Lee_Y Liam Smith_X \
0 AA 75.047164 19.281168 28.064313 87.184248 -76.148559
1 nan 50.642782 68.308319 46.088057 64.132263 -83.109383
2 BB 9.965115 77.950894 48.864693 8.613132 0.106708
3 nan 44.726136 58.751520 69.904076 40.818433 -87.656064
4 nan 101.501119 99.156872 101.976300 93.539749 -57.026015
5 CC 87.778446 65.814911 7.302116 40.577156 -28.703879
6 nan 99.682139 91.715231 88.029451 82.309191 -66.444582
7 CC 38.248267 38.648960 76.065297 67.322639 -34.754868
8 DD 69.429353 61.252800 83.024358 58.038962 -62.001353
9 nan 9.522023 73.009883 41.873986 8.677565 -20.389939
Liam Smith_Y Person X Y
0 18.420494 nan NaN NaN
1 33.206289 nan NaN NaN
2 73.833204 John Doe 9.965115 77.950894
3 39.652071 John Doe 44.726136 58.751520
4 88.176561 nan NaN NaN
5 53.776995 Kevin Lee 7.302116 40.577156
6 95.025923 nan NaN NaN
7 26.851864 Liam Smith -34.754868 26.851864
8 102.771046 John Doe 69.429353 61.252800
9 28.633231 John Doe 9.522023 73.009883
I'm now hoping to use the 'Event' column to refine the new ['X','Y'] column. Specifically, I want to return the coordinates of AA (10,20) when the value 'AA' is in the 'Event' column. Furthermore, I like to get the same coordinates until the next coordinates appear.
So the output would look like:
Event John Doe_X John Doe_Y Kevin Lee_X Kevin Lee_Y Liam Smith_X \
0 AA 75.047164 19.281168 28.064313 87.184248 -76.148559
1 nan 50.642782 68.308319 46.088057 64.132263 -83.109383
2 BB 9.965115 77.950894 48.864693 8.613132 0.106708
3 nan 44.726136 58.751520 69.904076 40.818433 -87.656064
4 nan 101.501119 99.156872 101.976300 93.539749 -57.026015
5 CC 87.778446 65.814911 7.302116 40.577156 -28.703879
6 nan 99.682139 91.715231 88.029451 82.309191 -66.444582
7 CC 38.248267 38.648960 76.065297 67.322639 -34.754868
8 DD 69.429353 61.252800 83.024358 58.038962 -62.001353
9 nan 9.522023 73.009883 41.873986 8.677565 -20.389939
Liam Smith_Y Person X Y
0 18.420494 nan 10 20
1 33.206289 nan 10 20
2 73.833204 John Doe 9.965115 77.950894
3 39.652071 John Doe 44.726136 58.751520
4 88.176561 nan NaN NaN
5 53.776995 Kevin Lee 7.302116 40.577156
6 95.025923 nan NaN NaN
7 26.851864 Liam Smith -34.754868 26.851864
8 102.771046 John Doe 69.429353 61.252800
9 28.633231 John Doe 9.522023 73.009883
I have tried to write something like this:
for value in df['Event']:
if value == 'AA' :
df['X', 'Y'] = AA
But get a ValueError: ValueError: Length of values does not match length of index
If you want to iterate through rows you can try:
# iterate through rows
for index, row in df.iterrows():
# check Event value for the row
if row['Event'] == 'AA' :
# update dataframe
df.loc[index,('X', 'Y')] = AA
print(df)
Result:
Event John Doe_X John Doe_Y Kevin Lee_X Kevin Lee_Y Liam Smith_X \
0 AA 12.603084 81.636376 25.997186 76.733337 -17.683132
1 nan 104.652839 104.064767 56.762357 83.599629 -34.714117
2 BB 69.724434 33.324135 98.452840 57.407782 -8.479175
3 nan 16.361719 51.290716 41.929234 46.494053 -81.882100
4 nan 30.874579 34.683986 95.434111 80.343098 -62.448286
5 CC 77.619875 70.164773 7.385376 40.142712 -55.590472
6 nan 31.214066 54.081010 36.249414 34.218611 -21.754019
7 CC 91.487647 28.307019 71.235864 48.915612 -37.196812
8 DD 45.036216 61.655465 50.231592 29.511502 -4.583804
9 nan 95.249002 25.649100 31.959114 10.234085 -93.106746
X NaN NaN NaN NaN NaN NaN
Liam Smith_Y Person X Y
0 86.267909 nan 10.000000 20.000000
1 43.090388 nan NaN NaN
2 56.330139 John Doe 69.724434 33.324135
3 65.648633 John Doe 16.361719 51.290716
4 16.349304 nan NaN NaN
5 5.528887 Kevin Lee 7.385376 40.142712
6 75.717007 nan NaN NaN
7 100.925457 Liam Smith -37.196812 100.925457
8 87.256541 John Doe 45.036216 61.655465
9 35.361163 John Doe 95.249002 25.649100
X NaN NaN NaN NaN
Your code has some errors (Person is mistaken with Player among other things). I assume this is a paste error.
Your problem however is easily solved using a mask and applying the tuple AA to the subset that the mask is using df.loc
m = df['Event'] == 'AA'
df.loc[m, ['X','Y']] = AA

Efficient pandas rolling aggregation over date range by group - Python 2.7 Windows - Pandas 0.19.2

I'm trying to find an efficient way to generate rolling counts or sums in pandas given a grouping and a date range. Eventually, I want to be able to add conditions, ie. evaluating a 'type' field, but I'm not there just yet. I've written something to get the job done, but feel that there could be a more direct way of getting to the desired result.
My pandas data frame currently looks like this, with the desired output being put in the last column 'rolling_sales_180'.
name date amount rolling_sales_180
0 David 2015-01-01 100 100.0
1 David 2015-01-05 500 600.0
2 David 2015-05-30 50 650.0
3 David 2015-07-25 50 100.0
4 Ryan 2014-01-04 100 100.0
5 Ryan 2015-01-19 500 500.0
6 Ryan 2016-03-31 50 50.0
7 Joe 2015-07-01 100 100.0
8 Joe 2015-09-09 500 600.0
9 Joe 2015-10-15 50 650.0
My current solution and environment can be sourced below. I've been modeling my solution from this R Q&A in stackoverflow. Efficient way to perform running total in the last 365 day window
import pandas as pd
import numpy as np
def trans_date_to_dist_matrix(date_col): # used to create a distance matrix
x = date_col.tolist()
y = date_col.tolist()
data = []
for i in x:
tmp = []
for j in y:
tmp.append(abs((i - j).days))
data.append(tmp)
del tmp
return pd.DataFrame(data=data, index=date_col.values, columns=date_col.values)
def lower_tri(x_col, date_col, win): # x_col = column user wants a rolling sum of ,date_col = dates, win = time window
dm = trans_date_to_dist_matrix(date_col=date_col) # dm = distance matrix
dm = dm.where(dm <= win) # find all elements of the distance matrix that are less than window(time)
lt = dm.where(np.tril(np.ones(dm.shape)).astype(np.bool)) # lt = lower tri of distance matrix so we get only future dates
lt[lt >= 0.0] = 1.0 # cleans up our lower tri so that we can sum events that happen on the day we are evaluating
lt = lt.fillna(0) # replaces NaN with 0's for multiplication
return pd.DataFrame(x_col.values * lt.values).sum(axis=1).tolist()
def flatten(x):
try:
n = [v for sl in x for v in sl]
return [v for sl in n for v in sl]
except:
return [v for sl in x for v in sl]
data = [
['David', '1/1/2015', 100], ['David', '1/5/2015', 500], ['David', '5/30/2015', 50], ['David', '7/25/2015', 50],
['Ryan', '1/4/2014', 100], ['Ryan', '1/19/2015', 500], ['Ryan', '3/31/2016', 50],
['Joe', '7/1/2015', 100], ['Joe', '9/9/2015', 500], ['Joe', '10/15/2015', 50]
]
list_of_vals = []
dates_df = pd.DataFrame(data=data, columns=['name', 'date', 'amount'], index=None)
dates_df['date'] = pd.to_datetime(dates_df['date'])
list_of_vals.append(dates_df.groupby('name', as_index=False).apply(
lambda x: lower_tri(x_col=x.amount, date_col=x.date, win=180)))
new_data = flatten(list_of_vals)
dates_df['rolling_sales_180'] = new_data
print dates_df
Your time and feedback are appreciated.
Pandas has support for time-aware rolling via the rolling method, so you can use that instead of writing your own solution from scratch:
def get_rolling_amount(grp, freq):
return grp.rolling(freq, on='date')['amount'].sum()
df['rolling_sales_180'] = df.groupby('name', as_index=False, group_keys=False) \
.apply(get_rolling_amount, '180D')
The resulting output:
name date amount rolling_sales_180
0 David 2015-01-01 100 100.0
1 David 2015-01-05 500 600.0
2 David 2015-05-30 50 650.0
3 David 2015-07-25 50 100.0
4 Ryan 2014-01-04 100 100.0
5 Ryan 2015-01-19 500 500.0
6 Ryan 2016-03-31 50 50.0
7 Joe 2015-07-01 100 100.0
8 Joe 2015-09-09 500 600.0
9 Joe 2015-10-15 50 650.0

Categories