pandas new column from values in others

pandas new column from values in others - python

I have a df that is populated with XY coordinates from different subjects. I want to create a new column that takes specified XY coordinates from those subjects.
This is achieved when the name of any subject is highlighted in the 'Person' column. This returns the XY coordinates of that subject at that index.
import pandas as pd
import numpy as np
import random
AA = 10, 20
k = 5
N = 10
df = pd.DataFrame({
'John Doe_X' : np.random.uniform(k, k + 100 , size=N),
'John Doe_Y' : np.random.uniform(k, k + 100 , size=N),
'Kevin Lee_X' : np.random.uniform(k, k + 100 , size=N),
'Kevin Lee_Y' : np.random.uniform(k, k + 100 , size=N),
'Liam Smith_X' : np.random.uniform(k, k + -100 , size=N),
'Liam Smith_Y' : np.random.uniform(k, k + 100 , size=N),
'Event' : ['AA', 'nan', 'BB', 'nan', 'nan', 'CC', 'nan','CC', 'DD','nan'],
'Person' : ['nan','nan','John Doe','John Doe','nan','Kevin Lee','nan','Liam Smith','John Doe','John Doe']})
df['X'] = df.apply(lambda row: row.get(row['Person']+'_X') if pd.notnull(row['Person']) else np.nan, axis=1)
df['Y'] = df.apply(lambda row: row.get(row['Person']+'_Y') if pd.notnull(row['Person']) else np.nan, axis=1)
Output:
Event John Doe_X John Doe_Y Kevin Lee_X Kevin Lee_Y Liam Smith_X \
0 AA 75.047164 19.281168 28.064313 87.184248 -76.148559
1 nan 50.642782 68.308319 46.088057 64.132263 -83.109383
2 BB 9.965115 77.950894 48.864693 8.613132 0.106708
3 nan 44.726136 58.751520 69.904076 40.818433 -87.656064
4 nan 101.501119 99.156872 101.976300 93.539749 -57.026015
5 CC 87.778446 65.814911 7.302116 40.577156 -28.703879
6 nan 99.682139 91.715231 88.029451 82.309191 -66.444582
7 CC 38.248267 38.648960 76.065297 67.322639 -34.754868
8 DD 69.429353 61.252800 83.024358 58.038962 -62.001353
9 nan 9.522023 73.009883 41.873986 8.677565 -20.389939
Liam Smith_Y Person X Y
0 18.420494 nan NaN NaN
1 33.206289 nan NaN NaN
2 73.833204 John Doe 9.965115 77.950894
3 39.652071 John Doe 44.726136 58.751520
4 88.176561 nan NaN NaN
5 53.776995 Kevin Lee 7.302116 40.577156
6 95.025923 nan NaN NaN
7 26.851864 Liam Smith -34.754868 26.851864
8 102.771046 John Doe 69.429353 61.252800
9 28.633231 John Doe 9.522023 73.009883
I'm now hoping to use the 'Event' column to refine the new ['X','Y'] column. Specifically, I want to return the coordinates of AA (10,20) when the value 'AA' is in the 'Event' column. Furthermore, I like to get the same coordinates until the next coordinates appear.
So the output would look like:
Event John Doe_X John Doe_Y Kevin Lee_X Kevin Lee_Y Liam Smith_X \
0 AA 75.047164 19.281168 28.064313 87.184248 -76.148559
1 nan 50.642782 68.308319 46.088057 64.132263 -83.109383
2 BB 9.965115 77.950894 48.864693 8.613132 0.106708
3 nan 44.726136 58.751520 69.904076 40.818433 -87.656064
4 nan 101.501119 99.156872 101.976300 93.539749 -57.026015
5 CC 87.778446 65.814911 7.302116 40.577156 -28.703879
6 nan 99.682139 91.715231 88.029451 82.309191 -66.444582
7 CC 38.248267 38.648960 76.065297 67.322639 -34.754868
8 DD 69.429353 61.252800 83.024358 58.038962 -62.001353
9 nan 9.522023 73.009883 41.873986 8.677565 -20.389939
Liam Smith_Y Person X Y
0 18.420494 nan 10 20
1 33.206289 nan 10 20
2 73.833204 John Doe 9.965115 77.950894
3 39.652071 John Doe 44.726136 58.751520
4 88.176561 nan NaN NaN
5 53.776995 Kevin Lee 7.302116 40.577156
6 95.025923 nan NaN NaN
7 26.851864 Liam Smith -34.754868 26.851864
8 102.771046 John Doe 69.429353 61.252800
9 28.633231 John Doe 9.522023 73.009883
I have tried to write something like this:
for value in df['Event']:
if value == 'AA' :
df['X', 'Y'] = AA
But get a ValueError: ValueError: Length of values does not match length of index

If you want to iterate through rows you can try:
# iterate through rows
for index, row in df.iterrows():
# check Event value for the row
if row['Event'] == 'AA' :
# update dataframe
df.loc[index,('X', 'Y')] = AA
print(df)
Result:
Event John Doe_X John Doe_Y Kevin Lee_X Kevin Lee_Y Liam Smith_X \
0 AA 12.603084 81.636376 25.997186 76.733337 -17.683132
1 nan 104.652839 104.064767 56.762357 83.599629 -34.714117
2 BB 69.724434 33.324135 98.452840 57.407782 -8.479175
3 nan 16.361719 51.290716 41.929234 46.494053 -81.882100
4 nan 30.874579 34.683986 95.434111 80.343098 -62.448286
5 CC 77.619875 70.164773 7.385376 40.142712 -55.590472
6 nan 31.214066 54.081010 36.249414 34.218611 -21.754019
7 CC 91.487647 28.307019 71.235864 48.915612 -37.196812
8 DD 45.036216 61.655465 50.231592 29.511502 -4.583804
9 nan 95.249002 25.649100 31.959114 10.234085 -93.106746
X NaN NaN NaN NaN NaN NaN
Liam Smith_Y Person X Y
0 86.267909 nan 10.000000 20.000000
1 43.090388 nan NaN NaN
2 56.330139 John Doe 69.724434 33.324135
3 65.648633 John Doe 16.361719 51.290716
4 16.349304 nan NaN NaN
5 5.528887 Kevin Lee 7.385376 40.142712
6 75.717007 nan NaN NaN
7 100.925457 Liam Smith -37.196812 100.925457
8 87.256541 John Doe 45.036216 61.655465
9 35.361163 John Doe 95.249002 25.649100
X NaN NaN NaN NaN

Your code has some errors (Person is mistaken with Player among other things). I assume this is a paste error.
Your problem however is easily solved using a mask and applying the tuple AA to the subset that the mask is using df.loc
m = df['Event'] == 'AA'
df.loc[m, ['X','Y']] = AA

Related

How to Split a huge csv into multiple csv's based on column header

I'll try to represent my problem, basing on simple example below. I have below main csv and I am trying to split into 2 or more csv basing on column header, keeping the unique column id in intact in every csv file.
Below is the code I am trying to figure out, but not quite getting the result.
import pandas as pd
df = pd.read_csv('abc.csv')
df[['id','name','age']] = df['csv1'].str.split(' ', expand=True)
csv
id name age color Gender
0 101 Jack 23 white M
1 102 Mary 25 black F
2 103 Tom 24 brown M
Output required
csv1
id name age
0 101 Jack 23
1 102 Mary 25
2 103 Tom 24
csv2 -
id color Gender
0 101 white M
1 102 black F
2 103 brown M

UPDATE
I found a better apporach with np.array_split
I used this example df:
x y R TR x_c y_c xxx yyy RRR TTTR xxx_c yyy_c
id
1256780.0 13989 6241 6.689222 20.986341 14050.83 6315.33 213989 36241 46.689222 520.986341 614050.83 76315.33
12000.0 14013 6278 53.152036 0.000000 14060.00 6288.00 214013 36278 453.152036 5.000000 614060.00 76288.00
1100.0 14111 6379 87.598357 5.000000 14070.55 7000.00 214111 36379 487.598357 55.000000 614070.55 76288.00
which has 12 columns.
# the 4 means, split the df into 4 evenly sized chunks
chunks = np.array_split(df,4, axis=1)
Chunks is a list containing all seperate dataframes.
Output:
# chunks[0]
x y R
id
1256780.0 13989.0 6241.0 6.689222
12000.0 14013.0 6278.0 53.152036
1100.0 14111.0 6379.0 87.598357
# chunks[1]
TR x_c y_c
id
1256780.0 20.986341 14050.83 6315.33
12000.0 0.000000 14060.00 6288.00
1100.0 5.000000 14070.55 7000.00
# chunks[2]
xxx yyy RRR
id
1256780.0 213989.0 36241.0 46.689222
12000.0 214013.0 36278.0 453.152036
1100.0 214111.0 36379.0 487.598357
# chunks[3]
TTTR xxx_c yyy_c
id
1256780.0 520.986341 614050.83 76315.33
12000.0 5.000000 614060.00 76288.00
1100.0 55.000000 614070.55 76288.00
Old answer:
You could calculate half of the columns and then use iloc to split the dataframes into two parts.
df = df.set_index('id')
half = len(df.columns)//2
df1, df2 = df.iloc[:,:half], df.iloc[:,half:]
df1 = df1.reset_index()
df2 = df2.reset_index()
Output:
#df1
id name age
0 101 Jack 23
1 102 Mary 25
2 103 Tom 24
#df2
id color Gender
0 101 white M
1 102 black F
2 103 brown M

Find the second largest date in df python [duplicate]

I am using pandas to analyse some election results. I have a DF, Results, which has a row for each constituency and columns representing the votes for the various parties (over 100 of them):
In[60]: Results.columns
Out[60]:
Index(['Constituency', 'Region', 'Country', 'ID', 'Type', 'Electorate',
'Total', 'Unnamed: 9', '30-50', 'Above',
...
'WP', 'WRP', 'WVPTFP', 'Yorks', 'Young', 'Zeb', 'Party', 'Votes',
'Share', 'Turnout'],
dtype='object', length=147)
So...
In[63]: Results.head()
Out[63]:
Constituency Region Country ID Type \
PAID
1 Aberavon Wales Wales W07000049 County
2 Aberconwy Wales Wales W07000058 County
3 Aberdeen North Scotland Scotland S14000001 Burgh
4 Aberdeen South Scotland Scotland S14000002 Burgh
5 Aberdeenshire West & Kincardine Scotland Scotland S14000058 County
Electorate Total Unnamed: 9 30-50 Above ... WP WRP WVPTFP \
PAID ...
1 49821 31523 NaN NaN NaN ... NaN NaN NaN
2 45525 30148 NaN NaN NaN ... NaN NaN NaN
3 67745 43936 NaN NaN NaN ... NaN NaN NaN
4 68056 48551 NaN NaN NaN ... NaN NaN NaN
5 73445 55196 NaN NaN NaN ... NaN NaN NaN
Yorks Young Zeb Party Votes Share Turnout
PAID
1 NaN NaN NaN Lab 15416 0.489040 0.632725
2 NaN NaN NaN Con 12513 0.415052 0.662230
3 NaN NaN NaN SNP 24793 0.564298 0.648550
4 NaN NaN NaN SNP 20221 0.416490 0.713398
5 NaN NaN NaN SNP 22949 0.415773 0.751528
[5 rows x 147 columns]
The per-constituency results for each party are given in the columns Results.ix[:, 'Unnamed: 9': 'Zeb']
I can find the winning party (i.e. the party which polled highest number of votes) and the number of votes it polled using:
RawResults = Results.ix[:, 'Unnamed: 9': 'Zeb']
Results['Party'] = RawResults.idxmax(axis=1)
Results['Votes'] = RawResults.max(axis=1).astype(int)
But, I also need to know how many votes the second-place party got (and ideally its index/name). So is there any way in pandas to return the second highest value/index in a set of columns for each row?

To get the highest values of a column, you can use nlargest() :
df['High'].nlargest(2)
The above will give you the 2 highest values of column High.
You can also use nsmallest() to get the lowest values.

Here is a NumPy solution:
In [120]: df
Out[120]:
a b c d e f g h
0 1.334444 0.322029 0.302296 -0.841236 -0.360488 -0.860188 -0.157942 1.522082
1 2.056572 0.991643 0.160067 -0.066473 0.235132 0.533202 1.282371 -2.050731
2 0.955586 -0.966734 0.055210 -0.993924 -0.553841 0.173793 -0.534548 -1.796006
3 1.201001 1.067291 -0.562357 -0.794284 -0.554820 -0.011836 0.519928 0.514669
4 -0.243972 -0.048144 0.498007 0.862016 1.284717 -0.886455 -0.757603 0.541992
5 0.739435 -0.767399 1.574173 1.197063 -1.147961 -0.903858 0.011073 -1.404868
6 -1.258282 -0.049719 0.400063 0.611456 0.443289 -1.110945 1.352029 0.215460
7 0.029121 -0.771431 -0.285119 -0.018216 0.408425 -1.458476 -1.363583 0.155134
8 1.427226 -1.005345 0.208665 -0.674917 0.287929 -1.259707 0.220420 -1.087245
9 0.452589 0.214592 -1.875423 0.487496 2.411265 0.062324 -0.327891 0.256577
In [121]: np.sort(df.values)[:,-2:]
Out[121]:
array([[ 1.33444404, 1.52208164],
[ 1.28237078, 2.05657214],
[ 0.17379254, 0.95558613],
[ 1.06729107, 1.20100071],
[ 0.86201603, 1.28471676],
[ 1.19706331, 1.57417327],
[ 0.61145573, 1.35202868],
[ 0.15513379, 0.40842477],
[ 0.28792928, 1.42722604],
[ 0.48749578, 2.41126532]])
or as a pandas Data Frame:
In [122]: pd.DataFrame(np.sort(df.values)[:,-2:], columns=['2nd-largest','largest'])
Out[122]:
2nd-largest largest
0 1.334444 1.522082
1 1.282371 2.056572
2 0.173793 0.955586
3 1.067291 1.201001
4 0.862016 1.284717
5 1.197063 1.574173
6 0.611456 1.352029
7 0.155134 0.408425
8 0.287929 1.427226
9 0.487496 2.411265
or a faster solution from #Divakar:
In [6]: df
Out[6]:
a b c d e f g h
0 0.649517 -0.223116 0.264734 -1.121666 0.151591 -1.335756 -0.155459 -2.500680
1 0.172981 1.233523 0.220378 1.188080 -0.289469 -0.039150 1.476852 0.736908
2 -1.904024 0.109314 0.045741 -0.341214 -0.332267 -1.363889 0.177705 -0.892018
3 -2.606532 -0.483314 0.054624 0.979734 0.205173 0.350247 -1.088776 1.501327
4 1.627655 -1.261631 0.589899 -0.660119 0.742390 -1.088103 0.228557 0.714746
5 0.423972 -0.506975 -0.783718 -2.044002 -0.692734 0.980399 1.007460 0.161516
6 -0.777123 -0.838311 -1.116104 -0.433797 0.599724 -0.884832 -0.086431 -0.738298
7 1.131621 1.218199 0.645709 0.066216 -0.265023 0.606963 -0.194694 0.463576
8 0.421164 0.626731 -0.547738 0.989820 -1.383061 -0.060413 -1.342769 -0.777907
9 -1.152690 0.696714 -0.155727 -0.991975 -0.806530 1.454522 0.788688 0.409516
In [7]: a = df.values
In [8]: a[np.arange(len(df))[:,None],np.argpartition(-a,np.arange(2),axis=1)[:,:2]]
Out[8]:
array([[ 0.64951665, 0.26473378],
[ 1.47685226, 1.23352348],
[ 0.17770473, 0.10931398],
[ 1.50132666, 0.97973383],
[ 1.62765464, 0.74238959],
[ 1.00745981, 0.98039898],
[ 0.5997243 , -0.0864306 ],
[ 1.21819904, 1.13162068],
[ 0.98982033, 0.62673128],
[ 1.45452173, 0.78868785]])

Here is an interesting approach. What if we replace the maximum value with the minimum value and calculate. Although it is a quick hack and, not recommended!
first_highest_value_index = df.idxmax()
second_highest_value_index = df.replace(df.max(),df(min)).idxmax()
first_highest_value = df[first_highest_value_index]
second_highest_value = df[second_highest_value_index]

You could just sort your results, such that the first rows will contain the max. Then you can simply use indexing to get the first n places.
RawResults = Results.ix[:, 'Unnamed: 9': 'Zeb'].sort_values(by='votes', ascending=False)
RawResults.iloc[0, :] # First place
RawResults.iloc[1, :] # Second place
RawResults.iloc[n, :] # nth place

Here is a solution using nlargest function:
>>> df
a b c
0 4 20 2
1 5 10 2
2 3 40 5
3 1 50 10
4 2 30 15
>>> def give_largest(col,n):
... largest = col.nlargest(n).reset_index(drop = True)
... data = [x for x in largest]
... index = [f'{i}_largest' for i in range(1,len(largest)+1)]
... return pd.Series(data,index=index)
...
...
>>> def n_largest(df, axis, n):
... '''
... Function to return the n-largest value of each
... column/row of the input DataFrame.
... '''
... return df.apply(give_largest, axis = axis, n = n)
...
>>> n_largest(df,axis = 1, n = 2)
1_largest 2_largest
0 20 4
1 10 5
2 40 5
3 50 10
4 30 15
>>> n_largest(df,axis = 0, n = 2)
a b c
1_largest 5 50 15
2_largest 4 40 10

import numpy as np
import pandas as pd
df = pd.DataFrame({
'a': [4, 5, 3, 1, 2],
'b': [20, 10, 40, 50, 30],
'c': [25, 20, 5, 15, 10]
})
def second_largest(df):
return (df.nlargest(2).min())
print(df.apply(second_largest))
a 4
b 40
c 20
dtype: int64

df
a b c d e f g h
0 1.334444 0.322029 0.302296 -0.841236 -0.360488 -0.860188 -0.157942 1.522082
1 2.056572 0.991643 0.160067 -0.066473 0.235132 0.533202 1.282371 -2.050731
2 0.955586 -0.966734 0.055210 -0.993924 -0.553841 0.173793 -0.534548 -1.796006
3 1.201001 1.067291 -0.562357 -0.794284 -0.554820 -0.011836 0.519928 0.514669
4 -0.243972 -0.048144 0.498007 0.862016 1.284717 -0.886455 -0.757603 0.541992
5 0.739435 -0.767399 1.574173 1.197063 -1.147961 -0.903858 0.011073 -1.404868
6 -1.258282 -0.049719 0.400063 0.611456 0.443289 -1.110945 1.352029 0.215460
7 0.029121 -0.771431 -0.285119 -0.018216 0.408425 -1.458476 -1.363583 0.155134
8 1.427226 -1.005345 0.208665 -0.674917 0.287929 -1.259707 0.220420 -1.087245
9 0.452589 0.214592 -1.875423 0.487496 2.411265 0.062324 -0.327891 0.256577
tranpose and use nlargest in a for loop to get the results order by each line:
df1=df.T
results=list()
for col in df1.columns: results.append(df1[col].nlargest(len(df.columns))
the results var is a list of pandas objects, where the first item on the list will be the df's first row sorted in descending order and so on. Since each item on the list is a pandas object, it carries df's column as index (it was transposed), so you will get the values and the df's columns name of each row sorted
results
[h 1.522082
a 1.334444
b 0.322029
c 0.302296
g -0.157942
e -0.360488
d -0.841236
f -0.860188
Name: 0, dtype: float64,
a 2.056572
g 1.282371
b 0.991643
f 0.533202
e 0.235132
c 0.160067
d -0.066473
h -2.050731
Name: 1, dtype: float64,
....

Python : Remodeling the presentation data from a pandas Dataframe / group duplicates

Let's say that I have this dataframe with three column : "Name", "Account" and "Ccy".
import pandas as pd
Name = ['Dan', 'Mike', 'Dan', 'Dan', 'Sara', 'Charles', 'Mike', 'Karl']
Account = ['100', '30', '50', '200', '90', '20', '65', '230']
Ccy = ['EUR','EUR','USD','USD','','CHF', '','DKN']
df = pd.DataFrame({'Name':Name, 'Account' : Account, 'Ccy' : Ccy})
Name Account Ccy
0 Dan 100 EUR
1 Mike 30 EUR
2 Dan 50 USD
3 Dan 200 USD
4 Sara 90
5 Charles 20 CHF
6 Mike 65
7 Karl 230 DKN
I would like to reprensent this data differently. I would like to write a script that find all the duplicates in the column name and regroup them wit the different account and if there are an currency "Ccy", it add a new column next to it with all the currency associated.
So something like that :
Dan Ccy1 Mike Ccy2 Sara Charles Ccy3 Karl Ccy4
0 100 EUR 30 EUR 90 20 CHF 230 DKN
1 50 USD 65
2 200 USD
I dont' really know how to start that ! So I simplify the problem to do step y step. I try to regroup the dupicates by the name with a list however it did not identify the duplicates.
x_len, y_len = df.shape
new_data = []
for i in range(x_len) :
if df.iloc[i,0] not in new_data :
print(str(df.iloc[i,0]) + '\t'+ str(df.iloc[i,1])+ '\t' + str(bool(df.iloc[i,0] not in new_data)))
new_data.append([df.iloc[i,0],df.iloc[i,1]])
else:
new_data[str(df.iloc[i,0])].append(df.iloc[i,1])
Then I thought that it was easier to use a dictionary. So I try this loop but there is an error and maybe it is not the best way to go to the expected final result
from collections import defaultdict
dico=defaultdict(list)
x_len, y_len = df.shape
for i in range(x_len) :
if df.iloc[i,0] not in dico :
print(str(df.iloc[i,0]) + '\t'+ str(df.iloc[i,1])+ '\t' + str(bool(df.iloc[i,0] not in dico)))
dico[str(df.iloc[i,0])] = df.iloc[i,1]
print(dico)
else :
dico[df.iloc[i,0]].append(df.iloc[i,1])
Anyone has an idea how to start or to do the code if it is simple ?
Thank you

Use GroupBy.cumcount for counter, reshape by DataFrame.set_index and DataFrame.unstack and last flatten columns names:
g = df.groupby(['Name']).cumcount()
df = df.set_index([g,'Name']).unstack().sort_index(level=1, axis=1)
df.columns = df.columns.map(lambda x: f'{x[0]}_{x[1]}')
print (df)
Account_Charles Ccy_Charles Account_Dan Ccy_Dan Account_Karl Ccy_Karl \
0 20 CHF 100 EUR 230 DKN
1 NaN NaN 50 USD NaN NaN
2 NaN NaN 200 USD NaN NaN
Account_Mike Ccy_Mike Account_Sara Ccy_Sara
0 30 EUR 90
1 65 NaN NaN
2 NaN NaN NaN NaN
If need custom columns names use if-else in list comprehension:
g = df.groupby(['Name']).cumcount()
df = df.set_index([g,'Name']).unstack().sort_index(level=1, axis=1)
L = [b if a == 'Account' else f'{a}{i // 2}' for i, (a, b) in enumerate(df.columns)]
df.columns = L
print (df)
Charles Ccy0 Dan Ccy1 Karl Ccy2 Mike Ccy3 Sara Ccy4
0 20 CHF 100 EUR 230 DKN 30 EUR 90
1 NaN NaN 50 USD NaN NaN 65 NaN NaN
2 NaN NaN 200 USD NaN NaN NaN NaN NaN NaN

How do I create new columns by combining data in existing columns?

I have a dataset that includes 5 columns Excuse formatting:
id Price Service Rater Name Cleanliness
401013357 5 3 A 1
401014972 2 1 A 5
401022510 3 4 B 2
401022510 5 1 C 9
401022510 3 1 D 4
401022510 2 2 E 2
I would like for there to be only one row for each ID. Therefore, I need to create columns for each of the raters' names and ratings categories (e.g. Rater Name Price, Rater Name Service, Rater name Cleanliness), each in its own column. Thank you.
I've explored groupby but cannot figure out how to manipulate these into new columns. Thank you!
Here's the code and data I'm actually using:
import requests
from pandas import DataFrame
import pandas as pd
linesinfo_url = 'https://api.collegefootballdata.com/lines?year=2018&seasonType=regular'
linesresp = requests.get(linesinfo_url)
dflines = DataFrame(linesresp.json())
#nesteddata in lines like game info
#setting game ID as index
dflines.set_index('id', inplace=True)
a = linesresp.json()
#defining a as the response to our get request for this data, in JSON format
buf = []
#i believe this creates a receptacle for nested data I'm extracting from json
for game in a:
for line in game['lines']:
game_dict = dict(id=game['id'])
for cat in ('provider', 'spread','formattedSpread', 'overUnder'):
game_dict[cat] = line[cat]
buf.append(game_dict)
dflinestable = pd.DataFrame(buf)
dflinestable.set_index(['id', 'provider'])
From this, I get
formattedSpread overUnder spread
id provider
401013357 consensus UMass -21 68.0 -21.0
401014972 consensus Rice -22.5 58.5 -22.5
401022510 Caesars Colorado State -17.5 57.5 -17.5
consensus Colorado State -17 57.5 -17.0
numberfire Colorado State -17 58.5 -17.0
teamrankings Colorado State -17 58.0 -17.0
401013437 numberfire Wyoming -5 47.0 5.0
teamrankings Wyoming -5 47.0 5.0
401020671 consensus Ball State -19.5 61.5 -19.5
401019470 Caesars UCF -22.5 NaN 22.5
consensus UCF -22.5 NaN 22.5
numberfire UCF -24 70.0 24.0
teamrankings UCF -24 70.0 24.0
401013328 numberfire Minnesota -21.5 47.0 -21.5
teamrankings Minnesota -21.5 49.0 -21.5
The outcome I am looking for is for each of the 4 different providers to have three columns each, so that it's caesars_formattedSpread, caesars_overUnder, Caesars spread, numberfire_formattedSpread, numberfire_overUnder, numberfire_spread, etc.
When I run unstack as suggested, I don't get what I expect. Instead I get:
formattedSpread 0 UMass -21
1 Rice -22.5
2 Colorado State -17.5
3 Colorado State -17
4 Colorado State -17
5 Colorado State -17
6 Wyoming -5
7 Wyoming -5
8 Ball State -19.5
9 UCF -22.5
10 UCF -22.5
11 UCF -24
12 UCF -24

* Edited, based on the edited question *
Given that your dataframe is df:
df = df.set_index(['id', 'Rater Name']) # Make it a Multi Index
df_unstacked = df.unstack()
The problem with your edited code, is that you don't assign dflinestable.set_index(['id', 'provider']) to anything. So when you then use dflinestable.unstack(), you are unstacking the original dflinestable.
So with your entire code, it should be:
import requests
import pandas as pd
linesinfo_url = 'https://api.collegefootballdata.com/lines?year=2018&seasonType=regular'
linesresp = requests.get(linesinfo_url)
dflines = pd.DataFrame(linesresp.json())
#nesteddata in lines like game info
#setting game ID as index
dflines.set_index('id', inplace=True)
a = linesresp.json()
#defining a as the response to our get request for this data, in JSON format
buf = []
#i believe this creates a receptacle for nested data I'm extracting from json
for game in a:
for line in game['lines']:
game_dict = dict(id=game['id'])
for cat in ('provider', 'spread','formattedSpread', 'overUnder'):
game_dict[cat] = line[cat]
buf.append(game_dict)
dflinestable = pd.DataFrame(buf)
dflinestable.set_index(['id', 'provider'], inplace=True) # Add inplace=True
dflinestable_unstacked = dflinestable.unstack() # unstack (you could also reassign to the same df
# Flatten columns to single level, in the order as described
dflinestable_unstacked.columns = [f'{j}_{i}' for i, j in dflinestable_unstacked.columns]
This will give you a DataFrame like (abbreviated):
Caesars_formattedSpread ... teamrankings_spread
id ...
401012246 Alabama -24 ... -23.5
401012247 Arkansas -34 ... NaN
401012248 Auburn -1 ... -1.5
401012249 NaN ... NaN
401012250 Georgia -44 ... NaN

How to create Traingular moving average in python using for loop

I use python pandas to caculate the following formula
(https://i.stack.imgur.com/XIKBz.png)
I do it in python like this :
EURUSD['SMA2']= EURUSD['Close']. rolling (2).mean()
EURUSD['TMA2']= ( EURUSD['Close'] + EURUSD[SMA2']) / 2
The proplem is long coding when i calculated TMA 100 , so i need to use " for loop " to easy change TMA period .
Thanks in advance
Edited :
I had found the code but there is an error :
values = []
for i in range(1,201): values.append(eurusd['Close']).rolling(window=i).mean() values.mean()

TMA is average of averages.
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.rand(10, 5))
print(df)
# df['mean0']=df.mean(0)
df['mean1']=df.mean(1)
print(df)
df['TMA'] = df['mean1'].rolling(window=10,center=False).mean()
print(df)
Or you can easily print it.
print(df["mean1"].mean())
Here is how it looks:
0 1 2 3 4
0 0.643560 0.412046 0.072525 0.618968 0.080146
1 0.018226 0.222212 0.077592 0.125714 0.595707
2 0.652139 0.907341 0.581802 0.021503 0.849562
3 0.129509 0.315618 0.711265 0.812318 0.757575
4 0.881567 0.455848 0.470282 0.367477 0.326812
5 0.102455 0.156075 0.272582 0.719158 0.266293
6 0.412049 0.527936 0.054381 0.587994 0.442144
7 0.063904 0.635857 0.244050 0.002459 0.423960
8 0.446264 0.116646 0.990394 0.678823 0.027085
9 0.951547 0.947705 0.080846 0.848772 0.699036
0 1 2 3 4 mean1
0 0.643560 0.412046 0.072525 0.618968 0.080146 0.365449
1 0.018226 0.222212 0.077592 0.125714 0.595707 0.207890
2 0.652139 0.907341 0.581802 0.021503 0.849562 0.602470
3 0.129509 0.315618 0.711265 0.812318 0.757575 0.545257
4 0.881567 0.455848 0.470282 0.367477 0.326812 0.500397
5 0.102455 0.156075 0.272582 0.719158 0.266293 0.303313
6 0.412049 0.527936 0.054381 0.587994 0.442144 0.404901
7 0.063904 0.635857 0.244050 0.002459 0.423960 0.274046
8 0.446264 0.116646 0.990394 0.678823 0.027085 0.451842
9 0.951547 0.947705 0.080846 0.848772 0.699036 0.705581
0 1 2 3 4 mean1 TMA
0 0.643560 0.412046 0.072525 0.618968 0.080146 0.365449 NaN
1 0.018226 0.222212 0.077592 0.125714 0.595707 0.207890 NaN
2 0.652139 0.907341 0.581802 0.021503 0.849562 0.602470 NaN
3 0.129509 0.315618 0.711265 0.812318 0.757575 0.545257 NaN
4 0.881567 0.455848 0.470282 0.367477 0.326812 0.500397 NaN
5 0.102455 0.156075 0.272582 0.719158 0.266293 0.303313 NaN
6 0.412049 0.527936 0.054381 0.587994 0.442144 0.404901 NaN
7 0.063904 0.635857 0.244050 0.002459 0.423960 0.274046 NaN
8 0.446264 0.116646 0.990394 0.678823 0.027085 0.451842 NaN
9 0.951547 0.947705 0.080846 0.848772 0.699036 0.705581 0.436115

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

pandas new column from values in others - python

Your code has some errors (Person is mistaken with Player among other things). I assume this is a paste error. Your problem however is easily solved using a mask and applying the tuple AA to the subset that the mask is using df.loc m = df['Event'] == 'AA' df.loc[m, ['X','Y']] = AA

Related

How to Split a huge csv into multiple csv's based on column header

Find the second largest date in df python [duplicate]

Python : Remodeling the presentation data from a pandas Dataframe / group duplicates

How do I create new columns by combining data in existing columns?

How to create Traingular moving average in python using for loop

Categories

Resources