I have a df with a column, Critic_Score, that has NaN values. I am trying to replace them with the average of the Critic Scores from the same platform. This question has been asked on stack overflow several times and I used 4 suggestions that did not give me the desired output. Please tell me how to fix this.
This is a subset of the df:
x[['Platform','Critic_Score']].head()
Platform Critic_Score
0 wii 76.0
1 nes NaN
2 wii 82.0
3 wii 80.0
4 gb NaN
More information on the original df:
x.head().to_dict('list')
{'Name': ['wii sports',
'super mario bros.',
'mario kart wii',
'wii sports resort',
'pokemon red/pokemon blue'],
'Platform': ['wii', 'nes', 'wii', 'wii', 'gb'],
'Year_of_Release': [2006.0, 1985.0, 2008.0, 2009.0, 1996.0],
'Genre': ['sports', 'platform', 'racing', 'sports', 'role-playing'],
'NA_sales': [41.36, 29.08, 15.68, 15.61, 11.27],
'EU_sales': [28.96, 3.58, 12.76, 10.93, 8.89],
'JP_sales': [3.77, 6.81, 3.79, 3.28, 10.22],
'Other_sales': [8.45, 0.77, 3.29, 2.95, 1.0],
'Critic_Score': [76.0, nan, 82.0, 80.0, nan],
'User_Score': ['8', nan, '8.3', '8', nan],
'Rating': ['E', nan, 'E', 'E', nan]}
These are the statements I tried followed by their output:
1.
x['Critic_Score'] = x['Critic_Score'].fillna(x.groupby('Platform')['Critic_Score'].transform('mean'), inplace = True)
0 None
1 None
2 None
3 None
4 None
Name: Critic_Score, dtype: object
x.loc[x.Critic_Score.isnull(), 'Critic_Score'] = x.groupby('Platform').Critic_Score.transform('mean')
#no change in column
0 76.0
1 NaN
2 82.0
3 80.0
4 NaN
x['Critic_Score'] = x.groupby('Platform')['Critic_Score']\
.transform(lambda y: y.fillna(y.mean()))
#no change in column
0 76.0
1 NaN
2 82.0
3 80.0
4 NaN
Name: Critic_Score, dtype: float64
x['Critic_Score']=x.groupby('Platform')['Critic_Score'].apply(lambda y:y.fillna(y.mean()))
x['Critic_Score'].head()
Out[73]:
0 76.0
1 NaN
2 82.0
3 80.0
4 NaN
Name: Critic_Score, dtype: float64
x.update(
x.groupby('Platform').Critic_Score.transform('mean'),
overwrite=False)
First you create a new df with the same number of rows but with the platform average on every row.
Then use that to update the original
Bear in mind your sample has only one row of nes and another of gb, both with nan score, so there is nothing to be averaged
Related
I'm trying to make a report and then convert it to the prescribed form but I don't know how. Below is my code:
data = pd.read_csv('https://raw.githubusercontent.com/hoatranobita/reports/main/Loan_list_test.csv')
data_pivot = pd.pivot_table(data,('CLOC_CUR_XC_BL'),index=['BIZ_TYPE_SBV_CODE'],columns=['TERM_CODE','CURRENCY_CD'],aggfunc=np.sum).reset_index
print(data_pivot)
Pivot table shows as below:
<bound method DataFrame.reset_index of TERM_CODE Ng?n h?n Trung h?n
CURRENCY_CD 1. VND 2. USD 1. VND 2. USD
BIZ_TYPE_SBV_CODE
201 170000.00 NaN 43533.42 NaN
202 2485441.64 5188792.76 2682463.04 1497309.06
204 35999.99 NaN NaN NaN
301 1120940.65 NaN 190915.62 453608.72
401 347929.88 182908.01 239123.29 NaN
402 545532.99 NaN 506964.23 NaN
403 21735.74 NaN 1855.92 NaN
501 10346.45 NaN NaN NaN
601 881974.40 NaN 50000.00 NaN
602 377216.09 NaN 828868.61 NaN
702 9798.74 NaN 23616.39 NaN
802 155099.66 NaN 762294.95 NaN
803 23456.79 NaN 97266.84 NaN
804 151590.00 NaN 378000.00 NaN
805 182925.30 54206.52 4290216.37 NaN>
Here is the prescribed form:
form = pd.read_excel('https://github.com/hoatranobita/reports/blob/main/Form%20A00034.xlsx?raw=true')
form.head()
Mã ngành kinh tế Dư nợ tín dụng (không bao gồm mua, đầu tư trái phiếu doanh nghiệp) Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5
0 NaN Ngắn hạn NaN Trung và dài hạn NaN Tổng cộng
1 NaN Bằng VND Bằng ngoại tệ Bằng VND Bằng ngoại tệ NaN
2 101.0 NaN NaN NaN NaN NaN
3 201.0 NaN NaN NaN NaN NaN
4 202.0 NaN NaN NaN NaN NaN
As you see, pivot table have no 101 but form has. So what I have to do to convert from Dataframe to Form that skip 101.
Thank you.
Hi First create a worksheet using xlsxwriter
import xlsxwriter
#start workbook
workbook = xlsxwriter.Workbook('merge1.xlsx')
#Introduce formatting
format = workbook.add_format({'border': 1,'bold': True})
#Adding a worksheet
worksheet = workbook.add_worksheet()
merge_format = workbook.add_format({
'bold':1,
'border': 1,
'align': 'center',
'valign': 'vcenter'})
#Starting the Headers
worksheet.merge_range('A1:A3', 'Mã ngành kinh tế', merge_format)
worksheet.merge_range('B1:F1', 'Dư nợ tín dụng (không bao gồm mua, đầu tư trái phiếu doanh nghiệp)', merge_format)
worksheet.merge_range('B2:C2', 'Ngắn hạn', merge_format)
worksheet.merge_range('D2:E2', 'Trung và dài hạn', merge_format)
worksheet.merge_range('F2:F3', 'Tổng cộng', merge_format)
worksheet.write(2, 1, 'Bằng VND',format)
worksheet.write(2, 2, 'Bằng ngoại tệ',format)
worksheet.write(2, 3, 'Bằng VND',format)
worksheet.write(2, 4, 'Bằng ngoại tệ',format)
After this formatting you can start writing to sheet looping through using worksheet.write() below I have included a sample
expenses = (
['Rent', 1000],
['Gas', 100],
['Food', 300],
['Gym', 50],
)
for item, cost in (expenses):
worksheet.write(row, col, item)
row += 1
In row and col you can specify the cell row and column number it goes as a numerical value like a matrix.
And finally close the workbook
workbook.close()
I am using pandas to analyse some election results. I have a DF, Results, which has a row for each constituency and columns representing the votes for the various parties (over 100 of them):
In[60]: Results.columns
Out[60]:
Index(['Constituency', 'Region', 'Country', 'ID', 'Type', 'Electorate',
'Total', 'Unnamed: 9', '30-50', 'Above',
...
'WP', 'WRP', 'WVPTFP', 'Yorks', 'Young', 'Zeb', 'Party', 'Votes',
'Share', 'Turnout'],
dtype='object', length=147)
So...
In[63]: Results.head()
Out[63]:
Constituency Region Country ID Type \
PAID
1 Aberavon Wales Wales W07000049 County
2 Aberconwy Wales Wales W07000058 County
3 Aberdeen North Scotland Scotland S14000001 Burgh
4 Aberdeen South Scotland Scotland S14000002 Burgh
5 Aberdeenshire West & Kincardine Scotland Scotland S14000058 County
Electorate Total Unnamed: 9 30-50 Above ... WP WRP WVPTFP \
PAID ...
1 49821 31523 NaN NaN NaN ... NaN NaN NaN
2 45525 30148 NaN NaN NaN ... NaN NaN NaN
3 67745 43936 NaN NaN NaN ... NaN NaN NaN
4 68056 48551 NaN NaN NaN ... NaN NaN NaN
5 73445 55196 NaN NaN NaN ... NaN NaN NaN
Yorks Young Zeb Party Votes Share Turnout
PAID
1 NaN NaN NaN Lab 15416 0.489040 0.632725
2 NaN NaN NaN Con 12513 0.415052 0.662230
3 NaN NaN NaN SNP 24793 0.564298 0.648550
4 NaN NaN NaN SNP 20221 0.416490 0.713398
5 NaN NaN NaN SNP 22949 0.415773 0.751528
[5 rows x 147 columns]
The per-constituency results for each party are given in the columns Results.ix[:, 'Unnamed: 9': 'Zeb']
I can find the winning party (i.e. the party which polled highest number of votes) and the number of votes it polled using:
RawResults = Results.ix[:, 'Unnamed: 9': 'Zeb']
Results['Party'] = RawResults.idxmax(axis=1)
Results['Votes'] = RawResults.max(axis=1).astype(int)
But, I also need to know how many votes the second-place party got (and ideally its index/name). So is there any way in pandas to return the second highest value/index in a set of columns for each row?
To get the highest values of a column, you can use nlargest() :
df['High'].nlargest(2)
The above will give you the 2 highest values of column High.
You can also use nsmallest() to get the lowest values.
Here is a NumPy solution:
In [120]: df
Out[120]:
a b c d e f g h
0 1.334444 0.322029 0.302296 -0.841236 -0.360488 -0.860188 -0.157942 1.522082
1 2.056572 0.991643 0.160067 -0.066473 0.235132 0.533202 1.282371 -2.050731
2 0.955586 -0.966734 0.055210 -0.993924 -0.553841 0.173793 -0.534548 -1.796006
3 1.201001 1.067291 -0.562357 -0.794284 -0.554820 -0.011836 0.519928 0.514669
4 -0.243972 -0.048144 0.498007 0.862016 1.284717 -0.886455 -0.757603 0.541992
5 0.739435 -0.767399 1.574173 1.197063 -1.147961 -0.903858 0.011073 -1.404868
6 -1.258282 -0.049719 0.400063 0.611456 0.443289 -1.110945 1.352029 0.215460
7 0.029121 -0.771431 -0.285119 -0.018216 0.408425 -1.458476 -1.363583 0.155134
8 1.427226 -1.005345 0.208665 -0.674917 0.287929 -1.259707 0.220420 -1.087245
9 0.452589 0.214592 -1.875423 0.487496 2.411265 0.062324 -0.327891 0.256577
In [121]: np.sort(df.values)[:,-2:]
Out[121]:
array([[ 1.33444404, 1.52208164],
[ 1.28237078, 2.05657214],
[ 0.17379254, 0.95558613],
[ 1.06729107, 1.20100071],
[ 0.86201603, 1.28471676],
[ 1.19706331, 1.57417327],
[ 0.61145573, 1.35202868],
[ 0.15513379, 0.40842477],
[ 0.28792928, 1.42722604],
[ 0.48749578, 2.41126532]])
or as a pandas Data Frame:
In [122]: pd.DataFrame(np.sort(df.values)[:,-2:], columns=['2nd-largest','largest'])
Out[122]:
2nd-largest largest
0 1.334444 1.522082
1 1.282371 2.056572
2 0.173793 0.955586
3 1.067291 1.201001
4 0.862016 1.284717
5 1.197063 1.574173
6 0.611456 1.352029
7 0.155134 0.408425
8 0.287929 1.427226
9 0.487496 2.411265
or a faster solution from #Divakar:
In [6]: df
Out[6]:
a b c d e f g h
0 0.649517 -0.223116 0.264734 -1.121666 0.151591 -1.335756 -0.155459 -2.500680
1 0.172981 1.233523 0.220378 1.188080 -0.289469 -0.039150 1.476852 0.736908
2 -1.904024 0.109314 0.045741 -0.341214 -0.332267 -1.363889 0.177705 -0.892018
3 -2.606532 -0.483314 0.054624 0.979734 0.205173 0.350247 -1.088776 1.501327
4 1.627655 -1.261631 0.589899 -0.660119 0.742390 -1.088103 0.228557 0.714746
5 0.423972 -0.506975 -0.783718 -2.044002 -0.692734 0.980399 1.007460 0.161516
6 -0.777123 -0.838311 -1.116104 -0.433797 0.599724 -0.884832 -0.086431 -0.738298
7 1.131621 1.218199 0.645709 0.066216 -0.265023 0.606963 -0.194694 0.463576
8 0.421164 0.626731 -0.547738 0.989820 -1.383061 -0.060413 -1.342769 -0.777907
9 -1.152690 0.696714 -0.155727 -0.991975 -0.806530 1.454522 0.788688 0.409516
In [7]: a = df.values
In [8]: a[np.arange(len(df))[:,None],np.argpartition(-a,np.arange(2),axis=1)[:,:2]]
Out[8]:
array([[ 0.64951665, 0.26473378],
[ 1.47685226, 1.23352348],
[ 0.17770473, 0.10931398],
[ 1.50132666, 0.97973383],
[ 1.62765464, 0.74238959],
[ 1.00745981, 0.98039898],
[ 0.5997243 , -0.0864306 ],
[ 1.21819904, 1.13162068],
[ 0.98982033, 0.62673128],
[ 1.45452173, 0.78868785]])
Here is an interesting approach. What if we replace the maximum value with the minimum value and calculate. Although it is a quick hack and, not recommended!
first_highest_value_index = df.idxmax()
second_highest_value_index = df.replace(df.max(),df(min)).idxmax()
first_highest_value = df[first_highest_value_index]
second_highest_value = df[second_highest_value_index]
You could just sort your results, such that the first rows will contain the max. Then you can simply use indexing to get the first n places.
RawResults = Results.ix[:, 'Unnamed: 9': 'Zeb'].sort_values(by='votes', ascending=False)
RawResults.iloc[0, :] # First place
RawResults.iloc[1, :] # Second place
RawResults.iloc[n, :] # nth place
Here is a solution using nlargest function:
>>> df
a b c
0 4 20 2
1 5 10 2
2 3 40 5
3 1 50 10
4 2 30 15
>>> def give_largest(col,n):
... largest = col.nlargest(n).reset_index(drop = True)
... data = [x for x in largest]
... index = [f'{i}_largest' for i in range(1,len(largest)+1)]
... return pd.Series(data,index=index)
...
...
>>> def n_largest(df, axis, n):
... '''
... Function to return the n-largest value of each
... column/row of the input DataFrame.
... '''
... return df.apply(give_largest, axis = axis, n = n)
...
>>> n_largest(df,axis = 1, n = 2)
1_largest 2_largest
0 20 4
1 10 5
2 40 5
3 50 10
4 30 15
>>> n_largest(df,axis = 0, n = 2)
a b c
1_largest 5 50 15
2_largest 4 40 10
import numpy as np
import pandas as pd
df = pd.DataFrame({
'a': [4, 5, 3, 1, 2],
'b': [20, 10, 40, 50, 30],
'c': [25, 20, 5, 15, 10]
})
def second_largest(df):
return (df.nlargest(2).min())
print(df.apply(second_largest))
a 4
b 40
c 20
dtype: int64
df
a b c d e f g h
0 1.334444 0.322029 0.302296 -0.841236 -0.360488 -0.860188 -0.157942 1.522082
1 2.056572 0.991643 0.160067 -0.066473 0.235132 0.533202 1.282371 -2.050731
2 0.955586 -0.966734 0.055210 -0.993924 -0.553841 0.173793 -0.534548 -1.796006
3 1.201001 1.067291 -0.562357 -0.794284 -0.554820 -0.011836 0.519928 0.514669
4 -0.243972 -0.048144 0.498007 0.862016 1.284717 -0.886455 -0.757603 0.541992
5 0.739435 -0.767399 1.574173 1.197063 -1.147961 -0.903858 0.011073 -1.404868
6 -1.258282 -0.049719 0.400063 0.611456 0.443289 -1.110945 1.352029 0.215460
7 0.029121 -0.771431 -0.285119 -0.018216 0.408425 -1.458476 -1.363583 0.155134
8 1.427226 -1.005345 0.208665 -0.674917 0.287929 -1.259707 0.220420 -1.087245
9 0.452589 0.214592 -1.875423 0.487496 2.411265 0.062324 -0.327891 0.256577
tranpose and use nlargest in a for loop to get the results order by each line:
df1=df.T
results=list()
for col in df1.columns: results.append(df1[col].nlargest(len(df.columns))
the results var is a list of pandas objects, where the first item on the list will be the df's first row sorted in descending order and so on. Since each item on the list is a pandas object, it carries df's column as index (it was transposed), so you will get the values and the df's columns name of each row sorted
results
[h 1.522082
a 1.334444
b 0.322029
c 0.302296
g -0.157942
e -0.360488
d -0.841236
f -0.860188
Name: 0, dtype: float64,
a 2.056572
g 1.282371
b 0.991643
f 0.533202
e 0.235132
c 0.160067
d -0.066473
h -2.050731
Name: 1, dtype: float64,
....
Let's say that I have this dataframe with three column : "Name", "Account" and "Ccy".
import pandas as pd
Name = ['Dan', 'Mike', 'Dan', 'Dan', 'Sara', 'Charles', 'Mike', 'Karl']
Account = ['100', '30', '50', '200', '90', '20', '65', '230']
Ccy = ['EUR','EUR','USD','USD','','CHF', '','DKN']
df = pd.DataFrame({'Name':Name, 'Account' : Account, 'Ccy' : Ccy})
Name Account Ccy
0 Dan 100 EUR
1 Mike 30 EUR
2 Dan 50 USD
3 Dan 200 USD
4 Sara 90
5 Charles 20 CHF
6 Mike 65
7 Karl 230 DKN
I would like to reprensent this data differently. I would like to write a script that find all the duplicates in the column name and regroup them wit the different account and if there are an currency "Ccy", it add a new column next to it with all the currency associated.
So something like that :
Dan Ccy1 Mike Ccy2 Sara Charles Ccy3 Karl Ccy4
0 100 EUR 30 EUR 90 20 CHF 230 DKN
1 50 USD 65
2 200 USD
I dont' really know how to start that ! So I simplify the problem to do step y step. I try to regroup the dupicates by the name with a list however it did not identify the duplicates.
x_len, y_len = df.shape
new_data = []
for i in range(x_len) :
if df.iloc[i,0] not in new_data :
print(str(df.iloc[i,0]) + '\t'+ str(df.iloc[i,1])+ '\t' + str(bool(df.iloc[i,0] not in new_data)))
new_data.append([df.iloc[i,0],df.iloc[i,1]])
else:
new_data[str(df.iloc[i,0])].append(df.iloc[i,1])
Then I thought that it was easier to use a dictionary. So I try this loop but there is an error and maybe it is not the best way to go to the expected final result
from collections import defaultdict
dico=defaultdict(list)
x_len, y_len = df.shape
for i in range(x_len) :
if df.iloc[i,0] not in dico :
print(str(df.iloc[i,0]) + '\t'+ str(df.iloc[i,1])+ '\t' + str(bool(df.iloc[i,0] not in dico)))
dico[str(df.iloc[i,0])] = df.iloc[i,1]
print(dico)
else :
dico[df.iloc[i,0]].append(df.iloc[i,1])
Anyone has an idea how to start or to do the code if it is simple ?
Thank you
Use GroupBy.cumcount for counter, reshape by DataFrame.set_index and DataFrame.unstack and last flatten columns names:
g = df.groupby(['Name']).cumcount()
df = df.set_index([g,'Name']).unstack().sort_index(level=1, axis=1)
df.columns = df.columns.map(lambda x: f'{x[0]}_{x[1]}')
print (df)
Account_Charles Ccy_Charles Account_Dan Ccy_Dan Account_Karl Ccy_Karl \
0 20 CHF 100 EUR 230 DKN
1 NaN NaN 50 USD NaN NaN
2 NaN NaN 200 USD NaN NaN
Account_Mike Ccy_Mike Account_Sara Ccy_Sara
0 30 EUR 90
1 65 NaN NaN
2 NaN NaN NaN NaN
If need custom columns names use if-else in list comprehension:
g = df.groupby(['Name']).cumcount()
df = df.set_index([g,'Name']).unstack().sort_index(level=1, axis=1)
L = [b if a == 'Account' else f'{a}{i // 2}' for i, (a, b) in enumerate(df.columns)]
df.columns = L
print (df)
Charles Ccy0 Dan Ccy1 Karl Ccy2 Mike Ccy3 Sara Ccy4
0 20 CHF 100 EUR 230 DKN 30 EUR 90
1 NaN NaN 50 USD NaN NaN 65 NaN NaN
2 NaN NaN 200 USD NaN NaN NaN NaN NaN NaN
This question already has answers here:
How do I read a fixed width format text file in pandas?
(4 answers)
Closed 2 years ago.
I have to parse through a file that has data I need to put/sort in a pandas dataframe. Below is an example of part of file I parse through:
TEST# RESULT UNITS LOWER UPPER ALARM TEST NAME
------ -------------- -------- -------------- -------------- ----- ----------
1.0 1234 TESTNAME1
1.1 H647333 TESTNAME2
1.2 30 C TEMPTOTAL
1.3 1 cnt CEREAL
1.4 364003 cnt POINTNUM
1.5 20200505 cnt Date
1.6 174143 cnt Time
1.7 2.020051e+007 cnt DateTime
1.8 123 cnt SMT
1.9 23.16 C TEMP1
1.10 23.55 C 123 TEMP2
1.11 22.88 C -23 TEMP3
1.12 22.86 C TEMP4
1.13 1.406 Meter -1.450 1.500 DIST1
1.14 0.718 Meter -0.800 0.350 FAIL DIST2
My issue is: How do I account for having a low limit but no upper limit OR having a upper limit but no low limit?
NOTE: My actual text file does not have this case but my application\project calls to account for the instance where it could happen.
How I check each line is below:
line = file_object.readline()
while line.strip():
# extract data from line and format all info in one list
xline = line.strip().split()
# the length of the info list of the line read
# is correlated to the data
if len(xline) == 3:
number = xline[0]
results = xline[1]
testname = xline[2]
units = None
lower = None
upper = None
# alarm = None
elif len(xline) == 4:
number = xline[0]
results = xline[1]
units = xline[2]
testname = xline[3]
lower = None
upper = None
# alarm = None
elif len(xline) == 6:
number = xline[0]
results = xline[1]
units = xline[2]
lower = xline[3]
upper = xline[4]
testname = xline[5]
# alarm = None
elif len(xline) == 7:
number = xline[0]
results = xline[1]
units = xline[2]
lower = xline[3]
upper = xline[4]
# alarm = xline[5]
testname = xline[6]
# create a dictionary containing this row of data
row = {
'Test #': number,
'Result': results,
'Units': units,
'Lower Limit': lower,
'Upper Limit': upper,
# 'Alarm': alarm,
'Test Name': testname,
}
data.append(row)
line = file_object.readline()
My idea is that I compare each line read of data to the "TEST# RESULT UNITS LOWER UPPER ALARM TEST NAME" line header positions, but I have no idea on how to do that. If anyone could point me in a direction that could work that would be great!
EDIT: The file is not solely in the table format shown above. My file has a whole bunch of staggered block text at the start of the file. As well as multiple "tables" with staggered block text between them.
You can use, pd.read_fwf:
df = pd.read_fwf(inputtxt,'infer')
Output:
TEST# RESULT UNITS LOWER UPPER ALARM TEST NAME
0 ------ -------------- -------- -------------- -------------- ----- ----------
1 1.0 1234 NaN NaN NaN NaN TESTNAME1
2 1.1 H647333 NaN NaN NaN NaN TESTNAME2
3 1.2 30 C NaN NaN NaN TEMPTOTAL
4 1.3 1 cnt NaN NaN NaN CEREAL
5 1.4 364003 cnt NaN NaN NaN POINTNUM
6 1.5 20200505 cnt NaN NaN NaN Date
7 1.6 174143 cnt NaN NaN NaN Time
8 1.7 2.020051e+007 cnt NaN NaN NaN DateTime
9 1.8 123 cnt NaN NaN NaN SMT
10 1.9 23.16 C NaN NaN NaN TEMP1
11 1.10 23.55 C NaN 123 NaN TEMP2
12 1.11 22.88 C -23 NaN NaN TEMP3
13 1.12 22.86 C NaN NaN NaN TEMP4
14 1.13 1.406 Meter -1.450 1.500 NaN DIST1
15 1.14 0.718 Meter -0.800 0.350 FAIL DIST2
And, you can drop index 0 to get ride of dashed:
df = df.drop(0)
Output:
TEST# RESULT UNITS LOWER UPPER ALARM TEST NAME
1 1.0 1234 NaN NaN NaN NaN TESTNAME1
2 1.1 H647333 NaN NaN NaN NaN TESTNAME2
3 1.2 30 C NaN NaN NaN TEMPTOTAL
4 1.3 1 cnt NaN NaN NaN CEREAL
5 1.4 364003 cnt NaN NaN NaN POINTNUM
6 1.5 20200505 cnt NaN NaN NaN Date
7 1.6 174143 cnt NaN NaN NaN Time
8 1.7 2.020051e+007 cnt NaN NaN NaN DateTime
9 1.8 123 cnt NaN NaN NaN SMT
10 1.9 23.16 C NaN NaN NaN TEMP1
11 1.10 23.55 C NaN 123 NaN TEMP2
12 1.11 22.88 C -23 NaN NaN TEMP3
13 1.12 22.86 C NaN NaN NaN TEMP4
14 1.13 1.406 Meter -1.450 1.500 NaN DIST1
15 1.14 0.718 Meter -0.800 0.350 FAIL DIST2
A non-pandas solution that infers field widths from the header, but use pandas 🙂:
import re
with open('table.txt') as fin:
next(fin) # skip headers
# capture start/end of each set of dashed lines to get field widths
spans = [m.span() for m in re.finditer(r'-+',next(fin))]
for line in fin:
# break lines on the field widths and strip leading/trailing white sapce
column = [line[start:end].strip() for start,end in spans]
print(column)
Output:
['1.0', '1234', '', '', '', '', 'TESTNAME1']
['1.1', 'H647333', '', '', '', '', 'TESTNAME2']
['1.2', '30', 'C', '', '', '', 'TEMPTOTAL']
['1.3', '1', 'cnt', '', '', '', 'CEREAL']
['1.4', '364003', 'cnt', '', '', '', 'POINTNUM']
['1.5', '20200505', 'cnt', '', '', '', 'Date']
['1.6', '174143', 'cnt', '', '', '', 'Time']
['1.7', '2.020051e+007', 'cnt', '', '', '', 'DateTime']
['1.8', '123', 'cnt', '', '', '', 'SMT']
['1.9', '23.16', 'C', '', '', '', 'TEMP1']
['1.10', '23.55', 'C', '', '123', '', 'TEMP2']
['1.11', '22.88', 'C', '-23', '', '', 'TEMP3']
['1.12', '22.86', 'C', '', '', '', 'TEMP4']
['1.13', '1.406', 'Meter', '-1.450', '1.500', '', 'DIST1']
['1.14', '0.718', 'Meter', '-0.800', '0.350', 'FAIL', 'DIST2']
I have a df that is populated with XY coordinates from different subjects. I want to create a new column that takes specified XY coordinates from those subjects.
This is achieved when the name of any subject is highlighted in the 'Person' column. This returns the XY coordinates of that subject at that index.
import pandas as pd
import numpy as np
import random
AA = 10, 20
k = 5
N = 10
df = pd.DataFrame({
'John Doe_X' : np.random.uniform(k, k + 100 , size=N),
'John Doe_Y' : np.random.uniform(k, k + 100 , size=N),
'Kevin Lee_X' : np.random.uniform(k, k + 100 , size=N),
'Kevin Lee_Y' : np.random.uniform(k, k + 100 , size=N),
'Liam Smith_X' : np.random.uniform(k, k + -100 , size=N),
'Liam Smith_Y' : np.random.uniform(k, k + 100 , size=N),
'Event' : ['AA', 'nan', 'BB', 'nan', 'nan', 'CC', 'nan','CC', 'DD','nan'],
'Person' : ['nan','nan','John Doe','John Doe','nan','Kevin Lee','nan','Liam Smith','John Doe','John Doe']})
df['X'] = df.apply(lambda row: row.get(row['Person']+'_X') if pd.notnull(row['Person']) else np.nan, axis=1)
df['Y'] = df.apply(lambda row: row.get(row['Person']+'_Y') if pd.notnull(row['Person']) else np.nan, axis=1)
Output:
Event John Doe_X John Doe_Y Kevin Lee_X Kevin Lee_Y Liam Smith_X \
0 AA 75.047164 19.281168 28.064313 87.184248 -76.148559
1 nan 50.642782 68.308319 46.088057 64.132263 -83.109383
2 BB 9.965115 77.950894 48.864693 8.613132 0.106708
3 nan 44.726136 58.751520 69.904076 40.818433 -87.656064
4 nan 101.501119 99.156872 101.976300 93.539749 -57.026015
5 CC 87.778446 65.814911 7.302116 40.577156 -28.703879
6 nan 99.682139 91.715231 88.029451 82.309191 -66.444582
7 CC 38.248267 38.648960 76.065297 67.322639 -34.754868
8 DD 69.429353 61.252800 83.024358 58.038962 -62.001353
9 nan 9.522023 73.009883 41.873986 8.677565 -20.389939
Liam Smith_Y Person X Y
0 18.420494 nan NaN NaN
1 33.206289 nan NaN NaN
2 73.833204 John Doe 9.965115 77.950894
3 39.652071 John Doe 44.726136 58.751520
4 88.176561 nan NaN NaN
5 53.776995 Kevin Lee 7.302116 40.577156
6 95.025923 nan NaN NaN
7 26.851864 Liam Smith -34.754868 26.851864
8 102.771046 John Doe 69.429353 61.252800
9 28.633231 John Doe 9.522023 73.009883
I'm now hoping to use the 'Event' column to refine the new ['X','Y'] column. Specifically, I want to return the coordinates of AA (10,20) when the value 'AA' is in the 'Event' column. Furthermore, I like to get the same coordinates until the next coordinates appear.
So the output would look like:
Event John Doe_X John Doe_Y Kevin Lee_X Kevin Lee_Y Liam Smith_X \
0 AA 75.047164 19.281168 28.064313 87.184248 -76.148559
1 nan 50.642782 68.308319 46.088057 64.132263 -83.109383
2 BB 9.965115 77.950894 48.864693 8.613132 0.106708
3 nan 44.726136 58.751520 69.904076 40.818433 -87.656064
4 nan 101.501119 99.156872 101.976300 93.539749 -57.026015
5 CC 87.778446 65.814911 7.302116 40.577156 -28.703879
6 nan 99.682139 91.715231 88.029451 82.309191 -66.444582
7 CC 38.248267 38.648960 76.065297 67.322639 -34.754868
8 DD 69.429353 61.252800 83.024358 58.038962 -62.001353
9 nan 9.522023 73.009883 41.873986 8.677565 -20.389939
Liam Smith_Y Person X Y
0 18.420494 nan 10 20
1 33.206289 nan 10 20
2 73.833204 John Doe 9.965115 77.950894
3 39.652071 John Doe 44.726136 58.751520
4 88.176561 nan NaN NaN
5 53.776995 Kevin Lee 7.302116 40.577156
6 95.025923 nan NaN NaN
7 26.851864 Liam Smith -34.754868 26.851864
8 102.771046 John Doe 69.429353 61.252800
9 28.633231 John Doe 9.522023 73.009883
I have tried to write something like this:
for value in df['Event']:
if value == 'AA' :
df['X', 'Y'] = AA
But get a ValueError: ValueError: Length of values does not match length of index
If you want to iterate through rows you can try:
# iterate through rows
for index, row in df.iterrows():
# check Event value for the row
if row['Event'] == 'AA' :
# update dataframe
df.loc[index,('X', 'Y')] = AA
print(df)
Result:
Event John Doe_X John Doe_Y Kevin Lee_X Kevin Lee_Y Liam Smith_X \
0 AA 12.603084 81.636376 25.997186 76.733337 -17.683132
1 nan 104.652839 104.064767 56.762357 83.599629 -34.714117
2 BB 69.724434 33.324135 98.452840 57.407782 -8.479175
3 nan 16.361719 51.290716 41.929234 46.494053 -81.882100
4 nan 30.874579 34.683986 95.434111 80.343098 -62.448286
5 CC 77.619875 70.164773 7.385376 40.142712 -55.590472
6 nan 31.214066 54.081010 36.249414 34.218611 -21.754019
7 CC 91.487647 28.307019 71.235864 48.915612 -37.196812
8 DD 45.036216 61.655465 50.231592 29.511502 -4.583804
9 nan 95.249002 25.649100 31.959114 10.234085 -93.106746
X NaN NaN NaN NaN NaN NaN
Liam Smith_Y Person X Y
0 86.267909 nan 10.000000 20.000000
1 43.090388 nan NaN NaN
2 56.330139 John Doe 69.724434 33.324135
3 65.648633 John Doe 16.361719 51.290716
4 16.349304 nan NaN NaN
5 5.528887 Kevin Lee 7.385376 40.142712
6 75.717007 nan NaN NaN
7 100.925457 Liam Smith -37.196812 100.925457
8 87.256541 John Doe 45.036216 61.655465
9 35.361163 John Doe 95.249002 25.649100
X NaN NaN NaN NaN
Your code has some errors (Person is mistaken with Player among other things). I assume this is a paste error.
Your problem however is easily solved using a mask and applying the tuple AA to the subset that the mask is using df.loc
m = df['Event'] == 'AA'
df.loc[m, ['X','Y']] = AA