Related
I have a situation where I am trying to join df_a to df_b
In reality, these dataframes have shapes: (389944, 121) and (1098118, 60)
I need to conditionally join these two dataframes if any of the below conditions are true. If multiple, it only needs to be joined once:
df_a.player == df_b.handle
df_a.website == df_b.url
df_a.website == df_b.web_addr
df_a.website == df_b.notes
For an example...
df_a:
player
website
merch
michael jordan
www.michaeljordan.com
Y
Lebron James
www.kingjames.com
Y
Kobe Bryant
www.mamba.com
Y
Larry Bird
www.larrybird.com
Y
luka Doncic
www.77.com
N
df_b:
platform
url
web_addr
notes
handle
followers
following
Twitter
https://twitter.com/luka7doncic
www.77.com
luka7doncic
1500000
347
Twitter
www.larrybird.com
https://en.wikipedia.org/wiki/Larry_Bird
www.larrybird.com
Twitter
https://www.michaeljordansworld.com/
www.michaeljordan.com
Twitter
https://twitter.com/kobebryant
https://granitystudios.com/
https://granitystudios.com/
Kobe Bryant
14900000
514
Twitter
fooman.com
thefoo.com
foobar
foobarman
1
1
Twitter
www.stackoverflow.com
Ideally, df_a gets left joined to df_b to bring in the handle, followers, and following fields
player
website
merch
handle
followers
following
michael jordan
www.michaeljordan.com
Y
nh
0
0
Lebron James
www.kingjames.com
Y
null
null
null
Kobe Bryant
www.mamba.com
Y
Kobe Bryant
14900000
514
Larry Bird
www.larrybird.com
Y
nh
0
0
luka Doncic
www.77.com
N
luka7doncic
1500000
347
A minimal, reproducible example is below:
import pandas as pd, numpy as np
df_a = pd.DataFrame.from_dict({'player': {0: 'michael jordan', 1: 'Lebron James', 2: 'Kobe Bryant', 3: 'Larry Bird', 4: 'luka Doncic'}, 'website': {0: 'www.michaeljordan.com', 1: 'www.kingjames.com', 2: 'www.mamba.com', 3: 'www.larrybird.com', 4: 'www.77.com'}, 'merch': {0: 'Y', 1: 'Y', 2: 'Y', 3: 'Y', 4: 'N'}, 'handle': {0: 'nh', 1: np.nan, 2: 'Kobe Bryant', 3: 'nh', 4: 'luka7doncic'}, 'followers': {0: 0.0, 1: np.nan, 2: 14900000.0, 3: 0.0, 4: 1500000.0}, 'following': {0: 0.0, 1: np.nan, 2: 514.0, 3: 0.0, 4: 347.0}})
df_b = pd.DataFrame.from_dict({'platform': {0: 'Twitter', 1: 'Twitter', 2: 'Twitter', 3: 'Twitter', 4: 'Twitter', 5: 'Twitter'}, 'url': {0: 'https://twitter.com/luka7doncic', 1: 'www.larrybird.com', 2: np.nan, 3: 'https://twitter.com/kobebryant', 4: 'fooman.com', 5: 'www.stackoverflow.com'}, 'web_addr': {0: 'www.77.com', 1: 'https://en.wikipedia.org/wiki/Larry_Bird', 2: 'https://www.michaeljordansworld.com/', 3: 'https://granitystudios.com/', 4: 'thefoo.com', 5: np.nan}, 'notes': {0: np.nan, 1: 'www.larrybird.com', 2: 'www.michaeljordan.com', 3: 'https://granitystudios.com/', 4: 'foobar', 5: np.nan}, 'handle': {0: 'luka7doncic', 1: 'nh', 2: 'nh', 3: 'Kobe Bryant', 4: 'foobarman', 5: 'nh'}, 'followers': {0: 1500000, 1: 0, 2: 0, 3: 14900000, 4: 1, 5: 0}, 'following': {0: 347, 1: 0, 2: 0, 3: 514, 4: 1, 5: 0}})
cols_to_join = ['url', 'web_addr', 'notes']
on_handle = df_a.merge(right=df_b, left_on='player', right_on='handle', how='left')
res_df = []
res_df.append(on_handle)
for right_col in cols_to_join:
try:
temp = df_a.merge(right=df_b, left_on='website', right_on=right_col, how='left')
except:
temp = None
if temp is not None:
res_df.append(temp)
final = pd.concat(res_df, ignore_index=True)
final.drop_duplicates(inplace=True)
final
However, this produces erroneous results with duplicate columns.
How can I do this more efficiently and with correct results?
Use:
#for same input
df_a = df_a.drop(['handle','followers','following'], axis=1)
# print (df_a)
#meltying df_b for column website from cols_to_join
cols_to_join = ['url', 'web_addr', 'notes']
df2 = df_b.melt(id_vars=df_b.columns.difference(cols_to_join), value_name='website')
#because duplicates, removed dupes by website
df2 = df2.sort_values('followers', ascending=False).drop_duplicates('website')
print (df2)
followers following handle platform variable \
9 14900000 514 Kobe Bryant Twitter web_addr
3 14900000 514 Kobe Bryant Twitter url
6 1500000 347 luka7doncic Twitter web_addr
12 1500000 347 luka7doncic Twitter notes
0 1500000 347 luka7doncic Twitter url
10 1 1 foobarman Twitter web_addr
4 1 1 foobarman Twitter url
16 1 1 foobarman Twitter notes
5 0 0 nh Twitter url
7 0 0 nh Twitter web_addr
8 0 0 nh Twitter web_addr
1 0 0 nh Twitter url
14 0 0 nh Twitter notes
website
9 https://granitystudios.com/
3 https://twitter.com/kobebryant
6 www.77.com
12 NaN
0 https://twitter.com/luka7doncic
10 thefoo.com
4 fooman.com
16 foobar
5 www.stackoverflow.com
7 https://en.wikipedia.org/wiki/Larry_Bird
8 https://www.michaeljordansworld.com/
1 www.larrybird.com
14 www.michaeljordan.com
#2 times merge and because same index values replace missing values
dffin1 = df_a.merge(df_b.drop(cols_to_join + ['platform'], axis=1), left_on='player', right_on='handle', how='left')
dffin2 = df_a.merge(df2.drop(['platform','variable'], axis=1), on='website', how='left')
dffin = dffin2.fillna(dffin1)
print (dffin)
player website merch followers following \
0 michael jordan www.michaeljordan.com Y 0.0 0.0
1 Lebron James www.kingjames.com Y NaN NaN
2 Kobe Bryant www.mamba.com Y 14900000.0 514.0
3 Larry Bird www.larrybird.com Y 0.0 0.0
4 luka Doncic www.77.com N 1500000.0 347.0
handle
0 nh
1 NaN
2 Kobe Bryant
3 nh
4 luka7doncic
You can pass left_on and right_on with lists -
final = df_a.merge(
right=df_b,
left_on=['player', 'website', 'website', 'website'],
right_on=['handle', 'url', 'web_addr', 'notes'],
how='left'
)
I have a dataframe like as shown below
stud_name act_qtr year yr_qty qtr mov_avg_full mov_avg_2qtr_min_period
0 ABC Q2 2014 2014Q2 NaN NaN NaN
1 ABC Q1 2016 2016Q1 Q1 13.0 14.5
2 ABC Q4 2016 2016Q4 NaN NaN NaN
3 ABC Q4 2017 2017Q4 NaN NaN NaN
4 ABC Q4 2020 2020Q4 NaN NaN NaN
OP = pd.read_clipboard()
stud_name qtr year t_score p_score yr_qty mov_avg_full mov_avg_2qtr_min_period
0 ABC Q1 2014 10 11 2014Q1 10.000000 10.0
1 ABC Q1 2015 11 32 2015Q1 10.500000 10.5
2 ABC Q2 2015 13 45 2015Q2 11.333333 12.0
3 ABC Q3 2015 15 32 2015Q3 12.250000 14.0
4 ABC Q4 2015 17 21 2015Q4 13.200000 16.0
5 ABC Q1 2016 12 56 2016Q1 13.000000 14.5
6 ABC Q2 2017 312 87 2017Q2 55.714286 162.0
7 ABC Q3 2018 24 90 2018Q3 51.750000 168.0
df = pd.read_clipboard()
I would like to fillna() based on below logic
For ex: let's take stud_name = ABC. He has multipple NA records. Let's take his NA for 2020Q4. To fill that, we pick the latest record from df for stud_name=ABC before 2020Q4 (which is 2018Q3). Similarly, if we take stud_name = ABC. His another NA record is for 2014Q2. We pick the latest (prior) record from df for stud_name=ABC before 2014Q2 (which is 2014Q1). We need to sort based on yearqty values to get the latest (prior) record correctly
We need to do this for each stud_name and for a big dataset
So, we fillna in mov_avg_full and mov_avg_2qtr_min_period
If there are no previous records to look at in df dataframe, leave NA as it is
I was trying something like below but it doesn't work and incorrect
Filled = OP.merge(df,on=['stud_name'],how='left')
filled.sort_values(['year','Qty'],inplace=True)
filled['mov_avg_full'].fillna(Filled.groupby('stud_name']['mov_avg_full'].shift())
filled['mov_avg_2qtr_min_period'].fillna(Filled .groupby('stud_name']['mov_avg_2qtr_min_period'].shift())
I expect my output to be like as shown below
In this case, you might want to use append instead of merge. In other words, you want to concatenate vertically instead of horizontally. Then after sorting the DataFrame by stud_name and yr_qtr, you can use groupby and fillna methods on it.
Code:
import pandas as pd
# Create the sample dataframes
import numpy as np
op = pd.DataFrame({'stud_name': {0: 'ABC', 1: 'ABC', 2: 'ABC', 3: 'ABC', 4: 'ABC'}, 'act_qtr': {0: 'Q2', 1: 'Q1', 2: 'Q4', 3: 'Q4', 4: 'Q4'}, 'year': {0: 2014, 1: 2016, 2: 2016, 3: 2017, 4: 2020}, 'yr_qty': {0: '2014Q2', 1: '2016Q1', 2: '2016Q4', 3: '2017Q4', 4: '2020Q4'}, 'qtr': {0: np.NaN, 1: 'Q1', 2: np.NaN, 3: np.NaN, 4: np.NaN}, 'mov_avg_full': {0: np.NaN, 1: 13.0, 2: np.NaN, 3: np.NaN, 4: np.NaN}, 'mov_avg_2qtr_min_period': {0: np.NaN, 1: 14.5, 2: np.NaN, 3: np.NaN, 4: np.NaN}})
df = pd.DataFrame({'stud_name': {0: 'ABC', 1: 'ABC', 2: 'ABC', 3: 'ABC', 4: 'ABC', 5: 'ABC', 6: 'ABC', 7: 'ABC'}, 'qtr': {0: 'Q1', 1: 'Q1', 2: 'Q2', 3: 'Q3', 4: 'Q4', 5: 'Q1', 6: 'Q2', 7: 'Q3'}, 'year': {0: 2014, 1: 2015, 2: 2015, 3: 2015, 4: 2015, 5: 2016, 6: 2017, 7: 2018}, 't_score': {0: 10, 1: 11, 2: 13, 3: 15, 4: 17, 5: 12, 6: 312, 7: 24}, 'p_score': {0: 11, 1: 32, 2: 45, 3: 32, 4: 21, 5: 56, 6: 87, 7: 90}, 'yr_qty': {0: '2014Q1', 1: '2015Q1', 2: '2015Q2', 3: '2015Q3', 4: '2015Q4', 5: '2016Q1', 6: '2017Q2', 7: '2018Q3'}, 'mov_avg_full': {0: 10.0, 1: 10.5, 2: 11.333333, 3: 12.25, 4: 13.2, 5: 13.0, 6: 55.714286, 7: 51.75}, 'mov_avg_2qtr_min_period': {0: 10.0, 1: 10.5, 2: 12.0, 3: 14.0, 4: 16.0, 5: 14.5, 6: 162.0, 7: 168.0}})
# Append df to op
dfa = op.append(df[['stud_name', 'yr_qty', 'mov_avg_full', 'mov_avg_2qtr_min_period']])
# Sort before applying fillna
dfa = dfa.sort_values(['stud_name', 'yr_qty'])
# Group by stud_name and apply ffill
dfa[['mov_avg_full', 'mov_avg_2qtr_min_period']] = dfa.groupby('stud_name')[['mov_avg_full', 'mov_avg_2qtr_min_period']].fillna(method='ffill')
# Extract the orginal rows from op and deal with columns
dfa = dfa[dfa.act_qtr.notna()].drop('qtr', axis=1)
print(dfa)
Output:
stud_name
act_qtr
year
yr_qty
mov_avg_full
mov_avg_2qtr_min_period
ABC
Q2
2014
2014Q2
10
10
ABC
Q1
2016
2016Q1
13
14.5
ABC
Q4
2016
2016Q4
13
14.5
ABC
Q4
2017
2017Q4
55.7143
162
ABC
Q4
2020
2020Q4
51.75
168
I am working on calculating some football stats.
I have the following dataframe:
{'Player': {8: 'Darrel Williams', 2: 'Mark Ingram', 3: 'Michael Carter', 4: 'Najee Harris', 10: 'James Conner', 0: 'Buffalo Bills', 15: 'Davante Adams', 1: 'Aaron Rodgers', 5: 'Tyler Bass', 11: 'Corey Davis', 6: 'Van Jefferson', 14: 'Matt Ryan', 7: 'T.J. Hockenson', 9: 'Antonio Brown', 12: 'Alvin Kamara', 13: 'Tyler Boyd'}, 'Position': {8: 'RB', 2: 'RB', 3: 'RB', 4: 'RB', 10: 'RB', 0: 'DEF', 15: 'WR', 1: 'QB', 5: 'K', 11: 'WR', 6: 'WR', 14: 'QB', 7: 'TE', 9: 'WR', 12: 'RB', 13: 'WR'}, 'Score': {8: 24.9, 2: 18.8, 3: 16.2, 4: 15.3, 10: 13.9, 0: 12.0, 15: 11.3, 1: 10.48, 5: 9.0, 11: 8.8, 6: 6.9, 14: 1.68, 7: 0.0, 9: 0.0, 12: 0.0, 13: 0.0}}
Player
Position
Score
Darrel Williams
RB
24.9
Mark Ingram
RB
18.8
Michael Carter
RB
16.2
Najee Harris
RB
15.3
James Conner
RB
13.9
Buffalo Bills
DEF
12
Davante Adams
WR
11.3
Aaron Rodgers
QB
10.48
Tyler Bass
K
9
Corey Davis
WR
8.8
Van Jefferson
WR
6.9
Matt Ryan
QB
1.68
T.J. Hockenson
TE
0
Antonio Brown
WR
0
Alvin Kamara
RB
0
Tyler Boyd
WR
0
What I am looking to do, given the following requirements_dictionary, is to extract the top value (Score in the dataframe) for each key (Position in the dataframe):
requirements_dictionary = {'QB': 1, 'RB': 2, 'WR': 2, 'TE': 1, 'K': 1, 'DEF': 1, 'FLEX': 2}
What makes this challenging, is that for the final key, FLEX, that matches to no position in the dataframe, because that value could be a position of: RB, WR, or TE.
Final output should look like:
Player
Position
Score
Darrel Williams
RB
24.9
Mark Ingram
RB
18.8
Michael Carter
RB
16.2
Najee Harris
RB
15.3
Buffalo Bills
DEF
12
Davante Adams
WR
11.3
Aaron Rodgers
QB
10.48
Tyler Bass
K
9
Corey Davis
WR
8.8
T.J. Hockenson
TE
0
Since that is the top 2 RB, 1 QB, 2 WR, 1 TE, 1 K, 1 DEF and 2 FLEX.
I have tried the following code which gets me close:
all_points.groupby('Position')['Score'].nlargest(2)
Position
DEF 0 12.00
K 5 9.00
QB 1 10.48
14 1.68
RB 8 24.90
2 18.80
TE 7 0.00
WR 15 11.30
11 8.80
Name: Score, dtype: float64
However, that does not account for the FLEX "position"
I could alternatively loop through the dataframe and do this manually, but that seems very intensive.
How can I achieve the intended result?
Create a custom function that select a number of players according your requirements for each group and keep this index as idx_best. Then exclude all already selected players and select FLEX other players as idx_flex. Finally extract the union of this two indexes.
FLEX = requirements_dictionary['FLEX']
select_players = lambda x: x.nlargest(requirements_dictionary[x.name])
idx_best = df.groupby('Position')['Score'].apply(select_players).index.levels[1]
idx_flex = df.loc[df.index.difference(idx_best), 'Score'].nlargest(FLEX).index
out = df.loc[idx_best.union(idx_flex)].sort_values('Score', ascending=False)
Output:
>>> out
Player Position Score
8 Darrel Williams RB 24.90
2 Mark Ingram RB 18.80
3 Michael Carter RB 16.20
4 Najee Harris RB 15.30
0 Buffalo Bills DEF 12.00
15 Davante Adams WR 11.30
1 Aaron Rodgers QB 10.48
5 Tyler Bass K 9.00
11 Corey Davis WR 8.80
7 T.J. Hockenson TE 0.00
use the requirements dictionary to get the rows equal to a position then sort by score and get the head equal to the dictionary value for the position. Flex is top 2, for position in RB, WR, TE. I concatenate the flex results. my solution is more intuitive and logical to understand
txt="""Player,Position,Score
Darrel Williams,RB,24.9
Mark Ingram,RB,18.8
Michael Carter,RB,16.2
Najee Harris,RB,15.3
Buffalo Bills,DEF,12
Davante Adams,WR,11.3
Aaron Rodgers,QB,10.48
Tyler Bass,K,9
Corey Davis,WR,8.8
T.J. Hockenson,TE,0"""
df = pd.read_csv(io.StringIO(txt),sep=',')
requirements_dictionary = {'QB': 1, 'RB': 2, 'WR': 2, 'TE': 1, 'K': 1, 'DEF': 1, 'FLEX': 2}
#print(df)
df_top_rows = pd.DataFrame()
for position in requirements_dictionary.keys():
df_top_rows = df_top_rows.append(df[df['Position'] == position].sort_values(by='Score', ascending=False).head(requirements_dictionary[position]))
print(df_top_rows)
position='FLEX'
df_flex_rows = df_top_rows.append(df[df['Position'].isin(['RB','WR','TE'])].sort_values(by='Score', ascending=False).head(requirements_dictionary[position]))
#print(df_flex_rows)
df_result=pd.concat([df_top_rows,df_flex_rows],axis=0)
df_result.drop_duplicates(inplace=True)
print(df_result)
output
Player Position Score
6 Aaron Rodgers QB 10.48
0 Darrel Williams RB 24.90
1 Mark Ingram RB 18.80
5 Davante Adams WR 11.30
8 Corey Davis WR 8.80
9 T.J. Hockenson TE 0.00
7 Tyler Bass K 9.00
4 Buffalo Bills DEF 12.00
So, I'm working with Python 3.7 in Jupyter Notebooks. I'm currently exploring some survey data in the form of a Pandas imported from a .CSV file. I would like to explore further with some Seaborn visualisations, however, the numerical data has been gathered in the form of age bins, using string values.
Is there a way I could go about converting these columns (Age and Approximate Household Income) into numerical values, which could then be used with Seaborn? I've attempted searches but my wording seems to only be returning methods on creating age bins for columns with numerical values. I'm really looking for how I'd convert string values into numerical age bin values.
Also, does anybody have some tips on how I could improve my search method. What would have been the ideal wording for searching up a solution for something like this?
Here is an sample from the dataframe, using df.head(5).to_dict(), with values changed for anonymity purposes.
'Age': {0: '45-54', 1: '35-44', 2: '45-54', 3: '45-54', 4: '55-64'},
'Ethnicity': {0: 'White', 1: 'White', 2: 'White', 3: 'White', 4: 'White'},
'Approximate Household Income': {0: '$175,000 - $199,999',
1: '$75,000 - $99,999',
2: '$25,000 - $49,999',
3: '$50,000 - $74,999',
4: nan},
'Highest Level of Education Completed': {0: 'Four Year College Degree',
1: 'Four Year College Degree',
2: 'Jr College/Associates Degree',
3: 'Jr College/Associates Degree',
4: 'Four Year College Degree'},
'2020 Candidate Choice': {0: 'Joe Biden',
1: 'Joe Biden',
2: 'Donald Trump',
3: 'Joe Biden',
4: 'Donald Trump'},
'2016 Candidate Choice': {0: 'Hillary Clinton',
1: 'Third Party',
2: 'Donald Trump',
3: 'Hillary Clinton',
4: 'Third Party'},
'Party Registration 2020': {0: 'Independent',
1: 'No Party',
2: 'No Party',
3: 'Independent',
4: 'Independent'},
'Registered State for Voting': {0: 'Colorado',
1: 'Virginia',
2: 'California',
3: 'North Carolina',
4: 'Oregon'}
You can use some of pandas Series.str methods.
Smaller example dataset:
import pandas as pd
import numpy as np
df = pd.DataFrame(
{
"Age": {0: "45-54", 1: "35-44", 2: "45-54", 3: "45-54", 4: "55-64"},
"Ethnicity": {0: "White", 1: "White", 2: "White", 3: "White", 4: "White"},
"Approximate Household Income": {
0: "$175,000 - $199,999",
1: "$75,000 - $99,999",
2: "$25,000 - $49,999",
3: "$50,000 - $74,999",
4: np.nan,
},
}
)
# Age Ethnicity Approximate Household Income
# 0 45-54 White $175,000 - $199,999
# 1 35-44 White $75,000 - $99,999
# 2 45-54 White $25,000 - $49,999
# 3 45-54 White $50,000 - $74,999
# 4 55-64 White NaN
We can iterate through a list of columns and chain apply these methods to parse the ranges all within the pandas.DataFrame:
Methods we will use in order:
Series.str.replace - replace commas with nothing
Series.str.extract - extract the numbers from the Series, regex explained here
Series.astype - convert the extracted numbers to floats
DataFrame.rename - rename the new columns
DataFrame.join - add the extracted numbers back on to the original DataFrame
for col in ["Age", "Approximate Household Income"]:
df = df.join(
df[col]
.str.replace(",", "", regex=False)
.str.extract(pat=r"^[$]*(\d+)[-\s$]*(\d+)$")
.astype("float")
.rename({0: f"{col}_lower", 1: f"{col}_upper"}, axis="columns")
)
# Age Ethnicity Approximate Household Income Age_lower Age_upper \
# 0 45-54 White $175,000 - $199,999 45.0 54.0
# 1 35-44 White $75,000 - $99,999 35.0 44.0
# 2 45-54 White $25,000 - $49,999 45.0 54.0
# 3 45-54 White $50,000 - $74,999 45.0 54.0
# 4 55-64 White NaN 55.0 64.0
#
# Approximate Household Income_lower Approximate Household Income_upper
# 0 175000.0 199999.0
# 1 75000.0 99999.0
# 2 25000.0 49999.0
# 3 50000.0 74999.0
# 4 NaN NaN
In this case, I'd suggest setting up the conversion 'by hand' for each type of category based on the format of the strings. For example, for the age bins:
age = {0: '45-54', 1: '35-44', 2: '45-54', 3: '45-54', 4: '55-64'}
age_bins = {key: [int(age[key].split('-')[0]), int(age[key].split('-')[1])] for key in age}
{0: [45, 54], 1: [35, 44], 2: [45, 54], 3: [45, 54], 4: [55, 64]}
I have a dictionary I call 'test_dict'
test_dict = {'OBJECTID': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},
'Country': {0: 'Vietnam',
1: 'Vietnam',
2: 'Vietnam',
3: 'Vietnam',
4: 'Vietnam'},
'Location': {0: 'Nha Trang',
1: 'Hue',
2: 'Phu Quoc',
3: 'Chu Lai',
4: 'Lao Bao'},
'Lat': {0: 12.250000000000057,
1: 16.401000000000067,
2: 10.227000000000032,
3: 15.406000000000063,
4: 16.627300000000048},
'Long': {0: 109.18333300000006,
1: 107.70300000000009,
2: 103.96700000000004,
3: 108.70600000000007,
4: 106.59970000000004}}
That I convert to a DataFrame
test_df = pd.DataFrame(test_dict)
and I get this:
OBJECTID Country Location Lat Long
0 1 Vietnam Nha Trang 12.2500 109.183333
1 2 Vietnam Hue 16.4010 107.703000
2 3 Vietnam Phu Quoc 10.2270 103.967000
3 4 Vietnam Chu Lai 15.4060 108.706000
4 5 Vietnam Lao Bao 16.6273 106.599700
I want to construct a series with location names and I would like the column "ObjectID" to be the index. When I try it, I lose the first row.
pd.Series(test_df.Location, index=test_df.OBJECTID)
I get this:
OBJECTID
1 Hue
2 Phu Quoc
3 Chu Lai
4 Lao Bao
5 NaN
Name: Location, dtype: object
What I was hoping to get was this:
OBJECTID
1 Nha Trang
2 Hue
3 Phu Quoc
4 Chu Lai
5 Lao Bao
What am I doing wrong here? Why is the process of converting into a Series losing the first row?
You can fix your code via
pd.Series(test_df.Location.values, index=test_df.OBJECTID)
because the problem is that test_df.Location has an index itself that starts at 0.
Edit - my preferred alternative:
test_df.set_index('OBJECTID')['Location']
You can use:
pd.Series(test_df.Location).reindex(test_df.OBJECTID)
Result:
OBJECTID
1 Hue
2 Phu Quoc
3 Chu Lai
4 Lao Bao
5 NaN
Name: Location, dtype: object