Optimal conditional joining of pandas dataframe

Optimal conditional joining of pandas dataframe - python

I have a situation where I am trying to join df_a to df_b
In reality, these dataframes have shapes: (389944, 121) and (1098118, 60)
I need to conditionally join these two dataframes if any of the below conditions are true. If multiple, it only needs to be joined once:
df_a.player == df_b.handle
df_a.website == df_b.url
df_a.website == df_b.web_addr
df_a.website == df_b.notes
For an example...
df_a:
player
website
merch
michael jordan
www.michaeljordan.com
Y
Lebron James
www.kingjames.com
Y
Kobe Bryant
www.mamba.com
Y
Larry Bird
www.larrybird.com
Y
luka Doncic
www.77.com
N
df_b:
platform
url
web_addr
notes
handle
followers
following
Twitter
https://twitter.com/luka7doncic
www.77.com
luka7doncic
1500000
347
Twitter
www.larrybird.com
https://en.wikipedia.org/wiki/Larry_Bird
www.larrybird.com
Twitter
https://www.michaeljordansworld.com/
www.michaeljordan.com
Twitter
https://twitter.com/kobebryant
https://granitystudios.com/
https://granitystudios.com/
Kobe Bryant
14900000
514
Twitter
fooman.com
thefoo.com
foobar
foobarman
1
1
Twitter
www.stackoverflow.com
Ideally, df_a gets left joined to df_b to bring in the handle, followers, and following fields
player
website
merch
handle
followers
following
michael jordan
www.michaeljordan.com
Y
nh
0
0
Lebron James
www.kingjames.com
Y
null
null
null
Kobe Bryant
www.mamba.com
Y
Kobe Bryant
14900000
514
Larry Bird
www.larrybird.com
Y
nh
0
0
luka Doncic
www.77.com
N
luka7doncic
1500000
347
A minimal, reproducible example is below:
import pandas as pd, numpy as np
df_a = pd.DataFrame.from_dict({'player': {0: 'michael jordan', 1: 'Lebron James', 2: 'Kobe Bryant', 3: 'Larry Bird', 4: 'luka Doncic'}, 'website': {0: 'www.michaeljordan.com', 1: 'www.kingjames.com', 2: 'www.mamba.com', 3: 'www.larrybird.com', 4: 'www.77.com'}, 'merch': {0: 'Y', 1: 'Y', 2: 'Y', 3: 'Y', 4: 'N'}, 'handle': {0: 'nh', 1: np.nan, 2: 'Kobe Bryant', 3: 'nh', 4: 'luka7doncic'}, 'followers': {0: 0.0, 1: np.nan, 2: 14900000.0, 3: 0.0, 4: 1500000.0}, 'following': {0: 0.0, 1: np.nan, 2: 514.0, 3: 0.0, 4: 347.0}})
df_b = pd.DataFrame.from_dict({'platform': {0: 'Twitter', 1: 'Twitter', 2: 'Twitter', 3: 'Twitter', 4: 'Twitter', 5: 'Twitter'}, 'url': {0: 'https://twitter.com/luka7doncic', 1: 'www.larrybird.com', 2: np.nan, 3: 'https://twitter.com/kobebryant', 4: 'fooman.com', 5: 'www.stackoverflow.com'}, 'web_addr': {0: 'www.77.com', 1: 'https://en.wikipedia.org/wiki/Larry_Bird', 2: 'https://www.michaeljordansworld.com/', 3: 'https://granitystudios.com/', 4: 'thefoo.com', 5: np.nan}, 'notes': {0: np.nan, 1: 'www.larrybird.com', 2: 'www.michaeljordan.com', 3: 'https://granitystudios.com/', 4: 'foobar', 5: np.nan}, 'handle': {0: 'luka7doncic', 1: 'nh', 2: 'nh', 3: 'Kobe Bryant', 4: 'foobarman', 5: 'nh'}, 'followers': {0: 1500000, 1: 0, 2: 0, 3: 14900000, 4: 1, 5: 0}, 'following': {0: 347, 1: 0, 2: 0, 3: 514, 4: 1, 5: 0}})
cols_to_join = ['url', 'web_addr', 'notes']
on_handle = df_a.merge(right=df_b, left_on='player', right_on='handle', how='left')
res_df = []
res_df.append(on_handle)
for right_col in cols_to_join:
try:
temp = df_a.merge(right=df_b, left_on='website', right_on=right_col, how='left')
except:
temp = None
if temp is not None:
res_df.append(temp)
final = pd.concat(res_df, ignore_index=True)
final.drop_duplicates(inplace=True)
final
However, this produces erroneous results with duplicate columns.
How can I do this more efficiently and with correct results?

Use:
#for same input
df_a = df_a.drop(['handle','followers','following'], axis=1)
# print (df_a)
#meltying df_b for column website from cols_to_join
cols_to_join = ['url', 'web_addr', 'notes']
df2 = df_b.melt(id_vars=df_b.columns.difference(cols_to_join), value_name='website')
#because duplicates, removed dupes by website
df2 = df2.sort_values('followers', ascending=False).drop_duplicates('website')
print (df2)
followers following handle platform variable \
9 14900000 514 Kobe Bryant Twitter web_addr
3 14900000 514 Kobe Bryant Twitter url
6 1500000 347 luka7doncic Twitter web_addr
12 1500000 347 luka7doncic Twitter notes
0 1500000 347 luka7doncic Twitter url
10 1 1 foobarman Twitter web_addr
4 1 1 foobarman Twitter url
16 1 1 foobarman Twitter notes
5 0 0 nh Twitter url
7 0 0 nh Twitter web_addr
8 0 0 nh Twitter web_addr
1 0 0 nh Twitter url
14 0 0 nh Twitter notes
website
9 https://granitystudios.com/
3 https://twitter.com/kobebryant
6 www.77.com
12 NaN
0 https://twitter.com/luka7doncic
10 thefoo.com
4 fooman.com
16 foobar
5 www.stackoverflow.com
7 https://en.wikipedia.org/wiki/Larry_Bird
8 https://www.michaeljordansworld.com/
1 www.larrybird.com
14 www.michaeljordan.com
#2 times merge and because same index values replace missing values
dffin1 = df_a.merge(df_b.drop(cols_to_join + ['platform'], axis=1), left_on='player', right_on='handle', how='left')
dffin2 = df_a.merge(df2.drop(['platform','variable'], axis=1), on='website', how='left')
dffin = dffin2.fillna(dffin1)
print (dffin)
player website merch followers following \
0 michael jordan www.michaeljordan.com Y 0.0 0.0
1 Lebron James www.kingjames.com Y NaN NaN
2 Kobe Bryant www.mamba.com Y 14900000.0 514.0
3 Larry Bird www.larrybird.com Y 0.0 0.0
4 luka Doncic www.77.com N 1500000.0 347.0
handle
0 nh
1 NaN
2 Kobe Bryant
3 nh
4 luka7doncic

You can pass left_on and right_on with lists -
final = df_a.merge(
right=df_b,
left_on=['player', 'website', 'website', 'website'],
right_on=['handle', 'url', 'web_addr', 'notes'],
how='left'
)

Related

One column (df1) compare with two column(df2) with python

I would like to compare one column (df1) with two columns (df2).
df1
name area
Cody California
Billy Connecticut
Jeniffer Indiana
Franc Georgia
Mark Illinois
Tamis Connecticut
Danye Illinois
Leesa Indiana
Hector Illinois
Coy California
df2
name1 name2 points
Billy NA 20
Cody NA 27.5
Coy NA 25
Danye NA 21
Franc NA 19
Hector 40
Jeniffer 30
Leesa 20
Mark 50
Tamis 90
Output
name area points
Cody California 27.5
Billy Connecticut 20
Jeniffer Indiana 30
Franc Georgia 19
Mark Illinois 50
Tamis Connecticut 90
Danye Illinois 21
Leesa Indiana 20
Hector Illinois 40
Coy California 25

You could try as follows:
import pandas as pd
import numpy as np
data = {'name': {0: 'Cody', 1: 'Billy', 2: 'Jeniffer', 3: 'Franc', 4: 'Mark',
5: 'Tamis', 6: 'Danye', 7: 'Leesa', 8: 'Hector', 9: 'Coy'},
'area': {0: 'California', 1: 'Connecticut', 2: 'Indiana', 3: 'Georgia',
4: 'Illinois', 5: 'Connecticut', 6: 'Illinois', 7: 'Indiana',
8: 'Illinois', 9: 'California'}}
df = pd.DataFrame(data)
data2 = {'name1': {0: 'Billy', 1: 'Cody', 2: 'Coy', 3: 'Danye', 4: 'Franc',
5: np.nan, 6: np.nan, 7: np.nan, 8: np.nan, 9: np.nan},
'name2': {0: np.nan, 1: np.nan, 2: np.nan, 3: np.nan, 4: np.nan, 5: 'Hector',
6: 'Jeniffer', 7: 'Leesa', 8: 'Mark', 9: 'Tamis'},
'points': {0: 20.0, 1: 27.5, 2: 25.0, 3: 21.0, 4: 19.0, 5: 40.0,
6: 30.0, 7: 20.0, 8: 50.0, 9: 90.0}}
df2 = pd.DataFrame(data2)
# fill NaNs in `name2` based on `name1`
df2['name2'] = df2['name2'].fillna(df2['name1'])
# merge dfs
df_new = df.merge(df2[['name2','points']], left_on='name', right_on='name2')
print(df_new)
name area points
0 Cody California 27.5
1 Billy Connecticut 20.0
2 Jeniffer Indiana 30.0
3 Franc Georgia 19.0
4 Mark Illinois 50.0
5 Tamis Connecticut 90.0
6 Danye Illinois 21.0
7 Leesa Indiana 20.0
8 Hector Illinois 40.0
9 Coy California 25.0
Alternatively, instead of merge you could use map to add column points to your first df:
df['points'] = df['name'].map(df2.set_index('name2')['points'])

pandas fill NA but not all based on recent past record

I have a dataframe like as shown below
stud_name act_qtr year yr_qty qtr mov_avg_full mov_avg_2qtr_min_period
0 ABC Q2 2014 2014Q2 NaN NaN NaN
1 ABC Q1 2016 2016Q1 Q1 13.0 14.5
2 ABC Q4 2016 2016Q4 NaN NaN NaN
3 ABC Q4 2017 2017Q4 NaN NaN NaN
4 ABC Q4 2020 2020Q4 NaN NaN NaN
OP = pd.read_clipboard()
stud_name qtr year t_score p_score yr_qty mov_avg_full mov_avg_2qtr_min_period
0 ABC Q1 2014 10 11 2014Q1 10.000000 10.0
1 ABC Q1 2015 11 32 2015Q1 10.500000 10.5
2 ABC Q2 2015 13 45 2015Q2 11.333333 12.0
3 ABC Q3 2015 15 32 2015Q3 12.250000 14.0
4 ABC Q4 2015 17 21 2015Q4 13.200000 16.0
5 ABC Q1 2016 12 56 2016Q1 13.000000 14.5
6 ABC Q2 2017 312 87 2017Q2 55.714286 162.0
7 ABC Q3 2018 24 90 2018Q3 51.750000 168.0
df = pd.read_clipboard()
I would like to fillna() based on below logic
For ex: let's take stud_name = ABC. He has multipple NA records. Let's take his NA for 2020Q4. To fill that, we pick the latest record from df for stud_name=ABC before 2020Q4 (which is 2018Q3). Similarly, if we take stud_name = ABC. His another NA record is for 2014Q2. We pick the latest (prior) record from df for stud_name=ABC before 2014Q2 (which is 2014Q1). We need to sort based on yearqty values to get the latest (prior) record correctly
We need to do this for each stud_name and for a big dataset
So, we fillna in mov_avg_full and mov_avg_2qtr_min_period
If there are no previous records to look at in df dataframe, leave NA as it is
I was trying something like below but it doesn't work and incorrect
Filled = OP.merge(df,on=['stud_name'],how='left')
filled.sort_values(['year','Qty'],inplace=True)
filled['mov_avg_full'].fillna(Filled.groupby('stud_name']['mov_avg_full'].shift())
filled['mov_avg_2qtr_min_period'].fillna(Filled .groupby('stud_name']['mov_avg_2qtr_min_period'].shift())
I expect my output to be like as shown below

In this case, you might want to use append instead of merge. In other words, you want to concatenate vertically instead of horizontally. Then after sorting the DataFrame by stud_name and yr_qtr, you can use groupby and fillna methods on it.
Code:
import pandas as pd
# Create the sample dataframes
import numpy as np
op = pd.DataFrame({'stud_name': {0: 'ABC', 1: 'ABC', 2: 'ABC', 3: 'ABC', 4: 'ABC'}, 'act_qtr': {0: 'Q2', 1: 'Q1', 2: 'Q4', 3: 'Q4', 4: 'Q4'}, 'year': {0: 2014, 1: 2016, 2: 2016, 3: 2017, 4: 2020}, 'yr_qty': {0: '2014Q2', 1: '2016Q1', 2: '2016Q4', 3: '2017Q4', 4: '2020Q4'}, 'qtr': {0: np.NaN, 1: 'Q1', 2: np.NaN, 3: np.NaN, 4: np.NaN}, 'mov_avg_full': {0: np.NaN, 1: 13.0, 2: np.NaN, 3: np.NaN, 4: np.NaN}, 'mov_avg_2qtr_min_period': {0: np.NaN, 1: 14.5, 2: np.NaN, 3: np.NaN, 4: np.NaN}})
df = pd.DataFrame({'stud_name': {0: 'ABC', 1: 'ABC', 2: 'ABC', 3: 'ABC', 4: 'ABC', 5: 'ABC', 6: 'ABC', 7: 'ABC'}, 'qtr': {0: 'Q1', 1: 'Q1', 2: 'Q2', 3: 'Q3', 4: 'Q4', 5: 'Q1', 6: 'Q2', 7: 'Q3'}, 'year': {0: 2014, 1: 2015, 2: 2015, 3: 2015, 4: 2015, 5: 2016, 6: 2017, 7: 2018}, 't_score': {0: 10, 1: 11, 2: 13, 3: 15, 4: 17, 5: 12, 6: 312, 7: 24}, 'p_score': {0: 11, 1: 32, 2: 45, 3: 32, 4: 21, 5: 56, 6: 87, 7: 90}, 'yr_qty': {0: '2014Q1', 1: '2015Q1', 2: '2015Q2', 3: '2015Q3', 4: '2015Q4', 5: '2016Q1', 6: '2017Q2', 7: '2018Q3'}, 'mov_avg_full': {0: 10.0, 1: 10.5, 2: 11.333333, 3: 12.25, 4: 13.2, 5: 13.0, 6: 55.714286, 7: 51.75}, 'mov_avg_2qtr_min_period': {0: 10.0, 1: 10.5, 2: 12.0, 3: 14.0, 4: 16.0, 5: 14.5, 6: 162.0, 7: 168.0}})
# Append df to op
dfa = op.append(df[['stud_name', 'yr_qty', 'mov_avg_full', 'mov_avg_2qtr_min_period']])
# Sort before applying fillna
dfa = dfa.sort_values(['stud_name', 'yr_qty'])
# Group by stud_name and apply ffill
dfa[['mov_avg_full', 'mov_avg_2qtr_min_period']] = dfa.groupby('stud_name')[['mov_avg_full', 'mov_avg_2qtr_min_period']].fillna(method='ffill')
# Extract the orginal rows from op and deal with columns
dfa = dfa[dfa.act_qtr.notna()].drop('qtr', axis=1)
print(dfa)
Output:
stud_name
act_qtr
year
yr_qty
mov_avg_full
mov_avg_2qtr_min_period
ABC
Q2
2014
2014Q2
10
10
ABC
Q1
2016
2016Q1
13
14.5
ABC
Q4
2016
2016Q4
13
14.5
ABC
Q4
2017
2017Q4
55.7143
162
ABC
Q4
2020
2020Q4
51.75
168

SQL query with list and merging the result to dataframe with column location change

I am new to python and learning as I code. I have a DataFrame (df1) which I read from excel and from df1, I am taking a column ("Product_ID") converting to list and passing the list to SQL query to get the results. Results are stored in another DataFrame (df2) then I am merging df1 and df2 on column "Product_ID" and writing to excel. But in excel I am seeing only one row. That could be because results from SQL is creating a DataFrame for each product.
How could I write all rows to excel and also when I merge df2 with df1 how can I change location of column in df2.
Below is my code
file = path to excel
df1 = pd.read_excel(file)
prod_list = frm_df['Product_ID'].tolist() # list of product_ids
for x in prod_list:
SQL = pd.read_sql_query('''SELECT Product_ID, Amount from table where Product_ID= '{x}'
'''.format(x = x), cnxn)
df2 = pd.DataFrame(SQL)
merge = pd.merge(df1, df2, on='Product_ID')
writer = pd.ExcelWriter('output.xlsx')
merge.to_excel(writer, 'data')
writer.save()
df1 output is
Name Product_ID IND INN FAM INN
0 Allen 0072 1400 4200
1 Radio 0068 1500 2400
2 COMP 0430 3500 7000
df2 output:
Product_ID AMOUNT
0 0072 1400.0
Product_ID AMOUNT
0 0068 2400.0
Product_ID AMOUNT
0 0430 3500.0
merge output:
Name Product_ID IND INN FAM INN AMOUNT
0 Allen 0072 1400 4200 1400
Name Product_ID IND INN FAM INN AMOUNT
0 Radio 0068 1500 2400 2400
Name Product_ID IND INN FAM INN AMOUNT
0 COMP 0430 3500 7000 3500
In excel I am seeing only 1 row which is related to Product_ID. I want my merge dataframe as below:
Name Product_ID IND INN AMOUNT FAM INN
0 Allen 0072 1400 1400 4200
1 Radio 0068 1500 2400 2400
2 COMP 0430 3500 3500 7000
df1.to_dict() output:
{'Unnamed: 0': {0: 0, 1: 1, 2: 2}, 'Group Name': {0: 'Allen, Inc.', 1:
'American.', 2: 'COM'}, 'Product_ID': {0: '0072', 1: '0068', 2: '0430'},
'IND INN': {0: 1400, 1: 1500, 2: 3500}, 'FAM INN': {0:4200, 1: 2400,
2:7000}
df2.to_dict() output:
{'Product_ID': {0: '0072'}, 'AMOUNT': {0: 1400.0}}
{'Product_ID': {0: '0068'}, 'AMOUNT': {0: 2400.0}}
{'Product_ID': {0: '0430'}, 'AMOUNT': {0: 3500.0}}

You can build DataFrames from the dicts, (use stack + str.get + unstack to build df2). Then merge:
df1 = pd.DataFrame({'Unnamed: 0': {0: 0, 1: 1, 2: 2},
'Group Name': {0: 'Allen, Inc.', 1: 'American.', 2: 'COM'},
'Product_ID': {0: '0072', 1: '0068', 2: '0430'},
'IND INN': {0: 1400, 1: 1500, 2: 3500},
'FAM INN': {0:4200, 1: 2400, 2:7000}})
df2 = pd.DataFrame([{'Product_ID': {0: '0072'}, 'AMOUNT': {0: 1400.0}},
{'Product_ID': {0: '0068'}, 'AMOUNT': {0: 2400.0}},
{'Product_ID': {0: '0430'}, 'AMOUNT': {0: 3500.0}}])
df2 = df2.stack().str.get(0).unstack()
merged = df1.merge(df2, on='Product_ID').drop(columns='Unnamed: 0')
Output:
Group Name Product_ID IND INN FAM INN AMOUNT
0 Allen, Inc. 0072 1400 4200 1400.0
1 American. 0068 1500 2400 2400.0
2 COM 0430 3500 7000 3500.0

How to convert string of range (bins), into numerical values that can then be used with Seaborn visualisations

So, I'm working with Python 3.7 in Jupyter Notebooks. I'm currently exploring some survey data in the form of a Pandas imported from a .CSV file. I would like to explore further with some Seaborn visualisations, however, the numerical data has been gathered in the form of age bins, using string values.
Is there a way I could go about converting these columns (Age and Approximate Household Income) into numerical values, which could then be used with Seaborn? I've attempted searches but my wording seems to only be returning methods on creating age bins for columns with numerical values. I'm really looking for how I'd convert string values into numerical age bin values.
Also, does anybody have some tips on how I could improve my search method. What would have been the ideal wording for searching up a solution for something like this?
Here is an sample from the dataframe, using df.head(5).to_dict(), with values changed for anonymity purposes.
'Age': {0: '45-54', 1: '35-44', 2: '45-54', 3: '45-54', 4: '55-64'},
'Ethnicity': {0: 'White', 1: 'White', 2: 'White', 3: 'White', 4: 'White'},
'Approximate Household Income': {0: '$175,000 - $199,999',
1: '$75,000 - $99,999',
2: '$25,000 - $49,999',
3: '$50,000 - $74,999',
4: nan},
'Highest Level of Education Completed': {0: 'Four Year College Degree',
1: 'Four Year College Degree',
2: 'Jr College/Associates Degree',
3: 'Jr College/Associates Degree',
4: 'Four Year College Degree'},
'2020 Candidate Choice': {0: 'Joe Biden',
1: 'Joe Biden',
2: 'Donald Trump',
3: 'Joe Biden',
4: 'Donald Trump'},
'2016 Candidate Choice': {0: 'Hillary Clinton',
1: 'Third Party',
2: 'Donald Trump',
3: 'Hillary Clinton',
4: 'Third Party'},
'Party Registration 2020': {0: 'Independent',
1: 'No Party',
2: 'No Party',
3: 'Independent',
4: 'Independent'},
'Registered State for Voting': {0: 'Colorado',
1: 'Virginia',
2: 'California',
3: 'North Carolina',
4: 'Oregon'}

You can use some of pandas Series.str methods.
Smaller example dataset:
import pandas as pd
import numpy as np
df = pd.DataFrame(
{
"Age": {0: "45-54", 1: "35-44", 2: "45-54", 3: "45-54", 4: "55-64"},
"Ethnicity": {0: "White", 1: "White", 2: "White", 3: "White", 4: "White"},
"Approximate Household Income": {
0: "$175,000 - $199,999",
1: "$75,000 - $99,999",
2: "$25,000 - $49,999",
3: "$50,000 - $74,999",
4: np.nan,
},
}
)
# Age Ethnicity Approximate Household Income
# 0 45-54 White $175,000 - $199,999
# 1 35-44 White $75,000 - $99,999
# 2 45-54 White $25,000 - $49,999
# 3 45-54 White $50,000 - $74,999
# 4 55-64 White NaN
We can iterate through a list of columns and chain apply these methods to parse the ranges all within the pandas.DataFrame:
Methods we will use in order:
Series.str.replace - replace commas with nothing
Series.str.extract - extract the numbers from the Series, regex explained here
Series.astype - convert the extracted numbers to floats
DataFrame.rename - rename the new columns
DataFrame.join - add the extracted numbers back on to the original DataFrame
for col in ["Age", "Approximate Household Income"]:
df = df.join(
df[col]
.str.replace(",", "", regex=False)
.str.extract(pat=r"^[$]*(\d+)[-\s$]*(\d+)$")
.astype("float")
.rename({0: f"{col}_lower", 1: f"{col}_upper"}, axis="columns")
)
# Age Ethnicity Approximate Household Income Age_lower Age_upper \
# 0 45-54 White $175,000 - $199,999 45.0 54.0
# 1 35-44 White $75,000 - $99,999 35.0 44.0
# 2 45-54 White $25,000 - $49,999 45.0 54.0
# 3 45-54 White $50,000 - $74,999 45.0 54.0
# 4 55-64 White NaN 55.0 64.0
#
# Approximate Household Income_lower Approximate Household Income_upper
# 0 175000.0 199999.0
# 1 75000.0 99999.0
# 2 25000.0 49999.0
# 3 50000.0 74999.0
# 4 NaN NaN

In this case, I'd suggest setting up the conversion 'by hand' for each type of category based on the format of the strings. For example, for the age bins:
age = {0: '45-54', 1: '35-44', 2: '45-54', 3: '45-54', 4: '55-64'}
age_bins = {key: [int(age[key].split('-')[0]), int(age[key].split('-')[1])] for key in age}
{0: [45, 54], 1: [35, 44], 2: [45, 54], 3: [45, 54], 4: [55, 64]}

Missing first row when construction a Series from a DataFrame

I have a dictionary I call 'test_dict'
test_dict = {'OBJECTID': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},
'Country': {0: 'Vietnam',
1: 'Vietnam',
2: 'Vietnam',
3: 'Vietnam',
4: 'Vietnam'},
'Location': {0: 'Nha Trang',
1: 'Hue',
2: 'Phu Quoc',
3: 'Chu Lai',
4: 'Lao Bao'},
'Lat': {0: 12.250000000000057,
1: 16.401000000000067,
2: 10.227000000000032,
3: 15.406000000000063,
4: 16.627300000000048},
'Long': {0: 109.18333300000006,
1: 107.70300000000009,
2: 103.96700000000004,
3: 108.70600000000007,
4: 106.59970000000004}}
That I convert to a DataFrame
test_df = pd.DataFrame(test_dict)
and I get this:
OBJECTID Country Location Lat Long
0 1 Vietnam Nha Trang 12.2500 109.183333
1 2 Vietnam Hue 16.4010 107.703000
2 3 Vietnam Phu Quoc 10.2270 103.967000
3 4 Vietnam Chu Lai 15.4060 108.706000
4 5 Vietnam Lao Bao 16.6273 106.599700
I want to construct a series with location names and I would like the column "ObjectID" to be the index. When I try it, I lose the first row.
pd.Series(test_df.Location, index=test_df.OBJECTID)
I get this:
OBJECTID
1 Hue
2 Phu Quoc
3 Chu Lai
4 Lao Bao
5 NaN
Name: Location, dtype: object
What I was hoping to get was this:
OBJECTID
1 Nha Trang
2 Hue
3 Phu Quoc
4 Chu Lai
5 Lao Bao
What am I doing wrong here? Why is the process of converting into a Series losing the first row?

You can fix your code via
pd.Series(test_df.Location.values, index=test_df.OBJECTID)
because the problem is that test_df.Location has an index itself that starts at 0.
Edit - my preferred alternative:
test_df.set_index('OBJECTID')['Location']

You can use:
pd.Series(test_df.Location).reindex(test_df.OBJECTID)
Result:
OBJECTID
1 Hue
2 Phu Quoc
3 Chu Lai
4 Lao Bao
5 NaN
Name: Location, dtype: object

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Optimal conditional joining of pandas dataframe - python

You can pass left_on and right_on with lists - final = df_a.merge( right=df_b, left_on=['player', 'website', 'website', 'website'], right_on=['handle', 'url', 'web_addr', 'notes'], how='left' )

Related

One column (df1) compare with two column(df2) with python

pandas fill NA but not all based on recent past record

SQL query with list and merging the result to dataframe with column location change

How to convert string of range (bins), into numerical values that can then be used with Seaborn visualisations

Missing first row when construction a Series from a DataFrame

Categories

Resources