Missing first row when construction a Series from a DataFrame - python

I have a dictionary I call 'test_dict'
test_dict = {'OBJECTID': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},
'Country': {0: 'Vietnam',
1: 'Vietnam',
2: 'Vietnam',
3: 'Vietnam',
4: 'Vietnam'},
'Location': {0: 'Nha Trang',
1: 'Hue',
2: 'Phu Quoc',
3: 'Chu Lai',
4: 'Lao Bao'},
'Lat': {0: 12.250000000000057,
1: 16.401000000000067,
2: 10.227000000000032,
3: 15.406000000000063,
4: 16.627300000000048},
'Long': {0: 109.18333300000006,
1: 107.70300000000009,
2: 103.96700000000004,
3: 108.70600000000007,
4: 106.59970000000004}}
That I convert to a DataFrame
test_df = pd.DataFrame(test_dict)
and I get this:
OBJECTID Country Location Lat Long
0 1 Vietnam Nha Trang 12.2500 109.183333
1 2 Vietnam Hue 16.4010 107.703000
2 3 Vietnam Phu Quoc 10.2270 103.967000
3 4 Vietnam Chu Lai 15.4060 108.706000
4 5 Vietnam Lao Bao 16.6273 106.599700
I want to construct a series with location names and I would like the column "ObjectID" to be the index. When I try it, I lose the first row.
pd.Series(test_df.Location, index=test_df.OBJECTID)
I get this:
OBJECTID
1 Hue
2 Phu Quoc
3 Chu Lai
4 Lao Bao
5 NaN
Name: Location, dtype: object
What I was hoping to get was this:
OBJECTID
1 Nha Trang
2 Hue
3 Phu Quoc
4 Chu Lai
5 Lao Bao
What am I doing wrong here? Why is the process of converting into a Series losing the first row?

You can fix your code via
pd.Series(test_df.Location.values, index=test_df.OBJECTID)
because the problem is that test_df.Location has an index itself that starts at 0.
Edit - my preferred alternative:
test_df.set_index('OBJECTID')['Location']

You can use:
pd.Series(test_df.Location).reindex(test_df.OBJECTID)
Result:
OBJECTID
1 Hue
2 Phu Quoc
3 Chu Lai
4 Lao Bao
5 NaN
Name: Location, dtype: object

Related

Optimal conditional joining of pandas dataframe

I have a situation where I am trying to join df_a to df_b
In reality, these dataframes have shapes: (389944, 121) and (1098118, 60)
I need to conditionally join these two dataframes if any of the below conditions are true. If multiple, it only needs to be joined once:
df_a.player == df_b.handle
df_a.website == df_b.url
df_a.website == df_b.web_addr
df_a.website == df_b.notes
For an example...
df_a:
player
website
merch
michael jordan
www.michaeljordan.com
Y
Lebron James
www.kingjames.com
Y
Kobe Bryant
www.mamba.com
Y
Larry Bird
www.larrybird.com
Y
luka Doncic
www.77.com
N
df_b:
platform
url
web_addr
notes
handle
followers
following
Twitter
https://twitter.com/luka7doncic
www.77.com
luka7doncic
1500000
347
Twitter
www.larrybird.com
https://en.wikipedia.org/wiki/Larry_Bird
www.larrybird.com
Twitter
https://www.michaeljordansworld.com/
www.michaeljordan.com
Twitter
https://twitter.com/kobebryant
https://granitystudios.com/
https://granitystudios.com/
Kobe Bryant
14900000
514
Twitter
fooman.com
thefoo.com
foobar
foobarman
1
1
Twitter
www.stackoverflow.com
Ideally, df_a gets left joined to df_b to bring in the handle, followers, and following fields
player
website
merch
handle
followers
following
michael jordan
www.michaeljordan.com
Y
nh
0
0
Lebron James
www.kingjames.com
Y
null
null
null
Kobe Bryant
www.mamba.com
Y
Kobe Bryant
14900000
514
Larry Bird
www.larrybird.com
Y
nh
0
0
luka Doncic
www.77.com
N
luka7doncic
1500000
347
A minimal, reproducible example is below:
import pandas as pd, numpy as np
df_a = pd.DataFrame.from_dict({'player': {0: 'michael jordan', 1: 'Lebron James', 2: 'Kobe Bryant', 3: 'Larry Bird', 4: 'luka Doncic'}, 'website': {0: 'www.michaeljordan.com', 1: 'www.kingjames.com', 2: 'www.mamba.com', 3: 'www.larrybird.com', 4: 'www.77.com'}, 'merch': {0: 'Y', 1: 'Y', 2: 'Y', 3: 'Y', 4: 'N'}, 'handle': {0: 'nh', 1: np.nan, 2: 'Kobe Bryant', 3: 'nh', 4: 'luka7doncic'}, 'followers': {0: 0.0, 1: np.nan, 2: 14900000.0, 3: 0.0, 4: 1500000.0}, 'following': {0: 0.0, 1: np.nan, 2: 514.0, 3: 0.0, 4: 347.0}})
df_b = pd.DataFrame.from_dict({'platform': {0: 'Twitter', 1: 'Twitter', 2: 'Twitter', 3: 'Twitter', 4: 'Twitter', 5: 'Twitter'}, 'url': {0: 'https://twitter.com/luka7doncic', 1: 'www.larrybird.com', 2: np.nan, 3: 'https://twitter.com/kobebryant', 4: 'fooman.com', 5: 'www.stackoverflow.com'}, 'web_addr': {0: 'www.77.com', 1: 'https://en.wikipedia.org/wiki/Larry_Bird', 2: 'https://www.michaeljordansworld.com/', 3: 'https://granitystudios.com/', 4: 'thefoo.com', 5: np.nan}, 'notes': {0: np.nan, 1: 'www.larrybird.com', 2: 'www.michaeljordan.com', 3: 'https://granitystudios.com/', 4: 'foobar', 5: np.nan}, 'handle': {0: 'luka7doncic', 1: 'nh', 2: 'nh', 3: 'Kobe Bryant', 4: 'foobarman', 5: 'nh'}, 'followers': {0: 1500000, 1: 0, 2: 0, 3: 14900000, 4: 1, 5: 0}, 'following': {0: 347, 1: 0, 2: 0, 3: 514, 4: 1, 5: 0}})
cols_to_join = ['url', 'web_addr', 'notes']
on_handle = df_a.merge(right=df_b, left_on='player', right_on='handle', how='left')
res_df = []
res_df.append(on_handle)
for right_col in cols_to_join:
try:
temp = df_a.merge(right=df_b, left_on='website', right_on=right_col, how='left')
except:
temp = None
if temp is not None:
res_df.append(temp)
final = pd.concat(res_df, ignore_index=True)
final.drop_duplicates(inplace=True)
final
However, this produces erroneous results with duplicate columns.
How can I do this more efficiently and with correct results?
Use:
#for same input
df_a = df_a.drop(['handle','followers','following'], axis=1)
# print (df_a)
#meltying df_b for column website from cols_to_join
cols_to_join = ['url', 'web_addr', 'notes']
df2 = df_b.melt(id_vars=df_b.columns.difference(cols_to_join), value_name='website')
#because duplicates, removed dupes by website
df2 = df2.sort_values('followers', ascending=False).drop_duplicates('website')
print (df2)
followers following handle platform variable \
9 14900000 514 Kobe Bryant Twitter web_addr
3 14900000 514 Kobe Bryant Twitter url
6 1500000 347 luka7doncic Twitter web_addr
12 1500000 347 luka7doncic Twitter notes
0 1500000 347 luka7doncic Twitter url
10 1 1 foobarman Twitter web_addr
4 1 1 foobarman Twitter url
16 1 1 foobarman Twitter notes
5 0 0 nh Twitter url
7 0 0 nh Twitter web_addr
8 0 0 nh Twitter web_addr
1 0 0 nh Twitter url
14 0 0 nh Twitter notes
website
9 https://granitystudios.com/
3 https://twitter.com/kobebryant
6 www.77.com
12 NaN
0 https://twitter.com/luka7doncic
10 thefoo.com
4 fooman.com
16 foobar
5 www.stackoverflow.com
7 https://en.wikipedia.org/wiki/Larry_Bird
8 https://www.michaeljordansworld.com/
1 www.larrybird.com
14 www.michaeljordan.com
#2 times merge and because same index values replace missing values
dffin1 = df_a.merge(df_b.drop(cols_to_join + ['platform'], axis=1), left_on='player', right_on='handle', how='left')
dffin2 = df_a.merge(df2.drop(['platform','variable'], axis=1), on='website', how='left')
dffin = dffin2.fillna(dffin1)
print (dffin)
player website merch followers following \
0 michael jordan www.michaeljordan.com Y 0.0 0.0
1 Lebron James www.kingjames.com Y NaN NaN
2 Kobe Bryant www.mamba.com Y 14900000.0 514.0
3 Larry Bird www.larrybird.com Y 0.0 0.0
4 luka Doncic www.77.com N 1500000.0 347.0
handle
0 nh
1 NaN
2 Kobe Bryant
3 nh
4 luka7doncic
You can pass left_on and right_on with lists -
final = df_a.merge(
right=df_b,
left_on=['player', 'website', 'website', 'website'],
right_on=['handle', 'url', 'web_addr', 'notes'],
how='left'
)

Add rows in Pandas based on condition (grouping)

I've googled quite some bit regarding this and could not find an answer that applied to my problem. The issue Im having is that I've got a dataframe and each row has a variable and I want to continuously insert rows with variable C which is the value of variable A + B. Example:
TOWN YEAR Var Value
Amsterdam 2019 A 1
Amsterdam 2019 B 2
Amsterdam 2020 A 1
Amsterdam 2020 B 3
Rotterdam 2019 A 4
Rotterdam 2019 B 4
Rotterdam 2020 A 5
Rotterdam 2020 B 2
Where the desired output would insert a row and sum A and B respectively for rows that are identical in the other columns. My attempt right now backfired as I used groupby and sum, then converted it into a list and then just tried to append it as a seperate column (var_C). The reason it backfired is because I had to duplicate each value to match the length of the original dataset. In the end the length of the list did not match the length of the original dataset.
data_current = data[data['var'].isin(['A', 'B'])]
data_var_c = data_current.groupby(['TOWN', 'year'])['value'].sum()
values = data_var_c.tolist()
values_dup = [val for val in values for _ in (0, 1)]
len(values_dup)
Any feedback would be appreciated!
You can use groupby and pd.concat:
result = (
pd.concat([
df,
df.groupby(['TOWN', 'YEAR'], as_index=False)
.agg(sum)
.assign(Var = 'C')
])
)
result = result.sort_values(['TOWN', 'YEAR', 'Var'])
OUTPUT:
TOWN YEAR Var Value
0 Amsterdam 2019 A 1
1 Amsterdam 2019 B 2
0 Amsterdam 2019 C 3
2 Amsterdam 2020 A 1
3 Amsterdam 2020 B 3
1 Amsterdam 2020 C 4
4 Rotterdam 2019 A 4
5 Rotterdam 2019 B 4
2 Rotterdam 2019 C 8
6 Rotterdam 2020 A 5
7 Rotterdam 2020 B 2
3 Rotterdam 2020 C 7
A pivot stack option:
import pandas as pd
df = pd.DataFrame({
'TOWN': {0: 'Amsterdam', 1: 'Amsterdam', 2: 'Amsterdam', 3: 'Amsterdam',
4: 'Rotterdam', 5: 'Rotterdam', 6: 'Rotterdam', 7: 'Rotterdam'},
'YEAR': {0: 2019, 1: 2019, 2: 2020, 3: 2020, 4: 2019, 5: 2019, 6: 2020,
7: 2020},
'Var': {0: 'A', 1: 'B', 2: 'A', 3: 'B', 4: 'A', 5: 'B', 6: 'A', 7: 'B'},
'Value': {0: 1, 1: 2, 2: 1, 3: 3, 4: 4, 5: 4, 6: 5, 7: 2}
})
new_df = df.pivot(index=['TOWN', 'YEAR'], columns='Var')['Value'] \
.assign(C=lambda x: x.agg('sum', axis=1)) \
.stack() \
.rename('Value') \
.reset_index()
print(new_df)
new_df:
TOWN YEAR Var Value
0 Amsterdam 2019 A 1
1 Amsterdam 2019 B 2
2 Amsterdam 2019 C 3
3 Amsterdam 2020 A 1
4 Amsterdam 2020 B 3
5 Amsterdam 2020 C 4
6 Rotterdam 2019 A 4
7 Rotterdam 2019 B 4
8 Rotterdam 2019 C 8
9 Rotterdam 2020 A 5
10 Rotterdam 2020 B 2
11 Rotterdam 2020 C 7
I overcomplicated it, it is as simple as grouping by TOWN and Year, taking the value column and apply a sum function to get the overall sum:
data['c'] = data_current.groupby(['TOWN', 'year'])['value'].transform('sum')
This, however, is not the desired output as it adds the summation as another column. Whereas, Nk03's answer adds the summation as a seperate row.

Get a specific row value by comparasion of two dataFrames with multiple conditions

==============
I have a firts DataFrame df1 who looks like something like this:
Name
Location
Bob
Paris
Bob
Berlin
Alice
Paris
Alice
Miami
Toto
NYC
Bob
NYC
Mark
Berlin
Joe
Paris
...
...
Then i have a second DataFrame df2 who looks like this too :
Name
Location
Value
Alice
Paris
0.3
Bob
Paris
0.2
Bob
Berlin
0.4
Alice
Miami
0.1
Lucas
NYC
0.0
...
...
...
I would like to make a function searchedValue() who implements each rows of my df1 in a new column ["SEARCHEDVALUE"] with the corresponding df2["VALUE"] with these two conditions :
By checking if df1["NAME"] is in df2 and if df1["LOCATION"] is in my df2, then return the df2 VALUE corresponding to the matching row... else return no match found
I know that i can use something like this to implement my columns with my function:
df1["SearchedValue"] = df2.apply(searchedValue)
But i haven't find solution to build my function.
Thanks for your help
Data:
df = pd.DataFrame({'Name': {0: 'Bob', 1: 'Bob', 2: 'Alice', 3: 'Alice', 4: 'Toto'},
'Location': {0: 'Paris', 1: 'Berlin', 2: 'Paris', 3: 'Miami', 4: 'NYC'}})
df:
Name Location
0 Bob Paris
1 Bob Berlin
2 Alice Paris
3 Alice Miami
4 Toto NYC
df2 = pd.DataFrame({'Name': {0: 'Bob', 1: 'Bob', 2: 'Alice', 3: 'Alice', 4: 'Lucas'},
'Location': {0: 'Paris', 1: 'Berlin', 2: 'Paris', 3: 'Miami', 4: 'NYC'},
'Value': {0: 0.3, 1: 0.2, 2: 0.4, 3: 0.1, 4: 0.0}})
df2:
Name Location Value
0 Bob Paris 0.3
1 Bob Berlin 0.2
2 Alice Paris 0.4
3 Alice Miami 0.1
4 Lucas NYC 0.0
def searchedValue(Name, Location):
merged = df.merge(df2, on=["Name", "Location"])
result = merged[(merged.Name == Name) & (merged.Location == Location)]
if not result.size:
return "No match found"
return f"The Value is: {result['Value'].iloc[0]}"
print(searchedValue("Alice", "Paris"))
print(searchedValue("Alice", "Miami"))
print(searchedValue("Alice", "NOOOOOOOOO"))
The Value is: 0.4
The Value is: 0.1
No match found

Operations on multiple data frame in PANDAS

I have several tables that look like this:
ID YY ZZ
2 97 826
2 78 489
4 47 751
4 110 322
6 67 554
6 88 714
code:
raw = {'ID': [2, 2, 4, 4, 6, 6,],
'YY': [97,78,47,110,67,88],
'ZZ':[826,489,751,322,554,714]}
df = pd.DataFrame(raw)
For each of these dfs, I have to perform a number of operations.
First, group by id,
extract the length of the column zz and average of the column zz,
put results in new df
New df that looks like this
Cities length mean
Paris 0 0
Madrid 0 0
Berlin 0 0
Warsaw 0 0
London 0 0
code:
raw2 = {'Cities': ['Paris', 'Madrid', 'Berlin', 'Warsaw', 'London'],
'length': 0,
'mean': 0}
df2 = pd.DataFrame(raw2)
I pulled out the average and the size of individual groups
df_grouped = df.groupby('ID').ZZ.size()
df_grouped2 = df.groupby('ID').ZZ.mean()
the problem occurs when trying to transfer results to a new table because it does not contain all the cities and the results must be matched according to the appropriate key.
I tried to use a dictionary:
dic_cities = {"Paris":df_grouped.loc[2],
"Madrid":df_grouped.loc[4],
"Warsaw":df_grouped.loc[6],
"Berlin":df_grouped.loc[8],
"London":df_grouped.loc[10]}
Unfortunately, I'm receiving KeyError: 8
I have 19 df's from which I have to extract this data and the final tables have to look like this:
Cities length mean
Paris 2 657.5
Madrid 2 536.5
Berlin 0 0.0
Warsaw 2 634.0
London 0 0.0
Does anyone know how to deal with it using groupby and the dictionary or knows a better way to do it?
First, you should index df2 on 'Cities':
raw2 = {'Cities': ['Paris', 'Madrid', 'Berlin', 'Warsaw', 'London'],
'length': 0,
'mean': 0}
df2 = pd.DataFrame(raw2).set_index('Cities')
Then you should reverse you dictionary:
dic_cities = {2: "Paris",
4: "Madrid",
6: "Warsaw",
8: "Berlin",
10: "London"}
Once this is done, the processing is as simple as a groupby:
for i, sub in df.groupby('ID'):
df2.loc[dic_cities[i]] = sub.ZZ.agg([len, np.mean]).tolist()
Which gives for df2:
length mean
Cities
Paris 2.0 657.5
Madrid 2.0 536.5
Berlin 0.0 0.0
Warsaw 2.0 634.0
London 0.0 0.0
See this:
import pandas as pd
# setup raw data
raw = {'ID': [2, 2, 4, 4, 6, 6,], 'YY': [97,78,47,110,67,88], 'ZZ':[826,489,751,322,554,714]}
df = pd.DataFrame(raw)
# get mean values
mean_values = df.groupby('ID').mean()
# drop column
mean_values = mean_values.drop(['YY'], axis=1)
# get occurrence number
occurrence = df.groupby('ID').size()
# save data
result = pd.concat([occurrence, mean_values], axis=1, sort=False)
# rename columns
result.rename(columns={0:'length', 'ZZ':'mean'}, inplace=True)
# city data
raw2 = 'Cities': ['Paris', 'Madrid', 'Berlin', 'Warsaw', 'London'], 'length': 0, 'mean': 0}
df2 = pd.DataFrame(raw2)
# rename indexes
df2 = df2.rename(index={0: 2, 1:4, 2:8, 3:6, 4:10}
# merge data
df2['length'] = result['length']
df2['mean'] = result['mean']
Outout:
Cities length mean
2 Paris 2.0 657.5
4 Madrid 2.0 536.5
8 Berlin NaN NaN
6 Warsaw 2.0 634.0
10 London NaN NaN

Pandas label encoding column with default label for invalid row values

For a data frame I replaced a set of items in a column with a range of values as follows:
df['borough_num'] = df['Borough'].replace(regex=['MANHATTAN', 'BROOKLYN', 'QUEENS', 'STATEN ISLAND','BRONX'], value=[1, 2, 3, 4,5])
The issue that I want to replace all the rest of elements in 'Borough' that not mentioned before with the value 0
also I need to use regex because there are looks like data eg. 07 BRONX, I need it also to be replaced by 5 not 0
To replace all other values by 0, you can do:
# create maps
new_values = ['MANHATTAN', 'BROOKLYN', 'QUEENS', 'STATEN ISLAND','BRONX']
maps = dict(zip(new_values, [1]*len(new_values)))
# map the values
df['borough_num'] = df['Borough'].apply(lambda x: maps.get(x, 0))
Data from cold using map with fillna, all the value not in the map dict will return NaN, then we just fillna
df.Borough.map(dict(zip(['QUEENS', 'BRONX'],[1,2]))).fillna(0).astype(int)
0 1
1 2
2 2
3 0
Name: Borough, dtype: int32
I see you want to perform category encoding with some imposed order. I would recommend using pd.Categorical with ordered=True:
df = pd.DataFrame({
'Borough': ['QUEENS', 'BRONX', 'MANHATTAN', 'BROOKLYN', 'INVALID']})
df
Borough
0 QUEENS
1 BRONX
2 MANHATTAN
3 BROOKLYN
4 INVALID
keys = ['MANHATTAN', 'BROOKLYN', 'QUEENS', 'STATEN ISLAND','BRONX']
df['borough_num'] = pd.Categorical(
df['Borough'], categories=keys, ordered=True).codes+1
df
Borough borough_num
0 QUEENS 3
1 BRONX 5
2 MANHATTAN 1
3 BROOKLYN 2
4 INVALID 0
pd.Categorical returns invalid strings as -1:
pd.Categorical(
df['Borough'], categories=keys, ordered=True).codes
array([ 2, 4, 0, 1, -1], dtype=int8)
This should be much faster than using replace, anyway, but for reference, you would do this with replace and a dictionary:
from collections import defaultdict
d = defaultdict(int)
d.update(dict(zip(keys, range(len(keys)))))
df['borough_num'] = df['Borough'].map(d)
df
Borough borough_num
0 QUEENS 2
1 BRONX 4
2 MANHATTAN 0
3 BROOKLYN 1
4 INVALID 0
You can also use np.where:
Creating a dummy DataFrame
df = pd.DataFrame({'Borough': ['MANHATTAN', 'BROOKLYN', 'QUEENS', 'STATEN ISLAND','BRONX', 'TEST']})
df
Borough
0 MANHATTAN
1 BROOKLYN
2 QUEENS
3 STATEN ISLAND
4 BRONX
5 TEST
Your Operation:
df['borough_num'] = df['Borough'].replace(regex=['MANHATTAN', 'BROOKLYN', 'QUEENS', 'STATEN ISLAND','BRONX'], value=[1, 2, 3, 4,5])
df
Borough borough_num
0 MANHATTAN 1
1 BROOKLYN 2
2 QUEENS 3
3 STATEN ISLAND 4
4 BRONX 5
5 TEST TEST
Replacing values of column Borough not in keys with 0 using np.where:
keys = ['MANHATTAN', 'BROOKLYN', 'QUEENS', 'STATEN ISLAND','BRONX']
df['Borough'] = np.where(~df['Borough'].isin(keys), 0 ,df['Borough'])
df
Borough borough_num
0 MANHATTAN 1
1 BROOKLYN 2
2 QUEENS 3
3 STATEN ISLAND 4
4 BRONX 5
5 0 TEST

Categories