Merge object with pandas dataframe - python

Below you see I have this object called westCountries, and right below you will see that I have a dataframe called countryDf.
westCountries = {'West': ['US', 'CA', 'PR']}
# countryDF
Country
0 [US]
1 [PR]
2 [CA]
3 [HK]
I am wondering how I can include the westCountries obj into my dataframe in a new column called Location? I have tried merging but that doesn't really do anything because oddly enough I need the value in this column to be the name of my key in my object as seen below. NOTE: This output is only an example, I understand there missing correlations with the data I provided and my desired output.
Country Location
0 US West
1 CA West
I was thinking of doing a few things such as:
using .isin() and then working with that dataframe to a few more transformations/computations to populate my dataframe, but this route seems a bit foggy to me.
using df.loc[...] to compare my dataframe with the values in this list and then I can create my own column with the value of my choice.
converting my object into a dataframe, and then creating a new column in this temporary dataframe and then merging by country so I can include the locations column into my countryDF dataframe.
However, I feel like there might be a more sophisticated solution than all these approaches I listed above. Which is why I'm reaching out for help.

Use pandas.DataFrame.explode to remove values from the list
Use a list comprehension to match values with the westCountries value list and return the key
For the example, the sample dataframe column values are created as strings and need to be converted to dict type with ast.literal_eval
import pandas as pd
from ast import literal_eval # only for setting up the test dataframe
# setup the test dataframe
data = {'Country': ["['US']", "['PR']", "['CA']", "['HK']"]}
df = pd.DataFrame(data)
df.Country = df.Country.apply(literal_eval) # only for the test data
westCountries = {'West': ['US', 'CA', 'PR']}
# remove the values from lists, with explode
df = df.explode('Country')
# create the Loc column using apply
df['Loc'] = df.Country.apply(lambda x: [k if x in v else None for k, v in westCountries.items()][0])
# drop rows with None
df = df.dropna()
# display(df)
Country Loc
0 US West
1 PR West
2 CA West
Option 2 (Better):
In the first option, for every row, .apply has to iterate through every key-value pair in westCountries using [k if x in v else None for k, v in westCountries.items()], which is slow.
It's better to reshape westCountries into a flat dict with region for the value and state as the key, using a dict comprehension.
Use pandas.Series.map to map the dict values into the new column
import pandas as pd
from ast import literal_eval # only for setting up the test dataframe
# setup the test dataframe
data = {'Country': ["['US']", "['PR']", "['CA']", "['HK']"]}
df = pd.DataFrame(data)
df.Country = df.Country.apply(literal_eval) # only for the test data
# remove the values from lists, with explode
df = df.explode('Country')
# given
westCountries = {'West': ['US', 'CA', 'PR'], 'East': ['NY', 'NC']}
# unpack westCountries where all values are keys and key are values
mapped = {x: k for k, v in westCountries.items() for x in v}
# print(mapped)
{'US': 'West', 'CA': 'West', 'PR': 'West', 'NY': 'East', 'NC': 'East'}
# map the dict to the column
df['Loc'] = df.Country.map(mapped)
# dropna
df = df.dropna()

You can use pd.melt and then explode the df using df.explode and df.merge
westCountries = {'West': ['US', 'CA', 'PR']}
west = pd.melt(pd.DataFrame(westCountries), var_name='Loc', value_name='Country')
df.explode('Country').merge(west, on='Country')
Country Loc
0 US West
1 PR West
2 CA West
Details
pd.DataFrame(westCountries)
# West
#0 US
#1 CA
#2 PR
# Now melt the above dataframe
pd.melt(pd.DataFrame(westCountries), var_name='Loc', value_name='Country')
# Loc Country
#0 West US
#1 West CA
#2 West PR
# Now, merge `df` after exploding with `west` on `Country`
df.explode('Country').merge(west, on='Country') # how = 'left' by default in merge
# Country Loc
#0 US West
#1 PR West
#2 CA West
EDIT:
if you have westCountries dict with unequal sizes then try this
from itertools import zip_longest
westCountries = {'West': ['US', 'CA', 'PR'], 'East': ['NY', 'NC']}
west = pd.DataFrame(zip_longest(*westCountries.values(),fillvalue = np.nan),
columns= westCountries.keys())
west = west.melt(var_name='Loc', value_name='Country').dropna()
df.explode('Country').merge(west, on='Country')
Example of the above:
df
Country
0 [US]
1 [PR]
2 [CA]
3 [HK]
4 [NY] #--> added `NY` from `East`.
westCountries = {'West': ['US', 'CA', 'PR'], 'East': ['NY', 'NC']}
west = pd.DataFrame(zip_longest(*westCountries.values(),fillvalue = np.nan),
columns= westCountries.keys())
west = west.melt(var_name='Loc', value_name='Country').dropna()
df.explode('Country').merge(west, on='Country')
# Country Loc
#0 US West
#1 PR West
#2 CA West
#3 NY East

this is probably not the fastest approach in terms of run time but it works
import pandas as pd
westCountries = {'West': ['US', 'CA', 'PR']}
df = pd.DataFrame(["[US]","[PR]", "[CA]", "[HK]"], columns=["Country"])
df = df.assign(Location="")
for index, row in df.iterrows():
if any([True for country in westCountries.get('West') if country in row['Country']]):
row.Location='West'
west_df = df[df['Location'] != ""]

Related

Transform multiple rows of data into one based on multiple keys in pandas

I have a large CSV file of sports data and I need to transform the data so that teams with the same game_id are on the same row and create new columns based on the homeAway column and existing columns. Is there a way to do this wih Pandas?
Existing format:
game_id school conference homeAway points
332410041 Connecticut American Athletic home 18
332410041 Towson CAA away 33
Desired format:
game_id home_school home_conference home_points away_school away_conference away_points
332410041 Connecticut American Athletic 18 Towson CAA 33
One way to solve this is to convert the table into a Pandas dataframe. Filter the main table by 'homeaway', to create 'home' and 'away' dataframes. The columns in the 'away' table are relabelled, and original column of the key is preserved. We then run a join to both to produce the desired output.
import pandas as pd
data = {'game_id': [332410041, 332410041],
'school': ['Connecticut', 'Towson'],
'conference':['American Athletic', 'CAA'],
'homeAway': ['home', 'away'],
'points': [18, 33]
}
df = pd.DataFrame(data)
home = df[df['homeAway'] == 'home']
del home['homeAway']
away = df[df['homeAway'] == 'away']
del away['homeAway']
away.columns = ['game_id', 'away_school', 'away_conference', 'away_points']
home.merge(away)
Create two dataframes selected by the unique values in the 'homeAway' column, 'home' and 'away', using Boolean indexing.
Drop the obsolete 'homeAway' column
Rename the appropriate columns with a 'home_', and 'away_' prefix.
This can be done in a for-loop, with each dataframe added to a list, which can be consolidated into a simple list-comprehension.
Use pd.merge to combine the two dataframes on the common 'game_id' column.
See Merge, join, concatenate and compare and Pandas Merging 101 for additional details.
import pandas as pd
# test dataframe
data = {'game_id': [332410041, 332410041, 662410041, 662410041, 772410041, 772410041],
'school': ['Connecticut', 'Towson', 'NY', 'CA', 'FL', 'AL'],
'conference': ['American Athletic', 'CAA', 'a', 'b', 'c', 'd'],
'homeAway': ['home', 'away', 'home', 'away', 'home', 'away'], 'points': [18, 33, 1, 2, 3, 4]}
df = pd.DataFrame(data)
# create list of dataframes
dfl = [(df[df.homeAway.eq(loc)]
.drop('homeAway', axis=1)
.rename({'school': f'{loc}_school',
'conference': f'{loc}_conference',
'points': f'{loc}_points'}, axis=1))
for loc in df.homeAway.unique()]
# combine the dataframes
df_new = pd.merge(dfl[0], dfl[1])
# display(df_new)
game_id home_school home_conference home_points away_school away_conference away_points
0 332410041 Connecticut American Athletic 18 Towson CAA 33
1 662410041 NY a 1 CA b 2
2 772410041 FL c 3 AL d 4

Pandas to lookup and return corresponding values from many dataframes

A list of names and I want to retrieve each of the correspondent information in different data frames, to form a new dataframe.
I converted the list into a 1 column dataframe, then to look up its corresponding values in different dataframes.
The idea is visualized as:
I have tried:
import pandas as pd
data = {'Name': ["David","Mike","Lucy"]}
data_h = {'Name': ["David","Mike","Peter", "Lucy"],
'Hobby': ['Music','Sports','Cooking','Reading'],
'Member': ['Yes','Yes','Yes','No']}
data_s = {'Name': ["David","Lancy", "Mike","Lucy"],
'Speed': [56, 42, 35, 66],
'Location': ['East','East','West','West']}
df = pd.DataFrame(data)
df_hobby = pd.DataFrame(data_h)
df_speed = pd.DataFrame(data_s)
df['Hobby'] = df.lookup(df['Name'], df_hobby['Hobby'])
print (df)
But it returns the error message as:
ValueError: Row labels must have same size as column labels
I have also tried:
df = pd.merge(df, df_hobby, on='Name')
It works but it includes unnecessary columns.
What will be the smart an efficient way to do such, especially when the number of to-be-looked-up dataframes are many?
Thank you.
Filter only columns for merge and columns for append like:
df = (pd.merge(df, df_hobby[['Name','Hobby']], on='Name')
.merge(df_speed[['Name','Location']], on='Name'))
print(df)
Name Hobby Location
0 David Music East
1 Mike Sports West
2 Lucy Reading West
If want working with list use this solution with filtering columns:
dfList = [df,
df_hobby[['Name','Hobby']],
df_speed[['Name','Location']]]
from functools import reduce
df = reduce(lambda df1,df2: pd.merge(df1,df2,on='Name'), dfList)
print (df)
Name Hobby Location
0 David Music East
1 Mike Sports West
2 Lucy Reading West

Trying to match values in one data frame to values in another data frame (python)

I currently have a dataframe A consisting of a column (code1) of country codes such as CA, RU, US etc. I have another dataframe B that has 3 columns where the first column has all possible country codes, the second has a longitude value and the third has a latitude value. I'm trying to loop through the A, get the first country code in the first column, match it to the country code in the first column of B and then get the associated longitude and latitude of that country and so forth. I plan to create a new dataframe containing the codes from A (the first column) and the newly extracted longitude and latitude values.
So far my function looks as follows
def get_coords():
for i in range(len(A["code1"])):
for j in range(len(B["code"])):
if A["code1"[i] = B["code"[j]: #if the country codes match
latitude = B["lat"][j] #gets the latitude of the matched country code
longitude = B["long"][j] #gets the longitude
However, this seems to be inefficient and I'm not sure if it is even matching the codes from the dataframes correctly. Is there a better method of going about what I am trying to achieve?
For reference len(A["code1"]) = 581 and len(B["code"] = 5142
Here is a sample input of data:
A = pd.DataFrame({'code1': ['US',
'RU', 'AO', 'ZW']})
B = pd.DataFrame({'code': ['US', 'ZW', 'RU', 'YE', 'AO'],
'long': [65.216000, 65.216000,18.500000,-63.032000,19.952000], 'lat': [12.500000, 33.677000,-12.500000,18.237000,60.198000]})
I am trying to have the output look like
A = pd.DataFrame({'code1': ['US', 'RU', 'AO', 'ZW'], 'longitude':[65.216000,18.500000, 19.952000, 65.216000], 'latitude': [12.500000, -12.500000, 60.198000, 33.677000]})
use pd.merge and specify the left_on column to merge on as well as the right_on column, since the two column you want to merge have different column names. Then, .drop the excess column that you don't need.
A = pd.merge(A,B,how='left',left_on='code1',right_on='code').drop(['code'], axis=1)
output:
code1 long lat
0 US 65.216 12.500
1 RU 18.500 -12.500
2 AO 19.952 60.198
3 ZW 65.216 33.677
n [108]: A = pd.DataFrame({'code1': ['US',
...: 'RU', 'AO', 'ZW']})
In [109]: B = pd.DataFrame({'code': ['US', 'ZW', 'RU', 'YE', 'AO'],
...: 'long': [65.216000, 65.216000,18.500000,-63.032000,19.952000], 'lat': [12.500000, 33.67700
...: 0,-12.500000,18.237000,60.198000]})
In [110]: A.rename({"code1":"code"},axis=1,inplace=True)
In [111]: A = pd.merge(A,B, on="code").rename({"code":"code1"},axis=1)
In [112]: A
Out[112]:
code1 long lat
0 US 65.216 12.500
1 RU 18.500 -12.500
2 AO 19.952 60.198
3 ZW 65.216 33.677

Merging Two Dataframes without a Key Column

I have a requirement where I want to merge two data frames without any key column.
From the input table, I am treating first three columns as one data frame and the last column as another one. My plan is to sort the second data frame and then merge it to the first one without any key column so that it looks like the above output.
Is it possible to merge in this way or if there are any alternatives?
One way is to use pd.DataFrame.join after filtering out null values.
Data from #ALollz.
import pandas as pd
df1 = pd.DataFrame({'Country': ['USA', 'UK', 'Finland', 'Spain', 'Australia']})
df2 = pd.DataFrame({'Comments': ['X', None, 'Y', None, 'Z']})
res = df1.join(pd.DataFrame(list(filter(None, df2.values)), columns=['comments']))
Result:
Country comments
0 USA X
1 UK Y
2 Finland Z
3 Spain NaN
4 Australia NaN
If by "sort the second dataframe" you mean move the NULL values to the end of the list and keep the rest of the order in tact, then this will get the job done.
import pandas as pd
df1 = pd.DataFrame({'Country': ['USA', 'UK', 'Finland', 'Spain', 'Australia'],
'Name': ['Sam', 'Chris', 'Jeff', 'Kartik', 'Mavenn']})
df2 = pd.DataFrame({'Comments': ['X', None, 'Y', None, 'Z']})
df1['Comments'] = df2[df2.Comments.notnull()].reset_index().drop(columns='index')
Country Name Comments
0 USA Sam X
1 UK Chris Y
2 Finland Jeff Z
3 Spain Kartik NaN
4 Australia Mavenn NaN
IIUC:
input['Comments'] = input.Comments.sort_values().values
Output:
Comments Country Name
1 X USA Sam
2 Y UK Chris
3 Z Finland Jeff
4 NaN Spain Kartik
5 NaN Australia Maven

Applying Conditional Exclusions to Pandas DataFrame using Counts

I have the following DataFrame in pandas:
import pandas as pd
example_data = [{'ticker': 'aapl', 'loc': 'us'}, {'ticker': 'mstf', 'loc': 'us'}, {'ticker': 'baba', 'loc': 'china'}, {'ticker': 'ibm', 'loc': 'us'}, {'ticker': 'db', 'loc': 'germany'}]
df = pd.DataFrame(example_data)
print df
loc ticker
0 us aapl
1 us mstf
2 china baba
3 us ibm
4 germany db
I want to create a new DataFrame such that each row is created from the original df but rows with loc counts greater than 2 are excluded. That is, the new df is created by looping through the old df, counting the number of loc rows that have come before, and including / excluding the row based on this count.
The following code gives the desired output.
country_counts = {}
output = []
for row in df.values:
if row[0] not in country_counts:
country_counts[row[0]] = 1
else:
country_counts[row[0]] +=1
if country_counts[row[0]] <= 2:
output.append({'loc': row[0], 'ticker': row[1]})
new_df = pd.DataFrame(output)
print new_df
loc ticker
0 us aapl
1 us mstf
2 china baba
3 germany db
The output excludes the 4th row in the original df because its loc count is greater than 2 (i.e. 3).
Does there exist a better method to perform this type of operation? Any help is greatly appreciated.
How about groupby and .head:
In [90]: df.groupby('loc').head(2)
Out[90]:
loc ticker
0 us aapl
1 us mstf
2 china baba
4 germany db
Also, be careful with your column names, since loc clashes with the .loc method.

Categories