Pandas merge on variable columns

Pandas merge on variable columns - python

I have a table of sites with a land cover class and a state. I have another table with values linked to class and state. In the second table, however, some of the rows are linked only to class:
sites = pd.DataFrame({'id': ['a', 'b', 'c'],
'class': [1, 2, 23],
'state': ['al', 'ar', 'wy']})
values = pd.DataFrame({'class': [1, 1, 2, 2, 23],
'state': ['al', 'ar', 'al', 'ar', None],
'val': [10, 11, 12, 13, 16]})
I'd like to link the tables by class and state, except for those rows in the value table for which state is None, in which case they would be linked only by class.
A merge has the following result:
combined = sites.merge(values, how='left', on=['class', 'state'])
id class state val
0 a 1 al 10.0
1 b 2 ar 13.0
2 c 23 wy NaN
But I'd like val in the last row to be 16. Is there an inexpensive way to do this short of breaking up both tables, performing separate merges, and then concatenating the result?

How about merge them separately:
pd.concat([sites.merge(values, on=['class','state']),
sites.merge(values[values['state'].isna()].drop('state',axis=1),
on=['class'])
])
Output:
id class state val
0 a 1 al 10
1 b 2 ar 13
0 c 23 wy 16

We can use combine_first here:
(sites.set_index(['class','state'])
.combine_first(values.set_index(['class','state']))
.dropna().reset_index())
class state id val
0 1 al a 10.0
1 2 ar b 13.0
2 23 wy c 16.0

Related

How to convert efficiently dictionary to dataframe [duplicate]

How can I convert a list of dictionaries into a DataFrame? Given:
[{'points': 50, 'time': '5:00', 'year': 2010},
{'points': 25, 'time': '6:00', 'month': "february"},
{'points':90, 'time': '9:00', 'month': 'january'},
{'points_h1':20, 'month': 'june'}]
I want to turn the above into a DataFrame:
month points points_h1 time year
0 NaN 50 NaN 5:00 2010
1 february 25 NaN 6:00 NaN
2 january 90 NaN 9:00 NaN
3 june NaN 20 NaN NaN
Note: Order of the columns does not matter.

If ds is a list of dicts:
df = pd.DataFrame(ds)
Note: this does not work with nested data.

How do I convert a list of dictionaries to a pandas DataFrame?
The other answers are correct, but not much has been explained in terms of advantages and limitations of these methods. The aim of this post will be to show examples of these methods under different situations, discuss when to use (and when not to use), and suggest alternatives.
DataFrame(), DataFrame.from_records(), and .from_dict()
Depending on the structure and format of your data, there are situations where either all three methods work, or some work better than others, or some don't work at all.
Consider a very contrived example.
np.random.seed(0)
data = pd.DataFrame(
np.random.choice(10, (3, 4)), columns=list('ABCD')).to_dict('r')
print(data)
[{'A': 5, 'B': 0, 'C': 3, 'D': 3},
{'A': 7, 'B': 9, 'C': 3, 'D': 5},
{'A': 2, 'B': 4, 'C': 7, 'D': 6}]
This list consists of "records" with every keys present. This is the simplest case you could encounter.
# The following methods all produce the same output.
pd.DataFrame(data)
pd.DataFrame.from_dict(data)
pd.DataFrame.from_records(data)
A B C D
0 5 0 3 3
1 7 9 3 5
2 2 4 7 6
Word on Dictionary Orientations: orient='index'/'columns'
Before continuing, it is important to make the distinction between the different types of dictionary orientations, and support with pandas. There are two primary types: "columns", and "index".
orient='columns'
Dictionaries with the "columns" orientation will have their keys correspond to columns in the equivalent DataFrame.
For example, data above is in the "columns" orient.
data_c = [
{'A': 5, 'B': 0, 'C': 3, 'D': 3},
{'A': 7, 'B': 9, 'C': 3, 'D': 5},
{'A': 2, 'B': 4, 'C': 7, 'D': 6}]
pd.DataFrame.from_dict(data_c, orient='columns')
A B C D
0 5 0 3 3
1 7 9 3 5
2 2 4 7 6
Note: If you are using pd.DataFrame.from_records, the orientation is assumed to be "columns" (you cannot specify otherwise), and the dictionaries will be loaded accordingly.
orient='index'
With this orient, keys are assumed to correspond to index values. This kind of data is best suited for pd.DataFrame.from_dict.
data_i ={
0: {'A': 5, 'B': 0, 'C': 3, 'D': 3},
1: {'A': 7, 'B': 9, 'C': 3, 'D': 5},
2: {'A': 2, 'B': 4, 'C': 7, 'D': 6}}
pd.DataFrame.from_dict(data_i, orient='index')
A B C D
0 5 0 3 3
1 7 9 3 5
2 2 4 7 6
This case is not considered in the OP, but is still useful to know.
Setting Custom Index
If you need a custom index on the resultant DataFrame, you can set it using the index=... argument.
pd.DataFrame(data, index=['a', 'b', 'c'])
# pd.DataFrame.from_records(data, index=['a', 'b', 'c'])
A B C D
a 5 0 3 3
b 7 9 3 5
c 2 4 7 6
This is not supported by pd.DataFrame.from_dict.
Dealing with Missing Keys/Columns
All methods work out-of-the-box when handling dictionaries with missing keys/column values. For example,
data2 = [
{'A': 5, 'C': 3, 'D': 3},
{'A': 7, 'B': 9, 'F': 5},
{'B': 4, 'C': 7, 'E': 6}]
# The methods below all produce the same output.
pd.DataFrame(data2)
pd.DataFrame.from_dict(data2)
pd.DataFrame.from_records(data2)
A B C D E F
0 5.0 NaN 3.0 3.0 NaN NaN
1 7.0 9.0 NaN NaN NaN 5.0
2 NaN 4.0 7.0 NaN 6.0 NaN
Reading Subset of Columns
"What if I don't want to read in every single column"? You can easily specify this using the columns=... parameter.
For example, from the example dictionary of data2 above, if you wanted to read only columns "A', 'D', and 'F', you can do so by passing a list:
pd.DataFrame(data2, columns=['A', 'D', 'F'])
# pd.DataFrame.from_records(data2, columns=['A', 'D', 'F'])
A D F
0 5.0 3.0 NaN
1 7.0 NaN 5.0
2 NaN NaN NaN
This is not supported by pd.DataFrame.from_dict with the default orient "columns".
pd.DataFrame.from_dict(data2, orient='columns', columns=['A', 'B'])
ValueError: cannot use columns parameter with orient='columns'
Reading Subset of Rows
Not supported by any of these methods directly. You will have to iterate over your data and perform a reverse delete in-place as you iterate. For example, to extract only the 0th and 2nd rows from data2 above, you can use:
rows_to_select = {0, 2}
for i in reversed(range(len(data2))):
if i not in rows_to_select:
del data2[i]
pd.DataFrame(data2)
# pd.DataFrame.from_dict(data2)
# pd.DataFrame.from_records(data2)
A B C D E
0 5.0 NaN 3 3.0 NaN
1 NaN 4.0 7 NaN 6.0
The Panacea: json_normalize for Nested Data
A strong, robust alternative to the methods outlined above is the json_normalize function which works with lists of dictionaries (records), and in addition can also handle nested dictionaries.
pd.json_normalize(data)
A B C D
0 5 0 3 3
1 7 9 3 5
2 2 4 7 6
pd.json_normalize(data2)
A B C D E
0 5.0 NaN 3 3.0 NaN
1 NaN 4.0 7 NaN 6.0
Again, keep in mind that the data passed to json_normalize needs to be in the list-of-dictionaries (records) format.
As mentioned, json_normalize can also handle nested dictionaries. Here's an example taken from the documentation.
data_nested = [
{'counties': [{'name': 'Dade', 'population': 12345},
{'name': 'Broward', 'population': 40000},
{'name': 'Palm Beach', 'population': 60000}],
'info': {'governor': 'Rick Scott'},
'shortname': 'FL',
'state': 'Florida'},
{'counties': [{'name': 'Summit', 'population': 1234},
{'name': 'Cuyahoga', 'population': 1337}],
'info': {'governor': 'John Kasich'},
'shortname': 'OH',
'state': 'Ohio'}
]
pd.json_normalize(data_nested,
record_path='counties',
meta=['state', 'shortname', ['info', 'governor']])
name population state shortname info.governor
0 Dade 12345 Florida FL Rick Scott
1 Broward 40000 Florida FL Rick Scott
2 Palm Beach 60000 Florida FL Rick Scott
3 Summit 1234 Ohio OH John Kasich
4 Cuyahoga 1337 Ohio OH John Kasich
For more information on the meta and record_path arguments, check out the documentation.
Summarising
Here's a table of all the methods discussed above, along with supported features/functionality.
* Use orient='columns' and then transpose to get the same effect as orient='index'.

In pandas 16.2, I had to do pd.DataFrame.from_records(d) to get this to work.

You can also use pd.DataFrame.from_dict(d) as :
In [8]: d = [{'points': 50, 'time': '5:00', 'year': 2010},
...: {'points': 25, 'time': '6:00', 'month': "february"},
...: {'points':90, 'time': '9:00', 'month': 'january'},
...: {'points_h1':20, 'month': 'june'}]
In [12]: pd.DataFrame.from_dict(d)
Out[12]:
month points points_h1 time year
0 NaN 50.0 NaN 5:00 2010.0
1 february 25.0 NaN 6:00 NaN
2 january 90.0 NaN 9:00 NaN
3 june NaN 20.0 NaN NaN

Pyhton3:
Most of the solutions listed previously work. However, there are instances when row_number of the dataframe is not required and the each row (record) has to be written individually.
The following method is useful in that case.
import csv
my file= 'C:\Users\John\Desktop\export_dataframe.csv'
records_to_save = data2 #used as in the thread.
colnames = list[records_to_save[0].keys()]
# remember colnames is a list of all keys. All values are written corresponding
# to the keys and "None" is specified in case of missing value
with open(myfile, 'w', newline="",encoding="utf-8") as f:
writer = csv.writer(f)
writer.writerow(colnames)
for d in records_to_save:
writer.writerow([d.get(r, "None") for r in colnames])

The easiest way I have found to do it is like this:
dict_count = len(dict_list)
df = pd.DataFrame(dict_list[0], index=[0])
for i in range(1,dict_count-1):
df = df.append(dict_list[i], ignore_index=True)

I have the following list of dicts with datetime keys and int values:
list = [{datetime.date(2022, 2, 10): 7}, {datetime.date(2022, 2, 11): 1}, {datetime.date(2022, 2, 11): 1}]
I had a problem to convert it to Dataframe with the methods above as it created Dataframe with columns with dates...
My solution:
df = pd.DataFrame()
for i in list:
temp_df = pd.DataFrame.from_dict(i, orient='index')
df = df.append(temp_df)

Iterate through two data frames and update a column of the first data frame with a column of the second data frame in pandas

I am converting a piece of code written in R to python. The following code is in R. df1 and df2 are the dataframes. id, case, feature, feature_value are column names. The code in R is
for(i in 1:dim(df1)[1]){
temp = subset(df2,df2$id == df1$case[i],select = df1$feature[i])
df1$feature_value[i] = temp[,df1$feature[i]]
}
My code in python is as follows.
for i in range(0,len(df1)):
temp=np.where(df1['case'].iloc[i]==df2['id']),df1['feature'].iloc[i]
df1['feature_value'].iloc[i]=temp[:,df1['feature'].iloc[i]]
but it gives
TypeError: tuple indices must be integers or slices, not tuple
How to rectify this error? Appreciate any help.

Unfortunately, R and Pandas handle dataframes pretty differently. If you'll be using Pandas a lot, it would probably be worth going through a tutorial on it.
I'm not too familiar with R so this is what I think you want to do:
Find rows in df1 where the 'case' matches an 'id' in df2. If such a row is found, add the "feature" in df1 to a new df1 column called "feature_value."
If so, you can do this with the following:
#create a sample df1 and df2
>>> df1 = pd.DataFrame({'case': [1, 2, 3], 'feature': [3, 4, 5]})
>>> df1
case feature
0 1 3
1 2 4
2 3 5
>>> df2 = pd.DataFrame({'id': [1, 3, 7], 'age': [45, 63, 39]})
>>> df2
id age
0 1 45
1 3 63
2 7 39
#create a list with all the "id" values of df2
>>> df2_list = df2['id'].to_list()
>>> df2_list
[1, 3, 7]
#lambda allows small functions; in this case, the value of df1['feature_value']
#for each row is assigned df1['feature'] if df1['case'] is in df2_list,
#and otherwise it is assigned np.nan.
>>> df1['feature_value'] = df1.apply(lambda x: x['feature'] if x['case'] in df2_list else np.nan, axis=1)
>>> df1
case feature feature_value
0 1 3 3.0
1 2 4 NaN
2 3 5 5.0
Instead of lamda, a full function can be created, which may be easier to understand:
def get_feature_values(df, id_list):
if df['case'] in id_list:
feature_value = df['feature']
else:
feature_value = np.nan
return feature_value
df1['feature_value'] = df1.apply(get_feature_values, id_list=df2_list, axis=1)
Another way of going about this would involve merging df1 and df2 to find rows where the "case" value in df1 matches an "id" value in df2 (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html)
===================
To address the follow-up question in the comments:
You can do this by merging the databases and then creating a function.
#create example dataframes
>>> df1 = pd.DataFrame({'case': [1, 2, 3], 'feature': [3, 4, 5], 'names': ['a', 'b', 'c']})
>>> df2 = pd.DataFrame({'id': [1, 3, 7], 'age': [45, 63, 39], 'a': [30, 31, 32], 'b': [40, 41, 42], 'c': [50, 51, 52]})
#merge the dataframes
>>> df1 = df1.merge(df2, how='left', left_on='case', right_on='id')
>>> df1
case feature names id age a b c
0 1 3 a 1.0 45.0 30.0 40.0 50.0
1 2 4 b NaN NaN NaN NaN NaN
2 3 5 c 3.0 63.0 31.0 41.0 51.0
Then you can create the following function:
def get_feature_values_2(df):
if pd.notnull(df['id']):
feature_value = df['feature']
column_of_interest = df['names']
feature_extended_value = df[column_of_interest]
else:
feature_value = np.nan
feature_extended_value = np.nan
return feature_value, feature_extended_value
# "result_type='expand'" allows multiple values to be returned from the function
df1[['feature_value', 'feature_extended_value']] = df1.apply(get_feature_values_2, result_type='expand', axis=1)
#This results in the following dataframe:
case feature names id age a b c feature_value \
0 1 3 a 1.0 45.0 30.0 40.0 50.0 3.0
1 2 4 b NaN NaN NaN NaN NaN NaN
2 3 5 c 3.0 63.0 31.0 41.0 51.0 5.0
feature_extended_value
0 30.0
1 NaN
2 51.0
#To keep only a subset of the columns:
#First create a copy-pasteable list of the column names
list(df1.columns)
['case', 'feature', 'names', 'id', 'age', 'a', 'b', 'c', 'feature_value', 'feature_extended_value']
#Choose the subset of columns you would like to keep
df1 = df1[['case', 'feature', 'names', 'feature_value', 'feature_extended_value']]
df1
case feature names feature_value feature_extended_value
0 1 3 a 3.0 30.0
1 2 4 b NaN NaN
2 3 5 c 5.0 51.0

Pandas : Merge 2 dataframe based on common column which contains dictionary

How to compare and merge two dataframes based on common columns that have different dictionaries?
I have the following two dataframes,
df1 = pd.DataFrame({'name':['tom','keith','sam','joe'],'assets':[{'laptop':1,'scanner':2},{'laptop':1,'printer':3}, {'car':12,'keys':34},{'power-cables':24}]})
df2 = pd.DataFrame({'place':['ca','bal-vm'],'default_assets':[{'laptop':4,'printer':3,'scanner':2,'bag':8},{'car':12,'keys':34,'mat':24,'holder':45}]})
df1:
name assets
0 tom {'laptop':1,'scanner':2}
1 keith {'laptop':1,'printer':3}
2 sam {'car':12,'keys':34}
3 joe {'power-cables':24}
df2:
place default_assets
0 ca {'laptop':4,'printer':3,'scanner':2,'bag':8}
1 bal-vm {'car':12,'keys':34,'mat':24,'holder':45}
df2 is supposed to be merged with df1 when all the keys of df1.assets are in df2.default_assets, else None should be filled.
So the resultant df should be,
df:
name place assets default_assets
0 tom ca {'laptop':1,'scanner':2} {'laptop':4,'printer':3,'scanner':2,'bag':8}
1 keith ca {'laptop':1,'printer':3} {'laptop':4,'printer':3,'scanner':2,'bag':8}
2 sam bal-vm {'car':12,'keys':34} {'car':12,'keys':34,'mat':24,'holder':45}
3 joe None {'power-cables':24} None

You could do the following:
A cross join (cross product) of every row of df1 with df2
Then filter out the rows where all the keys of df1.assets are not in df2.default_assets.
Add the filtered out rows from df1, with pandas.concat.
For example:
# cross join
merged = df1.assign(key=1).merge(df2.assign(key=1), on='key').drop('key', axis=1)
# mask to filter
mask = [asset.keys() < default.keys() for asset, default in zip(merged['assets'], merged['default_assets'])]
# add those not in the mask
result = pd.concat([merged.loc[mask], df1], sort=True).drop_duplicates('name')
# print in full
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
print(result)
Output
assets \
0 {'laptop': 1, 'scanner': 2}
2 {'laptop': 1, 'printer': 3}
5 {'car': 12, 'keys': 34}
3 {'power-cables': 24}
default_assets name place
0 {'laptop': 4, 'printer': 3, 'scanner': 2, 'bag... tom ca
2 {'laptop': 4, 'printer': 3, 'scanner': 2, 'bag... keith ca
5 {'car': 12, 'keys': 34, 'mat': 24, 'holder': 45} sam bal-vm
3 NaN joe NaN

Map three rows of data into a matrix

I have a dataset of movies ratings that looks as follow:
I want to map this into a matrix where the index in the user id, columns are the moviesids and values are the ratings.
What I have done so far is:
movies = df['movieId'].unique()
users = df['userId'].unique()
data_set = pd.DataFrame({'userId':users})
data_set = data_set.set_index('userId')
for movie in movies:
data_set[movie] = 0
So now I need to fill those spaces items with the corresponding ratings, but this is a messy and slow process.

Consider the dataframe df
df = pd.DataFrame([
[1, 11, 1],
[1, 12, 5],
[2, 11, 3],
[2, 13, 4]
], columns=['userid', 'movieid', 'rating'])
option 1
pivot
df.pivot('userid', 'movieid', 'rating')
option 2
set_index + unstack
df.set_index(['userid', 'movieid']).rating.unstack()
Both yield
movieid 11 12 13
userid
1 1.0 5.0 NaN
2 3.0 NaN 4.0
However, the unstack method has a fill_value parameter that allows to keep the integer dtype
df.set_index(['userid', 'movieid']).rating.unstack(fill_value=0)
movieid 11 12 13
userid
1 1 5 0
2 3 0 4

Pandas, Pivoting DataFrame By Multiple Hierarchical Columns

I'm following
https://nikolaygrozev.wordpress.com/2015/07/01/reshaping-in-pandas-pivot-pivot-table-stack-and-unstack-explained-with-pictures/
but am facing a different scenario for pivoting DataFrames.
The basic pivot command is like this:
d.pivot(index='Item', columns='CType', values='USD')
Now suppose my 'Item', belongs to two categories, 'Area' and 'Region', in two other data columns. I want the pivoted result contains those three levels (Region, Area, Item). How can I do that?
I had been looking for answers everywhere, and had been trying methods like 'unstack', 'droplevel', 'reset_index', etc, but wasn't able to make them work myself.
Please help.
Thanks

First, you probably want to use pd.pivot_table. Second, when you want to have multiple columns along a dimension, you need to pass them as a list (e.g. index=['Item', 'Area', 'Region']).
# Random data.
np.random.seed(0)
df = pd.DataFrame({'Area': ['A', 'A', 'A', 'B', 'B', 'B'],
'Region': ['r', 's', 'r', 's', 'r', 'r'],
'Item': ['car' ,'car', 'car', 'truck', 'bus', 'bus'],
'CType': [3, 4, 3, 3, 5, 5],
'USD': np.random.rand(6) * 100})
>>> df
Area CType Item Region USD
0 A 3 car r 54.881350
1 A 4 car s 71.518937
2 A 3 car r 60.276338
3 B 3 truck s 54.488318
4 B 5 bus r 42.365480
5 B 5 bus r 64.589411
>>> pd.pivot_table(df,
index=['Item', 'Area', 'Region'],
columns='CType',
values='USD',
aggfunc=sum)
CType 3 4 5
Item Area Region
bus B r NaN NaN 106.954891
car A r 115.157688 NaN NaN
s NaN 71.518937 NaN
truck B s 54.488318 NaN NaN

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas merge on variable columns - python

How about merge them separately: pd.concat([sites.merge(values, on=['class','state']), sites.merge(values[values['state'].isna()].drop('state',axis=1), on=['class']) ]) Output: id class state val 0 a 1 al 10 1 b 2 ar 13 0 c 23 wy 16

We can use combine_first here: (sites.set_index(['class','state']) .combine_first(values.set_index(['class','state'])) .dropna().reset_index()) class state id val 0 1 al a 10.0 1 2 ar b 13.0 2 23 wy c 16.0

Related

How to convert efficiently dictionary to dataframe [duplicate]

Iterate through two data frames and update a column of the first data frame with a column of the second data frame in pandas

Pandas : Merge 2 dataframe based on common column which contains dictionary

Map three rows of data into a matrix

Pandas, Pivoting DataFrame By Multiple Hierarchical Columns

Categories

Resources