Merge two dataframes with overlapping index, keeping column values from left DataFrame

Merge two dataframes with overlapping index, keeping column values from left DataFrame - python

How can I join/merge two Pandas DataFrames with partially overlapping indexes, where I wish the resulting joined DataFrame to retain the column values in the first DataFrame i.e. dropping the duplicates in df2?
import pandas as pd
import io
df1 = """
date; count
'2020-01-01'; 210
'2020-01-02'; 189
'2020-01-03'; 612
'2020-01-04'; 492
'2020-01-05'; 185
'2020-01-06'; 492
'2020-01-07'; 155
'2020-01-08'; 62
'2020-01-09'; 15
"""
df2 = """
date; count
'2020-01-04'; 21
'2020-01-05'; 516
'2020-01-06'; 121
'2020-01-07'; 116
'2020-01-08'; 82
'2020-01-09'; 121
'2020-01-10'; 116
'2020-01-11'; 82
'2020-01-12'; 116
'2020-01-13'; 82
"""
df1 = pd.read_csv(io.StringIO(df1), sep=";")
df2 = pd.read_csv(io.StringIO(df2), sep=";")
print(df1)
print(df2)
I have tried using
df1.reset_index().merge(df2, how='outer').set_index('date')
however, this drops the joined df2 values. Is there a method to keep the duplicated rows of the first dataframe?
Desired outcome:
print(df3)
date count
'2020-01-01' 210
'2020-01-02' 189
'2020-01-03' 612
'2020-01-04' 492
'2020-01-05' 185
'2020-01-06' 492
'2020-01-07' 155
'2020-01-08' 62
'2020-01-09' 15
'2020-01-10' 116
'2020-01-11' 82
'2020-01-12' 116
'2020-01-13' 82
Any help greatly appreciated, thank you.

Use combine_first:
df3 = (df1.set_index('date')
.combine_first(df2.set_index('date'))
.reset_index()
)
Output:
date count
0 '2020-01-01' 210
1 '2020-01-02' 189
2 '2020-01-03' 612
3 '2020-01-04' 492
4 '2020-01-05' 185
5 '2020-01-06' 492
6 '2020-01-07' 155
7 '2020-01-08' 62
8 '2020-01-09' 15
9 '2020-01-10' 116
10 '2020-01-11' 82
11 '2020-01-12' 116
12 '2020-01-13' 82

here is another way usingconcat and drop_duplicates:
df3=pd.concat([df1, df2]).drop_duplicates(["date"], keep="first", ignore_index=True)
output:
date count
0 '2020-01-01' 210
1 '2020-01-02' 189
2 '2020-01-03' 612
3 '2020-01-04' 492
4 '2020-01-05' 185
5 '2020-01-06' 492
6 '2020-01-07' 155
7 '2020-01-08' 62
8 '2020-01-09' 15
9 '2020-01-10' 116
10 '2020-01-11' 82
11 '2020-01-12' 116
12 '2020-01-13' 82

Related

Perform operation on columns based on values of another columns in pandas

I have a dataframe
df = pd.DataFrame([["A",1,98,88,"",567,453,545,656,323,756], ["B",1,99,"","",231,232,234,943,474,345], ["C",1,97,67,23,543,458,456,876,935,876], ["B",1,"",79,84,895,237,678,452,545,453], ["A",1,45,"",58,334,778,234,983,858,657], ["C",1,23,55,"",183,565,953,565,234,234]], columns=["id","date","col1","col2","col3","col1_num","col1_deno","col3_num","col3_deno","col2_num","col2_deno"])
I need to make Nan/blank values for respective _num and _deno for column name. Ex: Make values Nan/blank for "col1_num" and "col1_deno" if particular row of "col1" is blank. Repeat the same process for "col2_num" and "col2_deno" based on "col2", and for "col3_num" and "col3_deno" based on "col3".
Expected Output:
df_out = pd.DataFrame([["A",1,98,88,"",567,453,"","",323,756], ["B",1,99,"","",231,232,"","","",""], ["C",1,97,67,23,543,458,456,876,935,876], ["B",1,"",79,84,"","",678,452,545,453], ["A",1,45,"",58,334,778,234,983,"",""], ["C",1,23,55,"",183,565,"","",234,234]], columns=["id","date","col1","col2","col3","col1_num","col1_deno","col3_num","col3_deno","col2_num","col2_deno"])
How to do it?

Let us try with boolean masking:
# select the columns
c = pd.Index(['col1', 'col2', 'col3'])
# create boolean mask
m = df[c].eq('').to_numpy()
# mask the values in `_num` and `_deno` like columns
df[c + '_num'] = df[c + '_num'].mask(m, '')
df[c + '_deno'] = df[c + '_deno'].mask(m, '')
>>> df
id date col1 col2 col3 col1_num col1_deno col3_num col3_deno col2_num col2_deno
0 A 1 98 88 567 453 323 756
1 B 1 99 231 232
2 C 1 97 67 23 543 458 456 876 935 876
3 B 1 79 84 678 452 545 453
4 A 1 45 58 334 778 234 983
5 C 1 23 55 183 565 234 234

#shubham's answer is simple and to the point and I believe faster as well; this is just an option, where you may not be able to (or want to) list all the columns
Get the list of columns that need to be changed:
cols = [col for col in df if col.startswith('col')]
['col1',
'col2',
'col3',
'col1_num',
'col1_deno',
'col3_num',
'col3_deno',
'col2_num',
'col2_deno']
Create a dictionary pairing col1 to the columns to be changed, same for col2 and so on:
from collections import defaultdict
d = defaultdict(list)
for col in cols:
if "_" in col:
d[col.split("_")[0]].append(col)
d
defaultdict(list,
{'col1': ['col1_num', 'col1_deno'],
'col3': ['col3_num', 'col3_deno'],
'col2': ['col2_num', 'col2_deno']})
Iterate through the dict to assign the new values:
for key, val in d.items():
df.loc[df[key].eq(""), val] = ""
id date col1 col2 col3 col1_num col1_deno col3_num col3_deno col2_num col2_deno
0 A 1 98 88 567 453 323 756
1 B 1 99 231 232
2 C 1 97 67 23 543 458 456 876 935 876
3 B 1 79 84 678 452 545 453
4 A 1 45 58 334 778 234 983
5 C 1 23 55 183 565 234 234

Solution with MultiIndex:
#first convert not processing and testing columns to index
df1 = df.set_index(['id','date'])
cols = df1.columns
#split columns by _ for MultiIndex
df1.columns = df1.columns.str.split('_', expand=True)
#compare columns without _ (with NaN in second level) by empty string
m = df1.xs(np.nan, axis=1, level=1).eq('')
#create mask by all columns
mask = m.reindex(df1.columns, axis=1, level=0)
#set new values by mask, overwrite columns names
df1 = df1.mask(mask, '').set_axis(cols, axis=1).reset_index()
print (df1)
id date col1 col2 col3 col1_num col1_deno col3_num col3_deno col2_num \
0 A 1 98 88 567 453 323
1 B 1 99 231 232
2 C 1 97 67 23 543 458 456 876 935
3 B 1 79 84 678 452 545
4 A 1 45 58 334 778 234 983
5 C 1 23 55 183 565 234
col2_deno
0 756
1
2 876
3 453
4
5 234

Pandas Dataframe

I have a dataframe containing a number of columns and rows, in all of the columns except for the leftmost two, there is data of the form "integer-integer". I would like to split all of these columns into two columns, with each integer in its own cell, and remove the dash.
I have tried to follow the answers in Pandas Dataframe: Split multiple columns each into two columns, but it seems that they are splitting after one element, while I would like to split on the "-".
By way of example, suppose I have a dataframe of the form:
I would like to split the columns labelled 2 through to 22, to have them called 2F, 2A, 3F, 3A, ..., 6A with the data in the first row being R1, Hawthorn, 229, 225, 91, 81, ..., 12.
Thank you for any help.

You can use DataFrame.set_index with DataFrame.stack for Series, then split to new 2 columns by Series.str.split, convert to integers, create new columns names by DataFrame.set_axis, reshape by DataFrame.unstack, sorting columns by DataFrame.sort_index and last flatten MultiIndexwith convert index to columns by DataFrame.reset_index:
#first replace columns names to default values
df.columns = range(len(df.columns))
df = (df.set_index([0,1])
.stack()
.str.split('-', expand=True)
.astype(int)
.set_axis(['F','A'], axis=1, inplace=False)
.unstack()
.sort_index(axis=1, level=[1,0], ascending=[True, False]))
df.columns = df.columns.map(lambda x: f'{x[1]}{x[0]}')
df = df.reset_index()
print (df)
0 1 2F 2A 3F 3A 4F 4A 5F 5A 6F 6A
0 R1 Hawthorn 229 225 91 81 216 142 439 367 7 12
1 R2 Sydney 226 214 93 92 151 167 377 381 12 8
2 R3 Geelong 216 228 91 166 159 121 369 349 16 14
3 R4 North Melbourne 213 239 169 126 142 155 355 394 8 9
4 R5 Gold Coast 248 226 166 94 267 169 455 389 18 6
5 R6 St Kilda 242 197 118 161 158 156 466 353 15 16
6 R7 Fremantle 225 219 72 84 224 185 449 464 7 5

For Input:
df = pd.DataFrame({0: ['R1'], 1: ['Hawthorn'], 2: ['229-225'], 3: ['91-81'], 4:['210-142'], 5:['439-367'], 6:['7-12']})
0 1 2 3 4 5 6
0 R1 Hawthorn 229-225 91-81 210-142 439-367 7-12
Trying the code:
for i in df.columns[2::]:
df[[str(i)+'F', str(i)+'A']] =pd.DataFrame(df[i].str.split('-').tolist(), index= df.index)
del df[i]
Prints (1st row):
0 1 2F 2A 3F 3A 4F 4A 5F 5A 6F 6A
0 R1 Hawthorn 229 225 91 81 210 142 439 367 7 12

you can use lambda function for split a series
import pandas as pd
df = pd.read_csv("data.csv")
df.head()
>>> data
0 12-24
1 13-26
2 14-28
3 15-30
df["d1"] = df["data"].apply(lambda x: x.split("-")[0])
df["d2"] = df["data"].apply(lambda x: x.split("-")[1])
df.head()
>>>
data d1 d2
0 12-24 12 24
1 13-26 13 26
2 14-28 14 28
3 15-30 15 30

Pythonic way for do math among columns within a MultiIndex dataframe and concat the result as a seperate column in original dataframe

I was trying to figure out
cols = pd.MultiIndex.from_product([['Company A','Company B'],['VWAL','Volumn']],names=[u'Entity',u'Indicator'])
rows = pd.date_range(start='2018-01-01',periods=6,freq='D')
df = pd.DataFrame(np.random.random_integers(1,100,(6,4)),index=rows,columns=cols)
In [245]: df
Out[245]:
Entity Company A Company B
Indicator VWAL Volumn VWAL Volumn
2018-01-01 92 3 22 59
2018-01-02 90 67 52 69
2018-01-03 12 10 35 11
2018-01-04 83 7 62 5
2018-01-05 35 74 27 19
2018-01-06 97 50 93 39
if i want to calculate 3rd-Column = VWAL+Volumn, 4th-Column=VWAL-Volumn for each company respectively, and concatenate them as separate columns under respective company, what is the most efficient / pythonic way to do this? (note: there could be thousands of Companies, and rows for several years, I am thinking about using generator to iterate over "company" label to save memory and speed up process)
I tried my way as below, but was stuck on dealing with the MultiIndex when concatenating the results.
temp = df.loc(axis=1)[:,'VWAL'].values+df.loc(axis=1)[:,'Volumn'].values
df2 = pd.concat([df,temp],axis=1,join='inner',keys=?????)

You can use:
groupby by first level of MultiIndex in columns and aggregate sum
create MultiIndex by MultiIndex.from_product for align data
concat and sort columns names by sort_index
df1 = df.groupby(axis=1, level=0).sum()
df1.columns = pd.MultiIndex.from_product([df1.columns, ['new']])
print (df1)
Company A Company B
new new
2018-01-01 160 117
2018-01-02 142 185
2018-01-03 145 107
2018-01-04 144 110
2018-01-05 116 178
2018-01-06 119 124
df = pd.concat([df, df1], axis=1).sort_index(axis=1)
print (df)
Entity Company A Company B
Indicator VWAL Volumn new VWAL Volumn new
2018-01-01 67 93 160 99 18 117
2018-01-02 84 58 142 87 98 185
2018-01-03 97 48 145 74 33 107
2018-01-04 47 97 144 26 84 110
2018-01-05 79 37 116 97 81 178
2018-01-06 69 50 119 56 68 124

How to get all fields for only a specic user_id from a pivot dataframe indexed by two fields 'timestamps' and 'user_id'?

I have the table below contained in the DataFrame pivoted below :
cost cost cost val1 val1 val1
user_id 1 2 3 1 2 3
timestamp
01/01/2011 1 100 3 5
01/02/2011 20 8
01/07/2012 19 57
01/11/2012 3100 49
21/12/2012 240 30
14/09/2013 21 63
01/12/2013 3200 51
I would like to know how I obtain another dataframe containing only fields associated to a specific user-id, i.e (based on my example) to be able to obtain something like df_by_user_id = pivoted ['user_id'=1] or df_by_user_id = pivoted ['user_id'=2] or df_by_user_id = pivoted ['user_id'=3] (knowing that the table above is grouped by 'timestamp' and 'user_id). (My final purpose being to be able to make a plot for each user_id).
The code use in order to obtain the above table is :
import pandas as pd
newnames = ['timestamp','user_id', 'cost', 'val1','val2', 'val3','code']
df = pd.read_csv('mytest.csv', names = newnames, header = False, parse_dates=True, dayfirst=True)
df['timestamp'] = pd.to_datetime(df['timestamp'], dayfirst=True)
pivoted = df.pivot('timestamp', 'user_id')
Thanks in advance for your help.

So let's start out with this reproducible dataframe:
import numpy as np
import pandas
np.random.seed(0)
N = 6
data = np.random.random_integers(low=0, high=200, size=(N, N))
cols = pandas.MultiIndex.from_product([('cost', 'value'), (1, 2, 3)], names=['quantity', 'user_id'])
dates = pandas.DatetimeIndex(freq='1M', start='2010-01-01', periods=N, name='date')
df = pandas.DataFrame(data, columns=cols, index=dates)
which is:
quantity cost value
user_id 1 2 3 1 2 3
date
2010-01-31 172 47 117 192 67 195
2010-02-28 103 9 21 36 87 70
2010-03-31 88 140 58 193 39 87
2010-04-30 174 88 81 165 25 77
2010-05-31 72 9 148 115 197 79
2010-06-30 175 192 82 99 177 29
Take a cross-section (xs) along axis 1 of the dataframe
df.xs(1, level='user_id', axis=1)
Which gives:
quantity cost value
date
2010-01-31 172 192
2010-02-28 103 36
2010-03-31 88 193
2010-04-30 174 165
2010-05-31 72 115
2010-06-30 175 99
Alternatively, you could pick out all of the costs with:
df.xs('cost', level='quantity', axis=1)
user_id 1 2 3
date
2010-01-31 172 47 117
2010-02-28 103 9 21
2010-03-31 88 140 58
2010-04-30 174 88 81
2010-05-31 72 9 148
2010-06-30 175 192 82
Since that level of the columns isn't named in your dataframe, you can access it with it's index:
df.xs('cost', level=0, axis=1)
user_id 1 2 3
date
2010-01-31 172 47 117
2010-02-28 103 9 21
2010-03-31 88 140 58
2010-04-30 174 88 81
2010-05-31 72 9 148
2010-06-30 175 192 82
If you had a multi-level index on rows, you could use axis=0 to select items base on row labels. But since you're concerned with columns right now, use axis=1

2d list in python - accessing through column names

I'm parsing two files which has data as shown below
File1:
UID A B C D
------ ---------- ---------- ---------- ----------
456 536 1 148 304
1071 908 1 128 243
1118 4 8 52 162
249 4 8 68 154
1072 296 416 68 114
118 180 528 68 67
file2:
UID X Y A Z B
------ ---------- ---------- ---------- ---------- ---------
456 536 1 148 304 234
1071 908 1 128 243 12
1118 4 8 52 162 123
249 4 8 68 154 987
1072 296 416 68 114 45
118 180 528 68 67 6
I will be comparing two such files, however the number of columns might vary and the columns names. For every unique UID, I need to match the column names, compare and find the difference.
Questions
1. Is there a way to access columns by column names instead of index?
2. Dynamically give column names based on the file data?
I'm able to load the file into list, and compare using indexes, but thats not a proper solutions.
Thanks in advance.

You might consider using csv.DictReader. It allows you both to address columns by names, and a variable list of columns for each file opened. Consider removing the ------ separating header from actual data as it might be read wrong.
Example:
import csv
with open('File1', 'r', newline='') as f:
# If you don't pass field names
# they are taken from the first row.
reader = csv.DictReader(f)
for line in reader:
# `line` is a dict {'UID': val, 'A': val, ... }
print line
If your input format has no clear delimiter (multiple whitespaces), you can wrap the file with a generator that will compress continous whitespaces into e.g. a comma:
import csv
import re
r = re.compile(r'[ ]+')
def trim_whitespaces(f):
for line in f:
yield r.sub(',', line)
with open('test.txt', 'r', newline='') as f:
reader = csv.DictReader(trim_whitespaces(f))
for line in reader:
print line

This is an good use case for pandas, loading data is as simple as:
import pandas as pd
from StringIO import StringIO
data = """ UID A B C D
------ ---------- ---------- ---------- ----------
456 536 1 148 304
1071 908 1 128 243
1118 4 8 52 162
249 4 8 68 154
1072 296 416 68 114
118 180 528 68 67 """
df = pd.read_csv(StringIO(data),skiprows=[1],delimiter=r'\s+')
Let's inspect results:
>>> df
UID A B C D
0 456 536 1 148 304
1 1071 908 1 128 243
2 1118 4 8 52 162
3 249 4 8 68 154
4 1072 296 416 68 114
5 118 180 528 68 67
After obtaining df2 with similar means we can merge results:
>>> df.merge(df2, on=['UID'])
UID A_x B_x C D X Y A_y Z B_y
0 456 536 1 148 304 536 1 148 304 234
1 1071 908 1 128 243 908 1 128 243 12
2 1118 4 8 52 162 4 8 52 162 123
3 249 4 8 68 154 4 8 68 154 987
4 1072 296 416 68 114 296 416 68 114 45
5 118 180 528 68 67 180 528 68 67 6
Resulting pandas.DataFrame has a very profound API, and all SQL-like analisis operations such as joining, filtering, grouping, aggregating etc are easy to perform. Look for examples on this site or in the documentation.

my_text = """UID A B C D
------ ---------- ---------- ---------- ----------
456 536 1 148 304
1071 908 1 128 243
1118 4 8 52 162
249 4 8 68 154
1072 296 416 68 114
118 180 528 68 67 """
lines = my_text.splitlines() #split your text into lines
keys= lines[0].split() #headers is your first line
table = [line.split() for line in lines[1:]] #the data is the rest
columns = zip(*table) #transpose the rows array to a columns array
my_dict = dict(zip(keys,columns)) #create a dict using your keys from earlier and matching them with columns
print my_dict['A'] #access
obviously you would need to change it if you had to read from a file say
alternatively this is what packages like pandas were made for
import pandas
table = pandas.read_csv('foo.csv', index_col=0)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Merge two dataframes with overlapping index, keeping column values from left DataFrame - python

Related

Perform operation on columns based on values of another columns in pandas

Pandas Dataframe

Pythonic way for do math among columns within a MultiIndex dataframe and concat the result as a seperate column in original dataframe

How to get all fields for only a specic user_id from a pivot dataframe indexed by two fields 'timestamps' and 'user_id'?

2d list in python - accessing through column names

Categories

Resources