Pandas dataframe: merge files by common columns - python

I have a collection of files that have some common columns that I want to join. In my real problem, there are several dissimilar and common columns. In this toy example, I have a set of a files and a set of b files that have unique columns and share identical c columns.
$ for ii in $(ls *.dat) ; do echo " "; echo $ii ; cat $ii ; done
a1.dat
a,c
4,8
1,10
2,3
a2.dat
a,c
1,2
3,4
b1.dat
b,c
2,8
2,10
1,3
b2.dat
b,c
.2,2
.8,4
I want to sweep through these files and merge them into a single dataframe. Here's what I've tried so far. I concat the first files to make sure I have all of the column names collected then merge the remaining files. When I merge by "inner", an empty dataframe is returned.
$ cat s.py
import pandas as pd
dat = pd.DataFrame()
for ii in [1, 2]:
for jj in ['a', 'b']:
d = pd.read_csv('%s%i.dat' % (jj, ii))
if ii == 1: dat = pd.concat([dat, d])
else: dat = pd.merge(dat, d, how='outer')
print(dat)
$ Python s.py
a b c
0 4.0 NaN 8
1 1.0 NaN 10
2 2.0 NaN 3
3 NaN 2.0 8
4 NaN 2.0 10
5 NaN 1.0 3
6 1.0 NaN 2
7 3.0 NaN 4
8 NaN 0.2 2
9 NaN 0.8 4
This is not my desired output. I don't understand how I can make this work better. The desired output was
a b c
0 4.0 2.0 8
1 1.0 2.0 10
2 2.0 1.0 3
3 1.0 0.2 2
4 3.0 0.8 4

There are two steps:
First, concatenate all files of the same type into one DataFrame each:
df = {}
for k in ['a', 'b']:
df[k] = pd.concat([
pd.read_csv('%s%d.dat' % (k, i)) for i in [1, 2]
], axis=0)
Then merge join on the shared column 'c',
result = df['a'].merge(df['b'], on='c')[['a', 'b', 'c']]

First concat all a and b files and then merge them on the column c like:
import numpy as np
import pandas as pd
a1 = pd.DataFrame({
'a': [4,1,2],
'c': [8,10,3],
})
a2 = pd.DataFrame({
'a': [1,3],
'c': [2,4],
})
b1 = pd.DataFrame({
'b': [2,2,1],
'c': [8,10,3],
})
b2 = pd.DataFrame({
'b': [0.2,0.8],
'c': [2,4],
})
concat_df_a = pd.concat([a1,a2])
concat_df_b = pd.concat([b1,b2])
print(concat_df_b.merge(concat_df_a,on='c')[['a','b','c']])
a b c
0 4 2.0 8
1 1 2.0 10
2 2 1.0 3
3 1 0.2 2
4 3 0.8 4

Related

python pandas dataframe multiply columns matching index or row name

I have two dataframes,
df1:
hash a b c
ABC 1 2 3
def 5 3 4
Xyz 3 2 -1
df2:
hash v
Xyz 3
def 5
I want to make
df:
hash a b c
ABC 1 2 3 (= as is, because no matching 'ABC' in df2)
def 25 15 20 (= 5*5 3*5 4*5)
Xyz 9 6 -3 (= 3*3 2*3 -1*3)
as like above,
I want to make a dataframe with values of multiplying df1 and df2 according to their index (or first column name) matched.
As df2 only has one column (v), all df1's columns except for the first one (index) should be affected.
Is there any neat Pythonic and Panda's way to achieve it?
df1.set_index(['hash']).mul(df2.set_index(['hash'])) or similar things seem not work..
One approach:
df1 = df1.set_index("hash")
df2 = df2.set_index("hash")["v"]
res = df1.mul(df2, axis=0).combine_first(df1)
print(res)
Output
a b c
hash
ABC 1.0 2.0 3.0
Xyz 9.0 6.0 -3.0
def 25.0 15.0 20.0
One Method:
# We'll make this for convenience
cols = ['a', 'b', 'c']
# Merge the DataFrames, keeping everything from df
df = df1.merge(df2, 'left').fillna(1)
# We'll make the v column integers again since it's been filled.
df.v = df.v.astype(int)
# Broadcast the multiplication across axis 0
df[cols] = df[cols].mul(df.v, axis=0)
# Drop the no-longer needed column:
df = df.drop('v', axis=1)
print(df)
Output:
hash a b c
0 ABC 1 2 3
1 def 25 15 20
2 Xyz 9 6 -3
Alternative Method:
# Set indices
df1 = df1.set_index('hash')
df2 = df2.set_index('hash')
# Apply multiplication and fill values
df = (df1.mul(df2.v, axis=0)
.fillna(df1)
.astype(int)
.reset_index())
# Output:
hash a b c
0 ABC 1 2 3
1 Xyz 9 6 -3
2 def 25 15 20
The function you are looking for is actually multiply.
Here's how I have done it:
>>> df
hash a b
0 ABC 1 2
1 DEF 5 3
2 XYZ 3 -1
>>> df2
hash v
0 XYZ 4
1 ABC 8
df = df.merge(df2, on='hash', how='left').fillna(1)
>>> df
hash a b v
0 ABC 1 2 8.0
1 DEF 5 3 1.0
2 XYZ 3 -1 4.0
df[['a','b']] = df[['a','b']].multiply(df['v'], axis='index')
>>>df
hash a b v
0 ABC 8.0 16.0 8.0
1 DEF 5.0 3.0 1.0
2 XYZ 12.0 -4.0 4.0
You can actually drop v at the end if you don't need it.

Pandas : How to drop a specific number of duplicates rows?

I hope you're doing well.
So I want to drop a specific number of duplicates rows. Let me explain by an example :
A B C
0 foo 2 3
1 foo nan 9
2 foo 1 4
3 bar 8 nan
4 xxx 9 10
5 xxx 4 4
6 xxx 9 6
So we have duplicated rows based on column A, so for 'foo' I want to drop 2 duplicates rows for example and for 'xxx' I want to drop just one row.
The method drop_duplicates can keep either 0 or 1 rows so it didn't help me.
Thanks in advance.
Probably not the optimal solution, but this one works:
df = pd.DataFrame({
'A': ['foo','foo','foo','bar','xxx','xxx','xxx'],
'B': [2,np.nan,1,8,9,4,9],
'C': [3,9,4,np.nan,10,4,6]
})
nb_drops = {'foo':2, 'xxx':1}
df2 = pd.DataFrame()
for k, v in nb_drops.items():
df2 = df2.append(df[df['A'] == k].head(v))
df = df.drop_duplicates(subset=['A'])
df = df.merge(df2,how='outer')
df
Gives
A B C
0 foo 2.0 3.0
1 bar 8.0 NaN
2 xxx 9.0 10.0
3 foo NaN 9.0
I made this code and it works...
import pandas as pd
import numpy as np
df = pd.DataFrame({
'A': ['foo','foo','foo','bar','xxx','xxx','xxx'],
'B': [2,np.nan,1,8,9,4,9],
'C': [3,9,4,np.nan,10,4,6]
nb_drops = {'foo':2, 'xxx':1}
rows_to_delete = []
for item in nb_drops :
indices_item = list(df[df['A'] == item].index)
rows_to_delete += range(indices_item[-1] - nb_drops[item] + 1,indices_item[-1] + 1)
df.drop(rows_to_delete, inplace = True)

How to create data fame from random lists length using python?

I want to create pandas data frame with multiple lists with different length. Below is my python code.
import pandas as pd
A=[1,2]
B=[1,2,3]
C=[1,2,3,4,5,6]
lenA = len(A)
lenB = len(B)
lenC = len(C)
df = pd.DataFrame(columns=['A', 'B','C'])
for i,v1 in enumerate(A):
for j,v2 in enumerate(B):
for k, v3 in enumerate(C):
if(i<random.randint(0, lenA)):
if(j<random.randint(0, lenB)):
if (k < random.randint(0, lenC)):
df = df.append({'A': v1, 'B': v2,'C':v3}, ignore_index=True)
print(df)
My lists are as below:
A=[1,2]
B=[1,2,3]
C=[1,2,3,4,5,6,7]
In each run I got different output and which is correct. But not covers all list items in each run. In one run I got below output as:
A B C
0 1 1 3
1 1 2 1
2 1 2 2
3 2 2 5
In the above output 'A' list's all items (1,2) are there. But 'B' list has only (1,2) items, the item 3 is missing. Also list 'C' has (1,2,3,5) items only. (4,6,7) items are missing in 'C' list. My expectation is: in each list each item should be in the data frame at least once and 'C' list items should be in data frame only once. My expected sample output is as below:
A B C
0 1 1 3
1 1 2 1
2 1 2 2
3 2 2 5
4 2 3 4
5 1 1 7
6 2 3 6
Guide me to get my expected output. Thanks in advance.
You can add random values of each list to total length and then use DataFrame.sample:
A=[1,2]
B=[1,2,3]
C=[1,2,3,4,5,6]
L = [A,B,C]
m = max(len(x) for x in L)
print (m)
6
a = [np.hstack((np.random.choice(x, m - len(x)), x)) for x in L]
df = pd.DataFrame(a, index=['A', 'B', 'C']).T.sample(frac=1)
print (df)
A B C
2 2 2 3
0 2 1 1
3 1 1 4
4 1 2 5
5 2 3 6
1 2 2 2
You can use transpose to achieve the same.
EDIT: Used random to randomize the output as requested.
import pandas as pd
from random import shuffle, choice
A=[1,2]
B=[1,2,3]
C=[1,2,3,4,5,6]
shuffle(A)
shuffle(B)
shuffle(C)
data = [A,B,C]
df = pd.DataFrame(data)
df = df.transpose()
df.columns = ['A', 'B', 'C']
df.loc[:,'A'].fillna(choice(A), inplace=True)
df.loc[:,'B'].fillna(choice(B), inplace=True)
This should give the below output
A B C
0 1.0 1.0 1.0
1 2.0 2.0 2.0
2 NaN 3.0 3.0
3 NaN 4.0 4.0
4 NaN NaN 5.0
5 NaN NaN 6.0

What pandas operation will help me do a groupby and aggregate by column combinations?

Here's an example of what I am trying to do:
bar foo o1 o2 thing
0 1 1 0.0 3.3 a
1 1 1 1.1 4.4 a
2 3 2 2.2 5.5 b
foo_1_bar_3_o1 foo_1_bar_3_o2 foo_2_bar_3_o1 foo_2_bar_3_o2 \
0 NaN NaN NaN NaN
1 NaN NaN 2.2 5.5
foo_1_bar_1_o1 foo_1_bar_1_o2 foo_2_bar_1_o1 foo_2_bar_1_o2 thing
0 1.1 7.7 NaN NaN a
1 NaN NaN NaN NaN b
The first is my input DataFrame and the second is my desired output DataFrame (NaNs could be substituted with 0's).
This should be some sort of a groupby (on column thing) and then some kind of an aggregating function on values in the columns o1 and o2 that aggregate based on all possible combinations of the values of foo and bar. Notice that foo_1_bar_2_o2 is 7.7 because it is the sum over the column o2 when foo == 1 && bar == 2 for the group 'a'.
I've tried researching dcast, crosstab, and pivot in pandas but none seem to satisfy what I am trying to do.
I wrote base Python code that does what I want, but, again, I would like to translate it to a more friendly format using already existing functions. I don't believe my use-case is obscure enough for this to not be possible.
Below is the base Python code for this operation.
import pandas as pd
import numpy as np
import itertools
df = pd.DataFrame({'thing': ['a', 'a', 'b'],
'foo': [1, 1, 2],
'bar': [1, 1, 3],
'o1': [0.0, 1.1, 2.2],
'o2': [3.3, 4.4, 5.5]})
key_columns = ['foo', 'bar']
key_value_pairs = [df[key].values.tolist() for key in key_columns]
key_value_pairs = list(set(itertools.product(*key_value_pairs)))
output_columns = ['o1', 'o2']
def aggregate(df):
new_columns = []
for pair in key_value_pairs:
pair = list(zip(key_columns, pair))
new_column = '_'.join(['%s_%d' % (key, value) for key, value in pair])
for o in output_columns:
criteria = list()
for key, value in pair:
criterion = (df[key] == value)
criteria.append(criterion)
new_columns.append('%s_%s' % (new_column, o))
df[new_columns[-1]] = df[np.logical_and.reduce(criteria)][o].sum()
return df.head(1)[new_columns + ['thing']]
things = df['thing'].value_counts().index.tolist()
groups = df.groupby('thing')
dfs = []
for thing in things:
dfs.append(aggregate(groups.get_group(thing).reset_index()))
#print(aggregate(groups.get_group(thing).reset_index(drop=True)))
print(df)
print(pd.concat(dfs).reset_index(drop=True))
I try create dynamic solution:
key_columns = ['foo', 'bar']
output_columns = ['o1', 'o2']
First add key_columns strings to values with radd:
df[key_columns] = (df[key_columns].astype(str)
.radd(pd.Series(key_columns,index=key_columns) + '_'))
print (df)
bar foo o1 o2 thing
0 bar_1 foo_1 0.0 3.3 a
1 bar_1 foo_1 1.1 4.4 a
2 bar_3 foo_2 2.2 5.5 b
Then aggregate by sum and reshape by unstack - get MultiIndex in columns:
df = df.groupby(['thing'] + key_columns)[output_columns].sum().unstack(key_columns)
print (df)
o1 o2
bar bar_1 bar_3 bar_1 bar_3
foo foo_1 foo_2 foo_1 foo_2
thing
a 1.1 NaN 7.7 NaN
b NaN 2.2 NaN 5.5
Create all possible combinations by MultiIndex.from_product for reindex, then reorder_levels and sort_index:
mux = pd.MultiIndex.from_product(df.columns.levels, names=df.columns.names)
print (mux)
MultiIndex(levels=[['o1', 'o2'], ['foo_1', 'foo_2'], ['bar_1', 'bar_3']],
labels=[[0, 0, 0, 0, 1, 1, 1, 1], [0, 0, 1, 1, 0, 0, 1, 1],
[0, 1, 0, 1, 0, 1, 0, 1]],
names=[None, 'foo', 'bar'])
df = df.reindex(columns=mux).reorder_levels(key_columns + [None], axis=1).sort_index(axis=1)
Last remove MultiIndex by map with join:
df.columns = df.columns.map('_'.join)
df = df.reset_index()
print (df)
thing foo_1_bar_1_o1 foo_1_bar_1_o2 foo_1_bar_3_o1 foo_1_bar_3_o2 \
0 a 1.1 7.7 NaN NaN
1 b NaN NaN NaN NaN
foo_2_bar_1_o1 foo_2_bar_1_o2 foo_2_bar_3_o1 foo_2_bar_3_o2
0 NaN NaN NaN NaN
1 NaN NaN 2.2 5.5
I think you'll still have to use itertools.product(), because Pandas isn't designed to think about data that don't exist. But once you've got those extra combinations defined, you can use groupby() and unstack() to get the output you're looking for.
Using the key_value_pairs you defined:
for k,v in key_value_pairs:
if not len(df.loc[df.foo.eq(k) & df.bar.eq(v)]):
df = df.append({"foo":k, "bar":v, "o1":np.nan, "o2":np.nan, "thing":"a"}, ignore_index=True)
df = df.append({"foo":k, "bar":v, "o1":np.nan, "o2":np.nan, "thing":"b"}, ignore_index=True)
df
bar foo o1 o2 thing
0 1 1 0.0 3.3 a
1 1 1 1.1 4.4 a
2 3 2 2.2 5.5 b
3 3 1 NaN NaN a
4 3 1 NaN NaN b
5 1 2 NaN NaN a
6 1 2 NaN NaN b
Now groupby and unstack:
gb = df.groupby(["thing", "foo", "bar"]).sum().unstack(level=[1,2])
gb.columns = [f"foo_{b}_bar_{c}_{a}" for a,b,c in gb.columns]
Output:
foo_1_bar_1_o1 foo_1_bar_3_o1 foo_2_bar_1_o1 foo_2_bar_3_o1 \
thing
a 1.1 NaN NaN NaN
b NaN NaN NaN 2.2
foo_1_bar_1_o2 foo_1_bar_3_o2 foo_2_bar_1_o2 foo_2_bar_3_o2
thing
a 7.7 NaN NaN NaN
b NaN NaN NaN 5.5

Creating dataframe from a dictionary where entries have different lengths

Say I have a dictionary with 10 key-value pairs. Each entry holds a numpy array. However, the length of the array is not the same for all of them.
How can I create a dataframe where each column holds a different entry?
When I try:
pd.DataFrame(my_dict)
I get:
ValueError: arrays must all be the same length
Any way to overcome this? I am happy to have Pandas use NaN to pad those columns for the shorter entries.
In Python 3.x:
import pandas as pd
import numpy as np
d = dict( A = np.array([1,2]), B = np.array([1,2,3,4]) )
pd.DataFrame(dict([ (k,pd.Series(v)) for k,v in d.items() ]))
Out[7]:
A B
0 1 1
1 2 2
2 NaN 3
3 NaN 4
In Python 2.x:
replace d.items() with d.iteritems().
Here's a simple way to do that:
In[20]: my_dict = dict( A = np.array([1,2]), B = np.array([1,2,3,4]) )
In[21]: df = pd.DataFrame.from_dict(my_dict, orient='index')
In[22]: df
Out[22]:
0 1 2 3
A 1 2 NaN NaN
B 1 2 3 4
In[23]: df.transpose()
Out[23]:
A B
0 1 1
1 2 2
2 NaN 3
3 NaN 4
A way of tidying up your syntax, but still do essentially the same thing as these other answers, is below:
>>> mydict = {'one': [1,2,3], 2: [4,5,6,7], 3: 8}
>>> dict_df = pd.DataFrame({ key:pd.Series(value) for key, value in mydict.items() })
>>> dict_df
one 2 3
0 1.0 4 8.0
1 2.0 5 NaN
2 3.0 6 NaN
3 NaN 7 NaN
A similar syntax exists for lists, too:
>>> mylist = [ [1,2,3], [4,5], 6 ]
>>> list_df = pd.DataFrame([ pd.Series(value) for value in mylist ])
>>> list_df
0 1 2
0 1.0 2.0 3.0
1 4.0 5.0 NaN
2 6.0 NaN NaN
Another syntax for lists is:
>>> mylist = [ [1,2,3], [4,5], 6 ]
>>> list_df = pd.DataFrame({ i:pd.Series(value) for i, value in enumerate(mylist) })
>>> list_df
0 1 2
0 1 4.0 6.0
1 2 5.0 NaN
2 3 NaN NaN
You may additionally have to transpose the result and/or change the column data types (float, integer, etc).
Use pandas.DataFrame and pandas.concat
The following code will create a list of DataFrames with pandas.DataFrame, from a dict of uneven arrays, and then concat the arrays together in a list-comprehension.
This is a way to create a DataFrame of arrays, that are not equal in length.
For equal length arrays, use df = pd.DataFrame({'x1': x1, 'x2': x2, 'x3': x3})
import pandas as pd
import numpy as np
# create the uneven arrays
mu, sigma = 200, 25
np.random.seed(365)
x1 = mu + sigma * np.random.randn(10, 1)
x2 = mu + sigma * np.random.randn(15, 1)
x3 = mu + sigma * np.random.randn(20, 1)
data = {'x1': x1, 'x2': x2, 'x3': x3}
# create the dataframe
df = pd.concat([pd.DataFrame(v, columns=[k]) for k, v in data.items()], axis=1)
Use pandas.DataFrame and itertools.zip_longest
For iterables of uneven length, zip_longest fills missing values with the fillvalue.
The zip generator needs to be unpacked, because the DataFrame constructor won't unpack it.
from itertools import zip_longest
# zip all the values together
zl = list(zip_longest(*data.values()))
# create dataframe
df = pd.DataFrame(zl, columns=data.keys())
plot
df.plot(marker='o', figsize=[10, 5])
dataframe
x1 x2 x3
0 232.06900 235.92577 173.19476
1 176.94349 209.26802 186.09590
2 194.18474 168.36006 194.36712
3 196.55705 238.79899 218.33316
4 249.25695 167.91326 191.62559
5 215.25377 214.85430 230.95119
6 232.68784 240.30358 196.72593
7 212.43409 201.15896 187.96484
8 188.97014 187.59007 164.78436
9 196.82937 252.67682 196.47132
10 NaN 223.32571 208.43823
11 NaN 209.50658 209.83761
12 NaN 215.27461 249.06087
13 NaN 210.52486 158.65781
14 NaN 193.53504 199.10456
15 NaN NaN 186.19700
16 NaN NaN 223.02479
17 NaN NaN 185.68525
18 NaN NaN 213.41414
19 NaN NaN 271.75376
While this does not directly answer the OP's question. I found this to be an excellent solution for my case when I had unequal arrays and I'd like to share:
from pandas documentation
In [31]: d = {'one' : Series([1., 2., 3.], index=['a', 'b', 'c']),
....: 'two' : Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
....:
In [32]: df = DataFrame(d)
In [33]: df
Out[33]:
one two
a 1 1
b 2 2
c 3 3
d NaN 4
You can also use pd.concat along axis=1 with a list of pd.Series objects:
import pandas as pd, numpy as np
d = {'A': np.array([1,2]), 'B': np.array([1,2,3,4])}
res = pd.concat([pd.Series(v, name=k) for k, v in d.items()], axis=1)
print(res)
A B
0 1.0 1
1 2.0 2
2 NaN 3
3 NaN 4
Both the following lines work perfectly :
pd.DataFrame.from_dict(df, orient='index').transpose() #A
pd.DataFrame(dict([ (k,pd.Series(v)) for k,v in df.items() ])) #B (Better)
But with %timeit on Jupyter, I've got a ratio of 4x speed for B vs A, which is quite impressive especially when working with a huge data set (mainly with a big number of columns/features).
If you don't want it to show NaN and you have two particular lengths, adding a 'space' in each remaining cell would also work.
import pandas
long = [6, 4, 7, 3]
short = [5, 6]
for n in range(len(long) - len(short)):
short.append(' ')
df = pd.DataFrame({'A':long, 'B':short}]
# Make sure Excel file exists in the working directory
datatoexcel = pd.ExcelWriter('example1.xlsx',engine = 'xlsxwriter')
df.to_excel(datatoexcel,sheet_name = 'Sheet1')
datatoexcel.save()
A B
0 6 5
1 4 6
2 7
3 3
If you have more than 2 lengths of entries, it is advisable to make a function which uses a similar method.

Categories