Alternative to pandas concat - python

I am trying to concatenate two data frames with
pd.concat([df1.set_index(["t", "tc"]), df2.set_index(["t", "tc"])], axis=1)
It can happen that in df1, the index is not unique. In that case, I want the corresponding entry in df2 to be inserted into all the rows with that index. Unfortunately, instead of doing that, concat gives me an error.I thought ignore_index = True might help, but I still get the error ValueError: cannot handle a non-unique multi-index!
Is there an alternative to concat that does what I want?
For example:
df1
t tc a
a 1 5
b 1 6
a 1 7
df2:
t tc b
a 1 8
b 1 10
result(after resetting the index):
t tc a b
a 1 5 8
b 1 6 10
a 1 7 8

using .merge you can get where you need
df1.merge(df2, on =['t', 'tc'])
#result
t tc a b
0 a 1 5 8
1 a 1 7 8
2 b 1 6 10

Related

How to compare two data frame and get the unmatched rows using python? [duplicate]

This question already has answers here:
Pandas: Find rows which don't exist in another DataFrame by multiple columns
(2 answers)
Closed 1 year ago.
I have two data frames, df1 and df2. Now, df1 contains 6 records and df2 contains 4 records. I want to get the unmatched records out of it. I tried it but getting an error ValueError: Can only compare identically-labelled DataFrame objects I guess this is due to the length of df as the df1 has 6 and df2 has 4 but how do I compare them both and get the unmatched rows?
code
df1=
a b c
0 1 2 3
1 4 5 6
2 3 5 5
3 5 6 7
4 6 7 8
5 6 6 6
df2 =
a b c
0 3 5 5
1 5 6 7
2 6 7 8
3 6 6 6
index = (df != df2).any(axis=1)
df3 = df.loc[index]
which gives:
ValueError: Can only compare identically-labelled DataFrame objects
Expected output:
a b c
0 1 2 3
1 4 5 6
I know that the error is due to the length but is there any way where we can compare two data frames and get the unmatched records out of it?
Use df.merge with indicator=True and pick all rows except both:
In [173]: df = df1.merge(df2, indicator=True, how='outer').query('_merge != "both"').drop('_merge', 1)
In [174]: df
Out[174]:
a b c
0 1 2 3
1 4 5 6
MultiIndex.from_frame + isin
We can use MultiIndex.from_frame on both df1 and df2 to create the corresponding multiindices, then use isin to test the membership of the index created from df1 in index created from df2 to create a boolean mask which can be then used to filter the non matching rows.
i1 = pd.MultiIndex.from_frame(df1)
i2 = pd.MultiIndex.from_frame(df2)
df1[~i1.isin(i2)]
Result
a b c
0 1 2 3
1 4 5 6

create pandas pivottable with a long multiindex

I have a dataframe df with the shape (4573,64) that I'm trying to pivot. The last column is an 'id' with two possible string values 'old' and 'new'. I would like to set the first 63 columns as index and then have the 'id' column across the top with values being the count of 'old' or 'new' for each index row.
I've created a list object out of columns labels that I want as index named cols.
I tried the following:
df.pivot(index=cols, columns='id')['id']
this gives an error: 'all arrays must be same length'
also tried the following to see if I can get sum but no luck either:
pd.pivot_table(df,index=cols,values=['id'],aggfunc=np.sum)
any ides greatly appreciated
I found a thread online talking about a possible bug in pandas 0.23.0 where the pandas.pivot_table() will not accept the multiindex as long as it contains NaN's (link to github in comments). My workaround was to do
df.fillna('empty', inplace=True)
then the solution below:
df1 = pd.pivot_table(df, index=cols,columns='id',aggfunc='size', fill_value=0)
as proposed by jezrael will work as intended hence the answer accepted.
I believe need convert columns names to list and then aggregate size with unstack:
df = pd.DataFrame({'B':[4,4,4,5,5,4],
'C':[1,1,9,4,2,3],
'D':[1,1,5,7,1,0],
'E':[0,0,6,9,2,4],
'id':list('aaabbb')})
print (df)
B C D E id
0 4 1 1 0 a
1 4 1 1 0 a
2 4 9 5 6 a
3 5 4 7 9 b
4 5 2 1 2 b
5 4 3 0 4 b
cols = df.columns.tolist()
df1 = df.groupby(cols)['id'].size().unstack(fill_value=0)
print (df1)
id a b
B C D E
4 1 1 0 2 0
3 0 4 0 1
9 5 6 1 0
5 2 1 2 0 1
4 7 9 0 1
Solution with pivot_table:
df1 = pd.pivot_table(df, index=cols,columns='id',aggfunc='size', fill_value=0)
print (df1)
id a b
B C D E
4 1 1 0 2 0
3 0 4 0 1
9 5 6 1 0
5 2 1 2 0 1
4 7 9 0 1

Modifying DataFrames in loop

Given this data frame:
import pandas as pd
df=pd.DataFrame({'A':[1,2,3],'B':[4,5,6],'C':[7,8,9]})
df
A B C
0 1 4 7
1 2 5 8
2 3 6 9
I'd like to create 3 new data frames; one from each column.
I can do this one at a time like this:
a=pd.DataFrame(df[['A']])
a
A
0 1
1 2
2 3
But instead of doing this for each column, I'd like to do it in a loop.
Here's what I've tried:
a=b=c=df.copy()
dfs=[a,b,c]
fields=['A','B','C']
for d,f in zip(dfs,fields):
d=pd.DataFrame(d[[f]])
...but when I then print each one, I get the whole original data frame as opposed to just the column of interest.
a
A B C
0 1 4 7
1 2 5 8
2 3 6 9
Update:
My actual data frame will have some columns that I do not need and the columns will not be in any sort of order, so I need to be able to get the columns by name.
Thanks in advance!
A simple list comprehension should be enough.
In [68]: df_list = [df[[x]] for x in df.columns]
Printing out the list, this is what you get:
In [69]: for d in df_list:
...: print(d)
...: print('-' * 5)
...:
A
0 1
1 2
2 3
-----
B
0 4
1 5
2 6
-----
C
0 7
1 8
2 9
-----
Each element in df_list is its own data frame, corresponding to each data frame from the original. Furthermore, you don't even need fields, use df.columns instead.
Or you can try this, instead create copy of df, this method will return the result as single Dataframe, not a list, However, I think save Dataframe into a list is better
dfs=['a','b','c']
fields=['A','B','C']
variables = locals()
for d,f in zip(dfs,fields):
variables["{0}".format(d)] = df[[f]]
a
Out[743]:
A
0 1
1 2
2 3
b
Out[744]:
B
0 4
1 5
2 6
c
Out[745]:
C
0 7
1 8
2 9
You should use loc
a = df.loc[:,0]
and then loop through like
for i in range(df.columns.size):
dfs[i] = df.loc[:, i]

Delete a row if it doesn't contain a specified integer value (Pandas)

I have a Pandas dataset that I want to clean up prior to applying my ML algorithm. I am wondering if it was possible to remove a row if an element of its columns does not match a set of values. For example, if I have the dataframe:
a b
0 1 6
1 4 7
2 2 4
3 3 7
...
And I desire the values of a to be one of [1,3] and of b to be one of [6,7], such that my final dataset is:
a b
0 1 6
1 3 7
...
Currently, my implementation is not working as some of my data rows have erroneous strings attached to the value. For example, instead of a value of 1 I'll have something like 1abc. Hence why I would like to remove anything that is not an integer of that value.
My workaround is also a bit archaic, as I am removing entries for column a that do not have 1 or 3 via:
dataset = dataset[(dataset.commute != 1)]
dataset = dataset[(dataset.commute != 3)]
You can use boolean indexing with double isin and &:
df1 = df[(df['a'].isin([1,3])) & (df['b'].isin([6,7]))]
print (df1)
a b
0 1 6
3 3 7
Or use numpy.in1d:
df1 = df[(np.in1d(df['a'], [1,3])) & (np.in1d(df['b'], [6,7])) ]
print (df1)
a b
0 1 6
3 3 7
But if need remove all rows with non numeric then need to_numeric with errors='coerce' which return NaN and then is possible filter it by notnull:
df = pd.DataFrame({'a':['1abc','2','3'],
'b':['4','5','dsws7']})
print (df)
a b
0 1abc 4
1 2 5
2 3 dsws7
mask = pd.to_numeric(df['a'], errors='coerce').notnull() &
pd.to_numeric(df['b'], errors='coerce').notnull()
df1 = df[mask].astype(int)
print (df1)
a b
1 2 5
If need check if some value is NaN or None:
df = pd.DataFrame({'a':['1abc',None,'3'],
'b':['4','5',np.nan]})
print (df)
a b
0 1abc 4
1 None 5
2 3 NaN
print (df[df.isnull().any(axis=1)])
a b
1 None 5
2 3 NaN
You can use pandas isin()
df = df[df.a.isin([1,3]) & df.b.isin([6,7])]
a b
0 1 6
3 3 7

Pandas merge two dataframes with different columns

I'm surely missing something simple here. Trying to merge two dataframes in pandas that have mostly the same column names, but the right dataframe has some columns that the left doesn't have, and vice versa.
>df_may
id quantity attr_1 attr_2
0 1 20 0 1
1 2 23 1 1
2 3 19 1 1
3 4 19 0 0
>df_jun
id quantity attr_1 attr_3
0 5 8 1 0
1 6 13 0 1
2 7 20 1 1
3 8 25 1 1
I've tried joining with an outer join:
mayjundf = pd.DataFrame.merge(df_may, df_jun, how="outer")
But that yields:
Left data columns not unique: Index([....
I've also specified a single column to join on (on = "id", e.g.), but that duplicates all columns except id like attr_1_x, attr_1_y, which is not ideal. I've also passed the entire list of columns (there are many) to on:
mayjundf = pd.DataFrame.merge(df_may, df_jun, how="outer", on=list(df_may.columns.values))
Which yields:
ValueError: Buffer has wrong number of dimensions (expected 1, got 2)
What am I missing? I'd like to get a df with all rows appended, and attr_1, attr_2, attr_3 populated where possible, NaN where they don't show up. This seems like a pretty typical workflow for data munging, but I'm stuck.
I think in this case concat is what you want:
In [12]:
pd.concat([df,df1], axis=0, ignore_index=True)
Out[12]:
attr_1 attr_2 attr_3 id quantity
0 0 1 NaN 1 20
1 1 1 NaN 2 23
2 1 1 NaN 3 19
3 0 0 NaN 4 19
4 1 NaN 0 5 8
5 0 NaN 1 6 13
6 1 NaN 1 7 20
7 1 NaN 1 8 25
by passing axis=0 here you are stacking the df's on top of each other which I believe is what you want then producing NaN value where they are absent from their respective dfs.
The accepted answer will break if there are duplicate headers:
InvalidIndexError: Reindexing only valid with uniquely valued Index objects.
For example, here A has 3x trial columns, which prevents concat:
A = pd.DataFrame([[3, 1, 4, 1]], columns=['id', 'trial', 'trial', 'trial'])
# id trial trial trial
# 0 3 1 4 1
B = pd.DataFrame([[5, 9], [2, 6]], columns=['id', 'trial'])
# id trial
# 0 5 9
# 1 2 6
pd.concat([A, B], ignore_index=True)
# InvalidIndexError: Reindexing only valid with uniquely valued Index objects
To fix this, deduplicate the column names before concat:
parser = pd.io.parsers.base_parser.ParserBase({'usecols': None})
for df in [A, B]:
df.columns = parser._maybe_dedup_names(df.columns)
pd.concat([A, B], ignore_index=True)
# id trial trial.1 trial.2
# 0 3 1 4 1
# 1 5 9 NaN NaN
# 2 2 6 NaN NaN
Or as a one-liner but less readable:
pd.concat([df.set_axis(parser._maybe_dedup_names(df.columns), axis=1) for df in [A, B]], ignore_index=True)
Note that for pandas <1.3.0, use: parser = pd.io.parsers.ParserBase({})
I had this problem today using any of concat, append or merge, and I got around it by adding a helper column sequentially numbered and then doing an outer join
helper=1
for i in df1.index:
df1.loc[i,'helper']=helper
helper=helper+1
for i in df2.index:
df2.loc[i,'helper']=helper
helper=helper+1
df1.merge(df2,on='helper',how='outer')

Categories