python 2 vs python 3 pd.merge command - python

The following command works fine if I run python 2
df5b = pd.merge(df5a, df5bb, how='outer')
However, when I run the same command with the same dfs in python 3, I get the following error:
ValueError: You are trying to merge on object and int64 columns. If you wish to proceed you should use pd.concat
My dataframes are very large, I hope that someone can help me out, without giving examples of my dataframes. The command is ok with python 2, so I assume that the problem are not the dataframes, but maybe a change of this command in python 3?

There is problem some columns are integers in one DataFrame and strings in another with same names.
Simpliest solution is cast all columns to strings:
df5b = pd.merge(df5a.astype(str), df5bb.astype(str), how='outer')
Another is test dtypes:
print (df5a.dtypes)
print (df5bb.dtypes)
And convert columns for same, e.g. convert strings columns from list to integers:
cols = ['col1','col12','col3']
df5a[cols] = df5a[cols].astype(int)
Sample:
df5a = pd.DataFrame({
'B':[4,5,4,5],
'C':[7,8,9,4],
'F':list('aaab')
})
df5bb = pd.DataFrame({
'B':['4','5','5'],
'F':list('aab')
})
df5b = pd.merge(df5a.astype(str), df5bb.astype(str), how='outer')
print (df5b)
B C F
0 4 7 a
1 4 9 a
2 5 8 a
3 5 4 b
print (df5a.dtypes)
B int64
C int64
F object
dtype: object
print (df5bb.dtypes)
B object
F object
dtype: object
cols = ['B']
df5bb[cols] = df5bb[cols].astype(int)
df5b = pd.merge(df5a, df5bb, how='outer')
print (df5b)
B C F
0 4 7 a
1 4 9 a
2 5 8 a
3 5 4 b

as i stated in my comment section, coerce is not happening on mixed types(may be int , str or float) hence you can consider concat or you can convert them as str and then merge which jezrael mentioned.
Just to determine types you can see ..
>>> pd.concat([df5a, df5bb]).dtypes
B object
C float64
F object
dtype: object
>>> pd.concat([df5a, df5bb])
B C F
0 4 7.0 a
1 5 8.0 a
2 4 9.0 a
3 5 4.0 b
0 4 NaN a
1 5 NaN a
2 5 NaN b

Related

groupby count same values in two columns in pandas?

I have the following Pandas dataframe:
name1 name2
A B
A A
A C
A A
B B
B A
I want to add a column named new which counts name1 OR name2 keeping the merged columns (distinct values in both name1 and name2). Hence, the expected output is the following dataframe:
name new
A 7
B 4
C 1
I've tried
df.groupby(["name1"]).count().groupby(["name2"]).count(), among many other things... but although that last one seems to give me the correct results, I cant get the joined datasets.
You can use value_counts with df.stack():
df[['name1','name2']].stack().value_counts()
#df.stack().value_counts() for all cols
A 7
B 4
C 1
Specifically:
(df[['name1','name2']].stack().value_counts().
to_frame('new').rename_axis('name').reset_index())
name new
0 A 7
1 B 4
2 C 1
Let us try melt
df.melt().value.value_counts()
Out[17]:
A 7
B 4
C 1
Name: value, dtype: int64
Alternatively,
df.name1.value_counts().add(df.name2.value_counts(), fill_value=0).astype(int)
gives you
A 7
B 4
C 1
dtype: int64
Using Series.append with Series.value_counts:
df['name1'].append(df['name2']).value_counts()
A 7
B 4
C 1
dtype: int64
value_counts converts the aggregated column to index. To get your desired output, use rename_axis with reset_index:
df['name1'].append(df['name2']).value_counts().rename_axis('name').reset_index(name='new')
name new
0 A 7
1 B 4
2 C 1
python Counter is another solution
from collections import Counter
s = pd.Series(Counter(df.to_numpy().flatten()))
In [1325]: s
Out[1325]:
A 7
B 4
C 1
dtype: int64

How to reshape a python vector when some elements are empty

I have a df with values:
A B C D
0 1 2 3 2
1 2 3 3 9
2 5 3 6 6
3 3 6 7
4 6 7
5 2
df.shape is 6x4, say
df.iloc[:,1] pulls out the B column, but len(df.iloc[:,1]) is also = 6
How do I "reshape" df.iloc[:,1]? Which function can I use so that the output is the length of the actual values in the column.
My expected output in this case is 3
You can use last_valid_index. Just note that since your series originally contained NaN values and these are considered float, even after filtering your series will be float. You may wish to convert to int as a separate step.
# first convert dataframe to numeric
df = df.apply(pd.to_numeric, errors='coerce')
# extract column
B = df.iloc[:, 1]
# filter to the last valid value
B_filtered = B[:B.last_valid_index()]
print(B_filtered)
0 2.0
1 3.0
2 3.0
3 6.0
Name: B, dtype: float64
You can use list comprehension like this.
len([x for x in df.iloc[:,1] if x != ''])

Replace pandas Series with Series of different length but same indices

Suppose two pandas Series A and B:
A:
1 4
2 4
3 4
4 1
5 3
B:
3 4
4 4
5 2
A is larger than B and B has the same indices as A with different values. I'm trying to replace the values of A with those of B.
A.replace(to_replace=B) seems obvious but does not work. What am I missing here?
I think you can use combine_first:
C = B.combine_first(A).astype(int)
print (C)
1 4
2 4
3 4
4 4
5 2
dtype: int32
An alternative solution with more basic pandas operators.
a.loc[b.index.values]=b.values

Outer join in python Pandas

I have two data sets as following
A B
IDs IDs
1 1
2 2
3 5
4 7
How in Pandas, Numpy we can apply a join which can give me all the data from B, which is not present in A
Something like Following
B
Ids
5
7
I know it can be done with for loop, but that I don't want, since my real data is in millions, and I am really not sure how to use Panda Numpy here, something like following
pd.merge(A, B, on='ids', how='right')
Thanks
You can use NumPy's setdiff1d, like so -
np.setdiff1d(B['IDs'],A['IDs'])
Also, np.in1d could be used for the same effect, like so -
B[~np.in1d(B['IDs'],A['IDs'])]
Please note that np.setdiff1d would give us a sorted NumPy array as output.
Sample run -
>>> A = pd.DataFrame([1,2,3,4],columns=['IDs'])
>>> B = pd.DataFrame([1,7,5,2],columns=['IDs'])
>>> np.setdiff1d(B['IDs'],A['IDs'])
array([5, 7])
>>> B[~np.in1d(B['IDs'],A['IDs'])]
IDs
1 7
2 5
You can use merge with parameter indicator and then boolean indexing. Last you can drop column _merge:
A = pd.DataFrame({'IDs':[1,2,3,4],
'B':[4,5,6,7],
'C':[1,8,9,4]})
print (A)
B C IDs
0 4 1 1
1 5 8 2
2 6 9 3
3 7 4 4
B = pd.DataFrame({'IDs':[1,2,5,7],
'A':[1,8,3,7],
'D':[1,8,9,4]})
print (B)
A D IDs
0 1 1 1
1 8 8 2
2 3 9 5
3 7 4 7
df = (pd.merge(A, B, on='IDs', how='outer', indicator=True))
df = df[df._merge == 'right_only']
df = df.drop('_merge', axis=1)
print (df)
B C IDs A D
4 NaN NaN 5.0 3.0 9.0
5 NaN NaN 7.0 7.0 4.0
You could convert the data series to sets and take the difference:
import pandas as pd
df=pd.DataFrame({'A' : [1,2,3,4], 'B' : [1,2,5,7]})
A=set(df['A'])
B=set(df['B'])
C=pd.DataFrame({'C' : list(B-A)}) # Take difference and convert back to DataFrame
The variable "C" then yields
C
0 5
1 7
You can simply use pandas' .isin() method:
df = pd.DataFrame({'A' : [1,2,3,4], 'B' : [1,2,5,7]})
df[~df['B'].isin(df['A'])]
If these are separate DataFrames:
a = pd.DataFrame({'IDs' : [1,2,3,4]})
b = pd.DataFrame({'IDs' : [1,2,5,7]})
b[~b['IDs'].isin(a['IDs'])]
Output:
IDs
2 5
3 7

Python Pandas working with dataframes in functions

I have a DataFrame which I want to pass to a function, derive some information from and then return that information. Originally I set up my code like:
df = pd.DataFrame( {
'A': [1,1,1,1,2,2,2,3,3,4,4,4],
'B': [5,5,6,7,5,6,6,7,7,6,7,7],
'C': [1,1,1,1,1,1,1,1,1,1,1,1]
} );
def test_function(df):
df['D'] = 0
df.D = np.random.rand(len(df))
grouped = df.groupby('A')
df = grouped.first()
df = df['D']
return df
Ds = test_function(df)
print(df)
print(Ds)
Which returns:
A B C D
0 1 5 1 0.582319
1 1 5 1 0.269779
2 1 6 1 0.421593
3 1 7 1 0.797121
4 2 5 1 0.366410
5 2 6 1 0.486445
6 2 6 1 0.001217
7 3 7 1 0.262586
8 3 7 1 0.146543
9 4 6 1 0.985894
10 4 7 1 0.312070
11 4 7 1 0.498103
A
1 0.582319
2 0.366410
3 0.262586
4 0.985894
Name: D, dtype: float64
My thinking was along the lines of, I don't want to copy my large dataframe, so I will add a working column to it, and then just return the information I want with out affecting the original dataframe. This of course doesn't work, because I didn't copy the dataframe so adding a column is adding a column. Currently I'm doing something like:
add column
results = Derive information
delete column
return results
which feels a bit kludgy to me, but I can't think of a better way to do it without copying the dataframe. Any suggestions?
If you do not want to add a column to your original DataFrame, you could create an independent Series and apply the groupby method to the Series instead:
def test_function(df):
ser = pd.Series(np.random.rand(len(df)))
grouped = ser.groupby(df['A'])
return grouped.first()
Ds = test_function(df)
yields
A
1 0.017537
2 0.392849
3 0.451406
4 0.234016
dtype: float64
Thus, test_function does not modify df at all. Notice that ser.groupby can be passed a sequence of values (such as df['A']) by which to group instead of the just the name of a column.

Categories