How to merge 2 data frames?

How to merge 2 data frames? - python

I have this table1:
A B C D
0 1 2 k l
1 3 4 e r
df.dtypes gets me this:
A int64
B int64
C object
D object
Now, I want to create a table2 which only includes objects (column C and D) using this command table2=df.select_dtypes(include=[object]).
Then, I want to encode table2 using this command pd.get_dummies(table).
It gives me this table2:
C D
0 0 1
1 1 0
The last thing I want to do is append both tables together (table 1 + table 2), so that the final table looks like this:
A B C D
0 1 2 0 1
1 3 4 1 0
Can somebody help?

This should do it:
table2=df.select_dtypes(include=[object])
table1.select_dtypes(include=[int]).join(table2.apply(lambda x:pd.factorize(x, sort=True)[0]))
It first factorizes the object typed columns of table 2 (instead of using dummies generator) and then merge it back to the int typed columns of the original dataframe!

Assuming what you're trying to do from the question is have a column for C that has a value of 1 replace values of e and in column D, values of 1 replace values of l. Otherwise, as mentioned elsewhere there will be a column for each response possibility.
df = pd.DataFrame({'A': [1,2], 'B': [2,4], 'C': ['k','e'], 'D': ['l','r']})
df
A B C D
0 1 2 k l
1 2 4 e r
df.dtypes
A int64
B int64
C object
D object
dtype: object
Now, if you want to drop the e and l because you want to have k-1 columns, you can use the drop_first argument.
df = pd.get_dummies(df, drop_first = True)
df
A B C_k D_r
0 1 2 1 0
1 2 4 0 1
Note that the dtypes are not int64 like columns A and B.
df
A int64
B int64
C_k uint8
D_r uint8
dtype: object
If it's important they are the same type you can of course change those as appropriate. In the general case, you may want to keep names like C_k and D_r so you know what the dummies correspond to. If not, you can always rename based on the _ (the default of get_dummies prefix argument.) So, you could create the rename dictionary using the '_' as as way to split out the part of the column name after the prefix. Or for a simple case like this.
df.rename({'C_k': 'C', 'D_r': 'D'}, axis = 1, inplace = True)
df
A B C D
0 1 2 1 0
1 2 4 0 1

Related

How can a duplicate row be dropped with some condition [duplicate]

This question already has answers here:
Get the row(s) which have the max value in groups using groupby
(15 answers)
Closed 9 months ago.
Simple DataFrame:
df = pd.DataFrame({'A': [1,1,2,2], 'B': [0,1,2,3], 'C': ['a','b','c','d']})
df
A B C
0 1 0 a
1 1 1 b
2 2 2 c
3 2 3 d
I wish for every value (groupby) of column A, to get the value of column C, for which column B is maximum. For example for group 1 of column A, the maximum of column B is 1, so I want the value "b" of column C:
A C
0 1 b
1 2 d
No need to assume column B is sorted, performance is of top priority, then elegance.

Check with sort_values +drop_duplicates
df.sort_values('B').drop_duplicates(['A'],keep='last')
Out[127]:
A B C
1 1 1 b
3 2 3 d

df.groupby('A').apply(lambda x: x.loc[x['B'].idxmax(), 'C'])
# A
#1 b
#2 d
Use idxmax to find the index where B is maximal, then select column C within that group (using a lambda-function

Here's a little fun with groupby and nlargest:
(df.set_index('C')
.groupby('A')['B']
.nlargest(1)
.index
.to_frame()
.reset_index(drop=True))
A C
0 1 b
1 2 d
Or, sort_values, groupby, and last:
df.sort_values('B').groupby('A')['C'].last().reset_index()
A C
0 1 b
1 2 d

Similar solution to #Jondiedoop, but avoids the apply:
u = df.groupby('A')['B'].idxmax()
df.loc[u, ['A', 'C']].reset_index(drop=1)
A C
0 1 b
1 2 d

How to explode multiple columns that contain a string?

I have a dataset that includes different types of tags. Each column has a string that contains a list of tags.
How am I supposed to explode selected columns at the same time ?
Unnamed: id Tag1 Tag2
0 A a,b,c d,e
1 B m,n x
to this:
Unnamed: id Tag1 Tag2
0 A a d
1 A a e
2 A b d
3 A b e
4 A c d
6 A c e
7 B m x
8 B n x

First, split the string values of each Tag column into lists, using Series.apply + Series.str.split. I'm using DataFrame.filter to select only the columns which starts with 'Tag'.
Then, use DataFrame.explode in a loop to explode sequentially each Tag column of the df, turning the values of each list into new rows.
tag_cols = df.filter(like='Tag').columns
df[tag_cols] = df[tag_cols].apply(lambda col: col.str.split(','))
for col in tag_cols:
df = df.explode(col, ignore_index=True)
print(df)
Output:
id Tag1 Tag2
0 A a d
1 A a e
2 A b d
3 A b e
4 A c d
5 A c e
6 B m x
7 B n x
Note that using just df.apply(lambda col: col.str.split(',').explode()) won't work in this case because some rows have strings/lists with a different number of elements. Therefore the rows can't be correctly aligned after exploding them, and apply will complain.

How do I iteratively select rows in pandas based on column values?

I'm a complete newbie at pandas so a simpler (though maybe not the most efficient or elegant) solution is appreciated. I don't mind a bit of brute force if I can understand the answer better.
If I have the following Dataframe:
A B C
0 0 1
0 1 1
I want to loop through columns "A", "B" and "C" in that order and during each iteration select all the rows for which the current column is "1" and none of the previous columns are and save the result and also use it in the next iteration.
So when looking at column A, I wouldn't select anything. Then when looking at column B I would select the second row because B==1 and A==0. Then when looking at column C I would select the first row because A==0 and B==0.

Create a boolean mask:
m = (df == 1) & (df.cumsum(axis=1) == 1)
d = {col: df[m[col]].index.tolist() for col in df.columns if m[col].sum()}
Output:
>>> m
A B C
0 False False True
1 False True False
2 False False True
>>> d
{'B': [1], 'C': [0, 2]}
I slightly modified your dataframe:
>>> df
A B C
0 0 0 1
1 0 1 1
2 0 0 1
Update
For the expected output on my sample:
for rows, col in zip(m, df.columns):
if m[col].sum():
print(f"\n=== {col} ===")
print(df[m[col]])
Output:
=== B ===
A B C
1 0 1 1
=== C ===
A B C
0 0 0 1
2 0 0 1

Seems like you need a direct use of idxmax
Return index of first occurrence of maximum over requested axis.
NA/null values are excluded.
>>> df.idxmax()
A 0
B 1
C 0
dtype: int64
The values above are the indexes for which your constraints are met. 1 for B means that the second row was "selected". 0 for C, same. The only issue is that, if nothing is found, it'll also return 0.
To address that, you can use where
>>> df.idxmax().where(~df.eq(0).all())
This will make sure that NaNs are returned for all-zero columns.
A NaN
B 1.0
C 0.0
dtype: float64

How to perform Matrix multiplication with conditions in python?

I am using matrix multiplication on a dataframe and its transpose with df#df.T
So if I have a df which looks like: (below 1 indicates that the object has the property whereas 0 indicates not having it):
Object Property1 Property2 Property3
A 1 1 1
B 0 1 1
C 1 0 0
Using df#df.T gives me:
A B C
A 3 2 1
B 2 2 0
C 1 0 1
This can be thought of a matrix showing how many properties each object has in common with another.
I now want to modify the problem where, instead of a binary indication of whether an object has a property, the properties column show levels of that property. So the new df looks like: (below the values 1,2,3 of properties shows its level. But 0 indicates not having the property)
Object Property1 Property2 Property3
A 3 2 1
B 0 2 3
C 2 0 0
I want to apply matrix multiplication, but with an altered definition of 'common' properties. Two objects will only have a common property if the levels of a property is within +-1 range of the other property.
Below is what the result will look like:
A B C
A 3 1 1
B 1 2 0
C 1 0 1
Note that the number of properties common between A and B have changed from 2 to 1. This is because property 3 between A and B is not within +-1 level. Also, 0 still means that the object does not have the property, so A and C still have 1 property in common (with property 3 for C being 0).
How can I achieve this in Python?

This can be done by modifying matrix multiplication for two DataFrames
Code
# DataFrame Matrix Multiplication
# i.e. equivalent to df1#df2
def df_multiply(df_a, df_b):
'''
Matrix multiplication of values in two DataFrames
Returns a DataFrame whose index and column are
from the df_a
'''
a = df_a.values
b = df_b.values
zip_b = zip(*b)
zip_b = list(zip_b)
zip_b = b
result = [[sum(ele_a*ele_b for ele_a, ele_b in zip(row_a, col_b))
for col_b in zip_b] for row_a in a]
return pd.DataFrame(data=result, index=df_a.index, columns=df_a.index)
# Modify df_multiply for desired result
def df_multiply_modified(df_a, df_b):
'''
Modified Matrix multiplication of values in two DataFrames to create desired result
Returns a DataFrame whose index and
column are from the df_a
'''
a = df_a.values
b = df_b.values
zip_b = zip(*b)
zip_b = list(zip_b)
# sum 1 when difference <= 1 and
# values are non-zero
# i.e. ele_a and ele_b and abs(ele_a-ele_b) <=1
result = [[sum(1 if ele_a and ele_b and abs(ele_a-ele_b) <=1 else 0 for ele_a, ele_b in zip(row_a, col_b))
for col_b in zip_b] for row_a in a]
return pd.DataFrame(data=result, index=df_a.index, columns=df_a.index)
Usage
Original Multiplication
df = pd.DataFrame({'Object':['A', 'B', 'C'],
'Property1':[1, 0, 1],
'Property2':[1, 1, 0],
'Property3':[1, 1, 0]})
df.set_index('Object', inplace = True)
print(df_multiply(df, df.T)
# Output (same as df#df.T):
Object A B C
Object
A 3 2 1
B 2 2 0
C 1 0 1
Modified Multiplication
# Use df_multiply_modified
df = pd.DataFrame({'Object':['A', 'B', 'C'],
'Property1':[3, 0, 2],
'Property2':[2, 2, 0],
'Property3':[1, 3, 0]})
df.set_index('Object', inplace = True)
print(df_multiply_modified(df, df.T)
# Output (same as desired)
Object A B C
Object
A 3 1 1
B 1 2 0
C 1 0 1

find the number of elements in a column of a dataframe

I want to file the row length of a column from the dataframe.
dataframe name- df
sample data:
a b c
1 d ['as','the','is','are','we']
2 v ['a','an']
3 t ['we','will','pull','this','together','.']
expected result:
a b c len
1 d ['as','the','is','are','we'] 5
2 v ['a','an'] 2
3 t ['we','will','pull','this','together','.'] 6
Till now, i have just tried:
df.loc[:,'len']=len(df.c)
but this gives me the total rows present in the dataframe.
How can i get the elements in each row of a specific column of a dataframe?

One way, is to use apply and calculate len
In [100]: dff
Out[100]:
a b c
0 1 d [as, the, is, are, we]
1 2 v [a, an]
2 3 t [we, will, pull, this, together, .]
In [101]: dff['len'] = dff['c'].apply(len)
In [102]: dff
Out[102]:
a b c len
0 1 d [as, the, is, are, we] 5
1 2 v [a, an] 2
2 3 t [we, will, pull, this, together, .] 6

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to merge 2 data frames? - python

Related

How can a duplicate row be dropped with some condition [duplicate]

How to explode multiple columns that contain a string?

How do I iteratively select rows in pandas based on column values?

How to perform Matrix multiplication with conditions in python?

find the number of elements in a column of a dataframe

Categories

Resources