I have two data frames that I am trying to merge.
Dataframe A:
col1 col2 sub grade
0 1 34.32 x a
1 1 34.32 x b
2 1 34.33 y c
3 2 10.14 z b
4 3 33.01 z a
Dataframe B:
col1 col2 group ID
0 1 34.32 t z
1 1 54.32 s w
2 1 34.33 r z
3 2 10.14 q z
4 3 33.01 q e
I want to merge on col1 and col2. I've been pd.merge with the following syntax:
pd.merge(A, B, how = 'outer', on = ['col1', 'col2'])
However, I think I am running into issues joining on the float values of col2 since many rows are being dropped. Is there any way to use np.isclose to match the values of col2? When I reference the index of a particular value of col2 in either dataframe, the value has many more decimal places than what is displayed in the dataframe.
I would like the result to be:
col1 col2 sub grade group ID
0 1 34.32 x a t z
1 1 34.32 x b s w
2 1 54.32 s w NaN NaN
3 1 34.33 y c r z
4 2 10.14 z b q z
5 3 33.01 z a q e
You can use a little hack - multiple float columns by some constant like 100, 1000..., convert column to int, merge and last divide by constant:
N = 100
#thank you koalo for comment
A.col2 = np.round(A.col2*N).astype(int)
B.col2 = np.round(B.col2*N).astype(int)
df = pd.merge(A, B, how = 'outer', on = ['col1', 'col2'])
df.col2 = df.col2 / N
print (df)
col1 col2 sub grade group ID
0 1 34.32 x a t z
1 1 34.32 x b t z
2 1 34.33 y c r z
3 2 10.14 z b q z
4 3 33.01 z a q e
5 1 54.32 NaN NaN s w
I had a similar problem where I needed to identify matching rows with thousands of float columns and no identifier. This case is difficult because values can vary slightly due to rounding.
In this case, I used scipy.spatial.distance.cosine to get the cosine similarity between rows.
from scipy import distance
threshold = 0.99999
similarity = 1 - spatial.distance.cosine(row1, row2)
if similarity >= threshold:
# it's a match
else:
# loop and check another row pair
This won't work if you have duplicate or very similar rows, but when you have a large number of float columns and not too many of rows, it works well.
Assuming that the column (col2) has n decimal numbers.
A.col2 = np.round(A.col2, decimals=n)
B.col2 = np.round(B.col2, decimals=n)
df = A.merge(B, left_on=['col1', 'col2'], right_on=['col1', 'col2'])
Related
I have data with a large number of columns:
df:
ID col1 col2 col3 ... col100
1 a x 0 1
1 a x 1 1
2 a y 1 1
4 a z 1 0
...
98 a z 1 1
100 a x 1 0
I want to fill in the missing ID values with a default value that indicate that the data is missing here. For example here it would be ID 3 and hypothetically speaking lets say the missing row data looks like ID 100
ID col1 col2 col3 ... col100
3 a x 1 0
99 a x 1 0
Expected output:
df:
ID col1 col2 col3 ... col100
1 a x 0 1
1 a x 1 1
2 a y 1 1
3 a x 1 0
4 a z 1 0
...
98 a z 1 1
99 a x 1 0
100 a x 1 0
I'm also ok with the 3 and 99 being at the bottom.
I have tried several ways of appending new rows:
noresponse = df[filterfornoresponse].head(1).copy() #assume that this will net us row 100
for i in range (1, maxID):
if len(df[df['ID'] == i) == 0: #IDs with no rows ie missing data
temp = noresponse.copy()
temp['ID'] = i
df.append(temp, ignore_index = True)
This method doesn't seem to append anything.
I have also tried
pd.concat([df, temp], ignore_index = True)
instead of df.append
I have also tried adding the rows to a list noresponserows with the intention of concating the list with df:
noresponserows = []
for i in range (1, maxID):
if len(df[df['ID'] == i) == 0: #IDs with no rows ie missing data
temp = noresponse.copy()
temp['ID'] = i
noresponserows.append(temp)
But here the list always ends up with only 1 row when in my data I know there are more than one rows that need to be appended.
I'm not sure why I am having trouble appending more than once instance of noresponse into the list, and why I can't directly append to a dataframe. I feel like I am missing something here.
I think it might have to do with me taking a copy of a row in the df vs constructing a new one. The reason why I take a copy of a row to get noresponse is because there are a large amount of columns so it is easier to just take an existing row.
Say you have a dataframe like this:
>>> df
col1 col2 col100 ID
0 a x 0 1
1 a y 3 2
2 a z 1 4
First, set the ID column to be the index:
>>> df = df.set_index('ID')
>>> df
col1 col2 col100
ID
1 a x 0
2 a y 3
4 a z 1
Now you can use df.loc to easily add rows.
Let's select the last row as the default row:
>>> default_row = df.iloc[-1]
>>> default_row
col1 a
col2 z
col100 1
Name: 4, dtype: object
We can add it right into the dataframe at ID 3:
>>> df.loc[3] = default_row
>>> df
col1 col2 col100
ID
1 a x 0
2 a y 3
4 a z 1
3 a z 1
Then use sort_index to sort the rows lexicographically by index:
>>> df = df.sort_index()
>>> df
col1 col2 col100
ID
1 a x 0
2 a y 3
3 a z 1
4 a z 1
And, optionally, reset the index:
>>> df = df.reset_index()
>>> df
ID col1 col2 col100
0 1 a x 0
1 2 a y 3
2 3 a z 1
3 4 a z 1
I have a dataframe which I am representing in a tabular format below. The original dataframe is a lot bigger in size and therefore I cannot afford to loop on each row.
col1 | col2 | col3
a x 1
b y 1
c z 0
d k 1
e l 1
What I want is split it into subsets of dataframes with consecutive number of 1s in the column col3.
So ideally I want to above dataframe to return two dataframes df1 and df2
df1
col1 | col2 | col3
a x 1
b y 1
df2
col1 | col2 | col3
d k 1
e l 1
Is there an approach like groupby to do this?
If I use groupby it returns me all the 4 rows in a dataframe with col3==1.
I do not want that as I need two dataframes each consisting of consecutively occuring 1s.
One method is to obviously loop by the rows and as and when I find a 0, I can return a dataframe but that is not efficient. Any kind of help is appreciated.
First compare values by 1, then create consecutive groups by shift and cumulative sum and last in list comprehension with groupby get all groups:
m1 = df['col3'].eq(1)
g = m1.ne(m1.shift()).cumsum()
dfs = [x for i, x in df[m1].groupby(g)]
print (dfs)
[ col1 col2 col3
0 a x 1
1 b y 1, col1 col2 col3
3 d k 1
4 e l 1]
print (dfs[0])
col1 col2 col3
0 a x 1
1 b y 1
If also is necessary remove single 1 rows is added Series.duplicated with keep=False:
print (df)
col1 col2 col3
0 a x 1
1 b y 1
2 c z 0
3 d k 1
4 e l 1
5 f m 0
6 g n 1 <- removed
m1 = df['col3'].eq(1)
g = m1.ne(m1.shift()).cumsum()
g = g[g.duplicated(keep=False)]
print (g)
0 1
1 1
3 3
4 3
Name: col3, dtype: int32
dfs = [x for i, x in df[m1].groupby(g)]
print (dfs)
[ col1 col2 col3
0 a x 1
1 b y 1, col1 col2 col3
3 d k 1
4 e l 1]
Given df1:
A B C
0 a 7 x
1 b 3 x
2 a 5 y
3 b 4 y
4 a 5 z
5 b 3 z
How to get df2 where for each value in C of df1, a new col D has the difference bettwen the df1 values in col B where col A==a and where col A==b:
C D
0 x 4
1 y 1
2 z 2
I'd use a pivot table:
df = df1.pivot_table(columns = ['A'],values = 'B', index = 'C')
df2 = pd.DataFrame({'D': df['a'] - df['b']})
The risk in the answer given by #YOBEN_S is that it will fail if b appears before a for a given value of C
I have two dataframes:
Dataframe A:
Col1 Col2 Value
A X 1
A Y 2
B X 3
B Y 2
C X 5
C Y 4
Dataframe B:
Col1
A
B
C
What I need is to add to Dataframe B one column for each value in Col2 of Dataframe A (in this case, X and Y), and filling them with the values in column "Value" after having merged the two dataframes on Col1. Here is it:
Col1 X Y
A 1 2
B 3 2
C 5 4
Thank you very much for your help!
B['X'] = A.loc[A['Col2'] == 'X', 'Value'].reset_index(drop = True)
B['Y'] = A.loc[A['Col2'] == 'Y', 'Value'].reset_index(drop = True)
Col1 X Y
0 A 1 2
1 B 3 2
2 C 5 4
If you are going to have 100s of distinct values in Col2 then you call the above two lines in a loop, like this:
for t in A['Col2'].unique():
B[t] = A.loc[A['Col2'] == t, 'Col3'].reset_index(drop = True)
B[t] = A.loc[A['Col2'] == t, 'Col3'].reset_index(drop = True)
B
You get the same output:
Col1 X Y
0 A 1 2
1 B 3 2
2 C 5 4
I have huge dataset with more than 100 columns that contain non-null values that I want to replace (and leave all the null values as is). Some columns, however, should stay untouched.
I am planning to do the following:
1) find unique values in these columns
2) replace this values with 1
Problem:
1) something like this barely possible to use for 100+ columns:
np.unique(df[['Col1', 'Col2']].values)
2) how do I apply than loc to all these columns? code below does not work
df_2.loc[df_2[['col1','col2','col3']] !=0, ['col1','col2','col3']] = 1
Maybe there is more reasonable and elegant way to solve the problem. Thanks!
Use DataFrame.mask:
c = ['col1','col2','col3']
df_2[c] = df_2[c].mask(df_2[c] != 0, 1)
Or compare by not equal with DataFrame.ne and cast mask by integers with DataFrame.astype:
df_2 = pd.DataFrame({
'A':list('abcdef'),
'col1':[0,5,0,5,5,0],
'col2':[7,8,9,0,2,0],
'col3':[0,0,5,7,0,0],
'E':[5,0,6,9,2,0],
})
c = ['col1','col2','col3']
df_2[c] = df_2[c].ne(0).astype(int)
print (df_2)
A col1 col2 col3 E
0 a 0 1 0 5
1 b 1 1 0 0
2 c 0 1 1 6
3 d 1 0 1 9
4 e 1 1 0 2
5 f 0 0 0 0
EDIT: For select columns by positions use DataFrame.iloc:
idx = np.r_[6:71,82]
df_2.iloc[:, idx] = df_2.iloc[:, idx].ne(0).astype(int)
Or first solution:
df_2.iloc[:, idx] = df_2.iloc[:, idx].mask(df_2.iloc[:, idx]] != 0, 1)