copy row values based on result from isin() in pandas - python

I have 2 dataframes:
df_a = pd.DataFrame({'A':[1,2,3,4],'B':[4,5,6,7],'ID':['a','b','c','d']})
df_a
A B ID
0 1 4 a
1 2 5 b
2 3 6 c
3 4 7 d
df_b = pd.DataFrame({'A':[1,2,3],'ID':['b','a','c']})
df_b['CopyB'] = ""
A ID CopyB
0 1 b
1 2 a
2 3 c
Now I want to match ID columns in both the dataframes and upon a successful match, I want to copy respective value of B from df_a to df_b['CopyB']. I tried df_b.loc[df_b['ID'].isin(df_a['ID']),'Copy']= df_a['B]
but that is not correct. Then I tried comparing the ID using '==' but got an error since the length of the ID series is not equal. Any help? Sorry if it is a very trivial query.

use join
df_b.join(df_a.set_index('ID').B, on='ID')
A ID B
0 1 b 5
1 2 a 4
2 3 c 6
join works on indices. So I set the index of df_a to be the ID column and access the B column so that df_a.set_index('ID').B is a series with the column I want to add as values and the merge column as the index. Then I use join. I wouldn't have to specify an on parameter if ID was the index of df_b but it isn't, so I do.

You could use merge
In [247]: df_b.merge(df_a, on=['ID'])
Out[247]:
A_x ID A_y B
0 1 b 2 5
1 2 a 1 4
2 3 c 3 6
In [248]: df_b.merge(df_a, on=['ID'])[['A_x', 'ID', 'B']].rename(columns={'A_x': 'A'})
Out[248]:
A ID B
0 1 b 5
1 2 a 4
2 3 c 6

If you set the ID column as a common index for both dataframes you can easily add the column:
df_a = df_a.set_index('ID')
df_b = db_b.set_index('ID')
df_b['Copy B'] = df_a['B']

Related

Is there a way to get the data after a specific condition in Pandas?

i want to know if there is a way to take the data from a dataframe after a specific condition, and keep taking that data until another condition is applied.
I have the following dataframe:
column_1 column_2
0 1 a
1 1 a
2 1 b
3 4 b
4 4 c
5 4 c
6 0 d
7 0 d
8 0 e
9 4 e
10 4 f
11 4 f
12 1 g
13 1 g
I want to select from this dataframe only the rows when in column_1 when it changes from 1->4 and stays 4 until it changes to another value, as follow:
column_1 column_2
3 4 b
4 4 c
5 4 c
Is there a way to do this in Pandas and not make them lists?
Another option is to find the cut off points using shift+eq; then use groupby.cummax to create a boolean filter:
df[(df['column_1'].shift().eq(1) & df['column_1'].eq(4)).groupby(df['column_1'].diff().ne(0).cumsum()).cummax()]
Output:
column_1 column_2
3 4 b
4 4 c
5 4 c
You can create helper column for groups by duplicated values new first, then test if shifted values is 1 compare with actual row and for these rows get new values. Last compare new column by filtered values for all duplicated 4 rows:
df['new'] = df['column_1'].ne(df['column_1'].shift()).cumsum()
s = df.loc[df['column_1'].shift().eq(1) & df['column_1'].eq(4), 'new']
df = df[df['new'].isin(s)]
print (df)
column_1 column_2 new
3 4 b 2
4 4 c 2
5 4 c 2

Pandas Counting Each Column with its Spesific Thresholds

If I have a following dataframe:
A B C D E
1 1 2 0 1 0
2 0 0 0 1 -1
3 1 1 3 -5 2
4 -3 4 2 6 0
5 2 4 1 9 -1
T 1 2 2 4 1
The last row is my threshold values for each column. I want to count each column values whether lower its threshold values or not in python pandas.
Desired Output is;
A B C D E
Count 2 2 3 3 4
But, I need to figure it out with a general solution, not for these specific columns. Because I have a large dataset. I cannot specify a column name for each of them in the code.
Could you please help me with this?
Select all rows without first by indexing and compare by DataFrame.lt by last row, then sum and convert Series to one row DataFrame by Series.to_frame with transpose by DataFrame.T:
df = df.iloc[:-1].lt(df.iloc[-1]).sum().to_frame('count').T
print (df)
A B C D E
count 2 2 3 3 4
Numpy alternative with DataFrame constructor:
arr = df.values
df = pd.DataFrame([np.sum(arr[:-1] < arr[-1], axis=0)], columns=df.columns, index=['count'])
print (df)
A B C D E
count 2 2 3 3 4

create pandas pivottable with a long multiindex

I have a dataframe df with the shape (4573,64) that I'm trying to pivot. The last column is an 'id' with two possible string values 'old' and 'new'. I would like to set the first 63 columns as index and then have the 'id' column across the top with values being the count of 'old' or 'new' for each index row.
I've created a list object out of columns labels that I want as index named cols.
I tried the following:
df.pivot(index=cols, columns='id')['id']
this gives an error: 'all arrays must be same length'
also tried the following to see if I can get sum but no luck either:
pd.pivot_table(df,index=cols,values=['id'],aggfunc=np.sum)
any ides greatly appreciated
I found a thread online talking about a possible bug in pandas 0.23.0 where the pandas.pivot_table() will not accept the multiindex as long as it contains NaN's (link to github in comments). My workaround was to do
df.fillna('empty', inplace=True)
then the solution below:
df1 = pd.pivot_table(df, index=cols,columns='id',aggfunc='size', fill_value=0)
as proposed by jezrael will work as intended hence the answer accepted.
I believe need convert columns names to list and then aggregate size with unstack:
df = pd.DataFrame({'B':[4,4,4,5,5,4],
'C':[1,1,9,4,2,3],
'D':[1,1,5,7,1,0],
'E':[0,0,6,9,2,4],
'id':list('aaabbb')})
print (df)
B C D E id
0 4 1 1 0 a
1 4 1 1 0 a
2 4 9 5 6 a
3 5 4 7 9 b
4 5 2 1 2 b
5 4 3 0 4 b
cols = df.columns.tolist()
df1 = df.groupby(cols)['id'].size().unstack(fill_value=0)
print (df1)
id a b
B C D E
4 1 1 0 2 0
3 0 4 0 1
9 5 6 1 0
5 2 1 2 0 1
4 7 9 0 1
Solution with pivot_table:
df1 = pd.pivot_table(df, index=cols,columns='id',aggfunc='size', fill_value=0)
print (df1)
id a b
B C D E
4 1 1 0 2 0
3 0 4 0 1
9 5 6 1 0
5 2 1 2 0 1
4 7 9 0 1

Delete a row if it doesn't contain a specified integer value (Pandas)

I have a Pandas dataset that I want to clean up prior to applying my ML algorithm. I am wondering if it was possible to remove a row if an element of its columns does not match a set of values. For example, if I have the dataframe:
a b
0 1 6
1 4 7
2 2 4
3 3 7
...
And I desire the values of a to be one of [1,3] and of b to be one of [6,7], such that my final dataset is:
a b
0 1 6
1 3 7
...
Currently, my implementation is not working as some of my data rows have erroneous strings attached to the value. For example, instead of a value of 1 I'll have something like 1abc. Hence why I would like to remove anything that is not an integer of that value.
My workaround is also a bit archaic, as I am removing entries for column a that do not have 1 or 3 via:
dataset = dataset[(dataset.commute != 1)]
dataset = dataset[(dataset.commute != 3)]
You can use boolean indexing with double isin and &:
df1 = df[(df['a'].isin([1,3])) & (df['b'].isin([6,7]))]
print (df1)
a b
0 1 6
3 3 7
Or use numpy.in1d:
df1 = df[(np.in1d(df['a'], [1,3])) & (np.in1d(df['b'], [6,7])) ]
print (df1)
a b
0 1 6
3 3 7
But if need remove all rows with non numeric then need to_numeric with errors='coerce' which return NaN and then is possible filter it by notnull:
df = pd.DataFrame({'a':['1abc','2','3'],
'b':['4','5','dsws7']})
print (df)
a b
0 1abc 4
1 2 5
2 3 dsws7
mask = pd.to_numeric(df['a'], errors='coerce').notnull() &
pd.to_numeric(df['b'], errors='coerce').notnull()
df1 = df[mask].astype(int)
print (df1)
a b
1 2 5
If need check if some value is NaN or None:
df = pd.DataFrame({'a':['1abc',None,'3'],
'b':['4','5',np.nan]})
print (df)
a b
0 1abc 4
1 None 5
2 3 NaN
print (df[df.isnull().any(axis=1)])
a b
1 None 5
2 3 NaN
You can use pandas isin()
df = df[df.a.isin([1,3]) & df.b.isin([6,7])]
a b
0 1 6
3 3 7

Selecting columns from two dataframes according to another column

I have 2 dataframes, one of them contains some general information about football players, and second of them contains other information like winning matches for each player. They both have the "id" column. However, they are not in same length.
What I want to do is creating a new dataframe which contains 2 columns: "x" from first dataframe and "y" from second dataframe, ONLY where the "id" column contains the same value in both dataframes. Thus, I can match the "x" and "y" columns which belong to same person.
I tried to do it using concat function:
pd.concat([firstdataframe['x'], seconddataframe['y']], axis=1, keys=['x', 'y'])
But I didn't manage to know how to apply the condition of the "id" being equal in both dataframes.
It seems you need merge with default inner join, also each values in id columns has to be unique:
df = pd.merge(df1[['id','x']], df2[['id','y']], on='id')
Sample:
df1 = pd.DataFrame({'id':[1,2,3],'x':[4,3,8]})
print (df1)
id x
0 1 4
1 2 3
2 3 8
df2 = pd.DataFrame({'id':[1,2],'y':[7,0]})
print (df2)
id y
0 1 7
1 2 0
df = pd.merge(df1[['id','x']], df2[['id','y']], on='id')
print (df)
id x y
0 1 4 7
1 2 3 0
Solution with concat is possible, but a bit complicated, becasue need join on indexes with inner join:
df = pd.concat([df1.set_index('id')['x'],
df2.set_index('id')['y']], axis=1, join='inner')
.reset_index()
print (df)
id x y
0 1 4 7
1 2 3 0
EDIT:
If ids are not unique, duplicates create all combinations and output dataframe is expanded:
df1 = pd.DataFrame({'id':[1,2,3],'x':[4,3,8]})
print (df1)
id x
0 1 4
1 2 3
2 3 8
df2 = pd.DataFrame({'id':[1,2,1,1],'y':[7,0,4,2]})
print (df2)
id y
0 1 7
1 2 0
2 1 4
3 1 2
df = pd.merge(df1[['id','x']], df2[['id','y']], on='id')
print (df)
id x y
0 1 4 7
1 1 4 4
2 1 4 2
3 2 3 0

Categories