I have two dataframes pd and pd2:
pd
Name A B
t1 3 4
t5 2 2
fry 4 5
net 3 3
pd2
Name A B
t1 3 4
t5 2 2
fry 4 5
net 3 3
I want to make sure that the columns 'Name' between the two dataframes match not only the names (t1,t5,etc..) but also they need to be in the same order. I've tried chekS = (df.index == df2.index).all(axis=1).astype(str) with no luck.
Assuming that Name is your index, you either change your axis to 0, or use chekS = sum(df.index != df2.index). If it's not the index, then chekS = sum(df.Name != df2.Name) will work.
If Name is the column not the index as your sample dataframe suggests, you can compare the two columns
(df1['Name'] == df2['Name']).all()
It returns True in this case.
Lets say your df2 is
Name A B
0 t1 3 4
1 t5 2 2
2 net 3 3
3 fry 4 5
I just flipped the rows at index 2 and 3 keeping the values same,
(df1['Name'] == df2['Name']).all()
will return False
Related
I have a multi-indexed dataframe which looks roughly like this:
import pandas as pd
test = pd.DataFrame({('A', 'a'):[1,2,3,4,5], ('A', 'b'):[5,4,3,2,1], ('B', 'a'):[5,2,3,4,1], ('B','b'):[1,4,3,2,5]})
>>> Output
A B
a b a b
0 1 5 5 1
1 2 4 2 4
2 3 3 3 3
3 4 2 4 2
4 5 1 1 5
In this dataframe, the zero-th row and fifth row are symmetric in the sense that if the entire A and B columns of the zero-th row are flipped, it becomes identical to the fifth one. Similarly, the second row is symmetric with itself.
I am planning to remove these rows from my original dataframe, thus making it 'non-symmetric'. The specific plans are as follow:
If a row with higher index is symmetric with a row with lower index, keep the lower one and remove the higher one. For example, from the above dataframe, keep the zero-th row and remove the fifth row.
If a row is symmetric with itself, remove that row. For example, from the above dataframe, remove the second row.
My attempt was to first zip the four lists into a tuple list, remove the symmetric tuples by a simple if-statement, unzip them, and merge them back into a dataframe. However, this turned out to be inefficient, making it unscalable for large dataframes.
How can I achieve this in an efficient manner? I guess utilizing several built-in pandas methods is necessary, but it seems quite complicated.
Namudon'tdie,
Try this solution:
import pandas as pd
test = pd.DataFrame({('A', 'a'):[1,2,3,4,5], ('A', 'b'):[5,4,3,2,1], ('B', 'a'):[5,2,3,4,1], ('B','b'):[1,4,3,2,5]})
test['idx'] = test.index * 2 # adding auxiliary column 'idx' (all even)
test2 = test.iloc[:, [2,3,0,1,4]] # creating flipped DF
test2.columns = test.columns # fixing column names
test2['idx'] = test2.index * 2 + 1 # for flipped DF column 'idx' is all odd
df = pd.concat([test, test2])
df = df.sort_values (by='idx')
df = df.set_index('idx')
print(df)
A B
a b a b
idx
0 1 5 5 1
1 5 1 1 5
2 2 4 2 4
3 2 4 2 4
4 3 3 3 3
5 3 3 3 3
6 4 2 4 2
7 4 2 4 2
8 5 1 1 5
9 1 5 5 1
df = df.drop_duplicates() # remove rows with duplicates
df = df[df.index%2 == 0] # remove rows with odd idx (flipped)
df = df.reset_index()[['A', 'B']]
print(df)
A B
a b a b
0 1 5 5 1
1 2 4 2 4
2 3 3 3 3
3 4 2 4 2
The idea is to create flipped rows with odd indexes, so that they will be placed under their original rows after reindexing. Then delete duplicates, keeping rows with lower indices. For cleanup simply delete remaining rows with odd indices.
Note that row [3,3,3,3] stayed. There should be a separate filter to take care of self-symmetric rows. Since your definition of self-symmetric is unclear (other rows have certain degree of symmetry too), I leave this part to you. Should be straightforward.
I have two data frames which look like:
DF1:
x_id y_id
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
DF2:
x_id y_id
1 1
2 1
3 1
4 2
5 2
6 2
1 3
3 3
: :
: :
3 y(i)
So, I want to merge/insert y_id from DF2 into y_id in DF1 in each iteration of the loop.
What I have so far:
count = df2['y_id'].unique()
for i in count:
new_df = df1.merge(df2['y_id']==i], how='inner', left_on='x_id', right_on='x_id')
While this create a new dataframe for each iteration of the loop, I think there should be a better way of doing this.
I want my final data frame to look like:
DF3:
x_id y_id
1 3
2 1
3 y(i)
4 2
5 2
6 2
Essentially what I want to do is group DF2 by y_id and merge them in a sorted order. So we can see in DF2 the values 1 and 3 have y_id = 1 and then further down the column they have y_id = 3. Since three is >1 I would like to use this value (ie. the greatest or most recent if we were working with dates etc.)
What I want to do is similar to an update statement in SQL where we update the column and set the row = y_id, taking the most recent value.
Hope I have explained sufficiently, any questions just ask.
Thanks
You can drop_duplicates before merge
df1=df1.drop('y_id',1).merge(df2.drop_duplicates('x_id',keep='last'),on='x_id')
df1
Out[469]:
x_id y_id
0 1 3
1 2 1
2 3 3
3 4 2
4 5 2
5 6 2
I'm trying to group rows by multiple columns.
What I want to achieve can be illustrated by this small example:
import pandas as pd
col_index = pd.MultiIndex.from_arrays([['A','A','B','B'],['a','b','c','d']])
df = pd.DataFrame([ [1,2,3,3],
[4,2,2,2],
[6,4,2,2],
[1,2,4,4],
[3,8,4,4],
[1,2,3,3]], columns = col_index)
DataFrame created by this looks like this:
A B
a b c d
0 1 2 3 3
1 4 2 2 2
2 6 4 2 2
3 1 2 4 4
4 3 8 4 4
5 1 2 3 3
I would like to group by 'c' and 'd', actually whole 'B'
This gives me "KeyError: 'c' "
#something like this
df.groupby(['c','d'], axis = 1, level = 1)
#or like this
df.groupby('B', axis = 1, level = 0)
I tried searching for answer but I can't seem to find any.
Can somebody tell me what I'm doing wrong?
This is one way of doing it by resetting the columns first:
df.set_axis(df.columns.droplevel(0), axis=1,inplace=False).groupby(['c','d']).sum()
Out[531]:
a b
c d
2 2 10 6
3 3 2 4
4 4 4 10
You can also specify the 2-level multi-indices explicitly.
df.groupby([("B","c"), ("B", "d")])
I need some help with cleaning a Dataframe that has multi index.
it looks something like this
cost
location season
Thorp park autumn £12
srping £13
summer £22
Sea life centre summer £34
spring £43
Alton towers and so on.............
location and season are index columns. I want to go through the data and remove any locations that don't have "season" values of all three seasons. So "Sea life centre" should be removed.
Can anyone help me with this?
Also another question, my dataframe was created from a groupby command and doesn't have a column name for the "cost" column. Is this normal? There are values in the column, just no header.
Option 1
groupby + count. You can use the result to index your dataframe.
df
col
a 1 0
2 1
b 1 3
2 4
3 5
c 2 7
3 8
v = df.groupby(level=0).transform('count').values
df = df[v == 3]
df
col
b 1 3
2 4
3 5
Option 2
groupby + filter. This is Paul H's idea, will remove if he wants to post.
df.groupby(level=0).filter(lambda g: g.count() == 3)
col
b 1 3
2 4
3 5
Option 1
Thinking outside the box...
df.drop(df.count(level=0).col[lambda x: x < 3].index)
col
b 1 3
2 4
3 5
Same thing with a little more robustness because I'm not depending on values in a column.
df.drop(df.index.to_series().count(level=0).loc[lambda x: x < 3].index)
col
b 1 3
2 4
3 5
Option 2
Robustify for general case with undetermined number of seasons.
This uses Pandas version 0.21's groupby.pipe method
df.groupby(level=0).pipe(lambda g: g.filter(lambda d: len(d) == g.size().max()))
col
b 1 3
2 4
3 5
I have a Pandas dataset that I want to clean up prior to applying my ML algorithm. I am wondering if it was possible to remove a row if an element of its columns does not match a set of values. For example, if I have the dataframe:
a b
0 1 6
1 4 7
2 2 4
3 3 7
...
And I desire the values of a to be one of [1,3] and of b to be one of [6,7], such that my final dataset is:
a b
0 1 6
1 3 7
...
Currently, my implementation is not working as some of my data rows have erroneous strings attached to the value. For example, instead of a value of 1 I'll have something like 1abc. Hence why I would like to remove anything that is not an integer of that value.
My workaround is also a bit archaic, as I am removing entries for column a that do not have 1 or 3 via:
dataset = dataset[(dataset.commute != 1)]
dataset = dataset[(dataset.commute != 3)]
You can use boolean indexing with double isin and &:
df1 = df[(df['a'].isin([1,3])) & (df['b'].isin([6,7]))]
print (df1)
a b
0 1 6
3 3 7
Or use numpy.in1d:
df1 = df[(np.in1d(df['a'], [1,3])) & (np.in1d(df['b'], [6,7])) ]
print (df1)
a b
0 1 6
3 3 7
But if need remove all rows with non numeric then need to_numeric with errors='coerce' which return NaN and then is possible filter it by notnull:
df = pd.DataFrame({'a':['1abc','2','3'],
'b':['4','5','dsws7']})
print (df)
a b
0 1abc 4
1 2 5
2 3 dsws7
mask = pd.to_numeric(df['a'], errors='coerce').notnull() &
pd.to_numeric(df['b'], errors='coerce').notnull()
df1 = df[mask].astype(int)
print (df1)
a b
1 2 5
If need check if some value is NaN or None:
df = pd.DataFrame({'a':['1abc',None,'3'],
'b':['4','5',np.nan]})
print (df)
a b
0 1abc 4
1 None 5
2 3 NaN
print (df[df.isnull().any(axis=1)])
a b
1 None 5
2 3 NaN
You can use pandas isin()
df = df[df.a.isin([1,3]) & df.b.isin([6,7])]
a b
0 1 6
3 3 7