I have a pandas dataframe, e.g.
one two three four five
0 1 2 3 4 5
1 1 1 1 1 1
What I would like is to be able to convert only a select number of columns to a list, such that we obtain:
[[1,2],[1,1]]
This is the rows 0,1, where we are selecting columns one and two.
Similarly if we selected columns one, two, four:
[[1,2,4],[1,1,1]]
Ideally I would like to avoid iteration of rows as it is slow!
You can select just those columns with:
In [11]: df[['one', 'two']]
Out[11]:
one two
0 1 2
1 1 1
and get the list of lists from the underlying numpy array using tolist:
In [12]: df[['one', 'two']].values.tolist()
Out[12]: [[1, 2], [1, 1]]
In [13]: df[['one', 'two', 'four']].values.tolist()
Out[13]: [[1, 2, 4], [1, 1, 1]]
Note: this should never really be necessary unless this is your end game... it's going to be much more efficient to do the work inside pandas or numpy.
So I worked out how to do it.
Firstly we select the columns we would like the values from:
y = x[['one','two']]
This gives us a subset df.
Now we can choose the values:
> y.values
array([[1, 2],
[1, 1]])
Related
I want to filter my df down to only those rows who have a value in column A which appears less frequently than some threshold. I currently am using a trick with two value_counts(). To explain what I mean:
df = pd.DataFrame([[1, 2, 3], [1, 4, 5], [6, 7, 8]], columns=['A', 'B', 'C'])
'''
A B C
0 1 2 3
1 1 4 5
2 6 7 8
'''
I want to remove any row whose value in the A column appears < 2 times in the column A. I currently do this:
df = df[df['A'].isin(df.A.value_counts()[df.A.value_counts() >= 2].index)]
Does Pandas have a method to do this which is cleaner than having to call value_counts() twice?
It's probably easiest to filter by group size, where the groups are done on column A.
df.groupby('A').filter(lambda x: len(x) >=2)
I am comparing two df, it gives me False when using .equals(), but if I append two df together and use drop_duplicate() it gives me nothing. Can someone explain this?
TL;DR
These are completely different operations and I'd have never expected them to produce the same results.
pandas.DataFrame.equals
Will return a boolean value depending on whether Pandas determines that the dataframes being compared are the "same". That means that the index of one is the "same" as the index of the other, the columns of one is the "same" as the columns of the the other, and the data of one is the "same" as the data of the other.
See docs
It is NOT the same as pandas.DataFrame.eq which will return a dataframe of boolean values.
Setup
Consider these three dataframes
df0 = pd.DataFrame([[0, 1], [2, 3]], [0, 1], ['A', 'B'])
df1 = pd.DataFrame([[1, 0], [3, 2]], [0, 1], ['B', 'A'])
df2 = pd.DataFrame([[0, 1], [2, 3]], ['foo', 'bar'], ['A', 'B'])
df0 df1 df2
A B B A A B
0 0 1 0 1 0 foo 0 1
1 2 3 1 3 2 bar 2 3
If we checked if df1 was equals to df0, we get
df0.equals(df1)
False
Even though all elements are the same
df0.eq(df1).all().all()
True
And that is because the columns are not aligned. If I sort the columns then ...
df0.equals(df1.sort_index(axis=1))
True
pandas.DataFrame.drop_duplicates
Compares the values in rows and doesn't care about the index.
So, both of these produce the same looking results
df0.append(df2).drop_duplicates()
and
df0.append(df1, sort=True).drop_duplicates()
A B
0 0 1
1 2 3
When I append (or pandas.concat), Pandas will align the columns and add the appended dataframe as new rows. Then drop_duplicates does it's thing. But it was the inherent aligning of the columns that does the what I did above with sort_index and axis=1.
maybe the lines in both dataframes are not ordered the same way? dataframes will be equal when the lines corresponding to the same index are the same
I'm trying to union several pd.DataFrames along the column axis, using the index to remove duplicates (A and B are from the same source "table" filterd by different predicates and I'm tring to recombine).
A = pd.DataFrame({"values": [1, 2]}, pd.MultiIndex.from_tuples([(1,1),(1,2)], names=('l1', 'l2')))
B = pd.DataFrame({"values": [2, 3, 2]}, pd.MultiIndex.from_tuples([(1,2),(2,1),(2,2)], names=('l1', 'l2')))
pd.concat([A,B]).drop_duplicates() fails since it ignores the index and de-dups on the values so it removed index item (2,2)
pd.concat([A.reset_index(),B.reset_index()]).drop_duplicates(subset=('l1', 'l2')).set_index(['l1', 'l2']) does what I want, but I feel like there should be a better way.
you may do a simple concat and filter out dups by using index.duplicated
df1 = pd.concat([A,B])
df1[~df1.index.duplicated()]
Out[123]:
values
l1 l2
1 1 1
2 2
2 1 3
2 2
I have a pandas dataframe following the form in the example below:
data = {'id': [1,1,1,1,2,2,2,2,3,3,3], 'a': [-1,1,1,0,0,0,-1,1,-1,0,0], 'b': [1,0,0,-1,0,1,1,-1,-1,1,0]}
df = pd.DataFrame(data)
Now, what I want to do is create a pivot table such that for each of the columns except the id, I will have 3 new columns corresponding to the values. That is, for column a, I will create a_neg, a_zero and a_pos. Similarly, for b, I will create b_neg, b_zero and b_pos. The values for these new columns would correspond to the number of times those values appear in the original a and b column. The final dataframe should look like this:
result = {'id': [1,2,3], 'a_neg': [1, 1, 1],
'a_zero': [1, 2, 2], 'a_pos': [2, 1, 0],
'b_neg': [1, 1, 1], 'b_zero': [2,1,1], 'b_pos': [1,2,1]}
df_result = pd.DataFrame(result)
Now, to do this, I can do the following steps and arrive at my final answer:
by_a = df.groupby(['id', 'a']).count().reset_index().pivot('id', 'a', 'b').fillna(0).astype(int)
by_a.columns = ['a_neg', 'a_zero', 'a_pos']
by_b = df.groupby(['id', 'b']).count().reset_index().pivot('id', 'b', 'a').fillna(0).astype(int)
by_b.columns = ['b_neg', 'b_zero', 'b_pos']
df_result = by_a.join(by_b).reset_index()
However, I believe that that method is not optimal especially if I have a lot of original columns aside from a and b. Is there a shorter and/or more efficient solution for getting what I want to achieve here? Thanks.
A shorter solution, though still quite in-efficient:
In [11]: df1 = df.set_index("id")
In [12]: g = df1.groupby(level=0)
In [13]: g.apply(lambda x: x.apply(lambda x: x.value_counts())).fillna(0).astype(int).unstack(1)
Out[13]:
a b
-1 0 1 -1 0 1
id
1 1 1 2 1 2 1
2 1 2 1 1 1 2
3 1 2 0 1 1 1
Note: I think you should be aiming for the multi-index columns.
I'm reasonably sure I've seen a trick to remove the apply/value_count/fillna with something cleaner and more efficient, but at the moment it eludes me...
Basically I am trying to do the opposite of How to generate a list from a pandas DataFrame with the column name and column values?
To borrow that example, I want to go from the form:
data = [['Name','Rank','Complete'],
['one', 1, 1],
['two', 2, 1],
['three', 3, 1],
['four', 4, 1],
['five', 5, 1]]
which should output:
Name Rank Complete
One 1 1
Two 2 1
Three 3 1
Four 4 1
Five 5 1
However when I do something like:
pd.DataFrame(data)
I get a dataframe where the first list should be my colnames, and then the first element of each list should be the rowname
EDIT:
To clarify, I want the first element of each list to be the row name. I am scrapping data so it is formatted this way...
One way to do this would be to take the column names as a separate list and then only give from 1st index for pd.DataFrame -
In [8]: data = [['Name','Rank','Complete'],
...: ['one', 1, 1],
...: ['two', 2, 1],
...: ['three', 3, 1],
...: ['four', 4, 1],
...: ['five', 5, 1]]
In [10]: df = pd.DataFrame(data[1:],columns=data[0])
In [11]: df
Out[11]:
Name Rank Complete
0 one 1 1
1 two 2 1
2 three 3 1
3 four 4 1
4 five 5 1
If you want to set the first column Name column as index, use the .set_index() method and send in the column to use for index. Example -
In [16]: df = pd.DataFrame(data[1:],columns=data[0]).set_index('Name')
In [17]: df
Out[17]:
Rank Complete
Name
one 1 1
two 2 1
three 3 1
four 4 1
five 5 1