This question already has answers here:
Pandas DENSE RANK
(4 answers)
Closed 4 years ago.
I am trying to generate a unique index column in my dataset.
I have a column in my dataset as follows:
665678, 665678, 665678, 665682, 665682, 665682, 665690, 665690
And I would like to generate a separately indexed column looking like this:
1, 1, 1, 2, 2, 2, 3, 3
I came across the post How to index columns uniquely?? that describes exactly what I am trying to do. But since the solutions are described for R, I wanted to know how can I implement the same in Python using Pandas.
Thanks
Use -
df.groupby('col').ngroup()+1
Output
0 1
1 1
2 1
3 2
4 2
5 2
6 3
7 3
dtype: int64
Related
This question already has answers here:
set difference for pandas
(12 answers)
Closed 11 months ago.
I have two data frames.
first_dataframe
id
9
8
6
5
7
4
second_dataframe
id
6
4
1
5
2
3
Note: My dataframe has many columns, but I need to compare only based on ID |
I need to find:
ids that are in first dataframe and not in second [1,2,3]
ids that are in second dataframe and not in first [7,8,9]
I have searched for an answer, but all solutions that I've found doesn't seem to work for me, because they look for changes based on index.
Use set subtraction:
inDF1_notinDF2 = set(df1['id']) - set(df2['id']) # Removes all items that are in df2 from df1
inDF2_notinDF1 = set(df2['id']) - set(df1['id']) # Removes all items that are in df1 from df2
Output:
>>> inDF1_notinDF2
{7, 8, 9}
>>> inDF2_notinDF1
{1, 2, 3}
This question already has answers here:
How to filter Pandas dataframe using 'in' and 'not in' like in SQL
(11 answers)
Closed 3 years ago.
For the following data frame I want to filter data based on count value. I want to keep count 10, 5, and 2:
index count
1 10
2 5
3 2
4 1
5 6
6 7
7 "tab"
I know that I can write the code as
df[(df.count==10) | (df.count==5) | (df.count==2) | (df.count=="tab")]
Is there any simpler way to do it? I have more than 20 values. I tried the following, but it did not work:
df[(df.count==[10, 5, 2, "tab"])
Thank you.
Use isin:
df[df['count'].isin([10, 5, 2, "tab"])]
This question already has answers here:
How to deal with SettingWithCopyWarning in Pandas
(20 answers)
Closed 3 years ago.
I am coming from R background. I need elementary with pandas.
if I have a dataframe like this
df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6]]))
I want to subset dataframe to select a fixed column and select a row by a boolean.
For example
df.iloc[df.2 > 4][2]
then I want to set the value for the subset cell to equal a value.
something like
df.iloc[df.2 > 4][2] = 7
It seems valid for me however it seem pandas work with booleans in more strict way than R
In here it is .loc
df.loc[df[2] > 4,2]
1 6
Name: 2, dtype: int64
df.loc[df[2] > 4,2]=7
df
0 1 2
0 1 2 3
1 4 5 7
This question already has answers here:
how to sort pandas dataframe from one column
(13 answers)
Closed 5 years ago.
I am would like to sort the values in my groupby by Actual Cost in ascending order, but I keep getting the wrong result.
This is my code:
D16 = Dec16.groupby('PRACTICE', sort=False)["Actual Cost"].sum()
D16
And it returns this:
PRACTICE
1 19585.09
3 144741.12
5 32622.69
6 138969.68
10 33973.04
Does anyone know how I can sort this correctly?
If your dataframe D16 looks like this with one column PRACTICE
PRACTICE
0 19585.09
1 144741.12
2 32622.69
3 138969.68
4 33973.04
D16.sort_values(by='PRACTICE', ascending=True)
will yield;
PRACTICE
0 19585.09
2 32622.69
4 33973.04
3 138969.68
1 144741.12
This question already has answers here:
selecting across multiple columns with pandas
(3 answers)
Closed 9 years ago.
despite there being at least two good tutorials on how to index a DataFrame in Python's pandas library, I still can't work out an elegant way of SELECTing on more than one column.
>>> d = pd.DataFrame({'x':[1, 2, 3, 4, 5], 'y':[4, 5, 6, 7, 8]})
>>> d
x y
0 1 4
1 2 5
2 3 6
3 4 7
4 5 8
>>> d[d['x']>2] # This works fine
x y
2 3 6
3 4 7
4 5 8
>>> d[d['x']>2 & d['y']>7] # I had expected this to work, but it doesn't
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
I have found (what I think is) a rather inelegant way of doing it, like this
>>> d[d['x']>2][d['y']>7]
But it's not pretty, and it scores fairly low for readability (I think).
Is there a better, more Python-tastic way?
It is a precedence operator issue.
You should add extra parenthesis to make your multi condition test working:
d[(d['x']>2) & (d['y']>7)]
This section of the tutorial you mentioned shows an example with several boolean conditions and the parenthesis are used.
There may still be a better way, but
In [56]: d[d['x'] > 2] and d[d['y'] > 7]
Out[56]:
x y
4 5 8
works.