I've been struggling to sort the entire columns of my df, however, my code seems to be working for solely the first column ('Name') and shuffles the rest of the columns based upon the first column as shown here:
Index Name Age Education Country
0 W 2 BS C
1 V 1 PhD F
2 R 9 MA A
3 A 8 MA A
4 D 7 PhD B
5 C 4 BS C
df.sort_values(by=['Name', 'Age', 'Education', 'Country'],ascending=[True,True, True, True])
Here's what I'm hoping to get:
Index Name Age Education Country
0 A 1 BS A
1 C 2 BS A
2 D 4 MA B
3 R 7 MA C
4 V 8 PhD C
5 W 9 PhD F
Instead, I'm getting the following:
Index Name Age Education Country
3 A 8 MA A
5 C 4 BS C
4 D 7 PhD B
2 R 9 MA A
1 V 1 PhD F
0 W 2 BS C
Could you please shed some light on this issue. Many thanks in advance.
Cheers,
R.
Your code is sorting by name, then age, then country, etc.
To get what you want, you can do sort for each column to sort column by column. For example,
for col in df.columns:
df[col]=sorted(df[col])
But are you sure that’s what you want to do? DataFrame is designed so that each row corresponds to a single entry, e.g. a person, and the columns corresponds to attributes like, ‘name’ and ‘age’, etc. So you don’t want sort the name and age separately so that people’s name and age get mismatched.
You can use np.sort along the 0th axis:
df[:] = np.sort(df.values, axis=0)
df
Index Name Age Education Country
0 0 A 1 BS A
1 1 C 2 BS A
2 2 D 4 MA B
3 3 R 7 MA C
4 4 V 8 PhD C
5 5 W 9 PhD F
If course, you should beware that sorting columns independently will mess the order of your columns relative to one another and render your data meaningless.
Related
I have a super large dataset that i'm trying to shrink.
My idea is to keep 100 rows by neighborhood.
Here's an overview of my data :
index
name
neighborhood
0
name 1
neighborhood A
1
name 2
neighborhood A
2
name 3
neighborhood B
3
name 4
neighborhood B
4
name 5
neighborhood C
5
name 6
neighborhood C
6
name 7
neighborhood D
7
name 8
neighborhood D
8
name 9
neighborhood E
9
name 10
neighborhood E
What is the more efficient way to do so ?
Thanks in advance
I'm expecting to create something that looks like :
index
name
neighborhood
0
name 1
neighborhood A
1
name 3
neighborhood B
2
name 5
neighborhood C
3
name 7
neighborhood D
4
name 9
neighborhood E
i think, you can use groupby and *nth:
dfx=df.groupby('neighborhood').nth[:100]
It depends how you want to select the rows.
first n with groupby.head:
n = 100
out = df.groupby('neighborhood').head(n)
random n rows with groupby.sample:
n = 100
out = df.groupby('neighborhood').sample(n=n)
I have a pandas dataframe in which I want to add a column (col_new), which values depend on a comparison of values in a existing column (col_exist).
Existing column (type=objects) contains As and Bs.
New column should count, starting with 1.
If an A follows an A, the count should rise by one.
If an A follows a B, the count should rise by one.
If a B follows an A, the count should not rise.
If a B follows a B, the count should not rise.
col_exist col_new
A 1
A 2
A 3
B 3
A 4
B 4
B 4
A 5
B 5
I am completely new to programming, so thank you in advance for your adequade answer.
Use eq and cumsum:
df['col_new'] = df['col_exist'].eq('A').cumsum()
output:
col_exist col_new
0 A 1
1 A 2
2 A 3
3 B 3
4 A 4
5 B 4
6 B 4
7 A 5
8 B 5
I need to drop some lines from dataframe with python , based on multiple values
Code Names Country
1 a France
2 b France
3 c USA
4 d Canada
5 e TOTO
6 f TITI
7 g Corona
I need to have this
Code Names Country
1 a France
4 d Canada
5 e TOTO
7 g Corona
I do this :
df.drop(df[('f','b','c')in df['names']].index)
But it doesnt work : KeyError: False
it works for only one key like this : df.drop(df['f' in df['names']].index)
Do you have any idea ?
To remove rows of certain values:
indexNames = df[df['Names'].isin(['f', 'b', 'c'])].index
df.drop(indexNames, inplace=True)
print(df)
Output:
Code Names Country
0 1 a France
3 4 d Canada
4 5 e TOTO
6 7 g Corona
Based on your example, I think this may be what you are looking for.
new_df = df.loc[~df.Names.isin(['f','b','c'])].copy()
new_df
Output:
Code Names Country
0 1 a France
3 4 d Canada
4 5 e TOTO
6 7 g Corona
In pandas, we can use .drop() function to drop column and rows.
For dropping specific rows, we need to use axis = 0
So your required output can be achieved by following line of code :
df4.drop([1,2,5], axis=0)
The output will be :
code Names Country
1 a France
4 d Canada
5 e TOTO
7 g Corona
In pandas and python:
I have a large datasets with health records where patients have records of diagnoses.
How to display the most frequent diagnoses, but only count 1 occurrence of the same diagnoses per patient?
Example ('pid' is patient id. 'code' is the code of a diagnosis):
IN:
pid code
1 A
1 B
1 A
1 A
2 A
2 A
2 B
2 A
3 B
3 C
3 D
4 A
4 A
4 A
4 B
OUT:
B 4
A 3
C 1
D 1
I would like to be able to use .isin .index if possible.
Example:
Remove all rows with less than 3 in frequency count in column 'code'
s = df['code'].value_counts().ge(3)
df = df[df['code'].isin(s[s].index)]
You can use groupby + nunique:
df.groupby(by='code').pid.nunique().sort_values(ascending=False)
Out[60]:
code
B 4
A 3
D 1
C 1
Name: pid, dtype: int64
To remove all rows with less than 3 in frequency count in column 'code'
df.groupby(by='code').filter(lambda x: x.pid.nunique()>=3)
Out[55]:
pid code
0 1 A
1 1 B
2 1 A
3 1 A
4 2 A
5 2 A
6 2 B
7 2 A
8 3 B
11 4 A
12 4 A
13 4 A
14 4 B
Since you mention value_counts
df.groupby('code').pid.value_counts().count(level=0)
Out[42]:
code
A 3
B 4
C 1
D 1
Name: pid, dtype: int64
You should be able to use the groupby and nunique() functions to obtain a distinct count of patients that had each diagnosis. This should give you the result you need:
df[['pid', 'code']].groupby(['code']).nunique()
I need some help with cleaning a Dataframe that has multi index.
it looks something like this
cost
location season
Thorp park autumn £12
srping £13
summer £22
Sea life centre summer £34
spring £43
Alton towers and so on.............
location and season are index columns. I want to go through the data and remove any locations that don't have "season" values of all three seasons. So "Sea life centre" should be removed.
Can anyone help me with this?
Also another question, my dataframe was created from a groupby command and doesn't have a column name for the "cost" column. Is this normal? There are values in the column, just no header.
Option 1
groupby + count. You can use the result to index your dataframe.
df
col
a 1 0
2 1
b 1 3
2 4
3 5
c 2 7
3 8
v = df.groupby(level=0).transform('count').values
df = df[v == 3]
df
col
b 1 3
2 4
3 5
Option 2
groupby + filter. This is Paul H's idea, will remove if he wants to post.
df.groupby(level=0).filter(lambda g: g.count() == 3)
col
b 1 3
2 4
3 5
Option 1
Thinking outside the box...
df.drop(df.count(level=0).col[lambda x: x < 3].index)
col
b 1 3
2 4
3 5
Same thing with a little more robustness because I'm not depending on values in a column.
df.drop(df.index.to_series().count(level=0).loc[lambda x: x < 3].index)
col
b 1 3
2 4
3 5
Option 2
Robustify for general case with undetermined number of seasons.
This uses Pandas version 0.21's groupby.pipe method
df.groupby(level=0).pipe(lambda g: g.filter(lambda d: len(d) == g.size().max()))
col
b 1 3
2 4
3 5