I have a dataframe df:
PID AID Ethnicity
1 A Asian
1 B Asian
1 C Arab
1 D African
2 A Asian
2 D African
2 E Caucasian
2 F African
2 B Asian
I want to generate a frame that tells me for each PID how many AIDs it has, and how many Ethnic groups:
So for the above the resulting newdf would be:
PID numAID numEthnicities
1 4 3
2 5 3
I know how to find numAID:
newdf = df[['PID','AID']].groupby('PID',
as_index=False).count().rename(columns={'AID':'numAID'})
I'm not sure how to add the third column to the dataframe.
This will work:
df.groupby('PID').agg({'AID':'count','Ethnicity':pd.Series.nunique}).add_prefix('num')
numAID numEthnicity
PID
1 4 3
2 5 3
since you have found out newdf, you could try to use join function.)
df = df.set_index('PID')
newdf = newdf.set_index('PID')
result = df.join(newdf, lsuffix='df', rsuffix='newdf')
You can add a third column like this:
newdf['numEthnicities'] = df[['PID, 'Ethnicity']].groupby('PID', as_index=False).count()
Related
I need to slice a long format DataFrame by every x unique values for the purpose of visualizing. My actual dataset has ~ 90 variables for 20 individuals so I would like to split into 9 separate df's containing the entries for all 20 individuals for each variable.
I have created this simple example to help explain:
df = pd.DataFrame({'ID':[1,1,1,2,2,2,3,3,3,4,4,4],
'Period':[1,2,3,1,2,3,1,2,3,1,2,3,],
'Food':['Ham','Ham','Ham','Cheese','Cheese','Cheese','Egg','Egg','Egg','Bacon','Bacon','Bacon',]})
df
''' ******* PSUEDOCODE *******
df1 = unique entries [:2]
df2 = unique entries [2:4] '''
# desired outcome:
df1 = pd.DataFrame({'ID':[1,1,1,2,2,2,],
'Period':[1,2,3,1,2,3,],
'Food':['Ham','Ham','Ham','Cheese','Cheese','Cheese',]})
df2 = pd.DataFrame({'ID':[3,3,3,4,4,4],
'Period':[1,2,3,1,2,3,],
'Food':['Egg','Egg','Egg','Bacon','Bacon','Bacon',]})
print(df1)
print(df2)
In this case, the DataFrame would be split at the end of every 2 sets of unique entries in the df['Food'] column to create df1 and df2. Best case scenario would be a loop that creates a new DataFrame for every x unique entries. Given the lack of info I can find, I'm unfortunately struggling to write even good pseudocode for that.
Let us try with factorize and groupby
n = 2
d = {x : y for x , y in df.groupby(df.Food.factorize()[0]//n)}
d[0]
Out[132]:
ID Period Food
0 1 1 Ham
1 1 2 Ham
2 1 3 Ham
3 2 1 Cheese
4 2 2 Cheese
5 2 3 Cheese
d[1]
Out[133]:
ID Period Food
6 3 1 Egg
7 3 2 Egg
8 3 3 Egg
9 4 1 Bacon
10 4 2 Bacon
11 4 3 Bacon
Possible solution is the following:
# pip install pandas
import pandas as pd
df = pd.DataFrame({'ID':[1,1,1,2,2,2,3,3,3,4,4,4],
'Period':[1,2,3,1,2,3,1,2,3,1,2,3,],
'Food':['Ham','Ham','Ham','Cheese','Cheese','Cheese','Egg','Egg','Egg','Bacon','Bacon','Bacon',]})
dfs = [y for x, y in df.groupby('Food', as_index=False)]
Separated dfs can be accessed by list index (see below) or using loop:
dfs[0]
dfs[1]
and etc.
We could use groupby + ngroup + floordiv to create groups; then use another groupby to separate:
out = [x for _, x in df.groupby(df.groupby('Food', sort=False).ngroup().floordiv(2))]
Output:
[ ID Period Food
0 1 1 Ham
1 1 2 Ham
2 1 3 Ham
3 2 1 Cheese
4 2 2 Cheese
5 2 3 Cheese,
ID Period Food
6 3 1 Egg
7 3 2 Egg
8 3 3 Egg
9 4 1 Bacon
10 4 2 Bacon
11 4 3 Bacon]
From what I understand, this may help:
for x in df['ID'].unique():
print(df[df['ID']==x], '\n')
for x in df['Food'].unique():
print(df[df['Food']==x], '\n')
I have a dataframe which is basically a table that looks like this:
A
B
C
D
1
1
2
3
2
2
1
2
3
1
3
3
Then, I have a file, which is is a single column table that contains columns and values labels:
A
Variable A
1
Red
2
Blue
3
Green
B
Variable B
1
Dog
2
Cat
3
Mouse
C
Variable C
1
Car
2
House
3
Tree
D
Variable D
1
Football
2
Basketball
3
Hockey
My goal is to connect these two files into one dataframe, that will looks like this:
A. Variable A
B. Variable B
C. Variable C
D. Variable D
Red
Dog
House
Hockey
Blue
Cat
Car
Basketball
Green
Dog
Tree
Hockey
Do you have any ideat how to do it? Thanks!
The key is to convert that second file into a dictionary and then map those new values. It's tough to do it with what you provided (as I'd need to see exactly what that 2nd file looks like), but this is the general idea:
import pandas as pd
data1 = {'A':[1,2,3],
'B':[1,2,1],
'C':[2,1,3],
'D':[3,2,3]}
df1 = pd.DataFrame(data1)
data2 = ['A','Variable A',1,'Red',2,'Blue',3,'Green',
'B','Variable B',1,'Dog',2,'Cat',3,'Mouse',
'C','Variable C',1,'Car',2,'House',3,'Tree',
'D','Variable D',1,'Football',2,'Basketball',3,'Hockey']
df2 = pd.DataFrame(data2)
# Convert that second file into something like this
remapDict = {
'A':{'Variable A':{1:'Red',2:'Blue',3:'Green'}},
'B':{'Variable B':{1:'Dog',2:'Cat',3:'Mouse'}},
'C':{'Variable C':{1:'Car',2:'House',3:'Tree'}},
'D':{'Variable D':{1:'Football',2:'Basketball',3:'Hockey'}}}
for col in df1.columns:
remapData = remapDict[col]
newColName = f'{col}. {list(remapData.keys())[0]}'
df1 = df1.rename(columns={col:f'{newColName}'})
df1[newColName] = df1[newColName].map(remapData[f'{list(remapData.keys())[0]}'])
Before:
print(df1)
A B C D
0 1 1 2 3
1 2 2 1 2
2 3 1 3 3
After:
print(df1)
A. Variable A B. Variable B C. Variable C D. Variable D
0 Red Dog House Hockey
1 Blue Cat Car Basketball
2 Green Dog Tree Hockey
I need to drop some lines from dataframe with python , based on multiple values
Code Names Country
1 a France
2 b France
3 c USA
4 d Canada
5 e TOTO
6 f TITI
7 g Corona
I need to have this
Code Names Country
1 a France
4 d Canada
5 e TOTO
7 g Corona
I do this :
df.drop(df[('f','b','c')in df['names']].index)
But it doesnt work : KeyError: False
it works for only one key like this : df.drop(df['f' in df['names']].index)
Do you have any idea ?
To remove rows of certain values:
indexNames = df[df['Names'].isin(['f', 'b', 'c'])].index
df.drop(indexNames, inplace=True)
print(df)
Output:
Code Names Country
0 1 a France
3 4 d Canada
4 5 e TOTO
6 7 g Corona
Based on your example, I think this may be what you are looking for.
new_df = df.loc[~df.Names.isin(['f','b','c'])].copy()
new_df
Output:
Code Names Country
0 1 a France
3 4 d Canada
4 5 e TOTO
6 7 g Corona
In pandas, we can use .drop() function to drop column and rows.
For dropping specific rows, we need to use axis = 0
So your required output can be achieved by following line of code :
df4.drop([1,2,5], axis=0)
The output will be :
code Names Country
1 a France
4 d Canada
5 e TOTO
7 g Corona
I've been struggling to sort the entire columns of my df, however, my code seems to be working for solely the first column ('Name') and shuffles the rest of the columns based upon the first column as shown here:
Index Name Age Education Country
0 W 2 BS C
1 V 1 PhD F
2 R 9 MA A
3 A 8 MA A
4 D 7 PhD B
5 C 4 BS C
df.sort_values(by=['Name', 'Age', 'Education', 'Country'],ascending=[True,True, True, True])
Here's what I'm hoping to get:
Index Name Age Education Country
0 A 1 BS A
1 C 2 BS A
2 D 4 MA B
3 R 7 MA C
4 V 8 PhD C
5 W 9 PhD F
Instead, I'm getting the following:
Index Name Age Education Country
3 A 8 MA A
5 C 4 BS C
4 D 7 PhD B
2 R 9 MA A
1 V 1 PhD F
0 W 2 BS C
Could you please shed some light on this issue. Many thanks in advance.
Cheers,
R.
Your code is sorting by name, then age, then country, etc.
To get what you want, you can do sort for each column to sort column by column. For example,
for col in df.columns:
df[col]=sorted(df[col])
But are you sure that’s what you want to do? DataFrame is designed so that each row corresponds to a single entry, e.g. a person, and the columns corresponds to attributes like, ‘name’ and ‘age’, etc. So you don’t want sort the name and age separately so that people’s name and age get mismatched.
You can use np.sort along the 0th axis:
df[:] = np.sort(df.values, axis=0)
df
Index Name Age Education Country
0 0 A 1 BS A
1 1 C 2 BS A
2 2 D 4 MA B
3 3 R 7 MA C
4 4 V 8 PhD C
5 5 W 9 PhD F
If course, you should beware that sorting columns independently will mess the order of your columns relative to one another and render your data meaningless.
I need some help with cleaning a Dataframe that has multi index.
it looks something like this
cost
location season
Thorp park autumn £12
srping £13
summer £22
Sea life centre summer £34
spring £43
Alton towers and so on.............
location and season are index columns. I want to go through the data and remove any locations that don't have "season" values of all three seasons. So "Sea life centre" should be removed.
Can anyone help me with this?
Also another question, my dataframe was created from a groupby command and doesn't have a column name for the "cost" column. Is this normal? There are values in the column, just no header.
Option 1
groupby + count. You can use the result to index your dataframe.
df
col
a 1 0
2 1
b 1 3
2 4
3 5
c 2 7
3 8
v = df.groupby(level=0).transform('count').values
df = df[v == 3]
df
col
b 1 3
2 4
3 5
Option 2
groupby + filter. This is Paul H's idea, will remove if he wants to post.
df.groupby(level=0).filter(lambda g: g.count() == 3)
col
b 1 3
2 4
3 5
Option 1
Thinking outside the box...
df.drop(df.count(level=0).col[lambda x: x < 3].index)
col
b 1 3
2 4
3 5
Same thing with a little more robustness because I'm not depending on values in a column.
df.drop(df.index.to_series().count(level=0).loc[lambda x: x < 3].index)
col
b 1 3
2 4
3 5
Option 2
Robustify for general case with undetermined number of seasons.
This uses Pandas version 0.21's groupby.pipe method
df.groupby(level=0).pipe(lambda g: g.filter(lambda d: len(d) == g.size().max()))
col
b 1 3
2 4
3 5