Pandas concatenate levels in multiindex - python

I do have following excel file:
{0: {0: nan, 1: nan, 2: nan, 3: 'A', 4: 'A', 5: 'B', 6: 'B', 7: 'C', 8: 'C'},
1: {0: nan, 1: nan, 2: nan, 3: 1.0, 4: 2.0, 5: 1.0, 6: 2.0, 7: 1.0, 8: 2.0},
2: {0: 'AA1', 1: 'a', 2: 'ng/mL', 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1},
3: {0: 'AA2', 1: 'a', 2: nan, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1},
4: {0: 'BB1', 1: 'b', 2: nan, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1},
5: {0: 'BB2', 1: 'b', 2: 'mL', 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1},
6: {0: 'CC1', 1: 'c', 2: nan, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1},
7: {0: 'CC2', 1: 'c', 2: nan, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1}}
I would like to create following dataframe:
level_0 AA1 AA2 CB1 BB2 CC1 CC2
new a ng/mL a N/A b N/A b mL c N/A c N/A
0 1
A 1 1 1 1 1 1 1
2 1 1 1 1 1 1
B 1 1 1 1 1 1 1
2 1 1 1 1 1 1
C 1 1 1 1 1 1 1
2 1 1 1 1 1 1
What I tried:
# read the column index separately to avoid pandas inputting "Unnamed: ..."
# for the nans
df = pd.read_excel(file_path, skiprows=3, index_col=None, header=None)
df.set_index([0, 1], inplace=True)
# the column index
cols = pd.read_excel(file_path, nrows=3, index_col=None, header=None).loc[:, 2:]
cols = cols.fillna('N/A')
idx = pd.MultiIndex.from_arrays(cols.values)
df.columns = idx
The new dataframe:
AA1 AA2 CB1 BB2 CC1 CC2
a a b b c c
ng/mL N/A N/A mL N/A N/A
0 1
A 1 1 1 1 1 1 1
2 1 1 1 1 1 1
B 1 1 1 1 1 1 1
2 1 1 1 1 1 1
C 1 1 1 1 1 1 1
2 1 1 1 1 1 1
This approach works but is kind of tedious:
df1 = df.T.reset_index()
df1['new'] = df1.loc[:, 'level_1'] + ' ' + df1.loc[:, 'level_2']
df1.set_index(['level_0', 'new']).drop(['level_1', 'level_2'], axis=1).T
Which gives me:
level_0 AA1 AA2 CB1 BB2 CC1 CC2
new a ng/mL a N/A b N/A b mL c N/A c N/A
0 1
A 1 1 1 1 1 1 1
2 1 1 1 1 1 1
B 1 1 1 1 1 1 1
2 1 1 1 1 1 1
C 1 1 1 1 1 1 1
2 1 1 1 1 1 1
Is there a simpler solution available ?

Use:
#file from sample data
d = {0: {0: np.nan, 1: np.nan, 2: np.nan, 3: 'A', 4: 'A', 5: 'B', 6: 'B', 7: 'C', 8: 'C'},
1: {0: np.nan, 1: np.nan, 2: np.nan, 3: 1.0, 4: 2.0, 5: 1.0, 6: 2.0, 7: 1.0, 8: 2.0},
2: {0: 'AA1', 1: 'a', 2: 'ng/mL', 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1},
3: {0: 'AA2', 1: 'a', 2: np.nan, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1},
4: {0: 'BB1', 1: 'b', 2: np.nan, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1},
5: {0: 'BB2', 1: 'b', 2: 'mL', 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1},
6: {0: 'CC1', 1: 'c', 2: np.nan, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1},
7: {0: 'CC2', 1: 'c', 2: np.nan, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1}}
df = pd.DataFrame(d)
df.to_excel('file.xlsx', header=False, index=False)
First create MultiIndex DataFrame with header=[0,1,2], then create MultiIndex by first 2 columns with DataFrame.set_index and remove index names by DataFrame.reset_index:
df = pd.read_excel('file.xlsx', header=[0,1,2])
df = df.set_index(df.columns[:2].tolist()).rename_axis((None, None))
Then loop by each level in list comprehension and join second with third level if not Unnamed, last use MultiIndex.from_tuples:
tuples = [(a, f'{b} N/A') if c.startswith('Unnamed')
else (a, f'{b} {c}')
for a, b, c in df.columns]
print (tuples)
[('AA1', 'a ng/mL'), ('AA2', 'a N/A'),
('BB1', 'b N/A'), ('BB2', 'b mL'),
('CC1', 'c N/A'), ('CC2', 'c N/A')]
df.columns = pd.MultiIndex.from_tuples(tuples)
print (df)
AA1 AA2 BB1 BB2 CC1 CC2
a ng/mL a N/A b N/A b mL c N/A c N/A
A 1 1 1 1 1 1 1
2 1 1 1 1 1 1
B 1 1 1 1 1 1 1
2 1 1 1 1 1 1
C 1 1 1 1 1 1 1
2 1 1 1 1 1 1
Another idea is use:
df = pd.read_excel('file.xlsx', header=[0,1,2])
df = df.set_index(df.columns[:2].tolist()).rename_axis((None, None))
lv1 = df.columns.get_level_values(0)
lv2 = df.columns.get_level_values(1)
lv3 = df.columns.get_level_values(2)
lv3 = lv3.where(~lv3.str.startswith('Unnamed'),'N/A')
df.columns = [lv1, lv2.to_series() + ' ' + lv3]

Related

Dataframe Groupby ID to JSON

I'm trying to convert the following dataframe into a JSON file:
id email surveyname question answer
1 lol#gmail s 1 apple
1 lol#gmail s 3 apple/juice
1 lol#gmail s 2 apple-pie
1 lol#gmail s 4 apple-pie
1 lol#gmail s 5 apple|pie|yes
1 lol#gmail s 6 apple
1 lol#gmail s 8 apple
1 lol#gmail s 7 apple
1 lol#gmail s 9 apple
1 lol#gmail s 12 apple
1 lol#gmail s 11 apple
1 lol#gmail s 10 apple_sauce
2 ll#gmail s 1 orange
2 ll#gmail s 3 juice
.
.
To:
{
"df":[
{
"id":"1",
"email:"lol#gmail"
"surveyname":"s",
"1":"apple",
"2":"apple-pie",
"3":"apple/juice",
"4":"apple-pie",
"5":"apple|pie|yes",
"6":"apple",
"7":"apple",
"8":"apple",
"9":"apple",
"10":"apple_sauce",
"11":"apple",
"12":"apple"
},
{
"id": "vid",
"email:"llgmail"
"surveyname: "s"
"1":"orange",
"2":"", # empty
"3":"juice",
.
.
.
}
]
}
It should map all the ids in the df and skip the numbers if they're empty.
Below is a sample for the df I used above. If the whole df for id = 2 needs to be constructed, please let me know and I can edit that in. However, some entries don't have completed values inside the actual df.
d = {'id': {0: 1,
1: 1,
2: 1,
3: 1,
4: 1,
5: 1,
6: 1,
7: 1,
8: 1,
9: 1,
10: 1,
11: 1,
12: 2,
13: 2},
'email': {0: 'lol#gmail',
1: 'lol#gmail',
2: 'lol#gmail',
3: 'lol#gmail',
4: 'lol#gmail',
5: 'lol#gmail',
6: 'lol#gmail',
7: 'lol#gmail',
8: 'lol#gmail',
9: 'lol#gmail',
10: 'lol#gmail',
11: 'lol#gmail',
12: 'll#gmail',
13: 'll#gmail'},
'surveyname': {0: 's',
1: 's',
2: 's',
3: 's',
4: 's',
5: 's',
6: 's',
7: 's',
8: 's',
9: 's',
10: 's',
11: 's',
12: 's',
13: 's'},
'question': {0: 1,
1: 3,
2: 2,
3: 4,
4: 5,
5: 6,
6: 8,
7: 7,
8: 9,
9: 12,
10: 11,
11: 10,
12: 1,
13: 3},
'answer': {0: 'apple',
1: 'apple/juice',
2: 'apple-pie',
3: 'apple-pie',
4: 'apple|pie|yes',
5: 'apple',
6: 'apple',
7: 'apple',
8: 'apple',
9: 'apple',
10: 'apple',
11: 'apple_sauce',
12: 'orange',
13: 'juice'}}
df = pd.DataFrame.from_dict(d)
You can pivot the dataframe before exporting to JSON:
(
df.pivot_table(
index=["id", "email", "surveyname"],
columns="question",
values="answer",
aggfunc="first",
)
.reindex(columns=np.arange(1, 13))
.fillna("")
.reset_index()
.to_json("data.json", orient="records")
)

How to convert two columns to list of values?

I have a dataframe like this below,
A B C D
0 A1 Egypt 10 Yes
1 A1 Morocco 5 No
2 A2 Algeria 4 Yes
3 A3 Egypt 45 No
4 A3 Egypt 17 Yes
5 A3 Tunisia 4 Yes
6 A3 Algeria 32 No
7 A4 Tunisia 7 No
8 A5 Egypt 6 No
9 A5 Morocco 1 No
I want to get the count of yes and no from the column D wrt column B. The expected output needs to be in the lists like this below which can help to plot the multivariable chart.
Exected output:
yes = [1,2,0,1]
no = [1,2,2,1]
country = ['Algeria', 'Egypt', 'Morocco','Tunisia']
I am not sure how to achieve this from the above dataframe. Any help will be appreciated.
Here is the minimum reproducible dataframe sample:
import pandas as pd
df = pd.DataFrame({'A': {0: 'A1',
1: 'A1',
2: 'A2',
3: 'A3',
4: 'A3',
5: 'A3',
6: 'A3',
7: 'A4',
8: 'A5',
9: 'A5'},
'B': {0: 'Egypt',
1: 'Morocco',
2: 'Algeria',
3: 'Egypt',
4: 'Egypt',
5: 'Tunisia',
6: 'Algeria',
7: 'Tunisia',
8: 'Egypt',
9: 'Morocco'},
'C ': {0: 10, 1: 5, 2: 4, 3: 45, 4: 17, 5: 4, 6: 32, 7: 7, 8: 6, 9: 1},
'D': {0: 'Yes',
1: 'No',
2: 'Yes',
3: 'No',
4: 'Yes',
5: 'Yes',
6: 'No',
7: 'No',
8: 'No',
9: 'No'}}
)
Use crosstab:
df1 = pd.crosstab(df.B, df.D)
print (df1)
D No Yes
B
Algeria 1 1
Egypt 2 2
Morocco 2 0
Tunisia 1 1
Then for plot use DataFrame.plot.bar
df1.plot.bar()
If need lists:
yes = df1['Yes'].tolist()
no = df1['No'].tolist()
country = df1.index.tolist()
Create new columns by counting "yes", "no"; then groupby "B" and use sum on the newly created columns:
country, yes, no = df.assign(Yes=df['D']=='Yes', No=df['D']=='No').groupby('B')[['Yes','No']].sum().reset_index().T.to_numpy().tolist()
Output:
['Algeria', 'Egypt', 'Morocco', 'Tunisia']
[1, 2, 0, 1]
[1, 2, 2, 1]

Adding missing rows and setting column value to zero based on current dataframe

dic= {'distinct_id': {0: 1,
1: 2,
2: 3,
3: 4,
4: 5},
'first_name': {0: 'Joe',
1: 'Barry',
2: 'David',
3: 'Marcus',
4: 'Anthony'},
'activity': {0: 'Jump',
1: 'Jump',
2: 'Run',
3: 'Run',
4: 'Climb'},
'tasks_completed': {0: 3, 1: 3, 2: 3, 3: 3, 4: 1},
'tasks_available': {0: 3, 1: 3, 2: 3, 3: 3, 4: 3}}
tasks = pd.DataFrame(dic)
I'm trying to make every id/name pair have a row for every unique activity, for example I want "Joe" to have rows where the activity column is "Run" and "Climb", but I want him to have a 0 in the tasks_completed column (those rows not being present already means that he hasn't done these activity tasks). I have tried using df.iterrows() and making a list of the unique ids and activity names and checking to see if they're both present, but it didn't work. Any help is very appreciated!
This is what I am hoping to have:
1: 2,
2: 3,
3: 4,
4: 5,
5: 1,
6: 1,
7: 2,
8: 2,
9: 3,
10: 3,
11: 4,
12: 4,
13: 5,
14: 5},
'email': {0: 'Joe',
1: 'Barry',
2: 'David',
3: 'Marcus',
4: 'Anthony',
5: 'Joe',
6: 'Joe',
7: 'Barry',
8: 'Barry',
9: 'David',
10: 'David',
11: 'Marcus',
12: 'Marcus',
13: 'Anthony',
14: 'Anthony'},
'activity': {0: 'Jump',
1: 'Jump',
2: 'Run',
3: 'Run',
4: 'Climb',
5: 'Run',
6: 'Climb',
7: 'Run',
8: 'Climb',
9: 'Jump',
10: 'Climb',
11: 'Climb',
12: 'Jump',
13: 'Run',
14: 'Jump'},
'tasks_completed': {0: 3,
1: 3,
2: 3,
3: 3,
4: 1,
5: 0,
6: 0,
7: 0,
8: 0,
9: 0,
10: 0,
11: 0,
12: 0,
13: 0,
14: 0},
'tasks_available': {0: 3,
1: 3,
2: 3,
3: 3,
4: 3,
5: 3,
6: 3,
7: 3,
8: 3,
9: 3,
10: 3,
11: 3,
12: 3,
13: 3,
14: 3}}
pd.DataFrame(tasks_new)
idx_cols = ['distinct_id', 'first_name', 'activity']
tasks.set_index(idx_cols).unstack(fill_value=0).stack().reset_index()
distinct_id first_name activity tasks_completed tasks_available
0 1 Joe Climb 0 0
1 1 Joe Jump 3 3
2 1 Joe Run 0 0
3 2 Barry Climb 0 0
4 2 Barry Jump 3 3
5 2 Barry Run 0 0
6 3 David Climb 0 0
7 3 David Jump 0 0
8 3 David Run 3 3
9 4 Marcus Climb 0 0
10 4 Marcus Jump 0 0
11 4 Marcus Run 3 3
12 5 Anthony Climb 1 3
13 5 Anthony Jump 0 0
14 5 Anthony Run 0 0

How to count occurrences based on multiple criteria in a DataFrame

I'm trying to figure out how to count a number of occurrences in the DataFrame using multiple criteria.
In this particular example, I'd like to know the number of female passengers in Pclass 3.
PassengerId Pclass Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 892 3 male 34.5 0 0 330911 7.8292 NaN Q
1 893 3 female 47.0 1 0 363272 7.0000 NaN S
2 894 2 male 62.0 0 0 240276 9.6875 NaN Q
3 895 3 male 27.0 0 0 315154 8.6625 NaN S
4 896 3 female 22.0 1 1 3101298 12.2875 NaN S
Here's my few failed attempts:
len(test[test["Sex"] == "female", test["Pclass"] == 3])
sum(test.Pclass == 3 & test.Sex == "female")
test.[test["Sex"] == "female", test["Pclass"] == 3].count()
None of them seem to be working.
At the end I've created my own function, but there must be a simpler way to calculate that.
def countif(sex, pclass):
x = 0
for i in range(0,len(test)):
s = test.iloc[i]['Sex']
c = test.iloc[i]['Pclass']
if s == sex and c == pclass:
x = x + 1
return x
Thank you in advance
There are a few ways to do this:
test = pd.DataFrame({'PassengerId': {0: 892, 1: 893, 2: 894, 3: 895, 4: 896},
'Pclass': {0: 3, 1: 3, 2: 2, 3: 3, 4: 3},
'Sex': {0: 'male', 1: 'female', 2: 'male', 3: 'male', 4: 'female'},
'Age': {0: 34.5, 1: 47.0, 2: 62.0, 3: 27.0, 4: 22.0},
'SibSp': {0: 0, 1: 1, 2: 0, 3: 0, 4: 1},
'Parch': {0: 0, 1: 0, 2: 0, 3: 0, 4: 1},
'Ticket': {0: 330911, 1: 363272, 2: 240276, 3: 315154, 4: 3101298},
'Fare': {0: 7.8292, 1: 7.0, 2: 9.6875, 3: 8.6625, 4: 12.2875},
'Cabin': {0: np.nan, 1: np.nan, 2: np.nan, 3: np.nan, 4: np.nan},
'Embarked': {0: 'Q', 1: 'S', 2: 'Q', 3: 'S', 4: 'S'}})
You need to put the boolean in round brackets and join with an &
sum((test.Pclass == 3) & (test.Sex == "female"))
len(test[(test.Pclass == 3) & (test.Sex == "female")])
test[(test["Sex"] == "female") & (test["Pclass"] == 3)].shape[0]
Or you can do:
tab = pd.crosstab(df.Pclass,df.Sex)
Sex female male
Pclass
2 0 1
3 2 2
tab.iloc[tab.index==3]['female']

groupby + apply results in a series appearing both in index and column - how to prevent it?

I've got a following data frame:
dict1 = {'id': {0: 11, 1: 12, 2: 13, 3: 14, 4: 15, 5: 16, 6: 19, 7: 18, 8: 17},
'var1': {0: 20.272108843537413,
1: 21.088435374149658,
2: 20.68027210884354,
3: 23.945578231292515,
4: 22.857142857142854,
5: 21.496598639455787,
6: 39.18367346938776,
7: 36.46258503401361,
8: 34.965986394557824},
'var2': {0: 27.731092436974773,
1: 43.907563025210074,
2: 55.67226890756303,
3: 62.81512605042017,
4: 71.63865546218487,
5: 83.40336134453781,
6: 43.48739495798319,
7: 59.243697478991606,
8: 67.22689075630252},
'var3': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 2, 7: 2, 8: 2}}
ex = pd.DataFrame(dict1).set_index('id')
I wanted to sort with within groups according to var1, so I wrote the following:
ex.groupby('var3').apply(lambda x: x.sort_values('var1'))
However, it results in a data frame which has var3 both in index and in column. How to prevent that and leave it only in a column?
Add optional param to groupby as_index=False
ex.groupby('var3', as_index=False) \
.apply(lambda x: x.sort_values('var1'))
Or, if you don't want multiIndex
ex.groupby('var3', as_index=False) \
.apply(lambda x: x.sort_values('var1')) \
.reset_index(level=0, drop=True)
You could use:
df_sorted=ex.groupby('var3').apply(lambda x: x.sort_values('var1')).reset_index(drop='var3')
print(df_sorted)
var1 var2 var3
0 20.272109 27.731092 1
1 20.680272 55.672269 1
2 21.088435 43.907563 1
3 21.496599 83.403361 1
4 22.857143 71.638655 1
5 23.945578 62.815126 1
6 34.965986 67.226891 2
7 36.462585 59.243697 2
8 39.183673 43.487395 2
But you only need DataFrame.sort_values
sorting first by var3 and then by var1:
df_sort=ex.sort_values(['var3','var1'])
print(df_sort)
var1 var2 var3
id
11 20.272109 27.731092 1
13 20.680272 55.672269 1
12 21.088435 43.907563 1
16 21.496599 83.403361 1
15 22.857143 71.638655 1
14 23.945578 62.815126 1
17 34.965986 67.226891 2
18 36.462585 59.243697 2
19 39.183673 43.487395 2

Categories