How to convert two columns to list of values? - python

I have a dataframe like this below,
A B C D
0 A1 Egypt 10 Yes
1 A1 Morocco 5 No
2 A2 Algeria 4 Yes
3 A3 Egypt 45 No
4 A3 Egypt 17 Yes
5 A3 Tunisia 4 Yes
6 A3 Algeria 32 No
7 A4 Tunisia 7 No
8 A5 Egypt 6 No
9 A5 Morocco 1 No
I want to get the count of yes and no from the column D wrt column B. The expected output needs to be in the lists like this below which can help to plot the multivariable chart.
Exected output:
yes = [1,2,0,1]
no = [1,2,2,1]
country = ['Algeria', 'Egypt', 'Morocco','Tunisia']
I am not sure how to achieve this from the above dataframe. Any help will be appreciated.
Here is the minimum reproducible dataframe sample:
import pandas as pd
df = pd.DataFrame({'A': {0: 'A1',
1: 'A1',
2: 'A2',
3: 'A3',
4: 'A3',
5: 'A3',
6: 'A3',
7: 'A4',
8: 'A5',
9: 'A5'},
'B': {0: 'Egypt',
1: 'Morocco',
2: 'Algeria',
3: 'Egypt',
4: 'Egypt',
5: 'Tunisia',
6: 'Algeria',
7: 'Tunisia',
8: 'Egypt',
9: 'Morocco'},
'C ': {0: 10, 1: 5, 2: 4, 3: 45, 4: 17, 5: 4, 6: 32, 7: 7, 8: 6, 9: 1},
'D': {0: 'Yes',
1: 'No',
2: 'Yes',
3: 'No',
4: 'Yes',
5: 'Yes',
6: 'No',
7: 'No',
8: 'No',
9: 'No'}}
)

Use crosstab:
df1 = pd.crosstab(df.B, df.D)
print (df1)
D No Yes
B
Algeria 1 1
Egypt 2 2
Morocco 2 0
Tunisia 1 1
Then for plot use DataFrame.plot.bar
df1.plot.bar()
If need lists:
yes = df1['Yes'].tolist()
no = df1['No'].tolist()
country = df1.index.tolist()

Create new columns by counting "yes", "no"; then groupby "B" and use sum on the newly created columns:
country, yes, no = df.assign(Yes=df['D']=='Yes', No=df['D']=='No').groupby('B')[['Yes','No']].sum().reset_index().T.to_numpy().tolist()
Output:
['Algeria', 'Egypt', 'Morocco', 'Tunisia']
[1, 2, 0, 1]
[1, 2, 2, 1]

Related

How to add column value based on condition in another dataframe?

I have a dataframe with the main fixed location data:
id name
1 BEL
2 BEL
3 BEL
4 NYC
5 NYC
6 NYC
7 BER
8 BER
I also have second dataframe where I get values for each id and city like this (notice, this dataframe is longer than the main dataframe):
id name value
1 BEL 9
2 BEL 7
3 BEL 3
4 NYC 76
5 NYC 76
6 NYC 23
7 BER 76
8 BER 2
3 BEL 7
4 NYC 5
5 NYC 4
6 NYC 2
My goal is, I want to check the second dataframe if the values are greater than 10 or not. If greater than 10 I want to add to the first dataframe a column ['not_ok'] like 1 for not ok. How can I do this?
I can filter the second dataframe with dff['not_ok'] = np.where(dff['value'] > 10, '1', '0') but since the dff is much longer I don't know how to get that information in the first dataframe.
My goal looks something like this:
id name is_ok
1 BEL 1
2 BEL 1
3 BEL 1
4 NYC 0
5 NYC 0
6 NYC 0
7 BER 0
8 BER 1
To reach the desired output you could try as follows:
import pandas as pd
data = {'id': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8},
'name': {0: 'BEL', 1: 'BEL', 2: 'BEL', 3: 'NYC', 4: 'NYC',
5: 'NYC', 6: 'BER', 7: 'BER'}
}
df = pd.DataFrame(data)
data2 = {'id': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7,
7: 8, 8: 3, 9: 4, 10: 5, 11: 6},
'name': {0: 'BEL', 1: 'BEL', 2: 'BEL', 3: 'NYC', 4: 'NYC',
5: 'NYC', 6: 'BER', 7: 'BER', 8: 'BEL', 9: 'NYC',
10: 'NYC', 11: 'NYC'},
'value': {0: 9, 1: 7, 2: 3, 3: 76, 4: 76, 5: 23, 6: 76,
7: 2, 8: 7, 9: 5, 10: 4, 11: 2}
}
df2 = pd.DataFrame(data2)
df = df.merge(df2[df2['value'].gt(10)], on=['id', 'name'], how='left')\
.rename(columns={'value':'is_ok'})
df['is_ok'] = df['is_ok'].isna().astype(int)
print(df)
id name is_ok
0 1 BEL 1
1 2 BEL 1
2 3 BEL 1
3 4 NYC 0
4 5 NYC 0
5 6 NYC 0
6 7 BER 0
7 8 BER 1
Explanation:
Use Series.gt to get a boolean pd.Series, which we use to select from d2 only the rows that meet the condition value > 10.
Use df.merge to merge this slice from df2 with df and rename column value to is_ok (df.rename).
We now have a column with NaN values where there is no match on id, name, and values > 10 where there is. Use Series.isna to turn this column into booleans.
Finally, we can chain .astype(int) to change True | False into 1 | 0.
Suppose you first (shorter) daraframe is called 'df_v1' and the second (longer) is called 'df_v2'.
On 'df_v2' prepare the column like this:
df_v2["not_ok"] = df_v2["value"].apply(lambda x: x > 10)
Then, do a join on 'id' & 'name' like this:
df_v1.merge(df_v2[["id", "name", "not_ok"]], on=["id", "name"], how="left")
You can use the .lt(10) method to get the values lesser than 10 (labeling values <10 as 1 and values >10 as 0). Then you group by ids using the min() function to keep the minimum value (0 here) in case of duplicate ids in the second DataFrame. Here is the code :
import pandas as pd
df1 = pd.DataFrame({'id': [1, 2, 3, 4, 5, 6, 7, 8],
'name': ['BEL', 'BEL', 'BEL', 'NYC', 'NYC', 'NYC', 'BER', 'BER']})
df2 = pd.DataFrame({'id': [1, 2, 3, 4, 5, 6, 7, 8, 3, 4, 5, 6],
'name': ['BEL', 'BEL', 'BEL', 'NYC', 'NYC', 'NYC', 'BER', 'BER', 'BEL', 'NYC', 'NYC', 'NYC'],
'value': [9, 7, 3, 76, 76, 23, 76, 2, 7, 5, 4, 2]})
df2['is_ok'] = df2['value'].lt(10).astype(int)
df3 = df2[['id', 'name', 'is_ok']].groupby('id').min().reset_index()
print(df3)
# If you want to merge it with the first DataFrame
# df1 = df1.merge(df3[["id", "is_ok"]], on=["id"])
# print(df1)
Output :
id name is_ok
0 1 BEL 1
1 2 BEL 1
2 3 BEL 1
3 4 NYC 0
4 5 NYC 0
5 6 NYC 0
6 7 BER 0
7 8 BER 1

Cannot manipulate Dataframe to calculate zscore using simple looping

I'm following a Datacamp course on "efficient data manipulation" on pandas. On their videos, by way of example, they are demonstrating the native method of looping over the dataframe to calculate the zscore.
I have found this specific course strange with what seem to be errors in the code and I'm wondering if it was done for an older version of Python, but it is more likely just me not getting it.
The Dataframe is basically something like this:
df = pd.DataFrame({'total_bill': {0: 16.99, 1: 10.34, 2: 21.01, 3: 23.68, 4: 24.59, 5: 25.29, 6: 8.77, 7: 26.88, 8: 15.04, 9: 14.78}, 'tip': {0: 1.01, 1: 1.66, 2: 3.5, 3: 3.31, 4: 3.61, 5: 4.71, 6: 2.0, 7: 3.12, 8: 1.96, 9: 3.23}, 'sex': {0: 'Female', 1: 'Male', 2: 'Male', 3: 'Male', 4: 'Female', 5: 'Male', 6: 'Male', 7: 'Male', 8: 'Male', 9: 'Male'}, 'smoker': {0: 'No', 1: 'No', 2: 'No', 3: 'No', 4: 'No', 5: 'No', 6: 'No', 7: 'No', 8: 'No', 9: 'No'}, 'day': {0: 'Sun', 1: 'Sun', 2: 'Sun', 3: 'Sun', 4: 'Sun', 5: 'Sun', 6: 'Sun', 7: 'Sun', 8: 'Sun', 9: 'Sun'}, 'time': {0: 'Dinner', 1: 'Dinner', 2: 'Dinner', 3: 'Dinner', 4: 'Dinner', 5: 'Dinner', 6: 'Dinner', 7: 'Dinner', 8: 'Dinner', 9: 'Dinner'}, 'size': {0: 2, 1: 3, 2: 3, 3: 2, 4: 4, 5: 4, 6: 2, 7: 4, 8: 2, 9: 2}})
So the code on the slides is as follows:
mean_female = df.groupby("sex").mean()["total_bill"]["Female"]
mean_male = df.groupby("sex").mean()["total_bill"]["Male"]
std_female = df.groupby("sex").std()["total_bill"]["Female"]
std_male = df.groupby("sex").std()["total_bill"]["Male"]
Followed by this...
for i in range(len(df)):
if df.iloc[i,2] == "Female":
df.iloc[i][0] = (df.iloc[i,0] - mean_female) / std_female
else:
df.iloc[i][0] = (df.iloc[i,0] - mean_male) / std_male
When I run the code (which is from datacamp not mine) I get the usual copy of a slice warning, but (more importantly) NOTHING happens to the data frame.
I assume the objective is to have something like this:
zscore = lambda x: (x - x.mean()) / x.std()
dfsex = restaurant.groupby('sex')
dfzscore = grouptime["total_bill"].transform(zscore)
dfzscore
I'm a little confused so any help figuring this out is much appreciated.
Cheers!
.iloc[i,0] should be used instead of .iloc[i][0]. The dataframe will be updated correctly after fixing this bug. Evidence:
df
Out[58]:
total_bill tip sex smoker day time size
0 -0.707107 1.01 Female No Sun Dinner 2
1 -1.138059 1.66 Male No Sun Dinner 3
2 0.402209 3.50 Male No Sun Dinner 3
3 0.787637 3.31 Male No Sun Dinner 2
4 0.707107 3.61 Female No Sun Dinner 4
5 1.020048 4.71 Male No Sun Dinner 4
6 -1.364696 2.00 Male No Sun Dinner 2
7 1.249573 3.12 Male No Sun Dinner 4
8 -0.459590 1.96 Male No Sun Dinner 2
9 -0.497122 3.23 Male No Sun Dinner 2
Explanation: Let's take a close look at df.iloc[i][0]. The first step df.iloc[i] returns a Series in-place indeed. The second step [0], however, just returns a copy of value which is not in-place. Therefore df won't be updated.
In short, every indice must be put inside .iloc[] (or arguably better .iat[] in this case) for the value assignment to happen in-place.
use:
df.assign(column0_name= lambda x: np.where(x['column2_name']=='Female',
(x['column0_name'] - mean_female) / std_female),
(x['column0_name'] - mean_male) / std_male)))
Instead of:
for i in range(len(df)):
if df.iloc[i,2] == "Female":
df.iloc[i][0] = (df.iloc[i,0] - mean_female) / std_female
else:
df.iloc[i][0] = (df.iloc[i,0] - mean_male) / std_male
It works on series and works faster than for loop.

How to count occurrences based on multiple criteria in a DataFrame

I'm trying to figure out how to count a number of occurrences in the DataFrame using multiple criteria.
In this particular example, I'd like to know the number of female passengers in Pclass 3.
PassengerId Pclass Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 892 3 male 34.5 0 0 330911 7.8292 NaN Q
1 893 3 female 47.0 1 0 363272 7.0000 NaN S
2 894 2 male 62.0 0 0 240276 9.6875 NaN Q
3 895 3 male 27.0 0 0 315154 8.6625 NaN S
4 896 3 female 22.0 1 1 3101298 12.2875 NaN S
Here's my few failed attempts:
len(test[test["Sex"] == "female", test["Pclass"] == 3])
sum(test.Pclass == 3 & test.Sex == "female")
test.[test["Sex"] == "female", test["Pclass"] == 3].count()
None of them seem to be working.
At the end I've created my own function, but there must be a simpler way to calculate that.
def countif(sex, pclass):
x = 0
for i in range(0,len(test)):
s = test.iloc[i]['Sex']
c = test.iloc[i]['Pclass']
if s == sex and c == pclass:
x = x + 1
return x
Thank you in advance
There are a few ways to do this:
test = pd.DataFrame({'PassengerId': {0: 892, 1: 893, 2: 894, 3: 895, 4: 896},
'Pclass': {0: 3, 1: 3, 2: 2, 3: 3, 4: 3},
'Sex': {0: 'male', 1: 'female', 2: 'male', 3: 'male', 4: 'female'},
'Age': {0: 34.5, 1: 47.0, 2: 62.0, 3: 27.0, 4: 22.0},
'SibSp': {0: 0, 1: 1, 2: 0, 3: 0, 4: 1},
'Parch': {0: 0, 1: 0, 2: 0, 3: 0, 4: 1},
'Ticket': {0: 330911, 1: 363272, 2: 240276, 3: 315154, 4: 3101298},
'Fare': {0: 7.8292, 1: 7.0, 2: 9.6875, 3: 8.6625, 4: 12.2875},
'Cabin': {0: np.nan, 1: np.nan, 2: np.nan, 3: np.nan, 4: np.nan},
'Embarked': {0: 'Q', 1: 'S', 2: 'Q', 3: 'S', 4: 'S'}})
You need to put the boolean in round brackets and join with an &
sum((test.Pclass == 3) & (test.Sex == "female"))
len(test[(test.Pclass == 3) & (test.Sex == "female")])
test[(test["Sex"] == "female") & (test["Pclass"] == 3)].shape[0]
Or you can do:
tab = pd.crosstab(df.Pclass,df.Sex)
Sex female male
Pclass
2 0 1
3 2 2
tab.iloc[tab.index==3]['female']

groupby + apply results in a series appearing both in index and column - how to prevent it?

I've got a following data frame:
dict1 = {'id': {0: 11, 1: 12, 2: 13, 3: 14, 4: 15, 5: 16, 6: 19, 7: 18, 8: 17},
'var1': {0: 20.272108843537413,
1: 21.088435374149658,
2: 20.68027210884354,
3: 23.945578231292515,
4: 22.857142857142854,
5: 21.496598639455787,
6: 39.18367346938776,
7: 36.46258503401361,
8: 34.965986394557824},
'var2': {0: 27.731092436974773,
1: 43.907563025210074,
2: 55.67226890756303,
3: 62.81512605042017,
4: 71.63865546218487,
5: 83.40336134453781,
6: 43.48739495798319,
7: 59.243697478991606,
8: 67.22689075630252},
'var3': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 2, 7: 2, 8: 2}}
ex = pd.DataFrame(dict1).set_index('id')
I wanted to sort with within groups according to var1, so I wrote the following:
ex.groupby('var3').apply(lambda x: x.sort_values('var1'))
However, it results in a data frame which has var3 both in index and in column. How to prevent that and leave it only in a column?
Add optional param to groupby as_index=False
ex.groupby('var3', as_index=False) \
.apply(lambda x: x.sort_values('var1'))
Or, if you don't want multiIndex
ex.groupby('var3', as_index=False) \
.apply(lambda x: x.sort_values('var1')) \
.reset_index(level=0, drop=True)
You could use:
df_sorted=ex.groupby('var3').apply(lambda x: x.sort_values('var1')).reset_index(drop='var3')
print(df_sorted)
var1 var2 var3
0 20.272109 27.731092 1
1 20.680272 55.672269 1
2 21.088435 43.907563 1
3 21.496599 83.403361 1
4 22.857143 71.638655 1
5 23.945578 62.815126 1
6 34.965986 67.226891 2
7 36.462585 59.243697 2
8 39.183673 43.487395 2
But you only need DataFrame.sort_values
sorting first by var3 and then by var1:
df_sort=ex.sort_values(['var3','var1'])
print(df_sort)
var1 var2 var3
id
11 20.272109 27.731092 1
13 20.680272 55.672269 1
12 21.088435 43.907563 1
16 21.496599 83.403361 1
15 22.857143 71.638655 1
14 23.945578 62.815126 1
17 34.965986 67.226891 2
18 36.462585 59.243697 2
19 39.183673 43.487395 2

Pandas concatenate levels in multiindex

I do have following excel file:
{0: {0: nan, 1: nan, 2: nan, 3: 'A', 4: 'A', 5: 'B', 6: 'B', 7: 'C', 8: 'C'},
1: {0: nan, 1: nan, 2: nan, 3: 1.0, 4: 2.0, 5: 1.0, 6: 2.0, 7: 1.0, 8: 2.0},
2: {0: 'AA1', 1: 'a', 2: 'ng/mL', 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1},
3: {0: 'AA2', 1: 'a', 2: nan, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1},
4: {0: 'BB1', 1: 'b', 2: nan, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1},
5: {0: 'BB2', 1: 'b', 2: 'mL', 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1},
6: {0: 'CC1', 1: 'c', 2: nan, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1},
7: {0: 'CC2', 1: 'c', 2: nan, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1}}
I would like to create following dataframe:
level_0 AA1 AA2 CB1 BB2 CC1 CC2
new a ng/mL a N/A b N/A b mL c N/A c N/A
0 1
A 1 1 1 1 1 1 1
2 1 1 1 1 1 1
B 1 1 1 1 1 1 1
2 1 1 1 1 1 1
C 1 1 1 1 1 1 1
2 1 1 1 1 1 1
What I tried:
# read the column index separately to avoid pandas inputting "Unnamed: ..."
# for the nans
df = pd.read_excel(file_path, skiprows=3, index_col=None, header=None)
df.set_index([0, 1], inplace=True)
# the column index
cols = pd.read_excel(file_path, nrows=3, index_col=None, header=None).loc[:, 2:]
cols = cols.fillna('N/A')
idx = pd.MultiIndex.from_arrays(cols.values)
df.columns = idx
The new dataframe:
AA1 AA2 CB1 BB2 CC1 CC2
a a b b c c
ng/mL N/A N/A mL N/A N/A
0 1
A 1 1 1 1 1 1 1
2 1 1 1 1 1 1
B 1 1 1 1 1 1 1
2 1 1 1 1 1 1
C 1 1 1 1 1 1 1
2 1 1 1 1 1 1
This approach works but is kind of tedious:
df1 = df.T.reset_index()
df1['new'] = df1.loc[:, 'level_1'] + ' ' + df1.loc[:, 'level_2']
df1.set_index(['level_0', 'new']).drop(['level_1', 'level_2'], axis=1).T
Which gives me:
level_0 AA1 AA2 CB1 BB2 CC1 CC2
new a ng/mL a N/A b N/A b mL c N/A c N/A
0 1
A 1 1 1 1 1 1 1
2 1 1 1 1 1 1
B 1 1 1 1 1 1 1
2 1 1 1 1 1 1
C 1 1 1 1 1 1 1
2 1 1 1 1 1 1
Is there a simpler solution available ?
Use:
#file from sample data
d = {0: {0: np.nan, 1: np.nan, 2: np.nan, 3: 'A', 4: 'A', 5: 'B', 6: 'B', 7: 'C', 8: 'C'},
1: {0: np.nan, 1: np.nan, 2: np.nan, 3: 1.0, 4: 2.0, 5: 1.0, 6: 2.0, 7: 1.0, 8: 2.0},
2: {0: 'AA1', 1: 'a', 2: 'ng/mL', 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1},
3: {0: 'AA2', 1: 'a', 2: np.nan, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1},
4: {0: 'BB1', 1: 'b', 2: np.nan, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1},
5: {0: 'BB2', 1: 'b', 2: 'mL', 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1},
6: {0: 'CC1', 1: 'c', 2: np.nan, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1},
7: {0: 'CC2', 1: 'c', 2: np.nan, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1}}
df = pd.DataFrame(d)
df.to_excel('file.xlsx', header=False, index=False)
First create MultiIndex DataFrame with header=[0,1,2], then create MultiIndex by first 2 columns with DataFrame.set_index and remove index names by DataFrame.reset_index:
df = pd.read_excel('file.xlsx', header=[0,1,2])
df = df.set_index(df.columns[:2].tolist()).rename_axis((None, None))
Then loop by each level in list comprehension and join second with third level if not Unnamed, last use MultiIndex.from_tuples:
tuples = [(a, f'{b} N/A') if c.startswith('Unnamed')
else (a, f'{b} {c}')
for a, b, c in df.columns]
print (tuples)
[('AA1', 'a ng/mL'), ('AA2', 'a N/A'),
('BB1', 'b N/A'), ('BB2', 'b mL'),
('CC1', 'c N/A'), ('CC2', 'c N/A')]
df.columns = pd.MultiIndex.from_tuples(tuples)
print (df)
AA1 AA2 BB1 BB2 CC1 CC2
a ng/mL a N/A b N/A b mL c N/A c N/A
A 1 1 1 1 1 1 1
2 1 1 1 1 1 1
B 1 1 1 1 1 1 1
2 1 1 1 1 1 1
C 1 1 1 1 1 1 1
2 1 1 1 1 1 1
Another idea is use:
df = pd.read_excel('file.xlsx', header=[0,1,2])
df = df.set_index(df.columns[:2].tolist()).rename_axis((None, None))
lv1 = df.columns.get_level_values(0)
lv2 = df.columns.get_level_values(1)
lv3 = df.columns.get_level_values(2)
lv3 = lv3.where(~lv3.str.startswith('Unnamed'),'N/A')
df.columns = [lv1, lv2.to_series() + ' ' + lv3]

Categories