How to add column value based on condition in another dataframe? - python

I have a dataframe with the main fixed location data:
id name
1 BEL
2 BEL
3 BEL
4 NYC
5 NYC
6 NYC
7 BER
8 BER
I also have second dataframe where I get values for each id and city like this (notice, this dataframe is longer than the main dataframe):
id name value
1 BEL 9
2 BEL 7
3 BEL 3
4 NYC 76
5 NYC 76
6 NYC 23
7 BER 76
8 BER 2
3 BEL 7
4 NYC 5
5 NYC 4
6 NYC 2
My goal is, I want to check the second dataframe if the values are greater than 10 or not. If greater than 10 I want to add to the first dataframe a column ['not_ok'] like 1 for not ok. How can I do this?
I can filter the second dataframe with dff['not_ok'] = np.where(dff['value'] > 10, '1', '0') but since the dff is much longer I don't know how to get that information in the first dataframe.
My goal looks something like this:
id name is_ok
1 BEL 1
2 BEL 1
3 BEL 1
4 NYC 0
5 NYC 0
6 NYC 0
7 BER 0
8 BER 1

To reach the desired output you could try as follows:
import pandas as pd
data = {'id': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8},
'name': {0: 'BEL', 1: 'BEL', 2: 'BEL', 3: 'NYC', 4: 'NYC',
5: 'NYC', 6: 'BER', 7: 'BER'}
}
df = pd.DataFrame(data)
data2 = {'id': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7,
7: 8, 8: 3, 9: 4, 10: 5, 11: 6},
'name': {0: 'BEL', 1: 'BEL', 2: 'BEL', 3: 'NYC', 4: 'NYC',
5: 'NYC', 6: 'BER', 7: 'BER', 8: 'BEL', 9: 'NYC',
10: 'NYC', 11: 'NYC'},
'value': {0: 9, 1: 7, 2: 3, 3: 76, 4: 76, 5: 23, 6: 76,
7: 2, 8: 7, 9: 5, 10: 4, 11: 2}
}
df2 = pd.DataFrame(data2)
df = df.merge(df2[df2['value'].gt(10)], on=['id', 'name'], how='left')\
.rename(columns={'value':'is_ok'})
df['is_ok'] = df['is_ok'].isna().astype(int)
print(df)
id name is_ok
0 1 BEL 1
1 2 BEL 1
2 3 BEL 1
3 4 NYC 0
4 5 NYC 0
5 6 NYC 0
6 7 BER 0
7 8 BER 1
Explanation:
Use Series.gt to get a boolean pd.Series, which we use to select from d2 only the rows that meet the condition value > 10.
Use df.merge to merge this slice from df2 with df and rename column value to is_ok (df.rename).
We now have a column with NaN values where there is no match on id, name, and values > 10 where there is. Use Series.isna to turn this column into booleans.
Finally, we can chain .astype(int) to change True | False into 1 | 0.

Suppose you first (shorter) daraframe is called 'df_v1' and the second (longer) is called 'df_v2'.
On 'df_v2' prepare the column like this:
df_v2["not_ok"] = df_v2["value"].apply(lambda x: x > 10)
Then, do a join on 'id' & 'name' like this:
df_v1.merge(df_v2[["id", "name", "not_ok"]], on=["id", "name"], how="left")

You can use the .lt(10) method to get the values lesser than 10 (labeling values <10 as 1 and values >10 as 0). Then you group by ids using the min() function to keep the minimum value (0 here) in case of duplicate ids in the second DataFrame. Here is the code :
import pandas as pd
df1 = pd.DataFrame({'id': [1, 2, 3, 4, 5, 6, 7, 8],
'name': ['BEL', 'BEL', 'BEL', 'NYC', 'NYC', 'NYC', 'BER', 'BER']})
df2 = pd.DataFrame({'id': [1, 2, 3, 4, 5, 6, 7, 8, 3, 4, 5, 6],
'name': ['BEL', 'BEL', 'BEL', 'NYC', 'NYC', 'NYC', 'BER', 'BER', 'BEL', 'NYC', 'NYC', 'NYC'],
'value': [9, 7, 3, 76, 76, 23, 76, 2, 7, 5, 4, 2]})
df2['is_ok'] = df2['value'].lt(10).astype(int)
df3 = df2[['id', 'name', 'is_ok']].groupby('id').min().reset_index()
print(df3)
# If you want to merge it with the first DataFrame
# df1 = df1.merge(df3[["id", "is_ok"]], on=["id"])
# print(df1)
Output :
id name is_ok
0 1 BEL 1
1 2 BEL 1
2 3 BEL 1
3 4 NYC 0
4 5 NYC 0
5 6 NYC 0
6 7 BER 0
7 8 BER 1

Related

How to convert two columns to list of values?

I have a dataframe like this below,
A B C D
0 A1 Egypt 10 Yes
1 A1 Morocco 5 No
2 A2 Algeria 4 Yes
3 A3 Egypt 45 No
4 A3 Egypt 17 Yes
5 A3 Tunisia 4 Yes
6 A3 Algeria 32 No
7 A4 Tunisia 7 No
8 A5 Egypt 6 No
9 A5 Morocco 1 No
I want to get the count of yes and no from the column D wrt column B. The expected output needs to be in the lists like this below which can help to plot the multivariable chart.
Exected output:
yes = [1,2,0,1]
no = [1,2,2,1]
country = ['Algeria', 'Egypt', 'Morocco','Tunisia']
I am not sure how to achieve this from the above dataframe. Any help will be appreciated.
Here is the minimum reproducible dataframe sample:
import pandas as pd
df = pd.DataFrame({'A': {0: 'A1',
1: 'A1',
2: 'A2',
3: 'A3',
4: 'A3',
5: 'A3',
6: 'A3',
7: 'A4',
8: 'A5',
9: 'A5'},
'B': {0: 'Egypt',
1: 'Morocco',
2: 'Algeria',
3: 'Egypt',
4: 'Egypt',
5: 'Tunisia',
6: 'Algeria',
7: 'Tunisia',
8: 'Egypt',
9: 'Morocco'},
'C ': {0: 10, 1: 5, 2: 4, 3: 45, 4: 17, 5: 4, 6: 32, 7: 7, 8: 6, 9: 1},
'D': {0: 'Yes',
1: 'No',
2: 'Yes',
3: 'No',
4: 'Yes',
5: 'Yes',
6: 'No',
7: 'No',
8: 'No',
9: 'No'}}
)
Use crosstab:
df1 = pd.crosstab(df.B, df.D)
print (df1)
D No Yes
B
Algeria 1 1
Egypt 2 2
Morocco 2 0
Tunisia 1 1
Then for plot use DataFrame.plot.bar
df1.plot.bar()
If need lists:
yes = df1['Yes'].tolist()
no = df1['No'].tolist()
country = df1.index.tolist()
Create new columns by counting "yes", "no"; then groupby "B" and use sum on the newly created columns:
country, yes, no = df.assign(Yes=df['D']=='Yes', No=df['D']=='No').groupby('B')[['Yes','No']].sum().reset_index().T.to_numpy().tolist()
Output:
['Algeria', 'Egypt', 'Morocco', 'Tunisia']
[1, 2, 0, 1]
[1, 2, 2, 1]

Python Pandas group by mean() for a certain count of rows

I need to group by mean() for the first 2 values of each category, how I define that.
df like
category value
-> a 2
-> a 5
a 4
a 8
-> b 6
-> b 3
b 1
-> c 2
-> c 2
c 7
by reading only the arrowed data where the output be like
category mean
a 3.5
b 4.5
c 2
how can I do this
I am trying but do not know where to define the to get only 1st 2 observation from each categrory
output = df.groupby(['category'])['value'].mean().reset_index()
your help is appreciated, thanks in advance
You can also do this via groupby() and agg():
out=df.groupby('category',as_index=False)['value'].agg(lambda x:x.head(2).mean())
Try apply on each group of values and use head(2) to just get the first 2 values then mean:
import pandas as pd
df = pd.DataFrame({
'category': {0: 'a', 1: 'a', 2: 'a', 3: 'a', 4: 'b', 5: 'b',
6: 'b', 7: 'c', 8: 'c', 9: 'c'},
'value': {0: 2, 1: 5, 2: 4, 3: 8, 4: 6, 5: 3, 6: 1, 7: 2,
8: 2, 9: 7}
})
output = df.groupby('category', as_index=False)['value'] \
.apply(lambda a: a.head(2).mean())
print(output)
output:
category value
0 a 3.5
1 b 4.5
2 c 2.0
Or create a boolean index to filter df with:
m = df.groupby('category').cumcount().lt(2)
output = df[m].groupby('category')['value'].mean().reset_index()
print(output)
category value
0 a 3.5
1 b 4.5
2 c 2.0

Adding missing rows and setting column value to zero based on current dataframe

dic= {'distinct_id': {0: 1,
1: 2,
2: 3,
3: 4,
4: 5},
'first_name': {0: 'Joe',
1: 'Barry',
2: 'David',
3: 'Marcus',
4: 'Anthony'},
'activity': {0: 'Jump',
1: 'Jump',
2: 'Run',
3: 'Run',
4: 'Climb'},
'tasks_completed': {0: 3, 1: 3, 2: 3, 3: 3, 4: 1},
'tasks_available': {0: 3, 1: 3, 2: 3, 3: 3, 4: 3}}
tasks = pd.DataFrame(dic)
I'm trying to make every id/name pair have a row for every unique activity, for example I want "Joe" to have rows where the activity column is "Run" and "Climb", but I want him to have a 0 in the tasks_completed column (those rows not being present already means that he hasn't done these activity tasks). I have tried using df.iterrows() and making a list of the unique ids and activity names and checking to see if they're both present, but it didn't work. Any help is very appreciated!
This is what I am hoping to have:
1: 2,
2: 3,
3: 4,
4: 5,
5: 1,
6: 1,
7: 2,
8: 2,
9: 3,
10: 3,
11: 4,
12: 4,
13: 5,
14: 5},
'email': {0: 'Joe',
1: 'Barry',
2: 'David',
3: 'Marcus',
4: 'Anthony',
5: 'Joe',
6: 'Joe',
7: 'Barry',
8: 'Barry',
9: 'David',
10: 'David',
11: 'Marcus',
12: 'Marcus',
13: 'Anthony',
14: 'Anthony'},
'activity': {0: 'Jump',
1: 'Jump',
2: 'Run',
3: 'Run',
4: 'Climb',
5: 'Run',
6: 'Climb',
7: 'Run',
8: 'Climb',
9: 'Jump',
10: 'Climb',
11: 'Climb',
12: 'Jump',
13: 'Run',
14: 'Jump'},
'tasks_completed': {0: 3,
1: 3,
2: 3,
3: 3,
4: 1,
5: 0,
6: 0,
7: 0,
8: 0,
9: 0,
10: 0,
11: 0,
12: 0,
13: 0,
14: 0},
'tasks_available': {0: 3,
1: 3,
2: 3,
3: 3,
4: 3,
5: 3,
6: 3,
7: 3,
8: 3,
9: 3,
10: 3,
11: 3,
12: 3,
13: 3,
14: 3}}
pd.DataFrame(tasks_new)
idx_cols = ['distinct_id', 'first_name', 'activity']
tasks.set_index(idx_cols).unstack(fill_value=0).stack().reset_index()
distinct_id first_name activity tasks_completed tasks_available
0 1 Joe Climb 0 0
1 1 Joe Jump 3 3
2 1 Joe Run 0 0
3 2 Barry Climb 0 0
4 2 Barry Jump 3 3
5 2 Barry Run 0 0
6 3 David Climb 0 0
7 3 David Jump 0 0
8 3 David Run 3 3
9 4 Marcus Climb 0 0
10 4 Marcus Jump 0 0
11 4 Marcus Run 3 3
12 5 Anthony Climb 1 3
13 5 Anthony Jump 0 0
14 5 Anthony Run 0 0

groupby + apply results in a series appearing both in index and column - how to prevent it?

I've got a following data frame:
dict1 = {'id': {0: 11, 1: 12, 2: 13, 3: 14, 4: 15, 5: 16, 6: 19, 7: 18, 8: 17},
'var1': {0: 20.272108843537413,
1: 21.088435374149658,
2: 20.68027210884354,
3: 23.945578231292515,
4: 22.857142857142854,
5: 21.496598639455787,
6: 39.18367346938776,
7: 36.46258503401361,
8: 34.965986394557824},
'var2': {0: 27.731092436974773,
1: 43.907563025210074,
2: 55.67226890756303,
3: 62.81512605042017,
4: 71.63865546218487,
5: 83.40336134453781,
6: 43.48739495798319,
7: 59.243697478991606,
8: 67.22689075630252},
'var3': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 2, 7: 2, 8: 2}}
ex = pd.DataFrame(dict1).set_index('id')
I wanted to sort with within groups according to var1, so I wrote the following:
ex.groupby('var3').apply(lambda x: x.sort_values('var1'))
However, it results in a data frame which has var3 both in index and in column. How to prevent that and leave it only in a column?
Add optional param to groupby as_index=False
ex.groupby('var3', as_index=False) \
.apply(lambda x: x.sort_values('var1'))
Or, if you don't want multiIndex
ex.groupby('var3', as_index=False) \
.apply(lambda x: x.sort_values('var1')) \
.reset_index(level=0, drop=True)
You could use:
df_sorted=ex.groupby('var3').apply(lambda x: x.sort_values('var1')).reset_index(drop='var3')
print(df_sorted)
var1 var2 var3
0 20.272109 27.731092 1
1 20.680272 55.672269 1
2 21.088435 43.907563 1
3 21.496599 83.403361 1
4 22.857143 71.638655 1
5 23.945578 62.815126 1
6 34.965986 67.226891 2
7 36.462585 59.243697 2
8 39.183673 43.487395 2
But you only need DataFrame.sort_values
sorting first by var3 and then by var1:
df_sort=ex.sort_values(['var3','var1'])
print(df_sort)
var1 var2 var3
id
11 20.272109 27.731092 1
13 20.680272 55.672269 1
12 21.088435 43.907563 1
16 21.496599 83.403361 1
15 22.857143 71.638655 1
14 23.945578 62.815126 1
17 34.965986 67.226891 2
18 36.462585 59.243697 2
19 39.183673 43.487395 2

Converting to Pandas MultiIndex

I have a dataframe of the form:
SpeciesName 0
0 A [[Year: 1, Quantity: 2],[Year: 3, Quantity: 4...]]
1 B [[Year: 1, Quantity: 7],[Year: 2, Quantity: 15...]]
2 C [[Year: 2, Quantity: 9],[Year: 4, Quantity: 13...]]
I'm attempting to try and create a MultiIndex that uses the SpeciesName and the year as the index:
SpeciesName Year
A 1 Data
2 Data
B 1 Data
2 Data
I have not been able to get pandas.MultiIndex(..) to work and my attempts at iterating through the dataset and manually creating a new object have not been very fruitful. Any insights would be greatly appreciated!
I'm going to assume your data is list of dictionaries... because if I don't, what you've written makes no sense unless they are strings and I don't want to parse strings
df = pd.DataFrame([
['A', [dict(Year=1, Quantity=2), dict(Year=3, Quantity=4)]],
['B', [dict(Year=1, Quantity=7), dict(Year=2, Quantity=15)]],
['C', [dict(Year=2, Quantity=9), dict(Year=4, Quantity=13)]]
], columns=['SpeciesName', 0])
df
SpeciesName 0
0 A [{'Year': 1, 'Quantity': 2}, {'Year': 3, 'Quantity': 4}]
1 B [{'Year': 1, 'Quantity': 7}, {'Year': 2, 'Quantity': 15}]
2 C [{'Year': 2, 'Quantity': 9}, {'Year': 4, 'Quantity': 13}]
Then the solution is obvious
pd.DataFrame.from_records(
*zip(*(
[d, s]
for s, l in zip(
df['SpeciesName'], df[0].values.tolist())
for d in l
))
).set_index('Year', append=True)
Quantity
Year
A 1 2
3 4
B 1 7
2 15
C 2 9
4 13

Categories