Conditional filling of column based on string - python

I have a dataset which I have to fill conditional or dropping the conditional rows. But, I am still unsuccessful.
Idx Fruits Days Name
0 60 20
1 15 85.5
2 10 62 Peter
3 40 90 Maria
4 5 10.2
5 92 66
6 65 87 John
7 50 1 Eric
8 50 0 Maria
9 80 87 John
Now, I have some empty cells. I can fill with fillna or regex or can drop empty cells.
I want only first starting cells until the string starts, either dropping or filling with "."
Like below
Idx Fruits Days Name
0 60 20 .
1 15 85.5 .
2 10 62 Peter
3 40 90 Maria
4 5 10.2
5 92 66
6 65 87 John
7 50 1 Eric
8 50 0 Maria
9 80 87 John
and
Idx Fruits Days Name
2 10 62 Peter
3 40 90 Maria
4 5 10.2
5 92 66
6 65 87 John
7 50 1 Eric
8 50 0 Maria
9 80 87 John
Is there any possibility using pandas? or any looping?

You can try this:
df['Name'] = df['Name'].replace('', np.nan)
df['Name'] = df['Name'].where(df['Name'].ffill().notna(), '.')
print(df)
Idx Fruits Days Name
0 0 60 20.0 .
1 1 15 85.5 .
2 2 10 62.0 Peter
3 3 40 90.0 Maria
4 4 5 10.2
5 5 92 66.0
6 6 65 87.0 John
7 7 50 1.0 Eric
8 8 50 0.0 Maria
9 9 80 87.0 John

Related

Swipe or turn data for stacked bar chart in Matplotlib

I'm trying to create or generate some graphs in stacked bar I'm using this data:
index 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
0 No 94 123 96 108 122 106.0 95.0 124 104 118 73 82 106 124 109 70 59
1 Yes 34 4 33 21 5 25.0 34.0 5 21 9 55 46 21 3 19 59 41
2 Dont know 1 2 1 1 2 NaN NaN 1 4 2 2 2 2 2 2 1 7
Basically I want to use the columns names as x and the Yes, No, Don't know as the Y values, here is my code and the result that I have at the moment.
ax = dfu.plot.bar(x='index', stacked=True)
UPDATE:
Here is an example:
data = [{0:1,1:2,2:3},{0:3,1:2,2:1},{0:1,1:1,2:1}]
index = ["yes","no","dont know"]
df = pd.DataFrame(data,index=index)
df.T.plot.bar(stacked=True) # Note .T is used to transpose the DataFrame

Efficient mean and total aggregation over multiple Pandas DataFrame columns

Suppose I have a DataFrame that looks something like this:
id
country
grade
category
amount
0
7
fr
a
mango
52
1
5
fr
b
banana
68
2
7
fr
a
banana
73
3
4
it
c
mango
70
4
5
fr
b
banana
99
5
9
uk
a
apple
29
6
3
uk
a
mango
83
7
0
uk
b
banana
59
8
2
it
c
mango
11
9
9
uk
a
banana
91
10
0
uk
b
mango
95
11
8
uk
a
mango
30
12
3
uk
a
mango
82
13
1
it
b
banana
78
14
3
uk
a
apple
76
15
6
it
c
apple
76
16
2
it
c
mango
10
17
1
it
b
mango
30
18
9
uk
a
banana
17
19
2
it
c
mango
58
Where each id belongs to a grade and lives in a country, and spends a certain amount on various fruits (category). Let's say the data covers a whole year. (Dataframe reproducible using the code below.)
import pandas as pd
df = pd.DataFrame({
"id":[7,5,7,4,5,9,3,0,2,9,0,8,3,1,3,6,2,1,9,2],
"country":["fr","fr","fr","it","fr","uk","uk","uk","it","uk","uk","uk","uk","it","uk","it","it","it","uk","it"],
"grade":["a","b","a","c","b","a","a","b","c","a","b","a","a","b","a","c","c","b","a","c"],
"category":["mango","banana","banana","mango","banana","apple","mango","banana","mango","banana","mango","mango","mango","banana","apple","apple","mango","mango","banana","mango"],
"amount":[52,68,73,70,99,29,83,59,11,91,95,30,82,78,76,76,10,30,17,58]
})
I would like to add two columns to this DF.
First, I'd like a column giving the mean annual (ie total) spent on each category by each combination of country and grade. So, for example, the Italy C-grade people have spent the following on mangos:
id: 4 total: 70
id: 2 total: 11 + 10 + 58 = 79
So the mean annual mango spend for Italy C-grade people is 74.5. I'd like to find this value for all of the country/grade/category combinations.
The second column I want to add is the same but for the mean annual count for each combination.
Desired output and the best I could come up with:
I've managed to populate these two desired columns using the following code:
import math
combos = [[i,j,k] for i in set(df["country"]) for j in set(df["grade"]) for k in set(df["category"])]
for c in combos:
x = df.loc[(df["country"]==c[0])&(df["grade"]==c[1])&(df["category"]==c[2])]
m = x.groupby("id").sum()["amount"].mean()
k = x.groupby("id").count()["amount"].mean()
if math.isnan(m):
m = 0
if math.isnan(k):
k = 0
c.append(m)
c.append(k)
temp_grouping = pd.DataFrame(combos,columns=["country","grade","category","mean_totals","mean_counts"])
df = df.merge(temp_grouping,on=["country","grade","category"],how="left")
Which gives the desired output:
id
country
grade
category
amount
mean_totals
mean_counts
0
7
fr
a
mango
52
52
1
1
5
fr
b
banana
68
167
2
2
7
fr
a
banana
73
73
1
3
4
it
c
mango
70
74.5
2
4
5
fr
b
banana
99
167
2
5
9
uk
a
apple
29
52.5
1
6
3
uk
a
mango
83
97.5
1.5
7
0
uk
b
banana
59
59
1
8
2
it
c
mango
11
74.5
2
9
9
uk
a
banana
91
108
2
10
0
uk
b
mango
95
95
1
11
8
uk
a
mango
30
97.5
1.5
12
3
uk
a
mango
82
97.5
1.5
13
1
it
b
banana
78
78
1
14
3
uk
a
apple
76
52.5
1
15
6
it
c
apple
76
76
1
16
2
it
c
mango
10
74.5
2
17
1
it
b
mango
30
30
1
18
9
uk
a
banana
17
108
2
19
2
it
c
mango
58
74.5
2
The above code works, but it is not usable on my real data because it is pretty slow. I'm searching, therefore, for a faster/more efficient solution to my problem. Thanks very much.
You can create mean_totals column as follows:
mean_total_df = df.groupby(['country', 'category', 'grade']).apply(lambda x: x.amount.sum()/ x.id.nunique())
df['mean_totals'] = df.apply(lambda x: mean_total_df.loc[x.country, x.category, x.grade], axis=1)
which gives
0 7 fr a mango 52 52.0
1 5 fr b banana 68 167.0
2 7 fr a banana 73 73.0
3 4 it c mango 70 74.5
4 5 fr b banana 99 167.0
5 9 uk a apple 29 52.5
6 3 uk a mango 83 97.5
7 0 uk b banana 59 59.0
8 2 it c mango 11 74.5
9 9 uk a banana 91 108.0
10 0 uk b mango 95 95.0
11 8 uk a mango 30 97.5
12 3 uk a mango 82 97.5
13 1 it b banana 78 78.0
14 3 uk a apple 76 52.5
15 6 it c apple 76 76.0
16 2 it c mango 10 74.5
17 1 it b mango 30 30.0
18 9 uk a banana 17 108.0
19 2 it c mango 58 74.5
It looks like you need a double groupby. Once for the sum, once for the mean:
out = (df
.groupby(['country', 'grade', 'category', 'id']).sum()
.groupby(['country', 'grade', 'category']).mean()
)
output:
amount
country grade category
fr a banana 73.0
mango 52.0
b banana 167.0
it b banana 78.0
mango 30.0
c apple 76.0
mango 74.5
uk a apple 52.5
banana 108.0
mango 97.5
b banana 59.0
mango 95.0
I hope this will work fast.. First group and compute the required details and merge with existing df.
import pandas as pd
df = pd.DataFrame({
"id":[7,5,7,4,5,9,3,0,2,9,0,8,3,1,3,6,2,1,9,2],
"country":["fr","fr","fr","it","fr","uk","uk","uk","it","uk","uk","uk","uk","it","uk","it","it","it","uk","it"],
"grade":["a","b","a","c","b","a","a","b","c","a","b","a","a","b","a","c","c","b","a","c"],
"category":["mango","banana","banana","mango","banana","apple","mango","banana","mango","banana","mango","mango","mango","banana","apple","apple","mango","mango","banana","mango"],
"amount":[52,68,73,70,99,29,83,59,11,91,95,30,82,78,76,76,10,30,17,58]
})
intermediate_df = df.groupby(by=['country','grade','category','id'], as_index=False).agg(int_totals=pd.NamedAgg(column='amount',aggfunc='sum'),int_counts=pd.NamedAgg(column='id',aggfunc='count')).groupby(by=['country','grade','category'], as_index=False).agg(mean_totals=pd.NamedAgg(column='int_totals',aggfunc='mean'),mean_counts=pd.NamedAgg(column='int_counts',aggfunc='mean'))
output_df = pd.merge(df,intermediate_df, left_on = ['country','grade','category'],right_on = ['country','grade','category'], how='left')
print(output_df)
Output_dataframe

How to get sum of recent N value or available value of each item with respect given cut off date?

To be honest, I am very new in programming and learned some of the panda's basic functionality.
I am able to do group by and sum price of each item, but not able to specifically apply cut of date and do summation.
Below are my input data and expected result. Requesting to help how to achieve this using pandas. data image
In the below data, N=5 (no of values need to consider before the cut of date), the expected result for item Grape is 88 i.e. sum of entry 7,6,5,4, and 3. And orange is 90 (entries 13,12,11,10), here only 4 entry available, so considered all.
EntryDate Itemname Price Cut off date Expected result
1 3/9/2020 Grape 16 3/15/2020 88
2 3/10/2020 Grape 15 3/15/2020 88
3 3/11/2020 Grape 12 3/15/2020 88
4 3/12/2020 Grape 18 3/15/2020 88
5 3/13/2020 Grape 20 3/15/2020 88
6 3/13/2020 Grape 18 3/15/2020 88
7 3/14/2020 Grape 20 3/15/2020 88
8 3/15/2020 Grape 12 3/15/2020 88
9 3/16/2020 Grape 19 3/15/2020 88
10 2/10/2020 Orange 22 2/17/2020 90
11 2/11/2020 Orange 21 2/17/2020 90
12 2/12/2020 Orange 26 2/17/2020 90
13 2/13/2020 Orange 21 2/17/2020 90
14 2/20/2020 Orange 26 2/17/2020 90
First convert columns to datetimes, then filter rows by cut off date by Series.lt in boolean indexing and aggregate sum in lambda function for last N values by Series.tail, last for new column use Series.map:
N = 5
df['Date'] = pd.to_datetime(df['Date'])
df['Cut off date'] = pd.to_datetime(df['Cut off date'])
s = (df[df['Date'].lt(df['Cut off date'])]
.groupby('Itemname')['Price']
.agg(lambda x: x.tail(N).sum()))
df['new'] = df['Itemname'].map(s)
print (df)
Entry Date Itemname Price Cut off date Expected result new
0 1 2020-03-09 Grape 16 2020-03-15 88 88
1 2 2020-03-10 Grape 15 2020-03-15 88 88
2 3 2020-03-11 Grape 12 2020-03-15 88 88
3 4 2020-03-12 Grape 18 2020-03-15 88 88
4 5 2020-03-13 Grape 20 2020-03-15 88 88
5 6 2020-03-13 Grape 18 2020-03-15 88 88
6 7 2020-03-14 Grape 20 2020-03-15 88 88
7 8 2020-03-15 Grape 12 2020-03-15 88 88
8 9 2020-03-16 Grape 19 2020-03-15 88 88
9 10 2020-02-10 Orange 22 2020-02-17 90 90
10 11 2020-02-11 Orange 21 2020-02-17 90 90
11 12 2020-02-12 Orange 26 2020-02-17 90 90
12 13 2020-02-13 Orange 21 2020-02-17 90 90
13 14 2020-02-20 Orange 26 2020-02-17 90 90

Comparing Two Data Frames in python

I have two data frames. I have to compare the two data frames and get the position of the unmatched data using python.
Note:
The First column will always not be unique.
Data Frame 1:
0 1 2 3 4
0 1 Dhoni 24 Kota 60000.0
1 2 Raina 90 Delhi 41500.0
2 3 Kholi 67 Ahmedabad 20000.0
3 4 Ashwin 45 Bhopal 8500.0
4 5 Watson 64 Mumbai 6500.0
5 6 KL Rahul 19 Indore 4500.0
6 7 Hardik 24 Bengaluru 1000.0
Data Frame 2
0 1 2 3 4
0 3 Kholi 67 Ahmedabad 20000.0
1 7 Hardik 24 Bengaluru 1000.0
2 4 Ashwin 45 Bhopal 8500.0
3 2 Raina 90 Delhi 41500.0
4 6 KL Rahul 19 Chennai 4500.0
5 1 Dhoni 24 Kota 60000.0
6 5 Watson 64 Mumbai 6500.0
I expect the output of (3,5)-(Indore - Chennai).
df1=pd.DataFrame({'A':['Dhoni','Raina','KL Rahul'],'B':[24,90,67],'C':['Kota','Delhi','Indore'],'D':[6000.0,41500.0,4500.0]})
df2=pd.DataFrame({'A':['Dhoni','Raina','KL Rahul'],'B':[24,90,67],'C':['Kota','Delhi','Chennai'],'D':[6000.0,41500.0,4500.0]})
df1['df']='df1'
df2['df']='df2'
df=pd.concat([df1,df2],sort=False).drop_duplicates(subset=['A','B','C','D'],keep=False)
print(df)
A B C D df
2 KL Rahul 67 Indore 4500.0 df1
2 KL Rahul 67 Chennai 4500.0 df2
I have added df column to show, from which df difference comes from

Python Pandas self merge on previous data

I have a DataFrame that contains many years worth of data. I want to make a couple columns containing the previous years' data from the same DataFrame. Here's an example:
df = pd.DataFrame({'id': [1,1,1,2,2,2,3,4,5,3,3,3,4],
'yr': [87,88,89,54,55,53,87,87,89,90,91,92,86],
'data': '1-87 1-88 1-89 2-54 2-55 2-53 3-87 4-87 5-89 3-90 3-91 3-92 4-86'.split()})
data id yr
0 1-87 1 87
1 1-88 1 88
2 1-89 1 89
3 2-54 2 54
4 2-55 2 55
5 2-53 2 53
6 3-87 3 87
7 4-87 4 87
8 5-89 5 89
9 3-90 3 90
10 3-91 3 91
11 3-92 3 92
12 4-86 4 86
I'd like to add on another column that shows the previous years' data for that id number. like this:
data id yr last_year_data
0 1-87 1 87 NaN
1 1-88 1 88 1-87
2 1-89 1 89 1-88
3 2-54 2 54 2-53
4 2-55 2 55 2-54
5 2-53 2 53 NaN
6 3-87 3 87 NaN
7 4-87 4 87 4-86
8 5-89 5 89 NaN
9 3-90 3 90 NaN
10 3-91 3 91 3-90
11 3-92 3 92 3-91
12 4-86 4 86 NaN
I tried to do this with a merge but I got Nan's all the way down in the 2nd half of the merge. Here's my code for that:
df['last_year'] = df['yr'].apply(lambda x: x-1 if x > 0 else None)
df_test = df.merge(df, how='left',indicator=False,left_on=['id','yr'],right_on=['id','last_year'])
I know there's a better way to do this, but I'm not sure what it is. can you help?
You can using shift
df['New']=df.sort_values(['id','yr']).groupby('id').data.shift()
df
Out[793]:
data id yr New
0 1-87 1 87 NaN
1 1-88 1 88 1-87
2 1-89 1 89 1-88
3 2-54 2 54 2-53
4 2-55 2 55 2-54
5 2-53 2 53 NaN
6 3-87 3 87 NaN
7 4-87 4 87 4-86
8 5-89 5 89 NaN
9 3-90 3 90 3-87
10 3-91 3 91 3-90
11 3-92 3 92 3-91
12 4-86 4 86 NaN

Categories