How to fill missing values with conditions? - python

I have a pandas DataFrame like this:
year = [2015, 2016, 2009, 2000, 1998, 2017, 1980, 2016, 2015, 2015]
mode = ["automatic", "automatic", "manual", "manual", np.nan,'automatic', np.nan, 'automatic', np.nan, np.nan]
X = pd.DataFrame({'year': year, 'mode': mode})
print(X)
year mode
0 2015 automatic
1 2016 automatic
2 2009 manual
3 2000 manual
4 1998 NaN
5 2017 automatic
6 1980 NaN
7 2016 automatic
8 2015 NaN
9 2015 NaN
I want to fill missing values with like this: if year is <2010 I want to fill NaN with 'manual' and if year is >=2010 I want to fill NaN value with 'automatic'
I thought about combination .groupby function with these condition but I do not know honestly how to do it :(
I would be grateful for any help.

Similar approach to my answer on your other question:
cond = X['year'] < 2010
X['mode'] = X['mode'].fillna(cond.map({True:'manual', False: 'automatic'}))

With np.where and fillna
s=pd.Series(np.where(X.year<2010,'manual','automatic'),index=X.index)
X['mode'].fillna(s,inplace=True)
X
Out[192]:
year mode
0 2015 automatic
1 2016 automatic
2 2009 manual
3 2000 manual
4 1998 manual
5 2017 automatic
6 1980 manual
7 2016 automatic
8 2015 automatic
9 2015 automatic

You can use np.where
X['mode'] = X['mode'].fillna(pd.Series(np.where(X['year'] >= 2010, 'automatic', 'manual')))
Output
year mode
0 2015 automatic
1 2016 automatic
2 2009 manual
3 2000 manual
4 1998 manual
5 2017 automatic
6 1980 manual
7 2016 automatic
8 2015 automatic
9 2015 automatic

Related

Python Pandas multiindex

i'm try create table like in example:
Example_picture
My code:
data = list(range(39)) # mockup for 39 values
columns = pd.MultiIndex.from_product([['1', '2', '6'], [str(year) for year in range(2007, 2020)]],
names=['Factor', 'Year'])
df = pd.DataFrame(data, index=['World'], columns=columns)
print(df)
But i get error:
Shape of passed values is (39, 1), indices imply (1, 39)
What i'm did wrong?
You need to wrap the data in a list to force the DataFrame constructor to interpret the list as a row:
data = list(range(39))
columns = pd.MultiIndex.from_product([['1', '2', '6'],
[str(year) for year in range(2007, 2020)]],
names=['Factor', 'Year'])
df = pd.DataFrame([data], index=['World'], columns=columns)
output:
Factor 1 2 6
Year 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019
World 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38

Pandas Python - How to create new columns with MultiIndex from pivot table

I have created a pivot table with 2 different types of values i) Number of apples from 2017-2020, ii) Number of people from 2017-2020. I want to create additional columns to calculate iii) Apples per person from 2017-2020. How can I do so?
Current code for pivot table:
tdf = df.pivot_table(index="States",
columns="Year",
values=["Number of Apples","Number of People"],
aggfunc= lambda x: len(x.unique()),
margins=True)
tdf
Here is my current pivot table:
Number of Apples Number of People
2017 2018 2019 2020 2017 2018 2019 2020
California 10 18 20 25 2 3 4 5
West Virginia 8 35 25 12 2 5 5 4
...
I want my pivot table to look like this, where I add additional columns to divide Number of Apples by Number of People.
Number of Apples Number of People Number of Apples per Person
2017 2018 2019 2020 2017 2018 2019 2020 2017 2018 2019 2020
California 10 18 20 25 2 3 4 5 5 6 5 5
West Virginia 8 35 25 12 2 5 5 4 4 7 5 3
I've tried a few things, such as:
Creating a new column via assigning new column names, but does not work with multiple column index tdf["Number of Apples per Person"][2017] = tdf["Number of Apples"][2017] / tdf["Number of People"][2017]
Tried the other assignment method tdf.assign(tdf["Number of Apples per Person"][2017] = tdf["Enrollment ID"][2017] / tdf["Student ID"][2017]); got this error SyntaxError: expression cannot contain assignment, perhaps you meant "=="?
Appreciate any help! Thanks
What you can do here is stack(), do your thing, and then unstack():
s = df.stack()
s['Number of Apples per Person'] = s['Number of Apples'] / s['Number of People']
df = s.unstack()
Output:
>>> df
Number of Apples Number of People Number of Apples per Person
2017 2018 2019 2020 2017 2018 2019 2020 2017 2018 2019 2020
California 10 18 20 25 2 3 4 5 5.0 6.0 5.0 5.0
West Virginia 8 35 25 12 2 5 5 4 4.0 7.0 5.0 3.0
One-liner:
df = df.stack().pipe(lambda x: x.assign(**{'Number of Apples per Person': x['Number of Apples'] / x['Number of People']})).unstack()
Given
df
Number of Apples Number of People
2017 2018 2019 2020 2017 2018 2019 2020
California 10 18 20 25 2 3 4 5
West Virginia 8 35 25 12 2 5 5 4
You can index on the first level to get sub-frames and then divide. The division will be auto-aligned on the columns.
df['Number of Apples'] / df['Number of People']
2017 2018 2019 2020
California 5.0 6.0 5.0 5.0
West Virginia 4.0 7.0 5.0 3.0
Append this back to your DataFrame:
pd.concat([df, pd.concat([df['Number of Apples'] / df['Number of People']], keys=['Result'], axis=1)], axis=1)
Number of Apples Number of People Result
2017 2018 2019 2020 2017 2018 2019 2020 2017 2018 2019 2020
California 10 18 20 25 2 3 4 5 5.0 6.0 5.0 5.0
West Virginia 8 35 25 12 2 5 5 4 4.0 7.0 5.0 3.0
This is fast since it is completely vectorized.

I need help plotting a bar graph from a dataframe

I have the following dataframe:
AQI Year City
0 349.407407 2015 'Patna'
1 297.024658 2015 'Delhi'
2 283.007605 2015 'Ahmedabad'
3 297.619178 2016 'Delhi'
4 282.717949 2016 'Ahmedabad'
5 250.528701 2016 'Patna'
6 379.753623 2017 'Ahmedabad'
7 325.652778 2017 'Patna'
8 281.401216 2017 'Gurugram'
9 443.053221 2018 'Ahmedabad'
10 248.367123 2018 'Delhi'
11 233.772603 2018 'Lucknow'
12 412.781250 2019 'Ahmedabad'
13 230.720548 2019 'Delhi'
14 217.626741 2019 'Patna'
15 214.681818 2020 'Ahmedabad'
16 181.672131 2020 'Delhi'
17 162.251366 2020 'Patna'
I would like to group data for each year, i.e. 2015, 2016, 2017 2018...2020 on the x axis, with AQI on the y axis. I am a newbie and please excuse the lack of depth in my question.
You can "pivot" your data to support your desired plotting output. Here we set the rows as Year, columns as City, and values as AQI.
pivot = pd.pivot_table(
data=df,
index='Year',
columns='City',
values='AQI',
)
Year
Ahmedabad
Delhi
Gurugram
Lucknow
Patna
2015
283.007605
297.024658
NaN
NaN
349.407407
2016
282.717949
297.619178
NaN
NaN
250.528701
2017
379.753623
NaN
281.401216
NaN
325.652778
2018
443.053221
248.367123
NaN
233.772603
NaN
2019
412.781250
230.720548
NaN
NaN
217.626741
2020
214.681818
181.672131
NaN
NaN
162.251366
Then you can plot this pivot table directly:
pivot.plot.bar(xlabel='Year', ylabel='AQI')
Old answer
Are you looking for the mean AQI per year? If so, you can do some pandas chaining, assuming your data is in a DataFrame df:
df.groupby('Year').mean().plot.bar(xlabel='Year', ylabel='AQI')

Information matrix from pandas dataframe

I have a pandas dataframe like the following:
Customer Id year
0 1510220024 2017
1 1510270013 2017
2 1511160047 2017
3 1512100014 2017
4 1603180006 2017
5 1605030030 2017
6 1605160013 2017
7 1606060008 2017
8 1510220024 2018
9 1606270014 2017
10 1608080011 2017
11 1608090002 2017
12 1511160047 2018
13 1606270014 2018
And I want to build the following matrix from the above dataframe:
2017 2018
2017 11 3
2018 3 3
This matrix tells that there were total 11 customers in year 2017 and three of them also appeared in 2018 and so on. In actual, I have 7 years of data so it would be 7x7 matrix. I am struggling for a while now but can't get this right.
merge + crosstab:
m = df.merge(df, left_on='Customer Id', right_on='Customer Id')
pd.crosstab(m.year_x, m.year_y)
year_y 2017 2018
year_x
2017 11 3
2018 3 3

Pandas dataframe: how to find missing years in a timeseries?

I have a DataFrame with a timestamp index and some 100,000 rows. Via
df['year'] = df.index.year
it is easy to create a new column which contains the year of each row. Now I want to find out which years are missing from my timeseries. So far, I understand that I can use groupby to obtain "something" which allows me to find the unique values. Thus,
grouped = df.groupby('year')
grouped.groups.keys()
will give me the years which are present in my dataset. I could now build a complete year vector with
pd.date_range(df.index.min(), df.index.max(), freq='AS')
and through reindex I should then be able to find the missing years as those years which have NaN values.
However, this sounds awfully complicated for such seemingly simple task, and the grouped.groups operation actually takes quite a while; presumably, because it doesn't only look for unique keys, but also builds the index lists of rows that belong to each key, which is a feature that I don't need here.
Is there any way to obtain the unique elements of a dataframe column more directly/efficiently?
One method would be to construct a series of the years of interest and then use isin to see the missing values:
In [89]:
year_s = pd.Series(np.arange(1993, 2015))
year_s
Out[89]:
0 1993
1 1994
2 1995
3 1996
4 1997
5 1998
6 1999
7 2000
8 2001
9 2002
10 2003
11 2004
12 2005
13 2006
14 2007
15 2008
16 2009
17 2010
18 2011
19 2012
20 2013
21 2014
dtype: int32
In [88]:
df = pd.DataFrame({'year':[1999, 2000, 2013]})
df
Out[88]:
year
0 1999
1 2000
2 2013
In [91]:
year_s[~year_s.isin(df['year'])]
Out[91]:
0 1993
1 1994
2 1995
3 1996
4 1997
5 1998
8 2001
9 2002
10 2003
11 2004
12 2005
13 2006
14 2007
15 2008
16 2009
17 2010
18 2011
19 2012
21 2014
dtype: int32
So in your case you can generate the year series as above, then for your df you can get the years using:
df.index.year.unique()
which will be much quicker than performing a groupby.
Take care that the last value passed to arange is not included in the range
If all you want is a list of missing years, you can first convert your Data Series to a list and simply build a list of missing years using a list comprehension:
years = df['year'].unique()
missing_years = [y for y in range(min(years), max(years)+1) if y not in years]

Categories