I have a DataFrame with a timestamp index and some 100,000 rows. Via
df['year'] = df.index.year
it is easy to create a new column which contains the year of each row. Now I want to find out which years are missing from my timeseries. So far, I understand that I can use groupby to obtain "something" which allows me to find the unique values. Thus,
grouped = df.groupby('year')
grouped.groups.keys()
will give me the years which are present in my dataset. I could now build a complete year vector with
pd.date_range(df.index.min(), df.index.max(), freq='AS')
and through reindex I should then be able to find the missing years as those years which have NaN values.
However, this sounds awfully complicated for such seemingly simple task, and the grouped.groups operation actually takes quite a while; presumably, because it doesn't only look for unique keys, but also builds the index lists of rows that belong to each key, which is a feature that I don't need here.
Is there any way to obtain the unique elements of a dataframe column more directly/efficiently?
One method would be to construct a series of the years of interest and then use isin to see the missing values:
In [89]:
year_s = pd.Series(np.arange(1993, 2015))
year_s
Out[89]:
0 1993
1 1994
2 1995
3 1996
4 1997
5 1998
6 1999
7 2000
8 2001
9 2002
10 2003
11 2004
12 2005
13 2006
14 2007
15 2008
16 2009
17 2010
18 2011
19 2012
20 2013
21 2014
dtype: int32
In [88]:
df = pd.DataFrame({'year':[1999, 2000, 2013]})
df
Out[88]:
year
0 1999
1 2000
2 2013
In [91]:
year_s[~year_s.isin(df['year'])]
Out[91]:
0 1993
1 1994
2 1995
3 1996
4 1997
5 1998
8 2001
9 2002
10 2003
11 2004
12 2005
13 2006
14 2007
15 2008
16 2009
17 2010
18 2011
19 2012
21 2014
dtype: int32
So in your case you can generate the year series as above, then for your df you can get the years using:
df.index.year.unique()
which will be much quicker than performing a groupby.
Take care that the last value passed to arange is not included in the range
If all you want is a list of missing years, you can first convert your Data Series to a list and simply build a list of missing years using a list comprehension:
years = df['year'].unique()
missing_years = [y for y in range(min(years), max(years)+1) if y not in years]
Related
I'm trying to add after the Gross profit line in an income statement new line with some values from array.
I tried just to append it in the location but nothing changed.
income_statement.loc[["Gross Profit"]].append(gross)
The only way i succeed doing something similar is by making it another dataframe and concat it to end of the income_statement.
I'm trying to make it look like that:(The 'gross' line in yellow)
How can i do it?
I created a sample df that tried to look similar to yours (see below).
df
Unnamed: 0 2010 2011 2012 2013 ... 2016 2017 2018 2019 TTM
0 gross profit 10 11 12 13 ... 16 17 18 19 300
1 total revenue 1 2 3 4 ... 7 8 9 10 400
The aim now would be to add a row between them ('gross'), with the values you have listed in the picture.
One way to add the row could be with numpy.insert, which returns an array back so you have to convert back to a pd.DataFrame:
# Store the columns of your df
cols = df.columns
# Add the row (the number indicates the index position for the row to be added,1 is the 2nd row as Python indexes start from 0)
new = pd.DataFrame(np.insert
(df.values, 1, values = ['gross',22, 45, 65,87,108,130,151,152,156,135,133], axis=0),
columns=cols)
Which gets back:
new
Unnamed: 0 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 TTM
0 gross profit 10 11 12 13 14 15 16 17 18 19 300
1 gross 22 45 65 87 108 130 151 152 156 135 133
2 total revenue 1 2 3 4 5 6 7 8 9 10 400
Hopefully this will work for you. Let me know for issues.
I have the following dataframe:
AQI Year City
0 349.407407 2015 'Patna'
1 297.024658 2015 'Delhi'
2 283.007605 2015 'Ahmedabad'
3 297.619178 2016 'Delhi'
4 282.717949 2016 'Ahmedabad'
5 250.528701 2016 'Patna'
6 379.753623 2017 'Ahmedabad'
7 325.652778 2017 'Patna'
8 281.401216 2017 'Gurugram'
9 443.053221 2018 'Ahmedabad'
10 248.367123 2018 'Delhi'
11 233.772603 2018 'Lucknow'
12 412.781250 2019 'Ahmedabad'
13 230.720548 2019 'Delhi'
14 217.626741 2019 'Patna'
15 214.681818 2020 'Ahmedabad'
16 181.672131 2020 'Delhi'
17 162.251366 2020 'Patna'
I would like to group data for each year, i.e. 2015, 2016, 2017 2018...2020 on the x axis, with AQI on the y axis. I am a newbie and please excuse the lack of depth in my question.
You can "pivot" your data to support your desired plotting output. Here we set the rows as Year, columns as City, and values as AQI.
pivot = pd.pivot_table(
data=df,
index='Year',
columns='City',
values='AQI',
)
Year
Ahmedabad
Delhi
Gurugram
Lucknow
Patna
2015
283.007605
297.024658
NaN
NaN
349.407407
2016
282.717949
297.619178
NaN
NaN
250.528701
2017
379.753623
NaN
281.401216
NaN
325.652778
2018
443.053221
248.367123
NaN
233.772603
NaN
2019
412.781250
230.720548
NaN
NaN
217.626741
2020
214.681818
181.672131
NaN
NaN
162.251366
Then you can plot this pivot table directly:
pivot.plot.bar(xlabel='Year', ylabel='AQI')
Old answer
Are you looking for the mean AQI per year? If so, you can do some pandas chaining, assuming your data is in a DataFrame df:
df.groupby('Year').mean().plot.bar(xlabel='Year', ylabel='AQI')
I was studying python and came across something I'm not sure about. Here is the data frame.
year totalprod
0 1998 5.105093e+06
1 1999 4.706674e+06
2 2000 5.106000e+06
3 2001 4.221545e+06
4 2002 3.892386e+06
5 2003 4.122091e+06
6 2004 4.456805e+06
7 2005 4.243146e+06
8 2006 3.761902e+06
9 2007 3.600512e+06
10 2008 3.974927e+06
11 2009 3.626700e+06
12 2010 4.382350e+06
13 2011 3.680025e+06
14 2012 3.522675e+06
Before performing the scatter plot, the course was telling me to reshape x values, which are years.
Here is the code
X = prod_per_year['year']
X = X.values.reshape(-1,1)
y = prod_per_year['totalprod']
plt.scatter(X,y)
plt.show()
Why do we have to reshape before plotting? Aren't the values the same?
My database from excel has some information by Country for Years. The problem is each year is a different column header. For example:
Country Indicator 1950 1951 1952
Australia x 10 27 20
Australia y 7 11 8
Australia z 40 32 37
I want to convert each Indicator as a column header and make a column by year. Like this:
Country year x y z
Australia 1950 10 7 40
Australia 1951 27 11 32
Australia 1952 20 8 37
And I don't know how many countries are in the column. Years = 1950 to 2019
We can do format with stack and unstack
df.set_index(['Country','Indicator']).stack().unstack(level=1).reset_index()
Indicator Country level_1 x y z
0 Australia 1950 10 7 40
1 Australia 1951 27 11 32
2 Australia 1952 20 8 37
This is just an exploration ... #Yoben's solution is the proper way to do it via Pandas ... I just seeing what other possibilities there are :
#create a dictionary of the years
years = {'Year' : df.filter(regex='\d').columns}
#get the data for the years column
year_data = df.filter(regex='\d').to_numpy()
#create a dictionary from the indicator and years columns pairing
reshaped = dict(zip(df.Indicator,year_data))
reshaped.update(years)
#create a new dataframe
pd.DataFrame(reshaped,index=df.Country)
x y z Year
Country
Australia 10 7 40 1950
Australia 27 11 32 1951
Australia 20 8 37 1952
You should never have to do this, as u could easily work within the dataframe, without the need to create a new one. The only time u may consider this is for the speed. Besides that, just something to explore
It's not exactly what you are looking for, but if your dataframe is the variable df, you can use the transpose method to invert the dataframe.
In [7]: df
Out[7]:
col1 col2 col3
0 1 True 10
1 2 False 10
2 3 False 100
3 4 True 100
Transpose
In [8]: df.T
Out[8]:
0 1 2 3
col1 1 2 3 4
col2 True False False True
col3 10 10 100 100
I think you have a multi-index dataframe so you may want to check the documentation on that.
I'm trying to turn the following dataframe (with values for county and year)
county region 2012 2013 ... 2035
A 101 10 15 ... 7
B 101 13 8 ... 11
...
into a dataframe that looks like this:
county region year sum
A 101 2012 10
A 101 2013 15
... ... ... ...
A 101 2035 7
B 101 2012 13
B 101 2013 8
B 101 2035 11
My current dataframe has 400 rows (different counties) with values for the years 2012-2035.
My manual approach would be to slice the year columns off and put each of them below the last row of the preceding year. But of course there has to be a pythonic way.
I guess I'm missing a basic pandas concept here, probably I just couldn't find the right answer to this problem because I simply didn't know how to ask the right question. Please be gentle with the newcomer.
You can use melt from pandas:
In [26]: df
Out[26]:
county region 2012 2013
0 A 101 10 15
1 B 101 13 8
In [27]: pd.melt(df, id_vars=['county','region'], var_name='year', value_name='sum')
Out[27]:
county region year sum
0 A 101 2012 10
1 B 101 2012 13
2 A 101 2013 15
3 B 101 2013 8