pandas grouping aggregtation across multiple columns in a dataframe - python

I would like to derive the min and max for each year, region, and weather_type from a pandas dataframe. The dataframe looks like this:
year jan feb mar apr may jun aug sept oct nov dec region weathertype
1862 42.0 8.2 82.7 46.7 72.7 61.6 81.9 45.9 76.8 34.9 44.8 Anglia Rain
1863 58.3 15.7 24.0 17.5 27.9 75.2 38.5 71.5 71.7 77.5 32.0 Anglia Rain
1864 20.5 30.3 81.5 13.8 59.5 26.5 12.3 19.2 42.1 25.5 79.9 Anglia Rain
What is needed are two new columns giving the min and max for each region and year, effecting grouping across rows, with the result added to the existing dataframe as two new columns:
year min max
1862 42.0 81.9
1863 15.7 77.5
1864 12.3 81.5
My approach has been to use this code:
weather_data['max_value'] = weather_data.groupby(['year','region','weathertype'])['jan','feb','mar','apr','may','jun','jul','aug','sep','oct', 'nov','dec'].transform(np.min)
However, this produces a non-aggregated subset of the data, which is a duplication of the existing frame, resulting the following error:
Wrong number of items passed 12, placement implies 1
I then melted the dataframe into a long, rather than wide format:
year region Option_1 variable value
1862 Anglia Rain jan 42.0
1863 Anglia Rain jan 58.3
1864 Anglia Rain jan 20.5
I used this code to produce what i needed:
weather_data['min_value'] = weather_data['value'].groupby(weather_data['region','Option_1']).transform(np.min)
but this either produces a key error where there is a single list.
[['region','Option_1]]
produces
Grouper for <class 'pandas.core.frame.DataFrame'> not 1-dimensional
Any suggestions are this point are gratefully received.

I would do:
(df.set_index(['year','region','weathertype'])
.assign(min=lambda x: x.min(axis=1),
max=lambda x: x.max(axis=1)
)
.reset_index())
Output:
year region weathertype jan feb mar apr may jun aug sept oct nov dec min max
-- ------ -------- ------------- ----- ----- ----- ----- ----- ----- ----- ------ ----- ----- ----- ----- -----
0 1862 Anglia Rain 42 8.2 82.7 46.7 72.7 61.6 81.9 45.9 76.8 34.9 44.8 8.2 82.7
1 1863 Anglia Rain 58.3 15.7 24 17.5 27.9 75.2 38.5 71.5 71.7 77.5 32 15.7 77.5
2 1864 Anglia Rain 20.5 30.3 81.5 13.8 59.5 26.5 12.3 19.2 42.1 25.5 79.9 12.3 81.5

Related

Transforming yearwise data using pandas

I have a dataframe that looks like this:
Temp
Date
1981-01-01 20.7
1981-01-02 17.9
1981-01-03 18.8
1981-01-04 14.6
1981-01-05 15.8
... ...
1981-12-27 15.5
1981-12-28 13.3
1981-12-29 15.6
1981-12-30 15.2
1981-12-31 17.4
365 rows × 1 columns
And I want to transform It so That It looks like:
1981 1982 1983 1984 1985 1986 1987 1988 1989 1990
0 20.7 17.0 18.4 19.5 13.3 12.9 12.3 15.3 14.3 14.8
1 17.9 15.0 15.0 17.1 15.2 13.8 13.8 14.3 17.4 13.3
2 18.8 13.5 10.9 17.1 13.1 10.6 15.3 13.5 18.5 15.6
3 14.6 15.2 11.4 12.0 12.7 12.6 15.6 15.0 16.8 14.5
4 15.8 13.0 14.8 11.0 14.6 13.7 16.2 13.6 11.5 14.3
... ... ... ... ... ... ... ... ... ... ...
360 15.5 15.3 13.9 12.2 11.5 14.6 16.2 9.5 13.3 14.0
361 13.3 16.3 11.1 12.0 10.8 14.2 14.2 12.9 11.7 13.6
362 15.6 15.8 16.1 12.6 12.0 13.2 14.3 12.9 10.4 13.5
363 15.2 17.7 20.4 16.0 16.3 11.7 13.3 14.8 14.4 15.7
364 17.4 16.3 18.0 16.4 14.4 17.2 16.7 14.1 12.7 13.0
My attempt:
groups=df.groupby(df.index.year)
keys=groups.groups.keys()
years=pd.DataFrame()
for key in keys:
years[key]=groups.get_group(key)['Temp'].values
Question:
The above code is giving me my desired output but Is there is a more efficient way of transforming this?
As I can't post the whole data because there are 3650 rows in the dataframe so you can download the csv file(60.6 kb) for testing from here
Try grabbing the year and dayofyear from the index then pivoting:
import pandas as pd
import numpy as np
# Create Random Data
dr = pd.date_range(pd.to_datetime("1981-01-01"), pd.to_datetime("1982-12-31"))
df = pd.DataFrame(np.random.randint(1, 100, size=dr.shape),
index=dr,
columns=['Temp'])
# Get Year and Day of Year
df['year'] = df.index.year
df['day'] = df.index.dayofyear
# Pivot
p = df.pivot(index='day', columns='year', values='Temp')
print(p)
p:
year 1981 1982
day
1 38 85
2 51 70
3 76 61
4 71 47
5 44 76
.. ... ...
361 23 22
362 42 64
363 84 22
364 26 56
365 67 73
Run-Time via Timeit
import timeit
setup = '''
import pandas as pd
import numpy as np
# Create Random Data
dr = pd.date_range(pd.to_datetime("1981-01-01"), pd.to_datetime("1983-12-31"))
df = pd.DataFrame(np.random.randint(1, 100, size=dr.shape),
index=dr,
columns=['Temp'])'''
pivot = '''
df['year'] = df.index.year
df['day'] = df.index.dayofyear
p = df.pivot(index='day', columns='year', values='Temp')'''
groupby_for = '''
groups=df.groupby(df.index.year)
keys=groups.groups.keys()
years=pd.DataFrame()
for key in keys:
years[key]=groups.get_group(key)['Temp'].values'''
if __name__ == '__main__':
print("Pivot")
print(timeit.timeit(setup=setup, stmt=pivot, number=1000))
print("Groupby For")
print(timeit.timeit(setup=setup, stmt=groupby_for, number=1000))
Pivot
1.598973
Groupby For
2.3967995999999996
*Additional note, the groupby for option will not work for leap years as it will not be able to handle 1984 being 366 days instead of 365. Pivot will work regardless.

30 Day distance between dates in datetime64[ns] column

I have data of the following form:
6460 2001-07-24 00:00:00 67.5 75.1 75.9 71.0 75.2 81.8
6490 2001-06-24 00:00:00 68.4 74.9 76.1 70.9 75.5 82.7
6520 2001-05-25 00:00:00 69.6 74.7 76.3 70.8 75.5 83.2
6550 2001-04-25 00:00:00 69.2 74.6 76.1 70.6 75.0 83.1
6580 2001-03-26 00:00:00 69.1 74.4 75.9 70.5 74.3 82.8
6610 2001-02-24 00:00:00 69.0 74.0 75.3 69.8 73.8 81.9
6640 2001-01-25 00:00:00 68.9 73.9 74.6 69.7 73.5 80.0
6670 2000-12-26 00:00:00 69.0 73.5 75.0 69.5 72.6 81.8
6700 2000-11-26 00:00:00 69.8 73.2 75.1 69.5 72.0 82.7
6730 2000-10-27 00:00:00 70.3 73.1 75.0 69.4 71.3 82.6
6760 2000-09-27 00:00:00 69.4 73.0 74.8 69.4 71.0 82.3
6790 2000-08-28 00:00:00 69.6 72.8 74.6 69.2 70.7 81.9
6820 2000-07-29 00:00:00 67.8 72.9 74.4 69.1 70.6 81.8
I want all the dates to have a 30 day difference between each other. I know how to add a specific day or month to a datetime object with something like
ndfd = ndf['Date'].astype('datetime64[ns]')
ndfd = ndfd.apply(lambda dt: dt.replace(day=15))
But this does not take into account the difference in days from month to month.
How can I ensure there is a consistent step in days from month to month in my data, given that I am able to change the day as long as it remains on the same month?
You could use date_range:
df['date'] = pd.date_range(start=df['date'][0], periods=len(df), freq='30D')
IIUC you could change your date column like this:
import datetime
a = df.iloc[0,0] # first date, assuming date col is first
df['date'] = [a + datetime.timedelta(days=30 * i) for i in range(len(df))]
I haven't tested this so not sure it work as smooth as I thought it will =).
You can transform your first day into ordinal, add 30*i to it and then transform it back.
first_day=df.iloc[0]['date_column'].toordinal()
df['date']=(first_day+30*i for i in range(len(df))).fromordinal

Select rows that lie within datetime intervals

I'm trying to compare two dataframes and drop rows from the first dataframe that aren't between the dates in the second dataframe (or...selecting those rows that are between the dates in the 2nd dataframe). The selections should be inclusive. This might be really simple but its just not clicking for me right now.
Example data is below. For dataframe 1, this can be generated using daily data starting July 1 2018 and ending November 30 2018 with random numbers in the 'number' column. The ... in the dataframe 1 are meant used to show skipping data but the data is there in the real dataframe.
Dataframe 1:
Number
Date
2018-07-01 15.2
2018-07-02 17.3
2018-07-03 19.5
2018-07-04 13.7
2018-07-05 19.1
...
2018-09-15 30.4
2018-09-16 25.7
2018-09-17 21.2
2018-09-18 19.7
2018-09-19 23.4
...
2018-11-01 30.8
2018-11-02 47.2
2018-11-03 25.3
2018-11-04 39.7
2018-11-05 43.8
Dataframe 2:
Change
Date
2018-07-02 Start
2018-07-04 End
2018-09-16 Start
2018-09-18 End
2018-11-02 Start
2018-11-04 End
With the example above, the output should be:
Number
Date
2018-07-02 17.3
2018-07-03 19.5
2018-07-04 13.7
2018-09-16 25.7
2018-09-17 21.2
2018-09-18 19.7
2018-11-02 47.2
2018-11-03 25.3
2018-11-04 39.7
You can try this, I hope the Start and End comes one after the other and is sorted.
df3 = pd.concat([df[i:j] for i,j in zip(df2.loc[df2['Change']=='Start'].index, df2.loc[df2['Change']=='End'].index)]))
Number
Date
2018-07-02 17.3
2018-07-03 19.5
2018-07-04 13.7
2018-09-16 25.7
2018-09-17 21.2
2018-09-18 19.7
2018-11-02 47.2
2018-11-03 25.3
2018-11-04 39.7
You can build an IntervalIndex from df2's index and search in logarithmic time.
df2.index = pd.to_datetime(df2.index)
idx = pd.IntervalIndex.from_arrays(df2.index[df.Change == 'Start'],
df2.index[df.Change == 'End'],
closed='both')
df1[idx.get_indexer(pd.to_datetime(df1.index)) > -1]
Number
Date
2018-07-02 17.3
2018-07-03 19.5
2018-07-04 13.7
2018-09-16 25.7
2018-09-17 21.2
2018-09-18 19.7
2018-11-02 47.2
2018-11-03 25.3
2018-11-04 39.7

Reshaping Pandas DataFrame: switch columns to indices and repeated values as columns

I've had a really tough time figuring out how to reshape this DataFrame. Sorry about the wording of the question, this problem seems a bit specific.
I have data on several countries along with a column of 6 repeating features and the year this data was recorded. It looks something like this (minus some features and columns):
Country Feature 2005 2006 2007 2008 2009
0 Afghanistan Age Dependency 99.0 99.5 100.0 100.2 100.1
1 Afghanistan Birth Rate 44.9 43.9 42.8 41.6 40.3
2 Afghanistan Death Rate 10.7 10.4 10.1 9.8 9.5
3 Albania Age Dependency 53.5 52.2 50.9 49.7 48.7
4 Albania Birth Rate 12.3 11.9 11.6 11.5 11.6
5 Albania Death Rate 5.95 6.13 6.32 6.51 6.68
There doesn't seem to be any way to make pivot_table() work in this situation and I'm having trouble finding what other steps I can take to make it look how I want:
Age Dependency Birth Rate Death Rate
Afghanistan 2005 99.0 44.9 10.7
2006 99.5 43.9 10.4
2007 100.0 42.8 10.1
2008 100.2 41.6 9.8
2009 100.1 40.3 9.5
Albania 2005 53.5 12.3 5.95
2006 52.2 11.9 6.13
2007 50.9 11.6 6.32
2008 49.7 11.5 6.51
2009 48.7 11.6 6.68
Where the unique values of the 'Feature' column each become a column and the year columns each become part of a multiIndex with the country. Any help is appreciated, thank you!
EDIT: I checked the "duplicate" but I don't see how that question is the same as this one. How would I place the repeated values within my feature column as unique columns while at the same time moving the years to become a multi index with the countries? Sorry if I'm just not getting something.
Use melt with reshape by set_index and unstack:
df = (df.melt(['Country','Feature'], var_name='year')
.set_index(['Country','year','Feature'])['value']
.unstack())
print (df)
Feature Age Dependency Birth Rate Death Rate
Country year
Afghanistan 2005 99.0 44.9 10.70
2006 99.5 43.9 10.40
2007 100.0 42.8 10.10
2008 100.2 41.6 9.80
2009 100.1 40.3 9.50
Albania 2005 53.5 12.3 5.95
2006 52.2 11.9 6.13
2007 50.9 11.6 6.32
2008 49.7 11.5 6.51
2009 48.7 11.6 6.68

Read excel row by row and do transpose, Python 3.6

I have excel file with below data and I want to read data where First Column contains 'Area' & transpose, then again move & find where Column contains 'Area' & transpose
In this data total 3 table data given, I want to split it & then transpose. First Column contains Area code and other column name contains Year
Area 1980 1981 1982 1983
AU 33.7 38.8 40.2 42.5
BE 54.6 51.6 49.7 48.9
FI 43.2 49.6 58.8 71.1
Area 1979 1980 1981 1982
AU 29.8 33.7 38.8 40.2
BE 54.2 54.6 51.6 49.7
CA 39.4 44.3 50.6 48
Area 1978 1979 1980 1981
DK 58 57.2 54.5 53.2
FI 37.7 43.2 49.6 58.8
FR 41.6 49.9 55.4 58.5
Final Result expected:
Area variable value
AU 1980 33.7
other values
How to achieve this?
Assuming that we have the following list of DataFrame's:
In [106]: dfs
Out[106]:
[ Area 1980 1981 1982 1983
0 AU 33.7 38.8 40.2 42.5
1 BE 54.6 51.6 49.7 48.9
2 FI 43.2 49.6 58.8 71.1, Area 1979 1980 1981 1982
0 AU 29.8 33.7 38.8 40.2
1 BE 54.2 54.6 51.6 49.7
2 CA 39.4 44.3 50.6 48.0, Area 1978 1979 1980 1981
0 DK 58.0 57.2 54.5 53.2
1 FI 37.7 43.2 49.6 58.8
2 FR 41.6 49.9 55.4 58.5]
first we concatenate them horizontally:
In [107]: df = pd.concat([x.set_index('Area') for x in dfs], axis=1)
In [108]: df
Out[108]:
1980 1981 1982 1983 1979 1980 1981 1982 1978 1979 1980 1981
AU 33.7 38.8 40.2 42.5 29.8 33.7 38.8 40.2 NaN NaN NaN NaN
BE 54.6 51.6 49.7 48.9 54.2 54.6 51.6 49.7 NaN NaN NaN NaN
CA NaN NaN NaN NaN 39.4 44.3 50.6 48.0 NaN NaN NaN NaN
DK NaN NaN NaN NaN NaN NaN NaN NaN 58.0 57.2 54.5 53.2
FI 43.2 49.6 58.8 71.1 NaN NaN NaN NaN 37.7 43.2 49.6 58.8
FR NaN NaN NaN NaN NaN NaN NaN NaN 41.6 49.9 55.4 58.5
now we can stack DF and rename columns:
In [109]: df.stack().reset_index() \
.rename(columns={'level_0':'Area','level_1':'variable',0:'value'})
Out[109]:
Area variable value
0 AU 1980 33.7
1 AU 1981 38.8
2 AU 1982 40.2
3 AU 1983 42.5
4 AU 1979 29.8
5 AU 1980 33.7
6 AU 1981 38.8
7 AU 1982 40.2
8 BE 1980 54.6
9 BE 1981 51.6
10 BE 1982 49.7
11 BE 1983 48.9
12 BE 1979 54.2
13 BE 1980 54.6
14 BE 1981 51.6
15 BE 1982 49.7
16 CA 1979 39.4
17 CA 1980 44.3
18 CA 1981 50.6
19 CA 1982 48.0
20 DK 1978 58.0
21 DK 1979 57.2
22 DK 1980 54.5
23 DK 1981 53.2
24 FI 1980 43.2
25 FI 1981 49.6
26 FI 1982 58.8
27 FI 1983 71.1
28 FI 1978 37.7
29 FI 1979 43.2
30 FI 1980 49.6
31 FI 1981 58.8
32 FR 1978 41.6
33 FR 1979 49.9
34 FR 1980 55.4
35 FR 1981 58.5
what have you tried thus far?
Pandas is a really good library to use for data parsing etc.
you could implement something along the lines of...
import pandas as pd
df = pd.DataFrame.from_csv(csv_filename)
def create_new_table(df):
start = 0
end = 3
while (df.last_valid_index() != end):
#create a new dataframe with the relevant column
newdf.transpose()
start = end
end = end + 3

Categories