The task is to transform the below table
import pandas as pd
import numpy as np
index = pd.date_range('2000-1-1', periods=700, freq='D')
df = pd.DataFrame(np.random.randn(700), index=index, columns=["values"])
df.groupby(by=[df.index.year, df.index.month]).sum()
In[1]: df
Out[1]:
values
2000 1 1.181000
2 -8.005783
3 6.590623
4 -6.266232
5 1.266315
6 0.384050
7 -1.418357
8 -3.132253
9 0.005496
10 -6.646101
11 9.616482
12 3.960872
2001 1 -0.989869
2 -2.845278
3 -1.518746
4 2.984735
5 -2.616795
6 8.360319
7 5.659576
8 0.279863
9 -5.220678
10 5.077400
11 1.332519
such that it looks like this
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2000 1.2 -8.0 6.6 -6.3 1.2 0.4 -1.4 -3.1 0.0 -6.6 9.6 3.9
2001 -0.9 -2.8 -1.5 3.0 -2.6 8.3 5.7 0.3 -5.2 5.1 1.3
Additionally I need to add an extra column which sums the yearly values like this
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Year
2000 1.2 -8.0 6.6 -6.3 1.2 0.4 -1.4 -3.1 0.0 -6.6 9.6 3.9 4.7
2001 -0.9 -2.8 -1.5 3.0 -2.6 8.3 5.7 0.3 -5.2 5.1 1.3 10.7
Is there a quick pandas pivotal way to solve this?
use strftime('%b') in your groupby
df['values'].groupby([df.index.year, df.index.strftime('%b')]).sum().unstack()
To preserve order of months
df['values'].groupby([df.index.year, df.index.strftime('%b')], sort=False).sum().unstack()
With 'Year' at end
df['values'].groupby([df.index.year, df.index.strftime('%b')], sort=False).sum() \
.unstack().assign(Year=df.groupby(df.index.year).sum())
You can do something like this:
import pandas as pd
import numpy as np
index = pd.date_range('2000-1-1', periods=700, freq='D')
df = pd.DataFrame(np.random.randn(700), index=index, columns=["values"])
l = [df.index.strftime("%Y"), df.index.strftime("%b"), df.index.strftime("%d")]
df.index = l
df=df.groupby(level=[-3,-2]).sum().unstack(-1)
df['Year'] = df.sum(axis=1)
df
Output:
Only change is you need to unstack the DF to convert it into a wide format. Once you get the integer month numbers, you could convert these into a datetime by specifying %m directive as the format to be considered. After obtaining this, use it to retrieve it's string representation through the help of strftime.
Calculate the year by taking it's sum across columns by specifying axis=1.
np.random.seed(314)
fr = df.groupby([df.index.year, df.index.month]).sum().unstack(fill_value=0)
fr.columns = pd.to_datetime(fr.columns.droplevel(0), format='%m').strftime('%b')
fr['Year'] = fr.sum(1)
The extra Year column you can do by doing
df['Year'] = df.sum(axis=1)
It will sum the dataframe row-wise (due to the axis=1), and storing it in a new column.
Related
I have two datasets. After merging them horzontally, and sorting the columns with the following code, I get the dataset below:
df=
X
Y
5.2
6.5
3.3
7.6
df_year=
X
Y
2014
2014
2015
2015
df_all_cols = pd.concat([df, df_year], axis = 1)
sorted_columns = sorted(df_all_cols.columns)
df_all_cols_sort = df_all_cols[sorted_columns]
X
X
Y
Y
5.2
2014
6.5
2014
3.3
2015
7.6
2015
I am trying to make my data look like this, by stacking the dataset every 2 columns.
name
year
Variable
5.2
2014
X
3.3
2015
X
6.5
2014
Y
7.6
2015
Y
One approach could be as follows:
Apply df.stack to both dfs before feeding them to pd.concat. The result at this stage being:
0 1
0 X 5.2 2014
Y 6.5 2014
1 X 3.3 2015
Y 7.6 2015
Next, use df.sort_index to sort on the original column names (i.e. "X, Y", now appearing as index level 1), and get rid of index level 0 (df.droplevel).
Finally, use df.reset_index with drop=False to insert index as a column and rename all the columns with df.rename.
res = (pd.concat([df.stack(),df_year.stack()], axis=1)
.sort_index(level=1)
.droplevel(0)
.reset_index(drop=False)
.rename(columns={'index':'Variable',0:'name',1:'year'})
)
# change the order of cols
res = res.iloc[:, [1,2,0]]
print(res)
name year Variable
0 5.2 2014 X
1 3.3 2015 X
2 6.5 2014 Y
3 7.6 2015 Y
I'm trying to replicate some of R4DS's dplyr exercises using Python's pandas, with the nycflights13.flights dataset. What I want to do is select, from that dataset:
Columns through year to day (inclusive);
All columns that end with "delay";
The distance and air_time columns
In the book, Hadley uses the following syntax:
library("tidyverse")
library("nycflights13")
flights_sml <- select(flights,
year:day,
ends_with("delay"),
distance,
air_time
)
In pandas, I came up with the following "solution":
import pandas as pd
from nycflights13 import flights
flights_sml = pd.concat([
flights.loc[:, 'year':'day'],
flights.loc[:, flights.columns.str.endswith("delay")],
flights.distance,
flights.air_time,
], axis=1)
Another possible implementation:
flights_sml = flights.filter(regex='year|day|month|delay$|^distance$|^air_time$', axis=1)
But I'm sure this is not the idiomatic way to write such DF-operation. I digged around, but haven't found something that fits in this situation from pandas API.
You are correct. This will create multiple dataframes/series and then concatenate them together, resulting in a lot of extra work. Instead, you can create a list of the columns you want to use and then simply select those.
For example (keeping the same column order):
cols = ['year', 'month', 'day'] + [col for col in flights.columns if col.endswith('delay')] + ['distance', 'air_time']
flights_sml = flights[cols]
According to the dataset's columns info, we can utilize str.contains
df.loc[:, df.columns.str.contains('year|month|day|delay|distance|air_time')]
year month day dep_delay arr_delay air_time distance
0 2013 1 1 2.0 11.0 227.0 1400
1 2013 1 1 4.0 20.0 227.0 1416
2 2013 1 1 2.0 33.0 160.0 1089
3 2013 1 1 -1.0 -18.0 183.0 1576
4 2013 1 1 -6.0 -25.0 116.0 762
One option is with select_columns from pyjanitor:
# pip install pyjanitor
import pandas as pd
import janitor
from nycflights13 import flights
flights_sml = flights.select_columns(
slice('year', 'day'),
'*delay',
'distance',
'air_time'
)
flights_sml.head()
year month day dep_delay arr_delay distance air_time
0 2013 1 1 2.0 11.0 1400 227.0
1 2013 1 1 4.0 20.0 1416 227.0
2 2013 1 1 2.0 33.0 1089 160.0
3 2013 1 1 -1.0 -18.0 1576 183.0
4 2013 1 1 -6.0 -25.0 762 116.0
I have two pd.dataframes:
df1:
Year Replaced Not_replaced
2015 1.5 0.1
2016 1.6 0.3
2017 2.1 0.1
2018 2.6 0.5
df2:
Year HI LO RF
2015 3.2 2.9 3.0
2016 3.0 2.8 2.9
2017 2.7 2.5 2.6
2018 2.6 2.2 2.3
I need to create a third df3 by using the following equation:
df3[column1]=df1['Replaced']-df1['Not_replaced]+df2['HI']
df3[column2]=df1['Replaced']-df1['Not_replaced]+df2['LO']
df3[column3]=df1['Replaced']-df1['Not_replaced]+df2['RF']
I can merge the two dataframes and manually create 3 new columns one by one, but I can't figure out how to use the loop function to create the results.
You can create an empty dataframe & fill it with values while looping
(Note: col_names & df3.columns must be of the same length)
df3 = pd.DataFrame(columns = ['column1','column2','column3'])
col_names = ["HI", "LO","RF"]
for incol,df3column in zip(col_names,df3.columns):
df3[df3column] = df1['Replaced']-df1['Not_replaced']+df2[incol]
print(df3)
output
column1 column2 column3
0 4.6 4.3 4.4
1 4.3 4.1 4.2
2 4.7 4.5 4.6
3 4.7 4.3 4.4
for the for loop, I would first merge df1 and df2 into to create a new df, called df3. Then, I would create a list of te names of the columns you want to iterate through:
col_names = ["HI", "LO","RF"]
for col in col_names:
df3[f"column_{col}]= df3['Replaced']-df3['Not_replaced]+df3[col]
I've got a very simple problem, but I can't seem to get it right.
Consider this dataframe
df = pd.DataFrame({'group' :
['A', 'A', 'A', 'B', 'B'], 'time' : [20, 21, 22, 20, 21],
'price' : [3.1, 3.5, 3.0, 2.3, 2.1]})
group price time
0 A 3.1 20
1 A 3.5 21
2 A 3.0 22
3 B 2.3 20
4 B 2.1 21
Now I want to take the standard deviation of the price of each group, but conditional on it being before time 22 (let's call it early_std). I want to then create a variable with that information.
The expected result is
group price time early_std
A 3.1 20 0.282843
A 3.5 21 0.282843
A 3.0 22 0.282843
B 2.3 20 0.141421
B 2.1 21 0.141421
This is what I tried:
df['early_std'] = df[df.time < 22].groupby('group').\
price.transform(lambda x : x.std())
This almost works but it gives a missing value on time = 22:
group price time early_std
0 A 3.1 20 0.282843
1 A 3.5 21 0.282843
2 A 3.0 22 NaN
3 B 2.3 20 0.141421
4 B 2.1 21 0.141421
I also tried with apply and I think it works, but I need to reset the index, which is something I'd rather avoid (I have a large dataset and I need to do this repeatedly)
early_std2 = df[df.time < 22].groupby('group').price.std()
df.set_index('group', inplace=True)
df['early_std2'] = early_std2
price time early_std early_std2
group
A 3.1 20 0.282843 0.282843
A 3.5 21 0.282843 0.282843
A 3.0 22 NaN 0.282843
B 2.3 20 0.141421 0.141421
B 2.1 21 0.141421 0.141421
Thanks!
It looks like you only need to add fillna() to your first code to expand the std values:
df['early_std'] = df[df.time < 22].groupby('group')['price'].transform(pd.Series.std)
df['early_std'] = df.groupby('group')['early_std'].apply(lambda x: x.fillna(x.max()))
df
To get:
group price time early_std
0 A 3.1 20 0.283
1 A 3.5 21 0.283
2 A 3.0 22 0.283
3 B 2.3 20 0.141
4 B 2.1 21 0.141
EDIT: I have changed ffill to a more general fillna, but you could also use chained .bfill().ffill() to achieve the same result.
Your second approach is very close to what you are trying to achieve.
This may not be the most efficient method but it worked for me:
df['early_std'] = 0
for index,value in early_std2.iteritems():
df.early_std[df.group==index] = value
I'm reading a csv file with Pandas. The format is:
Date Time x1 x2 x3 x4 x5
3/7/2012 11:09:22 13.5 2.3 0.4 7.3 6.4
12.6 3.4 9.0 3.0 7.0
3.6 4.4 8.0 6.0 5.0
10.6 3.5 1.0 3.0 8.0
...
3/7/2012 11:09:23 10.5 23.2 0.3 7.8 4.4
11.6 13.4 19.0 13.0 17.0
...
As you can see, not every row has a timestamp. Every row without a timestamp is from the same 1-second interval as the closest row above it that does have a timestamp.
I am trying to do 3 things:
1. combine the Date and Time columns to get a single timestamp column.
2. convert that column to have units of seconds.
3. fill empty cells to have the appropriate timestamp.
The desired end result is an array with the timestamp, in seconds, at each row.
I am not sure how to quickly convert the timestamps into units of seconds, other then to do a slow for loop and use the Python builtin time.mktime method.
Then when I fill in missing timestamp values, the problem is that the cells in the Date and Time columns which did not have a timestamp each get a "nan" value and when merged give a cell with the value "nan nan". Then when I use the fillna() method, it doesn't interpret "nan nan" as being a nan.
I am using the following code to get the problem result (not including the part of trying to convert to seconds):
import pandas as pd
df = pd.read_csv('file.csv', delimiter=',', parse_dates={'CorrectTime':[0,1]}, usecols=[0,1,2,4,6], names=['Date','Time','x1','x3','x5'])
df.fillna(method='ffill', axis=0, inplace=True)
Thanks for your help.
Assuming you want seconds since Jan 1, 1900...
import pandas
from io import StringIO
import datetime
data = StringIO("""\
Date,Time,x1,x2,x3,x4,x5
3/7/2012,11:09:22,13.5,2.3,0.4,7.3,6.4
,,12.6,3.4,9.0,3.0,7.0
,,3.6,4.4,8.0,6.0,5.0
,,10.6,3.5,1.0,3.0,8.0
3/7/2012,11:09:23,10.5,23.2,0.3,7.8,4.4
,,11.6,13.4,19.0,13.0,17.0
""")
df = pandas.read_csv(data, parse_dates=['Date']).fillna(method='ffill')
def dealwithdates(row):
datestring = row['Date'].strftime('%Y-%m-%d')
dtstring = '{} {}'.format(datestring, row['Time'])
date = datetime.datetime.strptime(dtstring, '%Y-%m-%d %H:%M:%S')
refdate = datetime.datetime(1900, 1, 1)
return (date - refdate).total_seconds()
df['ordinal'] = df.apply(dealwithdates, axis=1)
print(df)
Date Time x1 x2 x3 x4 x5 ordinal
0 2012-03-07 11:09:22 13.5 2.3 0.4 7.3 6.4 3540107362
1 2012-03-07 11:09:22 12.6 3.4 9.0 3.0 7.0 3540107362
2 2012-03-07 11:09:22 3.6 4.4 8.0 6.0 5.0 3540107362
3 2012-03-07 11:09:22 10.6 3.5 1.0 3.0 8.0 3540107362
4 2012-03-07 11:09:23 10.5 23.2 0.3 7.8 4.4 3540107363
5 2012-03-07 11:09:23 11.6 13.4 19.0 13.0 17.0 3540107363