I have a dataframe with 3 columns, one for hour, one for minute, and one for second, like this:
df = pd.DataFrame({'hour': [9.0, 9.0, 9.0, 10.0],
'min': [12.0, 13.0, 55.0, 2.0],
'sec': [42.0, 30.0, 12.0, 5.0]})
>>> df
hour min sec
0 9.0 12.0 42.0
1 9.0 13.0 30.0
2 9.0 55.0 12.0
3 10.0 2.0 5.0
I'm trying to combine the three columns into a new column made up of a datetime series. The goal would be to have this dataframe:
hour min sec time
0 9.0 12.0 42.0 9:12:42
1 9.0 13.0 30.0 9:13:30
2 9.0 55.0 12.0 9:55:12
3 10.0 2.0 5.0 10:02:05
So far I'm trying to use pd.to_datetime, as such:
df['time'] = pd.to_datetime(df[['hour', 'min', 'sec']],
format = '%H:%M:S')
But I'm getting the following ValueError:
ValueError: to assemble mappings requires at least that [year, month, day] be specified: [day,month,year] is missing.
I was trying to avoid this by including the format argument with only hour minute second, but apparently that doesn't work.
A similar question was asked here, but the solutions proposed do not seem to work in this case, I'm still getting this ValueError
Any ideas to solve this would be appreciated!
Thanks!
[EDIT]: I also needed the method to be able to deal with NaNs, so a dataframe such as this:
df = pd.DataFrame({'hour': [9.0, 9.0, 9.0, 10.0, np.nan],
'min': [12.0, 13.0, 55.0, 2.0, np.nan],
'sec': [42.0, 30.0, 12.0, 5.0, np.nan]})
The solution proposed by #PiRSquared works
Not sure if there is a more direct way but this works
df['time'] = pd.to_datetime(df['hour'].astype(int).astype(str)+':'+df['min'].astype(int).astype(str)+':'+df['sec'].astype(int).astype(str), format = '%H:%M:%S').dt.time
hour min sec time
0 9.0 12.0 42.0 09:12:42
1 9.0 13.0 30.0 09:13:30
2 9.0 55.0 12.0 09:55:12
3 10.0 2.0 5.0 10:02:05
We can use pd.to_datetime on a dataframe with the requisite column names to create a series of datetimes.
However, OPs initial dataframe has a 'min' column that needs to be renamed 'minute' and a 'sec' column that needs to be renamed 'second'.
In addition, I'll add the missing columns 'year', 'month', and 'day' using pd.DataFrame.assign.
Finally, I'll add the 'time' column with pd.DataFrame.assign again.
new = dict(year=2017, month=1, day=1)
rnm = dict(min='minute', sec='second')
df.assign(
time=pd.to_datetime(
df.rename(columns=rnm).assign(**new)
).dt.time
)
hour min sec time
0 9.0 12.0 42.0 09:12:42
1 9.0 13.0 30.0 09:13:30
2 9.0 55.0 12.0 09:55:12
3 10.0 2.0 5.0 10:02:05
Related
Please note that a similar question was asked a while back but never answered (see Winsorizing does not change the max value).
I am trying to winsorize a column in a dataframe using winsorize from scipy.stats.mstats. If there are no NaN values in the column then the process works correctly.
However, NaN values seem to prevent the process from working on the top (but not the bottom) of the distribution. Regardless of what value I set for nan_policy, the NaN values are set to the maximum value in the distribution. I feel like a must be setting the option incorrectly some how.
Below is an example that can be used to reproduce both correct winsorizing when there are no NaN values and the problem behavior I am experiencing when there NaN values are present. Any help on sorting this out would be appreciated.
#Import
import pandas as pd
import numpy as np
from scipy.stats.mstats import winsorize
# initialise data of lists.
data = {'Name':['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T'], 'Age':[1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0]}
# Create 2 DataFrames
df = pd.DataFrame(data)
df2 = pd.DataFrame(data)
# Replace two values in 2nd DataFrame with np.nan
df2.loc[5,'Age'] = np.nan
df2.loc[8,'Age'] = np.nan
# Winsorize Age in both DataFrames
winsorize(df['Age'], limits=[0.1, 0.1], inplace = True, nan_policy='omit')
winsorize(df2['Age'], limits=[0.1, 0.1], inplace = True, nan_policy='omit')
# Check min and max values of Age in both DataFrames
print('Max/min value of Age from dataframe without NaN values')
print(df['Age'].max())
print(df['Age'].min())
print()
print('Max/min value of Age from dataframe with NaN values')
print(df2['Age'].max())
print(df2['Age'].min())
It looks like the nan_policy is being ignored. But winsorization is just clipping, so you can handle this with pandas.
def winsorize_with_pandas(s, limits):
"""
s : pd.Series
Series to winsorize
limits : tuple of float
Tuple of the percentages to cut on each side of the array,
with respect to the number of unmasked data, as floats between 0. and 1
"""
return s.clip(lower=s.quantile(limits[0], interpolation='lower'),
upper=s.quantile(1-limits[1], interpolation='higher'))
winsorize_with_pandas(df['Age'], limits=(0.1, 0.1))
0 3.0
1 3.0
2 3.0
3 4.0
4 5.0
5 6.0
6 7.0
7 8.0
8 9.0
9 10.0
10 11.0
11 12.0
12 13.0
13 14.0
14 15.0
15 16.0
16 17.0
17 18.0
18 18.0
19 18.0
Name: Age, dtype: float64
winsorize_with_pandas(df2['Age'], limits=(0.1, 0.1))
0 2.0
1 2.0
2 3.0
3 4.0
4 5.0
5 NaN
6 7.0
7 8.0
8 NaN
9 10.0
10 11.0
11 12.0
12 13.0
13 14.0
14 15.0
15 16.0
16 17.0
17 18.0
18 19.0
19 19.0
Name: Age, dtype: float64
You can consider filling the missing values with the mean in the column, then winsorize and select only the original non nan
df2 = pd.DataFrame(data)
# Replace two values in 2nd DataFrame with np.nan
df2.loc[5,'Age'] = np.nan
df2.loc[8,'Age'] = np.nan
# mask of non nan
_m = df2['Age'].notna()
df2.loc[_m, 'Age'] = winsorize(df2['Age'].fillna(df2['Age'].mean()), limits=[0.1, 0.1])[_m]
print(df2['Age'].max())
print(df2['Age'].min())
# 18.0
# 3.0
or the other option by removing the nan before the winsorize.
df2.loc[_m, 'Age'] = winsorize(df2['Age'].loc[_m], limits=[0.1, 0.1])
print(df2['Age'].max())
print(df2['Age'].min())
# 19.0
# 2.0
I used the following code snipped as the basis for my problem (Whereas I needed to winsorize on a yearly basis, so i introduced two categories (A,B) in my toy data)
I got the same issue with not replacing the max p99 values because of the NaNs.
import pandas as pd
import numpy as np
# Getting the toy data
# To see all columns and 100 rows
pd.options.display.max_columns = None
pd.set_option('display.max_rows', 100)
df = pd.DataFrame({"Zahl":np.arange(100),"Group":[i for i in "A"*50+"B"*50]})
# Getting NaN Values for first 4 rows
df.loc[0:3,"Zahl"] = np.NaN
# Defining a grouped list of 99/1% percentile values
p99 = df.groupby("Group")["Zahl"].quantile(.9).rename("99%-Quantile")
p1 = df.groupby("Group")["Zahl"].quantile(.1).rename("1%-Quantile")
# Defining the winsorize function
def winsor(value,p99,p1):
if (value < p99) & (value > p1):
return value
elif (value > p99) & (value > p1):
return p99
elif (value < p99) & (value < p1):
return p1
else:
return value
df["New"] = df.apply(lambda row: winsor(row["Zahl"],p99[row["Group"]],p1[row["Group"]]),axis=1)
The good thing of the winsor-function is that it naturally ignores NaN Values!
Hope this Idea helps for your problem
This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed 1 year ago.
I have a dataframe, which looks like
import pandas as pd
df=pd.DataFrame({'group': ['bmw', 'bmw', 'audi', 'audi', 'mb'],
'date': ['01/20', '02/20', '01/20', '02/20','01/20'],
'value1': [1,2,3,4,5],
'value2': [6,7,8,9,10]})
I want to make it wider and be look like
I tried to find a solution here, but did not find it. Could you help to create the new table?
Use pivot:
out = df.pivot(index='date', columns='group', values=['value1', 'value2'])
out.columns = out.swaplevel(axis='columns').columns.to_flat_index().map('_'.join)
>>> out.reset_index()
date audi_value1 bmw_value1 mb_value1 audi_value2 bmw_value2 mb_value2
0 01/20 3.0 1.0 5.0 8.0 6.0 10.0
1 02/20 4.0 2.0 NaN 9.0 7.0 NaN
Use DataFrame.pivot without value for all another columns and flatten MultiIndex:
df = df.pivot(index='date', columns='group')
df.columns = df.columns.map(lambda x: f'{x[1]}_{x[0]}')
print (df)
audi_value1 bmw_value1 mb_value1 audi_value2 bmw_value2 mb_value2
date
01/20 3.0 1.0 5.0 8.0 6.0 10.0
02/20 4.0 2.0 NaN 9.0 7.0 NaN
Please note that a similar question was asked a while back but never answered (see Winsorizing does not change the max value).
I am trying to winsorize a column in a dataframe using winsorize from scipy.stats.mstats. If there are no NaN values in the column then the process works correctly.
However, NaN values seem to prevent the process from working on the top (but not the bottom) of the distribution. Regardless of what value I set for nan_policy, the NaN values are set to the maximum value in the distribution. I feel like a must be setting the option incorrectly some how.
Below is an example that can be used to reproduce both correct winsorizing when there are no NaN values and the problem behavior I am experiencing when there NaN values are present. Any help on sorting this out would be appreciated.
#Import
import pandas as pd
import numpy as np
from scipy.stats.mstats import winsorize
# initialise data of lists.
data = {'Name':['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T'], 'Age':[1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0]}
# Create 2 DataFrames
df = pd.DataFrame(data)
df2 = pd.DataFrame(data)
# Replace two values in 2nd DataFrame with np.nan
df2.loc[5,'Age'] = np.nan
df2.loc[8,'Age'] = np.nan
# Winsorize Age in both DataFrames
winsorize(df['Age'], limits=[0.1, 0.1], inplace = True, nan_policy='omit')
winsorize(df2['Age'], limits=[0.1, 0.1], inplace = True, nan_policy='omit')
# Check min and max values of Age in both DataFrames
print('Max/min value of Age from dataframe without NaN values')
print(df['Age'].max())
print(df['Age'].min())
print()
print('Max/min value of Age from dataframe with NaN values')
print(df2['Age'].max())
print(df2['Age'].min())
It looks like the nan_policy is being ignored. But winsorization is just clipping, so you can handle this with pandas.
def winsorize_with_pandas(s, limits):
"""
s : pd.Series
Series to winsorize
limits : tuple of float
Tuple of the percentages to cut on each side of the array,
with respect to the number of unmasked data, as floats between 0. and 1
"""
return s.clip(lower=s.quantile(limits[0], interpolation='lower'),
upper=s.quantile(1-limits[1], interpolation='higher'))
winsorize_with_pandas(df['Age'], limits=(0.1, 0.1))
0 3.0
1 3.0
2 3.0
3 4.0
4 5.0
5 6.0
6 7.0
7 8.0
8 9.0
9 10.0
10 11.0
11 12.0
12 13.0
13 14.0
14 15.0
15 16.0
16 17.0
17 18.0
18 18.0
19 18.0
Name: Age, dtype: float64
winsorize_with_pandas(df2['Age'], limits=(0.1, 0.1))
0 2.0
1 2.0
2 3.0
3 4.0
4 5.0
5 NaN
6 7.0
7 8.0
8 NaN
9 10.0
10 11.0
11 12.0
12 13.0
13 14.0
14 15.0
15 16.0
16 17.0
17 18.0
18 19.0
19 19.0
Name: Age, dtype: float64
You can consider filling the missing values with the mean in the column, then winsorize and select only the original non nan
df2 = pd.DataFrame(data)
# Replace two values in 2nd DataFrame with np.nan
df2.loc[5,'Age'] = np.nan
df2.loc[8,'Age'] = np.nan
# mask of non nan
_m = df2['Age'].notna()
df2.loc[_m, 'Age'] = winsorize(df2['Age'].fillna(df2['Age'].mean()), limits=[0.1, 0.1])[_m]
print(df2['Age'].max())
print(df2['Age'].min())
# 18.0
# 3.0
or the other option by removing the nan before the winsorize.
df2.loc[_m, 'Age'] = winsorize(df2['Age'].loc[_m], limits=[0.1, 0.1])
print(df2['Age'].max())
print(df2['Age'].min())
# 19.0
# 2.0
I used the following code snipped as the basis for my problem (Whereas I needed to winsorize on a yearly basis, so i introduced two categories (A,B) in my toy data)
I got the same issue with not replacing the max p99 values because of the NaNs.
import pandas as pd
import numpy as np
# Getting the toy data
# To see all columns and 100 rows
pd.options.display.max_columns = None
pd.set_option('display.max_rows', 100)
df = pd.DataFrame({"Zahl":np.arange(100),"Group":[i for i in "A"*50+"B"*50]})
# Getting NaN Values for first 4 rows
df.loc[0:3,"Zahl"] = np.NaN
# Defining a grouped list of 99/1% percentile values
p99 = df.groupby("Group")["Zahl"].quantile(.9).rename("99%-Quantile")
p1 = df.groupby("Group")["Zahl"].quantile(.1).rename("1%-Quantile")
# Defining the winsorize function
def winsor(value,p99,p1):
if (value < p99) & (value > p1):
return value
elif (value > p99) & (value > p1):
return p99
elif (value < p99) & (value < p1):
return p1
else:
return value
df["New"] = df.apply(lambda row: winsor(row["Zahl"],p99[row["Group"]],p1[row["Group"]]),axis=1)
The good thing of the winsor-function is that it naturally ignores NaN Values!
Hope this Idea helps for your problem
I'm using sort_values() to sort values in a pandas Series from largest to smallest. I wonder if there is an easy way of randomizing the order of ties(?). It appears that the indexes of ties come in the descending order given as argument in this case:
s = pd.Series([3.0, 15.0, 1.0, 22.0, 11.0, 12.0, 2.0, 5.0, 3.0, 12.0, 2.0, 3.0])
s.sort_values(ascending=False)
Out:
3 22.0
1 15.0
9 12.0
5 12.0
4 11.0
7 5.0
11 3.0
8 3.0
0 3.0
10 2.0
6 2.0
2 1.0
I have read the documentation that there are different kinds of sort that can be given as argument: ‘quicksort’, ‘mergesort’, ‘heapsort’. However, as far as I understand none of them will randomize the order of ties.
I guess I could write a custom function, but curious to know if something exists.
You can first shuffle the series by taking a random 100 % sample and then sort it:
s.sample(frac=1).sort_values(ascending=False)
You can also pass the random_state argument to start off the random number generator as needed.
I have some features that I want to write to some csv files. I want to use pandas for this approach if possible.
I am following the instruction in here and have created some dummy data to check it out. Basically there are some activities with a random number of features belonging to them.
import io
data = io.StringIO('''Activity,id,value,value,value,value,value,value,value,value,value
Run,1,1,2,2,5,6,4,3,2,1
Run,1,2,4,4,10,12,8,6,4,2
Stand,2,1.5,3.,3.,7.5,9.,6.,4.5,3.,1.5
Sit,3,0.5,1.,1.,2.5,3.,2.,1.5,1.,0.5
Sit,3,0.6,1.2,1.2,3.,3.6,2.4,1.8,1.2,0.6
Run, 2, 0.8, 1.6, 1.6, 4. , 4.8, 3.2, 2.4, 1.6, 0.8
''')
df_unindexed = pd.read_csv(data)
df = df_unindexed.set_index(['Activity', 'id'])
When I run:
df.xs('Run')
I get
value value.1 value.2 value.3 value.4 value.5 value.6 value.7 \
id
1 1.0 2.0 2.0 5.0 6.0 4.0 3.0 2.0
1 2.0 4.0 4.0 10.0 12.0 8.0 6.0 4.0
2 0.8 1.6 1.6 4.0 4.8 3.2 2.4 1.6
value.8
id
1 1.0
1 2.0
2 0.8
which almost what I want, that is all run activities. I want to remove the 1st row and 1st column, i.e. the header and the id column. How do I achieve this?
Also a second question is when I want only one activity, how do I get it.
When using
idx = pd.IndexSlice
df.loc[idx['Run', 1], :]
gives
value value.1 value.2 value.3 value.4 value.5 value.6 \
Activity id
Run 1 1.0 2.0 2.0 5.0 6.0 4.0 3.0
1 2.0 4.0 4.0 10.0 12.0 8.0 6.0
value.7 value.8
Activity id
Run 1 2.0 1.0
1 4.0 2.0
but slicing does not work as I would expect. For example trying
df.loc[idx['Run', 1], 2:11]
instead produces an error:
TypeError: cannot do slice indexing on with these indexers [2] of 'int'>
So, how do I get my features in this place?
P.S. If not clear I am new to Pandas so be gentle. Also the column id is editable to be unique to each activity or to whole dataset if this makes things easier etc
You can use a little hack - get columns names by positions, because iloc for MultiIndex is not yet supported:
print (df.columns[2:11])
Index(['value.2', 'value.3', 'value.4', 'value.5', 'value.6', 'value.7',
'value.8'],
dtype='object')
idx = pd.IndexSlice
print (df.loc[idx['Run', 1], df.columns[2:11]])
value.2 value.3 value.4 value.5 value.6 value.7 value.8
Activity id
Run 1 2.0 5.0 6.0 4.0 3.0 2.0 1.0
1 4.0 10.0 12.0 8.0 6.0 4.0 2.0
If want save file to csv without index and columns:
df.xs('Run').to_csv(file, index=False, header=None)
I mostly look at https://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-integer when I'm stuck with these kind of issues.
Without any testing I think you can remove rows and columns like
df = df.drop(['rowindex'], axis=0)
df = df.drop(['colname'], axis=1)
Avoid the problem by recognizing the index columns at CSV read-time:
pd.read_csv(header=0, # to read in the header row as a header row, and
... index_col=['id'] or index_col=0 to pick the index column.