Pandas Idxmax on a date-value DataFrame - python

Given this DataFrame:
import matplotlib.pyplot as plt
import pandas as pd
df = pd.DataFrame({'Date':['20/03/17 10:30:34','20/03/17 10:31:24','20/03/17 10:34:34'],
'Value':[4,7,5]})
df['Date'] = pd.to_datetime(df.Date)
df
Out[53]:
Date Value
0 2017-03-20 10:30:34 4
1 2017-03-20 10:31:24 7
2 2017-03-20 10:34:34 5
Im am trying to extract the max value and its index. I can get the max value by df.Value.max() but when I use df.idxmax() to get the Index fo the value I get a TypeError:
TypeError: float() argument must be a string or a number
Is there any other way to get the Index of the Max value of a Dataframe? (Or any way to correct this one?)

Because it should be:
df.Value.idxmax()
It then returns 1.

If you only care about the Value column, you can use:
df.Value.idxmax()
>>> 1
However, it is strange that it fails on both columns with df.idxmax() as the following works, too:
df.Date.idxmax()
>>> 2
df.idxmax() also works for some other dummy data:
dummy = pd.DataFrame(np.random.random(size=(5,2)))
print(dummy)
0 1
0 0.944017 0.365198
1 0.541003 0.447632
2 0.583375 0.081192
3 0.492935 0.570310
4 0.832320 0.542983
print(dummy.idxmax())
0 0
1 3
dtype: int64

You have to specify from which column you want to get maximum value-idx.
To get the idx of maxumum value use:
df.Value.idxmax()
if you want to get the idx of maximum Date use:
df.Date.idxmax()

Related

How do I subtract an odd row value from even row vlaue?

I have dataframe below.
I want to even row value substract from odd row value.
and make new dataframe.
How can I do it?
import pandas as pd
import numpy as np
raw_data = {'Time': [281.54385, 436.55295, 441.74910, 528.36445,
974.48405, 980.67895, 986.65435, 1026.02485]}
data = pd.DataFrame(raw_data)
data
dataframe
Time
0 281.54385
1 436.55295
2 441.74910
3 528.36445
4 974.48405
5 980.67895
6 986.65435
7 1026.02485
Wanted result
ON_TIME
0 155.00910
1 86.61535
2 6.19490
3 39.37050
You can use NumPy indexing:
res = pd.DataFrame(data.values[1::2] - data.values[::2], columns=['Time'])
print(res)
Time
0 155.00910
1 86.61535
2 6.19490
3 39.37050
you can use shift for the subtraction, and then pick every 2nd element, starting with the 2nd element (index = 1)
(data.Time - data.Time.shift())[1::2].rename('On Time').reset_index(drop=True)
outputs:
0 155.00910
1 86.61535
2 6.19490
3 39.37050
Name: On Time, dtype: float64

pandas DataFrame sum method works counterintuitively

my_df = DataFrame(np.arange(1,13).reshape(4,3), columns=list('abc'))
my_df.sum(axis="rows")
O/P is
a 22
b 26
c 30
// I expect it to sum by rows thereby giving
0 6
1 15
2 24
3 33
my_df.sum(axis="columns") //helps achieve this
Why does it work counterintutively?
In a similar context, drop method works as it should i.e when i write
my_df.drop(['a'],axis="columns")
// This drops column "a".
Am I missing something? Please enlighten.
Short version
It is a naming convention. The sum of the columns gives a row-wise sum. You are looking for axis='columns').
Long version
Ok that was interesting. In pandas normally 0 is for columns and 1 is for rows.
However looking in the docs we find that the allowed params are:
axis : {index (0), columns (1)}
You are passing a param that does not exist which results in the default. This can thus be read as: The sum of the columns returns the row sum. The sum of the index returns the column sum. What you want to use it axis=1 or axis='columns' which results in your desired output:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.arange(1,13).reshape(4,3), columns=list('abc'))
print(df.sum(axis=1))
Returns:
0 6
1 15
2 24
3 33
dtype: int64

shifting pandas series for only some entries

I've got a dataframe that has a Time Series (made up of strings) with some missing information:
# Generate a toy dataframe:
import pandas as pd
data = {'Time': ['0'+str(i)+':15:45' for i in range(10)]}
data['Time'][4] = 'unknown'
data['Time'][8] = 'unknown'
df = pd.DataFrame(data)
# df
Time
0 00:15:45
1 01:15:45
2 02:15:45
3 03:15:45
4 unknown
5 05:15:45
6 06:15:45
7 07:15:45
8 unknown
9 09:15:45
I would like the unknown entries to match the entry above, resulting in this dataframe:
# desired_df
Time
0 00:15:45
1 01:15:45
2 02:15:45
3 03:15:45
4 03:15:45
5 05:15:45
6 06:15:45
7 07:15:45
8 07:15:45
9 09:15:45
What is the best way to achieve this?
If you're intent on working with a time series data. I would recommend converting it to a time series, and then forward filling the blanks
import pandas as pd
data = {'Time': ['0'+str(i)+':15:45' for i in range(10)]}
data['Time'][4] = 'unknown'
data['Time'][8] = 'unknown'
df.Time = pd.to_datetime(df.Time, errors = 'coerce')
df.fillna(method='ffill')
However, if you are getting this data from a csv file or something where you use pandas.read_* function you should use the na_values argument in those functions to specify unknown as a NA value
df = pd.read_csv('example.csv', na_values = 'unknown')
df = df.fillna(method='ffill')
you can also pass a list instead of the string, and it adds the words passed to already existing list of NA values
However, if you want to keep the column a string, I would recommend just doing a find and replace
df.Time = np.where(df.Time == 'unknown', df.Time.shift(),df.Time)
One way to do this would be using pandas' shift, creating a new column with the data in Time shifted by one, and dropping it. But there may be a cleaner way to achieve this:
# Create new column with the shifted time data
df['Time2'] = df['Time'].shift()
# Replace the data in Time with the data in your new column where necessary
df.loc[df['Time'] == 'unknown', 'Time'] = df.loc[df['Time'] == 'unknown', 'Time2']
# Drop your new column
df = df.drop('Time2', axis=1)
print(df)
Time
0 00:15:45
1 01:15:45
2 02:15:45
3 03:15:45
4 03:15:45
5 05:15:45
6 06:15:45
7 07:15:45
8 07:15:45
9 09:15:45
EDIT: as pointed out by Zero, the new column step can be skipped altogether:
df.loc[df['Time'] == 'unknown', 'Time'] = df['Time'].shift()

Set max string length in pandas

I want my dataframe to auto-truncate strings which are longer than a certain length.
basically:
pd.set_option('auto_truncate_string_exceeding_this_length', 255)
Any ideas? I have hundreds of columns and don't want to iterate over every data point. If this can be achieved during import that would also be fine (e.g. pd.read_csv())
Thanks.
pd.set_option('display.max_colwidth', 255)
You can use read_csv converters. Lets say you want to truncate column name abc, you can pass a dictionary with function like
def auto_truncate(val):
return val[:255]
df = pd.read_csv('file.csv', converters={'abc': auto_truncate}
If you have columns with different lengths
df = pd.read_csv('file.csv', converters={'abc': lambda: x: x[:255], 'xyz': lambda: x: x[:512]}
Make sure column type is string. Column index can also be used instead of name in converters dict.
I'm not sure you can do this on the whole df, the following would work after loading:
In [21]:
df = pd.DataFrame({"a":['jasjdhadasd']*5, "b":arange(5)})
df
Out[21]:
a b
0 jasjdhadasd 0
1 jasjdhadasd 1
2 jasjdhadasd 2
3 jasjdhadasd 3
4 jasjdhadasd 4
In [22]:
for col in df:
if is_string_like(df[col]):
df[col] = df[col].str.slice(0,5)
df
Out[22]:
a b
0 jasjd 0
1 jasjd 1
2 jasjd 2
3 jasjd 3
4 jasjd 4
EDIT
I think if you specified the dtypes in the args to read_csv then you could set the max length:
df = pd.read_csv('file.csv', dtype=(np.str, maxlen))
I will try this and confirm shortly
UPDATE
Sadly you cannot specify the length, an error is raised if you try this:
NotImplementedError: the dtype <U5 is not supported for parsing
when attempting to pass the arg dtype=(str,5)
You can also simply truncate a single column with
df['A'] = df['A'].str[:255]

Is there a way to do a Series.map in place, but keep original value if no match?

The scenario here is that I've got a dataframe df with raw integer data, and a dict map_array which maps those ints to string values.
I need to replace the values in the dataframe with the corresponding values from the map, but keep the original value if the it doesn't map to anything.
So far, the only way I've been able to figure out how to do what I want is by using a temporary column. However, with the size of data that I'm working with, this could sometimes get a little bit hairy. And so, I was wondering if there was some trick to do this in pandas without needing the temp column...
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(1,5, size=(100,1)))
map_array = {1:'one', 2:'two', 4:'four'}
df['__temp__'] = df[0].map(map_array, na_action=None)
#I've tried varying the na_action arg to no effect
nan_index = data['__temp__'][df['__temp__'].isnull() == True].index
df['__temp__'].ix[nan_index] = df[0].ix[nan_index]
df[0] = df['__temp__']
df = df.drop(['__temp__'], axis=1)
I think you can simply use .replace, whether on a DataFrame or a Series:
>>> df = pd.DataFrame(np.random.randint(1,5, size=(3,3)))
>>> df
0 1 2
0 3 4 3
1 2 1 2
2 4 2 3
>>> map_array = {1:'one', 2:'two', 4:'four'}
>>> df.replace(map_array)
0 1 2
0 3 four 3
1 two one two
2 four two 3
>>> df.replace(map_array, inplace=True)
>>> df
0 1 2
0 3 four 3
1 two one two
2 four two 3
I'm not sure what the memory hit of changing column dtypes will be, though.

Categories