Using 2 pandas columns as arguments for np.timedelta - python

Simple question:
In [1]:
df = DataFrame({'value':[4,4,4],'unit':['D','W','Y']})
df
Out[1]:
unit value
0 D 4
1 W 4
2 Y 4
I can create timedeltas this way (of course):
In [2]:
timedelta64(4, 'D')
Out[2]:
numpy.timedelta64(4,'D')
But I'm not being able to iterate through DataFrame columns to get a resulting Series with timedeltas:
def f(x):
return timedelta64(x['value'], x['unit'])
df.apply(f, axis=1)
Instead, I'm getting:
TypeError: don't know how to convert scalar number to float
EDIT:
This also does not work, and returns the same error:
df['arg'] = zip(df.value, df.unit)
df.arg.apply(lambda x: timedelta64(x[0], x[1]))

So your code works for me.
df = pd.DataFrame({'value':[4,4,4],'unit':['D','W','Y']})
df.apply(f, axis=1)
0 4 days
1 4 weeks
2 4 years
dtype: object
Here's my versions:
numpy.__version__
'1.8.0'
pandas.__version__
'0.13.0rc1-32-g81053f9'
I did notice a bug perhaps related to your issue. You might check if you have numpy 1.7, if so upgrade to 1.8 and see if that fixes the issues. Good Luck :)
https://github.com/pydata/pandas/issues/5689

In 0.13 this is supported using the new pd.to_timedelta:
In [24]: df = DataFrame({'value':[4,4,4],'unit':['D','W','Y']})
In [25]: pd.to_timedelta(df.apply(lambda x: np.timedelta64(x['value'],x['unit']), axis=1))
Out[25]:
0 4 days, 00:00:00
1 28 days, 00:00:00
2 1460 days, 23:16:48
dtype: timedelta64[ns]

Related

Using result_type with pandas apply function

I want to use apply on a pandas.DataFrame that I created, and return for each row a list of values, where each value is a column in itself.
I wrote the following code:
import pandas as pd
def get_list(row):
return [i for i in range(5)]
df = pd.DataFrame(0, index=np.arange(100), columns=['col'])
df.apply(lambda row: get_list(row), axis=1, result_type='expand')
When I add result_type='expand' in order to change the returned array into separate columns I get the following error:
TypeError: ("<lambda>() got an unexpected keyword argument 'result_type'", 'occurred at index 0')
However if I drop the result_type field it runs fine (returns a column of arrays), what might be the problem?
I'm using colab to run my code
This code works in pandas version 0.23.3, properly you just need to run pip install --upgrade pandas in your terminal.
Or
You can accomplish it without the result_type as follows:
def get_list(row):
return pd.Series([i for i in range(5)])
df = pd.DataFrame(0, index=np.arange(100), columns=['col'])
pd.concat([df, df.apply(get_list, axis=1)], axis=1)
col 0 1 2 3 4
0 0 0 1 2 3 4
1 0 0 1 2 3 4
2 0 0 1 2 3 4
3 0 0 1 2 3 4
4 0 0 1 2 3 4
...
BTW, you don't need a lambda for it, you can just:
df.apply(get_list, axis=1, result_type='expand')
Update
The result_type was announced in the release notes of pandas 0.23: https://pandas.pydata.org/pandas-docs/stable/whatsnew.html#whatsnew-0230 so I am afraid you will have to update.

Call a function on an object and assign the return to the same object at the same time

Let's say I want to convert a series of date strings to datetime using the following:
>>> import pandas as pd
>>> dataframe.loc[:, 'DATE'] = pd.to_datetime(dataframe.loc[:, 'DATE'])
Now, I see dataframe.loc[:, 'DATE'] as redundant. Is it possible in python that I call a function on an object and assign the return to the same object at the same time?
Something that looks like:
>>> pd.to_datetime(dataframe.loc[:,'DATE'], +)
or
dataframe.loc[:,'DATE'] += pd.to_datetime()
where + (or whatever) assigns the return of the function to its first argument
This question might be due to my lack of understanding on how programming languages are written/function, so please be gentle.
There is no such a thing. But you can achieve the same with:
name = 'DATE'
dataframe[name] = pd.to_datetime(dataframe[name])
No need for .loc
Some methods support an inplace=True keyword argument.
For example, sorting a dataframe gives you new one:
>>> df = pd.DataFrame({'DATE': [10, 7, 1, 2, 3]})
>>> df.sort_values()
>>> df.sort_values('DATE')
DATE
2 1
3 2
4 3
1 7
0 10
The original remains unchanged:
>>> df
DATE
0 10
1 7
2 1
3 2
4 3
Setting inplace=True, modifies the original df:
>>> df.sort_values('DATE', inplace=True)
>>> df
DATE
2 1
3 2
4 3
1 7
0 10
Closest Pandas gets to this is the ad-hoc "inplace" command that exists for a good portion of DataFrame functions.
For example, an inplace datetime operation happens to be hidden in the new set_index functionality.
df.set_index(df['Date'], inplace=True)

Combine arbitrary number of columns into one in pandas

This question is a general version of a specific case asked about here.
I have a pandas dataframe with columns that contain integers. I'd like to concatenate all of those integers into a string in one column.
Given this answer, for particular columns, this works:
(dl['ungrd_dum'].map(str) +
dl['mba_dum'].map(str) +
dl['jd_dum'].map(str) +
dl['ma_phd_dum'].map(str))
But suppose I have many (hundreds) of such columns, whose names are in a list dummies. I'm certain there's some cool pythonic way of doing this with one magical line that will do it all. I've tried using map with dummies, but haven't yet been able to figure it out.
IIUC you should be able to do
df[dummies].astype(str).apply(lambda x: ''.join(x), axis=1)
Example:
In [12]:
df = pd.DataFrame({'a':np.random.randint(0,100, 5), 'b':np.arange(5), 'c':np.random.randint(0,10,5)})
df
Out[12]:
a b c
0 5 0 2
1 46 1 3
2 86 2 4
3 85 3 9
4 60 4 4
In [15]:
cols=['a','c']
df[cols].astype(str).apply(''.join, axis=1)
Out[15]:
0 52
1 463
2 864
3 859
4 604
dtype: object
EDIT
As #JohnE has pointed out you could call sum instead which will be faster:
df[cols].astype(str).sum(axis=1)
However, that will implicitly convert the dtype to float64 so you'd have to cast back to str again and slice the decimal point off if necessary:
df[cols].astype(str).sum(axis=1).astype(str).str[:-2]
from operator import add
reduce(add, (df[c].astype(str) for c in cols), "")
For example:
df = pd.DataFrame({'a':np.random.randint(0,100, 5),
'b':np.arange(5),
'c':np.random.randint(0,10,5)})
cols = ['a', 'c']
In [19]: df
Out[19]:
a b c
0 6 0 4
1 59 1 9
2 13 2 5
3 44 3 1
4 79 4 4
In [20]: reduce(add, (df[c].astype(str) for c in cols), "")
Out[20]:
0 64
1 599
2 135
3 441
4 794
dtype: object
The first thing you need to do is to convert your Dataframe of numbers in a Dataframe of strings, as efficiently as possible:
dl = dl.astype(str)
Then, you're in the same situation as this other question, and can use the same Series.str accessor techniques as in this answer:
.str.cat()
Using str.cat() you could do:
dl['result'] = dl[dl.columns[0]].str.cat([dl[c] for c in dl.columns[1:]], sep=' ')
str.join()
To use .str.join() you need a series of iterables, say tuples.
df['result'] = df[df.columns[1:]].apply(tuple, axis=1).str.join(' ')
Don't try the above with list instead of tuple or the apply() methdo will return a Dataframe and dataframes don't have the .str accessor like Series.

Pandas Filter function returned a Series, but expected a scalar bool

I am attempting to use filter on a pandas dataframe to filter out all rows that match a duplicate value(need to remove ALL the rows when there are duplicates, not just the first or last).
This is what I have that works in the editor :
df = df.groupby("student_id").filter(lambda x: x.count() == 1)
But when I run my script with this code in it I get the error:
TypeError: filter function returned a Series, but expected a scalar bool
I am creating the dataframe by concatenating two other frames immediately before trying to apply the filter.
it should be:
In [32]: grouped = df.groupby("student_id")
In [33]: grouped.filter(lambda x: x["student_id"].count()==1)
Updates:
i'm not sure about the issue u mentioned regarding the interactive console. technically speaking in this particular case (there might be other situations such as the intricate "import" functionality in which diff env may behave differently), the console (such as ipython) should behave the same as other environment (orig python env, or some IDE embedded one)
an intuitive way to understand the pandas groupby is to treat the return obj of DataFrame.groupby() as a list of dataframe. so when u try to using filter to apply the lambda function upon x, x is actually one of those dataframes:
In[25]: df = pd.DataFrame(data,columns=year)
In[26]: df
Out[26]:
2013 2014
0 0 1
1 2 3
2 4 5
3 6 7
4 0 1
5 2 3
6 4 5
7 6 7
In[27]: grouped = df.groupby(2013)
In[28]: grouped.count()
Out[28]:
2014
2013
0 2
2 2
4 2
6 2
in this example, the first dataframe in the grouped obj would be:
In[33]: df1 = df.ix[[0,4]]
In[34]: df1
Out[33]:
2013 2014
0 0 1
4 0 1
how about using the pd.DataFrame.drop_duplicates() method?
Documentation.
Are you sure you really want to remove ALL rows? Not n-1?

Python - pandas - Append Series into Blank DataFrame

Say I have two pandas Series in python:
import pandas as pd
h = pd.Series(['g',4,2,1,1])
g = pd.Series([1,6,5,4,"abc"])
I can create a DataFrame with just h and then append g to it:
df = pd.DataFrame([h])
df1 = df.append(g, ignore_index=True)
I get:
>>> df1
0 1 2 3 4
0 g 4 2 1 1
1 1 6 5 4 abc
But now suppose that I have an empty DataFrame and I try to append h to it:
df2 = pd.DataFrame([])
df3 = df2.append(h, ignore_index=True)
This does not work. I think the problem is in the second-to-last line of code. I need to somehow define the blank DataFrame to have the proper number of columns.
By the way, the reason I am trying to do this is that I am scraping text from the internet using requests+BeautifulSoup and I am processing it and trying to write it to a DataFrame one row at a time.
So if you don't pass an empty list to the DataFrame constructor then it works:
In [16]:
df = pd.DataFrame()
h = pd.Series(['g',4,2,1,1])
df = df.append(h,ignore_index=True)
df
Out[16]:
0 1 2 3 4
0 g 4 2 1 1
[1 rows x 5 columns]
The difference between the two constructor approaches appears to be that the index dtypes are set differently, with an empty list it is an Int64 with nothing it is an object:
In [21]:
df = pd.DataFrame()
print(df.index.dtype)
df = pd.DataFrame([])
print(df.index.dtype)
object
int64
Unclear to me why the above should affect the behaviour (I'm guessing here).
UPDATE
After revisiting this I can confirm that this looks to me to be a bug in pandas version 0.12.0 as your original code works fine:
In [13]:
import pandas as pd
df = pd.DataFrame([])
h = pd.Series(['g',4,2,1,1])
df.append(h,ignore_index=True)
Out[13]:
0 1 2 3 4
0 g 4 2 1 1
[1 rows x 5 columns]
I am running pandas 0.13.1 and numpy 1.8.1 64-bit using python 3.3.5.0 but I think the problem is pandas but I would upgrade both pandas and numpy to be safe, I don't think this is a 32 versus 64-bit python issue.

Categories