Why is pandas eval not working anymore with where? - python

I was using pandas eval within a where that sits inside a function in order to create a column in a data frame. While it was working in the past, not it doesn't. There was a recent move to Python 3 within our dataiku software. Could that be the reason for it?
Below will be the code that is now in place
import pandas as pd, numpy as np
from numpy import where, nan
d = {'ASSET': ['X','X','A','X','B'], 'PRODUCT': ['Z','Y','Z','C','Y']}
MAIN_df = pd.DataFrame(data=d)
def val_per(ASSET, PRODUCT):
return(
where(pd.eval("ASSET== 'X' & PRODUCT == 'Z'"),0.04,
where(pd.eval("PRODUCT == 'Y'"),0.08,1.5)
)
)
MAIN_2_df = (MAIN_df.eval("PCT = #val_per(ASSET, PRODUCT)"))
The error received now is <class 'TypeError'>: unhashable type: 'numpy.ndarray'

You can change the last two lines with:
MAIN_2_df = MAIN_df.copy()
MAIN_2_df = val_per(MAIN_2_df.ASSET, MAIN_2_df.PRODUCT)
This approach will work faster for large dataframes. You can use a vectorized aproach to faster results.

Related

Save and load correctly pandas dataframe in csv while preserving freq of datetimeindex

I was trying to save a DataFrame and load it. If I print the resulting df, I see they are (almost) identical. The freq attribute of the datetimeindex is not preserved though.
My code looks like this
import datetime
import os
import numpy as np
import pandas as pd
def test_load_pandas_dataframe():
idx = pd.date_range(start=datetime.datetime.now(),
end=(datetime.datetime.now()
+ datetime.timedelta(hours=3)),
freq='10min')
a = pd.DataFrame(np.arange(2*len(idx)).reshape((len(idx), 2)), index=idx,
columns=['first', 2])
a.to_csv('test_df')
b = load_pandas_dataframe('test_df')
os.remove('test_df')
assert np.all(b == a)
def load_pandas_dataframe(filename):
'''Correcty loads dataframe but freq is not maintained'''
df = pd.read_csv(filename, index_col=0,
parse_dates=True)
return df
if __name__ == '__main__':
test_load_pandas_dataframe()
And I get the following error:
ValueError: Can only compare identically-labeled DataFrame objects
It is not a big issue for my program, but it is still annoying.
Thanks!
The issue here is that the dataframe you save has columns
Index(['first', 2], dtype='object')
but the dataframe you load has columns
Index(['first', '2'], dtype='object').
In other words, the columns of your original dataframe had the integer 2, but upon saving it with to_csv and loading it back with read_csv, it is parsed as the string '2'.
The easiest fix that passes your assertion is to change line 13 to:
columns=['first', '2'])
To complemente #jfaccioni answer, freq attribute is not preserved, there are two options here
Fast a simple, use pickle which will preserver everything:
a.to_pickle('test_df')
b = pd.read_pickle('test_df')
a.equals(b) # True
Or you can use the inferred_freq attribute from a DatetimeIndex:
a.to_csv('test_df')
b.read_csv('test_df')
b.index.freq = b.index.inferred_freq
print(b.index.freq) #<10 * Minutes>

How to solve the FunctionError and MapError

Python 3.6 pycharm
import prettytable as pt
import numpy as np
import pandas as pd
a=np.random.randn(30,2)
b=a.round(2)
df=pd.DataFrame(b)
df.columns=['data1','data2']
tb = pt.PrettyTable()
def func1(columns):
def func2(column):
return tb.add_column(column,df[column])
return map(func2,columns)
column1=['data1','data2']
print(column1)
print(func1(column1))
I want to get the results are:
tb.add_column('data1',df['data1'])
tb.add_column('data2',df['data2'])
As a matter of fact,the results are:
<map object at 0x000001E527357828>
I am trying find the answer in Stack Overflow for a long time, some answer tell me can use list(func1(column1)), but the result is [None, None].
Based on the tutorial at https://ptable.readthedocs.io/en/latest/tutorial.html, PrettyTable.add_column modifies the PrettyTable in-place. Such functions generally return None, not the modified object.
You're also overcomplicating the problem by trying to use map and a fancy wrapper function. The below code is much simpler, but produces the desired result.
import prettytable as pt
import numpy as np
import pandas as pd
column_names = ['data1', 'data2']
a = np.random.randn(30, 2)
b = a.round(2)
df = pd.DataFrame(b)
df.columns = column_names
tb = pt.PrettyTable()
for col in column_names:
tb.add_column(col, df[col])
print(tb)
If you're still interesting in learning about the thing that map returns, I suggest reading about iterables and iterators. map returns an iterator over the results of calling the function, and does not actually do any work until you iterate over it.

Problems transforming data in a dataframe

I've written the function (tested and working) below:
import pandas as pd
def ConvertStrDateToWeekId(strDate):
dateformat = '2016-7-15 22:44:09'
aDate = pd.to_datetime(strDate)
wk = aDate.isocalendar()[1]
yr = aDate.isocalendar()[0]
Format_4_5_4_date = str(yr) + str(wk)
return Format_4_5_4_date'
and from what I have seen on line I should be able to use it this way:
ml_poLines = result.value.select('PURCHASEORDERNUMBER', 'ITEMNUMBER', PRODUCTCOLORID', 'RECEIVINGWAREHOUSEID', ConvertStrDateToWeekId('CONFIRMEDDELIVERYDATE'))
However when I "show" my dataframe the "CONFIRMEDDELIVERYDATE" column is the original datetime string! NO errors are given.
I've also tried this:
ml_poLines['WeekId'] = (ConvertStrDateToWeekId(ml_poLines['CONFIRMEDDELIVERYDATE']))
and get the following error:
"ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions." which makes no sense to me.
I've also tried this with no success.
x = ml_poLines.toPandas();
x['testDates'] = ConvertStrDateToWeekId(x['CONFIRMEDDELIVERYDATE'])
ml_poLines2 = spark.createDataFrame(x)
ml_poLines2.show()
The above generates the following error:
AttributeError: 'Series' object has no attribute 'isocalendar'
What have I done wrong?
Your function ConvertStrDateToWeekId takes a string. But in the following line the argument of the function call is a series of strings:
x['testDates'] = ConvertStrDateToWeekId(x['CONFIRMEDDELIVERYDATE'])
A possible workaround for this error is to use the apply-function of pandas:
x['testDates'] = x['CONFIRMEDDELIVERYDATE'].apply(ConvertStrDateToWeekId)
But without more information about the kind of data you are processing it is hard to provide further help.
This was the work-around that I got to work:
`# convert the confirimedDeliveryDate to a WeekId
x= ml_poLines.toPandas();
x['WeekId'] = x[['ITEMNUMBER', 'CONFIRMEDDELIVERYDATE']].apply(lambda y:ConvertStrDateToWeekId(y[1]), axis=1)
ml_poLines = spark.createDataFrame(x)
ml_poLines.show()`
Not quite as clean as I would like.
Maybe someone else cam propose a cleaner solution.

Python PANDAS: Converting from pandas/numpy to dask dataframe/array

I am working to try to convert a program to be parallelizable/multithreaded with the excellent dask library. Here is the program I am working on converting:
Python PANDAS: Stack by Enumerated Date to Create Records Vectorized
import pandas as pd
import numpy as np
import dask.dataframe as dd
import dask.array as da
from io import StringIO
test_data = '''id,transaction_dt,units,measures
1,2018-01-01,4,30.5
1,2018-01-03,4,26.3
2,2018-01-01,3,12.7
2,2018-01-03,3,8.8'''
df_test = pd.read_csv(StringIO(test_data), sep=',')
df_test['transaction_dt'] = pd.to_datetime(df_test['transaction_dt'])
df_test = df_test.loc[np.repeat(df_test.index, df_test['units'])]
df_test['transaction_dt'] += pd.to_timedelta(df_test.groupby(level=0).cumcount(), unit='d')
df_test = df_test.reset_index(drop=True)
expected results:
id,transaction_dt,measures
1,2018-01-01,30.5
1,2018-01-02,30.5
1,2018-01-03,30.5
1,2018-01-04,30.5
1,2018-01-03,26.3
1,2018-01-04,26.3
1,2018-01-05,26.3
1,2018-01-06,26.3
2,2018-01-01,12.7
2,2018-01-02,12.7
2,2018-01-03,12.7
2,2018-01-03,8.8
2,2018-01-04,8.8
2,2018-01-05,8.8
It occurred to me that this might be a good candidate to try to parallelize because the separate dask partitions should not need to know anything about each other to accomplish the required operations. Here is a naive representation of how I thought it might work:
dd_test = dd.from_pandas(df_test, npartitions=3)
dd_test = dd_test.loc[da.repeat(dd_test.index, dd_test['units'])]
dd_test['transaction_dt'] += dd_test.to_timedelta(dd.groupby(level=0).cumcount(), unit='d')
dd_test = dd_test.reset_index(drop=True)
So far I have been trying to work through the following errors or idiomatic differences:
"NotImplementedError: Only integer valued repeats supported."
I have tried to convert the index into a int column/array to try as well but still run into the issue.
2. dask does not support the mutating operator: "+="
3. No dask .to_timedelta() argument
4. No dask .cumcount() (but I think .cumsum() is interchangable?!)
If there are any dask experts out there who might be able let me know if there are fundamental impediments to preclude me from trying this or any tips on implementation, that would be a great help!
Edit:
I think I have made a bit of progress on this since posting the question:
dd_test = dd.from_pandas(df_test, npartitions=3)
dd_test['helper'] = 1
dd_test = dd_test.loc[da.repeat(dd_test.index, dd_test['units'])]
dd_test['transaction_dt'] = dd_test['transaction_dt'] + (dd.test.groupby('id')['helper'].cumsum()).astype('timedelta64[D]')
dd_test = dd_test.reset_index(drop=True)
However, I am still stuck on the dask array repeats error. Any tips still welcome.
Not sure if this is exactly what you are looking for, but I replaced the da.repeat with using np.repeat, along with explicity casting dd_test.index and dd_test['units'] to numpy arrays, and finally adding dd_test['transaction_dt'].astype('M8[us]') to your timedelta calculation.
df_test = pd.read_csv(StringIO(test_data), sep=',')
dd_test = dd.from_pandas(df_test, npartitions=3)
dd_test['helper'] = 1
dd_test = dd_test.loc[np.repeat(np.array(dd_test.index),
np.array(dd_test['units']))]
dd_test['transaction_dt'] = dd_test['transaction_dt'].astype('M8[us]') + (dd_test.groupby('id')['helper'].cumsum()).astype('timedelta64[D]')
dd_test = dd_test.reset_index(drop=True)
df_expected = dd_test.compute()

How do I search for a tuple of values in pandas?

I'm trying to write a function to swap a dictionary of targets with results in a pandas dataframe. I'd like to match a tuple of values and swap out new values. I tried building it as follows, but the the row select isn't working. I feel like I'm missing some critical function here.
import pandas
testData=pandas.DataFrame([["Cats","Parrots","Sandstone"],["Dogs","Cockatiels","Marble"]],columns=["Mammals","Birds","Rocks"])
target=("Mammals","Birds")
swapVals={("Cats","Parrots"):("Rats","Canaries")}
for x in swapVals:
#Attempt 1:
#testData.loc[x,target]=swapVals[x]
#Attempt 2:
testData[testData.loc[:,target]==x,target]=swapVals[x]
This was written in Python 2, but the basic idea should work for you. It uses the apply function:
import pandas
testData=pandas.DataFrame([["Cats","Parrots","Sandstone"],["Dogs","Cockatiels","Marble"]],columns=["Mammals","Birds","Rocks"])
swapVals={("Cats","Parrots"):("Rats","Canaries")}
target=["Mammals","Birds"]
def swapper(in_row):
temp =tuple(in_row.values)
if temp in swapVals:
return list(swapVals[temp])
else:
return in_row
testData[target] = testData[target].apply(swapper, axis=1)
testData
Note that if you loaded the other keys into the dict, you could do the apply without the swapper function:
import pandas
testData=pandas.DataFrame([["Cats","Parrots","Sandstone"],["Dogs","Cockatiels","Marble"]],columns=["Mammals","Birds","Rocks"])
swapVals={("Cats","Parrots"):("Rats","Canaries"), ("Dogs","Cockatiels"):("Dogs","Cockatiels")}
target=["Mammals","Birds"]
testData[target] = testData[target].apply(lambda x: list(swapVals[tuple(x.values)]), axis=1)
testData

Categories