Dataframe datetime column as argument to function - python

When passing multiple columns of a dataframe as arguments to a function, the datetime column does not want to do the formatting function below. I can manage with the inline solution shown ... But it would be nice to know the reason ... Should I have e.g. used a different date-like data type? Thanks (p.s. Pandas = great)
import pandas as pd
import numpy as np
import datetime as dt
def fmtfn(arg_dttm, arg_int):
retstr = arg_dttm.strftime(':%Y-%m-%d') + '{:0>3}'.format(arg_int)
# bombs with: 'numpy.datetime64' object has no attribute 'strftime'
# retstr = '{:%Y-%m-%d}~{:0>3}'.format(arg_dttm, arg_int)
# bombs with: invalid format specifier
return retstr
def fmtfn2(arg_dtstr, arg_int):
retstr = '{}~{:0>3}'.format(arg_dtstr, arg_int)
return retstr
# The source data.
# I want to add a 3rd column newhg that carries e.g. 2017-06-25~066
# i.e. a concatenation of the other two columns.
df1 = pd.DataFrame({'mydt': ['2017-05-07', '2017-06-25', '2015-08-25'],
'myint': [66, 201, 100]})
df1['mydt'] = pd.to_datetime(df1['mydt'], errors='raise')
# THIS WORKS (without calling a function)
print('\nInline solution')
df1['newhg'] = df1[['mydt', 'myint']].apply(lambda x: '{:%Y-%m-%d}~{:0>3}'.format(x[0], x[1]), axis=1)
print(df1)
# THIS WORKS
print('\nConvert to string first')
df1['mydt2'] = df1['mydt'].apply(lambda x: x.strftime('%Y-%m-%d'))
df1['newhg'] = np.vectorize(fmtfn2)(df1['mydt2'], df1['myint'])
print(df1)
# Bombs in the function - see above
print('\nPass a datetime')
df1['newhg'] = np.vectorize(fmtfn)(df1['mydt'], df1['myint'])
print(df1)

You could have also used the builtin functions from pandas which make it a bit easier to read:
df1['newhg'] = df1.mydt.dt.strftime('%Y-%m-%d') + '~' + df1.myint.astype(str).str.zfill(3)

Related

TypeError:Invalid arguement, not a string or column

I am trying to do some source to target testing in pyspark. First part I am trying to do is a count of columns using a Lean Six Sigma method making sure there are less than 3/1000000 discrepancies in the columns. When I run this though, the if statement throws a:
TypeError: Invalid arguement, not a string or column: -276244 of type <class 'int'>. For columns literals, use 'lit', 'array','struct' or 'create_map' function.
Could anyone help?
import pyspark.sql.functions as f
from pyspark.sql.types import *
good_fields = []
bad_fields = {}
count_issues = {}
columns = list(spark.sql('show columns from tu_historical').toPandas()['col_name'])
for col in columns:
print(col)
df = spark.sql(f'select pid,fnum,{col} from historical_clean')
df1 = spark.sql(f'select pid,fnum,{col} from historical1')
#count issue testing
if abs(df1.count()-df.count()) > df1.count()*.000003:
count_issues[col] = df1.count()-df.count()
test_df = df.join(df1,(df.num == df1.file) & (df1.pid == df.pid),'left').filter(df1[col]!=df[col])
Seems like your columns has a funky value in it.
You may want to use this to get the columns names:
columns = spark.sql('select * from tu_historical limit 0').columns

Save and load correctly pandas dataframe in csv while preserving freq of datetimeindex

I was trying to save a DataFrame and load it. If I print the resulting df, I see they are (almost) identical. The freq attribute of the datetimeindex is not preserved though.
My code looks like this
import datetime
import os
import numpy as np
import pandas as pd
def test_load_pandas_dataframe():
idx = pd.date_range(start=datetime.datetime.now(),
end=(datetime.datetime.now()
+ datetime.timedelta(hours=3)),
freq='10min')
a = pd.DataFrame(np.arange(2*len(idx)).reshape((len(idx), 2)), index=idx,
columns=['first', 2])
a.to_csv('test_df')
b = load_pandas_dataframe('test_df')
os.remove('test_df')
assert np.all(b == a)
def load_pandas_dataframe(filename):
'''Correcty loads dataframe but freq is not maintained'''
df = pd.read_csv(filename, index_col=0,
parse_dates=True)
return df
if __name__ == '__main__':
test_load_pandas_dataframe()
And I get the following error:
ValueError: Can only compare identically-labeled DataFrame objects
It is not a big issue for my program, but it is still annoying.
Thanks!
The issue here is that the dataframe you save has columns
Index(['first', 2], dtype='object')
but the dataframe you load has columns
Index(['first', '2'], dtype='object').
In other words, the columns of your original dataframe had the integer 2, but upon saving it with to_csv and loading it back with read_csv, it is parsed as the string '2'.
The easiest fix that passes your assertion is to change line 13 to:
columns=['first', '2'])
To complemente #jfaccioni answer, freq attribute is not preserved, there are two options here
Fast a simple, use pickle which will preserver everything:
a.to_pickle('test_df')
b = pd.read_pickle('test_df')
a.equals(b) # True
Or you can use the inferred_freq attribute from a DatetimeIndex:
a.to_csv('test_df')
b.read_csv('test_df')
b.index.freq = b.index.inferred_freq
print(b.index.freq) #<10 * Minutes>

Dataframe with arrays and key-pairs

I have a JSON structure which I need to convert it into data-frame. I have converted through pandas library but I am having issues in two columns where one is an array and the other one is key-pair value.
Pito Value
{"pito-key": "Number"} [{"WRITESTAMP": "2018-06-28T16:30:36Z", "S":"41bbc22","VALUE":"2"}]
How to break columns into the data-frames.
As far as I understood your question, you can apply regular expressions to do that.
import pandas as pd
import re
data = {'pito':['{"pito-key": "Number"}'], 'value':['[{"WRITESTAMP": "2018-06-28T16:30:36Z", "S":"41bbc22","VALUE":"2"}]']}
df = pd.DataFrame(data)
def get_value(s):
s = s[1]
v = re.findall(r'VALUE\":\".*\"', s)
return int(v[0][8:-1])
def get_pito(s):
s = s[0]
v = re.findall(r'key\": \".*\"', s)
return v[0][7:-1]
df['value'] = df.apply(get_value, axis=1)
df['pito'] = df.apply(get_pito, axis=1)
df.head()
Here I create 2 functions that transform your scary strings to values you want them to have
Let me know if that's not what you meant

passing a variable into apply() in pandas

I am having trouble getting the syntax right for applying a function to a dataframe. I am trying to create a new column in a dataframe by joining the strings in two other columns, passing in a separator. I get the error
TypeError: ("apply_join() missing 1 required positional argument: 'sep'", 'occurred at index cases')
If I add sep to the apply_join() function call, that also fails:
File "unite.py", line 37, in unite
tibble_extra = df[cols].apply(apply_join, sep)
NameError: name 'sep' is not defined
import pandas as pd
from io import StringIO
tibble3_csv = """country,year,cases,population
Afghanistan,1999,745,19987071
Afghanistan,2000,2666,20595360
Brazil,1999,37737,172006362
Brazil,2000,80488,174504898
China,1999,212258,1272915272
China,2000,213766,1280428583"""
with StringIO(tibble3_csv) as fp:
tibble3 = pd.read_csv(fp)
print(tibble3)
def str_join_elements(x, sep=""):
assert type(sep) is str
return sep.join((str(xi) for xi in x))
def unite(df, cols, new_var, combine=str_join_elements):
def apply_join(x, sep):
joinstr = str_join(x, sep)
return pd.Series({new_var[i]:s for i, s in enumerate(joinstr)})
fixed_vars = df.columns.difference(cols)
tibble = df[fixed_vars].copy()
tibble_extra = df[cols].apply(apply_join)
return pd.concat([tibble, tibble_extra], axis=1)
table3_again = unite(tibble3, ['cases', 'population'], 'rate', combine=lambda x: str_join_elements(x, "/"))
print(table3_again)
Use lambda when you have multiple parameters i.e
df[cols].apply(lambda x: apply_join(x,sep),axis=1)
Or pass parameters with the help of args parameter i.e
df[cols].apply(apply_join,args=[sep],axis=1)
You just add it into the apply statement:
tibble_extra = df[cols].apply(apply_join, sep=...)
Also, you should specify the axis. It may work without it, but its a good habit to prevent errors:
tibble_extra = df[cols].apply(apply_join, sep=..., axis=1(columns) or 0(rows|default))

Split and Join Series in Pandas

I have two series in the dataframe below. The first is a string which will appear in the second, which will be a url string. What I want to do is change the first series by concatenating on extra characters, and have that change applied onto the second string.
import pandas as pd
#import urlparse
d = {'OrigWord' : ['bunny', 'bear', 'bull'], 'WordinUrl' : ['http://www.animal.com/bunny/ear.html', 'http://www.animal.com/bear/ear.html', 'http://www.animal.com/bull/ear.html'] }
df = pd.DataFrame(d)
def trial(source_col, dest_col):
splitter = dest_col.str.split(str(source_col))
print type(splitter)
print splitter
res = 'angry_' + str(source_col).join(splitter)
return res
df['Final'] = df.applymap(trial(df.OrigWord, df.WordinUrl))
I'm trying to find the string from the source_col, then split on that string in the dest_col, then effect that change on the string in dest_col. Here I have it as a new series called Final but I would rather inplace. I think the main issue are the splitter variable, which isn't working and the application of the function.
Here's how result should look:
OrigWord WordinUrl
angry_bunny http://www.animal.com/angry_bunny/ear.html
angry_bear http://www.animal.com/angry_bear/ear.html
angry_bull http://www.animal.com/angry_bull/ear.html
apply isn't really designed to apply to multiple columns in the same row. What you can do is to change your function so that it takes in a series instead and then assigns source_col, dest_col to the appropriate value in the series. One way of doing it is as below:
def trial(x):
source_col = x["OrigWord"]
dest_col = x['WordinUrl' ]
splitter = str(dest_col).split(str(source_col))
res = splitter[0] + 'angry_' + source_col + splitter[1]
return res
df['Final'] = df.apply(trial,axis = 1 )
here is an alternative approach:
df['WordinUrl'] = (df.apply(lambda x: x.WordinUrl.replace(x.OrigWord,
'angry_' + x.OrigWord), axis=1))
In [25]: df
Out[25]:
OrigWord WordinUrl
0 bunny http://www.animal.com/angry_bunny/ear.html
1 bear http://www.animal.com/angry_bear/ear.html
2 bull http://www.animal.com/angry_bull/ear.html
Instead of using split, you can use the replace method to prepend the angry_ to the corresponding source:
def trial(row):
row.WordinUrl = row.WordinUrl.replace(row.OrigWord, "angry_" + row.OrigWord)
row.OrigWord = "angry_" + row.OrigWord
return row
df.apply(trial, axis = 1)
OrigWord WordinUrl
0 angry_bunny http://www.animal.com/angry_bunny/ear.html
1 angry_bear http://www.animal.com/angry_bear/ear.html
2 angry_bull http://www.animal.com/angry_bull/ear.html

Categories