Dask - How to concatenate Series into a DataFrame with apply?

Dask - How to concatenate Series into a DataFrame with apply? - python

How do I return multiple values from a function applied on a Dask Series?
I am trying to return a series from each iteration of dask.Series.apply and for the final result to be a dask.DataFrame.
The following code tells me that the meta is wrong. The all-pandas version however works. What's wrong here?
Update: I think that I am not specifying the meta/schema correctly. How do I do it correctly?
Now it works when I drop the meta argument. However, it raises a warning. I would like to use dask "correctly".
import dask.dataframe as dd
import pandas as pd
import numpy as np
from sklearn import datasets
iris = datasets.load_iris()
def transformMyCol(x):
#Minimal Example Function
return(pd.Series(['Tom - ' + str(x),'Deskflip - ' + str(x / 8),'']))
#
## Pandas Version - Works as expected.
#
pandas_df = pd.DataFrame(data= np.c_[iris['data'], iris['target']], columns= iris['feature_names'] + ['target'])
pandas_df.target.apply(transformMyCol,1)
#
## Dask Version (second attempt) - Raises a warning
#
df = dd.from_pandas(pandas_df, npartitions=10)
unpacked = df.target.apply(transformMyCol)
unpacked.head()
#
## Dask Version (first attempt) - Raises an exception
#
df = dd.from_pandas(pandas_df, npartitions=10)
unpacked_dask_schema = {"name" : str, "action" : str, "comments" : str}
unpacked = df.target.apply(transformMyCol, meta=unpacked_dask_schema)
unpacked.head()
This is the error that I get:
File "/anaconda3/lib/python3.7/site-packages/dask/dataframe/core.py", line 3693, in apply_and_enforce
raise ValueError("The columns in the computed data do not match"
ValueError: The columns in the computed data do not match the columns in the provided metadata
I have also trued the following and it also does not work.
meta_df = pd.DataFrame(dtype='str',columns=list(unpacked_dask_schema.keys()))
unpacked = df.FILEDATA.apply(transformMyCol, meta=meta_df)
unpacked.head()
Same error:
File "/anaconda3/lib/python3.7/site-packages/dask/dataframe/core.py", line 3693, in apply_and_enforce
raise ValueError("The columns in the computed data do not match"
ValueError: The columns in the computed data do not match the columns in the provided metadata

You're right, the problem is you're not specifying the meta correctly; more specifically and as the error message says, the metadata columns ("name", "action", "comments") do not match the columns in the computed data (0, 1, 2). You should either:
Change the metadata columns to 0, 1, 2:
unpacked_dask_schema = dict.fromkeys(range(3), str)
df.target.apply(transformMyCol, meta=unpacked_dask_schema)
or
Change transformMyCol to use the named columns:
def transformMyCol(x):
return pd.Series({
'name': 'Tom - ' + str(x),
'action': 'Deskflip - ' + str(x / 8),
'comments': '',
}))

Related

Spark UDF error when input parameter is a value concatenated from two columns of a dataframe

Following python code loads a csv file into dataframe df and sends a string value from single or multiple columns of df to UDF function testFunction(...). The code works fine if I send a single column value. But if I send a value df.address + " " + df.city from two columns of df, I get the following error:
Question: What I may be doing wrong, and how can we fix the issue? All the columns in df are NOT NULL so null or empty string should not be an I issue. For example if I send single column value df.address, that value has blank spaces (e.g. 123 Main Street). So, why the error when two columns' concatenated values are sent to UDF?
Error:
PythonException: An exception was thrown from a UDF: 'AttributeError: 'NoneType' object has no attribute 'upper''
from pyspark.sql.types import StringType
from pyspark.sql import functions as F
df = spark.read.csv(".......dfs.core.windows.net/myDataFile.csv", header="true", inferSchema="true")
def testFunction(value):
mystr = value.upper().replace(".", " ").replace(",", " ").replace(" ", " ").strip()
return mystr
newFunction = F.udf(testFunction, StringType())
df2 = df.withColumn("myNewCol", newFunction(df.address + " " + df.city))
df2.show()

In PySpark you cannot concatenate StringType columns together using +. It will return null which breaks your udf. You can use concat instead.
df2 = df.withColumn("myNewCol", newFunction(F.concat(df.address, F.lit(" "), df.city)))

Save and load correctly pandas dataframe in csv while preserving freq of datetimeindex

I was trying to save a DataFrame and load it. If I print the resulting df, I see they are (almost) identical. The freq attribute of the datetimeindex is not preserved though.
My code looks like this
import datetime
import os
import numpy as np
import pandas as pd
def test_load_pandas_dataframe():
idx = pd.date_range(start=datetime.datetime.now(),
end=(datetime.datetime.now()
+ datetime.timedelta(hours=3)),
freq='10min')
a = pd.DataFrame(np.arange(2*len(idx)).reshape((len(idx), 2)), index=idx,
columns=['first', 2])
a.to_csv('test_df')
b = load_pandas_dataframe('test_df')
os.remove('test_df')
assert np.all(b == a)
def load_pandas_dataframe(filename):
'''Correcty loads dataframe but freq is not maintained'''
df = pd.read_csv(filename, index_col=0,
parse_dates=True)
return df
if __name__ == '__main__':
test_load_pandas_dataframe()
And I get the following error:
ValueError: Can only compare identically-labeled DataFrame objects
It is not a big issue for my program, but it is still annoying.
Thanks!

The issue here is that the dataframe you save has columns
Index(['first', 2], dtype='object')
but the dataframe you load has columns
Index(['first', '2'], dtype='object').
In other words, the columns of your original dataframe had the integer 2, but upon saving it with to_csv and loading it back with read_csv, it is parsed as the string '2'.
The easiest fix that passes your assertion is to change line 13 to:
columns=['first', '2'])

To complemente #jfaccioni answer, freq attribute is not preserved, there are two options here
Fast a simple, use pickle which will preserver everything:
a.to_pickle('test_df')
b = pd.read_pickle('test_df')
a.equals(b) # True
Or you can use the inferred_freq attribute from a DatetimeIndex:
a.to_csv('test_df')
b.read_csv('test_df')
b.index.freq = b.index.inferred_freq
print(b.index.freq) #<10 * Minutes>

How to convert manipulated data from .fits file to pandas DataFrame

I have a .fits file with some data, from which I have made some manipulations and would like to store the new data (not the entire .fits file) as a pd.DataFrame. The data comes from a file called pabdatazcut.fits.
#Sorted by descending Paschen Beta flux
sortedpab = sorted(pabdatazcut[1].data , key = lambda data: data['PAB_FLUX'] , reverse = True )
unsorteddf = pd.DataFrame(pabdatazcut[1].data)
sortedpabdf = pd.DataFrame({'FIELD' : sortedpab['FIELD'],
'ID' : sortedpab['ID'],
'Z_50' : sortedpab['Z_50'],
'Z_ERR' : ((sortedpab['Z_84'] - sortedpab['Z_50']) + (sortedpab['Z_50'] - sortedpab['Z_16'])) / (2 * sortedpab['Z_50']),
'$\lambda Pa\beta$' : 12820 * (1 + sortedpab['Z_50']),
'$Pa\beta$ FLUX' : sortedpab['PAB_FLUX'],
'$Pa\beta$ FLUX ERR' : sortedpab['PAB_FLUX_ERR']})
''''
I have received the 'TypeError: list indices must be integers or slices, not str' error message when I try to run this.

You get this because of accesses like sortedpab['ID'] I guess. According to the doc sorted returns a sorted list. Lists do not accept strings as id to access elements. They can only be accessed by integer positions or slices. That's what the error is trying to tell you.
Unfortunately I can't test this on my machine, because I don't have your data, but I guess, what you really want to do is something like this:
data_dict= dict()
for obj in sortedpab:
for key in ['FIELD', 'ID', 'Z_50', 'Z_50', 'Z_ERR', 'Z_84', 'PAB_FLUX', 'PAB_FLUX_ERR']:
data_dict.setdefault(key, list()).append(obj[key])
sortedpabdf = pd.DataFrame(data_dict)
# maybe you don't even need to create the data_dict but
# can pass the sortedpad directly to your data frame
# have you tried that already?
#
# then I would calculate the columns which are not just copied
# in the dataframe directly, as this is more convenient
# like this:
sortedpabdf['Z_ERR']= ((sortedpabdf['Z_84'] - sortedpabdf['Z_50']) + (sortedpabdf['Z_50'] - sortedpabdf['Z_16'])) / (2 * sortedpabdf['Z_50'])
sortedpabdf['$\lambda Pa\beta$']= 12820 * (1 + sortedpabdf['Z_50']),
sortedpabdf.rename({
'PAB_FLUX': '$Pa\beta$ FLUX',
'PAB_FLUX_ERR': '$Pa\beta$ FLUX ERR'
}, axis='columns', inplace=True)
cols_to_delete= [col for col in sortedpabdf.columns if col not in ['FIELD', 'ID', 'Z_50', 'Z_ERR', '$\lambda Pa\beta$', '$Pa\beta$ FLUX','$Pa\beta$ FLUX ERR'])
sortedpabdf.drop(cols_to_delete, axis='columns', inplace=True)

Dask schema difference error with not enough detail in exception

I refer to this question - dask dataframe read parquet schema difference
But the metadata returned by Dask does not indicate any differences between the different dataframes. Here is my code, which parses the exception details to find mismatched dtypes. It finds none. There are up to 100 dataframes with 717 columns (each is ~100MB in size).
try:
df = dd.read_parquet(data_filenames, columns=list(cols_to_retrieve), engine='pyarrow')
except Exception as ex:
# Process the ex message to find the diff, this will break if dask change their error message
msgs = str(ex).split('\nvs\n')
cols1 = msgs[0].split('metadata')[0]
cols1 = cols1.split('was different. \n')[1]
cols2 = msgs[1].split('metadata')[0]
df1_err = pd.DataFrame([sub.split(":") for sub in cols1.splitlines()])
df1_err = df1_err.dropna()
df2_err = pd.DataFrame([sub.split(":") for sub in cols2.splitlines()])
df2_err = df2_err.dropna()
df_err = pd.concat([df1_err, df2_err]).drop_duplicates(keep=False)
raise Exception('Mismatch dataframes - ' + str(df_err))
The exception I get back is:
'Mismatch dataframes - Empty DataFrame Columns: [0, 1] Index: []'
This error does not occur with fastparquet, but it is so slow that it is unusable.
I added this to the creation of the dataframes (using pandas to_parquet to save them) in an attempt to unify the dtypes by column
df_float = df.select_dtypes(include=['float16', 'float64'])
df = df.drop(df_float.columns, axis=1)
for col in df_float.columns:
df_float[col] = df_float.loc[:,col].astype('float32')
df = pd.concat([df, df_float], axis=1)
df_int = df.select_dtypes(include=['int8', 'int16', 'int32'])
try:
for col in df_int.columns:
df_int[col] = df_int.loc[:, col].astype('int64')
df = df.drop(df_int.columns, axis=1)
df = pd.concat([df, df_int], axis=1)
except ValueError as ve:
print('Error with upcasting - ' + str(ve))
This appears to work according to my above exception. But I cannot find out how the dataframes differ as the exception thrown by dask read_parquet does not tell me? Ideas on how to determine what it finds as different?

You could use the fastparquet function merge to create a metadata file from the many data files (this will take some time to scan all the files). Thereafter, pyarrow will use this metadata files, and that might be enough to get rid of the problem for you.

read_csv get the line where exception occured

The HTTP log files I'm trying to analyze with pandas have sometimes unexpected lines. Here's how I load my data :
df = pd.read_csv('mylog.log',
sep=r'\s(?=(?:[^"]*"[^"]*")*[^"]*$)(?![^\[]*\])',
engine='python', na_values=['-'], header=None,
usecols=[0, 3, 4, 5, 6, 7, 8,10],
names=['ip', 'time', 'request', 'status', 'size',
'referer','user_agent','req_time'],
converters={'status': int, 'size': int, 'req_time': int})
It works fine for most of the logs I have (which come from the same server). However, upon loading some logs, an exception is raised :
either
TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'
or
ValueError: invalid literal for int() with base 10: '"GET /agent/10577/bdl HTTP/1.1"'
For the sake of the example, here's the line that triggers the second exception:
22.111.117.229, 22.111.117.229 - - [19/Sep/2018:22:17:40 +0200] "GET /agent/10577/bdl HTTP/1.1" 204 - "-" "okhttp/3.8.0" apibackend.site.fr 429282
To find the number of the incriminated line, I used the following (terribly slow) function :
def search_error_dichotomy(path):
borne_inf = 0
log = open(path)
borne_sup = len(log.readlines())
log.close()
while borne_sup - borne_inf>1:
exceded = False
search_index = (borne_inf + borne_sup) // 2
try:
pd.read_csv(path,...,...,nrows=search_index)
except:
exceded = True
if exceded:
borne_sup = search_index
else:
borne_inf = search_index
return search_index
What I'd like to have is something like this :
try:
pd.read_csv(..........................)
except MyError as e:
print(e.row_number)
where e.row_number is the number of the messy line.
Thank you in advance.
SOLUTION
All credits to devssh, whose suggestion not only makes the process quicker, but allows me to get all unexpected line at once. Here's what I did out of it :
Load the dataframe without converters.
df = pd.read_csv(path,
sep=r'\s(?=(?:[^"]*"[^"]*")*[^"]*$)(?![^\[]*\])',
engine='python', na_values=['-'], header=None,
usecols=[0, 3, 4, 5, 6, 7, 8,10],
names=['ip', 'time', 'request', 'status', 'size',
'referer', 'user_agent', 'req_time'])
Add an 'index' column using .reset_index() .
df = df.reset_index()
Write custom function (to be used with apply), that converts to int if possible, otherwise saves
the entry and the 'index' in a dictionary wrong_lines
wrong_lines = {}
def convert_int_feedback_index(row,col):
try:
ans = int(row[col])
except:
wrong_lines[row['index']] = row[col]
ans = pd.np.nan
return ans
Use apply on the columns I want to convert (eg col = 'status', 'size', or 'req_time')
df[col] = df.apply(convert_int_feedback_index, axis=1, col=col)

Did you try pd.read_csv(..., nrows=10) to see if it works on even 10 lines?
Perhaps you should not use converters to specify the dtypes.
Load the DataFrame then apply the dtype to columns like df["column"] = df["column"].astype(np.int64) or a custom function like df["column"]=df["column"].apply(lambda x: convert_type(x)) and handle the errors yourself in the function convert_type.
Finally, update the csv by calling df.to_csv("preprocessed.csv", headers=True, index=False).
I don't think you can get the line number from the pd.read_csv itself. That separator itself looks too complex.
Or you can try just reading the csv as a single column DataFrame and use df["column"].str.extract to use regex to extract the columns. That way you control how the exception is to be raised or the default value to handle the error.
df.reset_index() will give you the row numbers as a column. That way if you apply to two columns, you will get the row number as well. It will give you index column with row numbers. Combine that with apply over multiple columns and you can customize everything.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Dask - How to concatenate Series into a DataFrame with apply? - python

Related

Spark UDF error when input parameter is a value concatenated from two columns of a dataframe

Save and load correctly pandas dataframe in csv while preserving freq of datetimeindex

How to convert manipulated data from .fits file to pandas DataFrame

Dask schema difference error with not enough detail in exception

read_csv get the line where exception occured

Categories

Resources