The HTTP log files I'm trying to analyze with pandas have sometimes unexpected lines. Here's how I load my data :
df = pd.read_csv('mylog.log',
sep=r'\s(?=(?:[^"]*"[^"]*")*[^"]*$)(?![^\[]*\])',
engine='python', na_values=['-'], header=None,
usecols=[0, 3, 4, 5, 6, 7, 8,10],
names=['ip', 'time', 'request', 'status', 'size',
'referer','user_agent','req_time'],
converters={'status': int, 'size': int, 'req_time': int})
It works fine for most of the logs I have (which come from the same server). However, upon loading some logs, an exception is raised :
either
TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'
or
ValueError: invalid literal for int() with base 10: '"GET /agent/10577/bdl HTTP/1.1"'
For the sake of the example, here's the line that triggers the second exception:
22.111.117.229, 22.111.117.229 - - [19/Sep/2018:22:17:40 +0200] "GET /agent/10577/bdl HTTP/1.1" 204 - "-" "okhttp/3.8.0" apibackend.site.fr 429282
To find the number of the incriminated line, I used the following (terribly slow) function :
def search_error_dichotomy(path):
borne_inf = 0
log = open(path)
borne_sup = len(log.readlines())
log.close()
while borne_sup - borne_inf>1:
exceded = False
search_index = (borne_inf + borne_sup) // 2
try:
pd.read_csv(path,...,...,nrows=search_index)
except:
exceded = True
if exceded:
borne_sup = search_index
else:
borne_inf = search_index
return search_index
What I'd like to have is something like this :
try:
pd.read_csv(..........................)
except MyError as e:
print(e.row_number)
where e.row_number is the number of the messy line.
Thank you in advance.
SOLUTION
All credits to devssh, whose suggestion not only makes the process quicker, but allows me to get all unexpected line at once. Here's what I did out of it :
Load the dataframe without converters.
df = pd.read_csv(path,
sep=r'\s(?=(?:[^"]*"[^"]*")*[^"]*$)(?![^\[]*\])',
engine='python', na_values=['-'], header=None,
usecols=[0, 3, 4, 5, 6, 7, 8,10],
names=['ip', 'time', 'request', 'status', 'size',
'referer', 'user_agent', 'req_time'])
Add an 'index' column using .reset_index() .
df = df.reset_index()
Write custom function (to be used with apply), that converts to int if possible, otherwise saves
the entry and the 'index' in a dictionary wrong_lines
wrong_lines = {}
def convert_int_feedback_index(row,col):
try:
ans = int(row[col])
except:
wrong_lines[row['index']] = row[col]
ans = pd.np.nan
return ans
Use apply on the columns I want to convert (eg col = 'status', 'size', or 'req_time')
df[col] = df.apply(convert_int_feedback_index, axis=1, col=col)
Did you try pd.read_csv(..., nrows=10) to see if it works on even 10 lines?
Perhaps you should not use converters to specify the dtypes.
Load the DataFrame then apply the dtype to columns like df["column"] = df["column"].astype(np.int64) or a custom function like df["column"]=df["column"].apply(lambda x: convert_type(x)) and handle the errors yourself in the function convert_type.
Finally, update the csv by calling df.to_csv("preprocessed.csv", headers=True, index=False).
I don't think you can get the line number from the pd.read_csv itself. That separator itself looks too complex.
Or you can try just reading the csv as a single column DataFrame and use df["column"].str.extract to use regex to extract the columns. That way you control how the exception is to be raised or the default value to handle the error.
df.reset_index() will give you the row numbers as a column. That way if you apply to two columns, you will get the row number as well. It will give you index column with row numbers. Combine that with apply over multiple columns and you can customize everything.
Related
I am trying to read a csv compressed file from S3 using pandas.
The dataset has 2 date columns, or at least they should be parsed as dates, I have read pandas docs and using parse_dates=[col1, col2] should work fine. Indeed it parsed one column as date but not the second one, which is something weird because they have the same formatting (YYYYmmdd.0), and both have Nan values as shown below
I read the file as follow :
date_columns = ['PRESENCE_UO_DT_FIN_PREVUE', 'PERSONNE_DT_MODIF']
df = s3manager_in.read_csv(object_key='Folder1/file1', sep=';', encoding = 'ISO-8859-1', compression = 'gzip', parse_dates=date_columns, engine='python')
Is there any explanation why 1 column get parsed as date and the second one is not ?
Thanks
The column 'PRESENCE_UO_DT_FIN_PREVUE' seems to carry some "bad" values (that are not formatted as 'YYYYmmdd.0'). That's probably the reason why pandas.read_csv can't parse this column as a date even with passing it as an argument of the parameter parse_dates.
Try this :
df = s3manager_in.read_csv(object_key='Folder1/file1', sep=';', encoding = 'ISO-8859-1', compression = 'gzip', engine='python')
date_columns = ['PRESENCE_UO_DT_FIN_PREVUE', 'PERSONNE_DT_MODIF']
df[date_columns ] = df[date_columns].apply(pd.to_datetime, errors='coerce')
Note that the 'coerce' in pandas.to_datetime will put NaN instead of every bad value in the column 'PRESENCE_UO_DT_FIN_PREVUE'.
errors{‘ignore’, ‘raise’, ‘coerce’}, default ‘raise’ :
If ‘coerce’,
then invalid parsing will be set as NaN.
Found out what was wrong with the column, in fact -as my collegue pointed out- the column has some values that exceed Python maximum value to be parsed as date in Python.
Pandas maximum datetime is Timestamp.max=Timestramp('2262-04-11 23:47:16.854775807')
and it happened that values in that column can be : df.loc['PRESENCE_UO_DT_FIN_PREVUE'].idxmax()]['RESENCE_UO_DT_FIN_PREVUE'] = 99991231.0
The problem is read_csv with parse_dates is not generating any errors or warning, so it is difficult to find out what's is wrong.
So to encounter this problem, I manually convert the column:
def date_time_processing(var):
#if var == 99991231.0 or var == 99991230.0 or var == 29991231.0 or var == 99991212.0 or var == 29220331.0 or var == 30000131.0 or var == 30001231.0:
if var > 21000000.0 :
return (pd.Timestamp.max).strftime('%Y%m%d')
elif var is np.nan :
return pd.NaT
else :
return (pd.to_datetime(var, format='%Y%m%d'))
and then give it a lambda function :
df['PRESENCE_UO_DT_FIN_PREVUE'] = df['PRESENCE_UO_DT_FIN_PREVUE'].apply(lambda x:date_time_processing(x)
I would like to read in an excel file, and using method chaining, convert the column names into lower case and replace any white space into _. The following code runs fine
def supp_read(number):
filename = f"supplemental-table{number}.xlsx"
df = (pd.read_excel(filename,skiprows=5)
.rename(columns = str.lower))
return df
But the code below does not
def supp_read(number):
filename = f"supplemental-table{number}.xlsx"
df = (pd.read_excel(filename,skiprows=5)
.rename(columns = str.lower)
.rename(columns = str.replace(old=" ",new="_")))
return df
After adding the str.replace line I get the following error: No value for argument 'self' in unbound method call. Can someone shed some light on what I can do to fix this error and why the above does not work?
In addition, when I use str.lower() I get the same error. Why does str.lower work but not str.lower()?
Here's a different syntax which I frequently use:
def supp_read(number):
filename = f"supplemental-table{number}.xlsx"
df = pd.read_excel(filename,skiprows=5)
df.columns = df.columns.str.lower().replace(" ", "_")
return df
How do I return multiple values from a function applied on a Dask Series?
I am trying to return a series from each iteration of dask.Series.apply and for the final result to be a dask.DataFrame.
The following code tells me that the meta is wrong. The all-pandas version however works. What's wrong here?
Update: I think that I am not specifying the meta/schema correctly. How do I do it correctly?
Now it works when I drop the meta argument. However, it raises a warning. I would like to use dask "correctly".
import dask.dataframe as dd
import pandas as pd
import numpy as np
from sklearn import datasets
iris = datasets.load_iris()
def transformMyCol(x):
#Minimal Example Function
return(pd.Series(['Tom - ' + str(x),'Deskflip - ' + str(x / 8),'']))
#
## Pandas Version - Works as expected.
#
pandas_df = pd.DataFrame(data= np.c_[iris['data'], iris['target']], columns= iris['feature_names'] + ['target'])
pandas_df.target.apply(transformMyCol,1)
#
## Dask Version (second attempt) - Raises a warning
#
df = dd.from_pandas(pandas_df, npartitions=10)
unpacked = df.target.apply(transformMyCol)
unpacked.head()
#
## Dask Version (first attempt) - Raises an exception
#
df = dd.from_pandas(pandas_df, npartitions=10)
unpacked_dask_schema = {"name" : str, "action" : str, "comments" : str}
unpacked = df.target.apply(transformMyCol, meta=unpacked_dask_schema)
unpacked.head()
This is the error that I get:
File "/anaconda3/lib/python3.7/site-packages/dask/dataframe/core.py", line 3693, in apply_and_enforce
raise ValueError("The columns in the computed data do not match"
ValueError: The columns in the computed data do not match the columns in the provided metadata
I have also trued the following and it also does not work.
meta_df = pd.DataFrame(dtype='str',columns=list(unpacked_dask_schema.keys()))
unpacked = df.FILEDATA.apply(transformMyCol, meta=meta_df)
unpacked.head()
Same error:
File "/anaconda3/lib/python3.7/site-packages/dask/dataframe/core.py", line 3693, in apply_and_enforce
raise ValueError("The columns in the computed data do not match"
ValueError: The columns in the computed data do not match the columns in the provided metadata
You're right, the problem is you're not specifying the meta correctly; more specifically and as the error message says, the metadata columns ("name", "action", "comments") do not match the columns in the computed data (0, 1, 2). You should either:
Change the metadata columns to 0, 1, 2:
unpacked_dask_schema = dict.fromkeys(range(3), str)
df.target.apply(transformMyCol, meta=unpacked_dask_schema)
or
Change transformMyCol to use the named columns:
def transformMyCol(x):
return pd.Series({
'name': 'Tom - ' + str(x),
'action': 'Deskflip - ' + str(x / 8),
'comments': '',
}))
I refer to this question - dask dataframe read parquet schema difference
But the metadata returned by Dask does not indicate any differences between the different dataframes. Here is my code, which parses the exception details to find mismatched dtypes. It finds none. There are up to 100 dataframes with 717 columns (each is ~100MB in size).
try:
df = dd.read_parquet(data_filenames, columns=list(cols_to_retrieve), engine='pyarrow')
except Exception as ex:
# Process the ex message to find the diff, this will break if dask change their error message
msgs = str(ex).split('\nvs\n')
cols1 = msgs[0].split('metadata')[0]
cols1 = cols1.split('was different. \n')[1]
cols2 = msgs[1].split('metadata')[0]
df1_err = pd.DataFrame([sub.split(":") for sub in cols1.splitlines()])
df1_err = df1_err.dropna()
df2_err = pd.DataFrame([sub.split(":") for sub in cols2.splitlines()])
df2_err = df2_err.dropna()
df_err = pd.concat([df1_err, df2_err]).drop_duplicates(keep=False)
raise Exception('Mismatch dataframes - ' + str(df_err))
The exception I get back is:
'Mismatch dataframes - Empty DataFrame Columns: [0, 1] Index: []'
This error does not occur with fastparquet, but it is so slow that it is unusable.
I added this to the creation of the dataframes (using pandas to_parquet to save them) in an attempt to unify the dtypes by column
df_float = df.select_dtypes(include=['float16', 'float64'])
df = df.drop(df_float.columns, axis=1)
for col in df_float.columns:
df_float[col] = df_float.loc[:,col].astype('float32')
df = pd.concat([df, df_float], axis=1)
df_int = df.select_dtypes(include=['int8', 'int16', 'int32'])
try:
for col in df_int.columns:
df_int[col] = df_int.loc[:, col].astype('int64')
df = df.drop(df_int.columns, axis=1)
df = pd.concat([df, df_int], axis=1)
except ValueError as ve:
print('Error with upcasting - ' + str(ve))
This appears to work according to my above exception. But I cannot find out how the dataframes differ as the exception thrown by dask read_parquet does not tell me? Ideas on how to determine what it finds as different?
You could use the fastparquet function merge to create a metadata file from the many data files (this will take some time to scan all the files). Thereafter, pyarrow will use this metadata files, and that might be enough to get rid of the problem for you.
I'm attempting to create a raw string variable from a pandas dataframe, which will eventually be written to a .cfg file, by firstly joining two columns together as shown below and avoiding None:
Section of df:
command value
...
439 sensitivity "0.9"
440 cl_teamid_overhead_always 1
441 host_writeconfig None
...
code:
...
df = df['value'].replace('None', np.nan, inplace=True)
print df
df = df['command'].astype(str)+' '+df['value'].astype(str)
print df
cfg_output = '\n'.join(df.tolist())
print cfg_output
I've attempted to replace all the None values with NaN firstly so that no lines in cfg_output contain "None" as part of of the string. However, by doing so I seem to get a few undesired results. I made use of print statements to see what is going on.
It seems that df = df['value'].replace('None', np.nan, inplace=True), simply outputs None.
It seems that df = df['command'].astype(str)+' '+df['value'].astype(str) and cfg_output = '\n'.join(df.tolist()), cause the following error:
TypeError: 'NoneType' object has no attribute '__getitem__'
Therefore, I was thinking that by ignoring any occurrences of NaN, the code may run smoothly, although I'm unsure about how to do so using Pandas
Ultimately, my desired output would be as followed:
sensitivity "0.9"
cl_teamid_overhead_always 1
host_writeconfig
First of all, df['value'].replace('None', np.nan, inplace=True) returns None because you're calling the method with the inplace=True argument. This argument tells replace to not return anything but instead modify the original dataframe as it is. Similar to how pop or append work on lists.
With that being said, you can also get the desired output calling fillna with an empty string:
import pandas as pd
import numpy as np
d = {
'command': ['sensitivity', 'cl_teamid_overhead_always', 'host_writeconfig'],
'value': ['0.9', 1, None]
}
df = pd.DataFrame(d)
# df['value'].replace('None', np.nan, inplace=True)
df = df['command'].astype(str) + ' ' + df['value'].fillna('').astype(str)
cfg_output = '\n'.join(df.tolist())
>>> print(cfg_output)
sensitivity 0.9
cl_teamid_overhead_always 1
host_writeconfig
You can replace None to ''
df=df.replace('None','')
df['command'].astype(str)+' '+df['value'].astype(str)
Out[436]:
439 sensitivity 0.9
440 cl_teamid_overhead_always 1
441 host_writeconfig
dtype: object