Is there a way to automate data cleaning for pandas DataFrames? - python

I am cleaning my data for a machine learning project by replacing the missing values with the zeros and the mean for the 'Age' and 'Fare' columns respectively. The code for which is given below:
train_data['Age'] = train_data['Age'].fillna(0)
mean = train_data['Fare'].mean()
train_data['Fare'] = train_data['Fare'].fillna(mean)
Since I would I have to do this multiple times for other sets of data, I want to automate this process by creating a generic function that takes the DataFrame as input and performs the operations for modifying it and returning the modified function. The code for that is given below:
def data_cleaning(df):
df['Age'] = df['Age'].fillna(0)
fare_mean = df['Fare'].mean()
df['Fare'] = df['Fare'].fillna()
return df
However when I pass the training data DataFrame:
train_data = data_cleaning(train_data)
I get the following error:
/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:2:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-
docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/tmp/ipykernel_42/1440633985.py in <module>
1 #print(train_data)
----> 2 train_data = data_cleaning(train_data)
3 cross_val_data = data_cleaning(cross_val_data)
/tmp/ipykernel_42/3053068338.py in data_cleaning(df)
2 df['Age'] = df['Age'].fillna(0)
3 fare_mean = df['Fare'].mean()
----> 4 df['Fare'] = df['Fare'].fillna()
5 return df
/opt/conda/lib/python3.7/site-packages/pandas/util/_decorators.py in wrapper(*args,
**kwargs)
309 stacklevel=stacklevel,
310 )
--> 311 return func(*args, **kwargs)
312
313 return wrapper
/opt/conda/lib/python3.7/site-packages/pandas/core/series.py in fillna(self, value,
method, axis, inplace, limit, downcast)
4820 inplace=inplace,
4821 limit=limit,
-> 4822 downcast=downcast,
4823 )
4824
/opt/conda/lib/python3.7/site-packages/pandas/core/generic.py in fillna(self, value,
method, axis, inplace, limit, downcast)
6311 """
6312 inplace = validate_bool_kwarg(inplace, "inplace")
-> 6313 value, method = validate_fillna_kwargs(value, method)
6314
6315 self._consolidate_inplace()
/opt/conda/lib/python3.7/site-packages/pandas/util/_validators.py in
validate_fillna_kwargs(value, method, validate_scalar_dict_value)
368
369 if value is None and method is None:
--> 370 raise ValueError("Must specify a fill 'value' or 'method'.")
371 elif value is None and method is not None:
372 method = clean_fill_method(method)
ValueError: Must specify a fill 'value' or 'method'.
On some research, I found that I would have to use apply() and map() functions instead, but I am not sure how to input the mean value of the column. Furthermore, this does not scale well as I would have to calculate all the fillna values before inputting them into the function, which is cumbersome. Therefore I want to ask, is there better way to automate data cleaning?

This line df['Fare'] = df['Fare'].fillna() in your function, you did not fill the n/a with anything, thus it returns an error. You should change it to df['Fare'] = df['Fare'].fillna(fare_mean).
If you intend to make this usable for another file in same directory, you can just call it in another file by:
from file_that_contain_function import function_name
And if you intend to make it reusable for your workspace/virtual environment, you may need to create your own python package.

So yes, the other answer explains where the error is coming from.
However, the warning at the beginning has nothing to do with filling NaNs. The warning is telling you that you are modifying a slice of a copy of your dataframe. Change your code to
def data_cleaning(df):
df['Age'] = df.loc[:, 'Age'].fillna(0)
fare_mean = df['Fare'].mean()
df['Fare'] = df.loc[:, 'Fare'].fillna(fare_mean) # <- and also fix this error
return df
I suggest also searching that specific warning here, as there are hundreds of posts detailing this warning and how to deal with it. Here's a good one.

Related

find which which row in dataframe causes groupby or transform to fail

I have a dataframe of shape (2061, 5) and the following line:
df[6] = df.groupby(df.index)[6].transform(lambda x: ' '.join(x))
..causes the following error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-19-27721ddd8064> in <module>
----> 1 df.groupby(df.index)[6].transform(lambda x: ' '.join(x))
~/.pyenv/versions/miniconda3-latest/lib/python3.7/site-packages/pandas/core/groupby/generic.py in transform(self, func, *args, **kwargs)
463
464 if not isinstance(func, str):
--> 465 return self._transform_general(func, *args, **kwargs)
466
467 elif func not in base.transform_kernel_whitelist:
~/.pyenv/versions/miniconda3-latest/lib/python3.7/site-packages/pandas/core/groupby/generic.py in _transform_general(self, func, *args, **kwargs)
487 for name, group in self:
488 object.__setattr__(group, "name", name)
--> 489 res = func(group, *args, **kwargs)
490
491 if isinstance(res, (ABCDataFrame, ABCSeries)):
<ipython-input-19-27721ddd8064> in <lambda>(x)
----> 1 df.groupby(df.index)[6].transform(lambda x: ' '.join(x))
TypeError: sequence item 0: expected str instance, float found
I developed that code on a subset of the dataframe and it seemed to be doing exactly what I wanted to the data. So now if I for example do this:
df = df.head(50)
..and run the code, the error message goes away again.
I think somewhere, a type cast is happening except at one of the lines it decides to do something else. How can I efficiently find which row in the df is causing this without manually reading through the whole two thousand long column or a trial an error thing with .head() of different sizes?
EDITED: Mask column in question to keep only rows where column has a float value, then check first index. IE:
mask = df['column_in_q'].apply(lambda x: type(x) == float)
#This returns a Boolean DF that can be used to keep only True values
float_df = df[mask] # Subset of DF that meets condition
print(df.head())
I think this is because the Groupby method returns a groupby object, not a
dataframe. You have to specify aggregation methods, which you could then subset. That is:
df[6] = df.groupby(df.index).sum()[6].transform(lambda x: ' '.join(x))
See here for more: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html

Python Pandas Reshape Data Frame

It seems to be very basic knowledge, but I got stuck despite having some theoretical background in data processing (via other software). Worth to mention I'm new to python and pandas library.
So. I've got a data frame:
My task is to put values of 'Series Name' column as separate columns (transform from long to wide). I've spent ages trying different methods, but got only errors.
For example:
mydata = mydata.pivot(index=['Country', 'Year'], columns='Series Name', values='Value')
And I got an error:
... a lot of text...
ValueError: Length of passed values is 2487175, index implies 2
Could anybody guide me through that process please? Thanks.
It's for the code
'mydata = mydata.pivot(index=['Country', 'Year'], columns='Series Name', values='Value')'
Error message:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-4-8169d6d374c7> in <module>
----> 1 mydata = mydata.pivot(index=['Country', 'Year'], columns='Series Name', values='Value')
~/anaconda3_501/lib/python3.6/site-packages/pandas/core/frame.py in pivot(self, index, columns, values)
5192 """
5193 from pandas.core.reshape.reshape import pivot
-> 5194 return pivot(self, index=index, columns=columns, values=values)
5195
5196 _shared_docs['pivot_table'] = """
~/anaconda3_501/lib/python3.6/site-packages/pandas/core/reshape/reshape.py in pivot(self, index, columns, values)
412 else:
413 indexed = self._constructor_sliced(self[values].values,
--> 414 index=index)
415 return indexed.unstack(columns)
416
~/anaconda3_501/lib/python3.6/site-packages/pandas/core/series.py in __init__(self, data, index, dtype, name, copy, fastpath)
260 'Length of passed values is {val}, '
261 'index implies {ind}'
--> 262 .format(val=len(data), ind=len(index)))
263 except TypeError:
264 pass
ValueError: Length of passed values is 2487175, index implies 2
Try maybe:
mydata = mydata.pivot_table(index=['Country', 'Year'], columns='Series Name', values='Value', aggfunc='sum')
(If you want to sum your Value) it seems that you need to somehow aggregate your data explicitly.
Although would be good, if you would share full error message.
I managed to reproduce your error. Like I said- you need to provide aggregating function:
import pandas as pd
df=pd.DataFrame({"a": list("xyzpqr"), "b": list("abbbaa"), "c": [4,3,6,2,7,5], "d": list("pqqppp")})
df2=df.pivot(index=["b", "d"], columns="a", values="c")
#ValueError: Length of passed values is 6, index implies 2
df2=df.pivot_table(index=["b", "d"], columns="a", values="c", aggfunc=set)
#works fine - you need aggregation function e.g. list/set to collect all/unique values or e.g. sum/max to do some numeric operation
Nearly there. The resulting table is
How is it possible to to put 'Country' and 'Year' to the same level as other column names to be able to export it normally to excel? If I export like it is now 'Country' and 'Year' not included in the table.

Issues with creating column name for DataFrame in Python3

I do not understand why "u" has NaNvalues. What wrong am I doing here?
>>> z=pd.DataFrame([['abcb','asasa'],['sdsd','aeio']])
>>> z
0 1
0 abcb asasa
1 sdsd aeio
>>> u=pd.DataFrame(z,columns=['hello','ajsajs'])
>>> u
hello ajsajs
0 NaN NaN
1 NaN NaN
Alternate construction calls
You can use the underlying NumPy array:
u = pd.DataFrame(z.values, columns=['hello','ajsajs'])
hello ajsajs
0 abcb asasa
1 sdsd aeio
Alternately, you could use:
u = z.rename(columns={0: 'hello',1: 'ajsajs'})
And lastly as suggested by #Dark:
u = z.set_axis(['hello','ajsajs'], axis=1, inplace=False)
A small note on inplace in set_axis -
WARNING: inplace=None currently falls back to to True, but in a future
version, will default to False. Use inplace=True explicitly rather
than relying on the default.
In pandas 0.20.3 the syntax would be just:
u = z.set_axis(axis=1, labels=['hello','ajsajs'])
#Dark's solution appears fastest here.
Why current method doesn't work
I believe the issue here is that there's a .reindex being called when the DataFrame is constructed in this way. Here's some source code where ellipses denote irrelevant stuff I'm leaving out:
from pandas.core.internals import BlockManager
# pandas.core.frame.DataFrame
class DataFrame(NDFrame):
def __init__(self, data=None, index=None, columns=None, dtype=None,
copy=False):
# ...
if isinstance(data, DataFrame):
data = data._data
if isinstance(data, BlockManager):
mgr = self._init_mgr(data, axes=dict(index=index, columns=columns),
dtype=dtype, copy=copy)
# ... a bunch of other if statements irrelevant to your case
NDFrame.__init__(self, mgr, fastpath=True)
# ...
What's happening here:
DataFrame inherits from a more generic base class which in turn has multiple inheritance. (Pandas is great, but its source can be like trying to backtrack through a spider's web.)
In u = pd.DataFrame(z,columns=['hello','ajsajs']), x is a DataFrame. Therefore, the first if statement below is True and data = data._data. What's _data? It's a BlockManager.* (To be continued below...)
Because we just converted what you passed to its BlockManager, the next if statement also evaluates True. Then mgr gets assigned to the result of the _init_mrg method and the parent class's __init__ gets called, passing mgr.
* confirm with isinstance(z._data, BlockManager).
Now on to part 2...
# pandas.core.generic.NDFrame
class NDFrame(PandasObject, SelectionMixin):
def __init__(self, data, axes=None, copy=False, dtype=None,
fastpath=False):
# ...
def _init_mgr(self, mgr, axes=None, dtype=None, copy=False):
""" passed a manager and a axes dict """
for a, axe in axes.items():
if axe is not None:
mgr = mgr.reindex_axis(axe,
axis=self._get_block_manager_axis(a),
copy=False)
# ...
return mgr
Here is where _init_mgr is defined, which gets called above. Essentially in your case you have:
columns=['hello','ajsajs']
axes=dict(index=None, columns=columns)
# ...
When you go to reindex axis and specify a new axis where none of the new labels are included in the old object, you get all NaNs. This seems like a deliberate design decision. Consider this related example to prove the point, where one new column is present and one is not:
pd.DataFrame(z, columns=[0, 'ajsajs'])
0 ajsajs
0 abcb NaN
1 sdsd NaN

simple dask map_partitions example

I read the following SO thead and now am trying to understand it. Here is my example:
import dask.dataframe as dd
import pandas as pd
from dask.multiprocessing import get
import random
df = pd.DataFrame({'col_1':random.sample(range(10000), 10000), 'col_2': random.sample(range(10000), 10000) })
def test_f(col_1, col_2):
return col_1*col_2
ddf = dd.from_pandas(df, npartitions=8)
ddf['result'] = ddf.map_partitions(test_f, columns=['col_1', 'col_2']).compute(get=get)
It generates the following error below. What am I doing wrong? Also I am not clear how to pass additional parameters to function in map_partitions?
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
~\AppData\Local\conda\conda\envs\tensorflow\lib\site-packages\dask\dataframe\utils.py in raise_on_meta_error(funcname)
136 try:
--> 137 yield
138 except Exception as e:
~\AppData\Local\conda\conda\envs\tensorflow\lib\site-packages\dask\dataframe\core.py in _emulate(func, *args, **kwargs)
3130 with raise_on_meta_error(funcname(func)):
-> 3131 return func(*_extract_meta(args, True), **_extract_meta(kwargs, True))
3132
TypeError: test_f() got an unexpected keyword argument 'columns'
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
<ipython-input-9-913789c7326c> in <module>()
----> 1 ddf['result'] = ddf.map_partitions(test_f, columns=['col_1', 'col_2']).compute(get=get)
~\AppData\Local\conda\conda\envs\tensorflow\lib\site-packages\dask\dataframe\core.py in map_partitions(self, func, *args, **kwargs)
469 >>> ddf.map_partitions(func).clear_divisions() # doctest: +SKIP
470 """
--> 471 return map_partitions(func, self, *args, **kwargs)
472
473 #insert_meta_param_description(pad=12)
~\AppData\Local\conda\conda\envs\tensorflow\lib\site-packages\dask\dataframe\core.py in map_partitions(func, *args, **kwargs)
3163
3164 if meta is no_default:
-> 3165 meta = _emulate(func, *args, **kwargs)
3166
3167 if all(isinstance(arg, Scalar) for arg in args):
~\AppData\Local\conda\conda\envs\tensorflow\lib\site-packages\dask\dataframe\core.py in _emulate(func, *args, **kwargs)
3129 """
3130 with raise_on_meta_error(funcname(func)):
-> 3131 return func(*_extract_meta(args, True), **_extract_meta(kwargs, True))
3132
3133
~\AppData\Local\conda\conda\envs\tensorflow\lib\contextlib.py in __exit__(self, type, value, traceback)
75 value = type()
76 try:
---> 77 self.gen.throw(type, value, traceback)
78 except StopIteration as exc:
79 # Suppress StopIteration *unless* it's the same exception that
~\AppData\Local\conda\conda\envs\tensorflow\lib\site-packages\dask\dataframe\utils.py in raise_on_meta_error(funcname)
148 ).format(" in `{0}`".format(funcname) if funcname else "",
149 repr(e), tb)
--> 150 raise ValueError(msg)
151
152
ValueError: Metadata inference failed in `test_f`.
Original error is below:
------------------------
TypeError("test_f() got an unexpected keyword argument 'columns'",)
Traceback:
---------
File "C:\Users\some_user\AppData\Local\conda\conda\envs\tensorflow\lib\site-packages\dask\dataframe\utils.py", line 137, in raise_on_meta_error
yield
File "C:\Users\some_user\AppData\Local\conda\conda\envs\tensorflow\lib\site-packages\dask\dataframe\core.py", line 3131, in _emulate
return func(*_extract_meta(args, True), **_extract_meta(kwargs, True))
There is an example in map_partitions docs to achieve exactly what are trying to do:
ddf.map_partitions(lambda df: df.assign(z=df.x * df.y))
When you call map_partitions (just like when you call .apply() on pandas.DataFrame), the function that you try to map (or apply) will be given dataframe as a first argument.
In case of dask.dataframe.map_partitions this first argument will be a partition and in case of pandas.DataFrame.apply - a whole dataframe.
Which means that your function has to accept dataframe(partition) as a first argument and and in your case could look like this:
def test_f(df, col_1, col_2):
return df.assign(result=df[col_1] * df[col_2])
Note that assignment of a new column in this case happens (i.e. gets scheduled to happen) BEFORE you call .compute().
In your example you assign column AFTER you call .compute(), which kind of defeats the purpose of using dask. I.e. after you call .compute() the results of that operation are loaded into memory if there is enough space for those results (if not you just get MemoryError).
So for you example to work you could:
1) Use function (with column names as arguments):
def test_f(df, col_1, col_2):
return df.assign(result=df[col_1] * df[col_2])
ddf_out = ddf.map_partitions(test_f, 'col_1', 'col_2')
# Here is good place to do something with BIG ddf_out dataframe before calling .compute()
result = ddf_out.compute(get=get) # Will load the whole dataframe into memory
2) Use lambda (with column names hardcoded in the function):
ddf_out = ddf.map_partitions(lambda df: df.assign(result=df.col_1 * df.col_2))
# Here is good place to do something with BIG ddf_out dataframe before calling .compute()
result = ddf_out.compute(get=get) # Will load the whole dataframe into memory
Update:
To apply function on a row-by-row basis, here is a quote from the post you linked:
map / apply
You can map a function row-wise across a series with map
df.mycolumn.map(func)
You can map a function row-wise across a dataframe with apply
df.apply(func, axis=1)
I.e. for the example function in your question, it might look like this:
def test_f(dds, col_1, col_2):
return dds[col_1] * dds[col_2]
Since you will be applying it on a row-by-row basis the function's first argument will be a series (i.e. each row of a dataframe is a series).
To apply this function then you might call it like this:
dds_out = ddf.apply(
test_f,
args=('col_1', 'col_2'),
axis=1,
meta=('result', int)
).compute(get=get)
This will return a series named 'result'.
I guess you could also call .apply on each partition with a function but it does not look to be any more efficient then calling .apply on dataframe directly. But may be your tests will prove otherwise.
Your test_f takes two arguments: col_1 and col_2. You pass a single argument, ddf.
Try something like
In [5]: dd.map_partitions(test_f, ddf['col_1'], ddf['col_2'])
Out[5]:
Dask Series Structure:
npartitions=8
0 int64
1250 ...
...
8750 ...
9999 ...
dtype: int64
Dask Name: test_f, 32 tasks

Exception during groupby pandas

I am just beginning to learn analytics with python for network analysis using the Python For Data Analysis book and I'm getting confused by an exception I get while doing some groupby's... here's my situation.
I have a CSV of NetFlow data that I've imported to pandas. The data looks something like:
dt, srcIP, srcPort, dstIP, dstPort, bytes
2013-06-06 00:00:01.123, 123.123.1.1, 12345, 234.234.1.1, 80, 75
I've imported and indexed the data as follows:
df = pd.read_csv('mycsv.csv')
df.index = pd.to_datetime(full_set.pop('dt'))
What I want is a count of unique srcIPs which visit my servers per time period (I have data over several days and I'd like time period by date,hour). I can obtain an overall traffic graph by grouping and plotting as follows:
df.groupby([lambda t: t.date(), lambda t: t.hour]).srcIP.nunique().plot()
However, I want to know how that overall traffic is split amongst my servers. My intuition was to additionally group by the 'dstIP' column (which only has 5 unique values), but I get errors when I try to aggregate on srcIP.
grouped = df.groupby([lambda t: t.date(), lambda t: t.hour, 'dstIP'])
grouped.sip.nunique()
...
Exception: Reindexing only valid with uniquely valued Index objects
So, my specific question is: How can I avoid this exception in order to create a plot where traffic is aggregated over 1 hour blocks and there is a different series for each server.
More generally, please let me know what newb errors I'm making.
Also, the data does not have regular frequency timestamps and I don't want sampled data in case that makes any difference in your answer.
EDIT 1
This is my ipython session exactly as input. output ommitted except for the deepest few calls in the error.
EDIT 2
Upgrading pandas from 0.8.0 to 0.12.0 as yielded a more descriptive exception shown below
import numpy as np
import pandas as pd
import time
import datetime
full_set = pd.read_csv('june.csv', parse_dates=True, index_col=0)
full_set.sort_index(inplace=True)
gp = full_set.groupby(lambda t: (t.date(), t.hour, full_set['dip'][t]))
gp['sip'].nunique()
...
/usr/local/lib/python2.7/dist-packages/pandas/core/groupby.pyc in _make_labels(self)
1239 raise Exception('Should not call this method grouping by level')
1240 else:
-> 1241 labs, uniques = algos.factorize(self.grouper, sort=self.sort)
1242 uniques = Index(uniques, name=self.name)
1243 self._labels = labs
/usr/local/lib/python2.7/dist-packages/pandas/core/algorithms.pyc in factorize(values, sort, order, na_sentinel)
123 table = hash_klass(len(vals))
124 uniques = vec_klass()
--> 125 labels = table.get_labels(vals, uniques, 0, na_sentinel)
126
127 labels = com._ensure_platform_int(labels)
/usr/local/lib/python2.7/dist-packages/pandas/hashtable.so in pandas.hashtable.PyObjectHashTable.get_labels (pandas/hashtable.c:12229)()
/usr/local/lib/python2.7/dist-packages/pandas/core/generic.pyc in __hash__(self)
52 def __hash__(self):
53 raise TypeError('{0!r} objects are mutable, thus they cannot be'
---> 54 ' hashed'.format(self.__class__.__name__))
55
56 def __unicode__(self):
TypeError: 'TimeSeries' objects are mutable, thus they cannot be hashed
So I'm not 100 percent sure why that exception was raised.. but a few suggestions:
You can read in your data and parse the datetime and index by the datetime all at once with read_csv:
df = pd.read_csv('mycsv.csv', parse_dates=True, index_col=0)
Then you can form your groups by using a lambda function that returns a tuple of values:
gp = df.groupby( lambda t: ( t.date(), t.hour, df['dstIP'][t] ) )
The input to this lambda function is the index, we can use this index to go into the dataframe in the outer scope and retrieve the srcIP value at that index and thus factor it into the grouping.
Now that we have the grouping, we can apply the aggregator:
gp['srcIP'].nunique()
I ended up solving my problem by adding a new column of hour-truncated datetimes to the original dataframe as follows:
f = lambda i: i.strftime('%Y-%m-%d %H:00:00')
full_set['hours'] = full_set.index.map(f)
Then I can groupby('dip') and loop through each destIP creating an hourly grouped plot as I go...
for d, g in dipgroup:
g.groupby('hours').sip.nunique().plot()

Categories