How to remove garbage values or missing values in my dataset? - python

I have a dataset which is about 1.4m rows x 16 columns, there are no missing values in the dataset but instead there are some string or text like "+AC0-5.3" instead of just empty or 'NaN' .
The numerical values after 'AC0-***' are different for different entries but they all start with '+***-***'. How to deal with this? I guess dropping all rows wherever there's such entries would be a good idea.
I tried a solution from stack-overflow:
dataset = dataset[~dataset['total+AF8-amount'].astype(str).str.startswith('+')]
Which helps remove all the rows that had an entry of that missing value. The problem is that it only checks the missing values in the target column which is 'total+AF8-amount'.
I want to remove all the rows where there is this weird missing value in any column, how can I accomplish this?
This the training set .head()
sorry for the bad formatting every space refers to the next column,
ID vendor+AF8-id pickup+AF8-loc drop+AF8-loc driver+AF8-tip mta+AF8-tax distance pickup+AF8-time drop+AF8-time num+AF8-passengers toll+AF8-amount payment+AF8-method rate+AF8-code stored+AF8-flag extra+AF8-charges improvement+AF8-charge total+AF8-amount
0 1 170 233 1.83 0.5 0.7 04-04-2017 17.59 04-04-2017 18.05 1 0 1 1 N 1 0.3 9.13
1 2 151 243 3.56 0.5 4.64 04-03-2017 19.03 04-03-2017 19.20 1 0 1 1 N 1 0.3 21.36
2 2 68 90 1.5 0.5 1.29 04-03-2017 15.06 04-03-2017 15.12 2 0 1 1 N 0 0.3 8.8
3 2 142 234 1.5 0.5 2.74 04-04-2017 8.10 04-04-2017 8.27 1 0 1 1 N 0 0.3 14.8
1656 2 114 255 3.96 0.5 3.92 04-05-2017 22.57 04-05-2017 23.22 2 0 1 1 N 0.5 0.3 23.76
1657 2 230 100 0 **+AC0-0.5** 0.51 04-06-2017 8.14 04-06-2017 8.18 1 0 3 1 N 0 **+AC0-0.3 +AC0-5.3**
1658 2 163 226 0 0.5 3.93 04-07-2017 4.06 04-07-2017 4.20 1 0 2 1 N 0.5 0.3 15.8
1659 2 229 90 2.56 0.5 2.61 04-07-2017 13.49 04-07-2017 14.06 2 0 1 1 N 0 0.3 15.36
For example the row with ID 1657 have a those missing entries, there are other such rows. This is what i have done:
dataset = pd.read_csv('chh-OLA-Kaggle.csv', index_col = 'ID')
testset = pd.read_csv('test.csv', index_col = 'ID')
dataset.dropna(axis = 0, subset = ['total+AF8-amount'], inplace = True)
dataset = dataset[~dataset['total+AF8-amount'].astype(str).str.startswith('+')]
X = dataset.iloc[:, :15].values
y = dataset['total+AF8-amount'].values
One more problem arises is that now all those values is of type 'str', how to make all columns of numerical values to type 'float64' so that I could fit it to a model.
Are all datasets like this?

However, this is been answered but still i found a way to do it different way which i used 3 years back from the So itself which is more intuitive and good to have..
str.contains('+') matches the beginning of any string, Since every string has a beginning, everything matches. Instead use str.contains('\+') to match the literal + character.
Therefore we can use for col in df for every column by calling the str.contains on it to get the values with np.column_stack() as a boolean masking and save it and then later applying it using the dataFrame.loc() along row-wise(asis=1).
Example DataFrame:
>>> df
col1 col2 col3
0 32.1 33.2 +232
1 34.2 3.4 3.4
2 32.44 +232 32.44
3 +232 1.32 +234
4 1.312 131.23 131.23
Solution:
>>> mask = np.column_stack([df[col].str.contains(r"\+", na=False) for col in df])
>>> df.loc[ ~ mask.any(axis=1)]
col1 col2 col3
1 34.2 3.4 3.4
4 1.312 131.23 131.23
Solution 2:
Without np.column_stack purly with pandas but it returns dataframe obj while earlier with numpy returns numpy.ndarray which is best fit for boolean masking.
>>> mask = df.apply(lambda x: x.str.contains(r'\+', na=False))
>>> df.loc[ ~ mask.any(axis=1)]
col1 col2 col3
1 34.2 3.4 3.4
4 1.312 131.23 131.23
In case you need to apply float, try following..
df.loc[ ~ mask.any(axis=1)].astype(float)
Note:
Since you asked, the use of ~ used as boolean vectors to filter the data. other operators are: | for or, & for and, and ~ for not. These must be grouped by using parentheses.

First of all you have to count how many rows have this type(starts with '+-***) of entries.if they are less then you can simply drop those rows.
If not then you should perform data cleaning operation on each row.For that you can use apply() method.pass axis=1 in apply()

You can do it this way.
Example dataframe:
Col1 Col2 Col3
0 32.1 33.2 +232
1 34.2 3.4 3.4
2 32.44 +232 32.44
3 +232 1.32 +232
4 1.3123 131.23 131.23
You want to remove rows from any column that starts with + and also make sure you convert them back to float64 to pass it to your model. Then do:
for x in df.columns:
df = df[~df[x].astype(str).str.startswith('+')]
df[x] = df[x].astype(float)
If you do not want to loop over all columns, you can just pass the column names in a list instead of df.columns.
Final output after getting rid of rows having entries starting with +:
Col1 Col2 Col3
1 34.2000 3.40 3.40
4 1.3123 131.23 131.23
Output of df.info() showing they are now in float64 type:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 1 to 4
Data columns (total 3 columns):
Col1 2 non-null float64
Col2 2 non-null float64
Col3 2 non-null float64
dtypes: float64(3)
memory usage: 64.0 bytes

Related

Efficient way to count NaN columns across all rows of a large pandas dataset?

I'm currently counting the number of missing columns across my full dataset with:
missing_cols = X.apply(lambda x: x.shape[0] - x.dropna().shape[0], axis=1).value_counts().to_frame()
When I run this, my RAM usage dramatically increases. In Kaggle, it's enough to crash the machine. After the operation and a gc.collect(), I don't seem to get all of the memory back, hinting at some sort of leak.
I'm trying to get a feel for the number of rows missing 1 column of data, 2 columns of data, 3 columns of data, etc.
Is there a more efficient way to perform this calculation?
to get the output you would get with your code you could use:
df.isnull().sum(axis=1).value_counts().to_frame()
This is an example:
df=pd.DataFrame()
df['col1']=[np.nan,1,3,5,np.nan]
df['col2']=[2,np.nan,np.nan,3,6]
df['col3']=[1,3,np.nan,4,np.nan]
print(df)
print(df.isnull().sum(axis=1))
print(df.isnull().sum(axis=0))
col1 col2 col3
0 NaN 2.0 1.0
1 1.0 NaN 3.0
2 3.0 NaN NaN
3 5.0 3.0 4.0
4 NaN 6.0 NaN
0 1
1 1
2 2
3 0
4 2
dtype: int64
col1 2
col2 2
col3 2
dtype: int64
as you can see you can get the count of NaN values ​​by rows and by columns
Now doing:
df.isnull().sum(axis=1).value_counts().to_frame()
0
2 2
1 2
0 1
You can count na values by row using the following:
df.isna().count(axis='rows')
If this is causing your machine to crash, I would suggest iterating chunk-wise.

Calculating only the current and previous rows in a Pandas data series

New to python. I'm sure there is a very simple solution to this but I'm struggling to find it.
I have a series of positive and negative numbers. I want to know what percentage of the numbers that are positive. I've accomplished that for the whole dataset but I would like the calculation to occur on every row.
The dataset I'm working with is quite large but here is an example:
import pandas as pd
data = {'numbers': [100, 300, 150, -150, -75, -100]}
df = pd.DataFrame(data)
df['count'] = df['numbers'].count()
df['pct_positive'] = df.numbers[df.numbers > 0].count() / df['count']
print(df)
Here is the actual result:
numbers count pct_positive
0 100 6 0.5
1 300 6 0.5
2 150 6 0.5
3 -150 6 0.5
4 -75 6 0.5
5 -100 6 0.5
Here is my desired result:
numbers count pct_positive
0 100 1 1.0
1 300 2 1.0
2 150 3 1.0
3 -150 4 0.75
4 -75 5 0.66
5 -100 6 0.5
note how 'count' and 'pct_positive' are calculated on each row in the desired result and are simply totals in the actual result.
In this case 'Count' is redundant with your index, so you can create that column based on the index (or just stick with the index). .cumsum a boolean Series checking > 0 to get the percent positive after dividing by 'Count'.
df['Count'] = df.index+1
df['pct_pos'] = df.numbers.gt(0).cumsum()/df.Count
numbers Count pct_pos
0 100 1 1.00
1 300 2 1.00
2 150 3 1.00
3 -150 4 0.75
4 -75 5 0.60
5 -100 6 0.50
Also, avoid naming a column 'count' as it is a method.
Try:
df.numbers.gt(0).cumsum().div(df.numbers.notnull().cumsum())
Output:
0 1.00
1 1.00
2 1.00
3 0.75
4 0.60
5 0.50
Name: numbers, dtype: float64
Details:
Get sign of df.number check to see if greater than 0 for positive
then cumsum that column.
Count the numbers using notnull to change to boolean and cumsum.
Divide postive by total count.

Pandas: update a column based on index in another dataframe

I want to update couple of columns in a dataframe using a multiplying factor in another df (both the dfs have a 'KEY' column). Though I was able to achieve this, it takes a lot of processing time since I have a few million records. Looking for a more optimum solution if any.
Let me explain my scenario using dummy dfs. I have a dataframe df1 as below
In [8]: df1
Out[8]:
KEY col2 col3 col4
0 1 1 10 5
1 2 7 13 8
2 1 12 15 12
3 4 3 23 1
4 3 14 5 6
Now I want to change col2 and col3 by a factor that I fetch from the below df2 dataframe based on the KEY.
In [11]: df2
Out[11]:
FACTOR
KEY
1 100
2 3000
3 1000
4 200
5 50
I'm using the below for loop to achieve what I need.
In [12]: for index, row in df2.iterrows():
df1.loc[(df1['KEY']==index), ['col2', 'col3']] *= df2.loc[index]['FACTOR']
In [13]: df1
Out[13]:
KEY col2 col3 col4
0 1 100 1000 5
1 2 21000 39000 8
2 1 1200 1500 12
3 4 600 4600 1
4 3 14000 5000 6
This does the job. But my actual data has a few million records that come in real time and takes about 15 seconds to complete for each batch of incoming data. I am looking for a better solution since the for loop seems to be doing it in O(n) complexity
you should use a merge:
c=df1.merge(df2,on="KEY")
the c dataframe will now contain the "FACTOR" column which is the result you want to achieve
if one of the fields to merge is the index, you can use :
c=df1.merge(df2,left_on="KEY",right_index=True)

keep only lowest value per row in a Python Pandas dataset

In a Pandas dataset I only want to keep the lowest value per line. All other values should be deleted.
I need the original dataset intact. Just remove all values (replace by NaN) which are not the minimum.
What is the best way to do this - speed/performance wise.
I can also transpose the dataset if the operation is easier per column.
Thanks
Robert
Since the operation you are contemplating does not rely on the columns or index, it might be easier (and faster) to do this using NumPy rather than Pandas.
You can find the location (i.e. column index) of the minimums for each row using
idx = np.argmin(arr, axis=1)
You could then make a new array filled with NaNs and copy the minimum values
to the new array.
import numpy as np
import pandas as pd
def nan_all_but_min(df):
arr = df.values
idx = np.argmin(arr, axis=1)
newarr = np.full_like(arr, np.nan, dtype='float')
newarr[np.arange(arr.shape[0]), idx] = arr[np.arange(arr.shape[0]), idx]
df = pd.DataFrame(newarr, columns=df.columns, index=df.index)
return df
df = pd.DataFrame(np.random.random((4,3)))
print(df)
# 0 1 2
# 0 0.542924 0.499702 0.058555
# 1 0.682663 0.162582 0.885756
# 2 0.389789 0.648591 0.513351
# 3 0.629413 0.843302 0.862828
df = nan_all_but_min(df)
print(df)
yields
0 1 2
0 NaN NaN 0.058555
1 NaN 0.162582 NaN
2 0.389789 NaN NaN
3 0.629413 NaN NaN
Here is a benchmark comparing nan_all_but_min vs using_where:
def using_where(df):
return df.where(df.values == df.min(axis=1)[:,None])
In [73]: df = pd.DataFrame(np.random.random(100*100).reshape(100,100))
In [74]: %timeit using_where(df)
1000 loops, best of 3: 701 µs per loop
In [75]: %timeit nan_all_but_min(df)
10000 loops, best of 3: 105 µs per loop
Note that using_where and nan_all_but_min behave differently if a row contains the same min value more than once. using_where will preserve all the mins, nan_all_but_min will preserve only one min. For example:
In [76]: using_where(pd.DataFrame([(0,0,1), (1,2,1)]))
Out[76]:
0 1 2
0 0 0 NaN
1 1 NaN 1
In [77]: nan_all_but_min(pd.DataFrame([(0,0,1), (1,2,1)]))
Out[77]:
0 1 2
0 0 NaN NaN
1 1 NaN NaN
Piggybacking off #unutbu's excellent answer, the following minor change should accommodate your modified question.
The where method
In [26]: df2 = df.copy()
In [27]: df2
Out[27]:
0 1 2
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
In [28]: df2.where(df2.values == df2.min(axis=1)[:,None])
Out[28]:
0 1 2
0 0 NaN NaN
1 3 NaN NaN
2 6 NaN NaN
3 9 NaN NaN
Mandatory speed test.
In [29]: df3 = pd.DataFrame(np.random.random(100*100).reshape(100,100))
In [30]: %timeit df3.where(df3.values == df3.min(axis=1)[:,None])
1000 loops, best of 3: 723 µs per loop
If your data frame already contains NaN values, you must use numpy's nanmin as follows:
df2.where(df2.values==np.nanmin(df2,axis=0))
I just found and tried out the answer by unutbu.
I tried the .where method, but it will be deprecated soon.
FutureWarning: Support for multi-dimensional indexing (e.g. `obj[:, None]`) is deprecated and will be removed in a future version. Convert to a numpy array before indexing instead.
However, i got this sucker working instead. However, it is a lambda function, and most likely slower...
df = pd.DataFrame(np.random.random((4,3)))
print(df)
# 0 1 2
# 0 0.542924 0.499702 0.058555
# 1 0.682663 0.162582 0.885756
# 2 0.389789 0.648591 0.513351
# 3 0.629413 0.843302 0.862828
mask = df.apply(lambda d:(d == df.min(axis=1)))
print (df[mask])
Should yield:
0 1 2
0 NaN NaN 0.058555
1 NaN 0.162582 NaN
2 0.389789 NaN NaN
3 0.629413 NaN NaN

Updating a DataFrame based on another DataFrame

Given DataFrame df:
Id Sex Group Time Time!
0 21 M 2 2.31 NaN
1 2 F 2 2.29 NaN
and update:
Id Sex Group Time
0 21 M 2 2.36
1 2 F 2 2.09
2 3 F 1 1.79
I want to match on Id, Sex and Group and either update Time! with Time value (from the update df) if match, or insert if a new record.
Here is how I do it:
df = df.set_index(['Id', 'Sex', 'Group'])
update = update.set_index(['Id', 'Sex', 'Group'])
for i, row in update.iterrows():
if i in df.index: # update
df.ix[i, 'Time!'] = row['Time']
else: # insert new record
cols = up.columns.values
row = np.array(row).reshape(1, len(row))
_ = pd.DataFrame(row, index=[i], columns=cols)
df = df.append(_)
print df
Time Time!
Id Sex Group
21 M 2 2.31 2.36
2 F 2 2.29 2.09
3 F 1 1.79 NaN
The code seem to work and my wished result matches with the above. However, I have noticed this behaving faultily on a big data set, with the conditional
if i in df.index:
...
else:
...
working obviously wrong (it would proceed to else and vice-verse where it shouldn't, I guess, this MultiIndex may be the cause somehow).
So my question is, do you know any other way, or a more robust version of mine, to update one df based on another df?
I think I would do this with a merge, and then update the columns with a where. First remove the Time column from up:
In [11]: times = up.pop('Time') # up = the update DataFrame
In [12]: df1 = df.merge(up, how='outer')
In [13]: df1
Out[13]:
Id Sex Group Time Time!
0 21 M 2 2.31 NaN
1 2 F 2 2.29 NaN
2 3 F 1 NaN NaN
Update Time if it's not NaN and Time! if it's NaN:
In [14]: df1['Time!'] = df1['Time'].where(df1['Time'].isnull(), times)
In [15]: df1['Time'] = df1['Time'].where(df1['Time'].notnull(), times)
In [16]: df1
Out[16]:
Id Sex Group Time Time!
0 21 M 2 2.31 2.36
1 2 F 2 2.29 2.09
2 3 F 1 1.79 NaN

Categories