.dropna() increases memory usage - python

First I import the whole file and get a memory consumption of 1002.0+ KB
df = pd.read_csv(
filepath_or_buffer="./dataset/chicago.csv"
)
print(df.info())
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 32063 entries, 0 to 32062
# Data columns (total 4 columns):
# Name 32062 non-null object
# Position Title 32062 non-null object
# Department 32062 non-null object
# Employee Annual Salary 32062 non-null object
# dtypes: object(4)
# memory usage: 1002.0+ KB
# None
then I drop NaN, run the script again and get a memory consumption of 1.2+ MB
df = pd.read_csv(
filepath_or_buffer="./dataset/chicago.csv"
).dropna(how="all")
# <class 'pandas.core.frame.DataFrame'>
# Int64Index: 32062 entries, 0 to 32061
# Data columns (total 4 columns):
# Name 32062 non-null object
# Position Title 32062 non-null object
# Department 32062 non-null object
# Employee Annual Salary 32062 non-null object
# dtypes: object(4)
# memory usage: 1.2+ MB
# None
since I'm dropping one row I would expect that memory consumption goes down or at least remain the same no this.
Does any body know why is this happening? or how to fix it? or if this is a bug?
EDIT: chicago.csv

The change comes from the fact that your index changed from a RangeIndex to an Int64Index, which takes more memory.
You can "fix" this by resetting the index after the dropna(), but this will have the side effect of changing the row index (which you may not care about).
Here is an illustrative example:
First create a sample DataFrame:
df = pd.DataFrame({"a": range(10000)})
df.loc[1000, "a"] = None
Print the info:
print(df.info())
#<class 'pandas.core.frame.DataFrame'>
#RangeIndex: 10000 entries, 0 to 9999
#Data columns (total 1 columns):
#a 9999 non-null float64
#dtypes: float64(1)
#memory usage: 78.2 KB
Drop the na values:
print(df.dropna().info())
#<class 'pandas.core.frame.DataFrame'>
#Int64Index: 9999 entries, 0 to 9999
#Data columns (total 1 columns):
#a 9999 non-null float64
#dtypes: float64(1)
#memory usage: 156.2 KB
Reset (and drop) the index:
df.dropna().reset_index(drop=True).info()
#<class 'pandas.core.frame.DataFrame'>
#RangeIndex: 9999 entries, 0 to 9998
#Data columns (total 1 columns):
#a 9999 non-null float64
#dtypes: float64(1)
#memory usage: 78.2 KB

This is not a bug. It is working as intended, you are loading the file in so it is taking that amount of memory as before but more because you are then searching through the dataframe and removing the rows that have NaN which adds memory usage.

Related

How to parse a date column as datetimes, not objects in Pandas?

I'd like to create DataFrame from a csv with one datetime-typed column.
Follow the article, the code should create needed DateFrame:
df = pd.read_csv('data/data_3.csv', parse_dates=['date'])
df.info()
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 3 non-null datetime64[ns]
1 product 3 non-null object
2 price 3 non-null int64
dtypes: datetime64[ns](1), int64(1), object(1)
memory usage: 200.0+ bytes
But when I do exacly the same steps, I get object-typed date column:
df = pd.read_csv(path, parse_dates=['published_at'])
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 name 100000 non-null object
1 salary_from 48041 non-null float64
2 salary_to 53029 non-null float64
3 salary_currency 64733 non-null object
4 area_name 100000 non-null object
5 published_at 100000 non-null object
dtypes: float64(2), object(4)
memory usage: 4.6+ MB
I have tried a couple of various ways to parse datetime column and still can't get a DateFrame with datetime dtype. So how to parse a column with datetime type (not object)?
When loading the csv, have you tried:
df = pd.read_csv(path, parse_dates=['published_at'], infer_datetime_format = True)
And/or when converting to datetime:
pd.to_datetime(df.published_at, utc=True)

Pandas dataframe adding zero-padding before the datetime

I'm using Pandas dataframe. And I have a dataFrame df as the following:
time id
-------------
5:13:40 1
16:20:59 2
...
For the first row, the time 5:13:40 has no zero padding before, and I want to convert it to 05:13:40. So my expected df would be like:
time id
-------------
05:13:40 1
16:20:59 2
...
The type of time is <class 'datetime.timedelta'>.Could anyone give me some hints to handle this problem? Thanks so much!
Use pd.to_timedelta:
df['time'] = pd.to_timedelta(df['time'])
Before:
print(df)
time id
1 5:13:40 1.0
2 16:20:59 2.0
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 1 to 2
Data columns (total 2 columns):
time 2 non-null object
id 2 non-null float64
dtypes: float64(1), object(1)
memory usage: 48.0+ bytes
After:
print(df)
time id
1 05:13:40 1.0
2 16:20:59 2.0
df.info()
d<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 1 to 2
Data columns (total 2 columns):
time 2 non-null timedelta64[ns]
id 2 non-null float64
dtypes: float64(1), timedelta64[ns](1)
memory usage: 48.0 bytes

Wrong Data Type while Reading a large Text File

I'm trying to read the following file using pandas. The code that I'm using is the following:
df = pd.read_csv("household_power_consumption.txt", header=0, delimiter=';', nrows=5)
The df.info() is giving the correct output.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 9 columns):
Date 5 non-null object
Time 5 non-null object
Global_active_power 5 non-null float64
Global_reactive_power 5 non-null float64
Voltage 5 non-null float64
Global_intensity 5 non-null float64
Sub_metering_1 5 non-null float64
Sub_metering_2 5 non-null float64
Sub_metering_3 5 non-null float64
dtypes: float64(7), object(2)
memory usage: 440.0+ bytes
But when I'm trying to read the entire data set using the same code except nrows:
df_all = pd.read_csv("household_power_consumption.txt", header=0, delimiter=';') the column types are becoming object.
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2075259 entries, 2006-12-16 17:24:00 to 2010-11-26 21:02:00
Data columns (total 7 columns):
Global_active_power object
Global_reactive_power object
Voltage object
Global_intensity object
Sub_metering_1 object
Sub_metering_2 object
Sub_metering_3 float64
dtypes: float64(1), object(6)
memory usage: 126.7+ MB
Can anyone please tell me why this is happening? And how to resolve it?
Thanks!
My guess would be that when you read the full data set in there are values in the additional rows that are being interpreted as different data types, for example floats interpreted as integers. You can specify the data types explicitly using the dtype argument in read_csv - see docs here.
Alternatively you could try to force the data types after loading the data; e.g. like so:
df["Global_active_power"] = df["Global_active_power"].astype(float)

Python Pandas: Got an Error from trying to bin data

Python is returning this error message. Would anyone happen to know how to fix?
TypeError: putmask() argument 1 must be numpy.ndarray, not numpy.int64
I got this error when trying to use pd.cut to classify my data into different bins, as seen below:
df['Credit Score'].fillna(0, inplace=True)
df['Credit Score'] = df['Credit Score'].astype(int)
bins = [0, 500, 550, 600, 650]
Summary_Scores = df.groupby(pd.cut('Credit Score', bins))
The data is being read from a csv file:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5729 entries, 0 to 252032
Data columns (total 5 columns):
Loan # 5729 non-null int64
Amount 5729 non-null float64
Issue Date 5729 non-null datetime64[ns]
Purposes 5661 non-null object
Credit Score 5729 non-null int32
dtypes: datetime64[ns](1), float64(1), int32(1), int64(1), object(1)
memory usage: 246.2+ KB
None

sum columns in dataframe with pandas

I have a dataframe df_F1
df_F1.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 0 to 1
Data columns (total 7 columns):
class_energy 2 non-null object
ACT_TIME_AERATEUR_1_F1 2 non-null float64
ACT_TIME_AERATEUR_1_F3 2 non-null float64
dtypes: float64(6), object(1)
memory usage: 128.0+ bytes
df_F1.head()
class_energy ACT_TIME_AERATEUR_1_F1 ACT_TIME_AERATEUR_1_F3
low 5.875550 431
medium 856.666667 856
I try to create a dataframe Ratio wich contain for each class_energy the value of energy of each ACT_TIME_AERATEUR_1_Fx devided by the sum of energy of all class_energy for each ACT_TIME_AERATEUR_1_Fx. For example :
ACT_TIME_AERATEUR_1_F1 ACT_TIME_AERATEUR_1_F3
low 5.875550/(5.875550 + 856.666667) 431/(431+856)
medium 856.666667/(5.875550+856.666667) 856/(431+856)
Can you help me please to resolve it?
Thank you in advancce
Best regards
you can do this:
In [20]: df.set_index('class_energy').apply(lambda x: x/x.sum()).reset_index()
Out[20]:
class_energy ACT_TIME_AERATEUR_1_F1 ACT_TIME_AERATEUR_1_F3
0 low 0.006812 0.334887
1 medium 0.993188 0.665113

Categories