sum columns in dataframe with pandas - python

I have a dataframe df_F1
df_F1.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 0 to 1
Data columns (total 7 columns):
class_energy 2 non-null object
ACT_TIME_AERATEUR_1_F1 2 non-null float64
ACT_TIME_AERATEUR_1_F3 2 non-null float64
dtypes: float64(6), object(1)
memory usage: 128.0+ bytes
df_F1.head()
class_energy ACT_TIME_AERATEUR_1_F1 ACT_TIME_AERATEUR_1_F3
low 5.875550 431
medium 856.666667 856
I try to create a dataframe Ratio wich contain for each class_energy the value of energy of each ACT_TIME_AERATEUR_1_Fx devided by the sum of energy of all class_energy for each ACT_TIME_AERATEUR_1_Fx. For example :
ACT_TIME_AERATEUR_1_F1 ACT_TIME_AERATEUR_1_F3
low 5.875550/(5.875550 + 856.666667) 431/(431+856)
medium 856.666667/(5.875550+856.666667) 856/(431+856)
Can you help me please to resolve it?
Thank you in advancce
Best regards

you can do this:
In [20]: df.set_index('class_energy').apply(lambda x: x/x.sum()).reset_index()
Out[20]:
class_energy ACT_TIME_AERATEUR_1_F1 ACT_TIME_AERATEUR_1_F3
0 low 0.006812 0.334887
1 medium 0.993188 0.665113

Related

How to parse a date column as datetimes, not objects in Pandas?

I'd like to create DataFrame from a csv with one datetime-typed column.
Follow the article, the code should create needed DateFrame:
df = pd.read_csv('data/data_3.csv', parse_dates=['date'])
df.info()
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 3 non-null datetime64[ns]
1 product 3 non-null object
2 price 3 non-null int64
dtypes: datetime64[ns](1), int64(1), object(1)
memory usage: 200.0+ bytes
But when I do exacly the same steps, I get object-typed date column:
df = pd.read_csv(path, parse_dates=['published_at'])
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 name 100000 non-null object
1 salary_from 48041 non-null float64
2 salary_to 53029 non-null float64
3 salary_currency 64733 non-null object
4 area_name 100000 non-null object
5 published_at 100000 non-null object
dtypes: float64(2), object(4)
memory usage: 4.6+ MB
I have tried a couple of various ways to parse datetime column and still can't get a DateFrame with datetime dtype. So how to parse a column with datetime type (not object)?
When loading the csv, have you tried:
df = pd.read_csv(path, parse_dates=['published_at'], infer_datetime_format = True)
And/or when converting to datetime:
pd.to_datetime(df.published_at, utc=True)

.dropna() increases memory usage

First I import the whole file and get a memory consumption of 1002.0+ KB
df = pd.read_csv(
filepath_or_buffer="./dataset/chicago.csv"
)
print(df.info())
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 32063 entries, 0 to 32062
# Data columns (total 4 columns):
# Name 32062 non-null object
# Position Title 32062 non-null object
# Department 32062 non-null object
# Employee Annual Salary 32062 non-null object
# dtypes: object(4)
# memory usage: 1002.0+ KB
# None
then I drop NaN, run the script again and get a memory consumption of 1.2+ MB
df = pd.read_csv(
filepath_or_buffer="./dataset/chicago.csv"
).dropna(how="all")
# <class 'pandas.core.frame.DataFrame'>
# Int64Index: 32062 entries, 0 to 32061
# Data columns (total 4 columns):
# Name 32062 non-null object
# Position Title 32062 non-null object
# Department 32062 non-null object
# Employee Annual Salary 32062 non-null object
# dtypes: object(4)
# memory usage: 1.2+ MB
# None
since I'm dropping one row I would expect that memory consumption goes down or at least remain the same no this.
Does any body know why is this happening? or how to fix it? or if this is a bug?
EDIT: chicago.csv
The change comes from the fact that your index changed from a RangeIndex to an Int64Index, which takes more memory.
You can "fix" this by resetting the index after the dropna(), but this will have the side effect of changing the row index (which you may not care about).
Here is an illustrative example:
First create a sample DataFrame:
df = pd.DataFrame({"a": range(10000)})
df.loc[1000, "a"] = None
Print the info:
print(df.info())
#<class 'pandas.core.frame.DataFrame'>
#RangeIndex: 10000 entries, 0 to 9999
#Data columns (total 1 columns):
#a 9999 non-null float64
#dtypes: float64(1)
#memory usage: 78.2 KB
Drop the na values:
print(df.dropna().info())
#<class 'pandas.core.frame.DataFrame'>
#Int64Index: 9999 entries, 0 to 9999
#Data columns (total 1 columns):
#a 9999 non-null float64
#dtypes: float64(1)
#memory usage: 156.2 KB
Reset (and drop) the index:
df.dropna().reset_index(drop=True).info()
#<class 'pandas.core.frame.DataFrame'>
#RangeIndex: 9999 entries, 0 to 9998
#Data columns (total 1 columns):
#a 9999 non-null float64
#dtypes: float64(1)
#memory usage: 78.2 KB
This is not a bug. It is working as intended, you are loading the file in so it is taking that amount of memory as before but more because you are then searching through the dataframe and removing the rows that have NaN which adds memory usage.

Pandas dataframe adding zero-padding before the datetime

I'm using Pandas dataframe. And I have a dataFrame df as the following:
time id
-------------
5:13:40 1
16:20:59 2
...
For the first row, the time 5:13:40 has no zero padding before, and I want to convert it to 05:13:40. So my expected df would be like:
time id
-------------
05:13:40 1
16:20:59 2
...
The type of time is <class 'datetime.timedelta'>.Could anyone give me some hints to handle this problem? Thanks so much!
Use pd.to_timedelta:
df['time'] = pd.to_timedelta(df['time'])
Before:
print(df)
time id
1 5:13:40 1.0
2 16:20:59 2.0
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 1 to 2
Data columns (total 2 columns):
time 2 non-null object
id 2 non-null float64
dtypes: float64(1), object(1)
memory usage: 48.0+ bytes
After:
print(df)
time id
1 05:13:40 1.0
2 16:20:59 2.0
df.info()
d<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 1 to 2
Data columns (total 2 columns):
time 2 non-null timedelta64[ns]
id 2 non-null float64
dtypes: float64(1), timedelta64[ns](1)
memory usage: 48.0 bytes

Python Pandas: Got an Error from trying to bin data

Python is returning this error message. Would anyone happen to know how to fix?
TypeError: putmask() argument 1 must be numpy.ndarray, not numpy.int64
I got this error when trying to use pd.cut to classify my data into different bins, as seen below:
df['Credit Score'].fillna(0, inplace=True)
df['Credit Score'] = df['Credit Score'].astype(int)
bins = [0, 500, 550, 600, 650]
Summary_Scores = df.groupby(pd.cut('Credit Score', bins))
The data is being read from a csv file:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5729 entries, 0 to 252032
Data columns (total 5 columns):
Loan # 5729 non-null int64
Amount 5729 non-null float64
Issue Date 5729 non-null datetime64[ns]
Purposes 5661 non-null object
Credit Score 5729 non-null int32
dtypes: datetime64[ns](1), float64(1), int32(1), int64(1), object(1)
memory usage: 246.2+ KB
None

Using set_index within a custom function

I would like to convert the date observations from a column into the index for my dataframe. I am able to do this with the code below:
Sample data:
test = pd.DataFrame({'Values':[1,2,3], 'Date':["1/1/2016 17:49","1/2/2016 7:10","1/3/2016 15:19"]})
Indexing code:
test['Date Index'] = pd.to_datetime(test['Date'])
test = test.set_index('Date Index')
test['Index'] = test.index.date
However when I try to include this code in a function, I am able to create the 'Date Index' column but set_index does not seem to work as expected.
def date_index(df):
df['Date Index'] = pd.to_datetime(df['Date'])
df = df.set_index('Date Index')
df['Index'] = df.index.date
If I inspect the output of not using a function info() returns:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 3 entries, 2016-01-01 17:49:00 to 2016-01-03 15:19:00
Data columns (total 3 columns):
Date 3 non-null object
Values 3 non-null int64
Index 3 non-null object
dtypes: int64(1), object(2)
memory usage: 96.0+ bytes
If I inspect the output of the function info() returns:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
Date 3 non-null object
Values 3 non-null int64
dtypes: int64(1), object(1)
memory usage: 120.0+ bytes
I would like the DatetimeIndex.
How can set_index be used within a function? Am I using it incorrectly?
IIUC return df is missing:
df1 = pd.DataFrame({'Values':[1,2,3], 'Exam Completed Date':["1/1/2016 17:49","1/2/2016 7:10","1/3/2016 15:19"]})
def date_index(df):
df['Exam Completed Date Index'] = pd.to_datetime(df['Exam Completed Date'])
df = df.set_index('Exam Completed Date Index')
df['Index'] = df.index.date
return df
print (date_index(df1))
Exam Completed Date Values Index
Exam Completed Date Index
2016-01-01 17:49:00 1/1/2016 17:49 1 2016-01-01
2016-01-02 07:10:00 1/2/2016 7:10 2 2016-01-02
2016-01-03 15:19:00 1/3/2016 15:19 3 2016-01-03
print (date_index(df1).info())
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 3 entries, 2016-01-01 17:49:00 to 2016-01-03 15:19:00
Data columns (total 3 columns):
Exam Completed Date 3 non-null object
Values 3 non-null int64
Index 3 non-null object
dtypes: int64(1), object(2)
memory usage: 96.0+ bytes
None

Categories