Floats in a csv are automatically rounded in Pandas - python

When I open a csv file in Python using Pandas, the float values of the last column are rounded.
ColA Timestamp Value
1 06:40:00 33.48
2 06:29:00 38.65
3 07:05:00 56.77
4 06:43:00 80.81
ColA,"Timestamp","Value"
1,"06:40:00","33.48"
2,"06:29:00","38.65"
3,"07:05:00","56.77"
4,"06:43:00","80.81"
Instead I get:
ColA Timestamp Value
1 06:40:00 33
2 06:29:00 39
3 07:05:00 57
4 06:43:00 81
I simply used:
df = pd.read_csv("C://Users//me//Downloads//file.csv", delimiter = ",", float_precision="high")
I tried removing the last parameter, but nothing changes.

You can just use read_csv directly then typecast the column you want to explicitly become float.
df = pd.read_csv("C://Users//me//Downloads//file.csv")
df['Value'] = df['Value'].astype(float)
>>> df
Col A Timestamp Value
0 1 06:40:00 33.48
1 2 06:29:00 38.65
2 3 07:05:00 56.77
3 4 06:43:00 80.81

Related

Pandas dataframe custom formatting string to time

I have a dataframe that looks like this
DEP_TIME
0 1851
1 1146
2 2016
3 1350
4 916
...
607341 554
607342 633
607343 657
607344 705
607345 628
I need to get every value in this column DEP_TIME to have the format hh:mm.
All cells are of type string and can remain that type.
Some cells are only missing the colon (rows 0 to 3), others are also missing the leading 0 (rows 4+).
Some cells are empty and should ideally have string value of 0.
I need to do it in an efficient way since I have a few million records. How do I do it?
Use to_datetime with Series.dt.strftime:
df['DEP_TIME'] = (pd.to_datetime(df['DEP_TIME'], format='%H%M', errors='coerce')
.dt.strftime('%H:%M')
.fillna('00:00'))
print (df)
DEP_TIME
0 18:51
1 11:46
2 20:16
3 13:50
4 09:16
607341 05:54
607342 06:33
607343 06:57
607344 07:05
607345 06:28
import re
d = [['1851'],
['1146'],
['2016'],
['916'],
['814'],
[''],
[np.nan]]
df = pd.DataFrame(d, columns=['DEP_TIME'])
df['DEP_TIME'] = df['DEP_TIME'].fillna('0')
df['DEP_TIME'] = df['DEP_TIME'].apply(lambda y: '0' if y=='' else re.sub(r'(\d{1,2})(\d{2})$', lambda x: x[1].zfill(2)+':'+x[2], y))
df
DEP_TIME
0 18:51
1 11:46
2 20:16
3 09:16
4 08:14
5 0

Preserving id columns in dataframe after applying assign and groupby

I have a data file containing different foetal ultrasound measurements. The measurements are collected at different points during pregnancy, like so:
PregnancyID MotherID gestationalAgeInWeeks abdomCirc
0 0 14 150
0 0 21 200
1 1 20 294
1 1 25 315
1 1 30 350
2 2 8 170
2 2 9 180
2 2 18 NaN
Following this answer to a previous questions I had asked, I used this code to summarise the ultrasound measurements using the maximum measurement recorded in a single trimester (13 weeks):
(df.assign(tm = (df['gestationalAgeInWeeks']+ 13 - 1 )// 13))
.drop(columns = 'gestationalAgeInWeeks')
.groupby(['MotherID', 'PregnancyID','tm'])
.agg('max')
.unstack()
)
This results in the following output:
tm 1 2 3
MotherID PregnancyID
0 0 NaN 200.0 NaN
1 1 NaN 294.0 350.0
2 2 180.0 NaN NaN
However, MotherID and PregnancyID no longer appear as columns in the output of df.info(). Similarly, when I output the dataframe to a csv file, I only get columns 1,2 and 3. The id columns only appear when running df.head() as can be seen in the dataframe above.
I need to preserve the id columns as I want to use them to merge this dataframe with another one using the ids. Therefore, my question is, how do I preserve these id columns as part of my dataframe after running the code above?
Chain that with reset_index:
(df.assign(tm = (df['gestationalAgeInWeeks']+ 13 - 1 )// 13)
# .drop(columns = 'gestationalAgeInWeeks') # don't need this
.groupby(['MotherID', 'PregnancyID','tm'])['abdomCirc'] # change here
.max().add_prefix('abdomCirc_') # here
.unstack()
.reset_index() # and here
)
Or a more friendly version with pivot_table:
(df.assign(tm = (df['gestationalAgeInWeeks']+ 13 - 1 )// 13)
.pivot_table(index= ['MotherID', 'PregnancyID'], columns='tm',
values= 'abdomCirc', aggfunc='max')
.add_prefix('abdomCirc_') # remove this if you don't want the prefix
.reset_index()
)
Output:
tm MotherID PregnancyID abdomCirc_1 abdomCirc_2 abdomCirc_3
0 abdomCirc_0 abdomCirc_0 NaN 200.0 NaN
1 abdomCirc_1 abdomCirc_1 NaN 315.0 350.0
2 abdomCirc_2 abdomCirc_2 180.0 NaN NaN

How to delete rows having bad error lines and read the remaining csv file using pandas or numpy?

I am not able to read my dataset.csv file due to following Parser Error.
Error tokenizing data. C error: Expected 1 fields in line 8, saw 4
The CSV file is generated through another program.
Basically I want to skip the character rows which are iterating after certain intervals and want only the integer and float values in my dataset.
I tried this:
df = pd.read_csv('Dataset.csv')
I also tried this, but i am only getting the bad lines as output. But I want to skip all these bad error lines and only show the remaining values in my dataset.
df = pd.read_csv('Dataset.csv',error_bad_lines=False, engine='python')
Dataset:
The pch2csv utility program
This file contains the pch2csv
$TITLE =
$SUBTITLE=
$LABEL = FX
1,0.000000E+00,3.792830E-06,-1.063093E-06
2,0.000000E+00,-1.441319E-06,4.711234E-06
3,0.000000E+00,2.950290E-06,-5.669502E-07
4,0.000000E+00,3.706791E-06,-1.094726E-06
5,0.000000E+00,3.689831E-06,-1.107476E-06
$TITLE =
$SUBTITLE=
$LABEL = FY
1,0.000000E+00,-5.878803E-06,1.127179E-06
2,0.000000E+00,2.782207E-06,-8.840886E-06
3,0.000000E+00,-1.574296E-06,3.867732E-07
4,0.000000E+00,-6.227912E-06,1.864081E-06
5,0.000000E+00,-3.113227E-05,9.339538E-06
Expected dataset:
*Even the blank rows may be deleted if possible
The 1st column should be set as index, and the final dataset must contain the 1st and 3rd columnn only as shown. The Column label must be set as '1'
You can add parameter names to read_csv for new columns names - then get some rows with missing values, so added DataFrame.dropna:
import pandas as pd
from io import StringIO
temp="""The pch2csv utility program
This file contains the pch2csv
$TITLE =
$SUBTITLE=
$LABEL = FX
1,0.000000E+00,3.792830E-06,-1.063093E-06
2,0.000000E+00,-1.441319E-06,4.711234E-06
3,0.000000E+00,2.950290E-06,-5.669502E-07
4,0.000000E+00,3.706791E-06,-1.094726E-06
5,0.000000E+00,3.689831E-06,-1.107476E-06
$TITLE =
$SUBTITLE=
$LABEL = FY
1,0.000000E+00,-5.878803E-06,1.127179E-06
2,0.000000E+00,2.782207E-06,-8.840886E-06
3,0.000000E+00,-1.574296E-06,3.867732E-07
4,0.000000E+00,-6.227912E-06,1.864081E-06
5,0.000000E+00,-3.113227E-05,9.339538E-06"""
#after testing replace 'pd.compat.StringIO(temp)' to 'Dataset.csv'
df = pd.read_csv(StringIO(temp),
error_bad_lines=False,
engine='python',
names=['a','b','c','d'])
df = df.dropna(subset=['b','c','d'])
print (df)
a b c d
0 1 0.0 0.000004 -1.063093e-06
1 2 0.0 -0.000001 4.711234e-06
2 3 0.0 0.000003 -5.669502e-07
3 4 0.0 0.000004 -1.094726e-06
4 5 0.0 0.000004 -1.107476e-06
8 1 0.0 -0.000006 1.127179e-06
9 2 0.0 0.000003 -8.840886e-06
10 3 0.0 -0.000002 3.867732e-07
11 4 0.0 -0.000006 1.864081e-06
12 5 0.0 -0.000031 9.339538e-06
EDIT:
For set first column to index and another columns names:
#after testing replace 'pd.compat.StringIO(temp)' to 'Dataset.csv'
df = pd.read_csv(StringIO(temp),
error_bad_lines=False,
engine='python',
index_col=[0],
names=['idx','col1','col2','col3'])
#check all columns, first column is set to index, so not tested
df = df.dropna()
#if need test if all values in row has NaNs
#df = df.dropna(how='all')
print (df)
col1 col2 col3
idx
1 0.0 0.000004 -1.063093e-06
2 0.0 -0.000001 4.711234e-06
3 0.0 0.000003 -5.669502e-07
4 0.0 0.000004 -1.094726e-06
5 0.0 0.000004 -1.107476e-06
1 0.0 -0.000006 1.127179e-06
2 0.0 0.000003 -8.840886e-06
3 0.0 -0.000002 3.867732e-07
4 0.0 -0.000006 1.864081e-06
5 0.0 -0.000031 9.339538e-06
EDIT1:
If need remove all columns filled by 0 only:
df = df.loc[:, df.ne(0).any()]
print (df)
col2 col3
idx
1 0.000004 -1.063093e-06
2 -0.000001 4.711234e-06
3 0.000003 -5.669502e-07
4 0.000004 -1.094726e-06
5 0.000004 -1.107476e-06
1 -0.000006 1.127179e-06
2 0.000003 -8.840886e-06
3 -0.000002 3.867732e-07
4 -0.000006 1.864081e-06
5 -0.000031 9.339538e-06

Forward filling missing dates into Python Pandas Dataframe

I have a Panda's dataframe that is filled as follows:
ref_date tag
1/29/2010 1
2/26/2010 3
3/31/2010 4
4/30/2010 4
5/31/2010 1
6/30/2010 3
8/31/2010 1
9/30/2010 4
12/31/2010 2
Note how there are missing months (i.e. 7, 10, 11) in the data. I want to fill in the missing data through a forward filling method so that it looks like this:
ref_date tag
1/29/2010 1
2/26/2010 3
3/31/2010 4
4/30/2010 4
5/31/2010 1
6/30/2010 3
7/30/2010 3
8/31/2010 1
9/30/2010 4
10/29/2010 4
11/30/2010 4
12/31/2010 2
The tag of the missing date will have the tag of the previous. All dates represent the last business day of the month.
This is what I tried to do:
idx = pd.date_range(start='1/29/2010', end='12/31/2010', freq='BM')
df.ref_date.index = pd.to_datetime(df.ref_date.index)
df = df.reindex(index=[idx], columns=[ref_date], method='ffill')
It's giving me the error:
TypeError: Cannot compare type 'Timestamp' with type 'int'
where pd is pandas and df is the dataframe.
I'm new to Pandas Dataframe, so any help would be appreciated!
You were very close, you just need to set the dataframe's index with the ref_date, reindex it to the business day month end index while specifying ffill at the method, then reset the index and rename back to the original:
# First ensure the dates are Pandas Timestamps.
df['ref_date'] = pd.to_datetime(df['ref_date'])
# Create a monthly index.
idx_monthly = pd.date_range(start='1/29/2010', end='12/31/2010', freq='BM')
# Reindex to the daily index, forward fill, reindex to the monthly index.
>>> (df
.set_index('ref_date')
.reindex(idx_monthly, method='ffill')
.reset_index()
.rename(columns={'index': 'ref_date'}))
ref_date tag
0 2010-01-29 1.0
1 2010-02-26 3.0
2 2010-03-31 4.0
3 2010-04-30 4.0
4 2010-05-31 1.0
5 2010-06-30 3.0
6 2010-07-30 3.0
7 2010-08-31 1.0
8 2010-09-30 4.0
9 2010-10-29 4.0
10 2010-11-30 4.0
11 2010-12-31 2.0
Thanks to the previous person that answered this question but deleted his answer. I got the solution:
df[ref_date] = pd.to_datetime(df[ref_date])
idx = pd.date_range(start='1/29/2010', end='12/31/2010', freq='BM')
df = df.set_index(ref_date).reindex(idx).ffill().reset_index().rename(columns={'index': ref_date})

Weighted Means for columns in Pandas DataFrame including Nan

I am trying to get the weighted mean for each column (A-F) of a Pandas.Dataframe with "Value" as the weight. I can only find solutions for problems with categories, which is not what I need.
The comparable solution for normal means would be
df.means()
Notice the df has Nan in the columns and "Value".
A B C D E F Value
0 17656 61496 83 80 117 99 2902804
1 75078 61179 14 3 6 14 3761964
2 21316 60648 86 Nan 107 93 127963
3 6422 48468 28855 26838 27319 27011 131354
4 12378 42973 47153 46062 46634 42689 3303909572
5 54292 35896 59 6 3 18 27666367
6 21272 Nan 126 12 3 5 9618047
7 26434 35787 113 17 4 8 309943
8 10508 34314 34197 7100 10 10 NaN
I can use this for a single column.
df1 = df[['A','Value']]
df1 = df1.dropna()
np.average(df1['A'], weights=df1['Value'])
There must be a simple method. It's driving me nuts I don't see it.
I would appreciate any help.
You could use masked arrays. We could dropoff rows where Value column has NaN values.
In [353]: dff = df.dropna(subset=['Value'])
In [354]: dff.apply(lambda x: np.ma.average(
np.ma.MaskedArray(x, mask=np.isnan(x)), weights=dff.Value))
Out[354]:
A 1.282629e+04
B 4.295120e+04
C 4.652817e+04
D 4.545254e+04
E 4.601520e+04
F 4.212276e+04
Value 3.260246e+09
dtype: float64

Categories