index_col parameter in datatable fread function - python

I want to accomplish the same result as:
import pandas as pd
pd.read_csv("data.csv", index_col=0)
But in a faster way, using the datatable library in python and convert it to a pandas dataframe. This is what I am currently doing:
import datatable as dt
datatable = dt.fread("data.csv")
dataframe = datatable.to_pandas().set_index('C0')
Is there any faster way to do it?
I would like to have a parameter that allows me to use a column as the row labels of the DataTable: Like the index_col=0 in pandas.read_csv(). Also, why does datatable.fread create a 'CO' column?

Related

Mix of columns in excel to one colum with Pandas

I have to import this Excel in code and I would like to unify the multi-index in a single column. I would like to delete the unnamed columns and unify everything into one. I don't know if it's possible.
I have tried the following and it imports, but the output is not as expected. I add the code here too
import pandas as pd
import numpy as np
macro = pd.read_excel(nameExcel, sheet_name=nameSheet, skiprows=3, header=[1,3,4])
macro = macro[macro.columns[1:]]
macro
The way to solve it is to save another header of the same length as the previous header.
cols = [...]
if len(df1.columns) == len(cols):
df1.columns = cols
else:
print("error")

Latex formula in Pandas DataFrame

I want to make a Pandas DataFrame in which some columns are of Latex formula (IPython.core.display.Latex) type. When I display the DataFrame in Jupyter Notebook, the formulae are not displayed, instead I see only the type name. Is there any way to show the formulae when they are some elements of a Pandas DF?
import pandas as pd
from IPython.display import Latex
my_dict = {'Case1':{'Formula1':Latex('$$x^{-1}$$'), 'Formula2':Latex('$$x^2$$')},
'Case2':{'Formula1':Latex('$$x^{-2}$$'), 'Formula2':Latex('$$x^4$$')}}
df = pd.DataFrame(my_dict)
display(df.transpose())
Interestingly, if you output only one field of the dataframe it works:
df['Case1']['Formula1']
𝑥−1
However, when the whole dataframe is taken into account you cannot use Latex objects. It is enough to use only LaTeX formulas between $$ characters:
my_dict = {'Case1':{'Formula1':'$$x^{-1}$$', 'Formula2':'$$x^2$$'},
'Case2':{'Formula1':'$$x^{-2}$$', 'Formula2':'$$x^4$$'}}
df = pd.DataFrame(my_dict)
display(df.transpose())

How to export cleaned data from a jupyter notebook, not the original data

I have just started to learn to use Jupyter notebook. I have a data file called 'Diseases'.
Opening data file
import pandas as pd
df = pd.read_csv('Diseases.csv')
Choosing data from a column named 'DIABETES', i.e choosing subject IDs that have diabetes, yes is 1 and no is 0.
df[df.DIABETES >1]
Now I want to export this cleaned data (that has fewer rows)
df.to_csv('diabetes-filtered.csv')
This exports the original data file, not the filtered df with fewer rows.
I saw in another question that the inplace argument needs to be used. But I don't know how.
You forget assign back filtered DataFrame, here to df1:
import pandas as pd
df = pd.read_csv('Diseases.csv')
df1 = df[df.DIABETES >1]
df1.to_csv('diabetes-filtered.csv')
Or you can chain filtering and exporting to file:
import pandas as pd
df = pd.read_csv('Diseases.csv')
df[df.DIABETES >1].to_csv('diabetes-filtered.csv')

Pandas Read_Excel Datetime Converter

Using Python 3.6 and Pandas 0.19.2: How do you read in an excel file and change a column to datetime straight from read_excel? Similar to This Question about converters and dtypes. But I want to read in a certain column as datetime
I want to change this:
import pandas as pd
import datetime
import numpy as np
file = 'PATH_HERE'
df1 = pd.read_excel(file)
df1['COLUMN'] = pd.to_datetime(df1['COLUMN']) # <--- Line to get rid of
into something like:
df1 = pd.read_excel(file, dtypes= {'COLUMN': datetime})
The code does not error, but in my example, COLUMN is still a dtype of int64 after calling print(df1['COLUMN'].dtype)
I have tried using np.datetime64 instead of datetime. I have also tried using converters= instead of dtypes= but to no avail. This may be nit picky, but would be a nice feature to implement in my code.
Typically reading excel sheets will use the dtypes defined in the excel sheets but you cannot specify the dtypes like in read_csv for example. You can provide a converters arg for which you can pass a dict of the column and func to call to convert the column:
df1 = pd.read_excel(file, converters= {'COLUMN': pd.to_datetime})
Another way to read in an excel file and change a column to datetime straight from read_excel is as follows;
import pandas as pd
file = 'PATH_HERE'
df1 = pd.read_excel(file, parse_dates=['COLUMN'])
For reference, I am using python 3.8.3
read_excel supports dtype, just as read_csv, as of this writing:
import datetime
import pandas as pd
xlsx = pd.ExcelFile('path...')
df = pd.read_excel(xlsx, dtype={'column_name': datetime.datetime})
https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html

Pandas filtering - between_time on a non-index column

I need to filter out data with specific hours. The DataFrame function between_time seems to be the proper way to do that, however, it only works on the index column of the dataframe; but I need to have the data in the original format (e.g. pivot tables will expect the datetime column to be with the proper name, not as the index).
This means that each filter looks something like this:
df.set_index(keys='my_datetime_field').between_time('8:00','21:00').reset_index()
Which implies that there are two reindexing operations every time such a filter is run.
Is this a good practice or is there a more appropriate way to do the same thing?
Create a DatetimeIndex, but store it in a variable, not the DataFrame.
Then call it's indexer_between_time method. This returns an integer array which can then be used to select rows from df using iloc:
import pandas as pd
import numpy as np
N = 100
df = pd.DataFrame(
{'date': pd.date_range('2000-1-1', periods=N, freq='H'),
'value': np.random.random(N)})
index = pd.DatetimeIndex(df['date'])
df.iloc[index.indexer_between_time('8:00','21:00')]

Categories