I have a Dask dataframe created with dd.read_csv("./*/file.csv") where the * glob is a folder for each date. In the concatenated dataframe I want to filter out subsets of time like how I would with a pd.between_time("09:30", "16:00"), say.
Because Dask's internal representation of the index does not have the nice features of Pandas's DateTimeIndex, I haven' had any success with filtering how I normally would in Pandas. Short of resorting to a naive mapping function/loop, I am unable to get this to work in Dask.
Since the partitions are by date, perhaps that could be exploited by converting to a Pandas dataframe and then back to a Dask partition, but it seems like there should be a better way.
Updating with the example used in Angus' answer.
I guess I don't understand the logic of the queries in the answers/comments. Is Pandas smart enough to not interpret the boolean mask literally as a string and do the correct datetime comparisons?
Filtering in Dask works just like pandas with a few convenience functions removed.
For example if you had the following data:
time,A,B
6/18/2020 09:00,29,0.330799201
6/18/2020 10:15,30,0.518081116
6/18/2020 18:25,31,0.790506469
The following code:
import dask.dataframe as dd
df = dd.read_csv('*.csv', parse_dates=['time']).set_index('time')
df.loc[(df.index > "09:30") & (df.index < "16:00")].compute()
(If ran on 18th June 2020) Would return:
time,A,B
2020-06-18 10:15:00,30,0.518081
EDIT:
The above answer filters for the current date only; pandas interprets the time string as a datetime value with the current date. If you'd like to filter values for all days between specific times there's a workaround to strip the date from the datetime column:
import dask.dataframe as dd
df = dd.read_csv('*.csv',parse_dates=['time'])
df["time_of_day"] = dd.to_datetime(df["time"].dt.time.astype(str))
df.loc[(df.time_of_day > "09:30") & (df.time_of_day < "16:00")].compute()
Bear in mind there might be a speed penalty to this method, possibly a concern for larger datasets.
Related
I am working on a calculator to determine what to feed your fish as a fun project to learn python, pandas, and numpy.
My data is organized like this:
As you can see, my fishes are rows, and the different foods are columns.
What I am hoping to do, is have the user (me) input a food, and have the program output to me all those values which are not nan.
The reason why I would prefer to leave them as nan rather than 0, is that I use different numbers in different spots to indicate preference. 1 is natural diet, 2 is ok but not ideal, 3 is live only.
Is there anyway to do this using pandas? Everywhere I look online helps me filter rows out of columns, but it is quite difficult to find info on filter columns out of rows.
Currently, my code looks like this:
import pandas as pd
import numpy as np
df = pd.read_excel(r'C:\Users\Daniel\OneDrive\Documents\AquariumAiMVP.xlsx')
clownfish = df[0:1]
angelfish = df[1:2]
damselfish = df[2:3]
So, as you can see, I haven't really gotten anywhere yet. I tried filtering out the nulls using, the following idea:
clownfish_wild_diet = pd.isnull(df.clownfish)
But it results in an error, saying:
AttributeError: 'DataFrame' object has no attribute 'clownfish'
Thanks for the help guys. I'm a total pandas noob so it is much appreciated.
You can use masks in pandas:
food = 'Amphipods'
mask = df[food].notnull()
result_set = df[mask]
df[food].notnull() returns a mask (a Series of boolean values indicating if the condition is met for each row), and you can use that mask to filter the real DF using df[mask].
Usually you can combine these two rows to have a more pythonic code, but that's up to you:
result_set = df[df[food].notnull()]
This returns a new DF with the subset of rows that meet the condition (including all columns from the original DF), so you can use other operations on this new DF (e.g selecting a subset of columns, drop other missing values, etc)
See more about .notnull(): https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.notnull.html
I am trying to load data from Delta into a pyspark dataframe.
path_to_data = 's3://mybucket/daily_data/'
df = spark.read.format("delta").load(path_to_data)
Now the underlying data is partitioned by date as
s3://mybucket/daily_data/
dt=2020-06-12
dt=2020-06-13
...
dt=2020-06-22
Is there a way to optimize the read as Dataframe, given:
Only certain date range is needed
Subset of column is only needed
Current way, i tried is :
df.registerTempTable("my_table")
new_df = spark.sql("select col1,col2 from my_table where dt_col > '2020-06-20' ")
# dt_col is column in dataframe of timestamp dtype.
In the above state, does Spark need to load the whole data, filter the data based on date range and then filter columns needed ? Is there any optimization that can be done in pyspark read, to load data since it is already partitioned ?
Something on line of :
df = spark.read.format("delta").load(path_to_data,cols_to_read=['col1','col2'])
or
df = spark.read.format("delta").load(path_to_data,partitions=[...])
In your case, there is no extra step needed. The optimizations would be taken care by Spark. Since you already partitioned the dataset based on column dt when you try to query the dataset with partitioned column dt as filter condition. Spark load only the subset of the data from the source dataset which matches the filter condition, in your case it is dt > '2020-06-20'.
Spark internally does the optimization based partitioning pruning.
To do this without SQL..
from pyspark.sql import functions as F
df = spark.read.format("delta").load(path_to_data).filter(F.col("dt_col") > F.lit('2020-06-20'))
Though for this example you may have some work to do with comparing dates.
I have a pandas dataframe df with an index of type DatetimeIndex with parameters: dtype='datetime64[ns]', name='DateTime', length=324336, freq=None. The dataframe has 22 columns, all numerical.
I want to create a new column Date with only the date-part of DateTime (to be used for grouping later).
My first attempt
df['Date'] = df.apply(lambda row: row.name.date(), axis=1)
takes ca. 13.5 seconds.
But when I make DateTime a regular column, it goes faster, even including the index operations:
df.reset_index(inplace=True)
df['Date'] = df.apply(lambda row: row['DateTime'].date(), axis=1)
df.set_index('DateTime')
This takes ca. 6.3 s, i.e., it is twice as fast. Furthermore, applying apply directly on the series (since it depends only on one column) is faster still:
df.reset_index(inplace=True)
df['Date'] = df['DateTime'].apply(lambda dt: dt.date())
df.set_index('DateTime')
takes ca. 1.1 s, more than 10 times faster than the original solution.
This brings me to my questions:
Is applying apply on a series generally faster than doing it on the dataframe?
Is using apply on an index generally slower than on columns?
More general: what is the advantage of keeping a column as an index? Or, conversely, what would I loose by resetting the index before doing any operations?
Finally, is there an even better/faster way of adding the column?
Use DatetimeIndex.date what should be faster solution:
df['Date'] = df.index.date
Is applying apply on a series generally faster than doing it on the dataframe?
Is using apply on an index generally slower than on columns
I think apply are loops under the hood, so it is obviously slowier like pandas methods
More general: what is the advantage of keeping a column as an index? Or, conversely, what would I loose by resetting the index before doing any operations?
You can check this:
Identifies data (i.e. provides metadata) using known indicators, important for analysis, visualization, and interactive console display.
Enables automatic and explicit data alignment.
Allows intuitive getting and setting of subsets of the data set.
Also if working time timeseries there are many methods like resample working with DatetimeIndex, also is possible use indexing with DatetimeIndex.
I am working on automating a process with python using pandas. Previously I would use Excel PowerQuery to combine files and manipulate data but PowerQuery is not as versatile as I need so I am now using pandas. I have the process working up to a point where I can loop through files, select the columns that I need in the correct order, dependent on each workbook, and insert that into a dataframe. Once each dataframe is created, I then concatenate them into a single dataframe and write to csv. Before writing, I need to apply some validation to certain columns.
For example, I have a Stock Number column that will always need to be exactly 11 characters long. Sometimes, dependent on the workbook, the data will be missing the leading zeros or will have more than 11 characters (but those extra characters should be removed). I know that what I need to do is something along the lines of:
STOCK_NUM.zfill(13)[:13]
but I'm not sure how to actually modify the existing dataframe values. Do I actually need to loop through the dataframe or is there a way to apply formatting to an entire column?
e.g.
dataset = [['51346812942315.01', '01-15-2018'], ['13415678', '01-15-2018'], ['5134687155546628', '01/15/2018']]
df = pd.DataFrame(dataset, columns = ['STOCK_NUM', 'Date'])
for x in df["STOCK_NUM"]:
print(x.zfill(13)[:13])
I would like to know the most optimal way to apply that format to the existing values and only if those values are present (i.e. not touching it if there are null values).
Also, I have a need to ensure that the date columns are truly date values. Sometimes the dates are formatted as MM-DD-YYYY or sometimes MM/DD/YY, etc.. and any of those are fine but what is not fine is if the actual value in the date column is an Excel serial number that Excel can fomat as a date. Is there some way to apply validation logic to an entire dataframe column the ensure that as there is a valid date instead of serial number?
I honestly have no idea how to approach this date issue.
Any and all advice, insight would be greatly appreciated!
Not an expert, but from things I could gather here and there you could try try:
df['STOCK_NUM']=df['STOCK_NUM'].str.zfill(13)
followed by:
df['STOCK_NUM'] = df['STOCK_NUM'].str.slice(0,13)
For the first part.
For dates you can do a try-except on:
df['Date'] = pd.to_datetime(df['Date'])
for your STOCK_NUM question, you could potentially apply a function to the column but the way I approach this is using list comprehensions. The first thing I would do is replace all the NAs in your STOCK_NUM column by a unique string and then apply the list comprehension as you can see in the code below:
import pandas as pd
dataset = [['51346812942315.01', '01-15-2018'], ['13415678', '01-15-2018'], ['5134687155546628', '01/15/2018'], [None,42139]]
df = pd.DataFrame(dataset, columns = ['STOCK_NUM', 'Date'])
#replace NAs with a string
df.STOCK_NUM.fillna('IS_NA',inplace=True)
#use list comprehension to reformat the STOCK_NUM column
df['STOCK_NUM'] = [None if i=='IS_NA' else i.zfill(13)[:13] for i in df.STOCK_NUM]
Then for your question relating to converting excel serial number to a date, I looked at an already answered question. I am assuming that the serial number in your dataframe is an integer type:
import datetime
def xldate_to_datetime(xldate):
temp = datetime.datetime(1900, 1, 1)
delta = datetime.timedelta(days=xldate) - datetime.timedelta(days=2)
return pd.to_datetime(temp+delta)
df['Date'] = [xldate_to_datetime(i) if type(i)==int else pd.to_datetime(i) for i in df.Date]
Hopefully this works for you! Accept this answer if it does, otherwise reply with whatever remains an issue.
I have created a DataFrame in order to process some data, and I want to find the difference in time between each pair of data in the DataFrame. Prior to using pandas, I was using two numpy arrays, one describing the data and the other describing time (an array of datetime.datetimes). With the data in arrays, I could do timearray[1:] - timearray[:-1] which resulted in an array (of n-1 elements) describing the gap in time between each pair of data.
In pandas, doing DataFrame.index[1] - DataFrame.index[0] gives me the result I want – the difference in time between the two indices I've picked out. However, doing DataFrame.index[1:] - DataFrame.index[:-1] does not yield an array of similar results, instead simply being equal to DataFrame.index[-1]. Why is this, and how can I replicate the numpy behaviour in pandas?
Alternatively, what is the best way to find datagaps in a DataFrame in pandas?
You can use shift to offset the date and use it to calculate the difference between rows.
# create dummy data
import pandas as pd
rng = pd.date_range('1/1/2011', periods=90, freq='h')
# shift a copy of the date column and subtract from the original date
df = pd.DataFrame({'value':range(1,91),'date':rng})
df['time_gap'] = df['date']- df['date'].shift(1)
To use this set your index to a column temporarily by using .reset_index() and .set_index('date') to return the date column to an index if required.