I have an Excel sheet of time series data of prices where each day consists of 6 hourly periods. I am trying to use Python and Pandas which I have setup and working, importing a CSV and then creating a df from this. That is fine it is just the sorting code I am struggling with. In excel I can do this using a Sum(sumifs) array function but I would like it to work in Python/Pandas.
I am looking to produce a new time series from the data where for each day I get the average price for periods 3 to 5 inclusive only, excluding the others. I am struggling with this.
An example raw data exert and result I am looking for is below:
You need filter by between and boolean indexing and then aggregate mean:
df = df[df['Period'].between(3,5)].groupby('Date', as_index=False)['Price'].mean()
Related
I have a small dataframe that I would like to sort in Python. When sorting in Python, I get a different result then when sorting in Excel. I would like to sort from A to Z and have the pandas result match what is outputted in Excel.
Here is the code I used:
import pandas as pd
df = pd.DataFrame({"Col": ['0123A', '0123B', '01-AB']})
df = df.sort_values('Col', ascending=True)
Here is the output in python:
Col
2 01-AB
0 0123A
1 0123B
My output in excel is this:
Col
0 0123A
1 0123B
2 01-AB
Is there a reason why the pandas and excel result don't match?
Yes. There is a reason that the pandas sort and the Excel sort are different. The pandas sort is utilizing an ASCIII/UTF sorting hierarchy, which probably is what you expected; whereas, Excel treats the minus sign/hyphen differently in the sorting process. If you want the Excel spreadsheet to sort in the same manner as the pandas sort, you need to utilize some additional definitions and processing like what is explained if you connect to this link.
Excel Hyphen Sorting
You might need to read through the solution a few times for it to register, but sorting columns where cell contents can contain a hyphen requires a bit of extra work in Excel.
I hope that helps.
Regards.
enter image description here[enter image description here][2]I am having trouble interpolating my missing values. I am using the following code to interpolate
df=pd.read_csv(filename, delimiter=',')
#Interpolating the nan values
df.set_index(df['Date'],inplace=True)
df2=df.interpolate(method='time')
Water=(df2['Water'])
Oil=(df2['Oil'])
Gas=(df2['Gas'])
Whenever I run my code I get the following message: "time-weighted interpolation only works on Series or DataFrames with a DatetimeIndex"
My Data consist of several columns with a header. The first column is named Date and all the rows look similar to this 12/31/2009. I am new to python and time series in general. Any tips will help.
Sample of CSV file
Try this, assuming the first column of your csv is the one with date strings:
df = pd.read_csv(filename, index_col=0, parse_dates=[0], infer_datetime_format=True)
df2 = df.interpolate(method='time', limit_direction='both')
It theoretically should 1) convert your first column into actual datetime objects, and 2) set the index of the dataframe to that datetime column, all in one step. You can optionally include the infer_datetime_format=True argument. If your datetime format is a standard format, it can help speed up parsing by quite a bit.
The limit_direction='both' should back fill any NaNs in the first row, but because you haven't provided a copy-paste-able sample of your data, I cannot confirm on my end.
Reading the documentation can be incredibly helpful and can usually answer questions faster than you'll get answers from Stack Overflow!
I'm trying to read a column with date and time from csv file and wanted to plot the frequencies of datas per day.
I don't actually know how to read them though.
You'll need to define your initial column as datetime first.
df['created'] = pd.to_datetime(df['created'])
I am working on a dataframe of 50 million rows in pandas. I need to run through a column and extract specific parts of the text. The column has string values defined in 4 or 5 patterns. I need to extract the text and replace the original string. I am using the apply function and regex for this. This takes me close to a day to execute. I feel this is inefficient. Or is this normal? Is there an approach i am missing to make it faster?
here are the docs:
http://pandas.pydata.org/pandas-docs/stable/indexing.html
http://pandas.pydata.org/pandas-docs/stable/text.html#extracting-substrings
Replacing text is easy. No a day isn't normal. Get rid of all the lists you had in an earlier version of this post. You don't need them. Add on columns to the dataframe if you need more space for data. Learn the data types to make the data smaller.
import pandas as pd
df = pd.DataFrame() #import your data at this step
df['column'].str.extract(regex_thingy_here)
I'd write more but you took the code down.
I am running a script (script 1) to create an empty data frame which is populated by another script (script 2). The index in this empty data frame is a time series of 30 minute intervals across 365 days, beginning 1st October 2016. To create this time series index, Script 1 contains the following code:
time_period_start = dt.datetime(2016,01,10).strftime("%Y-%d-%m")
index = pd.date_range(time_period_start, freq='30min', periods=17520)
Script 2 pulls data out of a CSV file, containing values across a time series. The plan is for this script to put this data into a dataframe, and then merge this with the dataframe created in Script 1.
The problem I am having is that the format of the dates in the dataframe created in Script 2 is Y-D-M, which is what comes out of the CSV files. However, the format of the dates in the dataframe created in Script 1 is Y-M-D, which causes incorrect results when I try to merge. This is even despite my use of ".strftime("%Y-%d-%m")" in the first line of code above. Is there any way of amending the second line of code so that the output dataframe is in Y-D-M?
.strftime() isn't affecting the final dataframe, since pd.date_range transforms it back into a datetime anyway. Instead of trying to match on strings, you should convert the dates in the second dataframe (the one created by Script 2) to datetime as well.
df2.date = pd.to_datetime(df2.date, format='%Y-%d-%m')