After some consideration, I have decided to change the parameters of my question.
I have really big csv file (over 300 thousand lines) which spans 3 years.
The file is set up as shown below, where all of the companies are under one column and their electrical usage is in the same line but different column.
Date and time Company Usage
2020-01-01 00:00:00 Company1 300
2020-01-01 00:00:00 Company2 20
2020-01-01 00:00:00 Company3 120
2020-01-01 00:00:00 Company4 600
2020-01-01 01:00:00 Company1 450
2020-01-01 01:00:00 Company3 80
2020-01-01 01:00:00 Company4 650
2020-01-01 02:00:00 Company1 350
2020-01-01 02:00:00 Company2 35
2020-01-01 02:00:00 Company3 150
2020-01-01 02:00:00 Company4 550
Note: Fabricated numbers
I am wondering how I can change the data in such a way that all of the companies are displayed as columns, as such
date and time Company1 Company2 Company3
2020-01-01 00:00:00 300 20 120
2020-01-01 01:00:00 450 80 80
2020-01-01 02:00:00 350 35 150
I have tried df.groupby('Company')['Usage'] which really didn't do much. I tried to make a for rule which didn't do anything (probably due to my very limited experience), similar to this
for i in df:
for j in i:
if df[][] = "date and time":
newDf.append()
This is probably much easier than I think, but I haven't stumbled on the right answer for few days now.
After long search, I found a solution. There are probably more, but for now this one works up to a point for me
new_df = dict(tuple(df.groupby('Company')))
At leas this is a step in the right direction.
Related
I have attached the example of a dataframe which is based quarterly. I wish to resample it to per minute without any aggregation
Input dataframe:
Date (CET)
Price
2020-01-01 11:00
50
2020-01-01 11:15
60
2020-01-01 11:15
100
The output I want is this:
Date (CET)
Price
2020-01-01 11:00
50
2020-01-01 11:01
50
2020-01-01 11:02
50
2020-01-01 11:03
50
2020-01-01 11:04
50
2020-01-01 11:05
50
2020-01-01 11:06
50
2020-01-01 11:07
50
2020-01-01 11:08
50
2020-01-01 11:09
50
2020-01-01 11:10
50
2020-01-01 11:11
50
2020-01-01 11:12
50
2020-01-01 11:13
50
2020-01-01 11:14
50
2020-01-01 11:15
60
I tried using df.resample, but it requires me to aggregated based on the mean() or sum(), which I don't want. I want the values to remain the same for a particular quarter. Like in the output table the price remains 50 from 11:00 to 11:14
Use:
#convert to DatetimeIndex
df['Date (CET)'] = pd.to_datetime(df['Date (CET)'])
#remove duplicates
df = df.drop_duplicates('Date (CET)')
df = df.set_index('Date (CET)')
#forward filling values - upsample
df.resample('Min').ffill()
I have a DataFrame with relevant stock information that looks like this.
Screenshot of my dataframe
I need it so that if the 'close' from one row is different from the 'open' in the next row a new dataframe will be created storing the ones that fulfill this criteria. I would like that all of the values from the row to be saved in the new dataframe. To clarify, I would like the two rows where this happens to be stored in the new dataframe.
DataFrame as text as requested:
timestamp open high low close volume
0 2020-01-01 00:00:00 129.16 130.98 128.68 130.24 4.714333e+04
1 2020-01-01 08:00:00 130.24 132.40 129.87 132.08 5.183323e+04
2 2020-01-01 16:00:00 132.08 133.05 129.74 130.77 4.579396e+04
3 2020-01-02 00:00:00 130.72 130.78 128.69 129.26 6.606601e+04
4 2020-01-02 08:00:00 129.23 130.28 128.90 129.59 4.849893e+04
5 2020-01-02 16:00:00 129.58 129.78 126.38 127.19 9.919212e+04
6 2020-01-03 00:00:00 127.19 130.15 125.88 128.86 1.276414e+05
This can be accomplished using Series.shift
>>> df['close'] != df['open'].shift(-1)
0 2020-01-01 False
1 2020-01-01 False
2 2020-01-01 True
3 2020-01-02 True
4 2020-01-02 True
5 2020-01-02 False
6 2020-01-03 True
This compares the close value in one row to the open value of the next row ("shifted" one row ahead).
You can then select the rows for which the condition is True.
>>> df[df['close'] != df['open'].shift(-1)]
timestamp open high low close volume
2 2020-01-01 16:00:00 132.08 133.05 129.74 130.77 45793.96
3 2020-01-02 00:00:00 130.72 130.78 128.69 129.26 66066.01
4 2020-01-02 08:00:00 129.23 130.28 128.90 129.59 48498.93
6 2020-01-03 00:00:00 127.19 130.15 125.88 128.86 127641.40
This only returns the second of the two rows; to get the first, we can then shift back one, and unite the two conditions.
>>> row_condition = df['close'] != df['open'].shift(-1)
>>> row_before = row_condition.shift(1)
>>> df[row_condition | row_before]
timestamp open high low close volume
0 2020-01-01 00:00:00 129.16 130.98 128.68 130.24 47143.33
1 2020-01-01 08:00:00 130.24 132.40 129.87 132.08 51833.23
2 2020-01-01 16:00:00 132.08 133.05 129.74 130.77 45793.96
3 2020-01-02 00:00:00 130.72 130.78 128.69 129.26 66066.01
4 2020-01-02 08:00:00 129.23 130.28 128.90 129.59 48498.93
5 2020-01-02 16:00:00 129.58 129.78 126.38 127.19 99192.12
6 2020-01-03 00:00:00 127.19 130.15 125.88 128.86 127641.40
Providing a textual sample of the DataFrame is useful because this can be copied directly into a Python session; I would have had to manually type the content of your screenshot otherwise.
I have two features of datetime type, few records of each column have NaT values. These NaT are meaningful and I can fill it up with 0.
What I want to do is to find the difference between the two dates to create a new feature "Time Spent". Because of NaT, my code is throwing error, which makes sense as I am trying to subtract NaT values. I was wondering if there is an efficient way to do it in Python. Thanks in advance
Example :
DateTime_Min DateTime_Max Process
2020-01-01 11:30:00 2020-01-01 11:30:30. A
2020-01-01 11:30:00 2020-01-01 11:30:20. B
NaT. NaT. C
2020-01-01 11:30:00 2020-01-01 11:30:30. D
What I want:
---
DateTime_Min DateTime_Max Process. Time_Spent(seconds)
2020-01-01 11:30:00 2020-01-01 11:30:30. A. 30
2020-01-01 11:30:00 2020-01-01 11:30:20. B. 20
NaT. NaT. C. 0
2020-01-01 11:30:00 2020-01-01 11:30:30. D. 30
Code
#calculating time spent on each process in seconds
def calculate_seconds(df):
if (df['DateTime_Max'] == 0 | df['DateTime_Min'] == 0):
df['Time_Spent']=0
else:
df['Time_Spent']=(df['DateTime_Max'] - df['DateTime_Min'])/np.timedelta64(1, 's')
Here's a solution using fillna:
df.DateTime_Min = pd.to_datetime(df.DateTime_Min)
df.DateTime_Max = pd.to_datetime(df.DateTime_Max)
df["Time_Spent"] = (df.DateTime_Max - df.DateTime_Min).fillna(pd.Timedelta(seconds=0))
The result is:
DateTime_Min DateTime_Max Process Time_Spent
0 2020-01-01 11:30:00 2020-01-01 11:30:30 A 00:00:30
1 2020-01-01 11:30:00 2020-01-01 11:30:20 B 00:00:20
2 NaT NaT C 00:00:00
3 2020-01-01 11:30:00 2020-01-01 11:30:30 D 00:00:30
Is there a way to filter rows if the column2 has all zeroes 10 minutes ahead from the current value in columnn1. How can I do this while keeping datetime index?
2020-01-01 00:01:00 60 0
2020-01-01 00:02:00 70 0
2020-01-01 00:03:00 80 0
2020-01-01 00:04:00 70 0
2020-01-01 00:05:00 60 0
2020-01-01 00:06:00 60 0
2020-01-01 00:07:00 70 0
2020-01-01 00:08:00 80 0
2020-01-01 00:09:00 80 2
2020-01-01 00:10:00 80 0
2020-01-01 00:11:00 70 0
2020-01-01 00:12:00 70 0
2020-01-01 00:13:00 50 0
2020-01-01 00:14:00 50 0
2020-01-01 00:15:00 60 0
2020-01-01 00:16:00 60 0
2020-01-01 00:17:00 70 0
2020-01-01 00:18:00 70 0
2020-01-01 00:19:00 80 0
2020-01-01 00:20:00 80 0
2020-01-01 00:21:00 80 1
2020-01-01 00:22:00 90 2
Expected output
2020-01-01 00:19:00 80 0
2020-01-01 00:20:00 80 0
I figured it out. It's actually simple.
input['col3'] = input['col2'].rolling(10).sum()
output = input.loc[(input['col3'] == 0)]
Just a guess, because I do not know pandas, but assuming it is a bit like SQL or linq or linkable datasets in C# - what about linking/joining your table (A) with itself (B) for all 12 minutes, grouping by each row of A and then summing the column2 of B (if only positive values there) and filter (SQL having) by the ones whose sum is 0?
As result report A.column0, A.column1 and SUM(B.column2)?
Using pandas.DataFrame.query, pandas.DataFrame.query - documentation
df.query(f'column_1 == {0} and column_2 == {value} or column_3 == {another_value}')
I'm fairly new to python, especially the data libraries, so please excuse any idiocy.
I'm trying to practise with a made up data set of monthly observations over 12 months, data looks like this...
print(data)
2017-04-17 156
2017-05-09 216
2017-06-11 300
2017-07-29 184
2017-08-31 162
2017-09-24 91
2017-10-15 225
2017-11-03 245
2017-12-26 492
2018-01-26 485
2018-02-18 401
2018-03-09 215
2018-04-30 258
These monthly observations are irregular (there is exactly one in each month but nowhere near the same time).
Now, I want to use liner interpolation to get the values at the start of each month -
I've tried a bunch of methods... and was able to do it 'manually', but I'm trying to get to grips with pandas and numpy, and I know it can be done with these, here's what I had so far: I make a Series holding data, and then I do:
resampled1 = data.resample('MS')
interp1 = resampled1.interpolate()
print(interp1)
This prints:
2017-04-01 NaN
2017-05-01 NaN
2017-06-01 NaN
2017-07-01 NaN
2017-08-01 NaN
2017-09-01 NaN
2017-10-01 NaN
2017-11-01 NaN
2017-12-01 NaN
2018-01-01 NaN
2018-02-01 NaN
2018-03-01 NaN
2018-04-01 NaN
Now, I know that the first one 2017-4-17 should be NaN as linear interpolation (which I believe is the default), interpolates between the two points before and after... which is not possible since I don't have a datapoint before April 1st. As for the others... I'm not certain what I'm doing wrong... probably just because I'm struggling to wrap my head around exactly what resample is doing?
You probably want to resample('D') to interpolate, e.g.:
In []:
data.resample('D').interpolate().asfreq('MS')
Out[]:
2017-05-01 194.181818
2017-06-01 274.545455
2017-07-01 251.666667
2017-08-01 182.000000
2017-09-01 159.041667
2017-10-01 135.666667
2017-11-01 242.894737
2017-12-01 375.490566
2018-01-01 490.645161
2018-02-01 463.086957
2018-03-01 293.315789
2018-04-01 234.019231
Try to use RedBlackPy.
from datetime import datetime
import redblackpy as rb
index = [datetime(2017,4,17), datetime(2017,5,9), datetime(2017,6, 11)]
values = [156, 216, 300]
series = rb.Series(index=index, values=values, interpolate='linear')
# Now you can access by any key with no insertion, using interpolation.
print(series[datetime(2017, 5, 1)]) # prints 194.18182373046875