Intersection of a DataFrame slice and a list - python

I'm trying to select a subsection of a dataframe using .loc as such:
for date in months.index:
labels = list(df.index.values)
X = df.loc[(date - relativedelta(months=+3)):date.intersection(labels), ['A', 'B']]
Y = df.loc[(date - relativedelta(months=+3)):date.intersection(labels), 'C']
months.at[date, 'Prediction'] = forest.fit(X, Y)
I am following the method suggested at https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike but am running into the error
AttributeError: 'Timestamp' object has no attribute 'intersection'
Is this an issue because I am using a time-indexed dataframe, because I am intersecting with a slice of the dataframe and not the whole index, or another issue? I tried to convert the timestamps to datetime objects to no avail.

#import librarys
import pandas as pd
import numpy as np
import datetime
#get today
today = datetime.datetime.now()
print(today)
output:
datetime.datetime(2020, 8, 26, 20, 25, 40, 480870)
Then work out the dates you want to work from, 90 days is easier.
#get date back 90 days
date_from = today - pd.to_timedelta(90,"D")
#create a mock DF
dates_this_year = pd.date_range("2020-01-01",datetime.datetime.now().strftime("%Y-%m-%d"))
mock_values = np.arange(0,len(dates_this_year))
df = pd.DataFrame({"date":dates_this_year,"A":mock_values,"B":mock_values,"C":mock_values})
date_df = df.set_index("date")
date_df
output:
date A B C
2020-01-01 0 0 0
2020-01-02 1 1 1
2020-01-03 2 2 2
2020-01-04 3 3 3
2020-01-05 4 4 4
... ... ... ...
2020-08-22 234 234 234
2020-08-23 235 235 235
2020-08-24 236 236 236
2020-08-25 237 237 237
2020-08-26 238 238 238
Then to index it just use the date from and to.
date_df.loc[date_from:today,["A","B"]]
output:
date A B
2020-05-29 149 149
2020-05-30 150 150
2020-05-31 151 151
2020-06-01 152 152
2020-06-02 153 153
... ... ...
2020-08-22 234 234
2020-08-23 235 235
2020-08-24 236 236
2020-08-25 237 237
2020-08-26 238 238

Related

Aggregate functions on a 3-level pandas grupby object

I want to make a new df with simple metrics like mean, sum, min, max calculated on the Value column in the df visible below, grouped by ID, Date and Key.
index
ID
Key
Date
Value
x
y
z
0
655
321
2021-01-01
50
546
235
252345
1
675
321
2021-01-01
50
345
345
34545
2
654
356
2021-02-02
70
345
346
543
I am doing it like this:
final = df.groupby(['ID','Date','Key'])['Value'].first().mean(level=[0,1]).reset_index().rename(columns={'Value':'Value_Mean'})
I use .first() because one Key can occur multiple times in the df but they all have the same Value. I want to aggregate on ID and Date so I am using level=[0,1].
and then I am adding next metrics with pandas merge as:
final = final.merge(df.groupby(['ID','Date','Key'])['Value'].first().max(level=[0,1]).reset_index().rename(columns={'Value':'Value_Max'}), on=['ID','Date'])
And I go like that with other metrics. I wonder if there is a more sophisticated way to do it than repeat it in multiple lines. I know that you can use .agg() and pass a dict with functions but it seems like in that way it isn't possible to specify the level which is important here.
Use DataFrame.drop_duplicates with named aggregation:
df = pd.DataFrame({'ID':[655,655,655,675,654], 'Key':[321,321,333,321,356],
'Date':['2021-01-01','2021-01-01','2021-01-01','2021-01-01','2021-02-02'],
'Value':[50,30,10,50,70]})
print (df)
ID Key Date Value
0 655 321 2021-01-01 50
1 655 321 2021-01-01 30
2 655 333 2021-01-01 10
3 675 321 2021-01-01 50
4 654 356 2021-02-02 70
final = (df.drop_duplicates(['ID','Date','Key'])
.groupby(['ID','Date'], as_index=False).agg(Value_Mean=('Value','mean'),
Value_Max=('Value','max')))
print (final)
ID Date Value_Mean Value_Max
0 654 2021-02-02 70 70
1 655 2021-01-01 30 50
2 675 2021-01-01 50 50
final = (df.groupby(['ID','Date','Key'], as_index=False)
.first()
.groupby(['ID','Date'], as_index=False).agg(Value_Mean=('Value','mean'),
Value_Max=('Value','max')))
print (final)
ID Date Value_Mean Value_Max
0 654 2021-02-02 70 70
1 655 2021-01-01 30 50
2 675 2021-01-01 50 50
df = (df.groupby(['ID','Date','Key'], as_index=False)
.first()
.groupby(['ID','Date'], as_index=False)['Value']
.agg(['mean', 'max'])
.add_prefix('Value_')
.reset_index())
print (df)
ID Date Value_Mean Value_Max
0 654 2021-02-02 70 70
1 655 2021-01-01 30 50
2 675 2021-01-01 50 50

python pandas how to read csv file by block

I'm trying to read a CSV file, block by block.
CSV looks like:
No.,time,00:00:00,00:00:01,00:00:02,00:00:03,00:00:04,00:00:05,00:00:06,00:00:07,00:00:08,00:00:09,00:00:0A,...
1,2021/09/12 02:16,235,610,345,997,446,130,129,94,555,274,4,
2,2021/09/12 02:17,364,210,371,341,294,87,179,106,425,262,3,
1434,2021/09/12 02:28,269,135,372,262,307,73,86,93,512,283,4,
1435,2021/09/12 02:29,281,207,688,322,233,75,69,85,663,276,2,
No.,time,00:00:10,00:00:11,00:00:12,00:00:13,00:00:14,00:00:15,00:00:16,00:00:17,00:00:18,00:00:19,00:00:1A,...
1,2021/09/12 02:16,255,619,200,100,453,456,4,19,56,23,4,
2,2021/09/12 02:17,368,21,37,31,24,8,19,1006,4205,2062,30,
1434,2021/09/12 02:28,2689,1835,3782,2682,307,743,256,741,52,23,6,
1435,2021/09/12 02:29,2281,2047,6848,3522,2353,755,659,885,6863,26,36,
Blocks start with No., and data rows follow.
def run(sock, delay, zipobj):
zf = zipfile.ZipFile(zipobj)
for f in zf.namelist():
print(zf.filename)
print("csv name: ", f)
df = pd.read_csv(zf.open(f), skiprows=[0,1,2,3,4,5] #,"nrows=1435? (but for the next blocks?")
print(df, '\n')
date_pattern='%Y/%m/%d %H:%M'
df['epoch'] = df.apply(lambda row: int(time.mktime(time.strptime(row.time,date_pattern))), axis=1) # create epoch as a column
tuples=[] # data will be saved in a list
formated_str='perf.type.serial.object.00.00.00.TOTAL_IOPS'
for each_column in list(df.columns)[2:-1]:
for e in zip(list(df['epoch']),list(df[each_column])):
each_column=each_column.replace("X", '')
#print(f"perf.type.serial.LDEV.{each_column}.TOTAL_IOPS",e)
tuples.append((f"perf.type.serial.LDEV.{each_column}.TOTAL_IOPS",e))
package = pickle.dumps(tuples, 1)
size = struct.pack('!L', len(package))
sock.sendall(size)
sock.sendall(package)
time.sleep(delay)
Many thanks for help,
Load your file with pd.read_csv and create block at each time the row of your first column is No.. Use groupby to iterate over each block and create a new dataframe.
data = pd.read_csv('data.csv', header=None)
dfs = []
for _, df in data.groupby(data[0].eq('No.').cumsum()):
df = pd.DataFrame(df.iloc[1:].values, columns=df.iloc[0])
dfs.append(df.rename_axis(columns=None))
Output:
# First block
>>> dfs[0]
No. time 00:00:00 00:00:01 00:00:02 00:00:03 00:00:04 00:00:05 00:00:06 00:00:07 00:00:08 00:00:09 00:00:0A ...
0 1 2021/09/12 02:16 235 610 345 997 446 130 129 94 555 274 4 NaN
1 2 2021/09/12 02:17 364 210 371 341 294 87 179 106 425 262 3 NaN
2 1434 2021/09/12 02:28 269 135 372 262 307 73 86 93 512 283 4 NaN
3 1435 2021/09/12 02:29 281 207 688 322 233 75 69 85 663 276 2 NaN
# Second block
>>> dfs[1]
No. time 00:00:10 00:00:11 00:00:12 00:00:13 00:00:14 00:00:15 00:00:16 00:00:17 00:00:18 00:00:19 00:00:1A ...
0 1 2021/09/12 02:16 255 619 200 100 453 456 4 19 56 23 4 NaN
1 2 2021/09/12 02:17 368 21 37 31 24 8 19 1006 4205 2062 30 NaN
2 1434 2021/09/12 02:28 2689 1835 3782 2682 307 743 256 741 52 23 6 NaN
3 1435 2021/09/12 02:29 2281 2047 6848 3522 2353 755 659 885 6863 26 36 NaN
and so on.
Sorry, i don't find a correct way with your code:
def run(sock, delay, zipobj):
zf = zipfile.ZipFile(zipobj)
for f in zf.namelist():
print("using zip :", zf.filename)
str = f
myobject = re.search(r'(^[a-zA-Z]{4})_.*', str)
Objects = myobject.group(1)
if Objects == 'LDEV':
metric = re.search('.*LDEV_(.*)/.*', str)
metric = metric.group(1)
elif Objects == 'Port':
metric = re.search('.*/(Port_.*).csv', str)
metric = metric.group(1)
else:
print("None")
print("using csv : ", f)
#df = pd.read_csv(zf.open(f), skiprows=[0,1,2,3,4,5])
data = pd.read_csv(zf.open(f), header=None, skiprows=[0,1,2,3,4,5])
dfs = []
for _, df in data.groupby(data[0].eq('No.').cumsum()):
df = pd.DataFrame(df.iloc[1:].values, columns=df.iloc[0])
dfs.append(df.rename_axis(columns=None))
print("here")
date_pattern='%Y/%m/%d %H:%M'
df['epoch'] = df.apply(lambda row: int(time.mktime(time.strptime(row.time,date_pattern))), axis=1) # create epoch as a column
tuples=[] # data will be saved in a list
#formated_str='perf.type.serial.object.00.00.00.TOTAL_IOPS'
for each_column in list(df.columns)[2:-1]:
for e in zip(list(df['epoch']),list(df[each_column])):
each_column=each_column.replace("X", '')
tuples.append((f"perf.type.serial.{Objects}.{each_column}.{metric}",e))
package = pickle.dumps(tuples, 1)
size = struct.pack('!L', len(package))
sock.sendall(size)
sock.sendall(package)
time.sleep(delay)
thanks for your help,

Organize DataFrame into columns by year and index by day-month - PYTHON - PANDAS

I have a dataframe that I would like to plot by day and month, year on top of year. In order to do that I understand that I have to put the years into their own columns and index by day and month. I am not sure how to go at it.
Here is a sample data frame
date count
2012-11-12 219
2013-11-12 188
2014-11-12 215
2015-11-12 232
2012-11-13 210
2013-11-13 234
2014-11-13 220
2015-11-13 203
2012-11-14 224
2013-11-14 196
2014-11-14 213
2015-11-14 228
which should look something like this
day-month 2012 2013 2014 2015
11-12 219 188 215 232
11-13 210 234 220 203
11-14 224 196 213 228
Thanks
Use DataFrame.pivot_table
dates = pd.to_datetime(df['date'])
new_df = df.pivot_table(index=[dates.dt.month, dates.dt.day],
columns=dates.dt.year,
values='count')
new_df = (new_df.set_axis([f'{month}-{day}' for month, day in new_df.index])
.rename_axis(index='month-day', columns=None)
# .reset_index() # if you want column month-day
)
print(new_df)
Output
2012 2013 2014 2015
month-day
11-12 219 188 215 232
11-13 210 234 220 203
11-14 224 196 213 228

pandas transform data into multi step time series based on a condition

i have a dataframe like below and i'm creating a multistep sequence of data using below for loop but i want to apply the logic at a customer level.
Dataframe :
Date Customer Price
1/1/2019 A 142
1/2/2019 A 123
1/3/2019 A 342
1/4/2019 A 232
1/5/2019 A 657
1/6/2019 B 875
1/7/2019 B 999
1/8/2019 B 434
1/9/2019 B 564
1/10/2019 B 345
1/10/2019 B 798
Below forloop code can create sequence of data having a rolling window 1.
data = np.array(data)
X_data, y_data = [], []
for i in range(2, len(data )-2):
X_data.append(data[i-2:i])
y_data.append(data[i:i+2])
The output of X_data array and y_data array should look like below
X_data(independent variables) y_data(target)
customer 0 1 0 1
A 142 123 342 232
A 123 342 232 657
B 875 999 434 564
B 999 434 564 345
B 434 564 345 798
Please suggest me on this.Thanks in advance
Use DataFrame.shift() to shift index by desired number for rolling data
def get_rolling_data(row):
n = 2
for i in range(n):
row[f'x_data.{i}'] = row.shift(0 - i).Price
row[f'y_data.{i}'] = row.shift(0 - n - i).Price
return row
df_res = df.groupby(['Customer']).apply(get_rolling_data)
print(df_res.dropna())
Date Customer Price x_data.0 x_data.1 y_data.0 y_data.1
0 1/1/2019 A 142 142 123.0 342.0 232.0
1 1/2/2019 A 123 123 342.0 232.0 657.0
5 1/6/2019 B 875 875 999.0 434.0 564.0
6 1/7/2019 B 999 999 434.0 564.0 345.0
7 1/8/2019 B 434 434 564.0 345.0 798.0
Above result is a Dataframe. You can further extract the required data if you need np.array instead.

Pivoting data with date as a row in Python

I have data that I've left in a format that will allow me to pivot on dates that look like:
Region 0 1 2 3
Date 2005-01-01 2005-02-01 2005-03-01 ....
East South Central 400 500 600
Pacific 100 200 150
.
.
Mountain 500 600 450
I need to pivot this table so it looks like:
0 Date Region value
1 2005-01-01 East South Central 400
2 2005-02-01 East South Central 500
3 2005-03-01 East South Central 600
.
.
4 2005-03-01 Pacific 100
4 2005-03-01 Pacific 200
4 2005-03-01 Pacific 150
.
.
Since both Date and Region are under one another I'm not sure how to melt or pivot around these strings so that I can get my desired output.
How can I go about this?
I think this is the solution you are looking for. Shown by example.
import pandas as pd
import numpy as np
N=100
regions = list('abcdef')
df = pd.DataFrame([[i for i in range(N)], ['2016-{}'.format(i) for i in range(N)],
list(np.random.randint(0,500, N)), list(np.random.randint(0,500, N)),
list(np.random.randint(0,500, N)), list(np.random.randint(0,500, N))])
df.index = ['Region', 'Date', 'a', 'b', 'c', 'd']
print(df)
This gives
0 1 2 3 4 5 6 7 \
Region 0 1 2 3 4 5 6 7
Date 2016-0 2016-1 2016-2 2016-3 2016-4 2016-5 2016-6 2016-7
a 96 432 181 64 87 355 339 314
b 360 23 162 98 450 78 114 109
c 143 375 420 493 321 277 208 317
d 371 144 207 108 163 67 465 130
And the solution to pivot this into the form you want is
df.transpose().melt(id_vars=['Date'], value_vars=['a', 'b', 'c', 'd'])
which gives
Date variable value
0 2016-0 a 96
1 2016-1 a 432
2 2016-2 a 181
3 2016-3 a 64
4 2016-4 a 87
5 2016-5 a 355
6 2016-6 a 339
7 2016-7 a 314
8 2016-8 a 111
9 2016-9 a 121
10 2016-10 a 124
11 2016-11 a 383
12 2016-12 a 424
13 2016-13 a 453
...
393 2016-93 d 176
394 2016-94 d 277
395 2016-95 d 256
396 2016-96 d 174
397 2016-97 d 349
398 2016-98 d 414
399 2016-99 d 132

Categories