Grouping columns of pandas dataframe in datetime format - python

I have two questions:
1) Is there something like pandas groupby but applicable on columns (df.columns, not the data within)?
2) How can I extract the "date" from a datetime object?
I have lots of pandas dataframes (or csv files) that have a position column (that I use as index) and then columns of values measured at each position at different time. The column header is a datetime object (or pd.to_datetime).
I would like to extract data from the same date and save them into a new file.
Here is a simple example of two such dataframes.
df1:
2015-03-13 14:37:00 2015-03-13 14:38:00 2015-03-13 14:38:15 \
0.0 24.49393 24.56345 24.50552
0.5 24.45346 24.54904 24.60773
1.0 24.46216 24.55267 24.74365
1.5 24.55414 24.63812 24.80463
2.0 24.68079 24.76758 24.78552
2.5 24.79236 24.83005 24.72879
3.0 24.83691 24.78308 24.66727
3.5 24.78452 24.73071 24.65085
4.0 24.65857 24.79398 24.72290
4.5 24.56390 24.93515 24.83267
5.0 24.62161 24.96939 24.87366
2015-05-19 11:33:00 2015-05-19 11:33:15 2015-05-19 11:33:30
0.0 8.836121 8.726685 8.710449
0.5 8.732880 8.742462 8.687408
1.0 8.881165 8.935120 8.925903
1.5 9.043396 9.092651 9.204041
2.0 9.080902 9.153839 9.329681
2.5 9.128815 9.183777 9.296509
3.0 9.191254 9.121643 9.207397
3.5 9.131866 8.975372 9.160248
4.0 8.966003 8.951813 9.195221
4.5 8.846924 9.074982 9.264099
5.0 8.848663 9.101593 9.283081
and df2:
2015-05-19 11:33:00 2015-05-19 11:33:15 2015-05-19 11:33:30 \
0.0 8.836121 8.726685 8.710449
0.5 8.732880 8.742462 8.687408
1.0 8.881165 8.935120 8.925903
1.5 9.043396 9.092651 9.204041
2.0 9.080902 9.153839 9.329681
2.5 9.128815 9.183777 9.296509
3.0 9.191254 9.121643 9.207397
3.5 9.131866 8.975372 9.160248
4.0 8.966003 8.951813 9.195221
4.5 8.846924 9.074982 9.264099
5.0 8.848663 9.101593 9.283081
2015-05-23 12:25:00 2015-05-23 12:26:00 2015-05-23 12:26:30
0.0 10.31052 10.132660 10.176910
0.5 10.26834 10.086910 10.252720
1.0 10.27393 10.165890 10.276670
1.5 10.29330 10.219090 10.335910
2.0 10.24432 10.193940 10.406430
2.5 10.11618 10.157470 10.323120
3.0 10.02454 10.110720 10.115360
3.5 10.08716 10.010680 9.997345
4.0 10.23868 9.905670 10.008090
4.5 10.27216 9.879425 9.979645
5.0 10.10693 9.919800 9.870361
df1 has data from 13 March and 19 May, df2 has data from 19 May and 23 May. From these two dataframes containing data from 3 days, I would like to get 3 dataframes (or csv files or any other object), one for each day.
(And for a real-life example, multiply the number of lines, columns and files by some hundred.)
In the worst case I can specify the dates in a separate list, but I am still failing to extract these dates from the dataframes.
I did have an idea of a nested loop
for df in dataframes:
for d in dates:
new_df = df[d]
but I can't get the date from the datetime.

First concat all DataFrames by columns and then convert groupby object by strftime for string keys of dictionary of DataFrames:
df = pd.concat([df1,df2, dfN], axis=1)
dfs = dict(tuple(df.groupby(df.columns.strftime('%Y-%m-%d'), axis=1)))
#select DataFrame
print (dfs['2015-03-13'])

Related

Merging 2 dataframes and sorting by datetime Pandas Python

I want to produce a code where it creates an additional table to the dataframe data. The new dataframe data2 will have the following changes:
label will be New instead of Old
col1's last index will be deleted
col2's first index will be deleted
date will be first index will be deleted and all date values will
be subtracted by 1 minute
Then I want to concatenate the two data frames to make one data frame called merge I want to sort the dataframe by dates. Since the first index of data2 is dropped the order of merge should be in order of label: New, Old, New, Old. How can I subtract 1 minute from date_mod and merge the two data frames in order of dates?
import pandas as pd
d = {'col1': [4, 5, 2, 2, 3, 5, 1, 1, 6], 'col2': [6, 2, 1, 7, 3, 5, 3, 3, 9],
'label':['Old','Old','Old','Old','Old','Old','Old','Old','Old'],
'date': ['2022-01-24 10:07:02', '2022-01-27 01:55:03', '2022-01-30 19:09:03', '2022-02-02 14:34:06',
'2022-02-08 12:37:03', '2022-02-10 03:07:02', '2022-02-10 14:02:03', '2022-02-11 00:32:25',
'2022-02-12 21:42:03']}
data = pd.DataFrame(d)
'''
Additional Dataframe
label will have New
'col1'`s last index will be deleted
'col2'`s first index will be deleted
'date' will be first index will be deleted and all date values will be subtracted by 1 minute
'''
a = data['col1'].drop(data['col1'].index[-1])
b = data['col2'].drop(data['col2'].index[0])
# subtract the date_mod by 1 minute
date_mod = pd.to_datetime(data['date'][1:])
data2 = pd.DataFrame({'col1':a,'col2':b,
'label':['New','New','New','New','New','New','New','New'],
'date': date_mod})
'''
Merging data and data2
Sort by 'date'
Should go in order as Old, New, Old, New ...
The length of the columns are 1 less than of data bc of the dropped indexes
'''
merge=pd.merge(data,displayer)
the simplest way I think off, - place all adjustments into the function and apply to the copy of the original dataframe, later simply concat and sort:
data.date = pd.to_datetime(data.date) # converting column date str values to datetime to deduct 1minute later
def adjust_data(df):
df['col1'] = df['col1'].drop(df['col1'].index[-1])
df['col2'] = df['col2'].drop(df['col2'].index[0])
df.date = df.date - pd.Timedelta(minutes=1) # subtract the datetime by 1 minute
df.label = df.label.replace('Old','New') # change values in the column "label"
data2 = data.copy()
adjust_data(data2) # apply function to data2
# concat both dataframes and sort by column "date"
merge = pd.concat([data,data2], axis=0).sort_values(by=['date']).reset_index(drop=True)
print(merge)
out:
col1 col2 label date
0 4.0 NaN New 2022-01-24 10:06:02
1 4.0 6.0 Old 2022-01-24 10:07:02
2 5.0 2.0 New 2022-01-27 01:54:03
3 5.0 2.0 Old 2022-01-27 01:55:03
4 2.0 1.0 New 2022-01-30 19:08:03
5 2.0 1.0 Old 2022-01-30 19:09:03
6 2.0 7.0 New 2022-02-02 14:33:06
7 2.0 7.0 Old 2022-02-02 14:34:06
8 3.0 3.0 New 2022-02-08 12:36:03
9 3.0 3.0 Old 2022-02-08 12:37:03
10 5.0 5.0 New 2022-02-10 03:06:02
11 5.0 5.0 Old 2022-02-10 03:07:02
12 1.0 3.0 New 2022-02-10 14:01:03
13 1.0 3.0 Old 2022-02-10 14:02:03
14 1.0 3.0 New 2022-02-11 00:31:25
15 1.0 3.0 Old 2022-02-11 00:32:25
16 NaN 9.0 New 2022-02-12 21:41:03
17 6.0 9.0 Old 2022-02-12 21:42:03

How to create a dataframe from series object when iterating

I am iterating and as a result of a single iteration I acquire a pandas series object which looks like this:
DE_AT 118.55
DE_CZ 62.73
PL_DE 263.36
PL_SK 315.07
dtype: float64
Sometimes I might get different names and lengths of this series for example I might get:
DE_AT 118.55
DE_CZ 62.73
PL_DE 263.36
PL_NL 315.07
PL_UK 420
dtype: float64
Now I want to create a dataframe from these series objects when iterating such that I will have all names as the index, from these two series objects I would like to get:
index 1 2
DE_AT 118.55 118.55
DE_CZ 62.73 62.73
PL_DE 263.36 263.36
PL_SK 315.07 NaN
PL_NL NaN 315.07
PL_UK NaN 420
Or maybe I can store them in a list and later create a dataframe?
Basic outer join of two series:
s1=pd.Series(index=["DE_AT","DE_CZ","PL_DE", "PL_SK"], data=[1,2,3,4]).to_frame()
s2=pd.Series(index=["DE_AT","DE_CZ","PL_DE", "PL_NL", "PL_UK"], data=[1,2,3,4,5]).to_frame()
s1.join(s2, how="outer",lsuffix="1",rsuffix="2")
Output:
index
00
01
DE_AT
1.0
1.0
DE_CZ
2.0
2.0
PL_DE
3.0
3.0
PL_NL
NaN
4.0
PL_SK
4.0
NaN
PL_UK
NaN
5.0

Pandas Dataframe: grouping by index keeping only notnan value in each column

I have dataframes similar to the following ones:
,A,B
2020-01-15,1,
2020-01-15,,2
2020-01-16,3,
2020-01-16,,4
2020-01-17,5,
2020-01-17,,6
,A,B,C
2020-01-15,1,
2020-01-15,,2
2020-01-15,,,3
2020-01-16,4,
2020-01-16,,5
2020-01-16,,,6
2020-01-17,7,
2020-01-17,,8
2020-01-17,,,9
I need to transform them to the following:
,A,B
2020-01-15,1,2
2020-01-16,3,4
2020-01-17,5,6
,A,B,C
2020-01-15,1,2,3
2020-01-16,4,5,6
2020-01-17,7,8,9
I have tried with groupby().first() without success
Let us do grubby + first
s=df.groupby(level=0).first()
A B
aaa
2020-01-15 1.0 2.0
2020-01-16 3.0 4.0
2020-01-17 5.0 6.0

Pandas slicing data with MultiIndex

I have some features that I want to write to some csv files. I want to use pandas for this approach if possible.
I am following the instruction in here and have created some dummy data to check it out. Basically there are some activities with a random number of features belonging to them.
import io
data = io.StringIO('''Activity,id,value,value,value,value,value,value,value,value,value
Run,1,1,2,2,5,6,4,3,2,1
Run,1,2,4,4,10,12,8,6,4,2
Stand,2,1.5,3.,3.,7.5,9.,6.,4.5,3.,1.5
Sit,3,0.5,1.,1.,2.5,3.,2.,1.5,1.,0.5
Sit,3,0.6,1.2,1.2,3.,3.6,2.4,1.8,1.2,0.6
Run, 2, 0.8, 1.6, 1.6, 4. , 4.8, 3.2, 2.4, 1.6, 0.8
''')
df_unindexed = pd.read_csv(data)
df = df_unindexed.set_index(['Activity', 'id'])
When I run:
df.xs('Run')
I get
value value.1 value.2 value.3 value.4 value.5 value.6 value.7 \
id
1 1.0 2.0 2.0 5.0 6.0 4.0 3.0 2.0
1 2.0 4.0 4.0 10.0 12.0 8.0 6.0 4.0
2 0.8 1.6 1.6 4.0 4.8 3.2 2.4 1.6
value.8
id
1 1.0
1 2.0
2 0.8
which almost what I want, that is all run activities. I want to remove the 1st row and 1st column, i.e. the header and the id column. How do I achieve this?
Also a second question is when I want only one activity, how do I get it.
When using
idx = pd.IndexSlice
df.loc[idx['Run', 1], :]
gives
value value.1 value.2 value.3 value.4 value.5 value.6 \
Activity id
Run 1 1.0 2.0 2.0 5.0 6.0 4.0 3.0
1 2.0 4.0 4.0 10.0 12.0 8.0 6.0
value.7 value.8
Activity id
Run 1 2.0 1.0
1 4.0 2.0
but slicing does not work as I would expect. For example trying
df.loc[idx['Run', 1], 2:11]
instead produces an error:
TypeError: cannot do slice indexing on with these indexers [2] of 'int'>
So, how do I get my features in this place?
P.S. If not clear I am new to Pandas so be gentle. Also the column id is editable to be unique to each activity or to whole dataset if this makes things easier etc
You can use a little hack - get columns names by positions, because iloc for MultiIndex is not yet supported:
print (df.columns[2:11])
Index(['value.2', 'value.3', 'value.4', 'value.5', 'value.6', 'value.7',
'value.8'],
dtype='object')
idx = pd.IndexSlice
print (df.loc[idx['Run', 1], df.columns[2:11]])
value.2 value.3 value.4 value.5 value.6 value.7 value.8
Activity id
Run 1 2.0 5.0 6.0 4.0 3.0 2.0 1.0
1 4.0 10.0 12.0 8.0 6.0 4.0 2.0
If want save file to csv without index and columns:
df.xs('Run').to_csv(file, index=False, header=None)
I mostly look at https://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-integer when I'm stuck with these kind of issues.
Without any testing I think you can remove rows and columns like
df = df.drop(['rowindex'], axis=0)
df = df.drop(['colname'], axis=1)
Avoid the problem by recognizing the index columns at CSV read-time:
pd.read_csv(header=0, # to read in the header row as a header row, and
... index_col=['id'] or index_col=0 to pick the index column.

Is it possible to write and read multiple DataFrames to/from one single file?

I'm currently dealing with a set of similar DataFrames having a double Header.
They have the following structure:
age height weight shoe_size
RHS height weight shoe_size
0 8.0 6.0 2.0 1.0
1 8.0 NaN 2.0 1.0
2 6.0 1.0 4.0 NaN
3 5.0 1.0 NaN 0.0
4 5.0 NaN 1.0 NaN
5 3.0 0.0 1.0 0.0
height weight shoe_size age
RHS weight shoe_size age
0 1.0 1.0 NaN NaN
1 1.0 2.0 0.0 2.0
2 1.0 NaN 0.0 5.0
3 1.0 2.0 0.0 NaN
4 0.0 1.0 0.0 3.0
Actually the main differences are the sorting of the first Header row, which could be made the same for all of them, and the position of the RHS header column in the second Header row. I'm currently wondering if there is an easy way of saving/reading all these DataFrames into/from a single CSV file instead of having a different CSV file for each of them.
Unfortunately, there isn't any reasonable way to store multiple dataframes in a single CSV such that retrieving each one would not be excessively cumbersome, but you can use pd.ExcelWriter and save to separate sheets in a single .xlsx file:
import pandas as pd
writer = pd.ExcelWriter('file.xlsx')
for i, df in enumerate(df_list):
df.to_excel(writer,'sheet{}'.format(i))
writer.save()
Taking back your example (with random numbers instead of your values) :
import pandas as pd
import numpy as np
h1 = [['age', 'height', 'weight', 'shoe_size'],['RHS','height','weight','shoe_size']]
df1 = pd.DataFrame(np.random.randn(3, 4), columns=h1)
h2 = [['height', 'weight', 'shoe_size','age'],['RHS','weight','shoe_size','age']]
df2 = pd.DataFrame(np.random.randn(3, 4), columns=h2)
First, reorder your columns (How to change the order of DataFrame columns?) :
df3 = df2[h1[0]]
Then, concatenate the two dataframes (Merge, join, and concatenate) :
df4 = pd.concat([df1,df3])
I don't know how you want to deal with the second row of your header (for now, it's just using two sub-columns, which is not very elegant). If, to your point of view, this row is meaningless, just reset your header like you want before to concatenate :
df1.columns=h1[0]
df3.columns=h1[0]
df5 = pd.concat([df1,df3])
Finally, save it under CSV format (pandas.DataFrame.to_csv) :
df4.to_csv('file_name.csv',sep=',')

Categories