Python Pandas combining timestamp columns and fillna in read_csv - python

I'm reading a csv file with Pandas. The format is:
Date Time x1 x2 x3 x4 x5
3/7/2012 11:09:22 13.5 2.3 0.4 7.3 6.4
12.6 3.4 9.0 3.0 7.0
3.6 4.4 8.0 6.0 5.0
10.6 3.5 1.0 3.0 8.0
...
3/7/2012 11:09:23 10.5 23.2 0.3 7.8 4.4
11.6 13.4 19.0 13.0 17.0
...
As you can see, not every row has a timestamp. Every row without a timestamp is from the same 1-second interval as the closest row above it that does have a timestamp.
I am trying to do 3 things:
1. combine the Date and Time columns to get a single timestamp column.
2. convert that column to have units of seconds.
3. fill empty cells to have the appropriate timestamp.
The desired end result is an array with the timestamp, in seconds, at each row.
I am not sure how to quickly convert the timestamps into units of seconds, other then to do a slow for loop and use the Python builtin time.mktime method.
Then when I fill in missing timestamp values, the problem is that the cells in the Date and Time columns which did not have a timestamp each get a "nan" value and when merged give a cell with the value "nan nan". Then when I use the fillna() method, it doesn't interpret "nan nan" as being a nan.
I am using the following code to get the problem result (not including the part of trying to convert to seconds):
import pandas as pd
df = pd.read_csv('file.csv', delimiter=',', parse_dates={'CorrectTime':[0,1]}, usecols=[0,1,2,4,6], names=['Date','Time','x1','x3','x5'])
df.fillna(method='ffill', axis=0, inplace=True)
Thanks for your help.

Assuming you want seconds since Jan 1, 1900...
import pandas
from io import StringIO
import datetime
data = StringIO("""\
Date,Time,x1,x2,x3,x4,x5
3/7/2012,11:09:22,13.5,2.3,0.4,7.3,6.4
,,12.6,3.4,9.0,3.0,7.0
,,3.6,4.4,8.0,6.0,5.0
,,10.6,3.5,1.0,3.0,8.0
3/7/2012,11:09:23,10.5,23.2,0.3,7.8,4.4
,,11.6,13.4,19.0,13.0,17.0
""")
df = pandas.read_csv(data, parse_dates=['Date']).fillna(method='ffill')
def dealwithdates(row):
datestring = row['Date'].strftime('%Y-%m-%d')
dtstring = '{} {}'.format(datestring, row['Time'])
date = datetime.datetime.strptime(dtstring, '%Y-%m-%d %H:%M:%S')
refdate = datetime.datetime(1900, 1, 1)
return (date - refdate).total_seconds()
df['ordinal'] = df.apply(dealwithdates, axis=1)
print(df)
Date Time x1 x2 x3 x4 x5 ordinal
0 2012-03-07 11:09:22 13.5 2.3 0.4 7.3 6.4 3540107362
1 2012-03-07 11:09:22 12.6 3.4 9.0 3.0 7.0 3540107362
2 2012-03-07 11:09:22 3.6 4.4 8.0 6.0 5.0 3540107362
3 2012-03-07 11:09:22 10.6 3.5 1.0 3.0 8.0 3540107362
4 2012-03-07 11:09:23 10.5 23.2 0.3 7.8 4.4 3540107363
5 2012-03-07 11:09:23 11.6 13.4 19.0 13.0 17.0 3540107363

Related

Python: Rolling Minimum by date interval

thanks for taking the time to read this question.
I am using time series data which is reported weekly. I am trying to calculate the minimum value of each row over 3 years which I have done using the code below. Since the data is reported weekly for each row it would be the minimum value of 156 rows (3yrs before). The column Spec_Min details the minimum value for each row over 3 years.
However, halfway through my data, it begins to be reported twice a month but I still need to have the minimum values over 3 years therefore no longer 156 rows later. I was wondering if there was a more simple way of doing this?
Perhaps doing it via date rather than rows but I am not sure how to do that.
df1['Spec_Min']=df1['Spec_NET'].rolling(156).min()
df1
Date Spec_NET Hed_NET Spec_Min
1995-10-31 9.0 -13.5 -49.7
1995-11-07 11.9 -23.5 -49.7
1995-11-14 9.8 -19.4 -49.7
1995-11-21 9.7 -25.4 -49.7
1995-11-28 10.4 -20.3 -49.7
1995-12-05 1.6 -15.3 -49.7
1995-12-12 -17.0 14.2 -49.7
1995-12-19 -16.6 15.2 -49.7
1995-12-26 4.7 -15.2 -49.7
1996-01-02 5.3 -22.7 -49.7
1996-01-16 7.3 -21.0 -49.7
1996-01-23 1.3 -20.4 -49.7
Pandas allows you to operate with a datetime aware rolling window. You'll need to structure your code to operate in terms of the number of days (365 * 3 for 3 years).
I used your provided sample DataFrame
df['Spec_Min'] = df.rolling(f'{365 * 3}D', on='Date')['Spec_NET'].min()
print(df)
Date Spec_NET Hed_NET Spec_Min
0 1995-10-31 9.0 -13.5 9.0
1 1995-11-07 11.9 -23.5 9.0
2 1995-11-14 9.8 -19.4 9.0
3 1995-11-21 9.7 -25.4 9.0
4 1995-11-28 10.4 -20.3 9.0
5 1995-12-05 1.6 -15.3 1.6
6 1995-12-12 -17.0 14.2 -17.0
7 1995-12-19 -16.6 15.2 -17.0
8 1995-12-26 4.7 -15.2 -17.0
9 1996-01-02 5.3 -22.7 -17.0
10 1996-01-16 7.3 -21.0 -17.0
11 1996-01-23 1.3 -20.4 -17.0
Try something like this:
(if your index is already a datetimeindex, skip the first two rows)
df.set_index('Date',inplace = True,drop = True)
df.index = pd.to_datetime(df.index)
# resample your dataframe in weekly frequency, and interpolate missing values
conformed = df.resample('W-MON').mean().interpolate(method = 'nearest')
n_weeks = 3 # the length of the rolling window (in weeks)
result = conformed.rolling(n_weeks).min()
Note that, you mention that you want the minimum of each row. But it seems like you are calculating the rolling minimum of each column...

summing the values row wise

I have a three column of data as arranged below:
Input file:
>>>>>
1.0 2.0 3.0
2.0 2.0 4.0
3.0 4.5 8.0
>>>>>
1.0 2.5 6.8
2.0 3.5 6.8
3.0 1.2 1.9
>>>>>
1.0 1.2 1.3
2.0 2.7 1.8
3.0 4.5 8.5
In the above input file the first column values are repeated so I want to take only once that value and want to sum the third column values row wise and do not want to take any second column values.
I also want to append a third column with the fixed value 1.0
Finally want to save the result on another test file called output.txt.
Output:
1.0 11.1 1.0
2.0 12.6 1.0
3.0 18.4 1.0
In the output second column values resulted from is following:
3.0+6.8+1.3
4.0+6.8+1.8
8.0+1.9+8.5
I tried with numpy but getting error:
import numpy as np
import pandas as pd
import glob
data=np.loadtxt("input.txt")
You need to read your input file using pandas.read_csv, you need to set the delimiter to " ", specify no header and ">" as comment lines.
Then perform the groupby/sum operation, and export without header using pandas.to_csv
import pandas as pd
# input
df = pd.read_csv('filename.csv', delimiter=' ', header=None, comment='>')
# output
(df.groupby(0)[[2]].sum()
.assign(col=1.0)
.to_csv('output.txt', header=False, sep=' ', float_format='%.2f')
)
output.txt:
1.00 11.10 1.00
2.00 12.60 1.00
3.00 18.40 1.00
Try:
df[2].groupby(np.arange(len(df)) % 3).sum()
# or df.iloc[:, 2].groupby(np.arange(len(df)) % 3).sum()
0 11.1
1 12.6
2 18.4
Name: 2, dtype: float64
Use groupby with reset index
dfNew = df.groupby(0)[2].sum().reset_index()
dfNew.to_csv('output.txt', index= False)

Pandas data manipulation - multiple measurements per line to one per line [duplicate]

This question already has answers here:
Reshape wide to long in pandas
(2 answers)
Closed 4 years ago.
I am manipulating a data frame using Pandas in Python to match a specific format.
I currently have a data frame with a row for each measurement location (A or B). Each row has a nominal target and multiple measured data points.
This is the format I currently have:
df=
Location Nominal Meas1 Meas2 Meas3
A 4.0 3.8 4.1 4.3
B 9.0 8.7 8.9 9.1
I need to manipulate this data so there is only one measured data point per row, and copy the Location and Nominal values from the source rows to the new rows. The measured data also needs to be put in the first column.
This is the format I need:
df =
Meas Location Nominal
3.8 A 4.0
4.1 A 4.0
4.3 A 4.0
8.7 B 9.0
8.9 B 9.0
9.1 B 9.0
I have tried concat and append functions with and without transpose() with no success.
This is the most similar example I was able to find, but it did not get me there:
for index, row in df.iterrows():
pd.concat([row]*3, ignore_index=True)
Thank you!
Its' a wide to long problem
pd.wide_to_long(df,'Meas',i=['Location','Nominal'],j='drop').reset_index().drop('drop',1)
Out[637]:
Location Nominal Meas
0 A 4.0 3.8
1 A 4.0 4.1
2 A 4.0 4.3
3 B 9.0 8.7
4 B 9.0 8.9
5 B 9.0 9.1
Another solution, using melt:
new_df = (df.melt(['Location','Nominal'],
['Meas1', 'Meas2', 'Meas3'],
value_name = 'Meas')
.drop('variable', axis=1)
.sort_values('Location'))
>>> new_df
Location Nominal Meas
0 A 4.0 3.8
2 A 4.0 4.1
4 A 4.0 4.3
1 B 9.0 8.7
3 B 9.0 8.9
5 B 9.0 9.1

T.Test with pandas dataframes

I have 2 data frames pd and pd2:
pd
Name A B Mean
t1 1.0 2.0 1.5
t2 2.0 3.0 2.5
t3 9.4 3.3 6.35
pd2
Name A B Mean
t1 1.1 2.7 1.9
t2 3.7 3.0 3.35
t3 10.4 4.3 7.35
I would like to do the ttest calculation for columns 'A' on both dataframes and column B on both dataframes the result can be added to one of the dataframes or it can be added to a new data frame. The output should have the columns:
ttestA ttestB ttestC ...etc
Using for loop
from scipy import stats
l=[]
listofname=['A','B']
for x in listofname:
l.append(stats.ttest_ind(df[x],df2[x], equal_var=False))

Grouping columns of pandas dataframe in datetime format

I have two questions:
1) Is there something like pandas groupby but applicable on columns (df.columns, not the data within)?
2) How can I extract the "date" from a datetime object?
I have lots of pandas dataframes (or csv files) that have a position column (that I use as index) and then columns of values measured at each position at different time. The column header is a datetime object (or pd.to_datetime).
I would like to extract data from the same date and save them into a new file.
Here is a simple example of two such dataframes.
df1:
2015-03-13 14:37:00 2015-03-13 14:38:00 2015-03-13 14:38:15 \
0.0 24.49393 24.56345 24.50552
0.5 24.45346 24.54904 24.60773
1.0 24.46216 24.55267 24.74365
1.5 24.55414 24.63812 24.80463
2.0 24.68079 24.76758 24.78552
2.5 24.79236 24.83005 24.72879
3.0 24.83691 24.78308 24.66727
3.5 24.78452 24.73071 24.65085
4.0 24.65857 24.79398 24.72290
4.5 24.56390 24.93515 24.83267
5.0 24.62161 24.96939 24.87366
2015-05-19 11:33:00 2015-05-19 11:33:15 2015-05-19 11:33:30
0.0 8.836121 8.726685 8.710449
0.5 8.732880 8.742462 8.687408
1.0 8.881165 8.935120 8.925903
1.5 9.043396 9.092651 9.204041
2.0 9.080902 9.153839 9.329681
2.5 9.128815 9.183777 9.296509
3.0 9.191254 9.121643 9.207397
3.5 9.131866 8.975372 9.160248
4.0 8.966003 8.951813 9.195221
4.5 8.846924 9.074982 9.264099
5.0 8.848663 9.101593 9.283081
and df2:
2015-05-19 11:33:00 2015-05-19 11:33:15 2015-05-19 11:33:30 \
0.0 8.836121 8.726685 8.710449
0.5 8.732880 8.742462 8.687408
1.0 8.881165 8.935120 8.925903
1.5 9.043396 9.092651 9.204041
2.0 9.080902 9.153839 9.329681
2.5 9.128815 9.183777 9.296509
3.0 9.191254 9.121643 9.207397
3.5 9.131866 8.975372 9.160248
4.0 8.966003 8.951813 9.195221
4.5 8.846924 9.074982 9.264099
5.0 8.848663 9.101593 9.283081
2015-05-23 12:25:00 2015-05-23 12:26:00 2015-05-23 12:26:30
0.0 10.31052 10.132660 10.176910
0.5 10.26834 10.086910 10.252720
1.0 10.27393 10.165890 10.276670
1.5 10.29330 10.219090 10.335910
2.0 10.24432 10.193940 10.406430
2.5 10.11618 10.157470 10.323120
3.0 10.02454 10.110720 10.115360
3.5 10.08716 10.010680 9.997345
4.0 10.23868 9.905670 10.008090
4.5 10.27216 9.879425 9.979645
5.0 10.10693 9.919800 9.870361
df1 has data from 13 March and 19 May, df2 has data from 19 May and 23 May. From these two dataframes containing data from 3 days, I would like to get 3 dataframes (or csv files or any other object), one for each day.
(And for a real-life example, multiply the number of lines, columns and files by some hundred.)
In the worst case I can specify the dates in a separate list, but I am still failing to extract these dates from the dataframes.
I did have an idea of a nested loop
for df in dataframes:
for d in dates:
new_df = df[d]
but I can't get the date from the datetime.
First concat all DataFrames by columns and then convert groupby object by strftime for string keys of dictionary of DataFrames:
df = pd.concat([df1,df2, dfN], axis=1)
dfs = dict(tuple(df.groupby(df.columns.strftime('%Y-%m-%d'), axis=1)))
#select DataFrame
print (dfs['2015-03-13'])

Categories