Pandas data manipulation - multiple measurements per line to one per line [duplicate]

Pandas data manipulation - multiple measurements per line to one per line [duplicate] - python

This question already has answers here:
Reshape wide to long in pandas
(2 answers)
Closed 4 years ago.
I am manipulating a data frame using Pandas in Python to match a specific format.
I currently have a data frame with a row for each measurement location (A or B). Each row has a nominal target and multiple measured data points.
This is the format I currently have:
df=
Location Nominal Meas1 Meas2 Meas3
A 4.0 3.8 4.1 4.3
B 9.0 8.7 8.9 9.1
I need to manipulate this data so there is only one measured data point per row, and copy the Location and Nominal values from the source rows to the new rows. The measured data also needs to be put in the first column.
This is the format I need:
df =
Meas Location Nominal
3.8 A 4.0
4.1 A 4.0
4.3 A 4.0
8.7 B 9.0
8.9 B 9.0
9.1 B 9.0
I have tried concat and append functions with and without transpose() with no success.
This is the most similar example I was able to find, but it did not get me there:
for index, row in df.iterrows():
pd.concat([row]*3, ignore_index=True)
Thank you!

Its' a wide to long problem
pd.wide_to_long(df,'Meas',i=['Location','Nominal'],j='drop').reset_index().drop('drop',1)
Out[637]:
Location Nominal Meas
0 A 4.0 3.8
1 A 4.0 4.1
2 A 4.0 4.3
3 B 9.0 8.7
4 B 9.0 8.9
5 B 9.0 9.1

Another solution, using melt:
new_df = (df.melt(['Location','Nominal'],
['Meas1', 'Meas2', 'Meas3'],
value_name = 'Meas')
.drop('variable', axis=1)
.sort_values('Location'))
>>> new_df
Location Nominal Meas
0 A 4.0 3.8
2 A 4.0 4.1
4 A 4.0 4.3
1 B 9.0 8.7
3 B 9.0 8.9
5 B 9.0 9.1

Related

Python: Rolling Minimum by date interval

thanks for taking the time to read this question.
I am using time series data which is reported weekly. I am trying to calculate the minimum value of each row over 3 years which I have done using the code below. Since the data is reported weekly for each row it would be the minimum value of 156 rows (3yrs before). The column Spec_Min details the minimum value for each row over 3 years.
However, halfway through my data, it begins to be reported twice a month but I still need to have the minimum values over 3 years therefore no longer 156 rows later. I was wondering if there was a more simple way of doing this?
Perhaps doing it via date rather than rows but I am not sure how to do that.
df1['Spec_Min']=df1['Spec_NET'].rolling(156).min()
df1
Date Spec_NET Hed_NET Spec_Min
1995-10-31 9.0 -13.5 -49.7
1995-11-07 11.9 -23.5 -49.7
1995-11-14 9.8 -19.4 -49.7
1995-11-21 9.7 -25.4 -49.7
1995-11-28 10.4 -20.3 -49.7
1995-12-05 1.6 -15.3 -49.7
1995-12-12 -17.0 14.2 -49.7
1995-12-19 -16.6 15.2 -49.7
1995-12-26 4.7 -15.2 -49.7
1996-01-02 5.3 -22.7 -49.7
1996-01-16 7.3 -21.0 -49.7
1996-01-23 1.3 -20.4 -49.7

Pandas allows you to operate with a datetime aware rolling window. You'll need to structure your code to operate in terms of the number of days (365 * 3 for 3 years).
I used your provided sample DataFrame
df['Spec_Min'] = df.rolling(f'{365 * 3}D', on='Date')['Spec_NET'].min()
print(df)
Date Spec_NET Hed_NET Spec_Min
0 1995-10-31 9.0 -13.5 9.0
1 1995-11-07 11.9 -23.5 9.0
2 1995-11-14 9.8 -19.4 9.0
3 1995-11-21 9.7 -25.4 9.0
4 1995-11-28 10.4 -20.3 9.0
5 1995-12-05 1.6 -15.3 1.6
6 1995-12-12 -17.0 14.2 -17.0
7 1995-12-19 -16.6 15.2 -17.0
8 1995-12-26 4.7 -15.2 -17.0
9 1996-01-02 5.3 -22.7 -17.0
10 1996-01-16 7.3 -21.0 -17.0
11 1996-01-23 1.3 -20.4 -17.0

Try something like this:
(if your index is already a datetimeindex, skip the first two rows)
df.set_index('Date',inplace = True,drop = True)
df.index = pd.to_datetime(df.index)
# resample your dataframe in weekly frequency, and interpolate missing values
conformed = df.resample('W-MON').mean().interpolate(method = 'nearest')
n_weeks = 3 # the length of the rolling window (in weeks)
result = conformed.rolling(n_weeks).min()
Note that, you mention that you want the minimum of each row. But it seems like you are calculating the rolling minimum of each column...

summing the values row wise

I have a three column of data as arranged below:
Input file:
>>>>>
1.0 2.0 3.0
2.0 2.0 4.0
3.0 4.5 8.0
>>>>>
1.0 2.5 6.8
2.0 3.5 6.8
3.0 1.2 1.9
>>>>>
1.0 1.2 1.3
2.0 2.7 1.8
3.0 4.5 8.5
In the above input file the first column values are repeated so I want to take only once that value and want to sum the third column values row wise and do not want to take any second column values.
I also want to append a third column with the fixed value 1.0
Finally want to save the result on another test file called output.txt.
Output:
1.0 11.1 1.0
2.0 12.6 1.0
3.0 18.4 1.0
In the output second column values resulted from is following:
3.0+6.8+1.3
4.0+6.8+1.8
8.0+1.9+8.5
I tried with numpy but getting error:
import numpy as np
import pandas as pd
import glob
data=np.loadtxt("input.txt")

You need to read your input file using pandas.read_csv, you need to set the delimiter to " ", specify no header and ">" as comment lines.
Then perform the groupby/sum operation, and export without header using pandas.to_csv
import pandas as pd
# input
df = pd.read_csv('filename.csv', delimiter=' ', header=None, comment='>')
# output
(df.groupby(0)[[2]].sum()
.assign(col=1.0)
.to_csv('output.txt', header=False, sep=' ', float_format='%.2f')
)
output.txt:
1.00 11.10 1.00
2.00 12.60 1.00
3.00 18.40 1.00

Try:
df[2].groupby(np.arange(len(df)) % 3).sum()
# or df.iloc[:, 2].groupby(np.arange(len(df)) % 3).sum()
0 11.1
1 12.6
2 18.4
Name: 2, dtype: float64

Use groupby with reset index
dfNew = df.groupby(0)[2].sum().reset_index()
dfNew.to_csv('output.txt', index= False)

Pandas slicing data with MultiIndex

I have some features that I want to write to some csv files. I want to use pandas for this approach if possible.
I am following the instruction in here and have created some dummy data to check it out. Basically there are some activities with a random number of features belonging to them.
import io
data = io.StringIO('''Activity,id,value,value,value,value,value,value,value,value,value
Run,1,1,2,2,5,6,4,3,2,1
Run,1,2,4,4,10,12,8,6,4,2
Stand,2,1.5,3.,3.,7.5,9.,6.,4.5,3.,1.5
Sit,3,0.5,1.,1.,2.5,3.,2.,1.5,1.,0.5
Sit,3,0.6,1.2,1.2,3.,3.6,2.4,1.8,1.2,0.6
Run, 2, 0.8, 1.6, 1.6, 4. , 4.8, 3.2, 2.4, 1.6, 0.8
''')
df_unindexed = pd.read_csv(data)
df = df_unindexed.set_index(['Activity', 'id'])
When I run:
df.xs('Run')
I get
value value.1 value.2 value.3 value.4 value.5 value.6 value.7 \
id
1 1.0 2.0 2.0 5.0 6.0 4.0 3.0 2.0
1 2.0 4.0 4.0 10.0 12.0 8.0 6.0 4.0
2 0.8 1.6 1.6 4.0 4.8 3.2 2.4 1.6
value.8
id
1 1.0
1 2.0
2 0.8
which almost what I want, that is all run activities. I want to remove the 1st row and 1st column, i.e. the header and the id column. How do I achieve this?
Also a second question is when I want only one activity, how do I get it.
When using
idx = pd.IndexSlice
df.loc[idx['Run', 1], :]
gives
value value.1 value.2 value.3 value.4 value.5 value.6 \
Activity id
Run 1 1.0 2.0 2.0 5.0 6.0 4.0 3.0
1 2.0 4.0 4.0 10.0 12.0 8.0 6.0
value.7 value.8
Activity id
Run 1 2.0 1.0
1 4.0 2.0
but slicing does not work as I would expect. For example trying
df.loc[idx['Run', 1], 2:11]
instead produces an error:
TypeError: cannot do slice indexing on with these indexers [2] of 'int'>
So, how do I get my features in this place?
P.S. If not clear I am new to Pandas so be gentle. Also the column id is editable to be unique to each activity or to whole dataset if this makes things easier etc

You can use a little hack - get columns names by positions, because iloc for MultiIndex is not yet supported:
print (df.columns[2:11])
Index(['value.2', 'value.3', 'value.4', 'value.5', 'value.6', 'value.7',
'value.8'],
dtype='object')
idx = pd.IndexSlice
print (df.loc[idx['Run', 1], df.columns[2:11]])
value.2 value.3 value.4 value.5 value.6 value.7 value.8
Activity id
Run 1 2.0 5.0 6.0 4.0 3.0 2.0 1.0
1 4.0 10.0 12.0 8.0 6.0 4.0 2.0
If want save file to csv without index and columns:
df.xs('Run').to_csv(file, index=False, header=None)

I mostly look at https://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-integer when I'm stuck with these kind of issues.
Without any testing I think you can remove rows and columns like
df = df.drop(['rowindex'], axis=0)
df = df.drop(['colname'], axis=1)

Avoid the problem by recognizing the index columns at CSV read-time:
pd.read_csv(header=0, # to read in the header row as a header row, and
... index_col=['id'] or index_col=0 to pick the index column.

Rolling Linear Fit with Python DataFrame

I want to perform a moving window linear fit to the columns in my dataframe.
n =5
df = pd.DataFrame(index=pd.date_range('1/1/2000', periods=n))
df['B'] = [1.9,2.3,4.4,5.6,7.3]
df['A'] = [3.2,1.3,5.6,9.4,10.4]
B A
2000-01-01 1.9 3.2
2000-01-02 2.3 1.3
2000-01-03 4.4 5.6
2000-01-04 5.6 9.4
2000-01-05 7.3 10.4
For, say, column B, I want to perform a linear fit using the first two rows, then another linear fit using the second and third rown and so on. And the same for column A. I am only interested in the slope of the fit so at the end, I want a new dataframe with the entries above replaced by the different rolling slopes.
After doing
df.reset_index()
I try something like
model = pd.ols(y=df['A'], x=df['index'], window_type='rolling',window=3)
But I get
KeyError: 'index'
EDIT:
I aded a new column
df['i'] = range(0,len(df))
and I can now run
pd.ols(y=df['A'], x=df.i, window_type='rolling',window=3)
(it gives an error for window=2)
I am not understaing this well because I was expecting a string of numbers but I get just one result
-------------------------Summary of Regression Analysis---------------
Formula: Y ~ <x> + <intercept>
Number of Observations: 3
Number of Degrees of Freedom: 2
R-squared: 0.8981
Adj R-squared: 0.7963
Rmse: 1.1431
F-stat (1, 1): 8.8163, p-value: 0.2068
Degrees of Freedom: model 1, resid 1
-----------------------Summary of Estimated Coefficients--------------
Variable Coef Std Err t-stat p-value CI 2.5% CI 97.5%
--------------------------------------------------------------------------------
x 2.4000 0.8083 2.97 0.2068 0.8158 3.9842
intercept 1.2667 2.5131 0.50 0.7028 -3.6590 6.1923
---------------------------------End of Summary---------------------------------
EDIT 2:
Now I understand better what is going on. I can acces the different values of the fits using
model.beta

I havent tried it out, but I don't think you need to specify the window_type='rolling', if you specify the window to something, window will automatically be set to rolling.
Source.

I have problems doing this with the DatetimeIndex you created with pd.date_range, and find datetimes a confusing pain to work with in general due to the number of types out there and apparent incompatibility between APIs. Here's how I would do it if the date were an integer (e.g. days since 12/31/99, or years) or float in your example. It won't help your datetime problem, but hopefully it helps with the rolling linear fit part.
Generating your date with integers instead of datetimes:
df = pd.DataFrame()
df['date'] = range(1,6)
df['B'] = [1.9,2.3,4.4,5.6,7.3]
df['A'] = [3.2,1.3,5.6,9.4,10.4]
date B A
0 1 1.9 3.2
1 2 2.3 1.3
2 3 4.4 5.6
3 4 5.6 9.4
4 5 7.3 10.4
Since you want to group by 2 dates every time, then fit a linear model on each group, let's duplicate the records and number each group with the index:
df_dbl = pd.concat([df,df], names = ['date', 'B', 'A']).sort()
df_dbl = df_dbl.iloc[1:-1] # removes the first and last row
date B A
0 1 1.9 3.2 # this record is removed
0 1 1.9 3.2
1 2 2.3 1.3
1 2 2.3 1.3
2 3 4.4 5.6
2 3 4.4 5.6
3 4 5.6 9.4
3 4 5.6 9.4
4 5 7.3 10.4
4 5 7.3 10.4 # this record is removed
c = df_dbl.index[1:len(df_dbl.index)].tolist()
c.append(max(df_dbl.index))
df_dbl.index = c
date B A
1 1 1.9 3.2
1 2 2.3 1.3
2 2 2.3 1.3
2 3 4.4 5.6
3 3 4.4 5.6
3 4 5.6 9.4
4 4 5.6 9.4
4 5 7.3 10.4
Now it's ready to group by index to run linear models on B vs. date, which I learned from Using Pandas groupby to calculate many slopes. I use scipy.stats.linregress since I got weird results with pd.ols and couldn't find good documentation to understand why (perhaps because it's geared toward datetime).
1 0.4
2 2.1
3 1.2
4 1.7

Python Pandas combining timestamp columns and fillna in read_csv

I'm reading a csv file with Pandas. The format is:
Date Time x1 x2 x3 x4 x5
3/7/2012 11:09:22 13.5 2.3 0.4 7.3 6.4
12.6 3.4 9.0 3.0 7.0
3.6 4.4 8.0 6.0 5.0
10.6 3.5 1.0 3.0 8.0
...
3/7/2012 11:09:23 10.5 23.2 0.3 7.8 4.4
11.6 13.4 19.0 13.0 17.0
...
As you can see, not every row has a timestamp. Every row without a timestamp is from the same 1-second interval as the closest row above it that does have a timestamp.
I am trying to do 3 things:
1. combine the Date and Time columns to get a single timestamp column.
2. convert that column to have units of seconds.
3. fill empty cells to have the appropriate timestamp.
The desired end result is an array with the timestamp, in seconds, at each row.
I am not sure how to quickly convert the timestamps into units of seconds, other then to do a slow for loop and use the Python builtin time.mktime method.
Then when I fill in missing timestamp values, the problem is that the cells in the Date and Time columns which did not have a timestamp each get a "nan" value and when merged give a cell with the value "nan nan". Then when I use the fillna() method, it doesn't interpret "nan nan" as being a nan.
I am using the following code to get the problem result (not including the part of trying to convert to seconds):
import pandas as pd
df = pd.read_csv('file.csv', delimiter=',', parse_dates={'CorrectTime':[0,1]}, usecols=[0,1,2,4,6], names=['Date','Time','x1','x3','x5'])
df.fillna(method='ffill', axis=0, inplace=True)
Thanks for your help.

Assuming you want seconds since Jan 1, 1900...
import pandas
from io import StringIO
import datetime
data = StringIO("""\
Date,Time,x1,x2,x3,x4,x5
3/7/2012,11:09:22,13.5,2.3,0.4,7.3,6.4
,,12.6,3.4,9.0,3.0,7.0
,,3.6,4.4,8.0,6.0,5.0
,,10.6,3.5,1.0,3.0,8.0
3/7/2012,11:09:23,10.5,23.2,0.3,7.8,4.4
,,11.6,13.4,19.0,13.0,17.0
""")
df = pandas.read_csv(data, parse_dates=['Date']).fillna(method='ffill')
def dealwithdates(row):
datestring = row['Date'].strftime('%Y-%m-%d')
dtstring = '{} {}'.format(datestring, row['Time'])
date = datetime.datetime.strptime(dtstring, '%Y-%m-%d %H:%M:%S')
refdate = datetime.datetime(1900, 1, 1)
return (date - refdate).total_seconds()
df['ordinal'] = df.apply(dealwithdates, axis=1)
print(df)
Date Time x1 x2 x3 x4 x5 ordinal
0 2012-03-07 11:09:22 13.5 2.3 0.4 7.3 6.4 3540107362
1 2012-03-07 11:09:22 12.6 3.4 9.0 3.0 7.0 3540107362
2 2012-03-07 11:09:22 3.6 4.4 8.0 6.0 5.0 3540107362
3 2012-03-07 11:09:22 10.6 3.5 1.0 3.0 8.0 3540107362
4 2012-03-07 11:09:23 10.5 23.2 0.3 7.8 4.4 3540107363
5 2012-03-07 11:09:23 11.6 13.4 19.0 13.0 17.0 3540107363

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas data manipulation - multiple measurements per line to one per line [duplicate] - python

Its' a wide to long problem pd.wide_to_long(df,'Meas',i=['Location','Nominal'],j='drop').reset_index().drop('drop',1) Out[637]: Location Nominal Meas 0 A 4.0 3.8 1 A 4.0 4.1 2 A 4.0 4.3 3 B 9.0 8.7 4 B 9.0 8.9 5 B 9.0 9.1

Another solution, using melt: new_df = (df.melt(['Location','Nominal'], ['Meas1', 'Meas2', 'Meas3'], value_name = 'Meas') .drop('variable', axis=1) .sort_values('Location')) >>> new_df Location Nominal Meas 0 A 4.0 3.8 2 A 4.0 4.1 4 A 4.0 4.3 1 B 9.0 8.7 3 B 9.0 8.9 5 B 9.0 9.1

Related

Python: Rolling Minimum by date interval

summing the values row wise

Pandas slicing data with MultiIndex

Rolling Linear Fit with Python DataFrame

Python Pandas combining timestamp columns and fillna in read_csv

Categories

Resources