I have sensor data captured at different frequencies (this is data I've invented to simplify the operation). I want to resample the voltage data by increasing the number of data points and interpolate them so I have 16 instead of 12.
Pandas has a resample/upsample function but I can only find examples where people have gone from weekly data to daily data (adding 6 daily data points by interpolation between two weekly data points).
time (pressure)
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
pressure
1
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
2
2.1
2.2
2.3
2.4
2.5
time (voltage)
0.07
0.14
0.21
0.28
0.35
0.42
0.49
0.56
0.63
0.7
0.77
0.84
voltage
2.2
2.5
2.8
3.1
3.4
3.7
4
4.3
4.6
4.9
5.2
5.5
I would like my voltage to have 16 samples instead of 12 with the missing values interpolated. Thanks!
Let's assume two Series, "pressure" and "voltage":
pressure = pd.Series({0.05: 1.0, 0.1: 1.1, 0.15: 1.2, 0.2: 1.3, 0.25: 1.4, 0.3: 1.5, 0.35: 1.6, 0.4: 1.7, 0.45: 1.8,
0.5: 1.9, 0.55: 2.0, 0.6: 2.1, 0.65: 2.2, 0.7: 2.3, 0.75: 2.4, 0.8: 2.5}, name='pressure')
voltage = pd.Series({0.07: 2.2, 0.14: 2.5, 0.21: 2.8, 0.28: 3.1, 0.35: 3.4, 0.42: 3.7,
0.49: 4.0, 0.56: 4.3, 0.63: 4.6, 0.7: 4.9, 0.77: 5.2, 0.84: 5.5}, name='voltage')
You can either use pandas.merge_asof:
pd.merge_asof(pressure, voltage, left_index=True, right_index=True)
output:
or pandas.concat+interpolate:
(pd.concat([pressure, voltage], axis=1)
.sort_index()
.apply(pd.Series.interpolate)
#.plot(x='pressure', y='voltage', marker='o') # uncomment to plot
)
output:
Finally, to interpolate only on voltage, drop NAs on pressure first:
(pd.concat([pressure, voltage], axis=1)
.sort_index()
.dropna(subset=['pressure'])
.apply(pd.Series.interpolate)
)
output:
Related
I have a DataFrame called "DataExample" and an ascending sorted list called "normalsizes".
import pandas as pd
if __name__ == "__main__":
DataExample = [[0.6, 0.36 ,0.00],
[0.6, 0.36 ,0.00],
[0.9, 0.81 ,0.85],
[0.8, 0.64 ,0.91],
[1.0, 1.00 ,0.92],
[1.0, 1.00 ,0.95],
[0.9, 0.81 ,0.97],
[1.2, 1.44 ,0.97],
[1.0, 1.00 ,0.97],
[1.0, 1.00 ,0.99],
[1.2, 1.44 ,0.99],
[1.1, 1.21 ,0.99]]
DataExample = pd.DataFrame(data = DataExample, columns = ['Lx', 'A', 'Ratio'])
normalsizes = [0, 0.75, 1, 1.25, 1.5, 1.75 ,2, 2.25, 2.4, 2.5, 2.75, 3,
3.25, 3.5, 3.75, 4, 4.25, 4.5, 4.75, 5, 5.25, 5.5, 5.75, 6]
# for i in example.index:
#
# numb = example['Lx'][i]
What I am looking for is that each “DataExample [‘ Lx ’]” is analyzed and located within a range of normalsizes, for example:
For DataExample [‘Lx’] [0] = 0.6 -----> then it is between the interval of [0, 0.75] -----> 0.6> 0 and 0.6 <= 0.75 -----> so I take the largest of that interval, that is, 0.75. This for each row.
With this I should have the following result:
Lx A Ratio
1 0.36 0
1 0.36 0
1 0.81 0.85
1 0.64 0.91
1.25 1 0.92
1.25 1 0.95
1 0.81 0.97
1.25 1.44 0.97
1.25 1 0.97
1.25 1 0.99
1.25 1.44 0.99
1.25 1.21 0.99
numpy.searchsorted will get you what you want
import numpy as np
normalsizes = np.array(normalsizes) # convert to numpy array
DataExample["Lx"] = normalsizes[np.searchsorted(normalsizes, DataExample["Lx"])]
I'm trying to form sentences from single words in a dataframe (sometimes ending with .?!), and recognize that U. or S. is not the end of the sentence.
data = {
"start_time": [0.1, 0.3, 0.5, 0.7, 0.9, 1.1, 1.3, 1.5, 1.9, 2.1, 2.3, 2.5],
"end_time": [0.2, 0.4, 0.6, 0.8, 1.0, 1.2, 1.4, 1.6, 1.8, 2.0, 2.2, 2.4],
"word": [
"WHERE",
"ARE",
"YOU?",
"I",
"AM",
"U.",
"S.",
"OK,",
"COOL!",
"YES",
"IT",
"IS.",
],
}
df = pd.DataFrame(data, columns=["start_time", "end_time", "word"])
The dataframe looks like:
s_time e_time word
0.1 0.2 WHERE
0.3 0.4 ARE
0.5 0.6 YOU?
0.7 0.8 I
0.9 1.0 AM
1.1 1.2 U.
1.3 1.4 S.
1.5 1.6 OK,
1.7 1.8 COOL!
1.9 2.0 YES
2.1 2.2 IT
2.3 2.4 IS.
The result I want to get looks like:
s_time e_time sentence
0.1 0.6 WHERE ARE YOU?
0.7 1.4 I AM U. S.
1.5 1.8 OK, COOL!
1.9 2.4 YES IT IS.
I am stuck with how to get U. S. in one sentence.
Any suggestion would be much appreciated and really thanks for anyone help!
You could try this:
# Initialize variables
new_data = {"start_time": [], "end_time": [], "sentence": []}
sentence = []
start_time = None
# Iterate on the dataframe
for i, row in df.iterrows():
# Initialize start_time
if not start_time:
start_time = row["start_time"]
if (
not row["word"].endswith("?")
and not row["word"].endswith("!")
and not row["word"].endswith("S.")
):
# If word is not ending a phrase, get it
sentence.append(row["word"])
else:
# Pause iteration and update new_data with start_time, end_time
# and completed sentence
new_data["start_time"].append(start_time)
new_data["end_time"].append(row["end_time"])
sentence.append(row["word"])
new_data["sentence"].append(" ".join(sentence))
# Reset variables
start_time = None
sentence = []
new_df = pd.DataFrame(new_data, columns=["start_time", "end_time", "sentence"])
print(new_df)
# Outputs
start_time end_time sentence
0 0.1 0.6 WHERE ARE YOU?
1 0.7 1.4 I AM U. S.
2 1.5 1.8 OK, COOL!
3 2.1 2.4 YES IT IS.
Dataframe
df=pd.DataFrame({"sym":["a","a","aa","aa","aa","a","ab","ab","ab"],
"id_h":[2.1, 2.2 , 2.5 , 3.1 , 2.5, 3.8 , 2.5, 5,6],
"pm_h":[np.nan, 2.3, np.nan , 2.8, 2.7, 3.7, 2.4, 4.9,np.nan]})
want to fill pm_h nan values with max id_h value of each "sys" group i.e. (a, aa, ab)
Required output:
df1=pd.DataFrame({"sym":["a","a","aa","aa","aa","a","ab","ab","ab"],
"id_h":[2.1, 2.2 , 2.5 , 3.1 , 2.5, 3.8 , 2.5, 5,6],
"pm_h":[3.8, 2.3, 3.1 , 2.8, 2.7, 3.7, 2.4, 4.9, 6})
Use Series.fillna with GroupBy.transform by maximal values for new Series with same index like original:
df['pm_h'] = df['pm_h'].fillna(df.groupby('sym')['id_h'].transform('max'))
print (df)
sym id_h pm_h
0 a 2.1 3.8
1 a 2.2 2.3
2 aa 2.5 3.1
3 aa 3.1 2.8
4 aa 2.5 2.7
5 a 3.8 3.7
6 ab 2.5 2.4
7 ab 5.0 4.9
8 ab 6.0 6.0
In Python, there is a function data.info(). This function gives you all the information about a dataset such as datatypes, memory, number of entries, etc.
Here you can look up for more information about the .info() function in Python.
Is there also a function in R that gives me this kind of information?
So here we have a few options
Base R
Within Base R there are a few options for getting these kind of data regarding your data:
str
You can use str to see the structure of a data frame
str(iris)
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
summary
Additionally, there is the summary function which completes a five number summary for each column and then counts for factors:
summary(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 setosa :50
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 versicolor:50
Median :5.800 Median :3.000 Median :4.350 Median :1.300 virginica :50
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
dplyr
dplyr provides something similar to str which shows some of the data types
library(dplyr)
glimpse(iris)
Observations: 150
Variables: 5
$ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.8, 4.8, 4.3, 5.8, 5.7, 5.4, 5.1, 5.7, 5...
$ Sepal.Width <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.4, 3.0, 3.0, 4.0, 4.4, 3.9, 3.5, 3.8, 3...
$ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.6, 1.4, 1.1, 1.2, 1.5, 1.3, 1.4, 1.7, 1...
$ Petal.Width <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0.2, 0.1, 0.1, 0.2, 0.4, 0.4, 0.3, 0.3, 0...
$ Species <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, setosa, setosa, setosa, setosa, setosa, ...
skimr
Finally, the skimr package provides an enhanced summary including little histograms
library(skimr)
skim(iris)
-- Data Summary ------------------------
Values
Name iris
Number of rows 150
Number of columns 5
_______________________
Column type frequency:
factor 1
numeric 4
________________________
Group variables None
-- Variable type: factor -------------------------------------------------------
skim_variable n_missing complete_rate ordered n_unique top_counts
1 Species 0 1 FALSE 3 set: 50, ver: 50, vir: 50
-- Variable type: numeric ------------------------------------------------------
skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
1 Sepal.Length 0 1 5.84 0.828 4.3 5.1 5.8 6.4 7.9 ▆▇▇▅▂
2 Sepal.Width 0 1 3.06 0.436 2 2.8 3 3.3 4.4 ▁▆▇▂▁
3 Petal.Length 0 1 3.76 1.77 1 1.6 4.35 5.1 6.9 ▇▁▆▇▂
4 Petal.Width 0 1 1.20 0.762 0.1 0.3 1.3 1.8 2.5 ▇▁▇▅▃
Between those functions you can get a pretty good look at your data!
It's not a single function, but the first three things I always do are
library(tidyverse)
# Shows top 6 rows
iris %>% head()
# Gives dimensions of data.frame
iris %>% dim()
# Gives the classes of the data in each column (e.g. numeric, character etc)
iris %>% sapply(class)
The best package I use, that I haven't seen above, is inspectdf (mentioned by Niels in a comment above). inspectdf does much of the summary you see in skimr in #MDEWITT via specific function calls; for instance, inspect_cat and inspect_num for categorical and numerical variable summaries, respectively.
The contribution of my comment is that inspectdf has two additional functions inspect_imb and inspect_cor which, respectively, look at the most common value per column and the correlation between numerical cols. I find these tremendously useful for data cleaning/pre-processing.
I am creating a distance matrix in numpy, with an out put as such:
['H', 'B', 'D', 'A', 'I', 'C', 'F']
[[ 0. 2.4 6.1 3.2 5.2 3.9 7.1]
[ 2.4 0. 4.1 1.2 3.2 1.9 5.1]
[ 6.1 4.1 0. 3.1 6.9 2.8 5.2]
[ 3.2 1.2 3.1 0. 4. 0.9 4.1]
[ 5.2 3.2 6.9 4. 0. 4.7 7.9]
[ 3.9 1.9 2.8 0.9 4.7 0. 3.8]
[ 7.1 5.1 5.2 4.1 7.9 3.8 0. ]]
I am printing that x axis by just printing a list before I print the actual matrix, a:
print" ", names
print a
I need the axis in that order, as the list 'names' properly orders the variables with their value in the matrix. But how would i be able to get a similar y axis in numpy?
It is not so pretty, but this pretty table prints works:
import numpy as np
names=np.array(['H', 'B', 'D', 'A', 'I', 'C', 'F'])
a=np.array([[ 0., 2.4, 6.1, 3.2, 5.2, 3.9, 7.1],
[2.4, 0., 4.1, 1.2, 3.2, 1.9, 5.1],
[6.1, 4.1, 0., 3.1, 6.9, 2.8, 5.2],
[3.2, 1.2, 3.1, 0., 4., 0.9, 4.1],
[5.2, 3.2, 6.9, 4., 0., 4.7, 7.9],
[3.9, 1.9 , 2.8, 0.9, 4.7, 0., 3.8],
[7.1, 5.1, 5.2, 4.1, 7.9, 3.8, 0. ]])
def pptable(x_axis,y_axis,table):
def format_field(field, fmt='{:,.2f}'):
if type(field) is str: return field
if type(field) is tuple: return field[1].format(field[0])
return fmt.format(field)
def get_max_col_w(table, index):
return max([len(format_field(row[index])) for row in table])
for i,l in enumerate(table):
l.insert(0,y_axis[i])
x_axis.insert(0,' ')
table.insert(0,x_axis)
col_paddings=[get_max_col_w(table, i) for i in range(len(table[0]))]
for i,row in enumerate(table):
# left col
row_tab=[str(row[0]).ljust(col_paddings[0])]
# rest of the cols
row_tab+=[format_field(row[j]).rjust(col_paddings[j])
for j in range(1,len(row))]
print(' '.join(row_tab))
x_axis=['x{}'.format(c) for c in names]
y_axis=['y{}'.format(c) for c in names]
pptable(x_axis,y_axis,a.tolist())
Prints:
xH xB xD xA xI xC xF
yH 0.00 2.40 6.10 3.20 5.20 3.90 7.10
yB 2.40 0.00 4.10 1.20 3.20 1.90 5.10
yD 6.10 4.10 0.00 3.10 6.90 2.80 5.20
yA 3.20 1.20 3.10 0.00 4.00 0.90 4.10
yI 5.20 3.20 6.90 4.00 0.00 4.70 7.90
yC 3.90 1.90 2.80 0.90 4.70 0.00 3.80
yF 7.10 5.10 5.20 4.10 7.90 3.80 0.00
If you want the X and Y axis to be the same, just call it with two lists of the same labels.