How to form sentences from single words in a dataframe? - python

I'm trying to form sentences from single words in a dataframe (sometimes ending with .?!), and recognize that U. or S. is not the end of the sentence.
data = {
"start_time": [0.1, 0.3, 0.5, 0.7, 0.9, 1.1, 1.3, 1.5, 1.9, 2.1, 2.3, 2.5],
"end_time": [0.2, 0.4, 0.6, 0.8, 1.0, 1.2, 1.4, 1.6, 1.8, 2.0, 2.2, 2.4],
"word": [
"WHERE",
"ARE",
"YOU?",
"I",
"AM",
"U.",
"S.",
"OK,",
"COOL!",
"YES",
"IT",
"IS.",
],
}
df = pd.DataFrame(data, columns=["start_time", "end_time", "word"])
The dataframe looks like:
s_time e_time word
0.1 0.2 WHERE
0.3 0.4 ARE
0.5 0.6 YOU?
0.7 0.8 I
0.9 1.0 AM
1.1 1.2 U.
1.3 1.4 S.
1.5 1.6 OK,
1.7 1.8 COOL!
1.9 2.0 YES
2.1 2.2 IT
2.3 2.4 IS.
The result I want to get looks like:
s_time e_time sentence
0.1 0.6 WHERE ARE YOU?
0.7 1.4 I AM U. S.
1.5 1.8 OK, COOL!
1.9 2.4 YES IT IS.
I am stuck with how to get U. S. in one sentence.
Any suggestion would be much appreciated and really thanks for anyone help!

You could try this:
# Initialize variables
new_data = {"start_time": [], "end_time": [], "sentence": []}
sentence = []
start_time = None
# Iterate on the dataframe
for i, row in df.iterrows():
# Initialize start_time
if not start_time:
start_time = row["start_time"]
if (
not row["word"].endswith("?")
and not row["word"].endswith("!")
and not row["word"].endswith("S.")
):
# If word is not ending a phrase, get it
sentence.append(row["word"])
else:
# Pause iteration and update new_data with start_time, end_time
# and completed sentence
new_data["start_time"].append(start_time)
new_data["end_time"].append(row["end_time"])
sentence.append(row["word"])
new_data["sentence"].append(" ".join(sentence))
# Reset variables
start_time = None
sentence = []
new_df = pd.DataFrame(new_data, columns=["start_time", "end_time", "sentence"])
print(new_df)
# Outputs
start_time end_time sentence
0 0.1 0.6 WHERE ARE YOU?
1 0.7 1.4 I AM U. S.
2 1.5 1.8 OK, COOL!
3 2.1 2.4 YES IT IS.

Related

Sensor Data Sampling Frequency Mismatch

I have sensor data captured at different frequencies (this is data I've invented to simplify the operation). I want to resample the voltage data by increasing the number of data points and interpolate them so I have 16 instead of 12.
Pandas has a resample/upsample function but I can only find examples where people have gone from weekly data to daily data (adding 6 daily data points by interpolation between two weekly data points).
time (pressure)
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
pressure
1
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
2
2.1
2.2
2.3
2.4
2.5
time (voltage)
0.07
0.14
0.21
0.28
0.35
0.42
0.49
0.56
0.63
0.7
0.77
0.84
voltage
2.2
2.5
2.8
3.1
3.4
3.7
4
4.3
4.6
4.9
5.2
5.5
I would like my voltage to have 16 samples instead of 12 with the missing values interpolated. Thanks!
Let's assume two Series, "pressure" and "voltage":
pressure = pd.Series({0.05: 1.0, 0.1: 1.1, 0.15: 1.2, 0.2: 1.3, 0.25: 1.4, 0.3: 1.5, 0.35: 1.6, 0.4: 1.7, 0.45: 1.8,
0.5: 1.9, 0.55: 2.0, 0.6: 2.1, 0.65: 2.2, 0.7: 2.3, 0.75: 2.4, 0.8: 2.5}, name='pressure')
voltage = pd.Series({0.07: 2.2, 0.14: 2.5, 0.21: 2.8, 0.28: 3.1, 0.35: 3.4, 0.42: 3.7,
0.49: 4.0, 0.56: 4.3, 0.63: 4.6, 0.7: 4.9, 0.77: 5.2, 0.84: 5.5}, name='voltage')
You can either use pandas.merge_asof:
pd.merge_asof(pressure, voltage, left_index=True, right_index=True)
output:
or pandas.concat+interpolate:
(pd.concat([pressure, voltage], axis=1)
.sort_index()
.apply(pd.Series.interpolate)
#.plot(x='pressure', y='voltage', marker='o') # uncomment to plot
)
output:
Finally, to interpolate only on voltage, drop NAs on pressure first:
(pd.concat([pressure, voltage], axis=1)
.sort_index()
.dropna(subset=['pressure'])
.apply(pd.Series.interpolate)
)
output:

Is there an function that gives practical information about a dataframe?

In Python, there is a function data.info(). This function gives you all the information about a dataset such as datatypes, memory, number of entries, etc.
Here you can look up for more information about the .info() function in Python.
Is there also a function in R that gives me this kind of information?
So here we have a few options
Base R
Within Base R there are a few options for getting these kind of data regarding your data:
str
You can use str to see the structure of a data frame
str(iris)
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
summary
Additionally, there is the summary function which completes a five number summary for each column and then counts for factors:
summary(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 setosa :50
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 versicolor:50
Median :5.800 Median :3.000 Median :4.350 Median :1.300 virginica :50
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
dplyr
dplyr provides something similar to str which shows some of the data types
library(dplyr)
glimpse(iris)
Observations: 150
Variables: 5
$ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.8, 4.8, 4.3, 5.8, 5.7, 5.4, 5.1, 5.7, 5...
$ Sepal.Width <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.4, 3.0, 3.0, 4.0, 4.4, 3.9, 3.5, 3.8, 3...
$ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.6, 1.4, 1.1, 1.2, 1.5, 1.3, 1.4, 1.7, 1...
$ Petal.Width <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0.2, 0.1, 0.1, 0.2, 0.4, 0.4, 0.3, 0.3, 0...
$ Species <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, setosa, setosa, setosa, setosa, setosa, ...
skimr
Finally, the skimr package provides an enhanced summary including little histograms
library(skimr)
skim(iris)
-- Data Summary ------------------------
Values
Name iris
Number of rows 150
Number of columns 5
_______________________
Column type frequency:
factor 1
numeric 4
________________________
Group variables None
-- Variable type: factor -------------------------------------------------------
skim_variable n_missing complete_rate ordered n_unique top_counts
1 Species 0 1 FALSE 3 set: 50, ver: 50, vir: 50
-- Variable type: numeric ------------------------------------------------------
skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
1 Sepal.Length 0 1 5.84 0.828 4.3 5.1 5.8 6.4 7.9 ▆▇▇▅▂
2 Sepal.Width 0 1 3.06 0.436 2 2.8 3 3.3 4.4 ▁▆▇▂▁
3 Petal.Length 0 1 3.76 1.77 1 1.6 4.35 5.1 6.9 ▇▁▆▇▂
4 Petal.Width 0 1 1.20 0.762 0.1 0.3 1.3 1.8 2.5 ▇▁▇▅▃
Between those functions you can get a pretty good look at your data!
It's not a single function, but the first three things I always do are
library(tidyverse)
# Shows top 6 rows
iris %>% head()
# Gives dimensions of data.frame
iris %>% dim()
# Gives the classes of the data in each column (e.g. numeric, character etc)
iris %>% sapply(class)
The best package I use, that I haven't seen above, is inspectdf (mentioned by Niels in a comment above). inspectdf does much of the summary you see in skimr in #MDEWITT via specific function calls; for instance, inspect_cat and inspect_num for categorical and numerical variable summaries, respectively.
The contribution of my comment is that inspectdf has two additional functions inspect_imb and inspect_cor which, respectively, look at the most common value per column and the correlation between numerical cols. I find these tremendously useful for data cleaning/pre-processing.

How to normalize data in a text file while preserving the first variable

I have a text file with this format:
1 10.0e+08 1.0e+04 1.0
2 9.0e+07 9.0e+03 0.9
2 8.0e+07 8.0e+03 0.8
3 7.0e+07 7.0e+03 0.7
I would like to preserve the first variable of every line and to then normalize the data for all lines by the data on the first line. The end result would look something like;
1 1.0 1.0 1.0
2 0.9 0.9 0.9
2 0.8 0.8 0.8
3 0.7 0.7 0.7
so essentially, we are doing the following:
1 10.0e+08/10.0e+08 1.0e+04/1.0e+04 1.0/1.0
2 9.0e+07/10.0e+08 9.0e+03/1.0e+04 0.9/1.0
2 8.0e+07/10.0e+08 8.0e+03/1.0e+04 0.8/1.0
3 7.0e+07/10.0e+08 7.0e+03/1.0e+04 0.7/1.0
I'm still researching and reading on how to do this. I'll upload my attempt shortly. Also can anyone point me to a place where I can learn more about manipulating data files?
Read your file into a numpy array and use numpy broadcast feature:
import numpy as np
data = np.loadtxt('foo.txt')
data = data / data[0]
#array([[ 1. , 1. , 1. , 1. ],
# [ 2. , 0.09, 0.9 , 0.9 ],
# [ 2. , 0.08, 0.8 , 0.8 ],
# [ 3. , 0.07, 0.7 , 0.7 ]])
np.savetxt('new.txt', data)

Match two numpy arrays to find the same elements

I have a task kind of like SQL search. I have a "table" which contains the following 1D arrays (about 1 million elements) identified by ID1:
ID1, z, e, PA, n
Another "table" which contains the following 1D arrays (about 1.5 million elements) identified by ID2:
ID2, RA, DEC
I want to match ID1 and ID2 to find the common ones to form another "table" which contains ID, z, e, PA, n, RA, DEC. Most elements in ID1 can be found in ID2 but not all, otherwise I can use numpy.in1d(ID1,ID2) to accomplish it. Anyone has fast way to accomplish this task?
For example:
ID1, z, e, PA, n
101, 1.0, 1.2, 1.5, 1.8
104, 1.5, 1.8, 2.2, 3.1
105, 1.4, 2.0, 3.3, 2.8
ID2, RA, DEC
101, 4.5, 10.5
107, 90.1, 55.5
102, 30.5, 3.3
103, 60.1, 40.6
104, 10.8, 5.6
The output should be
ID, z, e, PA, n, RA, DEC
101, 1.0, 1.2, 1.5, 1.8, 4.5, 10.5
104, 1.5, 1.8, 2.2, 3.1, 10.8, 5.6
Well you can use np.in1d with swapped places for the first columns of the two arrays/tables, such that we would have two masks to index into the arrays for selection. Then, simply stack the results -
mask1 = np.in1d(a[:,0], b[:,0])
mask2 = np.in1d(b[:,0], a[:,0])
out = np.column_stack(( a[mask1], b[mask2,1:] ))
Sample run -
In [44]: a
Out[44]:
array([[ 101. , 1. , 1.2, 1.5, 1.8],
[ 104. , 1.5, 1.8, 2.2, 3.1],
[ 105. , 1.4, 2. , 3.3, 2.8]])
In [45]: b
Out[45]:
array([[ 101. , 4.5, 10.5],
[ 102. , 30.5, 3.3],
[ 103. , 60.1, 40.6],
[ 104. , 10.8, 5.6],
[ 107. , 90.1, 55.5]])
In [46]: mask1 = np.in1d(a[:,0], b[:,0])
In [47]: mask2 = np.in1d(b[:,0], a[:,0])
In [48]: np.column_stack(( a[mask1], b[mask2,1:] ))
Out[48]:
array([[ 101. , 1. , 1.2, 1.5, 1.8, 4.5, 10.5],
[ 104. , 1.5, 1.8, 2.2, 3.1, 10.8, 5.6]])
Assuming your second table, table B, is sorted, you can do a sorted lookup, then check if the indexed element is actually found:
idx = np.searchsorted(B[:-1, 0], A[:, 0])
found = A[:, 0] == B[idx, 0]
np.hstack((A[found, :], B[idx[found], 1:]))
Result:
array([[ 101. , 1. , 1.2, 1.5, 1.8, 4.5, 10.5],
[ 104. , 1.5, 1.8, 2.2, 3.1, 10.8, 5.6]])
The last element of the B indices is excluded to simplify the case where the item in A is beyond the final element in B. Without it, it is possible that the returned index would be greater than the length of B and cause indexing errors.
Use pandas:
import pandas as pd
id1 = pd.read_csv('id1.txt')
id2 = pd.read_csv('id2.txt')
df = id1.merge(id2.sort_values(by='ID2').drop_duplicates('ID2').rename(columns={'ID2':'ID1'}))
print(df)
Produces:
ID1 z e PA n RA DEC
0 101 1.0 1.2 1.5 1.8 4.5 10.5
1 104 1.5 1.8 2.2 3.1 10.8 5.6
With large datasets you may need to do things in place:
# [Optional] sort locations and drop duplicates
id2.sort_values(by='ID2', inplace=True)
id2.drop_duplicates('ID2', inplace=True)
# columns that you are merging must have the same name
id2.rename(columns={'ID2':'ID1'}, inplace=True)
# perform the merge
df = id1.merge(id2)
Without drop_duplicates you get one row for each item:
df = id1.merge(id2.rename(columns={'ID2':'ID1'}))
print(id2)
print(df)
Giving:
ID2 RA DEC
0 101 4.5 10.5
1 107 90.1 55.5
2 102 30.5 3.3
3 103 60.1 40.6
4 104 10.8 5.6
5 103 60.1 40.6
6 104 10.9 5.6
ID1 z e PA n RA DEC
0 101 1.0 1.2 1.5 1.8 4.5 10.5
1 104 1.5 1.8 2.2 3.1 10.8 5.6
2 104 1.5 1.8 2.2 3.1 10.9 5.6
Note that this solution preserves the different types for the columns:
>>> id1.ID1.dtype
dtype('int64')
>>> id1[' z'].dtype
dtype('float64')
Since you have spaces after the comma in the header row those spaces became part of the column name, hence need to refer to the second column using id1[' z']. By modifying the read statement, this is no longer necessary:
>>> id1 = pd.read_csv('id1.txt', skipinitialspace=True)
>>> id1.z.dtype
dtype('float64')

Meshgrid of z values that match x and y meshgrid values

Edit: Original question was flawed but I am leaving it here for reasons of transparency.
Original:
I have some x, y, z data where x and y are coordinates of a 2D grid and z is a scalar value corresponding to (x, y).
>>> import numpy as np
>>> # Dummy example data
>>> x = np.arange(0.0, 5.0, 0.5)
>>> y = np.arange(1.0, 2.0, 0.1)
>>> z = np.sin(x)**2 + np.cos(y)**2
>>> print "x = ", x, "\n", "y = ", y, "\n", "z = ", z
x = [ 0. 0.5 1. 1.5 2. 2.5 3. 3.5 4. 4.5]
y = [ 1. 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9]
z = [ 0.29192658 0.43559829 0.83937656 1.06655187 0.85571064 0.36317266
0.02076747 0.13964978 0.62437081 1.06008127]
Using xx, yy = np.meshgrid(x, y) I can get two grids containing x and y values corresponding to each grid position.
>>> xx, yy = np.meshgrid(x, y)
>>> print xx
[[ 0. 0.5 1. 1.5 2. 2.5 3. 3.5 4. 4.5]
[ 0. 0.5 1. 1.5 2. 2.5 3. 3.5 4. 4.5]
[ 0. 0.5 1. 1.5 2. 2.5 3. 3.5 4. 4.5]
[ 0. 0.5 1. 1.5 2. 2.5 3. 3.5 4. 4.5]
[ 0. 0.5 1. 1.5 2. 2.5 3. 3.5 4. 4.5]
[ 0. 0.5 1. 1.5 2. 2.5 3. 3.5 4. 4.5]
[ 0. 0.5 1. 1.5 2. 2.5 3. 3.5 4. 4.5]
[ 0. 0.5 1. 1.5 2. 2.5 3. 3.5 4. 4.5]
[ 0. 0.5 1. 1.5 2. 2.5 3. 3.5 4. 4.5]
[ 0. 0.5 1. 1.5 2. 2.5 3. 3.5 4. 4.5]]
>>> print yy
[[ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. ]
[ 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1]
[ 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2]
[ 1.3 1.3 1.3 1.3 1.3 1.3 1.3 1.3 1.3 1.3]
[ 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4 1.4]
[ 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5]
[ 1.6 1.6 1.6 1.6 1.6 1.6 1.6 1.6 1.6 1.6]
[ 1.7 1.7 1.7 1.7 1.7 1.7 1.7 1.7 1.7 1.7]
[ 1.8 1.8 1.8 1.8 1.8 1.8 1.8 1.8 1.8 1.8]
[ 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9]]
Now I want an array of the same shape for z, where the grid values correspond to the matching x and y values in the original data! But I cannot find an elegant, built-in solution where I do not need to re-grid the data, and I think I am missing some understanding of how I should approach it.
I have tried following this solution (with my real data, not this simple example data, but it should have the same result) but my final grid was not fully populated.
Please help!
Corrected question:
As was pointed out by commenters, my original dummy data was unsuitable for the question I am asking. Here is an improved version of the question:
I have some x, y, z data where x and y are coordinates of a 2D grid and z is a scalar value corresponding to (x, y). The data is read from a text file "data.txt":
#x y z
1.4 0.2 1.93164166734
1.4 0.3 1.88377897779
1.4 0.4 1.81946452501
1.6 0.2 1.9596778849
1.6 0.3 1.91181519535
1.6 0.4 1.84750074257
1.8 0.2 1.90890970517
1.8 0.3 1.86104701562
1.8 0.4 1.79673256284
2.0 0.2 1.78735230743
2.0 0.3 1.73948961789
2.0 0.4 1.67517516511
Loading the text:
>>> import numpy as np
>>> inFile = 'C:\data.txt'
>>> x, y, z = np.loadtxt(inFile, unpack=True, usecols=(0, 1, 2), comments='#', dtype=float)
>>> print x
[ 1.4 1.4 1.4 1.6 1.6 1.6 1.8 1.8 1.8 2. 2. 2. ]
>>> print y
[ 0.2 0.3 0.4 0.2 0.3 0.4 0.2 0.3 0.4 0.2 0.3 0.4]
>>> print z
[ 1.93164167 1.88377898 1.81946453 1.95967788 1.9118152 1.84750074
1.90890971 1.86104702 1.79673256 1.78735231 1.73948962 1.67517517]
Using xx, yy= np.meshgrid(np.unique(x), np.unique(y)) I can get two grids containing x and y values corresponding to each grid position.
>>> xx, yy= np.meshgrid(np.unique(x), np.unique(y))
>>> print xx
[[ 1.4 1.6 1.8 2. ]
[ 1.4 1.6 1.8 2. ]
[ 1.4 1.6 1.8 2. ]]
>>> print yy
[[ 0.2 0.2 0.2 0.2]
[ 0.3 0.3 0.3 0.3]
[ 0.4 0.4 0.4 0.4]]
Now each corresponding cell position in both xx and yy correspond to one of the original grid point locations.
I simply need an equivalent array where the grid values correspond to the matching z values in the original data!
"""e.g.
[[ 1.93164166734 1.9596778849 1.90890970517 1.78735230743]
[ 1.88377897779 1.91181519535 1.86104701562 1.73948961789]
[ 1.81946452501 1.84750074257 1.79673256284 1.67517516511]]"""
But I cannot find an elegant, built-in solution where I do not need to re-grid the data, and I think I am missing some understanding of how I should approach it. For example, using xx, yy, zz = np.meshgrid(x, y, z) returns three 3D arrays that I don't think I can use.
Please help!
Edit:
I managed to make this example work thanks to the solution from Jaime: Fill 2D numpy array from three 1D numpy arrays
>>> x_vals, x_idx = np.unique(x, return_inverse=True)
>>> y_vals, y_idx = np.unique(y, return_inverse=True)
>>> vals_array = np.empty(x_vals.shape + y_vals.shape)
>>> vals_array.fill(np.nan) # or whatever your desired missing data flag is
>>> vals_array[x_idx, y_idx] = z
>>> zz = vals_array.T
>>> print zz
But the code (with real input data) that led me on this path was still failing. I found the problem now. I have been using scipy.ndimage.zoom to resample my gridded data to a higher resolution before generating zz.
>>> import scipy.ndimage
>>> zoom = 2
>>> x = scipy.ndimage.zoom(x, zoom)
>>> y = scipy.ndimage.zoom(y, zoom)
>>> z = scipy.ndimage.zoom(z, zoom)
This produced an array containing many nan entries:
array([[ nan, nan, nan, ..., nan, nan, nan],
[ nan, nan, nan, ..., nan, nan, nan],
[ nan, nan, nan, ..., nan, nan, nan],
...,
[ nan, nan, nan, ..., nan, nan, nan],
[ nan, nan, nan, ..., nan, nan, nan],
[ nan, nan, nan, ..., nan, nan, nan]])
When I skip the zoom stage, the correct array is produced:
array([[-22365.93400183, -22092.31794674, -22074.21420168, ...,
-14513.89091599, -12311.97437017, -12088.07062786],
[-29264.34039242, -28775.79743097, -29021.31886353, ...,
-21354.6799064 , -21150.76555669, -21046.41225097],
[-39792.93758344, -39253.50249278, -38859.2562673 , ...,
-24253.36838785, -25714.71895023, -29237.74277727],
...,
[ 44829.24733543, 44779.37084337, 44770.32987311, ...,
21041.42652441, 20777.00408692, 20512.58162671],
[ 44067.26616067, 44054.5398901 , 44007.62587598, ...,
21415.90416488, 21151.48168444, 20887.05918082],
[ 43265.35371973, 43332.5983711 , 43332.21743471, ...,
21780.32283309, 21529.39770759, 21278.47255848]])

Categories