In Python, there is a function data.info(). This function gives you all the information about a dataset such as datatypes, memory, number of entries, etc.
Here you can look up for more information about the .info() function in Python.
Is there also a function in R that gives me this kind of information?
So here we have a few options
Base R
Within Base R there are a few options for getting these kind of data regarding your data:
str
You can use str to see the structure of a data frame
str(iris)
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
summary
Additionally, there is the summary function which completes a five number summary for each column and then counts for factors:
summary(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 setosa :50
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 versicolor:50
Median :5.800 Median :3.000 Median :4.350 Median :1.300 virginica :50
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
dplyr
dplyr provides something similar to str which shows some of the data types
library(dplyr)
glimpse(iris)
Observations: 150
Variables: 5
$ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.8, 4.8, 4.3, 5.8, 5.7, 5.4, 5.1, 5.7, 5...
$ Sepal.Width <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.4, 3.0, 3.0, 4.0, 4.4, 3.9, 3.5, 3.8, 3...
$ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.6, 1.4, 1.1, 1.2, 1.5, 1.3, 1.4, 1.7, 1...
$ Petal.Width <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0.2, 0.1, 0.1, 0.2, 0.4, 0.4, 0.3, 0.3, 0...
$ Species <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, setosa, setosa, setosa, setosa, setosa, ...
skimr
Finally, the skimr package provides an enhanced summary including little histograms
library(skimr)
skim(iris)
-- Data Summary ------------------------
Values
Name iris
Number of rows 150
Number of columns 5
_______________________
Column type frequency:
factor 1
numeric 4
________________________
Group variables None
-- Variable type: factor -------------------------------------------------------
skim_variable n_missing complete_rate ordered n_unique top_counts
1 Species 0 1 FALSE 3 set: 50, ver: 50, vir: 50
-- Variable type: numeric ------------------------------------------------------
skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
1 Sepal.Length 0 1 5.84 0.828 4.3 5.1 5.8 6.4 7.9 ▆▇▇▅▂
2 Sepal.Width 0 1 3.06 0.436 2 2.8 3 3.3 4.4 ▁▆▇▂▁
3 Petal.Length 0 1 3.76 1.77 1 1.6 4.35 5.1 6.9 ▇▁▆▇▂
4 Petal.Width 0 1 1.20 0.762 0.1 0.3 1.3 1.8 2.5 ▇▁▇▅▃
Between those functions you can get a pretty good look at your data!
It's not a single function, but the first three things I always do are
library(tidyverse)
# Shows top 6 rows
iris %>% head()
# Gives dimensions of data.frame
iris %>% dim()
# Gives the classes of the data in each column (e.g. numeric, character etc)
iris %>% sapply(class)
The best package I use, that I haven't seen above, is inspectdf (mentioned by Niels in a comment above). inspectdf does much of the summary you see in skimr in #MDEWITT via specific function calls; for instance, inspect_cat and inspect_num for categorical and numerical variable summaries, respectively.
The contribution of my comment is that inspectdf has two additional functions inspect_imb and inspect_cor which, respectively, look at the most common value per column and the correlation between numerical cols. I find these tremendously useful for data cleaning/pre-processing.
Related
I have sensor data captured at different frequencies (this is data I've invented to simplify the operation). I want to resample the voltage data by increasing the number of data points and interpolate them so I have 16 instead of 12.
Pandas has a resample/upsample function but I can only find examples where people have gone from weekly data to daily data (adding 6 daily data points by interpolation between two weekly data points).
time (pressure)
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
pressure
1
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
2
2.1
2.2
2.3
2.4
2.5
time (voltage)
0.07
0.14
0.21
0.28
0.35
0.42
0.49
0.56
0.63
0.7
0.77
0.84
voltage
2.2
2.5
2.8
3.1
3.4
3.7
4
4.3
4.6
4.9
5.2
5.5
I would like my voltage to have 16 samples instead of 12 with the missing values interpolated. Thanks!
Let's assume two Series, "pressure" and "voltage":
pressure = pd.Series({0.05: 1.0, 0.1: 1.1, 0.15: 1.2, 0.2: 1.3, 0.25: 1.4, 0.3: 1.5, 0.35: 1.6, 0.4: 1.7, 0.45: 1.8,
0.5: 1.9, 0.55: 2.0, 0.6: 2.1, 0.65: 2.2, 0.7: 2.3, 0.75: 2.4, 0.8: 2.5}, name='pressure')
voltage = pd.Series({0.07: 2.2, 0.14: 2.5, 0.21: 2.8, 0.28: 3.1, 0.35: 3.4, 0.42: 3.7,
0.49: 4.0, 0.56: 4.3, 0.63: 4.6, 0.7: 4.9, 0.77: 5.2, 0.84: 5.5}, name='voltage')
You can either use pandas.merge_asof:
pd.merge_asof(pressure, voltage, left_index=True, right_index=True)
output:
or pandas.concat+interpolate:
(pd.concat([pressure, voltage], axis=1)
.sort_index()
.apply(pd.Series.interpolate)
#.plot(x='pressure', y='voltage', marker='o') # uncomment to plot
)
output:
Finally, to interpolate only on voltage, drop NAs on pressure first:
(pd.concat([pressure, voltage], axis=1)
.sort_index()
.dropna(subset=['pressure'])
.apply(pd.Series.interpolate)
)
output:
I'm trying to form sentences from single words in a dataframe (sometimes ending with .?!), and recognize that U. or S. is not the end of the sentence.
data = {
"start_time": [0.1, 0.3, 0.5, 0.7, 0.9, 1.1, 1.3, 1.5, 1.9, 2.1, 2.3, 2.5],
"end_time": [0.2, 0.4, 0.6, 0.8, 1.0, 1.2, 1.4, 1.6, 1.8, 2.0, 2.2, 2.4],
"word": [
"WHERE",
"ARE",
"YOU?",
"I",
"AM",
"U.",
"S.",
"OK,",
"COOL!",
"YES",
"IT",
"IS.",
],
}
df = pd.DataFrame(data, columns=["start_time", "end_time", "word"])
The dataframe looks like:
s_time e_time word
0.1 0.2 WHERE
0.3 0.4 ARE
0.5 0.6 YOU?
0.7 0.8 I
0.9 1.0 AM
1.1 1.2 U.
1.3 1.4 S.
1.5 1.6 OK,
1.7 1.8 COOL!
1.9 2.0 YES
2.1 2.2 IT
2.3 2.4 IS.
The result I want to get looks like:
s_time e_time sentence
0.1 0.6 WHERE ARE YOU?
0.7 1.4 I AM U. S.
1.5 1.8 OK, COOL!
1.9 2.4 YES IT IS.
I am stuck with how to get U. S. in one sentence.
Any suggestion would be much appreciated and really thanks for anyone help!
You could try this:
# Initialize variables
new_data = {"start_time": [], "end_time": [], "sentence": []}
sentence = []
start_time = None
# Iterate on the dataframe
for i, row in df.iterrows():
# Initialize start_time
if not start_time:
start_time = row["start_time"]
if (
not row["word"].endswith("?")
and not row["word"].endswith("!")
and not row["word"].endswith("S.")
):
# If word is not ending a phrase, get it
sentence.append(row["word"])
else:
# Pause iteration and update new_data with start_time, end_time
# and completed sentence
new_data["start_time"].append(start_time)
new_data["end_time"].append(row["end_time"])
sentence.append(row["word"])
new_data["sentence"].append(" ".join(sentence))
# Reset variables
start_time = None
sentence = []
new_df = pd.DataFrame(new_data, columns=["start_time", "end_time", "sentence"])
print(new_df)
# Outputs
start_time end_time sentence
0 0.1 0.6 WHERE ARE YOU?
1 0.7 1.4 I AM U. S.
2 1.5 1.8 OK, COOL!
3 2.1 2.4 YES IT IS.
Dataframe
df=pd.DataFrame({"sym":["a","a","aa","aa","aa","a","ab","ab","ab"],
"id_h":[2.1, 2.2 , 2.5 , 3.1 , 2.5, 3.8 , 2.5, 5,6],
"pm_h":[np.nan, 2.3, np.nan , 2.8, 2.7, 3.7, 2.4, 4.9,np.nan]})
want to fill pm_h nan values with max id_h value of each "sys" group i.e. (a, aa, ab)
Required output:
df1=pd.DataFrame({"sym":["a","a","aa","aa","aa","a","ab","ab","ab"],
"id_h":[2.1, 2.2 , 2.5 , 3.1 , 2.5, 3.8 , 2.5, 5,6],
"pm_h":[3.8, 2.3, 3.1 , 2.8, 2.7, 3.7, 2.4, 4.9, 6})
Use Series.fillna with GroupBy.transform by maximal values for new Series with same index like original:
df['pm_h'] = df['pm_h'].fillna(df.groupby('sym')['id_h'].transform('max'))
print (df)
sym id_h pm_h
0 a 2.1 3.8
1 a 2.2 2.3
2 aa 2.5 3.1
3 aa 3.1 2.8
4 aa 2.5 2.7
5 a 3.8 3.7
6 ab 2.5 2.4
7 ab 5.0 4.9
8 ab 6.0 6.0
I want to increase the number of elements in a list by following same distribution.
My code:
# Presently I have 5 elements
x_now = [4,4.5,4.6,5.4,6]
# I want to produce 13 elements. My expected output
x_exp = [4,4.5,4.6,5.4,6,4,4.5,4.6,5.4,6,4,4.5] % I got it by copy and pasting existing list elements
# Is it possible to randomly sample between min and max here and produce n elements here:
x_exp1 =[4 4.2 4.6 4.9 5.5 5.9 4.3 4.7 4.8 5.6 6 4.1 4.6]
Option 1
(x * 2)+(x[:13%len(x)])
Option 2
[x[i%len(x)] for i in range(13)]
Something like this:
In [1431]: l = x_now * 3
In [1432]: l[:len(l)-(13 // len(x_now))]
Out[1432]: [4, 4.5, 4.6, 5.4, 6, 4, 4.5, 4.6, 5.4, 6, 4, 4.5, 4.6]
I have a task kind of like SQL search. I have a "table" which contains the following 1D arrays (about 1 million elements) identified by ID1:
ID1, z, e, PA, n
Another "table" which contains the following 1D arrays (about 1.5 million elements) identified by ID2:
ID2, RA, DEC
I want to match ID1 and ID2 to find the common ones to form another "table" which contains ID, z, e, PA, n, RA, DEC. Most elements in ID1 can be found in ID2 but not all, otherwise I can use numpy.in1d(ID1,ID2) to accomplish it. Anyone has fast way to accomplish this task?
For example:
ID1, z, e, PA, n
101, 1.0, 1.2, 1.5, 1.8
104, 1.5, 1.8, 2.2, 3.1
105, 1.4, 2.0, 3.3, 2.8
ID2, RA, DEC
101, 4.5, 10.5
107, 90.1, 55.5
102, 30.5, 3.3
103, 60.1, 40.6
104, 10.8, 5.6
The output should be
ID, z, e, PA, n, RA, DEC
101, 1.0, 1.2, 1.5, 1.8, 4.5, 10.5
104, 1.5, 1.8, 2.2, 3.1, 10.8, 5.6
Well you can use np.in1d with swapped places for the first columns of the two arrays/tables, such that we would have two masks to index into the arrays for selection. Then, simply stack the results -
mask1 = np.in1d(a[:,0], b[:,0])
mask2 = np.in1d(b[:,0], a[:,0])
out = np.column_stack(( a[mask1], b[mask2,1:] ))
Sample run -
In [44]: a
Out[44]:
array([[ 101. , 1. , 1.2, 1.5, 1.8],
[ 104. , 1.5, 1.8, 2.2, 3.1],
[ 105. , 1.4, 2. , 3.3, 2.8]])
In [45]: b
Out[45]:
array([[ 101. , 4.5, 10.5],
[ 102. , 30.5, 3.3],
[ 103. , 60.1, 40.6],
[ 104. , 10.8, 5.6],
[ 107. , 90.1, 55.5]])
In [46]: mask1 = np.in1d(a[:,0], b[:,0])
In [47]: mask2 = np.in1d(b[:,0], a[:,0])
In [48]: np.column_stack(( a[mask1], b[mask2,1:] ))
Out[48]:
array([[ 101. , 1. , 1.2, 1.5, 1.8, 4.5, 10.5],
[ 104. , 1.5, 1.8, 2.2, 3.1, 10.8, 5.6]])
Assuming your second table, table B, is sorted, you can do a sorted lookup, then check if the indexed element is actually found:
idx = np.searchsorted(B[:-1, 0], A[:, 0])
found = A[:, 0] == B[idx, 0]
np.hstack((A[found, :], B[idx[found], 1:]))
Result:
array([[ 101. , 1. , 1.2, 1.5, 1.8, 4.5, 10.5],
[ 104. , 1.5, 1.8, 2.2, 3.1, 10.8, 5.6]])
The last element of the B indices is excluded to simplify the case where the item in A is beyond the final element in B. Without it, it is possible that the returned index would be greater than the length of B and cause indexing errors.
Use pandas:
import pandas as pd
id1 = pd.read_csv('id1.txt')
id2 = pd.read_csv('id2.txt')
df = id1.merge(id2.sort_values(by='ID2').drop_duplicates('ID2').rename(columns={'ID2':'ID1'}))
print(df)
Produces:
ID1 z e PA n RA DEC
0 101 1.0 1.2 1.5 1.8 4.5 10.5
1 104 1.5 1.8 2.2 3.1 10.8 5.6
With large datasets you may need to do things in place:
# [Optional] sort locations and drop duplicates
id2.sort_values(by='ID2', inplace=True)
id2.drop_duplicates('ID2', inplace=True)
# columns that you are merging must have the same name
id2.rename(columns={'ID2':'ID1'}, inplace=True)
# perform the merge
df = id1.merge(id2)
Without drop_duplicates you get one row for each item:
df = id1.merge(id2.rename(columns={'ID2':'ID1'}))
print(id2)
print(df)
Giving:
ID2 RA DEC
0 101 4.5 10.5
1 107 90.1 55.5
2 102 30.5 3.3
3 103 60.1 40.6
4 104 10.8 5.6
5 103 60.1 40.6
6 104 10.9 5.6
ID1 z e PA n RA DEC
0 101 1.0 1.2 1.5 1.8 4.5 10.5
1 104 1.5 1.8 2.2 3.1 10.8 5.6
2 104 1.5 1.8 2.2 3.1 10.9 5.6
Note that this solution preserves the different types for the columns:
>>> id1.ID1.dtype
dtype('int64')
>>> id1[' z'].dtype
dtype('float64')
Since you have spaces after the comma in the header row those spaces became part of the column name, hence need to refer to the second column using id1[' z']. By modifying the read statement, this is no longer necessary:
>>> id1 = pd.read_csv('id1.txt', skipinitialspace=True)
>>> id1.z.dtype
dtype('float64')