I would like to fill gaps in a column in my DataFrame using a cubic spline. If I were to export to a list then I could use the numpy's interp1d function and apply this to the missing values.
Is there a way to use this function inside pandas?
Most numpy/scipy function require the arguments only to be "array_like", iterp1d is no exception. Fortunately both Series and DataFrame are "array_like" so we don't need to leave pandas:
import pandas as pd
import numpy as np
from scipy.interpolate import interp1d
df = pd.DataFrame([np.arange(1, 6), [1, 8, 27, np.nan, 125]]).T
In [5]: df
Out[5]:
0 1
0 1 1
1 2 8
2 3 27
3 4 NaN
4 5 125
df2 = df.dropna() # interpolate on the non nan
f = interp1d(df2[0], df2[1], kind='cubic')
#f(4) == array(63.9999999999992)
df[1] = df[0].apply(f)
In [10]: df
Out[10]:
0 1
0 1 1
1 2 8
2 3 27
3 4 64
4 5 125
Note: I couldn't think of an example off the top of my head to pass in a DataFrame into the second argument (y)... but this ought to work too.
Related
I'm trying to get the correlation between a single column and the rest of the numerical columns of the dataframe, but I'm stuck.
I'm trying with this:
corr = IM['imdb_score'].corr(IM)
But I get the error
operands could not be broadcast together with shapes
which I assume is because I'm trying to find a correlation between a vector (my imdb_score column) with the dataframe of several columns.
How can this be fixed?
The most efficient method it to use corrwith.
Example:
df.corrwith(df['A'])
Setup of example data:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(10, size=(5, 5)), columns=list('ABCDE'))
# A B C D E
# 0 7 2 0 0 0
# 1 4 4 1 7 2
# 2 6 2 0 6 6
# 3 9 8 0 2 1
# 4 6 0 9 7 7
output:
A 1.000000
B 0.526317
C -0.209734
D -0.720400
E -0.326986
dtype: float64
I think you can you just use .corr which returns all correlations between all columns and then select just the column you are interested in.
So, something like
IM.corr()['imbd_score']
should work.
Rather than calculating all correlations and keeping the ones of interest, it can be computationally more efficient to compute the subset of interesting correlations:
import pandas as pd
df = pd.DataFrame()
df['a'] = range(10)
df['b'] = range(10)
df['c'] = range(10)
pd.DataFrame([[c, df['a'].corr(df[c])] for c in df.columns if c!='a'], columns=['var', 'corr'])
I have a dataframe with 2 columns in python. I want to enter the dataframe with one column and obtain the value of the 2nd column. Sometimes the values can be exact, but they can also be values between 2 rows.
I have this example dataframe:
x y
0 0 0
1 10 100
2 20 200
I want to find the value of y if I check the dataframe with the value of x. For example, if I enter in the dataframe with the value of 10, I obtain the value of 100. But if I check with 15, I need to interpolate between the two values of y. Is there any function to do it?
numpy.interp is probaly the simplest way here for linear interpolation:
def interpolate(xval, df, xcol, ycol):
# compute xval as the linear interpolation of xval where df is a dataframe and
# df.x are the x coordinates, and df.y are the y coordinates. df.x is expected to be sorted.
return np.interp([xval], df[xcol], df[ycol])
With your example data it gives:
>>> interpolate(10, df, 'x', 'y')
>>> 100.0
>>> interpolate(15, df, 'x', 'y')
>>> 150.0
You can even directly do:
>>> np.interp([10, 15], df.x, df.y)
array([100., 150.])
You can have a look at the interpolate method provided in Pandas module (doc). But I'm not sure that answers your question.
You can do it with interp1d from the sklearn module. Several types of interpolation are possible: ‘linear’, ‘nearest’, ‘zero’, ‘slinear’, ‘quadratic’, ‘cubic’... You find the list at the (doc page).
The interpolation process can be summarised as three steps:
Split your data between missing and non missing values. I use isna (doc)
Create the interpolation function using the data without missing values. I use interp1d (doc)
Interpolate (predict the missing values). Just call the function find in step 2 on the missing data (column x).
Here the code:
# Import modules
import pandas as pd
import numpy as np
from scipy.interpolate import interp1d
# Data
df = pd.DataFrame(
[[0, 0],
[10, 100],
[11, np.NaN],
[15, np.NaN],
[17, np.NaN],
[20, 200]],
columns=["x", "y"])
print(df)
# x y
# 0 0 0.0
# 1 10 100.0
# 2 11 NaN
# 3 15 NaN
# 4 17 NaN
# 5 20 200.0
# Split data in training (not NaN values) and missing (NaN values)
missing = df.isna().any(axis=1)
df_training = df[~missing]
df_missing = df[missing].reset_index(drop=True)
# Create function that interpolate missing value (from our training values)
f = interp1d(df_training.x, df_training.y)
# Interpolate the missing values
df_missing["y"] = f(df_missing.x)
print(df_missing)
# x y
# 0 11 110.0
# 1 15 150.0
# 2 17 170.0
You can find others works on the topic at this link.
I have a non-standard CSV file that looks something like this:
x,y
1,"(5, 27, 4)"
2,"(3, 1, 6, 2)"
3,"(4, 5)"
Using pd.read_csv() leads to something that's not all that useful, because the tuples are not parsed. There are a existing answers that address this (1, 2), but because these tuples have heterogeneous lengths, those answers aren't entirely useful for the problem I'm having.
What I'd like to do is plot x vs y using the pandas plotting routines. The naive approach leads to an error because the tuples are stored as strings:
>>> # df = pd.read_csv('data.csv')
>>> df = pd.DataFrame({'x': [1, 2, 3],
'y': ["(5, 27, 4)","(3, 1, 6, 2)","(4, 5)"]})
>>> df.plot.scatter('x', 'y')
[...]
ValueError: scatter requires y column to be numeric
The result I'd hope for is something like this:
import numpy as np
import matplotlib.pyplot as plt
for x, y in zip(df['x'], df['y']):
y = eval(y)
plt.scatter(x * np.ones_like(y), y, color='blue')
Is there a straightforward way to create this plot directly from Pandas, by transforming the dataframe and using df.plot.scatter() (and preferably without using eval())?
You could explode the df and plot
In [3129]: s = df.y.map(ast.literal_eval)
In [3130]: dff = pd.DataFrame({'x': df.x.repeat(s.str.len()).values,
'y': np.concatenate(s.values)})
In [3131]: dff
Out[3131]:
x y
0 1 5
1 1 27
2 1 4
3 2 3
4 2 1
5 2 6
6 2 2
7 3 4
8 3 5
And, plot
dff.plot.scatter('x', 'y')
You can use the .str accessor to extract integers, specifically .str.extractall:
# Index by 'x' to retain its values once we extract from 'y'
df = df.set_index('x')
# Extract integers from 'y'
df = df['y'].str.extractall(r'(\d+)')[0].astype('int64')
# Rename and reset the index (remove 'match' level, get 'x' as column)
df = df.rename('y').reset_index(level='match', drop=True).reset_index()
If you have floats instead of ints, just modify the regex and astype as appropriate.
This gives a DataFrame that looks like:
x y
0 1 5
1 1 27
2 1 4
3 2 3
4 2 1
5 2 6
6 2 2
7 3 4
8 3 5
And from there df.plot.scatter('x', 'y') should produce the expected plot.
I am playing around with data and need to look at differences across columns (as well as rows) in a fairly large dataframe.
The easiest way for rows is clearly the diff() method, but I cannot find the equivalent for columns?
My current solution to obtain a dataframe with the columns differenced for via
df.transpose().diff().transpose()
Is there a more efficient alternative? Or is this such odd usage of pandas that this was just never requested/ considered useful? :)
Thanks,
Pandas DataFrames are excellent for manipulating table-like data whose columns have different dtypes.
If subtracting across columns and rows both make sense, then it means all the values are the same kind of quantity. That might be an indication that you should be using a NumPy array instead of a Pandas DataFrame.
In any case, you can use arr = df.values to extract a NumPy array of the underlying data from the DataFrame. If all the columns share the same dtype, then the NumPy array will have the same dtype. (When the columns have different dtypes, df.values has object dtype).
Then you can compute the differences along rows or columns using np.diff(arr, axis=...):
import numpy as np
import pandas as pd
df = pd.DataFrame(np.arange(12).reshape(3,4), columns=list('ABCD'))
# A B C D
# 0 0 1 2 3
# 1 4 5 6 7
# 2 8 9 10 11
np.diff(df.values, axis=0) # difference of the rows
# array([[4, 4, 4, 4],
# [4, 4, 4, 4]])
np.diff(df.values, axis=1) # difference of the columns
# array([[1, 1, 1],
# [1, 1, 1],
# [1, 1, 1]])
Just difference the columns, e.g.
df['new_col'] = df['a'] - df['b']
For multiple columns, I believe unutbu's answer is the best (although it returns a np.ndarray object instead of a dataframe, it is still faster even after then converting it to a dataframe).
# Create a large dataframe.
df = pd.DataFrame(np.random.randn(1e6, 100))
%%timeit
np.diff(df.values, axis=1)
1 loops, best of 3: 450 ms per loop
%%timeit
df - df.shift(axis=1)
1 loops, best of 3: 727 ms per loop
%%timeit
df.T.diff().T
1 loops, best of 3: 1.52 s per loop
Use the axis parameter in diff:
df = pd.DataFrame(np.arange(12).reshape(3, 4), columns=list('ABCD'))
# A B C D
# 0 0 1 2 3
# 1 4 5 6 7
# 2 8 9 10 11
df.diff(axis=1) # subtracting column wise
# A B C D
# 0 NaN 1 1 1
# 1 NaN 1 1 1
# 2 NaN 1 1 1
df.diff() # subtracting row wise
# A B C D
# 0 NaN NaN NaN NaN
# 1 4 4 4 4
# 2 4 4 4 4
imaging i have a series looks like this:
Out[64]:
2 0
3 1
80 1
83 1
84 2
85 2
how can i append an item at the very beginning of this series? the native pandas.Series.append function only appends at the end.
thanks a lot
There is a pandas.concat function...
import pandas as pd
a = pd.Series([2,3,4])
pd.concat([pd.Series([1]), a])
See the Merge, Join, and Concatenate documentation.
Using concat, or append, the resulting series will have duplicate indices:
for concat():
import pandas as pd
a = pd.Series([2,3,4])
pd.concat([pd.Series([1]), a])
Out[143]:
0 1
0 2
1 3
2 4
and for append():
import pandas as pd
a = pd.Series([2,3,4])
a.append(pd.Series([1]))
Out[149]:
0 2
1 3
2 4
0 1
This could be a problem in the future, since a[0] (if you assign the result to a) will return two values for either case.
My solutions are in this case:
import pandas as pd
a = pd.Series([2,3,4])
b = [1]
b[1:] = a
pd.Series(b)
Out[199]:
0 1
1 2
2 3
3 4
or, by reindexing with concat():
import pandas as pd
a = pd.Series([2,3,4])
a.index = a.index + 1
pd.concat([pd.Series([1]), a])
Out[208]:
0 1
1 2
2 3
3 4
In case you need to prepend a single value from a different Series b, say its last value, this is what works for me:
import pandas as pd
a = pd.Series([2, 3, 4])
b = pd.Series([0, 1])
pd.concat([b[-1:], a])
Similarly, you can use append with a list or tuple of series (so long as you're using pandas version .13 or greater)
import pandas as pd
a = pd.Series([2,3,4])
pd.Series.append((pd.Series([1]), a))