Pandas: plot a dataframe containing a column of tuples - python

I have a non-standard CSV file that looks something like this:
x,y
1,"(5, 27, 4)"
2,"(3, 1, 6, 2)"
3,"(4, 5)"
Using pd.read_csv() leads to something that's not all that useful, because the tuples are not parsed. There are a existing answers that address this (1, 2), but because these tuples have heterogeneous lengths, those answers aren't entirely useful for the problem I'm having.
What I'd like to do is plot x vs y using the pandas plotting routines. The naive approach leads to an error because the tuples are stored as strings:
>>> # df = pd.read_csv('data.csv')
>>> df = pd.DataFrame({'x': [1, 2, 3],
'y': ["(5, 27, 4)","(3, 1, 6, 2)","(4, 5)"]})
>>> df.plot.scatter('x', 'y')
[...]
ValueError: scatter requires y column to be numeric
The result I'd hope for is something like this:
import numpy as np
import matplotlib.pyplot as plt
for x, y in zip(df['x'], df['y']):
y = eval(y)
plt.scatter(x * np.ones_like(y), y, color='blue')
Is there a straightforward way to create this plot directly from Pandas, by transforming the dataframe and using df.plot.scatter() (and preferably without using eval())?

You could explode the df and plot
In [3129]: s = df.y.map(ast.literal_eval)
In [3130]: dff = pd.DataFrame({'x': df.x.repeat(s.str.len()).values,
'y': np.concatenate(s.values)})
In [3131]: dff
Out[3131]:
x y
0 1 5
1 1 27
2 1 4
3 2 3
4 2 1
5 2 6
6 2 2
7 3 4
8 3 5
And, plot
dff.plot.scatter('x', 'y')

You can use the .str accessor to extract integers, specifically .str.extractall:
# Index by 'x' to retain its values once we extract from 'y'
df = df.set_index('x')
# Extract integers from 'y'
df = df['y'].str.extractall(r'(\d+)')[0].astype('int64')
# Rename and reset the index (remove 'match' level, get 'x' as column)
df = df.rename('y').reset_index(level='match', drop=True).reset_index()
If you have floats instead of ints, just modify the regex and astype as appropriate.
This gives a DataFrame that looks like:
x y
0 1 5
1 1 27
2 1 4
3 2 3
4 2 1
5 2 6
6 2 2
7 3 4
8 3 5
And from there df.plot.scatter('x', 'y') should produce the expected plot.

Related

Idiomatic way to create pandas dataframe as concatenation of function of another's rows

Say I have one dataframe
import pandas as pd
input_df = pd.DataFrame(dict(a=[1, 2], b=[2, 3]))
Also I have a function f that maps each row to another dataframe. Here's an example of such a function. Note that in general the function could take any form so I'm not looking for answers that use agg to reimplement the f below.
def f(row):
return pd.DataFrame(dict(x=[row['a'] * row['b'], row['a'] + row['b']],
y=[row['a']**2, row['b']**2]))
I want to create one dataframe that is the concatenation of the function applied to each of the first dataframe's rows. What is the idiomatic way to do this?
output_df = pd.concat([f(row) for _, row in input_df.iterrows()])
I thought I should be able to use apply or similar for this purpose but nothing seemed to work.
x y
0 2 1
1 3 4
0 6 4
1 5 9
You can use DataFrame.agg to calucalate sum and prod and numpy.ndarray.reshape, df.pow(2)/np.sqaure for calculating sqaure.
out = pd.DataFrame({'x': df.agg(['prod', 'sum'],axis=1).to_numpy().reshape(-1),
'y': np.square(df).to_numpy().reshape(-1)})
out
x y
0 2 1
1 3 4
2 6 4
3 5 9
Yoy should avoid iterating rows (How to iterate over rows in a DataFrame in Pandas).
Instead try:
df = df.assign(product=df.a*df.b, sum=df.sum(axis=1),
asq=df.a**2, bsq=df.b**2)
Then:
df = [[[p, s], [asq, bsq]] for p, s, asq, bsq in df.to_numpy()]

Hue two panda series

I have two pandas series for which I want to compare them visually by plotting them on top of each other. I already tried the following
>>> s1 = pd.Series([1,2,3,4,5])
>>> s2 = pd.Series([3,3,3,3,3])
>>> df = pd.concat([s1, s2], axis=1)
>>> sns.stripplot(data = df)
which yields the following picture:
Now, I am aware of the hue keyword of sns.stripplot but trying to apply it, requires me to to use the keywords x and y. I already tried to transform my data into a different dataframe like that
>>> df = pd.concat([pd.DataFrame({'data':s1, 'type':'s1'}), pd.DataFrame({'data':s2, 'type':'s2'})])
so I can "hue over" type; but even then I have no idea what to put for the keyword x (assuming y = 'data'). Ignoring the keyword x like that
>>> sns.stripplot(y='data', data=df, hue='type')
fails to hue anything:
seaborn generally works best with long-form data, so you might need to rearrange your dataframe slightly. The hue keyword is expecting a column, so we'll use .melt() to get one.
long_form = df.melt()
long_form['X'] = 1
sns.stripplot(data=long_form, x='X', y='value', hue='variable')
Will give you a plot that roughly reflects your requirements:
When we do pd.melt, we change the frame from having multiple columns of values to having a single column of values, with a "variable" column to identify which of our original columns they came from. We add in an 'X' column because stripplot needs both x and hueto work properly in this case. Our long_form dataframe, then, looks like this:
variable value X
0 0 1 1
1 0 2 1
2 0 3 1
3 0 4 1
4 0 5 1
5 1 3 1
6 1 3 1
7 1 3 1
8 1 3 1
9 1 3 1

Having columns with subscripts/indices in pandas

Let's say that I have som data from a file where some columns are "of the same kind", only of different subscripts of some mathematical variable, say x:
n A B C x[0] x[1] x[2]
0 1 2 3 4 5 6
1 2 3 4 5 6 7
Is there some way I can load this into a pandas dataframe df and somehow treat the three x-columns as an indexable, array-like entity (I'm new to pandas)? I believe it would be convenient, because I could do operations on the data-series contained in x such as sum(df.x).
Kind regards.
EDIT:
Admittedly, my original post was not clear enough. I'm not just interested in getting the sum of three columns. That was just an example. I'm looking for a generally applicable abstraction that I hope is built into pandas.
I'd like to have multiple columns accessible through (sub-)indices of one entity, e.g. df.x[0], such that I (or any other user of the data) can do whichever operation he/she wants (sum/max/min/avg/standard deviation, you name it). You can consider the x's as an ensamble of time-dependent measurements if you like.
Kind regards.
Consider, you define your dataframe like this
df = pd.DataFrame([[1, 2, 3, 4, 5, 6],
[2, 3, 4, 5, 6, 7]], columns=['A', 'B', 'C', 'x0', 'x1', 'x2'])
Then with
x = ['x0', 'x1', 'x2']
You use the following notation allowing a quite general definition of x
>>> df[x].sum(axis=1)
0 15
1 18
dtype: int64
Look of column which starts with 'x' and perform operations you need
column_num=[col for col in df.columns if col.startswith('x')]
df[column_num].sum(axis=1)
I'll give you another answer which will defer from you initial data structure in exchange for addressing the values of the dataframe by df.x[0] etc.
Consider you have defined your dataframe like this
>>> dv = pd.DataFrame(np.random.randint(10, size=20),
index=pd.MultiIndex.from_product([range(4), range(5)]), columns=['x'])
>>> dv
x
0 0 8
1 3
2 4
3 6
4 1
1 0 8
1 9
2 1
3 8
4 8
[...]
Then you can exactly do this
dv.x[1]
0 8
1 9
2 1
3 8
4 8
Name: x, dtype: int64
which is your desired notation. Requires some changes to your initial set-up but will give you exactly what you want.

How to parse out array from column inside a dataframe?

I have a data frame that looks like this:
Index Values Digits
1 [1.0,0.13,0.52...] 3
2 [1.0,0.13,0.32...] 3
3 [1.0,0.31,0.12...] 1
4 [1.0,0.30,0.20...] 2
5 [1.0,0.30,0.20...] 3
My output should be:
Index Values Digits
1 [0.33,0.04,0.17...] 3
2 [0.33,0.04,0.11...] 3
3 [0.33,0.10,0.40...] 1
4 [0.33,0.10,0.07...] 2
5 [0.33,0.10,0.07...] 3
I believe that the Values column has a np.array within the cells? Is this technically an array.
I wish to parse out the Values column and divide all values within the array by 3.
My attempts have stopped at the parsing out of the values:
a = df(df['Values'].values.tolist())
IIUC, apply the list calculation
df.Values.apply(lambda x : [y/3 for y in x])
Out[1095]:
0 [0.3333333333333333, 0.043333333333333335, 0.1...
1 [0.3333333333333333, 0.043333333333333335, 0.1...
Name: Values, dtype: object
#df.Values=df.Values.apply(lambda x : [y/3 for y in x])
Created dataframe:
import pandas as pd
d = {'col1': [[1,10], [2,20]], 'col2': [3, 4]}
df = pd.DataFrame(data=d)
created function:
def divide_by_3(lst):
outpuut =[]
for i in lst:
outpuut.append(i/3.0)
return outpuut
apply function :
df.col1.apply(divide_by_3`)
result:
0 [0.333333333333, 3.33333333333]
1 [0.666666666667, 6.66666666667]

Interpolating time series in Pandas using Cubic spline

I would like to fill gaps in a column in my DataFrame using a cubic spline. If I were to export to a list then I could use the numpy's interp1d function and apply this to the missing values.
Is there a way to use this function inside pandas?
Most numpy/scipy function require the arguments only to be "array_like", iterp1d is no exception. Fortunately both Series and DataFrame are "array_like" so we don't need to leave pandas:
import pandas as pd
import numpy as np
from scipy.interpolate import interp1d
df = pd.DataFrame([np.arange(1, 6), [1, 8, 27, np.nan, 125]]).T
In [5]: df
Out[5]:
0 1
0 1 1
1 2 8
2 3 27
3 4 NaN
4 5 125
df2 = df.dropna() # interpolate on the non nan
f = interp1d(df2[0], df2[1], kind='cubic')
#f(4) == array(63.9999999999992)
df[1] = df[0].apply(f)
In [10]: df
Out[10]:
0 1
0 1 1
1 2 8
2 3 27
3 4 64
4 5 125
Note: I couldn't think of an example off the top of my head to pass in a DataFrame into the second argument (y)... but this ought to work too.

Categories