Aggregating pandas series into a DataFrame and visual representation

Aggregating pandas series into a DataFrame and visual representation - python

I have three pandas Series, called: Col_data, C_PV_data and C_elec_data. Each one has these values:
Col_data:
0 625814.205486
1 782267.756857
2 938721.308229
Name: 7, dtype: object
C_PV_data:
0 2039032.206909
1 2548790.258636
2 3058548.310363
Name: 3, dtype: object
C_elec_data:
0 1337523.743009
1 1671904.678761
2 2006285.614513
Name: 0, dtype: object
I would like to aggregate them into a single DataFrame, to export that DataFrame to a .xlsx file, in which each column is called as the variable. For instance:
Col_data
C_PV_data
C_elec_data
625814.205486
2039032.206909
1337523.743009
782267.756857
2548790.258636
1671904.678761
938721.308229
3058548.310363
2006285.614513
Finally, I would like to represent each column with a graph in which the central value is a line, and two dots over that line, for the lowest and hights value. For instance, the graph would be something like this:

Sure, here you go:
Init
Col_data = pd.Series([
625814.205486,
782267.756857,
938721.308229])
C_PV_data = pd.Series([
2039032.206909,
2548790.258636,
3058548.310363])
C_elec_data = pd.Series([
1337523.743009,
1671904.678761,
2006285.614513])
As a df
df = pd.concat(
[Col_data, C_PV_data, C_elec_data], axis=1,
keys=['Col_data', 'C_PV_data', 'C_elec_data'])
>>> df
Col_data C_PV_data C_elec_data
0 625814.205486 2.039032e+06 1.337524e+06
1 782267.756857 2.548790e+06 1.671905e+06
2 938721.308229 3.058548e+06 2.006286e+06
Side note: I always dislike repeats. The following alternative to the above is DRY (Don't Repeat Yourself), but less clear perhaps:
keys = ['Col_data', 'C_PV_data', 'C_elec_data']
d = locals() # just for DRY...
df = pd.concat([d[k] for k in keys], axis=1, keys=keys)
To xlsx
Assuming you have openpyxl installed:
df.to_excel('foo.xlsx', index=False)
Box plot
Edit: (and save as PNG)
ax = df.loc[[0,1,1,1,2]].plot.box()
ax.figure.savefig('costs.png')

Related

Dask : NotImplementedError: `df.column.cat.codes` with unknown categories is not supported

I used this code to create a column for creating a product id in a dataframe :
df = df.assign(id=(df['PROD_NAME']).astype('category').cat.codes)
This code works fine if I use pandas. This line allows me to create an id for each PROD_NAME value.
My issue is that I want to use Dask that allows me to manage several clients and handle memory issues.
I obtain the following error message :
NotImplementedError: `df.column.cat.codes` with unknown categories is not supported. Please use `column.cat.as_known()` or `df.categorize()` beforehand to ensure known categories
How can i create this new column then ?

This is an old post, but being the first that comes up when searching for this error, it could use an answer:
TL;DR:
Run this sequence on your Dask dataframe:
ddf["PROD_NAME"] = ddf["PROD_NAME"].cat.as_known()
ddf = ddf.assign(id=(ddf["PROD_NAME"].cat.codes))
out_df = ddf.compute()
Per Dask's documentation, you can convert categorical data types in Dask between "known categoricals" and "unknown categoricals". In this situation, it needs "known" categories, because it will need to pull category mapping from column metadata.
import pandas as pd
from dask import dataframe as dd
# Show the pandas workflow
>>> d = pd.Series(['A','B','D'], dtype='category').to_frame(name=“PROD_NAME”)
>>> d = d.assign(id=(d["PROD_NAME"]).astype('category').cat.codes)
>>> d
PROD_NAME id
0 A 0
1 B 1
2 D 2
# Now, in Dask:
>>> ddf = dd.from_pandas(d, npartitions=1)
>>> ddf
Dask DataFrame Structure:
PROD_NAME
npartitions=1
0 category[known]
2 ...
Dask Name: from_pandas, 1 tasks
# The conversion to Dask dataframe already created a "known categorical", but
# let's convert it to "unknown" (notice the .compute() is not used):
>>> ddf["PROD_NAME"] = ddf["PROD_NAME"].cat.as_unknown()
>>> ddf
Dask DataFrame Structure:
PROD_NAME
npartitions=1
0 category[unknown]
2 ...
Dask Name: assign, 3 tasks
# Now, let's convert it back to "known", then create the new column using .assign()
# and call .compute() to create output dataframe:
>>> ddf["PROD_NAME"] = ddf["PROD_NAME"].cat.as_known()
>>> ddf = ddf.assign(id=(ddf["PROD_NAME"].cat.codes))
>>> out_df = ddf.compute()
>>> out_df
PROD_NAME id
0 A 0
1 B 1
2 D 2

Column in pandas dataframe has lists as values. How do I create a version of this column but with only the first value in the list?

My current solution is below:
prices_real = []
for item in sparkline['prices']:
prices_real.append(item[0])
sparkline['prices_real'] = prices_real
But I'm wondering if there is an easier way or a method I don't know about?

There are 2 aspects to your problem:
Extracting the first (and only) element of each list within your series.
Converting your series to numeric.
So you can use the str accessor followed by pd.to_numeric:
df = pd.DataFrame({'x': [['0.12312'], ['-5.32454'], ['0.563412'], ['-3.918324']]})
df['x'] = pd.to_numeric(df['x'].str[0])
print(df, df.dtypes, sep='\n'*2)
x
0 0.123120
1 -5.324540
2 0.563412
3 -3.918324
x float64
dtype: object

You can use pandas.Series.apply:
sparkline = pd.DataFrame({"prices": [[1], [4]]})
sparkline
# prices
# 0 [1]
# 1 [4]
sparkline["prices"] = sparkline["prices"].apply(lambda x: x[0])
sparkline
# prices
# 0 1
# 1 4

Dictionary in Pandas DataFrame, how to split the columns

I have a DataFrame that consists of one column ('Vals') which is a dictionary. The DataFrame looks more or less like this:
In[215]: fff
Out[213]:
Vals
0 {u'TradeId': u'JP32767', u'TradeSourceNam...
1 {u'TradeId': u'UUJ2X16', u'TradeSourceNam...
2 {u'TradeId': u'JJ35A12', u'TradeSourceNam...
When looking at an individual row the dictionary looks like this:
In[220]: fff['Vals'][100]
Out[218]:
{u'BrdsTraderBookCode': u'dffH',
u'Measures': [{u'AssetName': u'Ie0',
u'DefinitionId': u'6dbb',
u'MeasureValues': [{u'Amount': -18.64}],
u'ReportingCurrency': u'USD',
u'ValuationId': u'669bb'}],
u'SnapshotId': 12739,
u'TradeId': u'17304M',
u'TradeLegId': u'31827',
u'TradeSourceName': u'xxxeee',
u'TradeVersion': 1}
How can I split the the columns and create a new DataFrame, so that I get one column with TradeId and another one with MeasureValues?

try this:
l=[]
for idx, row in df['Vals'].iteritems():
temp_df = pd.DataFrame(row['Measures'][0]['MeasureValues'])
temp_df['TradeId'] = row['TradeId']
l.append(temp_df)
pd.concat(l,axis=0)

Here's a way to get TradeId and MeasureValues (using twice your sample row above to illustrate the iteration):
new_df = pd.DataFrame()
for id, data in fff.iterrows():
d = {'TradeId': data.ix[0]['TradeId']}
d.update(data.ix[0]['Measures'][0]['MeasureValues'][0])
new_df = pd.concat([new_df, pd.DataFrame.from_dict(d, orient='index').T])
Amount TradeId
0 -18.64 17304M
0 -18.64 17304M

Selectively replacing DataFrames column names

I have a time series dataset in a .csv file that I want to process with Pandas (using Canopy). The column names from the file are a mix of strings and isotopic numbers.
cycles 40 38.02 35.98 P4
0 1 1.1e-8 4.4e-8 7.7e-8 8.8e-7
1 2 2.2e-8 5.5e-8 8.8e-8 8.7e-7
2 3 3.3e-8 6.6e-8 9.9e-8 8.6e-7
I would like this DataFrame to look like this
cycles 40 38 36 P4
0 1 1.1e-8 4.4e-8 7.7e-8 8.8e-7
1 2 2.2e-8 5.5e-8 8.8e-8 8.7e-7
2 3 3.3e-8 6.6e-8 9.9e-8 8.6e-7
The .csv files won't always have exactly the same column names; they numbers could be slightly different from file to file. To handle this, I've sampled the column names and rounded the values to the nearest integer.This is what my code looks like so far:
import pandas as pd
import numpy as np
df = {'cycles':[1,2,3],'40':[1.1e-8,2.2e-8,3.3e-8],'38.02':[4.4e-8,5.5e-8, 6.6e-8],'35.98':[7.7e-8,8.8e-8,9.9e-8,],'P4':[8.8e-7,8.7e-7,8.6e-7]}
df = pd.DataFrame(df, columns=['cycles', '40', '38.02', '35.98', 'P4'])
colHeaders = df.columns.values.tolist()
colHeaders[1:4] = list(map(float, colHeaders[1:4]))
colHeaders[1:4] = list(map(np.around, colHeaders[1:4]))
colHeaders[1:4] = list(map(int, colHeaders[1:4]))
colHeaders = list(map(str, colHeaders))
I tried df.rename(columns={df.loc[ 1 ]:colHeaders[ 0 ]}, ...), but I get this error:
TypeError: 'Series' objects are mutable, thus they cannot be hashed
I've read this post as well as the pandas 0.17 documentation, but I can't figure out how to use it to selectively replace the column names in a way that doesn't require me to assign new column names manually like this post.
I'm fairly new to Python and I've never posted on StackOverflow before, so any help would be greatly appreciated.

You could use a variant of your approach, but assign the new columns directly:
>>> cols = list(df.columns)
>>> cols[1:-1] = [int(round(float(x))) for x in cols[1:-1]]
>>> df.columns = cols
>>> df
cycles 40 38 36 P4
0 1 1.100000e-08 4.400000e-08 7.700000e-08 8.800000e-07
1 2 2.200000e-08 5.500000e-08 8.800000e-08 8.700000e-07
2 3 3.300000e-08 6.600000e-08 9.900000e-08 8.600000e-07
>>> df.columns
Index(['cycles', 40, 38, 36, 'P4'], dtype='object')
Or you could pass a function to rename:
>>> df = df.rename(columns=lambda x: x if x[0].isalpha() else int(round(float(x))))
>>> df.columns
Index(['cycles', 40, 38, 36, 'P4'], dtype='object')

Given a pandas dataframe, is there an easy way to print out a command to generate it?

After running some commands I have a pandas dataframe, eg.:
>>> print df
B A
1 2 1
2 3 2
3 4 3
4 5 4
I would like to print this out so that it produces simple code that would recreate it, eg.:
DataFrame([[2,1],[3,2],[4,3],[5,4]],columns=['B','A'],index=[1,2,3,4])
I tried pulling out each of the three pieces (data, columns and rows):
[[e for e in row] for row in df.iterrows()]
[c for c in df.columns]
[r for r in df.index]
but the first line fails because e is not a value but a Series.
Is there a pre-build command to do this, and if not, how do I do it? Thanks.

You can get the values of the data frame in array format by calling df.values:
df = pd.DataFrame([[2,1],[3,2],[4,3],[5,4]],columns=['B','A'],index=[1,2,3,4])
arrays = df.values
cols = df.columns
index = df.index
df2 = pd.DataFrame(arrays, columns = cols, index = index)

Based on #Woody Pride's approach, here is the full solution I am using. It handles hierarchical indices and index names.
from types import MethodType
from pandas import DataFrame, MultiIndex
def _gencmd(df, pandas_as='pd'):
"""
With this addition to DataFrame's methods, you can use:
df.command()
to get the command required to regenerate the dataframe df.
"""
if pandas_as:
pandas_as += '.'
index_cmd = df.index.__class__.__name__
if type(df.index)==MultiIndex:
index_cmd += '.from_tuples({0}, names={1})'.format([i for i in df.index], df.index.names)
else:
index_cmd += "({0}, name='{1}')".format([i for i in df.index], df.index.name)
return 'DataFrame({0}, index={1}{2}, columns={3})'.format([[xx for xx in x] for x in df.values],
pandas_as,
index_cmd,
[c for c in df.columns])
DataFrame.command = MethodType(_gencmd, None, DataFrame)
I have only tested it on a few cases so far and would love a more general solution.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Aggregating pandas series into a DataFrame and visual representation - python

Related

Dask : NotImplementedError: `df.column.cat.codes` with unknown categories is not supported

Column in pandas dataframe has lists as values. How do I create a version of this column but with only the first value in the list?

Dictionary in Pandas DataFrame, how to split the columns

Selectively replacing DataFrames column names

Given a pandas dataframe, is there an easy way to print out a command to generate it?

Categories

Resources