If I have a csv file rows where one column has ordereddicts in them, how do I create a new column extract a single element of each ordereddict using python (3.+)/ pandas(.18)?
Here's an example. My column, attributes, has billingPostalCodes hidden in ordereddicts. All I care about is creating a column with the billingPostalCodes.
Here's what my data looks like now:
import pandas as pd
from datetime import datetime
import csv
from collections import OrderedDict
df = pd.read_csv('sf_account_sites.csv')
print(df)
yields:
id attributes
1 OrderedDict([(u'attributes', OrderedDict([(u'type', u'Account'), (u'url', u'/services/data/v29.0/sobjects/Account/001d000001tKZmWAAW')])), (u'BillingPostalCode', u'85020')])
2 OrderedDict([(u'attributes', OrderedDict([(u'type', u'Account'), (u'url', u'/services/data/v29.0/sobjects/Account/001d000001tKZmWAAW')])), (u'BillingPostalCode', u'55555')])
...
I know on an individual level if I do this:
dict = OrderedDict([(u'attributes', OrderedDict([(u'type', u'Account'), (u'url', u'/services/data/v29.0/sobjects/Account/001d000001tKZmWAAW')])), (u'BillingPostalCode', u'85020')])
print(dict['BillingPostalCode'])
I'll get 85020 back as a result.
What do I have to get it to look like this?
id zip_codes
1 85020
2 55555
...
Do I have to use an apply function? A for loop? I've tried a lot of different things but I can't get anything to work on the dataframe.
Thanks in advance, and let me know if I need to be more specific.
This took me a while to work out, but the problem is resolved by doing the following:
df.apply(lambda row: row["attributes"]["BillingPostalCode"], axis = 1)
The trick here is to note that axis = 1 forces pandas to iterate through every row, rather than each column (which is the default setting, as seen in the docs).
DataFrame.apply(func, axis=0, broadcast=False, raw=False, reduce=None,
args=(), **kwds)
Applies function along input axis of DataFrame.
Objects passed to functions are Series objects having index either the
DataFrame’s index (axis=0) or the columns (axis=1). Return type
depends on whether passed function aggregates, or the reduce argument
if the DataFrame is empty.
Parameters:
func : function Function to apply to each column/row
axis : {0 or ‘index’, 1 or ‘columns’}, default 0
0 or ‘index’: apply function to each column
1 or ‘columns’: apply function to each row
From there, it is a simple matter to first extract the relevant column - in this case attributes - and then from there extract only the BillingPostalCode.
You'll need to format the resulting DataFrame to have the correct column names.
Related
Considering a function (apply_this_function) that will be applied to a dataframe:
# our dataset
data = {"addresses": ['Newport Beach, California', 'New York City', 'London, England', 10001, 'Sydney, Au']}
# create a throw-away dataframe
df_throwaway = df.copy()
def apply_this_function(passed_row):
passed_row['new_col'] = True
passed_row['added'] = datetime.datetime.now()
return passed_row
df_throwaway.apply(apply_this_function, axis=1) # axis=1 is important to use the row itself
In df_throway.appy(.), where does the function take the "passed_row" parameter? Or what value is this function taking? My assumption is that by the structure of apply(), the function takes values from row i starting at 1?
I am referring to the information obtained here
When you apply a function to a DataFrame with axis=1, then
this function is called for each row from the source DataFrame
and by convention its parameter is called row.
In your case this function returns (from each call) the original
row (actually a Series object), with 2 new elements added.
Then apply method collects these rows, concatenates them
and the result is a DataFrame with 2 new columns.
You wrote takes values from row i starting at 1. I would change it to
takes values from each row.
Writing starting at 1 can lead to misunderstandings, since when your
DataFrame has a default index, its values start from 0 (not from 1).
In addition, I would like to propose 2 corrections to your code:
Create your DataFrame passing data (your code sample does not
contain creation of df):
df_throwaway = pd.DataFrame(data)
Define your function as:
def apply_this_function(row):
row['new_col'] = True
row['added'] = pd.Timestamp.now()
return row
i.e.:
name the parameter as just row (everybody knows that this row
has been passed by apply method),
instead of datetime.datetime.now() use pd.Timestamp.now()
i.e. a native pandasonic type and its method.
A custom dask GroupBy Aggregation is very handy, but I am having trouble to define one working for the most often value in a column.
What do I have:
So from the example here, we can define custom aggregate functions like this:
custom_sum = dd.Aggregation('custom_sum', lambda s: s.sum(), lambda s0: s0.sum())
my_aggregate = {
'A': custom_sum,
'B': custom_most_often_value, ### <<< This is the goal.
'C': ['max','min','mean'],
'D': ['max','min','mean']
}
col_name = 'Z'
ddf_agg = ddf.groupby(col_name).agg(my_aggregate).compute()
While this works for custom_sum (as on the example page), the adaption to most often value could be like this (from the example here):
custom_most_often_value = dd.Aggregation('custom_most_often_value', lambda x:x.value_counts().index[0], lambda x0:x0.value_counts().index[0])
but it yields
ValueError: Metadata inference failed in `_agg_finalize`.
You have supplied a custom function and Dask is unable to
determine the type of output that that function returns.
Then I tried to find the meta keyword in the dd.Aggregation implementation to define it, but could not find it.. And the fact that it is not needed in the example of custom_sum makes me think that the error is somewhere else..
So my question would be, how to get that mostly occuring value of a column in a df.groupby(..).agg(..). Thanks!
A quick clarification rather than an answer: the meta parameter is used in the .agg() method, to specify the column data types you expect, best expressed as a zero-length pandas dataframe. Dask will supply dummy data to your function otherwise, to try to guess those types, but this doesn't always work.
The issue that you're running into, is that the separate stages of the aggregation can't be the same function applied recursively, as in the custom_sum example that you're looking at.
I've modified code from this answer, leaving comments from #
user8570642, because they are very helpful. Note that this method will solve for a list of groupby keys:
https://stackoverflow.com/a/46082075/3968619
def chunk(s):
# for the comments, assume only a single grouping column, the
# implementation can handle multiple group columns.
#
# s is a grouped series. value_counts creates a multi-series like
# (group, value): count
return s.value_counts()
def agg(s):
# print('agg',s.apply(lambda s: s.groupby(level=-1).sum()))
# s is a grouped multi-index series. In .apply the full sub-df will passed
# multi-index and all. Group on the value level and sum the counts. The
# result of the lambda function is a series. Therefore, the result of the
# apply is a multi-index series like (group, value): count
return s.apply(lambda s: s.groupby(level=-1).sum())
# faster version using pandas internals
s = s._selected_obj
return s.groupby(level=list(range(s.index.nlevels))).sum()
def finalize(s):
# s is a multi-index series of the form (group, value): count. First
# manually group on the group part of the index. The lambda will receive a
# sub-series with multi index. Next, drop the group part from the index.
# Finally, determine the index with the maximum value, i.e., the mode.
level = list(range(s.index.nlevels - 1))
return (
s.groupby(level=level)
.apply(lambda s: s.reset_index(level=level, drop=True).idxmax())
)
max_occurence = dd.Aggregation('mode', chunk, agg, finalize)
chunk will count the values for the groupby object in each partition. agg will take the results from chunk and groupy the original groupby command and sum the value counts, so that we have the value counts for every group. finalize will take the multi-index series provided by agg and return the most frequently occurring value of B for each group from Z.
Here's a test case:
df = dd.from_pandas(
pd.DataFrame({"A":[1,1,1,1,2,2,3]*10,"B":[5,5,5,5,1,1,1]*10,
'Z':['mike','amy','amy','amy','chris','chris','sandra']*10}), npartitions=10)
res = df.groupby(['Z']).agg({'B': mode}).compute()
print(res)
I have an Excel file :
Test_Case Value
Case_1 0.988532846
Case_2 0.829241525
Case_3 0.257209267
Case_4 0.871698313
Case_5 0.63913665
with pandas I have seen that we can get a column like this :
import pandas as pd
myExcelFile = "data.xlsx"
readExcelFile = pd.read_excel(myExcelFile, sheet_name=0, index=0)
testCaseColumn = readExceFile.Test_Case
result :
0 Case_1
1 Case_2
2 Case_3
3 Case_4
4 Case_5
The name of the column can be change, and I would like to create a function with two arguments to get the column I want :
def getColumn(readExceFile, columnName):
return readExceFile.columnName
I would like to know how can I attribute the name of the column to my readExcelFile parameter ?
Thanks for your help
You can use getattr.
def getColumn(readExceFile, columnName):
return getattr(readExceFile, columnName)
Since your_dataframe.column_name works only with column names w/o space character and you've mentioned that column name can be changed, you can call for column name with your_dataframe.loc[:,'column_name'] (see Alexander Céciles comment).
On the other hand, if your dataset has always the same structure (n-columns, first one with some categorical data, second one with values, etc.) then you can call it also directly with your_dataframe.iloc[:,0], with 0 being your first column-of-interest in the set.
Finally if you really need to have a separate function (besides at least those two I've mentioned) which returns exactly the same output then you may use this:
def get_column(your_dataframe, column_name):
return your_dataframe.loc[:,column_name]
... what is highly non-pythonic way of writing the code (see Zen of Python)
I have a dataframe of corporate actions for a specific equity. it looks something like this:
0 Declared Date Ex-Date Record Date
BAR_DATE
2018-01-17 2017-02-21 2017-08-09 2017-08-11
2018-01-16 2017-02-21 2017-05-10 2017-06-05
except that it has hundreds of rows, but that is unimportant. I created the index "BAR_DATE" from one of the columns which is where the 0 comes from above BAR_DATE.
What I want to do is to be able to reference a specific element of the dataframe and return the index value, or BAR_DATE, I think it would go something like this:
index_value = cacs.iloc[5, :].index.get_values()
except index_value becomes the column names, not the index. Now, this may stem from a poor understanding of indexing in pandas dataframes, so this may or may not be really easy to solve for someone else.
I have looked at a number of other questions including this one, but it returns column values as well.
Your code is really close, but you took it just one step further than you needed to.
# creates a slice of the dataframe where the row is at iloc 5 (row number 5) and where the slice includes all columns
slice_of_df = cacs.iloc[5, :]
# returns the index of the slice
# this will be an Index() object
index_of_slice = slice_of_df.index
From here we can use the documentation on the Index object: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Index.html
# turns the index into a list of values in the index
index_list = index_of_slice.to_list()
# gets the first index value
first_value = index_list[0]
The most important thing to remember about the Index is that it is an object of its own, and thus we need to change it to the type we expect to work with if we want something other than an index. This is where documentation can be a huge help.
EDIT: It turns out that the iloc in this case is returning a Series object which is why the solution is returning the wrong value. Knowing this, the new solution would be:
# creates a Series object from row 5 (technically the 6th row)
row_as_series = cacs.iloc[5, :]
# the name of a series relates to it's index
index_of_series = row_as_series.name
This would be the approach for single-row indexing. You would use the former approach with multi-row indexing where the return value is a DataFrame and not a Series.
Unfortunately, I don't know how to coerce the Series into a DataFrame for single-row slicingbeyond explicit conversion:
row_as_df = DataFrame(cacs.iloc[5, :])
While this will work, and the first approach will happily take this and return the index, there is likely a reason why Pandas doesn't return a DataFrame for single-row slicing so I am hesitant to offer this as a solution.
i would like to create a pandas SparseDataFrame with the Dimonson 250.000 x 250.000. In the end my aim is to come up with a big adjacency matrix.
So far that is no problem to create that data frame:
df = SparseDataFrame(columns=arange(250000), index=arange(250000))
But when i try to update the DataFrame, i become massive memory/runtime problems:
index = 1000
col = 2000
value = 1
df.set_value(index, col, value)
I checked the source:
def set_value(self, index, col, value):
"""
Put single value at passed column and index
Parameters
----------
index : row label
col : column label
value : scalar value
Notes
-----
This method *always* returns a new object. It is currently not
particularly efficient (and potentially very expensive) but is provided
for API compatibility with DataFrame
...
The latter sentence describes the problem in this case using pandas? I really would like to keep on using pandas in this case, but its totally impossible in this case!
Does someone have an idea, how to solve this problem more efficiently?
My next idea is to work with something like nested lists/dicts or so...
thanks for your help!
Do it this way
df = pd.SparseDataFrame(columns=np.arange(250000), index=np.arange(250000))
s = df[2000].to_dense()
s[1000] = 1
df[2000] = s
In [11]: df.ix[1000,2000]
Out[11]: 1.0
So the procedure is to swap out the entire series at a time. The SDF will convert the passed in series to a SparseSeries. (you can do it yourself to see what they look like with s.to_sparse(). The SparseDataFrame is basically a dict of these SparseSeries, which themselves are immutable. Sparseness will have some changes in 0.12 to better support these types of operations (e.g. setting will work efficiently).