efficiently mapping values in pandas from a 2nd dataframe - python

I'm looking to best understand how to use a 2nd file/dataframe to efficiently map values when these values are provided as encoded and there is a label I want to map to it. Think of this 2nd file as a data dictionary that translates the values in the first dataframe.
For example
import pandas as pd
dataset = pd.read_csv('https://gist.githubusercontent.com/seankross/a412dfbd88b3db70b74b/raw/5f23f993cd87c283ce766e7ac6b329ee7cc2e1d1/mtcars.csv')
data_dictionary = pd.DataFrame({'columnname' : ['vs','vs', 'am','am'], 'code' : [0,1,0,1], 'label':['vs_is_0','vs_is_1','am_is_0','am_is_1'] })
Now, I want to be able replace the values in the 'columnname' in the first dataset according to the mapping 'code' with the accurate 'label'. If a value is found in one and not the other, nothing happens.
Currently my approach is as follows but I feel it is very ineffecient and suboptimal. Keep in mind I could have 30-40 columns each with 2-200 values I'd want replaced with this vlookup like replacement:
for each_colname in dataset.columns.tolist():
lookup_values = data_dictionary.query("columnname=={}".format(each_colname))
# and then doing a merge...
Any help is much appreciated!

First you can create a mapper dict and then apply this to your dataset.
mapper = (
data_dictionary.groupby('columnname')
.apply(lambda x: dict(x.values.tolist()))
.to_dict()
)
for e in mapper.keys():
df[e] = df[e].map(mapper[e]).combine_first(df[e])
Update to handle mismatching datatypes:
mapper = (
data_dictionary.groupby('columnname')
.apply(lambda x: dict(x.astype(str).values.tolist()))
.to_dict()
)
for e in mapper.keys():
df[e] = df[e].astype(str).map(mapper[e]).combine_first(df[e])

Related

Adding column to pandas dataframe using group name in function when iterating through groupby

I have a set of data which I fitted using a function, this yielded a dict with fitting parameters where the keys correspond to the possible group names.
Imagine I have another dataframe with some of those groups and some corresponding x-values. What I would like to do is get the y-values for the x-values in the second dataset using the fitting parameters from the dict, without merging the parameters onto the second dataset.
Here is a simplified example of what I would like to do. First I have a function using fitting parameters (not the real one):
def func(x,p):
y = 0
for i in range(len(p)):
y += p[i]*x**(i)
return y
A DataFrame with the second dataset consisting of two columns to group on and some corresponding x-values:
df = pd.DataFrame({'a': np.random.randint(3, size=20),
'b': np.random.randint(3, size=20),
'x': np.random.randint(10, high=20, size=20)})
A dict with fitting parameters (groups of df are typically a sample of the dict keys):
params = {key: np.random.randint(5,size=3) for key in df.groupby(['a','b']).groups.keys()}
Now I want to calculate a new column 'ycalc', using the group names as selector for params and apply the function. In my head this would look something like:
for name, group in df.groupby(['a','b']):
df['ycalc'] = func(params[name],group['c'])
But then the whole column is overwritten for each group, yielding NaN for all members outside the group. Another logical solution would be to use transform, but then I cannot use the group name as input (regardless of possible other syntax mistakes):
df['ycalc'] = df.groupby(['a','b'])['x'].transform(func, args=(params[name]))
What would be the best approach to get column ycalc?
Use lambda function:
df['ycalc'] = df.groupby(['a','b'])['x'].transform(lambda x: func(x, p[x.name]))
From the discussion under the accepted answer, I share the solution that I finally used, proposed by jezrael as well:
def f(x):
x['ycalc'] = func(params[x.name],x['c'])
return x
df = df.groupby(['a','b']).apply(f)
For me this is more readable than using melt and pivoting (another suggestion) and it adds the extra flexibility of using multiple columns for the construction of df['ycalc']. This came in handy, because in my real problem I have columns df['d'] and df['e'] in addition to df['c'] that are used as input for func.

Using pd.Dataframe.replace with an apply function as the replace value

I have several dataframes that have mixed in some columns with dates in this ASP.NET format "/Date(1239018869048)/". I've figured out how to parse this into python's datetime format for a given column. However I would like to put this logic into a function so that I can pass it any dataframe and have it replace all the dates that it finds that match a regex using pd.Dataframe.replace.
something like:
def pretty_dates():
#Messy logic here
df.replace(to_replace=r'\/Date(d+)', value=pretty_dates(df), regex=True)
Problem with this is that the df that is being passed to pretty_dates is the whole dataframe not just the cell that is needed to be replaced.
So the concept I'm trying to figure out is if there is a way that the value that should be replaced when using df.replace can be a function instead of a static value.
Thank you so much in advance
EDIT
To try to add some clarity, I have many columns in a dataframe, over a hundred that contain this date format. I would like not to list out every single column that has a date. Is there a way to apply the function the clean my dates across all the columns in my dataset? So I do not want to clean 1 column but all the hundreds of columns of my dataframe.
I'm sure you can use regex to do this in one step, but here is how to apply it to the whole column at once:
df = pd.Series(['/Date(1239018869048)/',
'/Date(1239018869048)/'],dtype=str)
df = df.str.replace('\/Date\(', '')
df = df.str.replace('\)\/', '')
print(df)
0 1239018869048
1 1239018869048
dtype: object
As far as I understand, you need to apply custom function to selected cells in specified column. Hope, that the following example helps you:
import pandas as pd
df = pd.DataFrame({'x': ['one', 'two', 'three']})
selection = df.x.str.contains('t', regex=True) # put your regexp here
df.loc[selection, 'x'] = df.loc[selection, 'x'].map(lambda x: x+x) # do some logic instead
You can apply this procedure to all columns of the df in a loop:
for col in df.columns:
selection = df.loc[:, col].str.contains('t', regex=True) # put your regexp here
df.loc[selection, col] = df.loc[selection, col].map(lambda x: x+x) # do some logic instead

How to get all groups from Dask DataFrameGroupBy, if I have more then one group by fields?

how can I get all unique groups in Dask from grouped data frame?
Let's say, we have the following code:
g = df.groupby(['Year', 'Month', 'Day'])
I have to iterate through all groups and process the data within the groups.
My idea was to get all unique value combinations and then iterate through the collection and call e.g.
g.get_group((2018,01,12)).compute()
for each of them... which is not going to be fast, but hopefully will work..
In Spark/Scala I can achieve smth like this using the following approach:
val res = myDataFrame.groupByKey(x => groupFunctionWithX(X)).mapGroups((key,iter) => {
process group with all the child records
} )
I am wondering, what is the best way to implement smth like this using Dask/Python?
Any assistance would be greatly appreciated!
Best, Michael
UPDATE
I have tried the following in python with pandas:
df = pd.read_parquet(path, engine='pyarrow')
g = df.groupby(('Year', 'Month', 'Day'))
g.apply(lambda x: print(x.Year[0], x.Month[0], x.Day[0], x.count()[0]))
And this was working perfectly fine. Afterwards, I have tried the same with Dask:
df2 = dd.read_parquet(path, engine='pyarrow')
g2 = df2.groupby(('Year', 'Month', 'Day'))
g2.apply(lambda x: print(x.Year[0], x.Month[0], x.Day[0], x.count()[0]))
This has led me to the following error:
ValueError: Metadata inference failed in `groupby.apply(lambda)`.
Any ideas what went wrong?
Computing one group at a time is likely to be slow. Instead I recommend using groupby-apply
df.groupby([...]).apply(func)
Like Pandas, the user-defined function func should expect a Pandas dataframe that has all rows corresponding to that group, and should return either a Pandas dataframe, a Pandas Series, or scalar.
Getting one group at a time can be cheap if your data is indexed by the grouping column
df = df.set_index('date')
part = df.loc['2018-05-01'].compute()
Given that you're grouping by a few columns though I'm not sure how well this will work.

How do you effectively use pd.DataFrame.apply on rows with duplicate values?

The function that I'm applying is a little expensive, as such I want it to only calculate the value once for unique values.
The only solution I've been able to come up with has been as follows:
This step because apply doesn't work on arrays, so I have to convert the unique values into a series.
new_vals = pd.Series(data['column'].unique()).apply(function)
This one because .merge has to be used on dataframes.
new_dataframe = pd.DataFrame( index = data['column'].unique(), data = new_vals.values)
Finally Merging The results
yet_another= pd.merge(data, new_dataframe, right_index = True, left_on = column)
data['calculated_column'] = yet_another[0]
So basically I had to Convert my values to a Series, apply the function, convert to a Dataframe, merge the results and use that column to create me new column.
I'm wondering if there is some one-line solution that isn't as messy. Something pythonic that doesn't involve re-casting object types multiple times. I've tried grouping by but I just can't figure out how to do it.
My best guess would have been to do something along these lines
data[calculated_column] = dataframe.groupby(column).index.apply(function)
but that isn't right either.
This is an operation that I do often enough to want to learn a better way to do, but not often enough that I can easily find the last time I used it, so I end up re-figuring a bunch of things again and again.
If there is no good solution I guess I could just add this function to my library of common tools that I hedonistically > from me_tools import *
def apply_unique(data, column, function):
new_vals = pd.Series(data[column].unique()).apply(function)
new_dataframe = pd.DataFrame( data = new_vals.values, index =
data[column].unique() )
result = pd.merge(data, new_dataframe, right_index = True, left_on = column)
return result[0]
I would do something like this:
def apply_unique(df, orig_col, new_col, func):
return df.merge(df[[orig_col]]
.drop_duplicates()
.assign(**{new_col: lambda x: x[orig_col].apply(func)}
), how='inner', on=orig_col)
This will return the same DataFrame as performing:
df[new_col] = df[orig_col].apply(func)
but will be much more performant when there are many duplicates.
How it works:
We join the original DataFrame (calling) to another DataFrame (passed) that contains two columns; the original column and the new column transformed from the original column.
The new column in the passed DataFrame is assigned using .assign and a lambda function, making it possible to apply the function to the DataFrame that has already had .drop_duplicates() performed on it.
A dict is used here for convenience only, as it allows a column name to be passed in as a str.
Edit:
As an aside: best to drop new_col if it already exists, otherwise the merge will append suffixes to each new_col
if new_col in df:
df = df.drop(new_col, axis='columns')

Change CSV numerical values based on header name in file using Python

I have a .csv file filled with observations from sensors in the field. The sensors write data as millimeters and I need it as meters to import into another application. My idea was to use Python and possibly pandas to:
1. Read in the .csv as dataframe
2. Find the headers of the data I need to modify (divide each actual value by 1000)
3. Divide each value in the chosen column by 1000 to convert it to meters
4. Write the resulting updated file to disk
Objective: I need to modify all the values except those with a header that contains "rad" in it.
This is what the data looks like:
Here is what I have done so far:
Read data into a dataframe:
import pandas as pd
import numpy as np
delta_df = pd.read_csv('SAAF_121581_67_500.dat',index_col=False)
Filter out all the data that I don't want to touch:
delta_df.filter(like='rad', axis=1)
Here is where I got stuck as I couldn't filter the dataframe to
not like = 'rad'
How can I do this?
Its easier if you post the dataframe rather than the image as the image is not reproducible.
You can use dataframe.filter to keep all the columns containing 'rad'
delta_df = delta_df.select(lambda x: re.search('rad', x), axis=1)
Incase you are trying to remove all the columns containing 'rad', use
delta_df = delta_df.select(lambda x: not re.search('rad', x), axis=1)
Alternate solution without regex:
df.filter(like='rad',axis=1)
EDIT:
Given the dataframes containing rad and not containing rad like this
df_norad = df.select(lambda x: not re.search('rad', x), axis=1)
df_rad = df.select(lambda x: re.search('rad', x), axis=1)
You can convert the values of df_norad df to meters and then merge it with df_rad
merged = pd.concat([df_norad, df_rad], axis = 1)
You can convert the dataframe merged to csv using to_csv
merged.to_csv('yourfilename.csv')
Off the top of my head I believe you can do something like this:
delta_df.filter(regex='^rad', axis=1)
Where we use the regex parameter instead of the like parameter (**note regex and like are mutually exclusive).
The actual regex selects everything that does not match what follow the '^' operator.
Again, I don't have an environment set up to test this but I hope this motivates the idea well enough.

Categories