Map each value in pandas series/dataframe to n>1 dimensions - python

I have a pandas series, and a function that takes a value in the series and returns a dataframe. Is there a way to apply the function to the series and collate the results in a natural way?
What I am really trying to do is to use pandas series/multiindex to keep track of the results in each step of my data analysis pipeline, where the multiindex holds the parameters used to get the values. For example, the series (s below) is the result of step 0 in my data analysis pipeline. In step 1, I want to try x more dimensions (2 below, thus the dataframe) and collate the results into another series.
Can we do better than below? Where stack() calls seem a bit excessive. Will the xarray library be a good fit for my use case?
In [112]: s
Out[112]:
a 0
b 1
c 2
dtype: int64
In [113]: d = s.apply(lambda x: pd.DataFrame([[x,x*2],[x*3,x*4]]).stack()).stack().stack()
In [114]: d
Out[114]:
a 0 0 0
1 0
1 0 0
1 0
b 0 0 1
1 3
1 0 2
1 4
c 0 0 2
1 6
1 0 4
1 8
dtype: int64

This should give you a DataSet of 2D arrays, and align them for you. You may want to set the dimensions prior if you want them to be named a certain way / be a certain size.
xr.Dataset(k: func(v) for k, v in series.items())

Related

Pandas: Dataframe itertuples boolean series groupby optimization

I'm new in python.
I have data frame (DF) example:
id
type
1
A
1
B
2
C
2
B
I would like to add a column example A_flag group by id.
In the end I have data frame (DF):
id
type
A_flag
1
A
1
1
B
1
2
C
0
2
B
0
I can do this in two step:
DF['A_flag_tmp'] = [1 if x.type=='A' else 0 for x in DF.itertuples()]
DF['A_flag'] = DF.groupby(['id'])['A_flag_tmp'].transform(np.max)
It's working, but it's very slowy for big data frame.
Is there any way to optimize this case ?
Thank's for help.
Change your codes with slow iterative coding to fast vectorized coding by replacing your first step to generate a boolean series by Pandas built-in functions, e.g.
df['type'].eq('A')
Then, you can attach it to the groupby statement for second step, as follows:
df['A_flag'] = df['type'].eq('A').groupby(df['id']).transform('max').astype(int)
Result
print(df)
id type A_flag
0 1 A 1
1 1 B 1
2 2 C 0
3 2 B 0
In general, if you have more complicated conditions, you can also define it in vectorized way, eg. define the boolean series m by:
m = df['type'].eq('A') & df['type1'].gt(1) | (df['type2'] != 0)
Then, use it in step 2 as follows:
m.groupby(df['id']).transform('max').astype(int)

Pandas - How to extract values from a large DF without any 'keys' using another DF's values?

I've got one large matrix as a pandas DF w/o any 'keys' but plain numbers on top. A smaller version of that just to demonstrate the problem in here would be like this input:
M=pd.DataFrame(np.random.rand(4,5))
What I want to accomplish is using another given DF as reference that has a structure like this
N=pd.DataFrame({'A':[2,2,2],'B':[2,3,4]})
...to extract the values from the large DF whereas the values of 'A' correspond to the ROW number and 'B' values to the COLUMN number of the large DF so that the expected output would look like this:
Large DF
0 1 2 3 4
0 0.766275 0.910825 0.378541 0.775416 0.639854
1 0.505877 0.992284 0.720390 0.181921 0.501062
2 0.439243 0.416820 0.285719 0.100537 0.429576
3 0.243298 0.560427 0.162422 0.631224 0.033927
Small DF
A B
0 2 2
1 2 3
2 2 4
Expected Output:
A B extracted values
0 2 2 0.285719
1 2 3 0.100537
2 2 4 0.429576
So far I've tried different version of something like this
N['extracted'] = M.iloc[N['A'].astype(int):,N['B'].astype(int)]
..but it keeps failing with an error saying
TypeError: cannot do positional indexing on RangeIndex with these indexers
[0 2
1 2
2 2
Which approach would be the best ?
Is this job better to accomplish by converting the DF's into a numpy arrays ?
Thanks for help!
I think you want to use the apply function. This goes row by row through your data set.
N['extracted'] = N.apply(lambda row: M.iloc[row['A'], row['B']], axis=1)

Create a column based on multiple column distinct count pandas [duplicate]

I want to add an aggregate, grouped, nunique column to my pandas dataframe but not aggregate the entire dataframe. I'm trying to do this in one line and avoid creating a new aggregated object and merging that, etc.
my df has track, type, and id. I want the number of unique ids for each track/type combination as a new column in the table (but not collapse track/type combos in the resulting df). Same number of rows, 1 more column.
something like this isn't working:
df['n_unique_id'] = df.groupby(['track', 'type'])['id'].nunique()
nor is
df['n_unique_id'] = df.groupby(['track', 'type'])['id'].transform(nunique)
this last one works with some aggregating functions but not others. the following works (but is meaningless on my dataset):
df['n_unique_id'] = df.groupby(['track', 'type'])['id'].transform(sum)
in R this is easily done in data.table with
df[, n_unique_id := uniqueN(id), by = c('track', 'type')]
thanks!
df.groupby(['track', 'type'])['id'].transform(nunique)
Implies that there is a name nunique in the name space that performs some function. transform will take a function or a string that it knows a function for. nunique is definitely one of those strings.
As pointed out by #root, often the method that pandas will utilize to perform a transformation indicated by these strings are optimized and should generally be preferred to passing your own functions. This is True even for passing numpy functions in some cases.
For example transform('sum') should be preferred over transform(sum).
Try this instead
df.groupby(['track', 'type'])['id'].transform('nunique')
demo
df = pd.DataFrame(dict(
track=list('11112222'), type=list('AAAABBBB'), id=list('XXYZWWWW')))
print(df)
id track type
0 X 1 A
1 X 1 A
2 Y 1 A
3 Z 1 A
4 W 2 B
5 W 2 B
6 W 2 B
7 W 2 B
df.groupby(['track', 'type'])['id'].transform('nunique')
0 3
1 3
2 3
3 3
4 1
5 1
6 1
7 1
Name: id, dtype: int64

What's the best way to transform Array values in one column to columns of the original DataFrame?

I have a table where one of the columns is an Array of binary features, they are there when that feature is present.
I'd like to train a logistic model on these rows, but can't get the data in the required format where each feature value is it's own column with a 1 or 0 value.
Example:
id feature values
1 ['HasPaws', 'DoesBark', 'CanFetch']
2 ['HasPaws', 'CanClimb', 'DoesMeow']
I'd like to get it to the format of
id HasPaws DoesBark CanFetch CanClimb DoesMeow
1 1 1 1 0 0
2 1 0 0 1 0
It seems like there would be some functionality built in to accomplish this, but I can't think of what this transformation is called to do a better search on my own.
You can first convert lists to columns and then use get_dummies() method:
In [12]: df
Out[12]:
id feature_values
0 1 [HasPaws, DoesBark, CanFetch]
1 2 [HasPaws, CanClimb, DoesMeow]
In [13]: (pd.get_dummies(df.set_index('id').feature_values.apply(pd.Series),
...: prefix='', prefix_sep='')
...: .reset_index()
...: )
Out[13]:
id HasPaws CanClimb DoesBark CanFetch DoesMeow
0 1 1 0 1 1 0
1 2 1 1 0 0 1
Another option is to loop through the feature values column, and construct a series from each cell with the values in the list as index. And in this way, pandas will expand the series into a data frame with index as headers:
pd.concat([df['id'],
(df['feature values'].apply(lambda lst: pd.Series([1]*len(lst), index=lst))
.fillna(0)], axis=1)
method 1
pd.concat([df['id'], df['feature values'].apply(pd.value_counts)], axis=1).fillna(0)
method 2
df.set_index('id').squeeze().apply(pd.value_counts).reset_index().fillna(0)
method 3
pd.concat([pd.Series(1, f, name=i) for _, (i, f) in df.iterrows()],
axis=1).T.fillna(0).rename_axis('id').reset_index()

GroupBy object in which entries can belong to several groups

By the methods it supports, it looks like nothing speaks against labels of the original data frame/series occuring multiple times in a derived GroupBy object. Is it actually possible to, for example, construct a GroupBy object g from an iterable column like a in
>>> x
a b
0 [0, 1] 1
1 [1, 2] 2
such that g will represent a GroupBy object with one entry for each of the entries in a's values? That is, I get results like
>>> x.iterable_groupby('a').size()
a
0 1
1 2
2 1
>>> x.iterable_groupby('a').mean()
b
0 1.0
1 1.5
2 2.0
You should reshape your DataFrame to a tidy dataset. Reshaping part is asked frequently (1, 2, 3).
In a tidy dataset, each row should represent a single record. For that, you can create a 'grouper' column like this:
x['a'].apply(pd.Series).stack().reset_index(level=1, drop=True).to_frame('grouper')
Out:
grouper
0 0
0 1
1 1
1 2
If you join this with the original DataFrame, it can be grouped as you like:
x['a'].apply(pd.Series).stack().reset_index(level=1, drop=True).to_frame('grouper').join(x).groupby('grouper').mean()
Out:
b
grouper
0 1.0
1 1.5
2 2.0
Reshaping part is not very efficient but as far as I know pandas does not offer a better method for that yet.

Categories