Pandas dataframe apply list to columns auto-matching behaviour - python

Take a look at this minimal example:
import pandas as pd
s1 = """A,B
a,1
b,1
"""
df = pd.read_csv(io.StringIO(s1))
print(df.apply(lambda x: 1 * [x.A],axis=1))
print("==================")
print(df.apply(lambda x: 2 * [x.A],axis=1))
print("==================")
print(df.apply(lambda x: 3 * [x.A],axis=1))
The print statements yield:
0 [a]
1 [b]
dtype: object
==================
A B
0 a a
1 b b
==================
0 [a, a, a]
1 [b, b, b]
dtype: object
As you can see when the number of list elements equals the number of columns in the initial dataframe the list gets matched to those columns, in all other cases the result is just a series containing the lists as elements.
I can solve this by checking the dimensions of the dataframe and add an empty dummy column to keep the number of columns from being the same as the length of the lists if necessary, but I wish to know if there is a direct way of controlling that matching behaviour.
EDIT: The specific way I'm creating the lists in my example is only for simplicity, the lists could also be created as an output e.g. from a numpy function such as linreg or polyfit.
EDIT 2: I want my 2nd case to look like this:
0 [a, a]
1 [b, b]
EDIT 3: My real application is this to have two columns with an array or list each and then using numpy polyfit on it which yields an array whose length depends on the degree of the polynomial.
df["polyfit"] = df.apply(lambda x: list(np.polyfit(x[x_name],x[y_name],degree)),axis=1)

Try a different approach:
In [9]: df['A'].apply(list) * 1
Out[9]:
0 [a]
1 [b]
Name: A, dtype: object
In [10]: df['A'].apply(list) * 2
Out[10]:
0 [a, a]
1 [b, b]
Name: A, dtype: object
In [11]: df['A'].apply(list) * 3
Out[11]:
0 [a, a, a]
1 [b, b, b]
Name: A, dtype: object

Related

replacing strings in pandas dataframes with lists of strings

I have a pandas dataframe read from file, some of whose columns contain strings, some of which in turn contain substrings separated by semicolons. My goal is to turn the semicolon-separated substrings into lists of strings and put those back in the dataframe.
When I use df.iloc[-1][-1] = df.iloc[-1][-1].split(';'); on a cell that contains a string with semicolons, there's no error but the value df.iloc[-1][-1] is not changed.
When I use
newval = df.iloc[-1,-1]; newval
newval = df.iloc[-1,-1].split( ';' ); newval
df.iloc[-1][-1] = newval; df.iloc[-1][-1]
It shows the original string for the first line and the list of substrings for the second, but then the original string again for the third. It looks as if nothing has been assigned -- but there was no error message either.
My first guess was that it was not allowed to put a list of strings in a cell that contains strings but a quick test showed me that that is OK:
>>> df = pd.DataFrame([["a", "a;b"], ["a;A", "a;b;A;B"]], index=[1, 2], columns=['A', 'B']);
>>> df
A B
1 a a;b
2 a;A a;b;A;B
>>> for row in range ( df.shape [ 0 ] ):
... for col in range ( df.shape [ 1 ] ):
... value = df.iloc[row][col];
... if ( type ( value ) == str ):
... value = value.split( ';' );
... df.iloc[row][col] = value;
>>> df
A B
1 [a] [a, b]
2 [a, A] [a, b, A, B]
So I'm puzzled why (i) the assignment works in the example but not for my CSV-imported dataframe, and (ii) why python does not give an error message?
Honestly, you can simplify your code avoiding the loops with a simple applymap. Loops should be avoided with pandas. Here applymap won't necessarily be faster, but it's definitely much easier to use and understand.
out = df.applymap(lambda x: x.split(';'))
output:
A B
1 [a] [a, b]
2 [a, A] [a, b, A, B]
why your approach failed
You're using df.iloc[row][col] = value which can cause setting value on a copy, you should use df.iloc[row, col] = value instead. Did you get a SettingWithCopyWarning?
not all values are strings:
df.applymap(lambda x: x.split(';') if isinstance(x, str) else x)
Example:
df = pd.DataFrame([["a", 2], ["a;A", "a;b;A;B"]], index=[1, 2], columns=['A', 'B'])
df.applymap(lambda x: x.split(';') if isinstance(x, str) else x)
A B
1 [a] 2
2 [a, A] [a, b, A, B]

Groupby pandas dataframe keeping unique values for some columns and list other columns

I want to group the following output by material_id keeping the unique values of material_description and MPN, but list the plant_id. picture for reference
def search_output(materials):
df=pd.DataFrame(materials)
df_ref = df.loc[:, df.columns!='#search.score'].groupby('material_id').agg({lambda
x:list(x)})
return df_ref
This currently groups by material_id and list other columns.
The following code i use to keep unique values grouped by material_id, but now I am missing the plant_id list column.
df_t = df.loc[:, df.columns!='#search.score'].groupby('material_id' ['material_description','MPN'].agg(['unique'])
picture for reference#2
I'm looking for a way to combine the two. A way to group by a column, keep unique values of specific columns and list other columns at the same time.
Hope you can help - and sorry for the pictures, but can't figure out how to add output otherwise :)
You can create dictionary by lists - first for aggregation by unique and for all another columns by list with dict.fromkeys, join them an pass to GroupBy.agg:
print (df)
material_id material_description MPN A B
0 1 descr1 a b c
1 1 descr2 a d e
2 1 descr1 b b c
3 2 descr3 a b c
4 2 descr4 a b c
5 2 descr4 a b c
u_cols = ['material_description','MPN']
d = {c: 'unique' if c in u_cols else list for c in df.columns.drop('material_id')}
df_ref = df.loc[:, df.columns!='#search.score'].groupby('material_id').agg(d)
print (df_ref)
material_description MPN A B
material_id
1 [descr1, descr2] [a, b] [b, d, b] [c, e, c]
2 [descr3, descr4] [a] [b, b, b] [c, c, c]

How to join a column of lists in one dataframe with a column of strings in another dataframe?

I have two dataframes. The first one (let's call it A) has a column (let's call it 'col1') whose elements are lists of strings. The other one (let's call it B) has a column (let's call it 'col2') whose elements are strings. I want to do a join between these two dataframes where B.col2 is in the list in A.col1. This is one-to-many join.
Also, I need the solution to be scalable since I wanna join two dataframes with hundreds of thousands of rows.
I have tried concatenating the values in A.col1 and creating a new column (let's call it 'col3') and joining with this condition: A.col3.contains(B.col2). However, my understanding is that this condition triggers a cartesian product between the two dataframes which I cannot afford considering the size of the dataframes.
def joinIds(IdList):
return "__".join(IdList)
joinIds_udf = udf(joinIds)
pnr_corr = pnr_corr.withColumn('joinedIds', joinIds_udf(pnr_corr.pnrCorrelations.correlationPnrSchedule.scheduleIds)
pnr_corr_skd = pnr_corr.join(skd, pnr_corr.joinedIds.contains(skd.id), how='inner')
This is a sample of the join that I have in mind:
dataframe A:
listColumn
["a","b","c"]
["a","b"]
["d","e"]
dataframe B:
valueColumn
a
b
d
output:
listColumn valueColumn
["a","b","c"] a
["a","b","c"] b
["a","b"] a
["a","b"] b
["d","e"] d
I don't know if there is an efficient way to do it, but this gives the correct output:
import pandas as pd
from itertools import chain
df1 = pd.Series([["a","b","c"],["a","b"],["d","e"]])
df2 = pd.Series(["a","b","d"])
result = [ [ [el2,list1] for el2 in df2.values if el2 in list1 ]
for list1 in df1.values ]
result_flat = list(chain(*result))
result_df = pd.DataFrame(result_flat)
You get:
In [26]: result_df
Out[26]:
0 1
0 a [a, b, c]
1 b [a, b, c]
2 a [a, b]
3 b [a, b]
4 d [d, e]
Another approach is to use the new explode() method from pandas>=0.25 and merge like this:
import pandas as pd
df1 = pd.DataFrame({'col1': [["a","b","c"],["a","b"],["d","e"]]})
df2 = pd.DataFrame({'col2': ["a","b","d"]})
df1_flat = df1.col1.explode().reset_index()
df_merged = pd.merge(df1_flat,df2,left_on='col1',right_on='col2')
df_merged['col2'] = df1.loc[df_merged['index']].values
df_merged.drop('index',axis=1, inplace=True)
This gives the same result:
col1 col2
0 a [a, b, c]
1 a [a, b]
2 b [a, b, c]
3 b [a, b]
4 d [d, e]
How about:
df['col1'] = [df['col1'].values[i] + [df['col2'].values[i]] for i in range(len(df))]
Where 'col1' is the list of strings and 'col2' is the strings.
You can also drop 'col2' if you don't want it anymore with:
df = df.drop('col2',axis=1)

For every row in Pandas dataframe determine if a column value exists in another column

I have a pandas data frame like this:
df = pd.DataFrame({'category' : ['A', 'B', 'C', 'A'], 'category_pred' : [['A'], ['B','D'], ['A','B','C'], ['D']]})
print(df)
category category_pred
0 A [A]
1 B [B, D]
2 C [A, B, C]
3 A [D]
I would like to have an output like this:
category category_pred count
0 A [A] 1
1 B [B, D] 1
2 C [A, B, C] 1
3 A [D] 0
That is, for every row, determine if the value in 'category' appears in 'category_pred'. Note that 'category_pred' can contain multiple values.
I can do a for-loop like this one, but it is really slow.
for i in df.index:
if df.category[i] in df.category_pred[i]:
df['count'][i] = 1
I am looking for an efficient way to do this operation. Thanks!
You can make use of the DataFrame's apply method.
df['count'] = df.apply(lambda x: 1 if x.category in x.category_pred else 0, axis = 1)
This will add the new column as you want

Count size of rolling intersection in pandas

I have a dataframe that consists of group labels ('B') and elements of each group ('A'). The group labels are ordered, and I want to know how many elements of group I show up in group i+1.
An example:
df= pd.DataFrame({ 'A': ['a','b','c','a','c','a','d'], 'B' : [1,1,1,2,2,3,3]})
A B
0 a 1
1 b 1
2 c 1
3 a 2
4 c 2
5 a 3
6 d 3
The desired output would be something like:
B
1 NaN
2 2
3 1
One way to go about this would be to compute the number of distinct elements in the union of group I and group i+1 and then subtract of the number of distinct elements in each group. I've tried:
pd.rolling_apply(grp['A'], lambda x: len(x.unique()),2)
but this produces an error:
AttributeError: 'Series' object has no attribute 'type'
How do I get this to work with rolling_apply or is there a better way to attack this problem?
An approach with using sets and shifting the result:
First grouping the dataframe and then converting column A of each group into a set:
In [86]: grp = df.groupby('B')
In [87]: s = grp.apply(lambda x : set(x['A']))
In [88]: s
Out[88]:
B
1 set([a, c, b])
2 set([a, c])
3 set([a, d])
dtype: object
To calculate the intersection between consecutive sets, make a shifted version (I replace the NaN to an empty set for the next step):
In [89]: s2 = s.shift(1).fillna(set([]))
In [90]: s2
Out[90]:
B
1 set([])
2 set([a, c, b])
3 set([a, c])
dtype: object
Combine both series and calculate the length of the intersection:
In [91]: s.combine(s2, lambda x, y: len(x.intersection(y)))
Out[91]:
B
1 0
2 2
3 1
dtype: object
Another way to do the last step (for sets & means intersection):
df = pd.concat([s, s2], axis=1)
df.apply(lambda x: len(x[0] & x[1]), axis=1)
The reason the rolling apply does not work is because 1) you provided it a GroupBy object and not a series, and 2) it only works with numerical values.

Categories