How to lower all the elements in a pandas dataframe? - python

Just a quick question guys, I have a pandas dataframe:
In [11]: df = pd.DataFrame([['A', 'B'], ['C', E], ['D', 'C']],columns=['X', 'Y', 'Z'])
In [12]: df
Out[12]:
X Y Z
0 A B D
1 C E C
How can I convert to lower all the elements of df:
Out[12]:
X Y Z
0 a b d
1 c e c
I look over the documentation and I tried the following:
df = [[col.lower() for col in [df["X"],df["Y"], df["Z"]]]]
df
Nevertheless, it doesnt work. How to lower all the elements inside a pandas dataframe?.

Either
df.applymap(str.lower)
Out:
X Y Z
0 a b d
1 c e c
Or
df.apply(lambda col: col.str.lower())
Out:
X Y Z
0 a b d
1 c e c
The first one is faster and it looks nicer but the second one can handle NaNs.

using applymap with lambda will work even if df contain NaN values and String
import pandas as pd
df = df.applymap(lambda x: x.lower() if pd.notnull(x) else x)

Related

List Comprehension with Mapping to Get a New Dataframe When Its Column Value Contains At Least One Element in A Given List

I have a dataframe df. I eventually want to have a new dataframe that column 'question' in df contains an element in a list answer.
answer = ['a','b','c','d','e']
df = pd.DataFrame({'question': ['a,b', 'b,c', 'z', 'f,e', 'x', 'd']})
>>> df
question
0 a,b
1 b,c
2 z
3 f,e
4 x
5 d
I want the desired output dataframe to be:
>>> new_df
question
0 a,b
1 b,c
3 f,e
5 d
And this is what I have simplified so far.
for y in answer:
new_df = df[df['question'].map(lambda x: y in x)]
I came up with something like this and got the following error:
new_df = df[df['question'].map(lambda x: y in x for y in answer)]
TypeError: 'generator' object is not callable
How can I get a new dataframe that satisfies the condition in one line of code by using list comprehension?
You can use Series.str.contains, Pandas.concat and DataFreame.sort_index with a comprehension-list:
df_result = pd.concat([df[df['question'].str.contains(a)] for a in answer]).drop_duplicates().sort_index()
But, if you ask me the above is not readable, So I let you the code above without list-comprehension to a good understanding:
list_dfs = []
for a in answer:
# df_match will be a tiny dataframe with the matching.
# For instance, In the first iteration will be:
# question
# 0 a,b
df_match = df[df['question'].str.contains(a)]
list_dfs.append(df_match)
df_result = pd.concat(list_dfs).drop_duplicates().sort_index()
print(df_result)
For both versions, the output is the same:
question
0 a,b
1 b,c
3 f,e
5 d
Use df.isin rather than list comprehension:
df = pd.DataFrame({'question': ['a,b', 'b,c', 'z', 'f,e', 'x', 'd']})
>>> df['question'].str.split(',') \
.apply(lambda x: len(set(x).intersection(answer)) != 0)
0 True
1 True
2 False
3 True
4 False
5 True
Name: question, dtype: bool
New dataframe:
new_df = df[df['question'].str.split(',') \
.apply(lambda x: len(set(x).intersection(answer)) != 0)]
>>> new_df
question
0 a,b
1 b,c
3 f,e
5 d

ffill not filling data in pandas dataframe

I have a dataframe like this :
A B C E D
---------------
0 a r g g
1 x
2 x f f r
3 t
3 y
I am trying for forward filling using ffill. It is not working
cols = df.columns[:4].tolist()
df[cols] = df[cols].ffill()
I also tried :
df[cols] = df[cols].fillna(method='ffill')
But it is not getting filled.
Is it the empty columns in data causing this issue?
Data is mocked. Exact data is different (contains strings,numbers and empty columns)
desired o/p:
A B C E D
---------------
0 a r g g
1 a r g x
2 x f f r
3 x f f t
3 x f f y
Replace empty values in subset of columns by NaN:
df[cols] = df[cols].replace('', np.nan).ffill()
You should replace the empty strings with np.NaN before:
df = df.replace('', np.NaN)
df[cols] = df[cols].ffill()
Replace '' with np.nan first:
df[df='']=np.nan
df[cols] = df[cols].ffill()

pandas: Extracting the index of the maximum value in an expanding window

In a pandas DataFrame, I can create a Series B with the maximum value of another Series A, from the first row to the current one, by using an expanding window:
df['B'] = df['A'].expanding().max()
I can also extract the value of the index of the maximum overall value of Series A:
idx_max_A = df['A'].idxmax().value
What I want is an efficient way to combine both; that is, to create a Series B that holds the value of the index of the maximum value of Series A from the first row up to the current one. Ideally, something like this...
df['B'] = df['A'].expanding().idxmax().value
...but, of course, the above fails because the Expanding object does not have idxmax. Is there a straightforward way to do this?
EDIT: For illustration purposes, for the following DataFrame...
df = pd.DataFrame([1, 2, 1, 3, 0], index=['a', 'b', 'c', 'd', 'e'], columns=['A'])
...I'd like to create an additional column B so that the DataFrame contains the following:
A B
a 1 a
b 2 b
c 1 b
d 3 d
e 0 d
I believe you can use expanding + max + groupby:
v = df.expanding().max().A
df['B'] = v.groupby(v).transform('idxmax')
df
A B
a 1 a
b 2 b
c 1 b
d 3 d
e 0 d
It seems idmax is a function in the latest version of pandas which I don't have yet. Here's a solution not involving groupby or idmax
import pandas as pd
import numpy as np
df = pd.DataFrame([1, 2, 1, 3, 0], index=['a', 'b', 'c', 'd', 'e'], columns=['A'])
temp = df.A.expanding().max()
df['B'] = temp.apply(lambda x: temp[temp == x].index[0])
df
A B
a 1 a
b 2 b
c 1 b
d 3 d
e 0 d

Pandas - Interleave / Zip two DataFrames by row

Suppose I have two dataframes:
>> df1
0 1 2
0 a b c
1 d e f
>> df2
0 1 2
0 A B C
1 D E F
How can I interleave the rows? i.e. get this:
>> interleaved_df
0 1 2
0 a b c
1 A B C
2 d e f
3 D E F
(Note my real DFs have identical columns, but not the same number of rows).
What I've tried
inspired by this question (very similar, but asks on columns):
import pandas as pd
from itertools import chain, zip_longest
df1 = pd.DataFrame([['a','b','c'], ['d','e','f']])
df2 = pd.DataFrame([['A','B','C'], ['D','E','F']])
concat_df = pd.concat([df1,df2])
new_index = chain.from_iterable(zip_longest(df1.index, df2.index))
# new_index now holds the interleaved row indices
interleaved_df = concat_df.reindex(new_index)
ValueError: cannot reindex from a duplicate axis
The last call fails because df1 and df2 have some identical index values (which is also the case with my real DFs).
Any ideas?
You can sort the index after concatenating and then reset the index i.e
import pandas as pd
df1 = pd.DataFrame([['a','b','c'], ['d','e','f']])
df2 = pd.DataFrame([['A','B','C'], ['D','E','F']])
concat_df = pd.concat([df1,df2]).sort_index().reset_index(drop=True)
Output :
0 1 2
0 a b c
1 A B C
2 d e f
3 D E F
EDIT (OmerB) : Incase of keeping the order regardless of the index value then.
import pandas as pd
df1 = pd.DataFrame([['a','b','c'], ['d','e','f']]).reset_index()
df2 = pd.DataFrame([['A','B','C'], ['D','E','F']]).reset_index()
concat_df = pd.concat([df1,df2]).sort_index().set_index('index')
Use toolz.interleave
In [1024]: from toolz import interleave
In [1025]: pd.DataFrame(interleave([df1.values, df2.values]))
Out[1025]:
0 1 2
0 a b c
1 A B C
2 d e f
3 D E F
Here's an extension of #Bharath's answer that can be applied to DataFrames with user-defined indexes without losing them, using pd.MultiIndex.
Define Dataframes with the full set of column/ index labels and names:
df1 = pd.DataFrame([['a','b','c'], ['d','e','f']], index=['one', 'two'], columns=['col_a', 'col_b','col_c'])
df1.columns.name = 'cols'
df1.index.name = 'rows'
df2 = pd.DataFrame([['A','B','C'], ['D','E','F']], index=['one', 'two'], columns=['col_a', 'col_b','col_c'])
df2.columns.name = 'cols'
df2.index.name = 'rows'
Add DataFrame ID to MultiIndex:
df1.index = pd.MultiIndex.from_product([[1], df1.index], names=["df_id", df1.index.name])
df2.index = pd.MultiIndex.from_product([[2], df2.index], names=["df_id", df2.index.name])
Then use #Bharath's concat() and sort_index():
data = pd.concat([df1, df2], axis=0, sort=True)
data.sort_index(axis=0, level=data.index.names[::-1], inplace=True)
Output:
cols col_a col_b col_c
df_id rows
1 one a b c
2 one A B C
1 two d e f
2 two D E F
You could also preallocate a new DataFrame, and then fill it using a slice.
def interleave(dfs):
data = np.transpose(np.array([np.empty(dfs[0].shape[0]*len(dfs), dtype=dt) for dt in dfs[0].dtypes]))
out = pd.DataFrame(data, columns=dfs[0].columns)
for ix, df in enumerate(dfs):
out.iloc[ix::len(dfs),:] = df.values
return out
The preallocation code is taken from this question.
While there's a chance it could outperform the index method for certain data types / sizes, it won't behave gracefully if the DataFrames have different sizes.
Note - for ~200000 rows with 20 columns of mixed string, integer and floating types, the index method is around 5x faster.
You can try this way :
In [31]: import pandas as pd
...: from itertools import chain, zip_longest
...:
...: df1 = pd.DataFrame([['a','b','c'], ['d','e','f']])
...: df2 = pd.DataFrame([['A','B','C'], ['D','E','F']])
In [32]: concat_df = pd.concat([df1,df2]).sort_index()
...:
In [33]: interleaved_df = concat_df.reset_index(drop=1)
In [34]: interleaved_df
Out[34]:
0 1 2
0 a b c
1 A B C
2 d e f
3 D E F

Python Pandas: Apply method to column if condition met in another column

I have a pandas dataframe that looks something like the following:
>>> df = pd.DataFrame([["B","X"],["C","Y"],["D","X"]])
>>> df.columns = ["A","B"]
>>> df
A B
0 B X
1 C Y
2 D X
How can I apply a method to change the values of column A only if the value in column B is "X"? The desired outcome for example might be:
>>> df
A B
0 Bx X
1 C Y
2 Dx X
I thought of combining the two columns together (df['C']=df['A']+df['B']) but probably there's a better way to perform such a simple operation
One approach is using loc
df.loc[df.B == 'X', 'A']+='x'
A B
0 Bx X
1 C Y
2 Dx X
EDIT: Based on the question in the comment, is this what you are looking for?
df.loc[df.B == 'X', 'A'] = df.A.str.lower()+'x'
A B
0 bx X
1 C Y
2 dx X

Categories