Use ast.literal_eval on all columns of a Pandas Dataframe - python

I have a data frame that looks like the following:
Category Class
==========================
['org1', 'org2'] A
['org2', 'org3'] B
org1 C
['org3', 'org4'] A
org2 A
['org2', 'org4'] B
...
When I read in this data using Pandas, the lists are read in as strings (e.g., dat['Category][0][0] returns [ rather than returning org1). I have several columns like this. I want every categorical column that already contains at least one list to have all records be a list. For example, the above data frame should look like the following:
Category Class
==========================
['org1', 'org2'] A
['org2', 'org3'] B
['org1'] C
['org3', 'org4'] A
['org2'] A
['org2', 'org4'] B
...
Notice how the singular values in the Category column are now contained in lists. When I reference dat['Category][0][0], I'd like org1 to be returned.
What is the best way to accomplish this? I was thinking of using ast.literal_eval with an apply and lambda function, but I'd like to try and use best-practices if possible. Thanks in advance!

You could create a boolean mask of the values that need to changed. If there are no lists, no change is needed. If there are lists, you can apply literal_eval or a list creation lambda to subsets of the data.
import ast
import pandas as pd
def normalize_category(df):
is_list = df['Category'].str.startswith('[')
if is_list.any():
df.loc[is_list,'Category'] = df.loc[is_list, 'Category'].apply(ast.literal_eval)
df.loc[~is_list,'Category'] = df.loc[~is_list]['Category'].apply(lambda val: [val])
df = pd.DataFrame({"Category":["['org1', 'org2']", "org1"], "Class":["A", "B"]})
normalize_category(df)
print(df)
df = pd.DataFrame({"Category":["org2", "org1"], "Class":["A", "B"]})
normalize_category(df)
print(df)

You can do it like this:
df['Category'] = df['Category'].apply(lambda x: literal_eval(x) if x.startswith('[') else [x])

Related

pandas filter series with lists of strings as values

So I'm trying to make a simple filter that will take in the dataframe and filter out all rows that don't have the target genre. It'll be easier to explain with the code:
import pandas as pd
test = [{
"genre":["RPG","Shooter"]},
{"genre":["RPG"]},
{"genre":["Shooter"]}]
data =pd.DataFrame(test)
fil = data.genre.isin(['RPG'])
I want the filter to return a dataframe with the following elements:
[{"genre":["RPG"]},
{"genre":["RPG", "Shooter"]}]
This is the error I'm getting when I try my code:
SystemError: <built-in method view of numpy.ndarray object at 0x00000180D1DF2760> returned a result with an error set
The problem is that the elements of genre are lists, so isin does not work. Use:
mask = data['genre'].apply(frozenset(['RPG']).issubset)
print(data[mask])
Output
genre
0 [RPG, Shooter]
1 [RPG]
The expression:
frozenset(['RPG']).issubset
Checks that any list is contained in each row, from the documentation:
Test whether every element in the set is in other.
So you could also check for multiple values easily, for example:
mask = data['genre'].apply(frozenset(['RPG', "Shooter"]).issubset)
print(data[mask])
Output
genre
0 [RPG, Shooter]
You want:
data[data.genre.apply(lambda x: 'RPG' in x)]
Or:
data[data.genre.explode().eq('RPG').any(level=0)]
Output:
genre
0 [RPG, Shooter]
1 [RPG]

How to lowercase an entire Data Frame?

I'm' trying to build a function to the job because my data frames are in a list. This is the function that I am working on:
def lower(x):
'''
This function lowercase the entire Data Frame.
'''
for x in clean_lst:
for x.columns in x:
x.columns['i'].map(lambda i: i.lower())
It's not working like that!
This is the list of data frames:
clean_lst = [pop_movies, trash_movies]
I am planing to access the list like this:
lower = [pd.DataFrame(lower(x)) for x in clean_list]
pop_movies = lower[0]
trash_movies = lower[1]
HELP!!!
You can use apply functions from pandas package which works on df / series.
clean_lst = [i.apply(lambda x: x.str.lower()) for i in clean_lst]
You should use a vectorized method for every column in the dataframe
x["column_i"].str.lower()

How to unpack the columns of a pandas DataFrame to multiple variables

Lists or numpy arrays can be unpacked to multiple variables if the dimensions match. For a 3xN array, the following will work:
import numpy as np
a,b = [[1,2,3],[4,5,6]]
a,b = np.array([[1,2,3],[4,5,6]])
# result: a=[1,2,3], b=[4,5,6]
How can I achieve a similar behaviour for the columns of a pandas DataFrame? Extending the above example:
import pandas as pd
df = pd.DataFrame([[1,2,3],[4,5,6]])
df.columns = ['A','B','C'] # Rename cols and
df.index = ['i', 'ii'] # rows for clarity
The following does not work as expected:
a,b = df.T
# result: a='i', b='ii'
a,b,c = df
# result: a='A', b='B', c='C'
However, what I would like to get is the following:
a,b,c = unpack(df)
result: a=df['A'], b=df['B'], c=df['C']
Is the function unpack already available in pandas? Or can it be mimicked in an easy way?
I just figured that the following works, which is already close to what I try to achieve:
a,b,c = df.T.values # Common
a,b,c = df.T.to_numpy() # Recommended
# a,b,c = df.T.as_matrix() # Deprecated
Details: As always, things are a little more complicated than one thinks. Note that a pd.DataFrame stores columns separately in Series. Calling df.values (or better: df.to_numpy()) is potentially expensive, as it combines the columns in a single ndarray, which likely involves copying actions and type conversions. Also, the resulting container has a single dtype able to accommodate all data in the data frame.
In summary, the above approach loses the per-column dtype information and is potentially expensive. It is technically cleaner to iterate the columns in one of the following ways (there are more options):
# The following alternatives create VIEWS!
a,b,c = (v for _,v in df.items()) # returns pd.Series
a,b,c = (df[c] for c in df) # returns pd.Series
Note that the above creates views! Modifying the data likely will trigger a SettingWithCopyWarning.
a.iloc[0] = "blabla" # raises SettingWithCopyWarning
If you want to modify the unpacked variables, you have to copy the columns.
# The following alternatives create COPIES!
a,b,c = (v.copy() for _,v in df.items()) # returns pd.Series
a,b,c = (df[c].copy() for c in df) # returns pd.Series
a,b,c = (df[c].to_numpy() for c in df) # returns np.ndarray
While this is cleaner, it requires more characters. I personally do not recommend the above approach for production code. But to avoid typing (e.g., in interactive shell sessions), it is still a fair option...
# More verbose and explicit alternatives
a,b,c = df["the first col"], df["the second col"], df["the third col"]
a,b,c = df.iloc[:,0], df.iloc[:,1], df.iloc[:,2]
The dataframe.values shown method is indeed a good solution, but it involves building a numpy array.
In the case you want to access pandas series methods after unpacking, I personally use a different approach.
For the people like me that use a lot of chained methods, I have a solution by adding a custom unpacking method to pandas. Note that this may not be very good for production pipelines, but it is very handy in ad-hoc data analyses.
df = pd.DataFrame({
"lat": [30, 40],
"lon": [0, 1],
})
This approach involves returning a generator on a .unpack() call.
from typing import Tuple
def unpack(self: pd.DataFrame) -> Tuple[pd.Series]:
return (
self[col]
for col in self.columns
)
pd.DataFrame.unpack = unpack
This can be used in two major ways.
Either directly as a solution to your problem:
lat, lon = df.unpack()
Or, can be used in a method chaining.
Imagine a geo function which has to take a latitude serie in the first arg and a longitude in the second arg, named do_something_geographical(lat, lon)
df_result = (
df
.(...some method chaining...)
.assign(
geographic_result=lambda dataframe: do_something_geographical(dataframe[["lat", "lon"]].unpack())
)
.(...some method chaining...)
)

How to define a variable amount of columns in python pandas apply

I am trying to add columns to a python pandas df using the apply function.
However the number of columns to be added depend on the output of the function
used in the apply function.
example code:
number_of_columns_to_be_added = 2
def add_columns(number_of_columns_to_be_added):
df['n1'],df['n2'] = zip(*df['input'].apply(lambda x : do_something(x, number_of_columns_to_be_added)))
Any idea on how to define the ugly column part (df['n1'], ..., df['n696969']) before the = zip( ... part programatically?
I'm guessing that the output of zip is a tuple, therefore you could try this:
temp = zip(*df['input'].apply(lambda x : do_something(x, number_of_columns_to_be_added)))
for i, value in enumerate(temp, 1):
key = 'n'+str(i)
df[key] = value
temp will hold the all the entries and then you iterate over tempto assign the values to your dict with your specific keys. Hope this matches your original idea.

How do I extract the list inside a string in Python?

I imported a CSV using Pandas and one column was read in with string entries. Examining the entries for this Series (column), I see that they should actually be lists. For example:
df['A'] = pd.Series(['["entry11"]', '["entry21","entry22"]', '["entry31","entry32"]'])
I would like to extract the list elements from the strings. So far, I've tried the following chain:
df['A'] = df['A'].replace("'",'',regex=True).
replace('\[','',regex=True).
replace('\]','',regex=True).
str.split(",")
(all on one line, of course).
and this gives me back my desired list elements in one column.
['"entry11"']
['"entry21", "entry22"']
['"entry31", "entry32"']
My question: Is there a more efficient way of doing this? This seems like a lot of strain for something that should be a little easier.
You can "apply" the ast.literal_eval() to the series:
In [8]: from ast import literal_eval
In [9]: df['A'] = df['A'].apply(literal_eval)
In [10]: df
Out[10]:
A
0 [entry11]
1 [entry21, entry22]
2 [entry31, entry32]
There is also map() and applymap() - here is a topic where the differences are discussed:
Difference between map, applymap and apply methods in Pandas

Categories