So I'm trying to make a simple filter that will take in the dataframe and filter out all rows that don't have the target genre. It'll be easier to explain with the code:
import pandas as pd
test = [{
"genre":["RPG","Shooter"]},
{"genre":["RPG"]},
{"genre":["Shooter"]}]
data =pd.DataFrame(test)
fil = data.genre.isin(['RPG'])
I want the filter to return a dataframe with the following elements:
[{"genre":["RPG"]},
{"genre":["RPG", "Shooter"]}]
This is the error I'm getting when I try my code:
SystemError: <built-in method view of numpy.ndarray object at 0x00000180D1DF2760> returned a result with an error set
The problem is that the elements of genre are lists, so isin does not work. Use:
mask = data['genre'].apply(frozenset(['RPG']).issubset)
print(data[mask])
Output
genre
0 [RPG, Shooter]
1 [RPG]
The expression:
frozenset(['RPG']).issubset
Checks that any list is contained in each row, from the documentation:
Test whether every element in the set is in other.
So you could also check for multiple values easily, for example:
mask = data['genre'].apply(frozenset(['RPG', "Shooter"]).issubset)
print(data[mask])
Output
genre
0 [RPG, Shooter]
You want:
data[data.genre.apply(lambda x: 'RPG' in x)]
Or:
data[data.genre.explode().eq('RPG').any(level=0)]
Output:
genre
0 [RPG, Shooter]
1 [RPG]
I'm' trying to build a function to the job because my data frames are in a list. This is the function that I am working on:
def lower(x):
'''
This function lowercase the entire Data Frame.
'''
for x in clean_lst:
for x.columns in x:
x.columns['i'].map(lambda i: i.lower())
It's not working like that!
This is the list of data frames:
clean_lst = [pop_movies, trash_movies]
I am planing to access the list like this:
lower = [pd.DataFrame(lower(x)) for x in clean_list]
pop_movies = lower[0]
trash_movies = lower[1]
HELP!!!
You can use apply functions from pandas package which works on df / series.
clean_lst = [i.apply(lambda x: x.str.lower()) for i in clean_lst]
You should use a vectorized method for every column in the dataframe
x["column_i"].str.lower()
Lists or numpy arrays can be unpacked to multiple variables if the dimensions match. For a 3xN array, the following will work:
import numpy as np
a,b = [[1,2,3],[4,5,6]]
a,b = np.array([[1,2,3],[4,5,6]])
# result: a=[1,2,3], b=[4,5,6]
How can I achieve a similar behaviour for the columns of a pandas DataFrame? Extending the above example:
import pandas as pd
df = pd.DataFrame([[1,2,3],[4,5,6]])
df.columns = ['A','B','C'] # Rename cols and
df.index = ['i', 'ii'] # rows for clarity
The following does not work as expected:
a,b = df.T
# result: a='i', b='ii'
a,b,c = df
# result: a='A', b='B', c='C'
However, what I would like to get is the following:
a,b,c = unpack(df)
result: a=df['A'], b=df['B'], c=df['C']
Is the function unpack already available in pandas? Or can it be mimicked in an easy way?
I just figured that the following works, which is already close to what I try to achieve:
a,b,c = df.T.values # Common
a,b,c = df.T.to_numpy() # Recommended
# a,b,c = df.T.as_matrix() # Deprecated
Details: As always, things are a little more complicated than one thinks. Note that a pd.DataFrame stores columns separately in Series. Calling df.values (or better: df.to_numpy()) is potentially expensive, as it combines the columns in a single ndarray, which likely involves copying actions and type conversions. Also, the resulting container has a single dtype able to accommodate all data in the data frame.
In summary, the above approach loses the per-column dtype information and is potentially expensive. It is technically cleaner to iterate the columns in one of the following ways (there are more options):
# The following alternatives create VIEWS!
a,b,c = (v for _,v in df.items()) # returns pd.Series
a,b,c = (df[c] for c in df) # returns pd.Series
Note that the above creates views! Modifying the data likely will trigger a SettingWithCopyWarning.
a.iloc[0] = "blabla" # raises SettingWithCopyWarning
If you want to modify the unpacked variables, you have to copy the columns.
# The following alternatives create COPIES!
a,b,c = (v.copy() for _,v in df.items()) # returns pd.Series
a,b,c = (df[c].copy() for c in df) # returns pd.Series
a,b,c = (df[c].to_numpy() for c in df) # returns np.ndarray
While this is cleaner, it requires more characters. I personally do not recommend the above approach for production code. But to avoid typing (e.g., in interactive shell sessions), it is still a fair option...
# More verbose and explicit alternatives
a,b,c = df["the first col"], df["the second col"], df["the third col"]
a,b,c = df.iloc[:,0], df.iloc[:,1], df.iloc[:,2]
The dataframe.values shown method is indeed a good solution, but it involves building a numpy array.
In the case you want to access pandas series methods after unpacking, I personally use a different approach.
For the people like me that use a lot of chained methods, I have a solution by adding a custom unpacking method to pandas. Note that this may not be very good for production pipelines, but it is very handy in ad-hoc data analyses.
df = pd.DataFrame({
"lat": [30, 40],
"lon": [0, 1],
})
This approach involves returning a generator on a .unpack() call.
from typing import Tuple
def unpack(self: pd.DataFrame) -> Tuple[pd.Series]:
return (
self[col]
for col in self.columns
)
pd.DataFrame.unpack = unpack
This can be used in two major ways.
Either directly as a solution to your problem:
lat, lon = df.unpack()
Or, can be used in a method chaining.
Imagine a geo function which has to take a latitude serie in the first arg and a longitude in the second arg, named do_something_geographical(lat, lon)
df_result = (
df
.(...some method chaining...)
.assign(
geographic_result=lambda dataframe: do_something_geographical(dataframe[["lat", "lon"]].unpack())
)
.(...some method chaining...)
)
I am trying to add columns to a python pandas df using the apply function.
However the number of columns to be added depend on the output of the function
used in the apply function.
example code:
number_of_columns_to_be_added = 2
def add_columns(number_of_columns_to_be_added):
df['n1'],df['n2'] = zip(*df['input'].apply(lambda x : do_something(x, number_of_columns_to_be_added)))
Any idea on how to define the ugly column part (df['n1'], ..., df['n696969']) before the = zip( ... part programatically?
I'm guessing that the output of zip is a tuple, therefore you could try this:
temp = zip(*df['input'].apply(lambda x : do_something(x, number_of_columns_to_be_added)))
for i, value in enumerate(temp, 1):
key = 'n'+str(i)
df[key] = value
temp will hold the all the entries and then you iterate over tempto assign the values to your dict with your specific keys. Hope this matches your original idea.
I imported a CSV using Pandas and one column was read in with string entries. Examining the entries for this Series (column), I see that they should actually be lists. For example:
df['A'] = pd.Series(['["entry11"]', '["entry21","entry22"]', '["entry31","entry32"]'])
I would like to extract the list elements from the strings. So far, I've tried the following chain:
df['A'] = df['A'].replace("'",'',regex=True).
replace('\[','',regex=True).
replace('\]','',regex=True).
str.split(",")
(all on one line, of course).
and this gives me back my desired list elements in one column.
['"entry11"']
['"entry21", "entry22"']
['"entry31", "entry32"']
My question: Is there a more efficient way of doing this? This seems like a lot of strain for something that should be a little easier.
You can "apply" the ast.literal_eval() to the series:
In [8]: from ast import literal_eval
In [9]: df['A'] = df['A'].apply(literal_eval)
In [10]: df
Out[10]:
A
0 [entry11]
1 [entry21, entry22]
2 [entry31, entry32]
There is also map() and applymap() - here is a topic where the differences are discussed:
Difference between map, applymap and apply methods in Pandas