only keep part of a list in pandas series - python

I have a Pandas series containing a list of strings like so:
series_of_list.head()
0 ['hello','there','my','name']
1 ['hello','hi','my','name']
2 ['hello','howdy','my','name']
3 ['hello','mate','my','name']
4 ['hello','hello','my','name']
type(series_of_list)
pandas.core.series.Series
I would like to only keep the first to entries of the list like so:
series_of_list.head()
0 ['hello','there']
1 ['hello','hi']
2 ['hello','howdy']
3 ['hello','mate']
4 ['hello','hello']
I have tried slicing it, series_of_list=series_of_list[:2], but doing so just returns the first two indexes of the series...
series_of_list.head()
0 ['hello','there','my','name']
1 ['hello','hi','my','name']
I have also tried .drop and other slicing but the outcome is not what I want.
How can I only keep the first two items of the list for the entire pandas series?
Thank you!

pandas.Series.apply() the function on each element.
series_of_list = series_of_list.apply(lambda x: x[:2])

Related

column of list values to one flat list in Python

I have pandas column with multiple string values in it, I want to convert them into one list so that I can take count of it
df.columnX
Row 1 ['A','B','A','C']
Row 2 ['A','C']
Row 3 ['D','A']
I want output like
Tag Count
A 4
B 1
C 2
D 1
I am trying to pull them to list but double quote is coming
df.columnX.values = ["'A','B',,,,,,,,,'A'"]
Thanks in advance
What about this ?
df.explode('columnX').columnX.value_counts().to_frame()
Note that you need pandas > 0.25.0 for explode to work.
If your lists are in fact strings, you can first convert them to lists (as suggested by #Jon Clements) :
import ast
df.columnX = df.columnX.map(ast.literal_eval)
I got it
flatList = [item for sublist in list(df.ColumnX.map(ast.literal_eval)) for item in sublist]
dict((x,flatList.count(x)) for x in set(flatList))

How to collapse the values of a Series where the values are a list into a unique list

Given a Pandas Series like the one below:
0 [ID01]
1 [ID02]
2 [ID05, ID08]
3 [ID09, ID56, ID32]
4 [ID03]
The objective is to get a single list like the one below:
[ID01, ID02, ID05, ID08, ID09, ID56, ID32, ID03]
How do you achieve that in a pythonic way in Python?
Assuming that is a pandas.Series object
Option 1
Full list
np.concatenate(s).tolist()
Option 1.1
Unique list
np.unique(np.concatenate(s)).tolist()
Option 2
Works if elements are lists. Doesn't work if they are numpy arrays.
Full list
s.sum()
Option 2.1
Unique list
pd.unique(s.sum()).tolist()
Option 3
Full list
[x for y in s for x in y]
Option 3.1
Unique list (Thanks #pault)
list({x for y in s for x in y})
#Wen's Option
list(set.union(*map(set, s)))
Setup
s = pd.Series([
['ID01'],
['ID02'],
['ID05', 'ID08'],
['ID09', 'ID56', 'ID32'],
['ID03']
])
s
0 [ID01]
1 [ID02]
2 [ID05, ID08]
3 [ID09, ID56, ID32]
4 [ID03]
dtype: object

string manipulations in Pandas

I am trying to manipulate a large list of strings, so cannot do this manually. I am new to python so am having trouble figuring this out.
I have a dataframe with columns:
df = pd.read_csv('filename.csv')
df
A B
0 big_apples
1 big_oranges
2 small_pears
3 medium_grapes
and I need it to look more like:
A B
0 apples(big)
1 oranges(big)
2 pears(small)
3 grapes(medium)
I was thinking of using a startswith() function and .replace()/concatenate everything. But then I would have to create columns for each of these and i need it to recognize the unique prefixes. Is there a more efficient method?
You can do some string formatting and apply it to the Series:
df.B.apply(lambda x: '{}({})'.format(*x.split('_')[::-1]))
0 apples(big)
1 oranges(big)
2 pears(small)
3 grapes(medium)
Here apply is applying the formatting to each item of the series. Then apply the string formatting you desire (I'm using [::-1] to reverse the order of the string) and * to "unpack" the return values that are in a list

how do you pass multiple variables to pandas dataframe to use them with .map to create a new column

To pass multiple variables to a normal python function you can just write something like:
def a_function(date,string,float):
do something....
convert string to int,
date = date + (float * int) days
return date
When using Pandas DataFrames I know you can create a new column based on the contents of one like so:
df['new_col']) = df['column_A'].map(a_function)
# This might return the year from a date column
# return date.year
What I'm wondering is in the same way you can pass multiple pieces of data to a single function (as seen in the first example above), can you use multiple columns in the creation of a new pandas DataFrame column?
For example combining three separate parts of a date Y - M - D into one field.
df['whole_date']) = df['Year','Month','Day'].map(a_function)
I get a key error with the following test.
def combine(one,two,three):
return one + two + three
df = pd.DataFrame({'a': [1,2,3], 'b': [2,3,4],'c': [4,5,6]})
df['d'] = df['a','b','b'].map(combine)
Is there a way of creating a new column in a pandas DataFrame using .map or something else which takes as input three columns and returns a single column?
-> Example input: 1, 2, 3
-> Example output: 1*2*3
Likewise is there also a way of having a function take in one argument, a date and return three new pandas DataFrame columns; one for the year, month and day?
Is there a way of creating a new column in a pandas dataframe using .MAP or something else which takes as input three columns and returns a single column. For example input would be 1, 2, 3 and output would be 1*2*3
To do that, you can use apply with axis=1. However, instead of being called with three separate arguments (one for each column) your specified function will then be called with a single argument for each row, and that argument will be a Series containing the data for that row. You can either account for this in your function:
def combine(row):
return row['a'] + row['b'] + row['c']
>>> df.apply(combine, axis=1)
0 7
1 10
2 13
Or you can pass a lambda which unpacks the Series into separate arguments:
def combine(one,two,three):
return one + two + three
>>> df.apply(lambda x: combine(*x), axis=1)
0 7
1 10
2 13
If you want to pass only specific rows, you need to select them by indexing on the DataFrame with a list:
>>> df[['a', 'b', 'c']].apply(lambda x: combine(*x), axis=1)
0 7
1 10
2 13
Note the double brackets. (This doesn't really have anything to do with apply; indexing with a list is the normal way to access multiple columns from a DataFrame.)
However, it's important to note that in many cases you don't need to use apply, because you can just use vectorized operations on the columns themselves. The combine function above can simply be called with the DataFrame columns themselves as the arguments:
>>> combine(df.a, df.b, df.c)
0 7
1 10
2 13
This is typically much more efficient when the "combining" operation is vectorizable.
Likewise is there also a way of having a function take in one argument, a date and return three new pandas dataframe columns; one for the year, month and day?
As above, there are two basic ways to do this: a general but non-vectorized way using apply, and a faster vectorized way. Suppose you have a DataFrame like this:
>>> df = pandas.DataFrame({'date': pandas.date_range('2015/05/01', '2015/05/03')})
>>> df
date
0 2015-05-01
1 2015-05-02
2 2015-05-03
You can define a function that returns a Series for each value, and then apply it to the column:
def dateComponents(date):
return pandas.Series([date.year, date.month, date.day], index=["Year", "Month", "Day"])
>>> df.date.apply(dateComponents)
11: Year Month Day
0 2015 5 1
1 2015 5 2
2 2015 5 3
In this situation, this is the only option, since there is no vectorized way to access the individual date components. However, in some cases you can use vectorized operations:
>>> df = pandas.DataFrame({'a': ["Hello", "There", "Pal"]})
>>> df
a
0 Hello
1 There
2 Pal
>>> pandas.DataFrame({'FirstChar': df.a.str[0], 'Length': df.a.str.len()})
FirstChar Length
0 H 5
1 T 5
2 P 3
Here again the operation is vectorized by operating directly on the values instead of applying a function elementwise. In this case, we have two vectorized operations (getting first character and getting the string length), and then we wrap the results in another call to DataFrame to create separate columns for each of the two kinds of results.
I normally use apply for this kind of thing; it's basically the DataFrame version of map (the axis parameter lets you decide whether to apply your function to rows or columns):
df.apply(lambda row: row.a*row.b*row.c, axis =1)
or
df.apply(np.prod, axis=1)
0 8
1 30
2 72

Remove empty lists in pandas series

I have a long series like the following:
series = pd.Series([[(1,2)],[(3,5)],[],[(3,5)]])
In [151]: series
Out[151]:
0 [(1, 2)]
1 [(3, 5)]
2 []
3 [(3, 5)]
dtype: object
I want to remove all entries with an empty list. For some reason, boolean indexing does not work.
The following tests both give the same error:
series == [[(1,2)]]
series == [(1,2)]
ValueError: Arrays were different lengths: 4 vs 1
This is very strange, because in the simple example below, indexing works just like above:
In [146]: pd.Series([1,2,3]) == [3]
Out[146]:
0 False
1 False
2 True
dtype: bool
P.S. ideally, I'd like to split the tuples in the series into a DataFrame of two columns also.
You could check to see if the lists are empty using str.len():
series.str.len() == 0
and then use this boolean series to remove the rows containing empty lists.
If each of your entries is a list containing a two-tuple (or else empty), you could create a two-column DataFrame by using the str accessor twice (once to select the first element of the list, then to access the elements of the tuple):
pd.DataFrame({'a': series.str[0].str[0], 'b': series.str[0].str[1]})
Missing entries default to NaN with this method.
Using the built in apply you can filter by the length of the list:
series = pd.Series([[(1,2)],[(3,5)],[],[(3,5)]])
series = series[series.apply(len) > 0]
Your series is in a bad state -- having a Series of lists of tuples of ints
buries the useful data, the ints, inside too many layers of containers.
However, to form the desired DataFrame, you could use
df = series.apply(lambda x: pd.Series(x[0]) if x else pd.Series()).dropna()
which yields
0 1
0 1 2
1 3 5
2 3 5
A better way would be to avoid building the malformed series altogether and
form df directly from the data:
data = [[(1,2)],[(3,5)],[],[(3,5)]]
data = [pair for row in data for pair in row]
df = pd.DataFrame(data)

Categories