I am trying to manipulate a large list of strings, so cannot do this manually. I am new to python so am having trouble figuring this out.
I have a dataframe with columns:
df = pd.read_csv('filename.csv')
df
A B
0 big_apples
1 big_oranges
2 small_pears
3 medium_grapes
and I need it to look more like:
A B
0 apples(big)
1 oranges(big)
2 pears(small)
3 grapes(medium)
I was thinking of using a startswith() function and .replace()/concatenate everything. But then I would have to create columns for each of these and i need it to recognize the unique prefixes. Is there a more efficient method?
You can do some string formatting and apply it to the Series:
df.B.apply(lambda x: '{}({})'.format(*x.split('_')[::-1]))
0 apples(big)
1 oranges(big)
2 pears(small)
3 grapes(medium)
Here apply is applying the formatting to each item of the series. Then apply the string formatting you desire (I'm using [::-1] to reverse the order of the string) and * to "unpack" the return values that are in a list
Related
I have a Pandas series containing a list of strings like so:
series_of_list.head()
0 ['hello','there','my','name']
1 ['hello','hi','my','name']
2 ['hello','howdy','my','name']
3 ['hello','mate','my','name']
4 ['hello','hello','my','name']
type(series_of_list)
pandas.core.series.Series
I would like to only keep the first to entries of the list like so:
series_of_list.head()
0 ['hello','there']
1 ['hello','hi']
2 ['hello','howdy']
3 ['hello','mate']
4 ['hello','hello']
I have tried slicing it, series_of_list=series_of_list[:2], but doing so just returns the first two indexes of the series...
series_of_list.head()
0 ['hello','there','my','name']
1 ['hello','hi','my','name']
I have also tried .drop and other slicing but the outcome is not what I want.
How can I only keep the first two items of the list for the entire pandas series?
Thank you!
pandas.Series.apply() the function on each element.
series_of_list = series_of_list.apply(lambda x: x[:2])
I am having some issue with copying a dataframe. Basically, I want to replicate a dataframe with another variable but with the columns being numerical instead of categorical. Below I have function that returns dataframe mean_df when I print it out I see that the rows are categorical. I then create a new dataframe (mean_df_num) which is equal to mean_df. Then I convert the rows to index values (for mean_df_num) instead of the categorical letters. However, when I print my mean_df after I see that it has also changed indices to be numerical. Why does this happen and is there a way around this?
mean_df = mean_funct(train_df_cat)
print(mean_df)
mean_df_num = mean_df
mean_df_num.index = range(len(mean_df_num)) #Convert df to numerical indices
print(mean_df)
Output:
m00 mu02 mu11
a 1.00162 0.357137 -0.245608
c 0.766659 0.354217 0.244405
e 0.929145 0.422447 0.0602329
m 1.61799 2.85194 -1.80078
n 1.03976 0.700674 -1.0011
o 0.97873 0.754065 0.172753
r 0.623244 0.11065 1.52705
s 0.789545 0.177259 -0.154744
x 1.0039 0.404982 -1.51634
z 0.919228 0.3578 0.42973
m00 mu02 mu11
0 1.00162 0.357137 -0.245608
1 0.766659 0.354217 0.244405
2 0.929145 0.422447 0.0602329
3 1.61799 2.85194 -1.80078
4 1.03976 0.700674 -1.0011
5 0.97873 0.754065 0.172753
6 0.623244 0.11065 1.52705
7 0.789545 0.177259 -0.154744
8 1.0039 0.404982 -1.51634
9 0.919228 0.3578 0.42973
Pandas dataframe is essentially a pointer. That meas when you do mean_df_num=mean_df, then mean_df_num and mean_df point to the same object. You change one, you change the other. The way around this is .copy(), i.e. mean_df_num=mean_df.copy().
Actually, for your purpose, it's better just do mean_df_num=mean_df.reset_index(drop=True). It does both at the same time: copy the data and set index as range index.
Hi is there a way to get a substring of a column based on another column?
import pandas as pd
x = pd.DataFrame({'name':['bernard','brenden','bern'],'digit':[2,3,3]})
x
digit name
0 2 bernard
1 3 brenden
2 3 bern
What i would expect is something like:
for row in x.itertuples():
print row[2][:row[1]]
be
bre
ber
where the result is the substring of name based on digit.
I know if I really want to I can create a list based on the itertuples function but does not seem right and also, I always try to create a vectorized method.
Appreciate any feedback.
Use apply with axis=1 for row-wise with a lambda so you access each column for slicing:
In [68]:
x = pd.DataFrame({'name':['bernard','brenden','bern'],'digit':[2,3,3]})
x.apply(lambda x: x['name'][:x['digit']], axis=1)
Out[68]:
0 be
1 bre
2 ber
dtype: object
To pass multiple variables to a normal python function you can just write something like:
def a_function(date,string,float):
do something....
convert string to int,
date = date + (float * int) days
return date
When using Pandas DataFrames I know you can create a new column based on the contents of one like so:
df['new_col']) = df['column_A'].map(a_function)
# This might return the year from a date column
# return date.year
What I'm wondering is in the same way you can pass multiple pieces of data to a single function (as seen in the first example above), can you use multiple columns in the creation of a new pandas DataFrame column?
For example combining three separate parts of a date Y - M - D into one field.
df['whole_date']) = df['Year','Month','Day'].map(a_function)
I get a key error with the following test.
def combine(one,two,three):
return one + two + three
df = pd.DataFrame({'a': [1,2,3], 'b': [2,3,4],'c': [4,5,6]})
df['d'] = df['a','b','b'].map(combine)
Is there a way of creating a new column in a pandas DataFrame using .map or something else which takes as input three columns and returns a single column?
-> Example input: 1, 2, 3
-> Example output: 1*2*3
Likewise is there also a way of having a function take in one argument, a date and return three new pandas DataFrame columns; one for the year, month and day?
Is there a way of creating a new column in a pandas dataframe using .MAP or something else which takes as input three columns and returns a single column. For example input would be 1, 2, 3 and output would be 1*2*3
To do that, you can use apply with axis=1. However, instead of being called with three separate arguments (one for each column) your specified function will then be called with a single argument for each row, and that argument will be a Series containing the data for that row. You can either account for this in your function:
def combine(row):
return row['a'] + row['b'] + row['c']
>>> df.apply(combine, axis=1)
0 7
1 10
2 13
Or you can pass a lambda which unpacks the Series into separate arguments:
def combine(one,two,three):
return one + two + three
>>> df.apply(lambda x: combine(*x), axis=1)
0 7
1 10
2 13
If you want to pass only specific rows, you need to select them by indexing on the DataFrame with a list:
>>> df[['a', 'b', 'c']].apply(lambda x: combine(*x), axis=1)
0 7
1 10
2 13
Note the double brackets. (This doesn't really have anything to do with apply; indexing with a list is the normal way to access multiple columns from a DataFrame.)
However, it's important to note that in many cases you don't need to use apply, because you can just use vectorized operations on the columns themselves. The combine function above can simply be called with the DataFrame columns themselves as the arguments:
>>> combine(df.a, df.b, df.c)
0 7
1 10
2 13
This is typically much more efficient when the "combining" operation is vectorizable.
Likewise is there also a way of having a function take in one argument, a date and return three new pandas dataframe columns; one for the year, month and day?
As above, there are two basic ways to do this: a general but non-vectorized way using apply, and a faster vectorized way. Suppose you have a DataFrame like this:
>>> df = pandas.DataFrame({'date': pandas.date_range('2015/05/01', '2015/05/03')})
>>> df
date
0 2015-05-01
1 2015-05-02
2 2015-05-03
You can define a function that returns a Series for each value, and then apply it to the column:
def dateComponents(date):
return pandas.Series([date.year, date.month, date.day], index=["Year", "Month", "Day"])
>>> df.date.apply(dateComponents)
11: Year Month Day
0 2015 5 1
1 2015 5 2
2 2015 5 3
In this situation, this is the only option, since there is no vectorized way to access the individual date components. However, in some cases you can use vectorized operations:
>>> df = pandas.DataFrame({'a': ["Hello", "There", "Pal"]})
>>> df
a
0 Hello
1 There
2 Pal
>>> pandas.DataFrame({'FirstChar': df.a.str[0], 'Length': df.a.str.len()})
FirstChar Length
0 H 5
1 T 5
2 P 3
Here again the operation is vectorized by operating directly on the values instead of applying a function elementwise. In this case, we have two vectorized operations (getting first character and getting the string length), and then we wrap the results in another call to DataFrame to create separate columns for each of the two kinds of results.
I normally use apply for this kind of thing; it's basically the DataFrame version of map (the axis parameter lets you decide whether to apply your function to rows or columns):
df.apply(lambda row: row.a*row.b*row.c, axis =1)
or
df.apply(np.prod, axis=1)
0 8
1 30
2 72
I'm having a problem trying to get a character count column of the string values in another column, and haven't figured out how to do it efficiently.
for index in range(len(df)):
df['char_length'][index] = len(df['string'][index]))
This apparently involves first creating a column of nulls and then rewriting it, and it takes a really long time on my data set. So what's the most effective way of getting something like
'string' 'char_length'
abcd 4
abcde 5
I've checked around quite a bit, but I haven't been able to figure it out.
Pandas has a vectorised string method for this: str.len(). To create the new column you can write:
df['char_length'] = df['string'].str.len()
For example:
>>> df
string
0 abcd
1 abcde
>>> df['char_length'] = df['string'].str.len()
>>> df
string char_length
0 abcd 4
1 abcde 5
This should be considerably faster than looping over the DataFrame with a Python for loop.
Many other familiar string methods from Python have been introduced to Pandas. For example, lower (for converting to lowercase letters), count for counting occurrences of a particular substring, and replace for swapping one substring with another.
Here's one way to do it.
In [3]: df
Out[3]:
string
0 abcd
1 abcde
In [4]: df['len'] = df['string'].str.len()
In [5]: df
Out[5]:
string len
0 abcd 4
1 abcde 5