Is there a way to apply a list of functions to each column in a DataFrame like the DataFrameGroupBy.agg function does? I found an ugly way to do it like this:
df=pd.DataFrame(dict(one=np.random.uniform(0,10,100), two=np.random.uniform(0,10,100)))
df.groupby(np.ones(len(df))).agg(['mean','std'])
one two
mean std mean std
1 4.802849 2.729528 5.487576 2.890371
For Pandas 0.20.0 or newer, use df.agg (thanks to ayhan for pointing this out):
In [11]: df.agg(['mean', 'std'])
Out[11]:
one two
mean 5.147471 4.964100
std 2.971106 2.753578
For older versions, you could use
In [61]: df.groupby(lambda idx: 0).agg(['mean','std'])
Out[61]:
one two
mean std mean std
0 5.147471 2.971106 4.9641 2.753578
Another way would be:
In [68]: pd.DataFrame({col: [getattr(df[col], func)() for func in ('mean', 'std')] for col in df}, index=('mean', 'std'))
Out[68]:
one two
mean 5.147471 4.964100
std 2.971106 2.753578
In the general case where you have arbitrary functions and column names, you could do this:
df.apply(lambda r: pd.Series({'mean': r.mean(), 'std': r.std()})).transpose()
mean std
one 5.366303 2.612738
two 4.858691 2.986567
I tried to apply three functions into a column and it works
#removing new line character
rem_newline = lambda x : re.sub('\n',' ',x).strip()
#character lower and removing spaces
lower_strip = lambda x : x.lower().strip()
df = df['users_name'].apply(lower_strip).apply(rem_newline).str.split('(',n=1,expand=True)
I am using pandas to analyze Chilean legislation drafts. In my dataframe, the list of authors are stored as a string. The answer above did not work for me (using pandas 0.20.3). So I used my own logic and came up with this:
df.authors.apply(eval).apply(len).sum()
Concatenated applies! A pipeline!! The first apply transforms
"['Barros Montero: Ramón', 'Bellolio Avaria: Jaime', 'Gahona Salazar: Sergio']"
into the obvious list, the second apply counts the number of lawmakers involved in the project. I want the size of every pair (lawmaker, project number) (so I can presize an array where I will study which parties work on what).
Interestingly, this works! Even more interestingly, that last call fails if one gets too ambitious and does this instead:
df.autores.apply(eval).apply(len).apply(sum)
with an error:
TypeError: 'int' object is not iterable
coming from deep within /site-packages/pandas/core/series.py in apply
Related
I have a Pandas DataFrame, with columns 'time' and 'current'. It also has lots of other columns, but I don't want to use them for this operation. All values are floats.
df[['time','current']].head()
time current
1 0.0 9.6
2 300.0 9.3
3 600.0 9.6
4 900.0 9.5
5 1200.0 9.5
I'd like to calculate the rolling integral of current over time, such that at each point in time, I get the integral up to that point of the current over the time. (I realize that this particular operation is simple, but it's an example. I'm not really looking for this function, but the method as a whole)
Ideally, I'd be able to do something like this:
df[['time','current']].expanding().apply(scipy.integrate.trapezoid)
or
df[['time','current']].expanding(method = 'table').apply(scipy.integrate.trapezoid)
but neither of these work, as I'd like to take the 'time' column as the function's first argument, and the 'current' as the second. The function does work with one column (current alone), but I don't like dividing by timesteps separately afterwards.
It seems DataFrame columns can't be accessed within expanding().apply().
I've heard that internally the expanding is treated as an array, so I've also tried this:
df[['time','current']].expanding(method = 'table').apply(lambda x:scipy.integrate.trapezoid(x[0], x[1]))
df[['time','current']].expanding(method = 'table').apply(lambda x:scipy.integrate.trapezoid(x['time'], x['current']))
and variations, but I can never access the columns in expanding().
As a matter of fact, even using apply() on a plain DataFrame disallows using columns simultaneously, as each one is treated sequentially as a Series.
df[['time','current']].apply(lambda x:scipy.integrate.trapezoid(x.time,x.current))
...
AttributeError: 'Series' object has no attribute 'time'
This answer mentions the method 'table' for expanding(), but it wasn't out at the time, and I can't seem to figure out what it needs to work here. Their solution was simply to do it manually.
I've also tried defining the function first, but this returns an error too:
def func(x,y):
return(scipy.integrate.trapezoid(x,y))
df[['time','current']].expanding().apply(func)
...
DataError: No numeric types to aggregate
Is what I'm asking even possible with expanding().apply()? Should I just do it another way? Can I apply expanding inside the apply()?
Thanks, and good luck.
Overview
It is not yet fully implemented in pandas but there are things you can do to workaround. expanding() and rolling() plus .agg() or .apply() will deal column by column unless you precise method='table', (see Method 2).
Method 1
There is a workaround to get what you want as long as you output one column. The trick is to move columns to the index and then resetting it in the function: (don't do that with scipy.integrate.trapezoid because, as #ALollz said scipy.integrate.cumtrapz is already a cumulative (expanding) calculation)
def custom_func(serie):
subDf = serie.reset_index()
# work with the sub dataframe as you would do in a groupby
# you have access to subDf.x and subDf.y
return(scipy.integrate.trapezoid(subDf.x,subDf.y))
df.set_index(['y']).expanding().agg(custom_func)
Method 2
You can make use of the method='table' (available from pandas==1.3.0) in expanding()
and rolling() In that case you need to use .apply(custom_func, raw=True,engine='numba') and write a function custom_func in numba python (beware of types) that will take the numpy array representation of your dataframe. If you do this your custom_func needs to output an array of the length that the ones in input so you might have to add dummy columns in the input in order to bypass this and rename your columns afterward.
min_periods=100
def custom_func(table):
rep = np.zeros(len(table))
# You need something like this if you want to use the min_periods argument
if len(table) < min_periods :
return rep
# Do something with your numpy arrays
return rep
df.expanding(min_periods,method='table').apply(custom_func,raw=True,engine='numba')
# Rename
df.columns = ...
I have a dataframe extracted with Pandas for which one of the colums looks something like this:
What I want to do is to extract the numerical values (floats) in this column, which by itself I could do. The issue comes because I have some cells, like the cell 20 in the image, in which I have more than one number, so I would like to make an average of these values. I think that for that I would first need to recognize the different groups of numerical values in the string (each float number) and then extract them as floats to then operate with them. I don't know how to do this.
Edit: I have found an solution to this using the re.findall command from regex. This is based on the answer of a question in this thread Find all floats or ints in a given string.
for index,value in z.iteritems():
z[index]=statistics.mean([float(h) for h in re.findall(r'(?:\b\d{1,2}\b(?:\.\d*))',value)])
Note that I haven't included match for integers, and only account for values up to 99, just due to the type of data that I have.
However, I get a warning with this approach, due to the loop (there is no warning when I do it only for one element of the series):
SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame
Although I don't see any issue happening with my data, is this warning important?
I think you can benefit from the Pandas vectorized operations here. Use findall over the original dataframe and apply in sequence the pd.Series to transform from list to columns and pd.to_numeric to convert from string to numeric type (default return dtype is float64). Then calculate the average of the values on each row with .mean(axis=1).
import pandas as pd
d = {0: {0: '2.469 (VLT: emission host)',
1: '1.942 (VLT: absorption)',
2: '1.1715 (VLT: absorption)',
3: '0.42 (NOT: absorption)|0.4245 (GTC)|0.4250 (ESO-VLT UT2: absorption & emission)',
4: '3.3765 (VLT: absorption)',
5: '1.86 (Xinglong: absorption)| 1.86 (GMG: absorption)|1.859 (VLT: absorption)',
6: '<2.4 (NOT: inferred)'}}
df = pd.DataFrame(d)
print(df)
s_mean = df[0].str.findall(r'(?:\b\d{1,2}\b(?:\.\d*))')\
.apply(pd.Series)\
.apply(pd.to_numeric)\
.mean(axis=1)
print(s_mean)
Output from s_mean
0 2.469000
1 1.942000
2 1.171500
3 0.423167
4 3.376500
5 1.859667
6 2.400000
I have found a solution based on what I wrote previously in the Edit of the original post:
It consists on using the re.findall() command with regex, as posted in this thread Find all floats or ints in a given string:
statistics.mean([float(h) for h in re.findall(r'(?:\b\d{1,2}\b(?:\.\d*))',string)])
Then, to loop over the dataframe column, just use the lambda x: method with the pandas apply command (df.apply). For this, I have defined a function (redshift_to_num) executing the operation above, and then apply this function to each element in the dataframe column:
import re
import pandas as pd
import statistics
def redshift_to_num(string):
measures=[float(h) for h in re.findall(r'(?:\b\d{1,2}\b(?:\.\d*))',string)]
mean=statistics.mean(measures)
return mean
df.Redshift=df.Redshift.apply(lambda x: redshift_to_num(x))
Notes:
The data of interest in my case is stored in the dataframe column df.Redshift.
In the re.findall command I haven't included match for integers, and only account for values up to 99, just due to the type of data that I have.
Question
I have an email_alias column and I'd like to find the number of integers in that column (per row) in another column using Python. So far I can only count the total number of numbers in the entire column.
Attempt
I tried: df['count_numbers'] = sum(c.isdigit() for c in df['email_alias'])
Example:
email_alias count_numbers
thisisatest111 3
testnumber2 1
I believe this might be the simplest solution.
df['count_numbers'] = df['email_alias'].str.count('\d')
You can apply a custom python function to the column. I don't think there's a vectorized way. sum() here takes advantage of the fact that bools are a subclass of ints so all True values are equal to 1.
import pandas as pd
def count_digits(string):
return sum(item.isdigit() for item in string)
df = pd.DataFrame({'a': ['thisisatest111', 'testnumber2']})
df['counts'] = df['a'].apply(count_digits)
Your approach of:
df['count_numbers'] = sum(c.isdigit() for c in df['email_alias'])
could not work because df['count_numbers'] = is an assignment to every value in that column. Here, apply implicitly iterates over the rows (but in Python time, so it's not vectorized). Then again, most of the .str accessor methods of Pandas are, too, despite the syntax suggesting it will go faster than a for loop.
I want to sum different columns in a spark dataframe.
Code
from pyspark.sql import functions as F
cols = ["A.p1","B.p1"]
df = spark.createDataFrame([[1,2],[4,89],[12,60]],schema=cols)
# 1. Works
df = df.withColumn('sum1', sum([df[col] for col in ["`A.p1`","`B.p1`"]]))
#2. Doesnt work
df = df.withColumn('sum1', F.sum([df[col] for col in ["`A.p1`","`B.p1`"]]))
#3. Doesnt work
df = df.withColumn('sum1', sum(df.select(["`A.p1`","`B.p1`"])))
Why isn't approach #2. & #3. not working?
I am on Spark 2.2
Because,
# 1. Works
df = df.withColumn('sum1', sum([df[col] for col in ["`A.p1`","`B.p1`"]]))
Here you are using python in-built sum function which takes iterable as input,so it works. https://docs.python.org/2/library/functions.html#sum
#2. Doesnt work
df = df.withColumn('sum1', F.sum([df[col] for col in ["`A.p1`","`B.p1`"]]))
Here you are using pyspark sum function which takes column as input but you are trying to get it at row level.
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.sum
#3. Doesnt work
df = df.withColumn('sum1', sum(df.select(["`A.p1`","`B.p1`"])))
Here, df.select() returns a dataframe and trying to sum over a dataframe. In this case, I think, you got to iterate rowwise and apply sum over it.
TL;DR builtins.sum is just fine.
Following your comments:
Using native python sum() is not benefitting from spark optimization. so whats the spark way of doing it
and
its not a pypark function so it wont be really be completely benefiting from spark right.
I can see you are making incorrect assumptions.
Let's decompose the problem:
[df[col] for col in ["`A.p1`","`B.p1`"]]
creates a list of Columns:
[Column<b'A.p1'>, Column<b'B.p1'>]
Let's call it iterable.
sum reduces output by taking elements of this list and calling __add__ method (+). Imperative equivalent is:
accum = iterable[0]
for element in iterable[1:]:
accum = accum + element
This gives Column:
Column<b'(A.p1 + B.p1)'>
which is the same as calling
df["`A.p1`"] + df["`B.p1`"]
No data has been touched and when evaluated it is benefits from all Spark optimizations.
Addition of multiple columns from a list into one column
I tried a lot of methods and the following are my observations:
PySpark's sum function doesn't support column addition (Pyspark version 2.3.1)
Built-in python's sum function is working for some folks but giving error for others (might be because of conflict in names)
In your 3rd approach, the expression (inside python's sum function) is returning a PySpark DataFrame.
So, the addition of multiple columns can be achieved using the expr function in PySpark, which takes an expression to be computed as an input.
from pyspark.sql.functions import expr
cols_list = ['a', 'b', 'c']
# Creating an addition expression using `join`
expression = '+'.join(cols_list)
df = df.withColumn('sum_cols', expr(expression))
This gives us the desired sum of columns. We can also use any other complex expression to get other output.
In the following, male_trips is a big pandas data frame and stations is a small pandas data frame. For each station id I'd like to know how many male trips took place. The following does the job, but takes a long time:
mc = [ sum( male_trips['start_station_id'] == id ) for id in stations['id'] ]
how should I go about this instead?
Update! So there were two main approaches: groupby() followed by size(), and the simpler .value_counts(). I did a quick timeit, and the groupby approach wins by quite a large margin! Here is the code:
from timeit import Timer
setup = "import pandas; male_trips=pandas.load('maletrips')"
a = "male_trips.start_station_id.value_counts()"
b = "male_trips.groupby('start_station_id').size()"
Timer(a,setup).timeit(100)
Timer(b,setup).timeit(100)
and here is the result:
In [4]: Timer(a,setup).timeit(100) # <- this is value_counts
Out[4]: 9.709594964981079
In [5]: Timer(b,setup).timeit(100) # <- this is groupby / size
Out[5]: 1.5574288368225098
Note that, at this speed, for exploring data typing value_counts is marginally quicker and less remembering!
I'd do like Vishal but instead of using sum() using size() to get a count of the number of rows allocated to each group of 'start_station_id'. So:
df = male_trips.groupby('start_station_id').size()
My answer below works in Pandas 0.7.3. Not sure about the new releases.
This is what the pandas.Series.value_counts method is for:
count_series = male_trips.start_station_id.value_counts()
It should be straight-forward to then inspect count_series based on the values in stations['id']. However, if you insist on only considering those values, you could do the following:
count_series = (
male_trips[male_trips.start_station_id.isin(stations.id.values)]
.start_station_id
.value_counts()
)
and this will only give counts for station IDs actually found in stations.id.
male_trips.count()
doesnt work?
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.count.html
how long would this take:
df = male_trips.groupby('start_station_id').sum()
edit: after seeing in the answer above that isin and value_counts exist (and value_counts even comes with its own entry in pandas.core.algorithm and also isin isn't simply np.in1d) I updated the three methods below
male_trips.start_station_id[male_trips.start_station_id.isin(station.id)].value_counts()
You could also do an inner join on stations.id:
pd.merge(male_trips, station, left_on='start_station_id', right_on='id') followed by value_counts.
Or:
male_trips.set_index('start_station_id, inplace=True)
station.set_index('id, inplace=True)
male_trips.ix[male_trips.index.intersection(station.index)].reset_index().start_station_id.value_counts()
If you have the time I'd be interested how this performs differently with a huge DataFrame.