I am attempting to write a function that will sum a set of specified columns in a pandas DataFrame.
First, some background. The data each have a column with a name (e.g., "var") and a number next to that name in sequential order (e.g., "var1, var2"). I know I can sum, say, 5 columns together with the following code:
import pandas as pd
data = pd.read_csv('data_file.csv')
data['var_total'] = data.var1 + data.var2 + data.var3 + data.var4 + data.var5
However, this can be repetitive when you have var1-var30 to sum. I figured there must be some elegant solution to summing them more quickly, since the column names are predictable and uniform. Is there a function I can write or a built-in pandas function that will let me sum these more quickly?
You could do something like this:
data['var_total'] = data.filter(regex='var[0-9]+').sum(axis=1)
This will first filter your dataframe to retain only the columns that start with var and are followed by one or more numbers. Then it will sum across the resulting filtered DataFrame.
I think you're looking for the filter method of DataFrame; you can pass it either a string or a regular expression, and it will just return the columns whose names match it. Then you can just call sum or whatever else you want on the resulting columns:
pd.DataFrame({'var1':[1], 'var2':[2],'othercol':['abc']})
othercol var1 var2
0 abc 1 2
pd.DataFrame({'var1':[1], 'var2':[2],'othercol':['abc']}).filter(like='var')
var1 var2
0 1 2
pd.DataFrame({'var1':[1], 'var2':[2],'othercol':['abc']}).filter(like='var').sum(axis=1)
0 3
By the way note that I've called sum(axis=1) to return the row-wise sums, by default, sum will return the sum of the columns.
Even if you are writing out all the column names there a couple of ways to do the sum a bit more elegantly:
import pandas as pd
import numpy as np
df = pd.DataFrame({'var1': np.random.randint(1, 10, 10),
'var2': np.random.randint(1, 10, 10),
'var3': np.random.randint(1, 10, 10)})
# Use the sum method:
df[['var1', 'var2', 'var3']].sum(axis='columns')
# Use eval
df.eval('var1 + var2 + var3')
Then you can always use the standard Python tools for manipulating strings to put together the list of column names:
cols = ['var' + str(n) for n in range(1, 3 + 1)]
cols
Out[9]: ['var1', 'var2', 'var3']
df[cols].sum(axis='columns')
Related
I am pretty new to python and pandas, and I want to sort through the existing two dataframes by certain columns, and create a third dataframe that contains only the value matches within a tolerance. In other words, I have df1 and df2, and I want df3 to contain the rows and columns of df2 that are within the tolerance of values in df1:
Two dataframes:
df1=pd.DataFrame([[0.221,2.233,7.84554,10.222],[0.222,2.000,7.8666,10.000],
[0.220,2.230,7.8500,10.005]],columns=('rt','mz','mz2','abundance'))
[Dataframe 1]
df2=pd.DataFrame([[0.219,2.233,7.84500,10.221],[0.220,7.8669,10.003],[0.229,2.238,7.8508,10.009]],columns=('rt','mz','mz2','abundance'))
[Dataframe 2]
Expected Output:
df3=pd.DataFrame([[0.219,2.233,7.84500,10.221],[0.220,2.002,7.8669,10.003]],columns=('Rt','mz','mz2','abundance'))
[Dataframe 3]
I have tried forloops and filters, but as I am a newby nothing is really working for me. But here us what I'm trying now:
import pandas as pd
import numpy as np
p=[]
d=np.array(p)
#print(d.dtype)
def count(df2, l, r):
l=[(df1['Rt']-0.001)]
r=[(df1['Rt']+0.001)]
for x in df2['Rt']:
# condition check
if x>= l and x<= r:
print(x)
d.append(x)
where p and d are the corresponding dataframe and the array (if necessary to make array?) that will be populated. I bet the problem lies somewhere in the fact that that the function shouldn't contain the forloop.
Ideally, this could work to sort like ~13,000 rows of a dataframe using the 180 column values of another dataframe.
Thank you in advance!
Is this what you're looking for?:
min = df1.rt.min()-0.001
max = df1.rt.max()+0.001
df3 = df2[(df2.rt >= min) & (df2.rt <= max)]
>>> df3
I am trying to fill an exisiting dataframe in pandas by adding several rows at one time, the number of rows depend on a comprehension list so it is variable. The initial dataframe is filled as follows:
import pandas as pd
import portion as P
columns = ['chr', 'Start', 'End', 'type']
x = pd.DataFrame(columns=columns)
RANGE = [(212, 222),(866, 888),(152, 158)]
INTERVAL= P.Interval(*[P.closed(x, y) for x, y in RANGE])
def fill_df(df, junction, chr, type ):
df['Start'] = [x.lower for x in junction]
df['End'] = [x.upper for x in junction]
df['chr'] = chr
df['type'] = type
return df
z = fill_df(x, INTERVAL, 1, 'DUP')
The idea is to keep appending rows to the dataframe from different intervals (so variable number of rows). Append those rows to the existing dataframe.
Here I have found different ways to add several rows but none of them are easy to apply unless I wrote a function to convert my data in tupples or lists, which I am not sure if it would be efficient. I have also try with pandas append but I was not able to do it for a bunch of lines..
Is it there any simple way to do this?
Thanks a lot!
Have you tried wrapping the list comprehension in pd.Series?
df['Start.pos'] = pd.Series([x.lower for x in junction])
If you want to use append and append several elements at once, you can create a second DataFrame table and simply append it to the first one. This looks like this:
import intvalpy as ip
import pandas as pd
inf = [1, 2, 3]
sup = [4, 5, 6]
intervals = ip.Interval(inf, sup)
add_intervals = ip.Interval([-10, -20], [10,20])
df = pd.DataFrame(data={'start': intervals.a, 'end': intervals.b})
df2 = pd.DataFrame(data={'start': add_intervals.a, 'end': add_intervals.b})
df = df.append(df2, ignore_index=True)
print(df.head(10))
The intvalpy library specialized for classical and full interval arithmetic is used here. To set an interval or intervals, use the Interval function, where the first argument is the left end and the second is the right end of the intervals.
The ignore_index parameter allows to continue indexing of the first table.
In case you want to add one line, you can do it as follows:
for k in range(len(intervals)):
df = df.append({'start': intervals[k].a, 'end': intervals[k].b}, ignore_index=True)
print(df.head(10))
I purposely did it with a loop to show that you can do without creating a second table if you want to add a few rows.
I have N Dataframes, named data1,data2...etc
Each dataframe has two columns 'X' and 'Y'. The lenght of each dataframe it's not the same.
I need a new dataframe consisting in the sum of the 'X' columns.
I just tried something like:
dataframesum = pd.DataFrame(0, index=np.arange(Some_number),columns = ['X']
for i in range(N):
dataframesum.add(globals()['Data%s'%i]['X'], fill_values = 0)
but it doesn't works (i'm not sure what should be the value of Some_number) and i am getting the next error:
NotImplementedError: fill_value 0 not supported
You should use a dictionary to store an arbitrary number of variables.
So let's assume you have dataframes stored in dfs = {1: df1, 2: df2, 3: df3...}.
You can then concatenate them via pd.concat:
df_concat = pd.concat(list(dfs.values()))
Finally, you can sum columns via pd.DataFrame.sum:
sums = df_concat.sum()
To take advantage of vectorised operations, you should avoid a manual for loop. In addition, use of globals() is poor practice, and can be avoided by using dict or list to store your dataframes.
When calling a function using groupby + apply, I want to go from a DataFrame to a Series groupby object, apply a function to each group that takes a Series as input and returns a Series as output, and then assign the output from the groupby + apply call as a field in the DataFrame.
The default behavior is to have the output from groupby + apply indexed by the grouping fields, which prevents me from assigning it back to the DataFrame cleanly. I'd prefer to have the function I call with apply take a Series as input and return a Series as output; I think it's a bit cleaner than DataFrame to DataFrame. (This isn't the best way of getting to the result for this example; the real application is pretty different.)
import pandas as pd
df = pd.DataFrame({
'A': [999, 999, 111, 111],
'B': [1, 2, 3, 4],
'C': [1, 3, 1, 3]
})
def less_than_two(series):
# Intended for series of length 1 in this case
# But not intended for many-to-one generally
return series.iloc[0] < 2
output = df.groupby(['A', 'B'])['C'].apply(less_than_two)
I want the index on output to be the same as df, otherwise I cant assign
to df (cleanly):
df['Less_Than_Two'] = output
Something like output.index = df.index seems too ugly, and using the group_keys argument doesn't seem to work:
output = df.groupby(['A', 'B'], group_keys = False)['C'].apply(less_than_two)
df['Less_Than_Two'] = output
transform returns the results with the original index, just as you've asked for. It will broadcast the same result across all elements of a group. Caveat, beware that the dtype may be inferred to be something else. You may have to cast it yourself.
In this case, in order to add another column, I'd use assign
df.assign(
Less_Than_Two=df.groupby(['A', 'B'])['C'].transform(less_than_two).astype(bool))
A B C Less_Than_Two
0 999 1 1 True
1 999 2 3 False
2 111 3 1 True
3 111 4 3 False
Assuming your groupby is necessary (and the resulting groupby object will have fewer rows than your DataFrame -- this isn't the case with the example data), then assigning the Series to the 'Is.Even' column will result in NaN values (since the index to output will be shorter than the index to df).
Instead, based on the example data, the simplest approach will be to merge output -- as a DataFrame -- with df, like so:
output = df.groupby(['A','B'])['C'].agg({'C':is_even}).reset_index() # reset_index restores 'A' and 'B' from indices to columns
output.columns = ['A','B','Is_Even'] #rename target column prior to merging
df.merge(output, how='left', on=['A','B']) # this will support a many-to-one relationship between combinations of 'A' & 'B' and 'Is_Even'
# and will thus properly map aggregated values to unaggregated values
Also, I should note that you're better off using underscores than dots in variable names; unlike in R, for instance, dots act as operators for accessing object properties, and so using them in variable names can block functionality/create confusion.
This question already has an answer here:
Grouping a dataframe by X columns
(1 answer)
Closed 5 years ago.
I have a DataFrame with 40 columns (columns 0 through 39) and I want to group them four at a time:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.binomial(1, 0.2, (100, 40)))
new_df["0-3"] = df[0] + df[1] + df[2] + df[3]
new_df["4-7"] = df[4] + df[5] + df[6] + df[7]
...
new_df["36-39"] = df[36] + df[37] + df[38] + df[39]
Can I do this in a single statement (or in a better way than summing them separately)? The column names in the new DataFrame are not important.
You could select out the columns and sum on the row axis, like this.
df['0-3'] = df.loc[:, 0:3].sum(axis=1)
A couple things to note:
Summing like this will ignore missing data while df[0] + df[1] ... propagates it. Pass skipna=False if you want that behavior.
Not necessarily any performance benefit, may actually be a little slower.
Here's another way to do it:
new_df = df.transpose()
new_df['Group'] = new_df.index / 4
new_df = new_df.groupby('Group').sum().transpose()
Note that the divide-by operation here is integer division, not floating-point division.
I don't know if it is the best way to go but I ended up using MultiIndex:
df.columns = pd.MultiIndex.from_product((range(10), range(4)))
new_df = df.groupby(level=0, axis=1).sum()
Update: Probably because of the index, this was faster than the alternatives. The same can be done with df.groupby(df.columns//4, axis=1).sum() faster if you take into account the time for constructing the index. However, the index change is a one time operation and I update the df and take the sum thousands of times so using a MultiIndex was faster for me.
Consider a list comprehension:
df = # your data
df_slices = [df.iloc[x:x+4] for x in range(10)]
Or more generally
df_slices = [df.iloc[x:x+4] for x in range(len(df.columns)/4)]