I have a few functions that make new columns in a pandas dataframe, as a function of existing columns in the dataframe. I have two different scenarios that occur here: (1) the dataframe is NOT multiIndex and has a set of columns, say [a,b] and (2) the dataframe is multiIndex and now has the same set of columns headers repeated N times, say [(a,1),(b,1),(a,2),(b,2)....(a,N),(n,N)].
I've been making the aforementioned functions in the style shown below:
def f(df):
if multiindex(df):
for s df[a].columns:
df[c,s] = someFunction(df[a,s], df[b,s])
else:
df[c] = someFunction(df[a], df[b])
Is there another way to do this, without having these if-multi-index/else statement everywhere and duplicating the someFunction code? I'd prefer NOT to split the multi indexed frame into N smaller dataframes (I often need to filter data or do things and keep the rows consistent across all the 1,2,...N frames, and keeping them together in one frame seems the to be the best way to do that).
you may still have to test if columns is a MultiIndex but this should be cleaner and more efficient. Caveat, this will not work if your function utilizes summary statistics on the column. For example, if someFunction divides by the the average of column 'a'.
Solution
def someFunction(a, b):
return a + b
def f(df):
df = df.copy()
ismi = isinstance(df.columns, pd.MultiIndex)
if ismi:
df = df.stack()
df['c'] = someFunction(df['a'], df['a'])
if ismi:
df = df.unstack()
return df
Setup
import pandas as pd
import numpy as np
setup_tuples = []
for c in ['a', 'b']:
for i in ['one', 'two', 'three']:
setup_tuples.append((c, i))
columns = pd.MultiIndex.from_tuples(setup_tuples)
rand_array = np.random.rand(10, len(setup_tuples))
df = pd.DataFrame(rand_array, columns=columns)
df looks like this
a b
one two three one two three
0 0.282834 0.490313 0.201300 0.140157 0.467710 0.352555
1 0.838527 0.707131 0.763369 0.265170 0.452397 0.968125
2 0.822786 0.785226 0.434637 0.146397 0.056220 0.003197
3 0.314795 0.414096 0.230474 0.595133 0.060608 0.900934
4 0.334733 0.118689 0.054299 0.237786 0.658538 0.057256
5 0.993753 0.552942 0.665615 0.336948 0.788817 0.320329
6 0.310809 0.199921 0.158675 0.059406 0.801491 0.134779
7 0.971043 0.183953 0.723950 0.909778 0.103679 0.695661
8 0.755384 0.728327 0.029720 0.408389 0.808295 0.677195
9 0.276158 0.978232 0.623972 0.897015 0.253178 0.093772
I constructed df to have MultiIndex columns. What I'd do is use the .stack() method to push the second level of the column index to be the second level of the row index.
df.stack() looks like this
a b
0 one 0.282834 0.140157
three 0.201300 0.352555
two 0.490313 0.467710
1 one 0.838527 0.265170
three 0.763369 0.968125
two 0.707131 0.452397
2 one 0.822786 0.146397
three 0.434637 0.003197
two 0.785226 0.056220
3 one 0.314795 0.595133
three 0.230474 0.900934
two 0.414096 0.060608
4 one 0.334733 0.237786
three 0.054299 0.057256
two 0.118689 0.658538
5 one 0.993753 0.336948
three 0.665615 0.320329
two 0.552942 0.788817
6 one 0.310809 0.059406
three 0.158675 0.134779
two 0.199921 0.801491
7 one 0.971043 0.909778
three 0.723950 0.695661
two 0.183953 0.103679
8 one 0.755384 0.408389
three 0.029720 0.677195
two 0.728327 0.808295
9 one 0.276158 0.897015
three 0.623972 0.093772
two 0.978232 0.253178
Now you can operate on df.stack() as if the columns were not a MultiIndex
Demonstration
print f(df)
will give you what you want
a b c \
one three two one three two one
0 0.282834 0.201300 0.490313 0.140157 0.352555 0.467710 0.565667
1 0.838527 0.763369 0.707131 0.265170 0.968125 0.452397 1.677055
2 0.822786 0.434637 0.785226 0.146397 0.003197 0.056220 1.645572
3 0.314795 0.230474 0.414096 0.595133 0.900934 0.060608 0.629591
4 0.334733 0.054299 0.118689 0.237786 0.057256 0.658538 0.669465
5 0.993753 0.665615 0.552942 0.336948 0.320329 0.788817 1.987507
6 0.310809 0.158675 0.199921 0.059406 0.134779 0.801491 0.621618
7 0.971043 0.723950 0.183953 0.909778 0.695661 0.103679 1.942086
8 0.755384 0.029720 0.728327 0.408389 0.677195 0.808295 1.510767
9 0.276158 0.623972 0.978232 0.897015 0.093772 0.253178 0.552317
three two
0 0.402600 0.980626
1 1.526739 1.414262
2 0.869273 1.570453
3 0.460948 0.828193
4 0.108599 0.237377
5 1.331230 1.105884
6 0.317349 0.399843
7 1.447900 0.367907
8 0.059439 1.456654
9 1.247944 1.956464
Related
I am unable to properly explain my requirement, but I can show the expected result.
I have a dataframe that looks like so:
Series1
Series2
1370307
1370306
927092
927091
925392
925391
925390
925389
2344089
2344088
1827855
1827854
1715793
1715792
2356467
2356466
1463264
1463263
1712684
1712683
actual dataframe size: 902811 rows × 2 columns
then another dataframe of unique values of Series2. This I've done using value counts.
df2 = df['Series2'].value_counts().rename_axis('Series2').to_frame('counts').reset_index()
Then I need a list of matching Series1 values for each Series2 value:
The expected result is:
Series2
counts
Series1_List
2543113
6
[2543114, 2547568, 2559207, 2563778, 2564330, 2675803]
2557212
6
[2557213, 2557301, 2559192, 2576080, 2675693, 2712790]
2432032
5
[2432033, 2444169, 2490928, 2491392, 2528056]
2559269
5
[2559270, 2576222, 2588034, 2677710, 2713207]
2439554
5
[2439555, 2441882, 2442272, 2443590, 2443983]
2335180
5
[2335181, 2398282, 2527060, 2527321, 2565487]
2494111
4
[2494112, 2495321, 2526026, 2528492]
2559195
4
[2559196, 2570172, 2634537, 2675718]
2408775
4
[2408776, 2409117, 2563765, 2564320]
2408773
4
[2408774, 2409116, 2563764, 2564319]
I achieve this (although only for a subset of 50 rows) using the following code:
df2.loc[:50,'Series1_List'] = df2.loc[:50,'Series2'].apply(lambda x: df[df['Series2']==x]['Series1'].tolist())
If I do this for the whole dataframe it wouldn't complete even in 20 minutes.
So the question is whether there is a faster and efficient method of achieving the result?
IIUC, use:
df2 = (df.groupby('Series2', as_index=False)
.agg(counts=('Series1', 'count'), Series1_List=('Series1', list))
)
I have many different tables that all have different column names and each refer to an outcome, like glucose, insulin, leptin etc (except keep in mind that the tables are all gigantic and messy with tons of other columns in them as well).
I am trying to generate a report that starts empty but then adds columns based on functions applied to each of the glucose, insulin, and leptin tables.
I have included a very simple example - ignore that the function makes little sense. The below code works, but I would like to, instead of copy + pasting final_report["outcome"] = over and over again, just run the find_result function over each of glucose, insulin, and leptin and add the "glucose_result", "insulin_result" and "leptin_result" to the final_report in one or a few lines.
Thanks in advance.
import pandas as pd
ids = [1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,4,4,4,4,4,4]
timepoint = [1,2,3,4,5,6,1,2,3,4,5,6,1,2,4,1,2,3,4,5,6]
outcome = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3]
glucose = pd.DataFrame({'id':ids,
'timepoint':timepoint,
'outcome':outcome})
insulin = pd.DataFrame({'id':ids,
'timepoint':timepoint,
'outcome':outcome})
leptin = pd.DataFrame({'id':ids,
'timepoint':timepoint,
'outcome':outcome})
ids = [1,2,3,4]
start = [1,1,1,1]
end = [6,6,6,6]
final_report = pd.DataFrame({'id':ids,
'start':start,
'end':end})
def find_result(subject, start, end, df):
df = df.loc[(df["id"] == subject) & (df["timepoint"] >= start) & (df["timepoint"] <= end)].sort_values(by = "timepoint")
return df["timepoint"].nunique()
final_report['glucose_result'] = final_report.apply(lambda x: find_result(x['id'], x['start'], x['end'], glucose), axis=1)
final_report['insulin_result'] = final_report.apply(lambda x: find_result(x['id'], x['start'], x['end'], insulin), axis=1)
final_report['leptin_result'] = final_report.apply(lambda x: find_result(x['id'], x['start'], x['end'], leptin), axis=1)
If you have to use this code structure, you can create a simple dictionary with your dataframes and their names and loop through them, creating new columns with programmatically assigned names:
input_dfs = {"glucose": glucose, "insulin": insulin, "leptin": leptin}
for name, df in input_dfs.items():
final_report[f"{name}_result"] = final_report.apply(
lambda x: find_result(x['id'], x['start'], x['end'], df),
axis=1
)
Output:
id start end glucose_result insulin_result leptin_result
0 1 1 6 6 6 6
1 2 1 6 6 6 6
2 3 1 6 3 3 3
3 4 1 6 6 6 6
I have two dataframes (A and B). I want to compare strings in A and find a match or is contained in another string in B. Then count the amount of times A was matched or contained in B.
Dataframe A
0 "4012, 4065, 4682"
1 "4712, 2339, 5652, 10007"
2 "4618, 8987"
3 "7447, 4615, 4012"
4 "6515"
5 "4065, 2339, 4012"
Dataframe B
0 "6515, 4012, 4618, 8987" <- matches (DF A, Index 2 & 4) (2: 4618, 8987), (4: 6515)
1 "4065, 5116, 2339, 8757, 4012" <- matches (DF A, Index 5) (4065, 2339, 4012)
2 "1101"
3 "6515" <- matches (DF A, Index 4) (6515)
4 "4012, 4615, 7447" <- matches (DF A, Index 3) (7447, 4615, 4012)
5 "7447, 6515, 4012, 4615" <- matches (DF A, Index 3 & 4) (3: 7447, 4615, 4012 ), (4: 6515)
Desired Output:
Itemset Count
2 4618, 8987 1
3 7447, 4165, 4012 2
4 6515 3
5 4065, 2339, 4012 1
Basically, I want to count when there is a direct match of A in B (either in order or not) or if A is partially contained in B (in order or not). My goal is to count how many times A is being validated by B. These are all strings by the way.
EDIT Need for speed edition:
This is a redo question from my previous post:
Compare two dataframe columns for matching strings or are substrings then count in pandas
I have millions of rows in both dfA and dfB to make these comparisons against.
In my previous post, the following code got the job done:
import pandas as pd
dfA = pd.DataFrame(["4012, 4065, 4682",
"4712, 2339, 5652, 10007",
"4618, 8987",
"7447, 4615, 4012",
"6515",
"4065, 2339, 4012",],
columns=['values'])
dfB = pd.DataFrame(["6515, 4012, 4618, 8987",
"4065, 5116, 2339, 8757, 4012",
"1101",
"6515",
"4012, 4615, 7447",
"7447, 6515, 4012, 4615"],
columns=['values'])
dfA['values_list'] = dfA['values'].str.split(', ')
dfB['values_list'] = dfB['values'].str.split(', ')
dfA['overlap_A'] = [sum(all(val in cell for val in row)
for cell in dfB['values_list'])
for row in dfA['values_list']]
However with the total amount of rows to check, I am experiencing a performance issue and need another way to check the frequency / counts. Seems like Numpy is needed in this case. this is about the extent of my numpy knowledge as I work primarily in pandas. Anyone have suggestions to make this faster?
dfA_array = dfA['values_list'].to_numpy()
dfB_array = dfB['values_list'].to_numpy()
give this a try. Your algorithm is O(NNK): square of count * words per line. Below should improve to O(NK)
from collections import defaultdict
from functools import reduce
d=defaultdict(set)
for i,t in enumerate(dfB['values']):
for s in t.split(', '):
d[s].add(i)
dfA['count']=dfA['values'].apply(lambda x:len(reduce(lambda a,b: a.intersection(b), [d[s] for s in x.split(', ') ])))
I am a beginner in python. I have a hundred pair of CSV file. The file looks like this:
25_13oct_speed_0.csv
26_13oct_speed_0.csv
25_13oct_speed_0.1.csv
26_13oct_speed_0.1.csv
25_13oct_speed_0.2.csv
26_13oct_speed_0.2.csv
and others
I want to concatenate the pair files between 25 and 26 file. each pair of the file has a speed threshold (Speed_0, 0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0,1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9,2.0) which is labeled on the file name. These files have the same structure data.
Mac Annotation X Y
A first 0 0
A last 0 0
B first 0 0
B last 0 0
Therefore, concatenate analyze is enough to join these two data. I use this method:
df1 = pd.read_csv('25_13oct_speed_0.csv')
df2 = pd.read_csv('26_13oct_speed_0.csv')
frames = [df1, df2]
result = pd.concat(frames)
for each pair files. but it takes time and not an elegant way. is there a good way to combine automatically the pair file and save simultaneously?
Idea is create DataFrame by list of files and add 2 new columns by Series.str.split by first _:
print (files)
['25_13oct_speed_0.csv', '26_13oct_speed_0.csv',
'25_13oct_speed_0.1.csv', '26_13oct_speed_0.1.csv',
'25_13oct_speed_0.2.csv', '26_13oct_speed_0.2.csv']
df1 = pd.DataFrame({'files': files})
df1[['g','names']] = df1['files'].str.split('_', n=1, expand=True)
print (df1)
files g names
0 25_13oct_speed_0.csv 25 13oct_speed_0.csv
1 26_13oct_speed_0.csv 26 13oct_speed_0.csv
2 25_13oct_speed_0.1.csv 25 13oct_speed_0.1.csv
3 26_13oct_speed_0.1.csv 26 13oct_speed_0.1.csv
4 25_13oct_speed_0.2.csv 25 13oct_speed_0.2.csv
5 26_13oct_speed_0.2.csv 26 13oct_speed_0.2.csv
Then loop per groups per column names, loop by groups with DataFrame.itertuples and create new DataFrame with read_csv, if necessary add new column filled by values from g, append to list, concat and last cave to new file by name from column names:
for i, g in df1.groupby('names'):
out = []
for n in g.itertuples():
df = pd.read_csv(n.files).assign(source=n.g)
out.append(df)
dfbig = pd.concat(out, ignore_index=True)
print (dfbig)
dfbig.to_csv(g['names'].iat[0])
I've been looking around for ways to select columns through the python documentation and the forums but every example on indexing columns are too simplistic.
Suppose I have a 10 x 10 dataframe
df = DataFrame(randn(10, 10), index=range(0,10), columns=['A', 'B', 'C', 'D','E','F','G','H','I','J'])
So far, all the documentations gives is just a simple example of indexing like
subset = df.loc[:,'A':'C']
or
subset = df.loc[:,'C':]
But I get an error when I try index multiple, non-sequential columns, like this
subset = df.loc[:,('A':'C', 'E')]
How would I index in Pandas if I wanted to select column A to C, E, and G to I? It appears that this logic will not work
subset = df.loc[:,('A':'C', 'E', 'G':'I')]
I feel that the solution is pretty simple, but I can't get around this error. Thanks!
Name- or Label-Based (using regular expression syntax)
df.filter(regex='[A-CEG-I]') # does NOT depend on the column order
Note that any regular expression is allowed here, so this approach can be very general. E.g. if you wanted all columns starting with a capital or lowercase "A" you could use: df.filter(regex='^[Aa]')
Location-Based (depends on column order)
df[ list(df.loc[:,'A':'C']) + ['E'] + list(df.loc[:,'G':'I']) ]
Note that unlike the label-based method, this only works if your columns are alphabetically sorted. This is not necessarily a problem, however. For example, if your columns go ['A','C','B'], then you could replace 'A':'C' above with 'A':'B'.
The Long Way
And for completeness, you always have the option shown by #Magdalena of simply listing each column individually, although it could be much more verbose as the number of columns increases:
df[['A','B','C','E','G','H','I']] # does NOT depend on the column order
Results for any of the above methods
A B C E G H I
0 -0.814688 -1.060864 -0.008088 2.697203 -0.763874 1.793213 -0.019520
1 0.549824 0.269340 0.405570 -0.406695 -0.536304 -1.231051 0.058018
2 0.879230 -0.666814 1.305835 0.167621 -1.100355 0.391133 0.317467
Just pick the columns you want directly....
df[['A','E','I','C']]
How do I select multiple columns by labels in pandas?
Multiple label-based range slicing is not easily supported with pandas, but position-based slicing is, so let's try that instead:
loc = df.columns.get_loc
df.iloc[:, np.r_[loc('A'):loc('C')+1, loc('E'), loc('G'):loc('I')+1]]
A B C E G H I
0 -1.666330 0.321260 -1.768185 -0.034774 0.023294 0.533451 -0.241990
1 0.911498 3.408758 0.419618 -0.462590 0.739092 1.103940 0.116119
2 1.243001 -0.867370 1.058194 0.314196 0.887469 0.471137 -1.361059
3 -0.525165 0.676371 0.325831 -1.152202 0.606079 1.002880 2.032663
4 0.706609 -0.424726 0.308808 1.994626 0.626522 -0.033057 1.725315
5 0.879802 -1.961398 0.131694 -0.931951 -0.242822 -1.056038 0.550346
6 0.199072 0.969283 0.347008 -2.611489 0.282920 -0.334618 0.243583
7 1.234059 1.000687 0.863572 0.412544 0.569687 -0.684413 -0.357968
8 -0.299185 0.566009 -0.859453 -0.564557 -0.562524 0.233489 -0.039145
9 0.937637 -2.171174 -1.940916 -1.553634 0.619965 -0.664284 -0.151388
Note that the +1 is added because when using iloc the rightmost index is exclusive.
Comments on Other Solutions
filter is a nice and simple method for OP's headers, but this might not generalise well to arbitrary column names.
The "location-based" solution with loc is a little closer to the ideal, but you cannot avoid creating intermediate DataFrames (that are eventually thrown out and garbage collected) to compute the final result range -- something that we would ideally like to avoid.
Lastly, "pick your columns directly" is good advice as long as you have a manageably small number of columns to pick. It will, however not be applicable in some cases where ranges span dozens (or possibly hundreds) of columns.
One option for selecting multiple slices is with select_columns from pyjanitor:
# pip install pyjanitor
import pandas as pd
import janitor
from numpy import random
random.seed(3)
df = pd.DataFrame(
random.randn(10, 10),
index=range(0,10),
columns=['A', 'B', 'C', 'D','E','F','G','H','I','J']
)
df.select_columns(slice('A', 'C'), 'E', slice('G', 'I'))
A B C E G H I
0 1.788628 0.436510 0.096497 -0.277388 -0.082741 -0.627001 -0.043818
1 -1.313865 0.884622 0.881318 0.050034 -0.545360 -1.546477 0.982367
2 -1.185047 -0.205650 1.486148 -1.023785 0.625245 -0.160513 -0.768836
3 0.745056 1.976111 -1.244123 -0.803766 -0.923792 -1.023876 1.123978
4 -1.623285 0.646675 -0.356271 -0.596650 -0.873882 0.029714 -2.248258
5 1.013183 0.852798 1.108187 1.487543 0.845833 -1.860890 -0.602885
6 1.048148 1.333738 -0.197415 -0.674728 0.152946 -1.064195 0.437947
7 -1.024931 0.899338 -0.154507 0.483788 0.643163 0.249087 -1.395764
8 -1.370669 0.238563 0.614077 0.145063 -0.024104 -0.888657 -2.915738
9 -0.591079 -0.516417 -0.959996 -0.574708 0.679072 -0.855437 -0.300206
The caveat here is that you have to explicitly use python's builtin slice.
Just like the excellent chosen answer, you can use regular expressions, again, it is explicit use (python's re):
import re
df.select_columns(re.compile('[A-CEG-I]'))
A B C E G H I
0 1.788628 0.436510 0.096497 -0.277388 -0.082741 -0.627001 -0.043818
1 -1.313865 0.884622 0.881318 0.050034 -0.545360 -1.546477 0.982367
2 -1.185047 -0.205650 1.486148 -1.023785 0.625245 -0.160513 -0.768836
3 0.745056 1.976111 -1.244123 -0.803766 -0.923792 -1.023876 1.123978
4 -1.623285 0.646675 -0.356271 -0.596650 -0.873882 0.029714 -2.248258
5 1.013183 0.852798 1.108187 1.487543 0.845833 -1.860890 -0.602885
6 1.048148 1.333738 -0.197415 -0.674728 0.152946 -1.064195 0.437947
7 -1.024931 0.899338 -0.154507 0.483788 0.643163 0.249087 -1.395764
8 -1.370669 0.238563 0.614077 0.145063 -0.024104 -0.888657 -2.915738
9 -0.591079 -0.516417 -0.959996 -0.574708 0.679072 -0.855437 -0.300206
You can go crazy and combine different selection options within the select_columns method.