(update: added desired data frame)
Let me start by saying that I'm reasonably confident that I found a solution to this problem several years ago, but I have not been able to re-find that solution.
Questions that address similar problems, but don't solve my particular problem include:
Efficiently select rows that match one of several values in Pandas DataFrame
Efficiently adding calculated rows based on index values to a pandas DataFrame
Compare Python Pandas DataFrames for matching rows
The Question
Let's say I have a data frame with many columns that I am working on:
big = pd.DataFrame({'match_1': [11, 12, 51, 52]})
big
match_1
0 11
1 12
2 51
3 52
I also have smaller data frame that, in theory, maps some conditional statement to a desired value:
# A smaller dataframe that we use to map values into the larger dataframe
small = pd.DataFrame({'is_even': [True, False], 'score': [10, 200]})
small
is_even score
0 True 10
1 False 200
The goal here would be to use a conditional statement to match each row in big to a single row in small. Assume that small is constructed such that there was always be one, and only one, match for each row in big. (If there has to be multiple rows in small that match, just pick the first one.)
The desired output would be something like:
desired = pd.DataFrame({'match_1': [11, 12, 51, 52], 'metric': [200, 10, 200, 10]})
desired
match_1 metric
0 11 200
1 12 10
2 51 200
3 52 10
I'm pretty sure that the syntax would look similar to:
big['score'] = small.loc[small['is_even'] == ( (big['match_1'] / 2) == 0), 'score']
This won't work because small['is_even'] is a Series of length 2, while ( (big['match_1'] / 2) == 0) is a Series of length 4. What I'm looking to do is, for each row in big, find the one row in small that matches based on a conditional.
If I can get a sequence that contains the correct row in small that matches each row in big, then I could do something like:
`big['score'] = small.loc[matching_rows, 'score']
The question I have is: how do I generate the Sequence matching rows?
Things that (I think) aren't quite what I want:
If the columns in big and small were to match simply on constant values, this would be a straight forward use of either big.merge() or big.groupby(), however, in my case, the mapping can be an arbitrarily complex boolean conditional, for example:
(big['val1'] > small['threshold']) & (big['val2'] == small['val2']) & (big['val3'] > small['min_val']) & (big['val3'] < small['max_val'])
Solutions that rely on isin(), any(), etc, don't work, because the conditional check can be arbitrarily complex.
I could certainly create a function to apply() to the bigger DataFrame, but again, I'm pretty sure there was a simpler solution.
The answer may come down to 'calculate some intermediate columns until you can do a simple merge' or 'just use apply(), but I could swear that there was a way to do what I've described above.
One approach is to use a merge in which the on_left is not a column, but a vector of keys. It's made simpler by setting the index of small to be is_even:
>>> small.set_index('is_even', inplace=True)
>>> condition = big['match_1'] % 2 == 0
>>> pd.merge(big, small, left_on=condition, right_index=True, how='left')
match_1 score
0 11 200
1 12 10
2 51 200
3 52 10
You can index small with True and False and just do a straight .ix lookup on it. Not sure it's all that much tidier than the intermediate column/merge:
In [127]: big = pd.DataFrame({'match_1': [11, 12, 51, 52]})
In [128]: small = pd.DataFrame({'score': [10, 200]}, index=[True, False])
In [129]: big['score'] = small.ix[pd.Index(list(big.match_1 % 2 == 0))].score.values
In [130]: big
Out[130]:
match_1 score
0 11 200
1 12 10
2 51 200
3 52 10
Related
I'm trying to do arithmetic among different cells in my dataframe and can't figure out how to operate on each of my groups. I'm trying to find the difference in energy_use between a baseline building (in this example upgrade_name == b is the baseline case) and each upgrade, for each building. I have an arbitrary number of building_id's and arbitrary number of upgrade_names.
I can do this successfully for a single building_id. Now I need to expand this out to a full dataset and am stuck. I will have 10's of thousands of buildings and dozens of upgrades for each building.
The answer to this question Iterating within groups in Pandas may be related, but I'm not sure how to apply it to my problem.
I have a dataframe like this:
df = pd.DataFrame({'building_id': [1,2,1,2,1], 'upgrade_name': ['a', 'a', 'b', 'b', 'c'], 'energy_use': [100.4, 150.8, 145.1, 136.7, 120.3]})
In [4]: df
Out[4]:
building_id upgrade_name energy_use
0 1 a 100.4
1 2 a 150.8
2 1 b 145.1
3 2 b 136.7
4 1 c 120.3
For a single building_id I have the following code:
upgrades = df.loc[df.building_id == 1, ['upgrade_name', 'energy_use']]
starting_point = upgrades.loc[upgrades.upgrade_name == 'b', 'energy_use']
upgrades['diff'] = upgrades.energy_use - starting_point.values[0]
In [8]: upgrades
Out[8]:
upgrade_name energy_use diff
0 a 100.4 -44.7
2 b 145.1 0.0
4 c 120.3 -24.8
How do I write this for arbitrary numbers of building_id's, instead of my hard-coded building_id == 1?
The ideal solution looks like this (doesn't matter if the baseline differences are 0 or NaN):
In [17]: df
Out[17]:
building_id upgrade_name energy_use ideal
0 1 a 100.4 -44.7
1 2 a 150.8 14.1
2 1 b 145.1 0.0
3 2 b 136.7 0.0
4 1 c 120.3 -24.8
Define the function counting the difference in energy usage (for
a group of rows for the current building) as follows:
def euDiff(grp):
euBase = grp[grp.upgrade_name == 'b'].energy_use.values[0]
return grp.energy_use - euBase
Then compute the difference (for all buildings), applying it to each group:
df['ideal'] = df.groupby('building_id').apply(euDiff)\
.reset_index(level=0, drop=True)
The result is just as you expected.
thanks for sharing that example data! Made things a lot easier.
I suggest solving this in two parts:
1. Make a dictionary from your dataframe that contains that baseline energy use for each building
2. Apply a lambda function to your dataframe to subtract each energy use value from the baseline value associated with that building.
# set index to building_id, turn into dictionary, filter out energy use
building_baseline = df[df['upgrade_name'] == 'b'].set_index('building_id').to_dict()['energy_use']
# apply lambda to dataframe, use axis=1 to access rows
df['diff'] = df.apply(lambda row: row['energy_use'] - building_baseline[row['building_id']])
You could also write a function to do this. You also don't necessarily need the dictionary, it just makes things easier. If you're curious about these alternative solutions let me know and I can add them for you.
Instead of relying on queries from SQL, I am trying to find ways that I can use pandas to do the same work but in a more time-efficient manner.
The problem I am trying to solve is best illustrated through the following simplified example:
df = pd.DataFrame({'id':list([1,2,3,4,5,6]),
'value':[12,8,31,14,45,12]})
Based on the data I would like to change the values of the "value" column when the id is 1,2,4 to 32,15,14
I managed to do this for one value with the following code:
df.loc[ df['id'] ==1, 'value'] = 32
However the problem is that the above code is very time-inefficient. So I wonder if anyone could help come up with a solution where I can update 20-30 values as fast as possible in a pandas table.
Thanks in advance
Use isin:
df.loc[df['id'].isin([1, 2, 4]), 'value'] = [32, 15, 14]
df
id value
0 1 32
1 2 15
2 3 31
3 4 14
4 5 45
5 6 12
Problem
I have a dataframe df with indexes not monotonically increasing over 4 partitions, meaning every partition is indexed with [0..N].
I need to select rows based on a indexes list [0..M] where M > N.
Using loc would yield to an inconsistent output as there are multiple rows indexed by 0 (see example).
In other words, I'd need to overcome the difference between Dask's and Pandas' reset_index, as it'd easily solve my issue.
Example
print df.loc[0].compute() results in:
Unnamed: 0 best_answer thread_id ty_avc ty_ber ty_cjr ty_cpc \
0 0 1 1 1 0.052174 9 18
0 0 1 5284 12 0.039663 34 60
0 0 1 18132 2 0.042254 7 20
0 0 1 44211 4 0.025000 5 5
Possible solutions
repartition df to 1 single partition and reset_index, don't like as won't fit in memory;
add a column with [0..M] indexes and use set_index, discouraged in performance tips;
solution to this question solves a different problem as his df has unique indexes;
split the indexes list in npartitions parts, apply offset computation and use map_partitions
I cannot think of other solutions... probably the last one is more efficient although not sure if it's actually feasible.
Generally Dask.dataframe does not track the lengths of the pandas dataframes that make up the dask.dataframe. I suspect that your option 4 is best. You might also consider using dask.delayed
See also http://dask.pydata.org/en/latest/delayed-collections.html
I'm converting a financial spreadsheet into Pandas, and this is a frequent challenge that comes up.
In excel, suppose you have some calculation that for columns 0:n, the value depends on the previous column [shown in format Cell (row, column)]: Cell(1,n) = (Cell(1,n-1)^2)*5.
Obviously, for n=2, you could create a calculated column in Pandas:
df[2] = (df[1]^2) *5
But for a chain of say 30, that doesn't work. So currently, I am using a for loop.
total_columns_needed = list(range(0,100))
for i in total_columns_needed:
df[i] = (df[i-1]^2)* 5
That loop works fine, but I trying to see how I could use map and apply to make this look cleaner. From reading, apply is a loop function underneath, so I'm not sure whether I will get any speed from doing this. But, it could shrink the code by a lot.
The problem that I've had with:
df.apply()
is that 1) there could be other columns not involved in the calculation (which arguably shouldn't be there if the data is properly normalised), and 2) the columns don't exist yet. Part 2 could possibly be solved by creating the dataframe with all the needed columns, but I'm trying to avoid that for other reasons.
Any help in solving this greatly appreciated!
To automatically generate a bunch of columns, without a loop:
In [433]:
df = pd.DataFrame({'Val': [0,1,2,3,4]})
In [434]:
print df.Val.apply(lambda x: pd.Series(x+np.arange(0,25,5)))
0 1 2 3 4
0 0 5 10 15 20
1 1 6 11 16 21
2 2 7 12 17 22
3 3 8 13 18 23
4 4 9 14 19 24
numpy.arange(0,25,5) gives you array([ 0, 5, 10, 15, 20]). For each of the values in Val, we will add that value to array([ 0, 5, 10, 15, 20]), creating a new Series.
And finally, put the new Series together back into a new DataFrame
I'm relatively new to Python and totally new to Pandas, so my apologies if this is really simple. I have a dataframe, and I want to operate over all elements in a particular column, but only if a different column with the same index meets a certain criteria.
float_col int_col str_col
0 0.1 1 a
1 0.2 2 b
2 0.2 6 None
3 10.1 8 c
4 NaN -1 a
For example, if the value in float_col is greater than 5, I want to multiply the value in in_col (in the same row) by 2. I'm guessing I'm supposed to use one of the map apply or applymap functions, but I'm not sure which, or how.
There might be more elegant ways to do this, but once you understand how to use things like loc to get at a particular subset of your dataset, you can do it like this:
df.loc[df['float_col'] > 5, 'int_col'] = df.loc[df['float_col'] > 5, 'int_col'] * 2
You can also do it a bit more succinctly like this, since pandas is smart enough to match up the results based on the index of your dataframe and only use the relevant data from the df['int_col'] * 2 expression:
df.loc[df['float_col'] > 5, 'int_col'] = df['int_col'] * 2