I have two dataframes in pandas. DF "A" contains the start and end indexes of zone names. DF "B" contains the start and end indexes of subzones. The goal is to extract all subzones of all zones.
Example:
A:
start index | end index | zone name
-----------------------------------
1 | 10 | X
B:
start index | end index | subzone name
-----------------------------------
2 | 3 | Y
In the above example, Y is a subzone of X since its indexes fall within X's indexes.
The way I'm currently doing this is using iterrows to go through every row in A, and for every row (zone) I find the slice in B (subzone).
This solution is extremely slow in pandas since iterrows is not fast. How can I do this task without using iterrows in pandas?
Grouping with Dicts and Series is possible,
Grouping information may exist in a form other than an array. Let’s consider another
example DataFrame ( since your Data Frames don't have Data do i m taking my own DF DFA =mapping, DFB= people
with values and that have real world interpretations):
people = pd.DataFrame(np.random.randn(5, 5),
columns=['a', 'b', 'c', 'd', 'e'],
index=['Joe', 'Steve', 'Wes', 'Jim', 'Travis'])
people.iloc[2:3, [1, 2]] = np.nan # Add a few NA values
Now, suppose I have a group correspondence for the columns and want to sum
together the columns by group:
mapping = {'a': 'red', 'b': 'red', 'c': 'blue',
'd': 'blue', 'e': 'red', 'f' : 'orange'}
#Mapping is a Dictionary just like a DataFrame (DF A representing Zones)
you could construct an array from this dict to pass to groupby, but instead we
can just pass the dict ( I sure you can convert at Dictionary to dtata Frame and Data Frame to Dictionary, so skipping the step, other wise you are well come to ask in comments)
by_column = people.groupby(mapping, axis=1)
i am using sum() operator you can use whatever operator you want ( in case you want to combine sub Zones with Parent Zones you can do this by concatenation- out of scope of this other wise i would have gone in details )
by_column.sum()
The same functionality holds for Series, which can be viewed as a fixed-size mapping:
Note: using functions with arrays, dicts, or Series is not a problem as everything gets converted to arrays internally.
Related
I want to groupby pandas dataframe and get last n elements from each group but with any offset. For example, after group by column A i've a column 'A' with elements in column 'B' with values (1,2,3,4,5,6,7) for certain value in 'A'. And I want to take the last 10 elements excluding the most recent one or two. How can I do it?
I've tried to use tail(), df.groupby('A').tail(10), but that's not my case.
input: 'A': [1,1,1,1,1,1,1,1,1,], 'B': [1,2,3,4,5,6,7,8,9] output: (last 3 excluding the recent 2) 'A' [1], 'B': [5,6,7]
First of all, it is unusual task, since all your "A" values are the same -> it is weird to group by such a column.
This leads to 2 solutions that came to my mind...
1]
data = {'A': [1,2,3,4,5,6,7,8,9], 'B': [1,2,3,4,5,6,7,8,9]}
df_dict = pd.DataFrame.from_dict(data)
no_of_unwanted_values = 2
df_dict.groupby('A').agg(lambda a: a).head(-no_of_unwanted_values)#.tail(1)
This solution work if you group by A-column-row-specific values. The head(-x) selects all the values top down but the last x values.
I think what you are looking for is the second solution:
2]
data = {'A': [1,2,1,3,1,2,1,2,3], 'B': [1,2,3,4,5,6,7,8,9]}
df_dict = pd.DataFrame.from_dict(data)
no_of_unwanted_values = 2
df_dict.groupby('A').sum().head(-no_of_unwanted_values)#.tail(1)
Here you have 3 values to group by and then you are using some operation on those groups (in this case it is sum). Lastly you select again all but the last with head(-x). Optionaly if you would like to select also some values but the top ones from such set, you can append the query by .tail() and again specify number of rows to retrieve. The last line could be also rewriten as len(df_dict) - no_of_unwanted_values (but in this case the number of unwanted values woudl have to be x + 1). You could apply the logic with len(x) - 1 for example also to selection of lists.
PS.:
beware when using sort_values for example:
data.sort_values(['col_1','col_2']).groupby('col_3','col_2').head(x)
here the head(x) correspond to col_1 values. That is if you want all but last values for len(data.col_1.unique()) = 100, use head(99).
I have a data frame that has the following format:
d = {'id1': ['a', 'a', 'b', 'b',], 'id2': ['a', 'b', 'b', 'c'], 'score': ['1', '2', '3', '4']}
df = pd.DataFrame(data=d)
print(df)
id1 id2 score
0 a a 1
1 a b 2
3 b b 3
4 b c 4
The data frame has over 1 billion rows, it represents pairwise distance scores between objects in columns id1 and id2. I do not need all object pair combinations, for each object in id1 (there are about 40k unique id's) I only want to keep the top 100 closest (smallest) distance scores
The code I'm running to do this is the following:
df = df.groupby(['id1'])['score'].nsmallest(100)
The issue with this code is that I run into a memory error each time I try to run it
MemoryError: Unable to allocate 8.53 GiB for an array with shape (1144468900,) and data type float64
I'm assuming it is because in the background pandas is now creating a new data frame for the result of the group by, but the existing data frame is still held in memory.
The reason I am only taking the top 100 of each id is to reduce the size of the data frame, but I seems that while doing that process I am actually taking up more space.
Is there a way I can go about filtering this data down but not taking up more memory?
The desired output would be something like this (assuming top 1 instead of top 100)
id1 id2 score
0 a a 1
1 b b 3
Some additional info about the original df:
df.count()
permid_1 1144468900
permid_2 1144468900
distance 1144468900
dtype: int64
df.dtypes
permid_1 int64
permid_2 int64
distance float64
df.shape
dtype: object
(1144468900, 3)
id1 & id2 unique value counts: 33,830
I can't test this code, lacking your data, but perhaps try something like this:
indicies = []
for the_id in df['id1'].unique():
scores = df['score'][df['id1'] == the_id]
min_subindicies = np.argsort(scores.values)[:100] # numpy is raw index only
min_indicies = scores.iloc[min_subindicies].index # convert to pandas indicies
indicies.extend(min_indicies)
df = df.loc[indicies]
Descriptively, in each unique ID (the_id), extract the matching scores. Then find the raw indicies which are the smallest 100. Select those indicies, then map from the raw index to the Pandas index. Save the Pandas index to your list. Then at the end, subset on the pandas index.
iloc does take a list input. some_series.iloc should align properly with some_series.values which should allow this to work. Storing indicies indirectly like this should make this substantially more memory-efficient.
df['score'][df['id1'] == the_id] should work more efficiently than df.loc[df['id1'] == the_id, 'score']. Instead of taking the whole data frame and masking it, it takes only the score column of the data frame and masks it for matching IDs. You may want to del scores at the end of each loop if you want to immediately free more memory.
You can try the following:
df.sort_values(["id1", "scores"], inplace=True)
df["dummy_key"] = df["id1"].shift(100).ne(df["id1"])
df = df.loc[df["dummy_key"]]
You sort ascending (smallest on top), by first grouping, then by score.
You add column to indicate whether current id1 is different than the one 100 rows back (if it's not - your row is 101+ in order).
You filter by column from 2.
As Aryerez outlined in a comment, you can do something along the lines of:
closest = pd.concat([df.loc[df['id1'] == id1].sort_values(by = 'score').head(100) for
id1 in set(df['id1'])])
You could also do
def get_hundredth(id1):
sub_df = df.loc[df['id1'] == id1].sort_values(by = 'score')
return sub_df.iloc[100]['score']
hundredth_dict = {id1: get_hundredth(id1) for id1 in set(df['id1'])}
def check_distance(row):
return row['score'] <= hundredth_dict[row['id1']]
closest = df.loc[df.apply(check_distance, axis = 1)
Another strategy would be to look at how filtering out distances past a threshold affects the dataframe. That is, take
low_scores = df.loc[df['score']<threshold]
Does this significantly decrease the size of the dataframe for some reasonable threshold? You'd need a threshold that makes the dataframe small enough to work with, but leaves the lowest 100 scores for each id1.
You also might want to look into what sort of optimization you can do given your distance metric. There's probably algorithms out there specifically for cosine similarity.
For the given shape (1144468900, 3) with 33,830 unique value counts, id1 and id2 columns are good candidates for categorical column, convert them to categorical data type, and that will reduce the memory requirement by 1144468900/33,830 = 33,830 times approximately for these two columns, then perform any aggregation you want.
df[['id1', 'id2']] = df[['id1', 'id2']].astype('category')
out = df.groupby(['id1'])['score'].nsmallest(100)
I have a dataframe of unique strings and I want to find the row and column for a given string. I want these values because I'll be eventually exporting this dataframe to an excel spreadsheet. The easiest way I've found so far to get these values is the follwing:
jnames = list(df.iloc[0].to_frame().index)
for i in jnames:
for k in df[i]:
if 'searchstring' in str(k):
print('Column: {}'.format( (jnames.index(i) + 1 ) ) )
print('Row: {}'.format( list( df[i] ).index('searchstring') ) )
break
Can anyone advise a solution that takes better advantage of the inherent capabilities of pandas?
Without reproducible code / data, I'm going to make up a dataframe and show one simple way:
Setup
import pandas as pd, numpy as np
df = pd.DataFrame([['a', 'b', 'c'], ['d', 'e', 'f'], ['g', 'h', 'b']])
The dataframe looks like this:
0 1 2
0 a b c
1 d e f
2 g h b
Solution
result = list(zip(*np.where(df.values == 'b')))
Result
[(0, 1), (2, 2)]
Explanation
df.values accesses the numpy array underlying the dataframe.
np.where creates an array of coordinates satisfying the provided condition.
zip(*...) transforms [x-coords-array, y-coords-array] into (x, y) coordinate pairs.
Try using contains. This will return you a dataframe of rows that contain the slice you are looking for.
df[df['<my_col>'].str.contains('<my_string_slice>')]
Similarly, you can use match for a direct match.
This is my approach not writing double for loops:
value_to_search = "c"
print(df[[x for x in df.columns if value_to_search in df[x].unique()]].index[0])
print(df[[x for x in df.columns if value_to_search in df[x].unique()]].columns[0])
The first will return the column name and the second will return the index. Combined together, you will get the index-column combination. Since you mentioned that all values in the df is unique, both lines will return exactly one value.
You might need a try-except if value_to_search might not in the data frame.
By using stack , data from jpp
df[df=='b'].stack()
Out[211]:
0 1 b
2 2 b
dtype: object
I'm having trouble understanding how looping through a dataframe works.
I found somewhere that if you write:
for row in df.iterrows()
you wont be able to access row['column1'], instead youll have to use
for row,index in df.iterrows() and then it works.
Now i want to create a collection of signals I found in the loop by adding row to a new dataframe newdf.append(row) this works but it looses the ability to be referenced by a string. How do i have to add those rows to my dataframe in order for that to work?
Detailed code:
dataframe1 = DataFrame(np.random.randn(10, 5), columns=['a','b','c', 'd', 'e'])
dataframe2 = DataFrame()
for index,row in dataframe1:
if row['a'] == 5
dataframe2.append(row)
print dataframe2['b']
This doesnt work, because he wont accept strings inside the bracket for dataframe2.
Yes this could be done easier, but for the sake of argument lets say it couldnt(more complex logic than one if).
In my real code there are like ten different ifs and elses determining what to do with that specific row (and do other stuff from within the loop). Im not talking about filtering but just adding the row to a new dataframe in a way that it preservers the index so i can reference with the name of the column
In pandas, it is pretty straightforward to filter and pass the results, if needed, to a new dataframe, just as #smci suggests for r.
import numpy as np
import pandas as pd
dataframe1 = pd.DataFrame(np.random.randn(10, 5), columns=['a','b','c', 'd', 'e'])
dataframe1.head()
a b c d e
0 -2.824391 -0.143400 -0.936304 0.056744 -1.958325
1 -1.116849 0.010941 -1.146384 0.034521 -3.239772
2 -2.026315 0.600607 0.071682 -0.925031 0.575723
3 0.088351 0.912125 0.770396 1.148878 0.230025
4 -0.954288 -0.526195 0.811891 0.558740 -2.025363
Then, to filter, you can do like so:
dataframe2=dataframe1.ix[dataframe1.a>.5]
dataframe2.head()
a b c d e
0 0.708511 0.282347 0.831361 0.331655 -2.328759
1 1.646602 -0.090472 -0.074580 -0.272876 -0.647686
8 2.728552 -0.481700 0.338771 0.848957 -0.118124
EDIT
OP didn't want to use a filter, so here is an example iterating through rows instead:
np.random.seed(123)
dataframe1 = pd.DataFrame(np.random.randn(10, 5), columns=['a','b','c', 'd', 'e'])
## I declare the second df with the same structure
dataframe2 = pd.DataFrame(columns=['a','b','c', 'd', 'e'])
For the loop I use iterrows, and instead of appending to an empty dataframe, I use the index from the iterator to place at the same index position in the empty frame. Notice that I said > .5 instead of = 5 or else the resulting dataframe would be empty for sure.
for index, row in dataframe1.iterrows():
if row['a'] > .5:
dataframe2.loc[index] = row
dataframe2
a b c d e
1 1.651437 -2.426679 -0.428913 1.265936 -0.866740
4 0.737369 1.490732 -0.935834 1.175829 -1.253881
UPDATE:
Don't. Solution is:
dataframe1[dataframe1.a > .5]
# or, if you only want the 'b' column
dataframe1[dataframe1.a > .5] ['b']
You only want to filter for rows where a==5 (and then select the b column?)
You have still shown zero reason whatsoever why you need to append to the dataframe1. In fact you don't need to append anything, you just directly generate your filtered version.
ORIGINAL VERSION:
Don't.
If all you want to do is compute aggregations or summaries and they don't really belong in the parent dataframe, do a filter. Assign the result to a separate dataframe.
If you really insist on using iterate+append, instead of filter, even knowing all the caveats, then create an empty summary dataframe, then append to that as you iterate. Only after you're finished iterating, append it (and only if you really need to), back to the parent dataframe.
I have a Python data frame that I want to subdivide by row BUT in 32 different slices (think of a large data set chopped by row into 32 smaller data sets). I can manually divide the data frames in this way:
df_a = df[df['Type']=='BROKEN PELVIS']
df_b = df[df['Type']=='ABDOMINAL STRAIN']
I'm assuming there is a much more Pythonic expression someone might like to share. I'm looking for something along the lines of:
for i in new1:
df_%s= df[df['#RIC']=='%s'] , %i
Hope that makes sense.
In these kind of situations I think it's more pythonic to store the DataFrames in a python dictionary:
injuries = {injury: df[df['Type'] == injury] for injury in df['Type'].unique()}
injuries['BROKEN PELVIS'] # is the same as df_a above
Most of the time you don't need to create a new DataFrame but can use a groupby (it depends what you're doing next), see http://pandas.pydata.org/pandas-docs/stable/groupby.html:
g = df.groupby('Type')
Update: in fact there is a method get_group to access these:
In [21]: df = pd.DataFrame([['A', 2], ['A', 4], ['B', 6]])
In [22]: g = df.groupby(0)
In [23]: g.get_group('A')
Out[23]:
0 1
0 A 2
1 A 4
Note: most of the time you don't need to do this, apply, aggregate and transform are your friends!