I am fairly new to pandas and come from a statistics background and I am struggling with a conceptual problem:
Pandas has columns, who are containing values. But sometimes values have a special meaning - in a statistical program like SPSS or R called a "value labels".
Imagine a column rain with two values 0 (meaning: no rain) and 1 (meaning: raining). Is there a way to assign these labels to that values?
Is there a way to do this in pandas, too? Mainly for platting and visualisation purposes.
There's not need to use a map anymore. Since version 0.15, Pandas allows a categorical data type for its columns.
The stored data takes less space, operations on it are faster and you can use labels.
I'm taking an example from the pandas docs:
df = pd.DataFrame({"id":[1,2,3,4,5,6], "raw_grade":['a', 'b', 'b', 'a', 'a', 'e']})
#Recast grade as a categorical variable
df["grade"] = df["raw_grade"].astype("category")
df["grade"]
#Gives this:
Out[124]:
0 a
1 b
2 b
3 a
4 a
5 e
Name: grade, dtype: category
Categories (3, object): [a, b, e]
You can also rename categories and add missing categories
You could have a separate dictionary which maps values to labels:
d={0:"no rain",1:"raining"}
and then you could access the labelled data by doing
df.rain_column.apply(lambda x:d[x])
Related
I have a data frame that has the following format:
d = {'id1': ['a', 'a', 'b', 'b',], 'id2': ['a', 'b', 'b', 'c'], 'score': ['1', '2', '3', '4']}
df = pd.DataFrame(data=d)
print(df)
id1 id2 score
0 a a 1
1 a b 2
3 b b 3
4 b c 4
The data frame has over 1 billion rows, it represents pairwise distance scores between objects in columns id1 and id2. I do not need all object pair combinations, for each object in id1 (there are about 40k unique id's) I only want to keep the top 100 closest (smallest) distance scores
The code I'm running to do this is the following:
df = df.groupby(['id1'])['score'].nsmallest(100)
The issue with this code is that I run into a memory error each time I try to run it
MemoryError: Unable to allocate 8.53 GiB for an array with shape (1144468900,) and data type float64
I'm assuming it is because in the background pandas is now creating a new data frame for the result of the group by, but the existing data frame is still held in memory.
The reason I am only taking the top 100 of each id is to reduce the size of the data frame, but I seems that while doing that process I am actually taking up more space.
Is there a way I can go about filtering this data down but not taking up more memory?
The desired output would be something like this (assuming top 1 instead of top 100)
id1 id2 score
0 a a 1
1 b b 3
Some additional info about the original df:
df.count()
permid_1 1144468900
permid_2 1144468900
distance 1144468900
dtype: int64
df.dtypes
permid_1 int64
permid_2 int64
distance float64
df.shape
dtype: object
(1144468900, 3)
id1 & id2 unique value counts: 33,830
I can't test this code, lacking your data, but perhaps try something like this:
indicies = []
for the_id in df['id1'].unique():
scores = df['score'][df['id1'] == the_id]
min_subindicies = np.argsort(scores.values)[:100] # numpy is raw index only
min_indicies = scores.iloc[min_subindicies].index # convert to pandas indicies
indicies.extend(min_indicies)
df = df.loc[indicies]
Descriptively, in each unique ID (the_id), extract the matching scores. Then find the raw indicies which are the smallest 100. Select those indicies, then map from the raw index to the Pandas index. Save the Pandas index to your list. Then at the end, subset on the pandas index.
iloc does take a list input. some_series.iloc should align properly with some_series.values which should allow this to work. Storing indicies indirectly like this should make this substantially more memory-efficient.
df['score'][df['id1'] == the_id] should work more efficiently than df.loc[df['id1'] == the_id, 'score']. Instead of taking the whole data frame and masking it, it takes only the score column of the data frame and masks it for matching IDs. You may want to del scores at the end of each loop if you want to immediately free more memory.
You can try the following:
df.sort_values(["id1", "scores"], inplace=True)
df["dummy_key"] = df["id1"].shift(100).ne(df["id1"])
df = df.loc[df["dummy_key"]]
You sort ascending (smallest on top), by first grouping, then by score.
You add column to indicate whether current id1 is different than the one 100 rows back (if it's not - your row is 101+ in order).
You filter by column from 2.
As Aryerez outlined in a comment, you can do something along the lines of:
closest = pd.concat([df.loc[df['id1'] == id1].sort_values(by = 'score').head(100) for
id1 in set(df['id1'])])
You could also do
def get_hundredth(id1):
sub_df = df.loc[df['id1'] == id1].sort_values(by = 'score')
return sub_df.iloc[100]['score']
hundredth_dict = {id1: get_hundredth(id1) for id1 in set(df['id1'])}
def check_distance(row):
return row['score'] <= hundredth_dict[row['id1']]
closest = df.loc[df.apply(check_distance, axis = 1)
Another strategy would be to look at how filtering out distances past a threshold affects the dataframe. That is, take
low_scores = df.loc[df['score']<threshold]
Does this significantly decrease the size of the dataframe for some reasonable threshold? You'd need a threshold that makes the dataframe small enough to work with, but leaves the lowest 100 scores for each id1.
You also might want to look into what sort of optimization you can do given your distance metric. There's probably algorithms out there specifically for cosine similarity.
For the given shape (1144468900, 3) with 33,830 unique value counts, id1 and id2 columns are good candidates for categorical column, convert them to categorical data type, and that will reduce the memory requirement by 1144468900/33,830 = 33,830 times approximately for these two columns, then perform any aggregation you want.
df[['id1', 'id2']] = df[['id1', 'id2']].astype('category')
out = df.groupby(['id1'])['score'].nsmallest(100)
How can I create a new column in a Pandas DataFrame that compresses/collapses multiple values at once from another column? Also, is it possible to use a default value so that you don't have to explicitly write out all the value mappings?
I'm referring to a process that is often called "variable recoding" in statistical software such as SPSS and Stata.
Example
Suppose I have a DataFrame with 1,000 observations. The only column in the DataFrame is called col1 and it has 26 unique values (the letters A through Z). Here's a reproducible example of my starting point:
import pandas as pd
import numpy as np
import string
np.random.seed(666)
df = pd.DataFrame({'col1':np.random.choice(list(string.ascii_uppercase),size=1000)})
I want to create a new column called col2 according to the following mapping:
If col1 is equal to either A, B or C, col2 should receive AA
If col1 is equal to either D, E or F, col2 should receive MM
For all other values in col1, col2 should receive ZZ
I know I can partially do this using Pandas' replace function, but it has two problems. The first is that the replace function doesn't allow you to condense multiple input values into one single response value. This forces me to write out df['col1'].replace({'A':'AA','B':'AA','C':'AA'}) instead of something simpler like df['col1'].replace({['A','B','C']:'AA'}).
The second problem is that the replace function doesn't have an all_other_values keyword or anything like that. This forces me to manually write out the ENTIRE value mappings like this df['col1'].replace({'A':'AA','B':'AA',...,'G':'ZZ','H':'ZZ','I':'ZZ',...,'X':'ZZ','Y':'ZZ','Z':'ZZ'}) instead of something simpler like df['col1'].replace(dict_for_abcdef, all_other_values='ZZ')
Is there another way to use the replace function that I'm missing that would allow me to do what I'm asking? Or is there another Pandas function that enables you to so similar things to what I describe above?
Dirty implementation
Here is a "dirty" implementation of what I'm looking for using loc:
df['col2'] = 'ZZ' # Initiate the column with the default "all_others" value
df.loc[df['col1'].isin(['A','B','C']),'col2'] = 'AA' # Mapping from "A","B","C" to "AA"
df.loc[df['col1'].isin(['D','E','F']),'col2'] = 'MM' # Mapping from "D","E","F" to "MM"
I find this solution a bit messy and was hoping something a bit cleaner existed.
Can try with np.select which takes a list of conditions, a list of values, and also a default:
conds = [df['col1'].isin(['A', 'B', 'C']),
df['col1'].isin(['D', 'E', 'F'])]
values = ['AA', 'MM']
df['col2'] = np.select(conds, values, default='ZZ')
Can also use between instead of isin:
conds = [df['col1'].between('A', 'C'),
df['col1'].between('D', 'F')]
values = ['AA', 'MM']
df['col2'] = np.select(conds, values, default='ZZ')
Sample Input and Output:
import string
import numpy as np
import pandas as pd
letters = string.ascii_uppercase
df = pd.DataFrame({'col1': list(letters)[:10]})
df:
col1 col2
0 A AA
1 B AA
2 C AA
3 D MM
4 E MM
5 F MM
6 G ZZ
7 H ZZ
8 I ZZ
9 J ZZ
np.select(condition, choice, alternative). For conditions, check numerals between a defined range
c=[df['col1'].between('A','C'),df['col1'].between('E','F')]
CH=['AA','MM']
df=df.assign(col2=np.select(c,CH,'ZZ'))
I'm looking for a method to add a column of float values to a matrix of string values.
Mymatrix =
[["a","b"],
["c","d"]]
I need to have a matrix like this =
[["a","b",0.4],
["c","d",0.6]]
I would suggest using a pandas DataFrame instead:
import pandas as pd
df = pd.DataFrame([["a","b",0.4],
["c","d",0.6]])
print(df)
0 1 2
0 a b 0.4
1 c d 0.6
You can also specify column (Series) names:
df = pd.DataFrame([["a","b",0.4],
["c","d",0.6]], columns=['A', 'B', 'C'])
df
A B C
0 a b 0.4
1 c d 0.6
As noted you can't mix data types in a ndarray, but can do so in a structured or record array. They are similar in that you can mix datatypes as defined by the dtype= argument (it defines the datatypes and field names). Record arrays allow access to fields of structured arrays by attribute instead of only by index. You don't need for loops when you want to copy the entire contents between arrays. See my example below (using your data):
Mymatrix = np.array([["a","b"], ["c","d"]])
Mycol = np.array([0.4, 0.6])
dt=np.dtype([('col0','U1'),('col1','U1'),('col2',float)])
new_recarr = np.empty((2,), dtype=dt)
new_recarr['col0'] = Mymatrix[:,0]
new_recarr['col1'] = Mymatrix[:,1]
new_recarr['col2'] = Mycol[:]
print (new_recarr)
Resulting output looks like this:
[('a', 'b', 0.4) ('c', 'd', 0.6)]
From there, use formatted strings to print.
You can also copy from a recarray to an ndarray if you reverse assignment order in my example.
Note: I discovered there can be a significant performance penalty when using recarrays. See answer in this thread:
is ndarray faster than recarray access?
You need to understand why you do that. Numpy is efficient because data are aligned in memory. So mixing types is generally source of bad performance. but in your case you can preserve alignement, since all your strings have same length. since types are not homogeneous, you can use structured array:
raw=[["a","b",0.4],
["c","d",0.6]]
dt=dtype([('col0','U1'),('col1','U1'),('col2',float)])
aligned=ndarray(len(raw),dt)
for i in range (len(raw)):
for j in range (len(dt)):
aligned[i][j]=raw[i][j]
You can also use pandas, but you loose often some performance.
I'm having trouble understanding how looping through a dataframe works.
I found somewhere that if you write:
for row in df.iterrows()
you wont be able to access row['column1'], instead youll have to use
for row,index in df.iterrows() and then it works.
Now i want to create a collection of signals I found in the loop by adding row to a new dataframe newdf.append(row) this works but it looses the ability to be referenced by a string. How do i have to add those rows to my dataframe in order for that to work?
Detailed code:
dataframe1 = DataFrame(np.random.randn(10, 5), columns=['a','b','c', 'd', 'e'])
dataframe2 = DataFrame()
for index,row in dataframe1:
if row['a'] == 5
dataframe2.append(row)
print dataframe2['b']
This doesnt work, because he wont accept strings inside the bracket for dataframe2.
Yes this could be done easier, but for the sake of argument lets say it couldnt(more complex logic than one if).
In my real code there are like ten different ifs and elses determining what to do with that specific row (and do other stuff from within the loop). Im not talking about filtering but just adding the row to a new dataframe in a way that it preservers the index so i can reference with the name of the column
In pandas, it is pretty straightforward to filter and pass the results, if needed, to a new dataframe, just as #smci suggests for r.
import numpy as np
import pandas as pd
dataframe1 = pd.DataFrame(np.random.randn(10, 5), columns=['a','b','c', 'd', 'e'])
dataframe1.head()
a b c d e
0 -2.824391 -0.143400 -0.936304 0.056744 -1.958325
1 -1.116849 0.010941 -1.146384 0.034521 -3.239772
2 -2.026315 0.600607 0.071682 -0.925031 0.575723
3 0.088351 0.912125 0.770396 1.148878 0.230025
4 -0.954288 -0.526195 0.811891 0.558740 -2.025363
Then, to filter, you can do like so:
dataframe2=dataframe1.ix[dataframe1.a>.5]
dataframe2.head()
a b c d e
0 0.708511 0.282347 0.831361 0.331655 -2.328759
1 1.646602 -0.090472 -0.074580 -0.272876 -0.647686
8 2.728552 -0.481700 0.338771 0.848957 -0.118124
EDIT
OP didn't want to use a filter, so here is an example iterating through rows instead:
np.random.seed(123)
dataframe1 = pd.DataFrame(np.random.randn(10, 5), columns=['a','b','c', 'd', 'e'])
## I declare the second df with the same structure
dataframe2 = pd.DataFrame(columns=['a','b','c', 'd', 'e'])
For the loop I use iterrows, and instead of appending to an empty dataframe, I use the index from the iterator to place at the same index position in the empty frame. Notice that I said > .5 instead of = 5 or else the resulting dataframe would be empty for sure.
for index, row in dataframe1.iterrows():
if row['a'] > .5:
dataframe2.loc[index] = row
dataframe2
a b c d e
1 1.651437 -2.426679 -0.428913 1.265936 -0.866740
4 0.737369 1.490732 -0.935834 1.175829 -1.253881
UPDATE:
Don't. Solution is:
dataframe1[dataframe1.a > .5]
# or, if you only want the 'b' column
dataframe1[dataframe1.a > .5] ['b']
You only want to filter for rows where a==5 (and then select the b column?)
You have still shown zero reason whatsoever why you need to append to the dataframe1. In fact you don't need to append anything, you just directly generate your filtered version.
ORIGINAL VERSION:
Don't.
If all you want to do is compute aggregations or summaries and they don't really belong in the parent dataframe, do a filter. Assign the result to a separate dataframe.
If you really insist on using iterate+append, instead of filter, even knowing all the caveats, then create an empty summary dataframe, then append to that as you iterate. Only after you're finished iterating, append it (and only if you really need to), back to the parent dataframe.
I have a Python data frame that I want to subdivide by row BUT in 32 different slices (think of a large data set chopped by row into 32 smaller data sets). I can manually divide the data frames in this way:
df_a = df[df['Type']=='BROKEN PELVIS']
df_b = df[df['Type']=='ABDOMINAL STRAIN']
I'm assuming there is a much more Pythonic expression someone might like to share. I'm looking for something along the lines of:
for i in new1:
df_%s= df[df['#RIC']=='%s'] , %i
Hope that makes sense.
In these kind of situations I think it's more pythonic to store the DataFrames in a python dictionary:
injuries = {injury: df[df['Type'] == injury] for injury in df['Type'].unique()}
injuries['BROKEN PELVIS'] # is the same as df_a above
Most of the time you don't need to create a new DataFrame but can use a groupby (it depends what you're doing next), see http://pandas.pydata.org/pandas-docs/stable/groupby.html:
g = df.groupby('Type')
Update: in fact there is a method get_group to access these:
In [21]: df = pd.DataFrame([['A', 2], ['A', 4], ['B', 6]])
In [22]: g = df.groupby(0)
In [23]: g.get_group('A')
Out[23]:
0 1
0 A 2
1 A 4
Note: most of the time you don't need to do this, apply, aggregate and transform are your friends!