Is there a general, efficient way to assign values to a subset of a DataFrame in pandas? I've got hundreds of rows and columns that I can access directly but I haven't managed to figure out how to edit their values without iterating through each row,col pair. For example:
In [1]: import pandas, numpy
In [2]: array = numpy.arange(30).reshape(3,10)
In [3]: df = pandas.DataFrame(array, index=list("ABC"))
In [4]: df
Out[4]:
0 1 2 3 4 5 6 7 8 9
A 0 1 2 3 4 5 6 7 8 9
B 10 11 12 13 14 15 16 17 18 19
C 20 21 22 23 24 25 26 27 28 29
In [5]: rows = ['A','C']
In [6]: columns = [1,4,7]
In [7]: df[columns].ix[rows]
Out[7]:
1 4 7
A 1 4 7
C 21 24 27
In [8]: df[columns].ix[rows] = 900
In [9]: df
Out[9]:
0 1 2 3 4 5 6 7 8 9
A 0 1 2 3 4 5 6 7 8 9
B 10 11 12 13 14 15 16 17 18 19
C 20 21 22 23 24 25 26 27 28 29
I believe what is happening here is that I'm getting a copy rather than a view, meaning I can't assign to the original DataFrame. Is that my problem? What's the most efficient way to edit those rows x columns (preferably in-pace, as the DataFrame may take up a lot of memory)?
Also, what if I want to replace those values with a correctly shaped DataFrame?
Use loc in an assignment expression (the = means it's not relevant whether it is a view or a copy!):
In [11]: df.loc[rows, columns] = 99
In [12]: df
Out[12]:
0 1 2 3 4 5 6 7 8 9
A 0 99 2 3 99 5 6 99 8 9
B 10 11 12 13 14 15 16 17 18 19
C 20 99 22 23 99 25 26 99 28 29
If you're using a version prior to 0.11 you can use .ix.
As #Jeff comments:
This is an assignment expression (see 'advanced indexing with ix' section of the docs) and doesn't return anything (although there are assignment expressions which do return things, e.g. .at and .iat).
df.loc[rows,columns] can return a view, but usually it's a copy. Confusing, but done for efficiency.
Bottom line: use ix, loc, iloc to set (as above), and don't modify copies.
See 'view versus copy' section of the docs.
Related
I have a large dataset with millions of rows of data. One of the data columns is ID.
I also have another (hash)table that maps the range of indices to a specific group that meets a certain criteria.
What is an efficient way to map the range of indices to include them as an additional column on my dataset in pandas?
As an example, lets say that the dataset looks like this:
In [18]:
print(df_test)
Out [19]:
ID
0 13
1 14
2 15
3 16
4 17
5 18
6 19
7 20
8 21
9 22
10 23
11 24
12 25
13 26
14 27
15 28
16 29
17 30
18 31
19 32
Now the hash table with the range of indices looks like this:
In [20]:
print(df_hash)
Out [21]:
ID_first
0 0
1 2
2 10
where the index specifies the group number that I need.
I tried doing something like this:
for index in range(df_hash.size):
try:
df_test.loc[df_hash.ID_first[index]:df_hash.ID_first[index + 1], 'Group'] = index
except:
df_test.loc[df_hash.ID_first[index]:, 'Group'] = index
Which works well, except that it is really slow as it loops over the length of the hash table dataframe (hundreds of thousands of rows). It produces the following answer (which I want):
In [23]:
print(df_test)
Out [24]:
ID Group
0 13 0
1 14 0
2 15 1
3 16 1
4 17 1
5 18 1
6 19 1
7 20 1
8 21 1
9 22 1
10 23 2
11 24 2
12 25 2
13 26 2
14 27 2
15 28 2
16 29 2
17 30 2
18 31 2
19 32 2
Is there a way to do this more efficiently?
You can map the index of df_test using ID_first to the index of df_hash, and then ffill. Need to construct a Series as the pd.Index class doesn't have a ffill method.
df_test['group'] = (pd.Series(df_test.index.map(dict(zip(df_hash.ID_first, df_hash.index))),
index=df_test.index)
.ffill(downcast='infer'))
# ID group
#0 13 0
#1 14 0
#2 15 1
#...
#9 22 1
#10 23 2
#...
#17 30 2
#18 31 2
#19 32 2
you can do series.isin with series.cumsum
df_test['group'] = df_test['ID'].isin(df_hash['ID_first']).cumsum() #.sub(1)
print(df_test)
ID group
0 0 1
1 1 1
2 2 2
3 3 2
4 4 2
5 5 2
6 6 2
7 7 2
8 8 2
9 9 2
10 10 3
11 11 3
12 12 3
13 13 3
14 14 3
15 15 3
16 16 3
17 17 3
18 18 3
19 19 3
I am having a dataframe df like shown:
1-1 1-2 1-3 2-1 2-2 3-1 3-2 4-1 5-1
10 3 9 1 3 9 33 10 11
21 31 3 22 21 13 11 7 13
33 22 61 31 35 34 8 10 16
6 9 32 5 4 8 9 6 8
where the explanation of the columns as the following:
the first digit is a group number and the second is part of it or subgroup in our example we have groups 1,2,3,4,5 and group 1 consists of 1-1,1-2,1-3.
I would like to create a new dataframe that have only the groups 1,2,3,4,5 without subgroups and choose for each row the max number in the subgroup and be flexible for any new modifications or increasing the groups or subgroups.
The new dataframe I need is like the shown:
1 2 3 4 5
10 3 33 10 11
31 22 13 7 13
61 35 34 10 16
32 5 9 6 8
You can aggregate by columns with axis=1 and lambda function for split and select first values with max and DataFrame.groupby:
This working correct if numbers of groups contains 2 or more digits.
df1 = df.groupby(lambda x: x.split('-')[0], axis=1).max()
Alternative is pass splitted columns names:
df1 = df.groupby(df.columns.str.split('-').str[0], axis=1).max()
print (df1)
1 2 3 4 5
0 10 3 33 10 11
1 31 22 13 7 13
2 61 35 34 10 16
3 32 5 9 6 8
You can use .str[] or .str.get here.
df.groupby(df.columns.str[0], axis=1).max())
1 2 3 4 5
0 10 3 33 10 11
1 31 22 13 7 13
2 61 35 34 10 16
3 32 5 9 6 8
I have a data sheet with about 1700 columns and 100 rows of data w/ a unique identifier. It is survey data and every employee of an organization answer the same 9 questions but its compiled into one row of data for every organization. Is there a way in python/pandas to vertically integrate this data as opposed to the elongated format on the x-axis it already is at? I am cutting and pasting currently.
You can reshape the underlying numpy array and reindex with proper companies:
# sample data, assuming index is the company
df = pd.DataFrame(np.arange(36).reshape(2,-1))
# new index
idx = df.index.repeat(df.shape[1]//9)
# new data:
new_df = pd.DataFrame(df.values.reshape(-1,9), index=idx)
Output:
0 1 2 3 4 5 6 7 8
0 0 1 2 3 4 5 6 7 8
0 9 10 11 12 13 14 15 16 17
1 18 19 20 21 22 23 24 25 26
1 27 28 29 30 31 32 33 34 35
I have created a days difference column in a pandas dataframe, and I'm looking to add a column that has the sum of a specific value over a given days window backwards
Notice that I can supply a date column for each row if it is needed, but the diff was created as days difference from the first day of the data.
Example
df = pd.DataFrame.from_dict({'diff': [0,0,1,2,2,2,2,10,11,15,18],
'value': [10,11,15,2,5,7,8,9,23,14,15]})
df
Out[12]:
diff value
0 0 10
1 0 11
2 1 15
3 2 2
4 2 5
5 2 7
6 2 8
7 10 9
8 11 23
9 15 14
10 18 15
I want to add 5_days_back_sum column that will sum the past 5 days, including same day so the result would be like this
Out[15]:
5_days_back_sum diff value
0 21 0 10
1 21 0 11
2 36 1 15
3 58 2 2
4 58 2 5
5 58 2 7
6 58 2 8
7 9 10 9
8 32 11 23
9 46 15 14
10 29 18 15
How can I achieve that? Originally I have a date column to create the diff column, if that helps its available
Use custom function with boolean indexing for filtering range with sum:
def f(x):
return df.loc[(df['diff'] >= x - 5) & (df['diff'] <= x), 'value'].sum()
df['5_days_back_sum'] = df['diff'].apply(f)
print (df)
diff value 5_days_back_sum
0 0 10 21
1 0 11 21
2 1 15 36
3 2 2 58
4 2 5 58
5 2 7 58
6 2 8 58
7 10 9 9
8 11 23 32
9 15 14 46
10 18 15 29
Similar solution with between:
def f(x):
return df.loc[df['diff'].between(x - 5, x), 'value'].sum()
df['5_days_back_sum'] = df['diff'].apply(f)
print (df)
diff value 5_days_back_sum
0 0 10 21
1 0 11 21
2 1 15 36
3 2 2 58
4 2 5 58
5 2 7 58
6 2 8 58
7 10 9 9
8 11 23 32
9 15 14 46
10 18 15 29
dfa = pd.DataFrame({'a':[1,2,3,4],'b':[4,5,7,6]})
Expected output
a b
0 1 4
1 2 5
I could achieve this using following way
>>> dfa[(dfa.a == 1) | (dfa.a == 2)]
a b
0 1 4
1 2 5
But this is not really scalable since I want to do something similar to
?? dfa[(dfa.a has-any range(5,50))
I think you need boolean indexing with isin with np.arange or range:
print (np.arange(5,51))
[ 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50]
print (dfa[dfa.a.isin(np.arange(5,51))])
Or:
print (dfa[dfa.a.isin(range(5,51))])
Solution with between:
print (dfa[dfa['a'].between(5, 50)])
Sample (one value is changed to 8):
dfa = pd.DataFrame({'a':[1,2,3,8],'b':[4,5,7,6]})
print (dfa)
a b
0 1 4
1 2 5
2 3 7
3 8 6
print (dfa[dfa.a.isin(np.arange(5,51))])
a b
3 8 6
print (dfa[dfa.a.isin(range(5,51))])
a b
3 8 6
print (dfa[dfa['a'].between(5, 50)])
a b
3 8 6
This will also do:
import pandas as pd
dfa = pd.DataFrame({'a':[1,2,3,4],'b':[4,5,7,6]})
print dfa['a'].between(5, 50).any()
#False
print dfa['b'].between(5, 50).any()
#True
print ((5 <= dfa) & (dfa <= 50)).any() # all columns together
#a False
#b True
#dtype: bool