Python Pandas create column based on value of index - python

I have a dataframe that looks like the below. What I'd like to do is create another column that is based on the VALUE of the index (so anything less the 10 would have another column and be labeled as "small"). I can do something like lengthDF[lengthDF.index < 10] to get the values I want, but I'm sure how to get the additional column I want. I've tried this Create Column with ELIF in Pandas but can't get it to read the index...
LengthFirst LengthOthers
0 1 NaN
4 NaN 1
9 NaN 1
13 NaN 1
17 1 1
18 NaN 1
19 NaN 1
20 1 NaN
21 1 1
22 3 4
23 1 NaN
24 7 6
25 1 2
26 16 19
27 1 2
28 24 8
29 9 12
30 73 65
31 15 12
32 55 60
33 28 21
34 29 31

Something like this?
lengthDF['size'] = 'large'
lengthDF['size'][lengthDF.index < 10] = 'small'

Related

For a column in pandas dataframe, calculate mean of column values in previous 4th, 8th and 12th row from the present row?

In a pandas dataframe, I want to create a new column that calculates the average of column values of 4th, 8th and 12th row before our present row.
As shown in the table below, for row number 13 :
Value in Existing column that is 4 rows before row 13 (row 9) = 4
Value in Existing column that is 8 rows before row 13 (row 5) = 6
Value in Existing column that is 12 rows before row 13 (row 1) = 2
Average of 4,6,2 is 4. Hence New Column = 4 at row number 13, for the remaining rows between 1-12, New Column = Nan
I have more rows in my df, but I added only first 13 rows here for illustration.
Row number
Existing column
New column
1
2
NaN
2
4
NaN
3
3
NaN
4
1
NaN
5
6
NaN
6
4
NaN
7
8
NaN
8
2
NaN
9
4
NaN
10
9
NaN
11
2
NaN
12
4
NaN
13
3
3
.shift() is your missing part. We can use it to access previous rows from the existing row in a Pandas dataframe.
Let's use .groupby(), .apply() and .shift() as follows:
df['New column'] = df.groupby((df['Row number'] - 1) // 13)['Existing column'].apply(lambda x: (x.shift(4) + x.shift(8) + x.shift(12)) / 3)
Here, rows are partitioned into groups of 13 rows by grouping them under different group numbers set by (df['Row number'] - 1) // 13
Then within each group, we use .apply() on the column Existing column and use .shift() to get the previous 4th, 8th and 12th entries within the group.
Test Run
data = {'Row number' : np.arange(1, 40), 'Existing column': np.arange(11, 50) }
df = pd.DataFrame(data)
print(df)
Row number Existing column
0 1 11
1 2 12
2 3 13
3 4 14
4 5 15
5 6 16
6 7 17
7 8 18
8 9 19
9 10 20
10 11 21
11 12 22
12 13 23
13 14 24
14 15 25
15 16 26
16 17 27
17 18 28
18 19 29
19 20 30
20 21 31
21 22 32
22 23 33
23 24 34
24 25 35
25 26 36
26 27 37
27 28 38
28 29 39
29 30 40
30 31 41
31 32 42
32 33 43
33 34 44
34 35 45
35 36 46
36 37 47
37 38 48
38 39 49
df['New column'] = df.groupby((df['Row number'] - 1) // 13)['Existing column'].apply(lambda x: (x.shift(4) + x.shift(8) + x.shift(12)) / 3)
print(df)
Row number Existing column New column
0 1 11 NaN
1 2 12 NaN
2 3 13 NaN
3 4 14 NaN
4 5 15 NaN
5 6 16 NaN
6 7 17 NaN
7 8 18 NaN
8 9 19 NaN
9 10 20 NaN
10 11 21 NaN
11 12 22 NaN
12 13 23 15.0
13 14 24 NaN
14 15 25 NaN
15 16 26 NaN
16 17 27 NaN
17 18 28 NaN
18 19 29 NaN
19 20 30 NaN
20 21 31 NaN
21 22 32 NaN
22 23 33 NaN
23 24 34 NaN
24 25 35 NaN
25 26 36 28.0
26 27 37 NaN
27 28 38 NaN
28 29 39 NaN
29 30 40 NaN
30 31 41 NaN
31 32 42 NaN
32 33 43 NaN
33 34 44 NaN
34 35 45 NaN
35 36 46 NaN
36 37 47 NaN
37 38 48 NaN
38 39 49 41.0
You can use rolling with .apply to apply a custom aggregation function.
The average of (4,6,2) is 4, not 3
>>> (2 + 6 + 4) / 3
4.0
>>> df["New column"] = df["Existing column"].rolling(13).apply(lambda x: x.iloc[[0, 4, 8]].mean())
>>> df
Row number Existing column New column
0 1 2 NaN
1 2 4 NaN
2 3 3 NaN
3 4 1 NaN
4 5 6 NaN
5 6 4 NaN
6 7 8 NaN
7 8 2 NaN
8 9 4 NaN
9 10 9 NaN
10 11 2 NaN
11 12 4 NaN
12 13 3 4.0
breaking it down:
df["Existing column"]: select "Existing column" from the dataframe
.rolling(13): starting with the first 13 rows, we're going to move a sliding window across all of the data. So first, we will encounter rows 0-12, then rows 1-13, then 2-14, so on and so forth.
.apply(...): For each of those aforementioned rolling sections, we're going to apply a function that works on each section (in this case the function we're applying is the lambda.
lambda x: x.iloc[[0, 4, 8]].mean(): from each of those rolling sections, extract the 0th 4th, and 8th (corresponding to row 1, 5, & 9) and calculate and return the mean of those values.
In order to work on your dataframe in chunks (or groups) instead of a sliding window, you can apply the same logic with the .groupby method (instead of .rolling).
>>> groups = np.arange(len(df)) // 13 # defines groups as chunks of 13 rows
>>> averages = (
df.groupby(groups)["Existing column"]
.apply(lambda x: x.iloc[[0, 4, 8]].mean())
)
>>> averages.index = (averages.index + 1) * 13 - 1
>>> df["New column"] = averages
>>> df
Row number Existing column New column
0 1 2 NaN
1 2 4 NaN
2 3 3 NaN
3 4 1 NaN
4 5 6 NaN
5 6 4 NaN
6 7 8 NaN
7 8 2 NaN
8 9 4 NaN
9 10 9 NaN
10 11 2 NaN
11 12 4 NaN
12 13 3 4.0
breaking it down now:
groups = np.arange(len(df)): creates an array that will be used to chunk our dataframe into groups. This array will essentially be 13 0s, followed by 13 1s, follow by 13 2s... until the array is the same length as the dataframe. So in this case for a single chunk example it will only be an array of 13 0s.
df.groupby(groups)["Existing column"] group the dataframe according to the groups defined above and select the "Existing column"
.apply(lambda x: x.iloc[[0, 4, 8]].mean()): Conceptually the same as before, except we're applying to each grouping instead of a sliding window.
averages.index = (averages.index + 1) * 12: this part may seem a little odd. But we're essentially ensuring that our selected averages line up with the original dataset correctly. In this case, we want the average from group 0 (specified with an index value of 0 in the averages Series) to align to row 12. If we had another group (group 1, we would want it to align to row 25 in the original dataset). So we can use a little math to do this transformation.
df["New column"] = averages: since we already matched up our indices, pandas takes care of the actual alignment of these new values under the hood for us.

Hash table mapping in Pandas

I have a large dataset with millions of rows of data. One of the data columns is ID.
I also have another (hash)table that maps the range of indices to a specific group that meets a certain criteria.
What is an efficient way to map the range of indices to include them as an additional column on my dataset in pandas?
As an example, lets say that the dataset looks like this:
In [18]:
print(df_test)
Out [19]:
ID
0 13
1 14
2 15
3 16
4 17
5 18
6 19
7 20
8 21
9 22
10 23
11 24
12 25
13 26
14 27
15 28
16 29
17 30
18 31
19 32
Now the hash table with the range of indices looks like this:
In [20]:
print(df_hash)
Out [21]:
ID_first
0 0
1 2
2 10
where the index specifies the group number that I need.
I tried doing something like this:
for index in range(df_hash.size):
try:
df_test.loc[df_hash.ID_first[index]:df_hash.ID_first[index + 1], 'Group'] = index
except:
df_test.loc[df_hash.ID_first[index]:, 'Group'] = index
Which works well, except that it is really slow as it loops over the length of the hash table dataframe (hundreds of thousands of rows). It produces the following answer (which I want):
In [23]:
print(df_test)
Out [24]:
ID Group
0 13 0
1 14 0
2 15 1
3 16 1
4 17 1
5 18 1
6 19 1
7 20 1
8 21 1
9 22 1
10 23 2
11 24 2
12 25 2
13 26 2
14 27 2
15 28 2
16 29 2
17 30 2
18 31 2
19 32 2
Is there a way to do this more efficiently?
You can map the index of df_test using ID_first to the index of df_hash, and then ffill. Need to construct a Series as the pd.Index class doesn't have a ffill method.
df_test['group'] = (pd.Series(df_test.index.map(dict(zip(df_hash.ID_first, df_hash.index))),
index=df_test.index)
.ffill(downcast='infer'))
# ID group
#0 13 0
#1 14 0
#2 15 1
#...
#9 22 1
#10 23 2
#...
#17 30 2
#18 31 2
#19 32 2
you can do series.isin with series.cumsum
df_test['group'] = df_test['ID'].isin(df_hash['ID_first']).cumsum() #.sub(1)
print(df_test)
ID group
0 0 1
1 1 1
2 2 2
3 3 2
4 4 2
5 5 2
6 6 2
7 7 2
8 8 2
9 9 2
10 10 3
11 11 3
12 12 3
13 13 3
14 14 3
15 15 3
16 16 3
17 17 3
18 18 3
19 19 3

Adding a column to a dataframe through a mapping between 2 dataframes in Python?

I asked something similar yesterday but I had to rephrase the question and change the dataframes that I'm using. So here is my question again:
I have a dataframe called df_location. In this dataframe I have duplicated ids because each id has a timestamp.
location = {'location_id': [1,1,1,1,2,2,2,3,3,3,4,5,6,7,8,9,10],
'temperature_value':[20,21,22,23,24,25,27,28,29,30,31,32,33,34,35,36,37],
'humidity_value':[60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76]}
df_location = pd.DataFrame(location)
I have another dataframe called df_islands:
islands = {'island_id':[10,20,30,40,50,60],
'list_of_locations':[[1],[2,3],[4,5],[6,7,8],[9],[10]]}
df_islands = pd.DataFrame(islands)
What I am trying to achieve is to map the values of list_of_locations to the location_id. If the values are the same , then the island_id for this location should be appended to a new column in df_location.
(Note that: I don't want to remove any duplicated Id, I need to keep them as they are)
Resulting dataframe:
final_dataframe = {'location_id': [1,1,1,1,2,2,2,3,3,3,4,5,6,7,8,9,10],
'temperature_value': [20,21,22,23,24,25,27,28,29,30,31,32,33,34,35,36,37],
'humidity_value':[60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76],
'island_id':[10,10,10,10,20,20,20,20,20,20,30,30,40,40,40,50,60]}
df_final_dataframe = pd.DataFrame(final_dataframe)
This is just a sample from the dataframe that I have. What I have is dataframe of 13,000,0000 rows and 4 columns. How can this be achieved in an efficient way ? Is there a pythonic way to do it ?I tried using for loops but it takes too long and still it didn't work. I would really appreciate it if someone can give me a solution to this problem.
Here's a solution:
island_lookup = df_islands.explode("list_of_locations").rename(columns = {"list_of_locations": "location"})
pd.merge(df_location, island_lookup, left_on="location_id", right_on="location").drop("location", axis=1)
The output is:
location_id temperature_value humidity_value island_id
0 1 20 60 10
1 1 21 61 10
2 1 22 62 10
3 1 23 63 10
4 2 24 64 20
5 2 25 65 20
6 2 27 66 20
7 3 28 67 20
8 3 29 68 20
9 3 30 69 20
10 4 31 63 30
11 5 32 64 30
12 6 33 65 40
13 7 34 66 40
14 8 35 67 40
15 9 36 68 50
16 10 37 69 60
If some of the locations don't have a matching island_id, but you'd still like to see them in the results (with island_id NaN), use how="left" in the merge statement, as in:
island_lookup = df_islands.explode("list_of_locations").rename(columns = {"list_of_locations": "location"})
pd.merge(df_location, island_lookup,
left_on="location_id",
right_on="location",
how = "left").drop("location", axis=1)
The result would be (note location-id 12 on row 3):
location_id temperature_value humidity_value island_id
0 1 20 60 10.0
1 1 21 61 10.0
2 1 22 62 10.0
3 12 23 63 NaN
4 2 24 64 20.0
5 2 25 65 20.0
6 2 27 66 20.0
...

get only previous three values from the dataframe

I am new to the python and pandas. Here , what I have is a dataframe which is like,
Id Offset feature
0 0 2
0 5 2
0 11 0
0 21 22
0 28 22
1 32 0
1 38 21
1 42 21
1 52 21
1 55 0
1 58 0
1 62 1
1 66 1
1 70 1
2 73 0
2 78 1
2 79 1
from this I am trying to get the previous three values from the column with the offsets of that .
SO, output would be like,
offset Feature
11 2
21 22
28 22
// Here these three values are `of the 0 which is at 32 offset`
In the same dataframe for next place where is 0
38 21
42 21
52 21
58 0
62 1
66 1
is there any way through which I can get this ?
Thanks
This will be on the basis of the document ID.
Even i am quite new to pandas but i have attempted to answer you question.
I populated your data as comma separated values in data.csv and then used slicing to get the previous 3 columns.
import pandas as pd
df = pd.read_csv('./data.csv')
for index in (df.loc[df['Feature'] == 0]).index:
print(df.loc[index-3:index-1])
The output looks like this. The leftmost column is index which you can discard if you dont want. Is this what you were looking for?
Offset Feature
2 11 2
3 21 22
4 28 22
Offset Feature
6 38 21
7 42 21
8 52 21
Offset Feature
7 42 21
8 52 21
9 55 0
Offset Feature
11 62 1
12 66 1
13 70 1
Note : There might be a more pythonic way to do this.
You can take 3 previous rows of your current 0 value in the column using loc.
Follow the code:
import pandas as pd
df = pd.read_csv("<path_of_the_file">)
zero_indexes = list(df[df['Feature'] == 0].index)
for each_zero_index in zero_indexes:
df1 = df.loc[each_zero_index - 3: each_zero_index]
print(df1) # This dataframe has 4 records. Your previous three including the zero record.
Output:
Offset Feature
2 11 2
3 21 22
4 28 22
5 32 0
Offset Feature
6 38 21
7 42 21
8 52 21
9 55 0
Offset Feature
7 42 21
8 52 21
9 55 0
10 58 0
Offset Feature
11 62 1
12 66 1
13 70 1
14 73 0

Calculate the delta between entries in Pandas using partitions

I'm using Dataframe in Pandas, and I would like to calculate the delta between each adjacent rows, using a partition.
For example, this is my initial set after sorting it by A and B:
A B
1 12 40
2 12 50
3 12 65
4 23 30
5 23 45
6 23 60
I want to calculate the delta between adjacent B values, partitioned by A. If we define C as result, the final table should look like this:
A B C
1 12 40 NaN
2 12 50 10
3 12 65 15
4 23 30 NaN
5 23 45 15
6 23 75 30
The reason for the NaN is that we cannot calculate delta for the minimum number in each partition.
You can group by column A and take the difference:
df['C'] = df.groupby('A')['B'].diff()
df
Out:
A B C
1 12 40 NaN
2 12 50 10.0
3 12 65 15.0
4 23 30 NaN
5 23 45 15.0
6 23 60 15.0

Categories