In a pandas dataframe, I want to create a new column that calculates the average of column values of 4th, 8th and 12th row before our present row.
As shown in the table below, for row number 13 :
Value in Existing column that is 4 rows before row 13 (row 9) = 4
Value in Existing column that is 8 rows before row 13 (row 5) = 6
Value in Existing column that is 12 rows before row 13 (row 1) = 2
Average of 4,6,2 is 4. Hence New Column = 4 at row number 13, for the remaining rows between 1-12, New Column = Nan
I have more rows in my df, but I added only first 13 rows here for illustration.
Row number
Existing column
New column
1
2
NaN
2
4
NaN
3
3
NaN
4
1
NaN
5
6
NaN
6
4
NaN
7
8
NaN
8
2
NaN
9
4
NaN
10
9
NaN
11
2
NaN
12
4
NaN
13
3
3
.shift() is your missing part. We can use it to access previous rows from the existing row in a Pandas dataframe.
Let's use .groupby(), .apply() and .shift() as follows:
df['New column'] = df.groupby((df['Row number'] - 1) // 13)['Existing column'].apply(lambda x: (x.shift(4) + x.shift(8) + x.shift(12)) / 3)
Here, rows are partitioned into groups of 13 rows by grouping them under different group numbers set by (df['Row number'] - 1) // 13
Then within each group, we use .apply() on the column Existing column and use .shift() to get the previous 4th, 8th and 12th entries within the group.
Test Run
data = {'Row number' : np.arange(1, 40), 'Existing column': np.arange(11, 50) }
df = pd.DataFrame(data)
print(df)
Row number Existing column
0 1 11
1 2 12
2 3 13
3 4 14
4 5 15
5 6 16
6 7 17
7 8 18
8 9 19
9 10 20
10 11 21
11 12 22
12 13 23
13 14 24
14 15 25
15 16 26
16 17 27
17 18 28
18 19 29
19 20 30
20 21 31
21 22 32
22 23 33
23 24 34
24 25 35
25 26 36
26 27 37
27 28 38
28 29 39
29 30 40
30 31 41
31 32 42
32 33 43
33 34 44
34 35 45
35 36 46
36 37 47
37 38 48
38 39 49
df['New column'] = df.groupby((df['Row number'] - 1) // 13)['Existing column'].apply(lambda x: (x.shift(4) + x.shift(8) + x.shift(12)) / 3)
print(df)
Row number Existing column New column
0 1 11 NaN
1 2 12 NaN
2 3 13 NaN
3 4 14 NaN
4 5 15 NaN
5 6 16 NaN
6 7 17 NaN
7 8 18 NaN
8 9 19 NaN
9 10 20 NaN
10 11 21 NaN
11 12 22 NaN
12 13 23 15.0
13 14 24 NaN
14 15 25 NaN
15 16 26 NaN
16 17 27 NaN
17 18 28 NaN
18 19 29 NaN
19 20 30 NaN
20 21 31 NaN
21 22 32 NaN
22 23 33 NaN
23 24 34 NaN
24 25 35 NaN
25 26 36 28.0
26 27 37 NaN
27 28 38 NaN
28 29 39 NaN
29 30 40 NaN
30 31 41 NaN
31 32 42 NaN
32 33 43 NaN
33 34 44 NaN
34 35 45 NaN
35 36 46 NaN
36 37 47 NaN
37 38 48 NaN
38 39 49 41.0
You can use rolling with .apply to apply a custom aggregation function.
The average of (4,6,2) is 4, not 3
>>> (2 + 6 + 4) / 3
4.0
>>> df["New column"] = df["Existing column"].rolling(13).apply(lambda x: x.iloc[[0, 4, 8]].mean())
>>> df
Row number Existing column New column
0 1 2 NaN
1 2 4 NaN
2 3 3 NaN
3 4 1 NaN
4 5 6 NaN
5 6 4 NaN
6 7 8 NaN
7 8 2 NaN
8 9 4 NaN
9 10 9 NaN
10 11 2 NaN
11 12 4 NaN
12 13 3 4.0
breaking it down:
df["Existing column"]: select "Existing column" from the dataframe
.rolling(13): starting with the first 13 rows, we're going to move a sliding window across all of the data. So first, we will encounter rows 0-12, then rows 1-13, then 2-14, so on and so forth.
.apply(...): For each of those aforementioned rolling sections, we're going to apply a function that works on each section (in this case the function we're applying is the lambda.
lambda x: x.iloc[[0, 4, 8]].mean(): from each of those rolling sections, extract the 0th 4th, and 8th (corresponding to row 1, 5, & 9) and calculate and return the mean of those values.
In order to work on your dataframe in chunks (or groups) instead of a sliding window, you can apply the same logic with the .groupby method (instead of .rolling).
>>> groups = np.arange(len(df)) // 13 # defines groups as chunks of 13 rows
>>> averages = (
df.groupby(groups)["Existing column"]
.apply(lambda x: x.iloc[[0, 4, 8]].mean())
)
>>> averages.index = (averages.index + 1) * 13 - 1
>>> df["New column"] = averages
>>> df
Row number Existing column New column
0 1 2 NaN
1 2 4 NaN
2 3 3 NaN
3 4 1 NaN
4 5 6 NaN
5 6 4 NaN
6 7 8 NaN
7 8 2 NaN
8 9 4 NaN
9 10 9 NaN
10 11 2 NaN
11 12 4 NaN
12 13 3 4.0
breaking it down now:
groups = np.arange(len(df)): creates an array that will be used to chunk our dataframe into groups. This array will essentially be 13 0s, followed by 13 1s, follow by 13 2s... until the array is the same length as the dataframe. So in this case for a single chunk example it will only be an array of 13 0s.
df.groupby(groups)["Existing column"] group the dataframe according to the groups defined above and select the "Existing column"
.apply(lambda x: x.iloc[[0, 4, 8]].mean()): Conceptually the same as before, except we're applying to each grouping instead of a sliding window.
averages.index = (averages.index + 1) * 12: this part may seem a little odd. But we're essentially ensuring that our selected averages line up with the original dataset correctly. In this case, we want the average from group 0 (specified with an index value of 0 in the averages Series) to align to row 12. If we had another group (group 1, we would want it to align to row 25 in the original dataset). So we can use a little math to do this transformation.
df["New column"] = averages: since we already matched up our indices, pandas takes care of the actual alignment of these new values under the hood for us.
I asked something similar yesterday but I had to rephrase the question and change the dataframes that I'm using. So here is my question again:
I have a dataframe called df_location. In this dataframe I have duplicated ids because each id has a timestamp.
location = {'location_id': [1,1,1,1,2,2,2,3,3,3,4,5,6,7,8,9,10],
'temperature_value':[20,21,22,23,24,25,27,28,29,30,31,32,33,34,35,36,37],
'humidity_value':[60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76]}
df_location = pd.DataFrame(location)
I have another dataframe called df_islands:
islands = {'island_id':[10,20,30,40,50,60],
'list_of_locations':[[1],[2,3],[4,5],[6,7,8],[9],[10]]}
df_islands = pd.DataFrame(islands)
What I am trying to achieve is to map the values of list_of_locations to the location_id. If the values are the same , then the island_id for this location should be appended to a new column in df_location.
(Note that: I don't want to remove any duplicated Id, I need to keep them as they are)
Resulting dataframe:
final_dataframe = {'location_id': [1,1,1,1,2,2,2,3,3,3,4,5,6,7,8,9,10],
'temperature_value': [20,21,22,23,24,25,27,28,29,30,31,32,33,34,35,36,37],
'humidity_value':[60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76],
'island_id':[10,10,10,10,20,20,20,20,20,20,30,30,40,40,40,50,60]}
df_final_dataframe = pd.DataFrame(final_dataframe)
This is just a sample from the dataframe that I have. What I have is dataframe of 13,000,0000 rows and 4 columns. How can this be achieved in an efficient way ? Is there a pythonic way to do it ?I tried using for loops but it takes too long and still it didn't work. I would really appreciate it if someone can give me a solution to this problem.
Here's a solution:
island_lookup = df_islands.explode("list_of_locations").rename(columns = {"list_of_locations": "location"})
pd.merge(df_location, island_lookup, left_on="location_id", right_on="location").drop("location", axis=1)
The output is:
location_id temperature_value humidity_value island_id
0 1 20 60 10
1 1 21 61 10
2 1 22 62 10
3 1 23 63 10
4 2 24 64 20
5 2 25 65 20
6 2 27 66 20
7 3 28 67 20
8 3 29 68 20
9 3 30 69 20
10 4 31 63 30
11 5 32 64 30
12 6 33 65 40
13 7 34 66 40
14 8 35 67 40
15 9 36 68 50
16 10 37 69 60
If some of the locations don't have a matching island_id, but you'd still like to see them in the results (with island_id NaN), use how="left" in the merge statement, as in:
island_lookup = df_islands.explode("list_of_locations").rename(columns = {"list_of_locations": "location"})
pd.merge(df_location, island_lookup,
left_on="location_id",
right_on="location",
how = "left").drop("location", axis=1)
The result would be (note location-id 12 on row 3):
location_id temperature_value humidity_value island_id
0 1 20 60 10.0
1 1 21 61 10.0
2 1 22 62 10.0
3 12 23 63 NaN
4 2 24 64 20.0
5 2 25 65 20.0
6 2 27 66 20.0
...
I have a data sheet with about 1700 columns and 100 rows of data w/ a unique identifier. It is survey data and every employee of an organization answer the same 9 questions but its compiled into one row of data for every organization. Is there a way in python/pandas to vertically integrate this data as opposed to the elongated format on the x-axis it already is at? I am cutting and pasting currently.
You can reshape the underlying numpy array and reindex with proper companies:
# sample data, assuming index is the company
df = pd.DataFrame(np.arange(36).reshape(2,-1))
# new index
idx = df.index.repeat(df.shape[1]//9)
# new data:
new_df = pd.DataFrame(df.values.reshape(-1,9), index=idx)
Output:
0 1 2 3 4 5 6 7 8
0 0 1 2 3 4 5 6 7 8
0 9 10 11 12 13 14 15 16 17
1 18 19 20 21 22 23 24 25 26
1 27 28 29 30 31 32 33 34 35
I am trying to add an empty column after the 3ed column on my data frame that contains 5 columns. Example:
Fname,Lname,city,state,zip
mike,smith,new york,ny,11101
This is what I have and below I am going to show what I want it to look like.
Fname,Lname,new column,city,state,zip
mike,smith,,new york,ny,11101
I dont want to populate that column with data all I want to do is add the extra column in the header and that data will have the blank column aka ',,'.
Ive seen examples where a new column is added to the end of a data frame but not at a specific placement.
you should use
df.insert(loc, column, value)
with loc being the index and column the column name and value it's value
for an empty column
df.insert(loc=2, column='new col', value=['' for i in range(df.shape[0])])
Use reindex or column filtering
df = pd.DataFrame(np.arange(50).reshape(10,-1), columns=[*'ABCDE'])
df['z']= np.nan
df[['A','z','B','C','D','E']]
OR
df.reindex(['A','z','B','C','D','E'], axis=1)
Output:
A z B C D E
0 0 NaN 1 2 3 4
1 5 NaN 6 7 8 9
2 10 NaN 11 12 13 14
3 15 NaN 16 17 18 19
4 20 NaN 21 22 23 24
5 25 NaN 26 27 28 29
6 30 NaN 31 32 33 34
7 35 NaN 36 37 38 39
8 40 NaN 41 42 43 44
9 45 NaN 46 47 48 49
You can simply go for df.insert()
import pandas as pd
data = {'Fname': ['mike'],
'Lname': ['smith'],
'city': ['new york'],
'state': ['ny'],
'zip': [11101]}
df = pd.DataFrame(data)
df.insert(1, "Address", '', True)
print(df)
Output:
Fname Address Lname city state zip
0 mike smith new york ny 11101