I am new to the python and pandas. Here , what I have is a dataframe which is like,
Id Offset feature
0 0 2
0 5 2
0 11 0
0 21 22
0 28 22
1 32 0
1 38 21
1 42 21
1 52 21
1 55 0
1 58 0
1 62 1
1 66 1
1 70 1
2 73 0
2 78 1
2 79 1
from this I am trying to get the previous three values from the column with the offsets of that .
SO, output would be like,
offset Feature
11 2
21 22
28 22
// Here these three values are `of the 0 which is at 32 offset`
In the same dataframe for next place where is 0
38 21
42 21
52 21
58 0
62 1
66 1
is there any way through which I can get this ?
Thanks
This will be on the basis of the document ID.
Even i am quite new to pandas but i have attempted to answer you question.
I populated your data as comma separated values in data.csv and then used slicing to get the previous 3 columns.
import pandas as pd
df = pd.read_csv('./data.csv')
for index in (df.loc[df['Feature'] == 0]).index:
print(df.loc[index-3:index-1])
The output looks like this. The leftmost column is index which you can discard if you dont want. Is this what you were looking for?
Offset Feature
2 11 2
3 21 22
4 28 22
Offset Feature
6 38 21
7 42 21
8 52 21
Offset Feature
7 42 21
8 52 21
9 55 0
Offset Feature
11 62 1
12 66 1
13 70 1
Note : There might be a more pythonic way to do this.
You can take 3 previous rows of your current 0 value in the column using loc.
Follow the code:
import pandas as pd
df = pd.read_csv("<path_of_the_file">)
zero_indexes = list(df[df['Feature'] == 0].index)
for each_zero_index in zero_indexes:
df1 = df.loc[each_zero_index - 3: each_zero_index]
print(df1) # This dataframe has 4 records. Your previous three including the zero record.
Output:
Offset Feature
2 11 2
3 21 22
4 28 22
5 32 0
Offset Feature
6 38 21
7 42 21
8 52 21
9 55 0
Offset Feature
7 42 21
8 52 21
9 55 0
10 58 0
Offset Feature
11 62 1
12 66 1
13 70 1
14 73 0
Related
I asked something similar yesterday but I had to rephrase the question and change the dataframes that I'm using. So here is my question again:
I have a dataframe called df_location. In this dataframe I have duplicated ids because each id has a timestamp.
location = {'location_id': [1,1,1,1,2,2,2,3,3,3,4,5,6,7,8,9,10],
'temperature_value':[20,21,22,23,24,25,27,28,29,30,31,32,33,34,35,36,37],
'humidity_value':[60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76]}
df_location = pd.DataFrame(location)
I have another dataframe called df_islands:
islands = {'island_id':[10,20,30,40,50,60],
'list_of_locations':[[1],[2,3],[4,5],[6,7,8],[9],[10]]}
df_islands = pd.DataFrame(islands)
What I am trying to achieve is to map the values of list_of_locations to the location_id. If the values are the same , then the island_id for this location should be appended to a new column in df_location.
(Note that: I don't want to remove any duplicated Id, I need to keep them as they are)
Resulting dataframe:
final_dataframe = {'location_id': [1,1,1,1,2,2,2,3,3,3,4,5,6,7,8,9,10],
'temperature_value': [20,21,22,23,24,25,27,28,29,30,31,32,33,34,35,36,37],
'humidity_value':[60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76],
'island_id':[10,10,10,10,20,20,20,20,20,20,30,30,40,40,40,50,60]}
df_final_dataframe = pd.DataFrame(final_dataframe)
This is just a sample from the dataframe that I have. What I have is dataframe of 13,000,0000 rows and 4 columns. How can this be achieved in an efficient way ? Is there a pythonic way to do it ?I tried using for loops but it takes too long and still it didn't work. I would really appreciate it if someone can give me a solution to this problem.
Here's a solution:
island_lookup = df_islands.explode("list_of_locations").rename(columns = {"list_of_locations": "location"})
pd.merge(df_location, island_lookup, left_on="location_id", right_on="location").drop("location", axis=1)
The output is:
location_id temperature_value humidity_value island_id
0 1 20 60 10
1 1 21 61 10
2 1 22 62 10
3 1 23 63 10
4 2 24 64 20
5 2 25 65 20
6 2 27 66 20
7 3 28 67 20
8 3 29 68 20
9 3 30 69 20
10 4 31 63 30
11 5 32 64 30
12 6 33 65 40
13 7 34 66 40
14 8 35 67 40
15 9 36 68 50
16 10 37 69 60
If some of the locations don't have a matching island_id, but you'd still like to see them in the results (with island_id NaN), use how="left" in the merge statement, as in:
island_lookup = df_islands.explode("list_of_locations").rename(columns = {"list_of_locations": "location"})
pd.merge(df_location, island_lookup,
left_on="location_id",
right_on="location",
how = "left").drop("location", axis=1)
The result would be (note location-id 12 on row 3):
location_id temperature_value humidity_value island_id
0 1 20 60 10.0
1 1 21 61 10.0
2 1 22 62 10.0
3 12 23 63 NaN
4 2 24 64 20.0
5 2 25 65 20.0
6 2 27 66 20.0
...
The data frame consists of a signal values which consists 1700+ observations. I need to subset the values of each cycle into a list called templates. Obviously the signal values that I am interested starts from 0 -> reaches positive peak -> comes back to zero -> reaches negative peak -> comes back to zero again, similar to sine wave.
Below is the sample dataframe
df
ID S S_lag
1 33 0
2 33 0
3 33 0
4 33 0
5 33 0
6 34 1
7 33 -1
8 33 0
9 34 1
10 34 0
11 34 0
12 34 0
13 35 1
14 41 6
15 52 11
16 70 18
17 72 2
18 73 1
19 74 1
20 74 0
21 75 1
22 86 1
23 84 -2
24 64 -20
25 43 -21
26 35 -8
25 31 -4
27 29 -2
28 27 -2
29 26 -1
30 26 0
Based on the above example dataframe my values of interest is from ID 12 to ID 30.
So as required templates[1] consists of first cycle values, tepmlates[2] consists of second cycle values and goes on.
The plot of my S_lag values are as below and my values of interest is within the red box.
NOTE: I need the corresponding values of S to be subset, not S_lag
I have a dataframe looks like following(I have sorted it according to item column already). For example, item 1- 10,11-20,...(every 10 items) are in the same category, I want to find the item in each category that have the highest score and return it.
What is the most efficient way to do that?
item score
1 1 10
3 4 1
4 6 6
39 11 2
8 12 1
9 13 1
10 15 24
11 17 9
12 18 12
13 20 7
14 22 1
59 25 3
18 28 3
19 29 2
22 34 2
23 37 1
24 38 3
25 39 2
26 40 2
27 42 3
29 45 1
31 48 1
32 53 4
33 58 4
assuming your dataframe is stored in df
g = df.groupby(pd.cut(df.item, np.arange(1, df.item.max(), 10), right=False)
)
get the max values from each category
max_score_ids = g.score.agg('idxmax')
this gives you the ids of the rows that contain the max score in each category
item
[1, 11) 1
[11, 21) 10
[21, 31) 59
[31, 41) 24
[41, 51) 27
then get the items associated with these ids
df.loc[max_score_ids].item
1 1
10 15
59 25
24 38
27 42
I have a dataframe df
df=DataFrame({'id': ['a','a','a','a','a','a','a','b','b','b','b','b','b','b','b','b','b'],
'min':[10,17,21,22,22,7,58,15,17,19,19,19,19,19,25,26,26],
'day':[15,15,15,15,15,17,17,41,41,41,41,41,41,41,57,57,57]})
that looks like
id min day
0 a 10 15
1 a 17 15
2 a 21 15
3 a 30 15
4 a 50 15
5 a 57 17
6 a 58 17
7 b 15 41
8 b 17 41
9 b 19 41
10 b 19 41
11 b 19 41
12 b 19 41
13 b 19 41
14 b 25 57
15 b 26 57
16 b 26 57
I want a new column that categorizes the data in a certain format based on the id and the relationship between the rows as follows, if min value difference for consecutive rows is less than 8 and the day value is the same I want to assign them to the same group, so my output would look like.
id min day category
0 a 10 15 1
1 a 17 15 1
2 a 21 15 1
3 a 30 15 2
4 a 50 15 3
5 a 57 17 4
6 a 58 17 4
7 b 15 41 5
8 b 17 41 5
9 b 19 41 5
10 b 19 41 5
11 b 19 41 5
12 b 19 41 5
13 b 19 41 5
14 b 25 57 6
15 b 26 57 6
16 b 26 57 6
hope this helps. let me know your views.
All the best.
import pandas as pd
df=pd.DataFrame({'id': ['a','a','a','a','a','a','a','b','b','b','b','b','b','b','b','b','b'],
'min':[10,17,21,22,22,7,58,15,17,19,19,19,19,19,25,26,26],
'day':[15,15,15,15,15,17,17,41,41,41,41,41,41,41,57,57,57]})
# initialize the catagory to 1 for counter increament
cat =1
# for the first row the catagory will be 1
new_series = [cat]
# loop will start from 1 and not from 0 because we cannot perform operation on iloc -1
for i in range(1,len(df)):
if df.iloc[i]['day'] == df.iloc[i-1]['day']:
if df.iloc[i]['min'] - df.iloc[i-1]['min'] > 8:
cat+=1
else:
cat+=1
new_series.append(cat)
df['catagory']= new_series
print(df)
I can use pandas replace to replace values in a dataframe using a dictionary:
prod_dict = {1:'Productive',2:'Moderate',3:'None'}
df['val'].replace(prod_dict,inplace=True)
What do I do if I want to replace a set of values in the dataframe with a single number. E.g I want to map all values from 1 to 20 to 1; all values from 21 to 40 to 2 and all values from 41 to 100 to 3. How do I specify this in a dictionary and use it in pandas replace?
You can do that using apply to traverse and apply function on every element, and lambda to write a function to replace the key with the value of in your dictionary.
I will go through a quick example here.
First, I will create a dataframe to showcase the algorithm
df = pd.DataFrame(range(50), columns=list('B'))
This function should generate a list of values between i,j .
def genValues(i,j):
return [x for x in range(j+1) if x >=i]
I will create lambda function to map the values.
df['E']= df['B'].apply(lambda x: 1 if x in genValues(0,20) else 2 if x in genValues(21,40) else 3 if x in genValues(41,100) else x)
print df
The output:
B E
0 0 1
1 1 1
2 2 1
3 3 1
4 4 1
5 5 1
6 6 1
7 7 1
8 8 1
9 9 1
10 10 1
11 11 1
12 12 1
13 13 1
14 14 1
15 15 1
16 16 1
17 17 1
18 18 1
19 19 1
20 20 1
21 21 2
22 22 2
23 23 2
24 24 2
25 25 2
26 26 2
27 27 2
28 28 2
29 29 2
30 30 2
31 31 2
32 32 2
33 33 2
34 34 2
35 35 2
36 36 2
37 37 2
38 38 2
39 39 2
40 40 2
41 41 3
42 42 3
43 43 3
44 44 3
45 45 3
46 46 3
47 47 3
48 48 3
49 49 3
You can replace the column by replacing it:
df['B']= df['B'].apply(lambda x: 1 if x in genValues(0,20) else 2 if x in genValues(21,40) else 3 if x in genValues(41,100) else x)