Merge rows with index +-1 of the current row - python
I have quite an interesting question. I am trying to merge rows that are too close to each other. Obviously "too close" depends on what you want too close to be but what I want to do is merging rows that are +-1 row close to another. I have this dataframe:
index Händelse Time Fuel level (%) Km driven (km) Difference (%)
61 Bränslenivåökning vid stillastående 20210601 100 1325217 73
124 Bränslenivåökning vid stillastående 20210601 93 1325708 63
125 Position 20210601 97 1325708 4
126 Position 20210601 100 1325720 3
176 Bränslenivåökning vid stillastående 20210602 100 1326038 46
234 Bränslenivåökning vid stillastående 20210603 90 1326528 56
235 Position 20210603 96 1326528 6
236 Position 20210603 100 1326540 4
301 Bränslenivåökning vid stillastående 20210603 100 1327019 77
360 Position 20210603 42 1327510 9
361 Bränslenivåökning vid stillastående 20210603 92 1327510 50
362 Position 20210604 100 1327513 8
436 Bränslenivåökning vid stillastående 20210604 100 1328013 72
499 Bränslenivåökning vid stillastående 20210606 87 1328504 57
500 Position 20210606 98 1328506 11
501 Position 20210606 100 1328516 2
...
As you can see in the index, there are multiple occurrences where the rows are followed up by another one with a very small time difference (I gather the data using a 10-minute interval which is not shown in the time column but is shown by looking at the index tab. For example 124, 125 and 126 who are close to each other). However, because of the small-time difference, I would like to sum the "Difference-column" for these rows but not the "Km driven", "fuel level" or "Time". In conclusion, if we take 124, 125, and 126 for example, I would like the output to be:
index Händelse Time Fuel level (%) Km driven (km) Difference (%)
126 Bränslenivåökning vid stillastående 20210601 100 (from 126) 1325710 (126) 70 (124, 125, 126)
To quickly explain what is happening in the data, there are different time stamps where a change in the fuel tank is taking place. This makes the analyst of the data assume that a refueling process is taking place. However, sometimes these "refueling-processes" take more than my time interval, resulting in it being noted as 3 different (like row 124, 125, 126) positive changes in the fuel tank. Also, I can't change the time interval.
Hopefully, this was enough information. Thank you in advance!
CURRENT CODE
from tkinter import Tk # from tkinter import Tk for Python 3.x
from tkinter.filedialog import askopenfilename
import pandas as pd
Tk().withdraw()
filepathname1 = askopenfilename()
filepathname2 = askopenfilename()
print("You have chosen to mix", filepathname1, "and", filepathname2)
pd.set_option("display.max_rows", None, "display.max_columns", 10)
df1 = pd.read_excel(
filepathname1, "CWA107 Event", na_values=["NA"], skiprows=1, usecols="A, B, D, E, F"
)
df2 = pd.read_excel(
filepathname2,
na_values=["NA"],
skiprows=1,
usecols=["Tankad mängd diesel", "Unnamed: 3"],
)
df1["Difference (%)"] = df1["Bränslenivå (%)"]
df1["Difference (%)"] = df1.loc[:, "Bränslenivå (%)"].diff()
# Renames time-column so that they match
df2.rename(columns={"Unnamed: 3": "Tid"}, inplace=True)
# Drop NaN
df2.dropna(inplace=True)
# Drop NaN
df1.dropna(inplace=True)
# Filters out the rows with a difference smaller than 2
df1filt = df1[(df1["Difference (%)"] >= 2)]
print(len(df1filt))
# Converts time-column to only year, month and date.
df1filt["Tid"] = pd.to_datetime(df1filt["Tid"]).dt.strftime("%Y%m%d").astype(str)
print(df1filt)
df1filt.reset_index(level=0, inplace=True)
filepathname3 = askopenfilename()
df1filt.to_excel(filepathname3, index=False)
input()
So I solved this problem by creating a new column that depends on the difference between my row-column (formerly known as the index column in the dataframe above). The new column represents the difference in the row column. If the difference is more than 1 row then it sets the value of the row to 0. By giving either 1 or 0 to these rows in the 'Match'-column I can further know what values to merge and which not to.
If the value = 0 then it will set the actual amount refueled to be the value of the current row (it does not merge with other rows) and it is marked as summed. If the value is bigger than 1 and under 4, the values are added. This will repeat until it hits an row with the value 0 which will mark it as "summed".
Here is the code (feel free to make changes):
df1filt["Match"] = df1filt["row"]
df1filt["Match"] = df1filt.loc[:, "row"].diff()
df1filt['Match'].values[df1filt['Match'].values > 1] = 0
ROWRANGE = len(df1filt)+1
thevalue = 0
for currentrow in range(ROWRANGE-1):
if df1filt.loc[currentrow, 'Match'] == 0.0:
df1filt.loc[currentrow-1,'Difference (%)'] = thevalue
df1filt.loc[currentrow-1,'Match'] = "SUMMED"
thevalue = df1filt.loc[currentrow, 'Difference (%)']
if df1filt.loc[currentrow, 'Match'] >= 1.0 and df1filt.loc[currentrow, 'Match'] <= 4:
thevalue += df1filt.loc[currentrow, 'Difference (%)']
Related
Get the sum of absolutes of columns for a dataframe
If I have a dataframe and I want to sum the values of the columns I could do something like import pandas as pd studentdetails = { "studentname":["Ram","Sam","Scott","Ann","John","Bobo"], "mathantics" :[80,90,85,70,95,100], "science" :[85,95,80,90,75,100], "english" :[90,85,80,70,95,100] } index_labels=['r1','r2','r3','r4','r5','r6'] df = pd.DataFrame(studentdetails ,index=index_labels) print(df) df3 = df.sum() print(df3) col_list= ['studentname', 'mathantics', 'science'] print( df[col_list].sum()) How can I do something similar but instead of getting only the sum, getting the sum of absolute values (which in this particular case would be the same though) of some columns? I tried abs in several way but it did not work Edit: studentname mathantics science english r1 Ram 80 85 90 r2 Sam 90 95 -85 r3 Scott -85 80 80 r4 Ann 70 90 70 r5 John 95 -75 95 r6 Bobo 100 100 100 Expected output mathantics 520 science 525 english 520 Edit2: The col_list cannot include string value columns
You need numeric columns for absolute values: col_list = df.columns.difference(['studentname']) df[col_list].abs().sum() df.set_index('studentname').abs().sum() df.select_dtypes(np.number).abs().sum()
Extract information from an Excel (by updating arrays) with Excel / Python
I have an Excel file with thousands of columns on the following format: Member No. X Y Z 1000 25 60 -30 -69 38 68 45 2 43 1001 24 55 79 4 -7 89 78 51 -2 1002 45 -55 149 94 77 -985 -2 559 56 I need a way such that I shall get a new table with the absolute maximum value from each column. In this example, something like: Member No. X Y Z 1000 69 60 68 1001 78 55 89 1002 94 559 985 I have tried it in Excel (with using VLOOKUP for finding the "Member Number" in the first row and then using HLOOKUP for finding the values from the rows thereafter), but the problem is that the HLOOKUP command is not automatically updated with the new array (the array in which Member number 1001 is) (so my solution works for member 1000, but not for 1001 and 1002), and hence it always searches for the new value ONLY in the 1st Row (i.e. the row with the member number 1000). I also tried reading the file with Python, but I am not well-versed enough to make much of a headway - once the dataset has been read, how do I tell excel to read the next 3 rows and get the (absolute) maximum in each column? Can someone please help? Solution required in Python 3 or Excel (ideally, Excel 2014).
The below solution will get you your desired output using Python. I first ffill to fill in the blanks in your Member No column (axis=0 means row-wise). Then convert your dataframe values to +ve using abs. Lastly, using pandas.DataFrame.agg, I get the max value for all the columns in your dataframe. Assuming your dataframe is called data: import pandas as pd data['Member No.'] = data['Member No.'].ffill(axis=0).astype(int) df = abs(df) res = (data.groupby('Member No.').apply(lambda x: x.max())).drop('Member No.',axis=1).reset_index() Which will print you: Member No. X Y Z A B C 0 1000 69 60 68 60 74 69 1 1001 78 55 89 78 92 87 2 1002 94 559 985 985 971 976 Note that I added extra columns in your sample data to make sure that all the columns will return their max() value.
pandas group by multiple columns and remove rows based on multiple conditions
I have a dataframe which is as follows: imagename,locationName,brandname,x,y,w,h,xdiff,ydiff 95-20180407-215120-235505-00050.jpg,Shirt,SAMSUNG,0,490,177,82,0,0 95-20180407-215120-235505-00050.jpg,Shirt,SAMSUNG,1,491,182,78,1,1 95-20180407-215120-235505-00050.jpg,Shirt,DHFL,3,450,94,45,2,-41 95-20180407-215120-235505-00050.jpg,Shirt,DHFL,5,451,95,48,2,1 95-20180407-215120-235505-00050.jpg,DUGOUT,VIVO,167,319,36,38,162,-132 95-20180407-215120-235505-00050.jpg,Shirt,DHFL,446,349,99,90,279,30 95-20180407-215120-235505-00050.jpg,Shirt,DHFL,455,342,84,93,9,-7 95-20180407-215120-235505-00050.jpg,Shirt,GOIBIBO,559,212,70,106,104,-130 Its a csv dump. From this I want to group by imagename and brandname. Wherever the values in xdiff and ydiff is less than 10 then remove the second line. For example, from the first two lines I want to delete the second line, similarly from lines 3 and 4 I want to delete line 4. I could do this quickly in R using dplyr group by, lag and lead functions. However, I am not sure how to combine different functions in python to achieve this. This is what I have tried so far: df[df.groupby(['imagename','brandname']).xdiff.transform() <= 10] Not sure what function should I call within transform and how to include ydiff too. The expected output is as follows: imagename,locationName,brandname,x,y,w,h,xdiff,ydiff 95-20180407-215120-235505-00050.jpg,Shirt,SAMSUNG,0,490,177,82,0,0 95-20180407-215120-235505-00050.jpg,Shirt,DHFL,3,450,94,45,2,-41 95-20180407-215120-235505-00050.jpg,DUGOUT,VIVO,167,319,36,38,162,-132 95-20180407-215120-235505-00050.jpg,Shirt,DHFL,446,349,99,90,279,30 95-20180407-215120-235505-00050.jpg,Shirt,GOIBIBO,559,212,70,106,104,-130
You can take individual groupby frames and apply the conditions through apply function #df.groupby(['imagename','brandname'],group_keys=False).apply(lambda x: x.iloc[range(0,len(x),2)] if x['xdiff'].lt(10).any() else x) df.groupby(['imagename','brandname'],group_keys=False).apply(lambda x: x.iloc[range(0,len(x),2)] if (x['xdiff'].lt(10).any() and x['ydiff'].lt(10).any()) else x) Out: imagename locationName brandname x y w h xdiff ydiff 2 95-20180407-215120-235505-00050.jpg Shirt DHFL 3 450 94 45 2 -41 5 95-20180407-215120-235505-00050.jpg Shirt DHFL 446 349 99 90 279 30 7 95-20180407-215120-235505-00050.jpg Shirt GOIBIBO 559 212 70 106 104 -130 0 95-20180407-215120-235505-00050.jpg Shirt SAMSUNG 0 490 177 82 0 0 4 95-20180407-215120-235505-00050.jpg DUGOUT VIVO 167 319 36 38 162 -132
Iterating over pandas rows to get minimum
Here is my dataframe: Date cell tumor_size(mm) 25/10/2015 113 51 22/10/2015 222 50 22/10/2015 883 45 20/10/2015 334 35 19/10/2015 564 47 19/10/2015 123 56 22/10/2014 345 36 13/12/2013 456 44 What I want to do is compare the size of the tumors detected on the different days. Let's consider the cell 222 as an example; I want to compare its size to different cells but detected on earlier days e.g. I will not compare its size with cell 883, because they were detected on the same day. Or I will not compare it with cell 113, because it was detected later on. As my dataset is too large, I have iterate over the rows. If I explain it in a non-pythonic way: for the cell 222: get_size_distance(absolute value): (50 - 35 = 15), (50 - 47 = 3), (50 - 56 = 6), (50 - 36 = 14), (44 - 36 = 8) get_minumum = 3, I got this value when I compared it with 564, so I will name it as a pait for the cell 222 Then do it for the cell 883 The resulting output should look like this: Date cell tumor_size(mm) pair size_difference 25/10/2015 113 51 222 1 22/10/2015 222 50 123 6 22/10/2015 883 45 456 1 20/10/2015 334 35 345 1 19/10/2015 564 47 456 3 19/10/2015 123 56 456 12 22/10/2014 345 36 456 8 13/12/2013 456 44 NaN NaN I will really appreciate your help
It's not pretty, but I believe it does the trick a = pd.read_clipboard() # Cut off last row since it was a faulty date. You can skip this. df = a.copy().iloc[:-1] # Convert to dates and order just in case (not really needed I guess). df['Date'] = df.Date.apply(lambda x: datetime.strptime(x, '%d/%m/%Y')) df.sort_values('Date', ascending=False) # Rename column df = df.rename(columns={"tumor_size(mm)": 'tumor_size'}) # These will be our lists of pairs and size differences. pairs = [] diffs = [] # Loop over all unique dates for date in df.Date.unique(): # Only take dates earlier then current date. compare_df = df.loc[df.Date < date].copy() # Loop over each cell for this date and find the minimum for row in df.loc[df.Date == date].itertuples(): # If no cells earlier are available use nans. if compare_df.empty: pairs.append(float('nan')) diffs.append(float('nan')) # Take lowest absolute value and fill in otherwise else: compare_df['size_diff'] = abs(compare_df.tumor_size - row.tumor_size) row_of_interest = compare_df.loc[compare_df.size_diff == compare_df.size_diff.min()] pairs.append(row_of_interest.cell.values[0]) diffs.append(row_of_interest.size_diff.values[0]) df['pair'] = pairs df['size_difference'] = diffs returns: Date cell tumor_size pair size_difference 0 2015-10-25 113 51 222.0 1.0 1 2015-10-22 222 50 564.0 3.0 2 2015-10-22 883 45 564.0 2.0 3 2015-10-20 334 35 345.0 1.0 4 2015-10-19 564 47 345.0 11.0 5 2015-10-19 123 56 345.0 20.0 6 2014-10-22 345 36 NaN NaN
filter pandas dataframe based in another column
this might be a basic question, but I have not being able to find a solution. I have two dataframes, with identical rows and columns, called Volumes and Prices, which are like this Volumes Index ProductA ProductB ProductC ProductD Limit 0 100 300 400 78 100 1 110 370 20 30 100 2 90 320 200 121 100 3 150 320 410 99 100 .... Prices Index ProductA ProductB ProductC ProductD Limit 0 50 110 30 90 0 1 51 110 29 99 0 2 49 120 25 88 0 3 51 110 22 96 0 .... I want to assign 0 to the "cell" of the Prices dataframe which correspond to Volumes less than what it is on the Limit column so, the ideal output would be Prices Index ProductA ProductB ProductC ProductD Limit 0 50 110 30 0 0 1 51 110 0 0 0 2 0 120 25 88 0 3 51 110 22 0 0 .... I tried import pandas as pd import numpy as np d_price = {'ProductA' : [50, 51, 49, 51], 'ProductB' : [110,110,120,110], 'ProductC' : [30,29,25,22],'ProductD' : [90,99,88,96], 'Limit': [0]*4} d_volume = {'ProductA' : [100,110,90,150], 'ProductB' : [300,370,320,320], 'ProductC' : [400,20,200,410],'ProductD' : [78,30,121,99], 'Limit': [100]*4} Prices = pd.DataFrame(d_price) Volumes = pd.DataFrame(d_volume) Prices[Volumes > Volumes.Limit]=0 but I do not obtain any changes to the Prices dataframe... obviously I'm having a hard time understanding boolean slicing, any help would be great
The problem is in Prices[Volumes > Volumes.Limit]=0 Since Limit varies on each row, you should use, for example, apply like following: Prices[Volumes.apply(lambda x : x>x.Limit, axis=1)]=0
you can use mask to solve this problem, I am not an expert either but this solutions does what you want to do. test=(Volumes.ix[:,'ProductA':'ProductD'] >= Volumes.Limit.values) final = Prices[test].fillna(0)