Pandas Comparing Two Data Frames - python
I have two dataframes. I will explain my requirement in form of a loop--because this is how I visualize the problem.
I realize that there can be another solution, so if this can be done differently, please feel free to share! I am new to Pandas, so I'm struggling with this solution. Thank you in advance for looking at my question!!
I have 2 dataframes that have 3 columns: ID, ODO, ODOLength. ODOLength is the running difference for each ODO record, which I got using: abs(Df1['Odo'] - Df1['Odo'].shift(-1))
OldDataSet = {'id' : [10,20,30,40,50,60,70,80,90,100,110,120,130,140],'Odo': [-1.09,1.02,26.12,43.12,46.81,56.23,111.07,166.38,191.27,196.41,207.74,231.61,235.84,240.04], 'OdoLength':[2.11,25.1,17,3.69,9.42,54.84,55.31,24.89,5.14,11.33,23.87,4.23,4.2,4.09]}
NewDataSet = {'id' : [1000,2000,3000,4000,5000,6000,7000,8000,9000,10000,11000,12000,13000,14000],'Odo': [1.51,2.68,4.72,25.03,42,45.74,55.15,110.05,165.41,170.48,172.39,190.35,195.44,206.78], 'OdoLength':[1.17,2.04,20.31,16.97,3.74,9.41,54.9,55.36,5.07,1.91,17.96,5.09,11.34,23.89]}
FinalResultDataSet = {'DFOneId':[10,20,30,40,50,60,70,80,90,100,110], 'DFTwoID' : [1000,3000,4000,5000,6000,7000,8000,11000,12000,13000,14000], 'OdoDiff': [2.6,3.7,1.09,1.12,1.07,1.08,1.02,6.01,0.92,0.97,0.96], 'OdoLengthDiff':[0.94,4.79,0.03,0.05,0.01,0.06,0.05,6.93,0.05,0.01,0.02], 'OdoAndLengthDiff':[1.66,1.09,1.06,1.07,1.06,1.02,0.97,0.92,0.87,0.96,0.94]}
df1= pd.DataFrame(OldDataSet)
df2 = pd.DataFrame(NewDataSet)
FinalDf = pd.DataFrame(FinalResultDataSet)
The logic behind how to get the FinalDF is as follows: Take Odo and OdoLen from df1 and subtract it from each Odo and OdoLen columns in df2. Take the lowest value of the difference and match them. For next comparison of Df1 and Df2, begin with the first Df2 record that does not have a match. If Df2 values are not a minimum value, for the current Df1 values
that is being compared then that record of DF2 is not included in the final dataset. For example, Df1 ID 20- was compared to Df2 ID 2000 and the final result was 21.4 ((DfOne.ODO:1.02-DfTwo.ODO:2.68) - (DfOneODOLen:25.1-DfTwo.ODoLen-2.04) = 21.4), however when Df1 ID 20 is compared to Df2 3000 the final difference is 1.09 ((DfOne.ODO:1.02-DfTwo.ODO:4.72) - (DfOneODOLen:25.1-DfTwo.ODoLen-20.31) = 1.06). In this case, Df2 ID 3000 is matched to DF1 ID 20 and Df2 ID - 2000 is dropped off because
the difference was larger. At this point DF2 ID 2000 is not considered for any other matches. So the next DF1 record comparison would start at DF2 ID 4000, because that is the next value that does not have a match.
As I said, I am open to all suggestions!
Thanks!
You can using merge_asof
Step 1: combine the dataframe
df1['match']=df1.Odo+df1.OdoLength
df2['match']=df2.Odo+df2.OdoLength
out=pd.merge_asof(df1,df2,on='match',direction='nearest')
out.drop_duplicates(['id_y'])
Out[728]:
Odo_x OdoLength_x id_x match Odo_y OdoLength_y id_y
0 -1.09 2.11 10 1.02 1.51 1.17 1000
1 1.02 25.10 20 26.12 4.72 20.31 3000
2 26.12 17.00 30 43.12 25.03 16.97 4000
3 43.12 3.69 40 46.81 42.00 3.74 5000
4 46.81 9.42 50 56.23 45.74 9.41 6000
5 56.23 54.84 60 111.07 55.15 54.90 7000
6 111.07 55.31 70 166.38 110.05 55.36 8000
7 166.38 24.89 80 191.27 172.39 17.96 11000
8 191.27 5.14 90 196.41 190.35 5.09 12000
9 196.41 11.33 100 207.74 195.44 11.34 13000
10 207.74 23.87 110 231.61 206.78 23.89 14000
Step 2
Then you can do something like below to get your new column
out['OdoAndLengthDiff']=out.OdoLength_x-out.OdoLength_y+out.Odo_x-out.Odo_y
BTW I did not drop the column , after you get all new value if you need, You can drop it by using out=out.drop([columns],1)
Related
how to Add 2 columns from a dataframe to another while indexes do Not match
i have 2 Data frames that have different length. first one has 1200 rows and the other only 1 the first one is sth like this. Date Open High Low Close Adj Close Volume 2012-01-09 70.40 50.20 9.40 71.5 1.8 1.8 9447.0 the second one looks like this Name Marcet Cap. Symbol Symbol2 Boerse Info. Periode ISIN WKN Once 1 tpp.us NaN 1 5Y US010001 999000 i only want to add (append) 2 columns to the first one, which are ISIN and WKN. Date Open High Low Close Adj Close Volume ISIN WKN 2012-01-09 70.40 50.20 9.40 71.5 1.8 1.8 944 US0101 999000 i already tried Merge() and Concat, however i got an KeyError and also i tried this which doesn't work. first['ISIN']=second['ISIN'].values how can i add 2 columns to the other DF?
Assign value by values[0] instead of values import pandas as pd import io data_string = """ Date Open High Low Close Adj_Close Volume 2012-01-09 70.40 50.20 9.40 71.5 1.8 9447.0 2012-01-10 70.40 50.20 9.40 71.5 1.8 9447.0""" first = pd.read_csv(io.StringIO(data_string), sep='\s+') data_string = """Name Marcet_Cap. Symbol Symbol2 Boerse_Info. Periode ISIN WKN Once 1 tpp.us NaN 1 5Y US010001 999000 """ second = pd.read_csv(io.StringIO(data_string), sep='\s+') first['ISIN'] = second['ISIN'].values[0] # work first['WKN'] = second['WKN'].values[0] # work print(first) # print sample result Date Open High Low Close Adj_Close Volume ISIN WKN 0 2012-01-09 70.4 50.2 9.4 71.5 1.8 9447.0 US010001 999000 1 2012-01-10 70.4 50.2 9.4 71.5 1.8 9447.0 US010001 999000
as I understood correctly you have 2 dataframes. First Dataframe has 1200 rows Second Dataframe has 1 row and you want to add 2 new columns in first dataframe and the values of those columns should be values of the last 2 columns of the second dataframe. Because the dataframes have different number of rows, first you should create 2 lists with same size of the first dataframe and then add to first one. col1 = [seconddataframe['ISIN'][0] for i in range(len(firstdataframe))] col2 = [seconddataframe['WKN'][0] for i in range(len(firstdataframe))] firstdataframe['ISIN'] = col1 firstdataframe['WKN'] = col2 just put the name of both dataframes in correct places. I hope it helps Best wishes
Merge different CSVs with different date ranges in a panda dataframe indexed by date and 0 for missing data
I have a problem I'm struggling with since a few days, and can't find my way out! I have a folder with many CSVs, each containing two columns: "date" (YYYY-MM-DD) and "value" (a float). The dates are usually a range of consecutive days (but some days might be missing). Now each of these CSVs starts from a different date. I need to merge them into a unique panda dataframe with "date" as index and then several columns like "csv1_value", "csv2_value", "csv3_value" etc. I've done it with 'merge' command on 'date' and that means I do have a panda that contains only the rows where the same "date" was found across all the CSVs. This is useful because actually some 'dates' in the range might be missing from a file, and in that case I need that date to be deleted from the panda even if it's present in the other files. BUT I would need to have actually the start of the range in the panda to be the older date I have, and then if that date is missing in the others (because they start later) having the value for that file as 0. AND any date that is missing from one file range, it should be filled with the latest value (useful to have 0.00 in any file starting later until there's actually some value). A bit complex, will try an example: csv1: "2020-01-01","1.01" "2020-01-02","2.01" "2020-01-03","3.01" "2020-01-04","4.01" "2020-01-05","5.01" "2020-01-06","6.01" "2020-01-07","7.01" "2020-01-08","8.01" "2020-01-09","9.01" "2020-01-10","10.01" csv2: "2020-01-04","4.02" "2020-01-05","5.02" "2020-01-06","6.02" "2020-01-08","8.02" "2020-01-09","9.02" "2020-01-10","10.02" csv3: "2020-01-03","3.03" "2020-01-04","4.03" "2020-01-05","5.03" "2020-01-06","6.03" "2020-01-07","7.03" "2020-01-09","9.03" "2020-01-10","10.03" resulting Panda should be: "2020-01-01","1.01","0.00","0.00" "2020-01-02","2.01","0.00","0.00" "2020-01-03","3.01","0.00","3.03" "2020-01-04","4.01","4.02","4.03" "2020-01-05","5.01","5.02","5.03" "2020-01-06","6.01","6.02","6.03" "2020-01-07","7.01","6.02","7.03" "2020-01-08","8.01","8.02","7.03" "2020-01-09","9.01","9.02","9.03" "2020-01-10","10.01","10.02","10.03" Anyone has an idea how I could achieve all this? My head is exploding...
you can do this using two outer joins, then fill NA with zeros df1 = pd.read_csv('csv1') df2 = pd.read_csv('csv2') df3 = pd.read_csv('csv3') DF = pd.merge(df1, df2, how='outer', on='date') DF = pd.merge(DF, df3, how='outer', on='date') DF.fillna(0, inplace=True)
My solution is designed to cope with arbitrary number of input files (not only 3, as in the other solution). Start with reading of your input files, creating a list of DataFrames, with proper names of the second column: import glob frames = [] for i, fn in enumerate(glob.glob('Input*.csv'), start=1): frames.append(pd.read_csv(fn, parse_dates=[0], names=['Date', f'csv{i}_value'])) Then join them into a single DataFrame: df = frames.pop(0) while len(frames) > 0: df2 = frames.pop(0) df = df.join(df2.set_index('Date'), on='Date') For now, from your sample files, you have: Date csv1_value csv2_value csv3_value 0 2020-01-01 1.01 NaN NaN 1 2020-01-02 2.01 NaN NaN 2 2020-01-03 3.01 NaN 3.03 3 2020-01-04 4.01 4.02 4.03 4 2020-01-05 5.01 5.02 5.03 5 2020-01-06 6.01 6.02 6.03 6 2020-01-07 7.01 NaN 7.03 7 2020-01-08 8.01 8.02 NaN 8 2020-01-09 9.01 9.02 9.03 9 2020-01-10 10.01 10.02 10.03 And to get the result, run: df = df.ffill().fillna(0.0) The result is: Date csv1_value csv2_value csv3_value 0 2020-01-01 1.01 0.00 0.00 1 2020-01-02 2.01 0.00 0.00 2 2020-01-03 3.01 0.00 3.03 3 2020-01-04 4.01 4.02 4.03 4 2020-01-05 5.01 5.02 5.03 5 2020-01-06 6.01 6.02 6.03 6 2020-01-07 7.01 6.02 7.03 7 2020-01-08 8.01 8.02 7.03 8 2020-01-09 9.01 9.02 9.03 9 2020-01-10 10.01 10.02 10.03 How to find possible errors One of things to check is whether the program finds expected CSV files. To check it, run: for i, fn in enumerate(glob.glob('Input*.csv'), start=1): print(i, fn) and you should get a list of files found. Another detail to check is whether your files have names starting from Input and have csv extension. Maybe you should change Input*.csv to some other pattern? Attempt also to run my code partially, i.e.: first the loop creating the list of DataFrames, then check the size of this list, print some of DataFrames and invoke info() on them (make test printouts), after that run the second part of my code (while loop). If some error occus, state in which instruction it occurred.
Python dataframe remove top n rows and moveup remaining
I have a data frame of 2500 rows. I am trying to remove top n rows and move up remaining without changing the index. I am giving an example of my problem and what I wanted df = A 10 10.5 11 20.5 12 30.5 13 40.5 14 50.5 15 60.5 16 70.5 In the above, I would like to remove top two rows and moveup the remaining without disturbing the index. My code and present output: idx = df.index df.drop(df.index[:2],inplace=True) df.set_index(idx[:len(df)],inplace=True) df = A 10 30.5 11 40.5 12 50.5 13 60.5 14 70.5 I got the output that I wanted. Is there a better way to do it? Like, oneline code?
You can use iloc to remove the rows and set the index to the original without the last 2 values. df = df.iloc[2:].set_index(df.index[:-2])
df = pd.DataFrame(df.A.shift(-2).dropna(how='all')) You can also use shift() to delete the resulting Na line to create a data frame.
Filtering a dataframe by a list
I have the following dataframe Dataframe: Date Name Value Rank Mean 01/02/2019 A 10 100 8.2 02/03/2019 A 9 120 7.9 01/03/2019 B 3 40 6.4 03/02/2019 B 1 39 5.9 ... And following list: date=['01/02/2019','03/02/2019'...] I would like to filter the df by the list, but as a date range, so for each value in the list I would like to bring back data between the date and the date-30 days
I am using numpy broadcast here, notice this method is o(n*m) , which mean if both of the df and date list is huge , it will exceed the memory limit s=pd.to_datetime(date).values df.Date=pd.to_datetime(df.Date) s1=df.Date.values t=(s-s1[:,None]).astype('timedelta64[D]').astype(int) df[np.any((t>=0)&(t<=30),1)] Out[120]: Date Name Value Rank Mean 0 2019-01-02 A 10 100 8.2 1 2019-02-03 A 9 120 7.9 3 2019-03-02 B 1 39 5.9
If your date is a string, just do: df[df.date.isin(list_of_dates)]
Python pandas merge data by sorting and filling data by distance
I am able to to create a bunch of loops to resolve this problem to group them by family but I think that this problem could be possible resolved a lot faster by using pandas. I need to merge both data frame and interpolate the distance by family where the distance is .5~1 max. Any help would be appreciated. print Df1 family distance 0 A 2.18 1 A 3.31 3 B 4.31 4 A 7.21 print Df2 family.1 distance.1 1 B 4. 2 A 3.05 3 A 11.03 Desired output df3 family distance family.1 distance.1 0 A 2.18 1 A 3.31 A 3.05 3 B 4.31 B 4.2 4 A 7.21 5 A 11.03