I want to merge two dataframes to create a single time-series with two variables.
I have a function that does this by iterating over each dataframe using itterows()... which is terribly slow and doesn't take advantage of the vectorization that pandas and numpy provide...
Would you be able to help?
This code illustrates what I am trying to do:
a = pd.DataFrame(data={'timestamp':[1,2,5,6,10],'x':[2,6,3,4,2]})
b = pd.DataFrame(data={'timestamp':[2,3,4,10],'y':[3,1,2,1]})
#z = Magical line of code/function call here
#z output: {'timestamp':[1,2,3,4,5,6,10],'x':[2,6,6,6,3,4,2], 'y':[NaN,3,1,2,2,2,1] }
This can be broken down into 2 steps:
The first step is the equivalent of an outer join in SQL, where create a table containing keys of both source tables. This is done with merge(..., how="outer")
The second is filling the NaN with the previous non-NaN values, which can done with ffill
z = a.merge(b, on="timestamp", how="outer").sort_values("timestamp").ffill()
Related
I have the following column multiindex dataframe.
I would like to select (or get a subset) of the dataframe with different columns of each level_0 index (i.e. x_mm and y_mm from virtual and z_mm rx_deg ry_deg rz_deg from actual). From what I have read I think I might be able to use pandas IndexSlice but not entire sure how to use it in this context.
So far my work around is to use pd.concat selecting the 2 sets of columns independently. I have the feeling that this can be done neatly with slicing.
You can programmatically generate the tuples to slice your MultiIndex:
from itertools import product
cols = ((('virtual',), ('x_mm', 'y_mm')),
(('actual',), ('z_mm', 'rx_deg', 'ry_deg', 'rz_deg'))
)
out = df[[t for x in cols for t in product(*x)]]
So I have 2 tables, Table 1 and Table 2, Table 2 is sorted with the dates- recent dates to old dates. So in excel when I do a lookup in Table 1 and the lookup is done from Table 2, It only picks the first value from table 2 and does not move on to search for the same value after the first.
So I tried replicating it in python with the merge function, but found out it gets to repeat the value the number of times it appears in the second table.
pd.merge(Table1, Table2, left_on='Country', right_on='Country', how='left', indicator='indicator_column')
TABLE1
TABLE2
Merger result
Expected Result(Excel vlookup)
Is there any way this could be achieved with the merge function or any other python function?
Typing this in the blind as you are including your data as images, not text.
# The index is a very important element in a DataFrame
# We will see that in a bit
result = table1.set_index('Country')
# For each country, only keep the first row
tmp = table2.drop_duplicates(subset='Country').set_index('Country')
# When you assign one or more columns of a DataFrame to one or more columns of
# another DataFrame, the assignment is aligned based on the index of the two
# frames. This is the equivalence of VLOOKUP
result.loc[:, ['Age', 'Date']] = tmp[['Age', 'Date']]
result.reset_index(inplace=True)
Edit: Since you want a straight up Vlookup, just use join. It appears to find the very first one.
table1.join(table2, rsuffix='r', lsuffix='l')
The docs seem to indicate it performs similarly to a vlookup: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.join.html
I'd recommend approaching this more like a SQL join than a Vlookup. Vlookup finds the first matching row, from top to bottom, which could be completely arbitrary depending on how you sort your table/array in excel. "True" database systems and their related functions are more detailed than this, for good reason.
In order to join only one row from the right table onto one row of the left table, you'll need some kind of aggregation or selection - So in your case, that'd be either MAX or MIN.
The question is, which column is more important? The date or age?
import pandas as pd
df1 = pd.DataFrame({
'Country':['GERM','LIB','ARG','BNG','LITH','GHAN'],
'Name':['Dave','Mike','Pete','Shirval','Kwasi','Delali']
})
df2 = pd.DataFrame({
'Country':['GERM','LIB','ARG','BNG','LITH','GHAN','LIB','ARG','BNG'],
'Age':[35,40,27,87,90,30,61,18,45],
'Date':['7/10/2020','7/9/2020','7/8/2020','7/7/2020','7/6/2020','7/5/2020','7/4/2020','7/3/2020','7/2/2020']
})
df1.set_index('Country')\
.join(
df2.groupby('Country')\
.agg({'Age':'max','Date':'max'}), how='left', lsuffix='l', rsuffix='r')
I have 3 datasets (csv.) that I need to merge together inorder to make a motion chart.
they all have countries as the columns and the year as rows.
The other two datasets look the same except one is population and the other is income. I've tried looking around to see what I can find to get the data set out like I'd like but cant seem to find anything.
I tried using pd.concat but it just lists it all one after the other not in separate columns.
merge all 3 data sets in preperation for making motionchart using pd.concat
mc_data = pd.concat([df2, pop_data3, income_data3], sort = True)
Any sort of help would be appreciated
EDIT: I have used the code as suggested, however I get a heap of NaN values that shouldn't be there
mc_data = pd.concat([df2, pop_data3, income_data3], axis = 1, keys = ['df2', 'pop_data3', 'income_data3'])
EDIT2: when I run .info and .index on them i get these results. Could it be to do with the data types? or the column entries?
From this answer:
You can do it with concat (the keys argument will create the hierarchical columns index):
pd.concat([df2, pop_data3, income_data3], axis=1, keys=['df2', 'pop_data3', 'income_data3'])
I have been trying to make a comparison of two dataframes, creating new dataframes for the ones which have the same entries in two columns. I thought I had cracked it but the code I have now just looks at the two columns of interest and if the string is found anywhere in that column it considers it a match. I need the two strings to be common on the same row across the columns. A sample of the code follows.
#produce table with common items
vto_in_jeff = df_vto[(df_vto['source'].isin(df_jeff['source']) & df_vto['target'].isin(df_jeff['target']))].dropna().reset_index(drop=True)
#vto_in_jeff.index = vto_in_jeff.index + 1
vto_in_jeff['compare'] = 'Common_terms'
print(vto_in_jeff)
vto_in_jeff.to_csv(output_path+'vto_in_'+f+'.csv', index=False)
So this code comes out with a table which has a list of the rows which has both source and target strings, but not the source and target strings necessarily having to appear in the same row. Can anyone help me look specifically row by row?
you can use the pandas merge method
result = pd.merge(df1, df2, on='key')
here are more details:
https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html#brief-primer-on-merge-methods-relational-algebra
I have two dataframes with the exact same index titles and the exact same column titles but different values within those tables. Also the number of rows and columns is exactly the same. Let's call them df1, and df2.
df1 = {'A':['a1','a2','a3','a4'],'B':['b1','b2','b3','b4'],'C':['c1','c2','c3','c4']}
df2 = {'A':['d1','d2','d3','d4'],'B':['e1','e2','e3','e4'],'C':['f1','f2','f3','f4']}
I want to perform several operations on these matrices i.e.
Multiplication - create the following matrix:
df2 = {'A':['a1*d1','a2*d2','a3*d3','a4*d4'],'B':['b1*e1','b2*e2','b3*e3','b4*e4'],'C':['c1*f1','c2*f2','c3*f3','c4*f4']}
as well as Addition, Substraction, Division using the exact same logic.
Please note that the question is more about the generic code which can be replicated since the matrix which I am using has hundreds of rows and columns.
This is pretty trivial to achieve using the pandas library. The data type of the columns is unclear from OPs question, but if they are numeric, then then the code below will run.
Try:
import pandas as pd
pd.DataFrame(df1) * pd.DataFrame(df2)
If you don't want to import panda just for this operation you can do this with the following code:
df1_2 = {key: [x*y for x,y in zip(df1[key],df2[key])] for key in df1.keys()}
NOTE : This works only if the values are numerics. If not use concatenation for strings like x'*'y, replace + with your desired operation.