Insert one dataframe into another in Python - python

Hi I have the following two DataFrame's (index level == 2):
df1 = pd.DataFrame()
df1["Index1"] = ["A", "AA"]
df1["Index2"] = ["B", "BB"]
df1 = df1.set_index(["Index1", "Index2"])
df1["Value1"] = 1
df1["Value2"] = 2
df1
df2 = pd.DataFrame()
df2["Index1"] = ["X", "XX"]
df2["Index2"] = ["Y", "YY"]
df2["Value1"] = 3
df2["Value2"] = 4
df2 = df2.set_index(["Index1", "Index2"])
df2
I would like to create the following DataFrame with 3-level index where the first level indicates from which DataFrame the values are taken. Note all DataFrames have exactly the same columns:
How can I do this in the most automatic way? Ideally I would like to have the following solution:
# start with empty dataframe
res = pd.DataFrame(index = pd.MultiIndex(levels = [[], [], []],
codes = [[],[],[]],
names = ["Df number", "Index1", "Index2"]),
columns = ["Value1", "Value2"])
res = AddDataFrameAtIndex(index = "DF1", level = 0, dfToInsert = df1)
res = AddDataFrameAtIndex(index = "DF2", level = 0, dfToInsert = df2)

A possible solution, based on pandas.concat:
pd.concat([df1, df2], keys=['DF1', 'DF2'], names=['DF number'])
Output:
Value1 Value2
DF number Index1 Index2
DF1 A B 1 2
AA BB 1 2
DF2 X Y 3 4
XX YY 3 4

Related

Comparing two dataframes columns

I have two dataframes that are structure wise equal.
Both has to following format:
file_name | country_name | country_code | .....
I want to compare the two, and get a percentage of equality for each column.
The second data frame is the test dataframe, that holds the true values. Some of the values are NaN, which should be ignored. So far I have a managed to compare the two, and get the total number of equal samples for each column, my problem is dividing each of them by the total number of relevant samples(That doesn't have NaN in the second dataframe), in a "nice way".
For example:
df1
file_name | country_name
1 a
2 b
3 d
4 c
df2
file_name | country_name
1 a
2 b
3 nan
4 d
I expect an output of 66% for this column, because 2 of the 3 relevant samples has the same value, and the 4th is nan so it is ignored from the calculation.
What I've done so far:
test_set = pd.read_excel(file_path)
test_set = test_set.astype(str)
a_set = pd.read_excel(file2_path)
a_set = a_set.astype(str)
merged_df = a_set.merge(test_set, on='file_name')
for field in fields:
if field == 'file_name':
continue
merged_df[field] = merged_df.apply(lambda x: 0 if x[field + '_y'] == 'nan' else 1 if x[field + '_x'].lower() == x[field + '_y'].lower() else 0, axis=1)
scores = merged_df.drop('file_name', axis=1).sum(axis=0)
This gives me these(correct) results:
country_name 14
country_code 0
state_name 4
state_code 59
city 74
...
But now I want to divide each of them by the total number of samples that doesn't contain NaN in the corresponding field from the test_set dataframe. I can think of naive ways to do this, like creating another column that holds the number of not nan values for each of these column, but looking for a pretty solution.
As you have unique filenames I would use all vectorial operations, take advantage of index alignment:
# set the filename as index
df1b = df1.set_index('file_name')
# set the filename as index
df2b = df2.set_index('file_name')
# compare and divide by the number of non-NA
out = df1b.eq(df2b).sum().div(df2b.notna().sum())
Output:
country_name 0.666667
dtype: float64
If you don't have to merge you could use:
import pandas as pd
import numpy as np
df1 = pd.DataFrame([
["1", "a"],
["2", np.NAN],
["3", "c"]
])
df2 = pd.DataFrame([
["1", "X"],
["100", "b"],
["3", "c"]
])
# expected:
# col 0: equal = 2, ratio: 2/3
# col 1: equal = 1, ratio: 1/2
df1 = df1.sort_index()
df2 = df2.sort_index()
def get_col_ratio(col):
colA = df1[col]
colB = df2[col]
colA_ = colA[~(colA.isna() | colB.isna())]
colB_ = colB[~(colA.isna() | colB.isna())]
return (colA_.str.lower() == colB_.str.lower()).sum() / len(colA_)
ratios = pd.DataFrame([[get_col_ratio(i) for i in df1.columns]], columns=df1.columns)
print(ratios)
Or, using pd.merge
fields = df1.columns
merged = pd.merge(df1,df2, left_index=True, right_index=True)
def get_ratio(col):
cols = merged[[f"{col}_x",f"{col}_y"]]
cols = cols.dropna()
equal_rows = cols[cols.columns[0]].str.lower() == cols[cols.columns[1]].str.lower()
return equal_rows.sum() / len(cols)
ratios = pd.DataFrame([[get_ratio(i) for i in fields]], columns=fields)
ratios

Merging DataFrames with Different Columns

Suppose I have two dataframes df1 and df2 as shown by the first two dataframes in the image below. I want to combine them to get df_desired as shown by the final dataframe in the image. My current attempts result in the third dataframe in the image; as you can see it is ignoring the fact that it has already seen a row with name a
My code:
df1 = pd.DataFrame({'name':['a','b'], 'data1':[3,4]})
df2 = pd.DataFrame({'name':['a','c'], 'data2':[1,5]})
def collect_results(target_list, df_list):
df = pd.DataFrame(columns = ['name','data1','data2'])
for i in range(2):
target = target_list[i]
df_target = df_list[i]
smiles = list(df_target['name'])
pxc50 = list(df_target[target])
target_col_names = ['name', target]
df_target_info = pd.DataFrame(columns=target_col_names)
df_target_info['name'] = smiles
df_target_info[target] = pxc50
try:
df = pd.merge(df,df_target_info, how="outer", on=["name",target])
except IndexError:
df = df.reindex_axis(df.columns.union(df_target_info.columns), axis=1)
return df
How can I get the desired behaviour?
You can merge on name with outer join using .merge()
df_desired = df1.merge(df2, on='name', how='outer')
Result:
print(df_desired)
name data1 data2
0 a 3.0 1.0
1 b 4.0 NaN
2 c NaN 5.0

Python dictionary conversion

I have a Python dictionary:
adict = {
'col1': [
{'id': 1, 'tag': '#one#two'},
{'id': 2, 'tag': '#two#'},
{'id': 1, 'tag': '#one#three#'}
]
}
I want the result as follows:
Id tag
1 one,two,three
2 two
Could someone please tell me how to do this?
Try this
import pandas as pd
d={'col1':[{'id':1,'tag':'#one#two'},{'id':2,'tag':'#two#'},{'id':1,'tag':'#one#three#'}]}
df = pd.DataFrame()
for i in d:
for k in d[i]:
t = pd.DataFrame.from_dict(k, orient='index').T
t["tag"] = t["tag"].str.replace("#",",")
df = pd.concat([df,t])
tf = df.groupby(["id"])["tag"].apply(lambda x : ",".join(set(''.join(list(x)).strip(",").split(","))))
Here is a simple code
import pandas as pd
d = {'col1':[{'id':1,'tag':'#one#two'},{'id':2,'tag':'#two#'},{'id':1,'tag':'#one#three#'}]}
df = pd.DataFrame(d)
df['Id'] = df.col1.apply(lambda x: x['id'])
df['tag'] = df.col1.apply(lambda x: ''.join(list(','.join(x['tag'].split('#')))[1:]))
df.drop(columns = 'col1', inplace = True)
Output:
Id Tag
1 one, two
2 two
1 one, three
If order of tags is important first remove trailing # and split by #, then per groups remove duplicates and join:
df = pd.DataFrame(d['col1'])
df['tag'] = df['tag'].str.strip('#').str.split('#')
f = lambda x: ','.join(dict.fromkeys([z for y in x for z in y]).keys())
df = df.groupby('id')['tag'].apply(f).reset_index()
print (df)
id tag
0 1 one,two,three
1 2 two
If order of tags is not important for remove duplicates use sets:
df = pd.DataFrame(d['col1'])
df['tag'] = df['tag'].str.strip('#').str.split('#')
f = lambda x: ','.join(set([z for y in x for z in y]))
df = df.groupby('id')['tag'].apply(f).reset_index()
print (df)
id tag
0 1 three,one,two
1 2 two
I tried as below
import pandas as pd
a = {'col1':[{'id':1, 'tag':'#one#two'},{'id':2, 'tag':'#two#'},{'id':1, 'tag':'#one#three#'}]}
df = pd.DataFrame(a)
df[["col1", "col2"]] = pd.DataFrame(df.col1.values.tolist(), index = df.index)
df['col1'] = df.col1.str.replace('#', ',')
df = df.groupby(["col2"])["col1"].apply(lambda x : ",".join(set(''.join(list(x)).strip(",").split(","))))
O/P:
col2
1 one,two,three
2 two
dic=[{'col1':[{'id':1,'tag':'#one#two'},{'id':2,'tag':'#two#'},{'id':1,'tag':'#one#three#'}]}]
row=[]
for key in dic:
data=key['col1']
for rows in data:
row.append(rows)
df=pd.DataFrame(row)
print(df)
o

Split Column containing lists into different rows in pandas [duplicate]

This question already has answers here:
How to explode a list inside a Dataframe cell into separate rows
(12 answers)
Closed 3 years ago.
I have a dataframe in pandas like this:
id info
1 [1,2]
2 [3]
3 []
And I want to split it into different rows like this:
id info
1 1
1 2
2 3
3 NaN
How can I do this?
You can try this out:
>>> import pandas as pd
>>> df = pd.DataFrame({'id': [1,2,3], 'info': [[1,2],[3],[]]})
>>> s = df.apply(lambda x: pd.Series(x['info']), axis=1).stack().reset_index(level=1, drop=True)
>>> s.name = 'info'
>>> df2 = df.drop('info', axis=1).join(s)
>>> df2['info'] = pd.Series(df2['info'], dtype=object)
>>> df2
id info
0 1 1
0 1 2
1 2 3
2 3 NaN
Similar question is posted in here
This is rather convoluted way, which drops empty cells:
import pandas as pd
df = pd.DataFrame({'id': [1,2,3],
'info': [[1,2], [3], [ ]]})
unstack_df = df.set_index(['id'])['info'].apply(pd.Series)\
.stack()\
.reset_index(level=1, drop=True)
unstack_df = unstack_df.reset_index()
unstack_df.columns = ['id', 'info']
unstack_df
>>
id info
0 1 1.0
1 1 2.0
2 2 3.0
Here's one way using np.repeat and itertools.chain. Converting empty lists to {np.nan} is a trick to fool Pandas into accepting an iterable as a value. This allows chain.from_iterable to work error-free.
import numpy as np
from itertools import chain
df.loc[~df['info'].apply(bool), 'info'] = {np.nan}
res = pd.DataFrame({'id': np.repeat(df['id'], df['info'].map(len).values),
'info': list(chain.from_iterable(df['info']))})
print(res)
id info
0 1 1.0
0 1 2.0
1 2 3.0
2 3 NaN
Try these methods too...
Method 1
def split_dataframe_rows(df,column_selectors):
# we need to keep track of the ordering of the columns
def _split_list_to_rows(row,row_accumulator,column_selector):
split_rows = {}
max_split = 0
for column_selector in column_selectors:
split_row = row[column_selector]
split_rows[column_selector] = split_row
if len(split_row) > max_split:
max_split = len(split_row)
for i in range(max_split):
new_row = row.to_dict()
for column_selector in column_selectors:
try:
new_row[column_selector] = split_rows[column_selector].pop(0)
except IndexError:
new_row[column_selector] = ''
row_accumulator.append(new_row)
new_rows = []
df.apply(_split_list_to_rows,axis=1,args = (new_rows,column_selectors))
new_df = pd.DataFrame(new_rows, columns=df.columns)
return new_df
Method 2
def flatten_data(json = None):
df = pd.DataFrame(json)
list_cols = [col for col in df.columns if type(df.loc[0, col]) == list]
for i in range(len(list_cols)):
col = list_cols[i]
meta_cols = [col for col in df.columns if type(df.loc[0, col]) != list] + list_cols[i+1:]
json_data = df.to_dict('records')
df = json_normalize(data=json_data, record_path=col, meta=meta_cols, record_prefix=col+str('_'), sep='_')
return json_normalize(df.to_dict('records'))

Multilevel Slicing Pandas DataFrame

I have 3 DataFrames:
import pandas as pd
df1 = pd.DataFrame( np.random.randn(100,4), index = pd.date_range('1/1/2010', periods=100), columns = {"A", "B", "C", "D"}).T.sort_index()
df2 = pd.DataFrame( np.random.randn(100,4), index = pd.date_range('1/1/2010', periods=100), columns = {"A", "B", "C", "D"}).T.sort_index()
df3 = pd.DataFrame( np.random.randn(100,4), index = pd.date_range('1/1/2010', periods=100), columns = {"A", "B", "C", "D"}).T.sort_index()
I concatenate them creating a DataFrame with multi levels:
df_c = pd.concat([df1, df2, df3], axis = 1, keys = ["df1", "df2", "df3"])
Swap levels and sort:
df_c.columns = df_c.columns.swaplevel(0,1)
df_c = df_c.reindex_axis(sorted(df_c.columns), axis = 1)
ipdb> df_c
2010-01-01 2010-01-02
df1 df2 df3 df1 df2 df3
A -0.798407 0.124091 0.271089 0.754759 -0.575769 1.501942
B 0.602091 -0.415828 0.152780 0.530525 0.118447 0.057240
C -0.440619 -1.074837 -0.618084 0.627520 -1.298814 1.029443
D -0.242851 -0.738948 -1.312393 0.559021 0.196936 -1.074277
I would like to slice it to get values for individual rows, but so far I have only achieved such a degree of slicing:
cols = df_c.T.index.get_level_values(0)
ipdb> df_c.xs(cols[0], axis = 1, level = 0)
df1 df2 df3
A -0.798407 0.124091 0.271089
B 0.602091 -0.415828 0.152780
C -0.440619 -1.074837 -0.618084
D -0.242851 -0.738948 -1.312393
The only way I found to get the values for each raw is to define a new dataframe,
slcd_df = df_c.xs(cols[0], axis = 1, level = 0)
and then select rows using the usual proceadure:
ipdb> slcd_df.ix["A", :]
df1 -0.798407
df2 0.124091
df3 0.271089
But I was wondering whether there exists a better (meaning faster and more elegant) way to slice multilevel Dataframes.
You can use pd.IndexSlice:
idx = pd.IndexSlice
sliced = df_c.loc["A", idx["2010-01-01", :]]
print(sliced)
2010-01-01 df1 0.199332
df2 0.887018
df3 -0.346778
Name: A, dtype: float64
Or you may also use slice(None):
print(df_c.loc["A", ("2010-01-01", slice(None))])
2010-01-01 df1 0.199332
df2 0.887018
df3 -0.346778
Name: A, dtype: float64

Categories