I am trying to explode a list in my dataframe column and merge it back to the df, but i get a memory error while merging the flatten column with the initial dataframe. I would like to know if i can merge it in chunks, so that i can overcome the memory issue.
def flatten_colum_with_list(df, column, reset_index=False):
column_to_flatten = pd.DataFrame([[i, x] for i, y in df[column].apply(list).iteritems() for x in y], columns=['I', column])
column_to_flatten = column_to_flatten.set_index('I')
df = df.drop(column, axis=1)
df = df.merge(column_to_flatten, left_index=True, right_index=True)
if reset_index:
df = df.reset_index(drop=True)
return df
I would appreciate any support.
Regarding this, you can simply use the following code:
df.explode(*column name here*,ignore_index=True)
The ignore_index set to true will set the index to 0,1,2,....... order.
Related
I have a initial dataframe D. I extract two data frames from it like this:
A = D[D.label == k]
B = D[D.label != k]
I want to combine A and B into one DataFrame. The order of the data is not important. However, when we sample A and B from D, they retain their indexes from D.
DEPRECATED: DataFrame.append and Series.append were deprecated in v1.4.0.
Use append:
df_merged = df1.append(df2, ignore_index=True)
And to keep their indexes, set ignore_index=False.
Use pd.concat to join multiple dataframes:
df_merged = pd.concat([df1, df2], ignore_index=True, sort=False)
Merge across rows:
df_row_merged = pd.concat([df_a, df_b], ignore_index=True)
Merge across columns:
df_col_merged = pd.concat([df_a, df_b], axis=1)
If you're working with big data and need to concatenate multiple datasets calling concat many times can get performance-intensive.
If you don't want to create a new df each time, you can instead aggregate the changes and call concat only once:
frames = [df_A, df_B] # Or perform operations on the DFs
result = pd.concat(frames)
This is pointed out in the pandas docs under concatenating objects at the bottom of the section):
Note: It is worth noting however, that concat (and therefore append)
makes a full copy of the data, and that constantly reusing this
function can create a significant performance hit. If you need to use
the operation over several datasets, use a list comprehension.
If you want to update/replace the values of first dataframe df1 with the values of second dataframe df2. you can do it by following steps —
Step 1: Set index of the first dataframe (df1)
df1.set_index('id')
Step 2: Set index of the second dataframe (df2)
df2.set_index('id')
and finally update the dataframe using the following snippet —
df1.update(df2)
To join 2 pandas dataframes by column, using their indices as the join key, you can do this:
both = a.join(b)
And if you want to join multiple DataFrames, Series, or a mixture of them, by their index, just put them in a list, e.g.,:
everything = a.join([b, c, d])
See the pandas docs for DataFrame.join().
# collect excel content into list of dataframes
data = []
for excel_file in excel_files:
data.append(pd.read_excel(excel_file, engine="openpyxl"))
# concatenate dataframes horizontally
df = pd.concat(data, axis=1)
# save combined data to excel
df.to_excel(excelAutoNamed, index=False)
You can try the above when you are appending horizontally! Hope this helps sum1
Use this code to attach two Pandas Data Frames horizontally:
df3 = pd.concat([df1, df2],axis=1, ignore_index=True, sort=False)
You must specify around what axis you intend to merge two frames.
What seems to be a simple function returns NaNs instead of the actual numbers. What am I missing here?
#Concatenate the dataframes:
dfcal = dfcal.astype(float)
dfmag = dfmag.astype(float)
print('dfcal\n-----',dfcal)
print('dfmag\n-----',dfmag)
df = pd.concat([dfcal,dfmag])
print('concatresult\n-----',df)
Cheers!
I guess you need axis=1 for append new columns, selected column caliper for avoid duplicated depth columns:
df = pd.concat([dfcal['caliper'],dfmag], axis=1)
Or:
df = pd.concat([dfcal.drop('depth', axis=1),dfmag], axis=1)
Check parameters (join, axis) or use merge
join{‘inner’, ‘outer’}, default ‘outer’
df = pd.concat([dfcal,dfmag], join='inner')
I have a dictionary of dataframes. Each of these dataframes has a column 'defrost_temperature'. What I want to do is make one new dataframe that collects all those columns, maintaining them as seperate columns.
This is what I am doing right now:
merged_defrosts = pd.DataFrame()
for key in df_dict.keys():
merged_defrosts[key] = df_dict[key]["defrost_temperature"]
But unfortunately, only the first column is filled correctly. The other columns are filled with NaN as shown in the screenshot
enter image description here
The different defrosts are not necessarily the same length. (the fourth dataframe is 108 rows, the others are 109 rows)
You can try pd.merge on index of the larger.
df_result = pd.DataFrame()
for i, df in enumerate(df_dict.values()):
s1, s2 = f'_{i}', f'_{i+1}'
m1, m2 = df_result.shape[0], df.shape[0]
if m1 == 0:
df_result = df
elif m1 >= m2:
df_result = df_result.merge(df, how=left, left_index=True, right_index=True, suffixes=(s1, s2))
else:
df_result = df.merge(df_result, how=left, left_index=True, right_index=True, suffixes=(s2, s1))
This would create undesired column names though that you can manually rename them afterwards.
You could try to concat the dataframes horizontaly after making the common column the index:
merged_defrosts = pd.concat([df.set_index("defrost_temperature") for df in df_dict.values()]
).reset_index()
I have a DataFrame with multi-index ['timestamp', 'symbol'] that contains timeseries data. I merging this data with other samples and my apply function that uses asof is similar to:
df.apply(lambda x: df2.xs(x['symbol'], level='symbol').index.asof(x['timestamp'])), axis=1)
I think the actual xs to filter on symbol is what is causing it to be so slow, so I am instead creating a dict of 'symbol' -> df where the values are already filtered so I can just call index.asof directly. Am I approaching this the wrong way?
Example:
df = pd.read_csv(StringIO("ts,symbol,bid,ask\n2014-03-03T09:30:00,A,54.00,55.00\n2014-03-03T09:30:05,B,34.00,35.00"), parse_dates='ts', index_col=['ts', 'symbol'])
df2 = pd.read_csv(StringIO("ts,eventId,symbol\n2014-03-03T09:32:00,1,A\n2014-03-03T09:33:05,2,B"), parse_dates='ts', index_col=['ts', 'symbol'])
# find ts to join with and use xs so we can use indexof
df2['event_ts'] = df2.apply(lambda x: df.xs(x['symbol'], level='symbol').index.asof(x['ts'])), axis=1)
# merge in fields
df2 = pd.merge(df2, df, left_on=['event_ts', 'symbol'], right_index=True)
I have a initial dataframe D. I extract two data frames from it like this:
A = D[D.label == k]
B = D[D.label != k]
I want to combine A and B into one DataFrame. The order of the data is not important. However, when we sample A and B from D, they retain their indexes from D.
DEPRECATED: DataFrame.append and Series.append were deprecated in v1.4.0.
Use append:
df_merged = df1.append(df2, ignore_index=True)
And to keep their indexes, set ignore_index=False.
Use pd.concat to join multiple dataframes:
df_merged = pd.concat([df1, df2], ignore_index=True, sort=False)
Merge across rows:
df_row_merged = pd.concat([df_a, df_b], ignore_index=True)
Merge across columns:
df_col_merged = pd.concat([df_a, df_b], axis=1)
If you're working with big data and need to concatenate multiple datasets calling concat many times can get performance-intensive.
If you don't want to create a new df each time, you can instead aggregate the changes and call concat only once:
frames = [df_A, df_B] # Or perform operations on the DFs
result = pd.concat(frames)
This is pointed out in the pandas docs under concatenating objects at the bottom of the section):
Note: It is worth noting however, that concat (and therefore append)
makes a full copy of the data, and that constantly reusing this
function can create a significant performance hit. If you need to use
the operation over several datasets, use a list comprehension.
If you want to update/replace the values of first dataframe df1 with the values of second dataframe df2. you can do it by following steps —
Step 1: Set index of the first dataframe (df1)
df1.set_index('id')
Step 2: Set index of the second dataframe (df2)
df2.set_index('id')
and finally update the dataframe using the following snippet —
df1.update(df2)
To join 2 pandas dataframes by column, using their indices as the join key, you can do this:
both = a.join(b)
And if you want to join multiple DataFrames, Series, or a mixture of them, by their index, just put them in a list, e.g.,:
everything = a.join([b, c, d])
See the pandas docs for DataFrame.join().
# collect excel content into list of dataframes
data = []
for excel_file in excel_files:
data.append(pd.read_excel(excel_file, engine="openpyxl"))
# concatenate dataframes horizontally
df = pd.concat(data, axis=1)
# save combined data to excel
df.to_excel(excelAutoNamed, index=False)
You can try the above when you are appending horizontally! Hope this helps sum1
Use this code to attach two Pandas Data Frames horizontally:
df3 = pd.concat([df1, df2],axis=1, ignore_index=True, sort=False)
You must specify around what axis you intend to merge two frames.