Most efficient way to transform this data using Pandas? - python

I currently have several hundred .csv files in the format shown on the left below, and I need to transform them all into the format shown on the right. I tried to highlight the blocks of data to make it easier to see what I'm trying to do.
Is there an efficient way to do this using Pandas? I was trying to formulate something using df.iteritems() but couldn't think of a good way to do it.

Given:
Date-Time L-A
0 5/1/2022 0:00 1.4
1 5/1/2022 0:05 1.4
2 5/2/2022 0:10 1.4
Doing:
name = df.columns[1]
df['x'] = name
df = df.reindex(columns=['x', 'Date-Time', name])
print(df.values)
Output:
[['L-A VLX' '5/1/2022 0:00' 1.4]
['L-A VLX' '5/1/2022 0:05' 1.4]
['L-A VLX' '5/2/2022 0:10' 1.4]]

My beginner's level way.
Slicing by column index location, later adding them together using concat:
print(df) # initial data frame
DateTime Units DateTime Units
0 a 1 a111 10
1 b 2 b222 20
2 c 3 c333 30
Slicing by column index location as the initial DF has duplicated headers:
df2 = df.iloc[: , [0, 1]].copy()
df3 = df.iloc[: , [2, 3]].copy()
# adding all back into new DF
df_result = pd.concat([df2,df3]).reset_index(drop=True)
print(df_result)
output:
DateTime Units
0 a 1
1 b 2
2 c 3
3 a111 10
4 b222 20
5 c333 30

Related

Conditional merge / join of two large Pandas DataFrames with duplicated keys based on values of multiple columns - Python

I come from R and honestly, this is the simplest thing to do in one line using R data.tables, and the operation is also quite fast for large datatables. Bu I'm really struggling implementing it in Python. None of the use cases previous mentioned were suitable for my application. The major issue at hand is the memory usage in the Python solution as i will explain below.
The problem: I've got two large DataFrames df1 and df2 (each around 50M-100M rows) and I need to merge two (or n) columns of df2 to df1 based on two condition:
1) df1.id = df2.id (usual case of merge)
2) df2.value_2A <= df1.value_1 <= df2.value_2B
import numpy as np
import pandas as pd
df1 = pd.DataFrame({'id': [1,1,1,2,2,3], 'value_1': [2,5,7,1,3,4]})
df2 = pd.DataFrame({'id': [1,1,1,1,2,2,2,3], 'value_2A': [0,3,7,12,0,2,3,1], 'value_2B': [1,5,9,15,1,4,6,3]})
df1
Out[13]:
id value_1
0 1 2
1 1 5
2 1 7
3 2 1
4 2 3
5 3 4
df2
Out[14]:
id value_2A value_2B
0 1 0 1
1 1 3 5
2 1 7 9
3 1 12 15
4 2 0 1
5 2 2 4
6 2 3 6
7 3 1 3
desired_output
Out[15]:
id value_1 value_2A value_2B
0 1 2 NaN NaN
1 1 5 3.0 5.0
2 1 7 7.0 9.0
3 2 1 0.0 1.0
4 2 3 2.0 4.0
5 2 3 3.0 6.0
6 3 4 NaN NaN
now, i know this can be done by first merging df1 and df2 the 'left' way and then filtering the data. But this is a horrendous solution in terms of scaling. I've got 50M x 50M rows with multiple duplicates of id. This would create some enormous dataframe which i would have to filter.
## This is NOT a solution because memory usage is just too large and
## too many oprations deeming it extremely inefficient and slow at large scale
output = pd.merge(df1, df2, on='id', how='left') ## output becomes very large in my case
output.loc[~((output['value_1'] >= output['value_2A']) & (output['value_1'] <= output['value_2B'])), ['value_2A', 'value_2B']] = np.nan
output = output.loc[~ output['value_2A'].isnull()]
output = pd.merge(df1, output, on=['id', 'value_1'], how='left')
This is so inefficient. I'm merging a large dataset twice to get the desired output and creating massive dataframes while doing so. Yuck!
Think of this as two dataframes of events, which i'm trying to match together. That is, tagging if events of df1 have occurred within events of df2. there are multiple events for each id in both df1 and df2. events of df2 are NOT mutually exclusive. The conditional join really needs to happen at the time of joining, not after.
This is done easily in R:
## in R realm ##
require(data.table)
desired_output <- df2[df1, on=.(id, value_2A <= value_1, value_2B >= value_1)] #fast and easy operation
is there any way to do this in Python?
interesting question!
Looks like pandasql might do what you want. Please see :
How to do a conditional join in python Pandas?
Yeah. It's an annoying problem. I handled this by splitting the left DataFrame into chunks.
def merge_by_chunks(left, right, condition=None, **kwargs):
chunk_size = 1000
merged_chunks = []
for chunk_start in range(0, len(left), chunk_size):
print(f"Merged {chunk_start} ", end="\r")
merged_chunk = pd.merge(left=left[chunk_start: chunk_start+chunk_size], right=right, **kwargs)
if condition is not None:
merged_chunk = merged_chunk[condition(merged_chunk)]
merged_chunks.append(merged_chunk)
return pd.concat(merged_chunks)
Then you can provide the condition as a function.
df1 = pd.DataFrame({'id': [1,1,1,2,2,3], 'value_1': [2,5,7,1,3,4]})
df2 = pd.DataFrame({'id': [1,1,1,1,2,2,2,3], 'value_2A': [0,3,7,12,0,2,3,1], 'value_2B': [1,5,9,15,1,4,6,3]})
def condition_func(output):
return (((output['value_1'] >= output['value_2A']) & (output['value_1'] <= output['value_2B'])))
output = merge_by_chunks(df1, df2, condition=condition_func, on='id', how='left')
merge_by_chunks(df1, output, on=['id', 'value_1'], how='left')
It can be pretty slow depending on the size of the DataFrame, but it doesn't run out of memory.

How can I rename NaN columns in python pandas?

Good day everyone! I had trouble putting a nested dictionary as separate columns. However, I fixed it using the concat and json.normalize function. But for some reason the code I used removed all the column names and returned NaN as values for the columns...
Does someone knows how to fix this?
Code I used:
import pandas as pd
c = ['photo.photo_replace', 'photo.photo_remove', 'photo.photo_add', 'photo.photo_effect', 'photo.photo_brightness',
'photo.background_color', 'photo.photo_resize', 'photo.photo_rotate', 'photo.photo_mirror', 'photo.photo_layer_rearrange',
'photo.photo_move', 'text.text_remove', 'text.text_add', 'text.text_edit', 'text.font_select', 'text.text_color', 'text.text_style',
'text.background_color', 'text.text_align', 'text.text_resize', 'text.text_rotate', 'text.text_move', 'text.text_layer_rearrange']
df_edit = pd.concat([json_normalize(x)[c] for x in df['editables']], ignore_index=True)
df.columns = df.columns.str.split('.').str[1]
Current problem:
Result I want:
df= pd.DataFrame({
'A':[1,2,3],
'B':[3,3,3]
})
print(df)
A B
0 1 3
1 2 3
2 3 3
c=['new_name1','new_name2']
df.columns=c
print(df)
new_name1 new_name2
0 1 3
1 2 3
2 3 3
remember , lenght of column names (c) should be equal to column amount

Merging content of two rows in Pandas

I have a data frame, where I would like to merge the content of two rows, and have it separated by underscore, within the same cell.
If this is the original DF:
0 eye-right eye-right hand
1 location location position
2 12 27.7 2
3 14 27.6 2.2
I would like it to become:
0 eye-right_location eye-right_location hand_position
1 12 27.7 2
2 14 27.6 2.2
Eventually I would like to translate row 0 to become header, and reset indexes for the entire df.
You can set your column labels, slice via iloc, then reset_index:
print(df)
# 0 1 2
# 0 eye-right eye-right hand
# 1 location location position
# 2 12 27.7 2
# 3 14 27.6 2.2
df.columns = (df.iloc[0] + '_' + df.iloc[1])
df = df.iloc[2:].reset_index(drop=True)
print(df)
# eye-right_location eye-right_location hand_position
# 0 12 27.7 2
# 1 14 27.6 2.2
I like jpp's answer a lot. Short and sweet. Perfect for quick analysis.
Just one quibble: The resulting DataFrame is generically typed. Because strings were in the first two rows, all columns are considered type object. You can see this with the info method.
For data analysis, it's often preferable that columns have specific numeric types. This can be tidied up with one more line:
df.columns = df.iloc[0] + '_' + df.iloc[1]
df = df.iloc[2:].reset_index(drop=True)
df = df.apply(pd.to_numeric)
The third line here applies Panda's to_numeric function to each column in turn, leaving a more-typed DataFrame:
While not essential for simple usage, as soon as you start performing math on DataFrames, or start using very large data sets, column types become something you'll need to pay attention to.

Slice column in panda database and averaging results

If I have a pandas database such as:
timestamp label value new
etc. a 1 3.5
b 2 5
a 5 ...
b 6 ...
a 2 ...
b 4 ...
I want the new column to be the average of the last two a's and the last two b's... so for the first it would be the average of 5 and 2 to get 3.5. It will be sorted by the timestamp. I know I could use a groupby to get the average of all the a's or all the b's but I'm not sure how to get an average of just the last two. I'm kinda new to python and coding so this might not be possible idk.
Edit: I should also mention this is not for a class or anything this is just for something I'm doing on my own and that this will be on a very large dataset. I'm just using this as an example. Also I would want each A and each B to have its own value for the last 2 average so the dimension of the new column will be the same as the others. So for the third line it would be the average of 2 and whatever the next a would be in the data set.
IIUC one way (among many) to do that:
In [139]: df.groupby('label').tail(2).groupby('label').mean().reset_index()
Out[139]:
label value
0 a 3.5
1 b 5.0
Edited to reflect a change in the question specifying the last two, not the ones following the first, and that you wanted the same dimensionality with values repeated.
import pandas as pd
data = {'label': ['a','b','a','b','a','b'], 'value':[1,2,5,6,2,4]}
df = pd.DataFrame(data)
grouped = df.groupby('label')
results = {'label':[], 'tail_mean':[]}
for item, grp in grouped:
subset_mean = grp.tail(2).mean()[0]
results['label'].append(item)
results['tail_mean'].append(subset_mean)
res_df = pd.DataFrame(results)
df = df.merge(res_df, on='label', how='left')
Outputs:
>> res_df
label tail_mean
0 a 3.5
1 b 5.0
>> df
label value tail_mean
0 a 1 3.5
1 b 2 5.0
2 a 5 3.5
3 b 6 5.0
4 a 2 3.5
5 b 4 5.0
Now you have a dataframe of your results only, if you need them, plus a column with it merged back into the main dataframe. Someone else posted a more succinct way to get to the results dataframe; probably no reason to do it the longer way I showed here unless you also need to perform more operations like this that you could do inside the same loop.

Pandas divide one row by another and output to another row in the same dataframe

For a Dataframe such as:
dt
COL000 COL001
STK_ID
Rowname1 2 2
Rowname2 1 4
Rowname3 1 1
What's the easiest way to append to the same data frame the result of dividing Row1 by Row2? i.e. the desired outcome is:
COL000 COL001
STK_ID
Rowname1 2 2
Rowname2 1 4
Rowname3 1 1
Newrow 2 0.5
Sorry if this is a simple question, I'm slowly getting to grips with pandas from an R background.
Thanks in advance!!!
The code below will create a new row with index d which is formed from dividing rows a and b.
import pandas as pd
df = pd.DataFrame(data={'x':[1,2,3], 'y':[4,5,6]}, index=['a', 'b', 'c'])
df.loc['d'] = df.loc['a'] / df.loc['b']
print(df)
# x y
# a 1.0 4.0
# b 2.0 5.0
# c 3.0 6.0
# d 0.5 0.8
in order to access the first two rows without caring about the index, you can use:
df.loc['newrow'] = df.iloc[0] / df.iloc[1]
then just follow #Ffisegydd's solution...
in addition, if you want to append multiple rows, use the pd.DataFrame.append function.
pandas does all the work row by row. By including another element it also interprets you want a new column:
data['new_row_with_division'] = data['row_name1_values'] / data['row_name2_values']

Categories