Pandas Melt Trouble in Python - python

I have a csv file data frame that looks like the following:
My goal is to melt (transform) the dataframe into a refined dataframe that looks like the following:
This is my code up to now:
import glob, pandas as pd
file = r"C:\Users\jrivera\OneDrive - Accelerate Resources\Documents\Python\maverickAvgTCProductionInput.csv"
dfTotal = pd.DataFrame()
for prd in glob.glob(file):
df = pd.read_csv(prd)
dfTotal = pd.concat([dfTotal, df])
dfTotal.shape
dfHDprd = pd.read_csv(r"C:\Users\jrivera\OneDrive - Accelerate Resources\Documents\Python\maverickAvgTCProductionInput.csv")
id_vars, dct = ["TCA","MONTH",],{}
for x in ["OIL", "GAS"]:
dct["value_vars_%s" % x] = ["NORM_%s"%x]
dfNew = pd.melt(frame = dfHDprd, id_vars = ["TCA", "MONTHS"], value_vars = ["NORM_OIL_1KFT", "NORM_GAS_1KFT"], var_name= "OIL", var_value = "GAS")

I'm not really sure what your goal is, from the link it just seems like you want to limit the months between 0-3 and remove some columns. I would suggest explicitly explaining what you need.
pd.melt is used to convert a wide dataframe into a long dataframe by 'melting' columns, with the variable names (NORM_OIL_1KFT, NORM_GAS_1KFT) going into the rows instead of as column headers. I don't think this is what you are looking for.
If you simply want to retain only the columns in your desired dataframe:
new_df = dfHDprd[['TCA','MONTH','NORM_OIL_1KFT','NORM_GAS_1KFT']]
new_df.columns = ['TCA','MONTH','OIL','GAS']
Melting the dataframe (which probably is not what you are looking to do), you would need to re-define your expression like this to understand the purpose of melting:
dfNew = pd.melt(frame = dfHDprd, id_vars = ["TCA", "MONTHS"], value_vars = ["NORM_OIL_1KFT", "NORM_GAS_1KFT"], var_name= "FUEL_TYPE", var_value = "QUANTITY")
Where var_name is the column header distinguishing between the variables which get melted into the dataframe, and var_value is the column header which is the label for the values.
Trivial example (as I can't copy any of your data):
df = pd.DataFrame({'id':['a','b','c'], 'C1':[1,2,3],'C2':[4,5,6],'C3':[5,6,7]})
>>>
id C1 C2 C3
0 a 1 4 5
1 b 2 5 6
2 c 3 6 7
pd.melt(frame=df, id_vars=['id'], value_vars=['C1','C2','C3'], value_name='value', var_name='variable')
>>>
id variable value
0 a C1 1
1 b C1 2
2 c C1 3
3 a C2 4
4 b C2 5
5 c C2 6
6 a C3 5
7 b C3 6
8 c C3 7

Related

Stick the columns based on the one columns keeping ids

I have a DataFrame with 100 columns (however I provide only three columns here) and I want to build a new DataFrame with two columns. Here is the DataFrame:
import pandas as pd
df = pd.DataFrame()
df ['id'] = [1,2,3]
df ['c1'] = [1,5,1]
df ['c2'] = [-1,6,5]
df
I want to stick the values of all columns for each id and put them in one columns. For example, for id=1 I want to stick 2, 3 in one column. Here is the DataFrame that I want.
Note: df.melt does not solve my question. Since I want to have the ids also.
Note2: I already use the stack and reset_index, and it can not help.
df = df.stack().reset_index()
df.columns = ['id','c']
df
You could first set_index with "id"; then stack + reset_index:
out = (df.set_index('id').stack()
.droplevel(1).reset_index(name='c'))
Output:
id c
0 1 1
1 1 -1
2 2 5
3 2 6
4 3 1
5 3 5

How to reverse the content of a specific dataframe column in pandas?

I have a pandas dataframe df1 = {'A':['a','b','c','d','e'],'no.':[0,1,2,3,4]}, df1 = pd.DataFrame(df1,columns=['A','no.']) where I would like to reverse in place the content of the second column with the result being like that: df2 = {'A':['a','b','c','d','e'],'no.':[4,3,2,1,0]} df2 = pd.DataFrame(df2,columns=['A','no.'])
Convert values to numpy and then indexing for change order:
df1['no.'] = df1['no.'].to_numpy()[::-1]
print (df1)
A no.
0 a 4
1 b 3
2 c 2
3 d 1
4 e 0

Pandas vectorization for a multiple data frame operation

I am looking to increase the speed of an operation within pandas and I have learned that it is generally best to do so via using vectorization. The problem I am looking for help with is vectorizing the following operation.
Setup:
df1 = a table with a date-time column, and city column
df2 = another (considerably larger) table with a date-time column, and city column
The Operation:
for i, row in df2.iterrows():
for x, row2 in df1.iterrows():
if row['date-time'] - row2['date-time'] > pd.Timedelta('8 hours') and row['city'] == row2['city']:
df2.at[i, 'result'] = True
break
As you might imagine, this operation is insanely slow on any dataset of a decent size. I am also just beginning to learn pandas vector operations and would like some help in figuring out a more optimal way to solve this problem
I think what you need is merge() with numpy.where() to achieve the same result.
Since you don't have a reproducible sample in your question, kindly consider this:
>>> df1 = pd.DataFrame({'time':[24,20,15,10,5], 'city':['A','B','C','D','E']})
>>> df2 = pd.DataFrame({'time':[2,4,6,8,10,12,14], 'city':['A','B','C','F','G','H','D']})
>>> df1
time city
0 24 A
1 20 B
2 15 C
3 10 D
4 5 E
>>> df2
time city
0 2 A
1 4 B
2 6 C
3 8 F
4 10 G
5 12 H
6 14 D
From what I understand, you only need to get all the rows in your df2 that has a value in the city column in df1, where the difference in the dates are at least 9 hours (greater than 8 hours).
To do that, we need to merge on your city column:
>>> new_df = df2.merge(df1, how = 'inner', left_on = 'city', right_on = 'city')
>>> new_df
time_x city time_y
0 2 A 24
1 4 B 20
2 6 C 15
3 14 D 10
time_x basically is the time in your df2 dataframe, and time_y is from your df1.
Now we need to check the difference of those times and retain the one that will give a greater than 8 value in doing so, by using numpy.where() flagging them to do the filtering later:
>>> new_df['flag'] = np.where(new_df['time_y'] - new_df['time_x'] > 8, ['Retain'], ['Remove'])
>>> new_df
time_x city time_y flag
0 2 A 24 Retain
1 4 B 20 Retain
2 6 C 15 Retain
3 14 D 10 Remove
Now that you have that, you can simply filter your new_df by the flag column, removing the column in the final output as such:
>>> final_df = new_df[new_df['flag'].isin(['Retain'])][['time_x', 'city', 'time_y']]
>>> final_df
time_x city time_y
0 2 A 24
1 4 B 20
2 6 C 15
And there you go, no looping needed. Hope this helps :D

how to append data from different data frame in python?

I have about 20 data frames and all data frames are having same columns and I would like to add data into the empty data frame but when I use my code
interested_freq
UPC CPC freq
0 136.0 B64G 2
1 136.0 H01L 1
2 136.0 H02S 1
3 244.0 B64G 1
4 244.0 H02S 1
5 257.0 B64G 1
6 257.0 H01L 1
7 312.0 B64G 1
8 312.0 H02S 1
list_of_lists = []
max_freq = df_interested_freq[df_interested_freq['freq'] == df_interested_freq['freq'].max()]
for row, cols in max_freq.iterrows():
interested_freq = df_interested_freq[df_interested_freq['freq'] != 1]
interested_freq
list_of_lists.append(interested_freq)
list_of_lists
for append the first data frame, and then change the name in that code for hoping that it will append more data
list_of_lists = []
for row, cols in max_freq.iterrows():
interested_freq_1 = df_interested_freq_1[df_interested_freq_1['freq'] != 1]
interested_freq_1
list_of_lists.append(interested_freq_1)
list_of_lists
but the first data is disappeared and show only the recent appended data. do I have done something wrong?
One way to Create a new DataFrame from existing DataFrame is use to df.copy():
Here is Detailed documentation
The df.copy() is very much relevant here because changing the subset of data within new dataframe will change the initial DataFrame So, you have fair chances of losing your actual dataFrame thus you need it.
Suppose Example DataFrame is df1 :
>>> df1
col1 col2
1 11 12
2 21 22
Solution , you can use df.copy method as follows which will inherit the data along.
>>> df2 = df1.copy()
>>> df2
col1 col2
1 11 12
2 21 22
In case you need to new dataframe(df2) to be created as like df1 but don't want the values to inserted across the DF then you have option to use reindex_like() method.
>>> df2 = pd.DataFrame().reindex_like(df1)
# df2 = pd.DataFrame(data=np.nan,columns=df1.columns, index=df1.index)
>>> df2
col1 col2
1 NaN NaN
2 NaN NaN
Why do you use append here? It’s not a list. Once you have the first dataframe (called d1 for example), try:
new_df = df1
new_df = pd.concat([new_df, df2])
You can do the same thing for all 20 dataframes.

Stacking columns one below other when the column names are same

I have a huge data set in a pandas data frame. It looks something like this
df = pd.DataFrame([[1,2,3,4],[31,14,13,11],[115,613,1313,1]], columns=['c1','c1','c2','c2'])
Here first two columns have same name. So they should be concatenated into a single column so the the values are one below another. so the dataframe should look something like this:
df1 = pd.DataFrame([[1,3],[31,13],[115,1313],[2,4],[14,11],[613,1]], columns=['c1','c2'])
Note: My orignal dataframe has many column so I cannot used simple concat function to stack the columns. Also I tried using stack function, apart from concat function. What can I do?
use groupby + cumcount to create a pd.MultiIndex. Reassign column with new pd.MultiIndex and stack
df = pd.DataFrame(
[[1,2,3,4],[31,14,13,11],[115,613,1313,1]],
columns=['c1','c1','c2','c2'])
df1 = df.copy()
df1.columns = [df.columns, df.columns.to_series().groupby(level=0).cumcount()]
print(df1.stack().reset_index(drop=True))
c1 c2
0 1 3
1 2 4
2 31 13
3 14 11
4 115 1313
5 613 1
Or with a bit of creativity, in one line
df.T.set_index(
df.T.groupby([df.columns]).cumcount(),
append=True
).unstack().T.reset_index(drop=True)
c1 c2
0 1 3
1 2 4
2 31 13
3 14 11
4 115 1313
5 613 1
You could melt the dataframe, then count entries within each column to use as index for the new dataframe and then unstack it back like this:
import pandas as pd
df = pd.DataFrame(
[[1,2,3,4],[31,14,13,11],[115,613,1313,1]],
columns=['c1','c1','c2','c2'])
df1 = (pd.melt(df,var_name='column')
.assign(n = lambda x: x.groupby('column').cumcount())
.set_index(['n','column'])
.unstack())
df1.columns=df1.columns.get_level_values(1)
print(df1)
Which produces
column c1 c2
n
0 1 3
1 31 13
2 115 1313
3 2 4
4 14 11
5 613 1

Categories