I am calculating heat decay from spent fuel rods using variable cooling times. How can I create multiple dataframes, by varying the cooling time column with a for loop, then write them to a file?
Using datetime objects, I am creating multiple columns of cooling time values by subtracting a future date from the date the fuel rod was discharged.
I then tried to use a for loop to index these columns into a new dataframe, with the intent to streamline multiple files by using newly created dataframes in a new function.
df = pd.read_excel('data')
df.columns = ['ID','Enr','Dis','Mtu']
# Discharge Dates
_0 = dt.datetime(2020,12,1)
_1 = dt.datetime(2021,6,1)
_2 = dt.datetime(2021,12,1)
_3 = dt.datetime(2022,6,1)
# Variable Cooling Time Columns
df['Ct_0[Years]'] = df['Dis'].apply(lambda x: (((_0 - x).days)/365))
df['Ct_1[Years]'] = df['Dis'].apply(lambda x: (((_1 - x).days)/365))
df['Ct_2[Years]'] = df['Dis'].apply(lambda x: (((_2 - x).days)/365))
df['Ct_3[Years]'] = df['Dis'].apply(lambda x: (((_3 - x).days)/365))
# Attempting to index columns into new data frame
for i in range(4):
df = df[['ID','Mtu','Enr','Ct_%i[Years]'%i]]
tfile = open('Inventory_FA_%s.prn'%i,'w')
### Apply conditions for flagging
tfile.close()
I was expecting the created cooling time columns to be indexed into the newly defined dataframe df. Instead I received the following error;
KeyError: "['Ct_1[Years]'] not in index"
Thank you for the help.
You are overwriting your dataframe in each iteration of your loop with the line:
df = df[['ID','Mtu','Enr','Ct_%i[Years]'%i]]
which is why you are fine on your first iteration (error doesn't say anything about 'Ct_0[Years]' not being in the index), and then die on your second iteration. You've dropped everything but the columns you selected in your first iteration. Select your columns into a temporary df instead:
for i in range(4):
df_temp = df[['ID','Mtu','Enr','Ct_%i[Years]'%i]]
tfile = open('Inventory_FA_%s.prn'%i,'w')
### Apply conditions for flagging using df_temp
tfile.close()
Depending on what your conditions are, there might be a better way to do this that doesn't require making a temporary view into the dataframe, but this should help.
Why are you creating a new dataframe? is it only to reorganize/drop columns?.Engineero is right you are effectively rewriting df on each iteration.
Anyways you could try:
dfnew = pd.Dataframe()
dfnew = df[['ID','Mtu','Enr']]
for i in range(4):
dftemp = df[['Ct_%i[Years]'%i]]
dfnew.join(dftemp)
Related
I am trying to find the fastest way to shift a column based on a condition on another column.
For example:
Assuming the given input is :
Expected Output is :
Any help is greatly appreciated.
I have tried to run through each unique name using a for loop. performed the shift(-1) on the data frame and append data to a new data frame. Code example below. But there are over 1M rows and this takes lot of time to compute.
Assuming df2 being our sorted dataframe.
df3 = pd.DataFrame()
for i in df2['Name'].unique():
if df3.size == 0:
df3=df2.loc[df2['name']==i]
df3['final'] = df3['value'].shift(-1)
df3.fillna('Exit', inplace=True)
else:
df4=df2.loc[df2['name']==i]
df4['final'] = df4['value'].shift(-1)
df4.fillna('Exit', inplace=True)
df3 = df3.append(df4, ignore_index = True)
I have a list of multiple data frames on cryptocurrency. So I want to apply a function to all of these data frames, which should convert all the data frames so that I am only left with data from 2021.
The function looks like this:
dataframe_list = [bitcoin, have, binance, Cardano, chainlink, cosmos, crypto com, dogecoin, eos, Ethereum, iota, litecoin, monero, nem, Polkadot, Solana, stellar, tether, uni swap, usdcoin, wrapped, xrp]
def date_func(i):
i['Date'] = pd.to_datetime(i['Date'])
i = i.set_index(i['Date'])
i = i.sort_index()
i = i['2021-01-01':]
return(i)
for dataframe in dataframe_list:
dataframe = date_func(dataframe)
However, I am only left with one data frame called 'dataframe', which only contains values of the xrp dataframe.
I would like to have a new dataframe from each dataframe, called aave21, bitcoin21 .... which only contains values from 2021 onwards.
What am I doing wrong?
Best regards and thanks in advance.
You are overwriting dataframe when iterating over dataframe_list, i.e. you only keep the latest dataframe.
You can either try:
dataframe = pd.DataFrame()
for df in dataframe_list:
dataframe.append(date_func(df))
Or shorter:
dataframe = pd.concat([data_func(df) for df in dataframe_list])
You are overwriting dataframe variable in your for loop when iterating over dataframe_list. You need to keep appending results into a new variable.
final_df = pd.DataFrame()
for dataframe in dataframe_list:
final_df.append(date_func(dataframe))
print(final_df)
I am trying to fill an exisiting dataframe in pandas by adding several rows at one time, the number of rows depend on a comprehension list so it is variable. The initial dataframe is filled as follows:
import pandas as pd
import portion as P
columns = ['chr', 'Start', 'End', 'type']
x = pd.DataFrame(columns=columns)
RANGE = [(212, 222),(866, 888),(152, 158)]
INTERVAL= P.Interval(*[P.closed(x, y) for x, y in RANGE])
def fill_df(df, junction, chr, type ):
df['Start'] = [x.lower for x in junction]
df['End'] = [x.upper for x in junction]
df['chr'] = chr
df['type'] = type
return df
z = fill_df(x, INTERVAL, 1, 'DUP')
The idea is to keep appending rows to the dataframe from different intervals (so variable number of rows). Append those rows to the existing dataframe.
Here I have found different ways to add several rows but none of them are easy to apply unless I wrote a function to convert my data in tupples or lists, which I am not sure if it would be efficient. I have also try with pandas append but I was not able to do it for a bunch of lines..
Is it there any simple way to do this?
Thanks a lot!
Have you tried wrapping the list comprehension in pd.Series?
df['Start.pos'] = pd.Series([x.lower for x in junction])
If you want to use append and append several elements at once, you can create a second DataFrame table and simply append it to the first one. This looks like this:
import intvalpy as ip
import pandas as pd
inf = [1, 2, 3]
sup = [4, 5, 6]
intervals = ip.Interval(inf, sup)
add_intervals = ip.Interval([-10, -20], [10,20])
df = pd.DataFrame(data={'start': intervals.a, 'end': intervals.b})
df2 = pd.DataFrame(data={'start': add_intervals.a, 'end': add_intervals.b})
df = df.append(df2, ignore_index=True)
print(df.head(10))
The intvalpy library specialized for classical and full interval arithmetic is used here. To set an interval or intervals, use the Interval function, where the first argument is the left end and the second is the right end of the intervals.
The ignore_index parameter allows to continue indexing of the first table.
In case you want to add one line, you can do it as follows:
for k in range(len(intervals)):
df = df.append({'start': intervals[k].a, 'end': intervals[k].b}, ignore_index=True)
print(df.head(10))
I purposely did it with a loop to show that you can do without creating a second table if you want to add a few rows.
This post mentions symmetric difference and leveraging code df1.except(df2).union(df2.except(df1)) and/ordf1.unionAll(df2).except(df1.intersect(df2)) but I'm getting syntax errors when using except.
I'm trying to compare two dataframes who can have up to 50 or 50+ columns. I have the working code below but need to avoid hard coding columns.
sample code and example
# Create the two dataframes
df1 = sqlContext.createDataFrame([(11,'Sam',1000,'ind','IT','2/11/2019'),(22,'Tom',2000,'usa','HR','2/11/2019'),
(33,'Kom',3500,'uk','IT','2/11/2019'),(44,'Nom',4000,'can','HR','2/11/2019'),
(55,'Vom',5000,'mex','IT','2/11/2019'),(66,'XYZ',5000,'mex','IT','2/11/2019')],
['No','Name','Sal','Address','Dept','Join_Date'])
df2 = sqlContext.createDataFrame([(11,'Sam',1000,'ind','IT','2/11/2019'),(22,'Tom',2000,'usa','HR','2/11/2019'),
(33,'Kom',3000,'uk','IT','2/11/2019'),(44,'Nom',4000,'can','HR','2/11/2019'),
(55,'Xom',5000,'mex','IT','2/11/2019'),(77,'XYZ',5000,'mex','IT','2/11/2019')],
['No','Name','Sal','Address','Dept','Join_Date'])
df1 = df1.withColumn('FLAG',lit('DF1'))
df2 = df2.withColumn('FLAG',lit('DF2'))
# Concatenate the two DataFrames, to create one big dataframe.
df = df1.union(df2)
#Use window function to check if the count of same rows is more than 1 and if it indeed is, then mark column FLAG as SAME, else keep it the way it is. Finally, drop the duplicates.
my_window = Window.partitionBy('No','Name','Sal','Address','Dept','Join_Date').rowsBetween(-sys.maxsize, sys.maxsize)
df = df.withColumn('FLAG', when((count('*').over(my_window) > 1),'SAME').otherwise(col('FLAG'))).dropDuplicates()
df.show()
You can get all column names from df and use that list as parameter for the Window function:
cols = df.columns
cols.remove('FLAG')
my_window = Window.partitionBy(cols).rowsBetween(-sys.maxsize, sys.maxsize)
The remaining code stays unchanged.
I want to create a function which will give the output of df.describe for all the dataframes which is passed to the function argument.
My idea was to store all the dataframe(whom i need to describe) names as columns in a seperate dataframe (x) and then pass this to the function.
Here is what i have made and the output :
The problem is that its only showing description of only one dataframe
def des(df):
columns = df.columns
for column in columns:
column=pd.read_csv('SKUs\\'+column+'.csv')
column['Date'] = pd.to_datetime(column['Date'].astype(str),dayfirst = True, format ='%d&m%y',infer_datetime_format=True)
column.dropna(inplace=True)
return(column.describe())
data = {'UGCAA':[],'FAPG1':[],'ACSO5':[],'LGHF2':[],'LGMP8':[],'GGAF1':[]}
df=pd.DataFrame(data)
df
des(df)
Sales
count 948.000000
mean 876.415612
std 874.373236
min 1.000000
25% 298.750000
50% 619.500000
75% 1148.500000
max 7345.00000
I believe you can create list of DataFrames and last concat together:
def des(df):
dfs = []
for column in df.columns:
df1=pd.read_csv('SKUs\\'+column+'.csv')
df1['Date'] = pd.to_datetime(df1['Date'].astype(str),
format ='%d%m%y',infer_datetime_format=True)
df1.dropna(inplace=True)
dfs.append(df1.describe())
return pd.concat(dfs, axis=1, keys=df.columns)
It's because you are looping over and reseting column each time while only returning one. To just visualize it, you can just print the describe in each loop, or store them together in one variable and handle it after the loop.
def des(df):
columns = df.columns
for column in columns:
column=pd.read_csv('SKUs\\'+column+'.csv')
column['Date'] = pd.to_datetime(column['Date'].astype(str),dayfirst = True, format ='%d&m%y',infer_datetime_format=True)
column.dropna(inplace=True)
print(column.describe())