Python - Appending a column to a list of DataFrames - python

I have a list that is comprised of DataFrames where I would like to iterate over the list of DataFrames and insert a column to each DataFrame based on an array.
Below is a small example that I have created for illustrative purposes. I would do this manually if it was only 4 DataFrames but my dataset is much larger:
#Create dataframes
df1 = pd.DataFrame(list(range(0,10)))
df2 = pd.DataFrame(list(range(10,20)))
df3 = pd.DataFrame(list(range(20,30)))
df4 = pd.DataFrame(list(range(30,40)))
#Create list of Dataframes
listed_dfs = [df1,df2,df3,df4]
#Create list of dates
Dates = ['2015-05-15','2015-02-17', '2014-11-14', '2014-08-14']
#Objective: Sequentially append each instance of "Dates" to a new column in each dataframe
#First, create list of locations for iterations
locations = [0,1,2,3]
#Second, create for loop to iterate over [Need help here]
#Example: for the 1st Dataframe in the list of dataframes, add a column 'Date' that
# has the the 1st instance of the 'Dates' list for every row,
# then for the 2nd DataFrame in the list of dataframes, add the 2nd instance of the 'Dates' list for every row
for i in Dates:
for a in locations:
listed_dfs[a]['Date'] = i
print(listed_dfs)
The problem with the above for loop is that it applies the last date first, then it does not apply the 2nd date to the 2nd DataFrame, only the 1st date for each DataFrame.
Desired Output from for loop:
listed_dfs[0]['Date'] = Dates[0]
listed_dfs[1]['Date'] = Dates[1]
listed_dfs[2]['Date'] = Dates[2]
listed_dfs[3]['Date'] = Dates[3]
pd.concat(listed_dfs)

Change your for loop to
for i,j in zip(Dates,locations):
listed_dfs[j]['Date'] = i

Going with your desired output:
listed_dfs[0]['Date'] = Dates[0]
listed_dfs[1]['Date'] = Dates[1]
listed_dfs[2]['Date'] = Dates[2]
listed_dfs[3]['Date'] = Dates[3]
pd.concat(listed_dfs)
Notice that the index values are the same for a row, so, 0 and 0, 1 and 1, and so on.. That's essentially what you need.
for i in range(len(Dates)):
listed_dfs[i]['Date'] = Dates[i]
pd.concat(listed_dfs)

If I have undestood it well, the problem is that you are overwriting the column 'Date' in all four dataframes on each iteration on Dates. A solution may be only one 'for' loop like this:
for a in locations:
listed_dfs[a]['Date'] = Dates[a]

If, as in your example, you loop through your dataframes sequentially, you can zip dataframes and dates as below.
for df, date in zip(listed_dfs, Dates):
df['Date'] = date
This removes the need for the locations list.

Related

What is the best way to combine dataframes that have been created through a for loop?

I am trying to combine dataframes with 2 columns into a single dataframe. The initial dataframes are generated through a for loop and stored in a list. I am having trouble getting the data from the list of dataframes into a single dataframe. Right now when I run my code, it treats each full dataframe as a row.
def linear_reg_function(category):
df = pd.read_csv(file)
df = df[df['category_column'] == category]`
df1 = df[['category_column', 'value_column']]
df_export.append(df1)
df_export = []
for category in category_list:
linear_reg_function(category)
when I run this block of code I get a list of dataframes that have 2 columns. When I try to convert df_export to a dataframe, it ends up with 12 rows (the number of categories in category_list). I tried:
df_export = pd.DataFrame()
but the result was:
_
I would like to have a single dataframe with 2 columns, [Category, Value] that includes the values of all 12 categories generated in the for loop.
You can use pd.concat to merge a list of DataFrames into a single big DataFrame.
appended_data = []
for infile in glob.glob("*.xlsx"):
data = pandas.read_excel(infile)
# store DataFrame in list
appended_data.append(data)
# see pd.concat documentation for more info
appended_data = pd.concat(appended_data)
# write DataFrame to an excel sheet
appended_data.to_excel('appended.xlsx')
you can manipulate it to your proper demande

how to convert a list of list to dataframe keeping the grouping value of another column?

I have a df with a column in which each row is a list and another column with the list name.
I want to convert the elements of each list in a row into separate rows while keeping the list name in another column. example below.
I want to convert this data set into this:
I have done this using the below code:
data = []
group = []
for i in df.index:
for j in df.Data.loc[i]:
data.append(j)
for group_data in range(len(df.Data.loc[i])):
group.append(df.Group.loc[i])
Is there a more elegant way or inbuilt function for this in python?
This should work
df = df.explode('Data').reset_index(drop=True)
simplest way is make the list and then replace it :
df = df.reset_index() # make sure indexes pair with number of rows
newlist = []
newlist.append(['Group','Data']) # for headers
for index, row in df.iterrows():
values = str(row['Date']).split(',')
for value in values :
newlist.append([row['Group'] , value)
df = pd.Dataframe(newlist)

How to append several rows to an existing pandas dataframe with number of rows depending on a comprehension list

I am trying to fill an exisiting dataframe in pandas by adding several rows at one time, the number of rows depend on a comprehension list so it is variable. The initial dataframe is filled as follows:
import pandas as pd
import portion as P
columns = ['chr', 'Start', 'End', 'type']
x = pd.DataFrame(columns=columns)
RANGE = [(212, 222),(866, 888),(152, 158)]
INTERVAL= P.Interval(*[P.closed(x, y) for x, y in RANGE])
def fill_df(df, junction, chr, type ):
df['Start'] = [x.lower for x in junction]
df['End'] = [x.upper for x in junction]
df['chr'] = chr
df['type'] = type
return df
z = fill_df(x, INTERVAL, 1, 'DUP')
The idea is to keep appending rows to the dataframe from different intervals (so variable number of rows). Append those rows to the existing dataframe.
Here I have found different ways to add several rows but none of them are easy to apply unless I wrote a function to convert my data in tupples or lists, which I am not sure if it would be efficient. I have also try with pandas append but I was not able to do it for a bunch of lines..
Is it there any simple way to do this?
Thanks a lot!
Have you tried wrapping the list comprehension in pd.Series?
df['Start.pos'] = pd.Series([x.lower for x in junction])
If you want to use append and append several elements at once, you can create a second DataFrame table and simply append it to the first one. This looks like this:
import intvalpy as ip
import pandas as pd
inf = [1, 2, 3]
sup = [4, 5, 6]
intervals = ip.Interval(inf, sup)
add_intervals = ip.Interval([-10, -20], [10,20])
df = pd.DataFrame(data={'start': intervals.a, 'end': intervals.b})
df2 = pd.DataFrame(data={'start': add_intervals.a, 'end': add_intervals.b})
df = df.append(df2, ignore_index=True)
print(df.head(10))
The intvalpy library specialized for classical and full interval arithmetic is used here. To set an interval or intervals, use the Interval function, where the first argument is the left end and the second is the right end of the intervals.
The ignore_index parameter allows to continue indexing of the first table.
In case you want to add one line, you can do it as follows:
for k in range(len(intervals)):
df = df.append({'start': intervals[k].a, 'end': intervals[k].b}, ignore_index=True)
print(df.head(10))
I purposely did it with a loop to show that you can do without creating a second table if you want to add a few rows.

Find the difference between data frames based on specific columns and output the entire record

I want to compare 2 csv (A and B) and find out the rows which are present in B but not in A in based only on specific columns.
I found few answers to that but it is still not giving result what I expect.
Answer 1 :
df = new[~new['column1', 'column2'].isin(old['column1', 'column2'].values)]
This doesn't work. It works for single column but not for multiple.
Answer 2 :
df = pd.concat([old, new]) # concat dataframes
df = df.reset_index(drop=True) # reset the index
df_gpby = df.groupby(list(df.columns)) #group by
idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1] #reindex
final = df.reindex(idx)
This takes as an input specific columns and also outputs specific columns. I want to print the whole record and not only the specific columns of the record.
I tried this and it gave me the rows:
import pandas as pd
columns = [{Name of columns you want to use}]
new = pd.merge(A, B, how = 'right', on = columns)
col = new['{Any column from the first DataFrame which isn't in the list columns. You will probably have to add an '_x' at the end of the column name}']
col = col.dropna()
new = new[~new['{Any column from the first DataFrame which isn't in the list columns. You will probably have to add an '_x' at the end of the column name}'].isin(col)]
This will give you the rows based on the columns list. Sorry for the bad naming. If you want to rename the columns a bit too, here's the code for that:
for column in new.columns:
if '_x' in column:
new = new.drop(column, axis = 1)
elif '_y' in column:
new = new.rename(columns = {column: column[:column.find('_y')]})
Tell me if it works.

Select a specific slice of data from a main dataframe, conditional on a value in a main dataframe column

I have a main dataframe (df) with a Date column (non-index), a column 'VXX_Full' with values, and a 'signal' column.
I want to iterate through the signals column, and whenever it is 1, i want to capture a slice (20 rows before, 40 rows after) of the 'VXX_Full' column and create a new dataframe with all the slices. I would like the column name of the new dataframe to be the row number of the original dataframe.
VXX_signal = pd.DataFrame(np.zeros((60,0)))
counter = 1
for row in df.index:
if df.loc[row,'signal'] == 1:
add_row = df.loc[row - 20:row +20,'VXX_Full']
VXX_signal[counter] = add_row
counter +=1
VXX_signal
It just doesn't seem to be working. It creates a dataframe, however the values are all Nan. The first slice, it at least appears to be getting data from the main df, however the data doesn't correspond to the correct location. The following set of columns (there are 30 signals so 30 columns created) in the new df are all NaN
Thanks in advance!
I'm not sure about your current code - but basically all you need is a list of ranges of indexes. If your index is linear, this would be something like:
indexes = list(df[df.signal==1].index)
ranges = [(i,list(range(i-20,i+21))) for i in indexes] #create tuple (original index,range)
dfs = [df.loc[i[1]].copy().rename(
columns={'VXX_Full':i[0]}).reset_index(drop=True) for i in ranges]
#EDIT: for only the VXX_Full Column:
dfs = [df.loc[i[1]].copy()[['VXX_Full']].copy().rename(
columns={'VXX_Full':i[0]}).reset_index(drop=True) for i in ranges]
#here we take the -20:+20 slice of df, make a separate dataframe, the
#we change 'VXX_Full' to the original index value, and reset index to give it 0:40 index.
#The new index will be useful when putting all the columns next to each other.
So we made a list of indexes with signal == 1, turned it into a list of ranges and finally a list of dataframes with reset index.
Now we want to merge it all together:
from functools import reduce
merged_df = reduce(lambda left, right: pd.merge(
left, right, left_index=True, right_index=True), dfs)
I would build the resulting dataframe from a dictionnary of lists:
resul = pd.DataFrame({i:df.loc[i-20 if i >=20 else 0: i+40 if i <= len(df) - 40 else len(df),
'VXX_FULL'].values for i in df.loc[df.signal == 1].index})
The trick is that .values extract a numpy array with no associated index.
Beware: above code assumes that the index of the original dataframe is just the row number. Use reset_index first if it is different.

Categories