Pandas create new column every time function runs - python

This is data.csv:
tickers = ['ACOR', 'ACM', 'ACLS', 'ACND', 'ACMR']
stats = ['mkt_cap', 'price', 'change']
This code creates a csv file for each stat in the assets directory:
date = str(dt.date.today())
for stat in stats:
df = pd.read_csv('data.csv')
df.set_index('ticker', inplace=True)
df = df.loc[tickers, ['{}'.format(stat)]]
date = str(dt.date.today())
df.rename(columns = {'{}'.format(stat):date}, inplace=True)
df.to_csv(assets/{}.csv'.format(stats))
Here is price.csv
ticker 2019/07/04
ACOR 7.42
ACM 37.33
... ...
The problem is I need a new column to be created every time this function is run with the current date as the header. Data.csv gets updated everyday and I would like to add new data into mkt_cap.csv, prices.csv and change.csv with the new date as the header. The updated prices.csv would look like:
ticker 2019/07/04 2019/07/05
ACOR 7.42 XXX
ACM 37.33 XXX
... ...
EDIT:
date = str(dt.date.today())
for stat in stats:
df = pd.read_csv('data.csv')
df.set_index('ticker', inplace=True)
df = df.loc[tickers, ['{}'.format(stat)]]
date = str(dt.date.today())
df.rename(columns = {'{}'.format(stat):date}, inplace=True)
df.to_csv(assets/{}.csv'.format(stats))
for col in stats.columns:
stats["{}-{}".format(dt.date.today(),col)] = stats[col]
dataframes = []
for datapoint in stats.columns[-5:-1]:
dataframes.append(stats[[datapoint, "ticker"]])
for dff in dataframes:
dff.to_csv('assets/{}.csv'.format(dff.columns[1]))

import pandas as pd
list1 = []
for i in range(0,10):
list1.append(i)
df = pd.DataFrame()
df["col1"] = list1
df['col2'] = df['col1']+5
import datetime as dt
def new_col(df):
df[dt.datetime.now()] = df['col1']+ df['col2']
return df
new_col(df)
This will create a new column when the function is called with the datetime the function is run. Not entirely sure what you are trying to do as far as the arithmetic of the new column but this should do the trick as far as creating the new column.

for col in acor.columns: #or you could just use your stat list
acor["{}-{}".format(dt.datetime.now(),col)] = acor[col]
dataframes = [] ##seperates into individual dataframes
for datapoint in acor.columns[-5:-1]:
dataframes.append(acor[[datapoint,"timestamp"]])###you probobly want to replace timestamp with "symbol" or "ticker"
###finally saves dataframes by date and stat
for dff in dataframes:
dff.to_csv("{}.csv".format(dff.columns[1]))

Related

How to append new record of everyday to a new row in pandas dataframe?

I have a dataframe which contains an aggregated value till today. My datastream will update everyday so that I want to monitoring the change of each day. How can I append each newday's data to the dataframe?
The format will be like this,
Date Agg_Value
2022-12-07 0.43
2022-12-08 0.44
2022-12-09 0.41
... ...
You want to use pandas.concat to bring together 2 dataframes:
import pandas as pd
s1 = pd.Series(['a', 'b'])
s2 = pd.Series(['c', 'd'])
print(pd.concat([s1, s2]))
Assuming that you always have yesterday's dataframe available, you can simply execute a script that runs everyday to get today's date and concatenate the result.
# Assuming you have todays value
today_value = 42
# Get today's date
from datetime import date
today = date.today()
# Create a new dataframe with today's value
df1 = pd.DataFrame(
{
'Date':[str(today)],
'Agg_value':[today_value]
}
)
# Update df by concatenating today's data
df = pd.concat([df, df1], axis=0)
There are multiple ways:
creating example dataframe -
import pandas as pd
import numpy as np
Date = pd.date_range("2022-12-01", periods=4, freq="D")
Avg_value = np.random.rand(4)
df = pd.DataFrame(list(zip(Date, Avg_value)), columns=['Date','Avg_value'])
df
will give -
.loc
Now add single row by -
df.loc[len(df.index)] = [pd.Timestamp("2022-12-05"), 0.67]
will give-
by append
new_row = {'Date': pd.Timestamp("2022-12-06"), 'Avg_value': 0.39}
df = df.append(new_row, ignore_index=True)
df
concat example is explained in Arturo Sbr answer

Combining Successive Pandas Dataframes in One Master Dataframe via a Loop

I'm trying to loop through a series of tickers cleaning the associated dataframes then combining the individual ticker dataframes into one large dataframe with columns named for each ticker. The following code enables me to loop through unique tickers and name the columns of each ticker's dataframe after the specific ticker:
import pandas as pd
def clean_func(tkr,f1):
f1['Date'] = pd.to_datetime(f1['Date'])
f1.index = f1['Date']
keep = ['Col1','Col2']
f2 = f1[keep]
f2.columns = [tkr+'Col1',tkr+'Col2']
return f2
tkrs = ['tkr1','tkr2','tkr3']
for tkr in tkrs:
df1 = pd.read_csv(f'C:\\path\\{tkr}.csv')
df2 = clean_func(tkr,df1)
However, I don't know how to create a master dataframe where I add each new ticker to the master dataframe. With that in mind, I'd like to align each new ticker's data using the datetime index. So, if tkr1 has data for 6/25/22, 6/26/22, 6/27/22, and tkr2 has data for 6/26/22, and 6/27/22, the combined dataframe would show all three dates but would produce a NaN for ticker 2 on 6/25/22 since there is no data for that ticker on that date.
When not in a loop looking to append each successive ticker to a larger dataframe (as per above), the following code does what I'd like. But it doesn't work when looping and adding new ticker data for each successive loop (or I don't know how to make it work in the confines of a loop).
combined = pd.concat((df1, df2, df3,...,dfn), axis=1)
Many thanks in advance.
You should only create the master DataFrame after the loop. Appending to the master DataFrame in each iteration via pandas.concat is slow since you are creating a new DataFrame every time.
Instead, read each ticker DataFrame, clean it, and append it to a list which store every ticker DataFrames. After the loop create the master DataFrame with all the Dataframes using pandas.concat:
import pandas as pd
def clean_func(tkr,f1):
f1['Date'] = pd.to_datetime(f1['Date'])
f1.index = f1['Date']
keep = ['Col1','Col2']
f2 = f1[keep]
f2.columns = [tkr+'Col1',tkr+'Col2']
return f2
tkrs = ['tkr1','tkr2','tkr3']
dfs_list = []
for tkr in tkrs:
df1 = pd.read_csv(f'C:\\path\\{tkr}.csv')
df2 = clean_func(tkr,df1)
dfs_list.append(df2)
master_df = pd.concat(dfs_list, axis=1)
As a suggestion here is a cleaner way of defining your clean_func using DataFrame.set_index and DataFrame.add_prefix.
def clean_func(tkr, f1):
f1['Date'] = pd.to_datetime(f1['Date'])
f2 = f1.set_index('Date')[['Col1','Col2']].add_prefix(tkr)
return f2
Or if you want, you can parse the Date column as datetime and set it as index directly in the pd.read_csv call by specifying index_col and parse_dates parameters (honestly, I'm not sure if those two parameters will play well together, and I'm too lazy to test it, but you can try ;)).
import pandas as pd
def clean_func(tkr,f1):
f2 = f1[['Col1','Col2']].add_prefix(tkr)
return f2
tkrs = ['tkr1','tkr2','tkr3']
dfs_list = []
for tkr in tkrs:
df1 = pd.read_csv(f'C:\\path\\{tkr}.csv', index_col='Date', parse_dates=['Date'])
df2 = clean_func(tkr,df1)
dfs_list.append(df2)
master_df = pd.concat(dfs_list, axis=1)
Before the loop create an empty df with:
combined = pd.DataFrame()
Then within the loop (after loading df1 - see code above):
combined = pd.concat((combined, clean_func(tkr, df1)), axis=1)
If you get:
TypeError: concat() got multiple values for argument 'axis'
Make sure your parentheses are correct per above.
With the code above, you can skip the original step:
df2 = clean_func(tkr,df1)
Since it is embedded in the concat function. Alternatively, you could keep the df2 step and use:
combined = pd.concat((combined,df2), axis=1)
Just make sure the dataframes are encapsulated by parentheses within the concat function.
Same answer as GC123 but here is a full example which mimics reading from separate files and concatenating them
import pandas as pd
import io
fake_file_1 = io.StringIO("""
fruit,store,quantity,unit_price
apple,fancy-grocers,2,9.25
pear,fancy-grocers,3,100
banana,fancy-grocers,1,256
""")
fake_file_2 = io.StringIO("""
fruit,store,quantity,unit_price
banana,bargain-grocers,667,0.01
apple,bargain-grocers,170,0.15
pear,bargain-grocers,281,0.45
""")
fake_files = [fake_file_1,fake_file_2]
combined = pd.DataFrame()
for fake_file in fake_files:
df = pd.read_csv(fake_file)
df = df.set_index('fruit')
combined = pd.concat((combined, df), axis=1)
print(combined)
Output
This method is slightly more efficient:
combined = []
for fake_file in fake_files:
combined.append(pd.read_csv(fake_file).set_index('fruit'))
combined = pd.concat(combined, axis=1)
print(combined)
Output:
store quantity unit_price store quantity unit_price
fruit
apple fancy-grocers 2 9.25 bargain-grocers 170 0.15
pear fancy-grocers 3 100.00 bargain-grocers 281 0.45
banana fancy-grocers 1 256.00 bargain-grocers 667 0.01

Get affiliation information from multiple authors in a loop

Currently working with pybliometrics (scopus) I want to create a loop that allows me to get affiliation information from multiple authors.
Basically, this is the idea of my loop. How do I do that with many authors?
from pybliometrics.scopus import AuthorRetrieval
import pandas as pd
import numpy as np
au = AuthorRetrieval(authorid)
au.affiliation_history
au.identifier
x = au.identifier
refs2 = au.affiliation_history
len(refs2)
refs2
df = pd.DataFrame(refs2)
df.columns
a_history = df
df['authorid'] = x
#moving authorid to 0
cols = list(df)
cols.insert(0, cols.pop(cols.index('authorid')))
df = df.loc[:, cols]
df.to_excel("af_historyfinal.xlsx")
Turning your code into a loop over multiple author IDs? Nothing easier than that. Let's say AUTHOR_IDS equals 7004212771 and 57209617104:
import pandas as pd
from pybliometrics.scopus import AuthorRetrieval
def retrieve_affiliations(auth_id):
"""Author's affiliation history from Scopus as DataFrame."""
au = AuthorRetrieval(authorid)
df = pd.DataFrame(au.affiliation_history)
df["auth_id"] = au.identifier
return df
AUTHOR_IDS = [7004212771, 57209617104]
# Option 1, for few IDs
df = pd.concat([retrieve_affiliations(a) for a in AUTHOR_IDS])
# Option 2, for many IDs
df = pd.DataFrame():
for a in AUTHOR_IDS:
df = df.append(retrieve_affiliations(a))
# Have author ID as first column
df = df.set_index("authorid").reset_index()
df.to_excel("af_historyfinal.xlsx", index=False)
If, say, your IDs are in a comma-separated file called "input.csv", with one column called "authors", then you start with
AUTHOR_IDS = pd.read_csv("input.csv")["authors"].unique()

How to create new csv from list of csv's in dataframe

So I know my code isn't that close to right, but I am trying to loop through a list of csv's, line by line, to create a new csv where each line will list all csv's that met a condition. First column in all csv's is "date", I want to list the name of all csv's where data["entry"] > 3 on that date with date still being the 1st column.
Update: What I'm trying to do is for each csv, make a new list of each date the condition was met and on those days on the new csv append file_name to that row/rows.
###create list from dir
listdrs = os.listdir('c:/Users/17409/AppData/Local/Programs/Python/Python38/Indicators/SentdexTutorial/stock_dfs/')
###append full path to list
string = 'c:/Users/17409/AppData/Local/Programs/Python/Python38/Indicators/SentdexTutorial/stock_dfs/'
listdrs_path = [ string + x for x in listdrs]
complete_string = ' is complete'
listdrs_confirmation = [ x + complete_string for x in listdrs]
#print (listdrs_path)
###start loop, for each "file" in listdrs run the 2 functions below and overwrite saved csv.
for file_path in listdrs_path:
data = pd.read_csv(file_path, index_col=0)
########################################
####function 1
def get_price_hist(ticker):
# Put stock price data in dataframe
data = pd.read_csv(file_path)
#listdr = os.listdir('Users\17409\AppData\Local\Programs\Python\Python38\Indicators\Sentdex Tutorial\stock_dfs')
##print(listdr)
# Convert date to timestamp and make index
data.index = data["date"].apply(lambda x: pd.Timestamp(x))
data.drop("date", axis=1, inplace=True)
return data
##create new table and append data
data = data[data.Entry > 3]
for date in data.date:
new_table[date].append(file_path)
new_table_data = data.DataFrame([(k, ','.join(new_table[k])) for k in sorted(new_table.keys())], columns=['date', 'table names'])
print(new_table_data)
I would do something like this. You need to modify the following snippet according to your needs.
import pandas as pd
from glob import glob
from collections import defaultdict
# create and save some random data
df1 = pd.DataFrame({'date':[1,2,3], 'entry':[4,3,2]})
df2 = pd.DataFrame({'date':[1,2,3], 'entry':[1,2,4]})
df3 = pd.DataFrame({'date':[1,2,3], 'entry':[3,1,5]})
df1.to_csv('table1.csv')
df2.to_csv('table2.csv')
df3.to_csv('table3.csv')
# read all the csv
tables = glob('*.csv')
new_table = defaultdict(list)
# create new table
for table in tables:
df = pd.read_csv(table)
df = df[df.entry > 2]
for date in df.date:
new_table[date].append(table)
new_table_df = pd.DataFrame([(k, ','.join(new_table[k])) for k in sorted(new_table.keys())], columns=['date', 'table names'])
print (new_table_df)
date table names
0 1 table3.csv,table1.csv
1 2 table1.csv
2 3 table2.csv,table3.csv
Had some issues with the other code, here is the final solution I was able to come up with.
if 'Entry' in data:
##create new table and append data
data = data[data.Entry > 3]
if 'date' in data:
for date in data.date:
if date not in new_table:
new_table[date] = []
new_table[date].append(
pd.DataFrame({'FileName': [file_name], 'Entry': [int(data[data.date == date].Entry)]}))
new_table
elif 'Date' in data:
for date in data.Date:
if date not in new_table:
new_table[date] = []
new_table[date].append(
pd.DataFrame({'FileName': [file_name], 'Entry': [int(data[data.Date == date].Entry)]}))
# sorted(new_table, key=lambda x: x[0])
def find_max(tbl):
new_table_data = {}
for date in sorted(tbl.keys()):
merged_dt = pd.concat(tbl[date])
max_entry_v = max(list(merged_dt.Entry))
tbl_names = list(merged_dt[merged_dt.Entry == max_entry_v].FileName)
new_table_data[date] = tbl_names
return new_table_data
new_table_data = find_max(tbl=new_table)
#df = pd.DataFrame(new_table, columns =['date', 'tickers'])
#df.to_csv(input_path, index = False, header = True)
# find_max(new_table)
# new_table_data = pd.DataFrame([(k, ','.join(new_table[k])) for k in sorted(new_table.keys())],
# columns=['date', 'table names'])
print(new_table_data)

My loop always skip the first index

Every time I creat a loop function, it's common to have problem with the first one:
For example:
dfd = quandl.get("FRED/DEXBZUS")
dfe = quandl.get("ECB/EURBRL")
df = [dfd, dfe]
dps = []
for i in df:
I just get the second dataframe values.
Using this:
dfd = quandl.get("FRED/DEXBZUS")
df = [dfd]
dps = []
for i in df:
I got this:
Empty DataFrame
Columns: []
Index: []
And if I use this (repeting the first one):
dfd = quandl.get("FRED/DEXBZUS")
dfe = quandl.get("ECB/EURBRL")
df = [dfd, dfd, dfe]
dps = []
for i in df:
I get both dataframes correcly
Examples :
import quandl
import pandas as pd
#import matplotlib
import matplotlib.pyplot as plt
dfd = quandl.get("FRED/DEXBZUS")
dfe = quandl.get("ECB/EURBRL")
df = [dfd, dfe]
dps = []
for i in df:
df1 = i.reset_index()
results = pd.DataFrame(df1)
results = results.rename(columns={'Date': 'ds','Value': 'y'})
dps = pd.DataFrame(dps.append(results))
print(dps)
Empty DataFrame
Columns: []
Index: []
ds y
0 2008-01-02 2.6010
1 2008-01-03 2.5979
2 2008-01-04 2.5709
3 2008-01-07 2.6027
4 2008-01-08 2.5796
UPDATE
As Bruno suggested, it is related to this function:
dps = pd.DataFrame(dps.append(results))
How to append all the dataset into a one data frame ?
result=Pd.DataFrame(df1) If you create dataframe like this and don't give columns, then by default first it will take 1st row as column and later you are renaming columns what default created.
So please create pd.DataFrame(df1,columns=[column_list]).
First row will not skip.
#this will print every element in df
for i in df:
print i
Also,
for dfIndex, i in enumerate(df):
print i
print dfIndex #this will print the index of i in df
Note that indexes start at 0, not 1.

Categories