Dataframe to .csv - is only writing last value - Python/Pandas - python

I'm trying to write a dataframe to a .csv using df.to_csv(). For some reason, its only writing the last value (data for the last ticker). It reads through a list of tickers (turtle, all tickers are in first column) and spits out price data for each ticker. I can print all the data without a problem but can't seem to write to .csv. Any idea why? Thanks
input_file = pd.read_csv("turtle.csv", header=None)
for ticker in input_file.iloc[:,0].tolist():
data = web.DataReader(ticker, "yahoo", datetime(2011,06,1), datetime(2016,05,31))
data['ymd'] = data.index
year_month = data.index.to_period('M')
data['year_month'] = year_month
first_day_of_months = data.groupby(["year_month"])["ymd"].min()
first_day_of_months = first_day_of_months.to_frame().reset_index(level=0)
last_day_of_months = data.groupby(["year_month"])["ymd"].max()
last_day_of_months = last_day_of_months.to_frame().reset_index(level=0)
fday_open = data.merge(first_day_of_months,on=['ymd'])
fday_open = fday_open[['year_month_x','Open']]
lday_open = data.merge(last_day_of_months,on=['ymd'])
lday_open = lday_open[['year_month_x','Open']]
fday_lday = fday_open.merge(lday_open,on=['year_month_x'])
monthly_changes = {i:MonthlyChange(i) for i in range(1,13)}
for index,ym, openf,openl in fday_lday.itertuples():
month = ym.strftime('%m')
month = int(month)
diff = (openf-openl)/openf
monthly_changes[month].add_change(diff)
changes_df = pd.DataFrame([monthly_changes[i].get_data() for i in monthly_changes],columns=["Month","Avg Inc.","Inc","Avg.Dec","Dec"])
CSVdir = r"C:\Users\..."
realCSVdir = os.path.realpath(CSVdir)
if not os.path.exists(CSVdir):
os.makedirs(CSVdir)
new_file_name = os.path.join(realCSVdir,'PriceData.csv')
new_file = open(new_file_name, 'wb')
new_file.write(ticker)
changes_df.to_csv(new_file)

Use a for appending instead of wb because it overwrites the data in every iteration of loop.For different modes of opening a file see here.

Related

Format Custom DateTime for Entire Column

I'm looking to concatenate a bunch of csv files in the same directory that this code is ran in. I need the entire 'Date Time' column of these sheets to be in the format 'm/d/yyyy h:mm:ss.0' and I believe I just about have it.
Here is my current code (the format changing is at the very bottom):
import os
import pandas as pd
import glob
# returns all data after header in panda DataFrame
def skip_to(fle, line):
if os.stat(fle).st_size == 0:
raise ValueError("File is empty")
with open(fle, 'r') as f:
if check(fle, line):
pos = 0
cur_line = f.readline()
while not cur_line.startswith(line):
pos = f.tell()
# add current line to header dataframe
cur_line = f.readline()
f.seek(pos)
return pd.read_csv(f, parse_dates=[line], na_values=['Unknown'])
else:
return ""
def check(myfile, myline):
with open(myfile, 'r') as f:
datafile = f.readlines()
for line in datafile:
if myline in line:
return True
return False # finished the search without finding
# getting all csv files for concatenation
dir = os.getcwd()
files = [fn for fn in glob.glob(dir + '\**\\' + 'cdlog*.csv', recursive=True)]
df = pd.DataFrame()
for file in files:
if file.endswith('.csv'):
dp_data = skip_to(file, "Date Time")
if type(dp_data) != str:
dp_data.drop(0, axis=0, inplace=True)
df = pd.concat([df, dp_data], ignore_index=True, axis=0)
df['Date Time'] = pd.to_datetime(df['Date Time'])
df['Date Time'] = df['Date Time'].dt.strftime('%m/%d/%Y %H:%M:%S.0')
print(df['Date Time'])
# export to .csv
df.to_csv("test_output.csv")
With the print statement, I can see that it has it in the exact format that I'm looking for. When I check the newly created file, it is setting the format to 'mm:ss.0' instead. If I remove the '.0' from the end of the formatting, it sets it correctly in the new sheet, but it's only recording up to the minutes - it completely cuts off the seconds and I can't figure out why.
Example with having the '.0' at the end of the formatting:
Example without the '.0' at the end of the formatting:
pd.to_datetime() is outputting a datetime object, with witch you can work (to filter, to select ranges and so). If you want just a string formatted field, maybe you can apply dt.strftime() on the datetime object.
Example, once you got a datetime column by pd.to_datetime() you can apply:
df['Date Time'] = df['Date Time'].dt.strftime('%m/%d/%Y %H:%M:%S.0')

Reading a Specific JSON Column for Tokenization

I am planning to tokenize a column within a JSON file with NLTK. The code below reads and slices the JSON file according into different time intervals.
I am however struggling to have the 'Main Text' column (within the JSON file) read/tokenized in the final part of the code below. Is there any smart tweak to make this happen?
# Loading and reading dataset
file = open("Glassdoor_A.json", "r")
data = json.load(file)
df = pd.json_normalize(data)
df['Date'] = pd.to_datetime(df['Date'])
# Create an empty dictionary
d = dict()
# Filtering by date
start_date = pd.to_datetime('2009-01-01')
end_date = pd.to_datetime('2009-03-31')
last_end_date = pd.to_datetime('2017-12-31')
mnthBeg = pd.offsets.MonthBegin(3)
mnthEnd = pd.offsets.MonthEnd(3)
while end_date <= last_end_date:
filtered_dates = df[df.Date.between(start_date, end_date)]
n = len(filtered_dates.index)
print(f'Date range: {start_date.strftime("%Y-%m-%d")} - {end_date.strftime("%Y-%m-%d")}, {n} rows.')
if n > 0:
print(filtered_dates)
start_date += mnthBeg
end_date += mnthEnd
# NLTK tokenizing
file_content = open('Main Text').read()
tokens = nltk.word_tokenize(file_content)
print(tokens)
I have solved the situation with the following code, which runs smoothly. Many thanks again for everyone's input.
for index, row in filtered_dates.iterrows():
line = row['Text Main']
tokens = nltk.word_tokenize(line)
print(tokens)

Function gives list, I need DataFrame so I can concatenate

I started off pulling all files in the folder and concatenating them, this one works:
warranty_list = []
warranty_files = glob.glob(os.path.join(qms, '*.csv'))
for file_ in warranty_files:
df = pd.read_csv(file_,index_col=None, header=0)
warranty_list.append(df)
warranty = pd.concat(warranty_list)
Then I had to write a function so I would only grab certain files and concatenate them, but this one is not working. I do not get an error but the last line is not being used, so I am not concatenating the files.
def get_warranty(years=5):
warranty_list = [] #list for glob.glob()
current_year = datetime.datetime.today().year #current year
last_n_years = [str(current_year-i) for i in range(0,years+1)]
for year in last_n_years:
warranty = glob.glob(os.path.join(qms, "Warranty Detail%s.csv" % year))
if warranty:
for file_ in warranty:
df = pd.read_csv(file_,index_col=None, header=0)
warranty_list.append(df)
warranty_df = pd.concat(warranty_list)
The last line isn't working presumably because the pd.concat() is getting a list as an input and it won't do anything with that. O don't understand why it worked in the first set of code and not this one.
I don't know how to change the function to get a data frame or how to change what I get at the end into a data frame.
Any suggestions?
I would suggest to use directly append because it do same thing as concat
So basically you start with an empty dataframe
warranty_df = pd.Dataframe()
And then append the the others dataframe to this while reading the file
So your function should remain the same but you need to delete the following line
warranty_df = pd.concat(warranty_list)
And after the loop, you return the warranty_df!
def get_warranty(years=5):
warranty_df = pd.Dataframe()
current_year = datetime.datetime.today().year #current year
last_n_years = [str(current_year-i) for i in range(0,years+1)]
for year in last_n_years:
warranty = glob.glob(os.path.join(qms, "Warranty Detail%s.csv" % year))
if warranty:
for file_ in warranty:
df = pd.read_csv(file_,index_col=None, header=0)
waranty_df = warranty_df.append(df)
return warranty_df

Pulling contents of div tags with beautifulsoup and creating a pandas dataframe

date = '2017-08-04'
writer = pd.ExcelWriter('MLB Daily Data.xlsx')
url_4 = 'http://www.baseballpress.com/lineups/'+date
resp_4 = requests.get(url_4)
soup_4 = BeautifulSoup(resp_4.text, "lxml")
lineups = soup_4.findAll('div', attrs = {'class': 'players'},limit=None)
row_lineup = 0
for lineup in lineups:
lineup1 = lineup.prettify()
lineup2 = lineup1.replace('>'&&'<',',')
df4 = pd.DataFrame(eval(lineup2))
df4.to_excel(writer, sheet_name='Starting Lineups', startrow=row_lineups, startcol=0)
row_lineups = row_lineups + len(df.index) + 3
writer.save()
I am trying to get the starting lineups from the webpage, convert it them into a pandas data frame, and then save it to an excel file. I'm having an issue with turning it into a data frame. I replaced the brackets with commas because I figured that would turn it into csv format.
This may get you moving in the right direction, where each line is a line up
data = [[x.text for x in y.findAll('a')] for y in lineups]
df = pd.DataFrame(data)

For loop after for loop produces wrong output Python

I am trying to use for loops to iterate through some Yahoo Finance data and calculate the return the papers. The problem is that I want to do this for different times, and that I have a document containing the different start and end dates. This is the code I have been using:
import pandas as pd
import numpy as np
from pandas.io.data import DataReader
from datetime import datetime
# This function is just used to download the data I want and saveing
#it to a csv file.
def downloader():
start = datetime(2005,1,1)
end = datetime(2010,1,1)
tickers = ['VIS', 'VFH', 'VPU']
stock_data = DataReader(tickers, "yahoo", start, end)
price = stock_data['Adj Close']
price.to_csv('data.csv')
downloader()
#reads the data into a Pandas DataFrame.
price = pd.read_csv('data.csv', index_col = 'Date', parse_dates = True)
#Creates a Pandas DataFrame that holdt multiple dates. The formate on this is the same as the format I have on the dates when I load the full csv file of dates.
inp = [{'start' : datetime(2005,1,3), 'end' : datetime(2005,12,30)},
{'start' : datetime(2005,2,1), 'end' : datetime(2006,1,31)},
{'start' : datetime(2005,3,1), 'end' : datetime(2006,2,28)}]
df = pd.DataFrame(inp)
#Everything above this is not part of the original script, but this
#is just used to replicate the problem I am having.
results = pd.DataFrame()
for index, row in df.iterrows():
start = row['start']
end = row['end']
price_initial = price.ix[start:end]
for column1 in price_initial:
price1 = price_initial[column1]
startprice = price1.ix[end]
endprice = price1.ix[start]
momentum_value = (startprice / endprice)-1
results = results.append({'Ticker' : column1, 'Momentum' : momentum_value}, ignore_index=True)
results = results.sort(columns = "Momentum", ascending = False).head(1)
print(results.to_csv(sep= '\t', index=False))
I am not sure what I am doing wrong here. But I suspect there is something about the way I iterate over or the way I save the output from the script.
The output I get is this:
Momentum Ticker
0.16022263953253435 VPU
Momentum Ticker
0.16022263953253435 VPU
Momentum Ticker
0.16022263953253435 VPU
That is clearly not correct. Hope someone can help me get this right.

Categories