I am new to python and have been searching for this but can't find any questions on this. I have stock price data for hundreds of stocks, all in .txt files. I am trying to upload all of them to jupyter notebook to analyze them, ideally with charts and mathematical analysis (specifically mean reversion analysis).
I am wondering how can I upload so many files at once? I need to be able to analyze each of them to see if they are reverting to their mean price. Then I would like to create a chart that analyzes the top 5 biggest difference from the mean.
Also, should I convert them to .csv files? maybe then upload them to pandas? Also what are some good libraries to use? I know pandas, matplotlib, and the math library, as well as probably numpy.
Thank you.
use glob to read the dir and pandas to read the files.
Then concat them all
from glob import glob
dir_containing_files = 'path_to_csv_files'
df = pd.concat([pd.read_csv(i) for i in glob(dir_containing_files + '/*.txt')])
I'm guessing your text files contain columns of data separated by some delimiter, in which case, you can use pd.DataFrame.read_csv (even without changing the file extension to .csv)
data = pd.read_csv('stock_data.txt', sep=",")
# change `sep` to whatever delimiter is in your files
You could put the line above into a loop to load many files at once. Can't say exactly how to loop through it without knowing the pattern in your file names.
In addition to Pandas, libraries that I would reach for to do mean reversion analysis are:
statsmodel for model fitting
matplotlib for drawing graphs
Related
I have two excel files which are in general ledger format, I am trying to open them as dataframes so I can do some analysis, specifically look for duplicates. I tried opening them using
pd.read_excel(r"Excelfile.xls)in pandas. The files are being read but when I use df.head() , I am getting nans for all the records and columns as well. Is there a way to load the data in general ledger format into a data frame?
This is how the dataset looks like in the Jupyter notebook
This is what the dataset looks like in excel
I am new to stack overflow and I haven't learnt its full functionality to upload part of a dataset yet
I hope the images help in describing my situation
I'm trying to implement a LSTM network using TensorFlow 2 but I'm having problems with taking input. My dataset is in the form of multiple CSV files I have more than 100 CSV files in a directory that I want to read and load them in python. they contain customer's information. each file has 50 columns and many rows so they don't fit in memory!
I have read many documentation and I know the best way is to use generator function and chunks so i tried but still have problems... I will show part of a CSV file.
part of a CSV
How do i take these multiple CSVs as input?
I'm trying to use Dask to read a large number of csv, but I'm having issues since the number of columns varies between csv files, as does the order of the columns.
I know that packages like d6tstack (as detailed here), can help handle this, but is there a way to fix this without installing additional libraries and without taking up more disk space?
If you use from_delayed, then you can make a function which pre-processes each of your input files as you might wish. This is totally arbitrary, so you can choose to solve the issue using your own code or any package you want to install across the cluster.
#dask.delayed
def read_a_file(filename):
df = pd.read_csv(filename). # or remote file
do_something_with_columns
return df_out
df = dd.from_delayed([read_a_file(f) for f in filenames], meta=...)
I am using a third party tool to extract vast amounts of time-series data (to be analysed within python). The options are to save this as a text file or excel file. Which is the more efficient route speed-wise?
You can have a look here: Faster way to read Excel files to pandas dataframe
Here it is also mentioned that csv is fater, so the text file should be the better option.
I am working with stock data which i download using a file everyday. The file contains the same no of columns everyday but the rows would change everyday depending up the stocks in and out of the list. I am looking to compare the files from 2 dates and find the difference between the total quantity column. I want to see the difference between the two files which stocks got in or got out of the list.
I have tried using pandas dataframe and storing it in a hd5 file. Then tried merge function of the dataframes to find the differences between the two file. I am looking for a much elegant solution so that i can compare data frames and find the differences like i do it using index and match(or vlookup) function of excel.
You should use the python difflib library to compare the files.
From the documentation:
This module provides classes and functions for comparing sequences. It can be used for example, for comparing files, and can produce difference information in various formats, including HTML and context and unified diffs
Also, look at the answers to this similar question for some examples. One example that may be useful in your case is this one.