Reading multiple csv files for TensorFlow - python

I'm trying to implement a LSTM network using TensorFlow 2 but I'm having problems with taking input. My dataset is in the form of multiple CSV files I have more than 100 CSV files in a directory that I want to read and load them in python. they contain customer's information. each file has 50 columns and many rows so they don't fit in memory!
I have read many documentation and I know the best way is to use generator function and chunks so i tried but still have problems... I will show part of a CSV file.
part of a CSV
How do i take these multiple CSVs as input?

Related

Correctly formatting my csv file for input into ML algorithm

I am having a lot of trouble of formatting my csv file in a way which makes it suitable for a machine learning algorithm on python. I have followed various tutorials however no guidance has been given on a column in my csv file which has huge arrays of data in.
For context I am collecting Channel State information (CSI) data from various individuals and the programme collects this data into a big csv file with the csi data, the bit I am interested in being presented into a huge array of numbers. Therefore I want the ML algorithm to identify individuals based off this data and I am having trouble at this moment in time in finding a way to format the csv file.
Thanks
I have tried various tutorials etc but no algorithm is allowing my array of data as there's often a valueError. it says can't convert string to float.

How to EFFICIENTLY upload a a pyspark dataframe as a zipped csv or parquet file(similiar to.gz format)

I have 130 GB csv.gz file in S3 that was loaded using a parallel unload from redshift to S3. Since it contains multiple files i wanted to reduce the number of files so that its easier to read for my ML model(using sklearn).
I have managed to convert multiple from from S3 to a spark dataframe (called spark_df) using :
spark_df1=spark.read.csv(path,header=False,schema=schema)
spark_df1 contains 100s of columns (features) and is my time series inference data for millions of customers IDs. Since it is a time series data, i want to make sure that a the data points of 'customerID' should be present in same output file as I would be reading each partition file as a chunk.
I want to unload this data back into S3.I don't mind smaller partition of data but each partitioned file SHOULD have the entire time series data of a single customer. in other words one customer's data cannot be in 2 files.
current code:
datasink3=spark_df1.repartition(1).write.format("parquet").save(destination_path)
However, this takes forever to run and the ouput is a single file and it is not even zipped. I also tried using ".coalesce(1)" instead of ".repartition(1)" but it was slower in my case.
You can partition it using the customerID:
spark_df1.partitionBy("customerID") \
.write.format("parquet") \
.save(destination_path)
You can read more about it here: https://sparkbyexamples.com/pyspark/pyspark-repartition-vs-partitionby/
This code worked and the time to run reduced to 1/5 of the original result. Only thing to note is that make sure that the load is split equally amongst the nodes (in my case i had to make sure that each customer id had ~same number of rows)
spark_df1.repartition("customerID").write.partitionBy("customerID").format("csv").option("compression","gzip").save(destination_path)
adding to manks answer, you need to repartition the DataFrame by customerID and then write.partitionBy(customerID) to get one file per customer.
you can see a similar issue here.
Also, regarding your comment that parquet files are not zipped, the default compression is snappy which has some pros &cons compared to gzip compression but it's still much better than uncompressed.

Uploading stock price data in .txt files and analyzing in python

I am new to python and have been searching for this but can't find any questions on this. I have stock price data for hundreds of stocks, all in .txt files. I am trying to upload all of them to jupyter notebook to analyze them, ideally with charts and mathematical analysis (specifically mean reversion analysis).
I am wondering how can I upload so many files at once? I need to be able to analyze each of them to see if they are reverting to their mean price. Then I would like to create a chart that analyzes the top 5 biggest difference from the mean.
Also, should I convert them to .csv files? maybe then upload them to pandas? Also what are some good libraries to use? I know pandas, matplotlib, and the math library, as well as probably numpy.
Thank you.
use glob to read the dir and pandas to read the files.
Then concat them all
from glob import glob
dir_containing_files = 'path_to_csv_files'
df = pd.concat([pd.read_csv(i) for i in glob(dir_containing_files + '/*.txt')])
I'm guessing your text files contain columns of data separated by some delimiter, in which case, you can use pd.DataFrame.read_csv (even without changing the file extension to .csv)
data = pd.read_csv('stock_data.txt', sep=",")
# change `sep` to whatever delimiter is in your files
You could put the line above into a loop to load many files at once. Can't say exactly how to loop through it without knowing the pattern in your file names.
In addition to Pandas, libraries that I would reach for to do mean reversion analysis are:
statsmodel for model fitting
matplotlib for drawing graphs

A CNN that takes several rows from a CSV file as a single input

I have extracted facial features from several videos in the form of facial action units(AU) using an open face. These features span for several seconds and hence take several rows in a CSV file (each row containing AU data for each frame of the video). Originally, I had multiple CSV files as input for CNN but, as advised by others, I have concatenated and condensed the data into a single file. My CSV columns look like this:
Filename | Label | the other columns contain AU related data
Filename contains individual "ID" that helps keep track of a single "example". The label column contains 2 possible values. Either "yes" or "no". I'm also considering to add a "Frames" column to keep track of frame number for a certain "example".
The most likely scenario is that I will require some form of a 3DCNN but so far, the only codes or help that I found for 3DCNN are specific to videos while I require code for either a CSV file or various CSV files. I've been unable to find any code that can help me out in this scenario. Can someone please help me out? I have no idea how/where to move forward.

Is there a preferred format for Python to retrieve time-series data - between .txt or xlsx?

I am using a third party tool to extract vast amounts of time-series data (to be analysed within python). The options are to save this as a text file or excel file. Which is the more efficient route speed-wise?
You can have a look here: Faster way to read Excel files to pandas dataframe
Here it is also mentioned that csv is fater, so the text file should be the better option.

Categories