clustering days from csv data in python - python

I have CSV data that is already in a datetime format. I need to analyze the data by comparing days and finding the most similar ones. The data is organized in 10minute intervalls which means 144 rows per day need to be clustered into one day. It would be ideal if every day would be copied into an array and could be accessed by saying e.g. print(array_26.08.2022).
[CSV Screenshot]
(https://i.stack.imgur.com/bZEAR.png)
i searched online but couldnt find a solution

Related

How to replace mislabeled dates in a pandas dataframe?

I recently received a dataset with some dates and values that was scraped online. The problem is that some of the dates were not scraped properly. As a result, they turned up as hexes instead. One day is split into 48 intervals (as you can see in the second column), and the year (5th column), month (6th column), and name of the day is given.
Is there any way I can convert the hexes into properly labelled date in pandas? (I want to process this csv file into a pandas dataframe for time series analysis.)
image of csv file
this issue isn't related to the csv file content, the viewer you use (in this case i think its Excel) is just displaying it due to overflow in the cell... try to expand it a little

Stock Data Storage and calculation using python, pandas

I am working with stock data which i download using a file everyday. The file contains the same no of columns everyday but the rows would change everyday depending up the stocks in and out of the list. I am looking to compare the files from 2 dates and find the difference between the total quantity column. I want to see the difference between the two files which stocks got in or got out of the list.
I have tried using pandas dataframe and storing it in a hd5 file. Then tried merge function of the dataframes to find the differences between the two file. I am looking for a much elegant solution so that i can compare data frames and find the differences like i do it using index and match(or vlookup) function of excel.
You should use the python difflib library to compare the files.
From the documentation:
This module provides classes and functions for comparing sequences. It can be used for example, for comparing files, and can produce difference information in various formats, including HTML and context and unified diffs
Also, look at the answers to this similar question for some examples. One example that may be useful in your case is this one.

Replacing big amount of documents from specific date in MongoDB

I'm storing business/ statistical data by date in different collections.
Every day thousands of thousands of rows are being inserted.
In some cases my application fetches or generate information that includes let's say the last 20 days with new values, so I need to update that old information in MongoDB with the new values for those dates.
The first option I thought of is removing all rows from 20 days ago until now by removing by date, and insert the new data with insertMany().
The problem with this is that the amount of rows is huge and it blocks the database which some times makes my worker process to die (It's a python celery task).
The second option I thought of to is to split the new coming data into chunks per date (using Pandas dataframes), and perform a "removal" then "insert of that date, and iterate that process until today. This way is the same but in smaller chunks.
Is the last option a good idea?
Is there any better approach for this type of problem?
Thanks a lot

Transforming data in pandas

What would be the best way to approach this problem using python and pandas?
I have an excel file of electricity usage. It comes in an awkward structure and I want to transform it so that I can compare it to weather data based on date and time.
The structure look like ( foo is a string and xx is a number)
100,foo,foo,foo,foo
200,foo,foo,foo,foo,foo,0000,kWh,15
300,20181101,xx,xx,xx,xx...(96 columns)xx,A
... several hundred more 300 type rows
the 100 and 200 rows identify the meter and provide a partial schema. ie data is in kWh and 15 minute intervals. The 300 rows contain date and 96 (ie 96 = 24hours*4 15min blocks) columns of 15min power consumption and one column with a data quality flag.
I have previously processed all the data in other tools but I'm trying to learn how to do it in Python (jupyter notebook to be precise) and tap into the far more advanced analysis, modeling and visualisation tools available.
I think the thing to do is transform the data into a series of datetime and power. From there I can aggregate filter and compare however I like.
I am at a loss even to know what question to ask or resource to look up to tackle this problem. I could just import the 300 rows as is and loop through the rows and columns to create a new series in the right structure - easy enough to do. However, I strongly suspect there is an inbuilt method for doing this sort of thing and I would greatly appreciate any advise on what might be the best strategies. Possibly I don't need to transform the data at all.
You can read the data easy enough into a DataFrame, you just have to step over the metadata rows, e.g.:
df = pd.read_csv(<file>, skiprows=[0,1], index_col=1, parse_dates=True, header=None)
This will read in the csv, skip over the first 2 lines, make the date column the index and try and parse it to a date type.

Efficient Python/Pandas Selection and Sorting

My brother and I are working on reproducing the findings of this paper where they use daily historical stock prices to predict future prices with incredibly high accuracy.
They used data from the center for research and security prices, but we are using data from quandl and are running into trouble with the runtimes for our preprocessing. As in the paper, for each ticker and each day, we want to create a vector of 32 z-scores, comparing that ticker's momentum to the rest of the market's. As an intermediate step, they create a vector of 32 cumulative returns for each stock over a given time period (12 for the previous 12 months and 20 for the previous 20 days.) (this process is also briefly discussed in section 2.2 of the paper)
Because of the large amount of data, we are having problems with runtimes simply creating these cumulative returns vectors. We imported the data from a 1.7gb csv and turned it into a Pandas dataframe. We then wrote a function to take a given ticker and date and create a cumulative returns vector. I suspect that our trouble is in how we select data out of the dataframe. (specifically the following lines [each of them don't take much time, but they both need to be run many times for each date])
prices = data[data.ticker == ticker_argument]
and
p = prices[prices.date == date_argument]
What would be the most efficient way of doing this considering that the data is sorted by ticker and the data for each ticker is also sorted by date? I'd imagine that you could do some sort of binary search to speed most of it up, but should I do that by hand in plain python or is there a better pandas way to do that? Any help would be greatly appreciated.
The quandl csv is at http://www.jackmckeown.com/prices.csv if you want more info on how the data is formatted.

Categories