I recently received a dataset with some dates and values that was scraped online. The problem is that some of the dates were not scraped properly. As a result, they turned up as hexes instead. One day is split into 48 intervals (as you can see in the second column), and the year (5th column), month (6th column), and name of the day is given.
Is there any way I can convert the hexes into properly labelled date in pandas? (I want to process this csv file into a pandas dataframe for time series analysis.)
image of csv file
this issue isn't related to the csv file content, the viewer you use (in this case i think its Excel) is just displaying it due to overflow in the cell... try to expand it a little
Related
I have CSV data that is already in a datetime format. I need to analyze the data by comparing days and finding the most similar ones. The data is organized in 10minute intervalls which means 144 rows per day need to be clustered into one day. It would be ideal if every day would be copied into an array and could be accessed by saying e.g. print(array_26.08.2022).
[CSV Screenshot]
(https://i.stack.imgur.com/bZEAR.png)
i searched online but couldnt find a solution
My goal now is to append dataframe into an existed excel with date as index. Since sometimes i need to use the program several times a day ,I want overwrite that day when doing so.
For example, if I have 02-02 to 02-19 data and I want to 02-20 data just not overwrite any thing but if i have 02-02 to 02-19 data and now i got whole day 02-19 data, i want it just overwrite where 02-19 data start.
I already successfully write the dataframe to the excel, how can i set the startrow to fullfill my need
use xlwings. You can find the cell no where your data ends in excel by using range.end('down'), which you can use as your start point for writing new dataframe .
So Basically here is a simplified version of my dataframe, and the 2. picture is what I want to get. : https://imgur.com/a/44QgR44
An explanation: Basically the 20201001 stuff is the date in a number format, and I want to group up the values for each date for some Group and Name.
Here comes my issue: I tried using df.groupby(by=['Credit','Equity','Bond').sum but It grouped everything up, not only the ones in the list (there are much more in the original dataset which I dont want to group up.
The second issue is that there are 2 things which group up in a different row (Stock and Option) so not sure how I could do that with pandas.
In excel I just got the result by a simple SUMIF function.
I have a seemingly complicated problem and I have a general idea of how I should solve it but I am not sure if it is the best way to go about it. I'll give the scenario and would appreciate any help on how to break this down. I'm fairly new with Pandas so please excuse my ignorance.
The Scenario
I have a CSV file that I import as a dataframe. My example I am working through contains 2742 rows × 136 columns. The rows are variable but the columns are set. I have a set of 23 lookup tables (also as CSV files) named per year, per quarter (range is 2020 3rd quarter - 2015 1st quarter) The lookup files are named as such: PPRRVU203.csv. So that contains values from the 3rd quarter of 2020. The lookup tables are matched by two columns ('Code' and 'Mod') and I use three values that are associated in the lookup.
I am trying to filter sections of my data frame, pull the correct values from the matching lookup file, merge back into the original subset, and then replace into the original dataframe.
Thoughts
I can probably abstract this and wrap in a function but not sure how I can place back in. My question, for those that understand Pandas better than myself, what is the best method to filter, replace the values, and write the file back out.
The straight forward solution would be to filter the original dataframe into 23 separate dataframes, then do the merge on each individual file, then concat into a new dataframe and output to CSV.
This seems highly inefficient?
I can post code but I am looking for more of any high-level thoughts?
Not sure exactly how your DataFrame looks like but Pandas.query() method will maybe prove useful for the selection of data.
name = df.query('columnname == "something"')
A large dataframe has a date column. By using pandas.read_csv(..., parse_dates=["date"]) to read the data, I assume the column has been converted to an efficient data type for representing dates.
The task is now to select all items that fall into a date range, e.g. ("2018-01-01", "2018-12-31"). This could be extremely fast by having the date column in sorted form and using binary search to locate the bounding indices.
But how do I tell this to pandas? Is it enough to sort by the column and perform a query on it? Should I make it a pandas.DateTimeIndex and use .loc?
One possible caveat is that the items already have a MultiIndex that needs to be kept intact. Also, I don't want more than one copy of the dataframe in memory.