Preserving duplicate index of a pandas df when converting to python dictionary

Preserving duplicate index of a pandas df when converting to python dictionary - python

I have a df which have a dublicate index at '2020-10-25 02:00:00' with different values:
df
... 5
2020-10-25 02:00:00 10
2020-10-25 02:00:00 7
... 8
because of the summer/winter time change I have this duplicate index. It is fine till I want to change this df to dictionary via df.to_dict(). When I change it to dictionary one of the values of df['2020-10-25 02:00:00'] is removed, since dictionary cannot have duplicate keys.
Instead of hardcoding, I am looking for something like following, which could maybe store these two values as a list when converting into a dictionary:
df.to_dict(preserve_duplicates=True)
Summary: Is there a way to preserve duplicate index of a df, when it is converted to python dictionary?

One thing you can do is to group those values together (i.e. to a list) before you convert the df to dict:
value
date
2020-10-25 01:00:00 5
2020-10-25 02:00:00 10
2020-10-25 02:00:00 7
df.groupby(df.index).agg(list).to_dict()
> {'value': {'2020-10-25 01:00:00': [5], '2020-10-25 02:00:00': [10, 7]}}
The agg function can be flexible depending on your need, you can also do a sum or any other operations.

Duplicate index in a Pandas dataframe should be avoided, but in a Python dict, they are just impossible.
IMHO, the simplest way is just to reset the index before building the dict:
df.reset_index().to_dict()

Related

How to iterate through pandas series with 2 indexes?

I have a series my_series that looks like this:
Index Date
12345 2019-01-03 14.0
2019-01-04 65.0
2019-01-05 81.0
23456 2019-12-14 21.0
2019-12-15 51.0
2019-12-16 55.0
and I want to go through its values by selecting both indexes, because I need to perform an operation with each value.
Currently what I'm doing is something like this:
a_dict = {
index : my_series[index,date] * 2 for index,date in my_series
}
but keep getting this error:
'numpy.float64' object is not iterable

Use a groupby on level='Index':
df.groupby(level='Index').apply(function)
Or in a dictionary comprehension:
out = {k: function(g) for k, g in df.groupby(level='Index')}

Ok, got what I needed, so I'm posting how I solved it in case anyone else is in need.
So basically I was trying to access for index,date in my_series ,
the variable index included both cols "Index" and "Date". I had to access the 'Index' value alone, so what I had to do was:
a_dict = {
index[0] : value * 2 for index,value in my_series.items()
}
In this case, index refers to both "Index" and "Date" in my series. And by looping my_seires.items() I get to access both the indexes and my values of interest. Not sure if its clear what I mean, but I'm new to Python Pandas haha

Randomly Modify Entries in CSV using Pandas

I have a CSV file containing data like this:
DateTime, product_x, product_y, product_z
2018-01-02 00:00:00,945,1318,17.12
2018-01-03 00:00:00,958,1322,17.25
...
I want to use Python and Pandas to modify the values for product_x, product_y and product_z by some random amount - say adding a random value from -3 - +3 to each, and then writing the result back to a CSV.
EDIT: I need each cell shifted by a different amount (except for random coincidences).
How do I do this please?

Use np.random.randint with columns names in list for generate 2d array and add to original columns filtered in same list:
cols = ['product_x','product_y','product_y']
#dynamic columns names
#cols = df.filter(like='product').columns
df[cols] += np.random.randint(-3, 3, size=(len(df.index), len(cols)))
print (df)
DateTime product_x product_y product_z
0 2018-01-02 00:00:00 947 1320 17.12
1 2018-01-03 00:00:00 958 1323 17.25

Select rows from Python DataFrame

I have got a Python DataFrame called "x" like this:
363108 05:01:00
363107 05:02:00
363106 05:03:00
363105 05:04:00
363104 05:05:00
...
4 16:57:00
3 16:58:00
2 16:59:00
1 17:00:00
0 17:01:00
The "time" column is string type.
I want to create a new DataFrame called "m" from all the rows in "x" such that the minute is "00".
I have tried m = x.loc[x["time"][3:5] == "00"] but I get "IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match)."
Does anybody know how to do this please?

You should use "apply" for the condition.
x.loc[x["time"].apply(lambda s: s[3:5] == "00")]
*In your code you are getting the range [3:5] on time Series(row 3 to 5)

One way can be that you can create a new column in the existing dataframe that has the minutes field, which you can slice from the time column
df['minutes']=df['time'][-2:]
other_df=df.loc[df['minutes']=="00"]

Add missing dates in pandas df, but date range has (valid) duplicates

I have a dataset that has multiple values received per second - up to 100 DFS (no more, but not consistently 100). The challenge is that the date field did not capture time more granularly than second, so multiple rows have the same hh:mm:ss timestamp. These are fine, but I also have several seconds missing across the set, i.e., not showing at all.
Therefore my 2 initial columns might look like this, where I am missing the 54 sec step:
2020-08-24 03:36:53, 5
2020-08-24 03:36:53, 8
2020-08-24 03:36:53, 6
2020-08-24 03:36:55, 8
Because of the legit date "duplicates" and the information I need from this, I don't want to aggregate but I do need to create the missing seconds, insert them and fill (NaN, etc) so I can then manage them appropriately for aligning with other datasets.
The only way I can seem to do this is with a nested if loop which looks at the previous timestamp and if it is the same as the current cell (pt == ct) then no action, if it is 1 less (pt = (ct-1)) then no action but it if is more than the current cell by 2 or more, insert the missing (pt <= (ct-2)). This feels a bit cumbersome (though workable). Am I missing an easier way to do this?
I have checked a lot of "fill missing dates" threads on here as well as in various functions on pandas.pydata.org but reindexing and the most common date fills all seem to rely on dates not having duplicates. Any advice would be fantastic.

This can be solved by creating a pandas series containing all timepoints you want to consider and then merge this with the original dataframe.
For example:
start, end = df['date'].min(), df['date'].max()
all_timepoints = pd.date_range(start, end, freq='s').to_series(name='date')
df.merge(all_timepoints , on='date', how='outer', sort=True).fillna(0)
Will give:
date value
0 2020-08-24 03:36:53 5.0
1 2020-08-24 03:36:53 8.0
2 2020-08-24 03:36:53 6.0
3 2020-08-24 03:36:54 0.0
4 2020-08-24 03:36:55 8.0

Python / Pandas: How creating an multi-index empty DataFrame, and then starting to fill it?

I would like to store the summary of a local set of DataFrames into a "meta DataFrame" using pd.MultiIndex.
Basically, row-axis has two levels, and column-axis also.
In the class managing the set of DataFrames, I define as a class variable this "Meta DataFrame".
import pandas as pd
row_axis = pd.MultiIndex(levels=[[],[]], codes=[[],[]], names=['Data', 'Period'])
column_axis = pd.MultiIndex(levels=[[],[]], codes=[[],[]], names=['Data', 'Extrema'])
MD = pd.DataFrame(index=row_axis, columns=column_axis)
It seems to work.
MD.index
>>> MultiIndex([], names=['Data', 'Period'])
MD.columns
>>> MultiIndex([], names=['Data', 'Extrema'])
Now, each time I process an individual DataFrame id, I want to update this "Meta DataFrame" accordingly. id has a DateTimeIndex with period '5m'.
id.index[0]
>>> Timestamp('2020-01-01 08:00:00')
id.index[-1]
>>> Timestamp('2020-01-02 08:00:00')
I want to keep in MD its first and last index values for instance.
MD.loc[[('id', '5m')],[('Timestamp', 'First')]] = id.index[0]
MD.loc[[('id', '5m')],[('Timestamp', 'Last')]] = id.index[-1]
This doesn't work, I get following error message:
TypeError: unhashable type: 'list'
In the end, the result I would like is to have in MD following type of info (I am having other id DataFrames with different periods) :
Timestamp
First Last
id 5m 2020-01-01 08:00:00 2020-01-02 08:00:00
10m 2020-01-05 08:00:00 2020-01-06 18:00:00
Ultimately, I will also keep min and max of some columns in id.
For instance if id has a column 'Temperature'.
Timestamp Temperature
First Last Min Max
id 5m 2020-01-01 08:00:00 2020-01-02 08:00:00 -2.5 10
10m 2020-01-05 08:00:00 2020-01-06 18:00:00 4 15
These values will be recorded when I record id.
I am aware initializing a DataFrame cell per cell is not time efficient, but it will not be done that often.
Besides, I don't see how I can manage this organization of information in a Dict, which is why I am considering doing it with a multi-level DataFrame.
I will then dump it in a csv file to store these "meta data".
Please, what is the right way to initialize each of these values in MD?
I thank you for your help!
Bests,

Instead of filling an empty DataFrame you can store the data in a dict of dicts. A MultiIndex uses tuples as the index values so we make the keys of each dictionary tuples.
The outer Dictionary uses the column MultiIndex tuples as keys and the values are another dictionary with the row MultiIndex tuples as keys and the value that goes in a cell as the value.
d = {('Score', 'Min'): {('id1', '5m'): 72, ('id1', '10m'): -18},
('Timestamp', 'First'): {('id1', '5m'): 1, ('id1', '10m'): 2},
('Timestamp', 'Last'): {('id1', '5m'): 10, ('id1', '10m'): 20}}
# | | |
# Column MultiIndex Row Multi Cell Value
# Label Label
pd.DataFrame(d)
Score Timestamp
Min First Last
id1 5m 72 1 10
10m -18 2 20
Creating that dict will depend upon how you get the values. You can extend a dict with update

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Preserving duplicate index of a pandas df when converting to python dictionary - python

Duplicate index in a Pandas dataframe should be avoided, but in a Python dict, they are just impossible. IMHO, the simplest way is just to reset the index before building the dict: df.reset_index().to_dict()

Related

How to iterate through pandas series with 2 indexes?

Randomly Modify Entries in CSV using Pandas

Select rows from Python DataFrame

Add missing dates in pandas df, but date range has (valid) duplicates

Python / Pandas: How creating an multi-index empty DataFrame, and then starting to fill it?

Categories

Resources