Select rows from Python DataFrame - python

I have got a Python DataFrame called "x" like this:
363108 05:01:00
363107 05:02:00
363106 05:03:00
363105 05:04:00
363104 05:05:00
...
4 16:57:00
3 16:58:00
2 16:59:00
1 17:00:00
0 17:01:00
The "time" column is string type.
I want to create a new DataFrame called "m" from all the rows in "x" such that the minute is "00".
I have tried m = x.loc[x["time"][3:5] == "00"] but I get "IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match)."
Does anybody know how to do this please?

You should use "apply" for the condition.
x.loc[x["time"].apply(lambda s: s[3:5] == "00")]
*In your code you are getting the range [3:5] on time Series(row 3 to 5)

One way can be that you can create a new column in the existing dataframe that has the minutes field, which you can slice from the time column
df['minutes']=df['time'][-2:]
other_df=df.loc[df['minutes']=="00"]

Related

Create column with data and float data types

I work with a dataframe named emails_visits:
pandas is imported
Rep Doctor Date type
0 1 1 2021-01-25 email
1 1 1 2021-05-29 email
2 1 2 2021-03-15 email
3 1 2 2021-04-02 email
4 1 2 2021-04-29 email
30 1 2 2021-06-01 visit
5 1 3 2021-01-01 email
I want to create column "date_after" based on value in column type if it is equal to "visits" I would like to see date from column "date" otherwise empty.
I use this code:
emails_visits["date_after"]=np.where(emails_visits["type"]=="visit",emails_visits["Date"],np.nan)
However, it raise an error:
emails_visits["date_after"]=np.where(emails_visits["type"]=="visit",emails_visits["Date"],np.nan)
File "<__array_function__ internals>", line 5, in where
TypeError: The DType <class 'numpy.dtype[datetime64]'> could not be promoted by <class 'numpy.dtype[float64]'>. This means that no common DType exists for the given inputs. For example they cannot be stored in a single array unless the dtype is `object`. The full list of DTypes is: (<class 'numpy.dtype[datetime64]'>, <class 'numpy.dtype[float64]'>)
How can I fix this?
You can do it like this if you want.
email_visits['date after'] = email_visits.apply(lambda x: x[2] if x[3] == 'visit' else '', axis=1)
The type datetime64 of the column Date of emails_visits is incompatible with the one of np.nan which is a np.float64. Since it seems you use Pandas, you need to use pd.NA instead which is used for missing values (while np.nan means that the value is not a number and only applies for floating-point numbers). In fact, it is better not to use np.where here but pandas functions. Here is a simple solution:
emails_visits["date_after"] = emails_visits["Date"].where(emails_visits["type"]=="visit")

Preserving duplicate index of a pandas df when converting to python dictionary

I have a df which have a dublicate index at '2020-10-25 02:00:00' with different values:
df
... 5
2020-10-25 02:00:00 10
2020-10-25 02:00:00 7
... 8
because of the summer/winter time change I have this duplicate index. It is fine till I want to change this df to dictionary via df.to_dict(). When I change it to dictionary one of the values of df['2020-10-25 02:00:00'] is removed, since dictionary cannot have duplicate keys.
Instead of hardcoding, I am looking for something like following, which could maybe store these two values as a list when converting into a dictionary:
df.to_dict(preserve_duplicates=True)
Summary: Is there a way to preserve duplicate index of a df, when it is converted to python dictionary?
One thing you can do is to group those values together (i.e. to a list) before you convert the df to dict:
value
date
2020-10-25 01:00:00 5
2020-10-25 02:00:00 10
2020-10-25 02:00:00 7
df.groupby(df.index).agg(list).to_dict()
> {'value': {'2020-10-25 01:00:00': [5], '2020-10-25 02:00:00': [10, 7]}}
The agg function can be flexible depending on your need, you can also do a sum or any other operations.
Duplicate index in a Pandas dataframe should be avoided, but in a Python dict, they are just impossible.
IMHO, the simplest way is just to reset the index before building the dict:
df.reset_index().to_dict()

Python / Pandas: How creating an multi-index empty DataFrame, and then starting to fill it?

I would like to store the summary of a local set of DataFrames into a "meta DataFrame" using pd.MultiIndex.
Basically, row-axis has two levels, and column-axis also.
In the class managing the set of DataFrames, I define as a class variable this "Meta DataFrame".
import pandas as pd
row_axis = pd.MultiIndex(levels=[[],[]], codes=[[],[]], names=['Data', 'Period'])
column_axis = pd.MultiIndex(levels=[[],[]], codes=[[],[]], names=['Data', 'Extrema'])
MD = pd.DataFrame(index=row_axis, columns=column_axis)
It seems to work.
MD.index
>>> MultiIndex([], names=['Data', 'Period'])
MD.columns
>>> MultiIndex([], names=['Data', 'Extrema'])
Now, each time I process an individual DataFrame id, I want to update this "Meta DataFrame" accordingly. id has a DateTimeIndex with period '5m'.
id.index[0]
>>> Timestamp('2020-01-01 08:00:00')
id.index[-1]
>>> Timestamp('2020-01-02 08:00:00')
I want to keep in MD its first and last index values for instance.
MD.loc[[('id', '5m')],[('Timestamp', 'First')]] = id.index[0]
MD.loc[[('id', '5m')],[('Timestamp', 'Last')]] = id.index[-1]
This doesn't work, I get following error message:
TypeError: unhashable type: 'list'
In the end, the result I would like is to have in MD following type of info (I am having other id DataFrames with different periods) :
Timestamp
First Last
id 5m 2020-01-01 08:00:00 2020-01-02 08:00:00
10m 2020-01-05 08:00:00 2020-01-06 18:00:00
Ultimately, I will also keep min and max of some columns in id.
For instance if id has a column 'Temperature'.
Timestamp Temperature
First Last Min Max
id 5m 2020-01-01 08:00:00 2020-01-02 08:00:00 -2.5 10
10m 2020-01-05 08:00:00 2020-01-06 18:00:00 4 15
These values will be recorded when I record id.
I am aware initializing a DataFrame cell per cell is not time efficient, but it will not be done that often.
Besides, I don't see how I can manage this organization of information in a Dict, which is why I am considering doing it with a multi-level DataFrame.
I will then dump it in a csv file to store these "meta data".
Please, what is the right way to initialize each of these values in MD?
I thank you for your help!
Bests,
Instead of filling an empty DataFrame you can store the data in a dict of dicts. A MultiIndex uses tuples as the index values so we make the keys of each dictionary tuples.
The outer Dictionary uses the column MultiIndex tuples as keys and the values are another dictionary with the row MultiIndex tuples as keys and the value that goes in a cell as the value.
d = {('Score', 'Min'): {('id1', '5m'): 72, ('id1', '10m'): -18},
('Timestamp', 'First'): {('id1', '5m'): 1, ('id1', '10m'): 2},
('Timestamp', 'Last'): {('id1', '5m'): 10, ('id1', '10m'): 20}}
# | | |
# Column MultiIndex Row Multi Cell Value
# Label Label
pd.DataFrame(d)
Score Timestamp
Min First Last
id1 5m 72 1 10
10m -18 2 20
Creating that dict will depend upon how you get the values. You can extend a dict with update

What is happening when pandas.Series converts int64s into NaNs?

I have a csv with dates and integers (Headers: Date, Number), separated by a tab.
I'm trying to create a calendar heatmap with CalMap (demo on that page). The function that creates the chart takes data that's indexed by DateTime.
df = pd.read_csv("data.csv",delimiter="\t")
df['Date'] = df['Date'].astype('datetime64[ns]')
events = pd.Series(df['Date'],index = df['Number'])
calmap.yearplot(events)
But when I check events.head(5), it gives the date followed by NaN. I check df['Number'].head(5) and they appear as int64.
What am I doing wrong that is causing this conversion?
Edit: Data below
Date Number
7/9/2018 40
7/10/2018 40
7/11/2018 40
7/12/2018 70
7/13/2018 30
Edit: Output of events.head(5)
2018-07-09 NaN
2018-07-10 NaN
2018-07-11 NaN
2018-07-12 NaN
2018-07-13 NaN
dtype: float64
First of all, it is not NaN, it is NaT (Not a Timestamp), which is unique to Pandas, though Pandas makes it compatible with NaN, and uses it similarly to NaN in floating-point columns to mark missing data.
What pd.Series(data, index=index) does apparently depends on the type of data. If data is a list, then index has to be of equal length, and a new Series will be constructed, with data being data, and index being index. However, if data is already a Series (such as df['Date']), it will instead take the rows corresponding to index and construct a new Series out of those rows. For example:
pd.Series(df['Date'], [1, 1, 4])
will give you
1 2018-07-10
1 2018-07-10
4 2018-07-13
Where 2018-07-10 comes from row #1, and 2018-07-11 from row #4 of df['Date']. However, there is no row with index 40, 70 or 30 in your sample input data, so missing data is presumed, and NaT is inserted instead.
In contrast, this is what you get when you use a list instead:
pd.Series(df['Date'].to_list(), index=df['Number'])
# => Number
# 40 2018-07-09
# 40 2018-07-10
# 40 2018-07-11
# 70 2018-07-12
# 30 2018-07-13
# dtype: datetime64[ns]
I was able to fix this by changing the series into lists via df['Date'].tolist() and df['Number'].tolist(). calmap.calendarplot(events) was able to accept these instead of the original parameters as series.

Apply function to all columns in pd dataframe using variables in other dataframes

I have a series of dataframes, some hold static values some are time series.
I have been able to add
I want to transpose the values from one time series to a new time series, applying a function which draws values from both the original time series dataframe and the dataframe which holds static values.
A snip of the time series and static dataframes are below.
Time series dataframe (Irradiance)
Datetime Date Time GHI DIF flagR SE SA TEMP
2017-07-01 00:11:00 01.07.2017 00:11 0 0 0 -9.39 -179.97 11.1
2017-07-01 00:26:00 01.07.2017 00:26 0 0 0 -9.33 -176.47 11.0
2017-07-01 00:41:00 01.07.2017 00:41 0 0 0 -9.14 -172.98 10.9
2017-07-01 00:56:00 01.07.2017 00:56 0 0 0 -8.83 -169.51 10.9
2017-07-01 01:11:00 01.07.2017 01:11 0 0 0 -8.40 -166.04 11.0
Static dataframe (Properties)
Bearing (degrees) Inclination (degrees)
Home ID
151631 244 29
151632 244 29
151633 184 44
I have written a function which I want to use to populate a new dataframe using values from both of these.
def dif(DIF, Inclination, GHI):
global Albedo
return DIF * (1 + math.cos(math.radians(Inclination)) / 2) + (GHI * Albedo * (1 - math.cos(math.radians(Inclination)) / 2))
When I have tried to do the same, but within the same dataframe I have used the Numpy vectorize funcion, so I thought I would be able to iterate over each column of the the new dataframe using the following code.
for column in DIF:
DIF[column] = np.vectorize(dif)(irradiance['DIF'], properties.iloc['Inclination (degrees)'][column], irradiance['GHI'])
Instead this throws the following error.
TypeError: cannot do positional indexing on <class 'pandas.core.indexes.numeric.Int64Index'> with these indexers [Inclination (degrees)] of <class 'str'>
I've checked the dtypes for the values of Inclination(degrees) but it is returned as Int64, not str so I'm not sure why this error is being generated.
I'm obviously missing something critical here. Are there alternative methods that would work better, or at all? Any help would be much appreciated.

Categories