I have a several question.
First, I want to datetime in pandas dataframe.
like this... 2018/03/06 00:01:27:744
How can I replace this datetime?
And Then.. Second..
Time Sensor1 Sensor2 TimeCumsum
2018/03/06 00:01:27:744 0 1
2018/03/06 00:01:27:759 0 1
2018/03/06 00:01:27:806 0 1 0.15
2018/03/06 00:01:27:838 1 1
2018/03/06 00:01:28:009 1 1 0.2
2018/03/06 00:01:28:056 1 0 ...
I wanna Time Seconds cumsum when Sensor1 is 0 And Sensor2 is 1.
How can I do this?
Thanks.
I believe need:
df['Time'] = pd.to_datetime(df['Time'], format='%Y/%m/%d %H:%M:%S:%f')
m = (df['Sensor1'].eq(0) & df['Sensor2'].eq(1))
df['col'] = df.loc[m, 'Time'].dt.microsecond.cumsum() // 10**3
print (df)
Time Sensor1 Sensor2 col
0 2018-03-06 00:01:27.744 0 1 744.0
1 2018-03-06 00:01:27.759 0 1 1503.0
2 2018-03-06 00:01:27.806 0 1 2309.0
3 2018-03-06 00:01:27.838 1 1 NaN
4 2018-03-06 00:01:28.009 1 1 NaN
5 2018-03-06 00:01:28.056 1 0 NaN
Related
I would like to transform the dictionary below
x = {
'John': ['1.12.2021','2.12.2021','3.02.2022','4.2.2022','5.07.2022','6.07.2022','7.12.2022','8.12.2022'],
'Andrew': ['1.12.2021','2.03.2022','3.03.2022','4.05.2022','5.05.2022','6.09.2022','7.09.2022','8.11.2022','9.12.2022','10.12.2022']}
into a DataFrame like this, with columns that will show counts per Month
Name 12.2021 02.2022 03.2022 05.2022 07.2022 09.2022 11.2022 \
John 2 2 0 0 2 0 0
Andrew 1 0 2 2 0 2 1
12.2022
2
2
I started with this this, transforming the values into datetimes :
x = pd.DataFrame.from_dict(x, 'index').reset_index().fillna(value='0')
x.iloc[:,1:] = pd.to_datetime(x.iloc[:,1:], format='%d.%m.%Y')
print(x)
But getting this error: AttributeError: 'int' object has no attribute 'lower'
You could use a dictionary comprehension and pandas.concat+unstack:
day first
(pd.concat({k: pd.to_datetime(pd.Series(v), dayfirst=True)
.dt.strftime('%Y-%m')
.value_counts()
for k,v in x.items()})
.unstack(fill_value=0)
)
output:
2021-12 2022-02 2022-03 2022-05 2022-07 2022-09 2022-11 2022-12
John 2 2 0 0 2 0 0 2
Andrew 1 0 2 2 0 2 1 2
previous answer: month first (before data update)
(pd.concat({k: pd.Series(v, dtype='datetime64[ns]')
.dt.strftime('%Y-%m')
.value_counts()
for k,v in x.items()})
.unstack(fill_value=0)
)
output:
2021-01 2021-02 2021-03 2021-04 2021-05 2021-06 2021-07 2021-08 2021-09 2021-10
John 1 1 1 1 1 1 1 1 0 0
Andrew 1 1 1 1 1 1 1 1 1 1
I have a data frame
Count ID Date
1 1 2020-07-09
2 1 2020-07-11
1 1 2020-07-21
1 2 2020-07-04
2 2 2020-07-09
3 2 2020-07-18
1 3 2020-07-02
2 3 2020-07-05
1 3 2020-07-19
2 3 2020-07-22
I want to subtract the row in the date column by the row above it that has the same count BY each ID group. Those without the same count get a value of zero
Excepted output
ID Date Days
1 2020-07-09 0
1 2020-07-11 0
1 2020-07-21 12 (2020-07-21 MINUS 2020-07-09)
2 2020-07-04 0
2 2020-07-09 0
2 2020-07-18 0
3 2020-07-02 0
3 2020-07-05 0
3 2020-07-19 17 (2020-07-19 MINUS 2020-07-02)
3 2020-07-22 17 (2020-07-22 MINUS 2020-07-05)
My initial thought process is to filter out Count-ID pairs, and then do the calculation.. I was wondering if there is a better workaround this>
You can use groupby() to group by columns ID and Count, get the difference in days by .diff(). Fill NaN values with 0 by .fillna(), as follows:
df['Date'] = pd.to_datetime(df['Date']) # convert to datetime if not already in datetime format
df['Days'] = df.groupby(['ID', 'Count'])['Date'].diff().dt.days.fillna(0, downcast='infer')
Result:
print(df)
Count ID Date Days
0 1 1 2020-07-09 0
1 2 1 2020-07-11 0
2 1 1 2020-07-21 12
3 1 2 2020-07-04 0
4 2 2 2020-07-09 0
5 3 2 2020-07-18 0
6 1 3 2020-07-02 0
7 2 3 2020-07-05 0
8 1 3 2020-07-19 17
9 2 3 2020-07-22 17
I like SeaBean's answer, but here is what I was working on before I saw that
df2 = df.sort_values(by = ['ID', 'Count'])
df2['Date'] = pd.to_datetime(df2['Date'])
df2['shift1'] = df2.groupby(['ID', 'Count'])['Date'].shift(1)
df2['diff'] = (df2.Date- df2.shift1.combine_first(df2.Date) ).dt.days
I have a pandas.DataFrame in it I have a column. The columns contains, integers, strings, time...
I want to create columns (containing [0,1]) that tells if the value in that column is a string or not, a time or not... in an efficient way.
A
0 Hello
1 Name
2 123
3 456
4 22/03/2019
And the output should be
A A_string A_number A_date
0 Hello 1 0 0
1 Name 1 0 0
2 123 0 1 0
3 456 0 1 0
4 22/03/2019 0 0 1
Using pandas str methods to check for the string type could help:
df = pd.read_clipboard()
df['A_string'] = df.A.str.isalpha().astype(int)
df['A_number'] = df.A.str.isdigit().astype(int)
#naive assumption
df['A_Date'] = (~df.A.str.isalnum()).astype(int)
df.filter(['A','A_string','A_number','A_Date'])
A A_string A_number A_Date
0 Hello 1 0 0
1 Name 1 0 0
2 123 0 1 0
3 456 0 1 0
4 22/03/2019 0 0 1
We can use the native pandas .to_numeric, to_datetime to test for dates & numbers. Then we can use .loc for assignment and fillna to match your target df.
df.loc[~pd.to_datetime(df['A'],errors='coerce').isna(),'A_Date'] = 1
df.loc[~pd.to_numeric(df['A'],errors='coerce').isna(),'A_Number'] = 1
df.loc[(pd.to_numeric(df['A'],errors='coerce').isna())
& pd.to_datetime(df['A'],errors='coerce').isna()
,'A_String'] = 1
df = df.fillna(0)
print(df)
A A_Date A_Number A_String
0 Hello 0.0 0.0 1.0
1 Name 0.0 0.0 1.0
2 123 0.0 1.0 0.0
3 456 0.0 1.0 0.0
4 22/03/2019 1.0 0.0 0.0
I have the following dataframe:
doc_id is_fulltext
1243 dok:1 1
3310 dok:1 1
4370 dok:1 1
14403 dok:1020 1
17252 dok:1020 1
15977 dok:1020 0
16480 dok:1020 1
16252 dok:1020 1
468 dok:103 1
128 dok:1030 0
1673 dok:1038 1
I would like to split the is_fulltext column into two columns and count the occurrences of the docs at the same time.
Desired Output:
doc_id fulltext non-fulltext
0 dok:1 3 0
1 dok:1020 4 1
2 dok:103 1 0
3 dok:1030 0 1
4 dok:1038 1 0
I followed the procedure of Pandas - Create columns from column value, and fill with count
That post shows several alternatives, suggesting Categorical or reindex. I tried the following:
cats = ['fulltext', 'non_fulltext']
df_sorted['is_fulltext'] = pd.Categorical(df_sorted['is_fulltext'], categories=cats)
new_df = df_sorted.groupby(['doc_id', 'is_fulltext']).size().unstack(fill_value=0)
Here I get a ValueError:
ValueError: Length of passed values is 17446, index implies 0
Then I tried this method
cats = ['fulltext', 'non_fulltext']
new_df = df_sorted.groupby(['doc_id','is_fulltext']).size().unstack(fill_value=0).reindex(columns=cats).reset_index()
While this seems to have worked fine in the original post, my counts are filled with NANs (see below). I read by now that this happens when using reindex and categorical, but I wonder why it seems to have worked in the original post. And how can I solve this? Can anyone help? Thank you!
doc_id fulltext non-fulltext
0 dok:1 NaN NaN
1 dok:1020 NaN NaN
2 dok:103 NaN NaN
3 dok:1030 NaN NaN
4 dok:1038 NaN NaN
You could GroupBy the doc_id, apply pd.value_counts to each group and unstack:
(df.groupby('doc_id').is_fulltext.apply(pd.value_counts)
.unstack()
.fillna(0)
.rename(columns={0:'non-fulltext', 1:'fulltext'})
.reset_index())
doc_id non-fulltext fulltext
0 dok:1 0.0 3.0
1 dok:1020 1.0 4.0
2 dok:103 0.0 1.0
3 dok:1030 1.0 0.0
4 dok:1038 0.0 1.0
Or similarly to your own method, if performance is an issue do instead:
df.groupby(['doc_id','is_fulltext']).size()
.unstack(fill_value=0)
.rename(columns={0:'fulltext',1:'non_fulltext'})
.reset_index()
is_fulltext doc_id fulltext non_fulltext
0 dok:1 0 3
1 dok:1020 1 4
2 dok:103 0 1
3 dok:1030 1 0
4 dok:1038 0 1
I don't know if it's the best approach, but this should work for you:
import pandas as pd
df = pd.DataFrame({"doc_id":["id1", "id2", "id1", "id2"],
"is_fulltext":[1, 0, 1, 1]})
df_grouped = df.groupby("doc_id").sum().reset_index()
df_grouped["non_fulltext"] = df.groupby("doc_id").count().reset_index()["is_fulltext"] - df_grouped["is_fulltext"]
df_grouped
And the output is:
doc_id is_fulltext non_fulltext
0 id1 2 0
1 id2 1 1
I have data grouped by tow columns [CustomerID,cluster] like this:
CustomerIDClustered.groupby(['CustomerID','cluster']).count()
Count
CustomerID cluster
1893 0 1
1 2
2 5
3 1
2304 2 3
3 1
2655 0 1
2 1
2850 1 1
2 1
3 1
3648 0 1
I need assign most frequent cluster to the customer id
For Example:
1893->2 (2 appear in cluster more than other clusters)
2304->2
2655->1
Use sort_values, reset_index and last drop_duplicates:
df = df.sort_values('Count', ascending=False).reset_index().drop_duplicates('CustomerID')
Similar solution, only filter by first level of MultiIndex:
df = df.sort_values('Count', ascending=False)
df = df[~df.index.get_level_values(0).duplicated()].reset_index()
print (df)
CustomerID cluster Count
0 1893 2 5
1 2304 2 3
2 2655 0 1
3 2850 1 1
4 3648 0 1