Find difference between many datetime columns in pandas - python

I have a Dataframe with 282 columns, and 14000 rows. It looks as follows:
0 1 ... 282
uref_fixed
0006d730f5aa8492a59150e35bca5cc6 3/26/2018 7/3/2018 ...
00076311c47c44c33ffb834b1cebf5db 5/13/2018 5/13/2018 ...
0009ba8a69924902a9692c5f3aacea7f 7/13/2018 None ...
000dccb863b913226bca8ca636c9ddce 11/5/2017 11/10/2017 ...
I am trying to end up with a column at index 0 which, for each row, shows the average of the difference between each consecutive date value in each row (ie. difference date in column 2 and 3, then difference 3 and 4, then difference 4 and 5 etc., and then the average of all these)
Please note, there can be up to 282 date values in a row, but as you can see many have less.
Cheers

from datetime import datetime as dt
#df is your dataframe, df2 is a new one you have to initialize as empty
def diffdate(df, col1, col2):
if df[col1]==None or df[col2]==None:
return None
date1 = [int(i) for i in df[col1].split('/')]
date2 = [int(i) for i in df[col2].split('/')]
return (dt(date2[2],date2[0],date2[1]) - dt(date1[2],date1[0],date1[1])).days
for i in range(len(df.columns)-1):
df2[i] = df.apply(lambda x: diffdate(df, i, i+1),axis = 1)
df2 will hold all of the consecutive pair differences. Averaging the rows after this is pretty simple.

Related

Drop rows of df with fewer entries

I'm working currently with pandas in python.
I've got a dataset of customers (user_id on column1) and of the item they bought (column2).
Example dataset:
ID_user
ID_item
0
1
0
2
0
3
1
2
2
1
3
3
...
...
Now I want only to focus on customers, which have bought more than 10 items. How can I create a new dataframe with pandas and drop all other customers with less than 10 items bought?
Thank you very much!
You could first group your dataframe by the column "ID_user" and the .count() method. Afterwards filter out only those values that are bigger 10 with a lambda function.
# Group by column ID_user and the method .count()
df = df.groupby('ID_user').count()
# Only show values for which the lambda function evaluates to True
df = df[lambda row: row["ID_item"] > 10]
Or just do it in one line:
df = df.groupby('ID_user').count()[lambda row: row["ID_item"] > 10]
You can try groupby with transform then filter it
n = 10
cond = df.groupby('ID_user')['ID_item'].transform('sum')
out = df[cond>=n].copy()
A simple groupby + filter will do the job:
>>> df.groupby('ID_user').filter(lambda g: len(g) > 10)
Empty DataFrame
Columns: [ID_user, ID_item]
Index: []
Now, in your example, there aren't actually any groups that do have more than 10 items, so it's showing an empty dataframe here. But in your real data, this would work.

python / pandas: How to count each cluster of unevenly distributed distinct values in each row

I am transitioning from excel to python and finding the process a little daunting. I have a pandas dataframe and cannot find how to count the total of each cluster of '1's' per row and group by each ID (example data below).
ID 20-21 19-20 18-19 17-18 16-17 15-16 14-15 13-14 12-13 11-12
0 335344 0 0 1 1 1 0 0 0 0 0
1 358213 1 1 0 1 1 1 1 0 1 0
2 358249 0 0 0 0 0 0 0 0 0 0
3 365663 0 0 0 1 1 1 1 1 0 0
The result of the above in the format
ID
LastColumn Heading a '1' occurs: count of '1's' in that cluster
would be:
335344
16-17: 3
358213
19-20: 2
14-15: 4
12-13: 1
365663
13-14: 5
There are more than 11,000 rows of data I would like to output the result to a txt file. I have been unable to find any examples of how the same values are clustered by row, with a count for each cluster, but I am probably not using the correct python terminology. I would be grateful if someone could point me in the right direction. Thanks in advance.
First step is use DataFrame.set_index with DataFrame.stack for reshape. Then create consecutive groups by compare for not equal Series.shifted values with cumulative sum by Series.cumsum to new column g. Then filter rows with only 1 and aggregate by named aggregation by GroupBy.agg with GroupBy.last and GroupBy.size:
df = df.set_index('ID').stack().reset_index(name='value')
df['g'] = df['value'].ne(df['value'].shift()).cumsum()
df1 = (df[df['value'].eq(1)].groupby(['ID', 'g'])
.agg(a=('level_1','last'), b=('level_1','size'))
.reset_index(level=1, drop=True)
.reset_index())
print (df1)
ID a b
0 335344 16-17 3
1 358213 19-20 2
2 358213 14-15 4
3 358213 12-13 1
4 365663 13-14 5
Last for write to txt use DataFrame.to_csv:
df1.to_csv('file.txt', index=False)
If need your custom format in text file use:
with open("file.txt","w") as f:
for i, g in df1.groupby('ID'):
f.write(f"{i}\n")
for a, b in g[['a','b']].to_numpy():
f.write(f"\t{a}: {b}\n")
You just need to use the sum method and then specify which axis you would like to sum on. To get the sum of each row, create a new series equal to the sum of the row.
# create new series equal to sum of values in the index row
df['sum'] = df.sum(axis=1) # specifies index (row) axis
The best method for getting the sum of each column is dependent on how you want to use that information but in general the core is just to use the sum method on the series and assign it to a variable.
# sum a column and assign result to variable
foo = df['20-21'].sum() # default axis=0
bar = df['16-17'].sum() # default axis=0
print(foo) # returns 1
print(bar) # returns 3
You can get the sum of each column using a for loop and add them to a dictionary. Here is a quick function I put together that should get the sum of each column and return a dictionary of the results so you know which total belongs to which column. The two inputs are 1) the dataframe 2) a list of any column names you would like to ignore
def get_df_col_sum(frame: pd.DataFrame, ignore: list) -> dict:
"""Get the sum of each column in a dataframe in a dictionary"""
# get list of headers in dataframe
dfcols = frame.columns.tolist()
# create a blank dictionary to store results
dfsums = {}
# loop through each column and append sum to list
for dfcol in dfcols:
if dfcol not in ignore:
dfsums.update({dfcol: frame[dfcol].sum()})
return dfsums
I then ran the following code
# read excel to dataframe
df = pd.read_excel(test_file)
# ignore the ID column
ignore_list = ['ID']
# get sum for each column
res_dict = get_df_col_sum(df, ignore_list)
print(res_dict)
and got the following result.
{'20-21': 1, '19-20': 1, '18-19': 1, '17-18': 3, '16-17': 3, '15-16':
2, '14-15': 2, '13-14': 1, '12-13': 1, '11-12': 0}
Sources: Sum by row, Pandas Sum, Add pairs to dictionary

Iterate over dates pandas

I have a pandas dataframe:
CITY DT Error%
1 A 1/1/2020 0.03436722
2 A 1/2/2020 0.03190177
3 B 1/9/2020 0.040218757
4 B 1/8/2020 0.098921665
I want to iterate through the dataframe and check if the DT and its next week DT have a ERROR % of less than 0.05.
I want the return to be the dataframe series
2 A 1/2/2020 0.03190177
3 B 1/9/2020 0.040218757
IIUC,
df['DT'] = pd.to_datetime(df['DT'])
idx = df[df['DT'].sub(df['DT'].shift()).gt('6 days')].index.tolist()
indices = []
for i in idx:
indices.append(i-1)
indices.append(i)
print(df.loc[df['Error%'] <= 0.05].loc[indices])
CITY DT Error%
2 A 2020-01-02 0.031902
3 B 2020-01-09 0.040219
Not particularly elegant, but it gets the job done and maybe some of the professionals here can improve on it:
First, merge the information for the day with the info for the day a week after by performing a self-join on the time-shifted DT column. we can use an inner join since we're only interested in rows that have an entry for the week after:
tmp = df.set_index(df.DT.apply(lambda x: x + pd.Timedelta('7 days'))) \
.join(df.set_index('DT'), lsuffix='_L', how='inner')
Then select the date column for those entries where both error margins are satisfied:
tmp = tmp.DT.loc[(tmp['Error%_L'] < 0.05) & (tmp['Error%'] < 0.05)]
tmp is now a pd.Series with information in the index (the shifted values) and the values (the first week). Since both dates are wanted in the output, compile the "index dates" by taking the unique values among all of them:
idx = list(set(tmp.tolist() + tmp.index.tolist()))
And finally, grab the corresponding rows from the original dataframe:
df.set_index('DT').loc[idx].reset_index()
This, however loses you the original row number. If that is needed, you'll have to save it to a column first and reset the index back to that variable after selecting the relevant rows

pandas drop rows on groupby level 2 sum or mean conditions

I want to drop a group (all rows in the group) if the sum of values in a group is equal to a certain value.
The following code provides an example:
>>> df = pd.DataFrame(randn(10,10), index=pd.date_range('20130101',periods=10,freq='T'))
>>> df = pd.DataFrame(df.stack(), columns=['Values'])
>>> df.index.names = ['Time', 'Group']
>>> df.head(12)
Values
Time Group
2013-01-01 00:00:00 0 0.541795
1 0.060798
2 0.074224
3 -0.006818
4 1.211791
5 -0.066994
6 -1.019984
7 -0.558134
8 2.006748
9 2.737199
2013-01-01 00:01:00 0 1.655502
1 0.376214
>>> df['Values'].groupby('Group').sum()
Group
0 3.754481
1 -5.234744
2 -2.000393
3 0.991431
4 3.930547
5 -3.137915
6 -1.260719
7 0.145757
8 -1.832132
9 4.258525
Name: Values, dtype: float64
So the question is; how can I for instance drop all group rows where the grouped sum is negative? In my actual dataset I want to drop the groups where the sum or mean is zero.
Using GroupBy + transform with sum, followed by Boolean indexing:
res = df[df.groupby('Group')['Values'].transform('sum') > 0]
From the pandas documentation, filtration seems more suitable:
df2 = df.groupby('Group').filter(lambda g: g['Values'].sum() >= 0)
(Old answer):
This worked for me:
# Change the index to *just* the `Group` column
df.reset_index(inplace=True)
df.set_index('Group', inplace=True)
# Then create a filter using the groupby object
gb = df['Values'].groupby('Group')
gb_sum = gb.sum()
val_filter = gb_sum[gb_sum >= 0].index
# Print results
print(df.loc[val_filter])
The condition on which you filter can be changed accordingly.

Pandas DataFrame: How to print single row horizontally?

Single row of a DataFrame prints value side by side, i.e. column_name then columne_value in one line and next line contains next column_name and columne_value. For example, below code
import pandas as pd
df = pd.DataFrame([[100,200,300],[400,500,600]])
for index, row in df.iterrows():
# other operations goes here....
print row
Output for first row comes as
0 100
1 200
2 300
Name: 0, dtype: int64
Is there a way to have each row printed horizontally and ignore the datatype, Name? Example for the first row:
0 1 2
100 200 300
use the to_frame method then transpose with T
df = pd.DataFrame([[100,200,300],[400,500,600]])
for index, row in df.iterrows():
print(row.to_frame().T)
0 1 2
0 100 200 300
0 1 2
1 400 500 600
note:
This is similar to #JohnE's answer in that the method to_frame is syntactic sugar around pd.DataFrame.
In fact if we follow the code
def to_frame(self, name=None):
"""
Convert Series to DataFrame
Parameters
----------
name : object, default None
The passed name should substitute for the series name (if it has
one).
Returns
-------
data_frame : DataFrame
"""
if name is None:
df = self._constructor_expanddim(self)
else:
df = self._constructor_expanddim({name: self})
return df
Points to _constructor_expanddim
#property
def _constructor_expanddim(self):
from pandas.core.frame import DataFrame
return DataFrame
Which you can see simply returns the callable DataFrame
Use the transpose property:
df.T
0 1 2
0 100 200 300
It seems like there should be a simpler answer to this, but try turning it into another DataFrame with one row.
data = {x: y for x, y in zip(df.columns, df.iloc[0])}
sf = pd.DataFrame(data, index=[0])
print(sf.to_string())
Sorta combining the two previous answers, you could do:
for index, ser in df.iterrows():
print( pd.DataFrame(ser).T )
0 1 2
0 100 200 300
0 1 2
1 400 500 600
Basically what happens is that if you extract a row or column from a dataframe, you get a series which displays as a column. And doesn't matter if you do ser or ser.T, it "looks" like a column. I mean, series are one dimensional, not two, but you get the point...
So anyway, you can convert the series to a dataframe with one row. (I changed the name from "row" to "ser" to emphasize what is happening above.) The key is you have to convert to a dataframe first (which will be a column by default), then transpose it.

Categories