Here I have a dataset with three inputs with date and time. Here I collected my data not in pattern time. Here what I want first is put my start time as 0 and convert other time into minutes.
my code is:
data = pd.read_csv('data6.csv',"," )
data['date'] = pd.to_datetime(data['date'] + " " + data['time'], format='%d/%m/%Y %H:%M:%S')
lastday = data.loc[0, 'date']
def convert_time(x):
global lastday
if x.date() == lastday.date():
tm = x - lastday
return tm.total_seconds()/60
else:
lastday = x
return 0
data['time'] = data['date'].apply(convert_time)
Then I got the results:
But what I expected is :
I want to set the time for every one minute from starting time 0 , then if column has no value at that time then put the 0 values. If values are append then put the value with time column in minutes.
If new day then put again start time as 0 then start the value in minutes .
This is like time group with one minute , data.
Date time in min X1 X2 X3
10/3/2018 1 63 0 0
2
3
4 if no values then put 0 values into that
5 column till the values available
6 Then put it that column values
7
8
9
10
11
12
13
10/4/2018 0 120 30 60
1 0 0 0
My csv file:
link for my csv:
My csv
After new code my time is displaying :
Pandas has functions for this; resample from a datetime index. You have to give an aggregation feature in case your data has multiple values within 1 minute. Below example will sum these values, it is easy to change this.
Please correct me if this is not what you want.
Code
# Read CSV
csv_url = 'https://docs.google.com/spreadsheets/d/1WWq1qhqi4bGzNir_svQV7VstBkGbocToipPCY83Cclc/gviz/tq?tqx=out:csv&sheet=1512153575'
data = pd.read_csv(csv_url)
data['date'] = pd.to_datetime(data['date'] + " " + data['time'], format='%d/%m/%Y %H:%M:%S')
# Resample to 1 minute (T is minute)
df = data.set_index('date') \
.resample('1T') \
.sum() \
.fillna(0)
# Optional ugly one-liner to start index at 0, and 1 row per minute, restart at day start
df.index = ((df.index - pd.to_datetime(df.index.date)).total_seconds() / 60).astype(int)
Output
df.head()
x1 x2 x3 Unnamed: 5 Unnamed: 6 Unnamed: 7
date
2018-03-10 06:15:00 63 0 0 0.0 0.0 0.0
2018-03-10 06:16:00 0 0 0 0.0 0.0 0.0
2018-03-10 06:17:00 0 0 0 0.0 0.0 0.0
2018-03-10 06:18:00 0 0 0 0.0 0.0 0.0
2018-03-10 06:19:00 0 0 0 0.0 0.0 0.0
Output 2
With ugly-ass one-liner
x1 x2 x3 Unnamed: 5 Unnamed: 6 Unnamed: 7
date
0 63 0 0 0.0 0.0 0.0
1 0 0 0 0.0 0.0 0.0
2 0 0 0 0.0 0.0 0.0
3 0 0 0 0.0 0.0 0.0
4 0 0 0 0.0 0.0 0.0
5 0 0 0 0.0 0.0 0.0
You may creat a dataframe df2 contain the column time and minutes of day, and then use
csv_url = 'https://docs.google.com/spreadsheets/d/1WWq1qhqi4bGzNir_svQV7VstBkGbocToipPCY83Cclc/gviz/tq?tqx=out:csv&sheet=1512153575'
data = pd.read_csv(csv_url)
df = pd.merge(data,df2,how='outer',on='time')
df = df.fillna(0)
df2 is like the pic, you can create it by script or excel
Related
I have df as follows:
df = pd.DataFrame({"A":[0,np.nan,0,0,np.nan,1,np.nan,1,0,np.nan],
"B":[0,1,0,0,1,1,1,0,0,0]})
Now, I need to replace nan values in column A with values from column B and one above row. for example: 2nd row for column A should be 0, 7th row equals to 1 etc.
I defined this function but it doesnt work trying to apply into dataframe
def impute_with_previous_B(df):
for x in range(len(df)):
if pd.isnull(df.loc[x,"A"]) == True:
df.loc[x,"A"] = df.loc[x-1,"B"]
df["A"] = df.apply(lambda x: impute_with_previous_B(x),axis=1)
Can you please tell me what is wrong with that function ?
df['A'] = df['A'].fillna(df['B'].shift())
A B
0 0.0 0
1 0.0 1
2 0.0 0
3 0.0 0
4 0.0 1
5 1.0 1
6 1.0 1
7 1.0 0
8 0.0 0
9 0.0 0
Given a simple dataframe:
df = pd.DataFrame({'user': ['x','x','x','x','x','y','y','y','y'],
'Flag': [0,1,0,0,1,0,1,0,0],
'time': [10,34,40,43,44,12,20, 46, 51]})
I want to calculate the timedelta from the last flag == 1 for each user.
I did the diffs:
df.sort_values(['user', 'time']).groupby('user')['time'].diff().fillna(pd.Timedelta(10000000)).dt.total_seconds()/60
But it doesn't seem to solve my issue, I need time delta between the 1's and if there wasn't any then fill with some number N.
Please advise
For example:
user Flag time diff
0 x 0 10 NaN
1 x 1 34 NaN
2 x 0 40 6.0
3 x 0 43 9.0
4 x 1 44 10.0
5 y 0 12 NaN
6 y 1 20 NaN
7 y 0 46 26.0
8 y 0 51 31.0
I am not sure that I understood correctly, but if you want to compute the time delta only between 1's per group of user, you can apply your computation on the sliced dataframe for 1's only and using groupby:
df['delta'] = (df[df['Flag'].eq(1)] # select 1's only
.groupby('user') # group by user
['time'].diff() # compute the diff
.dt.total_seconds()/60 # convert to minutes
)
output:
user Flag time delta
0 x 0 0 days 10:30:00 NaN
1 x 1 0 days 11:34:00 NaN
2 x 0 0 days 11:43:00 NaN
3 y 0 0 days 13:43:00 NaN
4 y 1 0 days 14:40:00 NaN
5 y 0 0 days 15:32:00 NaN
6 y 1 0 days 18:30:00 230.0
7 w 0 0 days 19:30:00 NaN
8 w 0 0 days 20:11:00 NaN
edit. Here is a working solution for the updated question.
IIUC the update, you want to calculate the difference to the last 1 per user, and if the flag is 1, the difference to the last valid value per user if any.
In summary, it creates subgroup for ranges starting with 1s, then uses these groups to calculate the diffs. Finally masks the 1s with a diff with them previous value (is existing)
(df.assign(mask=df['Flag'].eq(1),
group=lambda d: d.groupby('user')['mask'].cumsum(),
# diff from last 1
diff=lambda d: d.groupby(['user', 'group'])['time'].apply(lambda g: g -(g.iloc[0] if g.name[1]>0 else float('nan'))),
)
# mask 1s with their own diff
.assign(## diff=lambda d: d['diff'].mask(d['mask'],d.groupby('user')['time'].diff()) ## OLD VERSION
diff= lambda d: d['diff'].mask(d['mask'],
(d[d['mask'].groupby(d['user']).cumsum().eq(0)|d['mask']]
.groupby('user')['time'].diff())
)
)
.drop(['mask', 'group'], axis=1) # cleanup temp columns
)
Output:
user Flag time diff
0 x 0 10 NaN
1 x 1 34 24.0
2 x 0 40 6.0
3 x 0 43 9.0
4 x 1 44 10.0
5 y 0 12 NaN
6 y 1 20 8.0
7 y 0 46 26.0
8 y 0 51 31.0
I have two pandas DataFrame.
df1 looks like this:
Date A B
2020-03-01 12 15
2020-03-02 13 16
2020-03-03 14 17
while df2, like this:
Date C
2020-03-03 x
2020-03-01 w
2020-03-05 y
I want to merge df2 to df1 such that the values turn into columns. Kinda like a one-hot encoding:
Date A B w x y z
2020-03-01 12 15 1 0 0 0
2020-03-02 13 16 0 0 0 1
2020-03-03 14 17 0 1 0 0
So the first row has a 1 in column w because the row with the same date, "2020-03-01", in df2['C'] is "w". Column z is for those entries in df1 without corresponding dates in df2. (Sorry if I couldn't explain it better. Feel free to clarify.)
As a solution, I thought of merging df1 and df2 first, like this:
Date A B C
2020-03-01 12 15 w
2020-03-02 13 16 -
2020-03-03 14 17 x
Then doing one-hot encoding using:
df1['w'] = (df2['C'] == 'w')*1.0
df1['y'] = (df2['C'] == 'y')*1.0
...
But I'm still thinking of how to code the first part, and the whole solution may not even be efficient. So I'm asking in case you know a more efficient way, like some combination of DataFrame methods. Thank you.
You can do with get_dummies and reindex to get z values:
df1.merge(pd.get_dummies(df2['C'])
.reindex(list('wxyz'), axis=1, fill_value=0)
.assign(Date=df2.Date),
on='Date',
how='left'
).fillna(0)
Output:
Date A B w x y z
0 2020-03-01 12 15 1.0 0.0 0.0 0.0
1 2020-03-02 13 16 0.0 0.0 0.0 0.0
2 2020-03-03 14 17 0.0 1.0 0.0 0.0
You should first build a tmp dataframe by using get_dummies after merging df1 and df2 on Date. Use reindex to make sure to have all columns, eventually filled with 0:
tmp = pd.get_dummies(df1.merge(df2, 'left', on='Date')['C']).reindex(df2['C'].values,
axis=1, fill_value=0)
it gives:
x w y
0 0 1 0
1 0 0 0
2 1 0 0
We can now compute the z column to be 1 if no 1 is present on the row and concat to df1:
tmp['z'] = 1 - tmp.aggregate('sum', axis=1)
resul = pd.concat([df1, tmp], axis=1)
to obtain:
Date A B x w y z
0 2020-03-01 12 15 0 1 0 0
1 2020-03-02 13 16 0 0 0 1
2 2020-03-03 14 17 1 0 0 0
I have an existing dataframe which looks like:
id start_date end_date
0 1 20170601 20210531
1 2 20181001 20220930
2 3 20150101 20190228
3 4 20171101 20211031
I am trying to add 85 columns to this dataframe which are:
if the month/year (looping on start_date to end_date) lie between 20120101 and 20190101: 1
else: 0
I tried the following method:
start, end = [datetime.strptime(_, "%Y%m%d") for _ in ['20120101', '20190201']]
global_list = list(OrderedDict(((start + timedelta(_)).strftime(r"%m/%y"), None) for _ in range((end - start).days)).keys())
def get_count(contract_start_date, contract_end_date):
start, end = [datetime.strptime(_, "%Y%m%d") for _ in [contract_start_date, contract_end_date]]
current_list = list(OrderedDict(((start + timedelta(_)).strftime(r"%m/%y"), None) for _ in range((end - start).days)).keys())
temp_list = []
for each in global_list:
if each in current_list:
temp_list.append(1)
else:
temp_list.append(0)
return pd.Series(temp_list)
sample_df[global_list] = sample_df[['contract_start_date', 'contract_end_date']].apply(lambda x: get_count(*x), axis=1)
and the sample df looks like:
customer_id contract_start_date contract_end_date 01/12 02/12 03/12 04/12 05/12 06/12 07/12 ... 04/18 05/18 06/18 07/18 08/18 09/18 10/18 11/18 12/18 01/19
1 1 20181001 20220930 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 1 1 1 1
9 2 20160701 20200731 0 0 0 0 0 0 0 ... 1 1 1 1 1 1 1 1 1 1
3 3 20171101 20211031 0 0 0 0 0 0 0 ... 1 1 1 1 1 1 1 1 1 1
3 rows × 88 columns
it works fine for small dataset but for 160k rows it didn't stopped even after 3 hours. Can someone tell me a better way to do this?
Facing problems when the dates overlap for same customer.
First I'd cut off the dud dates, to normalize the end_time (to ensure it's in the time range):
In [11]: df.end_date = df.end_date.where(df.end_date < '2019-02-01', pd.Timestamp('2019-01-31')) + pd.offsets.MonthBegin()
In [12]: df
Out[12]:
id start_date end_date
0 1 2017-06-01 2019-02-01
1 2 2018-10-01 2019-02-01
2 3 2015-01-01 2019-02-01
3 4 2017-11-01 2019-02-01
Note: you'll need to do the same trick for start_date if there are dates prior to 2012.
I'd create the resulting DataFrame from a date range of the columns and then fill it in (with ones at start time and something else:
In [13]: m = pd.date_range('2012-01-01', '2019-02-01', freq='MS')
In [14]: res = pd.DataFrame(0., columns=m, index=df.index)
In [15]: res.update(pd.DataFrame(np.diag(np.ones(len(df))), df.index, df.start_date).groupby(axis=1, level=0).sum())
In [16]: res.update(-pd.DataFrame(np.diag(np.ones(len(df))), df.index, df.end_date).groupby(axis=1, level=0).sum())
The groupby sum is required if multiple rows start or end in the same month.
# -1 and NaN were really placeholders for zero
In [17]: res = res.replace(0, np.nan).ffill(axis=1).replace([np.nan, -1], 0)
In [18]: res
Out[18]:
2012-01-01 2012-02-01 2012-03-01 2012-04-01 2012-05-01 ... 2018-09-01 2018-10-01 2018-11-01 2018-12-01 2019-01-01
0 0.0 0.0 0.0 0.0 0.0 ... 1.0 1.0 1.0 1.0 1.0
1 0.0 0.0 0.0 0.0 0.0 ... 0.0 1.0 1.0 1.0 1.0
2 0.0 0.0 0.0 0.0 0.0 ... 1.0 1.0 1.0 1.0 1.0
3 0.0 0.0 0.0 0.0 0.0 ... 1.0 1.0 1.0 1.0 1.0
I have a CSV file containing a time serie of daily precipitation. The problem arises of how the data is organized. Here a small sample to ilustrate:
date p01 p02 p03 p04 p05 p06
01-01-1941 33.6 7.1 22.3 0 0 0
01-02-1941 0 0 1.1 11.3 0 0
So, there is a column to each day of the month (p01 is the precipitation of the day 1, p02 corresponds to the day 2, and so on ). I'd like to have this structure: one column to date and another to precipitation values.
date p
01-01-1941 33.6
02-01-1941 7.1
03-01-1941 22.3
04-01-1941 0
05-01-1941 0
06-01-1941 0
01-02-1941 0
02-02-1941 0
03-02-1941 1.1
04-02-1941 11.3
05-02-1941 0
06-02-1941 0
I have found some code examples, but unsuccessfully for this specific problem. In general they suggest to try using pandas, numpy. Does anyone have a recommendation to solve this issue or a good advice to guide my studies? Thanks. (I'm sorry for my terrible English)
I think you can first use read_csv, then to_datetime with stack for reshaping DataFrame, then convert column days to_timedelta and add it to column date:
import pandas as pd
import io
temp=u"""date;p01;p02;p03;p04;p05;p06
01-01-1941;33.6;7.1;22.3;0;0;0
01-02-1941;0;0;1.1;11.3;0;0"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), sep=";")
print df
date p01 p02 p03 p04 p05 p06
0 01-01-1941 33.6 7.1 22.3 0.0 0 0
1 01-02-1941 0.0 0.0 1.1 11.3 0 0
#convert coolumn date to datetime
df.date = pd.to_datetime(df.date, dayfirst=True)
print df
date p01 p02 p03 p04 p05 p06
0 1941-01-01 33.6 7.1 22.3 0.0 0 0
1 1941-02-01 0.0 0.0 1.1 11.3 0 0
#stack, rename columns
df1 = df.set_index('date').stack().reset_index(name='p').rename(columns={'level_1':'days'})
print df1
date days p
0 1941-01-01 p01 33.6
1 1941-01-01 p02 7.1
2 1941-01-01 p03 22.3
3 1941-01-01 p04 0.0
4 1941-01-01 p05 0.0
5 1941-01-01 p06 0.0
6 1941-02-01 p01 0.0
7 1941-02-01 p02 0.0
8 1941-02-01 p03 1.1
9 1941-02-01 p04 11.3
10 1941-02-01 p05 0.0
11 1941-02-01 p06 0.0
#convert column to timedelta in days
df1.days = pd.to_timedelta(df1.days.str[1:].astype(int) - 1, unit='D')
print df1.days
0 0 days
1 1 days
2 2 days
3 3 days
4 4 days
5 5 days
6 0 days
7 1 days
8 2 days
9 3 days
10 4 days
11 5 days
Name: days, dtype: timedelta64[ns]
#add timedelta
df1['date'] = df1['date'] + df1['days']
#remove unnecessary column
df1 = df1.drop('days', axis=1)
print df1
date p
0 1941-01-01 33.6
1 1941-01-02 7.1
2 1941-01-03 22.3
3 1941-01-04 0.0
4 1941-01-05 0.0
5 1941-01-06 0.0
6 1941-02-01 0.0
7 1941-02-02 0.0
8 1941-02-03 1.1
9 1941-02-04 11.3
10 1941-02-05 0.0
11 1941-02-06 0.0
EDIT: Sorry the name of the question was a bit misleading. For the example output you gave (collapsing all p into a single column) you can do this:
# Opening the example file you gave
fid = open('csv.txt','r')
lines = fid.readlines()
fid.close()
fid = open('output2.txt','w')
fid.write('%15s %15s\n'%(lines[0].split()[0],'p'))
for i in range(1,len(lines)):
iline = lines[i].split()
for j in range(1,len(iline)):
fid.write('%15s %15s\n'%(iline[0],iline[j]))
fid.close()
, which results in this:
date p
01-01-1941 33.6
01-01-1941 7.1
01-01-1941 22.3
01-01-1941 0
01-01-1941 0
01-01-1941 0
01-02-1941 0
01-02-1941 0
01-02-1941 1.1
01-02-1941 11.3
01-02-1941 0
01-02-1941 0
ORIGINAL POST: Might be relevant to someone.
There are indeed many ways to do this. But considering you have no special preference (and if the file is not enormous) you may just want to use native Python.
def rows2columns(lines):
ilines = []
for i in lines:
ilines.append(i.split())
new = []
for j in range(len(ilines[0])):
local = []
for i in range(len(ilines)):
local.append(ilines[i][j])
new.append(local)
return new
def writefile(new,path='output.txt'):
fid = open(path,'w')
for i in range(len(new)):
for j in range(len(new[0])):
fid.write('%15s'%new[i][j])
fid.write('\n')
fid.close()
# Opening the example file you gave
fid = open('csv.txt','r')
lines = fid.readlines()
fid.close()
# Putting the list of lines to be reversed
new = rows2columns(lines)
# Writing the result to a file
writefile(new,path='output.txt')
, the output file is this:
date 01-01-1941 01-02-1941
p01 33.6 0
p02 7.1 0
p03 22.3 1.1
p04 0 11.3
p05 0 0
p06 0 0
This is probably the most simple (or close) native python recipe you may have. Other functionalities from csv module, or numpy, or pandas will have other features you might want to take advantage. This one in particular does not need imports.
Well I got the answer but it did not with just one command or any magic function. So here is how I got the answer. You can optimise this code further. Hope this helps!
import pandas as pd
from datetime import timedelta
df = pd.read_csv('myfile.csv')
df[u'date'] = pd.to_datetime(df[u'date'])
p1 = df[[u'date', u'p01']].copy()
p2 = df[[u'date', u'p02']].copy()
p3 = df[[u'date', u'p03']].copy()
p4 = df[[u'date', u'p04']].copy()
p5 = df[[u'date', u'p05']].copy()
# renaming cols -p1,p2,p3,p4
p1.columns = ['date','val']
p2.columns = ['date','val']
p3.columns = ['date','val']
p4.columns = ['date','val']
p5.columns = ['date','val']
p1['col'] = 'p01'
p2['col'] = 'p02'
p3['col'] = 'p03'
p4['col'] = 'p04'
p5['col'] = 'p05'
main = pd.concat([p1,p2,p3,p4,p5])
main['days2add'] = main['col'].apply(lambda x: int(x.strip('p')) -1 )
ff = lambda row : row[u'date'] + timedelta(row[u'days2add'])
main['new_date'] = main.apply(ff, axis=1)
In [209]: main[['new_date', u'val']]
Out[209]:
new_date val
0 1941-01-01 33.6
0 1941-01-02 7.1
0 1941-01-03 22.3
0 1941-01-04 0.0
0 1941-01-05 0.0
my csv file content:
In [210]: df
Out[210]:
date p01 p02 p03 p04 p05 p06
0 1941-01-01 33.6 7.1 22.3 0 0 0
my output content:
In [209]: main[['new_date', u'val']]
Out[209]:
new_date val
0 1941-01-01 33.6
0 1941-01-02 7.1
0 1941-01-03 22.3
0 1941-01-04 0.0
0 1941-01-05 0.0