I have a CSV file containing a time serie of daily precipitation. The problem arises of how the data is organized. Here a small sample to ilustrate:
date p01 p02 p03 p04 p05 p06
01-01-1941 33.6 7.1 22.3 0 0 0
01-02-1941 0 0 1.1 11.3 0 0
So, there is a column to each day of the month (p01 is the precipitation of the day 1, p02 corresponds to the day 2, and so on ). I'd like to have this structure: one column to date and another to precipitation values.
date p
01-01-1941 33.6
02-01-1941 7.1
03-01-1941 22.3
04-01-1941 0
05-01-1941 0
06-01-1941 0
01-02-1941 0
02-02-1941 0
03-02-1941 1.1
04-02-1941 11.3
05-02-1941 0
06-02-1941 0
I have found some code examples, but unsuccessfully for this specific problem. In general they suggest to try using pandas, numpy. Does anyone have a recommendation to solve this issue or a good advice to guide my studies? Thanks. (I'm sorry for my terrible English)
I think you can first use read_csv, then to_datetime with stack for reshaping DataFrame, then convert column days to_timedelta and add it to column date:
import pandas as pd
import io
temp=u"""date;p01;p02;p03;p04;p05;p06
01-01-1941;33.6;7.1;22.3;0;0;0
01-02-1941;0;0;1.1;11.3;0;0"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), sep=";")
print df
date p01 p02 p03 p04 p05 p06
0 01-01-1941 33.6 7.1 22.3 0.0 0 0
1 01-02-1941 0.0 0.0 1.1 11.3 0 0
#convert coolumn date to datetime
df.date = pd.to_datetime(df.date, dayfirst=True)
print df
date p01 p02 p03 p04 p05 p06
0 1941-01-01 33.6 7.1 22.3 0.0 0 0
1 1941-02-01 0.0 0.0 1.1 11.3 0 0
#stack, rename columns
df1 = df.set_index('date').stack().reset_index(name='p').rename(columns={'level_1':'days'})
print df1
date days p
0 1941-01-01 p01 33.6
1 1941-01-01 p02 7.1
2 1941-01-01 p03 22.3
3 1941-01-01 p04 0.0
4 1941-01-01 p05 0.0
5 1941-01-01 p06 0.0
6 1941-02-01 p01 0.0
7 1941-02-01 p02 0.0
8 1941-02-01 p03 1.1
9 1941-02-01 p04 11.3
10 1941-02-01 p05 0.0
11 1941-02-01 p06 0.0
#convert column to timedelta in days
df1.days = pd.to_timedelta(df1.days.str[1:].astype(int) - 1, unit='D')
print df1.days
0 0 days
1 1 days
2 2 days
3 3 days
4 4 days
5 5 days
6 0 days
7 1 days
8 2 days
9 3 days
10 4 days
11 5 days
Name: days, dtype: timedelta64[ns]
#add timedelta
df1['date'] = df1['date'] + df1['days']
#remove unnecessary column
df1 = df1.drop('days', axis=1)
print df1
date p
0 1941-01-01 33.6
1 1941-01-02 7.1
2 1941-01-03 22.3
3 1941-01-04 0.0
4 1941-01-05 0.0
5 1941-01-06 0.0
6 1941-02-01 0.0
7 1941-02-02 0.0
8 1941-02-03 1.1
9 1941-02-04 11.3
10 1941-02-05 0.0
11 1941-02-06 0.0
EDIT: Sorry the name of the question was a bit misleading. For the example output you gave (collapsing all p into a single column) you can do this:
# Opening the example file you gave
fid = open('csv.txt','r')
lines = fid.readlines()
fid.close()
fid = open('output2.txt','w')
fid.write('%15s %15s\n'%(lines[0].split()[0],'p'))
for i in range(1,len(lines)):
iline = lines[i].split()
for j in range(1,len(iline)):
fid.write('%15s %15s\n'%(iline[0],iline[j]))
fid.close()
, which results in this:
date p
01-01-1941 33.6
01-01-1941 7.1
01-01-1941 22.3
01-01-1941 0
01-01-1941 0
01-01-1941 0
01-02-1941 0
01-02-1941 0
01-02-1941 1.1
01-02-1941 11.3
01-02-1941 0
01-02-1941 0
ORIGINAL POST: Might be relevant to someone.
There are indeed many ways to do this. But considering you have no special preference (and if the file is not enormous) you may just want to use native Python.
def rows2columns(lines):
ilines = []
for i in lines:
ilines.append(i.split())
new = []
for j in range(len(ilines[0])):
local = []
for i in range(len(ilines)):
local.append(ilines[i][j])
new.append(local)
return new
def writefile(new,path='output.txt'):
fid = open(path,'w')
for i in range(len(new)):
for j in range(len(new[0])):
fid.write('%15s'%new[i][j])
fid.write('\n')
fid.close()
# Opening the example file you gave
fid = open('csv.txt','r')
lines = fid.readlines()
fid.close()
# Putting the list of lines to be reversed
new = rows2columns(lines)
# Writing the result to a file
writefile(new,path='output.txt')
, the output file is this:
date 01-01-1941 01-02-1941
p01 33.6 0
p02 7.1 0
p03 22.3 1.1
p04 0 11.3
p05 0 0
p06 0 0
This is probably the most simple (or close) native python recipe you may have. Other functionalities from csv module, or numpy, or pandas will have other features you might want to take advantage. This one in particular does not need imports.
Well I got the answer but it did not with just one command or any magic function. So here is how I got the answer. You can optimise this code further. Hope this helps!
import pandas as pd
from datetime import timedelta
df = pd.read_csv('myfile.csv')
df[u'date'] = pd.to_datetime(df[u'date'])
p1 = df[[u'date', u'p01']].copy()
p2 = df[[u'date', u'p02']].copy()
p3 = df[[u'date', u'p03']].copy()
p4 = df[[u'date', u'p04']].copy()
p5 = df[[u'date', u'p05']].copy()
# renaming cols -p1,p2,p3,p4
p1.columns = ['date','val']
p2.columns = ['date','val']
p3.columns = ['date','val']
p4.columns = ['date','val']
p5.columns = ['date','val']
p1['col'] = 'p01'
p2['col'] = 'p02'
p3['col'] = 'p03'
p4['col'] = 'p04'
p5['col'] = 'p05'
main = pd.concat([p1,p2,p3,p4,p5])
main['days2add'] = main['col'].apply(lambda x: int(x.strip('p')) -1 )
ff = lambda row : row[u'date'] + timedelta(row[u'days2add'])
main['new_date'] = main.apply(ff, axis=1)
In [209]: main[['new_date', u'val']]
Out[209]:
new_date val
0 1941-01-01 33.6
0 1941-01-02 7.1
0 1941-01-03 22.3
0 1941-01-04 0.0
0 1941-01-05 0.0
my csv file content:
In [210]: df
Out[210]:
date p01 p02 p03 p04 p05 p06
0 1941-01-01 33.6 7.1 22.3 0 0 0
my output content:
In [209]: main[['new_date', u'val']]
Out[209]:
new_date val
0 1941-01-01 33.6
0 1941-01-02 7.1
0 1941-01-03 22.3
0 1941-01-04 0.0
0 1941-01-05 0.0
Related
I have a dataframe contains two different IDs lists.
df
ID1 ID2
0 0 35
1 0 35
2 1 33
3 2 27
Then I have two dataframes df1 and df2 that contain the coordinates of such IDs.
df1
ID1 x y
0 0 1.3 2.3
1 1 2.5 7.2
3 2 4.5 4.5
df2
ID2 x y
0 27 3.6 4.5
1 33 3.3 2.3
2 35 2.3 2.5
I would like to to assign to df the coordinates of ID1 if it is repeated more times and the coordinates of ID2 if ID1 only appears once in df
At the end I would like something like that
df
ID1 ID2 x y
0 0 35 1.3 2.3
1 0 35 1.3 2.3
2 1 33 3.3 2.3
3 2 27 3.6 4.5
I think this will do the trick:
df=df.merge(df1).merge(df2,on='ID2',suffixes=['_id1','_id2'])
mask=df.groupby('ID1').transform('count')['ID2']
df['x']=np.where(mask>1,df['x_id1'],df['x_id2'])
df['y']=np.where(mask>1,df['y_id1'],df['y_id2'])
df[['ID1','ID2','x','y']]
ID1 ID2 x y
0 0 35 1.3 2.3
1 0 35 1.3 2.3
2 1 33 3.3 2.3
3 2 27 3.6 4.5
Try this one out
df3 = (df[df.duplicated(subset='ID1')]).merge(df1, how='left')
df4 = (df.drop_duplicates(subset='ID1')).merge(df2, on='ID2')
df5 = df3.merge(df4, how='outer').drop_duplicates(subset='ID1', keep='first')
df5.reindex(df.index, method='ffill')
Here I have a dataset with three inputs with date and time. Here I collected my data not in pattern time. Here what I want first is put my start time as 0 and convert other time into minutes.
my code is:
data = pd.read_csv('data6.csv',"," )
data['date'] = pd.to_datetime(data['date'] + " " + data['time'], format='%d/%m/%Y %H:%M:%S')
lastday = data.loc[0, 'date']
def convert_time(x):
global lastday
if x.date() == lastday.date():
tm = x - lastday
return tm.total_seconds()/60
else:
lastday = x
return 0
data['time'] = data['date'].apply(convert_time)
Then I got the results:
But what I expected is :
I want to set the time for every one minute from starting time 0 , then if column has no value at that time then put the 0 values. If values are append then put the value with time column in minutes.
If new day then put again start time as 0 then start the value in minutes .
This is like time group with one minute , data.
Date time in min X1 X2 X3
10/3/2018 1 63 0 0
2
3
4 if no values then put 0 values into that
5 column till the values available
6 Then put it that column values
7
8
9
10
11
12
13
10/4/2018 0 120 30 60
1 0 0 0
My csv file:
link for my csv:
My csv
After new code my time is displaying :
Pandas has functions for this; resample from a datetime index. You have to give an aggregation feature in case your data has multiple values within 1 minute. Below example will sum these values, it is easy to change this.
Please correct me if this is not what you want.
Code
# Read CSV
csv_url = 'https://docs.google.com/spreadsheets/d/1WWq1qhqi4bGzNir_svQV7VstBkGbocToipPCY83Cclc/gviz/tq?tqx=out:csv&sheet=1512153575'
data = pd.read_csv(csv_url)
data['date'] = pd.to_datetime(data['date'] + " " + data['time'], format='%d/%m/%Y %H:%M:%S')
# Resample to 1 minute (T is minute)
df = data.set_index('date') \
.resample('1T') \
.sum() \
.fillna(0)
# Optional ugly one-liner to start index at 0, and 1 row per minute, restart at day start
df.index = ((df.index - pd.to_datetime(df.index.date)).total_seconds() / 60).astype(int)
Output
df.head()
x1 x2 x3 Unnamed: 5 Unnamed: 6 Unnamed: 7
date
2018-03-10 06:15:00 63 0 0 0.0 0.0 0.0
2018-03-10 06:16:00 0 0 0 0.0 0.0 0.0
2018-03-10 06:17:00 0 0 0 0.0 0.0 0.0
2018-03-10 06:18:00 0 0 0 0.0 0.0 0.0
2018-03-10 06:19:00 0 0 0 0.0 0.0 0.0
Output 2
With ugly-ass one-liner
x1 x2 x3 Unnamed: 5 Unnamed: 6 Unnamed: 7
date
0 63 0 0 0.0 0.0 0.0
1 0 0 0 0.0 0.0 0.0
2 0 0 0 0.0 0.0 0.0
3 0 0 0 0.0 0.0 0.0
4 0 0 0 0.0 0.0 0.0
5 0 0 0 0.0 0.0 0.0
You may creat a dataframe df2 contain the column time and minutes of day, and then use
csv_url = 'https://docs.google.com/spreadsheets/d/1WWq1qhqi4bGzNir_svQV7VstBkGbocToipPCY83Cclc/gviz/tq?tqx=out:csv&sheet=1512153575'
data = pd.read_csv(csv_url)
df = pd.merge(data,df2,how='outer',on='time')
df = df.fillna(0)
df2 is like the pic, you can create it by script or excel
I am trying to calculate the # of days between failures. I'd like to know on each day in the series the # of days passed since the last failure where failure = 1. There may be anywhere from 1 to 1500 devices.
For Example, Id like my dataframe to look like this (please pull data from url in the second code block. This is just a short example of a larger dataframe.):
date device failure elapsed
10/01/2015 S1F0KYCR 1 0
10/07/2015 S1F0KYCR 1 7
10/08/2015 S1F0KYCR 0 0
10/09/2015 S1F0KYCR 0 0
10/17/2015 S1F0KYCR 1 11
10/31/2015 S1F0KYCR 0 0
10/01/2015 S8KLM011 1 0
10/02/2015 S8KLM011 1 2
10/07/2015 S8KLM011 0 0
10/09/2015 S8KLM011 0 0
10/11/2015 S8KLM011 0 0
10/21/2015 S8KLM011 1 20
Sample Code:
Edit: Please pull actual data from code block below. The above sample data is an short example. Thanks.
url = "https://raw.githubusercontent.com/dsdaveh/device-failure-analysis/master/device_failure.csv"
df = pd.read_csv(url, encoding = "ISO-8859-1")
df = df.sort_values(by = ['date', 'device'], ascending = True) #Sort by date and device
df['date'] = pd.to_datetime(df['date'],format='%Y/%m/%d') #format date to datetime
This is where I am running into obstacles. However the new column should contain the # of days since last failure, where failure = 1.
test['date'] = 0
for i in test.index[1:]:
if not test['failure'][i]:
test['elapsed'][i] = test['elapsed'][i-1] + 1
I have also tried
fails = df[df.failure==1]
fails.Dates = trues.index #need this because .diff() won't work on the index..
fails.Elapsed = trues.Dates.diff()
Using pandas.DataFrame.groupby with diff and apply:
import pandas as pd
import numpy as np
df['date'] = pd.to_datetime(df['date'])
s = df.groupby(['device', 'failure'])['date'].diff().dt.days.add(1)
s = s.fillna(0)
df['elapsed'] = np.where(df['failure'], s, 0)
Output:
Date Device Failure Elapsed
0 2015-10-01 S1F0KYCR 1 0.0
1 2015-10-07 S1F0KYCR 1 7.0
2 2015-10-08 S1F0KYCR 0 0.0
3 2015-10-09 S1F0KYCR 0 0.0
4 2015-10-17 S1F0KYCR 1 11.0
5 2015-10-31 S1F0KYCR 0 0.0
6 2015-10-01 S8KLM011 1 0.0
7 2015-10-02 S8KLM011 1 2.0
8 2015-10-07 S8KLM011 0 0.0
9 2015-10-09 S8KLM011 0 0.0
10 2015-10-11 S8KLM011 0 0.0
11 2015-10-21 S8KLM011 1 20.0
Update:
Found out the actual data linked in the OP contains No device that has more than two failure cases, making the final result all zeros (i.e. no second failure has ever happened and thus nothing to calculate for elapsed). Using OP's original snippet:
import pandas as pd
url = "http://aws-proserve-data-science.s3.amazonaws.com/device_failure.csv"
df = pd.read_csv(url, encoding = "ISO-8859-1")
df = df.sort_values(by = ['date', 'device'], ascending = True)
df['date'] = pd.to_datetime(df['date'],format='%Y/%m/%d')
Find if any device has more than 1 failure:
df.groupby(['device'])['failure'].sum().gt(1).any()
# False
Which actually confirms that the all zeros in df['elapsed'] is actually a correct answer :)
If you tweak your data a bit, it does yield elapsed just as expected.
df.loc[6879, 'device'] = 'S1F0RRB1'
# Making two occurrence of failure for device S1F0RRB1
s = df.groupby(['device', 'failure'])['date'].diff().dt.days.add(1)
s = s.fillna(0)
df['elapsed'] = np.where(df['failure'], s, 0)
df['elapsed'].value_counts()
# 0.0 124493
# 3.0 1
Here is one way
df['elapsed']=df[df.Failure.astype(bool)].groupby('Device').Date.diff().dt.days.add(1)
df.elapsed.fillna(0,inplace=True)
df
Out[225]:
Date Device Failure Elapsed elapsed
0 2015-10-01 S1F0KYCR 1 0 0.0
1 2015-10-07 S1F0KYCR 1 7 7.0
2 2015-10-08 S1F0KYCR 0 0 0.0
3 2015-10-09 S1F0KYCR 0 0 0.0
4 2015-10-17 S1F0KYCR 1 11 11.0
5 2015-10-31 S1F0KYCR 0 0 0.0
6 2015-10-01 S8KLM011 1 0 0.0
7 2015-10-02 S8KLM011 1 2 2.0
8 2015-10-07 S8KLM011 0 0 0.0
9 2015-10-09 S8KLM011 0 0 0.0
10 2015-10-11 S8KLM011 0 0 0.0
11 2015-10-21 S8KLM011 1 20 20.0
I have an existing dataframe which looks like:
id start_date end_date
0 1 20170601 20210531
1 2 20181001 20220930
2 3 20150101 20190228
3 4 20171101 20211031
I am trying to add 85 columns to this dataframe which are:
if the month/year (looping on start_date to end_date) lie between 20120101 and 20190101: 1
else: 0
I tried the following method:
start, end = [datetime.strptime(_, "%Y%m%d") for _ in ['20120101', '20190201']]
global_list = list(OrderedDict(((start + timedelta(_)).strftime(r"%m/%y"), None) for _ in range((end - start).days)).keys())
def get_count(contract_start_date, contract_end_date):
start, end = [datetime.strptime(_, "%Y%m%d") for _ in [contract_start_date, contract_end_date]]
current_list = list(OrderedDict(((start + timedelta(_)).strftime(r"%m/%y"), None) for _ in range((end - start).days)).keys())
temp_list = []
for each in global_list:
if each in current_list:
temp_list.append(1)
else:
temp_list.append(0)
return pd.Series(temp_list)
sample_df[global_list] = sample_df[['contract_start_date', 'contract_end_date']].apply(lambda x: get_count(*x), axis=1)
and the sample df looks like:
customer_id contract_start_date contract_end_date 01/12 02/12 03/12 04/12 05/12 06/12 07/12 ... 04/18 05/18 06/18 07/18 08/18 09/18 10/18 11/18 12/18 01/19
1 1 20181001 20220930 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 1 1 1 1
9 2 20160701 20200731 0 0 0 0 0 0 0 ... 1 1 1 1 1 1 1 1 1 1
3 3 20171101 20211031 0 0 0 0 0 0 0 ... 1 1 1 1 1 1 1 1 1 1
3 rows × 88 columns
it works fine for small dataset but for 160k rows it didn't stopped even after 3 hours. Can someone tell me a better way to do this?
Facing problems when the dates overlap for same customer.
First I'd cut off the dud dates, to normalize the end_time (to ensure it's in the time range):
In [11]: df.end_date = df.end_date.where(df.end_date < '2019-02-01', pd.Timestamp('2019-01-31')) + pd.offsets.MonthBegin()
In [12]: df
Out[12]:
id start_date end_date
0 1 2017-06-01 2019-02-01
1 2 2018-10-01 2019-02-01
2 3 2015-01-01 2019-02-01
3 4 2017-11-01 2019-02-01
Note: you'll need to do the same trick for start_date if there are dates prior to 2012.
I'd create the resulting DataFrame from a date range of the columns and then fill it in (with ones at start time and something else:
In [13]: m = pd.date_range('2012-01-01', '2019-02-01', freq='MS')
In [14]: res = pd.DataFrame(0., columns=m, index=df.index)
In [15]: res.update(pd.DataFrame(np.diag(np.ones(len(df))), df.index, df.start_date).groupby(axis=1, level=0).sum())
In [16]: res.update(-pd.DataFrame(np.diag(np.ones(len(df))), df.index, df.end_date).groupby(axis=1, level=0).sum())
The groupby sum is required if multiple rows start or end in the same month.
# -1 and NaN were really placeholders for zero
In [17]: res = res.replace(0, np.nan).ffill(axis=1).replace([np.nan, -1], 0)
In [18]: res
Out[18]:
2012-01-01 2012-02-01 2012-03-01 2012-04-01 2012-05-01 ... 2018-09-01 2018-10-01 2018-11-01 2018-12-01 2019-01-01
0 0.0 0.0 0.0 0.0 0.0 ... 1.0 1.0 1.0 1.0 1.0
1 0.0 0.0 0.0 0.0 0.0 ... 0.0 1.0 1.0 1.0 1.0
2 0.0 0.0 0.0 0.0 0.0 ... 1.0 1.0 1.0 1.0 1.0
3 0.0 0.0 0.0 0.0 0.0 ... 1.0 1.0 1.0 1.0 1.0
I have a pandas column like so:
index colA
1 10.2
2 10.8
3 11.6
4 10.7
5 9.5
6 6.2
7 12.9
8 10.6
9 6.4
10 20.5
I want to search the current row value and find matches from previous rows that are close. For example index4 (10.7) would return a match of 1 because it is close to index2 (10.8). Similarly index8 (10.6) would return a match of 2 because it is close to both index2 and 4.
Using a threshold of +/- 5% for this example would output the below:
index colA matches
1 10.2 0
2 10.8 0
3 11.6 0
4 10.7 2
5 9.5 0
6 6.2 0
7 12.9 0
8 10.6 3
9 6.4 1
10 20.5 0
With a large dataframe I would like to limit this to the previous X (300?) number of rows to search over rather than an entire dataframe.
Using triangle indices to ensure we only look backwards. Then use np.bincount to accumulate the matches.
a = df.colA.values
i, j = np.tril_indices(len(a), -1)
mask = np.abs(a[i] - a[j]) / a[i] <= .05
df.assign(matches=np.bincount(i[mask], minlength=len(a)))
colA matches
index
1 10.2 0
2 10.8 0
3 11.6 0
4 10.7 2
5 9.5 0
6 6.2 0
7 12.9 0
8 10.6 3
9 6.4 1
10 20.5 0
If you are having resource issues, consider using good 'ol fashion loops. However, if you have access to numba you make this considerably faster.
from numba import njit
#njit
def counter(a):
c = np.arange(len(a)) * 0
for i, x in enumerate(a):
for j, y in enumerate(a):
if j < i:
if abs(x - y) / x <= .05:
c[i] += 1
return c
df.assign(matches=counter(a))
colA matches
index
1 10.2 0
2 10.8 0
3 11.6 0
4 10.7 2
5 9.5 0
6 6.2 0
7 12.9 0
8 10.6 3
9 6.4 1
10 20.5 0
Here's a numpy solution that leverages broadcasted comparison:
i = df.colA.values
j = np.arange(len(df))
df['matches'] = (
(np.abs(i - i[:, None]) < i * .05) & (j < j[:, None])
).sum(1)
df
index colA matches
0 1 10.2 0
1 2 10.8 0
2 3 11.6 0
3 4 10.7 2
4 5 9.5 0
5 6 6.2 0
6 7 12.9 0
7 8 10.6 3
8 9 6.4 1
9 10 20.5 0
Note; This is extremely fast, but does not handle the 300 row limitation for large dataframes.
rolling with apply , if speed matter , please look into cold's answer
df.colA.rolling(window=len(df),min_periods=1).apply(lambda x : sum(abs((x-x[-1])/x[-1])<0.05)-1)
Out[113]:
index
1 0.0
2 0.0
3 0.0
4 2.0
5 0.0
6 0.0
7 0.0
8 3.0
9 1.0
10 0.0
Name: colA, dtype: float64