Reading in specific date lines from a file with pandas python - python
I am attempting to read in many files. Each file is a daily data file with data every 10 minutes. the data in each file is kind of "chunked up" like this:
2015-11-08 00:10:00 00:10:00
# z speed dir W sigW bck error
30 3.32 111.9 0.15 0.12 1.50E+05 0
40 3.85 108.2 0.07 0.14 7.75E+04 0
50 4.20 107.9 0.06 0.15 4.73E+04 0
60 4.16 108.5 0.03 0.19 2.73E+04 0
70 4.06 93.6 0.03 0.23 9.07E+04 0
80 4.06 93.8 0.07 0.28 1.36E+05 0
2015-11-08 00:20:00 00:10:00
# z speed dir W sigW bck error
30 3.79 120.9 0.15 0.11 7.79E+05 0
40 4.36 115.6 0.04 0.13 2.42E+05 0
50 4.71 113.6 0.07 0.14 6.84E+04 0
60 5.00 113.3 0.13 0.17 1.16E+04 0
70 4.29 94.2 0.22 0.20 1.38E+05 0
80 4.54 94.1 0.11 0.25 1.76E+05 0
2015-11-08 00:30:00 00:10:00
# z speed dir W sigW bck error
30 3.86 113.6 0.13 0.10 2.68E+05 0
40 4.34 116.1 0.09 0.11 1.41E+05 0
50 5.02 112.8 0.04 0.12 7.28E+04 0
60 5.36 110.5 0.01 0.14 5.81E+04 0
70 4.67 95.4 0.14 0.16 7.69E+04 0
80 4.56 95.0 0.15 0.21 9.84E+04 0
...
The file continues on like this every 10 minutes for the whole day. The file name for this file is 151108.mnd. I want my code to read in all files that are for november so 1511??.mnd and I want my code to read in each day file for a whole month grab all of the datetime lines so for the partial data file example I just showed I would want my code to grab 2015-11-08 00:10:00, 2015-11-08 00:20:00, 2015-11-08 00:30:00, etc. store as variables and then go to the next day file (151109.mnd) and grab all the datetime lines and store as date variable and append on to the previously stored dates. And so on and so forth for the whole month. Here is the code I have so far:
import pandas as pd
import glob
import datetime
filename = glob.glob('1511??.mnd')
data_nov15_hereford = pd.DataFrame()
frames = []
dates = []
counter = 1
for i in filename:
f_nov15_hereford = pd.read_csv(i, skiprows = 32)
for line in f_nov15_hereford:
if line.startswith("20"):
print line
date_object = datetime.datetime.strptime(line[:-6], '%Y-%m-%d %H:%M:%S %f')
dates.append(date_object)
counter = 0
else:
counter += 1
frames.append(f_nov15_hereford)
data_nov15_hereford = pd.concat(frames,ignore_index=True)
data_nov15_hereford = data_nov15_hereford.convert_objects(convert_numeric=True)
print dates
This code has some problems because when I print dates it prints out two copies of every date and it also only prints out the first date of every file so 2015-11-08 00:10:00, 2015-11-09 00:10:00, etc. It isn't going line-by-line in every file then once all dates in that file are stored moving on to the next file like I want. Instead it is just grabbing the first date in each file. Any help on this code? Is there an easier way to do what I want? Thanks!
A few observations:
First: Why you are only getting the first date in a file:
f_nov15_hereford = pd.read_csv(i, skiprows = 32)
for line in f_nov15_hereford:
if line.startswith("20"):
The first line reads the file into a pandas dataframe. The second line iterates over the columns of a dataframe, not the rows. As a result, the last line checks to see if the column starts with "20". This only happens once per file.
Second: counter is initialized and it's value gets changed, but it is never used. I presume it was intended to be used to skip over lines in the files.
Third: It might be simpler to collect all the dates into a Python list and then converting that to a pandas dataframe if needed.
import pandas as pd
import glob
import datetime as dt
# number of lines to skip before the first date
offset = 32
# number of lines from one date to the next
recordlength = 9
pattern = '1511??.mnd'
dates = []
for filename in glob.iglob(pattern):
with open(filename) as datafile:
count = -offset
for line in datafile:
if count == 0:
fmt = '%Y-%m-%d %H:%M:%S %f'
date_object = dt.datetime.strptime(line[:-6], fmt)
dates.append(date_object)
count += 1
if count == recordlength:
count = 0
data_nov15_hereford = pd.DataFrame(dates, columns=['Dates'])
print dates
Consider modifying the csv data line by line prior to reading in as a dataframe. Below opens original file in the glob list and writes to a temp file moving over dates to last column, removing multiple headers and empty lines.
CSV Data (assuming the text view of csv file looks like below; if different than actual, adjust py code)
2015-11-0800:10:0000:10:00,,,,,,
z,speed,dir,W,sigW,bck,error
30,3.32,111.9,0.15,0.12,1.50E+05,0
40,3.85,108.2,0.07,0.14,7.75E+04,0
50,4.2,107.9,0.06,0.15,4.73E+04,0
60,4.16,108.5,0.03,0.19,2.73E+04,0
70,4.06,93.6,0.03,0.23,9.07E+04,0
80,4.06,93.8,0.07,0.28,1.36E+05,0
,,,,,,
2015-11-0800:10:0000:20:00,,,,,,
z,speed,dir,W,sigW,bck,error
30,3.79,120.9,0.15,0.11,7.79E+05,0
40,4.36,115.6,0.04,0.13,2.42E+05,0
50,4.71,113.6,0.07,0.14,6.84E+04,0
60,5,113.3,0.13,0.17,1.16E+04,0
70,4.29,94.2,0.22,0.2,1.38E+05,0
80,4.54,94.1,0.11,0.25,1.76E+05,0
,,,,,,
2015-11-0800:10:0000:30:00,,,,,,
z,speed,dir,W,sigW,bck,error
30,3.86,113.6,0.13,0.1,2.68E+05,0
40,4.34,116.1,0.09,0.11,1.41E+05,0
50,5.02,112.8,0.04,0.12,7.28E+04,0
60,5.36,110.5,0.01,0.14,5.81E+04,0
70,4.67,95.4,0.14,0.16,7.69E+04,0
80,4.56,95,0.15,0.21,9.84E+04,0
Python Script
import glob, os
import pandas as pd
filenames = glob.glob('1511??.mnd')
temp = 'temp.csv'
# INITIATE EMPTY DATAFRAME
data_nov15_hereford = pd.DataFrame(columns=['z', 'speed', 'dir', 'W',
'sigW', 'bck', 'error', 'date'])
# ITERATE THROUGH EACH FILE IN GLOB LIST
for file in filenames:
# DELETE PRIOR TEMP VERSION
if os.path.exists(temp): os.remove(temp)
header = 0
# READ IN ORIGINAL CSV
with open(file, 'r') as txt1:
for rline in txt1:
# SAVE DATE VALUE THEN SKIP ROW
if "2015-11" in rline: date = rline.replace(',',''); continue
# SKIP BLANK LINES (CHANGE IF NO COMMAS)
if rline == ',,,,,,\n': continue
# ADD NEW 'DATE' COLUMN AND SKIP OTHER HEADER LINES
if 'z,speed,dir,W,sigW,bck,error' in rline:
if header == 1: continue
rline = rline.replace('\n', ',date\n')
with open(temp, 'a') as txt2:
txt2.write(rline)
continue
header = 1
# APPEND LINE TO TEMP CSV WITH DATE VALUE
with open(temp, 'a') as txt2:
txt2.write(rline.replace('\n', ','+date))
# APPEND TEMP FILE TO DATA FRAME
data_nov15_hereford = data_nov15_hereford.append(pd.read_csv(temp))
Output
z speed dir W sigW bck error date
0 30 3.32 111.9 0.15 0.12 150000 0 2015-11-0800:10:0000:10:00
1 40 3.85 108.2 0.07 0.14 77500 0 2015-11-0800:10:0000:10:00
2 50 4.20 107.9 0.06 0.15 47300 0 2015-11-0800:10:0000:10:00
3 60 4.16 108.5 0.03 0.19 27300 0 2015-11-0800:10:0000:10:00
4 70 4.06 93.6 0.03 0.23 90700 0 2015-11-0800:10:0000:10:00
5 80 4.06 93.8 0.07 0.28 136000 0 2015-11-0800:10:0000:10:00
6 30 3.79 120.9 0.15 0.11 779000 0 2015-11-0800:10:0000:20:00
7 40 4.36 115.6 0.04 0.13 242000 0 2015-11-0800:10:0000:20:00
8 50 4.71 113.6 0.07 0.14 68400 0 2015-11-0800:10:0000:20:00
9 60 5.00 113.3 0.13 0.17 11600 0 2015-11-0800:10:0000:20:00
10 70 4.29 94.2 0.22 0.20 138000 0 2015-11-0800:10:0000:20:00
11 80 4.54 94.1 0.11 0.25 176000 0 2015-11-0800:10:0000:20:00
12 30 3.86 113.6 0.13 0.10 268000 0 2015-11-0800:10:0000:30:00
13 40 4.34 116.1 0.09 0.11 141000 0 2015-11-0800:10:0000:30:00
14 50 5.02 112.8 0.04 0.12 72800 0 2015-11-0800:10:0000:30:00
15 60 5.36 110.5 0.01 0.14 58100 0 2015-11-0800:10:0000:30:00
16 70 4.67 95.4 0.14 0.16 76900 0 2015-11-0800:10:0000:30:00
17 80 4.56 95.0 0.15 0.21 98400 0 2015-11-0800:10:0000:30:00
Related
How To Iterate Over A Timespan and Calculate some Values in a Dataframe using Python?
I have a dataset like below data = {'ReportingDate':['2013/5/31','2013/5/31','2013/5/31','2013/5/31','2013/5/31','2013/5/31', '2013/6/28','2013/6/28', '2013/6/28','2013/6/28','2013/6/28'], 'MarketCap':[' ',0.35,0.7,0.875,0.7,0.35,' ',1,1.5,0.75,1.25], 'AUM':[3.5,3.5,3.5,3.5,3.5,3.5,5,5,5,5,5], 'weight':[' ',0.1,0.2,0.25,0.2,0.1,' ',0.2,0.3,0.15,0.25]} # Create DataFrame df = pd.DataFrame(data) df.set_index('Reporting Date',inplace=True) df Just a sample of a 8000 rows dataset. ReportingDate starts from 2013/5/31 to 2015/10/30. It includes data of all the months during the above period. But Only the last day of each month. The first line of each month has two missing data. I know that the sum of weight for each month is equal to 1 weight*AUM is equal to MarketCap I can use the below line to get the answer I want, for only one month a= (1-df["2013-5"].iloc[1:]['weight'].sum()) b= a* AUM df.iloc[1,0]=b df.iloc[1,2]=a How can I use a loop to get the data for the whole period? Thanks
One way using pandas.DataFrame.groupby: # If whitespaces are indeed whitespaces, not nan df = df.replace("\s+", np.nan, regex=True) # If not already datatime series df.index = pd.to_datetime(df.index) s = df["weight"].fillna(1) - df.groupby(df.index.date)["weight"].transform(sum) df["weight"] = df["weight"].fillna(s) df["MarketCap"] = df["MarketCap"].fillna(s * df["AUM"]) Note: This assumes that dates are always only the last day so that it is equivalent to grouping by year-month. If not so, try: s = df["weight"].fillna(1) - df.groupby(df.index.strftime("%Y%m"))["weight"].transform(sum) Output: MarketCap AUM weight ReportingDate 2013-05-31 0.350 3.5 0.10 2013-05-31 0.525 3.5 0.15 2013-05-31 0.700 3.5 0.20 2013-05-31 0.875 3.5 0.25 2013-05-31 0.700 3.5 0.20 2013-05-31 0.350 3.5 0.10 2013-06-28 0.500 5.0 0.10 2013-06-28 1.000 5.0 0.20 2013-06-28 1.500 5.0 0.30 2013-06-28 0.750 5.0 0.15 2013-06-28 1.250 5.0 0.25
merge 2 csv files by columns error related to strings?
I am trying to merge 2 csv files by column. my both csv ends with '_4.csv' as filename, and the final result of the merged csv is something like below: 0-10 ,83.72,66.76,86.98 ,0-10 ,83.72,66.76,86.98 11-20 ,15.01,31.12,12.04 ,11-20 ,15.01,31.12,12.04 21-30 ,1.14,2.05,0.94 ,21-30 ,1.14,2.05,0.94 31-40 ,0.13,0.07,0.03 ,31-40 ,0.13,0.07,0.03 over 40 ,0.0,0.0,0.0 ,over 40 ,0.0,0.0,0.0 UHF case ,0.0,0.0,0.0 ,UHF case ,0.0,0.0,0.0 my code: #combine 2 csv into 1 by columns files_in_dir = [f for f in os.listdir(os.getcwd()) if f.endswith('_4.csv')] temp_data = [] for filenames in files_in_dir: temp_data.append(np.loadtxt(filenames,dtype='str')) temp_data = np.array(temp_data) np.savetxt('_mix.csv',temp_data.transpose(),fmt='%s',delimiter=',') however the error said: temp_data.append(np.loadtxt(filenames,dtype='str')) for x in read_data(_loadtxt_chunksize): raise ValueError("Wrong number of columns at line %d" ValueError: Wrong number of columns at line 2 not sure if it is related to the first column being strings rather than values. Does anyone know how to fix it? much appreciation
I think you're looking for the join method. If we have two .csv files of the form: 0-10 ,83.72,66.76,86.98 11-20 ,15.01,31.12,12.04 21-30 ,1.14,2.05,0.94 31-40 ,0.13,0.07,0.03 over 40 ,0.0,0.0,0.0 UHF case ,0.0,0.0,0.0 Assuming they both have similar structure, we'll work with one of these named data.csv: import pandas as pd # Assumes there are no headers df1 = pd.read_csv("data.csv", header=None) df2 = pd.read_csv("data.csv", header=None) # By default: DataFrame headers are assigned numbers 0, 1, 2, 3 # In the second data frame, we will rename columns so they do not clash. # meaning `df2` will now have columns named: 4, 5, 6, 7 df2 = df2.rename( columns={ x: y for x, y in zip(df1.columns, range(len(df2.columns), len(df2.columns) * 2)) } ) print(df1.join(df2)) Example output: 0 1 2 3 4 5 6 7 0 0-10 83.72 66.76 86.98 0-10 83.72 66.76 86.98 1 11-20 15.01 31.12 12.04 11-20 15.01 31.12 12.04 2 21-30 1.14 2.05 0.94 21-30 1.14 2.05 0.94 3 31-40 0.13 0.07 0.03 31-40 0.13 0.07 0.03 4 over 40 0.00 0.00 0.00 over 40 0.00 0.00 0.00 5 UHF case 0.00 0.00 0.00 UHF case 0.00 0.00 0.00
Implemented the groupby and want to insert by output of groupby in my .csv file
I have around 8781 rows in my dataset. I have grouped the different items according to month and calculated the mean of a particular item of every month. Now, I want to store the result of every month after inserting the new row after every month. Below is the code that I have worked upon for grouping the item and calculated the mean. Please, anyone, tell how I can insert a new row after every month and store my groupby result in it. a = pd.read_csv("data3.csv") print (a) df=pd.DataFrame(a,columns=['month','day','BedroomLights..kW.']) print(df) groupby_month=df['day'].groupby(df['month']) print(groupby_month) c=list(df['day'].groupby(df['month'])) print(c) d=df['day'].groupby(df['month']).describe() print (d) #print(groupby_month.mean()) e=df['BedroomLights..kW.'].groupby(df['month']).mean() print(e) A sample of csv file is : Day Month Year lights Fan temperature windspeed 1 1 2016 0.003 0.12 39 8.95 2 1 2016 0.56 1.23 34 9.54 3 1 2016 1.43 0.32 32 10.32 4 1 2016 0.4 1.43 24 8.32 ................................................. 1 12 2016 0.32 0.54 22 7.65 2 12 2016 1.32 0.43 21 6.54 The excepted output I want is adding a new row that is mean of items of every month like: Month lights ...... 1 0.32 1 0.43 ............... mean as a new row ............... 12 0.32 12 0.43 mean ......... The output of the code I have shown is as follows: month 1 0.006081 2 0.005993 3 0.005536 4 0.005729 5 0.005823 6 0.005587 7 0.006214 8 0.005509 9 0.005935 10 0.005821 11 0.006226 12 0.006056 Name: BedroomLights..kW., dtype: float64
If your indices are named 1mean, 2mean, 3mean, etc., sort_indexes should place them where you want. e.index = [str(n)+'mean' for n in range(1,13)] df = df.append(e) df = df.sort_index()
How do I apply a lambda function on pandas slices, and return the same format as the input data frame?
I want to apply a function to row slices of dataframe in pandas for each row and returning a dataframe with for each row the value and number of slices that was calculated. So, for example df = pandas.DataFrame(numpy.round(numpy.random.normal(size=(2, 10)),2)) f = lambda x: (x - x.mean()) What I want is to apply lambda function f from column 0 to 5 and from column 5 to 10. I did this: a = pandas.DataFrame(f(df.T.iloc[0:5,:]) but this is only for the first slice.. how can include the second slice in the code, so that my resulting output frame looks exactly as the input frame -- just that every data point is changed to its value minus the mean of the corresponding slice. I hope it makes sense.. What would be the right way to go with this? thank you.
You can simply reassign the result to original df, like this: import pandas as pd import numpy as np # I'd rather use a function than lambda here, preference I guess def f(x): return x - x.mean() df = pd.DataFrame(np.round(np.random.normal(size=(2,10)), 2)) df.T 0 1 0 0.92 -0.35 1 0.32 -1.37 2 0.86 -0.64 3 -0.65 -2.22 4 -1.03 0.63 5 0.68 -1.60 6 -0.80 -1.10 7 -0.69 0.05 8 -0.46 -0.74 9 0.02 1.54 # makde a copy of df here df1 = df # just reassign the slices back to the copy # edited, omit DataFrame part. df1.T[:5], df1.T[5:] = f(df.T.iloc[0:5,:]), f(df.T.iloc[5:,:]) df1.T 0 1 0 0.836 0.44 1 0.236 -0.58 2 0.776 0.15 3 -0.734 -1.43 4 -1.114 1.42 5 0.930 -1.23 6 -0.550 -0.73 7 -0.440 0.42 8 -0.210 -0.37 9 0.270 1.91
Finding duration between events
I want to compute the duration (in weeks between change). For example, p is the same for weeks 1,2,3 and changes to 1.11 in period 4. So duration is 3. Now the duration is computed in a loop ported from R. It works but it is slow. Any suggestion how to improve this would be greatly appreciated. raw['duration']=np.nan id=raw['unique_id'].unique() for i in range(0,len(id)): pos1= abs(raw['dp'])>0 pos2= raw['unique_id']==id[i] pos= np.where(pos1 & pos2)[0] raw['duration'][pos[0]]=raw['week'][pos[0]]-1 for j in range(1,len(pos)): raw['duration'][pos[j]]=raw['week'][pos[j]]-raw['week'][pos[j-1]] The dataframe is raw, and values for a particular unique_id looks like this. date week p change duration 2006-07-08 27 1.05 -0.07 1 2006-07-15 28 1.05 0.00 NaN 2006-07-22 29 1.05 0.00 NaN 2006-07-29 30 1.11 0.06 3 ... ... ... ... ... 2010-06-05 231 1.61 0.09 1 2010-06-12 232 1.63 0.02 1 2010-06-19 233 1.57 -0.06 1 2010-06-26 234 1.41 -0.16 1 2010-07-03 235 1.35 -0.06 1 2010-07-10 236 1.43 0.08 1 2010-07-17 237 1.59 0.16 1 2010-07-24 238 1.59 0.00 NaN 2010-07-31 239 1.59 0.00 NaN 2010-08-07 240 1.59 0.00 NaN 2010-08-14 241 1.59 0.00 NaN 2010-08-21 242 1.61 0.02 5 ##
Computing duratiosn once you have your list in date order is trivial: iterate over the list, keeping track of how long since the last change to p. If the slowness comes from how you get that list, you haven't provided nearly enough info for help with that.
You can simply get the list of weeks where there is a change, then compute their differences, and finally join those differences back onto your original DataFrame. weeks = raw.query('change != 0.0')[['week']] weeks['duration'] = weeks.week.diff() pd.merge(raw, weeks, on='week', how='left')
raw2=raw.ix[raw['change'] !=0,['week','unique_id']] data2=raw2.groupby('unique_id') raw2['duration']=data2['week'].transform(lambda x: x.diff()) raw2.drop('unique_id',1) raw=pd.merge(raw,raw2,on=['unique_id','week'],how='left') Thank you all. I modified the suggestion and got this to give the same answer as the complicated loop. For 10,000. observations, it is not a whole lot faster but the code seems more compact. I put no change to Nan because the duration seems to be undefined when no change is made. But zero will work too. With the above code, the NaN is put in automatically by merge. In any case, I want to compute statistics for the non-change group separately.