I'm trying to filter data that is stored in a .csv file that contains time and angle values and save filtered data in an output .csv file. I solved the filtering part, but the problem is that time is recorded in hh:mm:ss:msmsmsms (12:55:34:500) format and I want to change that to hhmmss (125534) or in other words remove the : and the millisecond part.
I tried using the .replace function but I keep getting the KeyError: 'time' error.
Input data:
time,angle
12:45:55,56
12:45:56,89
12:45:57,112
12:45:58,189
12:45:59,122
12:46:00,123
Code:
import pandas as pd
#define min and max angle values
alpha_min = 110
alpha_max = 125
#read input .csv file
data = pd.read_csv('test_csv3.csv', index_col=0)
#filter by angle size
data = data[(data['angle'] < alpha_max) & (data['angle'] > alpha_min)]
#replace ":" with "" in time values
data['time'] = data['time'].replace(':','')
#display results
print data
#write results
data.to_csv('test_csv3_output.csv')
That's because time is an index. You can do this and remove the index_col=0:
data = pd.read_csv('test_csv3.csv')
And change this line:
data['time'] = pd.to_datetime(data['time']).dt.strftime('%H%M%S')
Output:
time angle
2 124557 112
4 124559 122
5 124600 123
What would print (data.keys()) or print(data.head()) yield? It seems like you have a stray character before\after the time index string, happens from time to time, depending on how the csv was created vs how it was read (see this question).
If it's not a bigger project and/or you just want the data, you could just do some silly workaround like: timeKeyString=list(data.columns.values)[0] (assuming time is the first one).
I was trying to modify each string present in column named Date_time in a data-frame. The values(string type) present in that column is as:
"40 11-02-20 11:42:36"
I was trying to delete the characters until first space and trying to replace it with: "11-02-20 11:42:36". I was able to split the value but unable to rewrite it in the same cell of that column. Here is the code i have done so far:
import numpy as np
import matplotlib as plt
import pandas as pd
dataset = pd.read_csv('20-02-11.csv')
for i in dataset.itertuples():
print(type(i.Date_time))
str = i.Date_time
str1 = str.split(None,1)[1]
i.Date_time = str1
print(str1)
print(i.Date_time)
break
and it shows AttributeError when i am trying to assign str1 to i.Date_time.
Please help.
The tuples that itertuples() returns, can/should not be used to set values in the original dataframe. They are copies not the actual data of the dataframe. You can try something like this:
for i in range(len(dataset)):
your_string = dataset.loc[i, "Date_time"]
adjusted_string = your_string.split(None, 1)[1]
dataset.loc[i, "Date_time"] = adjusted_string
This will use the actual data stored in the dataframe.
Using the df.at()-function:
for i, row in dataset.iterrows():
your_string = row.Date_time # or row['Date_time']
adjusted_string = your_string.split(None, 1)[1]
dataset.at[i,'Date_time'] = adjusted_string
You can format the entire column at once. Starting with a dataframe like this:
df = pd.DataFrame({'date_time': ['40 11-02-20 11:42:36', '31 11-02-20 11:42:36']})
print(df)
returns
date_time
0 40 11-02-20 11:42:36
1 31 11-02-20 11:42:36
You can remove the first characters and space like this:
df['date_time'] = [i[1+len(i.split(' ')[0]):] for i in df['date_time']]
print(df)
returns
date_time
0 11-02-20 11:42:36
1 11-02-20 11:42:36
I am attempting to roll-up rows from a data set with similar measures into a consolidated row. There are two conditions that must be met for the roll-up:
The measures (ranging from 1-5) should remain the same across the
rows for them to be rolled up to a single row.
The dates should be continuous (no gaps in dates).
If these conditions are not met, the code should generate a separate row.
This is the sample data that I am using:
id,measure1,measure2,measure3,measure4,measure5,begin_date,end_date
ABC123XYZ789,1,1,1,1,1,1/1/2019,3/31/2019
ABC123XYZ789,1,1,1,1,1,4/23/2019,6/30/2019
ABC123XYZ789,1,1,1,1,1,7/1/2019,9/30/2019
ABC123XYZ789,1,1,1,1,1,10/12/2019,12/31/2019
FGH589J6U88SW,1,1,1,1,1,1/1/2019,3/31/2019
FGH589J6U88SW,1,1,1,1,1,4/1/2019,6/30/2019
FGH589J6U88SW,1,1,1,2,1,7/1/2019,9/30/2019
FGH589J6U88SW,1,1,1,2,1,10/1/2019,12/31/2019
253DRWQ85AT2F334B,1,2,1,3,1,1/1/2019,3/31/2019
253DRWQ85AT2F334B,1,2,1,3,1,4/1/2019,6/30/2019
253DRWQ85AT2F334B,1,2,1,3,1,7/1/2019,9/30/2019
253DRWQ85AT2F334B,1,2,1,3,1,10/1/2019,12/31/2019
The expected result should be:
id,measure1,measure2,measure3,measure4,measure5,begin_date,end_date
ABC123XYZ789,1,1,1,1,1,1/1/2019,3/31/2019
ABC123XYZ789,1,1,1,1,1,4/23/2019,9/30/2019
ABC123XYZ789,1,1,1,1,1,10/12/2019,12/31/2019
FGH589J6U88SW,1,1,1,1,1,1/1/2019,6/30/2019
FGH589J6U88SW,1,1,1,2,1,7/1/2019,12/31/2019
253DRWQ85AT2F334B,1,2,1,3,1,1/1/2019,12/31/2019
I have implemented the code below which seems to address condition # 1, but I am looking for ideas on how to incorporate condition # 2 into the solution.
import pandas as pd
import time
startTime=time.time()
data=pd.read_csv('C:\\Users\\usertemp\\Data\\Rollup2.csv')
data['end_date']= pd.to_datetime(data['end_date'])
data['begin_date']= pd.to_datetime(data['begin_date'])
data = data.groupby(['id','measure1','measure2', 'measure3', 'measure4', 'measure5']) \
['begin_date', 'end_date'].agg({'begin_date': ['min'], 'end_date': ['max']}).reset_index()
print(data)
print("It took %s seconds for the collapse process" % (time.time() - startTime))
Any help is appreciated.
You can do the following.
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
# Convert begin_date and end_time to datetime
df['begin_date'] = pd.to_datetime(df['begin_date'], format='%m/%d/%Y')
df['end_date']= pd.to_datetime(df['end_date'], format='%m/%d/%Y')
# We create a new column which contains the end_date+1 from the previous row
df['end_date_prev'] = df['end_date'].iloc[:-1] + timedelta(days=1)
df['end_date_prev'] = np.roll(df['end_date_prev'], 1)
# Create a cumsum that resets when begin_date and end_date_prev doesn't match
df['cont'] = (~(df['begin_date'] == df['end_date_prev'])).astype(int).cumsum()
# Since we need all measures to match we create a string column containing all measurements
df['comb_measure'] = df['measure1'].astype(str).str.cat(df[['measure{}'.format(i) for i in range(2,6)]].astype(str))
# Get the final df
new_df = df.groupby(['id', 'comb_measure', 'cont']).agg(
{'measure1':'first', 'measure2':'first', 'measure3':'first', 'measure4':'first', 'measure5':'first',
'begin_date':'first', 'end_date':'last'})
I have a code with python that cleans a .csv up before I append it to another data set. It is missing a couple columns so I have been trying to figure how to use Pandas to add the column and fill the rows.
I currently have a column DiscoveredDate in a format of 10/1/2017 12:49.
What I'm trying to do is take that column and anything from the date range 10/1/2016-10/1/2017 have a column FedFY have its row filled with 2017 and like wise for 2018.
Below is my current script minus a few different column cleanups.
import os
import re
import pandas as pd
import Tkinter
import numpy as np
outpath = os.path.join(os.getcwd(), "CSV Altered")
# TK asks user what file to assimilate
from Tkinter import Tk
from tkFileDialog import askopenfilename
Tk().withdraw()
filepath = askopenfilename() # show an "Open" dialog box and return the path to the selected file
#Filepath is acknowledged and disseminated with the following totally human protocols
filenames = os.path.basename(filepath)
filename = [filenames]
for f in filename:
name = f
df = pd.read_csv(f)
# Make Longitude values negative if they aren't already.
df['Longitude'] = - df['Longitude'].abs()
# Add Federal Fiscal Year Field (FedFY)
df['FedFY'] = df['DiscoveredDate']
df['FedFY'] = df['FedFY'].replace({df['FedFY'].date_range(10/1/2016 1:00,10/1/2017 1:00): "2017",df['FedFY'].date_range(10/1/2017 1:00, 10/1/2018 1:00): "2018"})
I also tried this but figured I was completely fudging it up.
for rows in df['FedFY']:
if rows = df['FedFY'].date_range(10/1/2016 1:00, 10/1/2017 1:00):
then df['FedFY'] = df['FedFY'].replace({rows : "2017"})
elif df['FedFY'] = df['FedFY'].replace({rows : "2018"})
How should I go about this efficiently? Is it just my syntax messing me up? Or do I have it all wrong?
[Edited for clarity in title and throughout.]
Ok thanks to DyZ I am making progress; however, I figured out a much simpler way to do so, that figures all years.
Building on his np.where, I:
From datetime import datetime
df['Date'] = pd.to_datetime(df['DiscoveredDate'])
df['CalendarYear'] = df['Date'].dt.year
df['Month'] = df.Date.dt.month
c = pd.to_numeric(df['CalendarYear'])
And here is the magic line.
df['FedFY'] = np.where(df['Month'] >= 10, c+1, c)
To Mop up I added a line to get it back into date time format from numeric.
df['FedFY'] = (pd.to_datetime(df['FedFY'], format = '%Y')).dt.year
This is what really crossed the bridge for me Create a column based off a conditional with pandas.
Edit: Forgot to mention to import date time for .dt stuff
If you are concerned only with these two FYs, you can compare your date directly to the start/end dates:
df["FedFY"] = np.where((df.DiscoveredDate < pd.to_datetime("10/1/2017")) &\
(df.DiscoveredDate > pd.to_datetime("10/1/2016")),
2017, 2018)
Any date before 10/1/2016 will be labeled incorrectly! (You can fix this by adding another np.where).
Make sure that the start/end dates are correctly included or not included (change < and/or > to <= and >=, if necessary).
Sorry if this has been asked before -- I couldn't find this specific question.
In python, I'd like to subtract every even column from the previous odd column:
so go from:
292.087 190.238 299.837 189.488 255.525 187.012
300.837 190.887 299.4 188.488 248.637 187.363
292.212 191.6 299.038 188.988 249.65 187.5
300.15 192.4 307.812 189.125 247.825 188.113
to
101.849 110.349 68.513
109.95 110.912 61.274
100.612 110.05 62.15
107.75 118.687 59.712
There will be an unknown number of columns. should I use something in pandas or numpy?
Thanks in advance.
You can accomplish this using pandas. You can select the even- and odd-indexed columns separately and then subtract them.
#hiro protagonist, I didn't know you could do that StringIO magic. That's spicy.
import pandas as pd
import io
data = io.StringIO('''ROI121 ROI122 ROI124 ROI125 ROI126 ROI127
292.087 190.238 299.837 189.488 255.525 187.012
300.837 190.887 299.4 188.488 248.637 187.363
292.212 191.6 299.038 188.988 249.65 187.5
300.15 192.4 307.812 189.125 247.825 188.113''')
df = pd.read_csv(data, sep='\s+')
Note that the even/odd terms may be counterintuitive because python is 0-indexed, meaning that the signal columns are actually even-indexed and the background columns odd-indexed. If I understand your question properly, this is contrary to your use of the even/odd terminology. Just pointing out the difference to avoid confusion.
# strip the columns into their appropriate signal or background groups
bg_df = df.iloc[:, [i for i in range(len(df.columns)) if i%2 == 1]]
signal_df = df.iloc[:, [i for i in range(len(df.columns)) if i%2 == 0]]
# subtract the values of the data frames and store the results in a new data frame
result_df = pd.DataFrame(signal_df.values - bg_df.values)
result_df contains columns which are the difference between the signal and background columns. You probably want to rename these column names, though.
>>> result_df
0 1 2
0 101.849 110.349 68.513
1 109.950 110.912 61.274
2 100.612 110.050 62.150
3 107.750 118.687 59.712
import io
# faking the data file
data = io.StringIO('''ROI121 ROI122 ROI124 ROI125 ROI126 ROI127
292.087 190.238 299.837 189.488 255.525 187.012
300.837 190.887 299.4 188.488 248.637 187.363
292.212 191.6 299.038 188.988 249.65 187.5
300.15 192.4 307.812 189.125 247.825 188.113''')
header = next(data) # read the first line from data
# print(header[:-1])
for line in data:
# print(line)
floats = [float(val) for val in line.split()] # create a list of floats
for prev, cur in zip(floats[::2], floats[1::2]):
print('{:6.3f}'.format(prev-cur), end=' ')
print()
with output:
101.849 110.349 68.513
109.950 110.912 61.274
100.612 110.050 62.150
107.750 118.687 59.712
if you know what data[start:stop:step] means and how zip works this should be easily understood.