Related
I apologize if there is an obvious answer to this already.
I have a very large file that poses a few challenges for parsing. I am delivered these files from outside my organization, so there is no chance I can change their format.
Firstly, the file is space delimited but the fields that represent a "column" of data can span multiple rows. For example, if you had a row that was supposed to be 25 columns of data, it may be written in the file as:
1 2 3 4 5 6 7 8 9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25
1 2 3 4 5 6 7 8 9 10 11 12 13
14 15 16 17 18
19 20 21 22 23 24 25
As you can see, I can't rely on each set of data being on the same line, but I can rely on there being the same number of columns per set.
To make matters worse, the file follows a definition:data type format where the first 3 or so lines will be describing the data (including a field that tells me how many rows there are) and the next N rows are data. Then it will go back to the 3 lines format again to describe the next set of data. That means I can't just set up a reader for the N columns format and let it run to EOF.
I'm afraid the built in python file reading functionality could get really ugly real fast, but I can't find anything in csv or numpy that works.
Any suggestions?
EDIT: Just as an example of a different solution:
We have an old tool in MATLAB that parses this file using textscan on an open file handle. We know the number of columns so we do something like:
data = textscan(fid, repmat('%f ',1,n_cols), n_rows, 'delimiter', {' ', '\r', '\n'}, 'multipledelimsasone', true);
This would read the data no matter how it wrapped while leaving a file handle open to process the next section later. This is done because the files are so large they can lead to excess RAM usage.
This is a sketch how you can proceed:
(EDIT: with some modifications)
file = open("testfile.txt", "r")
# store data for the different sections here
datasections = list()
while True:
current_row = []
# read three lines
l1 = file.readline()
if line == '': # or other end condition
break
l2 = file.readline()
l3 = file.readline()
# extract the following information from l1, l2, l3
nrows = # extract the number rows in the next section
ncols = # extract the number of columns in the next section
# loop while len(current_row) < nrows * ncols:
# read next line, isolate the items using str.split()
# append items to current_row
# break current_row into the lines after each ncols-th item
# store data in datasections in a new array
In Python3 and pandas have a dataframe with dozens of columns and lines about food characteristics. Below is a summary:
alimentos = pd.read_csv("alimentos.csv",sep=',',encoding = 'utf-8')
alimentos.reset_index()
index alimento calorias
0 0 iogurte 40
1 1 sardinha 30
2 2 manteiga 50
3 3 maçã 10
4 4 milho 10
The column "alimento" (food) has the lines "iogurte", "sardinha", "manteiga", "maçã" and "milho", which are food names.
I need to create a new column in this dataframe, which will tell what kind of food is. I gave the name "classificacao"
alimentos['classificacao'] = ""
alimentos.reset_index()
index alimento calorias classificacao
0 0 iogurte 40
1 1 sardinha 30
2 2 manteiga 50
3 3 maçã 10
4 4 milho 10
Depending on the content found in the "alimento" column I want to automatically fill the rows of the "classificacao" column
For example, when finding "iogurte" fill -> "laticinio". When find "sardinha" -> "peixe". By finding "manteiga" -> "gordura animal". When finding "maçã" -> "fruta". And by finding "milho" -> "cereal"
Please, is there a way to automatically fill the rows when I find these strings?
If you have a mapping of all the possible values in the "alimento" column, you can just create a dictionary and use .map(d), as shown below:
df = pd.DataFrame({'alimento': ['iogurte','sardinha', 'manteiga', 'maçã', 'milho'],
'calorias':range(10,60,10)})
d = {"iogurte":"laticinio", "sardinha":"peixe", "manteiga":"gordura animal", "maçã":"fruta", "milho": "cereal"}
df['classificacao'] = df['alimento'].map(d)
However, in real life often we can't map everything in a dict (because of outliers that occur once in a blue moon, faulty inputs, etc.), and in which case the above would return NaN in the "classificacao" column. This could cause some issues, so think about setting a default value, like "Other" or "Unknown". To to that, just append .fillna("Other") after map(d).
This is my first post so please be gentle. I have searched across the world wide web looking for a solution but I am yet to find one. The problem i'm trying to solve is as follows:
I have a dataset, comprised of 500.000+ samples, with 6 features per sample.
I have put this dataset in a multiindexed Pandas DataFrame
The first level of my dataFrame is the timeseries index, the second level is the ID. It looks as follows
Time id
2017-03-07 10:06:49.963241984 122.0 -7.024347
136.0 -11.664985
243.0 1.716150
2017-03-07 10:06:50.003462400 122.0 -7.025922
136.0 -11.671526
Every timestamp, a number of objects can be seen and are marked by label 'id'. For my application, i want to add a temporal dependency by including information
that happened 5 seconds ago, i.e. in this example on timestamp 10:06:45.
But, importantly, i only want to add this information if at that timestamp the object already existed (so if the id is equal).
I wanted to use the function dataframe.shift, as mentioned here and, i want to do it per level, so as indicated by user Unutbu in How do you shift Pandas DataFrame with a multiindex?
My question is as follows:
How do I append extra columns to the original dataframe X with information on what those objects were 5s ago. I would expect something like the following
X['x_location_shifted'] = X.groupby(level=1)['x_location'].shift(5*rate)
with the rate being 25Hz, ie. we shift 125 "DateTimeIndices", but, only if an object with id='...' exists at that timestamp.
EDIT:
The timestamps are not synchronized 100%, so the timegap is not always exactly equal to 0.04. Previously, i used np.argmin(np.abs(time-index)) to find the closest index to the stamp.
For example, in my set, at timestamp 2017-03-07 10:36:03.605008640 there is an object with id == 175 and location_x = 54.323.
id = 175
X.ix['2017-03-07 10:36:03.605008640', id] = 54.323
At timestamp 2017-03-07 10:36:08.604962560 ..... this object with id=175 has a location_x = 67.165955
id = 175
old_time = pd.to_datetime('2017-03-07 10:36:03.605008640')
new_time = old_time + pd.Timedelta('5 seconds')
# Finding the new value of location
X.ix[np.argmin(np.abs(new_time - X.index.get_level_values(0))), id]
So, finally, at timestep 10:36:08 i want to add the information of timestamp 10:36:03 IF the object already existed at that timestamp.
EDIT2:
After trying Maarten Fabré's solution, I came up with my own implementation, which you can find below. If anyone can show me a more pythonic way to do this, please let me know.
for current_time in X.index.get_level_values(0)[125:]:
#only do if there are objects at current time
if len(X.ix[current_time].index):
# Calculate past time
past_time = current_time - pd.Timedelta('5 seconds')
# Find index in X.index that is closest to this past time
past_time_index = np.argmin(np.abs(past_time-X.index.get_level_values(0)))
# translate the index back to a label
past_time = X.index[past_time_index][0]
# in that timestep, cycle the objects
for obj_id in X.ix[current_time].index:
# Try looking for the value box_center.x of obj obj_id 5s ago
try:
X.ix[(current_time, obj_id), 'box_center.x.shifted'] = X.ix[(past_time, obj_id), 'box_center.x']
X.ix[(current_time, obj_id), 'box_center.y.shifted'] = X.ix[(past_time, obj_id), 'box_center.y']
X.ix[(current_time, obj_id), 'relative_velocity.x.shifted'] = X.ix[(past_time, obj_id), 'relative_velocity.x']
X.ix[(current_time, obj_id), 'relative_velocity.y.shifted'] = X.ix[(past_time, obj_id), 'relative_velocity.y']
# If the key doesnt exist, the object doesn't exist, ergo the field should be np.nan
except KeyError:
X.ix[(current_time, obj_id), 'box_center.x.shift'] = np.nan
print('Timestep {}'.format(current_time))
If this is not enough information, please say so and I can add it :)
Cheers and thanks!
Assuming that you have no gaps in the timestamps, one possible solution might be the following, which creates a new index with shifted timestamps and uses that to get the 5 seconds-ago values for each ID.
offset = 5 * rate
# Create a shallow copy of the multiindex levels for modification
modified_levels = list(X.index.levels)
# Shift them
modified_times = pd.Series(modified_levels[0]).shift(offset)
# Fill NaNs with dummy values to avoid duplicates in the new index
modified_times[modified_times.isnull()] = range(sum(modified_times.isnull()))
modified_levels[0] = modified_times
new_index = X.index.set_levels(modified_levels, inplace=False)
X['x_location_shifted'] = X.loc[new_index, 'x_location'].values
If the timestamps are not 100% regular, then you'll either have to round the to the nearest 1/x second, or use a loop
you could use this as a loop
Data definition
import pandas as pd
import numpy as np
from io import StringIO
df_str = """
timestamp id location
10:00:00.005 1 a
10:00:00.005 2 b
10:00:00.005 3 c
10:00:05.006 2 a
10:00:05.006 3 b
10:00:05.006 4 c"""
df = pd.DataFrame.from_csv(StringIO(df_str), sep='\t').reset_index()
delta = pd.to_timedelta(5, unit='s')
margin = pd.to_timedelta(1/50, unit='s')
df['location_shifted'] = np.nan
Loop over the different id's
for label_id in set(df['id']):
df_id = df[df['id'] == label_id].copy() # copy to make sure we don't overwrite the original data. Might not be necessary
df_id['time_shift'] = df['timestamp'] + delta
for row in df_id.itertuples():
idx = row.Index
time_dif = abs(df['timestamp'] - row.time_shift)
shifted_locs = df_id[time_dif < margin ]
l = len(shifted_locs)
if l:
print(shifted_locs)
if l == 1:
idx_shift = shifted_locs.index[0]
else:
idx_shift = shifted_locs['time_shift'].idxmin()
df.loc[idx_shift, 'location_shifted'] = df_id.loc[idx, 'location']
Results
timestamp id location location_shifted
0 2017-05-09 10:00:00.005 1 a
1 2017-05-09 10:00:00.005 2 b
2 2017-05-09 10:00:00.005 3 c
3 2017-05-09 10:00:05.006 2 a b
4 2017-05-09 10:00:05.006 3 b c
5 2017-05-09 10:00:05.006 4 c
Any of you arriving here with the same question; i managed to solve it in a (minimal) vectorized way, but, it required me to return to a 3d panel.
3 Steps:
- make into 3D panel
- Add new columns
- Fill those columns
From a multi-index 2d frame it's possible to change it to a pandas.Panel where you convert the 2nd index to one of the axes in the panel.
After this I have a 3D panel with axes [time, objects, parameters]. Then, tranpose the panel to have the PARAMETERS as items, this to add columns to the datapanel. So, tranpose the panel, add the columns, transpose back.
dp_new = dp.transpose(2,0,1)
dp_new['shifted_box_center_x']=np.nan
dp_new['shifted_box_center_y']=np.nan
dp_new['shifted_relative_velocity_x']=np.nan
dp_new['shifted_relative_velocity_y']=np.nan
# tranpose them back to their original form
dp_new = dp_new.transpose(1,2,0)
Now that we have added the new fields, we can get their names by
new_fields = dp_new.minor_axis[-4:]
The objective is to add information from 5s ago, if that object existed. Therefore, we cycle the time series from a moment in time which is 5s. In my case, at a rate of 25Hz, this is element 5*rate = 125.
Lets first set the time to start from 5s in the datapanel
time = dp_new.items[125:]
Then, we iterate an enumerated version of the time. The enumeration will start at 0, which is the index of the datapanel at timestep = 0. The first timestep however is the timestep at time 0+5seconds.
time = dp_new.items[125:]
for iloc, ts in enumerate(time):
# Print progress
print('{} out of {}'.format(ts, dp.items[-1]) , end="\r", flush=True)
# Generate new INDEX field, by taking the field ID and dropping the NaN values
ids = dp_new.loc[ts].id.dropna().values
# Drop the nan field from the frame
dp_new[ts].dropna(thresh=5, inplace=True)
# save the original indices
original_index = {'index': dp_new.loc[ts].index, 'id': dp_new.loc[ts].id.values}
# set the index to field id
dp_new[ts].set_index(['id'], inplace=True)
# Check if the vector ids does NOT contain ALL ZEROS
if np.any(ids): # Check for all zeros
df_past = dp_new.iloc[iloc].copy() # SCREENSHOT AT TS=5s --> ILOC = 0
df_past.dropna(thresh=5, inplace=True) # drop the nan rows
df_past.set_index(['id'], inplace=True) # set the index to field ID
dp_new[ts].loc[original_index['id'], new_fields] = df_past[fields].values
This will only fill in fields that have id's ==ids.
This code was able to run on a 300 000 element file in about 5 minutes.
Note: i spent quite some time on this, mainly because of how one indexes a panel. At first , i thought calling the 3 dimensions would work, as stated in pandas help, but it seems that this is not the case.
dp_new[ts, ids, new_fields] = values does NOT work.
I have a data set that looks like this:
CustomerID EventID EventType EventTime
6 1 Facebook 42373.31586
6 2 Facebook 42373.316
6 3 Web 42374.32921
6 4 Twitter 42377.14913
6 5 Facebook 42377.40598
6 6 Web 42378.31245
CustomerID: the unique identifier associated with the particular
customer
EventID: a unique identifier about a particular online activity
EventType: the type of online activity associated with this record
(Web, Facebook, or Twitter)
EventTime: the date and time at which this online activity took
place. This value is measured as the number of days since January 1, 1900, with fractions indicating particular times of day. So for example, an event taking place at the stroke of midnight on January 1, 2016 would be have an EventTime of 42370.00 while an event taking place at noon on January 1, 2016 would have an EventTime of 42370.50.
I've managed to import the CSV and creating into a list with the following code:
# Import Libraries & Set working directory
import csv
# STEP 1: READING THE DATA INTO A PYTHON LIST OF LISTS
f = open('test1000.csv', "r") # Import CSV as file type
a = f.read() # Convert file type into string
split_list = a.split("\r") # Removes \r
split_list[0:5] # Viewing the list
# Convert from lists to 'list of lists'
final_list = []
for row in split_list:
split_list = row.split(',') # Split list by comma delimiter
final_list.append(split_list)
print(final_list[0:5])
#CREATING INITIAL BLANK LISTS FOR OUTPUTTING DATA
legit = []
fraud = []
What I need to do next is sort each record into the fraud or legit list of lists. A record would be considered fraudulent under the following parameters. As such, that record would go to the fraud list.
Logic to assign a row to the fraud list: The CustomerID performs the same EventType within the last 4 hours.
For example, row 2 (event 2) in the sample data set above, would be moved to the fraud list because event 1 happened within the last 4 hours. On the other hand, event 4 would go to the legit list because in there are no Twitter records that happened in the last 4 hours.
The data set is in chronological order.
This solution groups by CustomerID and EventType and then checks if the previous event time occurred less than (lt) 4 hours ago (4. / 24).
df['possible_fraud'] = (
df.groupby(['CustomerID', 'EventType'])
.EventTime
.transform(lambda group: group - group.shift())
.lt(4. / 24))
>>> df
CustomerID EventID EventType EventTime possible_fraud
0 6 1 Facebook 42373.31586 False
1 6 2 Facebook 42373.31600 True
2 6 3 Web 42374.32921 False
3 6 4 Twitter 42377.14913 False
4 6 5 Facebook 42377.40598 False
5 6 6 Web 42378.31245 False
>>> df[df.possible_fraud]
CustomerID EventID EventType EventTime possible_fraud
1 6 2 Facebook 42373.316 True
Of course, pandas based solution seems more smart, but here is an example of using just inserted dictionary.
PS Try to perform input and output by himself
#!/usr/bin/python2.7
sample ="""
6 1 Facebook 42373.31586
6 2 Facebook 42373.316
6 3 Web 42374.32921
6 4 Twitter 42377.14913
5 5 Web 42377.3541
6 6 Facebook 42377.40598
6 7 Web 42378.31245
"""
last = {} # This dict will contain recent time
#values of events by client ID, for ex.:
#{"6": {"Facebook": 42373.31586, "Web": 42374.32921}}
legit = []
fraud = []
for row in sample.split('\n')[1:-1:]:
Cid, Eid, Type, Time = row.split()
if Cid not in last.keys():
legit.append(row)
last[Cid] = {Type: Time}
row += '\tlegit'
else:
if Type not in last[Cid].keys():
legit.append(row)
last[Cid][Type] = Time
row += '\tlegit'
else:
if float(Time) - float(last[Cid][Type]) > (4. / 24):
legit.append(row)
last[Cid][Type] = Time
row += '\tlegit'
else:
fraud.append(row)
row += '\tfraud'
print row
I am trying to sum (and plot) a total from functions which change states at different times using Python's Pandas.DataFrame. For example:
Suppose we have 3 people whose states can be a) holding nothing, b) holding a 5 pound weight, and c) holding a 10 pound weight. Over time, these people pick weights up and put them down. I want to plot the total amount of weight being held. So, given:
My brute forece attempt:
import pandas as ps
import math
import numpy as np
person1=[3,0,10,10,10,10,10]
person2=[4,0,20,20,25,25,40]
person3=[5,0,5,5,15,15,40]
allPeopleDf=ps.DataFrame(np.array(zip(person1,person2,person3)).T)
allPeopleDf.columns=['count','start1', 'end1', 'start2', 'end2', 'start3','end3']
allPeopleDfNoCount=allPeopleDf[['start1', 'end1', 'start2', 'end2', 'start3','end3']]
uniqueTimes=sorted(ps.unique(allPeopleDfNoCount.values.ravel()))
possibleStates=[-1,0,1,2] #extra state 0 for initialization
stateData={}
comboStates={}
#initialize dict to add up all of the stateData
for time in uniqueTimes:
comboStates[time]=0.0
allPeopleDf['track']=-1
allPeopleDf['status']=-1
numberState=len(possibleStates)
starti=-1
endi=0
startState=0
for i in range(3):
starti=starti+2
print starti
endi=endi+2
for time in uniqueTimes:
def helper(row):
start=row[starti]
end=row[endi]
track=row[7]
if start <= time and time < end:
return possibleStates[i+1]
else:
return possibleStates[0]
def trackHelp(row):
status=row[8]
track=row[7]
if track<=status:
return status
else:
return track
def Multiplier(row):
x=row[8]
if x==0:
return 0.0*row[0]
if x==1:
return 5.0*row[0]
if x==2:
return 10.0*row[0]
if x==-1:#numeric place holder for non-contributing
return 0.0*row[0]
allPeopleDf['status']=allPeopleDf.apply(helper,axis=1)
allPeopleDf['track']=allPeopleDf.apply(trackHelp,axis=1)
stateData[time]=allPeopleDf.apply(Multiplier,axis=1).sum()
for k,v in stateData.iteritems():
comboStates[k]=comboStates.get(k,0)+v
print allPeopleDf
print stateData
print comboStates
Plots of weight being held over time might look like the following:
And the sum of the intensities over time might look like the black line in the following:
with the black line defined with the Cartesian points: (0,0 lbs),(5,0 lbs),(5,5 lbs),(15,5 lbs),(15,10 lbs),(20,10 lbs),(20,15 lbs),(25,15 lbs),(25,20 lbs),(40,20 lbs). However, I'm flexible and don't necessarily need to define the combined intensity line as a set of Cartesian points. The unique times can be found with:
print list(set(uniqueTimes).intersection(allNoCountT[1].values.ravel())).sort()
,but I can't come up with a slick way of getting the corresponding intensity values.
I started out with a very ugly function to break apart each "person's" graph so that all people had start and stop times (albeit many stop and start times without state change) at the same time, and then I could add up all the "chunks" of time. This was cumbersome; there has to be a slick pandas way of handling this. If anyone can offer a suggestion or point me to another SO like that I might have missed, I'd appreciate the help!
In case my simplified example isn't clear, another might be plotting the intensity of sound coming from a piano: there are many notes being played for different durations with different intensities. I would like the sum of intensity coming from the piano over time. While my example is simplistic, I need a solution that is more on the scale of a piano song: thousands of discrete intensity levels per key, and many keys contributing over the course of a song.
Edit--Implementation of mgab's provided solution:
import pandas as ps
import math
import numpy as np
person1=['person1',3,0.0,10.0,10.0,10.0,10.0,10.0]
person2=['person2',4,0,20,20,25,25,40]
person3=['person3',5,0,5,5,15,15,40]
allPeopleDf=ps.DataFrame(np.array(zip(person1,person2,person3)).T)
allPeopleDf.columns=['id','intensity','start1', 'end1', 'start2', 'end2', 'start3','end3']
allPeopleDf=ps.melt(allPeopleDf,id_vars=['intensity','id'])
allPeopleDf.columns=['intensity','id','timeid','time']
df=ps.DataFrame(allPeopleDf).drop('timeid',1)
df[df.id=='person1'].drop('id',1) #easier to visualize one id for check
df['increment']=df.groupby('id')['intensity'].transform( lambda x: x.sub(x.shift(), fill_value= 0 ))
TypeError: unsupported operand type(s) for -: 'str' and 'int'
End Edit
Going for the piano keys example, lets assume you have three keys, with 30 levels of intensity.
I would try to keep the data in this format:
import pandas as pd
df = pd.DataFrame([[10,'A',5],
[10,'B',7],
[13,'C',10],
[15,'A',15],
[20,'A',7],
[23,'C',0]], columns=["time", "key", "intensity"])
time key intensity
0 10 A 5
1 10 B 7
2 13 C 10
3 15 A 15
4 20 A 7
5 23 C 0
where you record every change in intensity of any of the keys. From here you can already get the Cartesian coordinates for each individual key as (time,intensity) pairs
df[df.key=="A"].drop('key',1)
time intensity
0 10 5
3 15 15
4 20 7
Then, you can easily create a new column increment that will indicate the change in intensity that occurred for that key at that time point (intensity indicates just the new value of intensity)
df["increment"]=df.groupby("key")["intensity"].transform(
lambda x: x.sub(x.shift(), fill_value= 0 ))
df
time key intensity increment
0 10 A 5 5
1 10 B 7 7
2 13 C 10 10
3 15 A 15 10
4 20 A 7 -8
5 23 C 0 -10
And then, using this new column, you can generate the (time, total_intensity) pairs to use as Cartesian coordinates
df.groupby("time").sum()["increment"].cumsum()
time
10 12
13 22
15 32
20 24
23 14
dtype: int64
EDIT: applying the specific data presented in question
Assuming the data comes as a list of values, starting with the element id (person/piano key), then a factor multiplying the measured weight/intensities for this element, and then pairs of time values indicating the start and end of a series of known states (weight being carried/intensity being emitted). Not sure if I got the data format right. From your question:
data1=['person1',3,0.0,10.0,10.0,10.0,10.0,10.0]
data2=['person2',4,0,20,20,25,25,40]
data3=['person3',5,0,5,5,15,15,40]
And if we know the weight/intensity of each one of the states, we can define:
known_states = [5, 10, 15]
DF_columns = ["time", "id", "intensity"]
Then, the easiest way I came up to load the data includes this function:
import pandas as pd
def read_data(data, states, columns):
id = data[0]
factor = data[1]
reshaped_data = []
for i in xrange(len(states)):
j += 2+2*i
if not data[j] == data[j+1]:
reshaped_data.append([data[j], id, factor*states[i]])
reshaped_data.append([data[j+1], id, -1*factor*states[i]])
return pd.DataFrame(reshaped_data, columns=columns)
Notice that the if not data[j] == data[j+1]: avoids loading data to the dataframe when start and end times for a given state are equal (seems uninformative, and wouldn't appear in your plots anyway). But take it out if you still want these entries.
Then, you load the data:
df = read_data(data1, known_states, DF_columns)
df = df.append(read_data(data2, known_states, DF_columns), ignore_index=True)
df = df.append(read_data(data3, known_states, DF_columns), ignore_index=True)
# and so on...
And then you're right at the beginning of this answer (substituting 'key' by 'id' and the ids, of course)
Appears to be what .sum() is for:
In [10]:
allPeopleDf.sum()
Out[10]:
aStart 0
aEnd 35
bStart 35
bEnd 50
cStart 50
cEnd 90
dtype: int32