I have a csv file that contains a record of a workflow. It contains for each timestamp the status. So I do have time and day when something was done, however I sorted it in ascending order and this is sufficient for the next step, therefore this is not included in this sample data. My sample data looks like this (csv files are attached, Example1.csv and Example2.csv, the preview in google looks wrong, as the decimal "," separator is not properly recognized):
So as I said these files are already sorted in ascending order and the status could be imagined as something like a workflow. So work started, proceed, finished, clean up. Like this:
Now I want to detect suspicious entries. For example if someone finished work without actually started it, or other unusual "patterns". What I would like to have is an overview of all the different workflows.
1.
I would like to have the counts / number of occurences per unique workflow. I achieved to implement this. My code is as follows:
import pandas as pd
from collections import OrderedDict
df=pd.read_csv(r'C:\Users\PC\Desktop\Example2.csv', sep=";", decimal=",", encoding="utf-8-sig")
df['Status']=df['Status'].astype(str)
df['Status'].fillna('No', inplace=True)
df=df.groupby(['Worker'])['Status'].apply('|'.join).reset_index()
df=df.groupby(['Status']).count()
df = df.rename(columns={'Worker': 'Count'})
#df['Sum']=df.groupby(['Amount']).sum()
df.to_csv(r'C:\Users\PC\Desktop\outtest.csv', sep=';', encoding="utf-8-sig")
Which works. I get the following output:
or in case of using numbers:
Which is exactly what I want. Here I can see for example that two workers started work and then directly cleaned up.
2.
Now I would like to have the sum of the amounts too. The amount per worker is always the same, so this number does not vary for a worker, so for example as shown in the sample data, worker 1 always has 2500,24. What I would like to have is this output:
I tried to implement it with adding a simple line:
df['Sum']=df.groupby(['Amount']).sum()
But this throws an error. Reason is that the Amount in this step is simply not available. I could not figure out how to get this working.
How can I add the sum?
3.
I would like to "write the type of workflow which was counted for this worker" back to my original data file. So in my original data it should look like this (for simplicity reasons lets take the version where the status is represented with numbers):
How can I implement this?
(I thought about this and it actually does not need to be combined with the results from my previous code. I just basically need to expand/transpose the status for each worker and write this to a new variable/column. However here the problem is that I do not know in advance how many status/steps a worker has. So somehow I need to implement that "if the next entry belongs to the same worker than attach the value from status with a "|" to an existing variable" and this is my new column. But maybe I am wrong here and there is another implementation.)
To calculate sum of amounts we can first groupby the Worker column to get the workflow and the amount for each worker (I'm taking first for the amount since it's the same for all rows for the same worker). Then we groupby again on the workflow (which is in Status column after the first groupby), and calculate counts and sums:
df = pd.read_csv('Example2.csv', sep=';', decimal=',')
df['Status'] = df['Status'].astype(str)
z = df.groupby('Worker').agg({
'Status': '|'.join,
'Amount': 'first',
}).groupby('Status')['Amount'].agg(['count', 'sum']).reset_index()
# save and output
z.to_csv('outtest.csv', sep=';')
z
Output:
Status count sum
0 Started work 1 2900.00
1 Started work|Clean up 2 3600.18
2 Started work|Continued work|Finished|Clean up 2 6700.74
3 Started work|Continued work|Finished|Clean up|... 1 4200.98
To add workflow as a column, we can use transform:
df = pd.read_csv('Example1.csv', sep=';', decimal=',')
df['Status'] = df['Status'].astype(str)
# add workflow column
df['workflow'] = df.groupby('Worker')['Status'].transform('|'.join)
# save and output
df.to_csv('Example1_with_workflow.csv', sep=';', decimal=',')
df
Output (using the numeric Example1.csv here to make it more readable, but will work with either of them, of course):
Worker Status Amount workflow
0 1 1 2500.24 1|2|3|4
1 1 2 2500.24 1|2|3|4
2 1 3 2500.24 1|2|3|4
3 1 4 2500.24 1|2|3|4
4 2 1 2400.00 1|4
5 2 4 2400.00 1|4
6 3 1 4200.98 1|2|3|4|5
7 3 2 4200.98 1|2|3|4|5
8 3 3 4200.98 1|2|3|4|5
9 3 4 4200.98 1|2|3|4|5
10 3 5 4200.98 1|2|3|4|5
11 4 1 1200.18 1|4
12 4 4 1200.18 1|4
13 5 1 4200.50 1|2|3|4
14 5 2 4200.50 1|2|3|4
15 5 3 4200.50 1|2|3|4
16 5 4 4200.50 1|2|3|4
17 6 1 2900.00 1
P.S. If I read correctly, in (1) there was no question as everything worked as expected, right?
Related
I am wondering if it is possible to have multiple indexes similar to the picture in which one of them (second level in my case) counts automatically?
I have the following problem that i have data which needs to be updated repeatedly and the data either belong to the category "Math" or "English". However I would like to keep track of the first entry, second entry and so on for each category.
Now the trick is that, I would like to have the second level index count automatically within the category, so that every time I add a new entry to a category "math", for example, it would automatically update the second level index.
Thanks for the help.
You can set_index() using a column and a computed series. In your case cumcount() does what you need.
df = pd.DataFrame({"category":np.random.choice(["English","Math"],15), "data":np.random.uniform(2,5,15)})
df2 = df.sort_values("category").set_index(["category", df.sort_values("category").groupby("category").cumcount()+1])
df2
output
data
category
English 1 2.163213
2 4.292678
3 4.227062
4 3.255596
5 3.376833
6 2.477596
Math 1 3.436956
2 3.275532
3 2.720285
4 2.181704
5 3.667757
6 2.683818
7 2.069882
8 3.155550
9 4.155107
So I am dealing with a large data file which has 1.3 million rows.
What I'm trying to do is simple, I want to change values in some columns given some conditions.
for i in range(0,len(data2)): #where len(data2) is about 1.3 million
if data2.loc[i,'PPA']==0:
data1.loc[i,'LDU']=0 #(data1 and data2 have same amount of rows)
and I will also need to format for some other columns. for example, I want to format gender as either 0 or 1.
data as follows:
data['Gender']
Out[156]:
0 F
1 M
2 F
3 F
..
1290573 M
1290574 F
Name: Gender, Length: 1290575, dtype: object
#Format to 0 and 1
for i in range(0,len(data)):
if data.loc[i,'Gender']=='F':
data.loc[i,'Gender']=0;
else:
data.loc[i,'Gender']=1
Btw regarding the processing time, I noticed something unusual...
I saved the first 5000 rows to a new csv file, when I test my code on the sample data, it performed well and fast, like in 10 seconds.
But when I try to run it on my real data, and let it do
for i in range(0,10000) #instead of the full length of data
it takes about 9 minutes.
last time I formatted another column like this(assigning 0 and 1) on my full data takes more than 10 hours in python. So I'm just wondering if there anything wrong on my codes? is any other more efficient way to let it read and rewrite faster? ...
Any help would be appreciated! :)
I'm kinda new to python and it's my first question post, thank you everyone for your comment :)
Instead of loops you can try np.where
df=pd.DataFrame({'Gender':['M','F']})
df['Numeric_Gender'] = np.where(df.Gender=='M',1,0)
df
Gender Numeric_Gender
M 1
F 0
This is a problem I've encountered in various contexts, and I'm curious if I'm doing something wrong, or if my whole approach is off. The particular data/functions are not important here, but I'll include a concrete example in any case.
It's not uncommon to want a groupby/apply that does various operations on each group, and returns a new dataframe. An example might be something like this:
def patch_stats(df):
first = df.iloc[0]
diversity = (len(df['artist_id'].unique())/float(len(df))) * df['dist'].mean()
start = first['ts']
return pd.DataFrame({'diversity':[diversity],'start':[start]})
So, this is a grouping function that generates a new DataFrame with two columns, each derived from a different operation on the input data. Again, the specifics aren't too important here, but this is the issue:
When I look at the output, I get something like this:
result = df.groupby('patch_idx').apply(patch_stats)
print result
diversity start
patch_idx
0 0 0.876161 2007-02-24 22:54:28
1 0 0.588997 2007-02-25 01:55:39
2 0 0.655306 2007-02-25 04:27:05
3 0 0.986047 2007-02-25 05:37:58
4 0 0.997020 2007-02-25 06:27:08
5 0 0.639499 2007-02-25 17:40:56
6 0 0.687874 2007-02-26 05:24:11
7 0 0.003714 2007-02-26 07:07:20
8 0 0.065533 2007-02-26 09:01:11
9 0 0.000000 2007-02-26 19:23:52
10 0 0.068846 2007-02-26 20:43:03
...
It's all good, except I have an extraneous, unnamed index level that I don't want:
print result.index.names
FrozenList([u'patch_idx', None])
Now, this isn't a huge deal; I can always get rid of the extraneous index level with something like:
result = result.reset_index(level=1,drop=True)
But seeing how this comes up anytime I have grouping function that returns a DataFrame, I'm wondering if there's a better approach to how I'm approaching this. Is it bad form to have a grouping function that returns a DataFrame? If so, what's the right method to get the same kind of result? (again, this is a general question fitting problems of this type)
In your grouping function, return a Series instead of a DataFrame. Specifically, replace the last line of patch_stats with:
return pd.Series({'diversity':diversity, 'start':start})
I've encountered this same issue.
Solution
result = df.groupby('patch_idx', group_keys=False).apply(patch_stats)
print result
I have a pandas dataframe and I'd like to add a new column that has the contents of an existing column, but shifted relative to the rest of the data frame. I'd also like the value that drops off the bottom to get rolled around to the top.
For example if this is my dataframe:
>>> myDF
coord coverage
0 1 1
1 2 10
2 3 50
I want to get this:
>>> myDF_shifted
coord coverage coverage_shifted
0 1 1 50
1 2 10 1
2 3 50 10
(This is just a simplified example - in real life, my dataframes are larger and I will need to shift by more than one unit)
This is what I've tried and what I get back:
>>> myDF['coverage_shifted'] = myDF.coverage.shift(1)
>>> myDF
coord coverage coverage_shifted
0 1 1 NaN
1 2 10 1
2 3 50 10
So I can create the shifted column, but I don't know how to roll the bottom value around to the top. From internet searches I think that numpy lets you do this with "numpy.roll". Is there a pandas equivalent?
Pandas probably doesn't provide an off-the-shelf method to do the exactly what you described, however if you can move a little but out of the box, numpy has exactly that
In your case it is:
import numpy as np
myDF['coverage_shifted'] = np.roll(df.coverage, 2)
You can pass in an additional argument to the shift() to achieve what you want. The previous answer is much more helpful in most cases
last_value = myDF.iloc[-1]['coverage']
myDF['coverage_shifted'] = myDF.coverage.shift(1, fill_value=last_value)
You have to manually supply the value to fill_value
same can be applied for reverse rolling
first_value = myDF.iloc[0]['coverage']
myDF['coverage_back_shifted'] = myDF.coverage.shift(-1, fill_value=first_value)
I've been reading a huge (5 GB) gzip file in the form:
User1 User2 W
0 11 12 1
1 12 11 2
2 13 14 1
3 14 13 2
which is basically a directed graph representation of connections among users with a certain weight W. Since the file is so big, I tried to read it through networkx, building a Directed Graph and then set it to Undirected. But it took too much time. So I was thinking in doing the same thing analysing a pandas dataframe. I would like to return the previous dataframe in the form:
User1 User2 W
0 11 12 3
1 13 14 3
where the common links in the two directions have been merged into one having as W the sum of the single weights. Any help would be appreciated.
There is probably a more concise way, but this works. The main trick is just to normalize the data such that User1 is always the lower number ID. Then you can use groupby since 11,12 and 12,11 are now recognized as representing the same thing.
In [330]: df = pd.DataFrame({"User1":[11,12,13,14],"User2":[12,11,14,13],"W":[1,2,1,2]})
In [331]: df['U1'] = df[['User1','User2']].min(axis=1)
In [332]: df['U2'] = df[['User1','User2']].max(axis=1)
In [333]: df = df.drop(['User1','User2'],axis=1)
In [334]: df.groupby(['U1','U2'])['W'].sum()
Out[334]:
U1 U2
11 12 3
13 14 3
Name: W, dtype: int64
For more concise code that avoids creating new variables, you could replace the middle 3 steps with:
In [400]: df.ix[df.User1>df.User2,['User1','User2']] = df.ix[df.User1>df.User2,['User2','User1']].values
Note that column switching can be trickier than you'd think, see here: What is correct syntax to swap column values for selected rows in a pandas data frame using just one line?
As far as making this code fast in general, it will depend on your data. I don't think the above code will be as important as other things you might do. For example, your problem should be amenable to a chunking approach where you iterate over sections of the code, gradually shrinking it on each pass. In that case, the main thing you need to think about is sorting the data before chunking, so as to minimize how many passes you need to make. But doing it that way should allow you to do all the work in memory.