I have a dataframe like this which is an application log:
+---------+----------------+----------------+---------+----------+-------------------+------------+
| User | ReportingSubId | RecordLockTime | EndTime | Duration | DurationConverted | ActionType |
+---------+----------------+----------------+---------+----------+-------------------+------------+
| User 5 | 21 | 06:19.6 | 06:50.5 | 31 | 00:00:31 | Edit |
| User 4 | 19 | 59:08.6 | 59:27.6 | 19 | 00:00:19 | Add |
| User 25 | 22 | 29:09.4 | 29:37.0 | 28 | 00:00:28 | Edit |
| User 10 | 19 | 28:36.9 | 33:37.0 | 300 | 00:05:00 | Add |
| User 27 | 22 | 13:27.7 | 16:54.9 | 207 | 00:03:27 | Edit |
| User 5 | 21 | 11:22.8 | 12:37.3 | 75 | 00:01:15 | Edit |
+---------+----------------+----------------+---------+----------+-------------------+------------+
I wanted to visualize the duration of adds and edits for each user, ad Gantt Chart seemed ideal for me.
I was able to do it for a sample dataframe of 807 rows with the following code:
data = []
for row in df_temp.itertuples():
data.append(dict(Task=str(row.User), Start=str(row.RecordLockTime), Finish=str(row.EndTime), Resource=str(row.ActionType)))
colors = {'Add': 'rgb(110, 244, 65)',
'Edit': 'rgb(244, 75, 66)'}
fig = ff.create_gantt(data, colors=colors, index_col='Resource', show_colorbar=True, group_tasks=True)
for i in range(len(fig["data"]) - 2):
text = "User: {}<br>Start: {}<br>Finish: {}<br>Duration: {}<br>Number of Adds: {}<br>Number of Edits: {}".format(df_temp["User"].loc[i],
df_temp["RecordLockTime"].loc[i],
df_temp["EndTime"].loc[i],
df_temp["DurationConverted"].loc[i],
counts[counts["User"] == df_temp["User"].loc[i]]["Add"].iloc[0],
counts[counts["User"] == df_temp["User"].loc[i]]["Edit"].iloc[0])
fig["data"][i].update(text=text, hoverinfo="text")
fig['layout'].update(autosize=True, margin=dict(l=150))
py.iplot(fig, filename='gantt-group-tasks-together', world_readable=True)
and I am more than happy with the result : https://plot.ly/~pawelty/90.embed
However my original df has more users and 2500 rows in total. That seems to be too much for plotly. I get 502 error.
I am a huge fan of plotly but I might have reached it's limit. Can I change something in order to visualize it with Plotly ? Any other tool I could use?
I started using plotly.offline.plot(fig) to plot offline and it worked much faster and I got less errors. I also have the problem that my graph doesn't get displayed or sometimes only in fullscreen mode...
I import plotly instead of plotly.plotly though, otherwise it doesn't work.
Related
Trying to find a trend in attendance. I filtered my existing df to this so I can look at 1 activity at a time.
+---+-----------+-------+----------+-------+---------+
| | Date | Org | Activity | Hours | Weekday |
+---+-----------+-------+----------+-------+---------+
| 0 | 8/3/2020 | Org 1 | Gen Ab | 10.5 | Monday |
| 1 | 8/25/2020 | Org 1 | Gen Ab | 2 | Tuesday |
| 3 | 8/31/2020 | Org 1 | Gen Ab | 8.5 | Monday |
| 7 | 8/10/2020 | Org 2 | Gen Ab | 1 | Monday |
| 8 | 8/14/2020 | Org 3 | Gen Ab | 3.5 | Friday |
+---+-----------+-------+----------+-------+---------+
This code:
gen_ab = att_df.loc[att_df['Activity'] == "Gen Ab"]
sum_gen_ab = gen_ab.groupby(['Date', 'Activity']).sum()
sum_gen_ab.head()
Returns this:
+------------+----------+------------+
| | | Hours |
+------------+----------+------------+
| Date | Activity | |
| 06/01/2020 | Gen Ab | 347.250000 |
| 06/02/2020 | Gen Ab | 286.266667 |
| 06/03/2020 | Gen Ab | 169.583333 |
| 06/04/2020 | Gen Ab | 312.633333 |
| 06/05/2020 | Gen Ab | 317.566667 |
+------------+----------+------------+
How do I make the summed column name 'Hours'? I still get the same result when I do this:
sum_gen_ab['Hours'] = gen_ab.groupby(['Date', 'Activity']).sum()
What I eventually want to do is have a line graph that shows the sum of hours for the activity over time. The time of course would be the dates in my df.
plt.plot(sum_gen_ab['Date'], sum_gen_ab['Hours'])
plt.show()
returns KeyError: Date
Once you've used groupby(['Date', 'Activity']) Date and Activity have been transformed to indices and can't be referenced with sum_gen_ab['Date'].
To avoid transforming them to indices you can use groupby(['Date', 'Activity'], as_index=False) instead.
I will typically use the pandasql library to manipulate my data frames into different datasets. This allows you to manipulate your pandas data frame with SQL code. Pandasql can be used alongside pandas.
EXAMPLE:
import pandas as pd
import pandasql as psql
df = "will be your dataset"
new_dataset = psql.sqldf('''
SELECT DATE, ACTIVITY, SUM(HOURS) as SUM_OF_HOURS
FROM df
GROUP BY DATE, ACTIVITY''')
new_dataset.head() #Shows the first 5 rows of your dataset
I have this code that takes a csv files, filters from by a column and then plots the values of another column.
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import pyplot as plt
from matplotlib import pyplot as plt
import numpy as np
df = pd.read_csv(r'C:/Desktop/Plot/dataframe.csv', delimiter=";", encoding='unicode_escape')
df['num_1'] = df['num_2'].str.split(',').str[0]
df['num_1'] = df['num_2'].astype('int64', copy=False)
X=df[df['Describe']=='The Start of Journey']['num_2'].values
dev_x= X
# Set figure size
plt.figure(figsize=(10, 5))
plt.hist(dev_x, bins=5)
plt.title('Data')
This is the dataset
+-----+-------+-----------------------+--------+--------+
| | name | Describe | num_1 | num_2 |
+-----+-------+-----------------------+--------+--------+
| 0 | er | The Start of Journey | 17 | 249,5 |
| 1 | NaN | NaN | 58 | 51,0 |
| 2 | NaN | NaN | 14 | 66,5 |
| 3 | NaN | NaN | 526 | 84,0 |
| 4 | be | The end of journey | 3 | 13,0 |
| 5 | tg | Levels | 342 | 34,0 |
| 6 | NaN | NaN | 231 | 55,6 |
| 7 | NaN | NaN | 23 | 75,0 |
| 8 | tf | counts | 54 | 34,6 |
| 9 | sf | The Start of Journey | 52 | 4324,0 |
| 10 | gd | The Start of Journey | 352 | 54.0 |
+-----+-------+-----------------------+--------+--------+
I want to modify the code so it does the following
Props to user to add the csv file
Prop the user to add the name of the column we want to filter ( this case Describe column)
Prop the user to add the string ( this case The Start of Journey)
Prop the user to add the name of the column we want to plot the data ( this case num_2)
I have checked other sources but due to the structure of the code I am having trouble regarding this.
Use the input() function. You can have a variable like x, and do x = input("Enter the CSV path>>> ") (or something similar), and x will be a string with whatever the user input. Then you can use x later. For example, instead of 'Describe' you could just put x.
x = input("Enter the csv path>>>") # returns answer in string form
I have a dataframe that looks like this below with Date, Price and Serial.
+----------+--------+--------+
| Date | Price | Serial |
+----------+--------+--------+
| 2/1/1996 | 0.5909 | 1 |
| 2/1/1996 | 0.5711 | 2 |
| 2/1/1996 | 0.5845 | 3 |
| 3/1/1996 | 0.5874 | 1 |
| 3/1/1996 | 0.5695 | 2 |
| 3/1/1996 | 0.584 | 3 |
+----------+--------+--------+
I will like to make it look like this where the serial becomes the column name and the data sorts itself into the correct date row as well as Serial column.
+----------+--------+--------+--------+
| Date | 1 | 2 | 3 |
+----------+--------+--------+--------+
| 2/1/1996 | 0.5909 | 0.5711 | 0.5845 |
| 3/1/1996 | 0.5874 | 0.5695 | 0.584 |
+----------+--------+--------+--------+
I understand I can do this via a loop but just wondering if there is a more efficient way to do this?
Thanks for your kind help. Also curious if there is a better way to paste such tables rather than attaching images in my questions =x
You can use pandas.pivot_table:
res = df.pivot_table(index='Date', columns='Serial', values='Price', aggfunc=np.sum)\
.reset_index()
res.columns.name = ''
Date 1 2 3
0 2/1/1996 0.5909 0.5711 0.5845
1 3/1/1996 0.5874 0.5695 0.5840
I have an sql table with the below data:
select day, counter from table1 where day LIKE '%201601%';
+----------+---------+
| day | counter |
+----------+---------+
| 20160101 | 125777 |
| 20160102 | 31720 |
| 20160105 | 24981 |
| 20160106 | 240366 |
| 20160107 | 270560 |
| 20160108 | 268788 |
| 20160109 | 254286 |
| 20160110 | 218154 |
| 20160111 | 250186 |
| 20160112 | 94532 |
| 20160113 | 71437 |
| 20160114 | 71121 |
| 20160115 | 71135 |
| 20160116 | 71325 |
| 20160117 | 209762 |
| 20160118 | 210305 |
| 20160119 | 257627 |
| 20160120 | 306353 |
| 20160121 | 214687 |
| 20160122 | 214680 |
| 20160123 | 149844 |
| 20160124 | 133741 |
| 20160125 | 82404 |
| 20160126 | 71403 |
| 20160127 | 71437 |
| 20160128 | 72005 |
| 20160129 | 71417 |
| 20160130 | 0 |
| 20160131 | 69937 |
+----------+---------+
I have a python script where i run by
python myapp.py January
The January variable includes the query below:
January = """select day, counter from table1 where day LIKE '%201601%';"""
What i would like to do is to have to run the script with different flags and get the script to calculate the sum for all of the days of the :
last month,
this month,
last week,
last two weeks
specific month.
At the moment i have different variables for the months of 2016, this way the script will become huge, i am sure there is an easier way to do this.
cursor = db.cursor()
cursor.execute(eval(sys.argv[1]))
display = cursor.fetchall()
MonthlyLogs = sum(int(row[1]) for row in display)
The point of this is that i want to see discrepancies between data. My overall aim is to display this data in php at a later date, but for now i would like for it to be written to a file, which at the moment it is being written to.
What is the best way to achieve this?
Thanks,
Incubator
For the first part, you can get access to current month and year using datetime.
from datetime import datetime
currentMonth = datetime.now().month
currentYear = datetime.now().year
parameter = str(currentYear) + str(currentMonth)
Passing this parameter in your query should work.
For the second, you may define a function with a flag that when set to 1, will set parameter as above otherwise parameter's value will be sys.argv[1].
I come from a SPSS background and I want to declare missing values in a Pandas DataFrame.
Consider the following dataset from a Likert Scale:
SELECT COUNT(*),v_6 FROM datatable GROUP BY v_6;
| COUNT(*) | v_6 |
+----------+------+
| 1268 | NULL |
| 2 | -77 |
| 3186 | 1 |
| 2700 | 2 |
| 512 | 3 |
| 71 | 4 |
| 17 | 5 |
| 14 | 6 |
I have a DataFrame
pdf = psql.frame_query('SELECT * FROM datatable', con)
The null values are already declared as NaN - now I want -77 also to be a missing value.
In SPSS I am used to:
MISSING VALUES v_6 (-77).
No I am looking for the Pandas counterpart
I have read:
http://pandas.pydata.org/pandas-docs/stable/missing_data.html
but I honestly do not get the trick how the proposed way in my case would be...
Use pandas.Series.replace():
df['v_6'] = df['v_6'].replace(-77, np.NaN)