I have this code that takes a csv files, filters from by a column and then plots the values of another column.
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import pyplot as plt
from matplotlib import pyplot as plt
import numpy as np
df = pd.read_csv(r'C:/Desktop/Plot/dataframe.csv', delimiter=";", encoding='unicode_escape')
df['num_1'] = df['num_2'].str.split(',').str[0]
df['num_1'] = df['num_2'].astype('int64', copy=False)
X=df[df['Describe']=='The Start of Journey']['num_2'].values
dev_x= X
# Set figure size
plt.figure(figsize=(10, 5))
plt.hist(dev_x, bins=5)
plt.title('Data')
This is the dataset
+-----+-------+-----------------------+--------+--------+
| | name | Describe | num_1 | num_2 |
+-----+-------+-----------------------+--------+--------+
| 0 | er | The Start of Journey | 17 | 249,5 |
| 1 | NaN | NaN | 58 | 51,0 |
| 2 | NaN | NaN | 14 | 66,5 |
| 3 | NaN | NaN | 526 | 84,0 |
| 4 | be | The end of journey | 3 | 13,0 |
| 5 | tg | Levels | 342 | 34,0 |
| 6 | NaN | NaN | 231 | 55,6 |
| 7 | NaN | NaN | 23 | 75,0 |
| 8 | tf | counts | 54 | 34,6 |
| 9 | sf | The Start of Journey | 52 | 4324,0 |
| 10 | gd | The Start of Journey | 352 | 54.0 |
+-----+-------+-----------------------+--------+--------+
I want to modify the code so it does the following
Props to user to add the csv file
Prop the user to add the name of the column we want to filter ( this case Describe column)
Prop the user to add the string ( this case The Start of Journey)
Prop the user to add the name of the column we want to plot the data ( this case num_2)
I have checked other sources but due to the structure of the code I am having trouble regarding this.
Use the input() function. You can have a variable like x, and do x = input("Enter the CSV path>>> ") (or something similar), and x will be a string with whatever the user input. Then you can use x later. For example, instead of 'Describe' you could just put x.
x = input("Enter the csv path>>>") # returns answer in string form
Related
I have this massive dataset and I need to subset the data by using criteria. This is for illustration:
| Group | Name | Value |
|--------------------|-------------|-----------------|
| A | Bill| 256 |
| A | Jack| 268 |
| A | Melissa| 489 |
| B | Amanda | 787 |
| B | Eric| 485 |
| C | Matt| 1236 |
| C | Lisa| 1485 |
| D | Ben | 785 |
| D | Andrew| 985 |
| D | Cathy| 1025 |
| D | Suzanne| 1256 |
| D | Jim| 1520 |
I know how to handle this problem manually, such as:
import pandas as pd
df=pd.read_csv('Test.csv')
A=df[df.Group =="A "].to_numpy()
B=df[df.Group =="B "].to_numpy()
C=df[df.Group =="C "].to_numpy()
D=df[df.Group =="D "].to_numpy()
But considering the size of the data, it will take a lot of time if I handle it in this way.
With that in mind, I would like to know if it is possible to build an iteration with an IF statement that can look at the values in column “Group”(table above) . I was thinking, IF statement to see if the first value is the same with one the below if so, group them and create a new array/ dataframe.
My aim is to zero pad my data to have an equal length for all the subset datasets. I have data as follows:
|server| users | power | Throughput range | time |
|:----:|:--------------:|:--------------:|:--------------------:|:-----:|
| 0 | [5, 3,4,1] | -4.2974843 | [5.23243, 5.2974843]| 0 |
| 1 | [8, 6,2,7] | -6.4528433 | [6.2343, 7.0974845] | 1 |
| 2 | [9,12,10,11] | -3.5322451 | [4.31240, 4.9073840]| 2 |
| 3 | [14,13,16,17]| -5.9752843 | [5.2243, 5.2974843] | 3 |
| 0 | [22,18,19,21]| -1.2974652 | [3.12843, 4.2474643]| 4 |
| 1 | [22,23,24,25]| -9.884843 | [8.00843, 8.0974843]| 5 |
| 2 | [27,26,28,29]| -2.3984843 | [7.23843, 8.2094845]| 6 |
| 3 | [30,32,31,33]| -4.5654566 | [3.1233, 4.2474643] | 7 |
| 1 | [36,34,37,35]| -1.2974652 | [3.12843, 4.2474643]| 8 |
| 2 | [40,41,38,39]| -3.5322451 | [4.31240, 4.9073840]| 9 |
| 1 | [42,43,45,44]| -5.9752843 | [6.31240, 6.9073840]| 10 |
The aim is to analyze individual servers by their respective data which was done using the code below:
c0 = grp['server'].values == 0
c0_new = grp[c0]
server0 = pd.DataFrame(c0_new)
c1 = grp['server'].values == 1
c1_new = grp[c1]
server1 = pd.DataFrame(c1_new)
c2 = grp['server'].values == 2
c2_new = grp[c2]
server2 = pd.DataFrame(c2_new)
c3 = grp['server'].values == 3
c3_new = grp[c3]
server3 = pd.DataFrame(c3_new)
The results of this code provide the different servers and their respective data features. For example, the server0 output becomes:
| server | users | power | Throughput range | time |
|:------:|:--------------:|:--------------:|:--------------------:|:-----:|
| 0 | [5, 3,4,1] | -4.2974843 | [5.23243, 5.2974843]| 0 |
| 0 | [22,18,19,21]| -1.2974652 | [3.12843, 4.2474643]| 1 |
The results obtained for individual servers have different lengths so I tried padding using the code below:
from Keras.preprocessing.sequence import pad_sequences
man = [server0, server1, server2, server3]
new = pad_sequences(man)
The results obtained in this case show the padding has been done with all the servers having equal length but the problem is that the output does not contain the column names anymore, I want the final data to contain the columns. Please any suggestions?
The aim is to apply machine learning on the data and would like to have them concatenated. This is what I later did and it worked for the application I wanted it for.
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
man = [server0, server1, server2, server3]
for cel in man:
cel.set_index('time', inplace=True)
cel.drop(['users'], axis=1, inplace=True)
scl = MinMaxScaler()
vals = [cel.values.reshape(cel.shape[0], 1) for cel in man]
I then applied the the pad sequence and it worked as follows:
from keras.preprocessing.sequence import pad_sequences
new = pad_sequences(vals)
Trying to find a trend in attendance. I filtered my existing df to this so I can look at 1 activity at a time.
+---+-----------+-------+----------+-------+---------+
| | Date | Org | Activity | Hours | Weekday |
+---+-----------+-------+----------+-------+---------+
| 0 | 8/3/2020 | Org 1 | Gen Ab | 10.5 | Monday |
| 1 | 8/25/2020 | Org 1 | Gen Ab | 2 | Tuesday |
| 3 | 8/31/2020 | Org 1 | Gen Ab | 8.5 | Monday |
| 7 | 8/10/2020 | Org 2 | Gen Ab | 1 | Monday |
| 8 | 8/14/2020 | Org 3 | Gen Ab | 3.5 | Friday |
+---+-----------+-------+----------+-------+---------+
This code:
gen_ab = att_df.loc[att_df['Activity'] == "Gen Ab"]
sum_gen_ab = gen_ab.groupby(['Date', 'Activity']).sum()
sum_gen_ab.head()
Returns this:
+------------+----------+------------+
| | | Hours |
+------------+----------+------------+
| Date | Activity | |
| 06/01/2020 | Gen Ab | 347.250000 |
| 06/02/2020 | Gen Ab | 286.266667 |
| 06/03/2020 | Gen Ab | 169.583333 |
| 06/04/2020 | Gen Ab | 312.633333 |
| 06/05/2020 | Gen Ab | 317.566667 |
+------------+----------+------------+
How do I make the summed column name 'Hours'? I still get the same result when I do this:
sum_gen_ab['Hours'] = gen_ab.groupby(['Date', 'Activity']).sum()
What I eventually want to do is have a line graph that shows the sum of hours for the activity over time. The time of course would be the dates in my df.
plt.plot(sum_gen_ab['Date'], sum_gen_ab['Hours'])
plt.show()
returns KeyError: Date
Once you've used groupby(['Date', 'Activity']) Date and Activity have been transformed to indices and can't be referenced with sum_gen_ab['Date'].
To avoid transforming them to indices you can use groupby(['Date', 'Activity'], as_index=False) instead.
I will typically use the pandasql library to manipulate my data frames into different datasets. This allows you to manipulate your pandas data frame with SQL code. Pandasql can be used alongside pandas.
EXAMPLE:
import pandas as pd
import pandasql as psql
df = "will be your dataset"
new_dataset = psql.sqldf('''
SELECT DATE, ACTIVITY, SUM(HOURS) as SUM_OF_HOURS
FROM df
GROUP BY DATE, ACTIVITY''')
new_dataset.head() #Shows the first 5 rows of your dataset
I have a dataframe like this which is an application log:
+---------+----------------+----------------+---------+----------+-------------------+------------+
| User | ReportingSubId | RecordLockTime | EndTime | Duration | DurationConverted | ActionType |
+---------+----------------+----------------+---------+----------+-------------------+------------+
| User 5 | 21 | 06:19.6 | 06:50.5 | 31 | 00:00:31 | Edit |
| User 4 | 19 | 59:08.6 | 59:27.6 | 19 | 00:00:19 | Add |
| User 25 | 22 | 29:09.4 | 29:37.0 | 28 | 00:00:28 | Edit |
| User 10 | 19 | 28:36.9 | 33:37.0 | 300 | 00:05:00 | Add |
| User 27 | 22 | 13:27.7 | 16:54.9 | 207 | 00:03:27 | Edit |
| User 5 | 21 | 11:22.8 | 12:37.3 | 75 | 00:01:15 | Edit |
+---------+----------------+----------------+---------+----------+-------------------+------------+
I wanted to visualize the duration of adds and edits for each user, ad Gantt Chart seemed ideal for me.
I was able to do it for a sample dataframe of 807 rows with the following code:
data = []
for row in df_temp.itertuples():
data.append(dict(Task=str(row.User), Start=str(row.RecordLockTime), Finish=str(row.EndTime), Resource=str(row.ActionType)))
colors = {'Add': 'rgb(110, 244, 65)',
'Edit': 'rgb(244, 75, 66)'}
fig = ff.create_gantt(data, colors=colors, index_col='Resource', show_colorbar=True, group_tasks=True)
for i in range(len(fig["data"]) - 2):
text = "User: {}<br>Start: {}<br>Finish: {}<br>Duration: {}<br>Number of Adds: {}<br>Number of Edits: {}".format(df_temp["User"].loc[i],
df_temp["RecordLockTime"].loc[i],
df_temp["EndTime"].loc[i],
df_temp["DurationConverted"].loc[i],
counts[counts["User"] == df_temp["User"].loc[i]]["Add"].iloc[0],
counts[counts["User"] == df_temp["User"].loc[i]]["Edit"].iloc[0])
fig["data"][i].update(text=text, hoverinfo="text")
fig['layout'].update(autosize=True, margin=dict(l=150))
py.iplot(fig, filename='gantt-group-tasks-together', world_readable=True)
and I am more than happy with the result : https://plot.ly/~pawelty/90.embed
However my original df has more users and 2500 rows in total. That seems to be too much for plotly. I get 502 error.
I am a huge fan of plotly but I might have reached it's limit. Can I change something in order to visualize it with Plotly ? Any other tool I could use?
I started using plotly.offline.plot(fig) to plot offline and it worked much faster and I got less errors. I also have the problem that my graph doesn't get displayed or sometimes only in fullscreen mode...
I import plotly instead of plotly.plotly though, otherwise it doesn't work.
I come from a SPSS background and I want to declare missing values in a Pandas DataFrame.
Consider the following dataset from a Likert Scale:
SELECT COUNT(*),v_6 FROM datatable GROUP BY v_6;
| COUNT(*) | v_6 |
+----------+------+
| 1268 | NULL |
| 2 | -77 |
| 3186 | 1 |
| 2700 | 2 |
| 512 | 3 |
| 71 | 4 |
| 17 | 5 |
| 14 | 6 |
I have a DataFrame
pdf = psql.frame_query('SELECT * FROM datatable', con)
The null values are already declared as NaN - now I want -77 also to be a missing value.
In SPSS I am used to:
MISSING VALUES v_6 (-77).
No I am looking for the Pandas counterpart
I have read:
http://pandas.pydata.org/pandas-docs/stable/missing_data.html
but I honestly do not get the trick how the proposed way in my case would be...
Use pandas.Series.replace():
df['v_6'] = df['v_6'].replace(-77, np.NaN)