What can I do to visualize my dataframe in a proper way? - python

I have a dataframe in python that consists of two columns [Combinations] and [counts], the dataframe is 16369 rows, so there are 16369 combinations.
The combinations column consist of different combinations of departments (there are 14 different departments) working together on projects and the counts column is the amount of how much they worked together. There are about 8191 rows with 0 as counts.
I was wondering what the proper way would be to plot such a dataframe, I was thinking of a heatmap but this wont works because of all the unique values within the combinations column. How can I properly (preferably in something like plotly) plot this?
Combinations
counts
A,B
68
C,A
64
F,C
63
F,L
63
E,A
60
B,A
57
Q,L
56
A,B,C
55
L,N
54
C,L,A,C
53
A,F,B
52
F,H
51
C,V
50
Q,F
50
Z,X
49
C,X
49
A,P
49
K,Q
49
R,S
49

Have you tried to explore plotting these as sociograms? It seems like social network analysis will be an apt way to visualise the different relationships the departments have with each other.
You can try looking at this for some inspiration. coursera has some courses you can explore too.

Related

Iterate over certain columns with unique values and generate plots python

New to pandas and much help would be appreciated. I'm currently analyzing some Airbnb data and have over 50 different columns. Some of these columns have tens of thousands of unique values while some have very few unique values (categorical).
How do I loop over the columns that have less than 10 unique values to generate plots for them?
Count of unique values in each column:
id 38185
last_scraped 3
name 36774
description 34061
neighborhood_overview 18479
picture_url 37010
host_since 4316
host_location 1740
host_about 14178
host_response_time 4
host_response_rate 78
host_acceptance_rate 101
host_is_superhost 2
host_neighbourhood 486
host_total_listings_count 92
host_verifications 525
host_has_profile_pic 2
host_identity_verified 2
neighbourhood_cleansed 222
neighbourhood_group_cleansed 5
property_type 80
room_type 4
The above is stored through unique_vals = df.nunique()
Apologies if this is a repeat question, the closest answer I could find was Iterate through columns to generate separate plots in python but it pertained to the entire data set
Thanks!
You can filter the columns using df.columns[ unique_vals < 10 ]
You can also pass the df.nunique() call directly if you wish:
unique_columns = df.columns[ df.nunique() < 10 ]

Average for similar looking data in a column using Pandas

I'm working on a large data with more than 60K rows.
I have continuous measurement of current in a column. A code is measured for a second where the equipment measures it for 14/15/16/17 times, depending on the equipment speed and then the measurement moves to the next code and again measures for 14/15/16/17 times and so forth.
Every time measurement moves from one code to another, there is a jump of more than 0.15 on the current measurement
The data with top 48 rows is as follows,
Index
Curr(mA)
0
1.362476
1
1.341721
2
1.362477
3
1.362477
4
1.355560
5
1.348642
6
1.327886
7
1.341721
8
1.334804
9
1.334804
10
1.348641
11
1.362474
12
1.348644
13
1.355558
14
1.334805
15
1.362477
16
1.556172
17
1.542336
18
1.549252
19
1.528503
20
1.549254
21
1.528501
22
1.556173
23
1.556172
24
1.542334
25
1.556172
26
1.542336
27
1.542334
28
1.556170
29
1.535415
30
1.542334
31
1.729109
32
1.749863
33
1.749861
34
1.749861
35
1.736024
36
1.770619
37
1.742946
38
1.763699
39
1.749861
40
1.749861
41
1.763703
42
1.756781
43
1.742946
44
1.736026
45
1.756781
46
1.964308
47
1.957395
I want to write a script where similar data of 14/15/16/17 times is averaged in a separate column for each code measurement .. I have been thinking of doing this with pandas..
I want the data to look like
Index
Curr(mA)
0
1.34907
1
1.54556
2
1.74986
Need some help to get this done. Please help
First get the indexes of every row where there's a jump. Use Pandas' DataFrame.diff() to get the difference between the value in each row and the previous row, then check to see if it's greater than 0.15 with >. Use that to filter the dataframe index, and save the resulting indices (in the case of your sample data, three) in a variable.
indices = df.index[df['Curr(mA)'].diff() > 0.15]
The next steps depend on if there are more columns in the source dataframe that you want in the output, or if it's really just curr(mA) and index. In the latter case, you can use np.split() to cut the dataframe into a list of dataframes based on the indexes you just pulled. Then you can go ahead and average them in a list comphrension.
[df['Curr(mA)'].mean() for df in np.split(df, indices)]
> [1.3490729374999997, 1.5455638666666667, 1.7498627333333332, 1.9608515]
To get it to match your desired output above (same thing but as one-column dataframe rather than list) convert the list to pd.Series and reset_index().
pd.Series(
[df['Curr(mA)'].mean() for df in np.split(df, indices)]
).reset_index(drop=True)
index 0
0 0 1.349073
1 1 1.545564
2 2 1.749863
3 3 1.960851

generating a scatter plot using two different dataset in python pandas

I have two datasets. Both have different numbers of observations. Is it possible to generate a scatter plot between features from different datasets?
For example, I want to generate a scatter plot between the submission_day column of dataset 1 and the score column of dataset 2.
I am not sure how to do that using python packages.
For example consider the following two datasets:
id_student submission_day
23hv 100
24hv 99
45hv 10
56hv 16
53hv 34
id_student score
23hv 59
25gf 20
24hv 56
45hv 76
I think need merge for one DataFrame and then DataFrame.plot.scatter:
df = df1.merge(df2, on='id_student')
print (df)
id_student submission_day score
0 23hv 100 59
1 24hv 99 56
2 45hv 10 76
df.plot.scatter(x='submission_day', y='score')

Pandas: Import feature vectors from list of dictionaries into dataframe

I have a list of dictionaries, and each dictionary consists of two key-value tuples. The first key-value is the name of a person and the second one is a feature vector consisting of the grades each person achieved in different courses. For example:
ListOfGrades=[{'Name':"Mike", 'grades':[98,86,90,72]},{'Name':"Sasha", 'grades':[92,79,85,94]},{'Name':"Beth", 'grades':[89,89,76,90]}]
I want to import this data into a pandas dataframe such that each row has the label of a person's name with each column filled with their grades. In short, I need to get something like this:
Mike 98 86 90 72
Sasha 92 79 85 94
Beth 89 89 76 90
I know I should use pd.DataFrame(ListOfGrades), but I'm not sure how to set it for my purpose. I have seen Convert list of dictionaries to Dataframe, but it's different from the way I want to order my data in the data frame.
I have tried this:
for i in ListOfGrades:
ListOfGrades[i]=str(ListOfGrades[i]['grades'])
# Convert to dataframe
df = pd.DataFrame.from_dict(ListOfGrades, orient='index').reset_index()
But, python throws me an error:
ListOfGrades[i]=str(ListOfGrades[i]['grades'])
TypeError: list indices must be integers, not dict
Also, I don't know how to add the names to each row, such that the first column of my data frame consists of the name of people, like the way I want my data frame look (as I showed above). Any help is appreciated!
Try this..
df = pd.DataFrame.from_records(ListOfGrades, index='Name')['grades'].apply(pd.Series)
df
# 0 1 2 3
# Name
# Mike 98 86 90 72
# Sasha 92 79 85 94
# Beth 89 89 76 90
Adding data to list:
ListOfGrades=[{'Name':"Mike", 'grades':[98,86,90,72, 34]},{'Name':"Sasha", 'grades':[92,79,85,94,78]},{'Name':"Beth", 'grades':[89,89,76,90]}]
# 0 1 2 3 4
# Name
# Mike 98.0 86.0 90.0 72.0 34.0
# Sasha 92.0 79.0 85.0 94.0 78.0
# Beth 89.0 89.0 76.0 90.0 NaN
This reason you are getting an error is that i is already an item (in this case a dictionary) from the list and is not an index. To have this work better you could change your loop as follows
for i in range(len(ListOfGrades)):
This will have the effect of making i a proper index. However, as I mentioned in my previous comment there may be more practical ways of solving this problem, such as having a single dictionary where the keys are names and the values are a list of grade. This would mean you don't need a list of dictionaries.
Ok, this approach is a bit of a hack, and it will quickly run into problems if each student doesn't have the same number of grades, but essentially, you need to build a new list and create the dictionary from that list. For python 3.5:
new_list = []
for student in ListOfGrades:
new_list.append({'Name': student['Name'], **{'grade_'+str(i+1): grade for i, grade in enumerate(student['grades'])}})
df = pd.DataFrame(new_list)
This is the dataframe I'm getting:
Name grade_1 grade_2 grade_3 grade_4
0 Mike 98 86 90 72
1 Sasha 92 79 85 94
2 Beth 89 89 76 90
If you don't have python 3.5 but have a version of python 3, this should work:
new_list = []
for student in ListOfGrades:
new_list.append(dict(Name = student['Name'], **{'grade_'+str(i+1): grade for i, grade in enumerate(student['grades'])}))
df = pd.DataFrame(new_list)
Edited to add: The above should also work for python 2.7

Pandas: reshape data with duplicate row names to columns

I have a data set that's sort of like this (first lines shown):
Sample Detector Cq
P_1 106 23.53152
P_1 106 23.152458
P_1 106 23.685083
P_1 135 24.465698
P_1 135 23.86892
P_1 135 23.723469
P_1 17 22.524242
P_1 17 20.658733
P_1 17 21.146122
Both "Sample" and "Detector" columns contain duplicated values ("Cq" is unique): to be precise, each "Detector" appears 3 times for each sample, because it's a replicate in the data.
What I need to do is to:
Reshape the table so that the columns contain Samples and rows Detectors
Rename the duplicate columns so that I know which replicate is it
I thought that DataFrame.pivot would do the trick, but it fails because of the duplicate data. What would be the best approach? Rename the duplicates, then reshape, or is there a better option?
EDIT: I thought over it and I think it's better to state the purpose. I need to store for each "Sample" the mean and standard deviation of their "Detector".
It looks like what you may be looking for is a hierarchical indexed dataframe [link].
Would something like this work?
#build a sample dataframe
a=['P_1']*9
b=[106,106,106,135,135,135,17,17,17]
c = np.random.randint(1,100,9)
df = pandas.DataFrame(data=zip(a,b,c), columns=['sample','detector','cq'])
#add a repetition number column
df['rep_num']=[1,2,3]*( len(df)/3 )
#Convert to a multi-indexed DF
df_multi = df.set_index(['sample','detector','rep_num'])
#--------------Resulting Dataframe---------------------
cq
sample detector rep_num
P_1 106 1 97
2 83
3 81
135 1 46
2 92
3 89
17 1 58
2 26
3 75

Categories