generating a scatter plot using two different dataset in python pandas - python

I have two datasets. Both have different numbers of observations. Is it possible to generate a scatter plot between features from different datasets?
For example, I want to generate a scatter plot between the submission_day column of dataset 1 and the score column of dataset 2.
I am not sure how to do that using python packages.
For example consider the following two datasets:
id_student submission_day
23hv 100
24hv 99
45hv 10
56hv 16
53hv 34
id_student score
23hv 59
25gf 20
24hv 56
45hv 76

I think need merge for one DataFrame and then DataFrame.plot.scatter:
df = df1.merge(df2, on='id_student')
print (df)
id_student submission_day score
0 23hv 100 59
1 24hv 99 56
2 45hv 10 76
df.plot.scatter(x='submission_day', y='score')

Related

What can I do to visualize my dataframe in a proper way?

I have a dataframe in python that consists of two columns [Combinations] and [counts], the dataframe is 16369 rows, so there are 16369 combinations.
The combinations column consist of different combinations of departments (there are 14 different departments) working together on projects and the counts column is the amount of how much they worked together. There are about 8191 rows with 0 as counts.
I was wondering what the proper way would be to plot such a dataframe, I was thinking of a heatmap but this wont works because of all the unique values within the combinations column. How can I properly (preferably in something like plotly) plot this?
Combinations
counts
A,B
68
C,A
64
F,C
63
F,L
63
E,A
60
B,A
57
Q,L
56
A,B,C
55
L,N
54
C,L,A,C
53
A,F,B
52
F,H
51
C,V
50
Q,F
50
Z,X
49
C,X
49
A,P
49
K,Q
49
R,S
49
Have you tried to explore plotting these as sociograms? It seems like social network analysis will be an apt way to visualise the different relationships the departments have with each other.
You can try looking at this for some inspiration. coursera has some courses you can explore too.

Iterate over certain columns with unique values and generate plots python

New to pandas and much help would be appreciated. I'm currently analyzing some Airbnb data and have over 50 different columns. Some of these columns have tens of thousands of unique values while some have very few unique values (categorical).
How do I loop over the columns that have less than 10 unique values to generate plots for them?
Count of unique values in each column:
id 38185
last_scraped 3
name 36774
description 34061
neighborhood_overview 18479
picture_url 37010
host_since 4316
host_location 1740
host_about 14178
host_response_time 4
host_response_rate 78
host_acceptance_rate 101
host_is_superhost 2
host_neighbourhood 486
host_total_listings_count 92
host_verifications 525
host_has_profile_pic 2
host_identity_verified 2
neighbourhood_cleansed 222
neighbourhood_group_cleansed 5
property_type 80
room_type 4
The above is stored through unique_vals = df.nunique()
Apologies if this is a repeat question, the closest answer I could find was Iterate through columns to generate separate plots in python but it pertained to the entire data set
Thanks!
You can filter the columns using df.columns[ unique_vals < 10 ]
You can also pass the df.nunique() call directly if you wish:
unique_columns = df.columns[ df.nunique() < 10 ]

How to set a Seaborn scatterplot hue equal to features and not the values of a single feature?

I would like to create a scatterplot in Seaborn that sets the hue parameter so that values are coloured based on what feature they are from. (All features have values 0 to 100)
For example, supposing I have the following dataframe:
Happiness Kindness Sadness
1 100 70 0
2 60 50 1
3 34 32 10
4 23 65 54
5 43 54 87
When plotting the values, I would like to set all the values under happiness to red, kindness to blue, sadness to green. However, with Seaborn's scatterplot, the hue parameter only accepts one variable under the dataframe. Documentation here: http://man.hubwiz.com/docset/Seaborn.docset/Contents/Resources/Documents/generated/seaborn.scatterplot.html
I want to know if there are any workarounds to the only one variable being accepted feature so I can use the hue parameter to distinguish between the different variables.

Average for similar looking data in a column using Pandas

I'm working on a large data with more than 60K rows.
I have continuous measurement of current in a column. A code is measured for a second where the equipment measures it for 14/15/16/17 times, depending on the equipment speed and then the measurement moves to the next code and again measures for 14/15/16/17 times and so forth.
Every time measurement moves from one code to another, there is a jump of more than 0.15 on the current measurement
The data with top 48 rows is as follows,
Index
Curr(mA)
0
1.362476
1
1.341721
2
1.362477
3
1.362477
4
1.355560
5
1.348642
6
1.327886
7
1.341721
8
1.334804
9
1.334804
10
1.348641
11
1.362474
12
1.348644
13
1.355558
14
1.334805
15
1.362477
16
1.556172
17
1.542336
18
1.549252
19
1.528503
20
1.549254
21
1.528501
22
1.556173
23
1.556172
24
1.542334
25
1.556172
26
1.542336
27
1.542334
28
1.556170
29
1.535415
30
1.542334
31
1.729109
32
1.749863
33
1.749861
34
1.749861
35
1.736024
36
1.770619
37
1.742946
38
1.763699
39
1.749861
40
1.749861
41
1.763703
42
1.756781
43
1.742946
44
1.736026
45
1.756781
46
1.964308
47
1.957395
I want to write a script where similar data of 14/15/16/17 times is averaged in a separate column for each code measurement .. I have been thinking of doing this with pandas..
I want the data to look like
Index
Curr(mA)
0
1.34907
1
1.54556
2
1.74986
Need some help to get this done. Please help
First get the indexes of every row where there's a jump. Use Pandas' DataFrame.diff() to get the difference between the value in each row and the previous row, then check to see if it's greater than 0.15 with >. Use that to filter the dataframe index, and save the resulting indices (in the case of your sample data, three) in a variable.
indices = df.index[df['Curr(mA)'].diff() > 0.15]
The next steps depend on if there are more columns in the source dataframe that you want in the output, or if it's really just curr(mA) and index. In the latter case, you can use np.split() to cut the dataframe into a list of dataframes based on the indexes you just pulled. Then you can go ahead and average them in a list comphrension.
[df['Curr(mA)'].mean() for df in np.split(df, indices)]
> [1.3490729374999997, 1.5455638666666667, 1.7498627333333332, 1.9608515]
To get it to match your desired output above (same thing but as one-column dataframe rather than list) convert the list to pd.Series and reset_index().
pd.Series(
[df['Curr(mA)'].mean() for df in np.split(df, indices)]
).reset_index(drop=True)
index 0
0 0 1.349073
1 1 1.545564
2 2 1.749863
3 3 1.960851

Slicing a dataframe based on range of time

I am new to Pandas. I have a set of Excel data read into a dataframe as follows:
TimeReceived A B
08:00:01.010 70 40
08:00:01.050 80 50
08:01:01.100 50 20
08:01:01.150 40 30
I want to compute the average for columns A & B based on time intervals of 100 ms.
The output in this case would be :
TimeReceived A B
08:00:01.000 75 45
08:00:01.100 45 25
I have set the 'TimeReceived' as a Date-Time index:
df = df.set_index (['TimeReceived'])
I can select rows based on predefined time ranges but I cannot do computations on time intervals as shown above.
If you have aDatetimeIndex you can then use resample to up or down sample your data to a new frequency. This will introduce NaN rows where there are gaps but you can drop these using dropna:
df.resample('100ms').mean().dropna()

Categories