I have the following dataframe:
>>> mean_traf_tie
a d c
0.22 0.99 0.11 22
0.23 21
0.34 34
0.46 45
0.44 0.99 0.11 45
0.23 65
0.34 66
0.46 68
0.50 0.50 0.11 22
0.23 12
0.34 34
0.46 37
...
I want to crate plot from this dataframe, on a way that c will be the X axis, y will be the mean velocity and lines will be according to the a and d columns, so for example, one line will be for a=0.22 and d=0.99, the x will be c and y will be mean velocity, and then the 2nd line will be for a=0.44 and d=0.99 ect.
I have tried to do it like this:
df.plot()
(values are differrent in original dataframe).
as you can see,for some reason it plots i nthe x axis the a,d and creates only one line.
I have tried to fix it like this:
df.unstack(level=0).plot(figsize=(10,6))
but then I got very weird graph, with the correct lines by a and d but wrong x axis:
As you can see, it plot somehow the a,d values ,but that not what I want- I want it to be the c columns, and then to create lines based on the a,d columns,which suppose to create continous line.
I have trid that as well:
df[('mean_traf_tie')].unstack(level=0).plot(figsize=(10,6))
plt.xlabel('C')
plt.ylabel('mean_traf_tie')
but got again:
The desired output will have the c column as x axis, the mean_traf_tie as y axis, and lines will be generated bseed on a and d columns (line for 0.22 and 0.99, line for 0.44 and 0.99 ect).
Update:
I Have managed to go over it by concatinating the two index columns to one before plotting like this:
df['a,d'] = list(zip(df.a, df.d))
df=df.groupby(['a,d','C']).mean()
df.unstack(level=0).plot(figsize=(10,6))
the legenss is still not idealistic but I got the lines and axes as I wanted.
If anyone has better idea how to do it with original columns, i'm still open to learn.
Related
I have a big dataframe, here's a small subset:
key_words prove have read represent lead replace
be 0.58 0.49 0.48 0.17 0.23 0.89
represent 0.66 0.43 0 1 0 0.46
associate 0.88 0.23 0.12 0.43 0.11 0.67
induce 0.43 0.41 0.33 0.47 0 0.43
Which shows how close each word from the key_words is to the rest of the columns (based on their embeddings distance).
I want to find a way to visualize this dataframe so that I see the clusters that are being formed among the words that are closest to each other.
Is there a simple way to do this, considering that the key_word column has string values?
One option is to set the key_words column as index and to use seaborn.clustermap to plot the clusters:
# pip install seaborn
import seaborn as sns
sns.clustermap(df.set_index('key_words'), # data
vmin=0, vmax=1, # values of min/max here white/black
cmap='Greys', # color palette
figsize=(5,5) # plot size
)
output:
I have a pandas DataFrame that looks like this with 12 clusters in total. Certain clusters don't appear in a certain season.
I want to create a multi-line graph over the seasons of the percent of a specific cluster over each season. So if there are 30 teams in the 97-98 season and there are 10 teams in Cluster 1, then that value would be .33 since cluster 1 has one third of the total possible spots.
It'll look like this
And I want the dateset to look like this, where each cluster has its own percentage of the whole number of clusters in that season by percentage. I've tried using pandas groupby method to get a bunch of lists and then use value_counts() on that but that doesn't work since looping through df.groupby(['SEASON']) returns tuples, not a Series..
Thanks so much
Use .groupby combined with .value_counts and .unstack:
temp_df = df.groupby(['SEASON'])['Cluster'].value_counts(normalize=True).unstack().fillna(0.0)
temp_df.plot()
print(temp_df.round(2))
Cluster 0 1 2 4 5 6 7 10 11
SEASON
1996-97 0.1 0.21 0.17 0.21 0.07 0.1 0.03 0.07 0.03
1997-98 0.2 0.00 0.20 0.20 0.00 0.0 0.20 0.20 0.00
I am trying to merge 2 csv files by column.
my both csv ends with '_4.csv' as filename, and the final result of the merged csv is something like below:
0-10 ,83.72,66.76,86.98 ,0-10 ,83.72,66.76,86.98
11-20 ,15.01,31.12,12.04 ,11-20 ,15.01,31.12,12.04
21-30 ,1.14,2.05,0.94 ,21-30 ,1.14,2.05,0.94
31-40 ,0.13,0.07,0.03 ,31-40 ,0.13,0.07,0.03
over 40 ,0.0,0.0,0.0 ,over 40 ,0.0,0.0,0.0
UHF case ,0.0,0.0,0.0 ,UHF case ,0.0,0.0,0.0
my code:
#combine 2 csv into 1 by columns
files_in_dir = [f for f in os.listdir(os.getcwd()) if f.endswith('_4.csv')]
temp_data = []
for filenames in files_in_dir:
temp_data.append(np.loadtxt(filenames,dtype='str'))
temp_data = np.array(temp_data)
np.savetxt('_mix.csv',temp_data.transpose(),fmt='%s',delimiter=',')
however the error said:
temp_data.append(np.loadtxt(filenames,dtype='str'))
for x in read_data(_loadtxt_chunksize):
raise ValueError("Wrong number of columns at line %d"
ValueError: Wrong number of columns at line 2
not sure if it is related to the first column being strings rather than values.
Does anyone know how to fix it? much appreciation
I think you're looking for the join method. If we have two .csv files of the form:
0-10 ,83.72,66.76,86.98
11-20 ,15.01,31.12,12.04
21-30 ,1.14,2.05,0.94
31-40 ,0.13,0.07,0.03
over 40 ,0.0,0.0,0.0
UHF case ,0.0,0.0,0.0
Assuming they both have similar structure, we'll work with one of these named data.csv:
import pandas as pd
# Assumes there are no headers
df1 = pd.read_csv("data.csv", header=None)
df2 = pd.read_csv("data.csv", header=None)
# By default: DataFrame headers are assigned numbers 0, 1, 2, 3
# In the second data frame, we will rename columns so they do not clash.
# meaning `df2` will now have columns named: 4, 5, 6, 7
df2 = df2.rename(
columns={
x: y for x, y in zip(df1.columns, range(len(df2.columns), len(df2.columns) * 2))
}
)
print(df1.join(df2))
Example output:
0 1 2 3 4 5 6 7
0 0-10 83.72 66.76 86.98 0-10 83.72 66.76 86.98
1 11-20 15.01 31.12 12.04 11-20 15.01 31.12 12.04
2 21-30 1.14 2.05 0.94 21-30 1.14 2.05 0.94
3 31-40 0.13 0.07 0.03 31-40 0.13 0.07 0.03
4 over 40 0.00 0.00 0.00 over 40 0.00 0.00 0.00
5 UHF case 0.00 0.00 0.00 UHF case 0.00 0.00 0.00
I would need to plot as bar chart the following columns
%_Var1 %_Var2 %_Val1 %_Val2 Class
2 0.00 0.00 0.10 0.01 1
3 0.01 0.01 0.07 0.05 0
17 0.00 0.00 0.02 0.01 0
24 0.00 0.00 0.11 0.04 0
27 0.00 0.00 0.02 0.03 1
44 0.00 0.00 0.05 0.02 0
53 0.00 0.00 0.03 0.01 1
67 0.00 0.00 0.06 0.02 0
87 0.00 0.00 0.22 0.01 1
115 0.00 0.00 0.03 0.02 0
comparing the values having Class 1 and Class 0 respectively (i.e. bars which show each column of the dataframe, putting one beside the other the column for only Class 1 ad the column for Class 0.
So I should have 8 bars: 4 where 4 bars are for Class 1 and the remaining 4 for Class 0.
One column of Class 1 should be beside the same column for Class 0.
I tried as follows:
ax = df[["%_Var1", "%_Var2", "%_Var3" , "%_Var4"]].plot(kind='bar')
but the output is completely wrong, also writing ax = df[["%_Var1", "%_Var2", "%_Var3" , "%_Var4"]].Label.plot(kind='bar')
I think I should consider a groupby in my code, in order to group by Classes, but I do not know how to set the order (plots are not my top skill)
If you want to try the seaborn way, melt the dataframe to long format and then hue on the class.
data = df.melt(id_vars=['class'], value_vars=['var1','var2','val1','val2'])
import seaborn as sns
sns.barplot(x='variable', y='value', hue='class', data=data, ci=0)
gives:
Or if you want to get the plot based on the class, simply change the hue and x axis..
sns.barplot(x='class', y='value', hue='variable', data = data, ci=0)
gives:
Using groupby:
df.groupby('Class').mean().plot.bar()
With pivot_table method you can summarise the data per group as well.
df.pivot_table(index='Class').plot.bar()
# df.pivot_table(columns='Class').plot.bar() # invert order
By default, it calculates the mean of your target-columns, but you can specify another aggregation method with aggfunc='myfunc' parameter.
I am trying to calculate the kurtosis and skewness over a data and I managaed to create table but for some reason teh result is only for few columns and not for the whole fields.
For example, as you cann see, I have many fields (columns):
I calculate the skenwess and kurtosis using the next code:
sk=pd.DataFrame(data.skew())
kr=pd.DataFrame(data.kurtosis())
sk['kr']=kr
sk.rename(columns ={0: 'sk'}, inplace =True)
but then I get result that contains about half of the data I have:
I have tried to do head(10) but it doesn't change the fact that some columns dissapeard.
How can I calculte this for all the columns?
It is really hard to reproduce the error since you did not give the original data. Probably your dataframe contains non-numerical values in the missing columns which would result in this behavior.
dat = {"1": {'lg1':0.12, 'lg2':0.23, 'lg3':0.34, 'lg4':0.45},
"2":{'lg1':0.12, 'lg2':0.23, 'lg3':0.34, 'lg4':0.45},
"3":{'lg1':0.12, 'lg2':0.23, 'lg3':0.34, 'lg4':0.45},
"4":{'lg1':0.12, 'lg2':0.23, 'lg3':0.34, 'lg4':0.45},
"5":{'lg1':0.12, 'lg2':0.23, 'lg3': 'po', 'lg4':0.45}}
df = pd.DataFrame.from_dict(dat).T
print(df)
lg1 lg2 lg3 lg4
1 0.12 0.23 0.34 0.45
2 0.12 0.23 0.34 0.45
3 0.12 0.23 0.34 0.45
4 0.12 0.23 0.34 0.45
5 0.12 0.23 po 0.45
print(df.kurtosis())
lg1 0
lg2 0
lg4 0
The solution would be to preprocess the data.
One word of advice would be to check for consistency in the error, i.e. are always the same lines missing?