Line plot using matplotlib for a dataframe of 200 columns - python

I am struggling to plot line graph for a data Frame I have . The data frame consist of 200 column and approximately 4900 rows.
It head of my dataframe looks as follows,
Geneid pool16.1 pool18.13 pool14.11 pool15.6 pool15.2 pool17.1 pool14.16 pool14.9 pool15.10 ... pool3.13 pool2.3 pool4.7 pool1.16 pool3.14 pool1.14 pool2.14 pool8.7 pool9.15 pool10.11
0 ABL1.exon1 1073 594 901 1164 1117 1681 1516 914 1002
... 1471 1086 1032 1600 1023 1203 1465 546 650 947
1 ABL1.exon2 974 549 738 1006 1057 1463 1463 783 1334 ... 1288 1095 967 1474 881 1134 1326 595 505 912
2 ABL1.exon3 701 619 471 748 732 1043 1145 531 935 ... 1031 871 702 1206 771 985 1236 301 301 710
3 ABL1.exon4 555 225 371 586 559 842 830 402 636 ... 831 621 555 887 575 726 936 359 238 556
4 ABL1.exon5 1063 524 817 1085 1086 1624 1448 843 1368 ... 1523 1234 1185 1883 1025 1387 1655 732 581 882
5 rows × 199 columns
So I wanted to plot a line graph from the above dataFrame, here is what I tried,When a small part of the data frame is used to plot,
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')
CDKN2A= All_pools[All_pools.Geneid.str.contains("CDKN2A") == True]
CDKN2A.plot.line(figsize=(15,10),x='Geneid',y='value')
Which gives plot which looks like this,
Where I have no information about the first column on x axis and and the plot no informative. I aiming to plot something which looks like this,
Still the plot look so screwed no much informative...Any suggestions to make it look better would be great..

If you want to see dates in the x axis you need a column with dates in your dataframe.
If you want to see several lines you need to plot several columns.
If you want to see an actual, distinct line, you need to plot less rows or more autocorrelated data or both.
You already have information about the first column on x axis: the column name and some values labelled, as usual.

Related

Cannot open eps file after saving figure

Normally, opening an eps file is no problem but with this current code in Python that I am working on, the exported eps file is loading when opened but never appearing. I have tried exporting the same figure as a png and that works fine. Also I have tried exporting a really simple figure as eps and that opens without any flaws. I have included some of the relevant code concerning the plot/figure. Any help would be much appreciated.
#%% plot section
plt.close('all')
plt.figure()
plt.errorbar(r,omega,yerr=omega_err,fmt='mo')
plt.xlabel('xlabel')
plt.ylabel('ylabel')
plt.title('profile averaged from {} ms to {} ms \n shot {}'.format(tidsinterval[0],tidsinterval[1],skud_numre[0]),y=1.05)
plt.grid()
plt.axhline(y=2.45,color='Red')
plt.text(39,2.43,'txt block for horizontal line',backgroundcolor='white')
plt.axvline(x=37.5,color='Purple')
plt.text(37.5,1.2,'txt block for vertical line',ha='center',va="center",rotation='vertical',backgroundcolor='white')
plt.savefig('directory/plot.eps', format='eps')
plt.show()
The variables r, omega, omega_err are vectors of float of small sizes (6 perhaps).
Update: The program I use for opening eps-files is Evince, furthermore, one can download the eps file here https://filedropper.com/d/s/z7lxUCtANeox7tDMQ6dI6HZUpcTfHn. As far as I can see, it is fine sharing files over filedropper via community guidelines, but if I'm wrong please say so.
Found out that it is possible to open the file as long as there is no text contained in the plot (for example x-label,y-label, title and so on), so the problem has to be related to the text.
The short answer is it's your font. The /e glyph is throwing an error on setcachedevice (your PostScript interpreter should have told you this).
The actual problem is that the font program is careless (at least) about it's use of function name. The program contains this:
/mpldict 11 dict def
mpldict begin
/d { bind def } bind def
That creates a new dictionary called mpldict, begins that dictionary (makes it the topmost entry in the dictionary stack) and defines a function called 'd' in that dictionary
We then move on to the font definition, there's a lot of boiler plate in here, but each character shape is defined by an entry in the font's CharStrings dictionary, we'll pick that up with the definition of the function called 'd' in the font's CharStrings dictionary.
/d{1300 0 113 -29 1114 1556 sc
930 950 m
930 1556 l
ce} d
(2.60) == flush
/e{1260 0 113 -29 1151 1147 sc
1151 606 m
1151 516 l
305 516 l
313 389 351 293 419 226 c
488 160 583 127 705 127 c
776 127 844 136 910 153 c
977 170 1043 196 1108 231 c
1108 57 l
1042 29 974 8 905 -7 c
836 -22 765 -29 694 -29 c
515 -29 374 23 269 127 c
165 231 113 372 113 549 c
113 732 162 878 261 985 c
360 1093 494 1147 662 1147 c
813 1147 932 1098 1019 1001 c
1107 904 1151 773 1151 606 c
967 660 m
966 761 937 841 882 901 c
827 961 755 991 664 991 c
561 991 479 962 417 904 c
356 846 320 764 311 659 c
967 660 l
ce} d
Notice that what this does is create a new definition of a function named 'd' in the current dictionary. That's not a problem in itself. We now have two functions named 'd'; one in the current dictionary (the font's CharStrings dictionary) and one in 'mpldict'.
Then we define the next character:
/e{1260 0 113 -29 1151 1147 sc
1151 606 m
1151 516 l
305 516 l
313 389 351 293 419 226 c
488 160 583 127 705 127 c
776 127 844 136 910 153 c
977 170 1043 196 1108 231 c
1108 57 l
1042 29 974 8 905 -7 c
836 -22 765 -29 694 -29 c
515 -29 374 23 269 127 c
165 231 113 372 113 549 c
113 732 162 878 261 985 c
360 1093 494 1147 662 1147 c
813 1147 932 1098 1019 1001 c
1107 904 1151 773 1151 606 c
967 660 m
966 761 937 841 882 901 c
827 961 755 991 664 991 c
561 991 479 962 417 904 c
356 846 320 764 311 659 c
967 660 l
ce} d
Now, the last thing we do at the end of defining that character shape (for the character named 'e') is call a function named 'd'. But there are two, which one do we call ? The answer is that we work backwards down the dictionary stack looking in each dictionary to see if it has a function called 'd' and we use the first one we find. The current dictionary is the font's CharStrings dictionary, and it has a function called 'd' (which defines the 'd' character) so we call that.
And that function then tries to use setcachedevice. That operator is not legal except when executing a character description, which we are not doing, so it throws an undefined error.
Now your PostScript interpreter should tell you there is an error (Ghostscript, for example, does so). Because there is an error the interpreter stops and doesn't draw anything further, which is why you get a blank page.
What can you do about this ? Well you could raise a bug report with the creating application (apparently Matplotlib created the font too). This is not a good way to define a font!
Other than that, well frankly the only thing you can do is search and replace through the file. If you look for occurrences of ce} d and replace them with ce}bind def it'll probably work. This time.

sklearn StandardScaler doesn't seem to be working properly

I am trying to normalise my data so that it will be normally distributed needed for a later hypothesis test. The data I am trying to normalise, points, is as such:
P100m Plj Psp Phj P400m P110h Ppv Pdt Pjt P1500
0 938 1061 773 859 896 911 880 732 757 752
1 839 975 870 749 887 878 880 823 863 741
2 814 866 841 887 921 939 819 778 884 691
3 872 898 789 878 848 879 790 790 861 804
4 892 913 742 803 816 869 1004 789 854 699
... ... ... ... ... ... ... ... ... ...
7963 755 760 604 714 812 794 482 571 539 780
7964 830 845 524 767 786 783 601 573 562 535
7965 819 804 653 840 791 699 659 461 448 632
7966 804 720 539 758 830 782 731 487 425 729
7967 687 809 692 714 565 741 804 527 738 523
I am using sklearn.preprocessing.StandardScaler() and my code is as follows:
scaler = preprocessing.StandardScaler()
scaler.fit(points)
points_norm = scaler.transform(points)
points_norm_df = pd.DataFrame(points_norm, columns = ['P100m', 'Plj', 'Psp', 'Phj', 'P400m',
'P110h', 'Ppv', 'Pdt', 'Pjt','P1500'])
The strange part is that I am running an Anderson-Darling normality test from scipy.stats.anderson and the result is that it is very far from a normal distribution.
I am not the most proficient statistician. Am I misunderstanding what I am doing here or is it a problem with my code/data?
Any help would be greatly appreciated
The StandardScaler does not claim to make the data have a normal distribution rather than to Standardize so that your data will have zero mean and unit variance.
From the documentation:
Standardize features by removing the mean and scaling to unit variance
The standard score of a sample x is calculated as z = (x - u) / s
where u is the mean of the training samples or zero if
with_mean=False, and s is the standard deviation of the training
samples or one if with_std=False.
As gilad already pointed out the StandardScaler is standardizing your data.
You can find a list of methods here for preprocessing: https://scikit-learn.org/stable/modules/preprocessing.html
Are you searching for:
6.3.2.1. Mapping to a Uniform distribution
QuantileTransformer and quantile_transform provide a non-parametric
transformation to map the data to a uniform distribution with values
between 0 and 1
this would work somehow like this:
quantile_transformer = preprocessing.QuantileTransformer(random_state=0)
points_norm = quantile_transformer.fit_transform(points)

How can I read a file having different column for each rows?

my data looks like this.
0 199 1028 251 1449 847 1483 1314 23 1066 604 398 225 552 1512 1598
1 1214 910 631 422 503 183 887 342 794 590 392 874 1223 314 276 1411
2 1199 700 1717 450 1043 540 552 101 359 219 64 781 953
10 1707 1019 463 827 675 874 470 943 667 237 1440 892 677 631 425
How can I read this file structure in python? I want to extract a specific column from rows. For example, If I want to extract value in the second row, second column, how can I do that? I've tried 'loadtxt' using data type string. But it requires string index slicing, so that I could not proceed because each column has different digits. Moreover, each row has a different number of columns. Can you guys help me?
Thanks in advance.
Use something like this to split it
split2=[]
split1=txt.split("\n")
for item in split1:
split2.append(item.split(" "))
I have stored given data in "data.txt". Try below code once.
res=[]
arr=[]
lines = open('data.txt').read().split("\n")
for i in lines:
arr=i.split(" ")
res.append(arr)
for i in range(len(res)):
for j in range(len(res[i])):
print(res[i][j],sep=' ', end=' ', flush=True)
print()

Selecting Column from pandas Series

I have a Series named 'graph' in pandas that looks like this:
Wavelength
450 37
455 31
460 0
465 0
470 0
475 0
480 418
485 1103
490 1236
495 894
500 530
505 85
510 0
515 168
520 0
525 0
530 691
535 842
540 5263
545 4738
550 6237
555 1712
560 767
565 620
570 0
575 757
580 1324
585 1792
590 659
595 1001
600 601
605 823
610 0
615 134
620 3512
625 266
630 155
635 743
640 648
645 0
650 583
Name: A1, dtype: object
I am graphing the curve using graph.plot(), which looks like this :
The goal is to smooth the curve. I was trying to use the Savgol_Filter, but to do that I need to separate my series into x & y columns. As of right now, I can acess the "Wavelength" column by using graph.index, but I can't grab the next column to assign it as y.
I've tried using iloc and loc and haven't had any luck yet.
Any tips or new directions to try?
You don't need to pass an x and a y to savgol_filter. You just need the y values which get passed automatically when you pass graph to it. What you are missing is the window size parameter and the polygon order parameter that define the smoothing.
from scipy.signal import savgol_filter
import pandas as pd
# I passed `graph` but I could've passed `graph.values`
# It is `graph.values` that will get used in the filtering
pd.Series(savgol_filter(graph, 7, 3), graph.index).plot()
To address some other points of misunderstanding
graph is a pandas.Series and NOT a pandas.DataFrame. A pandas.DataFrame can be thought of as a pandas.Series of pandas.Series.
So you access the index of the series with graph.index and the values with graph.values.
You could have also done
import matplotlib.pyplot as plt
plt.plot(graph.index, savgol_filter(graph.values, 7, 3))
As you are using Series instead of DataFrame, some libraries could not access index to use it as a column.Use:
df = df.reset_index()
it will convert the index to an extra column you can use in savgol filter or any other.

Splitting a state name with a city name, return a list containing both?

So this is the simplified version of the data file:
Wichita, KS[3769,9734]279835
308 1002 1270 1068 1344 1360 1220 944 1192 748 1618 1774 416 1054
Wheeling, WV[4007,8072]43070
1017 1247 269 255 1513 327 589 203 1311 416 627 605 2442 998 85
West Palm Beach, FL[2672,8005]63305
1167 1550 1432 965 1249 2375 1160 718 1048 2175 760 1515 1459 3280 1794 1252
Wenatchee, WA[4742,12032]17257
3250 2390 1783 1948 2487 2595 1009 2697 2904 2589 1394 2690 1765 2912 117 1461
2358
Weed, CA[4142,12239]2879
622 3229 2678 1842 1850 2717 2898 1473 2981 3128 2880 1858 2935 2213 3213 505
1752 2659
Waycross, GA[3122,8235]19371
2947 2890 360 820 1192 1097 605 904 2015 828 386 703 1815 413 1155 1127
2920 1434 899
Wausau, WI[4496,8964]32426
1240 2198 1725 1600 708 841 1138 805 913 848 1015 1222 907 646 1008 111
1230 1777 509 676
Waukegan, IL[4236,8783]67653
244 1000 2260 1933 1360 468 757 1023 565 673 1056 775 982 667 854 768
170 990 1985 551 436
Watertown, SD[4490,9711]15649
601 393 1549 1824 1351 1909 1058 572 880 1155 1263 534 1365 1572 1257 394
1358 433 1580 1403 156 1026
And here is the code I have now...I can now split the city and state name from the line now but How can i get the coordinates of the city by calling the city names and how can i get the population of the city by calling city names? eg: [x,y] is the coordinates and the number right after[] is the population....
fin = open ("miles.txt","r")
cities=[]
for line in fin:
A=line.split()
if A[0][0] not in '0123456789':
B= A[0] + A[1][0]+ A[1][1]
cities.append[B]
print cities
Thanks! Any help will be appreciated!
Well since the data you've posted is showing just a two letter postal code after a , separator, I'd:
city, state = line.split(', ')
state = state[:2]
return (city, state)
If you've got some data that isn't a two letter postal code, I'd look for the expected [ character:
city, state = line.split(', ')
state = state[:state.index('[')
return (city, state)
To get the population, you'll need to make a dictionary of the stats you want to keep.
And yes, I know it's ugly:
fin = open ("miles.txt","r")
stats={}
for line in fin:
if line[0].isalpha(): #its got a city, state, x, y and pop stat to keep
city, state = line.split(', ')
state = state[ :state.index('[') ]
#get the two elements around the commas within the square brackets
lat, lng = line[ line.index('[') +1 : line.index(']') ].split(',')
#get the element after the last right bracket
pop = line[line.index(']') +1 :]
stats.update( {(city, state): (lat, lng, pop)} )
return stats
From there, you'll be able to toy around with the stats from your text file.
Just make sure you don't have key collisions...you have a tuple as your unique binding element for your stats...Keep in mind you wouldn't want to get data from a city name (there's more than one Springfield), but instead do a lookup on stats for the key matching (city, state). The value returned will be the x, y and population stats you had on that line.
>>> stats.get(('Waukegan, IL'))
(4236, 8783, 67653)
>>> stats.get(('Waukegan, IL'))[-1]
67653

Categories