Selecting Column from pandas Series - python

I have a Series named 'graph' in pandas that looks like this:
Wavelength
450 37
455 31
460 0
465 0
470 0
475 0
480 418
485 1103
490 1236
495 894
500 530
505 85
510 0
515 168
520 0
525 0
530 691
535 842
540 5263
545 4738
550 6237
555 1712
560 767
565 620
570 0
575 757
580 1324
585 1792
590 659
595 1001
600 601
605 823
610 0
615 134
620 3512
625 266
630 155
635 743
640 648
645 0
650 583
Name: A1, dtype: object
I am graphing the curve using graph.plot(), which looks like this :
The goal is to smooth the curve. I was trying to use the Savgol_Filter, but to do that I need to separate my series into x & y columns. As of right now, I can acess the "Wavelength" column by using graph.index, but I can't grab the next column to assign it as y.
I've tried using iloc and loc and haven't had any luck yet.
Any tips or new directions to try?

You don't need to pass an x and a y to savgol_filter. You just need the y values which get passed automatically when you pass graph to it. What you are missing is the window size parameter and the polygon order parameter that define the smoothing.
from scipy.signal import savgol_filter
import pandas as pd
# I passed `graph` but I could've passed `graph.values`
# It is `graph.values` that will get used in the filtering
pd.Series(savgol_filter(graph, 7, 3), graph.index).plot()
To address some other points of misunderstanding
graph is a pandas.Series and NOT a pandas.DataFrame. A pandas.DataFrame can be thought of as a pandas.Series of pandas.Series.
So you access the index of the series with graph.index and the values with graph.values.
You could have also done
import matplotlib.pyplot as plt
plt.plot(graph.index, savgol_filter(graph.values, 7, 3))

As you are using Series instead of DataFrame, some libraries could not access index to use it as a column.Use:
df = df.reset_index()
it will convert the index to an extra column you can use in savgol filter or any other.

Related

Which ML algorithm would be appropriate for clustering a combination of categorical and numerical dataframe?

I wish to cluster a DataFrame with a dimension of (120000 x 4).
It consists of two string-based "label" columns (str1 and str2), and two numerical columns which looks like the following:
Str1 Str2 Energy intensity
0 713 599 7678.159 5367.276014
1 715 598 7678.182 6576.100453
2 714 597 7678.183 5675.788001
3 684 587 7678.493 3040.650157
4 693 588 7678.585 5585.908164
5 695 586 7678.615 3184.001905
6 684 584 7678.674 4896.774505
7 799 509 7693.645 4907.484401
8 798 508 7693.754 4075.800912
9 797 507 7693.781 4407.800702
10 796 506 7694.043 3138.073328
11 794 505 7694.049 3653.699936
12 795 504 7694.077 3875.120022
13 675 277 7694.948 3081.797654
14 709 221 7698.216 3587.704908
15 708 220 7698.252 4070.050144
...........
What would be the best ML algorithm to cluster/categorize this data?
I have tried plotting individual energy&intensity components belonging to one specific category Str1== "713" etc, which didn't give me much information. I am in need of somewhat more compact clustering, if possible.
You can try to do categorical encoding or one-hot encoding to Str1 and Str2 (categorical encoding is suitable for the class with magnitude relation, while one-hot encoding is more widely used), these will convert the string into numerical data, can you can just simply use any regression model.

How to use certain rows of a dataframe in a formula

So I have multiple data frames and all need the same kind of formula applied to certain sets within this data frame. I got the locations of the sets inside the df, but I don't know how to access those sets.
This is my code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt #might used/need it later to check the output
df = pd.read_csv('Dalfsen.csv')
l = []
x = []
y = []
#the formula(trendline)
def rechtzetten(x,y):
a = (len(x)*sum(x*y)- sum(x)*sum(y))/(len(x)*sum(x**2)-sum(x)**2)
b = (sum(y)-a*sum(x))/len(x)
y1 = x*a+b
print(y1)
METING = df.ID.str.contains("<METING>") #locating the sets
indicatie = np.where(METING == False)[0] #and saving them somewhere
if n in df[n] != indicatie & n+1 != indicatie: #attempt to add parts of the set in l
append.l
elif n in df[n] != indicatie & n+1 == indicatie: #attempt defining the end of the set and using the formula for the set
append.l
rechtzetten(l.x, l.y)
else: #emptying the storage for the new set
l = []
indicatie has the following numbers:
0 12 13 26 27 40 41 53 54 66 67 80 81 94 95 108 109 121
122 137 138 149 150 162 163 177 178 190 191 204 205 217 218 229 230 242
243 255 256 268 269 291 292 312 313 340 341 373 374 401 402 410 411 420
421 430 431 449 450 468 469 487 488 504 505 521 522 538 539 558 559 575
576 590 591 604 605 619 620 633 634 647
Because my df looks like this:
ID,NUM,x,y,nap,abs,end
<PROFIEL>not used data
<METING>data</METING>
<METING>data</METING>
...
<METING>data</METING>
<METING>data</METING>
</PROFIEL>,,,,,,
<PROFIEL>not usde data
...
</PROFIEL>,,,,,,
tl;dr I'm trying to use a formula in each profile as shown above. I want to edit the data between 2 numbers of the list indicatie.
For example:
the fucntion rechtzetten(x,y) for the x and y df.x&df.y[1:11](Because [0]&[12] are in the list indicatie.) And then the same for [14:25] etc. etc.
What I try to avoid is typing the following hundreds of times manually:
x_#=df.x[1:11]
y_#=df.y[1:11]
rechtzetten(x_#,y_#)
I cant understand your question clearly, but if you want to replace a specific column of your pandas dataframe with a numpy array, you could simply assign it :
df['Column'] = numpy_array
Can you be more clear ?

How can I read a file having different column for each rows?

my data looks like this.
0 199 1028 251 1449 847 1483 1314 23 1066 604 398 225 552 1512 1598
1 1214 910 631 422 503 183 887 342 794 590 392 874 1223 314 276 1411
2 1199 700 1717 450 1043 540 552 101 359 219 64 781 953
10 1707 1019 463 827 675 874 470 943 667 237 1440 892 677 631 425
How can I read this file structure in python? I want to extract a specific column from rows. For example, If I want to extract value in the second row, second column, how can I do that? I've tried 'loadtxt' using data type string. But it requires string index slicing, so that I could not proceed because each column has different digits. Moreover, each row has a different number of columns. Can you guys help me?
Thanks in advance.
Use something like this to split it
split2=[]
split1=txt.split("\n")
for item in split1:
split2.append(item.split(" "))
I have stored given data in "data.txt". Try below code once.
res=[]
arr=[]
lines = open('data.txt').read().split("\n")
for i in lines:
arr=i.split(" ")
res.append(arr)
for i in range(len(res)):
for j in range(len(res[i])):
print(res[i][j],sep=' ', end=' ', flush=True)
print()

Sum of specific rows in a dataframe (Pandas)

I'm given a set of the following data:
week A B C D E
1 243 857 393 621 194
2 644 576 534 792 207
3 946 252 453 547 436
4 560 100 864 663 949
5 712 734 308 385 303
I’m asked to find the sum of each column for specified rows/a specified number of weeks, and then plot those numbers onto a bar chart to compare A-E.
Assuming I have the rows I need (e.g. df.iloc[2:4,:]), what should I do next? My assumption is that I need to create a mask with a single row that includes the sum of each column, but I'm not sure how I go about doing that.
I know how to do the final step (i.e. .plot(kind='bar'), I just need to know what the middle step is to obtain the sums I need.
You can use for select by positions iloc, sum and Series.plot.bar:
df.iloc[2:4].sum().plot.bar()
Or if want select by names of index (here weeks) use loc:
df.loc[2:4].sum().plot.bar()
Difference is iloc exclude last position:
print (df.loc[2:4])
A B C D E
week
2 644 576 534 792 207
3 946 252 453 547 436
4 560 100 864 663 949
print (df.iloc[2:4])
A B C D E
week
3 946 252 453 547 436
4 560 100 864 663 949
And if need also filter columns by positions:
df.iloc[2:4, :4].sum().plot.bar()
And by names (weeks):
df.loc[2:4, list('ABCD')].sum().plot.bar()
All you need to do is call .sum() on your subset of the data:
df.iloc[2:4,:].sum()
Returns:
week 7
A 1506
B 352
C 1317
D 1210
E 1385
dtype: int64
Furthermore, for plotting, I think you can probably get rid of the week column (as the sum of week numbers is unlikely to mean anything):
df.iloc[2:4,1:].sum().plot(kind='bar')
# or
df[list('ABCDE')].iloc[2:4].sum().plot(kind='bar')

Line plot using matplotlib for a dataframe of 200 columns

I am struggling to plot line graph for a data Frame I have . The data frame consist of 200 column and approximately 4900 rows.
It head of my dataframe looks as follows,
Geneid pool16.1 pool18.13 pool14.11 pool15.6 pool15.2 pool17.1 pool14.16 pool14.9 pool15.10 ... pool3.13 pool2.3 pool4.7 pool1.16 pool3.14 pool1.14 pool2.14 pool8.7 pool9.15 pool10.11
0 ABL1.exon1 1073 594 901 1164 1117 1681 1516 914 1002
... 1471 1086 1032 1600 1023 1203 1465 546 650 947
1 ABL1.exon2 974 549 738 1006 1057 1463 1463 783 1334 ... 1288 1095 967 1474 881 1134 1326 595 505 912
2 ABL1.exon3 701 619 471 748 732 1043 1145 531 935 ... 1031 871 702 1206 771 985 1236 301 301 710
3 ABL1.exon4 555 225 371 586 559 842 830 402 636 ... 831 621 555 887 575 726 936 359 238 556
4 ABL1.exon5 1063 524 817 1085 1086 1624 1448 843 1368 ... 1523 1234 1185 1883 1025 1387 1655 732 581 882
5 rows × 199 columns
So I wanted to plot a line graph from the above dataFrame, here is what I tried,When a small part of the data frame is used to plot,
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')
CDKN2A= All_pools[All_pools.Geneid.str.contains("CDKN2A") == True]
CDKN2A.plot.line(figsize=(15,10),x='Geneid',y='value')
Which gives plot which looks like this,
Where I have no information about the first column on x axis and and the plot no informative. I aiming to plot something which looks like this,
Still the plot look so screwed no much informative...Any suggestions to make it look better would be great..
If you want to see dates in the x axis you need a column with dates in your dataframe.
If you want to see several lines you need to plot several columns.
If you want to see an actual, distinct line, you need to plot less rows or more autocorrelated data or both.
You already have information about the first column on x axis: the column name and some values labelled, as usual.

Categories