for i,r in data.iterrows():
print(r)
a row is a Serias object and print output is like :
QuantifierId
18 0.0
19 0.0
20 0.0
21 NaN
23 NaN
24 NaN
25 NaN
26 NaN
27 NaN
28 NaN
63 NaN
64 NaN
81 NaN
82 NaN
83 NaN
84 NaN
85 NaN
86 NaN
87 NaN
88 NaN
89 NaN
91 NaN
93 NaN
94 NaN
95 NaN
96 NaN
121 NaN
Name: 52466, dtype: float64
I want to :
remove all QuantifierIds with value == Nan or 0 (and retain all QuantifierIds with value==1)
get that Name field from each row
How to do that?
remove all QuantifierIds with value == Nan or 0 (and retain all
QuantifierIds with value==1)
data = data.loc[(data.QuantifierId.notnull()) & (data.QuantifierId != 0)]
get that Name field from each row
data.index.tolist()
Related
I would like to shift only specific rows in my DataFrame by 1 period on the columns axis.
Df
Out:
Month Year_2005 Year_2006 Year_2007
0 01 NaN 31 35
1 02 NaN 40 45
2 03 NaN 87 46
3 04 NaN 55 41
4 05 NaN 36 28
5 06 31 21 NaN
6 07 29 27 NaN
To have something like this:
Df
Out:
Month Year_2005 Year_2006 Year_2007
0 01 NaN 31 35
1 02 NaN 40 45
2 03 NaN 87 46
3 04 NaN 55 41
4 05 NaN 36 28
5 06 NaN 31 21
6 07 NaN 29 27
My code so far:
rows_to_shift = Df[Df['Year_2005'].notnull()].index
Df.iloc[rows_to_shift, 1] = Df.iloc[rows_to_shift,2].shift(1)
Try:
df = df.set_index("Month")
df[df["Year_2005"].notnull()] = df[df["Year_2005"].notnull()].shift(axis=1)
>>> df
Year_2005 Year_2006 Year_2007
Month
1 NaN 31.0 35.0
2 NaN 40.0 45.0
3 NaN 87.0 46.0
4 NaN 55.0 41.0
5 NaN 36.0 28.0
6 NaN 31.0 21.0
7 NaN 29.0 27.0
You can try:
df1 = df.set_index('Month')
df1 = df1.apply(lambda x: pd.Series(sorted(x, key=pd.notna), index=x.index), axis=1)
df = df1.reset_index()
Result:
Month Year_2005 Year_2006 Year_2007
0 1 NaN 31.0 35.0
1 2 NaN 40.0 45.0
2 3 NaN 87.0 46.0
3 4 NaN 55.0 41.0
4 5 NaN 36.0 28.0
5 6 NaN 31.0 21.0
6 7 NaN 29.0 27.0
I have several files I've created (4 files, file 1 = a list of 1-100, file 2 = a list of 101-200...etc), all of which are just a list of ID's. I am trying to create a single file, with all the ID's in a single column (IDs are unique so a single column works).
I can't quite get it though - I keep getting close but not quite what I need.
I think I'm missing something related to assigning indexes or something? any insights are greatly appreciated!
import pandas as pd
URL1 = "C:\\PATH\\" + fileName + "1" + ".csv"
URL2 = "C:\\PATH\\" + fileName + "2" + ".csv"
URL3 = "C:\\PATH\\" + fileName + "3" + ".csv"
URL4 = "C:\\PATH\\" + fileName + "4" + ".csv"
data1 = pd.read_csv(URL1)
data2 = pd.read_csv(URL2)
data3 = pd.read_csv(URL3)
data4 = pd.read_csv(URL4)
data_concat = pd.concat([data1, data2, data3, data4])
data_append = data1.append(data2)
data_append2 = data_append.append(data3)
data_append3 = data_append2.append(data4)
when I:
print(data_append3)
# or
print(data_concat)
i get:
1 101 201 301
0 2.0 NaN NaN NaN
1 3.0 NaN NaN NaN
2 4.0 NaN NaN NaN
3 5.0 NaN NaN NaN
4 6.0 NaN NaN NaN
.. ... ... ... ...
94 NaN NaN NaN 396.0
95 NaN NaN NaN 397.0
96 NaN NaN NaN 398.0
97 NaN NaN NaN 399.0
98 NaN NaN NaN 400.0
[396 rows x 4 columns]
Edit:
Print(data1)
1 101
0 2.0 NaN
1 3.0 NaN
2 4.0 NaN
3 5.0 NaN
4 6.0 NaN
.. ... ...
94 NaN 196.0
95 NaN 197.0
96 NaN 198.0
97 NaN 199.0
98 NaN 200.0
print(data2)
1 101 201
0 2.0 NaN NaN
1 3.0 NaN NaN
2 4.0 NaN NaN
3 5.0 NaN NaN
4 6.0 NaN NaN
.. ... ... ...
94 NaN NaN 296.0
95 NaN NaN 297.0
96 NaN NaN 298.0
97 NaN NaN 299.0
98 NaN NaN 300.0
I tried adding this and it worked! thank you :)
data1 = pd.read_csv(URL1,header =None)
data2 = pd.read_csv(URL2,header=None)
data3 = pd.read_csv(URL3,header=None)
data4 = pd.read_csv(URL4,header=None)
print(data_append3)
0
0 1
1 2
2 3
3 4
4 5
.. ...
95 396
96 397
97 398
98 399
99 400
[400 rows x 1 columns]
I have the following pandas series:
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
10 NaN
11 2.291958
12 NaN
13 NaN
14 NaN
15 NaN
16 NaN
17 NaN
18 NaN
19 NaN
20 NaN
21 NaN
22 NaN
23 NaN
24 NaN
25 NaN
26 0.378826
27 NaN
28 NaN
29 NaN
...
123 NaN
124 NaN
125 NaN
126 NaN
127 1.170094
128 NaN
129 NaN
130 NaN
131 0.008531
132 NaN
133 NaN
134 NaN
135 NaN
136 NaN
137 NaN
138 NaN
139 NaN
140 NaN
141 NaN
142 NaN
143 NaN
144 NaN
145 NaN
146 NaN
147 NaN
148 NaN
149 NaN
150 NaN
151 NaN
152 NaN
Length: 153, dtype: float64
I interpolate it as follows:
ts.interpolate(method='cubic', limit_direction='both', limit=75)
I would have expected all NaNs to be filled by this, but in the output, NaNs still remain, why is that and how can I fix it in the interpolate command?Output is as follows:
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
10 NaN
11 2.291958
12 1.733142
13 1.255447
14 0.854370
15 0.525409
16 0.264062
17 0.065826
18 -0.073801
19 -0.159321
20 -0.195237
21 -0.186051
22 -0.136265
23 -0.050382
24 0.067095
25 0.211666
26 0.378826
27 0.564074
28 0.762908
29 0.970824
...
123 1.649933
124 1.579817
125 1.479152
126 1.343917
127 1.170094
128 0.953663
129 0.690605
130 0.376900
131 0.008531
132 NaN
133 NaN
134 NaN
135 NaN
136 NaN
137 NaN
138 NaN
139 NaN
140 NaN
141 NaN
142 NaN
143 NaN
144 NaN
145 NaN
146 NaN
147 NaN
148 NaN
149 NaN
150 NaN
151 NaN
152 NaN
Length: 153, dtype: float64
I do not think cubic can do that to fillna without between , if you change the linear , it will do it
s.interpolate('linear',limit_direction='both', limit=75)
Out[62]:
0 2.291958
1 2.291958
2 2.291958
3 2.291958
4 2.291958
5 2.291958
6 2.291958
7 2.291958
8 2.291958
9 2.291958
10 2.291958
11 2.291958
12 2.164416
13 2.036874
14 1.909332
15 1.781789
16 1.654247
17 1.526705
18 1.399163
19 1.271621
20 1.144079
21 1.016537
22 0.888995
23 0.761452
24 0.633910
25 0.506368
26 0.378826
27 0.378826
28 0.378826
29 0.378826
Name: s, dtype: float64
I've the following dataframe:
df =
c f V E
0 M 5 32 22
1 M 7 45 40
2 R 7 42 36
3 R 9 41 38
4 R 3 28 24
And I want a result like this, in which the values of column 'f' are my new columns, and my new indexes are a combination of column 'c' and the rest of columns in the dataframe (the order of rows doesn't matter):
df_result =
3 5 7 9
V(M) NaN 32 45 NaN
E(M) NaN 22 40 NaN
V(R) 28 NaN 42 41
E(R) 24 NaN 36 38
Currently, my code is:
df_result = pd.concat([df.pivot('c','f',col).rename(index = {e: col + '(' + e + ')' for e in df.pivot('c','f',col).index}) for col in [e for e in df.columns if e not in ['c','f']]])
With that code I'm getting:
df_result =
f 3 5 7 9
c
E(M) NaN 22 40 NaN
E(R) 24 NaN 36 38
V(M) NaN 32 45 NaN
V(R) 28 NaN 42 41
I think it's a valid result, however, I don't know if there is a way to get exactly my desire result or, at least, a better way to get what I am already getting.
Thanks you very much in advance.
To get the table, this is .melt + .pivot_table
df_result = df.melt(['f', 'c']).pivot_table(index=['variable', 'c'], columns='f')
Then we can clean up the naming:
df_result = df_result.rename_axis([None, None], 1)
df_result.columns = [y for _,y in df_result.columns]
df_result.index = [f'{x}({y})' for x,y in df_result.index]
# Python 2.: ['{0}({1})'.format(*x) for x in df_result.index]
Output:
3 5 7 9
E(M) NaN 22.0 40.0 NaN
E(R) 24.0 NaN 36.0 38.0
V(M) NaN 32.0 45.0 NaN
V(R) 28.0 NaN 42.0 41.0
You might consider keeping the MultiIndex instead of flattening to new strings, as it can be simpler for certain aggregations.
Check with pivot_table
s=pd.pivot_table(df,index='c',columns='f',values=['V','E']).stack(level=0).sort_index(level=1)
s.index=s.index.map('{0[1]}({0[0]})'.format)
s
Out[95]:
f 3 5 7 9
E(M) NaN 22.0 40.0 NaN
E(R) 24.0 NaN 36.0 38.0
V(M) NaN 32.0 45.0 NaN
V(R) 28.0 NaN 42.0 41.0
I have a list of students in a csv file. I want (using Python) to display four columns in that I want to display the male students who have higher marks in Maths, Computer, and Physics.
I tried to use pandas library.
marks = pd.concat([data['name'],
data.loc[data['students']==1, 'maths'].nlargest(n=10)], 'computer'].nlargest(n=10)], 'physics'].nlargest(n=10)])
I used 1 for male students and 0 for female students.
It gives me an error saying: Invalid syntax.
Here's a way to show the top 10 students in each of the disciplines. You could of course just sum the three scores and select the students with the highest total if you want the combined as opposed to the individual performance (see illustration below).
df1 = pd.DataFrame(data={'name': [''.join(random.choice('abcdefgh') for _ in range(8)) for i in range(100)],
'students': np.random.randint(0, 2, size=100)})
df2 = pd.DataFrame(data=np.random.randint(0, 10, size=(100, 3)), columns=['math', 'physics', 'computers'])
data = pd.concat([df1, df2], axis=1)
data.info()
RangeIndex: 100 entries, 0 to 99
Data columns (total 5 columns):
name 100 non-null object
students 100 non-null int64
math 100 non-null int64
physics 100 non-null int64
computers 100 non-null int64
dtypes: int64(4), object(1)
memory usage: 4.0+ KB
res = pd.concat([data.loc[:, ['name']], data.loc[data['students'] == 1, 'math'].nlargest(n=10), data.loc[data['students'] == 1, 'physics'].nlargest(n=10), data.loc[data['students'] == 1, 'computers'].nlargest(n=10)], axis=1)
res.dropna(how='all', subset=['math', 'physics', 'computers'])
name math physics computers
0 geghhbce NaN 9.0 NaN
1 hbbdhcef NaN 7.0 NaN
4 ghgffgga NaN NaN 8.0
6 hfcaccgg 8.0 NaN NaN
14 feechdec NaN NaN 8.0
15 dfaabcgh 9.0 NaN NaN
16 ghbchgdg 9.0 NaN NaN
23 fbeggcha NaN NaN 9.0
27 agechbcf 8.0 NaN NaN
28 bcddedeg NaN NaN 9.0
30 hcdgbgdg NaN 8.0 NaN
38 fgdfeefd NaN NaN 9.0
39 fbcgbeda 9.0 NaN NaN
41 agbdaegg 8.0 NaN 9.0
49 adgbefgg NaN 8.0 NaN
50 dehdhhhh NaN NaN 9.0
55 ccbaaagc NaN 8.0 NaN
68 hhggfffe 8.0 9.0 NaN
71 bhggbheg NaN 9.0 NaN
84 aabcefhf NaN NaN 9.0
85 feeeefbd 9.0 NaN NaN
86 hgeecacc NaN 8.0 NaN
88 ggedgfeg 9.0 8.0 NaN
89 faafgbfe 9.0 NaN 9.0
94 degegegd NaN 8.0 NaN
99 beadccdb NaN NaN 9.0
data['total'] = data.loc[:, ['math', 'physics', 'computers']].sum(axis=1)
data[data.students==1].nlargest(10, 'total').sort_values('total', ascending=False)
name students math physics computers total
29 fahddafg 1 8 8 8 24
79 acchhcdb 1 8 9 7 24
9 ecacceff 1 7 9 7 23
16 dccefaeb 1 9 9 4 22
92 dhaechfb 1 4 9 9 22
47 eefbfeef 1 8 8 5 21
60 bbfaaada 1 4 7 9 20
82 fbbbehbf 1 9 3 8 20
18 dhhfgcbb 1 8 8 3 19
1 ehfdhegg 1 5 7 6 18