gd = df.groupby(['subID'])['Accuracy'].std()
print(gd)
subID
4 0.810423
5 0.841364
6 0.881007
8 0.763175
9 0.760102
10 0.905283
12 0.928358
14 0.779291
15 0.912377
1018 0.926683
It displays like this and I assume it is a Series, not a DataFrame. I want to change the last index from 1018 to 13.
Use rename with dictionary, because first column here is index of Series:
gd = gd.rename({1018:13})
Working like:
gd = gd.rename(index={1018:13})
Related
I have a variable whose name is Strike, in Strike variable values regularly change because it is under a loop.
This is my Dataframe
code example -
for i in range(len(df.strike)):
Strike = df.strike.iloc[i]
list1 = ['0-50', '50-100', '100-150'.......]
list2 = [2000, 132.4, 1467.40, ..........]
df = [] # Here i have to create dataframe
Strike contains values like - 33000, 33100, 33200, 33300....... so on it contains at least 145 values.
which I want to make rows.
And I have two list which is also changing from time to time because it is also under a loop.
list1 = ['0-50', '50-100', '100-150'.......]
list1 I want to make columns.
and list2 contains numeric values -
list2 = [2000, 132.4, 1467.40, ..........]
I need dataframe in this format.
List 1 should we column name, and list 2 should we values and strike variable should be rows.
but I don't understand how can I create this data frame.
IIUC you could use the DataFrame constructor directly with a reshaped numpy array as input:
# dummy example
list2 = list(range(4*7))
list1 = ['0-50', '50-100', '100-150', '150-200']
# replace by df.strike
strike = [33000, 33100, 33200, 33300, 33400, 33500, 33600]
df = pd.DataFrame(np.array(list2).reshape((-1, len(list1))),
index=strike, columns=list1)
output:
0-50 50-100 100-150 150-200
33000 0 1 2 3
33100 4 5 6 7
33200 8 9 10 11
33300 12 13 14 15
33400 16 17 18 19
33500 20 21 22 23
33600 24 25 26 27
df = sample.groupby('id')['user_id'].apply(list).reset_index(name='new') this gives me:
id new
0 429 [659500]
1 1676 [2281394]
2 2389 [3973559]
3 2810 [4382598]
4 3104 [4733375]
5 3447 [5519461]
6 3818 [4453354]
7 3846 [4514870]
8 4283 [6378476]
9 4626 [6670089]
10 5022 [1116244]
11 5213 [6913646]
12 5899 [8213945, 8210403]
13 5962 [8733646]
However new is a series, how can I get 'new' into a list of strings in a dataframe?
I've tried df['new_id'] = df.loc[:, ['new']] thinking that this would at least solve my series issue... since print(type(df.loc[:, ['new']])) retuns a dataframe.
Try this:
sample['new_id'] = sample['id'].map(sample.groupby('id')['user_id'].agg(list))
I am facing a weird scenario.
I have a data frame with having 3 largest scores for unique row like this:
id rid code score
1 9 67 43
1 8 87 22
1 4 32 20
2 3 56 43
3 10. 22 100
3. 5 67. 50
Here id column is same but row wise it is different.
I want to make my data frame like this:
id first_code second_code third_code
1 67 87 32
2. 56. none. none
3 22. 67. none
So I have made my dataframe which is showing highest top 3 scores. If there is not top 3 value I am taking top 2 or the only value which is the score. So depending on score value, I want to re-arrange the code column into three different columns as example first_code is representing the highest_score, second_score is representing second-highest, third_code is representing the third highest value. If not found then I will make those blanks.
Kindly help me to solve this.
Use GroupBy.cumcount for counter, create MultiIndex and reshape by Series.unstack:
df = df.set_index(['id',df.groupby('id').cumcount()])['code'].unstack()
df.columns=['first_code', 'second_code', 'third_code']
df = df.reset_index()
print (df)
id first_code second_code third_code
0 1.0 67.0 87.0 32.0
1 2.0 56.0 NaN NaN
2 3.0 22.0 67.0 NaN
Btw, cumcount should be used also in previous code for filter top3 values.
I am using topic_.set_value(each_topic, word, prob) to change the value of cells in a pandas dataframe. Basically, I initialized a numpy array with a certain shape and converted it to a pandas dataframe. I am then replacing these zeros by iterating over all the columns and rows using the code above. The problem is that the number of cells are around 50,000 and every time I set the value pandas prints the array to the console. I want to suppress this behavior. Any ideas?
EDIT
I have two dataframes one is topic_ which is the target dataframe and tw which is the source dataframe. The topic_ is a topic by word matrix, where each cell stores the probability of a word occurring in a particular topic. I have initialized the topic_ dataframe to zero using numpy.zeros. A sample of the tw dataframe-
print(tw)
topic_id word_prob_pair
0 0 [(customer, 0.061703717964), (team, 0.01724444...
1 1 [(team, 0.0260560163563), (customer, 0.0247838...
2 2 [(customer, 0.0171786268847), (footfall, 0.012...
3 3 [(team, 0.0290787264225), (product, 0.01570401...
4 4 [(team, 0.0197917953222), (data, 0.01343226630...
5 5 [(customer, 0.0263740639141), (team, 0.0251677...
6 6 [(customer, 0.0289764173735), (team, 0.0249938...
7 7 [(client, 0.0265082412402), (want, 0.016477447...
8 8 [(customer, 0.0524006965405), (team, 0.0322975...
9 9 [(generic, 0.0373422774996), (product, 0.01834...
10 10 [(customer, 0.0305256248248), (team, 0.0241559...
11 11 [(customer, 0.0198707090364), (ad, 0.018516805...
12 12 [(team, 0.0159852971954), (customer, 0.0124540...
13 13 [(team, 0.033444510469), (store, 0.01961003290...
14 14 [(team, 0.0344793243818), (customer, 0.0210975...
15 15 [(team, 0.026416114692), (customer, 0.02041691...
16 16 [(campaign, 0.0486186973667), (team, 0.0236024...
17 17 [(customer, 0.0208270072145), (branch, 0.01757...
18 18 [(team, 0.0280889397541), (customer, 0.0127932...
19 19 [(team, 0.0297011415217), (customer, 0.0216007...
My topic_ dataframe is of the size of num_topics(which is 20) by number_of_unique_words (in the tw dataframe)
Following is the code I am using to replace each value in the topic_ dataframe
for each_topic in range(num_topics):
a = tw['word_prob_pair'].iloc[each_topic]
for word, prob in a:
topic_.set_value(each_topic, word, prob)
just redirect the output into variable:
>>> df.set_value(index=1,col=0,value=1)
0 1
0 0.621660 -0.400869
1 1.000000 1.585177
2 0.962754 1.725027
3 0.773112 -1.251182
4 -1.688159 2.372140
5 -0.203582 0.884673
6 -0.618678 -0.850109
>>> a=df.set_value(index=1,col=0,value=1)
>>>
To init df it's better to use this:
pd.DataFrame(np.zeros_like(pd_n), index=pd_n.index, columns=pd_n.columns)
If you do not wish to create a variable ('a' in the suggestion above) then use python's throwaway variable '_'. So your statement becomes :
_ = df.set_value(index=1,col=0,value=1)
I have this code:
import pandas as pd
data = pd.read_csv("test.csv", sep=",")
data array looks like that:
The problem is that I can't split it by columns, like that:
week = data[:,1]
It should split the second column into the week, but it doesn't do it:
*TypeError: unhashable type: 'slice'
*
How should I do this to make it work?
I also wondering, that what this code do exactly? (Don't really understand np.newaxis part)
week = data['1'][:, np.newaxis]
Result:
There are a few issues here.
First, read_csv uses a comma as a separator by default, so you don't need to specify that.
Second, the pandas csv reader by default uses the first row to get column headings. That doesn't appear to be what you want, so you need to use the header=None argument.
Third, it looks like your first column is the row number. You can use index_col=0 to use that column as the index.
Fourth, for pandas, the first index is the column, not the row. Further, using the standard data[ind] notation is indexing by column name, rather than column number. And you can't use a comma to index both row and column at the same time (you need to use data.loc[row, col] to do that).
So for your case, all you need to do to get the second columns is data[2], or if you use the first column as the row number then the second column becomes the first column, so you would do data[1]. This returns a pandas Series, which is the 1D equivalent of a 2D DataFrame.
So the whole thing should look like this:
import pandas as pd
data = pd.read_csv('test.csv', header=None, index_col=0)
week = data[1]
data looks like this:
1 2 3 4
0
1 10 2 100 12
2 15 5 150 15
3 25 7 240 20
4 22 12 350 20
5 51 13 552 20
6 134 20 880 36
7 150 22 900 38
8 200 29 1020 44
9 212 31 1100 46
10 199 23 1089 45
11 220 32 1145 60
The '0' row doesn't exist, it is just there for informational purposes.
week looks like this:
0
1 10
2 15
3 25
4 22
5 51
6 134
7 150
8 200
9 212
10 199
11 220
Name: 1, dtype: int64
However, you can give columns (and rows) meaningful names in pandas, and then access them by those names. I don't know the column names, so I just made some up:
import pandas as pd
data = pd.read_csv('test.csv', header=None, index_col=0, names=['week', 'spam', 'eggs', 'grail'])
week = data['week']
In this case, data looks like this:
week spam eggs grail
1 10 2 100 12
2 15 5 150 15
3 25 7 240 20
4 33 12 350 20
5 51 13 552 20
6 134 20 880 36
7 150 22 900 38
8 200 29 1020 44
9 212 31 1100 46
10 199 23 1089 45
11 220 32 1145 50
And week looks like this:
1 10
2 15
3 25
4 33
5 51
6 134
7 150
8 200
9 212
10 199
11 220
Name: week, dtype: int64
For np.newaxis, what that does is add one dimension to the array. So say you have a 1D array (a vector), using np.newaxis on it would turn it into a 2D array. It would turn a 2D array into a 3D array, 3D into 4D, and so on. Depending on where you put it (such as [:,np.newaxis] vs. [np.newaxis,:], you can determine which dimension to add. So np.arange(10)[np.newaxis,:] (or just np.arange(10)[np.newaxis]) gives you a shape (1,10) 2D array, while np.arange(10)[:,np.newaxis] gives you a shape (10,1) 2D array.
In your case, what the line is doing is getting the column with the name 1, which is a 1D pandas Series, then adding a new dimension to it. However, instead of turning it back into a DataFrame, it instead converts it into a 1D numpy array, then adds one dimension to make it a 2D numpy array.
This, however, is dangerous long-term. There is no guarantee that this sort of silent conversion won't be changed at some point. To change a pandas objects to a numpy one, you should use an explicit conversion with the values method, so in your cases data.values or data['1'].values.
However, you don't really need a numpy array. A series is fine. If you really want a 2D object, you can convert a Series into a DataFrame by using something like data['1'].to_frame().