I'm using pandas and i have a little data like that.
4 1
5 8
6 25
7 33
8 24
9 4
and I want to fill in missing parts. I want to like that :
1 0
2 0
3 0
4 1
5 8
6 25
7 33
8 24
9 4
10 0
It's gonna be a list for use. like that [0,0,0,1,8,25,33,24,4,0]
looked for a solution but couldn't find any. Any idea?
Try with reindex
l = s.reindex(range(10+1),fill_value=0).tolist()
Related
I have a dataframe that looks like this:
ID Age Score
0 9 5 3
1 4 6 1
2 9 7 2
3 3 2 1
4 12 1 15
5 2 25 6
6 9 5 4
7 9 5 61
8 4 2 12
I want to sort based on the first column, then the second column, and so on.
So I want my output to be this:
ID Age Score
5 2 25 6
3 3 2 1
8 4 2 12
1 4 6 1
0 9 5 3
6 9 5 4
7 9 5 61
2 9 7 2
4 12 1 15
I know I can do the above with df.sort_values(df.columns.to_list()), however I'm worried this might be quite slow for much larger dataframes (in terms of columns and rows).
Is there a more optimal solution?
You can use numpy.lexsort to improve performance.
import numpy as np
a = df.to_numpy()
out = pd.DataFrame(a[np.lexsort(np.rot90(a))],
index=df.index, columns=df.columns)
Assuming as input a random square DataFrame of side n:
df = pd.DataFrame(np.random.randint(0, 100, size=(n, n)))
here is the comparison for 100 to 100M items (slower runtime is the best):
Same graph with the speed relative to pandas
By still using df.sort_values() you can speed it up a bit by selecting the type of sorting algorithm. By default it's set to quicksort, but there is the alternatives of 'mergesort', 'heapsort' and 'stable'.
Maybe specifying one of these would improve it?
df.sort_values(df.columns.to_list(), kind="mergesort")
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html
I can't wrap my head around the best way to accomplish this.
The visualization below is what I would like to accomplish. I'm not sure what you would call it exactly but essentially I want to iterate through rows and columns and make a new dataframe with the x, y and then the intercepting point. Just reshaping the dataframe I don't want to lose any values.
I'm doing this just to learn Pandas so any help in the right direction of how to think about this/ best way to solve it would be greatly appreciated.
1 2 3
1 10 15 20
2 11 16 21
3 12 17 22
x y z
1 1 10
1 2 15
1 3 20
2 1 11
2 2 16
2 3 21
3 1 12
3 2 17
3 3 22
Given your dataset and expected output you are looking for pandas melt. Iterating over a dataframe can be slow and inefficient, if you are learning I strongly suggest you look for more efficient ways of working (for example vectorizing an operation, or pivoting) instead of using for loops. The last line of the proposed solution is merely for ordering the columns to match your desired output. Kindly try the following:
df = pd.DataFrame({1:[10,11,12],
2:[15,16,17],
3:[20,21,22]})
df = df.melt(value_vars=[1,2,3],var_name='x',value_name='z')
df['y'] = df.groupby('x').cumcount()+1
df[['x','y','z']]
Outputs:
x y z
0 1 1 10
1 1 2 11
2 1 3 12
3 2 1 15
4 2 2 16
5 2 3 17
6 3 1 20
7 3 2 21
8 3 3 22
So the problem I seem to have is that I want to acces the data in a dataframe but only the last twelve numbers in every column so I have a data frame:
index A B C
20 1 2 3
21 2 5 6
22 7 8 9
23 10 1 2
24 3 1 2
25 4 9 0
26 10 11 12
27 1 2 3
28 2 1 5
29 6 7 8
30 8 4 5
31 1 3 4
32 1 2 3
33 5 6 7
34 1 3 4
The values inside A,B,C are not important they are just to show an example
currently I am using
df1=df2.iloc[23:35]
perhaps there is an easier way to do this because I have to do this for around 20 different dataframes of different sizes I know that if I use
df1=df2.iloc[-1]
it will return the last number but I dont know how to incorporate it for the last twelve numbers. any help would be appreciated.
You can get the last n rows of a DataFrame by:
df.tail(n)
or
df.iloc[-n-1:-1]
I imported the data from csv file with pandas. I want to split the column which includes 50 (0 to 49) values into 5 rows each having ten values. Can anyone tell me how i can do this transpose in form of pandas frame?
Let me rephrase what i said:
I attached the data that i have. I wanted to select the second column, and split it into two rows each having 10 values.
That is the code i have done so far:(I couldn't get the picture of 50 rows so i have only put 20 rowsenter image description here)
import numpy as np
import pandas as pd
df = pd.read_csv('...csv')
df.iloc[:50,:2]
Consider the dataframe df
np.random.seed([3,1415])
df = pd.DataFrame(dict(mycolumn=np.random.randint(10, size=50)))
using numpy and reshape'ing, ignoring indices
pd.DataFrame(df.mycolumn.values.reshape(5, -1))
0 1 2 3 4 5 6 7 8 9
0 0 2 7 3 8 7 0 6 8 6
1 0 2 0 4 9 7 3 2 4 3
2 3 6 7 7 4 5 3 7 5 9
3 8 7 6 4 7 6 2 6 6 5
4 2 8 7 5 8 4 7 6 1 5
The head of my Pandas dataframe , df, is shown below:
count1 count2 totalcount season
0 3 13 16 1
1 8 32 40 1
2 5 27 32 1
3 3 10 13 1
4 0 1 1 1
I'd like to make boxplots of count1, count2, and totalcount, grouped by season (there are 4 seasons) and have each set of box plots show up on their own subplot in a single figure.
When I do this with only two of the columns, say count1 and count2, everything looks great.
df.boxplot(['count1', 'count2'], by='season')
But when I add totalcount to the mix, the axis limits go haywire.
df.boxplot(['count1', 'count2', 'totalcount'], by='season')
This happens irregardless of the order of the columns. I realize there are several ways around this problem, but it would be much more convenient if this worked properly.
Am I missing something? Is this a known bug in Pandas? I wasn't able to find anything in my first pass of the Pandas bug reports.
I'm using Pandas 0.14.0 and matplotlib 1.3.1.
Did you tried to upgrade your pandas/matplotlib packages?
I'm using Pandas 0.13.1 + Matplotlib 1.2.1 and this is the plot I get:
In [31]: df
Out[34]:
count1 count2 totalcount season
0 3 13 16 1
1 8 32 40 1
2 5 27 32 1
3 3 10 13 1
4 0 1 1 1
5 3 13 16 2
6 8 32 40 2
7 5 27 32 3
8 3 10 13 3
9 0 1 1 4
10 3 10 13 4
11 3 13 16 4
[12 rows x 4 columns]