Split/extract strings in Pandas series index and expand as DataFrame - python

I have a Pandas series as below:
index Value
'4-5-a' 2
'6-7-d' 3
'9-6-c' 7
'5-3-k' 8
I would like to extract/split the index of the series and form a DataFrame as shown below:
index Value x y
'4-5-a' 2 4 5
'6-7-d' 3 6 7
'9-6-c' 7 9 6
'5-3-k' 8 5 3
What is the best way to do this?

This is one way.
# convert series to dataframe, elevate index to column
df = s.to_frame('Value').reset_index()
# split by dash and exclude final split
df[['x', 'y']] = df['index'].str.split('-', expand=True).iloc[:, :-1].astype(int)
print(df)
index Value x y
0 4-5-a 2 4 5
1 6-7-d 3 6 7
2 9-6-c 7 9 6
3 5-3-k 8 5 3

Related

I have a dataframe where some index number are missing how do I cut dataframe previous to that missing index number

[enter image description here][1]
Index number 72 is missing from original dataframe which is shown in image. I want to cut dataframe like [0:71,:] with condition like when index sequence breaks then dataframe automatically cuts from previous index value.
Compare shifted values of index subtracted by original values if greater like 1 with invert ordering by [::-1] and Series.cummax, last filter in boolean indexing:
df = pd.DataFrame({'a': range(3,13)}).drop(3)
print (df)
a
0 3
1 4
2 5
4 7
5 8
6 9
7 10
8 11
9 12
df = df[df.index.to_series().shift(-1, fill_value=0).sub(df.index).gt(1)[::-1].cummax()]
print (df)
a
0 3
1 4
2 5
i came to this:
df = pd.DataFrame({'col':[1,2,3,4,5,6,7,8,9]}, index=[-1,0,1,2,3,4,5,7,8])
ind = next((i for i in range(len(df)-1) if df.index[i]+1!=df.index[i+1]),len(df))+1
>>> df.iloc[:ind]
'''
col
-1 1
0 2
1 3
2 4
3 5
4 6
5 7
With numpy, get the values that are equal to a normal range starting from the first index, up to the first mismatch (excluded):
df[np.minimum.accumulate(df.index==np.arange(df.index[0], df.index[0]+len(df)))]
Example:
col
-1 1
0 2
1 3
3 4
4 5
output:
col
-1 1
0 2
1 3

Selecting rows with the highest value based on 1 column in the dataframe

I have a set of dataframe with about 20k rows. with headings X,Y,Z,I,R,G,B. ( yes its point cloud)
I would wanna create numerous sub dataframes by grouping the data in rows of 100 after sorting out according to column X.
Subsequently i would like to sort all sub dataframes according to Y column and breaking them down further into rows of 50. (breaking each sub dataframe down further)
The end result is I should have a group of sub dataframes in rows of 50, and i would like to pick out all the rows with the highest Z value in each sub dataframe and write them onto a CSV file.
I have reached the following method with my code. But i am not sure how to continue further.
import pandas as pd
headings = ['x', 'y', 'z']
data = pd.read_table('file.csv', sep=',', skiprows=[0], names=headings)
points = data.sort_values(by=['x'])
Considering a dummy dataframe of 1000 rows,
df.head() # first 5 rows
X Y Z I R G B
0 6 6 0 3 7 0 2
1 0 8 3 6 5 9 7
2 8 9 7 3 0 4 5
3 9 6 8 5 1 0 0
4 9 0 3 0 9 2 9
First, extract the highest value of Z from the dataframe,
z_max = df['Z'].max()
df = df.sort_values('X')
# list of dataframes
dfs_X = np.split(df, len(df)/ 100)
results = pd.DataFrame()
for idx, df_x in enumerate(dfs_X):
dfs_X[idx] = df_x.sort_values('Y')
dfs_Y = np.split(dfs_X[idx], len(dfs_X[idx]) / 50)
for idy, df_y in enumerate(dfs_Y):
rows = df_y[df_y['Z'] == z_max]
results = results.append(rows)
results.head()
results will contain rows from all dataframes which have highest value of Z.
Output: First 5 rows
X Y Z I R G B
541 0 0 9 0 3 6 2
610 0 2 9 3 0 7 6
133 0 4 9 3 3 9 9
731 0 5 9 5 1 0 2
629 0 5 9 0 9 7 7
Now, write this dataframe to csv using df.to_csv().

Pandas Dataframe Create New Column That is Row Below Current Row's Value [duplicate]

I've got a pandas dataframe. I want to 'lag' one of my columns. Meaning, for example, shifting the entire column 'gdp' up by one, and then removing all the excess data at the bottom of the remaining rows so that all columns are of equal length again.
df =
y gdp cap
0 1 2 5
1 2 3 9
2 8 7 2
3 3 4 7
4 6 7 7
df_lag =
y gdp cap
0 1 3 5
1 2 7 9
2 8 4 2
3 3 7 7
Anyway to do this?
In [44]: df['gdp'] = df['gdp'].shift(-1)
In [45]: df
Out[45]:
y gdp cap
0 1 3 5
1 2 7 9
2 8 4 2
3 3 7 7
4 6 NaN 7
In [46]: df[:-1]
Out[46]:
y gdp cap
0 1 3 5
1 2 7 9
2 8 4 2
3 3 7 7
shift column gdp up:
df.gdp = df.gdp.shift(-1)
and then remove the last row
Time is going. And current Pandas documentation recommend this way:
df.loc[:, 'gdp'] = df.gdp.shift(-1)
To easily shift by 5 values for example and also get rid of the NaN rows, without having to keep track of the number of values you shifted by:
d['gdp'] = df['gdp'].shift(-5)
df = df.dropna()
First shift the column:
df['gdp'] = df['gdp'].shift(-1)
Second remove the last row which contains an NaN Cell:
df = df[:-1]
Third reset the index:
df = df.reset_index(drop=True)
df.gdp = df.gdp.shift(-1) ## shift up
df.gdp.drop(df.gdp.shape[0] - 1,inplace = True) ## removing the last row

In pandas Dataframe with multiindex how can I filter by order?

Assume the following dataframe
>>> import pandas as pd
>>> L = [(1,'A',9,9), (1,'C',8,8), (1,'D',4,5),(2,'H',7,7),(2,'L',5,5)]
>>> df = pd.DataFrame.from_records(L).set_index([0,1])
>>> df
2 3
0 1
1 A 9 9
C 8 8
D 4 5
2 H 7 7
L 5 5
I want to filter the rows in the nth position of level 1 of the multiindex, i.e. filtering the first
2 3
0 1
1 A 9 9
2 H 7 7
or filtering the third
2 3
0 1
1 D 4 5
How can I achieve this ?
You can filter rows with the help of GroupBy.nth after performing grouping on the first level of the multi-index DF. Since n follows the 0-based indexing approach, you need to provide the values appropriately to it as shown:
1) To select the first row grouped per level=0:
df.groupby(level=0, as_index=False).nth(0)
2) To select the third row grouped per level=0:
df.groupby(level=0, as_index=False).nth(2)

Pandas: Get highest n rows based on multiple columns and they are matching each other

Suppose I have pandas DataFrame like this. Those red values in column C and E are the highest 10 numbers in each column accordingly.
How can i get a data frame like this. Where it only returns the rows which are in the highest 10 on both columns? If the value is in the highest 10 but not in both then the row would be ignored.
At the moment i do this with looping where i loop first through each column separately and if the value is in the highest 10 then i save the row index, and then i loop a third time where i exclude indexes which are not in both, This is very inefficient since i work with a table of a over 100000 rows. Is there a better way to do it?
Consider the example dataframe df
np.random.seed([3,1415])
rng = np.arange(10)
df = pd.DataFrame(
dict(
A=rng,
B=list('abcdefghij'),
C=np.random.permutation(rng),
D=np.random.permutation(rng)
)
)
print(df)
A B C D
0 0 a 9 1
1 1 b 4 3
2 2 c 5 5
3 3 d 1 9
4 4 e 7 4
5 5 f 6 6
6 6 g 8 0
7 7 h 3 2
8 8 i 2 7
9 9 j 0 8
Use nlargest to identify lists. Then use query to filter dataframe
n = 5
c_lrgst = df.C.nlargest(n)
d_lrgst = df.D.nlargest(n)
df.query('C in #c_lrgst & D in #d_lrgst')
A B C D
2 2 c 5 5
5 5 f 6 6

Categories