Example DataFrame
df = pd.DataFrame(np.random.randint(2200, 3300, 50), index=[np.random.randint(0,6, 50)] ,columns=list('A'))
Below is a sample of what the data would look like
A
5 2393
4 2421
0 3038
5 2914
4 2559
4 2314
5 3006
3 2553
0 2642
3 2441
3 2512
0 2412
What I would like to do is drop the first n (lets use 2 for this example) records of index. So from the previous data example it would become...
A
4 2314
5 3006
3 2512
0 2412
Any guidance here would be appreciated. I haven't been able to get anything to work.
use tail with -2
s.groupby(level=0, group_keys=False).apply(pd.DataFrame.tail, n=-2)
A
0 2412
3 2512
4 2314
5 3006
To really nail it down
s.groupby(level=0, group_keys=False, sort=False).apply(pd.DataFrame.tail, n=-2)
A
5 3006
4 2314
0 2412
3 2512
If I understand your question, you want a new dataframe that has the first n rows of your dataframe removed. If that's what you want, I would reset the index, then drop based on pandas' default index, then put the original index back. Here's how you might do that.
df = pd.DataFrame(data=np.random.randint(2200, 3300, 50),
index=np.random.randint(0,6, 50),
columns=list('A'))
n = 5
print(df.head(n * 2))
df_new = df.reset_index().drop(range(n)).set_index('index')
print(df_new.head(n))
Related
I have a dataset where I want to fill the columns and rows in python as shown below:
Dataset:
| P | Q |
|678|1420|
|678|---|
|609|---|
|583|1260|
|---|1260|
|---|1261|
|---|1262|
|584|1263|
|---|403|
|---|---|
Expected Result:
| P | Q |
|678|1420|
|678|1420|
|609|---|
|583|1260|
|583|1260|
|583|1261|
|583|1262|
|584|1263|
|584|403|
|584|403|
I have filled the column P using fillna() but cannot do the same for column Q since the values needs to be filled for key pair.
can someone please help.
With your shown samples, please try following.
df['P'] = df['P'].ffill()
df['Q'] = df.groupby('P')['Q'].ffill()
Output will be as follows:
P Q
0 678.0 1420.0
1 678.0 1420.0
2 609.0
3 583.0 1260.0
4 583.0 1260.0
5 583.0 1261.0
6 583.0 1262.0
7 584.0 1263.0
8 584.0 403.0
9 584.0 403.0
ffill documentation
I have a dataframe that looks like this
DEP_TIME
0 1851
1 1146
2 2016
3 1350
4 916
...
607341 554
607342 633
607343 657
607344 705
607345 628
I need to get every value in this column DEP_TIME to have the format hh:mm.
All cells are of type string and can remain that type.
Some cells are only missing the colon (rows 0 to 3), others are also missing the leading 0 (rows 4+).
Some cells are empty and should ideally have string value of 0.
I need to do it in an efficient way since I have a few million records. How do I do it?
Use to_datetime with Series.dt.strftime:
df['DEP_TIME'] = (pd.to_datetime(df['DEP_TIME'], format='%H%M', errors='coerce')
.dt.strftime('%H:%M')
.fillna('00:00'))
print (df)
DEP_TIME
0 18:51
1 11:46
2 20:16
3 13:50
4 09:16
607341 05:54
607342 06:33
607343 06:57
607344 07:05
607345 06:28
import re
d = [['1851'],
['1146'],
['2016'],
['916'],
['814'],
[''],
[np.nan]]
df = pd.DataFrame(d, columns=['DEP_TIME'])
df['DEP_TIME'] = df['DEP_TIME'].fillna('0')
df['DEP_TIME'] = df['DEP_TIME'].apply(lambda y: '0' if y=='' else re.sub(r'(\d{1,2})(\d{2})$', lambda x: x[1].zfill(2)+':'+x[2], y))
df
DEP_TIME
0 18:51
1 11:46
2 20:16
3 09:16
4 08:14
5 0
I have a data file containing different foetal ultrasound measurements. The measurements are collected at different points during pregnancy, like so:
PregnancyID MotherID gestationalAgeInWeeks abdomCirc
0 0 14 150
0 0 21 200
1 1 20 294
1 1 25 315
1 1 30 350
2 2 8 170
2 2 9 180
2 2 18 NaN
Following this answer to a previous questions I had asked, I used this code to summarise the ultrasound measurements using the maximum measurement recorded in a single trimester (13 weeks):
(df.assign(tm = (df['gestationalAgeInWeeks']+ 13 - 1 )// 13))
.drop(columns = 'gestationalAgeInWeeks')
.groupby(['MotherID', 'PregnancyID','tm'])
.agg('max')
.unstack()
)
This results in the following output:
tm 1 2 3
MotherID PregnancyID
0 0 NaN 200.0 NaN
1 1 NaN 294.0 350.0
2 2 180.0 NaN NaN
However, MotherID and PregnancyID no longer appear as columns in the output of df.info(). Similarly, when I output the dataframe to a csv file, I only get columns 1,2 and 3. The id columns only appear when running df.head() as can be seen in the dataframe above.
I need to preserve the id columns as I want to use them to merge this dataframe with another one using the ids. Therefore, my question is, how do I preserve these id columns as part of my dataframe after running the code above?
Chain that with reset_index:
(df.assign(tm = (df['gestationalAgeInWeeks']+ 13 - 1 )// 13)
# .drop(columns = 'gestationalAgeInWeeks') # don't need this
.groupby(['MotherID', 'PregnancyID','tm'])['abdomCirc'] # change here
.max().add_prefix('abdomCirc_') # here
.unstack()
.reset_index() # and here
)
Or a more friendly version with pivot_table:
(df.assign(tm = (df['gestationalAgeInWeeks']+ 13 - 1 )// 13)
.pivot_table(index= ['MotherID', 'PregnancyID'], columns='tm',
values= 'abdomCirc', aggfunc='max')
.add_prefix('abdomCirc_') # remove this if you don't want the prefix
.reset_index()
)
Output:
tm MotherID PregnancyID abdomCirc_1 abdomCirc_2 abdomCirc_3
0 abdomCirc_0 abdomCirc_0 NaN 200.0 NaN
1 abdomCirc_1 abdomCirc_1 NaN 315.0 350.0
2 abdomCirc_2 abdomCirc_2 180.0 NaN NaN
In the (5 first rows) result below, you can see Freq column and the rolling means (3) column MMeans calculated using pandas:
Freq MMeans
0 215 NaN
1 453 NaN
2 277 315.000000
3 38 256.000000
4 1 105.333333
I was expecting MMeans to start at index 1 since 1 is the mean of (0-1-2). Is there an option that I am missing with rolling method?
edit 1
print(pd.DataFrame({
'Freq':eff,
'MMeans': dF['Freq'].rolling(3).mean()}))
edit 2
Sorry #Yuca for not being as clear as I'd like to. Next is the columns I'd like pandas to return :
Freq MMeans
0 215 NaN
1 453 315.000000
2 277 256.000000
3 38 105.333333
4 1 29.666667
which are not the results returned with min_periods=2
use min_periods =1
df['rol_mean'] = df['Freq'].rolling(3,min_periods=1).mean()
output:
Freq MMeans rol_mean
0 215 NaN 215.000000
1 453 NaN 334.000000
2 277 315.000000 315.000000
3 38 256.000000 256.000000
4 1 105.333333 105.333333
I have tried this a few ways and am stumped. My last attempt generates an error that says: "ValueError: Plan shapes are not aligned"
So I have a dataframe that can have up to about 1,000 columns in it based on data read in from an external file. The columns are all going to have their own labels/names, i.e. "Name", "BirthYear", Hometown", etc. I want to add a row at the beginning of the dataframe that runs from 0 to (as many columns as there are), so if the data ends up having 232 columns, this new first row would have values of 0,1,2,3,4....229,230,231,232.
What I am doing is creating a one-row dataframe with as many columns/values as there are in the main ("mega") dataframe, and then concatenating them. It throws this shape error at me, but when I print the shape of each frame, they match up in terms of length. Not sure what I am doing wrong, any help would be appreciated. Thank you!
colList = list(range(0, len(mega.columns)))
indexRow = pd.DataFrame(colList).T
print(indexRow)
print(indexRow.shape)
print(mega.shape)
mega = pd.concat([indexRow, mega],axis=0)
Here is the result...
0 1 2 3 4 5 6 7 8 9 ... 1045 \
0 0 1 2 3 4 5 6 7 8 9 ... 1045
1046 1047 1048 1049 1050 1051 1052 1053 1054
0 1046 1047 1048 1049 1050 1051 1052 1053 1054
[1 rows x 1055 columns]
(1, 1055)
(4, 1055)
ValueError: Plan shapes are not aligned
This is one way to do it. Depending on your data, this could mix types (e.g. if one column was timestamps). Also, this resets your index in mega.
mega = pd.DataFrame(np.random.randn(3,3), columns=list('ABC'))
indexRow = pd.DataFrame({col: [n] for n, col in enumerate(mega)})
>>> pd.concat([indexRow, mega], ignore_index=True)
A B C
0 0.000000 1.000000 2.000000
1 0.413145 -1.475655 0.529429
2 0.416250 -0.055519 1.611539
3 0.154045 -0.038109 1.020616