Python Pandas - Pivot a csv file into a specific format - python

It's my first attempt at using pandas. I really need help with pivot_table. None of the combinations I used seem to work.
I have a csv file like this:
Id Param1 Param2
1 -5.00138282776 2.04990620034E-08
1 -4.80147838593 2.01516989762E-08
1 -4.60159301758 1.98263165885E-08
1 -4.40133094788 1.94918392538E-08
1 -4.20143127441 1.91767686175E-08
1 -4.00122880936 1.88457374151E-08
2 -5.00141859055 6.88369405921E-09
2 -4.80152130127 6.77335965094E-09
2 -4.60163593292 6.65415056389E-09
2 -4.40139055252 6.54434062497E-09
3 -5.00138044357 1.16316911658E-08
3 -4.80148792267 1.15515588206E-08
3 -4.60160970688 1.14048361866E-08
3 -4.40137386322 1.12357021465E-08
3 -4.20145988464 1.11049178741E-08
I want my final output to be like this:
Param1_for_Id1 Param2_for_Id1 Param1_for_Id2 Param2_for_Id2 Param1_for_Id3 Param2_for_Id3
-5.00138282776 2.04990620034E-08 -5.00141859055 6.88369405921E-09 -5.00138044357 1.16316911658E-08
-4.80147838593 2.01516989762E-08 -4.80152130127 6.77335965094E-09 -4.80148792267 1.15515588206E-08
-4.60159301758 1.98263165885E-08 -4.60163593292 6.65415056389E-09 -4.60160970688 1.14048361866E-08
-4.40133094788 1.94918392538E-08 -4.40139055252 6.54434062497E-09 -4.40137386322 1.12357021465E-08
-4.20143127441 1.91767686175E-08 -4.20145988464 1.11049178741E-08
-4.00122880936 1.88457374151E-08
I can't figure out how to reshape my data. Any help would be most welcome!

Use set_index X2 + unstack:
v = (df.set_index('Id') # optional, omit if `Id` is the index
.set_index(df.groupby('Id').cumcount(), append=True)
.unstack(0)
.sort_index(level=1, axis=1)
.fillna('') # I actually don't recommend adding this step in
)
v.columns = v.columns.map('{0[0]}_for_Id{0[1]}'.format)
And now,
print(v)
Param1_for_Id1 Param2_for_Id1 Param1_for_Id2 Param2_for_Id2 \
0 -5.001383 2.049906e-08 -5.00142 6.88369e-09
1 -4.801478 2.015170e-08 -4.80152 6.77336e-09
2 -4.601593 1.982632e-08 -4.60164 6.65415e-09
3 -4.401331 1.949184e-08 -4.40139 6.54434e-09
4 -4.201431 1.917677e-08
5 -4.001229 1.884574e-08
Param1_for_Id3 Param2_for_Id3
0 -5.00138 1.16317e-08
1 -4.80149 1.15516e-08
2 -4.60161 1.14048e-08
3 -4.40137 1.12357e-08
4 -4.20146 1.11049e-08
5
Note that that last fillna step results in mixed strings and numeric data, so I don't recommend adding that step if you're going to do more with this output.

Related

Multi-part manipulation post str.split() Pandas

I have a subset of data (single column) we'll call ID:
ID
0 07-1401469
1 07-89556629
2 07-12187595
3 07-381962
4 07-99999085
The current format is (usually) YY-[up to 8-character ID].
The desired output format is a more uniformed YYYY-xxxxxxxx:
ID
0 2007-01401469
1 2007-89556629
2 2007-12187595
3 2007-00381962
4 2007-99999085
Knowing that I've done padding in the past, the thought process was to combine
df['id'].str.split('-').str[0].apply(lambda x: '{0:20>4}'.format(x))
df['id'].str.split('-').str[1].apply(lambda x: '{0:0>8}'.format(x))
However I ran into a few problems:
The '20' in '{0:20>4}' must be a singular value and not a string
Trying to do something like the below just results in df['id'] taking the properties of the last lambda & trying any other way to combine multiple apply/lambdas just didn't work. I started going down the pad left/right route but that seemed to be taking be backwards.
df['id'] = (df['id'].str.split('-').str[0].apply(lambda x: '{0:X>4}'.format(x)).str[1].apply(lambda x: '{0:0>8}'.format(x)))
The current solution I have (but HATE because its long, messy, and just not clean IMO) is:
df['idyear'] = df['id'].str.split('-').str[0].apply(lambda x: '{:X>4}'.format(x)) # Split on '-' and pad with X
df['idyear'] = df['idyear'].str.replace('XX', '20') # Replace XX with 20 to conform to YYYY
df['idnum'] = df['id'].str.split('-').str[1].apply(lambda x: '{0:0>8}'.format(x)) # Pad 0s up to 8 digits
df['id'] = df['idyear'].map(str) + "-" + df['idnum'] # Merge idyear and idnum to remake id
del df['idnum'] # delete extra
del df['idyear'] # delete extra
Which does work
ID
0 2007-01401469
1 2007-89556629
2 2007-12187595
3 2007-00381962
4 2007-99999085
But my questions are
Is there a way to run multiple apply() functions in a single line so I'm not making temp variables
Is there a better way than replacing 'XX' for '20'
I feel like this entire code block can be compress to 1 or 2 lines I just don't know how. Everything I've seen on SO and Pandas documentation on highlights/relates to singular manipulation so far.
One option is to split; then use str.zfill to pad '0's. Also prepend '20's before splitting, since you seem to need it anyway:
tmp = df['ID'].radd('20').str.split('-')
df['ID'] = tmp.str[0] + '-'+ tmp.str[1].str.zfill(8)
Output:
ID
0 2007-01401469
1 2007-89556629
2 2007-12187595
3 2007-00381962
4 2007-99999085
I'd do it in two steps, using .str.replace:
df["ID"] = df["ID"].str.replace(r"^(\d{2})-", r"20\1-", regex=True)
df["ID"] = df["ID"].str.replace(r"-(\d+)", lambda g: f"-{g[1]:0>8}", regex=True)
print(df)
Prints:
ID
0 2007-01401469
1 2007-89556629
2 2007-12187595
3 2007-00381962
4 2007-99999085

Pandas - take multiple columns and transform them into a single column of dictionary objects?

I am trying to transform a DataFrame by combining extra columns into a dictionary.
my DataFrame will always have four columns, at least: record, yhat, residual, and hat, with additional columns in different cases.
My current df head looks like this:
record yhat residual hat RinvRes AOMstat
0 1 6.7272 -0.57130 0.04985 0.009825 0.02041
1 2 6.5568 0.19460 0.09771 -0.014930 -0.03078
2 3 6.5457 0.16190 0.09765 0.272800 0.56260
If we look at the top column, we see that there are 2 additional columns, RinvRes and AOMstat
record yhat residual hat RinvRes AOMstat
0 1 6.7272 -0.57130 0.04985 0.009825 0.02041
I would like to combine those columns into a dictionary, where the column name is a key in a dictionary, eg :
record yhat residual hat additional
0 1 6.7272 -0.57130 0.04985 {“RinvRes“: “0.2291E-01“, “AOMstat“ : “0.3224E-01“}
in one step with .join, .agg(dict) and .drop
first create your list of aggregate columns
agg_cols = ['RinvRes', 'AOMstat']
df1 = df.join(df[agg_cols].agg(dict,axis=1)\
.to_frame('additional')).drop(agg_cols,1)
print(df1)
record yhat residual hat additional
0 1 6.7272 -0.5713 0.04985 {'RinvRes': 0.009825, 'AOMstat': 0.02041}
1 2 6.5568 0.1946 0.09771 {'RinvRes': -0.01493, 'AOMstat': -0.03078}
2 3 6.5457 0.1619 0.09765 {'RinvRes': 0.2728, 'AOMstat': 0.5626}
IIUC, starting from the list of the 4 columns, you can get extra columns names wtih difference and use to_dict to aggregate them
# columns you have in common
keep_cols = ['record', 'yhat', 'residual', 'hat']
# get columns to agg into dict
extra_cols = df.columns.difference(keep_cols)
# create the result
new_df = (
df[keep_cols]
.assign(additional = df[extra_cols].agg(lambda x: x.to_dict(), axis=1))
)
print(new_df)
record yhat residual hat \
0 1 6.7272 -0.5713 0.04985
1 2 6.5568 0.1946 0.09771
2 3 6.5457 0.1619 0.09765
additional
0 {'AOMstat': 0.02041, 'RinvRes': 0.009825}
1 {'AOMstat': -0.03078, 'RinvRes': -0.01493}
2 {'AOMstat': 0.5626, 'RinvRes': 0.2728}
Although the above answers are more elegant and efficient, here's a more simplistic version:
rinvres = df['RinvRes'].values.tolist()
aomstat = df['AOMstat'].values.tolist()
df.drop(['RinvRes', 'AOMstat'], axis=1)
additional = []
for i in range(len(rinvres)):
add = {
'RinvRes': rinvres[i],
'AOMstat': aomstat[i]
}
additional.append(add)
df['additional'] = additional

how to deal with multiple lists inside multiple column in df?

I have a df like this
data_list
0 [['13878018', '13878274'], ['54211', '54212'], ['AARTIIND21JUL850PE', 'AARTIIND21JUL860CE'], ['AARTIIND', 'AARTIIND']]
1 [['13099778', '13100034'], ['51171', '51172'], ['ABFRL21JUL210PE', 'ABFRL21JUL215CE'], ['ABFRL', 'ABFRL']]
2 [['13910018', '13910274'], ['54336', '54337'], ['ACC21JUL1980PE', 'ACC21JUL2000CE'], ['ACC', 'ACC']]
and I want to convert it to
name token ext_t symbol
0 AARTIIND 13878018 54211 AARTIIND21JUL850PE
1 AARTIIND 13878274 54212 AARTIIND21JUL860CE
2 ABFRL 13099778 51171 ABFRL21JUL210PE
3 ABFRL 13100034 51172 ABFRL21JUL215CE
4 ACC 13910018 54336 ACC21JUL1980PE
5 ACC 13910274 54337 ACC21JUL2000CE
How can I achieve this?
I tried to apply pd.series and I got an output like this
0 1 2 3
0 [13878018, 13878274] [54211, 54212] [AARTIIND21JUL850PE, AARTIIND21JUL860CE] [AARTIIND, AARTIIND]
1 [13099778, 13100034] [51171, 51172] [ABFRL21JUL210PE, ABFRL21JUL215CE] [ABFRL, ABFRL]
2 [13910018, 13910274] [54336, 54337] [ACC21JUL1980PE, ACC21JUL2000CE] [ACC, ACC]
I am not sure how to proceed next. Can anyone help please?
try via DataFrame() method and apply():
out=pd.DataFrame(df['data_list'].tolist()).apply(pd.Series.explode)
#OR(you can also use agg() method in place of apply() method)
out=pd.DataFrame(df['data_list'].tolist()).agg(pd.Series.explode)
Finally:
out.columns=['token','ext_t','symbol','name']
Now If you print out you will get your expected output

How to stack number of rows to one row and assign id

I have a dataframe likes this:
band mean raster
1 894.343482 D:/Python/Copied/selection/20170219_095504.tif
2 1159.282304 D:/Python/Copied/selection/20170219_095504.tif
3 1342.291595 D:/Python/Copied/selection/20170219_095504.tif
4 3056.809463 D:/Python/Copied/selection/20170219_095504.tif
1 516.9624071 D:/Python/Copied/selection/20170325_095551.tif
2 720.1932533 D:/Python/Copied/selection/20170325_095551.tif
3 689.6287879 D:/Python/Copied/selection/20170325_095551.tif
4 4561.576329 D:/Python/Copied/selection/20170325_095551.tif
1 566.2016867 D:/Python/Copied/selection/20170527_095700.tif
2 812.9927101 D:/Python/Copied/selection/20170527_095700.tif
3 760.4621212 D:/Python/Copied/selection/20170527_095700.tif
4 5009.537164 D:/Python/Copied/selection/20170527_095700.tif
And I want to format it to this:
band1_mean band2_mean band3_mean band4_mean raster_name id
894.343482 1159.282304 1342.291595 3056.809463 20170219_095504.tif 1
516.9624071 720.1932533 689.6287879 4561.576329 20170325_095551.tif 2
566.2016867 812.9927101 760.4621212 5009.537164 20170527_095700.tif 3
All 4 bands belong to one raster and therefore the values have to be all in one row. I don't know how to stack them without having and key id for every raster.
Thanks!
this is a case of pivot:
# extract the raster name:
df['raster_name'] = df.raster.str.extract('(\d+_\d+\.tif)')
# pivot
new_df = df.pivot(index='raster_name', columns='band', values='mean')
# rename the columns:
new_df.columns = [f'band{i}_mean' for i in new_df.columns]
Output:
band1_mean band2_mean band3_mean band4_mean
raster_name
20170219_095504.tif 894.343482 1159.282304 1342.291595 3056.809463
20170325_095551.tif 516.962407 720.193253 689.628788 4561.576329
20170527_095700.tif 566.201687 812.992710 760.462121 5009.537164
You can reset_index on new_df if you want raster_name to be a normal column.
With df.pivot("raster", "band", "mean") you'd get
band 1 2 3 4
raster
20170219_095504.tif 894.343482 1159.282304 1342.291595 3056.809463
20170325_095551.tif 516.962407 720.193253 689.628788 4561.576329
20170527_095700.tif 566.201687 812.992710 760.462121 5009.537164

How to convert a specific range of elements in a panda DataFrame into float numbers?

I have a panda Dataframe like the following
and this is the data:
0 1 2 3 4 5 6
0 Label Total/Target Jaccard Dice VolumeSimilarity FalseNegative FalsePositive
1 image-9003406 0.753958942196244 0.628584809743865 0.771939914928625 -0.0476974851707525 0.246041057803756 0.209200511636753
2 image-9007827 0.783266136200411 0.652181507072358 0.789479248231042 -0.015864625683349 0.216733863799589 0.204208282912204
3 image-9040390 0.797836181211824 0.611217035556112 0.758702300270988 0.0981000407411853 0.202163818788176 0.276772045623749
4 image-9047800 0.833585767007274 0.627592483537663 0.771191179469637 0.149701662401568 0.166414232992726 0.282513296651508
5 image-9054866 0.828860635279561 0.652709649240693 0.789866083907199 0.0940919370823063 0.171139364720439 0.245624253720476
6 image-9056363 0.795614053800371 0.658368025419615 0.793995078689519 0.00406974990730408 0.204385946199629 0.207617320977731
7 image-9068453 0.763313209747495 0.565848914378489 0.722737563225356 0.106314540359027 0.236686790252505 0.313742036740474
8 image-9085290 0.633747182342442 0.498166624744976 0.665035005475144 -0.0987391313269621 0.366252817657558 0.300427399066708
9 image-9087863 0.663537911271341 0.539359224086608 0.700758102003958 -0.112187081100769 0.336462088728659 0.257597937816249
10 image-9094865 0.667530629804239 0.556419610760253 0.714999485888594 -0.142222256073179 0.332469370195761 0.230263697338428
However, I need to convert the data which is starting from column #1 and row #1 as numbers, when it is saving into excel file, it is saving as string.
How to do that?
your help is appreciated
Use:
#set columns by first row
df.columns = df.iloc[0]
#set index by first column
df.index = df.iloc[:, 0]
#remove first row, first col and cast to floats
df = df.iloc[1:, 1:].astype(float)
print (df)
0 Total/Target Jaccard Dice VolumeSimilarity \
Label
image-9003406 0.753959 0.628585 0.771940 -0.047697
image-9007827 0.783266 0.652182 0.789479 -0.015865
image-9040390 0.797836 0.611217 0.758702 0.098100
image-9047800 0.833586 0.627592 0.771191 0.149702
image-9054866 0.828861 0.652710 0.789866 0.094092
image-9056363 0.795614 0.658368 0.793995 0.004070
image-9068453 0.763313 0.565849 0.722738 0.106315
image-9085290 0.633747 0.498167 0.665035 -0.098739
image-9087863 0.663538 0.539359 0.700758 -0.112187
image-9094865 0.667531 0.556420 0.714999 -0.142222
0 FalseNegative FalsePositive
Label
image-9003406 0.246041 0.209201
image-9007827 0.216734 0.204208
image-9040390 0.202164 0.276772
image-9047800 0.166414 0.282513
image-9054866 0.171139 0.245624
image-9056363 0.204386 0.207617
image-9068453 0.236687 0.313742
image-9085290 0.366253 0.300427
image-9087863 0.336462 0.257598
image-9094865 0.332469 0.230264

Categories