Python Pandas - Pivot a csv file into a specific format

Python Pandas - Pivot a csv file into a specific format - python

It's my first attempt at using pandas. I really need help with pivot_table. None of the combinations I used seem to work.
I have a csv file like this:
Id Param1 Param2
1 -5.00138282776 2.04990620034E-08
1 -4.80147838593 2.01516989762E-08
1 -4.60159301758 1.98263165885E-08
1 -4.40133094788 1.94918392538E-08
1 -4.20143127441 1.91767686175E-08
1 -4.00122880936 1.88457374151E-08
2 -5.00141859055 6.88369405921E-09
2 -4.80152130127 6.77335965094E-09
2 -4.60163593292 6.65415056389E-09
2 -4.40139055252 6.54434062497E-09
3 -5.00138044357 1.16316911658E-08
3 -4.80148792267 1.15515588206E-08
3 -4.60160970688 1.14048361866E-08
3 -4.40137386322 1.12357021465E-08
3 -4.20145988464 1.11049178741E-08
I want my final output to be like this:
Param1_for_Id1 Param2_for_Id1 Param1_for_Id2 Param2_for_Id2 Param1_for_Id3 Param2_for_Id3
-5.00138282776 2.04990620034E-08 -5.00141859055 6.88369405921E-09 -5.00138044357 1.16316911658E-08
-4.80147838593 2.01516989762E-08 -4.80152130127 6.77335965094E-09 -4.80148792267 1.15515588206E-08
-4.60159301758 1.98263165885E-08 -4.60163593292 6.65415056389E-09 -4.60160970688 1.14048361866E-08
-4.40133094788 1.94918392538E-08 -4.40139055252 6.54434062497E-09 -4.40137386322 1.12357021465E-08
-4.20143127441 1.91767686175E-08 -4.20145988464 1.11049178741E-08
-4.00122880936 1.88457374151E-08
I can't figure out how to reshape my data. Any help would be most welcome!

Use set_index X2 + unstack:
v = (df.set_index('Id') # optional, omit if `Id` is the index
.set_index(df.groupby('Id').cumcount(), append=True)
.unstack(0)
.sort_index(level=1, axis=1)
.fillna('') # I actually don't recommend adding this step in
)
v.columns = v.columns.map('{0[0]}_for_Id{0[1]}'.format)
And now,
print(v)
Param1_for_Id1 Param2_for_Id1 Param1_for_Id2 Param2_for_Id2 \
0 -5.001383 2.049906e-08 -5.00142 6.88369e-09
1 -4.801478 2.015170e-08 -4.80152 6.77336e-09
2 -4.601593 1.982632e-08 -4.60164 6.65415e-09
3 -4.401331 1.949184e-08 -4.40139 6.54434e-09
4 -4.201431 1.917677e-08
5 -4.001229 1.884574e-08
Param1_for_Id3 Param2_for_Id3
0 -5.00138 1.16317e-08
1 -4.80149 1.15516e-08
2 -4.60161 1.14048e-08
3 -4.40137 1.12357e-08
4 -4.20146 1.11049e-08
5
Note that that last fillna step results in mixed strings and numeric data, so I don't recommend adding that step if you're going to do more with this output.

Related

Multi-part manipulation post str.split() Pandas

I have a subset of data (single column) we'll call ID:
ID
0 07-1401469
1 07-89556629
2 07-12187595
3 07-381962
4 07-99999085
The current format is (usually) YY-[up to 8-character ID].
The desired output format is a more uniformed YYYY-xxxxxxxx:
ID
0 2007-01401469
1 2007-89556629
2 2007-12187595
3 2007-00381962
4 2007-99999085
Knowing that I've done padding in the past, the thought process was to combine
df['id'].str.split('-').str[0].apply(lambda x: '{0:20>4}'.format(x))
df['id'].str.split('-').str[1].apply(lambda x: '{0:0>8}'.format(x))
However I ran into a few problems:
The '20' in '{0:20>4}' must be a singular value and not a string
Trying to do something like the below just results in df['id'] taking the properties of the last lambda & trying any other way to combine multiple apply/lambdas just didn't work. I started going down the pad left/right route but that seemed to be taking be backwards.
df['id'] = (df['id'].str.split('-').str[0].apply(lambda x: '{0:X>4}'.format(x)).str[1].apply(lambda x: '{0:0>8}'.format(x)))
The current solution I have (but HATE because its long, messy, and just not clean IMO) is:
df['idyear'] = df['id'].str.split('-').str[0].apply(lambda x: '{:X>4}'.format(x)) # Split on '-' and pad with X
df['idyear'] = df['idyear'].str.replace('XX', '20') # Replace XX with 20 to conform to YYYY
df['idnum'] = df['id'].str.split('-').str[1].apply(lambda x: '{0:0>8}'.format(x)) # Pad 0s up to 8 digits
df['id'] = df['idyear'].map(str) + "-" + df['idnum'] # Merge idyear and idnum to remake id
del df['idnum'] # delete extra
del df['idyear'] # delete extra
Which does work
ID
0 2007-01401469
1 2007-89556629
2 2007-12187595
3 2007-00381962
4 2007-99999085
But my questions are
Is there a way to run multiple apply() functions in a single line so I'm not making temp variables
Is there a better way than replacing 'XX' for '20'
I feel like this entire code block can be compress to 1 or 2 lines I just don't know how. Everything I've seen on SO and Pandas documentation on highlights/relates to singular manipulation so far.

One option is to split; then use str.zfill to pad '0's. Also prepend '20's before splitting, since you seem to need it anyway:
tmp = df['ID'].radd('20').str.split('-')
df['ID'] = tmp.str[0] + '-'+ tmp.str[1].str.zfill(8)
Output:
ID
0 2007-01401469
1 2007-89556629
2 2007-12187595
3 2007-00381962
4 2007-99999085

I'd do it in two steps, using .str.replace:
df["ID"] = df["ID"].str.replace(r"^(\d{2})-", r"20\1-", regex=True)
df["ID"] = df["ID"].str.replace(r"-(\d+)", lambda g: f"-{g[1]:0>8}", regex=True)
print(df)
Prints:
ID
0 2007-01401469
1 2007-89556629
2 2007-12187595
3 2007-00381962
4 2007-99999085

Pandas - take multiple columns and transform them into a single column of dictionary objects?

I am trying to transform a DataFrame by combining extra columns into a dictionary.
my DataFrame will always have four columns, at least: record, yhat, residual, and hat, with additional columns in different cases.
My current df head looks like this:
record yhat residual hat RinvRes AOMstat
0 1 6.7272 -0.57130 0.04985 0.009825 0.02041
1 2 6.5568 0.19460 0.09771 -0.014930 -0.03078
2 3 6.5457 0.16190 0.09765 0.272800 0.56260
If we look at the top column, we see that there are 2 additional columns, RinvRes and AOMstat
record yhat residual hat RinvRes AOMstat
0 1 6.7272 -0.57130 0.04985 0.009825 0.02041
I would like to combine those columns into a dictionary, where the column name is a key in a dictionary, eg :
record yhat residual hat additional
0 1 6.7272 -0.57130 0.04985 {“RinvRes“: “0.2291E-01“, “AOMstat“ : “0.3224E-01“}

in one step with .join, .agg(dict) and .drop
first create your list of aggregate columns
agg_cols = ['RinvRes', 'AOMstat']
df1 = df.join(df[agg_cols].agg(dict,axis=1)\
.to_frame('additional')).drop(agg_cols,1)
print(df1)
record yhat residual hat additional
0 1 6.7272 -0.5713 0.04985 {'RinvRes': 0.009825, 'AOMstat': 0.02041}
1 2 6.5568 0.1946 0.09771 {'RinvRes': -0.01493, 'AOMstat': -0.03078}
2 3 6.5457 0.1619 0.09765 {'RinvRes': 0.2728, 'AOMstat': 0.5626}

IIUC, starting from the list of the 4 columns, you can get extra columns names wtih difference and use to_dict to aggregate them
# columns you have in common
keep_cols = ['record', 'yhat', 'residual', 'hat']
# get columns to agg into dict
extra_cols = df.columns.difference(keep_cols)
# create the result
new_df = (
df[keep_cols]
.assign(additional = df[extra_cols].agg(lambda x: x.to_dict(), axis=1))
)
print(new_df)
record yhat residual hat \
0 1 6.7272 -0.5713 0.04985
1 2 6.5568 0.1946 0.09771
2 3 6.5457 0.1619 0.09765
additional
0 {'AOMstat': 0.02041, 'RinvRes': 0.009825}
1 {'AOMstat': -0.03078, 'RinvRes': -0.01493}
2 {'AOMstat': 0.5626, 'RinvRes': 0.2728}

Although the above answers are more elegant and efficient, here's a more simplistic version:
rinvres = df['RinvRes'].values.tolist()
aomstat = df['AOMstat'].values.tolist()
df.drop(['RinvRes', 'AOMstat'], axis=1)
additional = []
for i in range(len(rinvres)):
add = {
'RinvRes': rinvres[i],
'AOMstat': aomstat[i]
}
additional.append(add)
df['additional'] = additional

how to deal with multiple lists inside multiple column in df?

I have a df like this
data_list
0 [['13878018', '13878274'], ['54211', '54212'], ['AARTIIND21JUL850PE', 'AARTIIND21JUL860CE'], ['AARTIIND', 'AARTIIND']]
1 [['13099778', '13100034'], ['51171', '51172'], ['ABFRL21JUL210PE', 'ABFRL21JUL215CE'], ['ABFRL', 'ABFRL']]
2 [['13910018', '13910274'], ['54336', '54337'], ['ACC21JUL1980PE', 'ACC21JUL2000CE'], ['ACC', 'ACC']]
and I want to convert it to
name token ext_t symbol
0 AARTIIND 13878018 54211 AARTIIND21JUL850PE
1 AARTIIND 13878274 54212 AARTIIND21JUL860CE
2 ABFRL 13099778 51171 ABFRL21JUL210PE
3 ABFRL 13100034 51172 ABFRL21JUL215CE
4 ACC 13910018 54336 ACC21JUL1980PE
5 ACC 13910274 54337 ACC21JUL2000CE
How can I achieve this?
I tried to apply pd.series and I got an output like this
0 1 2 3
0 [13878018, 13878274] [54211, 54212] [AARTIIND21JUL850PE, AARTIIND21JUL860CE] [AARTIIND, AARTIIND]
1 [13099778, 13100034] [51171, 51172] [ABFRL21JUL210PE, ABFRL21JUL215CE] [ABFRL, ABFRL]
2 [13910018, 13910274] [54336, 54337] [ACC21JUL1980PE, ACC21JUL2000CE] [ACC, ACC]
I am not sure how to proceed next. Can anyone help please?

try via DataFrame() method and apply():
out=pd.DataFrame(df['data_list'].tolist()).apply(pd.Series.explode)
#OR(you can also use agg() method in place of apply() method)
out=pd.DataFrame(df['data_list'].tolist()).agg(pd.Series.explode)
Finally:
out.columns=['token','ext_t','symbol','name']
Now If you print out you will get your expected output

How to stack number of rows to one row and assign id

I have a dataframe likes this:
band mean raster
1 894.343482 D:/Python/Copied/selection/20170219_095504.tif
2 1159.282304 D:/Python/Copied/selection/20170219_095504.tif
3 1342.291595 D:/Python/Copied/selection/20170219_095504.tif
4 3056.809463 D:/Python/Copied/selection/20170219_095504.tif
1 516.9624071 D:/Python/Copied/selection/20170325_095551.tif
2 720.1932533 D:/Python/Copied/selection/20170325_095551.tif
3 689.6287879 D:/Python/Copied/selection/20170325_095551.tif
4 4561.576329 D:/Python/Copied/selection/20170325_095551.tif
1 566.2016867 D:/Python/Copied/selection/20170527_095700.tif
2 812.9927101 D:/Python/Copied/selection/20170527_095700.tif
3 760.4621212 D:/Python/Copied/selection/20170527_095700.tif
4 5009.537164 D:/Python/Copied/selection/20170527_095700.tif
And I want to format it to this:
band1_mean band2_mean band3_mean band4_mean raster_name id
894.343482 1159.282304 1342.291595 3056.809463 20170219_095504.tif 1
516.9624071 720.1932533 689.6287879 4561.576329 20170325_095551.tif 2
566.2016867 812.9927101 760.4621212 5009.537164 20170527_095700.tif 3
All 4 bands belong to one raster and therefore the values have to be all in one row. I don't know how to stack them without having and key id for every raster.
Thanks!

this is a case of pivot:
# extract the raster name:
df['raster_name'] = df.raster.str.extract('(\d+_\d+\.tif)')
# pivot
new_df = df.pivot(index='raster_name', columns='band', values='mean')
# rename the columns:
new_df.columns = [f'band{i}_mean' for i in new_df.columns]
Output:
band1_mean band2_mean band3_mean band4_mean
raster_name
20170219_095504.tif 894.343482 1159.282304 1342.291595 3056.809463
20170325_095551.tif 516.962407 720.193253 689.628788 4561.576329
20170527_095700.tif 566.201687 812.992710 760.462121 5009.537164
You can reset_index on new_df if you want raster_name to be a normal column.

With df.pivot("raster", "band", "mean") you'd get
band 1 2 3 4
raster
20170219_095504.tif 894.343482 1159.282304 1342.291595 3056.809463
20170325_095551.tif 516.962407 720.193253 689.628788 4561.576329
20170527_095700.tif 566.201687 812.992710 760.462121 5009.537164

How to convert a specific range of elements in a panda DataFrame into float numbers?

I have a panda Dataframe like the following
and this is the data:
0 1 2 3 4 5 6
0 Label Total/Target Jaccard Dice VolumeSimilarity FalseNegative FalsePositive
1 image-9003406 0.753958942196244 0.628584809743865 0.771939914928625 -0.0476974851707525 0.246041057803756 0.209200511636753
2 image-9007827 0.783266136200411 0.652181507072358 0.789479248231042 -0.015864625683349 0.216733863799589 0.204208282912204
3 image-9040390 0.797836181211824 0.611217035556112 0.758702300270988 0.0981000407411853 0.202163818788176 0.276772045623749
4 image-9047800 0.833585767007274 0.627592483537663 0.771191179469637 0.149701662401568 0.166414232992726 0.282513296651508
5 image-9054866 0.828860635279561 0.652709649240693 0.789866083907199 0.0940919370823063 0.171139364720439 0.245624253720476
6 image-9056363 0.795614053800371 0.658368025419615 0.793995078689519 0.00406974990730408 0.204385946199629 0.207617320977731
7 image-9068453 0.763313209747495 0.565848914378489 0.722737563225356 0.106314540359027 0.236686790252505 0.313742036740474
8 image-9085290 0.633747182342442 0.498166624744976 0.665035005475144 -0.0987391313269621 0.366252817657558 0.300427399066708
9 image-9087863 0.663537911271341 0.539359224086608 0.700758102003958 -0.112187081100769 0.336462088728659 0.257597937816249
10 image-9094865 0.667530629804239 0.556419610760253 0.714999485888594 -0.142222256073179 0.332469370195761 0.230263697338428
However, I need to convert the data which is starting from column #1 and row #1 as numbers, when it is saving into excel file, it is saving as string.
How to do that?
your help is appreciated

Use:
#set columns by first row
df.columns = df.iloc[0]
#set index by first column
df.index = df.iloc[:, 0]
#remove first row, first col and cast to floats
df = df.iloc[1:, 1:].astype(float)
print (df)
0 Total/Target Jaccard Dice VolumeSimilarity \
Label
image-9003406 0.753959 0.628585 0.771940 -0.047697
image-9007827 0.783266 0.652182 0.789479 -0.015865
image-9040390 0.797836 0.611217 0.758702 0.098100
image-9047800 0.833586 0.627592 0.771191 0.149702
image-9054866 0.828861 0.652710 0.789866 0.094092
image-9056363 0.795614 0.658368 0.793995 0.004070
image-9068453 0.763313 0.565849 0.722738 0.106315
image-9085290 0.633747 0.498167 0.665035 -0.098739
image-9087863 0.663538 0.539359 0.700758 -0.112187
image-9094865 0.667531 0.556420 0.714999 -0.142222
0 FalseNegative FalsePositive
Label
image-9003406 0.246041 0.209201
image-9007827 0.216734 0.204208
image-9040390 0.202164 0.276772
image-9047800 0.166414 0.282513
image-9054866 0.171139 0.245624
image-9056363 0.204386 0.207617
image-9068453 0.236687 0.313742
image-9085290 0.366253 0.300427
image-9087863 0.336462 0.257598
image-9094865 0.332469 0.230264

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Pandas - Pivot a csv file into a specific format - python

Related

Multi-part manipulation post str.split() Pandas

Pandas - take multiple columns and transform them into a single column of dictionary objects?

how to deal with multiple lists inside multiple column in df?

How to stack number of rows to one row and assign id

How to convert a specific range of elements in a panda DataFrame into float numbers?

Categories

Resources