I have a data as below.How to convert the below into a dataframe. I need the Country name(some country names has comma inbetween)as first column and other values as each columns.
Input is a txt file with many countries
Czech Republic,22,22,22,21,21,21,21,21,19,18,16,14,13,12,11,11,10,9
Congo,Dem.Rep.,275,306,327,352,376,411,420,466,472,528,592,643,697,708,710,702,692,666
Congo,Rep.,209,222,231,243,255,269,424,457,367,545,313,354,402,509,477,482,511,485
Output should be a dataframe with country name as first column
Czech Republic 22 22 22 21 21 21 21 21 19 18 16 14 13 12 11 11 10 9
Congo,Rep. 209 222 231 243 255 269 424 457 367 545 313 354 402 509 477 482 511 485
Congo, Dem.Rep. 275 306 327 352 376 411 420 466 472 528 592 643 697 708 710 702 692 666
You can first use read_csv (no problem if it is .txt file) with some separator which is not in values like | for Series, then extract and strip country names to one column and another values split by ,:
import pandas as pd
from pandas.compat import StringIO
temp=u"""Czech Republic,22,22,22,21,21,21,21,21,19,18,16,14,13,12,11,11,10,9
Congo,Dem.Rep.,275,306,327,352,376,411,420,466,472,528,592,643,697,708,710,702,692,666
Congo,Rep.,209,222,231,243,255,269,424,457,367,545,313,354,402,509,477,482,511,485"""
#after testing replace 'StringIO(temp)' to 'filename.csv'
s = pd.read_csv(StringIO(temp), sep="|", header=None, squeeze=True)
print (s)
0 Czech Republic,22,22,22,21,21,21,21,21,19,18,1...
1 Congo,Dem.Rep.,275,306,327,352,376,411,420,466...
2 Congo,Rep.,209,222,231,243,255,269,424,457,367...
Name: 0, dtype: object
df = s.str.extract('([A-Za-z ,.]+)([0-9,]+)', expand=True)
df[0] = df[0].str.strip(',')
df = df.set_index(0)[1].str.split(',', expand=True).rename_axis(None).reset_index()
#reset column names by 0,1,2...
df.columns = np.arange(len(df.columns))
print (df)
0 1 2 3 4 5 6 7 8 9 10 11 12 \
0 Czech Republic 22 22 22 21 21 21 21 21 19 18 16 14
1 Congo,Dem.Rep. 275 306 327 352 376 411 420 466 472 528 592 643
2 Congo,Rep. 209 222 231 243 255 269 424 457 367 545 313 354
13 14 15 16 17 18
0 13 12 11 11 10 9
1 697 708 710 702 692 666
2 402 509 477 482 511 485
If need index with countries:
df = df.set_index(0)[1].str.split(',', expand=True).rename_axis(None)
print (df)
0 1 2 3 4 5 6 7 8 9 10 11 \
Czech Republic 22 22 22 21 21 21 21 21 19 18 16 14
Congo,Dem.Rep. 275 306 327 352 376 411 420 466 472 528 592 643
Congo,Rep. 209 222 231 243 255 269 424 457 367 545 313 354
12 13 14 15 16 17
Czech Republic 13 12 11 11 10 9
Congo,Dem.Rep. 697 708 710 702 692 666
Congo,Rep. 402 509 477 482 511 485
Solution be regex from another answer - it is possible use it as sep parameter, only engine='python' is necessary because warning:
import pandas as pd
from pandas.compat import StringIO
temp=u"""Czech Republic,22,22,22,21,21,21,21,21,19,18,16,14,13,12,11,11,10,9
Congo,Dem.Rep.,275,306,327,352,376,411,420,466,472,528,592,643,697,708,710,702,692,666
Congo,Rep.,209,222,231,243,255,269,424,457,367,545,313,354,402,509,477,482,511,485"""
#after testing replace 'StringIO(temp)' to 'filename.csv'
df = pd.read_csv(StringIO(temp), sep=",(?=\d)", header=None, engine='python')
print (df)
0 1 2 3 4 5 6 7 8 9 10 11 12 \
0 Czech Republic 22 22 22 21 21 21 21 21 19 18 16 14
1 Congo,Dem.Rep. 275 306 327 352 376 411 420 466 472 528 592 643
2 Congo,Rep. 209 222 231 243 255 269 424 457 367 545 313 354
13 14 15 16 17 18
0 13 12 11 11 10 9
1 697 708 710 702 692 666
2 402 509 477 482 511 485
jezrael's answer is the way to go if you want the complete output asap.
If you want to really understand some simpler code, try doing the following:
Split the string into some lists like this:
data = "Czech Republic..."
lines = data.split('\n')
rows = []
then iterate over the lines, and append them to a list of lists:
def is_number(s):
try:
float(s)
return True
except ValueError:
return False
for line in lines:
temp = line.split(',')
if is_number(temp[1]):
rows.append([''.join(temp[:2])].extend(temp[2:])) // ignoring the first ',' delimiter if the second column is a number
else:
rows.append(temp)
then use this list of lists and read the following pandas DataFrame documentation, on how to preety-print it. (Hint: make the list of lists a dict first)
The solution using re.split() function and labeled data structure with columns:
import pandas as pd, re
s = '''
Czech Republic,22,22,22,21,21,21,21,21,19,18,16,14,13,12,11,11,10,9
Congo,Dem.Rep.,275,306,327,352,376,411,420,466,472,528,592,643,697,708,710,702,692,666
Congo,Rep.,209,222,231,243,255,269,424,457,367,545,313,354,402,509,477,482,511,485
'''
data = []
for l in s.split('\n'):
if l: data.append(re.split(r',(?=\d)', l))
# setting output options
pd.set_option('display.max_columns', 20)
pd.set_option('display.width', 1000)
df = pd.DataFrame(data, columns=['Country name'] + list(range(len(data[0][1:]))))
print(df)
The output:
Country name 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
0 Czech Republic 22 22 22 21 21 21 21 21 19 18 16 14 13 12 11 11 10 9
1 Congo,Dem.Rep. 275 306 327 352 376 411 420 466 472 528 592 643 697 708 710 702 692 666
2 Congo,Rep. 209 222 231 243 255 269 424 457 367 545 313 354 402 509 477 482 511 485
I have a dataframe which looks like this
Geneid PRKCZ.exon1 PRKCZ.exon2 PRKCZ.exon3 PRKCZ.exon4 PRKCZ.exon5 PRKCZ.exon6 PRKCZ.exon7 PRKCZ.exon8 PRKCZ.exon9 PRKCZ.exon10 ... FLNA.exon31 FLNA.exon32 FLNA.exon33 FLNA.exon34 FLNA.exon35 FLNA.exon36 FLNA.exon37 FLNA.exon38 MTCP1.exon1 MTCP1.exon2
S28 22 127 135 77 120 159 49 38 409 67 ... 112 104 37 83 47 18 110 70 167 19
22 3 630 178 259 142 640 77 121 521 452 ... 636 288 281 538 276 109 242 314 790 484
S04 16 658 320 337 315 881 188 162 769 577 ... 1291 420 369 859 507 208 554 408 1172 706
56 26 663 343 390 314 1090 263 200 844 592 ... 675 243 250 472 280 133 300 275 750 473
S27 13 1525 571 1081 560 1867 427 370 1348 1530 ... 1817 926 551 1554 808 224 971 1313 1293 701
5 rows × 8297 columns
In that above dataframe I need to add an extra column with information about the index. And so I made a list -healthy with all the index to be labelled as h and rest everything should be d.
And so tried the following lines:
healthy=['39','41','49','50','51','52','53','54','56']
H_type =pd.Series( ['h' for x in df.loc[healthy]
else 'd' for x in df]).to_frame()
But it is throwing me following error:
SyntaxError: invalid syntax
Any help would be really appreciated
In the end I am aiming something like this:
Geneid sampletype SSX4.exon4 SSX2.exon11 DUX4.exon5 SSX2.exon3 SSX4.exon5 SSX2.exon10 SSX4.exon7 SSX2.exon9 SSX4.exon8 ... SETD2.exon21 FAT2.exon15 CASC5.exon8 FAT1.exon21 FAT3.exon9 MLL.exon31 NACA.exon7 RANBP2.exon20 APC.exon16 APOB.exon4
S28 h 0 0 0 0 0 0 0 0 0 ... 2480 2003 2749 1760 2425 3330 4758 2508 4367 4094
22 h 0 0 0 0 0 0 0 0 0 ... 8986 7200 10123 12422 14528 18393 9612 15325 8788 11584
S04 h 0 0 0 0 0 0 0 0 0 ... 14518 16657 17500 15996 17367 17948 18037 19446 24179 28924
56 h 0 0 0 0 0 0 0 0 0 ... 17784 17846 20811 17337 18135 19264 19336 22512 28318 32405
S27 h 0 0 0 0 0 0 0 0 0 ... 10375 20403 11559 18895 18410 12754 21527 11603 16619 37679
Thank you
I think you can use numpy.where with isin, if Geneid is column.
EDIT by comment:
There can be integers in column Geneid, so you can cast to string by astype.
healthy=['39','41','49','50','51','52','53','54','56']
df['type'] = np.where(df['Geneid'].astype(str).isin(healthy), 'h', 'd')
#get last column to list
print df.columns[-1].split()
['type']
#create new list from last column and all columns without last
cols = df.columns[-1].split() + df.columns[:-1].tolist()
print cols
['type', 'Geneid', 'PRKCZ.exon1', 'PRKCZ.exon2', 'PRKCZ.exon3', 'PRKCZ.exon4',
'PRKCZ.exon5', 'PRKCZ.exon6', 'PRKCZ.exon7', 'PRKCZ.exon8', 'PRKCZ.exon9',
'PRKCZ.exon10', 'FLNA.exon31', 'FLNA.exon32', 'FLNA.exon33', 'FLNA.exon34',
'FLNA.exon35', 'FLNA.exon36', 'FLNA.exon37', 'FLNA.exon38', 'MTCP1.exon1', 'MTCP1.exon2']
#reorder columns
print df[cols]
type Geneid PRKCZ.exon1 PRKCZ.exon2 PRKCZ.exon3 PRKCZ.exon4 \
0 d S28 22 127 135 77
1 d 22 3 630 178 259
2 d S04 16 658 320 337
3 h 56 26 663 343 390
4 d S27 13 1525 571 1081
PRKCZ.exon5 PRKCZ.exon6 PRKCZ.exon7 PRKCZ.exon8 ... \
0 120 159 49 38 ...
1 142 640 77 121 ...
2 315 881 188 162 ...
3 314 1090 263 200 ...
4 560 1867 427 370 ...
FLNA.exon31 FLNA.exon32 FLNA.exon33 FLNA.exon34 FLNA.exon35 \
0 112 104 37 83 47
1 636 288 281 538 276
2 1291 420 369 859 507
3 675 243 250 472 280
4 1817 926 551 1554 808
FLNA.exon36 FLNA.exon37 FLNA.exon38 MTCP1.exon1 MTCP1.exon2
0 18 110 70 167 19
1 109 242 314 790 484
2 208 554 408 1172 706
3 133 300 275 750 473
4 224 971 1313 1293 701
[5 rows x 22 columns]
You could use pandas isin()
First add an extra column called 'sampletype' and fill it with 'd'. Then, find all samples that have a geneid in health and fill them with 'h'. Suppose your main dataframe is called df, then you would use something like:
healthy = ['39','41','49','50','51','52','53','54','56']
df['sampletype'] = 'd'
df['sampletype'][df['Geneid'].isin(healthy)]='h'