How to write pandas dataframe to a csv with varying row length

How to write pandas dataframe to a csv with varying row length - python

I've read in a csv in Pandas that has a variance in row values and some blank lines in between the rows.
Example:
This is an example
CustomerID; 123;
Test ID; 144;
Seen_on_Tv; yes;
now_some_measurements_1;
test1; 333; 444; 555;
test2; 344; 455; 566;
test3; 5544; 3424; 5456;
comment; this test sample is only for this Stackoverflow question, but
similar to my real data.
When reading in this file, I use this code:
pat = pd.read_csv(FileName, skip_blank_lines = False, header=None, sep=";", names=['a', 'b', 'c', 'd', 'e'])
pat.head(10)
output:
a b c d e
0 This is an example NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN
2 CustomerID 123 NaN NaN NaN
3 Test ID 144 NaN NaN NaN
4 Seen_on_Tv yes NaN NaN NaN
5 NaN NaN NaN NaN NaN
6 now_some_measurements_1 NaN NaN NaN NaN
7 test1 333 444.0 555.0
8 test2 344 455.0 566.0 NaN
9 test3 5544 3424.0 5456.0 NaN
This works, especially since I have to change the CustomerID (via this code example) etc:
newID = 'HASHED'
pat.loc[pat['a'] == 'CustomerID', 'b']=newID
However, when I save this changed dataframe to csv, I get a lot of 'trailing' seperators (";") as most of the columns are empty and especially with the blank lines.
pat.to_csv('out.csv', sep=";", index = False, header=False)
output (out.csv):
This is an example;;;;
;;;;
CustomerID; HASHED;;;
Test ID; 144;;;
Seen_on_Tv; yes;;;
;;;;
now_some_measurements_1;;;;
test1; 333;444.0;555.0;
test2; 344;455.0;566.0;
test3; 5544;3424.0;5456.0;
;;;;
comment; this test sample is only for this Stackoverflow question, but similar to my real
data.
;;;
I've searched almost everywhere for a solution, but can not find it.
How to write only the column values to the csv file that are not blank (except for the blank lines to separate the sections, which need to remain blank of course)?
Thank you in advance for your kind help.

A simple way would be to just parse your out.csv and for the non-blank lines (those consisting solely of ;'s) - write a stripped version of that line, eg:
with open('out.csv') as fin, open('out2.csv', 'w') as fout:
for line in fin:
if stripped := line.strip(';\n '):
fout.write(stripped + '\n')
else:
fout.write(line)
Will give you:
This is an example
;;;;
CustomerID; HASHED
Test ID; 144
Seen_on_Tv; yes
;;;;
now_some_measurements_1
test1; 333;444.0;555.0
test2; 344;455.0;566.0
test3; 5544;3424.0;5456.0
;;;;
comment; this test sample is only for this Stackoverflow question, but similar to my real
data.
;;;
You could also pass a io.StringIO object to to_csv (to save writing to disk and then re-reading) as the output destination, then parse that in a similar fashion to produce your desired output file.

Related

Extracting Info From A Column that contains irregular structure of ";" and "|" separators

I have a pandas data frame in which one of the columns looks like this.
INFO
SVTYPE=CNV;END=401233
SVTYPE=CNV;END=401233;CSQT=1|BHAT12|ESNT12345|
SVTYPE=CNV;END=401233;CSQT=1|JHV87|ESNT12345|,1|HJJUB2|ESNT12345|
SVTYPE=CNV;END=401233;CSQT=1|GFTREF|ESNT12345|,1|321lkj|ESNT12345|,1|16-YHGT|ESNT12345|...
The information I want to extract in new columns is gene|ESNT12345 . For the same example should be
gene1 gene2 gene3
Na Na Na
BHAT12|ESNT12345 Na Na
JHV87|ESNT12345 HJJUB2|ESNT12345 Na
GFTREF|ESNT12345 321lkj|ESNT12345 16-YHGT|ESNT12345
How can I do this working with pandas? I have been trying with .apply(lambda x:x.split("|"). But as I don't know the number of gene_name|ESNT12345 my dataset has and also this will be used in an application that will take thousands of different data frames, I am looking for a way of dynamically creating the necessary columns.
How can I do this?

IIUC, you could use a regex and str.extractall.
joining to the original data:
new_df = df.join(
df['INFO']
.str.extractall(r'(\w+\|ESNT\d+)')[0]
.unstack(level='match')
.add_prefix('gene_')
)
output:
INFO gene_0 gene_1 gene_2
0 SVTYPE=CNV;END=401233 NaN NaN NaN
1 SVTYPE=CNV;END=401233;CSQT=1|BHAT12|ESNT12345| BHAT12|ESNT12345 NaN NaN
2 SVTYPE=CNV;END=401233;CSQT=1|JHV87|ESNT12345|,1|HJJUB2|ESNT12345| JHV87|ESNT12345 HJJUB2|ESNT12345 NaN
3 SVTYPE=CNV;END=401233;CSQT=1|GFTREF|ESNT12345|,1|321lkj|ESNT12345|,1|16-YHGT|ESNT12345|... GFTREF|ESNT12345 321lkj|ESNT12345 YHGT|ESNT12345
without joining to the original data:
new_df = (df['INFO']
.str.extractall(r'(\w+\|ESNT\d+)')[0]
.unstack(level='match')
.add_prefix('gene_')
.reindex(df.index)
)
output:
match gene_0 gene_1 gene_2
0 NaN NaN NaN
1 BHAT12|ESNT12345 NaN NaN
2 JHV87|ESNT12345 HJJUB2|ESNT12345 NaN
3 GFTREF|ESNT12345 321lkj|ESNT12345 YHGT|ESNT12345
regex hack to have gene1, gene2…
If you really want to have the genes counter to start with 1, you could use this small regex hack (match the beginning of the string as match 0 and drop it):
new_df = (df['INFO']
.str.extractall(r'(^|\w+\|ESNT\d+)')[0]
.unstack(level='match')
.iloc[:, 1:]
.add_prefix('gene')
.reindex(df.index)
)
output:
match gene1 gene2 gene3
0 NaN NaN NaN
1 BHAT12|ESNT12345 NaN NaN
2 JHV87|ESNT12345 HJJUB2|ESNT12345 NaN
3 GFTREF|ESNT12345 321lkj|ESNT12345 YHGT|ESNT12345

How to use square brackets as a quote character in Pandas.read_csv

Let's say I have a text file that looks like this:
Item,Date,Time,Location
1,01/01/2016,13:41,[45.2344:-78.25453]
2,01/03/2016,19:11,[43.3423:-79.23423,41.2342:-81242]
3,01/10/2016,01:27,[51.2344:-86.24432]
What I'd like to be able to do is read that in with pandas.read_csv, but the second row will throw an error. Here is the code I'm currently using:
import pandas as pd
df = pd.read_csv("path/to/file.txt", sep=",", dtype=str)
I've tried to set quotechar to "[", but that obviously just eats up the lines until the next open bracket and adding a closing bracket results in a "string of length 2 found" error. Any insight would be greatly appreciated. Thanks!
Update
There were three primary solutions that were offered: 1) Give a long range of names to the data frame to allow all data to be read in and then post-process the data, 2) Find values in square brackets and put quotes around it, or 3) replace the first n number of commas with semicolons.
Overall, I don't think option 3 is a viable solution in general (albeit just fine for my data) because a) what if I have quoted values in one column that contain commas, and b) what if my column with square brackets is not the last column? That leaves solutions 1 and 2. I think solution 2 is more readable, but solution 1 was more efficient, running in just 1.38 seconds, compared to solution 2, which ran in 3.02 seconds. The tests were run on a text file containing 18 columns and more than 208,000 rows.

We can use simple trick - quote balanced square brackets with double quotes:
import re
import six
import pandas as pd
data = """\
Item,Date,Time,Location,junk
1,01/01/2016,13:41,[45.2344:-78.25453],[aaaa,bbb]
2,01/03/2016,19:11,[43.3423:-79.23423,41.2342:-81242],[0,1,2,3]
3,01/10/2016,01:27,[51.2344:-86.24432],[12,13]
4,01/30/2016,05:55,[51.2344:-86.24432,41.2342:-81242,55.5555:-81242],[45,55,65]"""
print('{0:-^70}'.format('original data'))
print(data)
data = re.sub(r'(\[[^\]]*\])', r'"\1"', data, flags=re.M)
print('{0:-^70}'.format('quoted data'))
print(data)
df = pd.read_csv(six.StringIO(data))
print('{0:-^70}'.format('data frame'))
pd.set_option('display.expand_frame_repr', False)
print(df)
Output:
----------------------------original data-----------------------------
Item,Date,Time,Location,junk
1,01/01/2016,13:41,[45.2344:-78.25453],[aaaa,bbb]
2,01/03/2016,19:11,[43.3423:-79.23423,41.2342:-81242],[0,1,2,3]
3,01/10/2016,01:27,[51.2344:-86.24432],[12,13]
4,01/30/2016,05:55,[51.2344:-86.24432,41.2342:-81242,55.5555:-81242],[45,55,65]
-----------------------------quoted data------------------------------
Item,Date,Time,Location,junk
1,01/01/2016,13:41,"[45.2344:-78.25453]","[aaaa,bbb]"
2,01/03/2016,19:11,"[43.3423:-79.23423,41.2342:-81242]","[0,1,2,3]"
3,01/10/2016,01:27,"[51.2344:-86.24432]","[12,13]"
4,01/30/2016,05:55,"[51.2344:-86.24432,41.2342:-81242,55.5555:-81242]","[45,55,65]"
------------------------------data frame------------------------------
Item Date Time Location junk
0 1 01/01/2016 13:41 [45.2344:-78.25453] [aaaa,bbb]
1 2 01/03/2016 19:11 [43.3423:-79.23423,41.2342:-81242] [0,1,2,3]
2 3 01/10/2016 01:27 [51.2344:-86.24432] [12,13]
3 4 01/30/2016 05:55 [51.2344:-86.24432,41.2342:-81242,55.5555:-81242] [45,55,65]
UPDATE: if you are sure that all square brackets are balances, we don't have to use RegEx's:
import io
import pandas as pd
with open('35948417.csv', 'r') as f:
fo = io.StringIO()
data = f.readlines()
fo.writelines(line.replace('[', '"[').replace(']', ']"') for line in data)
fo.seek(0)
df = pd.read_csv(fo)
print(df)

I can't think of a way to trick the CSV parser into accepting distinct open/close quote characters, but you can get away with a pretty simple preprocessing step:
import pandas as pd
import io
import re
# regular expression to capture contents of balanced brackets
location_regex = re.compile(r'\[([^\[\]]+)\]')
with open('path/to/file.txt', 'r') as fi:
# replaced brackets with quotes, pipe into file-like object
fo = io.StringIO()
fo.writelines(unicode(re.sub(location_regex, r'"\1"', line)) for line in fi)
# rewind file to the beginning
fo.seek(0)
# read transformed CSV into data frame
df = pd.read_csv(fo)
print df
This gives you a result like
Date_Time Item Location
0 2016-01-01 13:41:00 1 [45.2344:-78.25453]
1 2016-01-03 19:11:00 2 [43.3423:-79.23423, 41.2342:-81242]
2 2016-01-10 01:27:00 3 [51.2344:-86.24432]
Edit If memory is not an issue, then you are better off preprocessing the data in bulk rather than line by line, as is done in Max's answer.
# regular expression to capture contents of balanced brackets
location_regex = re.compile(r'\[([^\[\]]+)\]', flags=re.M)
with open('path/to/file.csv', 'r') as fi:
data = unicode(re.sub(location_regex, r'"\1"', fi.read()))
df = pd.read_csv(io.StringIO(data))
If you know ahead of time that the only brackets in the document are those surrounding the location coordinates, and that they are guaranteed to be balanced, then you can simplify it even further (Max suggests a line-by-line version of this, but I think the iteration is unnecessary):
with open('/path/to/file.csv', 'r') as fi:
data = unicode(fi.read().replace('[', '"').replace(']', '"')
df = pd.read_csv(io.StringIO(data))
Below are the timing results I got with a 200k-row by 3-column dataset. Each time is averaged over 10 trials.
data frame post-processing (jezrael's solution): 2.19s
line by line regex: 1.36s
bulk regex: 0.39s
bulk string replace: 0.14s

I think you can replace first 3 occurence of , in each line of file to ; and then use parameter sep=";" in read_csv:
import pandas as pd
import io
with open('file2.csv', 'r') as f:
lines = f.readlines()
fo = io.StringIO()
fo.writelines(u"" + line.replace(',',';', 3) for line in lines)
fo.seek(0)
df = pd.read_csv(fo, sep=';')
print df
Item Date Time Location
0 1 01/01/2016 13:41 [45.2344:-78.25453]
1 2 01/03/2016 19:11 [43.3423:-79.23423,41.2342:-81242]
2 3 01/10/2016 01:27 [51.2344:-86.24432]
Or can try this complicated approach, because main problem is, separator , between values in lists is same as separator of other column values.
So you need post - processing:
import pandas as pd
import io
temp=u"""Item,Date,Time,Location
1,01/01/2016,13:41,[45.2344:-78.25453]
2,01/03/2016,19:11,[43.3423:-79.23423,41.2342:-81242,41.2342:-81242]
3,01/10/2016,01:27,[51.2344:-86.24432]"""
#after testing replace io.StringIO(temp) to filename
#estimated max number of columns
df = pd.read_csv(io.StringIO(temp), names=range(10))
print df
0 1 2 3 4 \
0 Item Date Time Location NaN
1 1 01/01/2016 13:41 [45.2344:-78.25453] NaN
2 2 01/03/2016 19:11 [43.3423:-79.23423 41.2342:-81242
3 3 01/10/2016 01:27 [51.2344:-86.24432] NaN
5 6 7 8 9
0 NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN
2 41.2342:-81242] NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN
#remove column with all NaN
df = df.dropna(how='all', axis=1)
#first row get as columns names
df.columns = df.iloc[0,:]
#remove first row
df = df[1:]
#remove columns name
df.columns.name = None
#get position of column Location
print df.columns.get_loc('Location')
3
#df1 with Location values
df1 = df.iloc[:, df.columns.get_loc('Location'): ]
print df1
Location NaN NaN
1 [45.2344:-78.25453] NaN NaN
2 [43.3423:-79.23423 41.2342:-81242 41.2342:-81242]
3 [51.2344:-86.24432] NaN NaN
#combine values to one column
df['Location'] = df1.apply( lambda x : ', '.join([e for e in x if isinstance(e, basestring)]), axis=1)
#subset of desired columns
print df[['Item','Date','Time','Location']]
Item Date Time Location
1 1 01/01/2016 13:41 [45.2344:-78.25453]
2 2 01/03/2016 19:11 [43.3423:-79.23423, 41.2342:-81242, 41.2342:-8...
3 3 01/10/2016 01:27 [51.2344:-86.24432]

Replacing/Stripping certain text from data in pandas?

I've got an issue with Pandas not replacing certain bits of text correctly...
# Create blank column
csvdata["CTemp"] = ""
# Create a copy of the data in "CDPure"
dcol = csvdata.CDPure
# Fill "CTemp" with the data from "CDPure" and replace and/or remove certain parts
csvdata['CTemp'] = dcol.str.replace(" (AMI)", "").replace(" N/A", "Non")
But yet when i print it hasn't replaced any as seen below by running print csvdata[-50:].head(50)
Pole KI DE Score STAT CTemp
4429 NaN NaN NaN 42 NaN Data N/A
4430 NaN NaN NaN 23.43 NaN Data (AMI)
4431 NaN NaN NaN 7.05 NaN Data (AMI)
4432 NaN NaN NaN 9.78 NaN Data
4433 NaN NaN NaN 169.68 NaN Data (AMI)
4434 NaN NaN NaN 26.29 NaN Data N/A
4435 NaN NaN NaN 83.11 NaN Data N/A
NOTE: The CSV is rather big so I have to use pandas.set_option('display.max_columns', 250) to be able to print the above.
Anyone know how I can make it replace those parts correctly in pandas?
EDIT, I've tried .str.replace("", "") and tried just .replace("", "")
Example CSV:
No,CDPure,Blank
1,Data Test,
2,Test N/A,
3,Data N/A,
4,Test Data,
5,Bla,
5,Stack,
6,Over (AMI),
7,Flow (AMI),
8,Test (AMI),
9,Data,
10,Ryflex (AMI),
Example Code:
# Import pandas
import pandas
# Open csv (I have to keep it all as dtype object otherwise I can't do the rest of my script)
csvdata = pandas.read_csv('test.csv', dtype=object)
# Create blank column
csvdata["CTemp"] = ""
# Create a copy of the data in "CDPure"
dcol = csvdata.CDPure
# Fill "CTemp" with the data from "CDPure" and replace and/or remove certain parts
csvdata['CTemp'] = dcol.str.replace(" (AMI)", "").str.replace(" N/A", " Non")
# Print
print csvdata.head(11)
Output:
No CDPure Blank CTemp
0 1 Data Test NaN Data Test
1 2 Test N/A NaN Test Non
2 3 Data N/A NaN Data Non
3 4 Test Data NaN Test Data
4 5 Bla NaN Bla
5 5 Stack NaN Stack
6 6 Over (AMI) NaN Over (AMI)
7 7 Flow (AMI) NaN Flow (AMI)
8 8 Test (AMI) NaN Test (AMI)
9 9 Data NaN Data
10 10 Ryflex (AMI) NaN Ryflex (AMI)

str.replace interprets its argument as a regular expression, so you need to escape the parentheses using dcol.str.replace(r" \(AMI\)", "").str.replace(" N/A", "Non").
This does not appear to be adequately documented; the docs mention that split and replace "take regular expressions, too", but doesn't make it clear that they always interpret their argument as a regular expression.

Read flat file to DataFrames using Pandas with field specifiers in-line

I'm attempting to read in a flat-file to a DataFrame using pandas but can't seem to get the format right. My file has a variable number of fields represented per line and looks like this:
TIME=20131203004552049|CHAN=FCJNJKDCAAANPCKEAAAAAAAA|EVNT=NVOCinpt|MIME=application/synthesis+ssml|TXID=NUAN-20131203004552049-FCJNJKDCAAANPCKEAAAAAAAA-txt|TXSZ=1167|UCPU=31|SCPU=15
TIME=20131203004552049|CHAN=FCJNJKDCAAANPCKEAAAAAAAA|EVNT=NVOCsynd|INPT=1167|DURS=5120|RSTT=stop|UCPU=31|SCPU=15
TIME=20131203004552049|CHAN=FCJNJKDCAAANPCKEAAAAAAAA|EVNT=NVOClise|LUSED=0|LMAX=100|OMAX=95|LFEAT=tts|UCPU=0|SCPU=0
I have the field separator at |, I've pulled a list of all unique keys into keylist, and am trying to use the following to read in the data:
keylist = ['TIME',
'CHAN',
# [truncated]
'DURS',
'RSTT']
test_fp = 'c:\\temp\\test_output.txt'
df = pd.read_csv(test_fp, sep='|', names=keylist)
This incorrectly builds the DataFrame as I'm not specifying any way to recognize the key label in the line. I'm a little stuck and am not sure which way to research -- should I be using .read_json() for example?

Not sure if there's a slick way to do this. Sometimes when the data structure is different enough from the norm it's easiest to preprocess it on the Python side. Sure, it's not as fast, but since you could immediately save it in a more standard format it's usually not worth worrying about.
One way:
with open("wfield.txt") as fp:
rows = (dict(entry.split("=",1) for entry in row.strip().split("|")) for row in fp)
df = pd.DataFrame.from_dict(rows)
which produces
>>> df
CHAN DURS EVNT INPT LFEAT LMAX LUSED \
0 FCJNJKDCAAANPCKEAAAAAAAA NaN NVOCinpt NaN NaN NaN NaN
1 FCJNJKDCAAANPCKEAAAAAAAA 5120 NVOCsynd 1167 NaN NaN NaN
2 FCJNJKDCAAANPCKEAAAAAAAA NaN NVOClise NaN tts 100 0
MIME OMAX RSTT SCPU TIME \
0 application/synthesis+ssml NaN NaN 15 20131203004552049
1 NaN NaN stop 15 20131203004552049
2 NaN 95 NaN 0 20131203004552049
TXID TXSZ UCPU
0 NUAN-20131203004552049-FCJNJKDCAAANPCKEAAAAAAA... 1167 31
1 NaN NaN 31
2 NaN NaN 0
[3 rows x 15 columns]
After you've got this, you can reshape as needed. (I'm not sure if you wanted to combine rows with the same TIME & CHAN or not.)
Edit: if you're using an older version of pandas which doesn't support passing a generator to from_dict, you can built it from a list instead:
df = pd.DataFrame(list(rows))
but note that you haev have to convert columns to numerical columns from strings after the fact.

Merging more than two files with one column and common index in pandas

I have 10 .csv files with two columns. For example
file1.csv
Bact1,[1821932:1822487](+)
Bact2,[555760:556294](+)
Bact3,[2901866:2902424](-)
Bact4,[1104980:1105544](+)
file2.csv
Bact1,[1973928:1975194](-)
Bact2,[972152:973499](+)
Bact3,[3001035:3002739](-)
Bact4,[3331158:3332481](+)
Bact5,[712517:713771](+)
Bact5,[1376120:1377386](-)
file3.csv
Bact6,[4045708:4047781](+)
and so on to file10.csv The Bact1 represents a bacterial species and all the numbers including the sign represents the position of a gene. Each file represents a different gene, and there are duplicates like in the case of file2.csv
I wanted to merge these files so that I have something like this
Bact1 [1821932:1822487](+) [1973928:1975194](-) NaN
Bact2 [555760:556294](+) [972152:973499](+) NaN
Bact3 [2901866:2902424](-) [3001035:3002739](-) NaN
Bact4 [1104980:1105544](+) [3331158:3332481](+) NaN
Bact5 NaN [712517:713771](+) NaN
Bact5 NaN [1376120:1377386](-) NaN
Bact6 NaN NaN [4045708:4047781](+)
I have tried to use pandas package in python, but seems like most of the functions are geared towards merging two dataframes, not more than two, or i am missing something.
I have just started programming in python last week (I normally use R), so getting stuck in what could be or atleast seems like a simple thing.
Right now i am using:
for x in range(1,10):
df[x]=pandas.read_csv("file%s.csv" % (x),header=None,index_col=[0])
df[x].columns=['gene%s' % (x)]
dfjoin={}
dfjoin=df[1].join([df[2],df[3],df[4],df[5],df[6],df[7],df[8],df[9],df[10]])
Result:
0 gene1 gene2 gene3
Starkeya-novella-DSM-506 NaN [728886:730173](+) [731445:732615](+)
Starkeya-novella-DSM-506 NaN [728886:730173](+) [9662:10994](+)
Starkeya-novella-DSM-506 NaN [728886:730173](+) [9662:10994](+)
Starkeya-novella-DSM-506 NaN [728886:730173](+) [9662:10994](+)
see gene2 and gene3, it has duplicated results copied.

Assuming you've read these in as DataFrames as follows:
In [11]: df1 = pd.read_csv('file1.csv', sep=',', header=None, index_col=[0], names=['bact', 'file1'])
In [12]: df1
Out[12]:
file1
bact
Bact1 [1821932:1822487](+)
Bact2 [555760:556294](+)
Bact3 [2901866:2902424](-)
Bact4 [1104980:1105544](+)
Then you can simply join them:
In [21]: df1.join([df2, df3])
Out[21]:
file1 file2 file3
bact
Bact1 [1821932:1822487](+) [1973928:1975194](-) NaN
Bact2 [555760:556294](+) [972152:973499](+) NaN
Bact3 [2901866:2902424](-) [3001035:3002739](-) NaN
Bact4 [1104980:1105544](+) [3331158:3332481](+) NaN
Bact5 NaN [712517:713771](+) NaN
Bact5 NaN [1376120:1377386](-) NaN
Bact6 NaN NaN [4045708:4047781](+)

I changed your example data a little, here is the code:
import pandas as pd
import io
data = {
"file1":"""Bact1,[1821932:1822487](+)
Bact2,[555760:556294](+)
Bact3,[2901866:2902424](-)
Bact4,[1104980:1105544](+)
Bact5,[1104981:1105544](+)
Bact5,[1104982:1105544](+)""",
"file2":"""Bact1,[1973928:1975194](-)
Bact2,[972152:973499](+)
Bact3,[3001035:3002739](-)
Bact4,[3331158:3332481](+)
Bact5,[712517:713771](+)
Bact5,[1376120:1377386](-)
Bact5,[1376121:1377386](-)""",
"file3":"""Bact4,[3331150:3332481](+)
Bact6,[4045708:4047781](+)"""}
def read_file(f):
s = pd.read_csv(f, header=None, index_col=0, squeeze=True)
return s.groupby(s.index).apply(lambda s:pd.Series(s.values))
series = {key:read_file(io.StringIO(unicode(text)))
for key, text in data.items()}
print pd.concat(series, axis=1)
output:
file1 file2 file3
0
Bact1 0 [1821932:1822487](+) [1973928:1975194](-) NaN
Bact2 0 [555760:556294](+) [972152:973499](+) NaN
Bact3 0 [2901866:2902424](-) [3001035:3002739](-) NaN
Bact4 0 [1104980:1105544](+) [3331158:3332481](+) [3331150:3332481](+)
Bact5 0 [1104981:1105544](+) [712517:713771](+) NaN
1 [1104982:1105544](+) [1376120:1377386](-) NaN
2 NaN [1376121:1377386](-) NaN
Bact6 0 NaN NaN [4045708:4047781](+)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to write pandas dataframe to a csv with varying row length - python

Related

Extracting Info From A Column that contains irregular structure of ";" and "|" separators

How to use square brackets as a quote character in Pandas.read_csv

Replacing/Stripping certain text from data in pandas?

Read flat file to DataFrames using Pandas with field specifiers in-line

Merging more than two files with one column and common index in pandas

Categories

Resources