Replace a Pandas subrow with an array of values? - python

I have a (simple?) problem, but that cannot understand how to solve in a pandas way.
I have this CSV:
,Unnamed: 0,Unnamed: 1,1Me,2Gi,3Ve,4Sa,5Do,6Lu,7Ma,8Me,9Gi,0Ve,1Sa,2Do,3Lu,4Ma,5Me,6Gi,7Ve,8Sa,9Do,0Lu,1Ma,2Me,3Gi,4Ve,5Sa,6Do,7Lu,8Ma,9Me,0Gi,1Ve,Unnamed: 2
0,,Ore,,",30",",46",",50",,,",20",",48",",41",",07",",52",",11",,",53",",51",",14",",28",",33",,",32",",10",",03",",44",",39",",04",,",26",",15",",07",",11",",59",
1,,Ore,,,,,,",53",,,,,,,,,,,,,,,,,,,,,,,,,,
That, when loaded, results in this dataframe:
>>> df
Unnamed: 0 Unnamed: 1 1Me 2Gi 3Ve 4Sa ... 7Lu 8Ma 9Me 0Gi 1Ve Unnamed: 2
0 NaN Ore NaN ,30 ,46 ,50 ... ,26 ,15 ,07 ,11 ,59 NaN
1 NaN Ore NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN
And also, I have values, that is a numpy array of two lists.
>>> values
array([list(['6,30', '5,46', '4,50', '5,20', '7,48', '5,41', '2,07', '3,52', '3,11', '4,53', '4,51', '5,14', '4,28', '3,33', '5,32', '3,10', '5,03', '4,44', '4,39', '5,04', '5,26', '7,15', '5,07', '6,11', '2,59']),
list(['2,53'])], dtype=object)
My question is, I want to replace all elements in Dataframe that match a specific regex to be replaced with the corresponding element of the values list.
I assume that df and values have the same length (in this case 2) and also that "wrong" numbers to be replaced inside df are the same of the corresponding row in the values array.
In my case, I tried using df.replace(), but it didn't work; I got this error:
>>> df_lattice2.replace(r"\d?,\d+", values)
TypeError: Invalid "to_replace" type: 'str'
After I while, I came out with an iterative algorithm, using df.iterrows(), counters and checking the elements one by one; I think, however, that a Pandas solution to a problem like this must exist, but I didn't find anything.
My expected output is:
>>> expected_df
Unnamed: 0 Unnamed: 1 1Me 2Gi 3Ve 4Sa ... 7Lu 8Ma 9Me 0Gi 1Ve Unnamed: 2
0 NaN Ore NaN 6,30 5,46 4,50 ... 5,26 7,15 5,07 6,11 2,59 NaN
1 NaN Ore NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN
A precondition is that any function should work row-to-row (so no applymap) because some values are found in the second row - and the corresponding value in the first row is NaN -, while applymap works column by column.

Simple pandas solution
s = df.stack()
s[s.str.contains(r'\d?,\d+', na=False)] = np.hstack(values)
out = s.unstack().reindex(df.columns, axis=1)
Explanation
stack the dataframe to reshape. Note: Stacking operation also drops the NaN values by default.
>>> df.stack()
0 Unnamed: 1 Ore
2Gi ,30
3Ve ,46
4Sa ,50
...
9Me ,07
0Gi ,11
1Ve ,59
1 Unnamed: 1 Ore
6Lu ,53
dtype: object
Match the regular expression pattern (\d?,\d+) against the stacked frame using str.contains, this essentially creates a boolean mask
>>> s.str.contains(r'\d?,\d+', na=False)
0 Unnamed: 1 False
2Gi True
3Ve True
4Sa True
...
9Me True
0Gi True
1Ve True
1 Unnamed: 1 False
6Lu True
dtype: bool
Using hstack flatten the values then assign these values to the matched strings in the stacked frame
>>> s[s.str.contains(r'\d?,\d+', na=False)] = np.hstack(values)
>>> s
0 Unnamed: 1 Ore
2Gi 6,30
3Ve 5,46
4Sa 4,50
...
9Me 5,07
0Gi 6,11
1Ve 2,59
1 Unnamed: 1 Ore
6Lu 2,53
dtype: object
Now unstack to reshape back into a dataframe and reindex the columns
>>> s.unstack().reindex(df.columns, axis=1)
Unnamed: 0 Unnamed: 1 1Me 2Gi 3Ve 4Sa 5Do 6Lu 7Ma 8Me 9Gi 0Ve 1Sa 2Do 3Lu 4Ma 5Me 6Gi 7Ve 8Sa 9Do 0Lu 1Ma 2Me 3Gi 4Ve 5Sa 6Do 7Lu 8Ma 9Me 0Gi 1Ve Unnamed: 2
0 NaN Ore NaN 6,30 5,46 4,50 NaN NaN 5,20 7,48 5,41 2,07 3,52 3,11 NaN 4,53 4,51 5,14 4,28 3,33 NaN 5,32 3,10 5,03 4,44 4,39 5,04 NaN 5,26 7,15 5,07 6,11 2,59 NaN
1 NaN Ore NaN NaN NaN NaN NaN 2,53 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

You are dealing with integers in df, but strings everywhere else, so that's part of the problem. Trying to replace with regex requires strings, so trying it with your df of digits will fail. Also, trying to replace using the entire values array doesn't seem like what you want to do. Iterrows is very slow, so avoid that at all costs.
Looks like you want to find the string '30' and replace with '6,30' for example. You can do that with df.replace(), all at once, as you originally wanted. You can also use replace on integers, e.g., replace the integer 30 with '6,30' in whatever data format that is. I'm not sure how what the exact data is you are working with and what data types you want in the end, so see this toy example for replacing all matching values in a df at once:
row1list = ['30', '46', '50', '20']
row2list = ['48', '41', '07', '52']
df = pd.DataFrame([row1list, row2list], columns=['1Me', '2Gi', '3ve', '4sa'])
values = ['6,30', '5,46', '2,07', '3,52']
for val in values:
left, right = val.split(',')
df = df.replace(right, val)
print(df)
# 1Me 2Gi 3ve 4sa
# 0 6,30 5,46 50 20
# 1 48 41 2,07 3,52

Related

Assign new value to a cell in pd.DataFrame which is a pd.Series when series index isn't unique

Here is my data if anyone wants to try to reproduce the problem:
https://github.com/LunaPrau/personal/blob/main/O_paired.csv
I have a pd.DataFrame (called O) of 1402 rows × 1402 columns with columns and index both as ['XXX-icsd', 'YYY-icsd', ...] and cell values as some np.float64, some np.nan and problematically, some as pandas.core.series.Series.
202324-icsd
644068-icsd
27121-icsd
93847-icsd
154319-icsd
202324-icsd
0.000000
0.029729
NaN
0.098480
0.097867
644068-icsd
NaN
0.000000
NaN
0.091311
0.091049
27121-icsd
0.144897
0.137473
0.0
0.081610
0.080442
93847-icsd
NaN
NaN
NaN
0.000000
0.005083
154319-icsd
NaN
NaN
NaN
NaN
0.000000
The problem is that some cells (e.g. O.loc["192693-icsd", "192401-icsd"]) contain a pandas.core.series.Series of form:
192693-icsd 0.129562
192693-icsd 0.129562
Name: 192401-icsd, dtype: float64
I'm struggling to make this cell contain only a np.float64.
I tried:
O.loc["192693-icsd", "192401-icsd"] = O.loc["192693-icsd", "192401-icsd"][0]
and other various known forms of assignnign a new value to a cell in pd.DataFrame, but they only assign a new element to the same series in this cell, e.g. if I do
O.loc["192693-icsd", "192401-icsd"] = 5
then when calling O.loc["192693-icsd", "192401-icsd"] I get:
192693-icsd 5.0
192693-icsd 5.0
Name: 192401-icsd, dtype: float64
How to modify O.loc["192693-icsd", "192401-icsd"] so that it is of type np.float64?
It's not that df.loc["192693-icsd", "192401-icsd"] contain a Series, your index just isn't unique. This is especially obvious looking at these outputs:
>>> df.loc["192693-icsd"]
202324-icsd 644068-icsd 27121-icsd 93847-icsd 154319-icsd 28918-icsd 28917-icsd ... 108768-icsd 194195-icsd 174188-icsd 159632-icsd 89111-icsd 23308-icsd 253341-icsd
192693-icsd NaN NaN NaN NaN 0.146843 NaN NaN ... NaN 0.271191 NaN NaN NaN NaN 0.253996
192693-icsd NaN NaN NaN NaN 0.146843 NaN NaN ... NaN 0.271191 NaN NaN NaN NaN 0.253996
[2 rows x 1402 columns]
# And the fact that this returns the same:
>>> df.at["192693-icsd", "192401-icsd"]
192693-icsd 0.129562
192693-icsd 0.129562
Name: 192401-icsd, dtype: float64
You can fix this with a groupby, but you have to decide what to do with the non-unique groups. It looks like they're the same, so we'll combine them with max:
df = df.groupby(level=0).max()
Now it'll work as expected:
>>> df.loc["192693-icsd", "192401-icsd"])
0.129562120551387
Your non-unique values are:
>>> df.index[df.index.duplicated()]
Index(['193303-icsd', '192693-icsd', '416602-icsd'], dtype='object')
IIUC, you can try DataFrame.applymap to check each cell type and get the first row if it is Series
df = df.applymap(lambda x: x.iloc[0] if type(x) == pd.Series else x)
It works as expected for O.loc["192693-icsd", "192401-icsd"] = O.loc["192693-icsd", "192401-icsd"][0]
Check this colab link: https://colab.research.google.com/drive/1XFXuj4OBu8GXQx6DTqv04XellmFcFWbC?usp=sharing

Extracting Info From A Column that contains irregular structure of ";" and "|" separators

I have a pandas data frame in which one of the columns looks like this.
INFO
SVTYPE=CNV;END=401233
SVTYPE=CNV;END=401233;CSQT=1|BHAT12|ESNT12345|
SVTYPE=CNV;END=401233;CSQT=1|JHV87|ESNT12345|,1|HJJUB2|ESNT12345|
SVTYPE=CNV;END=401233;CSQT=1|GFTREF|ESNT12345|,1|321lkj|ESNT12345|,1|16-YHGT|ESNT12345|...
The information I want to extract in new columns is gene|ESNT12345 . For the same example should be
gene1 gene2 gene3
Na Na Na
BHAT12|ESNT12345 Na Na
JHV87|ESNT12345 HJJUB2|ESNT12345 Na
GFTREF|ESNT12345 321lkj|ESNT12345 16-YHGT|ESNT12345
How can I do this working with pandas? I have been trying with .apply(lambda x:x.split("|"). But as I don't know the number of gene_name|ESNT12345 my dataset has and also this will be used in an application that will take thousands of different data frames, I am looking for a way of dynamically creating the necessary columns.
How can I do this?
IIUC, you could use a regex and str.extractall.
joining to the original data:
new_df = df.join(
df['INFO']
.str.extractall(r'(\w+\|ESNT\d+)')[0]
.unstack(level='match')
.add_prefix('gene_')
)
output:
INFO gene_0 gene_1 gene_2
0 SVTYPE=CNV;END=401233 NaN NaN NaN
1 SVTYPE=CNV;END=401233;CSQT=1|BHAT12|ESNT12345| BHAT12|ESNT12345 NaN NaN
2 SVTYPE=CNV;END=401233;CSQT=1|JHV87|ESNT12345|,1|HJJUB2|ESNT12345| JHV87|ESNT12345 HJJUB2|ESNT12345 NaN
3 SVTYPE=CNV;END=401233;CSQT=1|GFTREF|ESNT12345|,1|321lkj|ESNT12345|,1|16-YHGT|ESNT12345|... GFTREF|ESNT12345 321lkj|ESNT12345 YHGT|ESNT12345
without joining to the original data:
new_df = (df['INFO']
.str.extractall(r'(\w+\|ESNT\d+)')[0]
.unstack(level='match')
.add_prefix('gene_')
.reindex(df.index)
)
output:
match gene_0 gene_1 gene_2
0 NaN NaN NaN
1 BHAT12|ESNT12345 NaN NaN
2 JHV87|ESNT12345 HJJUB2|ESNT12345 NaN
3 GFTREF|ESNT12345 321lkj|ESNT12345 YHGT|ESNT12345
regex hack to have gene1, gene2…
If you really want to have the genes counter to start with 1, you could use this small regex hack (match the beginning of the string as match 0 and drop it):
new_df = (df['INFO']
.str.extractall(r'(^|\w+\|ESNT\d+)')[0]
.unstack(level='match')
.iloc[:, 1:]
.add_prefix('gene')
.reindex(df.index)
)
output:
match gene1 gene2 gene3
0 NaN NaN NaN
1 BHAT12|ESNT12345 NaN NaN
2 JHV87|ESNT12345 HJJUB2|ESNT12345 NaN
3 GFTREF|ESNT12345 321lkj|ESNT12345 YHGT|ESNT12345

Remove Specific Characters/Strings/Sequences of Characters in Python

I am creating a long list of what seem to be tuples that I would like to later convert into a Dataframe, but there are certain common sequences of characters that prevent this from being possible. And example of a fraction of the output:
0,"GAME_ID 21900001
EVENTNUM 2
EVENTMSGTYPE 12
EVENTMSGACTIONTYPE 0
PERIOD 1
WCTIMESTRING 8:04 PM
PCTIMESTRING 12:00
HOMEDESCRIPTION
NEUTRALDESCRIPTION
VISITORDESCRIPTION
SCORE NaN
SCOREMARGIN NaN
PERSON1TYPE 0
PLAYER1_ID 0
PLAYER1_NAME NaN
PLAYER1_TEAM_ID NaN
PLAYER1_TEAM_CITY NaN
PLAYER1_TEAM_NICKNAME NaN
PLAYER1_TEAM_ABBREVIATION NaN
PERSON2TYPE 0
PLAYER2_ID 0
PLAYER2_NAME NaN
PLAYER2_TEAM_ID NaN
PLAYER2_TEAM_CITY NaN
PLAYER2_TEAM_NICKNAME NaN
PLAYER2_TEAM_ABBREVIATION NaN
PERSON3TYPE 0
PLAYER3_ID 0
PLAYER3_NAME NaN
PLAYER3_TEAM_ID NaN
PLAYER3_TEAM_CITY NaN
PLAYER3_TEAM_NICKNAME NaN
PLAYER3_TEAM_ABBREVIATION NaN
VIDEO_AVAILABLE_FLAG 0
DESCRIPTION
TIME_ELAPSED 0
TIME_ELAPSED_PERIOD 0
Name: 0, dtype: object"
Whereas the desired output would be:
GAME_ID 21900001
EVENTNUM 2
EVENTMSGTYPE 12
EVENTMSGACTIONTYPE 0
PERIOD 1
WCTIMESTRING 8:04 PM
PCTIMESTRING 12:00
HOMEDESCRIPTION
NEUTRALDESCRIPTION
VISITORDESCRIPTION
SCORE NaN
SCOREMARGIN NaN
PERSON1TYPE 0
PLAYER1_ID 0
PLAYER1_NAME NaN
PLAYER1_TEAM_ID NaN
PLAYER1_TEAM_CITY NaN
PLAYER1_TEAM_NICKNAME NaN
PLAYER1_TEAM_ABBREVIATION NaN
PERSON2TYPE 0
PLAYER2_ID 0
PLAYER2_NAME NaN
PLAYER2_TEAM_ID NaN
PLAYER2_TEAM_CITY NaN
PLAYER2_TEAM_NICKNAME NaN
PLAYER2_TEAM_ABBREVIATION NaN
PERSON3TYPE 0
PLAYER3_ID 0
PLAYER3_NAME NaN
PLAYER3_TEAM_ID NaN
PLAYER3_TEAM_CITY NaN
PLAYER3_TEAM_NICKNAME NaN
PLAYER3_TEAM_ABBREVIATION NaN
VIDEO_AVAILABLE_FLAG 0
DESCRIPTION
TIME_ELAPSED 0
TIME_ELAPSED_PERIOD 0
How can I get rid of the 0 and " at the start, and then the trash at the end past the TIME_ELAPSED_PERIOD? The int at the start and the one in the bottom row increases by 1 until the end of my program, which could likely go upwards of around 320,000, so I will need the code to be able to adapt for a range of int values. I think it would be easiest to do this after the creation of my list, so it shouldn't be necessary for me to show you any of my code. Just a systematic manipulation of characters should do the trick. Thanks!
Provided that your input data is in the form of a list, you can try the following to meet your requirements:
inputlist = Your_list_to_be_corrected #Assign your input list here
# Now, remove the rows in the list that have the format "Name: 0, dtype: object""
inputlist = [ x for x in inputlist if "dtype: object" not in x ]
#Now, correct the rows containing GAME_ID by removing the int number and special characters
sep = 'GAME_ID'
for index, element in enumerate(inputlist):
if "GAME_ID" in element:
inputlist[index] = 'GAME_ID' + element.split(sep, 1)[1]

I'm trying modify a pandas data frame so that I will have 2 columns. A frequency column and a date column.

Basically, what I'm working with is a dataframe with all of the parking tickets given out in one year. Every ticket takes up its own row in the unaltered dataframe. What I want to do is group all the tickets by date so that I have 2 columns (date, and the amount of tickets issued on that day). Right now I can achieve that, however, the date is not considered a column by pandas.
import numpy as np
import matplotlib as mp
import pandas as pd
import matplotlib.pyplot as plt
df1 = pd.read_csv('C:/Users/brett/OneDrive/Data Science
Fundamentals/Parking_Tags_Data_2012.csv')
unnecessary_cols = ['tag_number_masked', 'infraction_code',
'infraction_description', 'set_fine_amount', 'time_of_infraction',
'location1', 'location2', 'location3', 'location4',
'province']
df1 = df1.drop (unnecessary_cols, 1)
df1 =
(df1.groupby('date_of_infraction').agg({'date_of_infraction':'count'}))
df1['frequency'] =
(df1.groupby('date_of_infraction').agg({'date_of_infraction':'count'}))
print (df1)
df1 = (df1.iloc[121:274])
The output is:
date_of_infraction date_of_infraction frequency
20120101 1059 NaN
20120102 2711 NaN
20120103 6889 NaN
20120104 8030 NaN
20120105 7991 NaN
20120106 8693 NaN
20120107 7237 NaN
20120108 5061 NaN
20120109 7974 NaN
20120110 8872 NaN
20120111 9110 NaN
20120112 8667 NaN
20120113 7247 NaN
20120114 7211 NaN
20120115 6116 NaN
20120116 9168 NaN
20120117 8973 NaN
20120118 9016 NaN
20120119 7998 NaN
20120120 8214 NaN
20120121 6400 NaN
20120122 6355 NaN
20120123 7777 NaN
20120124 8628 NaN
20120125 8527 NaN
20120126 8239 NaN
20120127 8667 NaN
20120128 7174 NaN
20120129 5378 NaN
20120130 7901 NaN
... ... ...
20121202 5342 NaN
20121203 7336 NaN
20121204 7258 NaN
20121205 8629 NaN
20121206 8893 NaN
20121207 8479 NaN
20121208 7680 NaN
20121209 5357 NaN
20121210 7589 NaN
20121211 8918 NaN
20121212 9149 NaN
20121213 7583 NaN
20121214 8329 NaN
20121215 7072 NaN
20121216 5614 NaN
20121217 8038 NaN
20121218 8194 NaN
20121219 6799 NaN
20121220 7102 NaN
20121221 7616 NaN
20121222 5575 NaN
20121223 4403 NaN
20121224 5492 NaN
20121225 673 NaN
20121226 1488 NaN
20121227 4428 NaN
20121228 5882 NaN
20121229 3858 NaN
20121230 3817 NaN
20121231 4530 NaN
Essentially, I want to move all the columns over by one to the right. Right now pandas only considers the last two columns as actual columns. I hope this made sense.
The count of infractions per date should be achievable with just one call to groupby. Try this:
import numpy as np
import pandas as pd
df1 = pd.read_csv('C:/Users/brett/OneDrive/Data Science
Fundamentals/Parking_Tags_Data_2012.csv')
unnecessary_cols = ['tag_number_masked', 'infraction_code',
'infraction_description', 'set_fine_amount', 'time_of_infraction',
'location1', 'location2', 'location3', 'location4',
'province']
df1 = df1.drop(unnecessary_cols, 1)
# reset_index() to move the dates into their own column
counts = df1.groupby('date_of_infraction').count().reset_index()
print(counts)
Note that any dates with zero tickets will not show up as 0; instead, they will simply be absent from counts.
If this doesn't work, it would be helpful for us to see the first few rows of df1 after you drop the unnecessary rows.
Try using as_index=False.
For example:
import numpy as np
import pandas as pd
data = {"date_of_infraction":["20120101", "20120101", "20120202", "20120202"],
"foo":np.random.random(4)}
df = pd.DataFrame(data)
df
date_of_infraction foo
0 20120101 0.681286
1 20120101 0.826723
2 20120202 0.669367
3 20120202 0.766019
(df.groupby("date_of_infraction", as_index=False) # <-- acts like reset_index()
.foo.count()
.rename(columns={"foo":"frequency"})
)
date_of_infraction frequency
0 20120101 2
1 20120202 2

Merging more than two files with one column and common index in pandas

I have 10 .csv files with two columns. For example
file1.csv
Bact1,[1821932:1822487](+)
Bact2,[555760:556294](+)
Bact3,[2901866:2902424](-)
Bact4,[1104980:1105544](+)
file2.csv
Bact1,[1973928:1975194](-)
Bact2,[972152:973499](+)
Bact3,[3001035:3002739](-)
Bact4,[3331158:3332481](+)
Bact5,[712517:713771](+)
Bact5,[1376120:1377386](-)
file3.csv
Bact6,[4045708:4047781](+)
and so on to file10.csv The Bact1 represents a bacterial species and all the numbers including the sign represents the position of a gene. Each file represents a different gene, and there are duplicates like in the case of file2.csv
I wanted to merge these files so that I have something like this
Bact1 [1821932:1822487](+) [1973928:1975194](-) NaN
Bact2 [555760:556294](+) [972152:973499](+) NaN
Bact3 [2901866:2902424](-) [3001035:3002739](-) NaN
Bact4 [1104980:1105544](+) [3331158:3332481](+) NaN
Bact5 NaN [712517:713771](+) NaN
Bact5 NaN [1376120:1377386](-) NaN
Bact6 NaN NaN [4045708:4047781](+)
I have tried to use pandas package in python, but seems like most of the functions are geared towards merging two dataframes, not more than two, or i am missing something.
I have just started programming in python last week (I normally use R), so getting stuck in what could be or atleast seems like a simple thing.
Right now i am using:
for x in range(1,10):
df[x]=pandas.read_csv("file%s.csv" % (x),header=None,index_col=[0])
df[x].columns=['gene%s' % (x)]
dfjoin={}
dfjoin=df[1].join([df[2],df[3],df[4],df[5],df[6],df[7],df[8],df[9],df[10]])
Result:
0 gene1 gene2 gene3
Starkeya-novella-DSM-506 NaN [728886:730173](+) [731445:732615](+)
Starkeya-novella-DSM-506 NaN [728886:730173](+) [9662:10994](+)
Starkeya-novella-DSM-506 NaN [728886:730173](+) [9662:10994](+)
Starkeya-novella-DSM-506 NaN [728886:730173](+) [9662:10994](+)
see gene2 and gene3, it has duplicated results copied.
Assuming you've read these in as DataFrames as follows:
In [11]: df1 = pd.read_csv('file1.csv', sep=',', header=None, index_col=[0], names=['bact', 'file1'])
In [12]: df1
Out[12]:
file1
bact
Bact1 [1821932:1822487](+)
Bact2 [555760:556294](+)
Bact3 [2901866:2902424](-)
Bact4 [1104980:1105544](+)
Then you can simply join them:
In [21]: df1.join([df2, df3])
Out[21]:
file1 file2 file3
bact
Bact1 [1821932:1822487](+) [1973928:1975194](-) NaN
Bact2 [555760:556294](+) [972152:973499](+) NaN
Bact3 [2901866:2902424](-) [3001035:3002739](-) NaN
Bact4 [1104980:1105544](+) [3331158:3332481](+) NaN
Bact5 NaN [712517:713771](+) NaN
Bact5 NaN [1376120:1377386](-) NaN
Bact6 NaN NaN [4045708:4047781](+)
I changed your example data a little, here is the code:
import pandas as pd
import io
data = {
"file1":"""Bact1,[1821932:1822487](+)
Bact2,[555760:556294](+)
Bact3,[2901866:2902424](-)
Bact4,[1104980:1105544](+)
Bact5,[1104981:1105544](+)
Bact5,[1104982:1105544](+)""",
"file2":"""Bact1,[1973928:1975194](-)
Bact2,[972152:973499](+)
Bact3,[3001035:3002739](-)
Bact4,[3331158:3332481](+)
Bact5,[712517:713771](+)
Bact5,[1376120:1377386](-)
Bact5,[1376121:1377386](-)""",
"file3":"""Bact4,[3331150:3332481](+)
Bact6,[4045708:4047781](+)"""}
def read_file(f):
s = pd.read_csv(f, header=None, index_col=0, squeeze=True)
return s.groupby(s.index).apply(lambda s:pd.Series(s.values))
series = {key:read_file(io.StringIO(unicode(text)))
for key, text in data.items()}
print pd.concat(series, axis=1)
output:
file1 file2 file3
0
Bact1 0 [1821932:1822487](+) [1973928:1975194](-) NaN
Bact2 0 [555760:556294](+) [972152:973499](+) NaN
Bact3 0 [2901866:2902424](-) [3001035:3002739](-) NaN
Bact4 0 [1104980:1105544](+) [3331158:3332481](+) [3331150:3332481](+)
Bact5 0 [1104981:1105544](+) [712517:713771](+) NaN
1 [1104982:1105544](+) [1376120:1377386](-) NaN
2 NaN [1376121:1377386](-) NaN
Bact6 0 NaN NaN [4045708:4047781](+)

Categories