Replace a Pandas subrow with an array of values? - python
I have a (simple?) problem, but that cannot understand how to solve in a pandas way.
I have this CSV:
,Unnamed: 0,Unnamed: 1,1Me,2Gi,3Ve,4Sa,5Do,6Lu,7Ma,8Me,9Gi,0Ve,1Sa,2Do,3Lu,4Ma,5Me,6Gi,7Ve,8Sa,9Do,0Lu,1Ma,2Me,3Gi,4Ve,5Sa,6Do,7Lu,8Ma,9Me,0Gi,1Ve,Unnamed: 2
0,,Ore,,",30",",46",",50",,,",20",",48",",41",",07",",52",",11",,",53",",51",",14",",28",",33",,",32",",10",",03",",44",",39",",04",,",26",",15",",07",",11",",59",
1,,Ore,,,,,,",53",,,,,,,,,,,,,,,,,,,,,,,,,,
That, when loaded, results in this dataframe:
>>> df
Unnamed: 0 Unnamed: 1 1Me 2Gi 3Ve 4Sa ... 7Lu 8Ma 9Me 0Gi 1Ve Unnamed: 2
0 NaN Ore NaN ,30 ,46 ,50 ... ,26 ,15 ,07 ,11 ,59 NaN
1 NaN Ore NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN
And also, I have values, that is a numpy array of two lists.
>>> values
array([list(['6,30', '5,46', '4,50', '5,20', '7,48', '5,41', '2,07', '3,52', '3,11', '4,53', '4,51', '5,14', '4,28', '3,33', '5,32', '3,10', '5,03', '4,44', '4,39', '5,04', '5,26', '7,15', '5,07', '6,11', '2,59']),
list(['2,53'])], dtype=object)
My question is, I want to replace all elements in Dataframe that match a specific regex to be replaced with the corresponding element of the values list.
I assume that df and values have the same length (in this case 2) and also that "wrong" numbers to be replaced inside df are the same of the corresponding row in the values array.
In my case, I tried using df.replace(), but it didn't work; I got this error:
>>> df_lattice2.replace(r"\d?,\d+", values)
TypeError: Invalid "to_replace" type: 'str'
After I while, I came out with an iterative algorithm, using df.iterrows(), counters and checking the elements one by one; I think, however, that a Pandas solution to a problem like this must exist, but I didn't find anything.
My expected output is:
>>> expected_df
Unnamed: 0 Unnamed: 1 1Me 2Gi 3Ve 4Sa ... 7Lu 8Ma 9Me 0Gi 1Ve Unnamed: 2
0 NaN Ore NaN 6,30 5,46 4,50 ... 5,26 7,15 5,07 6,11 2,59 NaN
1 NaN Ore NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN
A precondition is that any function should work row-to-row (so no applymap) because some values are found in the second row - and the corresponding value in the first row is NaN -, while applymap works column by column.
Simple pandas solution
s = df.stack()
s[s.str.contains(r'\d?,\d+', na=False)] = np.hstack(values)
out = s.unstack().reindex(df.columns, axis=1)
Explanation
stack the dataframe to reshape. Note: Stacking operation also drops the NaN values by default.
>>> df.stack()
0 Unnamed: 1 Ore
2Gi ,30
3Ve ,46
4Sa ,50
...
9Me ,07
0Gi ,11
1Ve ,59
1 Unnamed: 1 Ore
6Lu ,53
dtype: object
Match the regular expression pattern (\d?,\d+) against the stacked frame using str.contains, this essentially creates a boolean mask
>>> s.str.contains(r'\d?,\d+', na=False)
0 Unnamed: 1 False
2Gi True
3Ve True
4Sa True
...
9Me True
0Gi True
1Ve True
1 Unnamed: 1 False
6Lu True
dtype: bool
Using hstack flatten the values then assign these values to the matched strings in the stacked frame
>>> s[s.str.contains(r'\d?,\d+', na=False)] = np.hstack(values)
>>> s
0 Unnamed: 1 Ore
2Gi 6,30
3Ve 5,46
4Sa 4,50
...
9Me 5,07
0Gi 6,11
1Ve 2,59
1 Unnamed: 1 Ore
6Lu 2,53
dtype: object
Now unstack to reshape back into a dataframe and reindex the columns
>>> s.unstack().reindex(df.columns, axis=1)
Unnamed: 0 Unnamed: 1 1Me 2Gi 3Ve 4Sa 5Do 6Lu 7Ma 8Me 9Gi 0Ve 1Sa 2Do 3Lu 4Ma 5Me 6Gi 7Ve 8Sa 9Do 0Lu 1Ma 2Me 3Gi 4Ve 5Sa 6Do 7Lu 8Ma 9Me 0Gi 1Ve Unnamed: 2
0 NaN Ore NaN 6,30 5,46 4,50 NaN NaN 5,20 7,48 5,41 2,07 3,52 3,11 NaN 4,53 4,51 5,14 4,28 3,33 NaN 5,32 3,10 5,03 4,44 4,39 5,04 NaN 5,26 7,15 5,07 6,11 2,59 NaN
1 NaN Ore NaN NaN NaN NaN NaN 2,53 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
You are dealing with integers in df, but strings everywhere else, so that's part of the problem. Trying to replace with regex requires strings, so trying it with your df of digits will fail. Also, trying to replace using the entire values array doesn't seem like what you want to do. Iterrows is very slow, so avoid that at all costs.
Looks like you want to find the string '30' and replace with '6,30' for example. You can do that with df.replace(), all at once, as you originally wanted. You can also use replace on integers, e.g., replace the integer 30 with '6,30' in whatever data format that is. I'm not sure how what the exact data is you are working with and what data types you want in the end, so see this toy example for replacing all matching values in a df at once:
row1list = ['30', '46', '50', '20']
row2list = ['48', '41', '07', '52']
df = pd.DataFrame([row1list, row2list], columns=['1Me', '2Gi', '3ve', '4sa'])
values = ['6,30', '5,46', '2,07', '3,52']
for val in values:
left, right = val.split(',')
df = df.replace(right, val)
print(df)
# 1Me 2Gi 3ve 4sa
# 0 6,30 5,46 50 20
# 1 48 41 2,07 3,52
Related
Assign new value to a cell in pd.DataFrame which is a pd.Series when series index isn't unique
Here is my data if anyone wants to try to reproduce the problem: https://github.com/LunaPrau/personal/blob/main/O_paired.csv I have a pd.DataFrame (called O) of 1402 rows × 1402 columns with columns and index both as ['XXX-icsd', 'YYY-icsd', ...] and cell values as some np.float64, some np.nan and problematically, some as pandas.core.series.Series. 202324-icsd 644068-icsd 27121-icsd 93847-icsd 154319-icsd 202324-icsd 0.000000 0.029729 NaN 0.098480 0.097867 644068-icsd NaN 0.000000 NaN 0.091311 0.091049 27121-icsd 0.144897 0.137473 0.0 0.081610 0.080442 93847-icsd NaN NaN NaN 0.000000 0.005083 154319-icsd NaN NaN NaN NaN 0.000000 The problem is that some cells (e.g. O.loc["192693-icsd", "192401-icsd"]) contain a pandas.core.series.Series of form: 192693-icsd 0.129562 192693-icsd 0.129562 Name: 192401-icsd, dtype: float64 I'm struggling to make this cell contain only a np.float64. I tried: O.loc["192693-icsd", "192401-icsd"] = O.loc["192693-icsd", "192401-icsd"][0] and other various known forms of assignnign a new value to a cell in pd.DataFrame, but they only assign a new element to the same series in this cell, e.g. if I do O.loc["192693-icsd", "192401-icsd"] = 5 then when calling O.loc["192693-icsd", "192401-icsd"] I get: 192693-icsd 5.0 192693-icsd 5.0 Name: 192401-icsd, dtype: float64 How to modify O.loc["192693-icsd", "192401-icsd"] so that it is of type np.float64?
It's not that df.loc["192693-icsd", "192401-icsd"] contain a Series, your index just isn't unique. This is especially obvious looking at these outputs: >>> df.loc["192693-icsd"] 202324-icsd 644068-icsd 27121-icsd 93847-icsd 154319-icsd 28918-icsd 28917-icsd ... 108768-icsd 194195-icsd 174188-icsd 159632-icsd 89111-icsd 23308-icsd 253341-icsd 192693-icsd NaN NaN NaN NaN 0.146843 NaN NaN ... NaN 0.271191 NaN NaN NaN NaN 0.253996 192693-icsd NaN NaN NaN NaN 0.146843 NaN NaN ... NaN 0.271191 NaN NaN NaN NaN 0.253996 [2 rows x 1402 columns] # And the fact that this returns the same: >>> df.at["192693-icsd", "192401-icsd"] 192693-icsd 0.129562 192693-icsd 0.129562 Name: 192401-icsd, dtype: float64 You can fix this with a groupby, but you have to decide what to do with the non-unique groups. It looks like they're the same, so we'll combine them with max: df = df.groupby(level=0).max() Now it'll work as expected: >>> df.loc["192693-icsd", "192401-icsd"]) 0.129562120551387 Your non-unique values are: >>> df.index[df.index.duplicated()] Index(['193303-icsd', '192693-icsd', '416602-icsd'], dtype='object')
IIUC, you can try DataFrame.applymap to check each cell type and get the first row if it is Series df = df.applymap(lambda x: x.iloc[0] if type(x) == pd.Series else x)
It works as expected for O.loc["192693-icsd", "192401-icsd"] = O.loc["192693-icsd", "192401-icsd"][0] Check this colab link: https://colab.research.google.com/drive/1XFXuj4OBu8GXQx6DTqv04XellmFcFWbC?usp=sharing
Extracting Info From A Column that contains irregular structure of ";" and "|" separators
I have a pandas data frame in which one of the columns looks like this. INFO SVTYPE=CNV;END=401233 SVTYPE=CNV;END=401233;CSQT=1|BHAT12|ESNT12345| SVTYPE=CNV;END=401233;CSQT=1|JHV87|ESNT12345|,1|HJJUB2|ESNT12345| SVTYPE=CNV;END=401233;CSQT=1|GFTREF|ESNT12345|,1|321lkj|ESNT12345|,1|16-YHGT|ESNT12345|... The information I want to extract in new columns is gene|ESNT12345 . For the same example should be gene1 gene2 gene3 Na Na Na BHAT12|ESNT12345 Na Na JHV87|ESNT12345 HJJUB2|ESNT12345 Na GFTREF|ESNT12345 321lkj|ESNT12345 16-YHGT|ESNT12345 How can I do this working with pandas? I have been trying with .apply(lambda x:x.split("|"). But as I don't know the number of gene_name|ESNT12345 my dataset has and also this will be used in an application that will take thousands of different data frames, I am looking for a way of dynamically creating the necessary columns. How can I do this?
IIUC, you could use a regex and str.extractall. joining to the original data: new_df = df.join( df['INFO'] .str.extractall(r'(\w+\|ESNT\d+)')[0] .unstack(level='match') .add_prefix('gene_') ) output: INFO gene_0 gene_1 gene_2 0 SVTYPE=CNV;END=401233 NaN NaN NaN 1 SVTYPE=CNV;END=401233;CSQT=1|BHAT12|ESNT12345| BHAT12|ESNT12345 NaN NaN 2 SVTYPE=CNV;END=401233;CSQT=1|JHV87|ESNT12345|,1|HJJUB2|ESNT12345| JHV87|ESNT12345 HJJUB2|ESNT12345 NaN 3 SVTYPE=CNV;END=401233;CSQT=1|GFTREF|ESNT12345|,1|321lkj|ESNT12345|,1|16-YHGT|ESNT12345|... GFTREF|ESNT12345 321lkj|ESNT12345 YHGT|ESNT12345 without joining to the original data: new_df = (df['INFO'] .str.extractall(r'(\w+\|ESNT\d+)')[0] .unstack(level='match') .add_prefix('gene_') .reindex(df.index) ) output: match gene_0 gene_1 gene_2 0 NaN NaN NaN 1 BHAT12|ESNT12345 NaN NaN 2 JHV87|ESNT12345 HJJUB2|ESNT12345 NaN 3 GFTREF|ESNT12345 321lkj|ESNT12345 YHGT|ESNT12345 regex hack to have gene1, gene2… If you really want to have the genes counter to start with 1, you could use this small regex hack (match the beginning of the string as match 0 and drop it): new_df = (df['INFO'] .str.extractall(r'(^|\w+\|ESNT\d+)')[0] .unstack(level='match') .iloc[:, 1:] .add_prefix('gene') .reindex(df.index) ) output: match gene1 gene2 gene3 0 NaN NaN NaN 1 BHAT12|ESNT12345 NaN NaN 2 JHV87|ESNT12345 HJJUB2|ESNT12345 NaN 3 GFTREF|ESNT12345 321lkj|ESNT12345 YHGT|ESNT12345
Remove Specific Characters/Strings/Sequences of Characters in Python
I am creating a long list of what seem to be tuples that I would like to later convert into a Dataframe, but there are certain common sequences of characters that prevent this from being possible. And example of a fraction of the output: 0,"GAME_ID 21900001 EVENTNUM 2 EVENTMSGTYPE 12 EVENTMSGACTIONTYPE 0 PERIOD 1 WCTIMESTRING 8:04 PM PCTIMESTRING 12:00 HOMEDESCRIPTION NEUTRALDESCRIPTION VISITORDESCRIPTION SCORE NaN SCOREMARGIN NaN PERSON1TYPE 0 PLAYER1_ID 0 PLAYER1_NAME NaN PLAYER1_TEAM_ID NaN PLAYER1_TEAM_CITY NaN PLAYER1_TEAM_NICKNAME NaN PLAYER1_TEAM_ABBREVIATION NaN PERSON2TYPE 0 PLAYER2_ID 0 PLAYER2_NAME NaN PLAYER2_TEAM_ID NaN PLAYER2_TEAM_CITY NaN PLAYER2_TEAM_NICKNAME NaN PLAYER2_TEAM_ABBREVIATION NaN PERSON3TYPE 0 PLAYER3_ID 0 PLAYER3_NAME NaN PLAYER3_TEAM_ID NaN PLAYER3_TEAM_CITY NaN PLAYER3_TEAM_NICKNAME NaN PLAYER3_TEAM_ABBREVIATION NaN VIDEO_AVAILABLE_FLAG 0 DESCRIPTION TIME_ELAPSED 0 TIME_ELAPSED_PERIOD 0 Name: 0, dtype: object" Whereas the desired output would be: GAME_ID 21900001 EVENTNUM 2 EVENTMSGTYPE 12 EVENTMSGACTIONTYPE 0 PERIOD 1 WCTIMESTRING 8:04 PM PCTIMESTRING 12:00 HOMEDESCRIPTION NEUTRALDESCRIPTION VISITORDESCRIPTION SCORE NaN SCOREMARGIN NaN PERSON1TYPE 0 PLAYER1_ID 0 PLAYER1_NAME NaN PLAYER1_TEAM_ID NaN PLAYER1_TEAM_CITY NaN PLAYER1_TEAM_NICKNAME NaN PLAYER1_TEAM_ABBREVIATION NaN PERSON2TYPE 0 PLAYER2_ID 0 PLAYER2_NAME NaN PLAYER2_TEAM_ID NaN PLAYER2_TEAM_CITY NaN PLAYER2_TEAM_NICKNAME NaN PLAYER2_TEAM_ABBREVIATION NaN PERSON3TYPE 0 PLAYER3_ID 0 PLAYER3_NAME NaN PLAYER3_TEAM_ID NaN PLAYER3_TEAM_CITY NaN PLAYER3_TEAM_NICKNAME NaN PLAYER3_TEAM_ABBREVIATION NaN VIDEO_AVAILABLE_FLAG 0 DESCRIPTION TIME_ELAPSED 0 TIME_ELAPSED_PERIOD 0 How can I get rid of the 0 and " at the start, and then the trash at the end past the TIME_ELAPSED_PERIOD? The int at the start and the one in the bottom row increases by 1 until the end of my program, which could likely go upwards of around 320,000, so I will need the code to be able to adapt for a range of int values. I think it would be easiest to do this after the creation of my list, so it shouldn't be necessary for me to show you any of my code. Just a systematic manipulation of characters should do the trick. Thanks!
Provided that your input data is in the form of a list, you can try the following to meet your requirements: inputlist = Your_list_to_be_corrected #Assign your input list here # Now, remove the rows in the list that have the format "Name: 0, dtype: object"" inputlist = [ x for x in inputlist if "dtype: object" not in x ] #Now, correct the rows containing GAME_ID by removing the int number and special characters sep = 'GAME_ID' for index, element in enumerate(inputlist): if "GAME_ID" in element: inputlist[index] = 'GAME_ID' + element.split(sep, 1)[1]
I'm trying modify a pandas data frame so that I will have 2 columns. A frequency column and a date column.
Basically, what I'm working with is a dataframe with all of the parking tickets given out in one year. Every ticket takes up its own row in the unaltered dataframe. What I want to do is group all the tickets by date so that I have 2 columns (date, and the amount of tickets issued on that day). Right now I can achieve that, however, the date is not considered a column by pandas. import numpy as np import matplotlib as mp import pandas as pd import matplotlib.pyplot as plt df1 = pd.read_csv('C:/Users/brett/OneDrive/Data Science Fundamentals/Parking_Tags_Data_2012.csv') unnecessary_cols = ['tag_number_masked', 'infraction_code', 'infraction_description', 'set_fine_amount', 'time_of_infraction', 'location1', 'location2', 'location3', 'location4', 'province'] df1 = df1.drop (unnecessary_cols, 1) df1 = (df1.groupby('date_of_infraction').agg({'date_of_infraction':'count'})) df1['frequency'] = (df1.groupby('date_of_infraction').agg({'date_of_infraction':'count'})) print (df1) df1 = (df1.iloc[121:274]) The output is: date_of_infraction date_of_infraction frequency 20120101 1059 NaN 20120102 2711 NaN 20120103 6889 NaN 20120104 8030 NaN 20120105 7991 NaN 20120106 8693 NaN 20120107 7237 NaN 20120108 5061 NaN 20120109 7974 NaN 20120110 8872 NaN 20120111 9110 NaN 20120112 8667 NaN 20120113 7247 NaN 20120114 7211 NaN 20120115 6116 NaN 20120116 9168 NaN 20120117 8973 NaN 20120118 9016 NaN 20120119 7998 NaN 20120120 8214 NaN 20120121 6400 NaN 20120122 6355 NaN 20120123 7777 NaN 20120124 8628 NaN 20120125 8527 NaN 20120126 8239 NaN 20120127 8667 NaN 20120128 7174 NaN 20120129 5378 NaN 20120130 7901 NaN ... ... ... 20121202 5342 NaN 20121203 7336 NaN 20121204 7258 NaN 20121205 8629 NaN 20121206 8893 NaN 20121207 8479 NaN 20121208 7680 NaN 20121209 5357 NaN 20121210 7589 NaN 20121211 8918 NaN 20121212 9149 NaN 20121213 7583 NaN 20121214 8329 NaN 20121215 7072 NaN 20121216 5614 NaN 20121217 8038 NaN 20121218 8194 NaN 20121219 6799 NaN 20121220 7102 NaN 20121221 7616 NaN 20121222 5575 NaN 20121223 4403 NaN 20121224 5492 NaN 20121225 673 NaN 20121226 1488 NaN 20121227 4428 NaN 20121228 5882 NaN 20121229 3858 NaN 20121230 3817 NaN 20121231 4530 NaN Essentially, I want to move all the columns over by one to the right. Right now pandas only considers the last two columns as actual columns. I hope this made sense.
The count of infractions per date should be achievable with just one call to groupby. Try this: import numpy as np import pandas as pd df1 = pd.read_csv('C:/Users/brett/OneDrive/Data Science Fundamentals/Parking_Tags_Data_2012.csv') unnecessary_cols = ['tag_number_masked', 'infraction_code', 'infraction_description', 'set_fine_amount', 'time_of_infraction', 'location1', 'location2', 'location3', 'location4', 'province'] df1 = df1.drop(unnecessary_cols, 1) # reset_index() to move the dates into their own column counts = df1.groupby('date_of_infraction').count().reset_index() print(counts) Note that any dates with zero tickets will not show up as 0; instead, they will simply be absent from counts. If this doesn't work, it would be helpful for us to see the first few rows of df1 after you drop the unnecessary rows.
Try using as_index=False. For example: import numpy as np import pandas as pd data = {"date_of_infraction":["20120101", "20120101", "20120202", "20120202"], "foo":np.random.random(4)} df = pd.DataFrame(data) df date_of_infraction foo 0 20120101 0.681286 1 20120101 0.826723 2 20120202 0.669367 3 20120202 0.766019 (df.groupby("date_of_infraction", as_index=False) # <-- acts like reset_index() .foo.count() .rename(columns={"foo":"frequency"}) ) date_of_infraction frequency 0 20120101 2 1 20120202 2
Merging more than two files with one column and common index in pandas
I have 10 .csv files with two columns. For example file1.csv Bact1,[1821932:1822487](+) Bact2,[555760:556294](+) Bact3,[2901866:2902424](-) Bact4,[1104980:1105544](+) file2.csv Bact1,[1973928:1975194](-) Bact2,[972152:973499](+) Bact3,[3001035:3002739](-) Bact4,[3331158:3332481](+) Bact5,[712517:713771](+) Bact5,[1376120:1377386](-) file3.csv Bact6,[4045708:4047781](+) and so on to file10.csv The Bact1 represents a bacterial species and all the numbers including the sign represents the position of a gene. Each file represents a different gene, and there are duplicates like in the case of file2.csv I wanted to merge these files so that I have something like this Bact1 [1821932:1822487](+) [1973928:1975194](-) NaN Bact2 [555760:556294](+) [972152:973499](+) NaN Bact3 [2901866:2902424](-) [3001035:3002739](-) NaN Bact4 [1104980:1105544](+) [3331158:3332481](+) NaN Bact5 NaN [712517:713771](+) NaN Bact5 NaN [1376120:1377386](-) NaN Bact6 NaN NaN [4045708:4047781](+) I have tried to use pandas package in python, but seems like most of the functions are geared towards merging two dataframes, not more than two, or i am missing something. I have just started programming in python last week (I normally use R), so getting stuck in what could be or atleast seems like a simple thing. Right now i am using: for x in range(1,10): df[x]=pandas.read_csv("file%s.csv" % (x),header=None,index_col=[0]) df[x].columns=['gene%s' % (x)] dfjoin={} dfjoin=df[1].join([df[2],df[3],df[4],df[5],df[6],df[7],df[8],df[9],df[10]]) Result: 0 gene1 gene2 gene3 Starkeya-novella-DSM-506 NaN [728886:730173](+) [731445:732615](+) Starkeya-novella-DSM-506 NaN [728886:730173](+) [9662:10994](+) Starkeya-novella-DSM-506 NaN [728886:730173](+) [9662:10994](+) Starkeya-novella-DSM-506 NaN [728886:730173](+) [9662:10994](+) see gene2 and gene3, it has duplicated results copied.
Assuming you've read these in as DataFrames as follows: In [11]: df1 = pd.read_csv('file1.csv', sep=',', header=None, index_col=[0], names=['bact', 'file1']) In [12]: df1 Out[12]: file1 bact Bact1 [1821932:1822487](+) Bact2 [555760:556294](+) Bact3 [2901866:2902424](-) Bact4 [1104980:1105544](+) Then you can simply join them: In [21]: df1.join([df2, df3]) Out[21]: file1 file2 file3 bact Bact1 [1821932:1822487](+) [1973928:1975194](-) NaN Bact2 [555760:556294](+) [972152:973499](+) NaN Bact3 [2901866:2902424](-) [3001035:3002739](-) NaN Bact4 [1104980:1105544](+) [3331158:3332481](+) NaN Bact5 NaN [712517:713771](+) NaN Bact5 NaN [1376120:1377386](-) NaN Bact6 NaN NaN [4045708:4047781](+)
I changed your example data a little, here is the code: import pandas as pd import io data = { "file1":"""Bact1,[1821932:1822487](+) Bact2,[555760:556294](+) Bact3,[2901866:2902424](-) Bact4,[1104980:1105544](+) Bact5,[1104981:1105544](+) Bact5,[1104982:1105544](+)""", "file2":"""Bact1,[1973928:1975194](-) Bact2,[972152:973499](+) Bact3,[3001035:3002739](-) Bact4,[3331158:3332481](+) Bact5,[712517:713771](+) Bact5,[1376120:1377386](-) Bact5,[1376121:1377386](-)""", "file3":"""Bact4,[3331150:3332481](+) Bact6,[4045708:4047781](+)"""} def read_file(f): s = pd.read_csv(f, header=None, index_col=0, squeeze=True) return s.groupby(s.index).apply(lambda s:pd.Series(s.values)) series = {key:read_file(io.StringIO(unicode(text))) for key, text in data.items()} print pd.concat(series, axis=1) output: file1 file2 file3 0 Bact1 0 [1821932:1822487](+) [1973928:1975194](-) NaN Bact2 0 [555760:556294](+) [972152:973499](+) NaN Bact3 0 [2901866:2902424](-) [3001035:3002739](-) NaN Bact4 0 [1104980:1105544](+) [3331158:3332481](+) [3331150:3332481](+) Bact5 0 [1104981:1105544](+) [712517:713771](+) NaN 1 [1104982:1105544](+) [1376120:1377386](-) NaN 2 NaN [1376121:1377386](-) NaN Bact6 0 NaN NaN [4045708:4047781](+)