I need a way to print several lists of varying lengths as columns next to each other tab delimited and with the empty cells remaining empty or containing some fill character (e.g "-").
The methods attempted so far have not worked for lists of varying lengths and numpy has not been working as I expected it.
To summarize:
listname = [[1,2,3],[4,5,6,7,8],[9,10,11,12]]
printed as such in a .txt file:
1 4 9
2 5 10
3 6 11
- 7 12
- 8 -
You can use itertools.izip_longest. To fill the None spaces in the longer sequences you can use fillvalue (thanks #szxk):
>>> import itertools
>>> listname = [[1,2,3],[4,5,6,7,8],[9,10,11,12]]
>>> for x in itertools.izip_longest(*listname, fillvalue="-"):
... print '\t'.join([str(e) for e in x])
...
1 4 9
2 5 10
3 6 11
- 7 12
- 8 -
You can use zip function in this case that is more efficient for small list that itertools.izip
listname = [[1,2,3],[4,5,6,7,8],[9,10,11,12]]
with open('a.txt',w) as f:
for tup in zip(*listname) :
f.write('\t'.join(map(str,tup))
A bench-marking :
~$ python -m timeit "import itertools;listname = [[1,2,3],[4,5,6,7,8],[9,10,11,12]];itertools.izip_longest(*listname)"
1000000 loops, best of 3: 1.13 usec per loop
~$ python -m timeit "listname = [[1,2,3],[4,5,6,7,8],[9,10,11,12]];zip(*listname)"
1000000 loops, best of 3: 0.67 usec per loop
What about using pandas:
In [38]: listname = [[1,2,3],[4,5,6,7,8],[9,10,11,12]]
In [39]: import pandas as pd
In [40]: df = pd.DataFrame(listname, dtype=object)
In [41]: df.T
Out[41]:
0 1 2
0 1 4 9
1 2 5 10
2 3 6 11
3 None 7 12
4 None 8 None
[5 rows x 3 columns]
In [42]: df.T.to_csv("my_file.txt", index=False, header=False, sep="\t", na_rep="-")
Related
This question already has answers here:
Split / Explode a column of dictionaries into separate columns with pandas
(13 answers)
Closed 4 years ago.
I have a really simple Pandas dataframe where each cell contains a list. I'd like to split each element of the list into it's own column. I can do that by exporting the values and then creating a new dataframe. This doesn't seem like a good way to do this especially, if my dataframe had a column aside from the list column.
import pandas as pd
df = pd.DataFrame(data=[[[8,10,12]],
[[7,9,11]]])
df = pd.DataFrame(data=[x[0] for x in df.values])
Desired output:
0 1 2
0 8 10 12
1 7 9 11
Follow-up based on #Psidom answer:
If I did have a second column:
df = pd.DataFrame(data=[[[8,10,12], 'A'],
[[7,9,11], 'B']])
How do I not loose the other column?
Desired output:
0 1 2 3
0 8 10 12 A
1 7 9 11 B
You can loop through the Series with apply() function and convert each list to a Series, this automatically expand the list as a series in the column direction:
df[0].apply(pd.Series)
# 0 1 2
#0 8 10 12
#1 7 9 11
Update: To keep other columns of the data frame, you can concatenate the result with the columns you want to keep:
pd.concat([df[0].apply(pd.Series), df[1]], axis = 1)
# 0 1 2 1
#0 8 10 12 A
#1 7 9 11 B
You could do pd.DataFrame(df[col].values.tolist()) - is much faster ~500x
In [820]: pd.DataFrame(df[0].values.tolist())
Out[820]:
0 1 2
0 8 10 12
1 7 9 11
In [821]: pd.concat([pd.DataFrame(df[0].values.tolist()), df[1]], axis=1)
Out[821]:
0 1 2 1
0 8 10 12 A
1 7 9 11 B
Timings
Medium
In [828]: df.shape
Out[828]: (20000, 2)
In [829]: %timeit pd.DataFrame(df[0].values.tolist())
100 loops, best of 3: 15 ms per loop
In [830]: %timeit df[0].apply(pd.Series)
1 loop, best of 3: 4.06 s per loop
Large
In [832]: df.shape
Out[832]: (200000, 2)
In [833]: %timeit pd.DataFrame(df[0].values.tolist())
10 loops, best of 3: 161 ms per loop
In [834]: %timeit df[0].apply(pd.Series)
1 loop, best of 3: 40.9 s per loop
I'm selecting several columns of a dataframe, by a list of the column names. This works fine if all elements of the list are in the dataframe.
But if some elements of the list are not in the DataFrame, then it will generate the error "not in index".
Is there a way to select all columns which included in that list, even if not all elements of the list are included in the dataframe? Here is some sample data which generates the above error:
df = pd.DataFrame( [[0,1,2]], columns=list('ABC') )
lst = list('ARB')
data = df[lst] # error: not in index
I think you need Index.intersection:
df = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
print (df)
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
lst = ['A','R','B']
print (df.columns.intersection(lst))
Index(['A', 'B'], dtype='object')
data = df[df.columns.intersection(lst)]
print (data)
A B
0 1 4
1 2 5
2 3 6
Another solution with numpy.intersect1d:
data = df[np.intersect1d(df.columns, lst)]
print (data)
A B
0 1 4
1 2 5
2 3 6
Few other ways, and list comprehension is much faster
In [1357]: df[df.columns & lst]
Out[1357]:
A B
0 1 4
1 2 5
2 3 6
In [1358]: df[[c for c in df.columns if c in lst]]
Out[1358]:
A B
0 1 4
1 2 5
2 3 6
Timings
In [1360]: %timeit [c for c in df.columns if c in lst]
100000 loops, best of 3: 2.54 µs per loop
In [1359]: %timeit df.columns & lst
1000 loops, best of 3: 231 µs per loop
In [1362]: %timeit df.columns.intersection(lst)
1000 loops, best of 3: 236 µs per loop
In [1363]: %timeit np.intersect1d(df.columns, lst)
10000 loops, best of 3: 26.6 µs per loop
Details
In [1365]: df
Out[1365]:
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
In [1366]: lst
Out[1366]: ['A', 'R', 'B']
A really simple solution here is to use filter(). In your example, just type:
df.filter(lst)
and it will automatically ignore any missing columns. For more, see the documentation for filter.
As a general note, filter is a very flexible and powerful way to select specific columns. In particular, you can use regular expressions. Borrowing the sample data from #jezrael, you could type either of the following.
df.filter(regex='A|R|B')
df.filter(regex='[ARB]')
Those are trivial examples, but suppose you wanted only columns starting with those letters, then you could type:
df.filter(regex='^[ARB]')
FWIW, in some quick timings I find this to be faster than the list comprehension method, but I don't think speed is really much of a concern here -- even the slowest way should be fast enough, as the speed does not depend on the size of the dataframe, only on the number of columns.
Honestly, all of these ways are fine and you can go with whatever is most readable to you. I prefer filter because it is simple while also giving you more options for selecting columns than a simple intersection.
Use * with list
data = df[[*lst]]
It will give the desired result.
please try this:
syntax : Dataframe[[List of Columns]]
for example : df[['a','b']]
a
Out[5]:
a b c
0 1 2 3
1 12 3 44
X is the list of req columns to slice
x = ['a','b']
this would give you the req slice:
a[x]
Out[7]:
a b
0 1 2
1 12 3
Performance:
%timeit a[x]
333 µs ± 9.27 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Given the DataFrame:
import pandas as pd
df = pd.DataFrame([6, 4, 2, 4, 5], index=[2, 6, 3, 4, 5], columns=['A'])
Results in:
A
2 6
6 4
3 2
4 4
5 5
Now, I would like to sort by values of Column A AND the index.
e.g.
df.sort_values(by='A')
Returns
A
3 2
6 4
4 4
5 5
2 6
Whereas I would like
A
3 2
4 4
6 4
5 5
2 6
How can I get a sort on the column first and index second?
You can sort by index and then by column A using kind='mergesort'.
This works because mergesort is stable.
res = df.sort_index().sort_values('A', kind='mergesort')
Result:
A
3 2
4 4
6 4
5 5
2 6
Using lexsort from numpy may be other way and little faster as well:
df.iloc[np.lexsort((df.index, df.A.values))] # Sort by A.values, then by index
Result:
A
3 2
4 4
6 4
5 5
2 6
Comparing with timeit:
%%timeit
df.iloc[np.lexsort((df.index, df.A.values))] # Sort by A.values, then by index
Result:
1000 loops, best of 3: 278 µs per loop
With reset index and set index again:
%%timeit
df.reset_index().sort_values(by=['A','index']).set_index('index')
Result:
100 loops, best of 3: 2.09 ms per loop
The other answers are great. I'll throw in one other option, which is to provide a name for the index first using rename_axis and then reference it in sort_values. I have not tested the performance but expect the accepted answer to still be faster.
df.rename_axis('idx').sort_values(by=['A', 'idx'])
A
idx
3 2
4 4
6 4
5 5
2 6
You can clear the index name afterward if you want with df.index.name = None.
I'm selecting several columns of a dataframe, by a list of the column names. This works fine if all elements of the list are in the dataframe.
But if some elements of the list are not in the DataFrame, then it will generate the error "not in index".
Is there a way to select all columns which included in that list, even if not all elements of the list are included in the dataframe? Here is some sample data which generates the above error:
df = pd.DataFrame( [[0,1,2]], columns=list('ABC') )
lst = list('ARB')
data = df[lst] # error: not in index
I think you need Index.intersection:
df = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
print (df)
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
lst = ['A','R','B']
print (df.columns.intersection(lst))
Index(['A', 'B'], dtype='object')
data = df[df.columns.intersection(lst)]
print (data)
A B
0 1 4
1 2 5
2 3 6
Another solution with numpy.intersect1d:
data = df[np.intersect1d(df.columns, lst)]
print (data)
A B
0 1 4
1 2 5
2 3 6
Few other ways, and list comprehension is much faster
In [1357]: df[df.columns & lst]
Out[1357]:
A B
0 1 4
1 2 5
2 3 6
In [1358]: df[[c for c in df.columns if c in lst]]
Out[1358]:
A B
0 1 4
1 2 5
2 3 6
Timings
In [1360]: %timeit [c for c in df.columns if c in lst]
100000 loops, best of 3: 2.54 µs per loop
In [1359]: %timeit df.columns & lst
1000 loops, best of 3: 231 µs per loop
In [1362]: %timeit df.columns.intersection(lst)
1000 loops, best of 3: 236 µs per loop
In [1363]: %timeit np.intersect1d(df.columns, lst)
10000 loops, best of 3: 26.6 µs per loop
Details
In [1365]: df
Out[1365]:
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
In [1366]: lst
Out[1366]: ['A', 'R', 'B']
A really simple solution here is to use filter(). In your example, just type:
df.filter(lst)
and it will automatically ignore any missing columns. For more, see the documentation for filter.
As a general note, filter is a very flexible and powerful way to select specific columns. In particular, you can use regular expressions. Borrowing the sample data from #jezrael, you could type either of the following.
df.filter(regex='A|R|B')
df.filter(regex='[ARB]')
Those are trivial examples, but suppose you wanted only columns starting with those letters, then you could type:
df.filter(regex='^[ARB]')
FWIW, in some quick timings I find this to be faster than the list comprehension method, but I don't think speed is really much of a concern here -- even the slowest way should be fast enough, as the speed does not depend on the size of the dataframe, only on the number of columns.
Honestly, all of these ways are fine and you can go with whatever is most readable to you. I prefer filter because it is simple while also giving you more options for selecting columns than a simple intersection.
Use * with list
data = df[[*lst]]
It will give the desired result.
please try this:
syntax : Dataframe[[List of Columns]]
for example : df[['a','b']]
a
Out[5]:
a b c
0 1 2 3
1 12 3 44
X is the list of req columns to slice
x = ['a','b']
this would give you the req slice:
a[x]
Out[7]:
a b
0 1 2
1 12 3
Performance:
%timeit a[x]
333 µs ± 9.27 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
I am reading a somewhat large table (90*85000) of strings, integers and missing values into pandas. The file fits easily into my memory. I also ran the script on a server with plenty of memory, observing the same behavior.
I would assume that reading the file in bulk would be faster or as fast as with chunking. However, with 'chunksize=any_number' pandas reads the file almost 300 times faster (11.138s vs. 0.039s).
Can someone explain this behavior?
My code:
startTime = datetime.now()
df=pd.read_csv(dataFile,delim_whitespace=True)
print datetime.now() - startTime
startTime = datetime.now()
df=pd.read_csv(dataFile,delim_whitespace=True, chunksize=10)
print datetime.now() - startTime
because in the second part you've created a pandas.io.parsers.TextFileReader object (iterator)...
Demo:
In [17]: df = pd.DataFrame(np.random.randint(0, 10, size=(20, 3)), columns=list('abc'))
In [18]: df.to_csv('d:/temp/test.csv')
In [19]: reader = pd.read_csv('d:/temp/test.csv', chunksize=10, index_col=0)
In [20]: print(reader)
<pandas.io.parsers.TextFileReader object at 0x000000000827CB70>
How to use this iterator
In [21]: for df in reader:
....: print(df)
....:
a b c
0 0 5 6
1 6 0 6
2 2 5 0
3 3 6 2
4 5 7 2
5 5 2 9
6 0 0 1
7 4 8 3
8 1 8 0
9 0 8 8
a b c
10 7 9 1
11 6 7 9
12 7 3 2
13 6 4 4
14 7 4 1
15 2 6 5
16 5 2 2
17 9 9 7
18 4 9 0
19 0 1 9
In the first part of your code you've read the whole CSV file in one DF (Data Frame). Obviously it takes longer because the iterator object (reader in the demo above) doesn't read the data from the CSV file until you start to iterate over it
Example: let's create a 1M rows DF and compare the timing of pd.read_csv(...) and pd.read_csv(..., chunksize=1000):
In [24]: df = pd.DataFrame(np.random.randint(0, 10, size=(10**6, 3)), columns=list('abc'))
In [25]: df.shape
Out[25]: (1000000, 3)
In [26]: df.to_csv('d:/temp/test.csv')
In [27]: %timeit pd.read_csv('d:/temp/test.csv', index_col=0)
1 loop, best of 3: 1.21 s per loop
In [28]: %timeit pd.read_csv('d:/temp/test.csv', index_col=0, chunksize=1000)
100 loops, best of 3: 4.42 ms per loop