I have a dataframe as follows:
df
ATG#FTY#RG#NUMFB#ZQ=CT QTG#SSTY#RG#NUMFB#ZQ=ED WQTG#SSTWY#RGW#NUMFB#ZQ=XED QQTG#SSTQY#RGQ#NUMFB#ZQ=XXED
1 2 3 4
2 4 6 2
1 0 3 7
What I am looking for is to create a duplicate of the existing data frame but by reordering the names split by '#' and '=' and dropping the keyword 'ZQ' and adding 'Z' at the end. So for example the 1st column name from
**ATG#FTY#RG#NUMFB#ZQ=CT ** should transform to ** ATG#FTY#RG#CT#NUMFBZ** ( with a 'Z appended at end say)
So I created the following code which works fine. However looking at a more elegant pythonic solution
import pandas as pd
import re
for col in dfT.columns:
zl=[]
fl = []
mc=col.split('#')
myL =mc[:-2]
nfS =mc[-2]
fnf =nfS+'Z'
fl.append(fnf)
zn = mc[-1].split('=')
zl = list(zn)
zl.remove('ZQ')
myL.extend(zl)
myL.extend(fl)
mst ='#'.join(myL)
dfT.rename(columns = {col:mst}, inplace = True)
In [80]: columns
Out[80]:
['ATG#FTY#RG#NUMFB#ZQ=CT',
'QTG#SSTY#RG#NUMFB#ZQ=ED',
'WQTG#SSTWY#RGW#NUMFB#ZQ=XED',
'QQTG#SSTQY#RGQ#NUMFB#ZQ=XXED']
In [81]: def renamer(col):
...: a,b,c = col.rsplit('#', 2)
...: return f"{a}#{c.split('=')[1]}#{b}Z"
...:
In [82]: renamed = dict(zip(columns, map(renamer, columns)))
In [83]: renamed
Out[83]:
{'ATG#FTY#RG#NUMFB#ZQ=CT': 'ATG#FTY#RG#CT#NUMFBZ',
'QTG#SSTY#RG#NUMFB#ZQ=ED': 'QTG#SSTY#RG#ED#NUMFBZ',
'WQTG#SSTWY#RGW#NUMFB#ZQ=XED': 'WQTG#SSTWY#RGW#XED#NUMFBZ',
'QQTG#SSTQY#RGQ#NUMFB#ZQ=XXED': 'QQTG#SSTQY#RGQ#XXED#NUMFBZ'}
you can use renamed in your df.rename call directly
df.columns = df.columns.str.replace('#NUMFB#ZQ=', '#') + '#NUMFBZ'
# Index(['ATG#FTY#RG#CT#NUMFBZ', 'QTG#SSTY#RG#ED#NUMFBZ',
# 'WQTG#SSTWY#RGW#XED#NUMFBZ', 'QQTG#SSTQY#RGQ#XXED#NUMFBZ'],
Related
In a column in a Dask Dataframe, I have strings like this:
column_name_1
column_name_2
a^b^c
j
e^f^g
k^l
h^i
m
I need to split these strings into columns in the same data frame, like this
column_name_1
column_name_2
column_name_1_1
column_name_1_2
column_name_1_3
column_name_2_1
column_name_2_2
a^b^c
j
a
b
c
j
e^f^g
k^l
e
f
g
k
l
h^i
m
h
i
m
I cannot figure out how to do this without knowing in advance how many occurrences of the delimiter there are in the data. Also, there are tens of columns in the Dataframe that are to be left alone, so I need to be able to specify which columns to split like this.
My best effort either includes something like
df[["column_name_1_1","column_name_1_2 ","column_name_1_3"]] = df["column_name_1"].str.split('^',n=2, expand=True)
But it fails with a
ValueError: The columns in the computed data do not match the columns in the provided metadata
Here are 2 solutions working without stack but with loop for selected columns names:
cols = ['column_name_1','column_name_2']
for c in cols:
df = df.join(df[c].str.split('^',n=2, expand=True).add_prefix(f'{c}_').fillna(''))
print (df)
column_name_1 column_name_2 column_name_1_0 column_name_1_1 column_name_1_2 \
0 a^b^c j a b c
1 e^f^g k^l e f g
2 h^i m h i
column_name_2_0 column_name_2_1
0 j
1 k l
2 m
Or modify another solution:
cols = ['column_name_1','column_name_2']
dfs = [df[c].str.split('^',n=2, expand=True).add_prefix(f'{c}_').fillna('') for c in cols]
df = pd.concat([df] + dfs, axis=1)
print (df)
column_name_1 column_name_2 column_name_1_0 column_name_1_1 column_name_1_2 \
0 a^b^c j a b c
1 e^f^g k^l e f g
2 h^i m h i
column_name_2_0 column_name_2_1
0 j
1 k l
2 m
Unfortunately using dask.dataframe.Series.str.split with expand=True and an unknown number of splits is not yet supported in Dask, the following returns a NotImplementedError:
import dask.dataframe as dd
import pandas as pd
ddf = dd.from_pandas(
pd.DataFrame({
'column_name_1': ['a^b^c', 'e^f^g', 'h^i'], 'column_name_2': ['j', 'k^l', 'm']
}), npartitions=2
)
# returns NotImplementedError
ddf['column_name_1'].str.split('^', expand=True).compute()
Usually when a pandas equivalent has not yet been implemented in Dask, map_partitions can be used to apply a Python function on each DataFrame partition. In this case, however, Dask would still need to know how many columns to expect in order to lazily produce a Dask DataFrame, as provided with a meta argument. This makes using Dask for this task challenging. Relatedly, the ValueError occurs because column_name_2 requires only 1 split, and returns a Dask DataFrame with 2 columns, but Dask is expecting a DataFrame with 3 columns.
Here is one solution (building from #Fontanka16's answer) if you do know the number of splits ahead of time:
import dask.dataframe as dd
import pandas as pd
ddf = dd.from_pandas(
pd.DataFrame({
'column_name_1': ['a^b^c', 'e^f^g', 'h^i'], 'column_name_2': ['j', 'k^l', 'm']
}), npartitions=2
)
ddf_list = []
num_split_dict = {'column_name_1': 2, 'column_name_2': 1}
for col, num_splits in num_split_dict.items():
split_df = ddf[col].str.split('^', n=num_splits, expand=True).add_prefix(f'{col}_')
ddf_list.append(split_df)
new_ddf = dd.concat([ddf] + ddf_list, axis=1)
new_ddf.compute()
Here's some data from another question:
positive negative neutral
1 [marvel, moral, bold, destiny] [] [view, should]
2 [beautiful] [complicated, need] []
3 [celebrate] [crippling, addiction] [big]
What I would do first is to add quotes across all words, and then:
import ast
df = pd.read_clipboard(sep='\s{2,}')
df = df.applymap(ast.literal_eval)
Is there a smarter way to do this?
Lists of strings
For basic structures you can use yaml without having to add quotes:
import yaml
df = pd.read_clipboard(sep='\s{2,}').applymap(yaml.load)
type(df.iloc[0, 0])
Out: list
Lists of numeric data
Under certain conditions, you can read your lists as strings and the convert them using literal_eval (or pd.eval, if they are simple lists).
For example,
A B
0 [1, 2, 3] 11
1 [4, 5, 6] 12
First, ensure there are at least two spaces between the columns, then copy your data and run the following:
import ast
df = pd.read_clipboard(sep=r'\s{2,}', engine='python')
df['A'] = df['A'].map(ast.literal_eval)
df
A B
0 [1, 2, 3] 11
1 [4, 5, 6] 12
df.dtypes
A object
B int64
dtype: object
Notes
for multiple columns, use applymap in the conversion step:
df[['A', 'B', ...]] = df[['A', 'B', ...]].applymap(ast.literal_eval)
if your columns can contain NaNs, define a function that can handle them appropriately:
parser = lambda x: x if pd.isna(x) else ast.literal_eval(x)
df[['A', 'B', ...]] = df[['A', 'B', ...]].applymap(parser)
if your columns contain lists of strings, you will need something like yaml.load (requires installation) to parse them instead if you don't want to manually add
quotes to the data. See above.
I did it this way:
df = pd.read_clipboard(sep='\s{2,}', engine='python')
df = df.apply(lambda x: x.str.replace(r'[\[\]]*', '').str.split(',\s*', expand=False))
PS i'm sure - there must be a better way to do that...
Another alternative is
In [43]: df.applymap(lambda x: x[1:-1].split(', '))
Out[43]:
positive negative neutral
1 [marvel, moral, bold, destiny] [] [view, should]
2 [beautiful] [complicated, need] []
3 [celebrate] [crippling, addiction] [big]
Note that this assumes the first and last character in each cell is [ and ].
It also assumes there is exactly one space after the commas.
Another version:
df.applymap(lambda x:
ast.literal_eval("[" + re.sub(r"[[\]]", "'",
re.sub("[,\s]+", "','", x)) + "]"))
Per help from #MaxU
df = pd.read_clipboard(sep='\s{2,}', engine='python')
Then:
>>> df.apply(lambda col: col.str[1:-1].str.split(', '))
positive negative neutral
1 [marvel, moral, bold, destiny] [] [view, should]
2 [beautiful] [complicated, need] []
3 [celebrate] [crippling, addiction] [big]
>>> df.apply(lambda col: col.str[1:-1].str.split()).loc[3, 'negative']
['crippling', 'addiction']
And per the notes from #unutbu who came up with a similar solution:
assumes the first and last character in each cell is [ and ]. It also assumes there is exactly one space after the commas.
if I have the following csv file test.csv:
C01,45,A,R
C02,123,H,I
where I have define sets R and I as
R=set(['R','E','D','N','P','H','K'])
I=set(['I','H','G','F','A','C','L','M','P','Q','S','T','V','W','Y'])
I want to be able to test if the string A is a member of set R (which is false) and if string H is a member of set I (which is true). I have tried to do this with the following script:
#!/usr/bin/env python
import pandas as pd
I=set(['I','H','G','F','A','C','L','M','P','Q','S','T','V','W','Y'])
R=set(['R','E','D','N','P','H','K'])
with open(test.csv) as f:
table = pd.read_table(f, sep=',', header=None, lineterminator='\n')
table[table.columns[3]].astype(str).isin(table[table.columns[4]].astype(str))
i.e. I am trying to do the equivalent of A in R or rather table.columns[3] in table.columns[4] and return TRUE or FALSE for each row of data.
The only problem is that using the final line the two rows return TRUE. If I change the final line to
table[table.columns[3]].astype(str).isin(R)
Then I get
0 FALSE
1 TRUE
which is correct. It seems that I am not referencing the set name correctly when doing .isin(table[table.columns[3]].astype(str))
any ideas?
Starting with the following:
In [21]: df
Out[21]:
0 1 2 3
0 C01 45 A R
1 C02 123 H I
In [22]: R=set(['R','E','D','N','P','H','K'])
...: I=set(['I','H','G','F','A','C','L','M','P','Q','S','T','V','W','Y'])
...:
You could do something like this:
In [23]: sets = {"R":R,"I":I}
In [24]: df.apply(lambda S: S[2] in sets[S[3]],axis=1)
Out[24]:
0 False
1 True
dtype: bool
Fair warning, .apply is slow and doesn't scale with larger data very well. It is there for convenience and a last resort.
I have a time series dataset in a .csv file that I want to process with Pandas (using Canopy). The column names from the file are a mix of strings and isotopic numbers.
cycles 40 38.02 35.98 P4
0 1 1.1e-8 4.4e-8 7.7e-8 8.8e-7
1 2 2.2e-8 5.5e-8 8.8e-8 8.7e-7
2 3 3.3e-8 6.6e-8 9.9e-8 8.6e-7
I would like this DataFrame to look like this
cycles 40 38 36 P4
0 1 1.1e-8 4.4e-8 7.7e-8 8.8e-7
1 2 2.2e-8 5.5e-8 8.8e-8 8.7e-7
2 3 3.3e-8 6.6e-8 9.9e-8 8.6e-7
The .csv files won't always have exactly the same column names; they numbers could be slightly different from file to file. To handle this, I've sampled the column names and rounded the values to the nearest integer.This is what my code looks like so far:
import pandas as pd
import numpy as np
df = {'cycles':[1,2,3],'40':[1.1e-8,2.2e-8,3.3e-8],'38.02':[4.4e-8,5.5e-8, 6.6e-8],'35.98':[7.7e-8,8.8e-8,9.9e-8,],'P4':[8.8e-7,8.7e-7,8.6e-7]}
df = pd.DataFrame(df, columns=['cycles', '40', '38.02', '35.98', 'P4'])
colHeaders = df.columns.values.tolist()
colHeaders[1:4] = list(map(float, colHeaders[1:4]))
colHeaders[1:4] = list(map(np.around, colHeaders[1:4]))
colHeaders[1:4] = list(map(int, colHeaders[1:4]))
colHeaders = list(map(str, colHeaders))
I tried df.rename(columns={df.loc[ 1 ]:colHeaders[ 0 ]}, ...), but I get this error:
TypeError: 'Series' objects are mutable, thus they cannot be hashed
I've read this post as well as the pandas 0.17 documentation, but I can't figure out how to use it to selectively replace the column names in a way that doesn't require me to assign new column names manually like this post.
I'm fairly new to Python and I've never posted on StackOverflow before, so any help would be greatly appreciated.
You could use a variant of your approach, but assign the new columns directly:
>>> cols = list(df.columns)
>>> cols[1:-1] = [int(round(float(x))) for x in cols[1:-1]]
>>> df.columns = cols
>>> df
cycles 40 38 36 P4
0 1 1.100000e-08 4.400000e-08 7.700000e-08 8.800000e-07
1 2 2.200000e-08 5.500000e-08 8.800000e-08 8.700000e-07
2 3 3.300000e-08 6.600000e-08 9.900000e-08 8.600000e-07
>>> df.columns
Index(['cycles', 40, 38, 36, 'P4'], dtype='object')
Or you could pass a function to rename:
>>> df = df.rename(columns=lambda x: x if x[0].isalpha() else int(round(float(x))))
>>> df.columns
Index(['cycles', 40, 38, 36, 'P4'], dtype='object')
After running some commands I have a pandas dataframe, eg.:
>>> print df
B A
1 2 1
2 3 2
3 4 3
4 5 4
I would like to print this out so that it produces simple code that would recreate it, eg.:
DataFrame([[2,1],[3,2],[4,3],[5,4]],columns=['B','A'],index=[1,2,3,4])
I tried pulling out each of the three pieces (data, columns and rows):
[[e for e in row] for row in df.iterrows()]
[c for c in df.columns]
[r for r in df.index]
but the first line fails because e is not a value but a Series.
Is there a pre-build command to do this, and if not, how do I do it? Thanks.
You can get the values of the data frame in array format by calling df.values:
df = pd.DataFrame([[2,1],[3,2],[4,3],[5,4]],columns=['B','A'],index=[1,2,3,4])
arrays = df.values
cols = df.columns
index = df.index
df2 = pd.DataFrame(arrays, columns = cols, index = index)
Based on #Woody Pride's approach, here is the full solution I am using. It handles hierarchical indices and index names.
from types import MethodType
from pandas import DataFrame, MultiIndex
def _gencmd(df, pandas_as='pd'):
"""
With this addition to DataFrame's methods, you can use:
df.command()
to get the command required to regenerate the dataframe df.
"""
if pandas_as:
pandas_as += '.'
index_cmd = df.index.__class__.__name__
if type(df.index)==MultiIndex:
index_cmd += '.from_tuples({0}, names={1})'.format([i for i in df.index], df.index.names)
else:
index_cmd += "({0}, name='{1}')".format([i for i in df.index], df.index.name)
return 'DataFrame({0}, index={1}{2}, columns={3})'.format([[xx for xx in x] for x in df.values],
pandas_as,
index_cmd,
[c for c in df.columns])
DataFrame.command = MethodType(_gencmd, None, DataFrame)
I have only tested it on a few cases so far and would love a more general solution.