Suppose I have two pandas of the form:
>>> df
A B C
first 62.184209 39.414005 60.716563
second 51.508214 94.354199 16.938342
third 36.081861 39.440953 38.088336
>>> df1
A B C
first 0.828069 0.762570 0.717368
second 0.136098 0.991668 0.547499
third 0.120465 0.546807 0.346949
>>>
That I generated with:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.random([3, 3])*100,
columns=['A', 'B', 'C'], index=['first', 'second', 'third'])
df1 = pd.DataFrame(np.random.random([3, 3]),
columns=['A', 'B', 'C'], index=['first', 'second', 'third'])
Could you find the smartest and quickest way of getting something like:
A B C
first 62.184209 39.414005 60.716563
first_s 0.828069 0.762570 0.717368
second 51.508214 94.354199 16.938342
second_s 0.136098 0.991668 0.547499
third 36.081861 39.440953 38.088336
third_s 0.120465 0.546807 0.346949
?
I guess I could do with a for cycle saying take even rows from the first and odd rows from the second but it does not seem very efficient to me.
Try this:
In [501]: pd.concat([df, df1.set_index(df1.index + '_s')]).sort_index()
Out[501]:
A B C
first 62.184209 39.414005 60.716563
first_s 0.828069 0.762570 0.717368
second 51.508214 94.354199 16.938342
second_s 0.136098 0.991668 0.547499
third 36.081861 39.440953 38.088336
third_s 0.120465 0.546807 0.346949
Related
I have code that deletes all columns that are starting with spike:
import pandas as pd
data = {'spike_starts1': [1,2,3], 'spike_starts2': [4,5,6], 'spike_starts3': [7,8,9], 'not': [10,11,12]}
df = pd.DataFrame(data)
df2 = df.drop(df.columns[df.columns.str.contains(pat = '^spike')].tolist() , axis=1).copy()
Question: How to modify code above so that it will leave first column that starts with spike but delete all others that starts with spike? If code above is hard to modify suggest your own versions.
This can be achieved just by changing .tolist()[1:], the final code must look like:
import pandas as pd
data = {'spike_starts1': [1,2,3], 'spike_starts2': [4,5,6], 'spike_starts3': [7,8,9], 'not': [10,11,12]}
df = pd.DataFrame(data)
df2 = df.drop(df.columns[df.columns.str.contains(pat = '^spike')].tolist()[1:] , axis=1).copy()
You can create a spike flag and drop duplicates which will only keep the first one.
(
df.T.assign(flag=lambda x: x.index.str.slice(0,5))
.drop_duplicates(subset='flag')
.drop('flag',1)
.T
)
spike_starts1 not
0 1 10
1 2 11
2 3 12
Of you can build a dict with only the first spike column and other non spike columns.
(
pd.DataFrame({'spike' if c.startswith('spike') else c:df[c] for c in df.columns})
.rename(columns = {'spike': [e for e in df.columns if e.startswith('spike')][0]})
)
Another solution:
(
pd.DataFrame(df.columns)
.assign(F=lambda x: x[0].str[:5])
.drop_duplicates(subset='F')
.pipe(lambda x: df.reindex(columns=x[0]))
)
I have dataframe as below:
df = pd.DataFrame({'$a':[1,2], '$b': [10,20]})
I tried creating a function which allow to change the column name dynamically where I can just input the old column name and new column name in the function as below:
def rename_column_name(df, old_column, new_column):
df = df.rename({'{}'.format(old_column) : '{}'.format(new_column)}, axis=1)
return df
This function is only applicable if I only have one input as below:
new_df = rename_column_name(df, '$a' , 'a')
which give me this new_df as below:
new_df = pd.DataFrame({'a':[1,2], '$b': [10,20]})
However, i wanted to create a function that allow me to make changes on multiple/one column depending on my preference as such:
new_df = rename_column_name(df, ['$a','$b'] , ['a','b'])
And get the new_df as below
new_df = pd.DataFrame({'a':[1,2], 'b': [10,20]})
So, how do I make my function more dynamic to allow me the freedom to enter multiple/one column names and rename them?
You don't need a function, you can do this using dict comprehension:
In [265]: old_names = df.columns.tolist()
In [266]: new_names = ['a','b']
In [268]: df = df.rename(columns=dict(zip(old_names, new_names)))
In [269]: df
Out[269]:
a b
0 1 10
1 2 20
Function that OP needs:
In [274]: def rename_column_name(df, old_column_list, new_column_list):
...: df = df.rename(columns=dict(zip(old_column_list, new_column_list)))
...: return df
...:
In [275]: rename_column_name(df,old_names,new_names)
Out[275]:
a b
0 1 10
1 2 20
You need to pass a list of columns to this function. It can be multiple columns or a single column. This should do what you were looking for.
def rename_column_name(df, old_column, new_column):
if not isinstance(old_column,(list,tuple)):
old_column = [old_column]
if not isinstance(new_column,(list,tuple)):
old_column = [new_column]
df = df.rename({'{}'.format(old) : '{}'.format(new) for old,new in zip(old_column,new_column)}, axis=1)
return df # dang i should have used dict.zip like in the other solution :P
I guess ... although i don't understand how this is easier than just calling
df.rename(columns={'$a':'a','$b':b})
You can do that with zip function where,
old_column_names and new_column_names should be lists.
def rename_column_name(df, old_column_names, new_column_names):
//validating the such that all the new names have been passed
if(len(old_column_names) == len(new_column_names)):
df = df.rename(columns=dict(zip(old_column_names, new_column_names)), inplace=True)
return df
To handle both one column rename and passing them as lists the function would require further conditions which can be
def rename_column_name(df, old_column_names, new_column_names):
//validating the such that all the new names have been passed
if(isinstance(old_column_names, list)) and (isinstance(new_column_names, list)):
if(len(old_column_names) == len(new_column_names)):
df = df.rename(columns=dict(zip(old_column_names, new_column_names)), inplace=True)
elif (isinstance(old_column_names, str)) and (isinstance(new_column_names, str)):
df = df.rename(columns={'{}'.format(old_column_names) : '{}'.format(new_column_names)}, inplace=True)
return df
I've read up on a number of threads (here and here) and the docs (here and here). However, I can't get this to work. I get an error of
AxisError: axis 0 is out of bounds for array of dimension 0
Thanks.
import pandas as pd
from scipy.stats import levene
data = {'A': [1,2,3,4,5,6,7,8],
'B': [9,10,11,12,13,14,15,16],
'C': [1,2,3,4,5,6,7,8]}
df3 = pd.DataFrame(data, columns=['A', 'B','C'])
print(levene(df3['A'], df3['C'])) # this works as intended
cols_of_interest = ['A','C'] # my requirement could make this any combination
# function to pass through arguments into Levene test
def func(df,cols_of_interest):
cols = [col for col in 'df.'+df[cols_of_interest].columns] # my strategy to mimic the arguments
lev = levene(*cols)
print(lev)
func(df3,cols_of_interest)
Replace your list comprehension inside def with:
cols = [df[x] for x in cols_of_interest]
Here's some data from another question:
positive negative neutral
1 [marvel, moral, bold, destiny] [] [view, should]
2 [beautiful] [complicated, need] []
3 [celebrate] [crippling, addiction] [big]
What I would do first is to add quotes across all words, and then:
import ast
df = pd.read_clipboard(sep='\s{2,}')
df = df.applymap(ast.literal_eval)
Is there a smarter way to do this?
Lists of strings
For basic structures you can use yaml without having to add quotes:
import yaml
df = pd.read_clipboard(sep='\s{2,}').applymap(yaml.load)
type(df.iloc[0, 0])
Out: list
Lists of numeric data
Under certain conditions, you can read your lists as strings and the convert them using literal_eval (or pd.eval, if they are simple lists).
For example,
A B
0 [1, 2, 3] 11
1 [4, 5, 6] 12
First, ensure there are at least two spaces between the columns, then copy your data and run the following:
import ast
df = pd.read_clipboard(sep=r'\s{2,}', engine='python')
df['A'] = df['A'].map(ast.literal_eval)
df
A B
0 [1, 2, 3] 11
1 [4, 5, 6] 12
df.dtypes
A object
B int64
dtype: object
Notes
for multiple columns, use applymap in the conversion step:
df[['A', 'B', ...]] = df[['A', 'B', ...]].applymap(ast.literal_eval)
if your columns can contain NaNs, define a function that can handle them appropriately:
parser = lambda x: x if pd.isna(x) else ast.literal_eval(x)
df[['A', 'B', ...]] = df[['A', 'B', ...]].applymap(parser)
if your columns contain lists of strings, you will need something like yaml.load (requires installation) to parse them instead if you don't want to manually add
quotes to the data. See above.
I did it this way:
df = pd.read_clipboard(sep='\s{2,}', engine='python')
df = df.apply(lambda x: x.str.replace(r'[\[\]]*', '').str.split(',\s*', expand=False))
PS i'm sure - there must be a better way to do that...
Another alternative is
In [43]: df.applymap(lambda x: x[1:-1].split(', '))
Out[43]:
positive negative neutral
1 [marvel, moral, bold, destiny] [] [view, should]
2 [beautiful] [complicated, need] []
3 [celebrate] [crippling, addiction] [big]
Note that this assumes the first and last character in each cell is [ and ].
It also assumes there is exactly one space after the commas.
Another version:
df.applymap(lambda x:
ast.literal_eval("[" + re.sub(r"[[\]]", "'",
re.sub("[,\s]+", "','", x)) + "]"))
Per help from #MaxU
df = pd.read_clipboard(sep='\s{2,}', engine='python')
Then:
>>> df.apply(lambda col: col.str[1:-1].str.split(', '))
positive negative neutral
1 [marvel, moral, bold, destiny] [] [view, should]
2 [beautiful] [complicated, need] []
3 [celebrate] [crippling, addiction] [big]
>>> df.apply(lambda col: col.str[1:-1].str.split()).loc[3, 'negative']
['crippling', 'addiction']
And per the notes from #unutbu who came up with a similar solution:
assumes the first and last character in each cell is [ and ]. It also assumes there is exactly one space after the commas.
I am trying to extract data from a csv file using python's pandas module. The experiment data has 6 columns (lets say a,b,c,d,e,f) and i have a list of model directories. Not every model has all 6 'species' (columns) so i need to split the data specifically for each model. Here is my code:
def read_experimental_data(self,experiment_path):
[path,fle]=os.path.split(experiment_path)
os.chdir(path)
data_df=pandas.read_csv(experiment_path)
# print data_df
experiment_species=data_df.keys() #(a,b,c,d,e,f)
# print experiment_species
for i in self.all_models_dirs: #iterate through a list of model directories.
[path,fle]=os.path.split(i)
model_specific_data=pandas.DataFrame()
species_dct=self.get_model_species(i+'.xml') #gives all the species (culuns) in this particular model
# print species_dct
#gives me only species that are included in model dir i
for l in species_dct.keys():
for m in experiment_species:
if l == m:
#how do i collate these pandas series into a single dataframe?
print data_df[m]
The above code gives me the correct data but i'm having trouble collecting it in a usable format. I've tried to merge and concatenate them but no joy. Does any body know how to do this?
Thanks
You can create a new DataFrame from data_df by passing it a list of columns you want,
import pandas as pd
df = pd.DataFrame({'a': [1,2,3], 'b': [4,5,6], 'c': [7,8,9]})
df_filtered = df[['a', 'c']]
or an example using some of your variable names,
import pandas as pd
data_df = pd.DataFrame({'a': [1,2], 'b': [3,4], 'c': [5,6],
'd': [7,8], 'e': [9,10], 'f': [11,12]})
experiment_species = data_df.keys()
species_dct = ['b', 'd', 'e', 'x', 'y', 'z']
good_columns = list(set(experiment_species).intersection(species_dct))
df_filtered = data_df[good_columns]