I am trying to extract data from a csv file using python's pandas module. The experiment data has 6 columns (lets say a,b,c,d,e,f) and i have a list of model directories. Not every model has all 6 'species' (columns) so i need to split the data specifically for each model. Here is my code:
def read_experimental_data(self,experiment_path):
[path,fle]=os.path.split(experiment_path)
os.chdir(path)
data_df=pandas.read_csv(experiment_path)
# print data_df
experiment_species=data_df.keys() #(a,b,c,d,e,f)
# print experiment_species
for i in self.all_models_dirs: #iterate through a list of model directories.
[path,fle]=os.path.split(i)
model_specific_data=pandas.DataFrame()
species_dct=self.get_model_species(i+'.xml') #gives all the species (culuns) in this particular model
# print species_dct
#gives me only species that are included in model dir i
for l in species_dct.keys():
for m in experiment_species:
if l == m:
#how do i collate these pandas series into a single dataframe?
print data_df[m]
The above code gives me the correct data but i'm having trouble collecting it in a usable format. I've tried to merge and concatenate them but no joy. Does any body know how to do this?
Thanks
You can create a new DataFrame from data_df by passing it a list of columns you want,
import pandas as pd
df = pd.DataFrame({'a': [1,2,3], 'b': [4,5,6], 'c': [7,8,9]})
df_filtered = df[['a', 'c']]
or an example using some of your variable names,
import pandas as pd
data_df = pd.DataFrame({'a': [1,2], 'b': [3,4], 'c': [5,6],
'd': [7,8], 'e': [9,10], 'f': [11,12]})
experiment_species = data_df.keys()
species_dct = ['b', 'd', 'e', 'x', 'y', 'z']
good_columns = list(set(experiment_species).intersection(species_dct))
df_filtered = data_df[good_columns]
Related
I am trying to read a parquet with pyarrow==1.0.1 as engine.
Given :
columns = ['a','b','c']
pd.read_parquet(x, columns=columns, engine="pyarrow")
if file x does not contain c, it will give out :
/opt/anaconda3/.../lib/python3.6/site-packages/pyarrow/_dataset.pyx in pyarrow._dataset.Dataset.to_table()
/opt/anaconda3/.../lib/python3.6/site-packages/pyarrow/_dataset.pyx in pyarrow._dataset.Dataset._scanner()
/opt/anaconda3/.../lib/python3.6/site-packages/pyarrow/_dataset.pyx in pyarrow._dataset.Scanner.from_dataset()
/opt/anaconda3/.../lib/python3.6/site-packages/pyarrow/_dataset.pyx in pyarrow._dataset._populate_builder()
/opt/anaconda3/.../lib/python3.6/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()
ArrowInvalid: Field named 'c' not found or not unique in the schema.
There is no argument to ignore warning and just read columns that are missing as nan.
The error handling is also pretty bad.
pyarrow.lib.ArrowInvalid("Field named 'c' not found or not unique in the schema.")
It is pretty hard to get the filed name that was missing, so that it can be used to remove the columns that is passed in next try.
Is there a method to this?
You can read the metadata from your parquet file to figure out which columns are available.
Bear in mind though that pandas won't be able to guess the type of the missing column (c in the example below), which may cause issues when you concatenate tables later.
import pandas as pd
import pyarrow.parquet as pq
all_columns = ['a', 'b', 'c']
df = pd.DataFrame({'a': [1, 2, 3], 'b': ['foo', 'bar', 'z']})
file_name = '/tmp/my_df.pq'
df.to_parquet(file_name)
parquet_file = pq.ParquetFile(file_name)
columns_in_file = [c for c in all_columns if c in parquet_file.schema.names]
df = (
parquet_file
.read(columns=columns_in_file)
.to_pandas()
.reindex(columns=all_columns)
)
I have a pandas DataFrame of unique rows which looks something like this:
df = pd.DataFrame({'O': ['O-1','O-1'],
'B': ['B-1','B-2'],
'C': ['C-1','C-2'],
'R': ['R-1','R-1']},
columns = ['O', 'B', 'C', 'R'])
Columns of df are ordered in parent-child linear relation, wherein column O is level 1, column B is level 2 and so on. The intention is to convert this df into a tree like structure for navigation purposes, which would look something like this:
output = pd.DataFrame({'PARENT': ['O-1','O-1','O-1','O-1','O-1','B-1','B-1','B-2','B-2','C-1','C-2'],
'CHILD_TYPE': ['B','B','C','C','R','C','R','C','R','R','R'],
'CHILD': ['B-1','B-2','C-1','C-2','R-1','C-1','R-1','C-2','R-1','R-1','R-1']},
columns = ['PARENT', 'CHILD_TYPE', 'CHILD'])
Filtering on each column's each value in df (as parent) then copying all unique values of remaining columns on the right as child seems like a bad way to achieve this.
Is there an efficient way?
As I mentioned, we have this way to achieve this:
Filtering on each column's each value in df (as parent) then copying all unique values of remaining columns on the right as child seems like a bad way to achieve this.
And the solution with same logic is here:
sample = pd.DataFrame({'O': ['O-1','O-1'],
'B': ['B-1','B-2'],
'C': ['C-1','C-2'],
'R': ['R-1','R-1']}, columns = ['O', 'B', 'C', 'R'])
ls = []
for col in sample:
for val in sample[col]:
fs = sample[sample[col] == val]
fvl = fs.iloc[:,fs.columns.get_loc(col)+1:].T.values.tolist()
fcl = fs.iloc[:,fs.columns.get_loc(col)+1:].columns.tolist()
for fc, fvs in zip(fcl, fvl):
for fv in fvs:
ls.append([val,fc,fv])
output = pd.DataFrame(ls, columns = ['PARENT', 'CHILD_TYPE', 'CHILD']).drop_duplicates()
I am trying to add a suffix to the dataframes called on by a dictionary.
Here is a sample code below:
import pandas as pd
import numpy as np
from collections import OrderedDict
from itertools import chain
# defining stuff
num_periods_1 = 11
num_periods_2 = 4
num_periods_3 = 5
# create sample time series
dates1 = pd.date_range('1/1/2000 00:00:00', periods=num_periods_1, freq='10min')
dates2 = pd.date_range('1/1/2000 01:30:00', periods=num_periods_2, freq='10min')
dates3 = pd.date_range('1/1/2000 02:00:00', periods=num_periods_3, freq='10min')
# column_names = ['WS Avg','WS Max','WS Min','WS Dev','WD Avg']
# column_names = ['A','B','C','D','E']
column_names_1 = ['C', 'B', 'A']
column_names_2 = ['B', 'C', 'D']
column_names_3 = ['E', 'B', 'C']
df1 = pd.DataFrame(np.random.randn(num_periods_1, len(column_names_1)), index=dates1, columns=column_names_1)
df2 = pd.DataFrame(np.random.randn(num_periods_2, len(column_names_2)), index=dates2, columns=column_names_2)
df3 = pd.DataFrame(np.random.randn(num_periods_3, len(column_names_3)), index=dates3, columns=column_names_3)
sep0 = '<~>'
suf1 = '_1'
suf2 = '_2'
suf3 = '_3'
ddict = {'df1': df1, 'df2': df2, 'df3': df3}
frames_to_concat = {'Sheets': ['df1', 'df3']}
Suffs = {'Suffixes': ['Suffix 1', 'Suffix 2', 'Suffix 3']}
Suff = {'Suffix 1': suf1, 'Suffix 2': suf2, 'Suffix 3': suf3}
## appply suffix to each data frame selected in order HERE
# Suffdict = [Suff[x] for x in Suffs['Suffixes']]
# print(Suffdict)
df4 = pd.concat([ddict[x] for x in frames_to_concat['Sheets']],
axis=1,
join='outer')
I want to add a suffix to each dataframe so that they can be distinguished when the dataframes are concatenated. I am having some trouble calling them and then applying them to each dataframe. So I have called for df1 and df3 to be concatenated and I would like only suffix 1 to be applied to df1 and suffix 2 to be applied to df3.
Order does not matter for the data frame suffix if df2 and df3 were called suffix 1 would be applied to df2 and suffix 2 would be applied to df3. obviously the last suffix would not be used.
Unless you have python3.6, you cannot guarantee order in dictionaries. Even if you could with python3.6, that would imply your code would not run in any lower python version. If you need order, you should be looking at lists instead.
You can store your dataframes as well as your suffixes in a list, and then use zip to add a suffix to each df in turn.
dfs = [df1, df2, df3]
sufs = [suf1, suf2, suf3]
df_sufs = [x.add_suffix(y) for x, y in zip(dfs, sufs)]
Based on your code/answer, you can load your dataframes and suffixes into lists, call zip, add a suffix to each one, and call pd.concat.
dfs = [ddict[x] for x in frames_to_concat['Sheets']]
sufs = [suff[x] for x in suffs['Suffixes']]
df4 = pd.concat([x.add_suffix(sep0 + y)
for x, y in zip(dfs, sufs)], axis=1, join='outer')
Ended up just making a simple iterator for the problem. Here is my solution
n=0
for df in frames_to_concat['Sheets']:
print(df_dict[df])
df_dict[df] = df_dict[df].add_suffix(sep0 + suff[suffs['Suffixes'][n]])
n = n+1
Anyone have a better way to do this?
Suppose I have two pandas of the form:
>>> df
A B C
first 62.184209 39.414005 60.716563
second 51.508214 94.354199 16.938342
third 36.081861 39.440953 38.088336
>>> df1
A B C
first 0.828069 0.762570 0.717368
second 0.136098 0.991668 0.547499
third 0.120465 0.546807 0.346949
>>>
That I generated with:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.random([3, 3])*100,
columns=['A', 'B', 'C'], index=['first', 'second', 'third'])
df1 = pd.DataFrame(np.random.random([3, 3]),
columns=['A', 'B', 'C'], index=['first', 'second', 'third'])
Could you find the smartest and quickest way of getting something like:
A B C
first 62.184209 39.414005 60.716563
first_s 0.828069 0.762570 0.717368
second 51.508214 94.354199 16.938342
second_s 0.136098 0.991668 0.547499
third 36.081861 39.440953 38.088336
third_s 0.120465 0.546807 0.346949
?
I guess I could do with a for cycle saying take even rows from the first and odd rows from the second but it does not seem very efficient to me.
Try this:
In [501]: pd.concat([df, df1.set_index(df1.index + '_s')]).sort_index()
Out[501]:
A B C
first 62.184209 39.414005 60.716563
first_s 0.828069 0.762570 0.717368
second 51.508214 94.354199 16.938342
second_s 0.136098 0.991668 0.547499
third 36.081861 39.440953 38.088336
third_s 0.120465 0.546807 0.346949
I want to select rows from a dask dataframe based on a list of indices. How can I do that?
Example:
Let's say, I have the following dask dataframe.
dict_ = {'A':[1,2,3,4,5,6,7], 'B':[2,3,4,5,6,7,8], 'index':['x1', 'a2', 'x3', 'c4', 'x5', 'y6', 'x7']}
pdf = pd.DataFrame(dict_)
pdf = pdf.set_index('index')
ddf = dask.dataframe.from_pandas(pdf, npartitions = 2)
Furthermore, I have a list of indices, that I am interested in, e.g.
indices_i_want_to_select = ['x1','x3', 'y6']
From this, I would like to generate a dask dataframe containing only the rows specified in indices_i_want_to_select
Edit: dask now supports loc on lists:
ddf_selected = ddf.loc[indices_i_want_to_select]
The following should still work, but is not necessary anymore:
import pandas as pd
import dask.dataframe as dd
#generate example dataframe
pdf = pd.DataFrame(dict(A = [1,2,3,4,5], B = [6,7,8,9,0]), index=['i1', 'i2', 'i3', 4, 5])
ddf = dd.from_pandas(pdf, npartitions = 2)
#list of indices I want to select
l = ['i1', 4, 5]
#generate new dask dataframe containing only the specified indices
ddf_selected = ddf.map_partitions(lambda x: x[x.index.isin(l)], meta = ddf.dtypes)
Using dask version '1.2.0' results with an error due to the mixed index type.
in any case there is an option to use loc.
import pandas as pd
import dask.dataframe as dd
#generate example dataframe
pdf = pd.DataFrame(dict(A = [1,2,3,4,5], B = [6,7,8,9,0]), index=['i1', 'i2', 'i3', '4', '5'])
ddf = dd.from_pandas(pdf, npartitions = 2,)
# #list of indices I want to select
l = ['i1', '4', '5']
# #generate new dask dataframe containing only the specified indices
# ddf_selected = ddf.map_partitions(lambda x: x[x.index.isin(l)], meta = ddf.dtypes)
ddf_selected = ddf.loc[l]
ddf_selected.head()