I have successfully finished data manipulation using pandas (in python). Depending on my starting data set I end up with a series of data frames - let's say for example sampleA, sampleB, and sample C.
I want to automate saving of these datasets (can be a lot of them) with also a unique identifier in the name
so I create a list of pandas, and use a loop to save the data - I cannot though make the loop give a unique name each time - see for example:
import numpy as np
import pandas as pd
sampleA= pd.DataFrame(np.random.randn(10, 4))
sampleB= pd.DataFrame(np.random.randn(10, 4))
sampleC= pd.DataFrame(np.random.randn(10, 4))
allsamples=(sampleA, sampleB, sampleC)
for x in allsamples:
#name = allsamples[x]
#x.to_csv(name + '.dat', sep=',', header = False, index = False)
x.to_csv(x + '.dat', sep=',', header = False, index = False)
when I am using the above (with not the commented lines) all data are saved as x.data, and I keep only the latest dataset; if I do the name line, then i get errors
any idea how I can come up with a naming approach so I can save 3 files named sampleA.dat, sampleB.data, and sampleC.dat
If you use strings, then you can look up the variable of the same name using vars():
allsamples = ('sampleA', 'sampleB', 'sampleC')
for name in allsamples:
df = vars()[name]
df.to_csv(name + '.dat', sep=',', header=False, index=False)
Without an argument vars() is equivalent to locals(). It returns a "read-only" dict mapping local variable names to their associated values. (The dict is "read-only" in the sense that it is mainly useful for looking up the value of local variables. Like any dict, it is modifiable, but modifying the dict will not modify the variable.)
Be aware that python tuple items have no names. And moreover, allsamples[x] is meaningless, you index tuple with a dataframe, what do you expect to get?
One can use a dictionary instead of a tuple for simultanious variables naming and storing:
all_samples = {'sampleA':sampleA, 'sampleB':sampleB, 'sampleC':sampleC}
for name, df in all_samples.items():
df.to_csv('{}.dat'.format(name), sep=',', header = False, index = False)
Related
I am using lasio (https://lasio.readthedocs.io/en/latest/index.html) to call out data within a .LAS file. It's an oil and gas drilling type file with data in the heading and in the body (called the curve). TL;DR on the lasio docs, but it reads the data as a pandas DataFrame. Hence me using a dictionary to assign the data.
This is an output of a lasio file in notepad:
At the end, I need a file that has the UWI (unique well #), the depth and it's porosity reading.
The UWI is one value but there are multiple values for the depth and porosity. So I need the UWI repeated. To complicate matters, not all of my files have the porosity data so I have had to screen for them too.
My code was going ok until I export it and see that in the csv, the cells are nested. The code reads in the values in a dictionary and I need the UWI duplicated for each depth value.
data = []
df_global = pd.DataFrame(data)
alias = ["DPHI", "DPHI_LS", "DPH8", "DPHZ", "DPHZ_LS", "DPOR_LS", "DPOR", "PORD", "DPHI_SCANNED", "SPHI"]
for filename in all_files:
las = lasio.read(filename)
df = las.df().reset_index()
mnemonic = las.keys()
match = set(alias).intersection(mnemonic)
if len(match) != 0:
DEPT = df["DEPT"]
DPHI2 = df[match]
DPHI = DPHI2.iloc[:,0]
UWI = las.well.UWI.value
df_global = df_global.append({'UWI': UWI, 'DEPTH': DEPT, 'DPHI': DPHI}, ignore_index=True)
df_global.to_csv('las_output.csv', index=False)
This is my output, note the nested rows.
I have tried
df.loc[:,"UWI"] = np.array(las.well.UWI.value*len(df.DEPT))
but the UWI value is just repeated and not put into rows.
Problem
You are appending dictionaries to an already-existing DataFrame. Each dictionary contains a variety of types (an integer under the key UWI, and pandas Series under other keys). This is a very general operation, and pandas reacts by converting the Series contained within the dictionary to strings, which is what you are seeing in columns B and C in Excel.
This is also probably not the operation you want to do, which appears to be appending DataFrames (i.e. one per file) to an existing DataFrame (df_global). Pandas does not make this easy for existing DataFrames, for good reason.
Solution
This is much simpler if you create a Python list (data) containing DataFrames, then use pandas' concat function to create a single DataFrame as the last step. See below for an example. I have not tested the code, because you didn't include a minimal reproducible example, but hopefully it helps.
data = []
alias = ["DPHI", "DPHI_LS", "DPH8", "DPHZ", "DPHZ_LS", "DPOR_LS", "DPOR", "PORD", "DPHI_SCANNED", "SPHI"]
for filename in all_files:
las = lasio.read(filename)
df = las.df().reset_index()
mnemonic = las.keys()
match = set(alias).intersection(mnemonic)
if len(match) != 0:
columns_to_keep = [las.curves[0].mnemonic] + list(match)
# Assign the single UWI value to a new column called "UWI"
df['UWI'] = las.well.UWI.value
columns_to_keep.append('UWI')
data.append(df[columns_to_keep])
df_final = pd.concat(data, join='outer') # join='outer' means that it will keep all of the different values found from `alias`
df_final.to_csv('las_output.csv', index=False)
Python 3.9.5/Pandas 1.1.3
I use the following code to create a nested dictionary object from a csv file with headers:
import pandas as pd
import json
import os
csv = "/Users/me/file.csv"
csv_file = pd.read_csv(csv, sep=",", header=0, index_col=False)
csv_file['org'] = csv_file[['location', 'type']].apply(lambda s: s.to_dict(), axis=1)
This creates a nested object called org from the data in the columns called location and type.
Now let's say the type column doesn't even exist in the csv file, and I want to pass a literal string as a type value instead of the values from a column in the csv file. So for example, I want to create a nested object called org using the values from the data column as before, but I want to just use the string foo for all values of a key called type. How to accomplish this?
You could just build it by hand:
csv_file['org'] = csv_file['location'].apply(lambda x: {'location': x,
'type': 'foo'})
use Chainmap. This will allow to use multiple columns (columns_to_use), and even override existing ones (if type is in these columns, it will be overridden):
from collections import ChainMap
# .. some code
csv_file['org'] = csv_file[columns_to_use].apply(
lambda s: ChainMap({'type': 'foo'}, s.to_dict()), axis=1)
BTW, without adding constant values it could be done by df.to_dict():
csv_file['org'] = csv_file[['location', 'type']].to_dict('records')
Hey guys I am have an data that looks like this train.dat . I am trying to create an varible that will contain the [ith] value of the column containing(-1,or 1), and another variable to hold the value of column that have strings.
So far I have tried this,
df=pd.read_csv("train.dat",delimiter="\t", sep=',')
# print(df.head())
# separate names from classes
vals = df.ix[:,:].values
names = [n[0][3:] for n in vals]
cls = [n[0][0:] for n in vals]
print(cls)
However the output looks all jumbled up, any help would be appreciated. I am a begineer in python
If the character after the numerical value is a tab, you're fine and all you would need is
import io # using io.StringIO for demonstration
import pandas as pd
ratings = "-1\tThis movie really sucks.\n-1\tRun colored water through
a reflux condenser and call it a science movie?\n+1\tJust another zombie flick? You'll be surprised!"
df = pd.read_csv(io.StringIO(ratings), sep='\t',
header=None, names=['change', 'rating'])
Passing header=None makes sure that the first line is interpreted as data.
Passing names=['change', 'rating'] provides some (reasonable) column headers.
Of course, the character is not a tab :D.
import io # using io.string
import pandas as pd
ratings = "-1 This movie really sucks.\n-1 Run colored water through a
reflux condenser and call it a science movie?\n+1 Just another zombie
flick? You'll be surprised!"
df = pd.read_csv(io.StringIO(ratings), sep='\t',
header=None, names=['stuff'])
df['change'], df['rating'] = df.stuff.str[:3], df.stuff.str[3:]
df.drop('stuff', axis=1)
One viable option is to read in the whole rating as one temporary column, split the string, distribute it to two columns and eventually drop the temporary column.
I'm using pandas to handle some csv file, but i'm having trouble storing the results in a variable and printing it out as it is.
This is the code that I have.
df = pd.read_csv(MY_FILE.csv, index_col=False, header=0)
df2 = df[(df['Name'])]
# Trying to get the result of Name to the variable
n = df2['Name']
print(n)
And the result that i get:
1 jake
Name: Name, dtype: object
My Question:
Is it possible to just have "Jake" stored in a variable "n" so that i can call it out whenever i need it?
EG: Print (n)
Result: Jake
This is the code that I have constructed
def name_search():
list_to_open = input("Which list to open: ") + ".csv"
directory = "C:\Users\Jake Wong\PycharmProjects\box" "\\" + list_to_open
if os.path.isfile(directory):
# Search for NAME
Name_id = input("Name to search for: ")
df = pd.read_csv(directory, index_col=False, header=0)
df2 = df[(df['Name'] == Name_id)]
# Defining the name to save the file as
n = df2['Name'].ix[1]
print(n)
This is what is in the csv file
S/N,Name,Points,test1,test2,test3
s49,sing chun,5000,sc,90 sunrsie,4984365132
s49,Alice Suh,5000,jake,88 sunrsie,15641816
s1231,Alice Suhfds,5000,sw,54290 sunrsie,1561986153
s49,Jake Wong,5000,jake,88 sunrsie,15641816
The problem is that n = df2['Name'] is actually a Pandas Series:
type(df.loc[df.Name == 'Jake Wong'].Name)
pandas.core.series.Series
If you just want the value, you can use values[0] -- values is the underlying array behind the Pandas object, and in this case it's length 1, and you're just taking the first element.
n = df2['Name'].values[0]
Also your CSV is not formatted properly: It's not enough to have things lined up in columns like that, you need to have a consistent delimiter (a comma or a tab usually) between columns, so the parser can know when one column ends and another one starts. Can you fix your csv to look like this?:
S/n,Name,points
s56,Alice Suh,5000
s49,Jake Wong,5000
Otherwise we can work on another solution for you but we will probably use regex rather than pandas.
Is it possible to load matlab tables in python using scipy.io.loadmat?
What I'm doing:
In Matlab:
tab = table((1:500)')
save('tab.mat', 'tab')
In Python:
import scipy.io
mat = scipy.io.loadmat('m:/tab.mat')
But I cannot access the table tab in Python using mat['tab']
The answer to your question is no. Many matlab objects can be loaded in python. Tables, among others, can not be loaded. See Handle Data Returned from MATLAB to Python
The loadmat function doesn't load MATLAB tables. Instead a small workaround can be done. The tables can be saves as .csv files which can then be read using pandas.
In MATLAB
writetable(table_name, file_name)
In Python
df = pd.read_csv(file_name)
At the end, the DataFrame df will have the contents of table_name
I've looked into this for a project I'm working on, and as a workaround, you could try the following.
In MATLAB, first convert the #table object into a struct, and retrieve the column names using:
table_struct = struct(table_object);
table_columns = table_struct.varDim.labels;
save table_as_struct table_struct table_columns;
And then you can try the following code in python:
import numpy
import pandas as pd
import scipy.io
# function to load table variable from MAT-file
def loadtablefrommat(matfilename, tablevarname, columnnamesvarname):
"""
read a struct-ified table variable (and column names) from a MAT-file
and return pandas.DataFrame object.
"""
# load file
mat = scipy.io.loadmat(matfilename)
# get table (struct) variable
tvar = mat.get(tablevarname)
data_desc = mat.get(columnnamesvarname)
types = tvar.dtype
fieldnames = types.names
# extract data (from table struct)
data = None
for idx in range(len(fieldnames)):
if fieldnames[idx] == 'data':
data = tvar[0][0][idx]
break;
# get number of columns and rows
numcols = data.shape[1]
numrows = data[0, 0].shape[0]
# and get column headers as a list (array)
data_cols = []
for idx in range(numcols):
data_cols.append(data_desc[0, idx][0])
# create dict out of original table
table_dict = {}
for colidx in range(numcols):
rowvals = []
for rowidx in range(numrows):
rowval = data[0,colidx][rowidx][0]
if type(rowval) == numpy.ndarray and rowval.size > 0:
rowvals.append(rowval[0])
else:
rowvals.append(rowval)
table_dict[data_cols[colidx]] = rowvals
return pd.DataFrame(table_dict)
Based on Jochens answer i propose a different variant that does a good job for me.
I wrote a Matlab Script to prepare the m-file automatically (see my GitLab Repositroy with examples).
It does the following:
In Matlab for class table:
Does the same like Jochens example, but binds the data together. So it is easier to load multiple variables. The names "table" and "columns" are mandatory for the next part.
YourVariableName = struct('table', struct(TableYouWantToLoad), 'columns', {struct(TableYouWantToLoad).varDim.labels})
save('YourFileName', 'YourVariableName')
In Matlab for class dataset:
Alternative, if you have to handle the old dataset type.
YourVariableName = struct('table', struct(DatasetYouWantToLoad), 'columns', {get(DatasetYouWantToLoad,'VarNames')})
save('YourFileName', 'YourVariableName')
In Python:
import scipy.io as sio
mdata = sio.loadmat('YourFileName')
mtable = load_table_from_struct(mdata['YourVariableName'])
with
import pandas as pd
def load_table_from_struct(table_structure) -> pd.DataFrame():
# get prepared data structure
data = table_structure[0, 0]['table']['data']
# get prepared column names
data_cols = [name[0] for name in table_structure[0, 0]['columns'][0]]
# create dict out of original table
table_dict = {}
for colidx in range(len(data_cols)):
table_dict[data_cols[colidx]] = [val[0] for val in data[0, 0][0, colidx]]
return pd.DataFrame(table_dict)
It is independent from loading the file, but basically a minimized versions of Jochens Code. So please give him kudos for his post.
As others have mentioned, this is currently not possible, because Matlab has not documented this file format. People are trying to reverse engineer the file format but this is a work in progress.
A workaround is to write the table to CSV format and to load that using Python. The entries in the table can be variable length arrays and these will be split across numbered columns. I have written a short function to load both scalars and arrays from this CSV file.
To write the table to CSV in matlab:
writetable(table_name, filename)
To read the CSV file in Python:
def load_matlab_csv(filename):
"""Read CSV written by matlab tablewrite into DataFrames
Each entry in the table can be a scalar or a variable length array.
If it is a variable length array, then Matlab generates a set of
columns, long enough to hold the longest array. These columns have
the variable name with an index appended.
This function infers which entries are scalars and which are arrays.
Arrays are grouped together and sorted by their index.
Returns: scalar_df, array_df
scalar_df : DataFrame of scalar values from the table
array_df : DataFrame with MultiIndex on columns
The first level is the array name
The second level is the index within that array
"""
# Read the CSV file
tdf = pandas.read_table(filename, sep=',')
cols = list(tdf.columns)
# Figure out which columns correspond to scalars and which to arrays
scalar_cols = [] # scalar column names
arr_cols = [] # array column names, without index
arrname2idxs = {} # dict of array column name to list of integer indices
arrname2colnames = {} # dict of array column name to list of full names
# Iterate over columns
for col in cols:
# If the name ends in "_" plus space plus digits, it's probably
# from an array
if col[-1] in '0123456789' and '_' in col:
# Array col
# Infer the array name and index
colsplit = col.split('_')
arr_idx = int(colsplit[-1])
arr_name = '_'.join(colsplit[:-1])
# Store
if arr_name in arrname2idxs:
arrname2idxs[arr_name].append(arr_idx)
arrname2colnames[arr_name].append(col)
else:
arrname2idxs[arr_name] = [arr_idx]
arrname2colnames[arr_name] = [col]
arr_cols.append(arr_name)
else:
# Scalar col
scalar_cols.append(col)
# Extract all scalar columns
scalar_df = tdf[scalar_cols]
# Extract each set of array columns into its own dataframe
array_df_d = {}
for arrname in arr_cols:
adf = tdf[arrname2colnames[arrname]].copy()
adf.columns = arrname2idxs[arrname]
array_df_d[arrname] = adf
# Concatenate array dataframes
array_df = pandas.concat(array_df_d, axis=1)
return scalar_df, array_df
scalar_df, array_df = load_matlab_csv(filename)