I have a time series dataset in a .csv file that I want to process with Pandas (using Canopy). The column names from the file are a mix of strings and isotopic numbers.
cycles 40 38.02 35.98 P4
0 1 1.1e-8 4.4e-8 7.7e-8 8.8e-7
1 2 2.2e-8 5.5e-8 8.8e-8 8.7e-7
2 3 3.3e-8 6.6e-8 9.9e-8 8.6e-7
I would like this DataFrame to look like this
cycles 40 38 36 P4
0 1 1.1e-8 4.4e-8 7.7e-8 8.8e-7
1 2 2.2e-8 5.5e-8 8.8e-8 8.7e-7
2 3 3.3e-8 6.6e-8 9.9e-8 8.6e-7
The .csv files won't always have exactly the same column names; they numbers could be slightly different from file to file. To handle this, I've sampled the column names and rounded the values to the nearest integer.This is what my code looks like so far:
import pandas as pd
import numpy as np
df = {'cycles':[1,2,3],'40':[1.1e-8,2.2e-8,3.3e-8],'38.02':[4.4e-8,5.5e-8, 6.6e-8],'35.98':[7.7e-8,8.8e-8,9.9e-8,],'P4':[8.8e-7,8.7e-7,8.6e-7]}
df = pd.DataFrame(df, columns=['cycles', '40', '38.02', '35.98', 'P4'])
colHeaders = df.columns.values.tolist()
colHeaders[1:4] = list(map(float, colHeaders[1:4]))
colHeaders[1:4] = list(map(np.around, colHeaders[1:4]))
colHeaders[1:4] = list(map(int, colHeaders[1:4]))
colHeaders = list(map(str, colHeaders))
I tried df.rename(columns={df.loc[ 1 ]:colHeaders[ 0 ]}, ...), but I get this error:
TypeError: 'Series' objects are mutable, thus they cannot be hashed
I've read this post as well as the pandas 0.17 documentation, but I can't figure out how to use it to selectively replace the column names in a way that doesn't require me to assign new column names manually like this post.
I'm fairly new to Python and I've never posted on StackOverflow before, so any help would be greatly appreciated.
You could use a variant of your approach, but assign the new columns directly:
>>> cols = list(df.columns)
>>> cols[1:-1] = [int(round(float(x))) for x in cols[1:-1]]
>>> df.columns = cols
>>> df
cycles 40 38 36 P4
0 1 1.100000e-08 4.400000e-08 7.700000e-08 8.800000e-07
1 2 2.200000e-08 5.500000e-08 8.800000e-08 8.700000e-07
2 3 3.300000e-08 6.600000e-08 9.900000e-08 8.600000e-07
>>> df.columns
Index(['cycles', 40, 38, 36, 'P4'], dtype='object')
Or you could pass a function to rename:
>>> df = df.rename(columns=lambda x: x if x[0].isalpha() else int(round(float(x))))
>>> df.columns
Index(['cycles', 40, 38, 36, 'P4'], dtype='object')
Related
I have a dataframe as follows:
df
ATG#FTY#RG#NUMFB#ZQ=CT QTG#SSTY#RG#NUMFB#ZQ=ED WQTG#SSTWY#RGW#NUMFB#ZQ=XED QQTG#SSTQY#RGQ#NUMFB#ZQ=XXED
1 2 3 4
2 4 6 2
1 0 3 7
What I am looking for is to create a duplicate of the existing data frame but by reordering the names split by '#' and '=' and dropping the keyword 'ZQ' and adding 'Z' at the end. So for example the 1st column name from
**ATG#FTY#RG#NUMFB#ZQ=CT ** should transform to ** ATG#FTY#RG#CT#NUMFBZ** ( with a 'Z appended at end say)
So I created the following code which works fine. However looking at a more elegant pythonic solution
import pandas as pd
import re
for col in dfT.columns:
zl=[]
fl = []
mc=col.split('#')
myL =mc[:-2]
nfS =mc[-2]
fnf =nfS+'Z'
fl.append(fnf)
zn = mc[-1].split('=')
zl = list(zn)
zl.remove('ZQ')
myL.extend(zl)
myL.extend(fl)
mst ='#'.join(myL)
dfT.rename(columns = {col:mst}, inplace = True)
In [80]: columns
Out[80]:
['ATG#FTY#RG#NUMFB#ZQ=CT',
'QTG#SSTY#RG#NUMFB#ZQ=ED',
'WQTG#SSTWY#RGW#NUMFB#ZQ=XED',
'QQTG#SSTQY#RGQ#NUMFB#ZQ=XXED']
In [81]: def renamer(col):
...: a,b,c = col.rsplit('#', 2)
...: return f"{a}#{c.split('=')[1]}#{b}Z"
...:
In [82]: renamed = dict(zip(columns, map(renamer, columns)))
In [83]: renamed
Out[83]:
{'ATG#FTY#RG#NUMFB#ZQ=CT': 'ATG#FTY#RG#CT#NUMFBZ',
'QTG#SSTY#RG#NUMFB#ZQ=ED': 'QTG#SSTY#RG#ED#NUMFBZ',
'WQTG#SSTWY#RGW#NUMFB#ZQ=XED': 'WQTG#SSTWY#RGW#XED#NUMFBZ',
'QQTG#SSTQY#RGQ#NUMFB#ZQ=XXED': 'QQTG#SSTQY#RGQ#XXED#NUMFBZ'}
you can use renamed in your df.rename call directly
df.columns = df.columns.str.replace('#NUMFB#ZQ=', '#') + '#NUMFBZ'
# Index(['ATG#FTY#RG#CT#NUMFBZ', 'QTG#SSTY#RG#ED#NUMFBZ',
# 'WQTG#SSTWY#RGW#XED#NUMFBZ', 'QQTG#SSTQY#RGQ#XXED#NUMFBZ'],
I have data that I want to retrieve from a couple of text files in a folder. For each file in the folder, I create a pandas.DataFrame to store the data. For now it works correctly and all the fils has the same number of rows.
Now what I want to do is to add each of these dataframes to a 'master' dataframe containing all of them. I would like to add each of these dataframes to the master dataframe with their file name.
I already have the file name.
For example, let say I have 2 dataframes with their own file names, I want to add them to the master dataframe with a header for each of these 2 dataframes representing the name of the file.
What I have tried now is the following:
# T0 data
t0_path = "C:/Users/AlexandreOuimet/Box Sync/Analyse Opto/Crunch/GF data crunch/T0/*.txt"
t0_folder = glob.glob(t0_path)
t0_data = pd.DataFrame()
for file in t0_folder:
raw_data = parseGFfile(file)
file_data = pd.DataFrame(raw_data, columns=['wavelength', 'max', 'min'])
file_name = getFileName(file)
t0_data.insert(loc=len(t0_data.columns), column=file_name, value=file_data)
Could someone help me with this please?
Thank you :)
Edit:
I think I was not clear enough, this is what I am expecting as an output:
output
You may be looking for the concat function. Here's an example:
import pandas as pd
A = pd.DataFrame({'Col1': [1, 2, 3], 'Col2': [4, 5, 6]})
B = pd.DataFrame({'Col1': [7, 8, 9], 'Col2': [10, 11, 12]})
a_filename = 'a_filename.txt'
b_filename = 'b_filename.txt'
A['filename'] = a_filename
B['filename'] = b_filename
C = pd.concat((A, B), ignore_index = True)
print(C)
Output:
Col1 Col2 filename
0 1 4 a_filename.txt
1 2 5 a_filename.txt
2 3 6 a_filename.txt
3 7 10 b_filename.txt
4 8 11 b_filename.txt
5 9 12 b_filename.txt
There are a couple changes to make here in order to make this happen in an easy way. I'll list the changes and reasoning below:
Specified which columns your master DataFrame will have
Instead of using some function that it seems like you were trying to define, you can simply create a new column called "file_name" that will be the filepath used to make the DataFrame for every record in that DataFrame. That way, when you combine the DataFrames, each record's origin is clear. I commented that you can make edits to that particular portion if you want to use string methods to clean up the filenames.
At the end, don't use insert. For combining DataFrames with the same columns (a union operation if you're familiar with SQL or with set theory), you can use the append method.
# T0 data
t0_path = "C:/Users/AlexandreOuimet/Box Sync/Analyse Opto/Crunch/GF data crunch/T0/*.txt"
t0_folder = glob.glob(t0_path)
t0_data = pd.DataFrame(columns=['wavelength', 'max', 'min','file_name'])
for file in t0_folder:
raw_data = parseGFfile(file)
file_data = pd.DataFrame(raw_data, columns=['wavelength', 'max', 'min'])
file_data['file_name'] = file #You can make edits here
t0_data = t0_data.append(file_data,ignore_index=True)
I have a multi-hierarchical pandas dataframe shown below. How, for a given attribute, attr ('rh', 'T', 'V'), can I set certain values (say values > 0.5) to NaN over the entire set of pLevs? I have seen answers on how to set a specific column (e.g., df['rh', 50]) but have not seen how to select the entire set.
attr rh T V
pLev 50 75 100 50 75 100 50 75 100
refIdx
0 0.225026 0.013868 0.306472 0.144581 0.379578 0.760685 0.686463 0.476179 0.185635
1 0.496020 0.956295 0.471268 0.492284 0.836456 0.852873 0.088977 0.090494 0.604290
2 0.898723 0.733030 0.175646 0.841776 0.517127 0.685937 0.094648 0.857104 0.135651
3 0.136525 0.443102 0.759630 0.148536 0.426558 0.731955 0.523390 0.965385 0.094153
To facilitate assistance, I am including code to create the dataframe here:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.random((4,9)))
df.columns = pd.MultiIndex.from_product([['rh','T','V'],[50,75,100]])
df.columns.names = ['attr', 'pLev']
df.index.name = 'refIdx'
The notation is mildly annoying but you can use IndexSlice
df.loc[:,pd.IndexSlice['rh',:]]=np.nan
If your 'given attribute' is 'rh' then you can take a cross-section with the following:
df_xs = df.xs('rh', level='attr', axis=1, drop_level=False)
Then you can update the original df as follows:
df[df_xs > 0.5] = np.nan
This works because drop_level=False was given to .xs
I have a giant list of values that I've downloaded and I want to build and insert them into a dataframe.
I thought it would be as easy as:
import pandas as pd
df = pd.DataFrame()
records = giant list of dictionary
df['var1'] = records[0]['key1']
df['var2'] = records[0]['key2']
and I would get a dataframe such as
var1 var2
val1 val2
However, my dataframe appears to be empty? I can individually print values from records no problem.
Simple Example:
t = [{'v1': 100, 'v2': 50}]
df['var1'] = t[0]['v1']
df['var2'] = t[0]['v2']
I would like to be:
var1 var2
100 50
One entry of your list of dictionaries looks like something you'd pass to the pd.Series constructor. You can turn that into a pd.DataFrame if you want to with the series method pd.Series.to_frame. I transpose at the end because I assume you wanted the dictionary to represent one row.
pd.Series(t[0]).to_frame().T
v1 v2
0 100 50
Pandas do exactly that for you !
>>> import pandas as pd
>>> t = [{'v1': 100, 'v2': 50}]
>>> df=pd.DataFrame(t)
>>> df
v1 v2
0 100 50
EDIT
>>> import pandas as pd
>>> t = [{'v1': 100, 'v2': 50}]
>>> df=pd.DataFrame([t[0]['v1']], index=None, columns=['var1'])
>>> df
0
0 100
After running some commands I have a pandas dataframe, eg.:
>>> print df
B A
1 2 1
2 3 2
3 4 3
4 5 4
I would like to print this out so that it produces simple code that would recreate it, eg.:
DataFrame([[2,1],[3,2],[4,3],[5,4]],columns=['B','A'],index=[1,2,3,4])
I tried pulling out each of the three pieces (data, columns and rows):
[[e for e in row] for row in df.iterrows()]
[c for c in df.columns]
[r for r in df.index]
but the first line fails because e is not a value but a Series.
Is there a pre-build command to do this, and if not, how do I do it? Thanks.
You can get the values of the data frame in array format by calling df.values:
df = pd.DataFrame([[2,1],[3,2],[4,3],[5,4]],columns=['B','A'],index=[1,2,3,4])
arrays = df.values
cols = df.columns
index = df.index
df2 = pd.DataFrame(arrays, columns = cols, index = index)
Based on #Woody Pride's approach, here is the full solution I am using. It handles hierarchical indices and index names.
from types import MethodType
from pandas import DataFrame, MultiIndex
def _gencmd(df, pandas_as='pd'):
"""
With this addition to DataFrame's methods, you can use:
df.command()
to get the command required to regenerate the dataframe df.
"""
if pandas_as:
pandas_as += '.'
index_cmd = df.index.__class__.__name__
if type(df.index)==MultiIndex:
index_cmd += '.from_tuples({0}, names={1})'.format([i for i in df.index], df.index.names)
else:
index_cmd += "({0}, name='{1}')".format([i for i in df.index], df.index.name)
return 'DataFrame({0}, index={1}{2}, columns={3})'.format([[xx for xx in x] for x in df.values],
pandas_as,
index_cmd,
[c for c in df.columns])
DataFrame.command = MethodType(_gencmd, None, DataFrame)
I have only tested it on a few cases so far and would love a more general solution.