Dynamic - Automated multiplication - Pandas dataframes - python

after spending quite a while search and reading on Stackoverflow and around the web, I am desperate...
I have a Pandas DataFrame with some imported data (spectra). The first column is the wavelength while the others are the various spectra (the data). The names of the columns are imported from a list that reads the filenames from a path and keeps just the names.
What I would like to achieve and I can't quite seem to get how is to multiply each of the columns with the wavelength column and either overwrite the existing ones or create a new dataframe (doesn't matter that much).
This is the code I have so far that does the job (even if not the most elegant, it get's the job done):
path = r'"thePathToData\PL_calc\Data_NIR'
idx = 0
#Create the DataFrame with all the data from the path above, use the filenames as column names
all_files = glob.glob(os.path.join(path, "*.asc"))
df = pd.concat((pd.read_csv(f, usecols=[1], sep='\t') for f in all_files), axis=1) #usecol=1 for the spectrum only
fileNames = [] # create a list for the filenames
for i in range(0,len(all_files)):
fileNames.append(all_files[i][71:-4])
df.columns = fileNames # assign the filenames as columns
wavelengths = pd.read_csv(all_files[0], usecols=[0], sep='\t') # add the wavelength column as first column of the dataframe
df.insert(loc=idx, column='Wavelength', value=wavelengths)
If I plot just the head of the DF it looks like this:
Wavelength F8BT_Pure_Batch1_px1_spectra_4V \ ...
0 478.0708 -3.384101
1 478.3917 -1.580399
2 478.7126 -0.323580
3 479.0334 -1.131425
4 479.3542 1.202728
The complete DF is:
1599 rows × 46 columns
Question 1:
I can't quite find an automated (dynamic) way of multiplying each col with the first one, essentially this:
for i in range(1, len(df.columns)):
df[[i]] = df[[0]] * df[[i]]
Question 2:
Why does this work:
df['F8BT_Pure_Batch1_px1_spectra_4V'] = df['Wavelength']*df['F8BT_Pure_Batch1_px1_spectra_4V']
while this doesn't and gives me an "IndexError: indices are out-of-bounds"
df[[1]] = df[[0]]*df[[1]]
But when I print(df[['Wavelength']]) Name: Wavelength, dtype: float64 and print(df[[0]]) [1599 rows x 1 columns] I get the same numbers..
Question 3:
Why does this df[fileNames] = df[fileNames].multiply(df.Wavelength) give me a ValueError: Columns must be same length as key? All the columns are of the same length (1599 rows long, 0-1598 and a total of 46 columns in this case). fileNames contains the names of the imported files and the names of the columns of the dataframe.
Many many thanks in advance for your help...
Alex

Question 1
To multiply your wavelength column by every other column in your DataFrame, you can use:
df.iloc[:, 1:] = df.iloc[:, 1:].mul(df['Wavelength'], axis=0)
This assumes your wavelength column is the first column.
Question 2
Selecting columns like that using an integer is asking for columns of your DataFrame that are named 0, 1, etc., as ints. There are none in your DataFrame. To select columns by index number look into the documentation for pandas' iloc method.
Question 3
When you call df[fileNames], you are getting a a DataFrame with the same number of columns as the length of your list fileNames. Your code df[fileNames].multiply(df.Wavelength) is not giving you a DataFrame with the same number of columns as df[fileNames], hence you cannot assign the values. Using the axis=0 parameter in the multiply function is working for me.

Related

How could I get a result for every column after comparing dataframes?

I have two csv files, and the two files have the exact same amount of rows and columns containing only numerical values. I want to compare each columns separately.
The idea would be to compare column 1 value of file "a" to column 1 value of file "b" and check the difference and so on for all the numbers in the column (there are 100 rows) and write out a number that in how many cases were the difference more than 0. So e.g. if in the case of column 1 there where 55 numbers that didnt mach in case of file "a" and "b" than I want to get back a value of 55 for column 1 and so on.
I would like to repeat the same for all the columns. I know it should be a double for loop but idk exactly how.
Thanks in advance!
import pandas as pd
dk = pd.read_csv('C:/Users/D/1_top_a.csv', sep=',', header=None)
dk = dk.dropna(how='all')
dk = dk.dropna(how='all', axis=1)
print(dk)
dl = pd.read_csv('C:/Users/D/1_top_b.csv', sep=',', header=None)
dl = dl.dropna(how='all')
dl = dl.dropna(how='all', axis=1)
#print(dl)
rows=dk.shape[0]
print(rows)
for row in range(len(dl)):
for col in range(len(dl.columns)):
if dl.iloc[row, col] != dk.iloc[row, col]:
I find the recordlinkage package very useful for comparing values from 2 datasets. You can define which columns to compare and it returns a 0 or 1 if they match. Next, you can filter for all matching values
https://recordlinkage.readthedocs.io/en/latest/about.html
Code looks like this:
# create pair of dataframes to compare
indexer = rl.Index()
indexer.add(Block('row_identifier1', 'row_identifier2'))
datasets = indexer.index(dataset1, dataset2)
# initialise class
comparer = rl.Compare()
# initialise similarity measurement algorithms
comparer.string('string_value1', 'string_value2', method='jarowinkler', threshold=0.95, label='string_matching')
comparer.exact('value3', 'value4', label='integer_matching')
# the method .compute() returns the DataFrame with the feature vectors.
results = comparer.compute(datasets, dataset1, dataset2)

Creating a new dataframe by filtering matches from columns of two existing dataframes with error tolerance

I am pretty new to python and pandas, and I want to sort through the existing two dataframes by certain columns, and create a third dataframe that contains only the value matches within a tolerance. In other words, I have df1 and df2, and I want df3 to contain the rows and columns of df2 that are within the tolerance of values in df1:
Two dataframes:
df1=pd.DataFrame([[0.221,2.233,7.84554,10.222],[0.222,2.000,7.8666,10.000],
[0.220,2.230,7.8500,10.005]],columns=('rt','mz','mz2','abundance'))
[Dataframe 1]
df2=pd.DataFrame([[0.219,2.233,7.84500,10.221],[0.220,7.8669,10.003],[0.229,2.238,7.8508,10.009]],columns=('rt','mz','mz2','abundance'))
[Dataframe 2]
Expected Output:
df3=pd.DataFrame([[0.219,2.233,7.84500,10.221],[0.220,2.002,7.8669,10.003]],columns=('Rt','mz','mz2','abundance'))
[Dataframe 3]
I have tried forloops and filters, but as I am a newby nothing is really working for me. But here us what I'm trying now:
import pandas as pd
import numpy as np
p=[]
d=np.array(p)
#print(d.dtype)
def count(df2, l, r):
l=[(df1['Rt']-0.001)]
r=[(df1['Rt']+0.001)]
for x in df2['Rt']:
# condition check
if x>= l and x<= r:
print(x)
d.append(x)
where p and d are the corresponding dataframe and the array (if necessary to make array?) that will be populated. I bet the problem lies somewhere in the fact that that the function shouldn't contain the forloop.
Ideally, this could work to sort like ~13,000 rows of a dataframe using the 180 column values of another dataframe.
Thank you in advance!
Is this what you're looking for?:
min = df1.rt.min()-0.001
max = df1.rt.max()+0.001
df3 = df2[(df2.rt >= min) & (df2.rt <= max)]
>>> df3

Subtract dataframes with completely different row names and column names

My dataframe 1 looks like this:
windcodes
name
yield
perp
163197.SH
shangguo comp
2.9248
NO
154563.SH
guosheng comp
2.886
Yes
789645.IB
guoyou comp
3.418
NO
My dataframe 2 looks like this
windcodes
CALC
1202203.IB
2.5517
1202203.IB
2.48457
1202203.IB
2.62296
and I want my result dataframe 3 to have one more new column than dataframe 1 which is to use the value in column 'yield' in dataframe 1 subtract the value in column 'CALC' in dataframe 2:
The result dataframe 3 should be looking like this
windcodes
name
yield
perp
yield-CALC
163197.SH
shangguo comp
2.9248
NO
0.3731
154563.SH
guosheng comp
2.886
Yes
0.40413
789645.IB
guoyou comp
3.418
NO
0.79504
It would be really helpful if anyone can tell me how to do it in python.
Just in case you have completely different indexes, use df2's underlying numpy array:
df1['yield-CALC'] = df1['yield'] - df2['yield'].values
You can try something like this:
df1['yield-CALC'] = df1['yield'] - df2['yield']
I'm assuming you don't want to join the dataframes, since the windcodes are not the same.
Do we need to join 2 dataframes from windcodes column? The windcodes are all the same in the sample data you have given in Dataframe2. Can you explain this?
If we are going to join from the windscode field. The code below will work.
df = pd.merge(left=df1, right=df2,how='inner',on='windcodes')
df['yield-CALC'] = df['yield']-df['CALC']
I will try to keep it as elaborated as possible:
environment I have used for coding is Jupyter Notebook
importing our required pandas library
import pandas as pd
getting your first table data in form of lists of lists (you can also use csv,excel etc here)
data_1 = [["163197.SH","shangguo comp",2.9248,"NO"],\
["154563.SH","guosheng comp",2.886,"Yes"] , ["789645.IB","guoyou comp",3.418,"NO"]]
creating dataframe one :
df_1 = pd.DataFrame(data_1 , columns = ["windcodes","name","yield","perp"])
df_1
Output:
getting your second table data in form of lists of lists (you can also use csv,excel etc here)
data_2 = [["1202203.IB",2.5517],["1202203.IB",2.48457],["1202203.IB",2.62296]]
creating dataframe two :
df_2 = pd.DataFrame(data_2 , columns = ["windcodes","CALC"])
df_2
Output:
Now creating the third dataframe:
df_3 = df_1 # becasue first 4 columns are same as our first dataframe
df_3
Output:
Now calculating the fourth column i.e "yield-CALC" :
df_3["yield-CALC"] = df_1["yield"] - df_2["CALC"] # each df_1 datapoint will be subtracted from df_2 datapoint one by one (still confused? search for "SIMD")
df_3
Output:

Creating a dataframe from several .txt files - each file being a row with 25 values

So, I have 7200 txt files, each with 25 lines. I would like to create a dataframe from them, with 7200 rows and 25 columns -- each line of the .txt file would be a value a column.
For that, first I have created a list column_names with length 25, and tested importing one single .txt file.
However, when I try this:
pd.read_csv('Data/fake-meta-information/1-meta.txt', delim_whitespace=True, names=column_names)
I get 25x25 dataframe, with values only on the first column. How do I read this into the dataframe in a way that I can get the txt lines to be imputed as values into the columns, and not imputing everything into the first column and creating 25 rows?
My next step would be creating a for loop to append each text file as a new row.
Probably something like this:
dir1 = *folder_path*
list = os.listdir(dir1)
number_files = len(list)
for i in range(number_files):
title = list[i]
df_temp = pd.read_csv(dir1 + title, delim_whitespace=True, names=column_names)
df = df.append(df_temp,ignore_index=True)
I hope I have been clear. Thank you all in advance!
read_csv generates a row per line in the source file but you want them to be columns. You could read the rows and pivot to columns, but since these files have a single value per line, you can just read them in numpy and use each resulting array as a row in a dataframe.
import numpy as np
import pandas as pd
from pathlib import Path
dir1 = Path(".")
df = pd.DataFrame([np.loadtxt(filename) for filename in dir1.glob("*.txt")])
print(df)
tdelaney's answer is probably "better" than mine, but if you want to keep your code more stylistically closer to what you are currently doing the following is another option.
You are getting your current output (25x25 with data in the first column only) because your read data is 25x1 but you are forcing the dataframe to have 25 columns with your names=column_names parameter.
To solve, just wait until the end to apply the column names:
Get a 25x1 df (drop the names param):
df_temp = pd.read_csv(dir1 + title, delim_whitespace=True)
Append the 25x1 df forming a 25x7200 df: df = df.append(df_temp,ignore_index=True)
Transpose the df forming the final 7200x25 df: df=df.T
Add column names: df.columns=column_names

how can I select data in a multiindex dataFrame and have the result dataFrame have an appropriate index

I have a multiindex DataFrame and I'm trying to select data in it base on certain criteria, so far so good. The problem is that once I have selected my data using .loc and pd.IndexSlice, the resulting DataFrame which should logically have less rows and less element in the first level of the multiindex keeps exactly the same multiIndex but with some keys in it refering to empty dataframe.
I've tried creating a completely new DataFrame with a new index, but the structure of my data set is complicating and there is not always the same number of elements in a given level, so it is not easy to created a dataFrame with the right shape in which I can put the data.
import numpy as np
import pandas as pd
np.random.seed(3) #so my exemple is reproductible
idx = pd.IndexSlice
iterables = [['A','B','C'],[0,1,2],['some','rdm','data']]
my_index = pd.MultiIndex.from_product(iterables,names =
['first','second','third'])
my_columns = ['col1','col2','col3']
df1 = pd.DataFrame(data = np.random.randint(10,size =
(len(my_index),len(my_columns))),
index = my_index,
columns = my_columns
)
#Ok, so let's say I want to keep only the elements in the first level of my index (["A","B","C"]) for
#which the total sum in column 3 is less than 35 for some reasons
boolean_mask = (df1.groupby(level = "first").col3.sum() < 35).tolist()
first_level_to_keep = df1.index.levels[0][boolean_mask].tolist()
#lets select the wanted data and put it in df2
df2 = df1.loc[idx[first_level_to_keep,:,:],:]
So far, everything is as expected
The problem is when I want to access the df2 index. I expected the following:
df2.index.levels[0].tolist() == ['B','C']
to be true. But this is what gives a True statement:
df2.index.levels[0].tolist() == ['A','B','C']
So my question is the following: is there a way to select data and to have in retrun a dataFrame with a multiindex reflecting what is in it. Because I find weird to be able to select non existing data in my df2:
I tried to put some images of the dataframes in question but I couldn't because I dont't have enough «reputation»... sorry about that.
Thank you for your time!
Even if you delete the rows corresponding to a particular value in an index level, that value still exists. You can reset the index and then set those columns back as an index in order to generate a MultiIndex with new level values.
df2 = df2.reset_index().set_index(['first','second','third'])
print(df2.index.levels[0].tolist() == ['B','C'])
True

Categories