Simple pandas Dataframe transformation - python

Venturing into python using some pandas. I'm trying to some very simple transformations on a Dataframe. Have read thru the docs and am not quite sussing out how this works. I want a simple subtraction on cells in a row. Want to return a new Dataframe with the computed column. Like so:
def geMeWhatIwant (data)
#data is a Dataframe organized like index : col 1 : col2
return log(data col1 - data col2)
It feels like this can be done under the hood without iterating over the Dataframe in my function. If I need to iterate then so be it. Could itterrows thru and do computations and append to return Dataframe. I'm simply looking for the most efficient and elegant way to do this in python - pandas
Thanks

This should work:
import numpy as np
np.log(df[col1] - df[col2])

Related

is there any better pandas function then pandas.testing.assert_frame_equal for testing two dataframe?

is there any better pandas function then pandas.testing.assert_frame_equal because i am facing two issues while comparing
if the data set is like this
df1=pd.DataFrame({'a':['abc','pqr','ss','sd','sds'], 'b':['sdd','gbc','mqr','sas','ho']})
df2=pd.DataFrame({'m':['abc','pqr','ss','sd','sds'], 'n':['sdd','gbc','mqr','sas','ho']})
I will give this error
if the dataset is like this
df1=pd.DataFrame({'a':['abc','pqr','ss','sd','sds'], 'b':['sdd','gbc','mqr','sas','ho']})
df2=pd.DataFrame({'a':['abc','pqr','sds','ss','sd'], 'b':['sdd','gbc','ho','mqr','sas']})
then I am getting this error
due to the order of rows which is wrong
Pandas.testing.assert_frame_equal is a very robust package that checks a lot of things, if you just want to check that the data they contain are equal (without regards to colnames, index or dtype etc.) it might be easier just to write a simple function to do it.
You will have to sort your values, then convert to a numpy array to get rid of the indices and column headers. Finally we can compare the arrays using np.array_equal().
import pandas as pd
import numpy as np
df1=pd.DataFrame({'a':['abc','pqr','ss','sd','sds'], 'b':['sdd','gbc','mqr','sas','ho']})
df2=pd.DataFrame({'x':['abc','pqr','sds','ss','sd'], 'b':['sdd','gbc','ho','mqr','sas']})
print(df1.equals(df2))
# False
def assert_equal_df(df1, df2):
df1 = df1.sort_values(df1.columns[0]).to_numpy()
df2 = df2.sort_values(df2.columns[0]).to_numpy()
return np.array_equal(df1, df2)
print(assert_equal_df(df1, df2))
# True

how can a specific cell be accessed in a vaex data frame?

vaex is a library similar to pandas, that provides a dataframe class
I'm looking for a way to access a specific cell by row and column
for example:
import vaex
df = vaex.from_dict({'a': [1,2,3], 'b': [4,5,6]})
df.a[0] # this works in pandas but not in vaex
In this specific case you could do df.a.values[0], but if this was a virtual column, it would lead to the whole column being evaluated. What would be faster to do (say in a case of > 1 billon rows, and a virtual column), is to do:
df['r'] = df.a + df.b
df.evaluate('r', i1=2, i2=3)[0]
This will evaluate the virtual column/expression r, from row 2 to 3 (an array of length 1), and get the first element.
This is rather clunky, and there is an issue open on this: https://github.com/vaexio/vaex/issues/238
Maybe you are surprised that vaex does not have something as 'basic' as this, but vaex is often used for really large datasets, where you don't access individual rows that often, so we don't run into this a lot.
#Maarten Breddels is the owner of Vaex, so I would take his word. But it's possible he wrote that answer before Vaex added slicing, which in this case would be much less "clunky" as described.
import vaex
df = vaex.example()
df.x[:1].values # Access row 0
df.x[1:3].values # Access rows 1 and 2

Python help needed. Whats the best way to gather data from a spreadsheet into an array that can hold strings and numbers

Would like to import data from spreadsheet, currently using pyxl. Whats the best way to read the excel data and insert it into an 2D array. Is it Numpy? Pandas? Lists? I am new and I am struggling how to insert into a variable like so:
MaterialData[y, x] = data from spreadsheet where y is the row and x the column.
I am using a for loop to go through the cells but I am cant find the way to put the data into an array.
for i in range(1, rows+1):
for j in range(1, 6):
col = sh.cell(i, j)
col1 = col.value
materialsList[i,j] = col1
The last line obviously is in error, but that's what i want to do, if it makes sense! The excel file is a list of materials where each column has a different prices, and depending on what the user selects in the program, that price is shown. (well that's a very simplified version of what i want to achieve). Part of the data in this array will be displayed on a listbox using tkinter, depending on flags set by the user.
Any advice welcome!!
To work on spreadsheets, using pandas is a better option than using numpy array.
Pandas DataFrame is a 2D numpy array under the hood. But it’s good to have column headers,index, etc,so that we can change those things on the fly
Pandas handle heterogeneous data very nicely, with numbers of inbuilt functions which will make the job easy.
ways to convert the pandas data-frame to its Numpy-array representation.
spreadsheet_np_array = df.as_matrix(columns=None)
OR
spreadsheet_np_array = df.values
OR
spreadsheet_np_array=np.asarray(your_data_frame_here).
pandas reference
pandas is a very good library the allows developers work with excel files.
try the following code in the same location with the excel file.
import pandas as pd
file_x = 'Scores.xlsx'
scores = pd.read_excel(file_x)
scores_dict = scores.to_dict()
rows = []
for row in scores_dict.values():
cols = []
for col in row.values():
cols.append(col)
rows.append(cols)
print(rows)
that should solve the problem

Python Pandas group by iteration

I am iterating over a groupby column in a pandas dataframe in Python 3.6 with the help of a for loop. The problem with this is that it becomes slow if I have a lot of data. This is my code:
import pandas as pd
dataDict = {}
for metric, df_metric in frontendFrame.groupby('METRIC'): # Creates frames for each metric
dataDict[metric] = df_metric.to_dict('records') # Converts dataframe to dictionary
frontendFrame is a dataframe containing two columns: VALUE and METRIC. My end goal is basically creating a dictionary where there is a key for each metric containing all data connected to it. I now this should be possible to do with lambda or map but I can't get it working with multiple arguments. frontendFrame.groupby('METRIC').apply(lambda x: print(x))
How can I solve this and make my script faster?
If you do not need any calculation involved after groupby , do not groupby data , you can using .loc to get what you need
s=frontendFrame.METRIC.unique()
frontendFrame.loc[frontendFrame.METRIC==s[0],]

I have multiple columns in csv. How do I match row values to a reference column using python?

I have a csv file with 367 columns. The first column has 15 unique values, and each subsequent column has some subset of those 15 values. No unique value is ever found more than once in a column. Each column is sorted. How do I get the rows to line up? My end goal is to make a presence/absence heat map, but I need to get the data matrix in the right format first, which I am struggling with.
Here is a small example of the type of data I have:
1,2,1,2
2,3,2,5
3,4,3,
4,,5,
5,,,
I need the rows to match the reference but stay in the same column like so:
1,,1,
2,2,2,2
3,3,3,
4,4,,
5,,5,5
My thought was to use the pandas library, but I could not figure out how to approach this problem, as I am very new to using python. I am using python2.7.
So your problem is definitely solvable via pandas:
Code:
# Create the sample data into a data frame
import pandas as pd
from io import StringIO
df = pd.read_csv(StringIO(u"""
1,2,1,2
2,3,2,5
3,4,3,
4,,5,
5,,,"""), header=None, skip_blank_lines=1).fillna(0)
for column in df:
df[column] = pd.to_numeric(df[column], downcast='integer')
# set the first column as an index
df = df.set_index([0])
# create a frame which we will build up
results = pd.DataFrame(index=df.index)
# add each column to the datafarme indicating if the desired value is present
for col in df.columns:
results[col] = df.index.isin(df[col])
# output the dataframe in the desired format
for idx, row in results.iterrows():
result = '%s,%s' % (idx, ','.join(str(idx) if x else ''
for x in row.values))
print(result)
Results:
1,,1,
2,2,2,2
3,3,3,
4,4,,
5,,5,5
How does it work?:
Pandas can be little daunting when first approached, even for someone who knows python well, so I will try to walk through this. And I encourage you to do what you need to get over the learning curve, because pandas is ridiculously powerful for this sort of data manipulation.
Get the data into a frame:
This first bit of code does nothing but get your sample data into a pandas.DataFrame. Your data format was not specified so I will assume, that you can get it into a frame, or if you can not get it into a frame, will ask another question here on SO about getting the data into a frame.
import pandas as pd
from io import StringIO
df = pd.read_csv(StringIO(u"""
1,2,1,2
2,3,2,5
3,4,3,
4,,5,
5,,,"""), header=None, skip_blank_lines=1).fillna(0)
for column in df:
df[column] = pd.to_numeric(df[column], downcast='integer')
# set the first column as an index
df = df.set_index([0])
Build a result frame:
Start with a result frame that is just the index
# create a frame which we will build up
results = pd.DataFrame(index=df.index)
For each column in the source data, see if the value is in the index
# add each column to the dataframe indicating if the desired value is present
for col in df.columns:
results[col] = df.index.isin(df[col])
That's it, with three lines of code, we have calculated our results.
Output the results:
Now iterate through each row, which contains booleans, and output the values in the desired format (as ints)
# output the dataframe in the desired format
for idx, row in results.iterrows():
result = '%s,%s' % (idx, ','.join(str(idx) if x else ''
for x in row.values))
print(result)
This outputs the index value first, and then for each True value outputs the index again, and for False values outputs an empty string.
Postscript:
There are quite a few people here on SO who are way better at pandas than I am, but since you did not tag your question, with the pandas keyword, they likely did not notice this question. But that allows me to take my cut at answering before they notice. The pandas keyword is very well covered for well formed questions, so I am pretty sure that if this answer is not optimum, someone else will come by and improve it. So in the future, be sure to tag your question with pandas to get the best response.
Also, you mentioned that you were new python, so I will just put in a plug to make sure that you are using a good IDE. I use PyCharm, and it and other good IDEs can make working in python even more powerful, so I highly recommend them.

Categories