I would like to compare a column from several pairs of pandas dataframes and write the shared values to an empty list. I have written a function that can do this with a single pair of dataframes, but I cannot seem to scale it up.
def parser(dataframe1,dataframe2,emptylist):
for i1 in dataframe1['POS']:
for i2 in dataframe2['POS']:
if i1 == i2:
emptylist.append(i1)
Where 'POS' is a column header in the two pandas dataframes.
I have made a list of variable names for each input value of this function, eg.
dataframe1_names=['name1','name2',etc...]
dataframe2_names=['name1','name2',etc...]
emptylist_names=['name1','name2',etc...]
Where each element of the list is a string containing the name of a variable (either a pandas dataframe in the case of the first two, or an empty list in the case of the last).
I have tried to iterate through these lists using the following code:
import itertools
for a, b, c in zip(range(len(dataframe1_names)), range(len(dataframe2_names)), range(len(emptylist_names))):
parser(dataframe1_names[a],dataframe2_names[b],emptylist_names[c])
But this returns TypeError: string indices must be integers.
I believe that this error is coming from passing the function a string containing the variable name instead of the variable name itself. Is there another way to pass multiple variables to a function in an automated way?
Thanks for your help!
Do you have to use strings of object names, instead of just the objects themselves? If you do
dataframes1=[name1,name2,...]
dataframes2=[name1,name2,...]
emptylists=[name1,name2,...]
Then you can just do
for a,b,c in zip( dataframes1, dataframes2, emptylists ):
parser(a,b,c)
The way you do this is really circuitous and unpythonic, by the way, so I've changed it a bit. Rather than getting lists of indexes for the for statement, I just iterate through the lists (and thus the objects) themselves. This is much more compact, and easier to understand. For that matter, do you have a need to input the empty list as an argument (eg, perhaps they aren't always empty)? And your code for the parser, while correct, doesn't take advantage of pandas at all, and will be very slow: to compare columns, you can simply do dataframe1['COL'] == dataframe2['COL'], which will give you a boolean series of where values are equal. Then you can use this for indexing a dataframe, to get the shared values. It comes out as a dataframe or series, but it's easy enough to convert to a list. Thus, your parser function can be reduced to the following, if you don't need to create the "empty list" elsewhere first:
def parser( df1, df2 ):
return list( df1['COL'][ df1['COL']==df2['COL'] ] )
This will be much, much faster, though as it returns the list, you'll have to do something with it, so in your case, you'd do something like:
sharedlists = [ parser(a,b) for a,b in zip( dataframes1, dataframes2 ) ]
If you must use variable names, the following very unsafe sort of code will convert your lists of names into lists of objects (you'll need to do this for each list):
dataframes1 = [ eval(name) for name in dataframe1_names ]
If this is just for numerical work you're doing in an interpreter, eval is alright, but for any code you're releasing, it's very insecure: it will evaluate whatever code is in the string passed into it, thus allowing arbitrary code execution.
This sounds like a use case of .query()
A use case for query() is when you have a collection of DataFrame
objects that have a subset of column names (or index levels/names) in
common. You can pass the same query to both frames without having to
specify which frame you’re interested in querying
map(lambda frame: frame.query(expr), [df, df2])
What kind of output are you looking for in the case where you have more than two DataFrame objects? In the case of just two, the following line would accomplish what your parser function does:
common = df1[df1["fieldname"] == df2["fieldname"]]["fieldname"]
except that common would be a DataFrame object itself, rather than a list, but you can easily get a list from it by doing list(common).
If you're looking for a function that takes any number of DataFrames and returns a list of common values in some field for each pair, you could do something like this:
from itertools import combinations
def common_lists(field, *dfs):
return [df1[df1[field] == df2[field]][field] for df1, df2 in combinations(dfs, 2)]
The same deal about getting a list from a DataFrame applies here, since you'll be getting a list of DataFrames.
As far as this bit:
import itertools
for a, b, c in zip(range(len(dataframe1_names)), range(len(dataframe2_names)), range(len(emptylist_names))):
parser(dataframe1_names[a],dataframe2_names[b],emptylist_names[c])
What you're doing is creating a list that looks something like this:
[(0,0,0), (1,1,1), ... (n,n,n)]
where n is the length of the shortest of dataframe1_names, dataframe2_names, and emptylist_names. So on the first iteration of the loop, you have a == b == c == 0, and you're using these values to index into your arrays of data frame variable names, so you're calling parser("name1", "name1", "name1"), passing it strings instead of pandas DataFrame objects. Your parser function is expecting DataFrame objects so it barfs when you try to call dataframe1["POS"] where dataframe1 is the string "name1".
Related
I want to create a list of functions, each defined on the set {0,1,2}, by setting their values to those found in a list of values. To check what I did, I then print the values of my functions. Here are two attempts:
values_list = [[1,2,3],[4,5,6]]
function_list = []
function_list.append((lambda x: values_list[0][x]))
function_list.append((lambda x: values_list[1][x]))
print('Values of 1st function:',function_list[0](0),function_list[0](1),function_list[0](2))
print('Values of 1st function:',function_list[1](0),function_list[1](1),function_list[1](2))
function_list1 = []
for i in range(2):
function_list1.append((lambda x: values_list[i][x]))
print('Values of 1st function:',function_list1[0](0),function_list1[0](1),function_list1[0](2))
print('Values of 2nd function:',function_list1[1](0),function_list1[1](1),function_list1[1](2))
The first attempt works fine, of course, but in the second attempt, both functions in the list are identical. They both give back the second set of values.
Ok, so I guess the reason is, that the iterator that I used in the definition stopped at the second set of values. But in practice I will have huge lists of function values, of varying length, and I can't type a line for each function in my list, as in my first attempt. What could I do to solve this?
fix this line to:
function_list1.append((lambda x,y=i: values_list[y][x]))
I have a pandas data frame called positive_samples that has a column called Gene Class, which is basically a pair of genes stored as a list. It looks like below
The entire data frame looks like this.
.
So the gene class column is just the other two columns in the data frame combined. I made a list using the gene class column like below. This take all the gene pair lists and make them into a single list.
#convert the column to a list
postive_gene_pairs = positive_samples["Gene Class"].tolist()
This is the output.
Each pair is now wrapped within double quotes, which I dont want because I loop through this list and use .loc method to locate this pairs in another data frame called new_expression which has them as an index like this
for positive_gene_pair in positive_gene_pairs:
print(new_expression_df.loc[[positive_gene_pair],"GSM144819"])
This throws a keyerror.
And it definely because of the extra quotes that each pair is wrapped around because when I instantiate a list like below without quotes it works just fine.
So my question is how do I remove the extra quotes to make this work with .loc? To make a list just like below, but from a data frame column?.
pairs = [['YAL013W','YBR103W'],['YAL011W','YMR263W']]
I tried so many workarounds like replace, strip but none of them worked for me as ideally they would work for strings but I was trying to make them work on a list, any easy solution? I just want to have a list of list like this pairs list that does not have extra single or double quotes.
Convert list of strings to lists first:
import ast
postive_gene_pairs = positive_samples["Gene Class"].apply(ast.literal_eval).tolist()
And then remove []:
for positive_gene_pair in positive_gene_pairs:
print(new_expression_df.loc[[positive_gene_pair],"GSM144819"])
to:
for positive_gene_pair in positive_gene_pairs:
print(new_expression_df.loc[positive_gene_pair,"GSM144819"])
define a functio:
def listup(initlist):
# Converting string to list
res = ini_list.strip('][').split(', ')
return res
change from
postive_gene_pairs = positive_samples["Gene Class"].tolist()
to
postive_gene_pairs = positive_samples["Gene Class"].apply(listup).tolist()
So I've been following a guide I got off reddit about understanding python's bracket requirements:
Is it a list? Then use brackets.
Is it a dict? Then use braces.
Otherwise, you probably want parentheses.
However, I've come across something that the above can't explain:
df.groupby('Age')['Salary'].mean()
In this case, both Age & Salary are lists (they are both columns from a df), so why do we use parentheses for Age and brackets for Salary?
Additionally, why is there a dot before mean, but not in-between ('Age') and ['Salary']?
I realise the questions I'm asking may be fairly basic. I'm working my way through Python Essential Reference (4th ed) Developer's Library. If anyone has any sources dealing with my kind of questions it would be great to see them.
Thanks
If you'll forgive me for answering the important question rather than the one you asked...
That's a very compact chain. Break it into separate lines and then use the Debugging view of an IDE to step through it the understand the datatypes involved.
query_method = df.groupby
query_string = 'Age'
query_return = query_method(query_string)
data = query_return['Salary']
data_mean = data.mean()
Step through in the PyCharm Debugger and you can see type for every variable.
There is a lot of context here that can be found in the pandas dataframe documentation.
To start off, df is an object of class pandas.DataFrame.
pandas.DataFrame has a function called groupby that takes some input. In your example, the input is 'Age'. When you pass arguments to a function it looks like this:
my_function(input)
when you have more than one input, the common way to pass them is as multiple variables, like this
my_function(input1, input2, etc, ...)
pandas.DataFrame.groupby(...) returns an object that is subscriptable or sliceable. Using slice notation is like accessing an element in an list or a dict, like this
my_list = [1,2,3]
print(my_list[0]) # --> 1
my_dict = {
"a": "apple",
"b": "banana",
"c": "cucumber"
}
print(my_dict["b"]) # --> banana
coming back to your specific question:
df.groupby('Age')['Salary'].mean()
df # df, the name of your DataFrame variable
.groupby('Age') # call the function groupby to get the frame grouped by the column 'Age'
['Salary'] # access the 'Salary' element from that groupby
.mean() # and apply the mean() function to the 'Salary' element
So it appears that you are getting a list of all the the mean salaries by age of the employee.
I hope this helps to explain
both Age & Salary are lists (they are both columns from a df),
They're Ranges / Columns, not lists. The group by function of a Dataframe returns an indexed object. Calling methods requires parenthesis, like print(). You can use square brackets to access indexed data (ref. dict() objects).
The period and paranthesis afterwards is another function call
why is there a dot before mean, but not in-between ('Age') and ['Salary']
Short answer is that foo.['bar'] is not valid syntax
But df.groupBy("Age").some_func() certainly could be done, depending on the available functions on that object
I am new to pandas and I am creating new columns based on conditions from other existing columns using the following code:
df.loc[(df.item1_existing=='NO') & (df.item1_sold=='YES'),'unit_item1']=1
df.loc[(df.item2_existing=='NO') & (df.item2_sold=='YES'),'unit_item2']=1
df.loc[(df.item3_existing=='NO') & (df.item3_sold=='YES'),'unit_item3']=1
Basically, what this means is that if item is NOT existing ('NO') and the item IS sold ('YES') then give me a 1. This works to create 3 new columns but I am thinking there is a better way. As you can see, there is a repeated string in the name of the columns: '_existing' and '_sold'. I am trying to create a for loop that will look for the name of the column that ends with that specific word and concatenate the beginning, something like this:
unit_cols = ['item1','item2','item3']
for i in unit_cols:
df.loc[('df.'+i+'_existing'=='NO') & ('df'+i+'_sold'=='YES'),'unit_'+i]=1
but of course, it doesn't work. As I said, I am able to make it work with the initial example, but I would like to have fewer lines of code instead of repeating the same code because I need to create several columns this way, not just three. Is there a way to make this easier? is the for loop the best option? Thank you.
You can use Boolean series, i.e. True / False depending on whether your condition is met. Coupled with pd.Series.eq and f-strings (PEP498, Python 3.6+), and using __getitem__ (or its syntactic sugar []) to allow string inputs, you can write your logic more readably:
unit_cols = ['item1','item2','item3']
for i in unit_cols:
df[f'unit_{i}'] = df[f'{i}_existing'].eq('NO') & df[f'{i}_sold'].eq('YES')
If you need integers (1 / 0) instead of Boolean values, you can convert via astype:
df[f'unit_{i}'] = df[f'unit_{i}'].astype(int)
Please, is there any ways to replace "x-y" by "x,x+1,x+2,...,y" in every row in a data frame? (Where x, y are integer).
For example, I want to replace every row like this:
"1-3,7" by "1,2,3,7"
"1,4,6-9,11-13,5" by "1,4,6,7,8,9,11,12,13,5"
etc
I know that by looping through lines and using regular expression we can do that. But the table is quite big and it takes quite some time. so I think using pandas might be faster.
Thanks alot
In pandas you can use apply to apply any function to either rows or columns in a DataFrame. The function can be passed with a lambda, or defined separately.
(side-remark: your example does not entirely make clear if you actually have a 2-D DataFrame or just a 1-D Series. Either way, apply can be used)
The next step is to find the right function. Here's a rough version (without regular expressions):
def make_list(str):
lst = str.split(',')
newlst = []
for i in lst:
if "-" in i:
newlst.extend(range(*[int(j) for j in i.split("-")]))
else:
newlst.append(int(i))
return newlst