Overwriting existing dataframe in loop - python

I am trying to transform elements in various data frames (standardize numerical values to be between 0 and 1, one-hot encode categorical variables) but when I try to overwrite the dataframe in a loop it doesn't modify the existing dataframe, only the loop variable. Here is a dummy example:
t = pd.DataFrame(np.arange(1, 16).reshape(5, 3))
b = pd.DataFrame(np.arange(1, 16).reshape(5, 3))
for hi in [t, b]:
hi = pd.DataFrame(np.arange(30, 45).reshape(5, 3))
But when I run this code both t and b have their original values. How can I overwrite the original dataframe (t or b) while in a loop?
The specific problem I'm running into is when trying to use get_dummies function in the loop:
hi = pd.get_dummies(hi, columns=['column1'])

You can't change elements of a list while iterating over the list that way. Search "changing list elements loop python" for a bunch of good stack overflow questions on why this is the case. My understanding is that "hi" is value-copied, not a reference to the original variable.
If you want to modify elements in a list iteratively, you can try enumerate(), or list comprehensions. You might want to create a dictionary of lists and iterate over that, instead of using variable names to keep track of all the lists, as suggested here.

Related

Partially merge to list in python

Classic, but i'm new to python... and have a problem a can't manage to solve. I'm assuming it's fairly easy.
I have two csv files, one scraped from the web(a=[]), containing 20000+ lines, the other exported from a local system [b=[]] 80+ lines.
I have open the files and stored the data in list a and b. Theey are structured like the example below.
a = [[1,'a','b','a#b',11],
[2,'c','d','c#b',22],
[3,'e','f','e#b',33]]
b = [['a','banana','A',100],
['e','apple','A',100]]
Now i would like to go through list a and when index 1 of every sublist in list a is equal to index 0 of the sublist in list b it shall append index 3 and 4 of a. So I would end up with
c= [['a','banana','A',100,'a#b',11],
['e','apple','A',100,'e#b',33],]
How to achive this. The solution don't need to be fast if it teaches something about the structure in Python. But if solved easy with pandas i'm all ears.
If this fora is not for questions like this i'm sorry for the time.
This is not optimized and isn't efficient time complexity vice, but it's readable and does the job.
c = []
for a_entry in a:
for b_entry in b:
if a_entry[1] == b_entry[0]:
c.append(b_entry + a_entry[3:])
make a dictionary ({index1:[index2,index3],...}) by iterating over a
for each item/sublist use index 1 for the key and indices 2 and 3 for the value
do the same thing for b except use index zero for the key and [1:] for the value
iterate over the items in the b dictionary
use the key of each item to get a value from the a dictionary
if there is one, extend the b item's value
reconstruct b from the modified dictionary

Pythonic way to create a dictionary by iterating

I'm trying to write something that answers "what are the possible values in every column?"
I created a dictionary called all_col_vals and iterate from 1 to however many columns my dataframe has. However, when reading about this online, someone stated this looked too much like Java and the more pythonic way would be to use zip. I can't see how I could use zip here.
all_col_vals = {}
for index in range(RCSRdf.shape[1]):
all_col_vals[RCSRdf.iloc[:,index].name] = set(RCSRdf.iloc[:,index])
The output looks like 'CFN Network': {nan, 'N521', 'N536', 'N401', 'N612', 'N204'}, 'Exam': {'EXRC', 'MXRN', 'HXRT', 'MXRC'} and shows all the possible values for that specific column. The key is the column name.
I think #piRSquared's comment is the best option, so I'm going to steal it as an answer and add some explanation.
Answer
Assuming you don't have duplicate columns, use the following:
{k : {*df[k]} for k in df}
Explanation
k represents a column name in df. You don't have to use the .columns attribute to access them because a pandas.DataFrame works similarly to a python dict
df[k] represents the series k
{*df[k]} unpacks the values from the series and places them in a set ({}) which only keeps distinct elements by definition (see definition of a set).
Lastly, using list comprehension to create the dict is faster than defining an empty dict and adding new keys to it via a for-loop.

Passing a list to a new list with a dynamic name after each step of a loop

I have a data frame with that contains several columns.
I would like to iterate some columns of the data frame and convert each row to a list based on a function called tokenizer.
columns = ['stemmed', 'lemmatized', 'lem_stop','stem_stop', 'lem_stop_nltk', 'stem_stop_nltk']
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
sentences = []
for i in columns:
for tweet in df[i]:
sentences += tweet_to_sentences(tweet, tokenizer)
However, I would like to create 6 different lists rather than one.
How can I change the name of the variable based on variable i in each step of the first loop (i.e. for i in columns:)
I was thinking something like this
for i in columns:
for tweet in df[i]:
sentences += tweet_to_sentences(tweet, tokenizer)
i_list = sentences
where i_list will translate in stemmed_list, lemmatized_list, lem_stop_list and so on.
Any idea?
Papa,
Perhaps I'm wrong with what you're asking, but what I'm hearing is that each time you iterate 'i' in your for loop you want to dynamically create a list(with a new name)?
If i could suggest a new approach based on this understanding, it'd be to create a list of lists, and each time you go into the for loop you can create a new list and add it as an element to this list of lists. Perhaps a better solution than this would be to create a dictionary, have the list name 'stemmed' be the key and the list you create in the for loop be the value.
Would this work for you?

How to create new array deducting segments of existing array

I am trying to create new array out of an existing array in Python.
I read some of already existing and similar questions but I still can not solve the problem.
For example:
I have array A = [4,6,9,15] and I want to create B =[(6-4),(9-6),(15-9)].
I tried to do it in for loop like this:
deltaB=[]
for i in range(0,len(A)):
deltaB[i]=A[i]-A[i-1]
deltaB.append(deltaB[i])
But that does not work... probably because I am writing code completely wrong since I'm new in Python and programming in general.
Can you help and write me code for this?
Many thanks upfront
List comprehension
Probably the best way to do this is using list comprehension:
[xj-xi for xi,xj in zip(A,A[1:])]
which generates:
>>> [xj-xi for xi,xj in zip(A,A[1:])]
[2, 3, 6]
Here we first zip(..) A (the list) and A[1:] the slice of the list that omits the first element together into tuples. For each such tuple (xi,xj) we add xj-xi to the list.
The error
The error occurs because in the for loop, you start from 0 and stop before len(A), it should be starting from 1 and stop before len(A). Furthermore you cannot first assign to an index that does not exist, you need to directly append it:
deltaB=[]
for i in range(1,len(A)):
deltaB.append(A[i]-A[i-1])

How do I pass multiple variables to a function in python?

I would like to compare a column from several pairs of pandas dataframes and write the shared values to an empty list. I have written a function that can do this with a single pair of dataframes, but I cannot seem to scale it up.
def parser(dataframe1,dataframe2,emptylist):
for i1 in dataframe1['POS']:
for i2 in dataframe2['POS']:
if i1 == i2:
emptylist.append(i1)
Where 'POS' is a column header in the two pandas dataframes.
I have made a list of variable names for each input value of this function, eg.
dataframe1_names=['name1','name2',etc...]
dataframe2_names=['name1','name2',etc...]
emptylist_names=['name1','name2',etc...]
Where each element of the list is a string containing the name of a variable (either a pandas dataframe in the case of the first two, or an empty list in the case of the last).
I have tried to iterate through these lists using the following code:
import itertools
for a, b, c in zip(range(len(dataframe1_names)), range(len(dataframe2_names)), range(len(emptylist_names))):
parser(dataframe1_names[a],dataframe2_names[b],emptylist_names[c])
But this returns TypeError: string indices must be integers.
I believe that this error is coming from passing the function a string containing the variable name instead of the variable name itself. Is there another way to pass multiple variables to a function in an automated way?
Thanks for your help!
Do you have to use strings of object names, instead of just the objects themselves? If you do
dataframes1=[name1,name2,...]
dataframes2=[name1,name2,...]
emptylists=[name1,name2,...]
Then you can just do
for a,b,c in zip( dataframes1, dataframes2, emptylists ):
parser(a,b,c)
The way you do this is really circuitous and unpythonic, by the way, so I've changed it a bit. Rather than getting lists of indexes for the for statement, I just iterate through the lists (and thus the objects) themselves. This is much more compact, and easier to understand. For that matter, do you have a need to input the empty list as an argument (eg, perhaps they aren't always empty)? And your code for the parser, while correct, doesn't take advantage of pandas at all, and will be very slow: to compare columns, you can simply do dataframe1['COL'] == dataframe2['COL'], which will give you a boolean series of where values are equal. Then you can use this for indexing a dataframe, to get the shared values. It comes out as a dataframe or series, but it's easy enough to convert to a list. Thus, your parser function can be reduced to the following, if you don't need to create the "empty list" elsewhere first:
def parser( df1, df2 ):
return list( df1['COL'][ df1['COL']==df2['COL'] ] )
This will be much, much faster, though as it returns the list, you'll have to do something with it, so in your case, you'd do something like:
sharedlists = [ parser(a,b) for a,b in zip( dataframes1, dataframes2 ) ]
If you must use variable names, the following very unsafe sort of code will convert your lists of names into lists of objects (you'll need to do this for each list):
dataframes1 = [ eval(name) for name in dataframe1_names ]
If this is just for numerical work you're doing in an interpreter, eval is alright, but for any code you're releasing, it's very insecure: it will evaluate whatever code is in the string passed into it, thus allowing arbitrary code execution.
This sounds like a use case of .query()
A use case for query() is when you have a collection of DataFrame
objects that have a subset of column names (or index levels/names) in
common. You can pass the same query to both frames without having to
specify which frame you’re interested in querying
map(lambda frame: frame.query(expr), [df, df2])
What kind of output are you looking for in the case where you have more than two DataFrame objects? In the case of just two, the following line would accomplish what your parser function does:
common = df1[df1["fieldname"] == df2["fieldname"]]["fieldname"]
except that common would be a DataFrame object itself, rather than a list, but you can easily get a list from it by doing list(common).
If you're looking for a function that takes any number of DataFrames and returns a list of common values in some field for each pair, you could do something like this:
from itertools import combinations
def common_lists(field, *dfs):
return [df1[df1[field] == df2[field]][field] for df1, df2 in combinations(dfs, 2)]
The same deal about getting a list from a DataFrame applies here, since you'll be getting a list of DataFrames.
As far as this bit:
import itertools
for a, b, c in zip(range(len(dataframe1_names)), range(len(dataframe2_names)), range(len(emptylist_names))):
parser(dataframe1_names[a],dataframe2_names[b],emptylist_names[c])
What you're doing is creating a list that looks something like this:
[(0,0,0), (1,1,1), ... (n,n,n)]
where n is the length of the shortest of dataframe1_names, dataframe2_names, and emptylist_names. So on the first iteration of the loop, you have a == b == c == 0, and you're using these values to index into your arrays of data frame variable names, so you're calling parser("name1", "name1", "name1"), passing it strings instead of pandas DataFrame objects. Your parser function is expecting DataFrame objects so it barfs when you try to call dataframe1["POS"] where dataframe1 is the string "name1".

Categories