I'm relatively new to pandas, so I expect that I've simply not grasped it well enough yet. I've been trying to make a copy of a dataframe and I need to reorder the rows as I do according to an externally supplied mapping (there's a good but irrelevant reason for setting df2 to nan). When I try to do it as one operation using .iloc, the ordering is ignored, but if I loop and do it one row at a time, it works as I expected it to. Can anyone explain where I'm going wrong in this MWE? (Also, more efficient / elegant ways of doing this are welcome).
import pandas as pd
import numpy as np
df1 = pd.DataFrame([[100,200,300,400]]).T
df1.columns = ['A']
df2 = df1.copy()
df2[:] = np.nan
assign = np.array([[0,0],[1,1],[3,2],[2,3]])
print df1
# This does not work:
# df2.iloc[assign[:,1]] = df1.iloc[assign[:,0]]
# Output:
# A
# 0 100
# 1 200
# 2 300
# 3 400
#
# A
# 0 100
# 1 200
# 2 300
# 3 400
# This does:
for x in assign:
df2.iloc[x[1]] = df1.iloc[x[0]]
# Output:
# A
# 0 100
# 1 200
# 2 300
# 3 400
#
# A
# 0 100
# 1 200
# 2 400
# 3 300
print df2
We will need a pandas developer here to explain why this is the way how it works, but I do know that the following solution will get you there (pandas 0.13.1):
In [179]:
df2.iloc[assign[:,1]] = df1.iloc[assign[:,0]].values
print df2
out[179]:
A
0 100
1 200
2 400
3 300
[4 rows x 1 columns]
As #Jeff pointed out, in df2.iloc[assign[:,1]] = df1.iloc[assign[:,0]], you are assigning a Series to a Series, and two indices will match up. But with df2.iloc[assign[:,1]] = df1.iloc[assign[:,0]].values, you are assigning a array to an Series and there are no index to be matched.
Also consider this following example, as an illustration of the index match behavior.
In [208]:
#this will work and there will be missing values
df1['B']=pd.Series({0:'a', 3:'b', 2:'c'})
print df1
A B
0 100 a
1 200 NaN
2 300 c
3 400 b
[4 rows x 2 columns]
In [209]:
#this won't work
df1['B']=['a', 'b', 'c'] #one element less than df1
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Related
I have a dataframe and I want to substract a column from multiple columns
code:
df = pd.DataFrame('A':[10,20,30],'B':[100,200,300],'C':[15,10,50])
# Create a new A and B columns by sub-stracting C from A and B
df[['newA','newB']] = df[['A','B']]-df['C']
Present output:
raise ValueError("cannot reindex from a duplicate axis")
ValueError: cannot reindex from a duplicate axis
You can check sub
df[['newA', 'newB']] = df[['A', 'B']].sub(df['C'],axis=0)
df
Out[114]:
A B C newA newB
0 10 100 15 -5 85
1 20 200 10 10 190
2 30 300 50 -20 250
Another option along with the above answer, you can convert column 'C' to a numpy array by doing df[['C']].values. Hence the new code would be:
df[['newA','newB']] = df[['A','B']]-df[['C']].values
Try using Pandas .apply() method. You can pass columns and apply a given function to them, in this case subtracting one of your existing columns. The below should work. Documentation here.
df[['newA','newB']] = df[['A','B']].apply(lambda x: x - df['C'])
You can try to convert df['C'].values to the same shape with df[['A','B']].
df[['newA','newB']] = df[['A','B']] - df['C'].values[:, None]
print(df)
A B C newA newB
0 10 100 15 -5 85
1 20 200 10 10 190
2 30 300 50 -20 250
I am unable to create an empty dataframe and then copy the edge nodes into the dataframe using a list comprehension.
df = pandas.DataFrame(columns=['Source','Target'])
df[['Source','Target']] = [(s,t) for (s,t) in graph.edges]
I receive an error stating that it can't copy 44000 into a series.
I dunno what graph edges are and I don't want to guess. This code has the same problem.
df = pd.DataFrame(columns=['Source','Target'])
df[['Source','Target']] = [(s,t) for (s,t) in zip(range(1000), range(1000))]
Results in:
ValueError: cannot copy sequence with size 1000 to array axis with dimension 0
I don't really have an answer for why besides you just can't do that.
df = pd.DataFrame(columns=['Source','Target'])
df[['Source','Target']] = pd.DataFrame([(s,t) for (s,t) in zip(range(1000), range(1000))])
Now it works.
>>> df.head()
Source Target
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
The problem
Starting from a pandas dataframe df made of dim_df rows, I need a new
dataframe df_new obtained by applying a function to every sub-dataframe of dimension dim_blk, ideally splitted starting from the last row (so the first block, not the last, may have or not the right number of rows, dim_blk), in the most efficient way (may be vectorized?).
Example
In the following example the dataframe is made of few rows, but the real dataframe will be made of millions of rows, that's why I need an efficient solution.
dim_df = 7 # dimension of the starting dataframe
dim_blk = 3 # number of rows of the splitted block
df = pd.DataFrame(np.arange(1,dim_df+1), columns=['TEST'])
print(df)
Output:
TEST
0 1
1 2
2 3
3 4
4 5
5 6
6 7
The splitted blocks I want:
1 # note: this is the first block composed by a <= dim_blk number of rows
2,3,4
5,6,7 # note: this is the last block and it has dim_blk number of rows
I've done so (I don't know if this is the efficient way):
lst = np.arange(dim_df, 0, -dim_blk) # [7 4 1]
lst_mod = lst[1:] # [4 1] to cut off the last empty sub-dataframe
split_df = np.array_split(df, lst_mod[::-1]) # splitted by reversed list
print(split_df)
Output:
split_df: [
TEST
0 1,
TEST
1 2
2 3
3 4,
TEST
4 5
5 6
6 7]
For example:
print(split_df[1])
Output:
TEST
1 2
2 3
3 4
How can I get a new dataframe, df_new, where every row is made by two columns, min and max (just an example) calculated for every blocks?
I.e:
# df_new
Min Max
0 1 1
1 2 4
2 5 7
Thank you,
Gilberto
You can convert the split_df into dataframe and then create a dataframe using min and max functions i.e
split_df = pd.DataFrame(np.array_split(df['TEST'], lst_mod[::-1]))
df_new = pd.DataFrame({"MIN":split_df.min(axis=1),"MAX":split_df.max(axis=1)}).reset_index(drop=True)
Output:
MAX MIN
0 1.0 1.0
1 4.0 2.0
2 7.0 5.0
Moved solution from question to answer:
The Solution
I've think laterally and found a very speedy solution:
Apply a rolling function to the entire dataframe
Choose every num_blk rows starting from the end
The code (with different values):
import numpy as np
import pandas as pd
import time
dim_df = 500000
dim_blk = 240
df = pd.DataFrame(np.arange(1,dim_df+1), columns=['TEST'])
start_time = time.time()
df['MAX'] = df['TEST'].rolling(dim_blk).max()
df['MIN'] = df['TEST'].rolling(dim_blk).min()
df[['MAX', 'MIN']] = df[['MAX', 'MIN']].fillna(method='bfill')
df_split = pd.DataFrame(columns=['MIN', 'MAX'])
df_split['MAX'] = df['MAX'][-1::-dim_blk][::-1]
df_split['MIN'] = df['MIN'][-1::-dim_blk][::-1]
df_split.reset_index(inplace=True)
del(df_split['index'])
print(df_split.tail())
print('\n\nEND\n\n')
print("--- %s seconds ---" % (time.time() - start_time))
Time Stats
The original code stops after 545 secs. The new code stops after 0,16 secs. Awesome!
I have a problem with adding columns in pandas.
I have DataFrame, dimensional is nxk. And in process I wiil need add columns with dimensional mx1, where m = [1,n], but I don't know m.
When I try do it:
df['Name column'] = data
# type(data) = list
result:
AssertionError: Length of values does not match length of index
Can I add columns with different length?
If you use accepted answer, you'll lose your column names, as shown in the accepted answer example, and described in the documentation (emphasis added):
The resulting axis will be labeled 0, ..., n - 1. This is useful if you are concatenating objects where the concatenation axis does not have meaningful indexing information.
It looks like column names ('Name column') are meaningful to the Original Poster / Original Question.
To save column names, use pandas.concat, but don't ignore_index (default value of ignore_index is false; so you can omit that argument altogether). Continue to use axis=1:
import pandas
# Note these columns have 3 rows of values:
original = pandas.DataFrame({
'Age':[10, 12, 13],
'Gender':['M','F','F']
})
# Note this column has 4 rows of values:
additional = pandas.DataFrame({
'Name': ['Nate A', 'Jessie A', 'Daniel H', 'John D']
})
new = pandas.concat([original, additional], axis=1)
# Identical:
# new = pandas.concat([original, additional], ignore_index=False, axis=1)
print(new.head())
# Age Gender Name
#0 10 M Nate A
#1 12 F Jessie A
#2 13 F Daniel H
#3 NaN NaN John D
Notice how John D does not have an Age or a Gender.
Use concat and pass axis=1 and ignore_index=True:
In [38]:
import numpy as np
df = pd.DataFrame({'a':np.arange(5)})
df1 = pd.DataFrame({'b':np.arange(4)})
print(df1)
df
b
0 0
1 1
2 2
3 3
Out[38]:
a
0 0
1 1
2 2
3 3
4 4
In [39]:
pd.concat([df,df1], ignore_index=True, axis=1)
Out[39]:
0 1
0 0 0
1 1 1
2 2 2
3 3 3
4 4 NaN
We can add the different size of list values to DataFrame.
Example
a = [0,1,2,3]
b = [0,1,2,3,4,5,6,7,8,9]
c = [0,1]
Find the Length of all list
la,lb,lc = len(a),len(b),len(c)
# now find the max
max_len = max(la,lb,lc)
Resize all according to the determined max length (not in this example
if not max_len == la:
a.extend(['']*(max_len-la))
if not max_len == lb:
b.extend(['']*(max_len-lb))
if not max_len == lc:
c.extend(['']*(max_len-lc))
Now the all list is same length and create dataframe
pd.DataFrame({'A':a,'B':b,'C':c})
Final Output is
A B C
0 1 0 1
1 2 1
2 3 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
I had the same issue, two different dataframes and without a common column. I just needed to put them beside each other in a csv file.
Merge:
In this case, "merge" does not work; even adding a temporary column to both dfs and then dropping it. Because this method makes both dfs with the same length. Hence, it repeats the rows of the shorter dataframe to match the longer dataframe's length.
Concat:
The idea of The Red Pea didn't work for me. It just appended the shorter df to the longer one (row-wise) while leaving an empty column (NaNs) above the shorter df's column.
Solution: You need to do the following:
df1 = df1.reset_index()
df2 = df2.reset_index()
df = [df1, df2]
df_final = pd.concat(df, axis=1)
df_final.to_csv(filename, index=False)
This way, you'll see your dfs besides each other (column-wise), each of which with its own length.
If somebody like to replace a specific column of a different size instead of adding it.
Based on this answer, I use a dict as an intermediate type.
Create Pandas Dataframe with different sized columns
If the column to be inserted is not a list but already a dict, the respective line can be omitted.
def fill_column(dataframe: pd.DataFrame, list: list, column: str):
dict_from_list = dict(enumerate(list)) # create enumertable object from list and create dict
dataFrame_asDict = dataframe.to_dict() # Get DataFrame as Dict
dataFrame_asDict[column] = dict_from_list # Assign specific column
return pd.DataFrame.from_dict(dataFrame_asDict, orient='index').T # Create new DataSheet from Dict and return it
The scenario here is that I've got a dataframe df with raw integer data, and a dict map_array which maps those ints to string values.
I need to replace the values in the dataframe with the corresponding values from the map, but keep the original value if the it doesn't map to anything.
So far, the only way I've been able to figure out how to do what I want is by using a temporary column. However, with the size of data that I'm working with, this could sometimes get a little bit hairy. And so, I was wondering if there was some trick to do this in pandas without needing the temp column...
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(1,5, size=(100,1)))
map_array = {1:'one', 2:'two', 4:'four'}
df['__temp__'] = df[0].map(map_array, na_action=None)
#I've tried varying the na_action arg to no effect
nan_index = data['__temp__'][df['__temp__'].isnull() == True].index
df['__temp__'].ix[nan_index] = df[0].ix[nan_index]
df[0] = df['__temp__']
df = df.drop(['__temp__'], axis=1)
I think you can simply use .replace, whether on a DataFrame or a Series:
>>> df = pd.DataFrame(np.random.randint(1,5, size=(3,3)))
>>> df
0 1 2
0 3 4 3
1 2 1 2
2 4 2 3
>>> map_array = {1:'one', 2:'two', 4:'four'}
>>> df.replace(map_array)
0 1 2
0 3 four 3
1 two one two
2 four two 3
>>> df.replace(map_array, inplace=True)
>>> df
0 1 2
0 3 four 3
1 two one two
2 four two 3
I'm not sure what the memory hit of changing column dtypes will be, though.