assumed I have two dataframes:
df1: 4 columns, n lines
df2: 50 columns, n lines
what is the best way to calculate the difference of each column of df1 to all columns of df2?
My only idea up to now is to merge the tables and create 4*50 new columns with the differences, as a loop. But there has to be a better way, right?
Thanks already! Paul
For this I have created 2 fictive dataframes:
Input Dataframes
df1 = pd.DataFrame({"a":[1,1,1],
"b":[2,2,2],
})
df2 = pd.DataFrame({"aa":[10,10,10],
"bb":[20,20,20],
"cc":[30,30,30],
"dd":[40,40,40],
"ee":[50,50,50]
})
print(df1)
a b
0 1 2
1 1 2
2 1 2
print(df2)
aa bb cc dd ee
0 10 20 30 40 50
1 10 20 30 40 50
2 10 20 30 40 50
Solution
df = pd.concat([df2.sub(df1[i], axis=0) for i in df1.columns],axis =1)
df.columns= [i for i in range(df1.shape[1]*df2.shape[1])]
df
Result
0 1 2 3 4 5 6 7 8 9
0 9 19 29 39 49 8 18 28 38 48
1 9 19 29 39 49 8 18 28 38 48
2 9 19 29 39 49 8 18 28 38 48
I have a large dataset with millions of rows of data. One of the data columns is ID.
I also have another (hash)table that maps the range of indices to a specific group that meets a certain criteria.
What is an efficient way to map the range of indices to include them as an additional column on my dataset in pandas?
As an example, lets say that the dataset looks like this:
In [18]:
print(df_test)
Out [19]:
ID
0 13
1 14
2 15
3 16
4 17
5 18
6 19
7 20
8 21
9 22
10 23
11 24
12 25
13 26
14 27
15 28
16 29
17 30
18 31
19 32
Now the hash table with the range of indices looks like this:
In [20]:
print(df_hash)
Out [21]:
ID_first
0 0
1 2
2 10
where the index specifies the group number that I need.
I tried doing something like this:
for index in range(df_hash.size):
try:
df_test.loc[df_hash.ID_first[index]:df_hash.ID_first[index + 1], 'Group'] = index
except:
df_test.loc[df_hash.ID_first[index]:, 'Group'] = index
Which works well, except that it is really slow as it loops over the length of the hash table dataframe (hundreds of thousands of rows). It produces the following answer (which I want):
In [23]:
print(df_test)
Out [24]:
ID Group
0 13 0
1 14 0
2 15 1
3 16 1
4 17 1
5 18 1
6 19 1
7 20 1
8 21 1
9 22 1
10 23 2
11 24 2
12 25 2
13 26 2
14 27 2
15 28 2
16 29 2
17 30 2
18 31 2
19 32 2
Is there a way to do this more efficiently?
You can map the index of df_test using ID_first to the index of df_hash, and then ffill. Need to construct a Series as the pd.Index class doesn't have a ffill method.
df_test['group'] = (pd.Series(df_test.index.map(dict(zip(df_hash.ID_first, df_hash.index))),
index=df_test.index)
.ffill(downcast='infer'))
# ID group
#0 13 0
#1 14 0
#2 15 1
#...
#9 22 1
#10 23 2
#...
#17 30 2
#18 31 2
#19 32 2
you can do series.isin with series.cumsum
df_test['group'] = df_test['ID'].isin(df_hash['ID_first']).cumsum() #.sub(1)
print(df_test)
ID group
0 0 1
1 1 1
2 2 2
3 3 2
4 4 2
5 5 2
6 6 2
7 7 2
8 8 2
9 9 2
10 10 3
11 11 3
12 12 3
13 13 3
14 14 3
15 15 3
16 16 3
17 17 3
18 18 3
19 19 3
I am having a dataframe df like shown:
1-1 1-2 1-3 2-1 2-2 3-1 3-2 4-1 5-1
10 3 9 1 3 9 33 10 11
21 31 3 22 21 13 11 7 13
33 22 61 31 35 34 8 10 16
6 9 32 5 4 8 9 6 8
where the explanation of the columns as the following:
the first digit is a group number and the second is part of it or subgroup in our example we have groups 1,2,3,4,5 and group 1 consists of 1-1,1-2,1-3.
I would like to create a new dataframe that have only the groups 1,2,3,4,5 without subgroups and choose for each row the max number in the subgroup and be flexible for any new modifications or increasing the groups or subgroups.
The new dataframe I need is like the shown:
1 2 3 4 5
10 3 33 10 11
31 22 13 7 13
61 35 34 10 16
32 5 9 6 8
You can aggregate by columns with axis=1 and lambda function for split and select first values with max and DataFrame.groupby:
This working correct if numbers of groups contains 2 or more digits.
df1 = df.groupby(lambda x: x.split('-')[0], axis=1).max()
Alternative is pass splitted columns names:
df1 = df.groupby(df.columns.str.split('-').str[0], axis=1).max()
print (df1)
1 2 3 4 5
0 10 3 33 10 11
1 31 22 13 7 13
2 61 35 34 10 16
3 32 5 9 6 8
You can use .str[] or .str.get here.
df.groupby(df.columns.str[0], axis=1).max())
1 2 3 4 5
0 10 3 33 10 11
1 31 22 13 7 13
2 61 35 34 10 16
3 32 5 9 6 8
I asked something similar yesterday but I had to rephrase the question and change the dataframes that I'm using. So here is my question again:
I have a dataframe called df_location. In this dataframe I have duplicated ids because each id has a timestamp.
location = {'location_id': [1,1,1,1,2,2,2,3,3,3,4,5,6,7,8,9,10],
'temperature_value':[20,21,22,23,24,25,27,28,29,30,31,32,33,34,35,36,37],
'humidity_value':[60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76]}
df_location = pd.DataFrame(location)
I have another dataframe called df_islands:
islands = {'island_id':[10,20,30,40,50,60],
'list_of_locations':[[1],[2,3],[4,5],[6,7,8],[9],[10]]}
df_islands = pd.DataFrame(islands)
What I am trying to achieve is to map the values of list_of_locations to the location_id. If the values are the same , then the island_id for this location should be appended to a new column in df_location.
(Note that: I don't want to remove any duplicated Id, I need to keep them as they are)
Resulting dataframe:
final_dataframe = {'location_id': [1,1,1,1,2,2,2,3,3,3,4,5,6,7,8,9,10],
'temperature_value': [20,21,22,23,24,25,27,28,29,30,31,32,33,34,35,36,37],
'humidity_value':[60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76],
'island_id':[10,10,10,10,20,20,20,20,20,20,30,30,40,40,40,50,60]}
df_final_dataframe = pd.DataFrame(final_dataframe)
This is just a sample from the dataframe that I have. What I have is dataframe of 13,000,0000 rows and 4 columns. How can this be achieved in an efficient way ? Is there a pythonic way to do it ?I tried using for loops but it takes too long and still it didn't work. I would really appreciate it if someone can give me a solution to this problem.
Here's a solution:
island_lookup = df_islands.explode("list_of_locations").rename(columns = {"list_of_locations": "location"})
pd.merge(df_location, island_lookup, left_on="location_id", right_on="location").drop("location", axis=1)
The output is:
location_id temperature_value humidity_value island_id
0 1 20 60 10
1 1 21 61 10
2 1 22 62 10
3 1 23 63 10
4 2 24 64 20
5 2 25 65 20
6 2 27 66 20
7 3 28 67 20
8 3 29 68 20
9 3 30 69 20
10 4 31 63 30
11 5 32 64 30
12 6 33 65 40
13 7 34 66 40
14 8 35 67 40
15 9 36 68 50
16 10 37 69 60
If some of the locations don't have a matching island_id, but you'd still like to see them in the results (with island_id NaN), use how="left" in the merge statement, as in:
island_lookup = df_islands.explode("list_of_locations").rename(columns = {"list_of_locations": "location"})
pd.merge(df_location, island_lookup,
left_on="location_id",
right_on="location",
how = "left").drop("location", axis=1)
The result would be (note location-id 12 on row 3):
location_id temperature_value humidity_value island_id
0 1 20 60 10.0
1 1 21 61 10.0
2 1 22 62 10.0
3 12 23 63 NaN
4 2 24 64 20.0
5 2 25 65 20.0
6 2 27 66 20.0
...
Is there a general, efficient way to assign values to a subset of a DataFrame in pandas? I've got hundreds of rows and columns that I can access directly but I haven't managed to figure out how to edit their values without iterating through each row,col pair. For example:
In [1]: import pandas, numpy
In [2]: array = numpy.arange(30).reshape(3,10)
In [3]: df = pandas.DataFrame(array, index=list("ABC"))
In [4]: df
Out[4]:
0 1 2 3 4 5 6 7 8 9
A 0 1 2 3 4 5 6 7 8 9
B 10 11 12 13 14 15 16 17 18 19
C 20 21 22 23 24 25 26 27 28 29
In [5]: rows = ['A','C']
In [6]: columns = [1,4,7]
In [7]: df[columns].ix[rows]
Out[7]:
1 4 7
A 1 4 7
C 21 24 27
In [8]: df[columns].ix[rows] = 900
In [9]: df
Out[9]:
0 1 2 3 4 5 6 7 8 9
A 0 1 2 3 4 5 6 7 8 9
B 10 11 12 13 14 15 16 17 18 19
C 20 21 22 23 24 25 26 27 28 29
I believe what is happening here is that I'm getting a copy rather than a view, meaning I can't assign to the original DataFrame. Is that my problem? What's the most efficient way to edit those rows x columns (preferably in-pace, as the DataFrame may take up a lot of memory)?
Also, what if I want to replace those values with a correctly shaped DataFrame?
Use loc in an assignment expression (the = means it's not relevant whether it is a view or a copy!):
In [11]: df.loc[rows, columns] = 99
In [12]: df
Out[12]:
0 1 2 3 4 5 6 7 8 9
A 0 99 2 3 99 5 6 99 8 9
B 10 11 12 13 14 15 16 17 18 19
C 20 99 22 23 99 25 26 99 28 29
If you're using a version prior to 0.11 you can use .ix.
As #Jeff comments:
This is an assignment expression (see 'advanced indexing with ix' section of the docs) and doesn't return anything (although there are assignment expressions which do return things, e.g. .at and .iat).
df.loc[rows,columns] can return a view, but usually it's a copy. Confusing, but done for efficiency.
Bottom line: use ix, loc, iloc to set (as above), and don't modify copies.
See 'view versus copy' section of the docs.