I would like to use a MultiIndex DataFrame to easily select portions of the DataFrame. I created an empty DataFrame as follows:
mi = mindex = {'input':['a','b','c'],'optim':['pareto','alive']}
mi = pd.MultiIndex.from_tuples([(c,k) for c in mi.keys() for k in mi[c]])
mc = pd.MultiIndex(names=['Generation','Individual'],labels=[[],[]],levels=[[],[]])
population = pd.DataFrame(index=mi,columns=mc)
which seems to be good.
However, I do not know how to insert a single data to start populating my DataFrame. I tried the followings:
population.loc[('optim','pareto'),(0,0)]=True
where I tried to define a new column double index (0,0) leading to a NotImplementedError. I also tried with (0,1), which gave a ValueError.
I tried also with no columns indexes:
population.loc[('optim','pareto')]=True
Which gave no error...but no change in the DataFrame either...
Any help? Thanks in advance.
EDIT
To clarify my question, once populated, my DataFrame should look like this:
Generation 1 2
Individual 1 2 3 4 5 6
input a 1 1 2 ...
b 1 2 2 ...
c 1 1 2 ...
optim pareto True True False ...
alive True True False ...
EDIT 2
I found out that what I was doing works if I define my first column at the DataFrame creation. In particular with:
mc = pd.MultiIndex.from_tuples([(0,0)])
I get a first column full of nan and I can add data as I wanted to (also for new columns):
population.loc[('optim','pareto'),(0,1)]=True
I still do not know what is wrong with my first definition...
Even if I do not know why my initial definition was wrong, the following works as expected:
mi = {'input':['a','b','c'],'optim':['pareto','alive']}
mi = pd.MultiIndex.from_tuples([(c,k) for c in mi.keys() for k in mi[c]])
mc = pd.MultiIndex.from_tuples([(0,0)],names=['Generation','Individual'])
population = pd.DataFrame(index=mi,columns=mc)
It looks like the solution was to initialize the columns at the DataFrame creation (here with a (0,0) column). The created DataFrame is then:
Generation 0
Individual 0
input a NaN
b NaN
c NaN
optim pareto NaN
alive NaN
which can be then be populated adding values to the current column or new columns/rows.
Related
I have 2 csv files with some random numbers, as follow:
csv1.csv
0 906018
1 007559
2 910475
3 915104
4 600393
...
5070 907525
5071 903079
5072 001910
5073 909735
5074 914861
length 5075
csv2.csv
0 5555
1 7859
2 501303
3 912414
4 913257
...
7497 915031
7498 915030
7499 915033
7500 902060
7501 915038
length 7502
Some elements in csv1 are present in csv2 but I don't know exactly which one and I would like to extract those unique values. So my idea was to start merging together the 2 data frame, and than remove the duplicates.
so I wrote the following code:
import pandas as pd
import csv
unique_users = pd.read_csv('./csv1.csv')
unique_users['id']
identity = pd.read_csv('./csv2.csv')
identityNumber = identity['IDNumber']
identityNumber
df = pd.concat([identityNumber, unique_users])
Until here everything is perfect and the length of the df is the sum of the 2 length, but I realised the part where I got stuck.
the df concat it did its job and concat based on the index, so now I have tons of NaN.
and when I use the code:
final_result = df.drop_duplicates(keep=False)
The data frame does not drop any value because the df structure now look like this:
Identitynumber. ID
5555 NaN
so I guess that drop duplicate is looking for the same exact values, but as they don't exist it just keeps it.
So what I would like to do, is loop over both data frame, and if a value in csv1 exists in csv2, I want them to be dropped.
Can anyone help with this please?
And please if you need more info just let me know.
UPDATE:
I think I found the reason why is not working but I am not sure how to solve this.
my csv1 looks like this:
id
906018,
007559,
910475,
915104,
600393,
007992,
502313,
004609,
910017,
007954,
006678,
in Jupiter notebook when I open the csv, it looks this way.
id
906018 NaN
007559 NaN
910475 NaN
915104 NaN
600393 NaN
... ...
907525 NaN
903079 NaN
001910 NaN
909735 NaN
914861 NaN
and I do not understand why is seeing the id as NaN.
in fact I tried to add a new column into csv2, and as value I passed the id from csv1..and I can confirm that they are all NaN.
So I believe the source of the problem is surely this, which than reflects on all the other events.
Can anyone help to understand how I can solve this issue?
you can achieve this using df.merge():
# Data samples
data_1 = {'col_a': [906018,7559,910475,915104,600393,907525,903079,1910,909735,914861]}
data_2 = {'col_b': [5555,7859,914861,912414,913257,915031,1910,915104,7559,915038]}
df1 = pd.DataFrame(data_1)
df2 = pd.DataFrame(data_2)
# using isin() method
unique_vals = df1.merge(df2, right_on='col_b', left_on='col_a')['col_a']
new_df1 = df1[~df1.col_a.isin(unique_vals)]
# another approach
new_df1 = df1[df1.merge(df2, right_on='col_b', left_on='col_a', how='left')['col_b'].isna()]
print(new_df1)
# col_a
# 0 906018
# 2 910475
# 4 600393
# 5 907525
# 6 903079
# 8 909735
This will remove the duplicates between your two dataframes and keep all the records in one dataframe df.
df = pandas.concat([df1,df2]).drop_duplicates().reset_index(drop=True)
You are getting NaN because when you concatenate, Pandas doesn't know what you want to do with the different column names of your two dataframes. One of your dataframes has an IdentityNumber column and the other has an ID column. Pandas can't figure out what you want, so it puts both columns into the resulting dataframe.
Try this:
pd.concat([df1["IDNumber"], df2["ID"]]).drop_duplicates().reset_index(drop=True)
I asked this question for R a few months back and got a great answer that I used often. Now I'm trying to transition to Python but I was dreading attempting rewriting this code snippet. And now after trying I haven't been able to translate the answer I got (or find anything similar by searching).
The question is: I have a dataframe that I'd like to append new columns to where the calculation is dependent on values in another dataframe which holds the instructions.
I have created a reproducible example below (although in reality there are quite a few more columns and many rows so speed is important and I'd like to avoid a loop if possible):
input dataframes:
import pandas as pd;
data = {"A":["orange","apple","banana"],"B":[5,3,6],"C":[7,12,4],"D":[5,2,7],"E":[1,18,4]}
data_df = pd.DataFrame(data)
key = {"cols":["A","B","C","D","E"],"include":["no","no","yes","no","yes"],"subtract":["na","A","B","C","D"],"names":["na","G","H","I","J"]}
key_df = pd.DataFrame(key)
desired output (same as data but with 2 new columns):
output = {"A":["orange","apple","banana"],"B":[5,3,6],"C":[7,12,4],"D":[5,2,7],"E":[1,18,4],"H":[2,9,-2],"J":[-4,16,-3]}
output_df= pd.DataFrame(output)
So, the key dataframe has 1 row for each column in the base dataframe and it has an "include" column that has to be set to "yes" if any calculation is to be done. When it is set to "yes", then I want to add a new column with a defined name that subtracts a defined column (all lookups from the key dataframe).
For example, column "C" in the base dataframe is included so I want to create a new column called "H" which is the the value from column "C" minus the value from column "B".
p.s. here was the answer from R in case that triggers any thought processes for someone better skillled than me!
k <- subset(key, include == "yes")
output <- cbind(base,setNames(base[k[["cols"]]]-base[k[["subtract"]]],k$names))
Filter for the yes values in include:
yes = key_df.loc[key_df.include.eq("yes"), ["cols", "subtract", "names"]]
cols subtract names
2 C B H
4 E D J
Create a dictionary of the yes values and unpack it in the assign method::
yes_values = { name: data_df[col] - data_df[subtract]
for col, subtract, name
in yes.to_numpy()}
data_df.assign(**yes_values)
A B C D E H J
0 orange 5 7 5 1 2 -4
1 apple 3 12 2 18 9 16
2 banana 6 4 7 4 -2 -3
I have a dataframe to capture characteristics of people accessing a webpage. The list of time spent by each user in the page is one of the characteristic feature that I get as an input. I want to update this column with maximum value of the list. Is there a way in which I can do this?
Assume that my data is:
df = pd.DataFrame({Page_id:{1,2,3,4}, User_count:{5,3,3,6}, Max_time:{[45,56,78,90,120],[87,109,23],[78,45,89],[103,178,398,121,431,98]})
What I want to do is convert the column Max_time in df to Max_time:{120,109,89,431}
I am not supposed to add another column for computing the max separately as this table structure cannot be altered.
I tried the following:
for i in range(len(df)):
df.loc[i]["Max_time"] = max(df.loc[i]["Max_time"])
But this is not changing the column as I intended it to. Is there something that I missed?
df = pd.DataFrame({'Page_id':[1,2,3,4],'User_count':[5,3,3,6],'Max_time':[[45,56,78,90,120],[87,109,23],[78,45,89],[103,178,398,121,431,98]]})
df.Max_time = df.Max_time.apply(max)
Result:
Page_id User_count Max_time
0 1 5 120
1 2 3 109
2 3 3 89
3 4 6 431
You can use this:
df['Max_time'] = df['Max_time'].map(lambda x: np.max(x))
I'm just getting started with Pandas so I may be missing something important, but I can't seem to successfully subtract two columns I'm working with. I have a spreadsheet in excel that I imported as follows:
df = pd.read_excel('/path/to/file.xlsx',sheetname='Sheet1')
My table when doing df.head() looks similar to the following:
a b c d
0 stuff stuff stuff stuff
1 stuff stuff stuff stuff
2 data data data data
... ... ... ... ...
89 data data data data
I don't care about the "stuff;" I would like to subtract two columns of just the data and make this its own column. Therefore, it seemed obvious that I should trim off the rows I'm not interested in and work with what remains, so I have tried the following:
dataCol1 = df.ix[2:,0:1]
dataCol2 = df.ix[2:,1:2]
print(dataCol1.sub(dataCol2,axis=0))
But it results in
a b
2 NaN NaN
3 NaN NaN
4 NaN NaN
... ... ...
89 NaN NaN
I get the same result if I also simply try print(dataCol1-dataCol2). I really don't understand how both of these subtraction operations not only result in all NaN's, but also two columns instead of just one with the end result. Because when I print(dataCol1), for example, I do obtain the column I want to work with:
a
2 data
3 data
4 data
... ...
89 data
Is there any way to both work simply and directly from an Excel spreadsheet and perform basic operations with a truncated portion of the columns of said spreadsheet? Maybe there is a better way to go about this than using df.ix and I am definitely open to those methods as well.
The problem is the misallignment of your indices.
One thing to do would be to subtract the values, so you don't have to deal with alignment issues:
dataCol1 = df.iloc[2: , 0:1] # ix is deprecated
dataCol2 = df.iloc[2: , 1:2]
result = pd.DataFrame(dataCol1.values - dataCol2.values)
I have a pandas dataframe and I'd like to add a new column that has the contents of an existing column, but shifted relative to the rest of the data frame. I'd also like the value that drops off the bottom to get rolled around to the top.
For example if this is my dataframe:
>>> myDF
coord coverage
0 1 1
1 2 10
2 3 50
I want to get this:
>>> myDF_shifted
coord coverage coverage_shifted
0 1 1 50
1 2 10 1
2 3 50 10
(This is just a simplified example - in real life, my dataframes are larger and I will need to shift by more than one unit)
This is what I've tried and what I get back:
>>> myDF['coverage_shifted'] = myDF.coverage.shift(1)
>>> myDF
coord coverage coverage_shifted
0 1 1 NaN
1 2 10 1
2 3 50 10
So I can create the shifted column, but I don't know how to roll the bottom value around to the top. From internet searches I think that numpy lets you do this with "numpy.roll". Is there a pandas equivalent?
Pandas probably doesn't provide an off-the-shelf method to do the exactly what you described, however if you can move a little but out of the box, numpy has exactly that
In your case it is:
import numpy as np
myDF['coverage_shifted'] = np.roll(df.coverage, 2)
You can pass in an additional argument to the shift() to achieve what you want. The previous answer is much more helpful in most cases
last_value = myDF.iloc[-1]['coverage']
myDF['coverage_shifted'] = myDF.coverage.shift(1, fill_value=last_value)
You have to manually supply the value to fill_value
same can be applied for reverse rolling
first_value = myDF.iloc[0]['coverage']
myDF['coverage_back_shifted'] = myDF.coverage.shift(-1, fill_value=first_value)