I am trying to make the below code more dynamic. I have left out all code that is not relevant for simplicity. What follows is the whole concept for the program. The objective is to build linear models from random columns. The way it working is x numbers of random columns are selection and then those columns are used to build a linear model. That model is used on a test dataset and relevant information is captured in a dataframe. This will continues a large number of times. What I would like to do is to be able to generation the code that is used to assign the values to the dataframe dynamical based on the number of columns selected. Otherwise I need to keep the number of columns selected static. With a consequence that I babysit the program while it runs and I manual index the number of columns selected.
The following code is what I would like to generate dynamically: test_df.loc[i,asignment_list[ii]] = i.
The example code below is only calling for 3 random columns.
import pandas as pd
test_df = pd.DataFrame(columns = {'a','b','c','d','e','f','g','h','i','j'})
for i in range(10):
asignment_list = list(test_df.sample(n = 3, replace = True, axis = 1))
test_df.loc[i,asignment_list[0]] = i
test_df.loc[i,asignment_list[1]] = i
test_df.loc[i,asignment_list[2]] = i
print(test_df)
Output:
I did trying the below piece of code but it requires that I call the variable name which can't be done dynamically.
for ii in range(0,3,1):
globals()[f'test_df.loc[{i},asignment_list[{ii}]'] = i
If python does not have this functionality could I build it into python with C?
Is this what you are looking for?
tgt_cols = list('abcdef')
dfs = pd.DataFrame(index=range(10))
for c in tgt_cols:
r = [random.randint(0,9) for _ in range(3)]
s = pd.Series(r, index=r).drop_duplicates()
dfs[c] = s
print(dfs)
Result
a b c d e f
0 0.0 0.0 0.0 NaN NaN NaN
1 NaN NaN NaN NaN 1.0 NaN
2 NaN NaN NaN NaN NaN NaN
3 NaN 3.0 NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN 4.0
5 NaN NaN NaN NaN NaN 5.0
6 6.0 NaN 6.0 NaN NaN NaN
7 NaN NaN NaN 7.0 7.0 NaN
8 NaN 8.0 8.0 NaN 8.0 NaN
9 9.0 NaN NaN 9.0 NaN NaN
Sorry for any confusion and thanks for the questions and responses. I found a solution. If anyone finds another way I would like to see it because this answer was a stretch for me. It is below with a note on application. The solution uses the zip function that is feed into the dict() function and then into a Pandas Dataframe. After the first iteration it concatenated the previous Pandas Dataframe from the last iteration with the newly constructed Pandas Dataframe. The newly concatenated Pandas Dataframe is assigned to the previous Pandas Dataframe variable name.
Use Application Comment: Given a wide table of data and a desire to build a model from the data set it would be optimal to select a random set of columns for each model that is built. This will prevent over fitting and show which variables have the model consistent effect.
import pandas as pd
import random
test_df = pd.DataFrame(columns = {'a','b','c','f','j'})
for i in range(0,10,1):
columns_number_selected = len(test_df.columns)
Randum_Columns = random.randrange(columns_number_selected)
while Randum_Columns == 0:
Randum_Columns = random.randrange(columns_number_selected)
assignment_list = list(test_df.sample(n = columns_number_selected, replace = True, axis = 1))
data = dict(zip(assignment_list, [i]))
if i == 0:
answer1 = pd.DataFrame(data, index = [i])
else:
answer2 = pd.DataFrame(data, index = [i])
answer1 = pd.concat([answer1, answer2])
print(answer1)
Example of a possible output:
Related
I'm cleaning some data and I've been struggling with one thing.
I have a dataframe with 7740 rows and 68 columns.
Most of the columns contains Nan values.
What i'm interested in, is to remove NaN values when it is NaN in those two columns : [SERIAL_ID],[NUMBER_ID]
Example :
SERIAL_ID
NUMBER_ID
8RY68U4R
NaN
8756ERT5
8759321
NaN
NaN
NaN
7896521
7EY68U4R
NaN
95856ERT5
988888
NaN
NaN
NaN
4555555
Results
SERIAL_ID
NUMBER_ID
8RY68U4R
NaN
8756ERT5
8759321
NaN
7896521
7EY68U4R
NaN
95856ERT5
988888
NaN
4555555
Removing rows when NaN is in the two columns.
I've used the followings to do so :
df.dropna(subset=['SERIAL_ID', 'NUMBER_ID'], how='all', inplace=True)
When I use this on my dataframe with 68 columns the result I get is this one :
SERIAL_ID
NUMBER_ID
NaN
NaN
NaN
NaN
NaN
NaN
NaN
7896521
NaN
NaN
95856ERT5
NaN
NaN
NaN
NaN
4555555
I tried with a copy of the dataframe with only 3 columns, it is working fine.
It is somehow working (I can tel cause I have an identical ID in another column) but remove some of the value, and I have no idea why.
Please help I've been struggling the whole day with this.
Thanks again.
I don't know why it only works for 3 columns and not for 68 originals.
However, we can obtain desired output in other way.
use boolean indexing:
df[df[['SERIAL_ID', 'NUMBER_ID']].notnull().any(axis=1)]
You can use boolean logic or simple do something like this for any given column:
import numpy as np
import pandas as pd
# sample dataframe
d = {'SERIAL_ID':['8RY68U4R', '8756ERT5', np.nan, np.nan],
'NUMBER_ID':[np.nan, 8759321, np.nan ,7896521]}
df = pd.DataFrame(d)
# apply logic to columns
df['nans'] = df['NUMBER_ID'].isnull() * df['SERIAL_ID'].isnull()
# filter columns
df_filtered = df[df['nans']==False]
print(df_filtered)
which returns this:
SERIAL_ID NUMBER_ID nans
0 8RY68U4R NaN False
1 8756ERT5 8759321.0 False
3 NaN 7896521.0 False
New learner here. I have a list of data values that are labeled by a comma-delimited string that represents the position in a dataframe; think of the string as representing the row (say 1-20) and column (say A-L) index values of a position in the array where the corresponding value should go. The populated data frame would be sparse, with many empty cells. I am working with pandas for the first time on this project, and am still learning the ropes.
position value
1,A 32
1,F 16
2,B 234
2,C 1345
2,E 13
2,G 999
3,D 5332
4,B 12
etc.
I have been trying various approaches, but am not satisfied. I created dummy entries for empty cells in the completed dataframe, then iterated over the list to write the value to the correct cell. It works but it is not elegant and it seems like a brittle solution.
I can pre-generate a dataframe and populate it, or generate a new dataframe as part of the population process: either solution would be fine. It seems like this should be a simple task. Maybe even a one liner! But I am stumped. I would appreciate any pointers.
This is a standard unstack:
entries.set_index(['row','column']).unstack()
where entries is defined in #StuartBerg answer:
entries = pd.read_csv(StringIO(text), sep='[ ,]+')
output:
value
column A B C D E F G
row
1 32.0 NaN NaN NaN NaN 16.0 NaN
2 NaN 234.0 1345.0 NaN 13.0 NaN 999.0
3 NaN NaN NaN 5332.0 NaN NaN NaN
4 NaN 12.0 NaN NaN NaN NaN NaN
As you suggest, the simplest method might be a for-loop to initialize the non-empty values. Alternatively, you can use pivot() or numpy advanced indexing. All options are shown below.
The only tricky thing is ensuring that your dataframe result will have the complete set of rows and columns, as explained in the update below.
text = """\
row,column,value
1,A 32
1,F 16
2,B 234
2,C 1345
2,E 13
2,G 999
3,D 5332
4,B 12
"""
from io import StringIO
import numpy as np
import pandas as pd
# Load your data and convert the column letters to integers.
# Note: Your exapmple data is delimited with both spaces and commas,
# which is why we need a custom 'sep' argument here.
entries = pd.read_csv(StringIO(text), sep='[ ,]+')
entries['icol'] = entries['column'].map(lambda c: ord(c) - ord('A'))
# Construct an empty DataFrame with the appropriate index and columns.
rows = range(1, 1 + entries['row'].max())
columns = [chr(ord('A') + i) for i in range(1 + entries['icol'].max())]
df = pd.DataFrame(index=rows, columns=columns)
##
## Three ways to populate the dataframe:
##
# Option 1: Iterate in a for-loop
for e in entries.itertuples():
df.loc[e.row, e.column] = e.value
# Option 2: Use pivot() or unstack()
df = df.fillna(entries.pivot('row', 'column', 'value'))
# Option 3: Use numpy indexing to overwrite the underlying array:
irows = entries['row'].values - 1
icols = entries['icol'].values
df.values[irows, icols] = entries['value'].values
Result:
A B C D E F G
1 32 NaN NaN NaN NaN 16 NaN
2 NaN 234 1345 NaN 13 NaN 999
3 NaN NaN NaN 5332 NaN NaN NaN
4 NaN 12 NaN NaN NaN NaN NaN
Update:
Late in the day, it occurred to me that this can be solved via pivot() (or unstack(), as suggested by #piterbarg). I've now included that option above.
In fact, it's tempting to just use pivot() without pre-initializing the DataFrame. HOWEVER, there's an important caveat to that approach: If any particular row or column value remains completely unused in your original entries data, then those rows will remain completely omitted from the final table. That is, if no entry uses row 3, your final table would only contain rows 1,2,4. Likewise, if your data contains no data for columns C,E,G (for example), then you would end up with columns A,B,D,F.
If you want to be sure that your rows use contiguous index values and your columns use a contiguous sequence of letters, then pivot() or unstack() is not enough. You must first initialize the indexes of your dataframe as shown above.
import impyute.imputation.cs as imp
print(Data)
Data = pd.DataFrame(data = imp.em(Data),columns = columns)
print(Data)
When i do the above code all my values gets converted to Nan as below,Can someone help me where am i going wrong?
Before
Time LymphNodeStatus ... MeanPerimeter TumorSize
0 31 5.0 ... 117.50 5.0
1 61 2.0 ... 122.80 3.0
2 116 0.0 ... 137.50 2.5
3 123 0.0 ... 77.58 2.0
4 27 0.0 ... 135.10 3.5
5 77 0.0 ... 84.60 2.5
After
Time LymphNodeStatus ... MeanPerimeter TumorSize
0 NaN NaN ... NaN NaN
1 NaN NaN ... NaN NaN
2 NaN NaN ... NaN NaN
3 NaN NaN ... NaN NaN
4 NaN NaN ... NaN NaN
5 NaN NaN ... NaN NaN
Editted
Solution first
Instead of passing columns to pd.DataFrame, just manually assign column names:
data = pd.DataFrame(imp.em(data))
data.columns = columns
Cause
Error lies in Data = pd.DataFrame(data = imp.em(Data),columns = columns).
imp.em has a decorator #preprocess which converts input into a numpy.array if it is a pandas.DataFrame.
...
if pd_DataFrame and isinstance(args[0], pd_DataFrame):
args[0] = args[0].as_matrix()
return pd_DataFrame(fn(*args, **kwargs))
It therefore returns a dataframe reconstructed from a matrix, having range(data.shape[1]) as column names.
And as I have pointed below, when pd.DataFrame is instantiated with mismatching columns on another pd.DataFrame, all the contents become NaN.
You can test this by
from impyute.util import preprocess
#preprocess
def test(data):
return data
data = pd.DataFrame({"time": [1,2,3], "size": [3,2,1]})
columns = data.columns
data = pd.DataFrame(test(data), columns = columns))
size time
0 NaN NaN
1 NaN NaN
2 NaN NaN
When you instantiate a pd.DataFrame from an existing pd.DataFrame, columns argument specifies which of the columns from original dataframe you want to use.
It does not re-label the dataframe. Which is not odd, just the way pandas intended in reindexing
By default values in the new index that do not have corresponding records in the dataframe are assigned NaN.
# Make new pseudo dataset
data = pd.DataFrame({"time": [1,2,3], "size": [3,2,1]})
data
size time
0 3 1
1 2 2
2 1 3
#Make new dataset with original `data`
data = pd.DataFrame(data, columns = ["a", "b"])
data
a b
0 NaN NaN
1 NaN NaN
2 NaN NaN
There may be some bug in impyute library. You are using em function which is nothing but a way to fill-missing values by expectation-maximization algorithm. You can try without using that function, as
df = pd.DataFrame(data = Data ,columns = columns)
You can raise this issue here after confirming. To confirm first load the data, using above example and find if there are null data present in the data by using df.isnull() method.
Data = pd.DataFrame(data = np.array(imp.em(Data)),columns = columns)
Doing this solved the issue i was facing, i guess the data after the use of em function doesn't return numpy array.
I have a dataframe as shown below (top 3 rows):
Sample_Name Sample_ID Sample_Type IS Component_Name IS_Name Component_Group_Name Outlier_Reasons Actual_Concentration Area Height Retention_Time Width_at_50_pct Used Calculated_Concentration Accuracy
Index
1 20170824_ELN147926_HexLacCer_Plasma_A-1-1 NaN Unknown True GluCer(d18:1/12:0)_LCB_264.3 NaN NaN NaN 0.1 2.733532e+06 5.963840e+05 2.963911 0.068676 True NaN NaN
2 20170824_ELN147926_HexLacCer_Plasma_A-1-1 NaN Unknown True GluCer(d18:1/17:0)_LCB_264.3 NaN NaN NaN 0.1 2.945190e+06 5.597470e+05 2.745026 0.068086 True NaN NaN
3 20170824_ELN147926_HexLacCer_Plasma_A-1-1 NaN Unknown False GluCer(d18:1/16:0)_LCB_264.3 GluCer(d18:1/17:0)_LCB_264.3 NaN NaN NaN 3.993535e+06 8.912731e+05 2.791991 0.059864 True 125.927659773487 NaN
When trying to generate a pivot table:
pivoted_report_conc = raw_report.pivot(index = "Sample_Name", columns = 'Component_Name', values = "Calculated_Concentration")
I get the following error:
ValueError: Index contains duplicate entries, cannot reshape
I tried resetting the index but it did not help. I couldn't find any duplicate values in the "Index" column. Could someone please help identify the problem here?
The expected output would be a reshaped dataframe with only the unique component names as columns and respective concentrations for each sample name:
Sample_Name GluCer(d18:1/12:0)_LCB_264.3 GluCer(d18:1/17:0)_LCB_264.3 GluCer(d18:1/16:0)_LCB_264.3
20170824_ELN147926_HexLacCer_Plasma_A-1-1 NaN NaN 125.927659773487
To clarify, I am not looking to aggregate the data, just reshape it.
You can use groupby() and unstack() to get around the error you're seeing with pivot().
Here's some example data, with a few edge cases added, and some column values removed or substituted for MCVE:
# df
Sample_Name Sample_ID IS Component_Name Calculated_Concentration Outlier_Reasons
Index
1 foo NaN True x NaN NaN
1 foo NaN True y NaN NaN
2 foo NaN False z 125.92766 NaN
2 bar NaN False x 1.00 NaN
2 bar NaN False y 2.00 NaN
2 bar NaN False z NaN NaN
(df.groupby(['Sample_Name','Component_Name'])
.Calculated_Concentration
.first()
.unstack()
)
Output:
Component_Name x y z
Sample_Name
bar 1.0 2.0 NaN
foo NaN NaN 125.92766
You should be able to accomplish what you are looking to do by using the the pandas.pivot_table() functionality as documented here.
With your dataframe stored as df use the following code:
import pandas as pd
df = pd.read_table('table_from_which_to_read')
new_df = pd.pivot_table(df,index=['Simple Name'], columns = 'Component_Name', values = "Calculated_Concentration")
If you want something other than the mean of the concentration value, you will need to change the aggfunc parameter.
EDIT
Since you don't want to aggregate over the values, you can reshape the data by using the set_index function on your DataFrame with documentation found here.
import pandas as pd
df = pd.DataFrame({'NonUniqueLabel':['Item1','Item1','Item1','Item2'],
'SemiUniqueValue':['X','Y','Z','X'], 'Value':[1.0,100,5,None])
new_df = df.set_index(['NonUniqueLabel','SemiUniqueLabel'])
The resulting table should look like what you expect the results to be and will have a multi-index.
You'll find snippets with reproducible input and an example of desired output at the end of the question.
The challenge:
I have a dataframe like this:
The dataframe has two columns with patterns of 1 and 0 like this:
Or this:
The number of columns will vary, and so will the length of the patterns.
However, the only numbers in the dataframe will be 0 or 1.
I would like to identify these patterns, count each occurence of them, and build a dataframe containing the results. To simplify the whole thing, I'd like to focus on the ones, and ignore the zeros. The desired output in this particular case would be:
I'd like the procedure to identify that, as an example, the pattern [1,1,1] occurs two times in column_A, and not at all in column_B. Notice that I've used the sums of the patterns as indexes in the dataframe.
Reproducible input:
import pandas as pd
df = pd.DataFrame({'column_A':[1,1,1,0,0,0,1,0,0,1,1,1],
'column_B':[1,1,1,1,1,0,0,0,1,1,0,0]})
colnames = list(df)
df[colnames] = df[colnames].apply(pd.to_numeric)
datelist = pd.date_range(pd.datetime.today().strftime('%Y-%m-%d'), periods=len(df)).tolist()
df['dates'] = datelist
df = df.set_index(['dates'])
print(df)
Desired output:
df2 = pd.DataFrame({'pattern':[5,3,2,1],
'column_A':[0,2,0,1],
'column_B':[1,0,1,0]})
df2 = df2.set_index(['pattern'])
print(df2)
My attempts so far:
I've been working on a solution that includes nested for loops where I calculate running sums that are reset each time an observation equals zero. It also includes functions such as df.apply(lambda x: x.value_counts()). But it's messy to say the least, and so far not 100% correct.
Thank you for any other suggestions!
Here's my attempt:
def fun(ser):
ser = ser.dropna()
ser = ser.diff().fillna(ser)
return ser.value_counts()
df.cumsum().where((df == 1) & (df != df.shift(-1))).apply(fun)
Out:
column_A column_B
1.0 1.0 NaN
2.0 NaN 1.0
3.0 2.0 NaN
5.0 NaN 1.0
The first part (df.cumsum().where((df == 1) & (df != df.shift(-1)))) produces the cumulative sums:
column_A column_B
dates
2017-08-04 NaN NaN
2017-08-05 NaN NaN
2017-08-06 3.0 NaN
2017-08-07 NaN NaN
2017-08-08 NaN 5.0
2017-08-09 NaN NaN
2017-08-10 4.0 NaN
2017-08-11 NaN NaN
2017-08-12 NaN NaN
2017-08-13 NaN 7.0
2017-08-14 NaN NaN
2017-08-15 7.0 NaN
So if we ignore the NaNs and take the diffs, we can have the values. That's what the function does: it drops the NaNs and then take the differences so it's not cumulative sum anymore. It finally returns the value counts.