New pandas dataframe from meta information of existing DF - python

Currently have a CSV file that outputs a dateframe as follows:
[in]
df = pd.read_csv(file_name)
df.sort('TOTAL_MONTHS', inplace=True)
print df[['TOTAL_MONTHS','COUNTEM']]
[out]
TOTAL_MONTHS COUNTEM
12 0
12 0
12 2
25 10
25 0
37 1
68 3
I want to get the total number of rows (by TOTAL_MONTHS) for which the 'COUNTEM' value falls within a preset bin.
The data is going to be entered into a histogram via excel/powerpoint with:
X-axis = Number of contracts
Y-axis = Total_months
Color of bar = COUNTEM
The input of the graph is like this (columns being COUNTEM bins):
MONTHS 0 1-3 4-6 7-10 10+ 20+
0 0 0 0 0 0 0
1 0 0 0 0 0 0
2 0 0 0 0 0 0
3 0 0 0 0 0 0
...
12 2 1 0 0 0 0
...
25 1 0 0 0 1 0
...
37 0 1 0 0 0 0
...
68 0 1 0 0 0 0
Ideally I'd like the code to output a dataframe in that format.

Interesting problem. Knowing pandas (as I don't properly) there may well be a much fancier and simpler solution to this. However, doing it through iterations is also possible in the following manner:
#First, imports and create your data
import pandas as pd
DF = pd.DataFrame({'TOTAL_MONTHS' : [12, 12, 12, 25, 25, 37, 68],
'COUNTEM' : [0, 0, 2, 10, 0, 1, 3]
})
#Next create a data frame of 'bins' with the months as index and all
#values set at a default of zero
New_DF = pd.DataFrame({'bin0' : 0,
'bin1' : 0,
'bin2' : 0,
'bin3' : 0,
'bin4' : 0,
'bin5' : 0},
index = DF.TOTAL_MONTHS.unique())
In [59]: New_DF
Out[59]:
bin0 bin1 bin2 bin3 bin4 bin5
12 0 0 0 0 0 0
25 0 0 0 0 0 0
37 0 0 0 0 0 0
68 0 0 0 0 0 0
#Create a list of bins (rather than 20 to infinity I limited it to 100)
bins = [[0], range(1, 4), range(4, 7), range(7, 10), range(10, 20), range(20, 100)]
#Now iterate over the months of the New_DF index and slice the original
#DF where TOTAL_MONTHS equals the month of the current iteration. Then
#get a value count from the original data frame and use integer indexing
#to place the value count in the appropriate column of the New_DF:
for month in New_DF.index:
monthly = DF[DF['TOTAL_MONTHS'] == month]
counts = monthly['COUNTEM'].value_counts()
for count in counts.keys():
for x in xrange(len(bins)):
if count in bins[x]:
New_DF.ix[month, x] = counts[count]
Which gives me:
In [62]: New_DF
Out[62]:
bin0 bin1 bin2 bin3 bin4 bin5
12 2 1 0 0 0 0
25 1 0 0 0 1 0
37 0 1 0 0 0 0
68 0 1 0 0 0 0
Which appears to be what you want. You can rename the index as you see fit....
Hope this helps. Perhaps someone has a solution that uses a built in pandas function, but for now this seems to work.

Related

How to pivot dataframe into ML format

My head is spinning trying to figure out if I have to use pivot_table, melt, or some other function.
I have a DF that looks like this:
month day week_day classname_en origin destination
0 1 7 2 1 2 5
1 1 2 6 2 1 167
2 2 1 5 1 2 54
3 2 2 6 4 1 6
4 1 2 6 5 6 1
But I want to turn it into something like:
month_1 month_2 ...classname_en_1 classname_en_2 ... origin_1 origin_2 ...destination_1
0 1 0 1 0 0 1 0
1 1 0 0 1 1 0 0
2 0 1 1 0 0 1 0
3 0 1 0 0 1 0 0
4 1 0 0 0 0 0 1
Basically, turn all values into columns and then have binary rows 1 - if the column is present, 0 if none.
IDK if it is at all possible to do with like a single function or not, but would appreciate all and any help!
To expand #Corraliens answer
It is indeed a way to do it, but since you write for ML purposes, you might introduce a bug.
With the code above you get a matrix with 20 features. Now, say you want to predict on some data which suddenly have a month more than your training data, then your matrix on your prediction data would have 21 features, thus you cannot parse that into your fitted model.
To overcome this you can use one-hot-encoding from Sklearn. It'll make sure that you always have the same amount of features on "new data" as your training data.
import pandas as pd
df_train = pd.DataFrame({"color":["red","blue"],"age":[10,15]})
pd.get_dummies(df_train)
# output
age color_blue color_red
0 10 0 1
1 15 1 0
df_new = pd.DataFrame({"color":["red","blue","green"],"age":[10,15,20]})
pd.get_dummies(df_new)
#output
age color_blue color_green color_red
0 10 0 0 1
1 15 1 0 0
2 20 0 1 0
and as you can see, the order of the color-binary representation has also changed.
If we on the other hand use OneHotEncoder you can ommit all those issues
from sklearn.preprocessing import OneHotEncoder
df_train = pd.DataFrame({"color":["red","blue"],"age":[10,15]})
ohe = OneHotEncoder(handle_unknown="ignore")
color_ohe_transformed= ohe.fit_transform(df_train[["color"]]) #creates sparse matrix
ohe_features = ohe.get_feature_names_out() # [color_blue, color_red]
pd.DataFrame(color_ohe_transformed.todense(),columns = ohe_features, dtype=int)
# output
color_blue color_red
0 0 1
1 1 0
# now transform new data
df_new = pd.DataFrame({"color":["red","blue","green"],"age":[10,15,20]})
new_data_ohe_transformed = ohe.transform(df_new[["color"]])
pd.DataFrame(new_data_ohe_transformed .todense(),columns = ohe_features, dtype=int)
#output
color_blue color_red
0 0 1
1 1 0
2 0 0
note in the last row that both blue and red are both zeros since it has color= "green" which was not present in the training data.
Note the todense() function is only used here to illustrate how it works. Ususally you would like to keep it a sparse matrix and use e.g scipy.sparse.hstack to append your other features such as age to it.
Use pd.get_dummies:
out = pd.get_dummies(df, columns=df.columns)
print(out)
# Output
month_1 month_2 day_1 day_2 day_7 week_day_2 week_day_5 ... origin_2 origin_6 destination_1 destination_5 destination_6 destination_54 destination_167
0 1 0 0 0 1 1 0 ... 1 0 0 1 0 0 0
1 1 0 0 1 0 0 0 ... 0 0 0 0 0 0 1
2 0 1 1 0 0 0 1 ... 1 0 0 0 0 1 0
3 0 1 0 1 0 0 0 ... 0 0 0 0 1 0 0
4 1 0 0 1 0 0 0 ... 0 1 1 0 0 0 0
[5 rows x 20 columns]
You can use get_dummies function of pandas for convert row to column based on data.
For that your code will be:
import pandas as pd
df = pd.DataFrame({
'month': [1, 1, 2, 2, 1],
'day': [7, 2, 1, 2, 2],
'week_day': [2, 6, 5, 6, 6],
'classname_en': [1, 2, 1, 4, 5],
'origin': [2, 1, 2, 1, 6],
'destination': [5, 167, 54, 6, 1]
})
response = pd.get_dummies(df, columns=df.columns)
print(response)
Result :

Parsing values to specific columns in Pandas

I would like to use Pandas to parse Q26 Challenges into the subsequent columns, with a "1" representing its presence in the original unparsed column. So the data frame initially looks like this:
ID
Q26 Challenges
Q26_1
Q26_2
Q26_3
Q26_4
Q26_5
Q26_6
Q26_7
1
5
0
0
0
0
0
0
0
2
1,2
0
0
0
0
0
0
0
3
1,3,7
0
0
0
0
0
0
0
And I want it to look like this:
ID
Q26 Challenges
Q26_1
Q26_2
Q26_3
Q26_4
Q26_5
Q26_6
Q26_7
1
5
0
0
0
0
1
0
0
2
1,2
1
1
0
0
0
0
0
3
1,3,7
1
0
1
0
0
0
1
You can iterate over the range of values in Q26 Challenges, using str.contains to check if the current value is contained in the string and then converting that boolean value to an integer. For example:
df = pd.DataFrame({'id' : [1, 2, 3, 4, 5], 'Q26 Challenges': ['0', '1,2', '2', '1,2,6,7', '3,4,5,11' ] })
for i in range(1, 12):
df[f'Q26_{i}'] = df['Q26 Challenges'].str.contains(rf'\b{i}\b').astype(int)
df
Output:
id Q26 Challenges Q26_1 Q26_2 Q26_3 Q26_4 Q26_5 Q26_6 Q26_7 Q26_8 Q26_9 Q26_10 Q26_11
0 1 0 0 0 0 0 0 0 0 0 0 0 0
1 2 1,2 1 1 0 0 0 0 0 0 0 0 0
2 3 2 0 1 0 0 0 0 0 0 0 0 0
3 4 1,2,6,7 1 1 0 0 0 1 1 0 0 0 0
4 5 3,4,5,11 0 0 1 1 1 0 0 0 0 0 1
str.get_dummies can be used on the 'Q26 Challenges' column to create the indicator values. This indicator DataFrame can be reindexed to include the complete result range (note column headers will be of type string). add_prefix can be used to add the 'Q26_' to the column headers. Lastly, join back to the original DataFrame:
df = df.join(
df['Q26 Challenges'].str.get_dummies(sep=',')
.reindex(columns=map(str, range(1, 8)), fill_value=0)
.add_prefix('Q26_')
)
The reindexing can also be done dynamically based on the resulting columns. It is necessary to convert the resulting column headers to numbers first to ensure numeric order, rather than lexicographic ordering:
s = df['Q26 Challenges'].str.get_dummies(sep=',')
# Convert to numbers to correctly access min and max
s.columns = s.columns.astype(int)
# Add back to DataFrame
df = df.join(s.reindex(
# Build range from the min column to max column values
columns=range(min(s.columns), max(s.columns) + 1),
fill_value=0
).add_prefix('Q26_'))
Both options produce:
ID Q26 Challenges Q26_1 Q26_2 Q26_3 Q26_4 Q26_5 Q26_6 Q26_7
0 1 5 0 0 0 0 1 0 0
1 2 1,2 1 1 0 0 0 0 0
2 3 1,3,7 1 0 1 0 0 0 1
Given initial input:
import pandas as pd
df = pd.DataFrame({
'ID': [1, 2, 3],
'Q26 Challenges': ['5', '1,2', '1,3,7']
})
ID Q26 Challenges
0 1 5
1 2 1,2
2 3 1,3,7

How to create new column based off values from existing columns in pandas

I have a dataframe with 171 rows and 11 columns.
The 11 columns have values with either 0 or 1
how can i create a new column that will either be a 0 or 1, depending on whether the existing columns have a majority of 0 or 1?
you could do
(df.sum(axis=1)>df.shape[1]/2)+0
import numpy as np
import pandas as pd
X = np.asarray([(0, 0, 0),
(0, 0, 1),
(0, 1, 1),
(1, 1, 1)])
df = pd.DataFrame(X)
df['majority'] = (df.mean(axis=1) > 0.5) + 0
df
Use mean of rows and compare by DataFrame.gt for greater or DataFrame.ge for greater or equal 0.5 (it depends of output if same number of 0 and 1) and last convert mask to integers by Series.astype:
np.random.seed(20193)
df = pd.DataFrame(np.random.choice([0,1], size=(5, 4)))
df['new'] = df.mean(axis=1).gt(0.5).astype(int)
print (df)
0 1 2 3 new
0 1 1 0 0 0
1 1 1 1 0 1
2 0 0 1 0 0
3 1 1 0 1 1
4 1 1 1 1 1
np.random.seed(20193)
df = pd.DataFrame(np.random.choice([0,1], size=(5, 4)))
df['new'] = df.mean(axis=1).ge(0.5).astype(int)
print (df)
0 1 2 3 new
0 1 1 0 0 1
1 1 1 1 0 1
2 0 0 1 0 0
3 1 1 0 1 1
4 1 1 1 1 1

One hot encoding - dummies - in several columns and then concating with original df with pandas

I have a df with several nominal categorical columns that I would want to create dummies for. Here's a mock df:
data = {'Frukt':[1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'Vikt':[23, 45, 31, 28, 62, 12, 44, 42, 23, 32],
'Färg':['grön', 'gul', 'röd', 'grön', 'grön', 'gul', 'röd', 'röd', 'gul', 'grön'],
'Smak':['god', 'sådär', 'supergod', 'rälig', 'rälig', 'supergod', 'god', 'god', 'rälig', 'god']}
df = pd.DataFrame(data)
I have tried naming the columns I want to get dummies from:
nomcols = ['Färg', 'Smak']
for column in ['nomcols']:
dummies = pd.get_dummies(df[column])
df[dummies.columns] = dummies
which was a tip I got from another question that I found, but it didn't work. I have looked at the other four questions that are similar but haven't had any luck since most of them get dummies from ALL the columns in the df.
What I would like is something like this:
Use get_dummies with specify columns in list, then remove separator by columns names with prefix seting to empty string:
nomcols = ['Färg', 'Smak']
df = pd.get_dummies(df, columns=nomcols, prefix='', prefix_sep='')
print (df)
Frukt Vikt grön gul röd god rälig supergod sådär
0 1 23 1 0 0 1 0 0 0
1 2 45 0 1 0 0 0 0 1
2 3 31 0 0 1 0 0 1 0
3 4 28 1 0 0 0 1 0 0
4 5 62 1 0 0 0 1 0 0
5 6 12 0 1 0 0 0 1 0
6 7 44 0 0 1 1 0 0 0
7 8 42 0 0 1 1 0 0 0
8 9 23 0 1 0 0 1 0 0
9 10 32 1 0 0 1 0 0 0
What you did was more or less correct.
But you did:
for column in ['nomcols']:
dummies = pd.get_dummies(df[column])
So you're trying to access df at 'nomcols'. What you wanted to do was:
dummies = pd.get_dummies(df[nomcols])
You want to access the dataframe at the column names inside the nomcols list.
nomcols = ['Färg', 'Smak']
for column in nomcols:
dummies = pd.get_dummies(df[column])
The above code should work.

Pandas: Adding zero values where no rows exist (sparse)

I have a Pandas DataFrame with a MultiIndex. The MultiIndex has values in the range (0,0) to (1000,1000), and the column has two fields p and q.
However, the DataFrame is sparse. That is, if there was no measurement corresponding to a particular index (say (3,2)), there won't be any row for it (3,2). I'd like to make it not sparse, by filling in these rows with p=0 and q=0. Continuing the example, if I do df.loc[3].loc[2], I want it to return p=0 q=0, not No Such Record (as it currently does).
Clarification: By "sparse", I mean it only in the sense I used it, that zero values are omitted. I'm not referring to anything in Pandas or Numpy internals.
Consider this df
data = {
(1, 0): dict(p=1, q=1),
(3, 2): dict(p=1, q=1),
(5, 4): dict(p=1, q=1),
(7, 6): dict(p=1, q=1),
}
df = pd.DataFrame(data).T
df
p q
1 0 1 1
3 2 1 1
5 4 1 1
7 6 1 1
Use reindex with fill_value=0 from a constructed pd.MultiIndex.from_product
mux = pd.MultiIndex.from_product([range(8), range(8)])
df.reindex(mux, fill_value=0)
p q
0 0 0 0
1 0 0
2 0 0
3 0 0
4 0 0
5 0 0
6 0 0
7 0 0
1 0 1 1
1 0 0
2 0 0
3 0 0
4 0 0
5 0 0
6 0 0
7 0 0
2 0 0 0
1 0 0
2 0 0
3 0 0
response to comment
You can get min, max of index levels like this
def mn_mx(idx):
return idx.min(), idx.max()
mn0, mx0 = mn_mx(df.index.levels[0])
mn1, mx1 = mn_mx(df.index.levels[1])
mux = pd.MultiIndex.from_product([range(mn0, mx0 + 1), range(mn1, mx1 + 1)])
df.reindex(mux, fill_value=0)

Categories