I have a dataframe with 3 columns now which appears like this
Model IsJapanese IsGerman
BenzC 0 1
BensGla 0 1
HondaAccord 1 0
HondaOdyssey 1 0
ToyotaCamry 1 0
I want to create a new dataframe and have TotalJapanese and TotalGerman as two columns in the same dataframe.
I am able to achieve this by creating 2 different dataframes. But wondering how to get both the counts in a single dataframe.
please suggest thank you!
Editing and adding another similar dataframe to this [sorry notsure whether its allowed-but trying
Second dataset- am trying to save multiple counts in single dataframe, based on repetition of data.
Here is my sample dataset
Store Address IsLA IsGA
Albertsons Cross St 1 0
Safeway LeoSt 0 1
Albertsons Main St 0 1
RiteAid Culver St 1 0
My aim is to prepare a new dataset with multiple counts per store
The result should be like this
Store TotalStores TotalLA TotalGA
Alberstons 2 1 1
Safeway 1 0 1
RiteAid 1 1 0
Is it possible to achieve these in single dataframe ?
Thanks!
One way would be to store the sum of Japanese cars and German cars, and manually create a dataframe using them:
j , g =sum(df['IsJapanese']),sum(df['IsGerman'])
total_df = pd.DataFrame({'TotalJapanese':j,
'TotalGerman':g},index=['Totals'])
print(total_df)
TotalJapanese TotalGerman
Totals 3 2
Another way would be to transpose (T) your dataframe, sum(axis=1), and tranpose back:
>>> total_df_v2 = pd.DataFrame(df.set_index('Model').T.sum(axis=1)).T
print(total_df_v2)
IsJapanese IsGerman
3 2
To answer your 2nd question, you can use a DataFrameGroupBy.agg on your 'Store' column, use parameter count on Address and sum on your other two columns. Then you can rename() your columns if needed:
resulting_df = df.groupby('Store').agg({'Address':'count',
'IsLA':'sum',
'IsGA':'sum'}).\
rename({'Address':'TotalStores',
'IsLA':'TotalLA',
'IsGA':'TotalGA'},axis=1)
Prints:
TotalStores IsLA IsGA
Store
Albertsons 2 1 1
RiteAid 1 1 0
Safeway 1 0 1
Related
This is my fist question on stackoverflow.
I'm implementing a Machine Learning classification algorithm and I want to generalize it for any input dataset that have their target class in the last column. For that, I want to modify all values of this column without needing to know the names of each column or rows using pandas in python.
For example, let's suppose I load a dataset:
dataset = pd.read_csv('random_dataset.csv')
Let's say the last column has the following data:
0 dog
1 dog
2 cat
3 dog
4 cat
I want to change each "dog" appearence to 1 and each cat appearance to 0, so that the column would look:
0 1
1 1
2 0
3 1
4 0
I have found some ways of changing the values of specific cells using pandas, but for this case, what would be the best way to do that?
I appreciate each answer.
You can use pandas.Categorical:
df['column'] = pd.Categorical(df['column']).codes
You can also use the built in functionality for this too:
df['column'] = df['column'].astype('category').cat.codes
use the map and map the values as per requirement:
df['col_name'] = df['col_name'].map({'dog' : 1 , 'cat': 0})
OR -> Use factorize(Encode the object as an enumerated type) -> if you wanna assign random numeric values
df['col_name'] = df['col_name'].factorize()[0]
OUTPUT:
0 1
1 1
2 0
3 1
4 0
I've got one large matrix as a pandas DF w/o any 'keys' but plain numbers on top. A smaller version of that just to demonstrate the problem in here would be like this input:
M=pd.DataFrame(np.random.rand(4,5))
What I want to accomplish is using another given DF as reference that has a structure like this
N=pd.DataFrame({'A':[2,2,2],'B':[2,3,4]})
...to extract the values from the large DF whereas the values of 'A' correspond to the ROW number and 'B' values to the COLUMN number of the large DF so that the expected output would look like this:
Large DF
0 1 2 3 4
0 0.766275 0.910825 0.378541 0.775416 0.639854
1 0.505877 0.992284 0.720390 0.181921 0.501062
2 0.439243 0.416820 0.285719 0.100537 0.429576
3 0.243298 0.560427 0.162422 0.631224 0.033927
Small DF
A B
0 2 2
1 2 3
2 2 4
Expected Output:
A B extracted values
0 2 2 0.285719
1 2 3 0.100537
2 2 4 0.429576
So far I've tried different version of something like this
N['extracted'] = M.iloc[N['A'].astype(int):,N['B'].astype(int)]
..but it keeps failing with an error saying
TypeError: cannot do positional indexing on RangeIndex with these indexers
[0 2
1 2
2 2
Which approach would be the best ?
Is this job better to accomplish by converting the DF's into a numpy arrays ?
Thanks for help!
I think you want to use the apply function. This goes row by row through your data set.
N['extracted'] = N.apply(lambda row: M.iloc[row['A'], row['B']], axis=1)
I'm new to Pandas but thanks to Add column with constant value to pandas dataframe I was able to add different columns at once with
c = {'new1': 'w', 'new2': 'y', 'new3': 'z'}
df.assign(**c)
However I'm trying to figure out what's the path to take when I want to add a new column to a dataframe (currently 1.2 million rows * 23 columns).
Let's simplify the df a bit and try to make it more clear:
Order Orderline Product
1 0 Laptop
1 1 Bag
1 2 Mouse
2 0 Keyboard
3 0 Laptop
3 1 Mouse
I would like to add a new column where depending if the Order has at least 1 product == Bag then it should be 1 (for all rows for that specific order), otherwise 0.
Result would become:
Order Orderline Product HasBag
1 0 Laptop 1
1 1 Bag 1
1 2 Mouse 1
2 0 Keyboard 0
3 0 Laptop 0
3 1 Mouse 0
What I could do is find all the unique order numbers, then filter out the subframe, check the Product column for Bag, if found then add 1 to a new column, otherwise 0, and then replace the original subframe with the result.
Likely there's a way better manner to accomplish this, and also way more performant.
The main reason I'm trying to do this, is to flatten things down later on. Every order should become 1 line with some values of product. I don't need the information for Bag anymore but I want to keep in my dataframe if the original order used to have a Bag (1) or no Bag (0).
Ultimately when the data is cleaned out it can be used as a base for scikit-learn (or that's what I hope).
If I understand you correctly, you want GroupBy.transform.any
First we create a boolean array by checking which rows in Product are Bag with Series.eq. Then we GroupBy on this boolean array and check if any of the values are True. We use transform to keep the shape of our initial array so we can assign the values back.
df['ind'] = df['Product'].eq('Bag').groupby(df['Order']).transform('any').astype(int)
Order Orderline Product ind
0 1 0 Laptop 1
1 1 1 Bag 1
2 1 2 Mouse 1
3 2 0 Keyboard 0
4 3 0 Laptop 0
5 3 1 Mouse 0
I have a column in dataframe(df['Values') with 1000 rows with repetitive codes A30, A31, A32, A33, A34. I want to create five separate columns with headings colA30, colA31, colA32, colA33, colA34 in the same dataframe(df) with values 0 or 1 in the new five columns created based on if the row is anyone of codes in df['Values'].
for Ex: df
Values colA30 colA31 colA32 colA33 colA34
A32 0 0 1 0 0
A30 1 0 0 0 0
A31 0 1 0 0 0
A34 0 0 0 0 1
A33 0 0 0 1 0
So if a row in df['Values'] is A32 then colA32 should be 1 and all other columns should be 0's and so on for rest of columns in df['Values'].
I did in the following way. But, is there anyway to do it in one shot as i have multiple columns with several codes for which multiple columns are to be created.
df['A30']=df['Values'].map(lambda x : 1 if x=='A30' else 0)
df['A31']=df['Values'].map(lambda x : 1 if x=='A31' else 0)
df['A32']=df['Values'].map(lambda x : 1 if x=='A32' else 0)
df['A33']=df['Values'].map(lambda x : 1 if x=='A33' else 0)
df['A34']=df['Values'].map(lambda x : 1 if x=='A34' else 0)
You can do this in many ways :
In pandas there is a function called pd.get_dummies() that allows you to convert each categorical data to a binary data. Apply it to your categorical column and then concatenate the dataframe obtained with the original one.
Here is the link to the documentation.
Another way would be to use the library sklearn and its OneHotEncoder. It does exactly the same as above but the objects is not the same. You should use the instance of your OneHotEncoder class to fit it to your categorical data.
For your case I'd use pd.get_dummies(), it's simpler to use.
I have a dataframe like
GULOSS GRLoss
1 1
2 2
3 3
I want to sum in such a way that I Get
GULOSS GRLoss Post
6 6 0
where Post does not exist in initial dataframe and Post is required in final dataframe with condition that if it does not exist then make the sum for the non existing column as 0
Assuming I understand your question correctly, here's how I would do it:
if 'POST' not in data.columns:
data['POST'] = 0
datasum = data.sum()