I would like to sum the frequencies over multiple columns with pandas. The amount of columns can vary between 2-15 columns. Here is an example of just 3 columns:
code1 code2 code3
27 5 56
534 27 78
27 312 55
89 312 27
And I would like to have the following result:
code frequency
5 1
27 4
55 1
56 2
78 1
312 2
534 1
To count values inside one column is not the problem, just need a sum of all frequencies in a dataframe a value can appear, no matter the amount of columns.
You could stack and take the value_counts on the resulting series:
df.stack().value_counts().sort_index()
5 1
27 4
55 1
56 1
78 1
89 1
312 2
534 1
dtype: int64
Related
I have a dataframe created by the following code:
dfHubR2I=dfHubPV2.loc[dfHubPV2['Ind'].dt.year == year, :].groupby(['SHOP_CODE', dfHubPV2['Ind'].dt.month])['R2I'].agg(['median']).fillna('-')
dfHubR2I=dfHubR2I['median'].unstack('SHOP_CODE')
dfHubR2I=dfHubR2I.iloc[:date.month-1]
dfHubR2I
It looks like this:
shop code A B C D All Shops
ind
1 23 34 23 56 34
2 13 23 45 47 34
3 56 67 42 85 57
4 3 3 2 6 46
where ind is months and the letters are different shops
I then got the median across all the shops for each month from this code:
dfHubR2Imonthallshops=dfHubPV2.loc[dfHubPV2['Ind'].dt.year == year, :].groupby([dfHubPV2['Ind'].dt.month])['R2I'].agg(['median']).fillna('-')
dfHubR2Imonthallshops=dfHubR2Imonthallshops.rename(columns={'median':'All Shops'})
dfHubR2Imonthallshops=dfHubR2Imonthallshops.iloc[:date.month-1]
dfHubR2Imonthallshops
which looks like this:
A B C D All shops
median 2 3 4 5 2
And I need to append it onto the bigger dataframe as a row but when I try to use pd.concat I get the error InvalidIndexError: Reindexing only valid with uniquely valued Index objects
I'm assuming it's because the larger dadtaframe has 2 levels but I'm not sure how to go about getting my final desired result:
shop code A B C D All shops
ind
1 23 34 23 56 34
2 13 23 45 47 34
3 56 67 42 85 57
4 3 3 2 6 46
YTD 2 3 4 5 2
Have you tried to do it with an assignment?
dfHubR2I.loc['YTD', :] = dfHubR2Imonthallshops.loc['median', :]
Eleonora
Suppose a data frame with 3 columns looks like this:
. Values Objects Legs
0 1 568 25
1 5 387 56
2 0 526 52
3 3 982 89
4 0 098 09
5 8 697 89
6 0 647 01
I want to create code that says if row(Values) == 0, split corresponding row(objects).str[2] and use the split number to count how many times it appears in Legs column and then create a dataframe with the results. Rows that are not zero should be left as they are. I have the following code but returns error Str has no str attribute
#
import panda as pd
df = pd.read_csv('Hello world')
#Making index loop for every 'Values' row
for index in df.index:
#checking for zero
if df.loc[index,'Values'] == 0.0:
#Splitting the 'Objects' row and counting how many times the split str in the 'Legs' Column when true
df.loc[df.Legs == df.loc[0,'Objects'].astype(str).str[2], 'Legs'].count()
Expected output
. Values Objects Legs Counts
0 1 568 25
1 5 387 56
2 0 526 52 1 #Counted 52 in 'Legs'
3 3 982 89
4 0 098 09 1 #Counted 09 in 'Legs'
5 8 697 89
6 0 647 01 0 #Counted 64 in 'Legs'
You want to reformat your columns to contain leading zeros when they are read. You can then fill the Counts column as shown here:
df['Objects']=df['Objects'].apply('{:0>3}'.format)
df['Legs']=df['Legs'].apply('{:0>2}'.format)
df['Counts']=None
for index in df.index:
if df.loc[index,'Values'] == 0.0:
df.loc[index,'Counts']=df.loc[df['Legs'] == df.loc[index,'Objects'][:2], 'Legs'].count()
Output:
Values Objects Legs Counts
0 1 568 25 None
1 5 387 56 None
2 0 526 52 1
3 3 982 89 None
4 0 098 09 1
5 8 697 89 None
6 0 647 01 0
I have a very long dataframe with hundreds of rows. I want to select the rows with one key word in one of the columns, and lower the whole row 18 places below. Since there are too many, using reindex and doing it manually would be too long.
As an example, for this df I would like to move the rows with the word "Base" in Column A, three rows below, after "Three" :
A B C
Base 572 55
One 654 196
Two 2 156
Three 154 123
Base 78 45
One 251 78
Two 5 56
Three 321 59
Base 48 45
One 5 12
Two 531 231
Three 51 123
So, I want it to look like:
A B C
One 654 196
Two 2 156
Three 154 123
Base 572 55
One 251 78
Two 5 56
Three 321 59
Base 78 45
One 5 12
Two 531 231
Three 51 123
Base 48 45
I am new at programming, so would appreciate your help!
First create extra, dummy column, to mock your sorting key. In this case, as far as I understood you:
ord=["One", "Two", "Three", "Base"]
df["sorting_key"]=df.groupby("A").cumcount().map(str)+":"+df["A"].apply(ord.index).map(str)
Then just sort by it:
df.sort_values("sorting_key")
Result:
A B C sorting_key
1 One 654 196 0:0
2 Two 2 156 0:1
3 Three 154 123 0:2
0 Base 572 55 0:3
5 One 251 78 1:0
6 Two 5 56 1:1
7 Three 321 59 1:2
4 Base 78 45 1:3
9 One 5 12 2:0
10 Two 531 231 2:1
11 Three 51 123 2:2
8 Base 48 45 2:3
Then in order to reindex it, and drop the dummy column:
df.sort_values("sorting_key").reset_index(drop=True).drop(columns="sorting_key")
Output:
A B C
0 One 654 196
1 Two 2 156
2 Three 154 123
3 Base 572 55
4 One 251 78
5 Two 5 56
6 Three 321 59
7 Base 78 45
8 One 5 12
9 Two 531 231
10 Three 51 123
11 Base 48 45
You could do the following:
# create mask for identifying Base
mask = df.A.eq("Base")
# create index with non base values
non_base = df[~mask].reset_index(drop=True) # reset index
# create DataFrame with Base values
base = df[mask]
base.index = base.index + (3 - np.arange(len(base))) # change index to reflect new indices in result
# concat and sort by index
result = pd.concat([base, non_base], sort=True).sort_index().reset_index(drop=True)
print(result)
Output
A B C
0 One 654 196
1 Two 2 156
2 Three 154 123
3 Base 572 55
4 One 251 78
5 Two 5 56
6 Three 321 59
7 Base 78 45
8 One 5 12
9 Two 531 231
10 Three 51 123
11 Base 48 45
I would like to create a table of relative start dates using the output of a Pandas pivot table. The columns of the pivot table are months, the rows are accounts, and the cells are a running total of actions. For example:
Date1 Date2 Date3 Date4
1 1 2 3
N/A 1 2 2
The first row's first instance is Date1.
The second row's first instance is Date2.
The new table would be formatted such that the columns are now the months relative to the first action and would look like:
FirstMonth SecondMonth ThirdMonth
1 1 2
1 2 2
Creating the initial pivot table is strightforward in pandas, I'm curious if there are any suggestion for how to develop the table of relative starting points. Thank you!
First, make sure your dataframe columns are actual datetime values. Then you can run the following to calculate the sum of actions for each date and then group those values by month and calculate the corresponding monthly sum:
>>>df
2019-01-01 2019-01-02 2019-02-01
Row
0 4 22 40
1 22 67 86
2 72 27 25
3 0 26 60
4 44 62 32
5 73 86 81
6 81 17 58
7 88 29 21
>>>df.sum().groupby(df.sum().index.month).sum()
1 720
2 403
And if you want it to reflect what you had above:
>>> out = df.sum().groupby(df.sum().index.month).sum().to_frame().T
>>> out.columns = [datetime.datetime.strftime(datetime.datetime.strptime(str(x),'%m'),'%B') for x in out.columns]
>>> out
January February
0 720 403
And if I misunderstood you, and you want it broken out by record / row:
>>> df.T.groupby(df.T.index.month).sum().T
1 2
Row
0 26 40
1 89 86
2 99 25
3 26 60
4 106 32
5 159 81
6 98 58
7 117 21
Rename the columns as above.
The trick is to use .apply() combined with dropna().
df.T.apply(lambda x: pd.Series(x.dropna().values)).T
I'm working on a ML project for a class. I'm currently cleaning the data and I encountered a problem. I basically have a column (which is identified as dtype object) that has ratings about a certain aspect in a hotel. When i checked what the values of this column were and in what frequency they appeared, I noticed that there are some wrong values in it (as you can see below, instead of ratings, some rows have a date as a value)
rating value_counts()
100 527
98 229
97 172
99 163
96 150
95 127
93 100
90 94
94 93
80 65
92 55
91 39
88 35
89 32
87 31
85 25
86 17
84 12
60 12
83 8
70 5
73 5
82 4
78 3
67 3
2018-11-11 3
20 2
81 2
2018-11-03 2
40 2
79 2
75 2
2018-10-26 2
2 1
2018-08-30 1
2018-09-03 1
2015-09-05 1
55 1
2018-10-12 1
2018-05-11 1
2018-11-14 1
2018-09-15 1
2018-04-07 1
2018-08-16 1
71 1
2018-09-18 1
2018-11-05 1
2018-02-04 1
NaN 1
What I wanted to do was to replace all the values that look like dates with NaN so I can later fill them with appropriate values. Is there a good way to do this other than selecting each different date one by one and replacing it with a NaN? Is there a way to select similar values (in this case all the dates that start in the same way, 2018) and replace them all?
Thank you for taking the time to read this!!
There are multiple options to clean this data.
Option 1: Rating column is ofobject type, search the strings by presence of '-' and replace with np.nan
df.loc[df['rating'].str.contains('-', na = False), 'rating'] = np.nan
Option 2: Convert the column to numeric which will coerce the dates to nan.
df['rating'] = pd.to_numeric(df['rating'], errors = 'coerce')