I have data like this:
+---+---+---+
| A | B | C |
+---+---+---+
| 1 | 2 | 7 |
| 2 | 2 | 7 |
| 3 | 2 | 1 |
| 3 | 2 | 1 |
| 3 | 2 | 1 |
+---+---+---+
Need to count unique value of each column and report it like below:
+---+---+---+
| A | 3 | 3 |
| A | 2 | 1 |
| A | 1 | 1 |
| B | 2 | 5 |
| C | 1 | 3 |
| C | 7 | 2 |
+---+---+---+
I have no issue when number of column are limit and manually name them, when input file is big it become hard,need to have simple way to have output
here is the code I have
import pandas as pd
df=pd.read_csv('1.csv')
A=df['A']
B=df['B']
C=df['C']
df1=A.value_counts()
df2=B.value_counts()
df3=C.value_counts()
all = {'A': df1,'B': df2,'C': df3}
result = pd.concat(all)
result.to_csv('out.csv')
Use DataFrame.stack with SeriesGroupBy.value_counts and then convert Series to DataFrame by Series.rename_axis and Series.reset_index and :
df=pd.read_csv('1.csv')
result = (df.stack()
.groupby(level=1)
.value_counts()
.rename_axis(['X','Y'])
.reset_index(name='Z'))
print (result)
X Y Z
0 A 3 3
1 A 1 1
2 A 2 1
3 B 2 5
4 C 1 3
5 C 7 2
X Y Z
2 A 3 3
0 A 1 1
1 A 2 1
3 B 2 5
4 C 1 3
5 C 7 2
result.to_csv('out.csv', index=False)
You can loop over column and insert them in dictionary.
you can initiate dictionary by all={}. To be scalable you can read column by colm=df.columns. This would give you all column in your df.
Try this code:
import pandas as pd
df=pd.read_csv('1.csv')
all={}
colm=df.columns
for i in colm:
all.update({i:df[i].value_counts()})
result = pd.concat(all)
result.to_csv('out.csv')
to find unique values of a data-frame.
df.A.unique()
to know the count of the unique values.
len(df.A.unique())
unique create an array to find the count use len() function
Related
This question already has answers here:
Split (explode) pandas dataframe string entry to separate rows
(27 answers)
Closed 1 year ago.
I have pandas in format like this
| Group | ID_LIST |
| -------- | -------------- |
| A | [1,2,3] |
| B | [1,3,5] |
| C | [2,4] |
I would like to delist into separate row like this
| Group | ID_LIST |
| -------- | -------------- |
| A | 1 |
| A | 2 |
| A | 3 |
| B | 1 |
| B | 3 |
| B | 5 |
| C | 2 |
| C | 4 |
If it possible to done with pandas function ? or should I approach with convert to list instead?
If ID_LIST column contains real list:
>>> df.explode('ID_LIST')
Group ID_LIST
0 A 1
0 A 2
0 A 3
1 B 1
1 B 3
1 B 5
2 C 2
2 C 4
If ID_LIST columns contains strings (which have the appearance of a list):
>>> df.assign(ID_LIST=pd.eval(df['ID_LIST'])).explode('ID_LIST')
Group ID_LIST
0 A 1
0 A 2
0 A 3
1 B 1
1 B 3
1 B 5
2 C 2
2 C 4
Use explode
df = df.explode('ID_LIST')
Did you try to use Pandas explode() to separate list elements into separate rows() ?
df.assign(Book=df.Book.str.split(",")).explode('Book')
Try this code
pd.concat([Series(row['ID_LIST'], row['Group'].split(',')) for _, row in a.iterrows()]).reset_index()
Do let me know if it works
How does pandas treat equal values in the column it is sorting by.
dataFrame1
a | b | c |
--|---|---|
1 | 2 | 2 |
2 | 1 | 6 |
2 | 1 | 5 |
3 | 4 | 2 |
If I run dataFrame1.sort_values(by=['a'], ascending=True)
How does it treat the duplicate values in a ?
I have the following data
---+--
b | c
--+---+--
a
--+---+--
0 | 1 | 2
1 | 3 | 4
I would like to drop the level of column a and have the following as output:
--+---+--
a| b | c
--+---+--
0 | 1 | 2
1 | 3 | 4
I have tried df.droplevel(1) but got the following error:
IndexError: Too many levels: Index has only 1 level, not 2
Any help is appreciated.
As suggested by #Alollz & #Ben, I did the following:
df = df.reset_index()
And got the following output
---+---+---+---+
| a | b | c
---+---+---+---+--
0 | 0 | 1 | 2
1 | 1 | 3 | 4
I have a dataframe like this:
| a | b | c |
0 | 0 | 0 | 0 |
1 | 5 | 5 | 5 |
I have a dataframe row (or series) like this:
| a | b | c |
0 | 1 | 2 | 3 |
I want to subtract the row from the entire dataframe to obtain this:
| a | b | c |
0 | 1 | 2 | 3 |
1 | 6 | 7 | 8 |
Any help is appreciated, thanks.
Use DataFrame.add or DataFrame.sub with convert one row DataFrame to Series - e.g. by DataFrame.iloc for first row:
df = df1.add(df2.iloc[0])
#alternative select by row label
#df = df1.add(df2.loc[0])
print (df)
a b c
0 1 2 3
1 6 7 8
Detail:
print (df2.iloc[0])
a 1
b 2
c 3
Name: 0, dtype: int64
You can convert the second dataframe to numpy array:
df1 + df2.values
Output:
a b c
0 1 2 3
1 6 7 8
So I have a dataframe similar to this:
timestamp | name
------------+------------
1 | a
1 | b
2 | c
2 | d
2 | e
3 | f
4 | g
Essentially I want to get min and max value of each timestamp session(defined by unique timestamp value, there are 4 sessions in this example), the expected result would something like this:
timestamp | name | start | end
------------+----------+--------+------
1 | a | 1 | 2
1 | b | 1 | 2
2 | c | 2 | 3
2 | d | 2 | 3
2 | e | 2 | 3
3 | f | 3 | 4
4 | g | 4 | 4
I am thinking index on timestamp column, then "move up" the index by 1, yet this approach didn't work on the forth bucket in the example above.
Any help is greatly appreciated!
try numpy.clip(), such as df['end']=numpy.clip(df['timestamp']+1, 0, 4)