Python pandas DataFrame, sum row's value which data is Tru - python

I am very new to pandas and even new to programming.
I have DataFrame of [500 rows x 24 columns]
500 rows are rank of data and 24 columns are years and months.
What I want is
select data from df
get all data's row value by int
sum all row value
I did DATAF = df1[df1.isin(['MYDATA'])]
DATAF is something like below
19_01 19_02 19_03 19_04 19_05
0 NaN MYDATA NaN NaN NaN
1 MYDATA NaN MYDATA NaN NaN
2 NaN NaN NaN MYDATA NaN
3 NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN
so I want to sum all the row value
which would be like 1 + 0 + 1 + 2
it would be nicer if sum is like 2 + 1 +2 + 3. because rows are rank of data
is there any way to do this?

You can use np.where:
rows, cols = np.where(DATAF .notna())
# rows: array([0, 1, 1, 2], dtype=int64)
print((rows+1).sum())
# 8

Related

transform a big dataframe with many None values to smaller one with indication of non null columns

I have a big dataframe with 4 columns with often 3 null values at every row. Sometimes there are 2 or 1 or even 0 null values but often 3.
I want to transform it to a two columns dataframe having in each row the non null value and the name of the column from which it was extracted.
Example: How to transform this dataframe
df
Out[1]:
a b c d
0 1.0 NaN NaN NaN
1 NaN 2.0 NaN NaN
2 NaN NaN 3.0 2.0
3 NaN NaN 1.0 NaN
to this One:
resultDF
Out[2]:
value columnName
0 1 a
1 2 b
2 3 c
3 2 d
4 1 c
The goal is to do it without looping on rows. Is this possible?
You can use pd.melt for adjusting the dataframe :
import pandas as pd
# reading the csv
df = pd.read_csv('test.csv')
df = df.melt(value_vars=['a','b','c','d'], var_name='foo', value_name='foo_value')
df.dropna(inplace=True)
df.reset_index(drop=True, inplace=True)
print(df)
Output :
foo foo_value
0 a 1.0
1 b 2.0
2 c 3.0
3 c 1.0
4 d 2.0

How to split the regex result into multiple column (Python)

I am trying to do the Regex in Python dataframe using this script
import pandas as pd
df1 = {'data':['1gsmxx,2gsm','abc10gsm','10gsm','18gsm hhh4gsm','Abc:10gsm','5gsmaaab3gsmABC55gsm','abc - 15gsm','3gsm,,ff40gsm','9gsm','VV - fg 8gsm','kk 5gsm 00g','001….abc..5gsm']}
df1 = pd.DataFrame(df1)
df1
df1['Result']=df1['Data'].str.findall('(\d{1,3}\s?gsm)')
OR
df2=df1['data'].str.extractall('(\d{1,3}\s?gsm)').unstack()
However, it turnout into multiple results in one column.
Is it possible I could have a result like the attached below?
Use pandas.Series.str.extractall with unstack.
If you want your original series, use pandas.concat.
df2 = df1['data'].str.extractall('(\d{1,3}\s?gsm)').unstack()
df = pd.concat([df1, df2.droplevel(0, 1)], 1)
print(df)
Output:
data 0 1 2
0 1gsmxx,2gsm 1gsm 2gsm NaN
1 abc10gsm 10gsm NaN NaN
2 10gsm 10gsm NaN NaN
3 18gsm hhh4gsm 18gsm 4gsm NaN
4 Abc:10gsm 10gsm NaN NaN
5 5gsmaaab3gsmABC55gsm 5gsm 3gsm 55gsm
6 abc - 15gsm 15gsm NaN NaN
7 3gsm,,ff40gsm 3gsm 40gsm NaN
8 9gsm 9gsm NaN NaN
9 VV - fg 8gsm 8gsm NaN NaN
10 kk 5gsm 00g 5gsm NaN NaN
11 001….abc..5gsm 5gsm NaN NaN

Pandas combines empty rows in Excel file to a single row in dataframe

I have different excel files that I am processing with Pandas. I need to remove a certain number of rows from the top of each file. These extra rows could be empty or they could contain text. Pandas is combining some of the rows so I am not sure how many need to be removed. For example:
Here is an example excel file (represented as csv):
,,
,,
some text,,
,,
,,
,,
name, date, task
Jason,1-Jan,swim
Aem,2-Jan,workout
Here is my current python script:
import pandas as pd
xl = pd.ExcelFile('extra_rows.xlsx')
dfs = xl.parse(xl.sheet_names[0])
print ("dfs: ", dfs)
Here is the results when I print the dataframe:
dfs: Unnamed: 0 Unnamed: 1 Unnamed: 2
0 some other text NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 name date task
5 Jason 2016-01-01 00:00:00 swim
6 Aem 2016-01-02 00:00:00 workout
From the file, I would remove the first 6 rows. However, from the dataframe I would only remove 4. Is there a way to read in the Excel file with the data in its raw state so the number of rows remains consistent?
I used python3 and pandas-0.18.1. The Excel load function is pandas.read_csv. You can try set the parameter header=None to achieve. Here are sample codes:
(1) With default parameters, result will ignore leading blank lines:
In [12]: pd.read_excel('test.xlsx')
Out[12]:
Unnamed: 0 Unnamed: 1 Unnamed: 2
0 text1 NaN NaN
1 NaN NaN NaN
2 n1 t2 c3
3 NaN NaN NaN
4 NaN NaN NaN
5 jim sum tim
(2) With header=None, result will keep leading blank lines.
In [13]: pd.read_excel('test.xlsx', header=None)
Out[13]:
0 1 2
0 NaN NaN NaN
1 NaN NaN NaN
2 text1 NaN NaN
3 NaN NaN NaN
4 n1 t2 c3
5 NaN NaN NaN
6 NaN NaN NaN
7 jim sum tim
Here is what you are looking for:
import pandas as pd
xl = pd.ExcelFile('extra_rows.xlsx')
dfs = xl.parse(skiprows=6)
print ("dfs: ", dfs)
Check the docs on ExcelFile for more details.
If you read your file in with pd.read_excel and pass header=None, the blank rows should be included:
In [286]: df = pd.read_excel("test.xlsx", header=None)
In [287]: df
Out[287]:
0 1 2
0 NaN NaN NaN
1 NaN NaN NaN
2 something NaN NaN
3 NaN NaN NaN
4 name date other
5 1 2 3

Pandas: interpolation where first and last data point in column is NaN

I would like to use the interpolate function, but only between known data values in a pandas DataFrame column. The issue is that the first and last values in the column are often NaN and sometimes it can be many rows before a value is not NaN:
col 1 col 2
0 NaN NaN
1 NaN NaN
...
1000 1 NaN
1001 NaN 1 <-----
1002 3 NaN <----- only want to fill in these 'in between value' rows
1003 4 3
...
3999 NaN NaN
4000 NaN NaN
I am tying together a dataset which is updated 'on event' but separately for each column, and is indexed via Timestamp. This means that there are often rows where no data is recorded for some columns, hence a lot of NaNs!
I select by min and max value of column by function idxmin and idxmax and use function fillna with method forward filling.
print df
# col 1 col 2
#0 NaN NaN
#1 NaN NaN
#1000 1 NaN
#1001 NaN 1
#1002 3 NaN
#1003 4 3
#3999 NaN NaN
#4000 NaN NaN
df.loc[df['col 1'].idxmin(): df['col 1'].idxmax()] = df.loc[df['col 1'].idxmin(): df['col 1'].idxmax()].fillna(method='ffill')
df.loc[df['col 2'].idxmin(): df['col 2'].idxmax()] = df.loc[df['col 2'].idxmin(): df['col 2'].idxmax()].fillna(method='ffill')
print df
# col 1 col 2
#0 NaN NaN
#1 NaN NaN
#1000 1 NaN
#1001 1 1
#1002 3 1
#1003 4 3
#3999 NaN NaN
#4000 NaN NaN
Added different solution, thanks HStro.
df['col 1'].loc[df['col 1'].first_valid_index() : df['col 1'].last_valid_index()] = df['col 1'].loc[df['col 1'].first_valid_index(): df['col 1'].last_valid_index()].astype(float).interpolate()

Best way to eliminate columns with only one value from pandas dataframe

i'm trying to build a function to eliminate from my dataset the columns with only one value. I used this function:
def oneCatElimination(dataframe):
columns=dataframe.columns.values
for column in columns:
if len(dataframe[column].value_counts().unique())==1:
del dataframe[column]
return dataframe
the problem is that the function eliminates even column with more the one distinct value, i.e. a index column with integer number..
Just
df.dropna(thresh=2, axis=1)
will work. No need for anything else. It will keep all columns with 2 or more non-NA values (controlled by the value passed to thresh). The axis kwarg will let you work with rows or columns. It is rows by default, so you need to pass axis=1 explicitly to work on columns (I forgot this at the time I answered, hence this edit). See dropna() for more information.
A couple of assumptions went into this:
Null/NA values don't count
You need multiple non-NA values to keep a column
Those values need to be different in some way (e.g., a column full of 1's and only 1's should be dropped)
All that said, I would use a select statement on the columns.
If you start with this dataframe:
import pandas
N = 15
df = pandas.DataFrame(index=range(10), columns=list('ABCD'))
df.loc[2, 'A'] = 23
df.loc[3, 'B'] = 52
df.loc[4, 'B'] = 36
df.loc[5, 'C'] = 11
df.loc[6, 'C'] = 11
df.loc[7, 'D'] = 43
df.loc[8, 'D'] = 63
df.loc[9, 'D'] = 97
df
Which creates:
A B C D
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 23 NaN NaN NaN
3 NaN 52 NaN NaN
4 NaN 36 NaN NaN
5 NaN NaN 11 NaN
6 NaN NaN 11 NaN
7 NaN NaN NaN 43
8 NaN NaN NaN 63
9 NaN NaN NaN 97
Given my assumptions above, columns A and C should be dropped since A only has one value and both of C's values are the same. You can then do:
df.select(lambda c: df[c].dropna().unique().shape[0] > 1, axis=1)
And that gives me:
B D
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 52 NaN
4 36 NaN
5 NaN NaN
6 NaN NaN
7 NaN 43
8 NaN 63
9 NaN 97
This will work for both text and numbers:
for col in dataframe:
if(len(dataframe.loc[:,col].unique()) == 1):
dataframe.pop(col)
Note: This will remove the columns having only one value from the original dataframe.

Categories