Pandas - two columns as index (valid time and row number)? - python

I have a csv that looks like this
valid,value
2004-07-21 09:00:00,200
2004-07-21 10:00:00,200
2004-07-21 11:00:00,150
I must set the valid column as index like this
import pandas as pd
df = pd.read_csv('test.csv')
df['valid'] = pd.to_datetime(df['valid']) # convert 'valid' column to pd.datetime objects
df = df.set_index('valid') # set the 'valid' as index
I would still like to be able to access the data by row index too however like this
for row_index in range(len(df)): # I know iterating over a df is not avisable
print(df.at[row_index])
But I get an error. ValueError: At based indexing on an non-integer index can only have non-integer indexers
I for sure have to have the valid column as index. But how can I also print a row given it's index?

Change selecting by label:
print(df.at[row_index])
to selecting by positions with select first column by DataFrame.iat:
for row_index in range(len(df)): # I know iterating over a df is not avisable
#convert position of value column to 0
print(df.iat[row_index, df.columns.get_loc('value')])
#selecting first column - 0
#print(df.iat[row_index, 0])
200
200
150
Or use DataFrame.iloc:
for row_index in range(len(df)): # I know iterating over a df is not avisable
print(df.iloc[row_index])
value 200
Name: 2004-07-21 09:00:00, dtype: int64
value 200
Name: 2004-07-21 10:00:00, dtype: int64
value 150
Name: 2004-07-21 11:00:00, dtype: int64
Difference is iat is faster, but return intersection index/column value - scalar, but iloc is more general, here return all columns to Series - Series name is index values, index values of Series are columns names.

for row_index in range(len(df)):
print(df.iloc[row_index])

Related

How can I add values from the first column to a new array?

I have this code:
unique_values = (dataset["Label"].value_counts())
print(unique_values)
And this output:
BENIGN 1378095
PortScan 158930
FTP-Patator 7938
SSH-Patator 5897
Infiltration 36
Name: Label, dtype: int64
How can I add values from the first column to a new array?
The first column should be this: BENIGN, PortScan,..
Is this giving you what you want?
unique_values.index.tolist()
Transform your index into a column with reset_index
unique_values.reset_index()
Output: your initial pd.DataFrame with a new column (but no index)

List of Dataframes, drop Dataframe column (columns have different names) if row contains a special string

What i have is a list of Dataframes.
What is important to note is that the shape of the dataframes differ between 2-7 columns, also the columns are named between 0 & len of the column (e.g. df1 has 5 columns named 0,1,2,3,4 etc. df2 has 4 columns named 0,1,2,3)
I would like is to check if a row in a column contains a certain string, then delete that column.
list_dfs1=[df1,df2,df3...df100]
What i have done so far is the below & i get an error that column 5 is not in axis (it is there for some DF)
for i, df in enumerate(list_dfs1):
for index,row in df.iterrows():
if np.where(row.str.contains("DEC")):
df.drop(index, axis=1)
Any suggestions.
You could try:
for df in list_dfs:
for col in df.columns:
# If you are unsure about column types, cast column as string:
df[col] = df[col].astype(str)
# Check if the column contains the string of interest
if df[col].str.contains("DEC").any():
df.drop(columns=[col], inplace=True)
If you know that all columns are of type string, you don't have to actually do df[col] = df[col].astype(str).
You can write a custom function that checks whether the dataframe has the pattern or not. You can use pd.Series.str.contains with pd.Series.any
def func(s):
return s.str.contains('DEC').any()
list_df = [df.loc[:, ~df.apply(func)] for df in list_dfs1]
I would take another approach. I would concatenate the list into a data frame and then eliminate the column where finding the string
import pandas as pd
df = pd.concat(list_dfs1)
Let us say your condition was to eliminate any column with "DEC"
df.mask(df == "DEC").dropna(axis=1, how="any")

pandas add column to dataframe aggregate on time series

I've done a dataframe aggregation and I want to add a new column in which if there is a value > 0 in year 2020 in row, it will put an 1, otherwise 0.
this is my code
and head of dataframe
df['year'] = pd.DatetimeIndex(df['TxnDate']).year # add column year
df['client'] = df['Customer'].str.split(' ').str[:3].str.join(' ') # add colum with 3 first word
Datedebut = df['year'].min()
Datefin = df['year'].max()
#print(df)
df1 = df.groupby(['client','year']).agg({'Amount': ['sum']}).unstack()
print(df1)
df1['nb2020']= np.where( df1['year']==2020, 1, 0)
Data frame df1 print before last line is like that:
Last line error is : KeyError: 'year'
thanks
When you performed that the aggregation and unstacked (df.groupby(['client','year']).agg({'Amount': ['sum']}).unstack()), the values of the column year have been expanded into columns, and these columns are a MultiIndex. You can look at that by calling:
print (df1.columns)
And then you can select them.
Using the MultiIndex column
So to select the column which matches to 2020 you can use:
df1.loc[:,df1.columns.get_level_values(2).isin({2020})
You can probably get the correct column then check if 2020 has a non zero value using:
df1['nb2020'] = df1.loc[:,df1.columns.get_level_values('year').isin({2020})] > 0
If you would like to have the 1 and 0 (instead of the bool types), you can convert to int (using astype).
Renaming the columns
If you think this is a bit complicated, you might also prefer change the column to single indexes. Using something like
df1.columns = df1.columns.get_level_values('year')
Or
df1.columns = df1.columns.get_level_values(2)
And then
df1['nb2020'] = (df1[2020] > 0).astype(int)

Python Pandas dataframe doesn't update values based on my filter indexes

I'm trying to update one column values if the index matches my specified condition:
df['newCol'] = 0
df[df.month == 12]['newCol'] = 1
After I've done this, the value is not changed:
df[df.month == 12]['newCol']
10120 0
Name: matched, dtype: int64
We cannot change pandas array values based on indexing like this?

Difference between dates between corresponding rows in pandas dataframe

Below is the example of a sample pandas dataframe. I am trying to find the difference between the dates in the two rows (with the first row as the base):
PH_number date Type
H09879721 2018-05-01 AccountHolder
H09879731 2018-06-22 AccountHolder
If the difference between two dates is within 90 days, then those two rows should be added to a new pandas dataframe. The date column is of type object.
How can I do this?
Use .diff():
df.date.diff()<=pd.Timedelta(90,'d')
0 False
1 True
Name: date, dtype: bool
Convert date column to datetime64[ns] data type using pd.to_datetime and then subtract as given:
df['date'] = pd.to_datetime(df['date'])
#if comparing with only 1st row
mask = (df['date']-df.loc[0,'date']).dt.days<=90
# alternative mask = (df['date']-df.loc[0,'date']).dt.days.le(90)
#if comparing with immediate rows.
mask = df['date'].diff().dt.days<=90
# alternative mask = df['date'].diff().dt.days.le(90)
df1 = df.loc[mask,:] #gives you required rows with all columns

Categories