Pandas - two columns as index (valid time and row number)?

Pandas - two columns as index (valid time and row number)? - python

I have a csv that looks like this
valid,value
2004-07-21 09:00:00,200
2004-07-21 10:00:00,200
2004-07-21 11:00:00,150
I must set the valid column as index like this
import pandas as pd
df = pd.read_csv('test.csv')
df['valid'] = pd.to_datetime(df['valid']) # convert 'valid' column to pd.datetime objects
df = df.set_index('valid') # set the 'valid' as index
I would still like to be able to access the data by row index too however like this
for row_index in range(len(df)): # I know iterating over a df is not avisable
print(df.at[row_index])
But I get an error. ValueError: At based indexing on an non-integer index can only have non-integer indexers
I for sure have to have the valid column as index. But how can I also print a row given it's index?

Change selecting by label:
print(df.at[row_index])
to selecting by positions with select first column by DataFrame.iat:
for row_index in range(len(df)): # I know iterating over a df is not avisable
#convert position of value column to 0
print(df.iat[row_index, df.columns.get_loc('value')])
#selecting first column - 0
#print(df.iat[row_index, 0])
200
200
150
Or use DataFrame.iloc:
for row_index in range(len(df)): # I know iterating over a df is not avisable
print(df.iloc[row_index])
value 200
Name: 2004-07-21 09:00:00, dtype: int64
value 200
Name: 2004-07-21 10:00:00, dtype: int64
value 150
Name: 2004-07-21 11:00:00, dtype: int64
Difference is iat is faster, but return intersection index/column value - scalar, but iloc is more general, here return all columns to Series - Series name is index values, index values of Series are columns names.

for row_index in range(len(df)):
print(df.iloc[row_index])

Related

How can I add values from the first column to a new array?

I have this code:
unique_values = (dataset["Label"].value_counts())
print(unique_values)
And this output:
BENIGN 1378095
PortScan 158930
FTP-Patator 7938
SSH-Patator 5897
Infiltration 36
Name: Label, dtype: int64
How can I add values from the first column to a new array?
The first column should be this: BENIGN, PortScan,..

Is this giving you what you want?
unique_values.index.tolist()

Transform your index into a column with reset_index
unique_values.reset_index()
Output: your initial pd.DataFrame with a new column (but no index)

List of Dataframes, drop Dataframe column (columns have different names) if row contains a special string

What i have is a list of Dataframes.
What is important to note is that the shape of the dataframes differ between 2-7 columns, also the columns are named between 0 & len of the column (e.g. df1 has 5 columns named 0,1,2,3,4 etc. df2 has 4 columns named 0,1,2,3)
I would like is to check if a row in a column contains a certain string, then delete that column.
list_dfs1=[df1,df2,df3...df100]
What i have done so far is the below & i get an error that column 5 is not in axis (it is there for some DF)
for i, df in enumerate(list_dfs1):
for index,row in df.iterrows():
if np.where(row.str.contains("DEC")):
df.drop(index, axis=1)
Any suggestions.

You could try:
for df in list_dfs:
for col in df.columns:
# If you are unsure about column types, cast column as string:
df[col] = df[col].astype(str)
# Check if the column contains the string of interest
if df[col].str.contains("DEC").any():
df.drop(columns=[col], inplace=True)
If you know that all columns are of type string, you don't have to actually do df[col] = df[col].astype(str).

You can write a custom function that checks whether the dataframe has the pattern or not. You can use pd.Series.str.contains with pd.Series.any
def func(s):
return s.str.contains('DEC').any()
list_df = [df.loc[:, ~df.apply(func)] for df in list_dfs1]

I would take another approach. I would concatenate the list into a data frame and then eliminate the column where finding the string
import pandas as pd
df = pd.concat(list_dfs1)
Let us say your condition was to eliminate any column with "DEC"
df.mask(df == "DEC").dropna(axis=1, how="any")

pandas add column to dataframe aggregate on time series

I've done a dataframe aggregation and I want to add a new column in which if there is a value > 0 in year 2020 in row, it will put an 1, otherwise 0.
this is my code
and head of dataframe
df['year'] = pd.DatetimeIndex(df['TxnDate']).year # add column year
df['client'] = df['Customer'].str.split(' ').str[:3].str.join(' ') # add colum with 3 first word
Datedebut = df['year'].min()
Datefin = df['year'].max()
#print(df)
df1 = df.groupby(['client','year']).agg({'Amount': ['sum']}).unstack()
print(df1)
df1['nb2020']= np.where( df1['year']==2020, 1, 0)
Data frame df1 print before last line is like that:
Last line error is : KeyError: 'year'
thanks

When you performed that the aggregation and unstacked (df.groupby(['client','year']).agg({'Amount': ['sum']}).unstack()), the values of the column year have been expanded into columns, and these columns are a MultiIndex. You can look at that by calling:
print (df1.columns)
And then you can select them.
Using the MultiIndex column
So to select the column which matches to 2020 you can use:
df1.loc[:,df1.columns.get_level_values(2).isin({2020})
You can probably get the correct column then check if 2020 has a non zero value using:
df1['nb2020'] = df1.loc[:,df1.columns.get_level_values('year').isin({2020})] > 0
If you would like to have the 1 and 0 (instead of the bool types), you can convert to int (using astype).
Renaming the columns
If you think this is a bit complicated, you might also prefer change the column to single indexes. Using something like
df1.columns = df1.columns.get_level_values('year')
Or
df1.columns = df1.columns.get_level_values(2)
And then
df1['nb2020'] = (df1[2020] > 0).astype(int)

Python Pandas dataframe doesn't update values based on my filter indexes

I'm trying to update one column values if the index matches my specified condition:
df['newCol'] = 0
df[df.month == 12]['newCol'] = 1
After I've done this, the value is not changed:
df[df.month == 12]['newCol']
10120 0
Name: matched, dtype: int64
We cannot change pandas array values based on indexing like this?

Difference between dates between corresponding rows in pandas dataframe

Below is the example of a sample pandas dataframe. I am trying to find the difference between the dates in the two rows (with the first row as the base):
PH_number date Type
H09879721 2018-05-01 AccountHolder
H09879731 2018-06-22 AccountHolder
If the difference between two dates is within 90 days, then those two rows should be added to a new pandas dataframe. The date column is of type object.
How can I do this?

Use .diff():
df.date.diff()<=pd.Timedelta(90,'d')
0 False
1 True
Name: date, dtype: bool

Convert date column to datetime64[ns] data type using pd.to_datetime and then subtract as given:
df['date'] = pd.to_datetime(df['date'])
#if comparing with only 1st row
mask = (df['date']-df.loc[0,'date']).dt.days<=90
# alternative mask = (df['date']-df.loc[0,'date']).dt.days.le(90)
#if comparing with immediate rows.
mask = df['date'].diff().dt.days<=90
# alternative mask = df['date'].diff().dt.days.le(90)
df1 = df.loc[mask,:] #gives you required rows with all columns

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas - two columns as index (valid time and row number)? - python

for row_index in range(len(df)): print(df.iloc[row_index])

Related

How can I add values from the first column to a new array?

List of Dataframes, drop Dataframe column (columns have different names) if row contains a special string

pandas add column to dataframe aggregate on time series

Python Pandas dataframe doesn't update values based on my filter indexes

Difference between dates between corresponding rows in pandas dataframe

Categories

Resources