Python Pandas Dataframe storing next row values - python

I am currently new to python data frames and trying to iterate over rows. I want to be able to get values of next 2 rows and store it in the variable. Following is code snippet:
df = pd.read_csv('Input.csv')
for index, row in df.iterrows():
row_1 = row['Word']
row_2 = row['Word'] + 1 # I know this is incorrect and it won't work
row_3 = row['Word'] + 2 # I know this is incorrect and it won't work
print(row_1, row_2, row_3)
I was hoping that given ('Input.csv'):
Word <- #Column
Hi
Hello
Some
Phone
keys
motion
entries
I want the output as following:
Hi, Hello, Some
Hello, Some, Phone
Some, Phone, keys
Phone, keys, motion
keys, motion, entries
Any help is appreciated. Thank you.

you can simply use iloc property
df = pd.read_csv('Input.csv')
for index, row in df.iterrows():
row_1 = df.iloc[index]['word']
row_2 = df.iloc[index + 1]['word']
row_3 = df.iloc[index + 2]['word']
print(row_1, row_2, row_3)

Many ways to do it but something simple like this which makes use of Pandas vectorized operations will work,
(df['Word'] + ', ' + df['Word'].shift(-1)+ ', ' + df['Word'].shift(-2)).dropna()

Related

Compare two df column names and select columns in common

Im trying to append the dictionary_of_columns with the columns that two df has in common.
my Code:
list_of_columns = []
for column in dfUpdates.schema:
list_of_columns.append(column.jsonValue()["name"].upper())
dictionary_of_columns = {}
dictionary_of_columns['BK_COLUMNS'] = []
dictionary_of_columns['COLUMNS'] = []
for row in df_metadata.dropDuplicates(['COLUMN_NAME', 'KeyType']).collect():
if row.KeyType == 'PRIMARY KEY' and row.COLUMN_NAME.upper() in list_of_columns:
dictionary_of_columns['BK_COLUMNS'].append(row.COLUMN_NAME.upper())
elif row.KeyType != 'PRIMARY KEY' and row.COLUMN_NAME.upper() in list_of_columns:
dictionary_of_columns['COLUMNS'].append(row.COLUMN_NAME.upper())
but when I it does not match. dict_of_columns has more columns in it.
UPDATE:
dfupdate: column names
df_metadata: values in COLUMN_NAME
Desired output of dictionary_of_columns = {} should be: {'BK_COLUMNS': ['CODE'],'COLUMNS':'DESCRIPTION'}
The pseudocode will look like this. I create sets as the columns in the two dataframes, and then just take out the elements they have in common.
You can adapt it to suit your needs since I see you also filter primary keys.
dfUpdates_cols = set(dfUpdates.columns)
df_metadata_cols = set(df_metadata.columns)
print(dfUpdates_cols & df_metadata_cols)

How to delete "[","]" in dataframe? and How i paste dataframe to existing excel file?

I'm very new to python. I think it's very simple thing but I can't. What I have to do is removing some strings of one column's each value from specific strings.
available_list
AE,SG,MO
KR,CN
SG
MO,MY
all_list = 'AE,SG,MO,MY,KR,CN,US,HK,YS'
I want to remove available_list values from all_list.
What I tried is following code.
col1 = df['available_list']
all_ori = 'AE,SG,MO,MY,KR,CN,US,HK,YS'.split(',')
all_c = all_ori.copy()
result=[]
for i in col1:
for s in i:
all_c.remove(s)
result.append(all_c)
all_c = all_main.copy()
result_df = pd.DataFrame({'Non-Priviliges' : result})
But the result was,
|Non-Priviliges|
|[MY, KR, CN, US, HK, YS]|
|[SG, MO, US, HK, YS]|
|[AE, SG, KR, CN, US, HK, YS]|
The problems are "[", "]". How I remove them?
And after replacing them,
I want to paste this series to existing excel file, next-to the column named "Priviliges".
Could you give me some advice? thanks!
Assuming your filename is "hello.xlsx", Following is my answer:
import pandas as pd
df = pd.read_excel('hello.xlsx')
all_list_str = 'AE,SG,MO,MY,KR,CN,US,HK,YS'
all_list = all_list_str.split(',')
def find_non_priv(row):
#convert row item string value to list
row_list = row.split(',')
return ','.join(list(set(all_list) - set(row_list)))
# pandas apply is used to call function to each row items.
df['Non-Priviliges'] = df['available_list'].apply(find_non_priv)
df.to_excel('output.xlsx')

Python pandas if column value is list then create new column(s) with individual list value

I'm using pandas to create a dataframe from a SaaS REST API json response and hitting a minor blocker to cleanse the data for visualization and analysis.
I need to tweak the python script by adding a conditional function to say if the value is in a list then remove the brackets, separate the values into new columns and name the new columns as [original column name + value list order].
In the similar questions posted the function is performed on a specified column whereas I need the check to be run on all 1,400+ columns in the dataframe. Basically, excel text to columns and the column header name is [original column name + value list order]
Current
Need
Here's the dataframe creation script from the .json response
def get_tap_dashboard():
use_fields = ''
for index, value in enumerate(list(WORKFLOW_FIELDS.keys())):
if index != len(list(WORKFLOW_FIELDS.keys())) - 1:
use_fields = use_fields + value + ','
else:
use_fields = use_fields + value
dashboard_head = {'Authorization': 'Bearer {}'.format(get_tap_token()), 'Content-Type': 'application/json'}
dashboard_url = \
TAP_URL + "api/v1/workflows/all?pageSize={}&page=1".format(SIZE) \
+ "&advancedFilter=__WorkflowDescription__~eq~'{}'".format(WORKFLOW_NAME) \
+ "&configurationId={}".format("1128443a-f7a7-4a90-953d-c095752a97a2")
dashboard = json.loads(requests.get(url=dashboard_url, headers=dashboard_head).text)
all_columns = []
for col in dashboard['Items'][0]['Columns']:
all_columns.append(col['Name'])
all_columns = ['ResultSetId'] + all_columns
pd_dashboard = pd.DataFrame(columns=all_columns)
for row in dashboard['Items']:
add_row_values = [row['ResultSetId']]
for col in row['Columns']:
if col['Value'] == '-- Select One --': # dtype issue
add_row_values.append([''])
else:
add_row_values.append(col['Value'])
add_row_df = pd.DataFrame([add_row_values], columns=all_columns)
pd_dashboard = pd_dashboard.append(add_row_df)
tap_dashboard = pd_dashboard
return tap_dashboard.rename(columns=WORKFLOW_FIELDS).reset_index(drop=True)
df = get_tap_dashboard()
Any help would be much appreciated thanks all!
PS - I have a Tableau creator license if it makes more sense to do it in Tableau/Tableau prep builder
Is this could be what you need?
from collections import defaultdict
output = defaultdict(lambda : [])
def count(x):
if isinstance(x,list):
if len(x) > 1:
for i,item in enumerate(x):
output[f'{item}_{i}'].append(item)
elif len(x) == 1:
output[f'{x[0]}_0'].append(x[0])
df['df_column_name'].apply(count)
print(pd.DataFrame.from_dict(output, orient='index').T)

How can I Access a Column by Name as a Variable to use the isin() Method

I have two dataframes as df1 and df2.Both have the same column name as 'Accounts'.
I can currently access this data for comparison using the following code:
df1.account.isin(df2.account.values)
I would like 'account' to be accessed as a variable something like this.df1.[account].isin(df2.[account].values)After research I have discovered a possible solution as:
df1.loc[:, 'account'] (I suspect this is not the correct approach.)
From this point, I'm not sure how to access the isin() method
As a result I welcome your wisdom with any alternative ways to accomplish this. Your help is very much appreciated :)
The full block of code is as follows:
slgCSV = 'c:\\automation\\python\\a.csv'
armyCSV = 'c:\\automation\\python\\b.csv'
df1 = pd.read_csv(slgCSV)
df2 = pd.read_csv(armyCSV)
d3 = {'Expected': [], 'Actual': []}
df3 = pd.DataFrame(data=d3)
match1 = df1.account.isin(df2.account.values)
match2 = df2.account.isin(df1.account.values)
for r1 in df1[match1].index:
for r2 in df2[match2].index:
# print("R2: " + str(r2))
if df1.account[r1] == df2.account[r2]:
idx = df1.account[r1]
row = {'Expected Row ID': r1+2, 'Actual Row ID': r2+2}
print("Output: " + str(row) + ": " + str(idx))
df1 looks as follows:
Account
1
2
3
4
5
df2 looks as follows:
Account
3
1
5
2
4
The solution is as follows:
col = "account"
df1[col].isin(df2[col].values)
Thank you for all the help!
Give this a try.. using the set functionality
Usercol ='Account' #user entry
Common =
list(set(df1.loc[:Usercol]).intersect(set(df2.loc[:Usercol])))
#fetch index of each data frame using
df1[df1[Usercol].isin(Common)].index
df2[df2[Usercol].isin(Common)].index

How to access a cell in a new dataframe?

I created a sub dataframe (drama_df) based on a criteria in the original dataframe (df). However, I can't access a cell using the typical drama_df['summary'][0] . Instead I get a KeyError: 0. I'm confused since type(drama_df) is a DataFrame. What do I do? Note that df['summary'][0] does indeed return a string.
drama_df = df[df['drama'] > 0]
#Now we generate a lump of text from the summaries
drama_txt = ""
i = 0
while (i < len(drama_df)):
drama_txt = drama_txt + " " + drama_df['summary'][i]
i += 1
EDIT
Here is an example of df:
Here is an example of drama_df:
This will solve it for you:
drama_df['summary'].iloc[0]
When you created the subDataFrame you probably left the index 0 behind. So you need to use iloc to get the element by position and not by index name (0).
You can also use .iterrows() or .itertuples() to do this routine:
Itertuples is a lot faster, but it is a bit more work to handle if you have a lot of columns
for row in drama_df.iterrows():
drama_txt = drama_txt + " " + row['summary']
To go faster:
for index, summary in drama_df[['summary']].itertuples():
drama_txt = drama_txt + " " + summary
Wait a moment here. You are looking for the str.join() operation.
Simply do this:
drama_txt = ' '.join(drama_df['summary'])
Or:
drama_txt = drama_df['summary'].str.cat(sep=' ')

Categories