Designating a Column for an Operation in Pandas - python

I am not a coder, but am working with pandas in python 3 to modify a program someone else wrote to strip HTML out of a column in a CSV file. In the original code, it asked for user input for the column names as in the code at the bottom, but my csv file will always have the same column headings so I would prefer not to have this input step, instead just including the column name in the program itself.
I have tried to replace this line:
col = input("Enter column name: ")
which works exactly the way it is supposed to when I manually input the column name (outputting a new column with the HTML cleaned), with:
col = df['ColumnName']
and many other variations, but whatever I try gives me various errors. What syntax should I use to simply have it operate directly on the column I name rather than requiring the manual input. Thanks so much for the help.
import pandas as pd
import re
import html
def cleanhtml(raw_html):
cleanr = re.compile('<.+?>')
cleantext = re.sub(cleanr, ' ', str(raw_html))
clean = re.sub('\s+',' ',cleantext)
return html.unescape(clean)
file = input("Enter CSV File name (without '.csv' at the end): ")
d = pd.read_csv("%s.csv" % file )
df = pd.DataFrame(d)
col = input("Enter column name: ")
df[col][0:5]
df['clean'] = df[col].apply(cleanhtml)

instead of manually accepting column names, you can just replace the input command with the name of the column that you want i.e
col = input("Enter column name: ")
with
col = 'columnName'

Related

Automatic transposing Excel user data in a Pandas Dataframe

I have some big Excel files like this (note: other variables are omitted for brevity):
and would need to build a corresponding Pandas DataFrame with the following structure.
I am trying to develop a Pandas code for, at least, parsing the first column and transposing the id and the full of each user. Could you help with this?
The way that I would tackle it, and I am assuming there are likely to be more efficient ways, is to import the excel file into a dataframe, and then iterate through it to grab the details you need for each line. Store that information in a dictionary, and append each formed line into a list. This list of dictionaries can then be used to create the final dataframe.
Please note, I made the following assumptions:
Your excel file is named 'data.xlsx' and in the current working directory
The index next to each person increments by one EVERY time
All people have a position described in brackets next to the name
I made up the column names, as none were provided
import pandas as pd
# import the excel file into a dataframe (df)
filename = 'data.xlsx'
df = pd.read_excel(filename, names=['col1', 'col2'])
# remove blank rows
df.dropna(inplace=True)
# reset the index of df
df.reset_index(drop=True, inplace=True)
# initialise the variables
counter = 1
name_pos = ''
name = ''
pos = ''
line_dict = {}
list_of_lines = []
# iterate through the dataframe
for i in range(len(df)):
if df['col1'][i] == counter:
name_pos = df['col2'][i].split(' (')
name = name_pos[0]
pos = name_pos[1].rstrip(name_pos[1][-1])
p_index = counter
counter += 1
else:
date = df['col1'][i].strftime('%d/%m/%Y')
amount = df['col2'][i]
line_dict = {'p_index': p_index, 'name': name, 'position': pos, 'date':date, 'amount': amount}
list_of_lines.append(line_dict)
final_df = pd.DataFrame(list_of_lines)
OUTPUT:

how to extract first part of name(first name) in a list that contains full names and discard names with one part

I have a CSV file that contains one column of names. what I want is a python code to check every name in the column and see if the name has more than one part, it takes just the first part and appends it in a new CSV file list while it skips any name that has just one part in the old CSV file.
For Example
input CSV file
Column1
Metarhizium robertsii ARSEF 23
Danio rerio
Parascaris equorum
Hevea
Gossypium
Vitis vinifera
The output CSV file should be
Column1
Metarhizium
Danio
Parascaris
Vitis
You can first create a flag for those values that have more than one word, then use the apply() method and write a lambda function to retrieve the first word in all names.
flag = df.loc[:,'Column1'].str.split(' ').apply(len) > 1
split_names = lambda name: name.split()[0] if (len(name.split())) else None
new_df = df.loc[flag,'Column1'].apply(split_names)
new_df.to_csv('output.csv', index=False)
You can split then apply the function len to mask the result, then get the first element of the filtered in rows.
import pandas as pd
df = pd.read_csv("input.csv")
splitted = df.Column1.apply(lambda x: x.split())
output = splitted[splitted.apply(len) > 1].apply(lambda x: x[0])
output.to_csv("output.csv")
# > ,Column1
# 0,Metarhizium
# 1,Danio
# 2,Parascaris
# 5,Vitis
Are the names always separated with a space?
You could use the re module in python and use regex expressions or if you looking for something simple you can also use the str.split() method in python:
for name in column:
split_name = name.split(' ', 1) #Splits the name once after the first space and returns a list of strings
if len(split_name) > 1: new_csv.write(split_name[0]) #write the first part of the split up name into the new csv

How to split column name separately in python?

I have one csv file in which a lot of columns are there.
After reading the csv file while printing the columns, its printing the col name as a full string not separate col name.
I need separate column name. Can you please help me how to do this?
code:
df1 = pd.read_csv("D:/Users/SPate233/Downloads/iMedical/AMPIL_DEV/MDM_PRODUCT_VIEW/MDM_PRODUCT_VIEW_H.csv", sep = '|')
print(list(df1.columns))
print(df1['SERIES_ID'][2])
Output:
['RECORD_ID,MDM_ID,SERIES_ID,RELTIO_ID,COUNTRY_ID,PRODUCT_NAME,GROUP_TYPE,JANSSEN_MSTR_PRDCT_NM']
KeyError: SERIES_ID
Desired Output:
['RECORD_ID','MDM_ID','SERIES_ID','RELTIO_ID','COUNTRY_ID','PRODUCT_NAME','GROUP_TYPE','JANSSEN_MSTR_PRDCT_NM']
Looks like you entered the wrong separator, so it's reading the entire first line as a single column. try:
df1 = pd.read_csv("D:/Users/SPate233/Downloads/iMedical/AMPIL_DEV/MDM_PRODUCT_VIEW/MDM_PRODUCT_VIEW_H.csv", sep = ',')

Write list to specific column in csv

I'm trying to write the data from my list to just column 4
namelist = ['PEAR']
for name in namelist:
for man_year in yearlist:
for man_month in monthlist:
with open('{2}\{0}\{1}.csv'.format(man_year,man_month,name),'w') as filename:
writer = csv.writer(filename)
writer.writerow(name)
time.sleep(0.01)
it outputs to a csv like this
P E A R
4015854 234342 2442343 234242
How can I get it to go on just the 4th column?
PEAR
4015854 234342 2442343 234242
Replace the line writer.writerow(name) with,
writer.writerow(['', '', '', name])
When you pass the name to csvwriter it assumes the name as an iterable and write each character in a column.
So, for getting ride of this problem change the following line:
writer.writerow(name)
With:
writer.writerow([''] * (len(other_row)-1) + [name])
Here other_row can be one of the rest rows, but if you are sure about the length you can do something like:
writer.writerow([''] * (length-1) + [name])
Instead of writing '' to cells you don't want to touch, you could use df.at instead. For example, you could write df.at[index, ColumnName] = 10 which would change only the value of that specific cell.
You can read more about it here: Set value for particular cell in pandas DataFrame using index

Python pandas trouble with storing result in variable

I'm using pandas to handle some csv file, but i'm having trouble storing the results in a variable and printing it out as it is.
This is the code that I have.
df = pd.read_csv(MY_FILE.csv, index_col=False, header=0)
df2 = df[(df['Name'])]
# Trying to get the result of Name to the variable
n = df2['Name']
print(n)
And the result that i get:
1 jake
Name: Name, dtype: object
My Question:
Is it possible to just have "Jake" stored in a variable "n" so that i can call it out whenever i need it?
EG: Print (n)
Result: Jake
This is the code that I have constructed
def name_search():
list_to_open = input("Which list to open: ") + ".csv"
directory = "C:\Users\Jake Wong\PycharmProjects\box" "\\" + list_to_open
if os.path.isfile(directory):
# Search for NAME
Name_id = input("Name to search for: ")
df = pd.read_csv(directory, index_col=False, header=0)
df2 = df[(df['Name'] == Name_id)]
# Defining the name to save the file as
n = df2['Name'].ix[1]
print(n)
This is what is in the csv file
S/N,Name,Points,test1,test2,test3
s49,sing chun,5000,sc,90 sunrsie,4984365132
s49,Alice Suh,5000,jake,88 sunrsie,15641816
s1231,Alice Suhfds,5000,sw,54290 sunrsie,1561986153
s49,Jake Wong,5000,jake,88 sunrsie,15641816
The problem is that n = df2['Name'] is actually a Pandas Series:
type(df.loc[df.Name == 'Jake Wong'].Name)
pandas.core.series.Series
If you just want the value, you can use values[0] -- values is the underlying array behind the Pandas object, and in this case it's length 1, and you're just taking the first element.
n = df2['Name'].values[0]
Also your CSV is not formatted properly: It's not enough to have things lined up in columns like that, you need to have a consistent delimiter (a comma or a tab usually) between columns, so the parser can know when one column ends and another one starts. Can you fix your csv to look like this?:
S/n,Name,points
s56,Alice Suh,5000
s49,Jake Wong,5000
Otherwise we can work on another solution for you but we will probably use regex rather than pandas.

Categories