Python pandas trouble with storing result in variable - python

I'm using pandas to handle some csv file, but i'm having trouble storing the results in a variable and printing it out as it is.
This is the code that I have.
df = pd.read_csv(MY_FILE.csv, index_col=False, header=0)
df2 = df[(df['Name'])]
# Trying to get the result of Name to the variable
n = df2['Name']
print(n)
And the result that i get:
1 jake
Name: Name, dtype: object
My Question:
Is it possible to just have "Jake" stored in a variable "n" so that i can call it out whenever i need it?
EG: Print (n)
Result: Jake
This is the code that I have constructed
def name_search():
list_to_open = input("Which list to open: ") + ".csv"
directory = "C:\Users\Jake Wong\PycharmProjects\box" "\\" + list_to_open
if os.path.isfile(directory):
# Search for NAME
Name_id = input("Name to search for: ")
df = pd.read_csv(directory, index_col=False, header=0)
df2 = df[(df['Name'] == Name_id)]
# Defining the name to save the file as
n = df2['Name'].ix[1]
print(n)
This is what is in the csv file
S/N,Name,Points,test1,test2,test3
s49,sing chun,5000,sc,90 sunrsie,4984365132
s49,Alice Suh,5000,jake,88 sunrsie,15641816
s1231,Alice Suhfds,5000,sw,54290 sunrsie,1561986153
s49,Jake Wong,5000,jake,88 sunrsie,15641816

The problem is that n = df2['Name'] is actually a Pandas Series:
type(df.loc[df.Name == 'Jake Wong'].Name)
pandas.core.series.Series
If you just want the value, you can use values[0] -- values is the underlying array behind the Pandas object, and in this case it's length 1, and you're just taking the first element.
n = df2['Name'].values[0]
Also your CSV is not formatted properly: It's not enough to have things lined up in columns like that, you need to have a consistent delimiter (a comma or a tab usually) between columns, so the parser can know when one column ends and another one starts. Can you fix your csv to look like this?:
S/n,Name,points
s56,Alice Suh,5000
s49,Jake Wong,5000
Otherwise we can work on another solution for you but we will probably use regex rather than pandas.

Related

how to extract first part of name(first name) in a list that contains full names and discard names with one part

I have a CSV file that contains one column of names. what I want is a python code to check every name in the column and see if the name has more than one part, it takes just the first part and appends it in a new CSV file list while it skips any name that has just one part in the old CSV file.
For Example
input CSV file
Column1
Metarhizium robertsii ARSEF 23
Danio rerio
Parascaris equorum
Hevea
Gossypium
Vitis vinifera
The output CSV file should be
Column1
Metarhizium
Danio
Parascaris
Vitis
You can first create a flag for those values that have more than one word, then use the apply() method and write a lambda function to retrieve the first word in all names.
flag = df.loc[:,'Column1'].str.split(' ').apply(len) > 1
split_names = lambda name: name.split()[0] if (len(name.split())) else None
new_df = df.loc[flag,'Column1'].apply(split_names)
new_df.to_csv('output.csv', index=False)
You can split then apply the function len to mask the result, then get the first element of the filtered in rows.
import pandas as pd
df = pd.read_csv("input.csv")
splitted = df.Column1.apply(lambda x: x.split())
output = splitted[splitted.apply(len) > 1].apply(lambda x: x[0])
output.to_csv("output.csv")
# > ,Column1
# 0,Metarhizium
# 1,Danio
# 2,Parascaris
# 5,Vitis
Are the names always separated with a space?
You could use the re module in python and use regex expressions or if you looking for something simple you can also use the str.split() method in python:
for name in column:
split_name = name.split(' ', 1) #Splits the name once after the first space and returns a list of strings
if len(split_name) > 1: new_csv.write(split_name[0]) #write the first part of the split up name into the new csv

Designating a Column for an Operation in Pandas

I am not a coder, but am working with pandas in python 3 to modify a program someone else wrote to strip HTML out of a column in a CSV file. In the original code, it asked for user input for the column names as in the code at the bottom, but my csv file will always have the same column headings so I would prefer not to have this input step, instead just including the column name in the program itself.
I have tried to replace this line:
col = input("Enter column name: ")
which works exactly the way it is supposed to when I manually input the column name (outputting a new column with the HTML cleaned), with:
col = df['ColumnName']
and many other variations, but whatever I try gives me various errors. What syntax should I use to simply have it operate directly on the column I name rather than requiring the manual input. Thanks so much for the help.
import pandas as pd
import re
import html
def cleanhtml(raw_html):
cleanr = re.compile('<.+?>')
cleantext = re.sub(cleanr, ' ', str(raw_html))
clean = re.sub('\s+',' ',cleantext)
return html.unescape(clean)
file = input("Enter CSV File name (without '.csv' at the end): ")
d = pd.read_csv("%s.csv" % file )
df = pd.DataFrame(d)
col = input("Enter column name: ")
df[col][0:5]
df['clean'] = df[col].apply(cleanhtml)
instead of manually accepting column names, you can just replace the input command with the name of the column that you want i.e
col = input("Enter column name: ")
with
col = 'columnName'

Pandas - Overwrite single column with new values, retain additional columns; overwrite original files

Fairly new to python, I have a csv with 2 columns, I need the code to perform a simple calculation on the first column while retaining the information in the second. code currently performs the calculation(albeit only on the first csv in the list, and there are numerous). But I haven't figured out how to overwrite the values in each file while retaining the second column unchanged. I'd like it to save over the original files with the new calculations. Additionally, originals have no header, and pandas automatically assigns a numeric value.
import os
import pandas as pd
def find_csv(topdir, suffix='.csv'):
filenames = os.listdir(topdir)
csv_list = [name for name in filesnames if name.endswith(suffix)
fp_list = []
for csv in csv_list:
fp = os.path.join(topdir, csv)
fp_list.append(fp)
return fp_list
def wn_to_um(wn):
um = 10000/wn
return um
for f in find_csv('C:/desktop/test'):
readit = pd.read_csv(f, usecols=[0])
convert = wn_to_um(readit)
df = pd.DataFram(convert)
df.to_csv('C:/desktop/test/whatever.csv')
I suppose you just have to do minor changes to your code.
def wn_to_um(wn):
wn.iloc[:,0] = 10000/wn.iloc[:,0] #performing the operation on the first column
return wn
for f in find_csv('C:/desktop/test'):
readit = pd.read_csv(f) #Here read the whole file
convert = wn_to_um(readit) #while performing operation, just call the function with the second column
os.remove(f) #if you want to replace the existing file with the updated calculation, simply delete and write
df.to_csv('C:/desktop/test/whatever.csv')
Say you have a column named 'X' which you want to divide by 10,000. You can store this as X and then divide each element in X like so:
X = df['X']
new_x = [X / 10000 for i in X]
From here, rewriting the column in the dataframe is very simple:
df['X'] = new_x
Just update your second function as:
def wn_to_um(wn):
wn.iloc[:,0] = 10000/wn.iloc[:,0]
return wn

Allow duplicate columns in Pandas

I'm splitting a large CSV (containing stock financial data) file into smaller chunks. The format of the CSV file is different. Something like an Excel pivot table. The first few rows of the first column contain some headers.
Company name, id, etc. are repeated across the following columns. Because one single company has more than one attribute, not like one company has one column only.
After the first few rows, the columns then start resembling a typical data frame where headers are in columns instead of rows.
Anyways, what I'm trying to do is to make Pandas allow duplicate column headers and not make it add ".1", ".2", ".3", etc after the headers. I know Pandas does not allow this natively, is there a workaround? I tried to set header = None on read_csv but it throws a tokenization error which I think makes sense. I just can't think of an easy way.
import pandas as pd
csv_path = "C:\\Users\\ThirdHandBD\\Desktop\\Data Splitting\\pd-split\\chunk4.csv"
#df = pd.read_csv(csv_path, header=1, dtype='unicode', sep=';', low_memory=False, error_bad_lines=False)
df = pd.read_csv(csv_path, header = 1, dtype='unicode', sep=';', index_col=False)
print("I read in a dataframe with {} columns and {} rows.".format(
len(df.columns), len(df)
))
filename = 1
#column increment
x = 30 * 59
for column in df:
loc = df.columns.get_loc(column)
if loc == (x * filename) + 1:
y = filename - 1
a = (x * y) + 1
b = (x * filename) + 1
date_df = df.iloc[:, :1]
out_df = df.iloc[:, a:b]
final_df = pd.concat([date_df, out_df], axis=1, join='inner')
out_path = "C:\\Users\\ThirdHandBD\\Desktop\\Data Splitting\\pd-split\\chunk4-part" + str(filename) + ".csv"
final_df.to_csv(out_path, index=False)
#out_df.to_csv(out_path)
filename += 1
# This should be the same as df, but with only the first column.
# Check it with similar code to above.
EDIT:
From, https://github.com/pandas-dev/pandas/issues/19383, I add:
final_df.columns = final_df.iloc[0]
final_df = final_df.reindex(final_df.index.drop(0)).reset_index(drop=True)
So, full code:
import pandas as pd
csv_path = "C:\\Users\\ThirdHandBD\\Desktop\\Data Splitting\\pd-split\\chunk4.csv"
#df = pd.read_csv(csv_path, header=1, dtype='unicode', sep=';', low_memory=False, error_bad_lines=False)
df = pd.read_csv(csv_path, header = 1, dtype='unicode', sep=';', index_col=False)
print("I read in a dataframe with {} columns and {} rows.".format(
len(df.columns), len(df)
))
filename = 1
#column increment
x = 30 * 59
for column in df:
loc = df.columns.get_loc(column)
if loc == (x * filename) + 1:
y = filename - 1
a = (x * y) + 1
b = (x * filename) + 1
date_df = df.iloc[:, :1]
out_df = df.iloc[:, a:b]
final_df = pd.concat([date_df, out_df], axis=1, join='inner')
out_path = "C:\\Users\\ThirdHandBD\\Desktop\\Data Splitting\\pd-split\\chunk4-part" + str(filename) + ".csv"
final_df.columns = final_df.iloc[0]
final_df = final_df.reindex(final_df.index.drop(0)).reset_index(drop=True)
final_df.to_csv(out_path, index=False)
#out_df.to_csv(out_path)
filename += 1
# This should be the same as df, but with only the first column.
# Check it with similar code to above.
Now, the entire first row is gone. But, the expected output is for the header row to be replaced with the reset index, without the ".1", ".2", etc.
Screenshot:
The SimFin ID row is no longer there.
This is how I did it:
final_df.columns = final_df.columns.str.split('.').str[0]
Reference:
https://pandas.pydata.org/pandas-docs/stable/text.html
Below solution would ensure that other column names with symbol period ('.') in the dataframe do not get modified
import pandas as pd
from csv import DictReader
csv_file_loc = "file.csv"
# Read csv
df = pd.read_csv(csv_file_loc)
# Get column names from csv file using DictReader
col_names = DictReader(open(csv_file_loc, 'r')).fieldnames
# Rename columns
df.columns = col_names
I know I'm pretty late to the draw on this one, but I'm leaving the solution I came up with in case anyone else wanders across this as I have.
Firstly, the linked question has a pretty nice and dynamic solution that seems to work well even for high column counts. I came across that after I made my solution, haha. Check it out here. Another answer on this thread utilizes the csv library to read and use the column names from that, as it doesn't seem to modify duplicates like Pandas does. That should work fine, but I just wanted to avoid using any extra libraries, especially considering I was originally using csv and then upgrade to Pandas for better functionality.
Now here's my solution. I'm sure it could be done more nicely but this does the job for what I needed and is pretty dynamic, from what I can tell. It basically goes through the columns, checks if it can split the string based on the rightmost "." (that's the rpartition), then does a few more checks from there.
It checks:
Is this string in the colMap? The colMap keeps track of all of the column names, duplicate or not. If this comes back true, then that means it's a duplicate of another column that came before it.
Is the string after the rightmost "." a number? All of the columns are strings, so this just makes sure that whatever it is can be converted into a number to prevent grabbing some other random column that meets previous criteria but isn't actually a dupe from Pandas. eg. "DupeCol" and "DupeCol.Stuff" wouldn't get picked up, but "DupeCol" and "DupeCol.1" would.
Does the number that comes after the rightmost "." match up to the current count of duplicates in the colMap? Seeing as the colMap contains all of the names of the columns, duplicates or not, this will ensure that we're not grabbing a user-named column that managed to overlap with the ".number" convention that Pandas uses. Eg. if a user had named two columns "DupeCol" and "DupeCol.6", it wouldn't get picked up unless there were 6 "DupeCol"s preceding "DupeCol.6", indicating that it almost had to be Pandas that named it that way, as opposed to the user. This part is definitely a bit overkill, but I felt like being extra thorough.
colMap = []
for col in df.columns:
if col.rpartition('.')[0]:
colName = col.rpartition('.')[0]
inMap = col.rpartition('.')[0] in colMap
lastIsNum = col.rpartition('.')[-1].isdigit()
dupeCount = colMap.count(colName)
if inMap and lastIsNum and (int(col.rpartition('.')[-1]) == dupeCount):
colMap.append(colName)
continue
colMap.append(col)
df.columns = colMap
Hopefully this helps someone! Feel free to comment if you think it could use any improvements. I don't entirely love using "continue" in my code, but I'm not sure if that's because it's actually bad practice or just me reading random people complain about it too much. I think it doesn't make the code too unreadable here and prevents the need for duplicating the "else" statement; but let me know if there's a way to improve that or anything otherwise. I'm always looking to learn!
If you know types of all data you may consider loading the csv without header first.
df = pd.read_csv(csv_file, header=None)
df.columns = df.iloc[0] # replace column with first row
df = df.drop(0) # remove the first row
(Note that drop is to remove the row, given that your index is unique, and may not be true if you use index_col argument of pd.read_csv)
caveats: The above solution causes you to lose dtypes infomations.
There is some solution to fix the above problem.
# turn each column into numeric
df = df.apply(lambda col: pd.to_numeric(col, errors='ignore'), axis=0)
Otherwise, you may consider reading the csv twice to get the dtype information and apply the correct convertion.

Write list to specific column in csv

I'm trying to write the data from my list to just column 4
namelist = ['PEAR']
for name in namelist:
for man_year in yearlist:
for man_month in monthlist:
with open('{2}\{0}\{1}.csv'.format(man_year,man_month,name),'w') as filename:
writer = csv.writer(filename)
writer.writerow(name)
time.sleep(0.01)
it outputs to a csv like this
P E A R
4015854 234342 2442343 234242
How can I get it to go on just the 4th column?
PEAR
4015854 234342 2442343 234242
Replace the line writer.writerow(name) with,
writer.writerow(['', '', '', name])
When you pass the name to csvwriter it assumes the name as an iterable and write each character in a column.
So, for getting ride of this problem change the following line:
writer.writerow(name)
With:
writer.writerow([''] * (len(other_row)-1) + [name])
Here other_row can be one of the rest rows, but if you are sure about the length you can do something like:
writer.writerow([''] * (length-1) + [name])
Instead of writing '' to cells you don't want to touch, you could use df.at instead. For example, you could write df.at[index, ColumnName] = 10 which would change only the value of that specific cell.
You can read more about it here: Set value for particular cell in pandas DataFrame using index

Categories