Removing space in dataframe python

Removing space in dataframe python - python

I am getting an error in my code because I tried to make a dataframe by calling an element from a csv. I have two columns I call from a file: CompanyName and QualityIssue. There are three types of Quality issues: Equipment Quality, User, and Neither. I run into problems trying to make a dataframe df.Equipment Quality, which obviously doesn't work because there is a space there. I want to take Equipment Quality from the original file and replace the space with an underscore.
input:
Top Calling Customers, Equipment Quality, User, Neither,
Customer 3, 2, 2, 0,
Customer 1, 0, 2, 1,
Customer 2, 0, 1, 0,
Customer 4, 0, 1, 0,
Here is my code:
import numpy as np
import pandas as pd
import pandas.util.testing as tm; tm.N = 3
# Get the data.
data = pd.DataFrame.from_csv('MYDATA.csv')
# Group the data by calling CompanyName and QualityIssue columns.
byqualityissue = data.groupby(["CompanyName", "QualityIssue"]).size()
# Make a pandas dataframe of the grouped data.
df = pd.DataFrame(byqualityissue)
# Change the formatting of the data to match what I want SpiderPlot to read.
formatted = df.unstack(level=-1)[0]
# Replace NaN values with zero.
formatted[np.isnan(formatted)] = 0
includingtotals = pd.concat([formatted,pd.DataFrame(formatted.sum(axis=1),
columns=['Total'])], axis=1)
sortedtotal = includingtotals.sort_index(by=['Total'], ascending=[False])
sortedtotal.to_csv('byqualityissue.csv')
This seems to be a frequently asked question and I tried lots of the solutions but they didn't seem to work. Here is what I tried:
with open('byqualityissue.csv', 'r') as f:
reader = csv.reader(f, delimiter=',', quoting=csv.QUOTE_NONE)
return [[x.strip() for x in row] for row in reader]
sentence.replace(" ", "_")
And
sortedtotal['QualityIssue'] = sortedtotal['QualityIssue'].map(lambda x: x.rstrip(' '))
And what I thought was the most promising from here http://pandas.pydata.org/pandas-docs/stable/text.html:
formatted.columns = formatted.columns.str.strip().str.replace(' ', '_')
but I got this error: AttributeError: 'Index' object has no attribute 'str'
Thanks for your help in advance!

Try:
formatted.columns = [x.strip().replace(' ', '_') for x in formatted.columns]

As I understand your question, the following should work (test it out with inplace=False to see how it looks first if you want to be careful):
sortedtotal.rename(columns=lambda x: x.replace(" ", "_"), inplace=True)
And if you have white space surrounding the column names, like: "This example "
sortedtotal.rename(columns=lambda x: x.strip().replace(" ", "_"), inplace=True)
which strips leading/trailing whitespace, then converts internal spaces to "_".

Related

Shuffle rows of a large csv

I want to shuffle this dataset to have a random set. It has 1.6 million rows but the first are 0 and the last 4, so I need pick samples randomly to have more than one class. The actual code prints only class 0 (meaning in just 1 class). I took advice from this platform but doesn't work.
fid = open("sentiment_train.csv", "r")
li = fid.readlines(16000000)
random.shuffle(li)
fid2 = open("shuffled_train.csv", "w")
fid2.writelines(li)
fid2.close()
fid.close()
sentiment_onefourty_train = pd.read_csv('shuffled_train.csv', header= 0, delimiter=",", usecols=[0,5], nrows=100000)
sentiment_onefourty_train.columns=['target', 'text']
print(sentiment_onefourty_train['target'].value_counts())

Because you read in your data using Pandas, you can also do the randomisation in a different way using pd.sample:
df = pd.read_csv('sentiment_train.csv', header= 0, delimiter=",", usecols=[0,5])
df.columns=['target', 'text']
df1 = df.sample(n=100000)
If this fails, it might be good to check out the amount of unique values and how frequent they appear. If the first 1,599,999 are 0 and the last is only 4, then the chances are that you won't get any 4.

listing multiple converters during initial reading of file, possible header issue?

I am reading in a CSV file to calculate some stats through Python.
I know that I can use the converters at the start of the program to adjust for some of the potential data issues, but for some reason when I try to do that, it errors with inflated results.
It's a 20-column CSV with over 1000 rows of data.
Public domain datalink is here: https://www.kaggle.com/canggih/anime-data-score-staff-synopsis-and-genre
The CSV is structured like so:
Title,Types,Episodes,Status,Start airing,End airing,Starting season,Broadcast time,Producers,Licensors,Studios,Sources,Genres,Duration,Rating,Score,Scored by,Members,Favorites,Description
Fullmetal Alchemist: Brotherhood,TV,64,Finished Airing,2009-4-5,2010-7-4,Spring,Sundays at 17:00 (JST),"Aniplex,Square Enix,Mainichi Broadcasting System,Studio Moriken","Funimation,Aniplex of America",Bones,Manga,"Action,Military,Adventure,Comedy,Drama,Magic,Fantasy,Shounen",24 min. per ep.,R,9.25,719706,1176368,105387,"""In order for something to be obtained, something of equal value must be lost.""
My problem is when I try to run the program, it keeps telling me it can't convert ('Episodes') to a float. I know this, so I tried skiprows=1, removing the header=None, and trying to rewrite the read line as converters = {2 : lambda s: float(s.replace('-','0'))},{2 : lambda e: float(e.replace('Episodes','0'))} in the pd.read_csv line as well, and I still can't bypass it.
Is there a way in my converters = {2 : lambda s: float(s.replace('-','0'))} part of the code to list multiple converter requirements, similar to the first .replace() I have in there?
I am not the strongest coder in Python, but I know there is a way to fix this; I just can't see it.
My originally updated code so far:
#Based on: https://datatofish.com/use-pandas-to-calculate-stats-from-an-imported-csv-file/
import pandas as pd
#import statistics
import re
#I had to put in the .replace for 'Episodes', otherwise it keeps trying to convert 'Episodes' to a float value.
#Is there a better way to fix this?
df = pd.read_csv (r'dataanime.csv', encoding='utf-8', header=None, skiprows=1,
converters = {2 : lambda s: float(s.replace('Episodes','').join(s.replace('-','0')))})
df.columns = ['Title','Type','Episodes','Status','Start_airing','End_airing','Starting_season','Broadcast_time','Producers','Licensors','Studios','Sources','Genres','Duration','Rating','Score','Scored_by','Members','Favorites','Description']
# block 1 - simple stats
mean1 = df['Episodes'].mean()#Results are off
sum1 = df['Episodes'].sum()(#Results are off
max1 = df['Episodes'].max()#Results are off
min1 = df['Episodes'].min()
count1 = df['Episodes'].count()#This has the correct number of shows
median1 = df['Episodes'].median()#Looks right, maybe?
std1 = df['Episodes'].std() #Results are off
var1 = df['Episodes'].var() #Results are off
# block 2 - group by
groupby_sum1 = df.groupby(['Genres'])['Episodes'].sum()
groupby_count1 = df.groupby(['Genres'])['Episodes'].count()
#Opens the output file for the results
h_file = open("csv_stats.html","w")
#Start writing the HTML lines
h_file.write("This list shows the statistics calculated from the dataanime CSV.")
h_file.write('<br>')
# print block 1
h_file.write('Mean episodes: ' + str(mean1))
h_file.write('<br>')
h_file.write('Sum of episodes: ' + str(sum1))
h_file.write('<br>')
h_file.write('Max episodes: ' + str(max1))
h_file.write('<br>')
h_file.write('Min episodes: ' + str(min1))
h_file.write('<br>')
h_file.write('Count of shows: ' + str(count1))
h_file.write('<br>')
h_file.write('Median episodes: ' + str(median1))
h_file.write('<br>')
h_file.write('Std of episodes: ' + str(std1))
h_file.write('<br>')
h_file.write('Var of episodes: ' + str(var1))
h_file.write('<br>')
# print block 2
h_file.write('Sum of values, grouped by the Genres: ' + '<li>'+ str(groupby_sum1)+'<\li>')
h_file.write('<br>')
h_file.write('Count of values, grouped by the Genres: ' + str(groupby_count1))
h_file.close()
This way clears out the error message, but because 'Episodes' is now converted to 0, the numbers the code spits out are way off.
My results look like:
Obviously, this is no good either. How can I set either the header to bypass the word 'Episodes' and do the calculations, or how do I rewrite the df = pd.read_csv (r'dataanime.csv', encoding='utf-8', header=None, skiprows=1, converters = {2 : lambda s: float(s.replace('Episodes','').join(s.replace('-','0')))}) to correct for this?

It's easier to read CSV files by letting pandas figure out how to handle the headers. By not passing anything into header and skiprows, Pandas will infer that the first line in the CSV is the header line and name your columns appropriately. To deal with the "-" Episode values, you can set na_values to indicate that "-" in that column is a NaN value, and use dropna() to remove those rows when calculating statistics.
df = pd.read_csv("dataanime.csv", encoding="utf-8", na_values={"Episodes": "-"})
# calculate stats on the Episodes columns
episode_values = df["Episodes"].dropna()
mean1 = episode_values.mean()
sum1 = episode_values.sum()
...

Pandas: count number of times every value in one column appears in another column

I want to count the number of times a value in Child column appears in Parent column then display this count in new column renamed child count. See previews df below.
I have this done via VBA (COUNTIFS) but now need dynamic visualization and animated display with data fed from a dir. So I resorted to Python and Pandas and tried below code after searching and reading answers like: Countif in pandas with multiple conditions | Determine if value is in pandas column | Iterate over rows in Pandas df | many others...
but still can't get the expected preview as illustrated in image below.
Any help will be very much appreciated. Thanks in advance.
#import libraries
import pandas as pd
import numpy as np
import os
#get datasets
path_dataset = r'D:\Auto'
df_ns = pd.read_csv(os.path.join(path_dataset, 'Scripts', 'data.csv'), index_col = False, encoding = 'ISO-8859-1', engine = 'python')
#preview dataframe
df_ns
#tried
df_ns.groupby(['Child','Parent', 'Site Name']).size().reset_index(name='child count')
#preview output
df_ns.groupby(['Child','Parent', 'Site Name']).size().reset_index(name='child count')
preview dataframe
preview output
expected output
[Edited] My data
Child = ['Tkt01', 'Tkt02', 'Tkt03', 'Tkt04', 'Tkt05', 'Tkt06', 'Tkt07', 'Tkt08', 'Tkt09', 'Tkt10']
Parent = [' ', ' ', 'Tkt03',' ',' ', 'Tkt03',' ', 'Tkt03',' ',' ', 'Tkt06',' ',' ',' ',]
Site_Name =[Yaounde','Douala','Bamenda','Bafoussam','Kumba','Garoua','Maroua','Ngaoundere','Buea','Ebolowa']

I created a lookalike of your df.
Before
Try this code
df['Count'] = [len(df[df['parent'].str.contains(value)]) for index, value in enumerate(df['child'])]
#breaking it down as a line by line code
counts = []
for index, value in enumerate(df['child']):
found = df[df['parent'].str.contains(value)]
counts.append(len(found))
df['Count'] = counts
After
Hope this works for you.

Since I don't have access to your data, I cannot check the code I am giving you. I suggest you will have problems with nan values with this line but you can give it a try.:
df_ns['child_count'] = df_ns['Parent'].groupby(df_ns['Child']).value_counts()
I give a name to the new column and directly assign values to it through the groupby -> value_counts functions.

Column in DataFrame isn't recognised. Keyword Error: 'Date'

I'm in the initial stages of doing some 'machine learning'.
I'm trying to create a new data frame and one of the columns doesn't appear to be recognised..?
I've loaded an Excel file with 2 columns (removed the index). All fine.
Code:
df = pd.read_excel('scores.xlsx',index=False)
df=df.rename(columns=dict(zip(df.columns,['Date','Amount'])))
df.index=df['Date']
df=df[['Amount']]
#creating dataframe
data = df.sort_index(ascending=True, axis=0)
new_data = pd.DataFrame(index=range(0,len(df)),columns=['Date','Amount'])
for i in range(0,len(data)):
new_data['Date'][i] = data['Date'][i]
new_data['Amount'][i] = data['Amount'][i]
The error:
KeyError: 'Date'
Not really sure what's the problem here.
Any help greatly appreciated

I think in line 4 you reduce your dataframe to just one column "Amount"

To add to #Grzegorz Skibinski's answer, the problem is after line 4, there is no longer a 'Date' column. The Date column was assigned to the index and removed, and while the index has a name "Date", you can't use 'Date' as a key to get the index - you have to use data.index[i] instead of data['Date'][i].

It seems that you have an error in the formatting of your Date column.
To check that you don't have an error on the name of the columns you can print the columns names:
import pandas as pd
# create data
data_dict = {}
data_dict['Fruit '] = ['Apple', 'Orange']
data_dict['Price'] = [1.5, 3.24]
# create dataframe from dict
df = pd.DataFrame.from_dict(data_dict)
# Print columns names
print(df.columns.values)
# Print "Fruit " column
print(df['Fruit '])
This code outputs:
['Fruit ' 'Price']
0 Apple
1 Orange
Name: Fruit , dtype: object
We clearly see that the "Fruit " column as a trailing space. This is an easy mistake to do, especially when using excel.
If you try to call "Fruit" instead of "Fruit " you obtain the error you have:
KeyError: 'Fruit'

Drop 0 values, NaN values, and empty strings

import pandas as pd
import csv
import numpy as np
readfile = pd.read_csv('50.csv')
filevalues= readfile.loc[readfile['Customer'].str.contains('Lam Dep', na=False), 'Jul-18\nQty']
filevalues = filevalues.replace(r'^\s*$', np.nan, regex=True)
filevalues = filevalues.fillna(int(0))
int_series = filevalues.astype(int)
calculated_series = int_series.apply(lambda x: x*(1/1.2))
print(calculated_series)
So I have hundreds of csv files with many empty spots for values. Some of the blanks spaces are detected as NaNs and others are empty strings.This has Forced me to create my code in the way it is right now, and the reason so is that I need to conduct a formula on each value so I changed all such NaNs and empty strings to 0 so that I am able to conduct any formula ( in this example 1/1.2.) The problem is that I do not want to see values that are 0, NaN or empty strings when printing my dataframe.
I have tried to use the following:
filevalues = filevalues.dropna()
But because certain csv files have empty strings, this method does not fully work and get the error:
ValueError: invalid literal for int() with base 10: ' '
I have also tried the following after converting all values to 0:
filevalues = filevalues.loc[:, (filevalues != 0).all(axis=0)]
and
mask = np.any(np.isnan(filevalues) | np.equal(a, 0), axis=1)
Every method seems to be giving different errors. Is there a clean way to not count these types of values when I am printing my pandas dataframe? Please let me know if an example csv file is needed.

Got it to work! Here is the answer if it is of use to anyone.
import pandas as pd
import csv
import numpy as np
readfile = pd.read_csv('50.csv')
filevalues= readfile.loc[readfile['Customer'].str.contains('Lam Dep', na=False), 'Jul-18\nQty']
filevalues = filevalues.replace(" ", "", regex=True)
filevalues.replace("", np.nan, inplace=True) # replace empty string with np.nan
filevalues.dropna(inplace=True) # drop nan values
int_series = filevalues.astype(int) # change type
calculated_series = int_series.apply(lambda x: x*(1/1.2))
print(calculated_series)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Removing space in dataframe python - python

Try: formatted.columns = [x.strip().replace(' ', '_') for x in formatted.columns]

Related

Shuffle rows of a large csv

listing multiple converters during initial reading of file, possible header issue?

Pandas: count number of times every value in one column appears in another column

Column in DataFrame isn't recognised. Keyword Error: 'Date'

Drop 0 values, NaN values, and empty strings

Categories

Resources