Function to return stripped dataframe - python

I have a dataframe from CSV file:
import pandas as pd
filename = 'mike.csv'
main_df = pd.read_csv(filename)
I need a function that will strip all string columns' (there are also numeric columns) contents from whitespaces and then return such stripped dataframe. In the below function, the stripping seems to work fine, but I don't know how to return the stripped dataframe:
def strip_whitespace(dataframe):
dataframe_strings = dataframe.select_dtypes(['object'])
dataframe[dataframe_strings.columns] = dataframe_strings.apply(lambda x: x.str.strip())
return # how to return a stripped dataframe here?
Full code:
import pandas as pd
filename = 'mike.csv'
main_df = pd.read_csv(filename)
def strip_whitespace(dataframe):
dataframe_strings = dataframe.select_dtypes(['object'])
dataframe[dataframe_strings.columns] = dataframe_strings.apply(lambda x: x.str.strip())
return stripped_dataframe # ?
stripped_main_df = strip_whitespace(main_df) # should be stripped df

I believe need parameter skipinitialspace=True in read_csv:
main_df = pd.read_csv(filename, skipinitialspace=True)
And then stripping columns is not necessary.
But if need use your function:
return dataframe

Related

Is there a way to remove header and split columns with pandas read_csv?

[Edited: working code at the end]
I have a CSV file with many rows, but only one column. I want to separate the rows' values into columns.
I have tried
import pandas as pd
df = pd.read_csv("TEST1.csv")
final = [v.split(";") for v in df]
print(final)
However, it didn't work. My CSV file doesn't have a header, yet the code reads the first row as a header. I don't know why, but the code returned only the header with the splits, and ignored the remainder of the data.
For this, I've also tried
import pandas as pd
df = pd.read_csv("TEST1.csv").shift(periods=1)
final = [v.split(";") for v in df]
print(final)
Which also returned the same error; and
import pandas as pd
df = pd.read_csv("TEST1.csv",header=None)
final = [v.split(";") for v in df]
print(final)
Which returned
AttributeError: 'int' object has no attribute 'split'
I presume it did that because when header=None or header=0, it appears as 0; and for some reason, the final = [v.split(";") for v in df] is only reading the header.
Also, I have tried inserting a new header:
import pandas as pd
df = pd.read_csv("TEST1.csv")
final = [v.split(";") for v in df]
headerList = ['Time','Type','Value','Size']
pd.DataFrame(final).to_csv("TEST2.csv",header=headerList)
And it did work, partly. There is a new header, but the only row in the csv file is the old header (which is part of the data); none of the other data has transferred to the TEST2.csv file.
Is there any way you could shed a light upon this issue, so I can split all my data?
Many thanks.
EDIT: Thanks to #1extralime, here is the working code:
import pandas as pd
df = pd.read_csv("TEST1.csv",sep=';')
df.columns = ['Time','Type','Value','Size']
df.to_csv("TEST2.csv")
Try:
import pandas as pd
df = pd.read_csv('TEST1.csv', sep=';')
df.columns = ['Time', 'Type', 'Value', 'Size']

Extra column appears when appending selected row from one csv to another in Python

I have this code which appends a column of a csv file as a row to another csv file:
def append_pandas(s,d):
import pandas as pd
df = pd.read_csv(s, sep=';', header=None)
df_t = df.T
df_t.iloc[0:1, 0:1] = 'Time Point'
df_t.at[1, 0] = 1
df_t.columns = df_t.iloc[0]
df_new = df_t.drop(0)
pdb = pd.read_csv(d, sep=';')
newpd = pdb.append(df_new)
from pandas import DataFrame
newpd.to_csv(d, sep=';')
The result is supposed to look like this:
Instead, every time the row is appended, there is an extra "Unnamed" column appearing on the left:
Do you know how to fix that?..
Please, help :(
My csv documents from which I select a column look like this:
You have to add index=False to your to_csv() method

Formatting of JSON file

Can we convert the highlighted INTEGER values to STRING value (refer below link)?
https://i.stack.imgur.com/3JbLQ.png
CODE
filename = "newsample2.csv"
jsonFileName = "myjson2.json"
import pandas as pd
df = pd.read_csv ('newsample2.csv')
df.to_json('myjson2.json', indent=4)
print(df)
Try doing something like this.
import pandas as pd
filename = "newsample2.csv"
jsonFileName = "myjson2.json"
df = pd.read_csv ('newsample2.csv')
df['index'] = df.index
df.to_json('myjson2.json', indent=4)
print(df)
This will take indices of your data and store them in the index column, so they will become a part of your data.

Unable to clean the csv file in python

I am trying to load a CSV file into python and clean the text. but I keep getting an error. I saved the CSV file in a variable called data_file and the function below cleans the text and supposed to return the clean data_file.
import pandas as pd
import numpy as np
import re
import matplotlib.pyplot as plt
df = pd.read_csv("/Users/yoshithKotla/Desktop/janTweet.csv")
data_file = df
print(data_file)
def cleanTxt(text):
text = re.sub(r'#[A-Za-z0-9]+ ', '', text) # removes # mentions
text = re.sub(r'#[A-Za-z0-9]+', '', text)
text = re.sub(r'RT[\s]+', '', text)
text = re.sub(r'https?:\/\/\S+', '', text)
return text
df['data_file'] = df['data_file'].apply(cleanTxt)
df
I get a key error here.
the key error comes from the fact that you are trying to apply a function to the column data_file of the dataframe df which does not contain such a column.
You juste created a copy of df in your line data_file = df.
To change the column names of your dataframe df use:
df.columns = [list,of,values,corresponding,to,your,columns]
Then you can either apply the function to the right column or on the whole dataframe.
To apply a function on the whole dataframe you may want to use the .applymap() method.
EDIT
For clarity's sake:
To print your column names and the length of your dataframe columns:
print(df.columns)
print(len(df.columns))
To modify your column names:
df.columns = [list,of,values,corresponding,to,your,columns]
To apply your function on a column:
df['your_column_name'] = df['your_column_name'].apply(cleanTxt)
To apply your function to your whole dataframe:
df = df.applymap(cleanTxt)

Pandas Dataframe filter not working but str.match() is working

I have a Pandas Dataframe words_df which contains some English words.
It only has one column named word which contains the English word.
words_df.tail():
words_df.dtypes:
I want to filter out the row(s) which contain the word zythum
Using the Pandas Series str.match() is giving me expected output:
words_df[words_df.word.str.match('zythum')]:
I know str.match() is not the correct way to do it, it will also return rows which contain other words like zythums for example.
But, using the following operation on Pandas Dataframe is returning an empty Dataframe
words_df[words_df['word'] == 'zythum']:
I was wondering why is this happening?
EDIT 1:
I am also attaching the source of my data and the code used to import it.
Data source (I used "Word lists in csv.zip"):
https://www.bragitoff.com/2016/03/english-dictionary-in-csv-format/
Dataframe import code:
import pandas as pd
import glob as glob
import os as os
import csv
path = r'data/words/' # use your path
all_files = glob.glob(path + "*.csv")
li = []
for filename in all_files:
df = pd.read_csv(filename, index_col=None, header=None, names = ['word'], engine='python', quoting=csv.QUOTE_NONE)
li.append(df)
words_df = pd.concat(li, axis=0, ignore_index=True)
EDIT 2:
Here is a block of my code, with a simpler import code, but facing same issue. (using Zword.csv file from link mentioned above)
IIUC: df1[df1['word'] == 'zythum'] is not working.
Try, removing whitespace around the string in the dataframe:
df1[df1['word'].str.strip() == 'zythum']
Your imported list does not match the string you are looking for exactly. There is a space after the words in the csv file.
You should be able to strip the whitespace out by using str.strip. For example:
import pandas as pd
myDF = pd.read_csv('Zword.csv')
myDF[myDF['z '] == 'zythum '] # This has the whitespace
myDF['z '] = myDF['z '].map(str.strip)
myDF[myDF['z '] == 'zythum'] # mapped the whitespace away
You need to convert the whole column to str type:
words_df['word'] = words_df['word'].astype(str)
This should work in your case.
Here, you can use this to do the work. Change parameters as required.
import glob as glob
import os as os
import csv
def match(dataframe):
l = []
for i in dataframe:
l.append('zythum' in i)
data = pd.DataFrame(l)
data.columns = ['word']
return data
path = r'Word lists in csv/' # use your path
files = os.listdir(path)
li = []
for filename in files:
df = pd.read_csv(path + filename, index_col=None, header=None, names = ['word'], engine='python', quoting=csv.QUOTE_NONE)
li.append(df)
words_df = pd.concat(li, axis=0, ignore_index=True)
words_df[match(words_df['word'])].dropna()```

Categories