Pandas DataFrame wrong indexing after reading from csv - python

I know very little about python's pandas module. I need to create a DataFrame and store it in .csv file for my project. I am using to_csv and read_csv functions. However, when I compared the two frames (before exporting and the imported one) I got different results. This is the the minimal reproducible example:
import sys
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
documents = []
documents.append("i love python")
documents.append("foo bar")
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(documents)
X = X.T.toarray()
df = pd.DataFrame(X, index=vectorizer.get_feature_names())
df.to_csv(path_or_buf = "db.csv")
df1 = pd.read_csv("db.csv")
print(df.axes)
print()
print(df1.axes)
And this is what is printed:
[Index(['bar', 'foo', 'love', 'python'], dtype='object'), RangeIndex(start=0, stop=2, step=1)]
[RangeIndex(start=0, stop=4, step=1), Index(['Unnamed: 0', '0', '1'], dtype='object')]
How can I make the DataFrame imported from a .csv file identical to the original one?

UPDATE:Give index name for the dataframe you are exporting and while reading the exported csv use that name as index. Here I am using vectors as index name
import sys
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
documents = []
documents.append("i love python")
documents.append("foo bar")
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(documents)
X = X.T.toarray()
df = pd.DataFrame(X, index=vectorizer.get_feature_names())
df.index.name = 'vectors'
df.to_csv(path_or_buf="db.csv")
df1 = pd.read_csv("db.csv",index_col='vectors')
print(df)
print()
print(df1)
Old answer: Try exporting csv without index by setting index to false as
df.to_csv(path_or_buf="db.csv", index=False)

Related

How to read in a semicolon delimited file in Pandas and normalize

I am trying to read in a wine quality dataset and normalize the data before I move on. I've read in the csv as semicolon delimited, but when I try to drop the target variable, quality, I'm getting an error that says that attribute isn't found in axis.
import pandas as pd
df = pd.read_csv('gdrive/My Drive/whitewine.csv', delimiter="\s")
x = df.drop(['quality'], axis=0).values
Error:
KeyError: "['quality'] not found in axis"
import pandas as pd
df = pd.read_csv('gdrive/My Drive/whitewine.csv', delimiter="\s")
X = df.drop(['quality'], axis=1)
y = df['quality']

Unable to clean the csv file in python

I am trying to load a CSV file into python and clean the text. but I keep getting an error. I saved the CSV file in a variable called data_file and the function below cleans the text and supposed to return the clean data_file.
import pandas as pd
import numpy as np
import re
import matplotlib.pyplot as plt
df = pd.read_csv("/Users/yoshithKotla/Desktop/janTweet.csv")
data_file = df
print(data_file)
def cleanTxt(text):
text = re.sub(r'#[A-Za-z0-9]+ ', '', text) # removes # mentions
text = re.sub(r'#[A-Za-z0-9]+', '', text)
text = re.sub(r'RT[\s]+', '', text)
text = re.sub(r'https?:\/\/\S+', '', text)
return text
df['data_file'] = df['data_file'].apply(cleanTxt)
df
I get a key error here.
the key error comes from the fact that you are trying to apply a function to the column data_file of the dataframe df which does not contain such a column.
You juste created a copy of df in your line data_file = df.
To change the column names of your dataframe df use:
df.columns = [list,of,values,corresponding,to,your,columns]
Then you can either apply the function to the right column or on the whole dataframe.
To apply a function on the whole dataframe you may want to use the .applymap() method.
EDIT
For clarity's sake:
To print your column names and the length of your dataframe columns:
print(df.columns)
print(len(df.columns))
To modify your column names:
df.columns = [list,of,values,corresponding,to,your,columns]
To apply your function on a column:
df['your_column_name'] = df['your_column_name'].apply(cleanTxt)
To apply your function to your whole dataframe:
df = df.applymap(cleanTxt)

Pandas Dataframe filter not working but str.match() is working

I have a Pandas Dataframe words_df which contains some English words.
It only has one column named word which contains the English word.
words_df.tail():
words_df.dtypes:
I want to filter out the row(s) which contain the word zythum
Using the Pandas Series str.match() is giving me expected output:
words_df[words_df.word.str.match('zythum')]:
I know str.match() is not the correct way to do it, it will also return rows which contain other words like zythums for example.
But, using the following operation on Pandas Dataframe is returning an empty Dataframe
words_df[words_df['word'] == 'zythum']:
I was wondering why is this happening?
EDIT 1:
I am also attaching the source of my data and the code used to import it.
Data source (I used "Word lists in csv.zip"):
https://www.bragitoff.com/2016/03/english-dictionary-in-csv-format/
Dataframe import code:
import pandas as pd
import glob as glob
import os as os
import csv
path = r'data/words/' # use your path
all_files = glob.glob(path + "*.csv")
li = []
for filename in all_files:
df = pd.read_csv(filename, index_col=None, header=None, names = ['word'], engine='python', quoting=csv.QUOTE_NONE)
li.append(df)
words_df = pd.concat(li, axis=0, ignore_index=True)
EDIT 2:
Here is a block of my code, with a simpler import code, but facing same issue. (using Zword.csv file from link mentioned above)
IIUC: df1[df1['word'] == 'zythum'] is not working.
Try, removing whitespace around the string in the dataframe:
df1[df1['word'].str.strip() == 'zythum']
Your imported list does not match the string you are looking for exactly. There is a space after the words in the csv file.
You should be able to strip the whitespace out by using str.strip. For example:
import pandas as pd
myDF = pd.read_csv('Zword.csv')
myDF[myDF['z '] == 'zythum '] # This has the whitespace
myDF['z '] = myDF['z '].map(str.strip)
myDF[myDF['z '] == 'zythum'] # mapped the whitespace away
You need to convert the whole column to str type:
words_df['word'] = words_df['word'].astype(str)
This should work in your case.
Here, you can use this to do the work. Change parameters as required.
import glob as glob
import os as os
import csv
def match(dataframe):
l = []
for i in dataframe:
l.append('zythum' in i)
data = pd.DataFrame(l)
data.columns = ['word']
return data
path = r'Word lists in csv/' # use your path
files = os.listdir(path)
li = []
for filename in files:
df = pd.read_csv(path + filename, index_col=None, header=None, names = ['word'], engine='python', quoting=csv.QUOTE_NONE)
li.append(df)
words_df = pd.concat(li, axis=0, ignore_index=True)
words_df[match(words_df['word'])].dropna()```

Is there a way to un-nesting a pandas dataframe in a python3 jupyter notebook?

I am importing a json file into a python3 jupyter notebook. The json file has the format
object
rooms [26 elements]
0
turns
fromBathroom
fromParking
distances
dfromBathroom
dfromParking
depth
area
1
.... etc.
name
I am importing the json file in this way:
import pandas as pd
import numpy as np
import json
from pandas.io.json import json_normalize
with open("rooms.json") as file:
data = json.load(file)
df = json_normalize(data['rooms'])
I am now trying to plot each of the 6 dimensions against each other in a matrix-like format, with 36 total graphs.
I am trying to this the following way:
col_features = ['fromBathroom', 'fromParking', 'dfromBathroom', 'dfromParking', 'depth', 'area']
pd.plotting.scatter_matrix(df[col_features], alpha = .2, figsize = (14,8))
This does not work, as I am getting an error that reads:
KeyError: "['fromBathroom' 'fromParking' 'dfromBathroom' 'dfromParking'] not in index"
This is because those features are nested in 'turns' and 'distances' in the json file. Is there a way to un-nest these features so that I can index into the dataframe the same way I can for depth and area to get the values?
Thank you for any insights.
Maybe you could extract df1 = df['turns'], df2 = df['distances'] and df3 = df['areas', 'depth] and then do a df4 = pd.concat([df1, df2, df3], join='inner', axis=1) see pandas doc
or directly : pd.concat([df['turns'], df['distances'], df['areas', 'depth]], join='inner', axis=1)
EDIT :
I tried something, I hope it is what you are looking for :
link to the image with the code and the results I get with Jupyter
df1 = df['turns']
df2 = df['distances']
df3 = pd.DataFrame(df['depth'])
df4 = pd.DataFrame(df['area'])
df_recomposed = pd.concat([df1, df2, df3, df4], join='inner', axis=1)
or Pandas - How to flatten a hierarchical index in columns
where df.columns = [' '.join(col).strip() for col in df.columns.values] should be what you are looking for

How to add LabelBinarizer columns to DataFrame

I have recently started working with LabelBinarizer by running the following code. (here are the first couple of rows of the CSV file that I'm using):
import pandas as pd
from sklearn.preprocessing import LabelBinarizer
#import matplotlib.pyplot as plot
#--------------------------------
label_conv = LabelBinarizer()
appstore_original = pd.read_csv("AppleStore.csv")
#--------------------------------
lb_conv = label_conv.fit_transform(appstore["cont_rating"])
column_names = label_conv.classes_
print(column_names)
print(lb_conv)
I get the lb_conv and the column names. Therefore:
how could I attach label_conv to appstore_original using column_names as the column names?
If anyone could help that would be great.
try this:
lb = LabelBinarizer()
df = pd.read_csv("AppleStore.csv")
df = df.join(pd.DataFrame(lb.fit_transform(df["cont_rating"]),
columns=lb.classes_,
index=df.index))
to make sure that a newly created DF will have the same index elements as the original DF (we need it for joining), we will specify index=df.index in the constructor call.

Categories