Unable to insert clean unicode text back into DataFrame in pandas - python

I am doing 2 things.
1) filter a dataframe in pandas
2) clean unicode text in a specific column in the filtered dataframe.
import pandas as pd
import probablepeople
from unidecode import unidecode
import re
#read data
df1 = pd.read_csv("H:\\data.csv")
#filter
df1=df1[(df1.gender=="female")]
#reset index because otherwise indexes will be as per original dataframe
df1=df1.reset_index()
Now i am trying to clean unicode text in the address column
#clean unicode text
for i in range(10):
df1.loc[i][16] = re.sub(r"[^a-zA-Z.,' ]",r' ',df1.address[i])
However, i am unable to do so and below is the error i am getting.
c:\python27\lib\site-packages\ipykernel\__main__.py:4: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

I think you can use str.replace:
df1=df1[df1.gender=="female"]
#reset index with parameter drop if need new monotonic index (0,1,2,...)
df1=df1.reset_index(drop=True)
df1.address = df1.address.str.replace(r"[^a-zA-Z.,' ]",r' ')

Related

How can I drop rows in pandas based on a condition

I'm trying to drop some rows in a pandas data frame, but I'm getting this error: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
I have a list of desired items that I want to stay in the Data Frame, so I wrote this:
#
import sys
import pandas as pd
biog = sys.argv[1]
df = pd.read_csv(biog, sep ='\t')
desired = ['Affinity Capture-Luminescence', 'Affinity Capture-MS', 'Affinity Capture-Western', 'Co-crystal Structure', 'Far Western', 'FRET', 'PCA', 'Reconstituted Complex']
new_df = df[['OFFICIAL_SYMBOL_A','OFFICIAL_SYMBOL_B','EXPERIMENTAL_SYSTEM']]
for i in desired:
print(i)
new_df.drop(new_df[new_df.EXPERIMENTAL_SYSTEM != i].index, inplace = True)
print(new_df)
#
it works if I place a single condition at a time, but when the for loop is inserted it doesn't work.
I didn't placed here the data because it is too large, I hope that this is enough.
thanks for the help
You can set a new df of when a column is in a list of values. No need to loop it.
new_df = new_df[new_df['EXPERIMENTAL_SYSTEM'].isin(desired)]

How do I extract the date from a column in a csv file using pandas?

This is the 'aired' column in the csv file:
as
Link to the csv file:
https://drive.google.com/file/d/1w7kIJ5O6XIStiimowC5TLsOCUEJxuy6x/view?usp=sharing
I want to extract the date and the month (in words) from the date following the 'from' word and store it in a separate column in another csv file. The 'from' is an obstruction since had it been just the date it would have been easily extracted as a timestamp format.
You are starting from a string and want to break out the data within it. The single quotes is a clue that this is a dict structure in string form. The Python standard libraries include the ast (Abstract Syntax Trees) module whose literal_eval method can read a string into a dict, gleaned from this SO answer: Convert a String representation of a Dictionary to a dictionary?
You want to apply that to your column to get the dict, at which point you expand it into separate columns using .apply(pd.Series), based on this SO answer: Splitting dictionary/list inside a Pandas Column into Separate Columns
Try the following
import pandas as pd
import ast
df = pd.read_csv('AnimeList.csv')
# turn the pd.Series of strings into a pd.Series of dicts
aired_dict = df['aired'].apply(ast.literal_eval)
# turn the pd.Series of dicts into a pd.Series of pd.Series objects
aired_df = aired_dict.apply(pd.Series)
# pandas automatically translates that into a pd.DataFrame
# concatenate the remainder of the dataframe with the new data
df_aired = pd.concat([df.drop(['aired'], axis=1), aired_df], axis=1)
# convert the date strings to datetime values
df_aired['aired_from'] = pd.to_datetime(df_aired['from'])
df_aired['aired_to'] = pd.to_datetime(df_aired['to'])
import pandas as pd
file = pd.read_csv('file.csv')
result = []
for cell in file['aired']:
date = cell[8:22]
date_ts = pd.to_datetime(date, format='%Y-%m-%d')
result.append((date_ts.month_name(), date_ts))
df = pd.DataFrame(result, columns=['month', 'date'])
df.to_csv('result_file.csv')

Replacing multiple numeric columns with the log value of those columns Python

I am working with a pandas DataFrame in Python that has 10 variables (4 numeric, 6 categorical). I want to replace the values of the 4 numeric variables with the natural log of the current values.
Example of my data below:
df = DataFrame
logcolumns = the names of the columns that I want to convert to the natural log
Import numpy as np
Import pandas as pd
df = pd.read_csv("myfile.csv")
logcolumns = ['Volume', 'Sales', 'Weight', 'Price']
df[logcolumns] = np.log(df[logcolumns])
After running this, I receive a SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead
This process works with an individual column, and with an entire dataframe, but not when I try to run it on a list of selected columns.
You could follow up the suggestion inside the warning and use labelled based access:
df.loc[:, logcolumns] = np.log(df[logcolumns])
The official doc is here: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

Change one column of a DataFrame only

I'm using Pandas with Python 3. I have a dataframe with a bunch of columns, but I only want to change the data type of all the values in one of the columns and leave the others alone. The only way I could find to accomplish this is to edit the column, remove the original column and then merge the edited one back. I would like to edit the column without having to remove and merge, leaving the the rest of the dataframe unaffected. Is this possible?
Here is my solution now:
import numpy as np
import pandas as pd
from pandas import Series,DataFrame
def make_float(var):
var = float(var)
return var
#create a new dataframe with the value types I want
df2 = df1['column'].apply(make_float)
#remove the original column
df3 = df1.drop('column',1)
#merge the dataframes
df1 = pd.concat([df3,df2],axis=1)
It also doesn't work to apply the function to the dataframe directly. For example:
df1['column'].apply(make_float)
print(type(df1.iloc[1]['column']))
yields:
<class 'str'>
df1['column'] = df1['column'].astype(float)
It will raise an error if conversion fails for some row.
Apply does not work inplace, but rather returns a series that you discard in this line:
df1['column'].apply(make_float)
Apart from Yakym's solution, you can also do this -
df['column'] += 0.0

dropping row containing non-english words in pandas dataframe

I turned this twitter corpus into pandas data frame and I was trying to find the none English tweets and delete them from the data frame, so I did this:
for j in range(0,150):
if not wordnet.synsets(df.i[j]):#Comparing if word is non-English
df.drop(j)
print(df.shape)
but I check the shape, no row was dropped.
Am I using the drop function wrong, or do I need to keep track of the index of the row?
That's because df.drop() returns a copy instead of modifying your original dataframe. Try set inplace=True
for j in range(0,150):
if not wordnet.synsets(df.i[j]):#Comparing if word is non-English
df.drop(j, inplace=True)
print(df.shape)
This will filter out all the non-English rows in our pandas dataframe.
import nltk
nltk.download('words')
from nltk.corpus import words
import pandas as pd
data1 = pd.read_csv("testdata.csv")
Word = list(set(words.words()))
df_final = data1[data1['column_name'].str.contains('|'.join(Word))]
print(df_final)

Categories