How to get rid of NaN values in csv file? Python - python

First than all, I know there's answers about this matter, but none of them are working for me until now. Anyway, I would like to know your answers, although I have already used that solution.
I have a csv file called mbti_datasets.csv. The the label of the first column is type and the second column is called description. Each row represent a new personality type (with its respective type and description).
TYPE | DESCRIPTION
a | This personality likes to eat apples...\nThey look like monkeys...\nIn fact, are strong people...
b | b.description
c | c.description
d | d.description
...16 types | ...
In the following code, I'm trying to duplicate each personality type when the description have \n.
Code:
import pandas as pd
# Reading the file
path_root = 'gdrive/My Drive/Colab Notebooks/MBTI/mbti_datasets.csv'
root_fn = path_rooth + 'mbti_datasets.csv'
df = pd.read_csv(path_root, sep = ',', quotechar = '"', usecols = [0, 1])
# split the column where there are new lines and turn it into a series
serie = df['description'].str.split('\n').apply(pd.Series, 1).stack()
# remove the second index for the DataFrame and the series to share indexes
serie.index = serie.index.droplevel(1)
# give it a name to join it to the DataFrame
serie.name = 'description'
# remove original column
del df['description']
# join the series with the DataFrame, based on the shared index
df = df.join(serie)
# New file name and writing the new csv file
root_new_fn = path_root + 'mbti_new.csv'
df.to_csv(root_new_fn, sep = ',', quotechar = '"', encoding = 'utf-8', index = False)
new_df = pd.read_csv(root_new_fn)
print(new_df)
EXPECTED OUTPUT:
TYPE | DESCRIPTION
a | This personality likes to eat apples...
a | They look like monkeys...
a | In fact, are strong people...
b | b.description
b | b.description
c | c.description
... | ...
CURRENT OUTPUT:
TYPE | DESCRIPTION
a | This personality likes to eat apples...
a | They look like monkeys...NaN
a | NaN
a | In fact, are strong people...NaN
b | b.description...NaN
b | NaN
b | b.description
c | c.description
... | ...
I'm not 100% sure, but I think the NaN value is \r.
Files uploaded to github as requested:
CSV FILES
Using the #YOLO solution:
CSV YOLO FILE
E.g. where is failing:
2 INTJ Existe soledad en la cima y-- siendo # adds -- in blank random blank spaces
3 INTJ -- y las mujeres # adds -- in the beginning
3 INTJ (...) el 0--8-- de la poblaci # doesnt end the word 'poblaciĆ³n'
10 INTJ icos-- un conflicto que parecer--a imposible. # starts letters randomly
12 INTJ c #adds just 1 letter
Translation for fully understanding:
2 INTJ There is loneliness at the top and-- being # adds -- in blank spaces
3 INTJ -- and women # adds - in the beginning
3 INTJ (...) on 0--8-- of the popula-- # doesnt end the word 'population'
10 INTJ icos-- a conflict that seems--to impossible. # starts letters randomly
12 INTJ c #adds just 1 letter
When I display if there's any NaN value and which type:
print(new_df['descripcion'].isnull())
<class 'float'>
0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 True
8 False
9 True
10 False
11 True
continue...

Here's a way to do, I had to find a workaround to replace \n character, somehow it wasn't working in the straight forward manner:
df['DESCRIPTION'] = df['DESCRIPTION'].str.replace('[^a-zA-Z0-9\s.]','--').str.split('--n')
df = df.explode('DESCRIPTION')
print(df)
TYPE DESCRIPTION
0 a This personality likes to eat apples...
0 a They look like monkeys...
0 a In fact-- are strong people...
1 b b.description
2 c c.description
3 d d.description

The problem can be attributed to the description cells, as there are parts with two new consecutive lines, with nothing between them.
I just used .dropna() to read the new csv created, and rewriting it without the NaN values. Anyway, I think repeating this process is not the best way, but it's going straight as a solution.
df.to_csv(root_new_fn, sep = ',', quotechar = '"', encoding = 'utf-8', index = False)
new_df = pd.read_csv(root_new_fn).dropna()
new_df.to_csv(root_new_fn, sep = ',', quotechar = '"', encoding = 'utf-8', index = False)
new_df = pd.read_csv(root_new_fn)
print(type(new_df.iloc[7, 1]))# where was a NaN value
print(new_df['descripcion'].isnull())
<class 'str'>
0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 False
and continues...

Related

Searching in a dataframe based on a list

I have a dataframe containing some string values:
df:
column1
0 | a
1 | b
2 | c
3 | d
now I also have a list = (b , c). It contains some of the values of the df.
I want to be able to find if for each of the values in the dataframe it can be found in the list.
0 | False
1 | True
2 | True
3 | False
So far I have used x = df['column1'].isin(list) but then it say False for all of the observations in the dataframe. I am assuming because it checks if all the values in the df are in the list. How can I achieve the desired result?
Thanks
Following code works for me:
import pandas as pd
data = ['a','b','c','d']
df = pd.DataFrame(data = data, columns=['Column 1'])
list1 =('a','b') #If you are using round brackets then that is not a list, its a tuple.
df.isin(list1)
Output:
Column 1
0 False
1 True
2 True
3 False
Note: If still not work then recheck all the values in the dataframe or a list, it might have unnecessary spaces something else.
Let me know if it works for you or not.

How to check entries in a column for patterns and calculate the number of patterns?

I have a DataFrame:
Name Price
0 Dictionary 3
1 Book 4
2 Dict En-Ru 2
3 BookforKids 6
4 Dict FR-CHN 1
I need a piece of code that will check the column 'Name' for patterns that I can specify myself and will count the number of founded patterns in another DataFrame.
For instance, check the number of entries in the 'Name' column with the patterns Dict an Book ignoring the case should give this result:
| Pattern | Occurencies |
| ----------- | ----------- |
| Dict | 3 |
| Book | 2 |
Here's one way using str.extract:
patterns = ['Dict','Book']
df.Name.str.extract(rf"({'|'.join(patterns)})", expand=False).value_counts()
Dict 3
Book 2
Name: 0, dtype: int64
You can make it case insensitive with the flags argument:
patterns_lower = '|'.join([s.lower() for s in patterns])
(df.Name.str.lower().str.extract(rf"({patterns_lower})", expand=False)
.value_counts())
You can define your pattern as a custom function:
# example
def get_pattern(txt):
if 'Dict' in txt:
return 'Dict'
if 'Book' in txt:
return 'Book'
return np.nan
Then you apply in your dataframe and use value counts:
df['Pattern'] = df['Name'].apply(get_pattern)
df['Pattern'].value_counts()
Dict 3
Book 2
dtype: int64

How does one break out strings of multiple key value pairs in a single dataframe column into a new dataframe in python?

I am pulling data from a sql database into a pandas dataframe. The dataframe is a single column containing various quantities of key value pairs stored in a string. I would like to make a new dataframe that contains two columns, one holding the keys, and the other holding the values.
The dataframe looks like:
In[1]:
print(df.tail())
Out[1]:
WK_VAL_PAIRS
166 {('sloth', 0.073), ('animal', 0.034), ('gift', 0.7843)}
167 {('dabbing', 0.0863), ('gift', 0.7843)}
168 {('grandpa', 0.0156), ('funny', 1.3714), ('grandfather', 0.0015)}
169 {('nerd', 0.0216)}
170 {('funny', 1.3714), ('pineapple', 0.0107)}
Ideally, the new dataframe would look like:
0 | sloth | 0.073
1 | animal | 0.034
2 | gift | 0.07843
3 | dabbing | 0.0863
4 | gift | 0.7843
...
etc.
I have been successful in separating out the key value pairs from a single row into a dataframe, as show below. From here it will be trivial to split out the pairs into thier own columns.
In[2]:
def prep_text(row):
string = row.replace('{', '')
string = string.replace('}', '')
string = string.replace('\',', '\':')
string = string.replace(' ', '')
string = string.replace(')', '')
string = string.replace('(', '')
string = string.replace('\'', '')
return string
df['pairs'] = df['WK_VAL_PAIRS'].apply(prep_text)
dd = df['pairs'].iloc[166]
af = pd.DataFrame([dd.split(',') for x in dd.split('\n')])
af.transpose()
Out[2]:
0 sloth:0.073
1 animal:0.034
2 gift:0.7843
3 spirit:0.0065
4 fans:0.0093
5 funny:1.3714
However, I'm missing the leap to apply this transformation to the entire dataframe. Is there a way to do this with an .apply() style function, rather than a for each loop. What is the most pythonic way of handling this?
Any help would be appreciated.
Solution
With Chris's strong hint below, I was able to get to an adequate solution for my needs:
def prep_text(row):
string = row.replace('\'', '')
string = '"'+ string + '"'
return string
kvp_df = pd.DataFrame(
re.findall(
'(\w+), (\d.\d+)',
df['WK_VAL_PAIRS'].apply(prep_text).sum()
)
)
Try re.findall with pandas.DataFrame:
import pandas as pd
import re
s = pd.Series(["{(stepper, 0.0001), (bob, 0.0017), (habitual, 0.0), (line, 0.0097)}",
"{(pete, 0.01), (joe, 0.0019), (sleep, 0.0), (cline, 0.0099)}"])
pd.DataFrame(re.findall('(\w+), (\d.\d+)', s.sum()))
Output:
0 1
0 stepper 0.0001
1 bob 0.0017
2 habitual 0.0
3 line 0.0097
4 pete 0.01
5 joe 0.0019
6 sleep 0.0
7 cline 0.0099

Get the average scores for the most common (frequent) words in a dataframe

I am trying to get the average scores for the most common words in my dataframes. Currently my dataframe has this format.
sentence | score
"Sam I am Sam" | 10
"I am Sam" | 5
"Paul is great Sam" | 5
"I am great" | 0
"Sam Sam Sam" | 15
I managed to successfully get the most common words using this blurp of code. This cleaned up my dataframe and removed all stop words. Which yielded me this series.
from collections import Counter
nltk.download('stopwords')
df_text = df[['sentence','score']]
df_text['sentence'] = df_text['sentence'].replace("[a-zA-Z0-9]{14}|rt|[0-9]",'',regex=True, inplace=False)
df_text['sentence'] = df_text['sentence'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
top_words =pd.Series(' '.join(df_text['sentence']).lower().split()).value_counts()[:25]
Words | Freq
Sam | 7
I | 3
Am | 3
Great | 2
is | 1
I understand that the groupby.().mean() is a really important function I would need to use, but I dont understand how I would try to get the score column. This is the ideal output I am trying to get. I showed the math to give logic on how I got the averages.
Words | Avg
Sam | 35/7 = 5
I | 15/3 = 5
Am | 15/3 = 5
Great | 5/2 = 2.5
is | 5/1 = 5
I will skip the data cleaning part (such as stopword removal), except that you really should use nltk.word_tokenize instead of split(). In particular, it would be your responsibility to eliminate the quotes.
df['words'] = df['sentence'].apply(nltk.word_tokenize)
Once the words are extracted, count them and combine with the scores:
word_counts = pd.concat([df[['score']],
df['words'].apply(Counter).apply(pd.Series)],
axis=1)
Now, calculate the weighted sums:
ws = word_counts.notnull().mul(word_counts['score'], axis=0).sum() \
/ word_counts.sum()
#score 1.0
#`` 7.0
#Sam 5.0
#I 5.0
#am 5.0
#'' 7.0
#Paul 5.0
#is 5.0
#great 2.5
Finally, eliminate the first row that was included only for convenience:
del(ws['score'])
considering you have your data in a tabular format.. this should work
import pandas as pd
from collections import Counter
df = pd.read_csv('data.csv')
cnt = Counter([word for sen in df.sentence.values for word in sen.split()])
for item in cnt:
tot_score = 0
for row in df.iterrows():
if item in row[1]['sentence'].split():
tot_score += row[1]['score']
if cnt[item] != 0:
print(item, tot_score/cnt[item])
else:
print(item, 0)

Pandas: Column value assignment in function not working

I have a dataset that looks like this:
country | year | supporting_nation | eco_sup | mil_sup
------------------------------------------------------------------
Fake 1984 US 1 1
Fake 1984 SU 0 1
In this fake example, a nation is playing both sides during the cold war and receiving support from both.
I am reshaping the dataset in two ways:
I removed all non US / SU instances of support, I am only interested in these two countries
I want to reduce it to 1 line per year per country, meaning that I am adding US / SU specific dummy variables for each variable
Like so:
country | year | US_SUP | US_eco_sup | US_mil_sup | SU_SUP | SU_eco_sup | SU_mil_sup |
------------------------------------------------------------------------------------------
Fake 1984 1 1 1 1 1 1
Fake 1985 1 1 1 1 1 1
florp 1984 0 0 0 1 1 1
florp 1985 0 0 0 1 1 1
I added all of the dummies and the US_SUP and SU_SUP columns have been populated with the correct values.
However, I am having trouble with giving the right value to the other variables.
To do so, I wrote the following function:
def get_values(x):
cols = ['eco_sup', 'mil_sup']
nation = ''
if x['SU_SUP'] == 1:
nation = 'SU_'
if x['US_SUP'] == 1:
nation = 'US_'
support_vars = x[['eco_sup', 'mil_sup']]
# Since each line contains only one measure of support I can
# automatically assume that the support_vars are from
# the correct nation
support_cols = [nation + x for x in cols]
x[support_cols] = support_vars
The plan is than to use a df.groupby.agg('max') operation, but I never get to this step as the function above return 0 for each new dummy col, regardless of the value of the columns in the dataframe.
So in the last table all of the US/SU_mil/eco_sup variables would be 0.
Does anyone know what I am doing wrong / why the columns are getting the wrong value?
I solved my problem by abandoning the .apply function and using this instead (where old is a list of the old variable names)
for index, row in df.iterrows():
if row['SU_SUP'] == 1:
nation = 'SU_'
for col in old:
df[index: index + 1][nation + col] = int(row[col])
if row['US_SUP'] == 1:
nation = 'US_'
for col in old:
df[index: index + 1][nation + col] = int(row[col])
This did the trick!

Categories