Drop 0 values, NaN values, and empty strings - python

import pandas as pd
import csv
import numpy as np
readfile = pd.read_csv('50.csv')
filevalues= readfile.loc[readfile['Customer'].str.contains('Lam Dep', na=False), 'Jul-18\nQty']
filevalues = filevalues.replace(r'^\s*$', np.nan, regex=True)
filevalues = filevalues.fillna(int(0))
int_series = filevalues.astype(int)
calculated_series = int_series.apply(lambda x: x*(1/1.2))
print(calculated_series)
So I have hundreds of csv files with many empty spots for values. Some of the blanks spaces are detected as NaNs and others are empty strings.This has Forced me to create my code in the way it is right now, and the reason so is that I need to conduct a formula on each value so I changed all such NaNs and empty strings to 0 so that I am able to conduct any formula ( in this example 1/1.2.) The problem is that I do not want to see values that are 0, NaN or empty strings when printing my dataframe.
I have tried to use the following:
filevalues = filevalues.dropna()
But because certain csv files have empty strings, this method does not fully work and get the error:
ValueError: invalid literal for int() with base 10: ' '
I have also tried the following after converting all values to 0:
filevalues = filevalues.loc[:, (filevalues != 0).all(axis=0)]
and
mask = np.any(np.isnan(filevalues) | np.equal(a, 0), axis=1)
Every method seems to be giving different errors. Is there a clean way to not count these types of values when I am printing my pandas dataframe? Please let me know if an example csv file is needed.

Got it to work! Here is the answer if it is of use to anyone.
import pandas as pd
import csv
import numpy as np
readfile = pd.read_csv('50.csv')
filevalues= readfile.loc[readfile['Customer'].str.contains('Lam Dep', na=False), 'Jul-18\nQty']
filevalues = filevalues.replace(" ", "", regex=True)
filevalues.replace("", np.nan, inplace=True) # replace empty string with np.nan
filevalues.dropna(inplace=True) # drop nan values
int_series = filevalues.astype(int) # change type
calculated_series = int_series.apply(lambda x: x*(1/1.2))
print(calculated_series)

Related

Pandas: count number of times every value in one column appears in another column

I want to count the number of times a value in Child column appears in Parent column then display this count in new column renamed child count. See previews df below.
I have this done via VBA (COUNTIFS) but now need dynamic visualization and animated display with data fed from a dir. So I resorted to Python and Pandas and tried below code after searching and reading answers like: Countif in pandas with multiple conditions | Determine if value is in pandas column | Iterate over rows in Pandas df | many others...
but still can't get the expected preview as illustrated in image below.
Any help will be very much appreciated. Thanks in advance.
#import libraries
import pandas as pd
import numpy as np
import os
#get datasets
path_dataset = r'D:\Auto'
df_ns = pd.read_csv(os.path.join(path_dataset, 'Scripts', 'data.csv'), index_col = False, encoding = 'ISO-8859-1', engine = 'python')
#preview dataframe
df_ns
#tried
df_ns.groupby(['Child','Parent', 'Site Name']).size().reset_index(name='child count')
#preview output
df_ns.groupby(['Child','Parent', 'Site Name']).size().reset_index(name='child count')
preview dataframe
preview output
expected output
[Edited] My data
Child = ['Tkt01', 'Tkt02', 'Tkt03', 'Tkt04', 'Tkt05', 'Tkt06', 'Tkt07', 'Tkt08', 'Tkt09', 'Tkt10']
Parent = [' ', ' ', 'Tkt03',' ',' ', 'Tkt03',' ', 'Tkt03',' ',' ', 'Tkt06',' ',' ',' ',]
Site_Name =[Yaounde','Douala','Bamenda','Bafoussam','Kumba','Garoua','Maroua','Ngaoundere','Buea','Ebolowa']
I created a lookalike of your df.
Before
Try this code
df['Count'] = [len(df[df['parent'].str.contains(value)]) for index, value in enumerate(df['child'])]
#breaking it down as a line by line code
counts = []
for index, value in enumerate(df['child']):
found = df[df['parent'].str.contains(value)]
counts.append(len(found))
df['Count'] = counts
After
Hope this works for you.
Since I don't have access to your data, I cannot check the code I am giving you. I suggest you will have problems with nan values with this line but you can give it a try.:
df_ns['child_count'] = df_ns['Parent'].groupby(df_ns['Child']).value_counts()
I give a name to the new column and directly assign values to it through the groupby -> value_counts functions.

Save and load correctly pandas dataframe in csv while preserving freq of datetimeindex

I was trying to save a DataFrame and load it. If I print the resulting df, I see they are (almost) identical. The freq attribute of the datetimeindex is not preserved though.
My code looks like this
import datetime
import os
import numpy as np
import pandas as pd
def test_load_pandas_dataframe():
idx = pd.date_range(start=datetime.datetime.now(),
end=(datetime.datetime.now()
+ datetime.timedelta(hours=3)),
freq='10min')
a = pd.DataFrame(np.arange(2*len(idx)).reshape((len(idx), 2)), index=idx,
columns=['first', 2])
a.to_csv('test_df')
b = load_pandas_dataframe('test_df')
os.remove('test_df')
assert np.all(b == a)
def load_pandas_dataframe(filename):
'''Correcty loads dataframe but freq is not maintained'''
df = pd.read_csv(filename, index_col=0,
parse_dates=True)
return df
if __name__ == '__main__':
test_load_pandas_dataframe()
And I get the following error:
ValueError: Can only compare identically-labeled DataFrame objects
It is not a big issue for my program, but it is still annoying.
Thanks!
The issue here is that the dataframe you save has columns
Index(['first', 2], dtype='object')
but the dataframe you load has columns
Index(['first', '2'], dtype='object').
In other words, the columns of your original dataframe had the integer 2, but upon saving it with to_csv and loading it back with read_csv, it is parsed as the string '2'.
The easiest fix that passes your assertion is to change line 13 to:
columns=['first', '2'])
To complemente #jfaccioni answer, freq attribute is not preserved, there are two options here
Fast a simple, use pickle which will preserver everything:
a.to_pickle('test_df')
b = pd.read_pickle('test_df')
a.equals(b) # True
Or you can use the inferred_freq attribute from a DatetimeIndex:
a.to_csv('test_df')
b.read_csv('test_df')
b.index.freq = b.index.inferred_freq
print(b.index.freq) #<10 * Minutes>

Dataframe with arrays and key-pairs

I have a JSON structure which I need to convert it into data-frame. I have converted through pandas library but I am having issues in two columns where one is an array and the other one is key-pair value.
Pito Value
{"pito-key": "Number"} [{"WRITESTAMP": "2018-06-28T16:30:36Z", "S":"41bbc22","VALUE":"2"}]
How to break columns into the data-frames.
As far as I understood your question, you can apply regular expressions to do that.
import pandas as pd
import re
data = {'pito':['{"pito-key": "Number"}'], 'value':['[{"WRITESTAMP": "2018-06-28T16:30:36Z", "S":"41bbc22","VALUE":"2"}]']}
df = pd.DataFrame(data)
def get_value(s):
s = s[1]
v = re.findall(r'VALUE\":\".*\"', s)
return int(v[0][8:-1])
def get_pito(s):
s = s[0]
v = re.findall(r'key\": \".*\"', s)
return v[0][7:-1]
df['value'] = df.apply(get_value, axis=1)
df['pito'] = df.apply(get_pito, axis=1)
df.head()
Here I create 2 functions that transform your scary strings to values you want them to have
Let me know if that's not what you meant

Replace None with NaN and ignore NoneType in Pandas

I'm attempting to create a raw string variable from a pandas dataframe, which will eventually be written to a .cfg file, by firstly joining two columns together as shown below and avoiding None:
Section of df:
command value
...
439 sensitivity "0.9"
440 cl_teamid_overhead_always 1
441 host_writeconfig None
...
code:
...
df = df['value'].replace('None', np.nan, inplace=True)
print df
df = df['command'].astype(str)+' '+df['value'].astype(str)
print df
cfg_output = '\n'.join(df.tolist())
print cfg_output
I've attempted to replace all the None values with NaN firstly so that no lines in cfg_output contain "None" as part of of the string. However, by doing so I seem to get a few undesired results. I made use of print statements to see what is going on.
It seems that df = df['value'].replace('None', np.nan, inplace=True), simply outputs None.
It seems that df = df['command'].astype(str)+' '+df['value'].astype(str) and cfg_output = '\n'.join(df.tolist()), cause the following error:
TypeError: 'NoneType' object has no attribute '__getitem__'
Therefore, I was thinking that by ignoring any occurrences of NaN, the code may run smoothly, although I'm unsure about how to do so using Pandas
Ultimately, my desired output would be as followed:
sensitivity "0.9"
cl_teamid_overhead_always 1
host_writeconfig
First of all, df['value'].replace('None', np.nan, inplace=True) returns None because you're calling the method with the inplace=True argument. This argument tells replace to not return anything but instead modify the original dataframe as it is. Similar to how pop or append work on lists.
With that being said, you can also get the desired output calling fillna with an empty string:
import pandas as pd
import numpy as np
d = {
'command': ['sensitivity', 'cl_teamid_overhead_always', 'host_writeconfig'],
'value': ['0.9', 1, None]
}
df = pd.DataFrame(d)
# df['value'].replace('None', np.nan, inplace=True)
df = df['command'].astype(str) + ' ' + df['value'].fillna('').astype(str)
cfg_output = '\n'.join(df.tolist())
>>> print(cfg_output)
sensitivity 0.9
cl_teamid_overhead_always 1
host_writeconfig
You can replace None to ''
df=df.replace('None','')
df['command'].astype(str)+' '+df['value'].astype(str)
Out[436]:
439 sensitivity 0.9
440 cl_teamid_overhead_always 1
441 host_writeconfig
dtype: object

Splitting Regex response column on python

I am receiving an object array after applying re.findall for link and hashtags on Tweets data. My data looks like
b=['https://t.co/1u0dkzq2dV', 'https://t.co/3XIZ0SN05Q']
['https://t.co/CJZWjaBfJU']
['https://t.co/4GMhoXhBQO', 'https://t.co/0V']
['https://t.co/Erutsftlnq']
['https://t.co/86VvLJEzvG', 'https://t.co/zCYv5WcFDS']
Now I want to split it in columns, I am using following
df = pd.DataFrame(b.str.split(',',1).tolist(),columns = ['flips','row'])
But it is not working because of weird datatype I guess, I tried few other solutions as well. Nothing worked.And this is what I am expecting, two separate columns
https://t.co/1u0dkzq2dV https://t.co/3XIZ0SN05Q
https://t.co/CJZWjaBfJU
https://t.co/4GMhoXhBQO https://t.co/0V
https://t.co/Erutsftlnq
https://t.co/86VvLJEzvG
It's not clear from your question what exactly is part of your data. (Does it include the square brackets and single quotes?). In any case, the pandas read_csv function is very versitile and can handle ragged data:
import StringIO
import pandas as pd
raw_data = """
['https://t.co/1u0dkzq2dV', 'https://t.co/3XIZ0SN05Q']
['https://t.co/CJZWjaBfJU']
['https://t.co/4GMhoXhBQO', 'https://t.co/0V']
['https://t.co/Erutsftlnq']
['https://t.co/86VvLJEzvG', 'https://t.co/zCYv5WcFDS']
"""
# You'll probably replace the StringIO part with the filename of your data.
df = pd.read_csv(StringIO.StringIO(raw_data), header=None, names=('flips','row'))
# Get rid of the square brackets and single quotes
for col in ('flips', 'row'):
df[col] = df[col].str.strip("[]'")
df
Output:
flips row
0 https://t.co/1u0dkzq2dV https://t.co/3XIZ0SN05Q
1 https://t.co/CJZWjaBfJU NaN
2 https://t.co/4GMhoXhBQO https://t.co/0V
3 https://t.co/Erutsftlnq NaN
4 https://t.co/86VvLJEzvG https://t.co/zCYv5WcFDS

Categories