import pandas as pd
d = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=d)
def calculation(text):
return text*2
for idx, row in df.iterrows():
df.at[idx, 'col3'] = dict(cats=calculation(row['col1']))
df
So as you can see from the code above I have tried a few different things.
Basically I am trying to get the dictionary in to col3.
However, when you run for the first time on new dataframe - you get a
col1 col2 col3
0 1 3 cats
1 2 4 {'cats': 4}
If you run the for loop again on the same dataframe you get what I am looking for which is
col1 col2 col3
0 1 3 {'cats': 2}
1 2 4 {'cats': 4}
How do I go straight to having the dictionary in there to start without having to run the loop again?
I have tried other ways like df.loc and others, still no joy.
Try to stay away from df.iterrows().
You can use df.apply instead:
import pandas as pd
d = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=d)
def calculation(text):
return text*2
def calc_dict(row):
return dict(cats=calculation(row['col1']))
df['col3'] = df.apply(calc_dict, axis=1)
df
Which outputs the result you expect.
The error seems to creep in with the creation and assignment of an object datatype to col col3. I tried to pre-allocate to NaNs with df['col3'] = pd.np.NaN which did not have an effect (inspect with print(df.dtypes)). Anyway this seems like buggy behaviour. Use df.apply instead, its faster and less prone to these types of issues.
Related
I'm trying to append each row of a DataFrame separately. Each row has Series and Scalar values. an example of a row would be
row = {'col1': 1, 'col2':'blah', 'col3': pd.Series(['first', 'second'])}
When I create a DataFrame from this, it looks like this
df = pd.DataFrame(row)
df
col1 col2 col3
0 1 blah first
1 1 blah second
This is what I want. The scalar values are repeated which is good. Now, some of my rows have empty Series for the column, as such:
another_row = {'col1': 45, 'col2':'more blah', 'col3': pd.Series([], dtype='object')}
When I create a new DataFrame in order to concat the two, like so
second_df = pd.DataFrame(another_row)
I get back an empty DataFrame. Which is not what I'm looking for.
>>> second_df = pd.DataFrame({'col1': 45, 'col2':'more blah', 'col3': pd.Series([], dtype='object')})
>>> second_df
Empty DataFrame
Columns: [col1, col2, col3]
Index: []
>>>
What I'm actually after is something like this
>>> second_df
>>>
col1 col2 col3
0 45 'more blah' NaN
Or something like that. Basically, I don't want the entire row to be dropped on the floor, I want the empty Series to be represented by None or NaN or something.
I don't get any errors, and nothing warns me that anything is out of the ordinary, so I have no idea why the df is behaving like this.
You could pass an index to make it work (and get the dataframe with NaN in the third column):
another_row = {'col1': 45, 'col2':'more blah', 'col3': pd.Series([], dtype='object')}
second_df = pd.DataFrame(another_row, index=[0])
When passing all scalars and a Series, the number of rows is determined by the length of the Series – if the length is zero, so is the number of rows. You could pass singleton lists instead of scalars so the number of rows is no longer zero:
another_row = {'col1': [45], 'col2': ['more blah'], 'col3': [np.nan]}
second_df = pd.DataFrame(another_row)
Alternatively, pass all scalars and an index like above,
another_row = {'col1': 45, 'col2': 'more blah', 'col3': np.nan}
second_df = pd.DataFrame(another_row, index=[0])
but I'd probably just do
second_df = pd.DataFrame([[45, 'more blah', np.nan]],
columns=['col1', 'col2', 'col3'])
Ultimately, I reworked my code to avoid having this problem. My solution is as follows:
I have a function do_data_stuff() and it used to return a pandas series, but now I have changed it to return
a series if there's stuff in it Series([1, 2, 3])
or nan if it would be empty np.nan.
A side effect of going with this approach was that the DataFrame requires an index if only scalars are passed. "ValueError: If using all scalar values, you must pass an index"
So I can't pass index=[0] hard coded like that because I wanted the DF to have the series determine the number of rows in the DF automatically.
row = {'col1': 1, 'col2':'blah', 'col3': pd.Series(['first', 'second'])}
df = pd.DataFrame(row)
df
col1 col2 col3
0 1 blah first
1 1 blah second
So what I ended up doing was adding a dynamic index call. I'm not sure if this is proper python, but it worked for me.
stuff = do_data_stuff()
data = pd.DataFrame({
'col1': 45,
'col2': 'very awesome stuff',
'col3': stuff
},
index= [0] if stuff is np.nan else None
)
And then I concatenated my DataFrames using the following:
data = pd.concat([data, some_other_df], ignore_index=True)
The result was a DataFrame that looks like this
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame({'col1': 1, 'col2':'blah', 'col3': pd.Series(['first', 'second'])})
>>> df
col1 col2 col3
0 1 blah first
1 1 blah second
>>> stuff = np.nan
>>> stuff
nan
>>> df = pd.concat([
df, pd.DataFrame(
{
'col1': 45,
'col2': 'more awesome stuff',
'col3': stuff
},
index= [0] if stuff is np.nan else None
)], ignore_index=True)
>>> df
col1 col2 col3
0 1 blah first
1 1 blah second
2 45 more awesome stuff NaN
You can replace np.nan with anything, like "".
I need to convert to JSON format DataFrames similar to this:
col1 col2 col3
col1 col2 col3
1 a 10 1 a 10
2 b 11 2 b 11
3 c 12 3 c 12
However, when I run df.to_json(orient='table') I'm getting the exception ValueError: Overlapping names between the index and columns. I understand why this happens, but I really would like to know if there is an easy way to circumvent the error. All I need is to convert the DataFrame to JSON maintaining the same indexes, and when restoring it, get the original DataFrame.
Here I leave a code snippet so you can reproduce the scenario.
import pandas as pd
df = pd.DataFrame({'col1': [1, 2, 3], 'col2': ['a', 'b', 'c'], 'col3': [10, 11, 12]})
df = df.set_index(keys=['col1', 'col2', 'col3'], drop=False)
df.to_json(orient='table')
I am trying to remove specific NA format with .dropna() method from pandas, however when apply it, the method returns None object.
import pandas as pd
# importing data #
df = pd.read_csv(path, sep=',', na_values='NA')
# this is how the df looks like
df = {'col1': [1, 2], 'col2': ['NA', 4]}
df=pd.DataFrame(df)
# trying to drop NA
d= df.dropna(how='any', inplace=True)
This code returns a None object, expected output could look like this:
# col1 col2
#0 2 4
How could I adjust this method?
Is there any simpler method to accomplish this task?
import numpy as np
import pandas as pd
Firstly replace 'NA' values in your dataframe with actual 'NaN' value by replace() method:
df=df.replace('NA',np.nan,regex=True)
Finally:
df.dropna(how='any', inplace=True)
Now if you print df you will get your desired output:
col1 col2
1 2 4.0
If you want exact same output that you mentioned in question then just use reset_index() method:
df=df.reset_index(drop=True)
Now if you print df you will get:
col1 col2
0 2 4.0
Remove records with string 'NA'
df[~df.eq('NA').any(1)]
col1 col2
1 2 4
I am attempting to run a simple query in PowerBI using python. Sadly, most python libraries are not supported in PowerBI, so I'm limited to pandas and numpy. The dataset is a set of projects that are either in pipeline or active. I want to filter the dataset down to rows that are just in pipeline based on a set of or conditions. So it would like look
dataframe = pd.DataFrame(where project = 'Pipeline'), set of other conditions to filter pipeline launches by)
Is that possible in python, similar to a nested where statement?
You can use isin to lookup multiple values in a column. If you want to filter based on multiple columns then you need to chain you loc or iloc conditions. Basically whatever you get returned from a iloc or loc is also a dataframe, below is the working example of both conditions.
>>> import pandas as pd
>>> import numpy as np
>>> d = {'col1': [1, 2, 2, 3], 'col2': [3, 4, 5, 6], 'col3': ['NULL','NULL', np.nan, 'virgo']}
>>> df = pd.DataFrame(data=d)
We will query the dataframe to get rows where col3 is either virgo or NULL
>>> df.loc[df['col3'].isin(['virgo','NULL'])]
col1 col2 col3
0 1 3 NULL
1 2 4 NULL
3 3 6 virgo
Now we will query the dataframe to get rows where col3 is NULL and col2 is 4
>>> df.loc[df['col3'] == 'NULL'].loc[df['col2'] == 4]
col1 col2 col3
1 2 4 NULL
If all conditions in one variable do:
df [ df.column_name.isin([value_1, value_2, value_n]) ]
if conditions in multiple columns do
df [ (df.column_1 == "value_1") & (df.column_2 == "value_2") & (df.column_n.isin([val_1, val_2, val_3])) ]
Note:
& means AND, so if both (left and right) are True the result is True, else result False.
if you need OR condition use |.
I want to add a column of 1s in the beginning of a pandas dataframe which is created from an external data file 'ex1data1.txt'. I wrote the following code. The problem is the print(data) command, in the end, is returning None. What is wrong with this code? I want data to be a pandas dataframe. The raw_data and X0_ are fine, I have printed them.
import numpy as np
import pandas as pd
raw_data = pd.read_csv('ex1data1.txt', header= None, names= ['x1','y'])
X0_ = np.ones(len(raw_data))
idx = 0
data = raw_data.insert(loc=idx, column='x0', value=X0_)
print(data)
Another solution might look like this:
import numpy as np
import pandas as pd
raw_data = pd.read_csv('ex1data1.txt', header= None, names= ['x1','y'])
raw_data.insert(loc=0, column='x0', value=1.0)
print(raw_data)
pd.DataFrame.insert
You can use pd.DataFrame.insert, but note this solution is in place and does not need reassignment. You may also need to explicitly set dtype to int:
df = pd.DataFrame([[1, 2, 3], [4, 5, 6]],
columns=['col1', 'col2', 'col3'])
arr = np.ones(len(df.index), dtype=int)
idx = 0
df.insert(loc=idx, column='col0', value=arr)
print(df)
col0 col1 col2 col3
0 1 1 2 3
1 1 4 5 6
Direct definition + reordering
One clean solution is to simply add a row and move the last column to the beginning of your dataframe. Here's a complete example:
df = pd.DataFrame([[1, 2, 3], [4, 5, 6]],
columns=['col1', 'col2', 'col3'])
df['col0'] = 1 # adds column to end of dataframe
cols = [df.columns[-1]] + df.columns[:-1].tolist() # move last column to front
df = df[cols] # apply new column ordering
print(df)
col0 col1 col2 col3
0 1 1 2 3
1 1 4 5 6