Create array from dataframe columns in python - error when iterating - python

I have created this dataframe
d = {'col1': [1], 'col2': [3]}
df = pd.DataFrame(data=d)
print(df)
I have then created a field called "columnA" which is supposed to be an array made of the two elements contained in col1 and col2:
filter_col = [col for col in df if col.startswith('col')]
df["columnA"] = df[filter_col].values.tolist()
print(df)
Now, I was expecting the columnA to be a list (or an array), but when I check the length of that field I get 1 (not 2, as I expected):
print("Lenght: ",str(len(df['columnA'])))
Length: 1
What do I need to do to get a value of 2 and therefore be able to iterate through that array?
For example, I would be able to do this iteration:
for i in range(len(df['columnA'])):
print(i)
Result:
0
1
Can anyone help me, please?

You are on right track, instead of direct using len() on dataframe, you have to take values and then apply
print("Lenght: ",len(df['columnA'].values[0]))

for item in df["columnA"]:
for num in item:
print(num)
This will iterate directly over the column

Related

Creating a function which creates a new column based on values in two columns?

I have data frame like -
ID Min_Value Max_Value
1 0 0.10562
2 0.10563 0.50641
3 0.50642 1.0
I have another data frame that contains Value as a column. I want to create a new column in second data frame which returns ID when Value is between Min_Value and Max_Value for a given ID as above data frame. I can use if-else conditions but number of ID's are large and code becomes too bulky. Is there a efficient way to do this?
If I understand correctly, just join/merge it into one DataFrame, using "between" function you can choose right indexes which will be located in the second DataFrame.
import pandas as pd
data = {"Min_Value": [0, 0.10563, 0.50642],
"Max_Value": [0.10562, 0.50641, 1.0]}
df = pd.DataFrame(data,
index=[1, 2, 3])
df2 = pd.DataFrame({"Value": [0, 0.1, 0.58]}, index=[1,2,3])
df = df.join(df2)
mask_between_values = df['Value'].between(df['Min_Value'], df['Max_Value'], inclusive="neither")
# This is the result
df2[mask_between_values]
1 0.00
3 0.58
Suppose you have two dataframes df and new_df. You want to assign a new column as 'new_column' into new_df dataframe. The value of 'Value' column must be between 'Min_Value' and 'Max_Value' from df dataframe. Then this code may help you.
for i in range(0,len(df)):
if df.loc[i,'Max_Value'] > new_df.loc[i,'Value'] and df.loc[i,'Min_value'] < new_df.loc[i,'Value']:
new_df.loc[i,'new_column'] = df.loc[i, 'ID']

Pandas slicing strings: keep all characters up to the second full stop and create new column to hold the new value

I need to slice the strings in a particular column and create a new column with the slice in it.
i.e. existing col A: 'CODE.45.6787' used to create col B: 'CODE.45'
Thank you!
df["B"] = df["A"].str.rsplit(".", 1).str[0]
print(df)
A B
0 CODE.45.6787 CODE.45
To be generic, I won't assume your strings all have only 2 full stops. Hence, I will not split your strings from the right and will instead maintain on splitting the strings from the left. For this, we can do it as follows:
df['B'] = df['A'].str.split('.').str[0:2].str.join('.')
Demo
df = pd.DataFrame({'A': ['CODE.45.6787', 'CODE.12.3456.78']})
df['B'] = df['A'].str.split('.').str[0:2].str.join('.')
print(df)
A B
0 CODE.45.6787 CODE.45
1 CODE.12.3456.78 CODE.12

Check for populated columns and delete values based on hierarchy?

I have a dataframe (very simplified version below):
d = {'col1': [1, '', 2], 'col2': ['', '', 3], 'col3': [4, 5, 6]}
df = pd.DataFrame(data=d)
I need to loop through the dataframe and check how many columns are populated per row. If the row has just one column populated, then I can continue onto the next row. If however, the column has more than one non-NaN value, I need to make all the columns into NaNs apart from one, based on some hierarchy.
For example, let's say the hierarchy is:
col1 is the most important
col2 second etc.
Therefore, if there were two or more columns with data and one of them happened to be column 1, I would drop all other column values, otherwise I would defer to check if col2 has a value etc and then repeat for the next row.
I have something like this as an idea:
nrows = df.shape[0]
for index in range(0, nrows):
print(index)
#check is the row has only one column populated
if (df.iloc[[index]].notna().sum() == 1):
continue
#check if more than one column is populated for that row
elif (df.iloc[[index]].notna().sum() >= 1):
if (index['col1'].notna() == True):
df.loc[:, df.columns != 'col1'] == 'NaN'
#continue down the hierarchy
but this is not correct as it gives True/False for every column and cannot read it the way I need.
Any suggestions very welcome! I was thinking of creating some sort of key, but feel there may be a more simply way to get there with the code I already have?
Edit:
Another important point which I should have included is that my index is not integers - it is unique identifiers which look something like this: '123XYZ', which is why I used range(0,n) and reshaped the df.
For the example dataframe you gave I don't think it would change after applying this algorithm so I didn't test it thoroughly, but something like this should work:
import numpy as np
heirarchy = ['col1', 'col2', 'col3']
inds = df.isna().sum(axis=1)
inds = inds[inds >= 2].index
for i in inds:
for col in heirarchy:
if not pd.isna(df.iloc[[i]][col]).all():
tmp = df.iloc[[i]][col]
df.iloc[[i]] = np.nan
df.iloc[[i]][col] = tmp
Note I'm assuming that you actually mean nan and not the empty string like you have in your example. If you want to look for empty strings then inds and the if statement would change above
I also think this should be faster than what you have above since it's only looping through the rows with more than 1 nan values.

Dictionary column arrangement where column name is 'start'

Normally to create a DataFrame with below code
df= pd.DataFrame({'a':[1],'b':[2]})
df
Output:
a b
0 1 2
But while I'm trying to create a DataFrame with one column name of 'start' its order is getting changed
df1 = pd.DataFrame({'start':[2],'end':[4]})
df1
Output:
end start
0 4 2
I'm trying to understand why this order is getting changed.
If you don't mention column name in order like columns=['', ''], it sometimes take the alphabetic order. As a result 'end'->e comes first and 'start'->s comes second.
This is because dictionaries are inherently unordered, and I wouldn't be surprised if it ordered alphabetically in this case.
As #GiovaniSalazar said:
df1 = pd.DataFrame({'start':[2],'end':[4]}, columns=['start','end'])
or, equivalently:
pd.DataFrame(data = [[2, 4]], columns=['start','end'])
Will force order with an ordered data structure

Dictionary to Dataframe Error: "If using all scalar values, you must pass an index"

Currently, I am using a for loop to read csv files from a folder.
After reading the csv file, I am storing the data into one row of a dictionary.
When I print the data types using "print(list_of_dfs.dtypes)" I receive:
dtype: object
DATETIME : object
VALUE : float64
ID : int64
ID Name: object.
Note that this is a nested dictionary with thousands of values stored in each of these data fields. I have 26 rows of the structure listed above. I am trying to append the dictionary rows into a dataframe where I will have only 1 row consisting of the datafields:
Index DATETIME VALUE ID ID Name.
Note: I am learning python as I go.
I tried using an array to store the data and then convert the array to a dataframe but I could not append the rows of the dataframe.
Using the dictionary method I attempted "df = pd.Dataframe(list_of_dfs)"
This throws an error.
list_of_dfs = {}
for I in range(0,len(regionLoadArray)
list_of_dfs[I] = pd.read_csv(regionLoadArray[I])
#regionLoadArray contains my- file names from list directory.
dataframe = pd.DataFrame(list_of_dfs)
#this method was suggested at thispoint.com for nested dictionaries.
#This is where my error occurs^
ValueError: If using all scalar values, you must pass an index
I appreciate any assistance with this issue as I am new to python.
My current goals is to simply produce a dataframe with my Headers that I can then send to a csv.
Depending on your needs, a simple workaround could be:
dct = {'col1': 'abc', 'col2': 123}
dct = {k:[v] for k,v in dct.items()} # WORKAROUND
df = pd.DataFrame(dct)
which results in
print(df)
col1 col2
0 abc 123
This error occurs because pandas needs an index. At first this seems sort of confusing because you think of list indexing. What this is essentially asking for is a column number for each dictionary to correspond to each dictionary. You can set this like so:
import pandas as pd
list = ['a', 'b', 'c', 'd']
df = pd.DataFrame(list, index = [0, 1, 2, 3])
The data frame then yields:
0
0 'a'
1 'b'
2 'c'
3 'd'
For you specifically, this might look something like this using numpy (not tested):
list_of_dfs = {}
for I in range(0,len(regionLoadArray)):
list_of_dfs[I] = pd.read_csv(regionLoadArray[I])
ind = np.arange[len(list_of_dfs)]
dataframe = pd.DataFrame(list_of_dfs, index = ind)
Pandas unfortunately always needs an index when creating a DataFrame.
You can either set it yourself, or use an object with the following structure so pandas can determine the index itself:
data= {'a':[1],'b':[2]}
Since it won't be easy to edit the data in your case,
A hacky solution is to wrap the data into a list
dataframe = pd.DataFrame([list_of_dfs])
import pandas as pd
d = [{"a": 1, "b":2, "c": 3},
{"a": 4, "b":5, "c": 6},
{"a": 7, "b":8, "c": 9}
]
pd.DataFrame(d, index=list(range(len(d))))
returns:
a b c
0 1 2 3
1 4 5 6
2 7 8 9

Categories