convert a dataframe column from string to List of numbers - python

I have created the following dataframe from a csv file:
id marks
5155 1,2,3,,,,,,,,
2156 8,12,34,10,4,3,2,5,0,9
3557 9,,,,,,,,,,
7886 0,7,56,4,34,3,22,4,,,
3689 2,8,,,,,,,,
It is indexed on id. The values for the marks column are string. I need to convert them to a list of numbers so that I can iterate over them and use them as index number for another dataframe. How can I convert them from string to a list? I tried to add a new column and convert them based on "Add a columns in DataFrame based on other column" but it failed:
df = df.assign(new_col_arr=lambda x: np.fromstring(x['marks'].values[0], sep=',').astype(int))

Here's a way to do:
df = df.assign(new_col_arr=df['marks'].str.split(','))
# convert to int
df['new_col'] = df['new_col_arr'].apply(lambda x: list(map(int, [i for i in x if i != ''])))

I presume that you want to create NEW dataframe, since the number of items is differnet from number of rows. I suggest the following:
#source data
df = pd.DataFrame({'id':[5155, 2156, 7886],
'marks':['1,2,3,,,,,,,,','8,12,34,10,4,3,2,5,0,9', '0,7,56,4,34,3,22,4,,,']
# create dictionary from df:
dd = {row[0]:np.fromstring(row[1], dtype=int, sep=',') for _, row in df.iterrows()}
{5155: array([1, 2, 3]),
2156: array([ 8, 12, 34, 10, 4, 3, 2, 5, 0, 9]),
7886: array([ 0, 7, 56, 4, 34, 3, 22, 4])}
# here you pad the lists inside dictionary so that they have equal length
...
# convert dd to DataFrame:
df2 = pd.DataFrame(dd)

I found two similar alternatives:
1.
df['marks'] = df['marks'].str.split(',').map(lambda num_str_list: [int(num_str) for num_str in num_str_list if num_str])
2.
df['marks'] = df['marks'].map(lambda arr_str: [int(num_str) for num_str in arr_str.split(',') if num_str])

Related

Can i split a Dataframe by columns?

I need to split a Dataframe by the columns,
I made a simple code, that runs without error, but didn't give me the return i expected.
Here's the simple code:
dados = pd.read_excel(r'XXX')
for x in range(1,13):
selectmonth = x
while selectmonth < 13:
df_datas = dados.loc[dados['month'] == selectmonth]
correlacao2 = df_datas.corr().round(4).iloc[0]
else: break
print()
I did one by one by inputing the selected mouth manually like this:
dfdatas = dados.loc[dados['month'] == selectmonth]
print('\n Voce selecionou o mês: ', selectmonth)
colunas2 = list(dfdatas.columns.values)
correlacao2 = dfdatas.corr().round(4).iloc[0]
print(correlacao2)
is there some way to do this in a loop? from month 1 to 12?
With pandas, you should avoid using loops wherever possible, it is very slow. You can achieve what you want here with index slicing. I'm assuming your columns are just the month numbers, you can do this:
setting up an example df:
df = pd.DataFrame([], columns=range(15))
df:
Columns: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]
Index: []
getting columns with numbers 1 to 12:
dfdatas = df.loc[:, 1:12]
Columns: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
Index: []
In the future, you should include example data in your question.
Just try this:
correlacao2 = dados.corr(method='pearson').round(4)
for month in dados.columns:
print('\n Voce selecionou o mês: ', month)
result=correlacao2.loc[month]
result=pd.DataFrame(result)
print(result)
Here I have used corr() and for-loop method and converted them to DataFrame
dados is your dataframe name
If your column name is number, then rename it with month name using dados.rename(columns={'1': 'Jan','2':'Feb','3':'Mar'}). Similarly, you include other months too to rename the column names. After renaming, apply the above code to get your expected answer.
If you don't want want to rename, then use .iloc[] instead of .loc[] in above code

How to read comma separated string in one cell using Python

I have a project wherein you need to read data from an excel file. I use openpyxl to read the said file. I tried reading the data as string first before converting it to an integer; however, error is occurring because of, I think, numbers in one cell separated by comma. I am trying to do a nested list but I still new in Python.
My code looks like this:
# storing S
S_follow = []
for row in range(2, max_row+1):
if (sheet.cell(row,3).value is not None):
S_follow.append(sheet.cell(row, 3).value);
# to convert the list from string to int, nested list
for i in range(0, len(S_follow)):
S_follow[i] = int(S_follow[i])
print(S_follow)
The data I a trying to read is:
['2,3', 4, '5,6', 8, 7, 9, 8, 9, 3, 11, 0]
hoping for your help
When you're about to convert the values to integers in the loop on the second-last line of your script, you can check if each value is an integer or string and if it is a string, just split it, convert the split values to integers and push them to a temporary list called say, strVal and then append that temp list to a new list called, say S_follow_int. But if the value is not a string, then just append them to S_follow_int without doing anything.
data= ['2,3', 4, '5,6', 8, 7, 9, 8, 9, 3, 11, 0]
S_follow = []
S_follow_int = []
for row in range(0, len(data)):
if (sheet.cell(row,3).value is not None):
S_follow.append(sheet.cell(row, 3).value);
# to convert the list from string to int, nested list
for i in range(0, len(S_follow)):
#if the current value is a string, split it, convert the values to integers, put them on a temp list called strVal and then append it to S_follow_int
if type(S_follow[i]) is str:
x = S_follow[i].split(',')
strVal = []
for y in x:
strVal.append(int(y))
S_follow_int.append(strVal)
#else if it is already an integer, just append it to S_follow_int without doing anything
else:
S_follow_int.append(S_follow[i])
print(S_follow_int)
However, I would recommend that you check the datatype(str/int) of each value in the initial loop that you used to retrieved data from the excel file itself rather than pushing all values to S_follow and then convert the type afterwards like this:
#simplified representation of the logic you can use for your script
data = ['2,3', 4, '5,6', 8, 7, 9, 8, 9, 3, 11, 0]
x = []
for dat in data:
if dat is not None:
if type(dat) is str:
y = dat.split(',')
strVal = []
for z in y:
strVal.append(int(z))
x.append(strVal)
else:
x.append(dat)
print(x)
S_follow = ['2,3', 4, '5,6', 8, 7, 9, 8, 9, 3, 11, 0]
for i in range(0, len(S_follow)):
try:
s = S_follow[i].split(',')
del S_follow[i]
for j in range(len(s)):
s[j] = int(s[j])
S_follow.insert(i,s)
except AttributeError as e:
S_follow[i] = int(S_follow[i])
print(S_follow)

Finding Dates in one array based on the ranges from another array and closest value

I have two Nested NumPy arrays (dateValArr & searchDates). dateValArr contains all dates for May 2011 (1st - 31st) and an associated value each date. searchDates contains 2 dates and an associated value as well (2 dates correspond to a date range).
Using date ranges specified in searchDates Array, I want to find dates in dateValArr array. Next for those selected dates in dateValArr, I want to find the closest value to the specified value of searchDates.
I have come up with is code but for the first part it it only works if only one value is specified.
#setup arrays ---------------------------------------------------------------------------
# Generate dates
st_date = '2011-05-01'
ed_date = '2011-05-31'
dates = pd.date_range(st_date,ed_date).to_numpy(dtype = object)
# Generate Values
val_arr = np.random.uniform(1,12,31)
dateValLs = []
for i,j in zip(dates,val_arr):
dateValLs.append((i,j))
dateValArr = np.asarray(dateValLs)
print(dateValArr)
#out:
[[Timestamp('2011-05-01 00:00:00', freq='D') 7.667399233149668]
[Timestamp('2011-05-02 00:00:00', freq='D') 5.906099813052642]
[Timestamp('2011-05-03 00:00:00', freq='D') 3.254485533826182]
...]
#Generate search dates
searchDates = np.array([(datetime(2011,5,11),datetime(2011,5,20),9),(datetime(2011,5,25),datetime(2011,5,29),2)])
print(searchDates)
#out:
[[datetime.datetime(2011, 5, 11, 0, 0) datetime.datetime(2011, 5, 20, 0, 0) 9]
[datetime.datetime(2011, 5, 25, 0, 0) datetime.datetime(2011, 5, 29, 0, 0) 2]]
#end setup ------------------------------------------------------------------------------
x = np.where(np.logical_and(dateValArr[:,0] > searchDates[0][0], dateValArr[:,0] < search_dates[0][1]))
print(x)
out: (array([11, 12, 13, 14, 15, 16, 17, 18], dtype=int64),)
However, the code works only if I select the first element searchDates (searchDates[0][0]). It will not run for all values in searcDates. What I mean if I replace by the following code.
x = np.where(np.logical_and(dateValArr[:,0] > searchDates[0], dateValArr[:,0] < search_dates[0]))
Then, I will get the following error: operands could not be broadcast together with shapes (31,) (3,)
To find the closest value I hoping to somehow combine the following line of the code,
n = (np.abs(dateValArr[:,1]-searchDates[:,2])).argmin()
Any ideas on how to solve it.Thanks in advance
Only thing came into my mind is a for loop.
Here is the link for my work
result = np.array([])
for search_term in searchDates:
mask = (dateValArr[:,0] > search_term[0]) & (dateValArr[:,0] < search_term[1])
date_search_result = dateValArr[mask, :]
d = np.abs(date_search_result[:,1] - searchDates[0,2])
result = np.hstack([result, date_search_result[d.argmin()]])
print(result)
I kinda figured out it as well,
date_value = []
for i in search_dates:
dateidx_arr = np.where(np.logical_and(dateValArr[:,0] >= i[0],dateValArr[:,0] <= i[1] )) #Get index of specified date ranges
date_arr = dateValArr[dateidx_arr] #Based on the index get the dates and values
value_arr = (np.abs(date_arr[:,1]-i[2])).argmin() #for those dates calculate the closest value index
date_value.append(date_arr[value_arr]) #Use the index to get the closest date and value

Python 2D list to dictionary

I have a 2 Dimensional list and have to get 2 columns from the 2D list and place the values from each column as key:value pairs.
Example:
table = [[15, 29, 6, 2],
[16, 9, 8, 0],
[7, 27, 16, 0]]
def averages(table, col, by):
columns = tuple(([table[i][col] for i in range(len(table))])) #Place col column into tuple so it can be placed into dictionary
groupby = tuple(([table[i][by] for i in range(len(table))])) #Place groupby column into tuple so it can be placed into dictionary
avgdict = {}
avgdict[groupby] = [columns]
print(avgdict)
averages(table, 1, 3)
Output is:
{(2, 0, 0): [(29, 9, 27)]}
I am trying to get the output to equal:
{0:36, 2:29}
So essentially the 2 keys of 0 have their values added
I'm having a hard time understanding how to separate each key with their values
and then adding the values together if the keys are equal.
Edit: I'm only using Python Standard library, and not implementing numpy for this problem.
You can create an empty dictionary, then iterate through every element of groupby. If the element in groupby exist in the dictionary, then add the corresponding element in columns to the values in the dictionary. Otherwise, add the element in groupby as key and the corresponding element in columns as value.The implementation is as follows:
table = [[15, 29, 6, 2],
[16, 9, 8, 0],
[7, 27, 16, 0]]
def averages(table, col, by):
columns = tuple(([table[i][col] for i in range(len(table))])) #Place col column into tuple so it can be placed into dictionary
groupby = tuple(([table[i][by] for i in range(len(table))])) #Place groupby column into tuple so it can be placed into dictionary
avgdict = {}
for x in range(len(groupby)):
key = groupby[x]
if key in avgdict:
avgdict[key] += columns[x]
else:
avgdict[key] = columns[x]
print(avgdict)
averages(table, 1, 3)
Otherwise, if you want to keep your initial avgdict, then you can change the averages() function to
def averages(table, col, by):
columns = tuple(([table[i][col] for i in range(len(table))])) #Place col column into tuple so it can be placed into dictionary
groupby = tuple(([table[i][by] for i in range(len(table))])) #Place groupby column into tuple so it can be placed into dictionary
avgdict = {}
avgdict[groupby] = [columns]
newdict = {}
for key in avgdict:
for x in range(len(key)):
if key[x] in newdict:
newdict[key[x]] += avgdict[key][0][x]
else:
newdict[key[x]] = avgdict[key][0][x]
print(newdict)
It took me a minute to figure out what you were trying to accomplish because your function and variable names reference averages but your output is a sum.
Based on your output, it seems you're trying to aggregate row values in a given column by a group in another column.
Here's a recommended solution (which likely could be reduced to a one-liner via list comprehension). This loops through the unique (using set) values (b) in your group by, creates a dictionary key (agg_dict[b]) for the group by being processed, and sums all rows in a given column (col) if the group by is being processed (table[i][by] == by).
table = [[15, 29, 6, 2],
[16, 9, 8, 0],
[7, 27, 16, 0]]
def aggregate(tbl, col, by):
agg_dict = {}
for b in list(set([table[i][by] for i in range(len(table))]))
agg_dict[b] = sum([table[i][col] for i in range(len(table)) if table[i][by] == b])
print(agg_dict)
aggregate(table, 1, 3)
You can also try the following answer. It doesn't use numpy, and is based on the use of sets to find unique elements in groupby.
table = [[15, 29, 6, 2],
[16, 9, 8, 0],
[7, 27, 16, 0]]
def averages(table, col, by):
columns = tuple(([table[i][col] for i in range(len(table))])) #Place col column into tuple so it can be placed into dictionary
groupby = tuple(([table[i][by] for i in range(len(table))])) #Place groupby column into tuple so it can be placed into dictionary
'''groupby_unq: tuple data type
stores list of unique entries in groupby.'''
groupby_unq = tuple(set(groupby))
'''avg: numpy.ndarray data type
numpy array of zeros of same length as groupby_unq.'''
avg = np.zeros( len(groupby_unq) )
for i in range(len(groupby)):
for j in range(len(groupby_unq)):
if(groupby[i]==groupby_unq[j]): avg[j]+=columns[i]
avgdict = dict( (groupby_unq[i], avg[i]) for i in range(len(avg)) )
return avgdict
result = averages(table, 1, 3)
print result
{0: 36.0, 2: 29.0}

Getting column widths from string in python

I have the foll. string:
' Y M D PDSW RSPC NPPC NEE'
Each element in the string corresponds to a column in a csv file. Is there a way (aside from for loops), of getting the width of each column from this string? E.g. Th first column has a width of 5 (' Y'), the next has a width of 4(' M')...
Maybe something like:
>>> text = ' Y M D PDSW RSPC NPPC NEE'
>>> cols = re.findall('\s+\S+', text)
>>> [len(col) for col in cols]
[5, 4, 4, 9, 10, 10, 10]
So - assume the columns (right-aligned) are one or more spaces followed by one or more non-space, then take the lengths of the resulting strings.

Categories