I need to format a data containing as list of lists in a table.
I can make a grid using tabulate:
x = [['Alice', 'min', 2],
['', 'max', 5],
['Bob', 'min', 8],
['', 'max', 15]]
header = ['Name', '', 'value']
print(tabulate.tabulate(x, headers=header, tablefmt="grid"))
+--------+-----+---------+
| Name | | value |
+========+=====+=========+
| Alice | min | 2 |
+--------+-----+---------+
| | max | 5 |
+--------+-----+---------+
| Bob | min | 8 |
+--------+-----+---------+
| | max | 15 |
+--------+-----+---------+
However, we require grouping of rows, like this:
+--------+-----+---------+
| Name | | value |
+========+=====+=========+
| Alice | min | 2 |
+ + + +
| | max | 5 |
+--------+-----+---------+
| Bob | min | 8 |
+ + + +
| | max | 15 |
+--------+-----+---------+
I tried using multiline rows (using "\n".join()), which is apparently supported in tabular 0.8.3, with no success.
This is required to run in the production server, so we can't use any heavy libraries. We are using tabulate because the whole tabulate library is a single file, and we can ship the file with the product.
You can try this:
x = [['Alice', 'min\nmax', '2\n5'],
['Bob', 'min\nmax', '8\n15'],
]
+--------+-----+------------------------+
| Name | | ['value1', 'value2'] |
+========+=====+========================+
| Alice | min | 2 |
| | max | 5 |
+--------+-----+------------------------+
| Bob | min | 8 |
| | max | 15 |
+--------+-----+------------------------+
Related
my data frame:
+-----+--------+-------+
| val | id | reRnk |
+-----+--------+-------+
| 2 | a | yes |
| 1 | b | no |
| 3 | c | no |
| 8 | d | yes |
| 7 | e | yes |
| 9 | f | no |
+-----+--------+-------+
In my desired output I will re-rank only the columns where reRnk==yes, ranking will be done based on "val"
I don't want to change the rows where reRnk = no, for example at id=b we have reRnk=no I want to keep that row at row no. 2 only.
my desired output will look like this:
+-----+--------+-------+
| val | id | reRnk |
+-----+--------+-------+
| 8 | d | yes |
| 1 | b | no |
| 3 | c | no |
| 7 | e | yes |
| 2 | a | yes |
| 9 | f | no |
+-----+--------+-------+
From what I'm reading, pyspark DF's do not have an index by default. You might need to add this.
I do not know the exact syntax for pyspark, however since it has many similarities with pandas this might lead you into a certain direction:
df.loc[df.reRnk == 'yes', ['val','id']] = df.loc[df.reRnk == 'yes', ['val','id']].sort_values('val', ascending=False).set_index(df.loc[df.reRnk == 'yes', ['val','id']].index)
Basically what we do here is isolating the rows with reRnk == 'yes', sorting these values but resetting the index to its original index. Then we assign these new values to the original rows in the df.
for .loc, https://spark.apache.org/docs/3.2.0/api/python/reference/pyspark.pandas/api/pyspark.pandas.DataFrame.loc.html might be worth a try.
for .sort_values see: https://sparkbyexamples.com/pyspark/pyspark-orderby-and-sort-explained/
I want to coalesce 4 columns using pandas. I've tried this:
final['join_key'] = final['book'].astype('str') + final['bdr'] + final['cusip'].fillna(final['isin']).fillna(final['Deal'].astype('str')).fillna(final['Id'])
When I use this it returns:
+-------+--------+-------+------+------+------------+------------------+
| book | bdr | cusip | isin | Deal | Id | join_key |
+-------+--------+-------+------+------+------------+------------------+
| 17236 | ETFROS | | | | 8012398421 | 17236.0ETFROSnan |
+-------+--------+-------+------+------+------------+------------------+
The field Id is not properly appending to my join_key field.
Any help would be appreciated, thanks.
Update:
+------------+------+------+-----------+--------------+------+------------+----------------------------+
| endOfDay | book | bdr | cusip | isin | Deal | Id | join_key |
+------------+------+------+-----------+--------------+------+------------+----------------------------+
| 31/10/2019 | 15 | ITOR | 371494AM7 | US371494AM77 | 161 | 8013210731 | 20191031|15|ITOR|371494AM7 |
| 31/10/2019 | 15 | ITOR | | | | 8011898573 | 20191031|15|ITOR| |
| 31/10/2019 | 15 | ITOR | | | | 8011898742 | 20191031|15|ITOR| |
| 31/10/2019 | 15 | ITOR | | | | 8011899418 | 20191031|15|ITOR| |
+------------+------+------+-----------+--------------+------+------------+----------------------------+
df['join_key'] = ("20191031|" + df['book'].astype('str') + "|" + df['bdr'] + "|" + df[['cusip', 'isin', 'Deal', 'id']].bfill(1)['cusip'].astype(str))
For some reason this code isnt picking up Id as part of the key.
The last chain fillna for cusip is too complicated. You may change it to bfill
final['join_key'] = (final['book'].astype('str') +
final['bdr'] +
final[['cusip', 'isin', 'Deal', 'Id']].bfill(1)['cusip'].astype(str))
Try this:
import pandas as pd
import numpy as np
# setup (ignore)
final = pd.DataFrame({
'book': [17236],
'bdr': ['ETFROS'],
'cusip': [np.nan],
'isin': [np.nan],
'Deal': [np.nan],
'Id': ['8012398421'],
})
# answer
final['join_key'] = final['book'].astype('str') + final['bdr'] + final['cusip'].fillna(final['isin']).fillna(final['Deal']).fillna(final['Id']).astype('str')
Output
book bdr cusip isin Deal Id join_key
0 17236 ETFROS NaN NaN NaN 8012398421 17236ETFROS8012398421
I am trying to aggregate data in pyspark dataframe on a particular criteria. I am trying to align the acct based on switchOUT amount to switchIN amount. So that accounts with money switching out of becomes from account and other accounts become to_accounts.
Data I am getting in the dataframe to begin with
+--------+------+-----------+----------+----------+-----------+
| person | acct | close_amt | open_amt | switchIN | switchOUT |
+--------+------+-----------+----------+----------+-----------+
| A | 1 | 125 | 50 | 75 | 0 |
+--------+------+-----------+----------+----------+-----------+
| A | 2 | 100 | 75 | 25 | 0 |
+--------+------+-----------+----------+----------+-----------+
| A | 3 | 200 | 300 | 0 | 100 |
+--------+------+-----------+----------+----------+-----------+
To this table
+--------+--------+-----------+----------+----------+
| person | from_acct| to_acct | switchIN | switchOUT|
+--------+----------+--------+----------+-----------+
| A | 3 | 1 | 75 | 100 |
+--------+----------+--------+----------+-----------+
| A | 3 | 2 | 25 | 100 |
+--------+----------+--------+----------+-----------+
And also how can I do it so that it works for N number of rows (not just 3 accounts)
So far I have used this code
# define udf
def sorter(l):
res = sorted(l, key=operator.itemgetter(1))
return [item[0] for item in res]
def list_to_string(l):
res = 'from_fund_' +str(l[0]) + '_to_fund_'+str(l[1])
return res
def listfirstAcc(l):
res = str(l[0])
return res
def listSecAcc(l):
res = str(l[1])
return res
sort_udf = F.udf(sorter)
list_str = F.udf(list_to_string)
extractFirstFund = F.udf(listfirstAcc)
extractSecondFund = F.udf(listSecAcc)
# Add additional columns
df= df.withColumn("move", sort_udf("list_col").alias("sorted_list"))
df= df.withColumn("move_string", list_str("move"))
df= df.withColumn("From_Acct",extractFirstFund("move"))
df= df.withColumn("To_Acct",extractSecondFund("move"))
Current outcome I am getting:
+--------+--------+-----------+----------+----------+
| person | from_acct| to_acct | switchIN | switchOUT|
+--------+----------+--------+----------+-----------+
| A | 3 | 1,2 | 75 | 100 |
+--------+----------+--------+----------+-----------+
I want to calculate APRU for several countries.
country_list = ['us','gb','ca','id']
count = {}
for i in country_list:
count[i] = df_day_country[df_day_country.isin([i])]
count[i+'_reverse'] = count[i].iloc[::-1]
for j in range(1,len(count[i+'_reverse'])):
count[i+'_reverse']['count'].iloc[j] = count[i+'_reverse']['count'][j-1:j+1].sum()
for k in range(1,len(count[i])):
count[i][revenue_sum].iloc[k] = count[i][revenue_sum][k-1:k+1].sum()
count[i]['APRU'] = count[i][revenue_sum] / count[i]['count'][0]/100
After that, I will create 4 dataframes: df_us, df_gb, df_ca, df_id that show each country's APRU.
But the size of dataset is large. The running time is extremely slow after the country list become larger. So is there a way to decrease the running time?
Consider using numba
Your code thus becomes
from numba import njit
country_list = ['us','gb','ca','id']
#njit
def count(country_list):
count = {}
for i in country_list:
count[i] = df_day_country[df_day_country.isin([i])]
count[i+'_reverse'] = count[i].iloc[::-1]
for j in range(1,len(count[i+'_reverse'])):
count[i+'_reverse']['count'].iloc[j] = count[i+'_reverse']['count'][j-1:j+1].sum()
for k in range(1,len(count[i])):
count[i][revenue_sum].iloc[k] = count[i][revenue_sum][k-1:k+1].sum()
count[i]['APRU'] = count[i][revenue_sum] / count[i]['count'][0]/100
return count
Numba makes python loops a lot faster and is in the process of being integrated into the more heavy duty python libraries like scipy. Deffinetly give this a look.
IIUC, from your code and variable names, it looks like you are trying to compute average:
# toy data set:
country_list = ['us','gb']
np.random.seed(1)
datalen=10
df_day_country = pd.DataFrame({'country': np.random.choice(country_list, datalen),
'count': np.random.randint(0,100, datalen),
'revenue_sum': np.random.uniform(0,100,datalen)})
df_day_country['APRU'] = (df_day_country.groupby('country',group_keys=False)
.apply(lambda x: x['revenue_sum']/x['count'].sum())
)
Output:
+----------+--------+--------------+------------+----------+
| country | count | revenue_sum | APRU | |
+----------+--------+--------------+------------+----------+
| 0 | gb | 16 | 20.445225 | 0.150333 |
| 1 | gb | 1 | 87.811744 | 0.645675 |
| 2 | us | 76 | 2.738759 | 0.011856 |
| 3 | us | 71 | 67.046751 | 0.290246 |
| 4 | gb | 6 | 41.730480 | 0.306842 |
| 5 | gb | 25 | 55.868983 | 0.410801 |
| 6 | gb | 50 | 14.038694 | 0.103226 |
| 7 | gb | 20 | 19.810149 | 0.145663 |
| 8 | gb | 18 | 80.074457 | 0.588783 |
| 9 | us | 84 | 96.826158 | 0.419161 |
+----------+--------+--------------+------------+----------+
Given an SFrame as such:
+------+-----------+-----------+-----------+-----------+-----------+-----------+
| X1 | X2 | X3 | X4 | X5 | X6 | X7 |
+------+-----------+-----------+-----------+-----------+-----------+-----------+
| the | -0.060292 | 0.06763 | -0.036891 | 0.066684 | 0.024045 | 0.099091 |
| , | 0.026625 | 0.073101 | -0.027073 | -0.019504 | 0.04173 | 0.038811 |
| . | -0.005893 | 0.093791 | 0.015333 | 0.046226 | 0.032791 | 0.110069 |
| of | -0.050371 | 0.031452 | 0.04091 | 0.033255 | -0.009195 | 0.061086 |
| and | 0.005456 | 0.063237 | -0.075793 | -0.000819 | 0.003407 | 0.053554 |
| to | 0.01347 | 0.043712 | -0.087122 | 0.015258 | 0.08834 | 0.139644 |
| in | -0.019466 | 0.077509 | -0.102543 | 0.034337 | 0.130886 | 0.032195 |
| a | -0.072288 | -0.017494 | -0.018383 | 0.001857 | -0.04645 | 0.133424 |
| is | 0.052726 | 0.041903 | 0.163781 | 0.006887 | -0.07533 | 0.108394 |
| for | -0.004082 | -0.024244 | 0.042166 | 0.007032 | -0.081243 | 0.026162 |
| on | -0.023709 | -0.038306 | -0.16072 | -0.171599 | 0.150983 | 0.042044 |
| that | 0.062037 | 0.100348 | -0.059753 | -0.041444 | 0.041156 | 0.166704 |
| ) | 0.052312 | 0.072473 | -0.02067 | -0.015581 | 0.063368 | -0.017216 |
| ( | 0.051408 | 0.186162 | 0.03028 | -0.048425 | 0.051376 | 0.004989 |
| with | 0.091825 | -0.081649 | -0.087926 | -0.061273 | 0.043528 | 0.107864 |
| was | 0.046042 | -0.058529 | 0.040581 | 0.067748 | 0.053724 | 0.041067 |
| as | 0.025248 | -0.012519 | -0.054685 | -0.040581 | 0.051061 | 0.114956 |
| it | 0.028606 | 0.106391 | 0.025065 | 0.023486 | 0.011184 | 0.016715 |
| by | -0.096704 | 0.150165 | -0.01775 | -0.07178 | 0.004458 | 0.098807 |
| be | -0.109489 | -0.025908 | 0.025608 | 0.076263 | -0.047246 | 0.100489 |
+------+-----------+-----------+-----------+-----------+-----------+-----------+
How can I convert the SFrame into a dictionary such that X1 column is the key and X2 to X7 as the np.array()?
I have tried iterating through the original SFrame row-by-row and do something like this:
>>> import graphlab as gl
>>> import numpy as np
>>> x = gl.SFrame()
>>> a = np.array([1,2,3])
>>> w = 'foo'
>>> x.append(gl.SFrame({'word':[w], 'vector':[a]}))
Columns:
vector array
word str
Rows: 1
Data:
+-----------------+------+
| vector | word |
+-----------------+------+
| [1.0, 2.0, 3.0] | foo |
+-----------------+------+
[1 rows x 2 columns]
Is there another way to do the same?
EDITED
After trying #papayawarrior solution, it works if I can load the whole dataframe into memory but there's a few quriks that makes it odd.
Assuming that my original input to the SFrame is as presented above (with 501 columns) but in .csv file, I have the code to read them into the desired dictionary:
def get_embeddings(embedding_gzip, size):
coltypes = [str] + [float] * size
sf = gl.SFrame.read_csv('compose-vectors/' + embedding_gzip, delimiter='\t', column_type_hints=coltypes, header=False, quote_char='\0')
sf = sf.pack_columns(['X'+str(i) for i in range(2, size+1)])
df = sf.to_dataframe().set_index('X1')
print list(df)
return df.to_dict(orient='dict')['X2']
But oddly it gives this error:
File "sts_compose.py", line 28, in get_embeddings
return df.to_dict(orient='dict')['X2']
KeyError: 'X2'
So when I check for the column names before the conversion to dictionary, I found that my column names are not 'X1' and 'X2' but list(df) prints ['X501', 'X3'].
Is there something wrong with how I have converting the graphlab.SFrame -> pandas.DataFrame -> dict?
I know I can resolve the problem by doing this instead, but the question remains, "How did the column names become so strange?":
def get_embeddings(embedding_gzip, size):
coltypes = [str] + [float] * size
sf = gl.SFrame.read_csv('compose-vectors/' + embedding_gzip, delimiter='\t', column_type_hints=coltypes, header=False, quote_char='\0')
sf = sf.pack_columns(['X'+str(i) for i in range(2, size+1)])
df = sf.to_dataframe().set_index('X1')
col_names = list(df)
return df.to_dict(orient='dict')[col_names[1]]
Is there another way to do the same?
Yes, you can use the pack_columns method from the SFrame class.
import graphlab as gl
data = gl.SFrame()
data.add_column(gl.SArray(['foo', 'bar']), 'X1')
data.add_column(gl.SArray([1., 3.]), 'X2')
data.add_column(gl.SArray([2., 4.]), 'X3')
print data
+-----+-----+-----+
| X1 | X2 | X3 |
+-----+-----+-----+
| foo | 1.0 | 2.0 |
| bar | 3.0 | 4.0 |
+-----+-----+-----+
[2 rows x 3 columns]
import array
data = data.pack_columns(['X2', 'X3'], dtype=array.array, new_column_name='vector')
data = data.rename({'X1':'word'})
print data
+------+------------+
| word | vector |
+------+------------+
| foo | [1.0, 2.0] |
| bar | [3.0, 4.0] |
+------+------------+
[2 rows x 2 columns]
b=data['vector'][0]
print type(b)
<type 'array.array'>
How can I convert the SFrame into a dictionary such that X1 column is the key and X2 to X7 as the np.array()?
I didn't find any built-in method to convert an SFrame to a dict. You could try the following (it might be very slow):
a={}
def dump_sframe_to_dict(row, a):
a[row['word']]=row['vector']
data.apply(lambda x: dump_sframe_to_dict(x, a))
print a
{'foo': array('d', [1.0, 2.0]), 'bar': array('d', [3.0, 4.0])}
Edited to match new questions in the post.
#Adrien Renaud is spot on with the SFrame.pack_columns method, but I would suggest using the Pandas dataframe to_dict for the last question if your dataset fits in memory.
>>> import graphlab as gl
>>> sf = gl.SFrame({'X1': ['cat', 'dog'], 'X2': [1, 2], 'X3': [3, 4]})
>>> sf
+-----+----+----+
| X1 | X2 | X3 |
+-----+----+----+
| cat | 1 | 3 |
| dog | 2 | 4 |
+-----+----+----+
>>> sf2 = sf.rename({'X1': 'word'})
>>> sf2 = sf.pack_columns(column_prefix='X', new_column_name='vector')
>>> sf2
+------+--------+
| word | vector |
+------+--------+
| cat | [1, 3] |
| dog | [2, 4] |
+------+--------+
>>> df = sf2.to_dataframe().set_index('word')
>>> result = df.to_dict(orient='dict')['vector']
>>> result
{'cat': [1, 3], 'dog': [2, 4]}