How to split column substrings into specific columns - python

I have a dataframe as below:
+--------+
| Key|
+--------+
| x10x60|
|x1x19x33|
| x12x6|
| a14x4|
|x1x1x1x6|
|x2a23x30|
+--------|
And I want the output like this:
The Key column can be divide by each x element and put into xa/xb/xc/xd by order, but if there are a elements then place it into ta/tb/tc/td by order.
+--------+-----+------+-----+-----+-----+----+----+-----+
| Key| xa| xb| xc| xd| ta| tb| tc| td|
+--------+-----+------+-----+-----+-----+----+----+-----+
| x10x60| x10| x60| | | | | | |
|x1x19x33| x1| x19| x33| | | | | |
| x12x6| x12| x6| | | | | | |
| a14x4| | x4| | | a14| | | |
|x1x1x1x6| x1| x1| x1| x6| | | | |
|x2a23x30| x2| | x30| | | a23| | |
+--------|-----+------+-----+-----+-----+----+----+-----+
I tried substr() or substring() cannot have the output, but seems stop at the divide.

Normally you can just use Series.str.split with expand=True, and pandas will auto-expand the substrings into columns.
But since you want to place the substrings into very specific columns, use Series.str.extractall:
import string
m = df['Key'].str.extractall(r'([xa][0-9]+)').reset_index() # match the x* and a* substrings
m['match'] = m['match'].map(dict(enumerate(string.ascii_lowercase))) # map 0,1,2... -> a,b,c...
m['match'] = np.where(m[0].str.startswith('x'), 'x', 't') + m['match'] # add x or t prefix
out = df[['Key']].join(
m.pivot(index='level_0', columns='match', values=0) # reshape into wide form
.sort_index(key=lambda c: c.str.startswith('t'), axis=1) # put t columns at end
.fillna('')
)
Output:
>>> print(out)
Key xa xb xc xd ta tb
0 x10x60 x10 x60
1 x1x19x33 x1 x19 x33
2 x12x6 x12 x6
3 a14x4 x4 a14
4 x1x1x1x6 x1 x1 x1 x6
5 x2a23x30 x2 x30 a23

Related

Looking for a solution to add numeric and float elements stored in list format in one of the columns in dataframe

| Index | col1 |
| -------- | -------------- |
| 0 | [0,0] |
| 2 | [7.9, 11.06] |
| 3 | [0.9, 4] |
| 4 | NAN |
I have data similar to like this.I want to add elements of the list and store it in other column say total using loop such that output looks like this:
| Index | col1 |Total |
| -------- | -------------- | --------|
| 0 | [0,0] |0 |
| 2 | [7.9, 11.06] |18.9 |
| 3 | [0.9, 4] |4.9 |
| 4 | NAN |NAN |
Using na_action parameter in map should work as well:
df['Total'] = df['col1'].map(sum,na_action='ignore')
Use apply with a lambda to sum the lists or return np.NA if the values are not a list:
df['Total'] = df['col1'].apply(lambda x: sum(x) if isinstance(x, list) else pd.NA)
I tried with df.fillna([]), but lists are not a valid parameters of fillna.
Edit: consider using awkward arrays instead of lists: https://awkward-array.readthedocs.io/en/latest/

PySpark: Evaluating specific columns together

I have a Spark dataframe that looks like the following:
+---+----+---+---+
| a | b | c | d |
+---+----+-------+
|13 | 43 | 67| 3 |
+---+----+---+---+
Is it possible to choose specific columns to evaluate together to produce the following?
+----+----+---+---+-----+-----+-----------+
| a | b | c | d | a+b | c-b | a+b / c-b |
+----+----+-------+-----+-----+-----------+
| 13 | 43 | 67| 3 | 56 | 24 | 2.33 |
+----+----+---+---+-----+-----+-----------+
Yes, it's possible. You can use selectExpr or withColumn to add extra columns:
from pyspark.sql.functions import expr
(
df.withColumn("a+b", expr("a + b"))
.withColumn("c-b", expr("c - b"))
.withColumn("a+b / c-b", expr("(a + b) / (c - b)"))
.show()
)

Select pd.DataFrame rows with the biggest intersection in terms of values of Depth (specific columns)

+------------+-----+--------+-----+-------------+
| Meth.name | Min| Max |Layer| Global name |
+------------+-----+--------+-----+-------------+
| DTS | 2600| 3041.2 | AC1 | DTS |
| GGK | 1800| 3200.0 | AC1 | DEN |
| DTP | 700 | 3041.0 | AC2 | DT |
| DS | 700 | 3041.0 | AC3 | CALI |
| PF1 | 2800| 3012.0 | AC3 | CALI |
| PF2 | 3000| 3041.0 | AC4 | CALI |
+------------+-----+--------+-----+-------------+
We have to drop duplicated rows by "Global name" column but in specific way : we wants to choose the row, which will give the biggest intersection with range calculated using max value of column "Min" and min value if column "Max" of non-duplicated rows.
In example above this range will be [2600.0; 3041.0], so we wants to leave only row with ['Meth.name] == 'DS' and overall result should be like:
+------------+-----+--------+-----+-------------+
| Meth.name | Min| Max |Layer| Global name |
+------------+-----+--------+-----+-------------+
| DTS | 2600| 3041.2 | AC1 | DTS |
| GGK | 1800| 3200.0 | AC1 | DEN |
| DTP | 700 | 3041.0 | AC2 | DT |
| DS | 700 | 3041.0 | AC3 | CALI |
+------------+-----+--------+-----+-------------+
This problem, of course, can be solved in several iterations (calculate interval based on non-duplicated rows and then iteratively select only those rows (from duplicated) that will give biggest intersection), but I'm trying to discover the most efficient approach
Thank you
If the order of the lines is not important you can do the following :
df['diff'] = df['Max']-df['Min']
df=df.sort_values(["Global_name","diff"],ascending=True)
df.drop_duplicates('Global_name',keep='last')
From this question
Here is how I will go about it:
# Helper function
def calc_overlap(x):
if min_of_max == max_of_min:
return 0
low = max(min_of_max, x.Min)
high = min(max_of_min, x.Max)
return high-low
dup_global_name = df.Global_name.value_counts()[df.Global_name.value_counts() > 1].index
dup_global_name = list(dup_global_name)
# Filter duplicates
df_dup = df[df.Global_name.isin(dup_global_name)]
# Add overlap column
df_dup['overlap'] = df_dup.apply(lambda x: calc_overlap(x), axis=1)
#Select max overlap
df_dup = df_dup.loc[df_dup.groupby('Global_name').overlap.idxmax()]
# Drop overlap col
df_dup.drop('overlap', axis=1, inplace=True)
#Concatinate with nonduplicate ones
pd.concat([df[~df.Global_name.isin(dup_global_name)], df_dup])
The desired output:

update multiple columns based on two columns in pyspark data frames

I have a data frame like below in pyspark.
+--------------------+--------------+------------+-----------+-----------+-----------+-----------+
| serial_number | rest_id | value | body | legs | face | idle |
+--------------------+--------------+------------+-----------+-----------+-----------+-----------+
| sn11 | rs1 | N | Y | N | N | acde |
| sn1 | rs1 | N | Y | N | N | den |
| sn1 | null | Y | N | Y | N | can |
| sn2 | rs2 | Y | Y | N | N | aeg |
| null | rs2 | N | Y | N | Y | ueg |
+--------------------+--------------+------------+-----------+-----------+-----------+-----------+
Now I want to update some of the column while checking some column values.
I want to update the value when the any given serial_number or rest_id has value Y then all values of that particular serial_number or rest_id should be updated to Y. if not then what ever values they have.
I have done like below.
df.alias('a').join(df.filter(col('value')='Y').alias('b'),on=(col('a.serial_number') == col('b.serial_number')) | (col('a.rest_id') == col('b.rest_id')), how='left').withColumn('final_value',when(col('b.value').isNull(), col('a.value')).otherwise(col('b.value'))).select('a.serial_number','a.rest_id','a.body', 'a.legs', 'a.face', 'a.idle', 'final_val')
I got the result I want.
Now I want to repeat the same for columns body, legs and face as well.
I can do like above for all columns individually, I mean to say 3 join statements. But I want to update all the 4 columns in a single statement.
How can I do that?
Expected result
+--------------------+--------------+------------+-----------+-----------+-----------+-----------+
| serial_number | rest_id | value | body | legs | face | idle |
+--------------------+--------------+------------+-----------+-----------+-----------+-----------+
| sn11 | rs1 | N | Y | N | N | acde |
| sn1 | rs1 | Y | Y | Y | N | den |
| sn1 | null | Y | Y | Y | N | can |
| sn2 | rs2 | Y | Y | N | Y | aeg |
| null | rs2 | Y | Y | N | Y | ueg |
+--------------------+--------------+------------+-----------+-----------+-----------+-----------+
You should be using window functions for both serial_number and rest_id columns for checking if Y is present in the columns within that groups. (comments as explanation are provided below)
#column names for looping for the updates
columns = ["value","body","legs","face"]
import sys
from pyspark.sql import window as w
#window for serial number grouping
windowSpec1 = w.Window.partitionBy('serial_number').rowsBetween(-sys.maxint, sys.maxint)
#window for rest id grouping
windowSpec2 = w.Window.partitionBy('rest_id').rowsBetween(-sys.maxint, sys.maxint)
from pyspark.sql import functions as f
from pyspark.sql import types as t
#udf function for checking if Y is in the collected list of windows defined above for the columns in the list defined for looping
def containsUdf(x):
return "Y" in x
containsUdfCall = f.udf(containsUdf, t.BooleanType())
#looping the columns for checking the condition defined in udf function above by collecting the N and Y in each columns within windows defined
for column in columns:
df = df.withColumn(column, f.when(containsUdfCall(f.collect_list(column).over(windowSpec1)) | containsUdfCall(f.collect_list(column).over(windowSpec2)), "Y").otherwise(df[column]))
df.show(truncate=False)
which should give you
+-------------+-------+-----+----+----+----+----+
|serial_number|rest_id|value|body|legs|face|idle|
+-------------+-------+-----+----+----+----+----+
|sn2 |rs2 |Y |Y |N |Y |aeg |
|null |rs2 |Y |Y |N |Y |ueg |
|sn11 |rs1 |N |Y |N |N |acde|
|sn1 |rs1 |Y |Y |Y |N |den |
|sn1 |null |Y |Y |Y |N |can |
+-------------+-------+-----+----+----+----+----+
I would recommend to use the window function separately in two loopings as it might give you memory exceptions for big data as both window functions are used at the same time for each rows

Convert graphlab sframe into a dictionary of {key: values}

Given an SFrame as such:
+------+-----------+-----------+-----------+-----------+-----------+-----------+
| X1 | X2 | X3 | X4 | X5 | X6 | X7 |
+------+-----------+-----------+-----------+-----------+-----------+-----------+
| the | -0.060292 | 0.06763 | -0.036891 | 0.066684 | 0.024045 | 0.099091 |
| , | 0.026625 | 0.073101 | -0.027073 | -0.019504 | 0.04173 | 0.038811 |
| . | -0.005893 | 0.093791 | 0.015333 | 0.046226 | 0.032791 | 0.110069 |
| of | -0.050371 | 0.031452 | 0.04091 | 0.033255 | -0.009195 | 0.061086 |
| and | 0.005456 | 0.063237 | -0.075793 | -0.000819 | 0.003407 | 0.053554 |
| to | 0.01347 | 0.043712 | -0.087122 | 0.015258 | 0.08834 | 0.139644 |
| in | -0.019466 | 0.077509 | -0.102543 | 0.034337 | 0.130886 | 0.032195 |
| a | -0.072288 | -0.017494 | -0.018383 | 0.001857 | -0.04645 | 0.133424 |
| is | 0.052726 | 0.041903 | 0.163781 | 0.006887 | -0.07533 | 0.108394 |
| for | -0.004082 | -0.024244 | 0.042166 | 0.007032 | -0.081243 | 0.026162 |
| on | -0.023709 | -0.038306 | -0.16072 | -0.171599 | 0.150983 | 0.042044 |
| that | 0.062037 | 0.100348 | -0.059753 | -0.041444 | 0.041156 | 0.166704 |
| ) | 0.052312 | 0.072473 | -0.02067 | -0.015581 | 0.063368 | -0.017216 |
| ( | 0.051408 | 0.186162 | 0.03028 | -0.048425 | 0.051376 | 0.004989 |
| with | 0.091825 | -0.081649 | -0.087926 | -0.061273 | 0.043528 | 0.107864 |
| was | 0.046042 | -0.058529 | 0.040581 | 0.067748 | 0.053724 | 0.041067 |
| as | 0.025248 | -0.012519 | -0.054685 | -0.040581 | 0.051061 | 0.114956 |
| it | 0.028606 | 0.106391 | 0.025065 | 0.023486 | 0.011184 | 0.016715 |
| by | -0.096704 | 0.150165 | -0.01775 | -0.07178 | 0.004458 | 0.098807 |
| be | -0.109489 | -0.025908 | 0.025608 | 0.076263 | -0.047246 | 0.100489 |
+------+-----------+-----------+-----------+-----------+-----------+-----------+
How can I convert the SFrame into a dictionary such that X1 column is the key and X2 to X7 as the np.array()?
I have tried iterating through the original SFrame row-by-row and do something like this:
>>> import graphlab as gl
>>> import numpy as np
>>> x = gl.SFrame()
>>> a = np.array([1,2,3])
>>> w = 'foo'
>>> x.append(gl.SFrame({'word':[w], 'vector':[a]}))
Columns:
vector array
word str
Rows: 1
Data:
+-----------------+------+
| vector | word |
+-----------------+------+
| [1.0, 2.0, 3.0] | foo |
+-----------------+------+
[1 rows x 2 columns]
Is there another way to do the same?
EDITED
After trying #papayawarrior solution, it works if I can load the whole dataframe into memory but there's a few quriks that makes it odd.
Assuming that my original input to the SFrame is as presented above (with 501 columns) but in .csv file, I have the code to read them into the desired dictionary:
def get_embeddings(embedding_gzip, size):
coltypes = [str] + [float] * size
sf = gl.SFrame.read_csv('compose-vectors/' + embedding_gzip, delimiter='\t', column_type_hints=coltypes, header=False, quote_char='\0')
sf = sf.pack_columns(['X'+str(i) for i in range(2, size+1)])
df = sf.to_dataframe().set_index('X1')
print list(df)
return df.to_dict(orient='dict')['X2']
But oddly it gives this error:
File "sts_compose.py", line 28, in get_embeddings
return df.to_dict(orient='dict')['X2']
KeyError: 'X2'
So when I check for the column names before the conversion to dictionary, I found that my column names are not 'X1' and 'X2' but list(df) prints ['X501', 'X3'].
Is there something wrong with how I have converting the graphlab.SFrame -> pandas.DataFrame -> dict?
I know I can resolve the problem by doing this instead, but the question remains, "How did the column names become so strange?":
def get_embeddings(embedding_gzip, size):
coltypes = [str] + [float] * size
sf = gl.SFrame.read_csv('compose-vectors/' + embedding_gzip, delimiter='\t', column_type_hints=coltypes, header=False, quote_char='\0')
sf = sf.pack_columns(['X'+str(i) for i in range(2, size+1)])
df = sf.to_dataframe().set_index('X1')
col_names = list(df)
return df.to_dict(orient='dict')[col_names[1]]
Is there another way to do the same?
Yes, you can use the pack_columns method from the SFrame class.
import graphlab as gl
data = gl.SFrame()
data.add_column(gl.SArray(['foo', 'bar']), 'X1')
data.add_column(gl.SArray([1., 3.]), 'X2')
data.add_column(gl.SArray([2., 4.]), 'X3')
print data
+-----+-----+-----+
| X1 | X2 | X3 |
+-----+-----+-----+
| foo | 1.0 | 2.0 |
| bar | 3.0 | 4.0 |
+-----+-----+-----+
[2 rows x 3 columns]
import array
data = data.pack_columns(['X2', 'X3'], dtype=array.array, new_column_name='vector')
data = data.rename({'X1':'word'})
print data
+------+------------+
| word | vector |
+------+------------+
| foo | [1.0, 2.0] |
| bar | [3.0, 4.0] |
+------+------------+
[2 rows x 2 columns]
b=data['vector'][0]
print type(b)
<type 'array.array'>
How can I convert the SFrame into a dictionary such that X1 column is the key and X2 to X7 as the np.array()?
I didn't find any built-in method to convert an SFrame to a dict. You could try the following (it might be very slow):
a={}
def dump_sframe_to_dict(row, a):
a[row['word']]=row['vector']
data.apply(lambda x: dump_sframe_to_dict(x, a))
print a
{'foo': array('d', [1.0, 2.0]), 'bar': array('d', [3.0, 4.0])}
Edited to match new questions in the post.
#Adrien Renaud is spot on with the SFrame.pack_columns method, but I would suggest using the Pandas dataframe to_dict for the last question if your dataset fits in memory.
>>> import graphlab as gl
>>> sf = gl.SFrame({'X1': ['cat', 'dog'], 'X2': [1, 2], 'X3': [3, 4]})
>>> sf
+-----+----+----+
| X1 | X2 | X3 |
+-----+----+----+
| cat | 1 | 3 |
| dog | 2 | 4 |
+-----+----+----+
>>> sf2 = sf.rename({'X1': 'word'})
>>> sf2 = sf.pack_columns(column_prefix='X', new_column_name='vector')
>>> sf2
+------+--------+
| word | vector |
+------+--------+
| cat | [1, 3] |
| dog | [2, 4] |
+------+--------+
>>> df = sf2.to_dataframe().set_index('word')
>>> result = df.to_dict(orient='dict')['vector']
>>> result
{'cat': [1, 3], 'dog': [2, 4]}

Categories