I have two text files:
One with translations/aliases of the form:
123 456
2 278
456 99999
...
and another one with three entries per line:
34 456 9900
111 333 444
234 2 562
...
I want to translate the second column, if possible, so for example I would like the output dataframe to have the rows:
34, 99999, 9900
111, 333, 444
234, 278, 562
Reading in the text files works fine. However, I do have problems with translating the column b.
This is my basic code structure right now:
translation = sc.textFile("transl.txt")\
.map(lambda line: line.split(" "))
def translate(string):
x = translation.filter(lambda x: x[0] == string).collect()
if x == []:
return string
return x[0][1]
d = sc.textFile("text.txt")\
.map(lambda line: line.split(" "))\
.toDF(["a", "b", "c"])\
.withColumn("b", translate(d.b))\
Everything works fine exept for the last line.
I know that applying functions to a column in spark doesn't work that easy, however I am out of ideas how else to do it.
You can achieve that with a left join. Please have a look at the commented code below:
import pyspark.sql.functions as F
l1 = [
(123, 456)
,(2, 278)
,(456, 99999)
]
l2 = [
(34, 456, 9900)
,(111, 333, 444)
,(234, 2, 562)
]
df1=spark.createDataFrame(l1, ['one1', 'two1'])
df2=spark.createDataFrame(l2, ['one2', 'two2', 'three2'])
#creates an dataframe with five columns one1, two1, one2, two2, three2
df = df2.join(df1, df2.two2 == df1.one1 , 'left')
#checks if a value in your dictionary dataframe is avaiable, if not it will keep the current value
#otherwise the value will be translated
df = df.withColumn('two2', F.when(F.col('two1').isNull(), F.col('two2') ).otherwise(F.col('two1')))
df = df.drop('one1', 'two1')
df.show()
Output:
+----+-----+------+
|one2| two2|three2|
+----+-----+------+
| 111| 333| 444|
| 234| 278| 562|
| 34|99999| 9900|
+----+-----+------+
A slightly different approach would be to join the two files, if you imported them as dataframes. I've shown an example below:
# Sample DataFrame's from provided example
import pandas as pd
translations = pd.DataFrame({
'Key': [123,2,456],
'Translation': [456,278,99999]
})
entries = pd.DataFrame({
'A': [34,11,234],
'B': [456,333,2],
'C': [9900,444,562]
})
After importing the files we can merge them by the lookup keys, using a left join
df = pd.merge(entries, translations, left_on='B', right_on='Key', how='left')
However, this will leave us with a columns with NaN's where a lookup couldn't be found. To resolve this, we take the value from 'B', and at the same time overwrite the original 'B' column with our lookup value.
df['B'] = df['Translation'].mask(pd.isna, df['B'])
Now we need to drop the additional columns to arrive at the result that you requested:
df.drop(columns=['Key', 'Translation'])
df will now look like this:
A B C
0 34 99999 9900
1 11 333 444
2 234 278 562
Related
I'm trying to lookup the values on a certain column and copy the remaining column based on that lookup. The thing is, the number of row in this operation is more than 20 million rows.
I tried to run the code, but it did not stop for like 8 hours and then I stop it. My question is:
Is my algorithm correct? If its correct, is the cause of this non-stop running is due to my inefficient algorithm?
Here's my code and tables to illustrate:
Table 1
A
B
12
abc
13
def
28
ghi
50
jkl
Table 2 (Lookup to this table)
B
C
D
abc
4
7
def
3
3
ghi
6
2
jkl
8
1
Targeted result
A
B
C
D
12
abc
4
7
13
def
3
3
28
ghi
6
2
50
jkl
8
1
So the column of C and D will be added to table 1 also but lookup to table 2 of column B
This value on Table 1 is located in different CSV files, so I also loop through the files in the folder. I name the directory as all_files in the code. So, after looking up
My code:
df = pd.DataFrame()
for f in all_files:
Table1 = pd.read_csv(all_files[f])
for j in range(len(Table1)):
u = Table1.loc[j,'B']
for z in range(len(Table2)):
if u == Table2.loc[z,'B']:
Table1.loc[j,'C'] = Table2.loc[z,'C']
Table1.loc[j,'D'] = Table2.loc[z,'D']
break
df = pd.concat([df,Table1],axis=0)
I used that break at the end just to stop the looping when it finds the same value and Table1 is concatenated to df. This code here didn't work on me, loops continuously and never stops.
Can anyone help? Any help will be very much appreciated!
I hope this is the solution you are looking for:
Firstly I would join all the CSV for table_1 together as a single DataFrame. Then I would merge the table_2 to table_1 with the key of Column B. Sample code:
df = pd.DataFrame()
for file in all_file:
df_tmp = pd.read_csv(file)
df = pd.concat([df, df_tmp])
df_merge = pd.merge(df, table_2, on="B", how="left")
When we use for-loop with Pandas, there is a 98% chance that we are doping it wrong. Pandas is design for you not to use loops.
Solutions with increasing performance:
import pandas as pd
table_1 = pd.DataFrame({'A': [12, 13, 28, 50], 'B': ['abc', 'def', 'ghi','jkl']})
table_2 = pd.DataFrame({'B': ['abc', 'def', 'ghi','jkl'], 'C': [4, 3, 6, 8], 'D': [7, 3, 2, 1]})
# simple merge
table = pd.merge(table_1, table_2, how='inner', on='B')
# gain speed by using indexing
table_1 = table_1.set_index('B')
table_2 = table_2.set_index('B')
table = pd.merge(table_1, table_2, how='inner', left_index=True, right_index=True)
# there is also join but it’s slow than merge
table = table_1.join(table_2, on="B").reset_index()
So I am working on the below dataframe:
What I am trying to do is merge the numeric values of 2 columns ('elected_in' & 'campaigned_in_') into a new column.
This column should look like this
new_column
007
NaN
043
275
027
etc
Any tips on how to do this at all? Found all the stack overflow answers not quite pertaining to this and I am also not sure what terminology to use...
Thanks for your help in advance.
Iterate over rows using iterrows() method and replace value on condition (if value in campaigned_in is a string replace with that from elected_in)
import pandas as pd
df = pd.DataFrame({"elected_in" : [0.07, "Bremen", "Nied"]})
df['campaigned_in'] = ["Schleswig",45,275]
df["answer"]=df["campaigned_in"]
for index, row in df.iterrows():
if(isinstance(row["campaigned_in"],str)==True):
row["answer"]=row["elected_in"]
df.head()
Updated df look like:
elected_in campaigned_in answer
0 0.07 Schleswig 0.07
1 Bremen 45 45
2 Nied 275 275
You can basically combine them using new column. I'm suggesting this as you have number and string both as value in the columns that you're trying to merge. Please refer below code.
import pandas as pd
df = pd.DataFrame(np.array([[7, "Bremen", "test"], [4, 5, 6], ["trial", 8, 43]]),
columns=['elected', 'b', 'campained'])
# Now combine them
df['number'] = df['elected'] + " " +df['campained']
df.head()
If you only want numbers then you can use simple lambda function to do that.
import re as re
def find_number(text):
num = re.findall(r'[0-9]+',text)
return " ".join(num)
df['new']=df['number'].apply(lambda x: find_number(x))
df.head()
Edit: Changed so that output is string format
def merge(e,c):
if str(e).isnumeric():
return e
elif str(c).isnumeric():
return c
else:
return np.nan
data = {'elected_in':['007', 'Bremen', 'test1', 182],
'campaigned_in_':['sh-h', np.nan, '043', 'test2']
}
df = pd.DataFrame(data)
df['new_column'] = df.apply(lambda x: merge(x.elected_in, x.campaigned_in_), axis = 1)
output:
elected_in campaigned_in_ new_column
0 007 sh-h 007
1 Bremen NaN NaN
2 test1 043 043
3 182 test2 182
Sample data below
enter image description here
input of file A and File B is given and the output format also given . can someone help me on this
I'd also be curious to see a clever/pythonic solution to this. My "ugly" solution iterating over index is as follows:
dfa, dfb are the two dataframes, columns named as in example.
dfa = pd.DataFrame({'c1':['v','f','h','m','s','d'],'c2':['100','110','235','999','333','39'],'c3':['tech','jjj',None,'iii','mnp','lf'],'c4':['hhh','scb','kkk','lop','sos','kdk']})
dfb = pd.DataFrame({'c1':['v','h','m','f','L','s'],'c2':['100','235','999','110','777','333'],'c3':['tech',None,'iii','jkl','9kdf','mnp1'],'c4':['hhh','mckkk','lok','scb','ooo','sos1']})
Now let's create lists of indexes to identify the rows that don't match between dfa and dfb
dfa, dfb = dfa.set_index(['c1','c2']), dfb.set_index(['c1','c2'])
mismatch3, mismatch4 = [],[]
for i in dfa.index:
if i in dfb.index:
if dfa.loc[i,'c3']!=dfb.loc[i,'c3']:
mismatch3.append(i)
if dfa.loc[i,'c4']!=dfb.loc[i,'c4']:
mismatch4.append(i)
mismatch = list(set(mismatch3+mismatch4))
Now that this is done, we want to rename dfb, perform the join operation on the mismatched indexes, and add the "status" columns based on mismatch3 and mismatch4.
dfb = dfb.rename(index=str, columns={'c3':'b_c3','c4':'b_c4'})
df = dfa.loc[mismatch].join(dfb)
df['c3_status'] = 'match'
df['c4_status'] = 'match'
df.loc[mismatch3, 'c3_status'] = 'mismatch'
df.loc[mismatch4, 'c4_status'] = 'mismatch'
Finally, let's get those columns in the right order :)
result = df[['c3','b_c3','c3_status','c4','b_c4','c4_status']]
Once again, I'd love to see a prettier solution. I hope this helps!
Here are four lines of code that may do what you are looking for:
columns_to_compare =['c2','c3']
dfa['Combo'] = dfa[columns_to_compare].apply(lambda x: ', '.join(x[x.notnull()]), axis = 1)
dfb['Combo1'] = dfb[columns_to_compare].apply(lambda x: ', '.join(x[x.notnull()]), axis = 1)
[i for i,x in enumerate(dfb['Combo1'].tolist()) if x not in dfa['Combo'].tolist()]
explanation
Assume that you want to see what dfb rows are not in dfa, for columns c2 and c3.
To do this, consider the following approach:
Create a column "Combo" in dfa where each row of "Combo" contains a comma separated string, representing the values of the chosen columns to compare (for the row concerned)
dfa['Combo'] = dfa[dfa.columns].apply(lambda x: ', '.join(x[x.notnull()]), axis = 1)
c1 c2 c3 c4 Combo
0 v 100 tech hhh 100, tech
1 f 110 jjj scb 110, jjj
2 h 235 None kkk 235
3 m 999 iii lop 999, iii
4 s 333 mnp sos 333, mnp
5 d 39 lf kdk 39, lf
Apply the same logic to dfb
c1 c2 c3 c4 Combo1
0 v 100 tech hhh 100, tech
1 h 235 None mckkk 235
2 m 999 iii lok 999, iii
3 f 110 jkl scb 110, jkl
4 L 777 9kdf ooo 777, 9kdf
5 s 333 mnp1 sos1 333, mnp1
Create a list containing the required indices from dfb:
[i for i,x in enumerate(dfb['Combo1'].tolist()) if x not in dfa['Combo'].tolist()]
or to show the actual row values (not indices):
[[x] for i,x in enumerate(dfb['Combo1'].tolist()) if x not in dfa['Combo'].tolist()]
Row Index Result
[3, 4, 5]
Row Value Result
[['110, jkl'], ['777, 9kdf'], ['333, mnp1']]
I am just starting to use Pandas for their DataFrames, but I had a question about what to do if a column entry for an observation is not a value but a dictionary of values? Example: three columns ('A', 'B', and 'C'), and entries are of the form (int, dictionary, string). Should you convert the dictionary to a Series or keep it as a dictionary?
I am not at a point yet to investigate further into Pandas and am unsure if it would be most beneficial ( or even just what a more seasoned Pandas user would do) to keep the values in column 'B' as dictionaries or another form. Also: these dictionaries all have a consistent form (i.e. 'B1': value, 'B2': value, etc.).
Thanks all
It sort of depends on what you want to do with it, but I would argue that in most scenarios it would be more useful to change your dictionaries in column B to separate columns, as you say that they all have consistent forms. Consider the following:
A = [1,2]
B = [{'B1':123, 'B2':321}, {'B1':567, 'B2':891}]
C = ['string1', 'string2']
You can make a dataframe of these data using:
df = pd.DataFrame({'A':A,'B':B,'C':C})
>>> df
A B C
0 1 {'B1': 123, 'B2': 321} string1
1 2 {'B1': 567, 'B2': 891} string2
But that leaves the data in column B as a dictionary, and it's harder to access. I would first make a dataframe of your data in B:
df_B = pd.DataFrame(B)
>>> df_B
B1 B2
0 123 321
1 567 891
And then concatenate it with a dataframe of your A and C data:
df_AC = pd.DataFrame({'A':A,'C':C}
final_df = pd.concat([df_AC, df_B], axis=1)
>>> final_df
A C B1 B2
0 1 string1 123 321
1 2 string2 567 891
Leaving all your data easily accessible to pandas
I have the following dataframe:
ID first mes1.1 mes 1.2 ... mes 1.10 mes2.[1-10] mes3.[1-10]
123df John 5.5 130 45 [12,312,...] [123,346,53]
...
where I have abbreviated columns using [] notation. So in this dataframe I have 31 columns: first, mes1.[1-10], mes2.[1-10], and mes3.[1-10]. Each row is keyed by a unique index: ID.
I would like to form a new table where I've replicated all column values, (represented here by ID and first) and move the mes2 and mes3 columns (20 of them) "down" giving me something like this:
ID first mes1 mes2 ... mes10
123df John 5.5 130 45
123df John 341 543 53
123df John 123 560 567
...
# How I set up your dataframe (please include a reproducible df next time)
df = pd.DataFrame(np.random.rand(6,31), index=["ID" + str(i) for i in range(6)],
columns=['first'] + ['mes{0}.{1}'.format(i, j) for i in range(1,4) for j in range(1,11)])
df['first'] = 'john'
Then there are two ways to do this
# Generate new underlying array
first = np.repeat(df['first'].values, 3)[:, np.newaxis]
new_vals = df.values[:, 1:].reshape(18,10)
new_vals = np.hstack((first, new_vals))
# Create new df
m = pd.MultiIndex.from_product((df.index, range(1,4)), names=['ID', 'MesNum'])
pd.DataFrame(new_vals, index=m, columns=['first'] + list(range(1,11)))
or using only Pandas
df.columns = ['first'] + list(range(1,11))*3
pieces = [df.iloc[:, i:i+10] for i in range(1,31, 10)]
df2 = pd.concat(pieces, keys = ['first', 'second', 'third'])
df2 = df2.swaplevel(1,0).sortlevel(0)
df2.insert(0, 'first', df['first'].repeat(3).values)