Related
I have a short script. I use that script for for example I have dataset
I try to group by id first 3 then I try to group them again but this time I try to merge name, url and house
example output and input
data set
input csv
id,name,house
1,a,house1,
1,aa,house2
1,aaa,house3
2,b,house4
2,bb,house5
2,bbb,house6
3,c,house7
3,cc,house8
3,ccc,house9
4,d,house10
4,dd,house11
4,ddd,house12
4,dddd, house13
the output csv
1,a,house1,aa,house2,aaa,houes3
2,b,house4,bb,house5,bbb,houes6
3,c,house7,cc,house8,ccc,houes9
4,d,house10,dd,house11,ddd,house12
script
df = pd.read_csv('test.csv', delim_whitespace=True)
df.sort_values(by=['id'])
df = df.groupby('id').head(3).groupby('id').agg({
'name': lambda l: ','.join(l),
'house': lambda l: ','.join(l)
})
df[['name_first', 'name_second', 'name_third']] = df.name.str.split(',', expand=True)
df[['house_first', 'house_second', 'house_third']] = df.house.str.split(',', expand=True)
df = df.reset_index().drop(['name', 'house'], axis=1)
df.to_csv('output.csv')
I want to add progressbar, but I couldn't add, If I can switch agg func to apply func, I think I will be able to switch it progress_apply but I couldn't change how can I do that, I need progressbar because I have really huge csv file which over 10 millions lines so it is gonna take time, I want to track process
df = pd.DataFrame({'id': ['1', '1', '1', '2', '2', '2', '3', '3', '3', '4', '4', '4', '4'],
'name': ['a', 'aa', 'aaa', 'b', 'bb', 'bbb', 'c', 'cc', 'ccc', 'd', 'dd', 'ddd', 'dddd'],
'house': ['house1', 'house2', 'house3', 'house4', 'house5', 'house6', 'house7', 'house8', 'house9', 'house10', 'house11', 'house12', ' house13']
})
This approach creates a pivot table
outcome = df.groupby('id').head(3)\
.assign(count=df.groupby('id').cumcount())\
.set_index(['id', 'count']).unstack()\
.sort_index(axis=1, level=1)
and then we can save it after renaming the columns
outcome.columns = [f'{x}_{str(y)}' for x, y in outcome.columns]
outcome.to_csv('...')
But this does not come with a progress bar because I did not use apply.
To use progress bar for the sake of using it:
from tqdm.notebook import tqdm
tqdm.pandas()
outcome = df.groupby('id').progress_apply(
lambda x: x.head(3).reset_index(drop=True).set_index('id', append=True).unstack(0),
).droplevel(0).sort_index(axis=1, level=1)
outcome.columns = [f'{x}_{str(y)}' for x, y in outcome.columns]
outcome.to_csv('...')
Please try the both approaches and see which is faster.
I have the following task.
I have this data:
import pandas
import numpy as np
data = {'name': ['Todd', 'Chris', 'Jackie', 'Ben', 'Richard', 'Susan', 'Joe', 'Rick'],
'phone': [912341.0, np.nan , 912343.0, np.nan, 912345.0, 912345.0, 912347.0, np.nan],
' email': ['todd#gmail.com', 'chris#gmail.com', np.nan, 'ben#gmail.com', np.nan ,np.nan , 'joe#gmail.com', 'rick#gmail.com'],
'most_visited_airport': ['Heathrow', 'Beijing', 'Heathrow', np.nan, 'Tokyo', 'Beijing', 'Tokyo', 'Heathrow'],
'most_visited_place': ['Turkey', 'Spain',np.nan , 'Germany', 'Germany', 'Spain',np.nan , 'Spain']
}
df = pandas.DataFrame(data)
What I have to do is for every feature column (most_visited_airport etc.) and its values (Heathrow, Beijing, Tokyo) I have to generate personal information and output it to a file.
E.g. If we look at most_visited_airport and Heathrow
I need to output three files containing the names, emails and phones of the people who visited the airport the most.
Currently, I have this code to do the operation for both columns and all the values:
columns_to_iterate = [ x for x in df.columns if 'most' in x]
for each in df[columns_to_iterate]:
values = df[each].dropna().unique()
for i in values:
df1 = df.loc[df[each]==i,'name']
df2 = df.loc[df[each]==i,' email']
df3 = df.loc[df[each]==i,'phone']
df1.to_csv(f'{each}_{i}_{df1.name}.csv')
df2.to_csv(f'{each}_{i}_{df2.name}.csv')
df3.to_csv(f'{each}_{i}_{df3.name}.csv')
Is it possible to do this in a more elegant and maybe faster way? Currently I have small dataset but not sure if this code will perform well with big data. My particular concern are the nested loops.
Thank you in advance!
You could replace the call to unique with a groupby, which would not only get the unique values, but split up the dataframe for you:
for column in df.filter(regex='^most'):
for key, group in df.groupby(column):
for attr in ('name', 'phone', 'email'):
group['name'].dropna().to_csv(f'{column}_{key}_{attr}.csv')
You can do it this way.
cols = df.filter(regex='most').columns.values
def func_current_cols_to_csv(most_col):
place = [i for i in df[most_col].dropna().unique().tolist()]
csv_cols = ['name', 'phone', ' email']
result = [df[df[most_col] == i][j].dropna().to_csv(f'{most_col}_{i}_{j}.csv', index=False) for i in place for j in
csv_cols]
return result
[func_current_cols_to_csv(i) for i in cols]
also in the options when writing to csv, you can leave the index, but do not forget to reset it before writing.
i have 2 dataframes have same columns with different len.
in [1] : df_g = pd.DataFrame([['EDC_TRAING_NO', 'EDU_T_N', 'NUMVER', '20'],
['EDC_TRAING_NAME', 'EDU_T_NM', 'VARCHAR', '40'],
['EDC_TRAING_ST', 'EDU_T_SD', 'DATETIME', '15'],
['EDC_TRAING_END', 'EDU_T_ED', 'DATETIME', '15'],
['EDC_PLACE_NM', 'EDU_P_NM', 'VARCHAR2', '40'],
['ONLINE_REQST_POSBL_AT', 'ONLINE_R_P_A', 'VARCHAR2', '1']],
columns=['NAME', 'ID', 'TYPE', 'LEN'])
in [2] : df_n = pd.DataFrame([['EDC_TRAING_NO', 'EDU_TR_N', 'NUMVER', '20'],
['EDC_TRAING_NAME', 'EDU_TR_NM', 'VARCHAR', '20'],
['EDC_TRAING_ST', 'EDU_TR_SD', 'DATETIME', '15'],
['EDC_TRAING_END', 'EDU_T_ED', 'DATETIME', '15'],
['EDC_PLACE_NM', 'EDU_PL_NM', 'VARCHAR2', '40'],
['ONLINE_REQST_POSBL_AT', 'ONLINE_REQ_P_A', 'VARCHAR2', '1']],
columns=['NAME', 'ID', 'TYPE', 'LEN'])
the reuslt i want to get:
result = pd.DataFrame([['EDC_TRAING_NO', 'EDU_TR_N', 'NUMVER', '20'],
['EDC_TRAING_ST', 'EDU_TR_SD', 'DATETIME', '15'],
['EDC_TRAING_END', 'EDU_T_ED', 'DATETIME', '15'],
['EDC_PLACE_NM', 'EDU_PL_NM', 'VARCHAR2', '40'],
['ONLINE_REQST_POSBL_AT', 'ONLINE_REQ_P_A', 'VARCHAR2', '1']],
columns=['NAME', 'ID', 'TYPE', 'LEN'])
and each df have length like this.
len(df_g) : 1000
len(df_n) : 5000
each dataframe has column named 'name, id, type, len'
i need to check those columns(name,type,len) in each df to compare 'id' column whether it has same value or not.
so i tried like this.
for i in g.index:
for j in n.index:
g = g.iloc[i].values
# make it to ndarray
g_Str = g[0] + g[2] + g[3]
# make it to str for pivot
n = n.iloc[j].values
n_Str = n[0] + str(n[2]) + str(n[3])
# comparing and check two df
if g_Str == n_Str and g[1] != n[1]:
print(i, j)
print(g[0])
I have above code for 2 different length DataFrame.
first i tried with 'iterrows()' for comparing those two df,
but it took too much time.(very slow)
i looked up for other ways to make it work better performance.
possible ways i found
option1
transform df to dict with to_dict() / compare those in nested for-loop
option2
transform df.series to ndarray / compare those in nested for-loop
is there any other better option?
or any option to not using nested for-loop?
thx.
you can try merge,
and if you are looking for records where ids do mismatch then the following is one way of achieving it:
r1=df_g.merge(df_n,on=['NAME', 'TYPE', 'LEN'],how='inner').query('ID_x != ID_y').rename(columns={'ID_x': 'ID'}).drop('ID_y', 1)
I have used how="inner" join, but based on need can use any of the following joins:
{‘left’, ‘right’, ‘outer’, ‘inner’, ‘cross’}, default ‘inner’
I have two data frames and I want for each line in one data frame to locate the matching line in the other data frame by a certain column (containing some id). I thought to go over the lines in the df1 and use the loc function to find the matching line in df2.
The problem is that some of the id's in df2 has some extra information except the id itself.
For example:
df1 has the id: 1234,
df2 has the id: 1234-KF
How can I locate this id for example with loc? Can loc somehow match only by prefixes?
Extra information can be removed using e.g. regular expression (or substring):
import pandas as pd
import re
df1 = pd.DataFrame({
'id': ['123', '124', '125'],
'data': ['A', 'B', 'C']
})
df2 = pd.DataFrame({
'id': ['123-AA', '124-AA', '125-AA'],
'data': ['1', '2', '3']
})
df2.loc[df2.id.apply(lambda s : re.sub("[^0-9]", "", s)) == df1.id]
How would you implement the following using pandas?
part 1:
I want to create a new conditional column in input_dataframe. Each row in input_dataframe will be matched against a regex. If at lease one element in the row matches, than the element for this row in the new column will contain the matched value(s).
part 2: A more complete version would be:
The source of the regex is the value of each element originating form another series. (i.e. I want to know if each row in input_dataframe contains a value(s) form the passed series.
part 3: An even more complete version would be:
Instead of passing a series, I'd pass another Dataframe, regex_dataframe. For each column in it, I would implement the same process as part 2 above. (Thus, The result would be a new column in the input_dataframe for each column in the regex_dataframe.)
example input:
input_df = pd.DataFrame({
'a':['hose','dog','baby'],
'b':['banana','avocado','mango'],
'c':['horse','dog','cat'],
'd':['chease','cucumber','orange']
})
example regex_dataframe:
regex_dataframe = pd.DataFrame({
'e':['ho','ddddd','ccccccc'],
'f':['wwwwww','ado','kkkkkkkk'],
'g':['fffff','mmmmmmm','cat'],
'i':['heas','ber','aaaaaaaa']
})
example result:
result_dataframe = pd.DataFrame({
'a': ['hose', 'dog', 'baby'],
'b': ['banana', 'avocado', 'mango'],
'c': ['horse', 'dog', 'cat'],
'd': ['chease', 'cucumber', 'orange'],
'e': ['ho', '', ''],
'f': ['', 'ado', ''],
'g': ['', '', 'cat'],
'i': ['heas', 'ber', '']
})
Rendered:
First of all, rename regex_dataframe so individual cells correspond to each other in both dataframes.
input_df = pd.DataFrame({
'a':['hose','dog','baby'],
'b':['banana','avocado','mango'],
'c':['horse','dog','cat'],
'd':['chease','cucumber','orange']
})
regex_dataframe = pd.DataFrame({
'a':['ho','ddddd','ccccccc'],
'b':['wwwwww','ado','kkkkkkkk'],
'c':['fffff','mmmmmmm','cat'],
'd':['heas','ber','aaaaaaaa']
})
Apply the method DataFrame.combine(other, func, fill_value=None, overwrite=True) to to get pairs of corresponding columns (which are Series).
Apply Series.combine(other, func, fill_value=nan) to get pairs of corresponding cells.
Apply regex to the cells.
import re
def process_cell(text, reg):
res = re.search(reg, text)
return res.group() if res else ''
def process_column(col_t, col_r):
return col_t.combine(col_r, lambda text, reg: process_cell(text, reg))
input_df.combine(regex_dataframe, lambda col_t, col_r: process_column(col_t, col_r))