Remove excess of pipes '|' in CSV after append files - python

I have 3 dataframes. I need to convert them in one merged CSV separated by pipes '|'.
And I need to sort them by Column1 after append.
But, when I try to convert the final df to CSV, there comes exceeded pipes for null columns. How to avoid this?
import pandas as pd
import io
df1 = pd.DataFrame({
'Column1': ['key_1', 'key_2', 'key_3'],
'Column2': ['1100', '1100', '1100']
})
df2 = pd.DataFrame({
'Column1': ['key_1', 'key_2', 'key_3', 'key_1', 'key_2', 'key_3'],
'Column2': ['1110', '1110', '1110', '1110', '1110', '1110'],
'Column3': ['xxr', 'xxv', 'xxw', 'xxt', 'xxe', 'xxz'],
'Column4': ['wer', 'cad', 'sder', 'dse', 'sdf', 'csd']
})
df3 = pd.DataFrame({
'Column1': ['key_1', 'key_2', 'key_3', 'key_1', 'key_2', 'key_3'],
'Column2': ['1115', '1115', '1115', '1115', '1115', '1115'],
'Column3': ['xxr', 'xxv', 'xxw', 'xxt', 'xxe', 'xxz'],
'Column4': ['wer', 'cad', 'sder', 'dse', 'sdf', 'csd'],
'Column5': ['xxr', 'xxv', 'xxw', 'xxt', 'xxe', 'xxz'],
'Column6': ['xxr', 'xxv', 'xxw', 'xxt', 'xxe', 'xxz'],
})
print(df1, df2, df3, sep="\n")
output = io.StringIO()
pd.concat([df1, df2, df3]).sort_values("Column1") \
.to_csv(output, header=False, index=False, sep="|")
print("csv",output.getvalue(),sep="\n")
output.seek(0)
df4 = pd.read_csv(output, header=None, sep="|", keep_default_na=False)
print("df4",df4,sep="\n" )
output.close()
This is the output I have (note pipes'|'):
key_1|1100||||
key_1|1110|xxr|wer||
key_1|1110|xxt|dse||
key_1|1115|xxr|wer|xxr|xxr
key_1|1115|xxt|dse|xxt|xxt
key_2|1100||||
key_2|1110|xxv|cad||
key_2|1110|xxe|sdf||
key_2|1115|xxv|cad|xxv|xxv
key_2|1115|xxe|sdf|xxe|xxe
key_3|1100||||
key_3|1110|xxw|sder||
key_3|1110|xxz|csd||
key_3|1115|xxw|sder|xxw|xxw
key_3|1115|xxz|csd|xxz|xxz
I need this. Justo to introduce, I'll not work on this final data, I need to upload it to a specific database in the exact format I show below, but I need this without using regex (note pipes'|'). Is there a way to do so?
key_1|1100
key_1|1110|xxr|wer
key_1|1110|xxt|dse
key_1|1115|xxr|wer|xxr|xxr
key_1|1115|xxt|dse|xxt|xxt
key_2|1100
key_2|1110|xxv|cad
key_2|1110|xxe|sdf
key_2|1115|xxv|cad|xxv|xxv
key_2|1115|xxe|sdf|xxe|xxe
key_3|1100
key_3|1110|xxw|sder
key_3|1110|xxz|csd
key_3|1115|xxw|sder|xxw|xxw
key_3|1115|xxz|csd|xxz|xxz

as you note, generate sorted pipe delimited
then split(), rstrip("|") and join()
"\n".join([l.rstrip("|") for l in
pd.concat([df1,df2,df3]).pipe(lambda d:
d.sort_values(d.columns.tolist())).to_csv(sep="|", index=False).split("\n")])

You can remove extra "|" with re.sub():
import re
s = pd.concat([df1, df2, df3]).sort_values("Column1") \
.to_csv(header=False, index=False, sep="|")
s1 = re.sub("\|*\n", "\n", s) # with regex
s2 = "\n".join([l.rstrip("|") for l in s.splitlines()]) # with rstrip
>>> print(s1.strip())
key_1|1100
key_1|1110|xxr|wer
key_1|1110|xxt|dse
key_1|1115|xxr|wer|xxr|xxr
key_1|1115|xxt|dse|xxt|xxt
key_2|1100
key_2|1110|xxv|cad
key_2|1110|xxe|sdf
key_2|1115|xxv|cad|xxv|xxv
key_2|1115|xxe|sdf|xxe|xxe
key_3|1100
key_3|1110|xxw|sder
key_3|1110|xxz|csd
key_3|1115|xxw|sder|xxw|xxw
key_3|1115|xxz|csd|xxz|xxz
>>> print(s2)
key_1|1100
key_1|1110|xxr|wer
key_1|1110|xxt|dse
key_1|1115|xxr|wer|xxr|xxr
key_1|1115|xxt|dse|xxt|xxt
key_2|1100
key_2|1110|xxv|cad
key_2|1110|xxe|sdf
key_2|1115|xxv|cad|xxv|xxv
key_2|1115|xxe|sdf|xxe|xxe
key_3|1100
key_3|1110|xxw|sder
key_3|1110|xxz|csd
key_3|1115|xxw|sder|xxw|xxw
key_3|1115|xxz|csd|xxz|xxz

Related

Faster way to iterate over columns in pandas

I have the following task.
I have this data:
import pandas
import numpy as np
data = {'name': ['Todd', 'Chris', 'Jackie', 'Ben', 'Richard', 'Susan', 'Joe', 'Rick'],
'phone': [912341.0, np.nan , 912343.0, np.nan, 912345.0, 912345.0, 912347.0, np.nan],
' email': ['todd#gmail.com', 'chris#gmail.com', np.nan, 'ben#gmail.com', np.nan ,np.nan , 'joe#gmail.com', 'rick#gmail.com'],
'most_visited_airport': ['Heathrow', 'Beijing', 'Heathrow', np.nan, 'Tokyo', 'Beijing', 'Tokyo', 'Heathrow'],
'most_visited_place': ['Turkey', 'Spain',np.nan , 'Germany', 'Germany', 'Spain',np.nan , 'Spain']
}
df = pandas.DataFrame(data)
What I have to do is for every feature column (most_visited_airport etc.) and its values (Heathrow, Beijing, Tokyo) I have to generate personal information and output it to a file.
E.g. If we look at most_visited_airport and Heathrow
I need to output three files containing the names, emails and phones of the people who visited the airport the most.
Currently, I have this code to do the operation for both columns and all the values:
columns_to_iterate = [ x for x in df.columns if 'most' in x]
for each in df[columns_to_iterate]:
values = df[each].dropna().unique()
for i in values:
df1 = df.loc[df[each]==i,'name']
df2 = df.loc[df[each]==i,' email']
df3 = df.loc[df[each]==i,'phone']
df1.to_csv(f'{each}_{i}_{df1.name}.csv')
df2.to_csv(f'{each}_{i}_{df2.name}.csv')
df3.to_csv(f'{each}_{i}_{df3.name}.csv')
Is it possible to do this in a more elegant and maybe faster way? Currently I have small dataset but not sure if this code will perform well with big data. My particular concern are the nested loops.
Thank you in advance!
You could replace the call to unique with a groupby, which would not only get the unique values, but split up the dataframe for you:
for column in df.filter(regex='^most'):
for key, group in df.groupby(column):
for attr in ('name', 'phone', 'email'):
group['name'].dropna().to_csv(f'{column}_{key}_{attr}.csv')
You can do it this way.
cols = df.filter(regex='most').columns.values
def func_current_cols_to_csv(most_col):
place = [i for i in df[most_col].dropna().unique().tolist()]
csv_cols = ['name', 'phone', ' email']
result = [df[df[most_col] == i][j].dropna().to_csv(f'{most_col}_{i}_{j}.csv', index=False) for i in place for j in
csv_cols]
return result
[func_current_cols_to_csv(i) for i in cols]
also in the options when writing to csv, you can leave the index, but do not forget to reset it before writing.

Flatten and Shape JSON DataFrame

I have the below JSON string in data. I want it to look like the Expected Result Below
import json
import pandas as pd
data = [{'useEventValue': True,
'eventConditions': [{'type': 'CATEGORY',
'matchType': 'EXACT',
'expression': 'ABC'},
{'type': 'ACTION',
'matchType': 'EXACT',
'expression': 'DEF'},
{'type': 'LABEL', 'matchType': 'REGEXP', 'expression': 'GHI|JKL'}]}]
Expected Result:
Category_matchType
Category_expression
Action_matchType
Action_expression
Label_matchType
Label_expression
0
EXACT
ABC
EXACT
DEF
REGEXP
GHI|JKL
What I've Tried:
This question is similar, but I'm not using the index the way the OP is. Following this example, I've tried using json_normalize and then using various forms of melt, stack, unstack, pivot, etc. But there has to be an easier way!
# this bit of code produces the below result where I can start using reshaping functions to get to what I need but it seems messy
df = pd.json_normalize(data, 'eventConditions')
type
matchType
expression
0
CATEGORY
EXACT
ABC
1
ACTION
EXACT
DEF
2
LABEL
REGEXP
GHI|JKL
We can use json_normalize to read the json data as pandas dataframe, then use stack followed by unstack to reshape the dataframe
df = pd.json_normalize(data, 'eventConditions')
df = df.set_index([df.groupby('type').cumcount(), 'type']).stack().unstack([1, 2])
df.columns = df.columns.map('_'.join)
CATEGORY_matchType CATEGORY_expression ACTION_matchType ACTION_expression LABEL_matchType LABEL_expression
0 EXACT ABC EXACT DEF REGEXP GHI|JKL
If your data is not too large in size, you could maybe process the json data first and then create a dataframe like this:
import pandas as pd
import json
data = [{'useEventValue': True,
'eventConditions': [{'type': 'CATEGORY',
'matchType': 'EXACT',
'expression': 'ABC'},
{'type': 'ACTION',
'matchType': 'EXACT',
'expression': 'DEF'},
{'type': 'LABEL', 'matchType': 'REGEXP', 'expression': 'GHI|JKL'}]}]
new_data = {}
for i in data:
for event in i['eventConditions']:
for key in event.keys():
if key != 'type':
col_name = event['type'] + '_' + key
new_data[col_name] = [event[key]] if col_name not in new_data else new_data[col_name].append(event[key])
df = pd.DataFrame(new_data)
df
Just found a way to do it with Pandas only:
df = pd.json_normalize(data, 'eventConditions')
df = df.melt(id_vars=[('type')])
df['type'] = df['type'] + '_' + df['variable']
df.drop(columns=['variable'], inplace=True)
df.set_index('type', inplace=True)
df = df.T

Error handling with dataframe.explode() in Python pandas

so I have some data that I am using the df.explode() method on. I get an error when running the code, I know the error is caused, because one of my rows (row 3) does not have a corresponding Qty for each location, but how can I handle that error ?
Code that returns ('ValueError: cannot reindex from a duplicate axis')
import pandas as pd
import openpyxl
data = {'ITEM': ['Item1', 'Item2', 'Item3'],
'Locations': ['loc1;loc2', 'loc3', 'loc4;loc5'],
'Qty': ['100;200', '100', '500']
}
df1 = pd.DataFrame(data, columns=['ITEM', 'Locations', 'Qty'])
print(df1)
formatted_df1 = (df1.set_index(['ITEM'])
.apply(lambda x: x.str.split(';')
.explode()).reset_index())
print(formatted_df1)
Code that works (Note that the last record has 500;600):
import pandas as pd
import openpyxl
data = {'ITEM': ['Item1', 'Item2', 'Item3'],
'Locations': ['loc1;loc2', 'loc3', 'loc4;loc5'],
'Qty': ['100;200', '100', '500']
}
df1 = pd.DataFrame(data, columns=['ITEM', 'Locations', 'Qty'])
print(df1)
formatted_df1 = (df1.set_index(['ITEM'])
.apply(lambda x: x.str.split(';')
.explode()).reset_index())
print(formatted_df1)

Pandas: Error tokenizing data--when using glob.glob

I am using the following code to concatenate several files (candidate master files) I have downloaded from here; but they can also be found here:
https://github.com/108michael/ms_thesis/blob/master/cn06.txt
https://github.com/108michael/ms_thesis/blob/master/cn08.txt
https://github.com/108michael/ms_thesis/blob/master/cn10.txt
https://github.com/108michael/ms_thesis/blob/master/cn12.txt
https://github.com/108michael/ms_thesis/blob/master/cn14.txt
import numpy as np
import pandas as pd
import glob
df = pd.concat((pd.read_csv(f, header=None, names=['feccandid','candname',\
'party','date', 'state', 'chamber', 'district', 'incumb.challeng', \
'cand_status', '1', '2','3','4', '5', '6' ], usecols=['feccandid', \
'party', 'date', 'state', 'chamber'])for f in glob.glob\
('/home/jayaramdas/anaconda3/Thesis/FEC/cn_data/cn**.txt')))
I am getting the following error:
CParserError: Error tokenizing data. C error: Expected 2 fields in line 58, saw 4
Does anyone have a clue on this?
The default delimiter for pd.read_csv is the comma ,. Since all of your candidates have names listed in the format Last, First, pandas reads two columns: everything before the comma and everything after. In one of the files, there are additional commas, leading pandas to assume that there are more columns. That's the parser error.
To use | as the delimiter instead of ,, just change your code to use the keyword delimiter="|" or sep="|". From the docs, we see that delimiter and sep are aliases of the same keyword.
New code:
df = pd.concat((pd.read_csv(f, header=None, delimiter="|", names=['feccandid','candname',\
'party','date', 'state', 'chamber', 'district', 'incumb.challeng', \
'cand_status', '1', '2','3','4', '5', '6' ], usecols=['feccandid', \
'party', 'date', 'state', 'chamber'])for f in glob.glob\
('/home/jayaramdas/anaconda3/Thesis/FEC/cn_data/cn**.txt')))
import numpy as np
import pandas as pd
import glob
df = pd.concat((pd.read_csv(f, header=None, names=['feccandid','candname', \
'party','date', 'state', 'chamber', 'district', 'incumb.challeng', \
'cand_status', '1', '2','3','4', '5', '6' ],sep='|', \
usecols=['feccandid', 'party', 'date', 'state', 'chamber'] \
)for f in glob.glob\
(/home/jayaramdas/anaconda3/Thesis/FEC/cn_data/cn**.txt')))
print len(df)

Why my txt file has been changed after I used pd.DataFrame?

the orignal data is:
the output data is:
import pandas as pd
signal_data = pd.read_csv('B.txt').T
print pd.read_csv('B.txt').T
dates = pd.date_range('2015-10-1', periods=19)
signal_data_df= pd.DataFrame(signal_data, index=dates, columns=['PCLN', 'SPY', 'QCOM', 'AAPL', 'USB', 'AMGN', 'GS', 'BIIB', 'AGN'])
print signal_data_df
Because you pass a df as the data source, it's reusing the index and columns from the df so when you pass an alternative index and column values you're effectively reindexing the original df hence the NaN values everywhere. You can just rename the columns and overwrite the index directly:
signal_data = pd.read_csv('B.txt').T
signal_data.columns=['PCLN', 'SPY', 'QCOM', 'AAPL', 'USB', 'AMGN', 'GS', 'BIIB', 'AGN']
signal_data.index = dates
or to get your code to work call .values to return the df as anonymous np array data:
signal_data_df= pd.DataFrame(signal_data.values, index=dates, columns=['PCLN', 'SPY', 'QCOM', 'AAPL', 'USB', 'AMGN', 'GS', 'BIIB', 'AGN'])

Categories