How to reduce read/write to excel time in python pandas - python

I have 640,000 rows of data in excel.
I want to append some rows to the data so I used
pd.read_excel and pd.concat([excel, some_data]).
After that, I used df.to_excel() to write back to excel.
But, read_excel is taking a long time, about 3 minutes, and to_excel is too.
How can I fix it?
def update_mecab(new_word_list):
user_dicpath='C:\\mecab\\user-dic\\custom.csv'
dictionary=pd.read_excel('./first_dictionary.xlsx')
dictionary=pd.concat([dictionary, new_word_list])
part_names= {
'일반 명사' : 'NNG',
'고유 명사' : 'NNP',
'의존 명사' : 'NNB',
'수사' : 'NR',
'대명사' : 'NP',
'동사' : 'VV',
'형용사' : 'VA',
'보조 용언' : 'VX',
'관형사' : 'MM',
'일반 부사' : 'MAG',
'접속 부사' : 'MAJ',
'감탄사' : 'IC'
}
new_word_pt=new_word_list.replace({"part":part_names})
user_dict=open(user_dicpath, 'a', encoding="UTF-8")
for index, item in new_word_pt.iterrows() :
custom_word=item['word']+',*,*,*,'+item['part']+',*,T,'+item['word']+',*,*,*,*,*\n'
user_dict.write(custom_word)
user_dict.close()
del user_dict
dictionary=dictionary.reset_index()
dictionary=dictionary[['word', 'part']]
dictionary.to_excel('first_dictionary.xlsx', sheet_name = "Sheet_1", index=None)
subprocess.call("powershell C:\\mecab\\add-userdic-win.ps1")

Related

How to create DataFrames using Cycle?

I just want to create DataFrames named by companies containing Financial quotes of stocks using cycle and dict:
financials = {'jp_morgan' : 'JPM', 'bank_of_amerika' : 'BAC', 'credit_suisse' : 'CS', 'visa' :'V',\
'mastercard' : 'MA', 'morgan_stanley' : 'MS', 'citigroup' : 'C', 'wells_fargo' : 'WFC',\
'blackrock' : 'CLOA', 'goldman_sachs' : 'GS'}
for i in financials:
i = yf.download(financials[i],'2016-01-01','2019-08-01')
I want to get dataframes

I receive an error code that says colspecs needs to be a list of integers, which I am pretty sure it is. Why is it not reading it as a list?

df = pd.read_fwf(
r'C:\Users\bruen\OneDrive\Documents\mass.mas03.txt',
colspecs=[2,3,4,6,9,10],
names=('N', 'Z', 'A', 'symbol', 'mass_excess', 'mass_excess_unc'),
converters = {
'symbol': str,
'mass_excess': strip_hash_and_keV_to_MeV,
'mass_excess_unc': strip_hash_and_keV_to_MeV
},
header=39,
index_col=False
)
display(df)
I have no idea why colspecs isnt read as a list of integers.

Dataframe to json using python

I a dataframe of below format
I want to send each row separately as below:
{ 'timestamp': 'A'
'tags': {
'columnA': '1',
'columnB': '11',
'columnC': '21'
.
.
.
.}}
The columns vary and I cannot hard code it. Then Send it to firestore collection
Then second row in above format to firestore collection and so on
How can I do this?
and don't mark the question as duplicate without comparing questions
I am not clear on the firebase part, but I think this might be what you want
import json
import pandas as pd
# Data frame to work with
x = pd.DataFrame(data={'timestamp': 'A', 'ca': 1, 'cb': 2, 'cc': 3}, index=[0])
x = x.append(x, ignore_index=True)
# rearranging
x = x[['timestamp', 'ca', 'cb', 'cc']]
def new_json(row):
return json.dumps(
dict(timestamp=row['timestamp'], tag=dict(zip(row.index[1:], row[row.index[1:]].values.tolist()))))
print x.apply(new_json, raw=False, axis=1)
Output
Output is a pandas series with each entry being a str in the json format as needed
0 '{"timestamp": "A", "tag": {"cc": 3, "cb": 2, "ca": 1}}'
1 '{"timestamp": "A", "tag": {"cc": 3, "cb": 2, "ca": 1}}'

How to populate multiple dictionary with common keys to pandas dataframe?

I have a list of dictionaries where keys are identical but values in each dictionary is not same, and the order of each dictionary strictly preserved. I am trying to find an automatic solution to populate these dictionaries to pandas dataframe as new column, but didn't get the expected output.
original data on gist
here is the data that I have on old data on gist.
my attempt
here is my attempt to populate multiple dictionaries with same keys but different values (binary value), my goal is I want to write down handy function to vectorize the code. Here is my inefficient code but works on gist
import pandas as pd
dat= pd.read_csv('old_data.csv', encoding='utf-8')
dat['type']=dat['code'].astype(str).map(typ)
dat['anim']=dat['code'].astype(str).map(anim)
dat['bovin'] = dat['code'].astype(str).map(bov)
dat['catg'] = dat['code'].astype(str).map(cat)
dat['foot'] = dat['code'].astype(str).map(foo)
my code works but it is not vectorized (not efficient I think). I am wondering how can I make this few lines of a simple function. Any idea? how to we make this happen as efficiently as possible?
Here is my current and the desired output:
since I got correct output but code is not well efficient here. this is my current output on gist
If you restructure your dictionaries into a dictionary of dictionaries you can one line it:
for keys in values.keys():
dat[keys]=dat['code'].astype(str).map(values[keys])
Full code:
values = {"typ" :{
'20230' : 'A',
'20130' : 'A',
'20220' : 'A',
'20120' : 'A',
'20329' : 'A',
'20322' : 'A',
'20321' : 'B',
'20110' : 'B',
'20210' : 'B',
'20311' : 'B'
} ,
"anim" :{
'20230' : 'AOB',
'20130' : 'AOB',
'20220' : 'AOB',
'20120' : 'AOB',
'20329' : 'AOC',
'20322' : 'AOC',
'20321' : 'AOC',
'20110' : 'AOB',
'20210' : 'AOB',
'20311' : 'AOC'
} ,
"bov" :{
'20230' : 'AOD',
'20130' : 'AOD',
'20220' : 'AOD',
'20120' : 'AOD',
'20329' : 'AOE',
'20322' : 'AOE',
'20321' : 'AOE',
'20110' : 'AOD',
'20210' : 'AOD',
'20311' : 'AOE'
} ,
"cat" :{
'20230' : 'AOF',
'20130' : 'AOG',
'20220' : 'AOF',
'20120' : 'AOG',
'20329' : 'AOF',
'20322' : 'AOF',
'20321' : 'AOF',
'20110' : 'AOG',
'20210' : 'AOF',
'20311' : 'AOG'
} ,
"foo" :{
'20230' : 'AOL',
'20130' : 'AOL',
'20220' : 'AOM',
'20120' : 'AOM',
'20329' : 'AOL',
'20322' : 'AOM',
'20321' : 'AOM',
'20110' : 'AOM',
'20210' : 'AOM',
'20311' : 'AOM'
}
}
import pandas as pd
dat= pd.read_csv('old_data.csv', encoding='utf-8')
for keys in values.keys():
dat[keys]=dat['code'].astype(str).map(values[keys])

Replacing characters in entire Pandas dataframe with values from a dictionary

I have a German csv file that was incorrectly encoded. I want to convert the characters back to utf-8 using a dictionary. I thought what I was doing was correct, but when I print the DF, nothing has changed. Here's my code:
DATA_DIR = 'C:\\...'
translations = {
'ö': 'oe',
'ü': 'ue',
'ß': 'ss',
'ä': 'ae',
'€': '€',
'Ä': 'Ae',
'Ö': 'Oe',
'Ü': 'Ue'
}
def cleanup():
for file in os.listdir(os.path.join(DATA_DIR)):
if not file.lower().endswith('.csv'):
continue
data_utf = pd.read_csv(os.path.join(DATA_DIR, file), header=3, index_col=None, skiprows=0-2)
data_utf.replace(translations, inplace=True)
print(data_utf)
if __name__ == '__main__':
cleanup()
I also tried
for before, after in translations.items():
data_utf.replace(before, after)
within the function, and directly putting the translations in the replace itself. This process works if I specify the column in which to replace the characters, however. What do I need to do to apply these translations to the whole dataframe, as well as to the dataframe column headers? Thanks!
Add regex=True for replace in substrings, for columns is possible convert values to Series by Index.to_series and then use replace:
data_utf = pd.DataFrame({'raÜing':['ösaüs','Ä dd Ö','ÖÄ']})
translations = {
'ö': 'oe',
'ü': 'ue',
'ß': 'ss',
'ä': 'ae',
'€': '€',
'Ä': 'Ae',
'Ö': 'Oe',
'Ü': 'Ue'
}
data_utf.replace(translations, inplace=True, regex=True)
data_utf.columns = data_utf.columns.to_series().replace(translations, regex=True)
print (data_utf)
raUeing
0 oesaues
1 Ae dd Oe
2 OeAe

Categories