I have a German csv file that was incorrectly encoded. I want to convert the characters back to utf-8 using a dictionary. I thought what I was doing was correct, but when I print the DF, nothing has changed. Here's my code:
DATA_DIR = 'C:\\...'
translations = {
'ö': 'oe',
'ü': 'ue',
'ß': 'ss',
'ä': 'ae',
'€': '€',
'Ä': 'Ae',
'Ö': 'Oe',
'Ü': 'Ue'
}
def cleanup():
for file in os.listdir(os.path.join(DATA_DIR)):
if not file.lower().endswith('.csv'):
continue
data_utf = pd.read_csv(os.path.join(DATA_DIR, file), header=3, index_col=None, skiprows=0-2)
data_utf.replace(translations, inplace=True)
print(data_utf)
if __name__ == '__main__':
cleanup()
I also tried
for before, after in translations.items():
data_utf.replace(before, after)
within the function, and directly putting the translations in the replace itself. This process works if I specify the column in which to replace the characters, however. What do I need to do to apply these translations to the whole dataframe, as well as to the dataframe column headers? Thanks!
Add regex=True for replace in substrings, for columns is possible convert values to Series by Index.to_series and then use replace:
data_utf = pd.DataFrame({'raÜing':['ösaüs','Ä dd Ö','ÖÄ']})
translations = {
'ö': 'oe',
'ü': 'ue',
'ß': 'ss',
'ä': 'ae',
'€': '€',
'Ä': 'Ae',
'Ö': 'Oe',
'Ü': 'Ue'
}
data_utf.replace(translations, inplace=True, regex=True)
data_utf.columns = data_utf.columns.to_series().replace(translations, regex=True)
print (data_utf)
raUeing
0 oesaues
1 Ae dd Oe
2 OeAe
Related
I have a dataframe of various wines. I am trying to remove all punctuation, all words containing 4 or fewer characters, as well as the words flavors, aromas, finish, and drink from the string values contained in the 'description' column. My code does not appear to be working and I have also tried various permutations of this to no avail.
remove_list = ['[^\w\s]', '[\b(\w{1,4})\b]', 'flavors', 'aromas', 'finish', 'drink']
df11['description'].str.replace('|'.join(remove_list), '', regex=True)
I think you are missing r to avoid escape characters in your regex pattern. read more
try:
remove_list = [r'[^\w\s]', r'\b\w{1,3}\b', 'flavors', 'aromas', 'finish', 'drink']
to replicate everything:
import pandas as pd
# create data
data = {'description': ["I don't like this wine. And flavors are really bad."]}
df11 = pd.DataFrame(data)
print(df11)
remove_list = [r'[^\w\s]', r'\b\w{1,3}\b', 'flavors', 'aromas', 'finish', 'drink']
df11['description'].replace('|'.join(remove_list), '', regex=True)
output is:
I have a python function that cleans up my dataframe(replaces whitespaces with _ and adds _ if column begins with a number):
These dataframes were jsons that have been converted to dataframes to easily work with them.
def prepare_json(df):
df = df.rename(lambda x: '_' + x if re.match('([0-9])\w+',x) else x, axis=1)
df = df.rename(lambda x: x.replace(' ', '_'), axis=1)
return df
This works for simple jsons like the following:
{"123asd":"test","test json":"test"}
Output:
{"_123asd":"test","test_json":"test"}
However when i try it with a more complex dataframe it does not work anymore.
Here is an exampe:
{"SETDET":[{"SETPRTY":[{"DEAG":{"95R":[{"Data Source Scheme":"SCOM","Proprietary Code":"CH123456"}]}},{"SAFE":{"97A":[{"Account Number":"123456789"}]},"SELL":{"95P":[{"Identifier Code Location Code":"AB","Identifier Code Country Code":"AB","Identifier Code Logical Terminal":"XXX","Identifier Code Bank Code":"ABCD"}]}},{"PSET":{"95P":[{"Identifier Code Location Code":"ZZ","Identifier Code Country Code":"CH","Identifier Code Logical Terminal":"","Identifier Code Bank Code":"INSE"}]}}],"SETR":{"22F":[{"Data Source Scheme":"","Indicator":"TRAD"}]}}],"TRADDET":[{"Other":{"35B":[{"Identification of Security":"CH0012138530","Description of Security":"CREDIT SUISSE GROUP"}]},"SETT":{"98A":[{"Date":"20181127"}]},"TRAD":{"98A":[{"Date":"20181123"}]}}],"FIAC":[{"SAFE":{"97A":[{"Account Number":"0123-1234567-05-001"}]},"SETT":{"36B":[{"Quantity":"10,","Quantity Type Code":"UNIT"}]}}],"GENL":[{"SEME":{"20C":[{"Reference":"1234567890123456"}]},"Other":{"23G":[{"Subfunction":"","Function":"NEWM"}]},"PREP":{"98C":[{"Date":"20181123","Time":"165256"}]}}]}
Trying it out with this i get the following error when trying to write the dataframe to bigquery:
Invalid field name "97A". Fields must contain only letters, numbers, and underscores, start with a letter or underscore, and be at most 300 characters long. with loading dataframe
maybe my solution helps you:
I convert your dictionary to a string
find all keys of dictionary with regex
replace spaces in keys by _ and add _ before keys start with digit
convert string to the dictionary with ast.literal_eval(dict_string)
try this:
import re
import ast
from copy import deepcopy
def my_replace(match):
return match.group()[0] + match.group()[1] + "_" + match.group()[2]
dct = {"SETDET":[{"SETPRTY":[{"DEAG":{"95R":[{"Data Source Scheme":"SCOM","Proprietary Code":"CH123456"}]}},{"SAFE":{"97A":[{"Account Number":"123456789"}]},"SELL":{"95P":[{"Identifier Code Location Code":"AB","Identifier Code Country Code":"AB","Identifier Code Logical Terminal":"XXX","Identifier Code Bank Code":"ABCD"}]}},{"PSET":{"95P":[{"Identifier Code Location Code":"ZZ","Identifier Code Country Code":"CH","Identifier Code Logical Terminal":"","Identifier Code Bank Code":"INSE"}]}}],"SETR":{"22F":[{"Data Source Scheme":"","Indicator":"TRAD"}]}}],"TRADDET":[{"Other":{"35B":[{"Identification of Security":"CH0012138530","Description of Security":"CREDIT SUISSE GROUP"}]},"SETT":{"98A":[{"Date":"20181127"}]},"TRAD":{"98A":[{"Date":"20181123"}]}}],"FIAC":[{"SAFE":{"97A":[{"Account Number":"0123-1234567-05-001"}]},"SETT":{"36B":[{"Quantity":"10,","Quantity Type Code":"UNIT"}]}}],"GENL":[{"SEME":{"20C":[{"Reference":"1234567890123456"}]},"Other":{"23G":[{"Subfunction":"","Function":"NEWM"}]},"PREP":{"98C":[{"Date":"20181123","Time":"165256"}]}}]}
keys = re.findall("{\'.*?\': | \'.*?\': ", str(dct))
keys_bfr_chng = deepcopy(keys)
keys = [re.sub("\s+(?=\w)", '_', key) for key in keys]
keys = [re.sub(r"{\'\d", my_replace, key) for key in keys]
dct = str(dct)
for i in range(len(keys)):
dct = dct.replace(keys_bfr_chng[i], keys[i])
dct = ast.literal_eval(dct)
print(dct)
type(dct)
output:
{'SETDET': [{'SETPRTY': [{'DEAG': {'_95R': [{'Data_Source_Scheme': 'SCOM', 'Proprietary_Code': 'CH123456'}]}}, {'SAFE': {'_97A': [{'Account_Number': '123456789'}]}, 'SELL': {'_95P': [{'Identifier_Code_Location_Code': 'AB', 'Identifier_Code_Country_Code': 'AB', 'Identifier_Code_Logical_Terminal': 'XXX', 'Identifier_Code_Bank_Code': 'ABCD'}]}}, {'PSET': {'_95P': [{'Identifier_Code_Location_Code': 'ZZ', 'Identifier_Code_Country_Code': 'CH', 'Identifier_Code_Logical_Terminal': '', 'Identifier_Code_Bank_Code': 'INSE'}]}}], 'SETR': {'_22F': [{'Data_Source_Scheme': '', 'Indicator': 'TRAD'}]}}], 'TRADDET': [{'Other': {'_35B': [{'Identification_of_Security': 'CH0012138530', 'Description_of_Security': 'CREDIT SUISSE GROUP'}]}, 'SETT': {'_98A': [{'Date': '20181127'}]}, 'TRAD': {'_98A': [{'Date': '20181123'}]}}], 'FIAC': [{'SAFE': {'_97A': [{'Account_Number': '0123-1234567-05-001'}]}, 'SETT': {'_36B': [{'Quantity': '10,', 'Quantity_Type_Code': 'UNIT'}]}}], 'GENL': [{'SEME': {'_20C': [{'Reference': '1234567890123456'}]}, 'Other': {'_23G': [{'Subfunction': '', 'Function': 'NEWM'}]}, 'PREP': {'_98C': [{'Date': '20181123', 'Time': '165256'}]}}]}
dict
I need to find the minimum and maximum of a given a column from a csv file and currently the value is a string but I need it to be an integer, right now my output after I have split all the lines into lists looks like this
['FRA', 'Europe', 'France', '14/06/2020', '390', '10\n']
['FRA', 'Europe', 'France', '11/06/2020', '364', '27\n']
['FRA', 'Europe', 'France', '12/06/2020', '802', '28\n']
['FRA', 'Europe', 'France', '13/06/2020', '497', '24\n']
And from that line along with its many others I want to find the minimum of the
5th column and currently when I do
min(column[4])
It just gives the min of each individual list which is just the number in that column rather than grouping them all up and getting that minimum.
P.S: I am very new to python and coding in general, I also have to do this without any importing of modules.
For you Azro.
def main(csvfile,country,analysis):
infile = csvfile
datafile = open(infile, "r")
country = country.capitalize()
if analysis == "statistics":
for line in datafile.readlines():
column = line.split(",")
if column[2] == country:
You may use pandas that allows to read csv file and manipulate them as DataFrame, then it's very easy to retrieve a min/max from a column
import pandas as pd
df = pd.read_csv("test.txt", sep=',')
mini = df['colName'].min()
maxi = df['colName'].max()
print(mini, maxi)
Then if you have already read your data in a list of lists, you max use builtin min and max
# use rstrip() when reading line, to remove leading \n
values = [
['FRA', 'Europe', 'France', '14/06/2020', '390', '10'],
['FRA', 'Europe', 'France', '14/06/2020', '395', '10']
]
mini = min(values, key=lambda x: int(x[4]))[4]
maxi = max(values, key=lambda x: int(x[4]))[4]
Take a look at the library pandas and especially the DataFrame class. This is probably the go-to method for handling .csv files and tabular data in general.
Essentially, your code would be something like this:
import pandas as pd
df = pd.read_csv('my_file.csv') # Construct a DataFrame from a csv file
print(df.columns) # check to see which column names the dataframe has
print(df['My Column'].min())
print(df['My Column'].max())
There are shorter ways to do this. But this example goes step by step:
# After you read a CSV file, you'll have a bunch of rows.
rows = [
['A', '390', '...'],
['B', '750', '...'],
['C', '207', '...'],
]
# Grab a column that you want.
col = [row[1] for row in rows]
# Convert strings to integers.
vals = [int(s) for s in col]
# Print max.
print(max(vals))
I have a column in a pandas data frame that contains string like the following format as for example
fullyRandom=true+mapSizeDividedBy64=51048
mapSizeDividedBy16000=9756+fullyRandom=false
qType=MpmcArrayQueue+qCapacity=822398+burstSize=664
count=11087+mySeed=2+maxLength=9490
capacity=27281
capacity=79882
we can read for example the first row as 2 parameters separated by '+' each parameter has a value, that clear by '=' that separate between the parameter and its value.
in Output, I'm asking if there is a python script that either extract the parameters we retrieve a list of unique parameters like the following
[fullyRandom,mapSizeDividedBy64,mapSizeDividedBy64,qType,qCapacity,qCapacity, count,mySeed,maxLength,Capacity]
Notice from the previous list that it contains only the unique parameters without its values
Or extended pandas data frame if it's not too difficult if we can parse the following column and convert into many columns, each column is for one parameter that store it's value in it
Try this, it will store the values in a list.
data = []
with open('<your text file>', 'r') as file:
content = file.readlines()
for row in content:
if '+' in row:
sub_row = row.strip('\n').split('+')
for r in sub_row:
data.append(r)
else:
data.append(row.strip('\n'))
print(data)
Output:
['fullyRandom=true', 'mapSizeDividedBy64=51048', 'mapSizeDividedBy16000=9756', 'fullyRandom=false', 'qType=MpmcArrayQueue', 'qCapacity=822398', 'burstSize=664', 'count=11087', 'mySeed=2', 'maxLength=9490', 'capacity=27281', 'capacity=79882']
to convert to a list of dict that could be used in pandas:
dict_list = []
for item in data:
df = {
item.split('=')[0]: item.split('=')[1]
}
dict_list.append(df)
print(dict_list)
Output:
[{'fullyRandom': 'true'}, {'mapSizeDividedBy64': '51048'}, {'mapSizeDividedBy16000': '9756'}, {'fullyRandom': 'false'}, {'qType': 'MpmcArrayQueue'}, {'qCapacity': '822398'}, {'burstSize': '664'}, {'count': '11087'}, {'mySeed': '2'}, {'maxLength': '9490'}, {'capacity': '27281'}, {'capacity': '79882'}]
To just get the headers:
dict_list.append(item.split('=')[0])
Output:
['fullyRandom', 'mapSizeDividedBy64', 'mapSizeDividedBy16000', 'fullyRandom', 'qType', 'qCapacity', 'burstSize', 'count', 'mySeed', 'maxLength', 'capacity', 'capacity']
I have created a class with two methods, NRG_load and NRG_flat. The first loads a CSV, converts it into a DataFrame and applies some filtering; the second takes this DataFrame and, after creating two columns, it melts the DataFrame to pivot it.
I am trying out these methods with the following code:
nrg105 = eNRG.NRG_load('nrg_105a.tsv')
nrg105_flat = eNRG.NRG_flat(nrg105, '105')
where eNRG is the class, and '105' as second argument is needed to run an if-loop within the method to create the aforementioned columns.
The behaviour I cannot explain is that the second line - the one with the NRG_flat method - changes the nrg105 values.
Note that if I only run the NRG_load method, I get the expected DataFrame.
What is the behaviour I am missing? Because it's not the first time I apply a syntax like that, but I never had problems, so I don't know where I should look at.
Thank you in advance for all of your suggestions.
EDIT: as requested, here is the class' code:
# -*- coding: utf-8 -*-
"""
Created on Tue Apr 16 15:22:21 2019
#author: CAPIZZI Filippo Antonio
"""
import pandas as pd
from FixFilename import FixFilename as ff
from SplitColumn import SplitColumn as sc
from datetime import datetime as ddt
class EurostatNRG:
# This class includes the modules needed to load and filter
# the Eurostat NRG files
# Default countries' lists to be used by the functions
COUNTRIES = [
'EU28', 'AL', 'AT', 'BE', 'BG', 'CY', 'CZ', 'DE', 'DK', 'EE', 'EL',
'ES', 'FI', 'FR', 'GE', 'HR', 'HU', 'IE', 'IS', 'IT', 'LT', 'LU', 'LV',
'MD', 'ME', 'MK', 'MT', 'NL', 'NO', 'PL', 'PT', 'RO', 'SE', 'SI', 'SK',
'TR', 'UA', 'UK', 'XK'
]
# Default years of analysis
YEARS = list(range(2005, int(ddt.now().year) - 1))
# NOTE: the 'datetime' library will call the current year, but since
# the code is using the 'range' function, the end years will be always
# current-1 (e.g. if we are in 2019, 'current year' will be 2018).
# Thus, I have added "-1" because the end year is t-2.
INDIC_PROD = pd.read_excel(
'./Datasets/VITO/map_nrg.xlsx',
sheet_name=[
'nrg105a_indic', 'nrg105a_prod', 'nrg110a_indic', 'nrg110a_prod',
'nrg110'
],
convert_float=True)
def NRG_load(dataset, countries=COUNTRIES, years=YEARS, unit='ktoe'):
# This module will load and refine the NRG dataset,
# preparing it to be filtered
# Fix eventual flags
dataset = ff.fix_flags(dataset)
# Load the dataset into a DataFrame
df = pd.read_csv(
dataset,
delimiter='\t',
encoding='utf-8',
na_values=[':', ': ', ' :'],
decimal='.')
# Clean up spaces from the column names
df.columns = df.columns.str.strip()
# Removes the mentioned column because it's not needed
if 'Flag and Footnotes' in df.columns:
df.drop(columns=['Flag and Footnotes'], inplace=True)
# Split the first column into separate columns
df = sc.nrg_split_column(df)
# Rename the columns
df.rename(
columns={
'country': 'COUNTRY',
'fuel_code': 'KEY_PRODUCT',
'nrg_code': 'KEY_INDICATOR',
'unit': 'UNIT'
},
inplace=True)
# Filter the dataset
df = EurostatNRG.NRG_filter(
df, countries=countries, years=years, unit=unit)
return df
def NRG_filter(df, countries, years, unit):
# This module will filter the input DataFrame 'df'
# showing only the 'countries', 'years' and 'unit' selected
# First, all of the units not of interest are removed
df.drop(df[df.UNIT != unit.upper()].index, inplace=True)
# Then, all of the countries not of interest are filtered out
df.drop(df[~df['COUNTRY'].isin(countries)].index, inplace=True)
# Finally, all of the years not of interest are removed,
# and the columns are rearranged according to the desired output
main_cols = ['KEY_INDICATOR', 'KEY_PRODUCT', 'UNIT', 'COUNTRY']
cols = main_cols + [str(y) for y in years if y not in main_cols]
df = df.reindex(columns=cols)
return df
def NRG_flat(df, name):
# This module prepares the DataFrame to be flattened,
# then it gives it as output
# Assign the indicators and products' names
if '105' in name: # 'name' is the name of the dataset
# Creating the 'INDICATOR' column
indic_dic = dict(
zip(EurostatNRG.INDIC_PROD['nrg105a_indic'].KEY_INDICATOR,
EurostatNRG.INDIC_PROD['nrg105a_indic'].INDICATOR))
df['INDICATOR'] = df['KEY_INDICATOR'].map(indic_dic)
# Creating the 'PRODUCT' column
prod_dic = dict(
zip(
EurostatNRG.INDIC_PROD['nrg105a_prod'].KEY_PRODUCT.astype(
str), EurostatNRG.INDIC_PROD['nrg105a_prod'].PRODUCT))
df['PRODUCT'] = df['KEY_PRODUCT'].map(prod_dic)
elif '110' in name:
# Creating the 'INDICATOR' column
indic_dic = dict(
zip(EurostatNRG.INDIC_PROD['nrg110a_indic'].KEY_INDICATOR,
EurostatNRG.INDIC_PROD['nrg110a_indic'].INDICATOR))
df['INDICATOR'] = df['KEY_INDICATOR'].map(indic_dic)
# Creating the 'PRODUCT' column
prod_dic = dict(
zip(
EurostatNRG.INDIC_PROD['nrg110a_prod'].KEY_PRODUCT.astype(
str), EurostatNRG.INDIC_PROD['nrg110a_prod'].PRODUCT))
df['PRODUCT'] = df['KEY_PRODUCT'].map(prod_dic)
# Delete che columns 'KEY_INDICATOR' and 'KEY_PRODUCT', and
# rearrange the columns in the desired order
df.drop(columns=['KEY_INDICATOR', 'KEY_PRODUCT'], inplace=True)
main_cols = ['INDICATOR', 'PRODUCT', 'UNIT', 'COUNTRY']
year_cols = [y for y in df.columns if y not in main_cols]
cols = main_cols + year_cols
df = df.reindex(columns=cols)
# Pivot the DataFrame to have it in flat format
df = df.melt(
id_vars=df.columns[:4], var_name='YEAR', value_name='VALUE')
# Convert the 'VALUE' column into float numbers
df['VALUE'] = pd.to_numeric(df['VALUE'], downcast='float')
# Drop rows that have no indicators (it means they are not in
# the Excel file with the products of interest)
df.dropna(subset=['INDICATOR', 'PRODUCT'], inplace=True)
return df
EDIT 2: if this could help, this is the error I receive when using the EurostatNRG class in IPython:
[autoreload of EurostatNRG failed: Traceback (most recent call last):
File
"C:\Users\CAPIZZIF\AppData\Local\Continuum\anaconda3\lib\site-packages\IPython\extensions\autoreload.py",
line 244, in check
superreload(m, reload, self.old_objects) File "C:\Users\CAPIZZIF\AppData\Local\Continuum\anaconda3\lib\site-packages\IPython\extensions\autoreload.py",
line 394, in superreload
update_generic(old_obj, new_obj) File "C:\Users\CAPIZZIF\AppData\Local\Continuum\anaconda3\lib\site-packages\IPython\extensions\autoreload.py",
line 331, in update_generic
update(a, b) File "C:\Users\CAPIZZIF\AppData\Local\Continuum\anaconda3\lib\site-packages\IPython\extensions\autoreload.py",
line 279, in update_class
if (old_obj == new_obj) is True: File "C:\Users\CAPIZZIF\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\generic.py",
line 1478, in nonzero
.format(self.class.name)) ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or
a.all(). ]
I managed to find the culprit.
In the NRG_flat method, the lines:
df['INDICATOR'] = df['KEY_INDICATOR'].map(indic_dic)
...
df['PRODUCT'] = df['KEY_PRODUCT'].map(indic_dic)
mess up the copies of the df DataFrame, thus I had to change them with the Pandas assign method:
df = df.assign(INDICATOR=df.KEY_INDICATOR.map(prod_dic))
...
df = df.assign(PRODUCT=df.KEY_PRODUCT.map(prod_dic))
I do not get any more error.
Thank you for replying!