KeyError: Float64Index when running drop_duplicates - python

I have a dataframe with the duplicates in the t column, however when I run the drop_duplicates function the following error is returned. Could someone please explain how to fix this?
print(df.columns)
Index(['t', 'ax', 'ay', 'az', 'gx', 'gy', 'gz', 'mx', 'my', 'mz', 'movement',
'hand', 'trial_1', 'trial_2', 'trial'],
dtype='object')
df.drop_duplicates(subset= left_data.loc[:, 't'], keep='first', inplace=True, ignore_index=True) ```
KeyError: Int64IndeX

Related

Indexerror: list index out of range for excel file

I have this piece of code i am trying to run :
import glob
filenames=list(glob.glob("../Data/atp_data.csv"))
l = [pd.read_excel(filename,encoding='latin-1') for filename in filenames]
no_b365=[i for i,d in enumerate(l) if "B365W" not in l[i].columns]
no_pi=[i for i,d in enumerate(l) if "PSW" not in l[i].columns]
for i in no_pi:
l[i]["PSW"]=np.nan
l[i]["PSL"]=np.nan
for i in no_b365:
l[i]["B365W"]=np.nan
l[i]["B365L"]=np.nan
l=[d[list(d.columns)[:13]+["Wsets","Lsets","Comment"]+["PSW","PSL","B365W","B365L"]] for d in [l[0]]+l[2:]]
data=pd.concat(l,0)
Everytime i do however, i am getting this error:
indexerror: list index out of range
Here are the list of columns in the mentioned atp_data.csv file:
['ATP', 'Location', 'Tournament', 'Date', 'Series', 'Court', 'Surface', 'Round', 'Best of', 'Winner', 'Loser', 'WRank', 'LRank', 'WPts', 'LPts', 'W1', 'L1', 'W2', 'L2', 'W3', 'L3', 'W4', 'L4', 'W5', 'L5', 'Wsets', 'Lsets', 'Comment', 'B365W', 'B365L', 'CBW', 'CBL', 'EXW', 'EXL', 'IWW', 'IWL', 'PSW', 'PSL', 'MaxW', 'MaxL', 'AvgW', 'AvgL']
Anyone know the soultion to this?
nvm, I found the error. I was using read_excel instead of read_csv to read a csv file.

Remove excess of pipes '|' in CSV after append files

I have 3 dataframes. I need to convert them in one merged CSV separated by pipes '|'.
And I need to sort them by Column1 after append.
But, when I try to convert the final df to CSV, there comes exceeded pipes for null columns. How to avoid this?
import pandas as pd
import io
df1 = pd.DataFrame({
'Column1': ['key_1', 'key_2', 'key_3'],
'Column2': ['1100', '1100', '1100']
})
df2 = pd.DataFrame({
'Column1': ['key_1', 'key_2', 'key_3', 'key_1', 'key_2', 'key_3'],
'Column2': ['1110', '1110', '1110', '1110', '1110', '1110'],
'Column3': ['xxr', 'xxv', 'xxw', 'xxt', 'xxe', 'xxz'],
'Column4': ['wer', 'cad', 'sder', 'dse', 'sdf', 'csd']
})
df3 = pd.DataFrame({
'Column1': ['key_1', 'key_2', 'key_3', 'key_1', 'key_2', 'key_3'],
'Column2': ['1115', '1115', '1115', '1115', '1115', '1115'],
'Column3': ['xxr', 'xxv', 'xxw', 'xxt', 'xxe', 'xxz'],
'Column4': ['wer', 'cad', 'sder', 'dse', 'sdf', 'csd'],
'Column5': ['xxr', 'xxv', 'xxw', 'xxt', 'xxe', 'xxz'],
'Column6': ['xxr', 'xxv', 'xxw', 'xxt', 'xxe', 'xxz'],
})
print(df1, df2, df3, sep="\n")
output = io.StringIO()
pd.concat([df1, df2, df3]).sort_values("Column1") \
.to_csv(output, header=False, index=False, sep="|")
print("csv",output.getvalue(),sep="\n")
output.seek(0)
df4 = pd.read_csv(output, header=None, sep="|", keep_default_na=False)
print("df4",df4,sep="\n" )
output.close()
This is the output I have (note pipes'|'):
key_1|1100||||
key_1|1110|xxr|wer||
key_1|1110|xxt|dse||
key_1|1115|xxr|wer|xxr|xxr
key_1|1115|xxt|dse|xxt|xxt
key_2|1100||||
key_2|1110|xxv|cad||
key_2|1110|xxe|sdf||
key_2|1115|xxv|cad|xxv|xxv
key_2|1115|xxe|sdf|xxe|xxe
key_3|1100||||
key_3|1110|xxw|sder||
key_3|1110|xxz|csd||
key_3|1115|xxw|sder|xxw|xxw
key_3|1115|xxz|csd|xxz|xxz
I need this. Justo to introduce, I'll not work on this final data, I need to upload it to a specific database in the exact format I show below, but I need this without using regex (note pipes'|'). Is there a way to do so?
key_1|1100
key_1|1110|xxr|wer
key_1|1110|xxt|dse
key_1|1115|xxr|wer|xxr|xxr
key_1|1115|xxt|dse|xxt|xxt
key_2|1100
key_2|1110|xxv|cad
key_2|1110|xxe|sdf
key_2|1115|xxv|cad|xxv|xxv
key_2|1115|xxe|sdf|xxe|xxe
key_3|1100
key_3|1110|xxw|sder
key_3|1110|xxz|csd
key_3|1115|xxw|sder|xxw|xxw
key_3|1115|xxz|csd|xxz|xxz
as you note, generate sorted pipe delimited
then split(), rstrip("|") and join()
"\n".join([l.rstrip("|") for l in
pd.concat([df1,df2,df3]).pipe(lambda d:
d.sort_values(d.columns.tolist())).to_csv(sep="|", index=False).split("\n")])
You can remove extra "|" with re.sub():
import re
s = pd.concat([df1, df2, df3]).sort_values("Column1") \
.to_csv(header=False, index=False, sep="|")
s1 = re.sub("\|*\n", "\n", s) # with regex
s2 = "\n".join([l.rstrip("|") for l in s.splitlines()]) # with rstrip
>>> print(s1.strip())
key_1|1100
key_1|1110|xxr|wer
key_1|1110|xxt|dse
key_1|1115|xxr|wer|xxr|xxr
key_1|1115|xxt|dse|xxt|xxt
key_2|1100
key_2|1110|xxv|cad
key_2|1110|xxe|sdf
key_2|1115|xxv|cad|xxv|xxv
key_2|1115|xxe|sdf|xxe|xxe
key_3|1100
key_3|1110|xxw|sder
key_3|1110|xxz|csd
key_3|1115|xxw|sder|xxw|xxw
key_3|1115|xxz|csd|xxz|xxz
>>> print(s2)
key_1|1100
key_1|1110|xxr|wer
key_1|1110|xxt|dse
key_1|1115|xxr|wer|xxr|xxr
key_1|1115|xxt|dse|xxt|xxt
key_2|1100
key_2|1110|xxv|cad
key_2|1110|xxe|sdf
key_2|1115|xxv|cad|xxv|xxv
key_2|1115|xxe|sdf|xxe|xxe
key_3|1100
key_3|1110|xxw|sder
key_3|1110|xxz|csd
key_3|1115|xxw|sder|xxw|xxw
key_3|1115|xxz|csd|xxz|xxz

Read dynamically CSV files

How I read CSV's files dynamically in Python, when change the suffix name files?
Example:
import pandas as pd
uf = ['AC', 'AL', 'AP', 'AM', 'BA', 'CE', 'DF', 'ES', 'GO', 'MA', 'MT', 'MS', 'MG', 'PA', 'PB', 'PR', 'PE', 'PI', 'RJ', 'RN', 'RS', 'RO', 'RR', 'SC', 'SP1', 'SP2', 'SE', 'TO']
for n in uf:
{n} = pd.read_csv('Basico_{n}.csv', encoding='latin1', sep=';', header=0)
The {} is not recognize into "for-loop".
I want to read the different file suffix names within in list items and create different DataFrames by same rules.
You have two main issues:
{n} = is invalid syntax. You can't assign to a variable name without messing with locals or globals. Doing so is almost always a bad idea anyway because it's much more difficult to programmatically access names that are, in a way, hard-coded. If the list of names is dynamic, then you need to start accessing globals() to get at them and this leads to bugs.
'Basico_{n}.csv' misses the f out of fstrings. n will not be added to the string if you don't specify that it's an f-string by prepending f.
Instead:
import pandas as pd
uf = ['AC', 'AL', 'AP', 'AM', 'BA', 'CE', 'DF', 'ES', 'GO', 'MA', 'MT', 'MS', 'MG', 'PA', 'PB', 'PR', 'PE', 'PI', 'RJ', 'RN', 'RS', 'RO', 'RR', 'SC', 'SP1', 'SP2', 'SE', 'TO']
dfs = {} # Create a dict to store the names
for n in uf:
dfs[n] = pd.read_csv(f'Basico_{n}.csv', encoding='latin1', sep=';', header=0)
'Basico_{n}.csv'
Will only work for python >= 3.6
Try
{n} = pd.read_csv('Basico_{}.csv'.format(n), encoding='latin1', sep=';', header=0)

Same DataFrame.reindex code - different output

Good afternoon everyone,
I want to filter out from a DataFrame the columns that I am not interested in.
To do that - and since the columns could change based on user input (that I will not show here) - I am using the following code within my offshore_filter function:
# Note: 'df' is my DataFrame, with different country codes as rows and years as columns' headers
import datetime as d
import pandas as pd
COUNTRIES = [
'EU28', 'AL', 'AT', 'BE', 'BG', 'CY', 'CZ', 'DE', 'DK', 'EE', 'EL',
'ES', 'FI', 'FR', 'GE', 'HR', 'HU', 'IE', 'IS', 'IT', 'LT', 'LU', 'LV',
'MD', 'ME', 'MK', 'MT', 'NL', 'NO', 'PL', 'PT', 'RO', 'SE', 'SI', 'SK',
'TR', 'UA', 'UK', 'XK'
YEARS = list(range(2005, int(d.datetime.now().year)))
def offshore_filter(df, countries=COUNTRIES, years=YEARS):
# This function is specific for filtering out the countries
# and the years not needed in the analysis
# Filter out all of the countries not of interest
df.drop(df[~df['country'].isin(countries)].index, inplace=True)
# Filter out all of the years not of interest
columns_to_keep = ['country', 'country_name'] + [i for i in years]
temp = df.reindex(columns=columns_to_keep)
df = temp # This step to avoid the copy vs view complication
return df
When I pass a years list of integers, the code works well and filters the DataFrame by taking only the columns in the years list.
However, if the DataFrame's column headers are strings (e.g. '2018' instead of 2018), changing [i for i in years] into [str(i) for i in years] doesn't work, and I have columns of Nan's (as the reindex documentation states).
Can you help me spot me why?

Python pygal code - World map

Could you please help me with this code:
import pygal
from pygal.maps.world import World
worldmap_chart = pygal.maps.world.World()
worldmap_chart.title = 'Some countries'
worldmap_chart.add('F countries', ['fr', 'fi'])
worldmap_chart.add('M countries', ['ma', 'mc', 'md', 'me', 'mg',
'mk', 'ml', 'mm', 'mn', 'mo',
'mr', 'mt', 'mu', 'mv', 'mw',
'mx', 'my', 'mz'])
worldmap_chart.add('U countries', ['ua', 'ug', 'us', 'uy', 'uz'])
worldmap_chart.render()
I use Spyder. Python 3.6
.The problem is that the map does not show up on the IPython console, and also on the second line of the code, I get yellow triangle/note that says: 'pygal.maps.world.World' imported but unused. Maybe this is the reason why the map does not show up.
Otherwise, if it helps, in the IPython console I get only this: runfile('C:/Users/Nikki/.spyder-py3/untitled0.py', wdir='C:/Users/Nikki/.spyder-py3')
Could you please help me to fix this.
Thanks,
Nikki
To put this here to help others trying to use pygal to create maps.
Yes, following from what #Carolos said, you can also easily export them as html. like this:
import pygal
from pygal.maps.world import World
worldmap_chart = World()
worldmap_chart.title = 'Some countries'
worldmap_chart.add('F countries', ['fr', 'fi'])
worldmap_chart.add('M countries', ['ma', 'mc', 'md', 'me', 'mg',
'mk', 'ml', 'mm', 'mn', 'mo',
'mr', 'mt', 'mu', 'mv', 'mw',
'mx', 'my', 'mz'])
worldmap_chart.add('U countries', ['ua', 'ug', 'us', 'uy', 'uz'])
worldmap_chart.render_to_file('mymap.html')

Categories