pandas concat successful but error message following stops loop - python

I'm getting the error message
ValueError: No objects to concatenate
When using pd.concat, but the concat appears to be successful as when I try and print the resulting dataframe it does so successfully but the error message terminates the loop.
state_list = ['Colorado', 'Ilinois', 'Indiana', 'Iowa', 'Kansas', 'Kentucky', 'Michigan', 'Minnesota', 'Missouri', 'Nebraska',\
'North Carolina', 'North Dakota', 'Ohio', 'Pennsylvania', 'South Dakota', 'Tennessee', 'Texas', 'Wisconsin']
for state_name in state_list:
### DF1 is a dataframe unique to each state###
condition_categories = (df1['description'].unique())
cats = []
for cat in condition_categories:
category_df = df1[['week', 'value']].where(df1['description'] == cat).dropna()
category_df = category_df.set_index('week')
category_df = category_df.rename(columns={'value': str(cat)})
category_df.week = dtype=np.int
cats.append(category_df)
#print(category_df)
df = pd.concat(cats, axis =1)
print(df)

Sorry for the late response. Looks like there is some issue with "cats" list. It is empty in one of the iterations of outer for loop.
You can add a condition just above concatination line. Like below, it may work.
if len(cats) > 0:
df = pd.concat(cats, axis =1)
else:
print("No records to concatinate")

Related

Creating a new dataframe column based on manipulating existing column and refernce object

I have a dataset that contains flight information. One of the columns in the dataset is AirportFrom. This is a list of the 3 letter code attached to an airport.
I have a reference table that takes the 3 letter code and gives the state that the airport is in.
I want to create a new column that takes all of the data from the AiportFrom and basically assigns the related State to the specific airport in the new column.
I have tried a few things that none of them seem to work correctly. I am getting an error in the last time I was trying to do it.
airportState = {
'ATL': 'Georgia',
'AUS': 'Texas',
'BNA': 'Tennessee',
'BOS': 'Massachusetts',
'BWI': 'Washington',
'CLT': 'North Carolina',
'DAL': 'Texas',
'DCA': 'Virginia',
'DEN': 'Colorado',
'DFW': 'Texas',
'DTW': 'Michigan',
'EWR': 'New Jersey',
'FLL': 'Florida',
'HNL': 'Hawaii',
'HOU': 'Texas',
'IAD': 'Virginia',
'IAH': 'Texas',
'JFK': 'New York',
'LAS': 'Nevada',
'LAX': 'California',
'LGA': 'New York',
'MCO': 'Florida',
'MDW': 'Illinois',
'MIA': 'Florida',
'MSP': 'Minnesota',
'MSY': 'Louisiana',
'OAK': 'California',
'ORD': 'Illinois',
'PDX': 'Oregon',
'PHL': 'Pennsylvania',
'PHX': 'Arizona',
'RDU': 'North Carolina',
'SAN': 'California',
'SEA': 'Washington',
'SFO': 'California',
'SJC': 'California',
'SLC': 'Utah',
'SMF': 'California',
'STL': 'Missouri',
'TPA': 'Florida',
}
here is the code I am trying to run:
dataset['StateFrom'] = airportState[dataset['AirportFrom']]
I know what the issue is, but I am not sure how to fix it.
Use map to substitute each value with the related State:
dataset['StateFrom'] = dataset['AirportFrom'].map(airportState)
print(dataset)
# Output
AirportFrom StateFrom
0 PHX Arizona
1 HNL Hawaii
2 LGA New York

Add a calculated column to a pivot table in pandas

Hi I am trying to create new columns to a multi-indexed pandas pivot table to do a countif statement (similar to excel) depending if a level of the index contains a specific string. This is the sample data:
df = pd.DataFrame({'City': ['Houston', 'Austin', 'Hoover','Adak','Denver','Houston','Adak','Denver'],
'State': ['Texas', 'Texas', 'Alabama','Alaska','Colorado','Texas','Alaska','Colorado'],
'Name':['Aria', 'Penelope', 'Niko','Susan','Aria','Niko','Aria','Niko'],
'Unit':['Sales', 'Marketing', 'Operations','Sales','Operations','Operations','Sales','Operations'],
'Assigned':['Yes','No','Maybe','No','Yes','Yes','Yes','Yes']},
columns=['City', 'State', 'Name', 'Unit','Assigned'])
pivot=df.pivot_table(index=['City','State'],columns=['Name','Unit'],values=['Assigned'],aggfunc=lambda x:', '.join(set(x)),fill_value='')
and this is the desired output (in screenshot). Thanks in advance!
try:
temp = pivot[('Mango', 'Aria', 'Sales')].str.len()>0
pivot['new col'] = temp.astype(int)
the result:
Based on your edit:
import numpy as np
temp = pivot.xs('Sales', level=2, drop_level=False, axis = 1).apply(lambda x: np.sum([1 if y!='' else 0 for y in x]), axis = 1)
pivot[('', 'total sales', 'count how many...')]=temp

Finding the min of a column across multiple lists in python

I need to find the minimum and maximum of a given a column from a csv file and currently the value is a string but I need it to be an integer, right now my output after I have split all the lines into lists looks like this
['FRA', 'Europe', 'France', '14/06/2020', '390', '10\n']
['FRA', 'Europe', 'France', '11/06/2020', '364', '27\n']
['FRA', 'Europe', 'France', '12/06/2020', '802', '28\n']
['FRA', 'Europe', 'France', '13/06/2020', '497', '24\n']
And from that line along with its many others I want to find the minimum of the
5th column and currently when I do
min(column[4])
It just gives the min of each individual list which is just the number in that column rather than grouping them all up and getting that minimum.
P.S: I am very new to python and coding in general, I also have to do this without any importing of modules.
For you Azro.
def main(csvfile,country,analysis):
infile = csvfile
datafile = open(infile, "r")
country = country.capitalize()
if analysis == "statistics":
for line in datafile.readlines():
column = line.split(",")
if column[2] == country:
You may use pandas that allows to read csv file and manipulate them as DataFrame, then it's very easy to retrieve a min/max from a column
import pandas as pd
df = pd.read_csv("test.txt", sep=',')
mini = df['colName'].min()
maxi = df['colName'].max()
print(mini, maxi)
Then if you have already read your data in a list of lists, you max use builtin min and max
# use rstrip() when reading line, to remove leading \n
values = [
['FRA', 'Europe', 'France', '14/06/2020', '390', '10'],
['FRA', 'Europe', 'France', '14/06/2020', '395', '10']
]
mini = min(values, key=lambda x: int(x[4]))[4]
maxi = max(values, key=lambda x: int(x[4]))[4]
Take a look at the library pandas and especially the DataFrame class. This is probably the go-to method for handling .csv files and tabular data in general.
Essentially, your code would be something like this:
import pandas as pd
df = pd.read_csv('my_file.csv') # Construct a DataFrame from a csv file
print(df.columns) # check to see which column names the dataframe has
print(df['My Column'].min())
print(df['My Column'].max())
There are shorter ways to do this. But this example goes step by step:
# After you read a CSV file, you'll have a bunch of rows.
rows = [
['A', '390', '...'],
['B', '750', '...'],
['C', '207', '...'],
]
# Grab a column that you want.
col = [row[1] for row in rows]
# Convert strings to integers.
vals = [int(s) for s in col]
# Print max.
print(max(vals))

Pandas DataFrame creation throws ValueError in loop

I have a nested dictionary(stats), which I'm trying to convert into a Pandas DF. When I run the below code, I get the desired result:
BAL_sp = pd.DataFrame(data = stats['sp']['Orioles'])
However, I need to do this 30 times then concatenate the results. When I run a for loop, I get a ValueError: DataFrame constructor not properly called! I don't understand, it is recognizing the key in stats as valid in the loop:
team_dict = {'LAA': 'Angels', 'ARI': 'Diamondbacks', 'BAL': 'Orioles', 'BOS': 'Red Sox', 'CHC': 'Cubs', 'CIN': 'Reds',
'CLE': 'Indians', 'COL': 'Rockies', 'DET': 'Tigers', 'HOU': 'Astros', 'KC': 'Royals', 'LAD': 'Dodgers',
'WSH':'Nationals', 'NYM': 'Mets', 'OAK': 'Athletics', 'PIT': 'Pirates', 'SD': 'Padres', 'SEA': 'Mariners',
'SF': 'Giants', 'STL': 'Cardinals', 'TB': 'Rays', 'TEX': 'Rangers', 'TOR': 'Blue Jays', 'MIN': 'Twins',
'PHI': 'Phillies', 'ATL': 'Braves', 'CWS': 'White Sox', 'MIA': 'Marlins', 'NYY': 'Yankees', 'MIL': 'Brewers' }
frames = []
for team in team_dict.values():
temp = pd.DataFrame(data = stats['sp'][team])
frames.append(temp)
sp_df = pd.concat(frames)
It doesn't throw an error if I do data = [stats['sp'][team]], but that does not produce the desired result. Thank you for any help.

Combining Masking and Indexing in Pandas

Consider the following data frame :
population_dict = {'California': 38332521,
'Texas': 26448193,
'New York': 19651127,
'Florida': 19552860,
'Illinois': 12882135}
pop = pd.Series(population_dict)
area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,
'Florida': 170312, 'Illinois': 149995}
area = pd.Series(area_dict)
data = pd.DataFrame({'area' : area, 'pop' : pop})
I can perform masking and indexing on columns in the same line as follows :
In [492]:data.loc[data.density > 100, ['pop', 'density']]
Out[492]:
pop density
New York 19651127 139.076746
Florida 19552860 114.806121
But if I need to do this masking and indexing on rows? Something like:
data.loc[data.density > 100, ['New York']]. But this statement obviously gives an error.
If you just want to extract information, chaining loc works just fine:
data[data.density > 100].loc[['New York']]
Output:
area pop density
New York 141297 19651127 139.076746
Try using:
data2 = data.loc[data.density > 100, ['pop', 'density']]
print(data2.loc[data2.index == 'New York'])

Categories