Why does the dataframe not display a table with the same columns? - python

I came across such a case. there is a data frame with the same columns and does not output the entire table.
My code:
import pandas as pd
data = {2:['Green','Blue'],
2:['small','BIG'],
2:['High','Low']}
df = pd.DataFrame(data)
print(df)
Output:
2
0 High
1 Low

Dictionary only supports Unique Keys (in Key-Value pairs)
So when you create a DataFrame using Dictionary, it will only consider the latest Key-Value pair if there is duplication in Key.
For any reason, you create DataFrame with the Same Column Headers, use following code -
import pandas as pd
df = pd.DataFrame([['Green','Blue'], ['small','BIG'], ['High','Low']], columns = [2,2])
print(df)
It will show entire table with same column headers

Related

Python how to filter a csv based on a column value and get the row count

I want to do data insepction and print count of rows that matches a certain value in one of the columns. So below is my code
import numpy as np
import pandas as pd
data = pd.read_csv("census.csv")
The census.csv has a column "income" which has 3 values '<=50K', '=50K' and '>50K'
and i want to print number of rows that has income value '<=50K'
i was trying like below
count = data['income']='<=50K'
That does not work though.
Sum Boolean selection
(data['income'].eq('<50K')).sum()
The key is to learn how to filter pandas rows.
Quick answer:
import pandas as pd
data = pd.read_csv("census.csv")
df2 = data[data['income']=='<=50K']
print(df2)
print(len(df2))
Slightly longer answer:
import pandas as pd
data = pd.read_csv("census.csv")
filter = data['income']=='<=50K'
print(filter) # notice the boolean list based on filter criteria
df2 = data[filter] # next we use that boolean list to filter data
print(df2)
print(len(df2))

How to update values in a CSV column using a list?

list=[8,5,3,4,5,7,3,5]
Screenshot of CSV file in Excel
I would like to replace all the values in the csv file using python. So for example all the values under items to be replaced with values from the list. So in row 2 under Items header the value 4 to be replaced with 8 from the list and so on. How can I do that?
Use pandas:
import pandas as pd
df = pd.read_csv('filename.csv')
df['items'] = [8,5,3,4,5,7,3,5]
df.to_csv('filename.csv', index=False)
Since you are updating all the values in a column, you can start from a dataframe.
import pandas as pd
import numpy as np
df = pd.read_csv('input.csv')
# now update the column
newlist=[8,5,3,4,5,7,3,5]
df['items'] = np.array(newlist)
#write back to csv
df.to_csv("output.csv", index=False)

Pandas - Duplicate rows while creating a dataframe from dictionary

I have the following code where I talk to a service to get data. The dictionary that is returned from this service looks like below
{'karnatakaCount': 44631, 'bangaloreCount': 20870, 'mangaloreCount': 985, 'kodaguCount': 743, 'udupiCount': 556, 'kasargodCount': 673, 'mangaloreVaccinations': 354095, 'karnatakaVaccinations': 9892349, 'keralaVaccinations': 7508437, 'mangaloreDeath': 4, 'indiaDailyConfirmed': 382602}
I want to store these values in a CSV file for which I had created a script using Pandas and then saving it to a CSV file. While creating the file, I get multiple duplicate values which looks as follows
The code looks like below
import pandas as pd
from getData import getData
from datetime import datetime
data = getData()
print(data)
dateToday=str(datetime.now().day)+","+str(datetime.now().strftime('%B'))+""+str(datetime.now().year)
df = pd.DataFrame(data, index=list(data.keys()))
df['Date'] = dateToday
# df.drop_duplicates(subset='Date', inplace=True)
df.to_csv('data1.csv',index=False)
To remove the duplicates, I have added the following code.
df.drop_duplicates(subset='Date', inplace=True)
How do I avoid this line and why is my data repeating in the DataFrame?
The problem is in df = pd.DataFrame(data, index=list(data.keys())).
pandas.DataFrame() can accepts a list of dictionary or a dictionary whose value is list-like objects.
a list of dictionary
[{'karnatakaCount': 44631, 'bangaloreCount': 20870}, {'karnatakaCount': 44631, 'bangaloreCount': 20870}]
a dictionary whose value is list-like objects
{'karnatakaCount': [44631], 'bangaloreCount': [20870]}
However, your data is a dictionary whose value only contains constants.
{'karnatakaCount': 44631, 'bangaloreCount': 20870}
Pandas fails to determine how many rows to create. If you do df = pd.DataFrame(data) in your example, pandas will give you an error
ValueError: If using all scalar values, you must pass an index
The usage of index is to determine how many times you want to repeat the data. In your example, you choose to repeat it list(data.keys()) times. That's why you got so many duplicated rows.
You can use df = pd.DataFrame([data]) to create one row dataframe.
Just try the from_dict method.
df = pd.DataFrame.from_dict(data)
And then create a new column with the keys and assgin it as index.
df.set_index('column_name')
Use the from_dict method after reformatting your data as such :
data["Date"] = str(datetime.now().day)+","+str(datetime.now().strftime('%B'))+""+str(datetime.now().year)
data = {k:[v] for k,v in data.items()}
df = pd.DataFrame.from_dict(data)

Loop over multiple columns to find strings in a numerical column?

The following code finds any strings for column B. Is it possible to loop over multiple columns of a dataframe outputting the cells containing strings for each column?
import pandas as pd
for i in df:
print(df[df['i'].str.contains(r'^[a-zA-Z]+$')])
Link to code above
https://stackoverflow.com/a/65410078/12801962
Here is how to loop through columns
import pandas as pd
colList = ['ColB', 'Some_other', 'ColC']
for col in colList:
subdf = df[df[col].str.contains(r'^[a-zA-Z]+$')]
#do something with sub DF
or do it in one long test and get all the problem rows in one dataframe
import pandas as pd
subdf = df[((df['ColB'].str.contains(r'^[a-zA-Z]+$')) |
(df['Some_other'].str.contains(r'^[a-zA-Z]+$')) |
(df['ColC'].str.contains(r'^[a-zA-Z]+$')))]
Not sure if it's what you are intending to do
import pandas as pd
import numpy as np
df = pd.DataFrame()
df['ColA'] = ['ABC', 'DEF', 12345, 23456]
df['ColB'] = ['abc', 12345, 'def', 23456]
all_trues = pd.Series(np.ones(df.shape[0], dtype=np.bool))
for col in df:
all_trues &= df[col].str.contains(r'^[a-zA-Z]+$')
df[all_trues]
Which will give the result:
ColA ColB
0 ABC abc
Try:
for k, s in df.astype(str).items():
print(s.loc[s.str.contains(r'^[a-zA-Z]+$')])
Or, for the values only (no index nor column information):
for k, s in df.astype(str).items():
print(s.loc[s.str.contains(r'^[a-zA-Z]+$')].values)
Note, both of the above only work because you just want to print the matching values in the columns, not return a new structure with filtered entries.
If you tried to make a new DataFrame with cells filtered by the condition, then that would lead to ragged arrays, which are not implemented (you could replace these cells by a marker of your choice, but you cannot cut them away). Another possibility would be to select rows where any or all the cells present the condition you are testing for (that way, the result is an homogeneous array, not a ragged one).
Yet another option would be to return a list of Series, each representing a column, or a dict of colname: Series:
{k: s.loc[s.str.contains(r'^[a-zA-Z]+$')] for k, s in df.astype(str).items()}

Complex aggregation after group by operation in Pandas DataFrame

I have this DataFrame in pandas that I have that I've grouped by a column.
After this operation I need to generate all unique pairs between the rows of
each group and perform some aggregate operation on all the pairs of a group.
I've implemented the following sample algorithm to give you an idea. I want to refactor this code in order to make it work with pandas to yield performance increase and/or decrease code complexity.
Code:
import numpy as np
import pandas as pd
import itertools
#Construct Dataframe
samples=40
a=np.random.randint(3,size=(1,samples))
b=np.random.randint(9,size=(1,samples))
c=np.random.randn(1,samples)
d=np.append(a,b,axis=0)
e=np.append(d,c,axis=0)
e=e.transpose()
df = pd.DataFrame(e,columns=['attr1','attr2','value'])
df['attr1'] = df.attr1.astype('int')
df['attr2'] = df.attr2.astype('int')
#drop duplicate rows so (attr1,attr2) will be key
df = df.drop_duplicates(['attr1','attr2'])
#df = df.reset_index()
print(df)
for key,tup in df.groupby('attr1'):
print('Group',key,' length ',len(tup))
#generate pairs
agg=[]
for v1,v2 in itertools.combinations(list(tup['attr2']),2):
p1_val = float(df.loc[(df['attr1']==key) & (df['attr2']==v1)]['value'])
p2_val = float(df.loc[(df['attr1']==key) & (df['attr2']==v2)]['value'])
agg.append([key,(v1,v2),(p1_val-p2_val)**2])
#insert pairs to dataframe
p = pd.DataFrame(agg,columns=['group','pair','value'])
top = p.sort_values(by='value').head(4)
print(top['pair'])
#Perform some operation in df based on pair values
#....
I am really afraid that pandas DataFrames can not provide such sophisticated analysis functionality.
Do I have to stick to traditional python like in the example?
I'm new to Pandas so any comments/suggestions are welcome.

Categories