Pandas dataframe only reading first value, NaN for everything else - python

I am attempting to read a csv with pandas and then insert into a SQL table. I am reading the data from the csv correctly when I print(data), but once I add it into the dataframe it is only reading the very first column, and is inserting NaN for every other value in the csv. Code and output below;
data = pd.read_csv (localFilePath)
print(data)
df = pd.DataFrame(data, columns= ['Date','EECode','LastName','FirstName', \
'HomeDepartmentCode','HomeDepartmentDesc','PayClass','InPunchTime', \
'OutPunchTime','DepartmentCode','DepartmentDesc','JobCodesCode', \
'JobCodesDesc','TeamCode','TeamDesc','EarnCode'])
print(df)
for row in df.itertuples():
SQLInsert = ('''
INSERT INTO [Reporting].[dbo].[Paycom_Missing_Punch]
(Date, EECode, LastName, FirstName, HomeDepartmentCode,
HomeDepartmentDesc, PayClass, InPunchTime, OutPunchTime,
DepartmentCode, DepartmentDesc, JobCodesCode, JobCodesDesc,
TeamCode, TeamDesc, EarnCode)
VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)
'''
)
args = row.Date, row.EECode, row.LastName, row.FirstName, \
row.HomeDepartmentCode, row.HomeDepartmentDesc, row.PayClass, row.InPunchTime, \
row.OutPunchTime, row.DepartmentCode, row.DepartmentDesc, row.JobCodesCode, \
row.JobCodesDesc, row.TeamCode, row.TeamDesc, row.EarnCode
#print(SQLInsert)
#print(args)
cursor.execute(SQLInsert, args)
conn.commit()
output when I print(data);
Date EE Code ... Team Desc Earn Code
0 01/21/2021 1435 ... Indiana DWD NaN
1 01/21/2021 1435 ... Indiana DWD NaN
2 01/22/2021 1180 ... Supervisors NaN
3 01/21/2021 1664 ... Technical Support Desk NaN
4 01/21/2021 1078 ... Supervisors NaN
output once I add it to the dataframe;
Date EECode LastName ... TeamCode TeamDesc EarnCode
0 01/21/2021 NaN NaN ... NaN NaN NaN
1 01/21/2021 NaN NaN ... NaN NaN NaN
2 01/22/2021 NaN NaN ... NaN NaN NaN
3 01/21/2021 NaN NaN ... NaN NaN NaN
4 01/21/2021 NaN NaN ... NaN NaN NaN
I assume the problem is how I am passing the values to the dataframe, but from everything I have read or seen, the way I am doing it looks correct.

The problem is the way you're doing the df. You're creating the dataframe first with your data. Then you're trying to create another dataframe of it, using names that don't exist. To fix your problem simply do this:
>>> col_names = ['Date','EECode','LastName','FirstName', \
'HomeDepartmentCode','HomeDepartmentDesc','PayClass','InPunchTime', \
'OutPunchTime','DepartmentCode','DepartmentDesc','JobCodesCode', \
'JobCodesDesc','TeamCode','TeamDesc','EarnCode']
>>> df = pd.read_csv(localFilePath)
>>> df.columns = col_names

Related

Create dataframe pandas from dict where values are list of tuples and each column name is unique

I have two lists that I use to create a dictionary, where list1 has text data and list2 is a list of tuples (text, float). I use these 2 lists to create a dictionary and the goal is to create a dataframe where each row of the first column will contain the elements of list1, each column will have a column name based on each unique text term from the first tuple element and for each row there will be the float values that connect them.
For example here's the dictionary with keys : {be, associate, induce, represent} and values : {('prove', 0.583171546459198), ('serve', 0.4951282739639282)} etc.
{'be': [('prove', 0.583171546459198), ('serve', 0.4951282739639282), ('render', 0.4826732873916626), ('represent', 0.47748714685440063), ('lead', 0.47725602984428406), ('replace', 0.4695377051830292), ('contribute', 0.4529820680618286)],
'associate': [('interact', 0.8237789273262024), ('colocalize', 0.6831706762313843)],
'induce': [('suppress', 0.8159114718437195), ('provoke', 0.7866303324699402), ('elicit', 0.7509980201721191), ('inhibit', 0.7498961687088013), ('potentiate', 0.742023229598999), ('produce', 0.7384929656982422), ('attenuate', 0.7352016568183899), ('abrogate', 0.7260081768035889), ('trigger', 0.717864990234375), ('stimulate', 0.7136563658714294)],
'represent': [('prove', 0.6612186431884766), ('evoke', 0.6591314673423767), ('up-regulate', 0.6582908034324646), ('synergize', 0.6541063785552979), ('activate', 0.6512928009033203), ('mediate', 0.6494284272193909)]}
Desired Output
prove serve render represent
be 0.58 0.49 0.48 0.47
associate 0 0 0 0
induce 0.45 0.58 0.9 0.7
represent 0.66 0 0 1
So what tricks me is that the verb prove can be found in more than one keys (i.e. for the key be, the score is 0.58 and for the key represent the score is 0.66).
If I use df = pd.DataFrame.from_dict(d,orient='index'), then the verb prove will appear twice as a column name, whereas I want each term to appear once in each column.
Can someone help?
With the dictionary that you provided (as d), you can't use from_dict directly.
You either need to rework the dictionary to have elements as dictionaries:
pd.DataFrame.from_dict({k: dict(v) for k,v in d.items()}, orient='index')
Or you need to read it as a Series and to reshape:
(pd.Series(d).explode()
.apply(pd.Series)
.set_index(0, append=True)[1]
.unstack(fill_value=0)
)
output:
prove serve render represent lead replace \
be 0.583172 0.495128 0.482673 0.477487 0.477256 0.469538
represent 0.661219 NaN NaN NaN NaN NaN
associate NaN NaN NaN NaN NaN NaN
induce NaN NaN NaN NaN NaN NaN
contribute interact colocalize suppress ... produce \
be 0.452982 NaN NaN NaN ... NaN
represent NaN NaN NaN NaN ... NaN
associate NaN 0.823779 0.683171 NaN ... NaN
induce NaN NaN NaN 0.815911 ... 0.738493
attenuate abrogate trigger stimulate evoke up-regulate \
be NaN NaN NaN NaN NaN NaN
represent NaN NaN NaN NaN 0.659131 0.658291
associate NaN NaN NaN NaN NaN NaN
induce 0.735202 0.726008 0.717865 0.713656 NaN NaN
synergize activate mediate
be NaN NaN NaN
represent 0.654106 0.651293 0.649428
associate NaN NaN NaN
induce NaN NaN NaN
[4 rows x 24 columns]

How to get Dataframe with Table ID in Pandas?

I want to extract dataframe from HTML using URL.
The page contains 59 table/dataframe.
I want to extract 1 particular table which can be identified by its ID "ctl00_Menu1"
Following is my trail which is giving error.
import pandas as pd
df = pd.read_html("http://eciresults.nic.in/statewiseS12.htm?st=S12",attrs = {'id': 'ctl00_Menu1'})
As this is my very early stage in python so can be simple solution but I am unable to find. appreciate help.
I would look at how the URL passes params and probably try to read a dataframe directly from it. I'm unsure if you are trying to develop a function or a script or just exercising.
If you do (notice the 58 at the end of the url)
df = pd.read_html("http://eciresults.nic.in/statewiseS12.htm?st=S1258",attrs = {'id':
'ctl00_Menu1'})
It works and gives you table 59.
[ 0 1 2 \
0 Partywise Partywise NaN
1 Partywise NaN NaN
2 Constituencywise-All Candidates NaN NaN
3 Constituencywise Trends NaN NaN
3 4 5 \
0 Constituencywise-All Candidates Constituencywise-All Candidates NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
6 7
0 Constituencywise Trends Constituencywise Trends
1 NaN NaN
2 NaN NaN
3 NaN NaN ]
Unsure if that's the table you want to extract, but most of the time it's easier to pass it as a url parameter. If you try it without the 58 it works too, I believe the 'ElectionResult' argument might not be a table classifier hence why you can't find any tables with that name.

I'm trying modify a pandas data frame so that I will have 2 columns. A frequency column and a date column.

Basically, what I'm working with is a dataframe with all of the parking tickets given out in one year. Every ticket takes up its own row in the unaltered dataframe. What I want to do is group all the tickets by date so that I have 2 columns (date, and the amount of tickets issued on that day). Right now I can achieve that, however, the date is not considered a column by pandas.
import numpy as np
import matplotlib as mp
import pandas as pd
import matplotlib.pyplot as plt
df1 = pd.read_csv('C:/Users/brett/OneDrive/Data Science
Fundamentals/Parking_Tags_Data_2012.csv')
unnecessary_cols = ['tag_number_masked', 'infraction_code',
'infraction_description', 'set_fine_amount', 'time_of_infraction',
'location1', 'location2', 'location3', 'location4',
'province']
df1 = df1.drop (unnecessary_cols, 1)
df1 =
(df1.groupby('date_of_infraction').agg({'date_of_infraction':'count'}))
df1['frequency'] =
(df1.groupby('date_of_infraction').agg({'date_of_infraction':'count'}))
print (df1)
df1 = (df1.iloc[121:274])
The output is:
date_of_infraction date_of_infraction frequency
20120101 1059 NaN
20120102 2711 NaN
20120103 6889 NaN
20120104 8030 NaN
20120105 7991 NaN
20120106 8693 NaN
20120107 7237 NaN
20120108 5061 NaN
20120109 7974 NaN
20120110 8872 NaN
20120111 9110 NaN
20120112 8667 NaN
20120113 7247 NaN
20120114 7211 NaN
20120115 6116 NaN
20120116 9168 NaN
20120117 8973 NaN
20120118 9016 NaN
20120119 7998 NaN
20120120 8214 NaN
20120121 6400 NaN
20120122 6355 NaN
20120123 7777 NaN
20120124 8628 NaN
20120125 8527 NaN
20120126 8239 NaN
20120127 8667 NaN
20120128 7174 NaN
20120129 5378 NaN
20120130 7901 NaN
... ... ...
20121202 5342 NaN
20121203 7336 NaN
20121204 7258 NaN
20121205 8629 NaN
20121206 8893 NaN
20121207 8479 NaN
20121208 7680 NaN
20121209 5357 NaN
20121210 7589 NaN
20121211 8918 NaN
20121212 9149 NaN
20121213 7583 NaN
20121214 8329 NaN
20121215 7072 NaN
20121216 5614 NaN
20121217 8038 NaN
20121218 8194 NaN
20121219 6799 NaN
20121220 7102 NaN
20121221 7616 NaN
20121222 5575 NaN
20121223 4403 NaN
20121224 5492 NaN
20121225 673 NaN
20121226 1488 NaN
20121227 4428 NaN
20121228 5882 NaN
20121229 3858 NaN
20121230 3817 NaN
20121231 4530 NaN
Essentially, I want to move all the columns over by one to the right. Right now pandas only considers the last two columns as actual columns. I hope this made sense.
The count of infractions per date should be achievable with just one call to groupby. Try this:
import numpy as np
import pandas as pd
df1 = pd.read_csv('C:/Users/brett/OneDrive/Data Science
Fundamentals/Parking_Tags_Data_2012.csv')
unnecessary_cols = ['tag_number_masked', 'infraction_code',
'infraction_description', 'set_fine_amount', 'time_of_infraction',
'location1', 'location2', 'location3', 'location4',
'province']
df1 = df1.drop(unnecessary_cols, 1)
# reset_index() to move the dates into their own column
counts = df1.groupby('date_of_infraction').count().reset_index()
print(counts)
Note that any dates with zero tickets will not show up as 0; instead, they will simply be absent from counts.
If this doesn't work, it would be helpful for us to see the first few rows of df1 after you drop the unnecessary rows.
Try using as_index=False.
For example:
import numpy as np
import pandas as pd
data = {"date_of_infraction":["20120101", "20120101", "20120202", "20120202"],
"foo":np.random.random(4)}
df = pd.DataFrame(data)
df
date_of_infraction foo
0 20120101 0.681286
1 20120101 0.826723
2 20120202 0.669367
3 20120202 0.766019
(df.groupby("date_of_infraction", as_index=False) # <-- acts like reset_index()
.foo.count()
.rename(columns={"foo":"frequency"})
)
date_of_infraction frequency
0 20120101 2
1 20120202 2

Replacing/Stripping certain text from data in pandas?

I've got an issue with Pandas not replacing certain bits of text correctly...
# Create blank column
csvdata["CTemp"] = ""
# Create a copy of the data in "CDPure"
dcol = csvdata.CDPure
# Fill "CTemp" with the data from "CDPure" and replace and/or remove certain parts
csvdata['CTemp'] = dcol.str.replace(" (AMI)", "").replace(" N/A", "Non")
But yet when i print it hasn't replaced any as seen below by running print csvdata[-50:].head(50)
Pole KI DE Score STAT CTemp
4429 NaN NaN NaN 42 NaN Data N/A
4430 NaN NaN NaN 23.43 NaN Data (AMI)
4431 NaN NaN NaN 7.05 NaN Data (AMI)
4432 NaN NaN NaN 9.78 NaN Data
4433 NaN NaN NaN 169.68 NaN Data (AMI)
4434 NaN NaN NaN 26.29 NaN Data N/A
4435 NaN NaN NaN 83.11 NaN Data N/A
NOTE: The CSV is rather big so I have to use pandas.set_option('display.max_columns', 250) to be able to print the above.
Anyone know how I can make it replace those parts correctly in pandas?
EDIT, I've tried .str.replace("", "") and tried just .replace("", "")
Example CSV:
No,CDPure,Blank
1,Data Test,
2,Test N/A,
3,Data N/A,
4,Test Data,
5,Bla,
5,Stack,
6,Over (AMI),
7,Flow (AMI),
8,Test (AMI),
9,Data,
10,Ryflex (AMI),
Example Code:
# Import pandas
import pandas
# Open csv (I have to keep it all as dtype object otherwise I can't do the rest of my script)
csvdata = pandas.read_csv('test.csv', dtype=object)
# Create blank column
csvdata["CTemp"] = ""
# Create a copy of the data in "CDPure"
dcol = csvdata.CDPure
# Fill "CTemp" with the data from "CDPure" and replace and/or remove certain parts
csvdata['CTemp'] = dcol.str.replace(" (AMI)", "").str.replace(" N/A", " Non")
# Print
print csvdata.head(11)
Output:
No CDPure Blank CTemp
0 1 Data Test NaN Data Test
1 2 Test N/A NaN Test Non
2 3 Data N/A NaN Data Non
3 4 Test Data NaN Test Data
4 5 Bla NaN Bla
5 5 Stack NaN Stack
6 6 Over (AMI) NaN Over (AMI)
7 7 Flow (AMI) NaN Flow (AMI)
8 8 Test (AMI) NaN Test (AMI)
9 9 Data NaN Data
10 10 Ryflex (AMI) NaN Ryflex (AMI)
str.replace interprets its argument as a regular expression, so you need to escape the parentheses using dcol.str.replace(r" \(AMI\)", "").str.replace(" N/A", "Non").
This does not appear to be adequately documented; the docs mention that split and replace "take regular expressions, too", but doesn't make it clear that they always interpret their argument as a regular expression.

Merging more than two files with one column and common index in pandas

I have 10 .csv files with two columns. For example
file1.csv
Bact1,[1821932:1822487](+)
Bact2,[555760:556294](+)
Bact3,[2901866:2902424](-)
Bact4,[1104980:1105544](+)
file2.csv
Bact1,[1973928:1975194](-)
Bact2,[972152:973499](+)
Bact3,[3001035:3002739](-)
Bact4,[3331158:3332481](+)
Bact5,[712517:713771](+)
Bact5,[1376120:1377386](-)
file3.csv
Bact6,[4045708:4047781](+)
and so on to file10.csv The Bact1 represents a bacterial species and all the numbers including the sign represents the position of a gene. Each file represents a different gene, and there are duplicates like in the case of file2.csv
I wanted to merge these files so that I have something like this
Bact1 [1821932:1822487](+) [1973928:1975194](-) NaN
Bact2 [555760:556294](+) [972152:973499](+) NaN
Bact3 [2901866:2902424](-) [3001035:3002739](-) NaN
Bact4 [1104980:1105544](+) [3331158:3332481](+) NaN
Bact5 NaN [712517:713771](+) NaN
Bact5 NaN [1376120:1377386](-) NaN
Bact6 NaN NaN [4045708:4047781](+)
I have tried to use pandas package in python, but seems like most of the functions are geared towards merging two dataframes, not more than two, or i am missing something.
I have just started programming in python last week (I normally use R), so getting stuck in what could be or atleast seems like a simple thing.
Right now i am using:
for x in range(1,10):
df[x]=pandas.read_csv("file%s.csv" % (x),header=None,index_col=[0])
df[x].columns=['gene%s' % (x)]
dfjoin={}
dfjoin=df[1].join([df[2],df[3],df[4],df[5],df[6],df[7],df[8],df[9],df[10]])
Result:
0 gene1 gene2 gene3
Starkeya-novella-DSM-506 NaN [728886:730173](+) [731445:732615](+)
Starkeya-novella-DSM-506 NaN [728886:730173](+) [9662:10994](+)
Starkeya-novella-DSM-506 NaN [728886:730173](+) [9662:10994](+)
Starkeya-novella-DSM-506 NaN [728886:730173](+) [9662:10994](+)
see gene2 and gene3, it has duplicated results copied.
Assuming you've read these in as DataFrames as follows:
In [11]: df1 = pd.read_csv('file1.csv', sep=',', header=None, index_col=[0], names=['bact', 'file1'])
In [12]: df1
Out[12]:
file1
bact
Bact1 [1821932:1822487](+)
Bact2 [555760:556294](+)
Bact3 [2901866:2902424](-)
Bact4 [1104980:1105544](+)
Then you can simply join them:
In [21]: df1.join([df2, df3])
Out[21]:
file1 file2 file3
bact
Bact1 [1821932:1822487](+) [1973928:1975194](-) NaN
Bact2 [555760:556294](+) [972152:973499](+) NaN
Bact3 [2901866:2902424](-) [3001035:3002739](-) NaN
Bact4 [1104980:1105544](+) [3331158:3332481](+) NaN
Bact5 NaN [712517:713771](+) NaN
Bact5 NaN [1376120:1377386](-) NaN
Bact6 NaN NaN [4045708:4047781](+)
I changed your example data a little, here is the code:
import pandas as pd
import io
data = {
"file1":"""Bact1,[1821932:1822487](+)
Bact2,[555760:556294](+)
Bact3,[2901866:2902424](-)
Bact4,[1104980:1105544](+)
Bact5,[1104981:1105544](+)
Bact5,[1104982:1105544](+)""",
"file2":"""Bact1,[1973928:1975194](-)
Bact2,[972152:973499](+)
Bact3,[3001035:3002739](-)
Bact4,[3331158:3332481](+)
Bact5,[712517:713771](+)
Bact5,[1376120:1377386](-)
Bact5,[1376121:1377386](-)""",
"file3":"""Bact4,[3331150:3332481](+)
Bact6,[4045708:4047781](+)"""}
def read_file(f):
s = pd.read_csv(f, header=None, index_col=0, squeeze=True)
return s.groupby(s.index).apply(lambda s:pd.Series(s.values))
series = {key:read_file(io.StringIO(unicode(text)))
for key, text in data.items()}
print pd.concat(series, axis=1)
output:
file1 file2 file3
0
Bact1 0 [1821932:1822487](+) [1973928:1975194](-) NaN
Bact2 0 [555760:556294](+) [972152:973499](+) NaN
Bact3 0 [2901866:2902424](-) [3001035:3002739](-) NaN
Bact4 0 [1104980:1105544](+) [3331158:3332481](+) [3331150:3332481](+)
Bact5 0 [1104981:1105544](+) [712517:713771](+) NaN
1 [1104982:1105544](+) [1376120:1377386](-) NaN
2 NaN [1376121:1377386](-) NaN
Bact6 0 NaN NaN [4045708:4047781](+)

Categories