This is probably very very basic but I can't seem to find a solution anywhere. I'm trying to construct a 3D panel object in pandas and then fill it with data which I read from several csv files. An example of what I'm trying to do would be the following:
import numpy as np
import pandas as pd
year = np.arange(2000,2005)
obs = np.arange(1,5)
variables = ['x1','x2']
data = pd.Panel(items = obs, major_axis = year, minor_axis = variables)
So that data[i] gives me all the data belonging to one of the observation units in the panel:
data[1]
x1 x2
2000 NaN NaN
2001 NaN NaN
2002 NaN NaN
2003 NaN NaN
2004 NaN NaN
Then, I read in data from a csv which gives me a DataFrame that looks like this (I'm just creating an equivalent object here to make this a working example):
x1data = pd.DataFrame(data = zip(year, np.random.randn(5)), columns = ['year', 'x1'])
x1data
year x1
0 2000 -0.261514
1 2001 0.474840
2 2002 0.021714
3 2003 -1.939358
4 2004 1.167545
No I would like to replace the NaN's in the x1 column of data[1] with the data that is in the x1data dataframe. My first idea (given that I'm coming from R) was to simply make sure that I select an object from x1data that has the same dimension as the x1 column in my panel and assign it to the panel:
data[1].x1 = x1data.x1
However, this doesn't work which I guess is due to the fact that in x1data, the years are a column of the dataframe, whereas in the panel they are whatever it is that shows up to the left of the columns (the "row names", would this be an index)?
As you can probably tell from my question I'm far from really understanding what's going on in the pandas data structure so any help would be greatly appreciated!
I'm guessing this question didn't elicit a lot of replies at it was simply too stupid, but just in case anyone ever comes across this and is as clueless as I was, the very simple answer is to access the panel using the .iloc method, as:
data.iloc[item, major_axis, minor_axis]
where each of the arguments can be single elements or lists, in order to write on slices of the panel. My question above would have been solved by
data.iloc[1, np.arange(2000,2005), 'x1'] = np.asarray(x1data.x1)
or
data.iloc[1, year, 'x1'] = np.asarray(x1data.x1)
Note than had I not used np.asarray, nothing would have happened as data.iloc[] creates an object that has the years as index, while x1data.x1 has an index starting at 0.
Related
I have 2 csv files with some random numbers, as follow:
csv1.csv
0 906018
1 007559
2 910475
3 915104
4 600393
...
5070 907525
5071 903079
5072 001910
5073 909735
5074 914861
length 5075
csv2.csv
0 5555
1 7859
2 501303
3 912414
4 913257
...
7497 915031
7498 915030
7499 915033
7500 902060
7501 915038
length 7502
Some elements in csv1 are present in csv2 but I don't know exactly which one and I would like to extract those unique values. So my idea was to start merging together the 2 data frame, and than remove the duplicates.
so I wrote the following code:
import pandas as pd
import csv
unique_users = pd.read_csv('./csv1.csv')
unique_users['id']
identity = pd.read_csv('./csv2.csv')
identityNumber = identity['IDNumber']
identityNumber
df = pd.concat([identityNumber, unique_users])
Until here everything is perfect and the length of the df is the sum of the 2 length, but I realised the part where I got stuck.
the df concat it did its job and concat based on the index, so now I have tons of NaN.
and when I use the code:
final_result = df.drop_duplicates(keep=False)
The data frame does not drop any value because the df structure now look like this:
Identitynumber. ID
5555 NaN
so I guess that drop duplicate is looking for the same exact values, but as they don't exist it just keeps it.
So what I would like to do, is loop over both data frame, and if a value in csv1 exists in csv2, I want them to be dropped.
Can anyone help with this please?
And please if you need more info just let me know.
UPDATE:
I think I found the reason why is not working but I am not sure how to solve this.
my csv1 looks like this:
id
906018,
007559,
910475,
915104,
600393,
007992,
502313,
004609,
910017,
007954,
006678,
in Jupiter notebook when I open the csv, it looks this way.
id
906018 NaN
007559 NaN
910475 NaN
915104 NaN
600393 NaN
... ...
907525 NaN
903079 NaN
001910 NaN
909735 NaN
914861 NaN
and I do not understand why is seeing the id as NaN.
in fact I tried to add a new column into csv2, and as value I passed the id from csv1..and I can confirm that they are all NaN.
So I believe the source of the problem is surely this, which than reflects on all the other events.
Can anyone help to understand how I can solve this issue?
you can achieve this using df.merge():
# Data samples
data_1 = {'col_a': [906018,7559,910475,915104,600393,907525,903079,1910,909735,914861]}
data_2 = {'col_b': [5555,7859,914861,912414,913257,915031,1910,915104,7559,915038]}
df1 = pd.DataFrame(data_1)
df2 = pd.DataFrame(data_2)
# using isin() method
unique_vals = df1.merge(df2, right_on='col_b', left_on='col_a')['col_a']
new_df1 = df1[~df1.col_a.isin(unique_vals)]
# another approach
new_df1 = df1[df1.merge(df2, right_on='col_b', left_on='col_a', how='left')['col_b'].isna()]
print(new_df1)
# col_a
# 0 906018
# 2 910475
# 4 600393
# 5 907525
# 6 903079
# 8 909735
This will remove the duplicates between your two dataframes and keep all the records in one dataframe df.
df = pandas.concat([df1,df2]).drop_duplicates().reset_index(drop=True)
You are getting NaN because when you concatenate, Pandas doesn't know what you want to do with the different column names of your two dataframes. One of your dataframes has an IdentityNumber column and the other has an ID column. Pandas can't figure out what you want, so it puts both columns into the resulting dataframe.
Try this:
pd.concat([df1["IDNumber"], df2["ID"]]).drop_duplicates().reset_index(drop=True)
First - I have tried reviewing similar posts, but I am still not getting it.
I have data with corporate codes that I have to reclassify. First thing, I created a new column -['corp_reclassed'].
I populate that column with the use of the map function and a dictionary.
Most of the original corporate numbers do not change thus I have nans in the new column (see below).
corp_number corp_reclassed
100 nan
110 nan
120 160
130 nan
150 170
I want to create a final column where if ['corp_reclased'] = nan then ['corp_number] is populate by the ['corp_number'] . If not, then populate['corp_reclassed'].
I have tried many ways, but I keep running into problems. For instance, this is my lastest try:
df['final_number'] = df.['corp_number'].where(df.['gem_reclassed'] = isnull, eq['gem_reclassed'])
Please help.
FYI- I am using pandas 0.19.2. I get upgrade because of restrictions at work.
Just a fillna?
df['final_number'] = df['corp_reclassed'].fillna(df['corp_number'])
df.loc[df['gem_reclassed']= pd.np.nan, 'final_number'] = df['corp_reclassed']
First of all, this is not a duplicate! I have searched in several SO questions as well as the Pandas doc, and I have not found anything conclusive!To create a new column with a row value, like this and this!
Imagine I have the following table, opening an .xls and I create a dataframe with it. As this is a small example created from the real proble, I created this simple Excel table which can be easily reproduceable:
What I want now is to find the row that has "Population Month Year" (I will be looking at different .xls, so the structure is the same: population, month and year.
xls='population_example.xls'
sheet_name='Sheet1'
df = pd.read_excel(xls, sheet_name=sheet_name, header=0, skiprows=2)
df
What I thought is:
Get the value of that row with startswith
Create a column, pythoning that value and getting the month and year value.
I have tried several things similar to this:
dff=df[s.str.startswith('Population')]
dff
But errors won't stop coming. In this above's code error, specifically:
IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match
I have several guesses:
I am not understanding properly how Seriesin pandas work, even though reading the doc. I did not even think on using them, but the startswithlooks like the thing I am looking for.
If I handle this properly, I might have a NaN error, but I cannot use df.dropna()yet, as I would lose that row value (Population April 2017)!
Edit:
The problem on using this:
df[df['Area'].str.startswith('Population')] Is that it will check the na values.
And this:
df['Area'].str.startswith('Population')
Will give me a true/false/na set of values, which I am not sure how I can use.
Thanks to #Erfan , I got to the solution:
Using properly the line of code in the comments and not like I was trying, I managed to:
dff=df[df['Area'].str.startswith('Population', na=False)]
dff
Which would output: Population and household forecasts, 2016 to 20... NaN NaN NaN NaN NaN NaN
Now I can access this value like
value=dff.iloc[0][0]
value
To get the string I was looking for: 'Population and household forecasts, 2016 to 2041, prepared by .id , the population experts, April 2019.'
And I can python around with this to create the desired column. Thank you!
You could try:
import pandas as pd
import numpy as np
pd.DataFrame({'Area': [f'Whatever{i+1}' for i in range(3)] + [np.nan, 'Population April 2017.'],
'Population': [3867, 1675, 1904, np.nan, np.nan]}).to_excel('population_example.xls', index=False)
df = pd.read_excel('population_example.xls').fillna('')
population_date = df[df.Area.str.startswith('Population')].Area.values[0].lstrip('Population ').rstrip('.').split()
Result:
['April', '2017']
Or (if Population Month Year is always on the last row):
df.iloc[-1, 0].lstrip('Population ').rstrip('.').split()
I'm just getting started with Pandas so I may be missing something important, but I can't seem to successfully subtract two columns I'm working with. I have a spreadsheet in excel that I imported as follows:
df = pd.read_excel('/path/to/file.xlsx',sheetname='Sheet1')
My table when doing df.head() looks similar to the following:
a b c d
0 stuff stuff stuff stuff
1 stuff stuff stuff stuff
2 data data data data
... ... ... ... ...
89 data data data data
I don't care about the "stuff;" I would like to subtract two columns of just the data and make this its own column. Therefore, it seemed obvious that I should trim off the rows I'm not interested in and work with what remains, so I have tried the following:
dataCol1 = df.ix[2:,0:1]
dataCol2 = df.ix[2:,1:2]
print(dataCol1.sub(dataCol2,axis=0))
But it results in
a b
2 NaN NaN
3 NaN NaN
4 NaN NaN
... ... ...
89 NaN NaN
I get the same result if I also simply try print(dataCol1-dataCol2). I really don't understand how both of these subtraction operations not only result in all NaN's, but also two columns instead of just one with the end result. Because when I print(dataCol1), for example, I do obtain the column I want to work with:
a
2 data
3 data
4 data
... ...
89 data
Is there any way to both work simply and directly from an Excel spreadsheet and perform basic operations with a truncated portion of the columns of said spreadsheet? Maybe there is a better way to go about this than using df.ix and I am definitely open to those methods as well.
The problem is the misallignment of your indices.
One thing to do would be to subtract the values, so you don't have to deal with alignment issues:
dataCol1 = df.iloc[2: , 0:1] # ix is deprecated
dataCol2 = df.iloc[2: , 1:2]
result = pd.DataFrame(dataCol1.values - dataCol2.values)
Problem:
I am trying to read in a csv to a pandas dataframe that contains data of different column sizes.
Example & Description:
Code:
df = pd.read_csv(input, error_bad_lines=False)
input:
ID, Time, Val
15, 18:00:01, 4
15, 18:00:02, 6
15, 18:00:03, 5
ID, Time, Val1, Val2
16 18:00:03, 1, 43
ID, Time, Val
15, 18:00:04, 8
and this pattern continues for the entirety of the file. Originally I was thinking about throwing away the extra columns since read_csv option throws and error and doesn't read them I just started to ignore them. However I then get duplicate headers in my dataframe... To combat this I tried the drop_duplicates() but found out that only in V0.17 of pandas do they include the keep=False option. I eventually started to convince myself that try to keep the data. So here is my question. Based on the dataset above I was hoping that I might be able to create two unique dataframes. You can assume that the ID will always be unique so you can create N number of frames for the number of different IDs you have. Each ID will not have the same number of headers. Once a different ID is encountered its header will be printed. So for example if we hit another ID 16 its header will be printed prior to the data. And again if we hit another ID 15 its header will be printed prior to its data.
I was thinking maybe to just preprocess the data before I started using dataframes since that is an option. But since I am still fairly new to all that pandas can do, I was hoping maybe some people here would have suggestions before I went ahead and wrote some nasty preprocessing code :). The other thought I had which turns into a question is - For the error_bad_lines, is there a way to save those lines to another dataframe or something else? Additionally, I tell pandas in the read_csv to only look for items that have an ID of X and just do that for all my ID's? I will add that the number of IDs is finite and known.
My current version of pandas is 0.14.
Note I corrected what I think is a typo in your sample data.
I split your data with a lookahead regular expression. I look for newline characters that are followed by ID.
Then parse each element of the list and concatenate.
from io import StringIO
import pandas as pd
import re
txt = """ID, Time, Val
15, 18:00:01, 4
15, 18:00:02, 6
15, 18:00:03, 5
ID, Time, Val1, Val2
16, 18:00:03, 1, 43
ID, Time, Val
15, 18:00:04, 8"""
def parse(s):
return pd.read_csv(StringIO(s), skipinitialspace=True)
pd.concat([parse(s) for s in re.split('\n(?=ID)', txt)])
ID Time Val Val1 Val2
0 15 18:00:01 4.0 NaN NaN
1 15 18:00:02 6.0 NaN NaN
2 15 18:00:03 5.0 NaN NaN
0 16 18:00:03 NaN 1.0 43.0
0 15 18:00:04 8.0 NaN NaN
The above was working with the sample data provided by OP. If this were in a csv file the solution would look like this
from io import StringIO
import pandas as pd
import re
with open('myfile.csv') as f:
txt = f.read()
def parse(s):
return pd.read_csv(StringIO(s), skipinitialspace=True)
pd.concat([parse(s) for s in re.split('\n(?=ID)', txt)])
You can treat the file as having four columns:
df = pd.read_csv(input, names=['id', 'time', 'v1', 'v2'])
and filter out the extra headers:
df = df[df.id != 'ID']
Then your two data sets are simply df[pd.isnull(df.v2)] and df[~pd.isnull(df.v2)].