How to store pandas series to sql as a row - python

I have a pandas Dataframe object, and I iterate through the rows with:
for idx, row in df.iterrows():
# do some stuff
# save row to database
The problem is when I try to save it to database, to_sql treats my row as a column.
The variable row seems to be of type Series, and I did a careful search through the Series.to_sql in the manual, and I don't see any way of treating it as a database row instead of column.
The workaround I came up with is converting the Series to a DataFrame and then transposing it:
temp = pd.DataFrame(row).T
temp.to_sql(table, con=engine, if_exists='append', index_label='idx')
Is there a simpler way?

Rather than use df.iterrows, which returns indices and a series representation of each row, one approach would be to iterate through df.index and use integer-location based indexing to slice the data frame for row manuipulation.
df = pd.DataFrame.from_dict({'a':[1,2,3],'b':[4,5,6]})
for i in range(df.index):
row = df.iloc[i:i+1,:]
#do Stuff
row.to_sql(...)
This is the recommended way to modify to your dataframe. From the df.iterrows docstring:
2. You should **never modify** something you are iterating over.
This is not guaranteed to work in all cases. Depending on the
data types, the iterator returns a copy and not a view, and writing
to it will have no effect.

Related

How to access a row in pandas?

could you explain me please the difference between those two:
#1
for index, row in df.iterrows():
#2
for x in df['city']:
Should I always use or for index, row in df.iterrows(): while trying to access data in pandas:
for index, row in df.iterrows():
for x in df['city']:
Or in some cases specifying the column name like in the second example will me enough?
Thank you
There are more ways to iterate than the ways you described. It all comes down to how simple your iteration is and the "efficiency" of it.
The second example way will be enough if you just want to iterate rows over a single column.
Also bare in mind, depending on the method of iteration, they return different dtypes. You can read about them all on pandas doc.
This is an interesting article explaining the different methods regarding performance https://medium.com/#rtjeannier/pandas-101-cont-9d061cb73bfc
for index, row in df.iterrows():
print(row['city'])
Explanation: It helps us to iterate over a data frame row-wise with row variable having values for each column of that row & 'index' having an index of that row. To access any value for that row, mention the column name as above
for x in df['city']:
print(x)
Explanation: It helps us to iterate over a Series df['city'] & not other columns in df.

Adding columns to a pandas.DataFrame with previous row values before calling apply()

I need to add a new column to a pandas dataframe, where the value is calculated from the value of a column in the previous row.
Coming from a non-functional background (C#), I am trying to avoid loops since I read it is an anti-pattern.
My plan is to use series.shift to add a new column to the dataframe for the previous value, call dataframe.apply and finally remove the additional column. E.g.:
def my_function(row):
# perform complex calculations with row.time, row.time_previous and other values
# return the result
df["time_previous"] = df.time.shift(1)
df.apply(my_function, axis = 1)
df.drop("time.previous", axis=1)
In reality, I need to create four additional columns like this. Is there a better alternative to accomplish this without a loop? Is this a good idea at all?

Looping through DataFrame via zip

I'm using this code to loop through a dataframe:
for r in zip(df['Name']):
#statements
How do I identify a particular row in the dataframe? For example, I want to assign a new value to each row of the Name column while looping through. How do I do that?
I've tried this:
for r in zip(df['Name']):
df['Name']= time.time()
The problem is that every single row is getting the same value instead of different values.
The main problem is in the assignment:
df['Name']= time.time()
This says to grab the current time and assign it to every cell in the Name column. You reference the column vector, rather than a particular row. Note your iteration statement:
for r in zip(df['Name']):
Here, r is the row, but you never refer to it. That makes it highly unlikely that anything you do within the loop will affect an individual row.
Putting on my "teacher" hat ...
Look up examples of how to iterate through the rows of a Pandas data frame.
Within those, see how individual cells are referenced: that technique looks a lot like indexing a nested list.
Now, alter your code so that you put the current time in one cell at a time, one on each iteration. It will look something like
df.at[row]['Name'] = time.time()
or
row['Name'] = time.time()
depending on how you define row in your iteration.
Does that get you to a solution?
The following also works:
import pandas as pd
import time
# example df
df = pd.DataFrame(data={'name': ['Bob', 'Dylan', 'Rachel', 'Mark'],
'age': [23, 27, 30, 35]})
# iterate through each row in the data frame
col_idx = df.columns.get_loc('name') # this is so we can use iloc
for i in df.itertuples():
df.iloc[i[0], col_idx] = time.time()
So, essentially we use the index of the dataframe as the indicator of the position of the row. The first index points to the first row in the dataframe, and so on.
EDIT: as pointed out in the comment, using .index to iterate rows is not a good practice. So, let's use the number of rows of the dataframe itself. This can be obtained via df.shape which returns a tuple (row, column) and so, we only need the row df.shape[0].
2nd EDIT: using df.itertuples() for performance gain and .iloc for integer based indexing.
Additionally, the official pandas doc recommends the use of loc for variable assignment to a pandas dataframe due to potential chained indexing. More information here http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

How to add values to a new column in pandas dataframe?

I want to create a new named column in a Pandas dataframe, insert first value into it, and then add another values to the same column:
Something like:
import pandas
df = pandas.DataFrame()
df['New column'].append('a')
df['New column'].append('b')
df['New column'].append('c')
etc.
How do I do that?
If I understand, correctly you want to append a value to an existing column in a pandas data frame. The thing is with DFs you need to maintain a matrix-like shape so the number of rows is equal for each column what you can do is add a column with a default value and then update this value with
for index, row in df.iterrows():
df.at[index, 'new_column'] = new_value
Dont do it, because it's slow:
updating an empty frame a-single-row-at-a-time. I have seen this method used WAY too much. It is by far the slowest. It is probably common place (and reasonably fast for some python structures), but a DataFrame does a fair number of checks on indexing, so this will always be very slow to update a row at a time. Much better to create new structures and concat.
Better to create a list of data and create DataFrame by contructor:
vals = ['a','b','c']
df = pandas.DataFrame({'New column':vals})
If in case you need to add random values to the newly created column, you could also use
df['new_column']= np.random.randint(1, 9, len(df))

parsing tab delimited values from text file to variables

Hello I've been struggling with this problem, I'm trying to iterate over rows and select data from them and then assign them to variables. this is the first time I'm using pandas and I'm not sure how to select the data
reader = pd.read_csv(file_path, sep="\t" ,lineterminator='\r', usecols=[0,1,2,9,10],)
for row in reader:
print(row)
#id_number = row[0]
#name = row[2]
#ip_address = row[1]
#latitude = row[9]
and this is the output from the row that I want to assign to the variables:
050000
129.240.228.138
planetlab2.simula.no
59.93
Edit: Perhaps this is not a problem for pandas but for general Python. I am fairly new to python and what I'm trying to achieve is to parse tab separated file line by line and assign data to the variables and print them in one loop.
this is the input file sample:
050263 128.2.211.113 planetlab-1.cmcl.cs.cmu.edu NA US Allegheny County Pittsburgh http://www.cs.cmu.edu/ Carnegie Mellon University 40.4446 -79.9427 unknown
050264 128.2.211.115 planetlab-3.cmcl.cs.cmu.edu NA US Allegheny County Pittsburgh http://www.cs.cmu.edu/ Carnegie Mellon University 40.4446 -79.9427 unknown
The general workflow you're describing is: you want to read in a csv, find a row in the file with a certain ID, and unpack all the values from that row into variables. This is simple to do with pandas.
It looks like the CSV file has at least 10 columns in it. Providing the usecols arg should filter out the columns that you're not interested in, and read_csv will ignore them when loading into the pandas DataFrame object (which you've called reader).
Steps to do what you want:
Read the data file using pd.read_csv(). You've already done this, but I recommend calling this variable df instead of reader, as read_csv returns a DataFrame object, not a Reader object. You'll also find it convenient to use the names argument to read_csv to assign column names to the dataframe. It looks like you want names=['id', 'ip_address', 'name', 'latitude','longitude'] to get those as columns. (Assuming col10 is longitude, which makes sense that 9,10 would be lat/long pairs)
Query the dataframe object for the row with that ID that you're interested in. There are a variety of ways to do this. One is using the query syntax. Hard to know why you want that specific row without more details, but you can look up more information about index lookups in pandas. Example: row = df.query("id == 50000")
Given a single row, you want to extract the row values into variables. This is easy if you've assigned column names to your dataframe. You can treat the row as a dictionary of values. E.g. lat = row['lat'] lon = row['long]
You can use iterrows():
df = pandas.read_csv(file_path, sep=',')
for index, row in df.iterrows():
value = row['col_name']
Or if you want to access by index of the column:
df = pandas.read_csv(file_path, sep=',')
for index, row in df.iterrows():
value = row.ix[0]
Are the values you need to add the same for each row, or does it require processing the value to determine the value of the addition? If it is consistent you can apply this sum simply using pandas to do a matrix operation on the dataset. If it requires processing row by row, the above solution is the correct one for sure. If it is a table of variables that must be added row by row, you can do that by dumping them all into a column aligned with your dataset, do the addition by row using pandas, and simply print out the complete dataframe. Assume you have three columns to add, which you put into a new column[e].
df['e'] = df.a + df.b + df.d
or, if it is a constant:
df['e'] = df.a + df.b + {constant}
Then drop the columns you don't need (ex df['a'] and df['b'] in the above)
Obviously, then, if you need to calculate based on unique values for each row, put the values into another column and sum as above.

Categories