SoFunction
Updated on 2024-11-13

Implementation of traversing DataFrame rows in pandas

There is the following Pandas DataFrame:

import pandas as pd
inp = [{'c1':10, 'c2':100}, {'c1':11,'c2':110}, {'c1':12,'c2':120}]
df = (inp)
print df

Code output above:

   c1   c2
0  10  100
1  11  110
2  12  120

Now you need to iterate through the rows of the DataFrame above. For each row, it is desired to be able to access the corresponding element (value in the cell) by column name. In other words, something like the following is needed:

for row in :
 print row['c1'], row['c2']

Can Pandas do this?

I found it.similar question. But that doesn't give me the answer I need, it's mentioned in there:

for date, row in ():

either one or the other

for row in ():

But I don't understand what the row object is and how I can use it.

Optimal solutions

To iteratively traverse the rows of a DataFrame in a Pandas fashion, you can use:

()

for index, row in ():
 print row["c1"], row["c2"]

()

for row in (index=True, name='Pandas'):
 print getattr(row, "c1"), getattr(row, "c2")

itertuples() should be faster than iterrows()

Note, however, that according to the documentation (currently Pandas 0.19.1):

  • iterrows: the dtype of the data may not be matched on a row-by-row basis, because iterrows returns each row of a series, and it does not preserve the dtypes of the rows (dtypes are preserved across DataFrames columns)*.
  • iterrows: do not modify lines

You should not modify what you are iterating on. This is not guaranteed to work in all cases. Depending on the data type, the iterator returns a copy instead of a view, and writing to it will not work.

use (sth. different)():

new_df = (lambda x: x * 2)
itertuples: column names will be renamed to position names if they are invalid Python identifiers, repeated or start with an underscore. For a large number of columns (> 255), return regular tuples.

Second option: apply

You can also use () to traverse rows and access multiple columns of a function.

docs: ()

def valuation_formula(x, y):
 return x * y * 0.5
 
df['price'] = (lambda row: valuation_formula(row['x'], row['y']), axis=1)

Third option: iloc

You can use functions as follows:

for i in range(0, len(df)):
 print [i]['c1'], [i]['c2']

Fourth option: slightly more cumbersome, but more efficient, converting DataFrame to List

You can write your own iterator that implements namedtuples.

from collections import namedtuple
 
def myiter(d, cols=None):
 if cols is None:
  v = ()
  cols = ()
 else:
  j = [.get_loc(c) for c in cols]
  v = [:, j].tolist()
 
 n = namedtuple('MyTuple', cols)
 
 for line in iter(v):
  yield n(*line)

It's equivalent, but more efficient.

Applies a custom function to a given DataFrame:

list(myiter(df))
 
[MyTuple(c1=10, c2=100), MyTuple(c1=11, c2=110), MyTuple(c1=12, c2=120)]

Or with:

list((index=False))
 
[Pandas(c1=10, c2=100), Pandas(c1=11, c2=110), Pandas(c1=12, c2=120)]

Comprehensive testing

We tested all available columns:

def iterfullA(d):
 return list(myiter(d))
 
def iterfullB(d):
 return list((index=False))
 
def itersubA(d):
 return list(myiter(d, ['col3', 'col4', 'col5', 'col6', 'col7']))
 
def itersubB(d):
 return list(d[['col3', 'col4', 'col5', 'col6', 'col7']].itertuples(index=False))
 
res = (
 index=[10, 30, 100, 300, 1000, 3000, 10000, 30000],
 columns='iterfullA iterfullB itersubA itersubB'.split(),
 dtype=float
)
 
for i in :
 d = ((10, size=(i, 10))).add_prefix('col')
 for j in :
  stmt = '{}(d)'.format(j)
  setp = 'from __main__ import d, {}'.format(j)
  [i, j] = timeit(stmt, setp, number=100)
 
([4:-1], axis=1).plot(loglog=True);

This is the whole content of this article.