There is the following Pandas DataFrame:
import pandas as pd inp = [{'c1':10, 'c2':100}, {'c1':11,'c2':110}, {'c1':12,'c2':120}] df = (inp) print df
Code output above:
c1 c2
0 10 100
1 11 110
2 12 120
Now you need to iterate through the rows of the DataFrame above. For each row, it is desired to be able to access the corresponding element (value in the cell) by column name. In other words, something like the following is needed:
for row in : print row['c1'], row['c2']
Can Pandas do this?
I found it.similar question. But that doesn't give me the answer I need, it's mentioned in there:
for date, row in ():
either one or the other
for row in ():
But I don't understand what the row object is and how I can use it.
Optimal solutions
To iteratively traverse the rows of a DataFrame in a Pandas fashion, you can use:
()
for index, row in (): print row["c1"], row["c2"]
()
for row in (index=True, name='Pandas'): print getattr(row, "c1"), getattr(row, "c2")
itertuples() should be faster than iterrows()
Note, however, that according to the documentation (currently Pandas 0.19.1):
- iterrows: the dtype of the data may not be matched on a row-by-row basis, because iterrows returns each row of a series, and it does not preserve the dtypes of the rows (dtypes are preserved across DataFrames columns)*.
- iterrows: do not modify lines
You should not modify what you are iterating on. This is not guaranteed to work in all cases. Depending on the data type, the iterator returns a copy instead of a view, and writing to it will not work.
use (sth. different)():
new_df = (lambda x: x * 2)
itertuples: column names will be renamed to position names if they are invalid Python identifiers, repeated or start with an underscore. For a large number of columns (> 255), return regular tuples.
Second option: apply
You can also use () to traverse rows and access multiple columns of a function.
docs: ()
def valuation_formula(x, y): return x * y * 0.5 df['price'] = (lambda row: valuation_formula(row['x'], row['y']), axis=1)
Third option: iloc
You can use functions as follows:
for i in range(0, len(df)): print [i]['c1'], [i]['c2']
Fourth option: slightly more cumbersome, but more efficient, converting DataFrame to List
You can write your own iterator that implements namedtuples.
from collections import namedtuple def myiter(d, cols=None): if cols is None: v = () cols = () else: j = [.get_loc(c) for c in cols] v = [:, j].tolist() n = namedtuple('MyTuple', cols) for line in iter(v): yield n(*line)
It's equivalent, but more efficient.
Applies a custom function to a given DataFrame:
list(myiter(df)) [MyTuple(c1=10, c2=100), MyTuple(c1=11, c2=110), MyTuple(c1=12, c2=120)]
Or with:
list((index=False)) [Pandas(c1=10, c2=100), Pandas(c1=11, c2=110), Pandas(c1=12, c2=120)]
Comprehensive testing
We tested all available columns:
def iterfullA(d): return list(myiter(d)) def iterfullB(d): return list((index=False)) def itersubA(d): return list(myiter(d, ['col3', 'col4', 'col5', 'col6', 'col7'])) def itersubB(d): return list(d[['col3', 'col4', 'col5', 'col6', 'col7']].itertuples(index=False)) res = ( index=[10, 30, 100, 300, 1000, 3000, 10000, 30000], columns='iterfullA iterfullB itersubA itersubB'.split(), dtype=float ) for i in : d = ((10, size=(i, 10))).add_prefix('col') for j in : stmt = '{}(d)'.format(j) setp = 'from __main__ import d, {}'.format(j) [i, j] = timeit(stmt, setp, number=100) ([4:-1], axis=1).plot(loglog=True);
This is the whole content of this article.