Faster alternative to Series.add function in pandas

I am trying to add two pandas Series together. The first Series is very large and has a MultiIndex. The index of the second series is a small subset of the index of the first.

    df1 = pd.DataFrame(np.ones((1000,5000)),dtype=int).stack()
    df1 = pd.DataFrame(df1, columns = ['total'])
    df2 = pd.concat([df1.iloc[50:55],df1.iloc[2000:2005]])  # df2 is tiny subset of df1

Using the regular Series.add function takes about 9 seconds the first time, and 2 seconds on subsequent tries (maybe because pandas optimizes how the df is stored in memory?).

    starttime = time.time(),fill_value=0).sum()
    print "Method 1 took %f seconds" % (time.time() - starttime)

Manually iterating over rows takes about 2/3 as long as Series.add the first time, and about 1/100 as long as Series.add on subsequent tries.

    starttime = time.time()
    result =
    for row_index, row in df2.iterrows():
        result[row_index] += row
    print "Method 2 took %f seconds" % (time.time() - starttime)

The speed difference is particularly noticeable when (as here) the Index is a MultiIndex.

Why does Series.add not work well here? Any suggestions for speeding this up? Is there a more efficient alternative to iterating over each element of the Series?

Also, how do I sort or structure the data frame to improve the performance of either method? The second time either of these methods is run is appreciably faster. How do I get this performance on the first time? Sorting using sort_index helps only marginally.

Best answer

You don’t need for loop:[df2.index] +=