I am trying to add two pandas Series together. The first Series is very large and has a MultiIndex. The index of the second series is a small subset of the index of the first.
df1 = pd.DataFrame(np.ones((1000,5000)),dtype=int).stack() df1 = pd.DataFrame(df1, columns = ['total']) df2 = pd.concat([df1.iloc[50:55],df1.iloc[2000:2005]]) # df2 is tiny subset of df1
Using the regular Series.add function takes about 9 seconds the first time, and 2 seconds on subsequent tries (maybe because pandas optimizes how the df is stored in memory?).
starttime = time.time() df1.total.add(df2.total,fill_value=0).sum() print "Method 1 took %f seconds" % (time.time() - starttime)
Manually iterating over rows takes about 2⁄3 as long as Series.add the first time, and about 1⁄100 as long as Series.add on subsequent tries.
starttime = time.time() result = df1.total.copy() for row_index, row in df2.iterrows(): result[row_index] += row print "Method 2 took %f seconds" % (time.time() - starttime)
The speed difference is particularly noticeable when (as here) the Index is a MultiIndex.
Why does Series.add not work well here? Any suggestions for speeding this up? Is there a more efficient alternative to iterating over each element of the Series?
Also, how do I sort or structure the data frame to improve the performance of either method? The second time either of these methods is run is appreciably faster. How do I get this performance on the first time? Sorting using sort_index helps only marginally.
You don’t need for loop:
df1.total[df2.index] += df2.total