Vote utilisateur: 5 / 5

Etoiles activesEtoiles activesEtoiles activesEtoiles activesEtoiles actives
 

I recently stumbled on this interesting post on RealPython (excellent website by the way!):

Fast, Flexible, Easy and Intuitive: How to Speed Up Your Pandas Projects

This post has different subjects related to Pandas: - creating a datetime column - looping over Pandas data - saving/loading HDF data stores - ...

I focused on the looping over Pandas data part. They compare different approaches for looping over a dataframe and applying a basic (piecewise linear) function: - a "crappy" loop with .iloc to access the data - iterrows() - apply() with a lambda function

But I was a little bit disapointed to see that they did not actually implement the following other approaches: - itertuples()`

While .itertuples() tends to be a bit faster, let’s stay in Pandas and use .iterrows() in this example, because some readers might not have run across nametuple. - Numpy vectorize - Numpy (just a loop over Numpy vectors) - Cython - Numba

So I just wanted to complete their post by adding the latter approaches to the performance comparison, using the same .csv file. In order to compare all the different implementations on the same computer, I also copied and re-ran their code.

Note: my laptop CPU is an Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz (with some DDDR4-2400 RAM).

Imports

```python import sys print(f"Python version: {sys.version}")

import numpy as np print(f"Numpy version: {np.version}")

import pandas as pd print(f"Pandas version: {pd.version}")

%load_ext Cython import cython print(f"Cython version: {cython.version}")

import numba print(f"Numba version: {numba.version}") from numba import jit

import matplotlib import matplotlib.pyplot as plt %matplotlib inline plt.style.use('ggplot') ```

Python version: 3.7.0 (default, Jun 28 2018, 13:15:42) 
[GCC 7.2.0]
Numpy version: 1.15.0
Pandas version: 0.23.3
Cython version: 0.28.4
Numba version: 0.39.0
 

Load the csv file

The .csv file is located here. It has been saved in a local data/ directory. The datetime column is created at the import with the parse_dates argument.

python df = pd.read_csv('./data/demand_profile.csv', parse_dates=['date_time']) df.info() # inspect the dataframe

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8760 entries, 0 to 8759
Data columns (total 2 columns):
date_time     8760 non-null datetime64[ns]
energy_kwh    8760 non-null float64
dtypes: datetime64[ns](1), float64(1)
memory usage: 137.0 KB
 

So we only have two columns in this dataframe: one for the datetime and one for the energy usage:

python df.head(2)

date_time energy_kwh
0 2013-01-01 00:00:00 0.586
1 2013-01-01 01:00:00 0.580

The goal of this example will be to apply time-of-use energy tariffs to find the total cost of energy consumption for one year. That is, at different hours of the day, the price for electricity varies, so the task is to multiply the electricity consumed for each hour by the correct price for the hour in which it was consumed.

Performance comparison

1. The "Crappy" loop

```python def apply_tariff(kwh, hour): """ Calculates cost of electricity for given hour. """

if 0 <= hour < 7:
    rate = 12
elif 7 <= hour < 17:
    rate = 20
elif 17 <= hour < 24:
    rate = 28
else:
    raise ValueError(f'Invalid hour: {hour}')

return rate * kwh
 

```

```python def applytariffloop(df): """ Calculate costs in loop. Modifies df inplace. """

energy_cost_list = []
for i in range(len(df)):
    # Get electricity used and hour of day
    energy_used = df.iloc[i]['energy_kwh']
    hour = df.iloc[i]['date_time'].hour
    energy_cost = apply_tariff(energy_used, hour)
    energy_cost_list.append(energy_cost)
df['cost_cents'] = energy_cost_list
 

```

python timeit = {} # initialization of the timing measures result = %timeit -r 1 -n 1 -o apply_tariff_loop(df) timeit["Crappy Loop"] = result.best

1.85 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
 

Next we save the result of the computation in order to later check that we get the same result with the different implementations...

python df.rename(columns={'cost_cents': 'cost_cents_ref'}, inplace=True)

2. Looping with .iterrows()

```python def applytariffiterrows(df):

energy_cost_list = []
for index, row in df.iterrows():
    # Get electricity used and hour of day
    energy_used = row['energy_kwh']
    hour = row['date_time'].hour
    # Append cost list
    energy_cost = apply_tariff(energy_used, hour)
    energy_cost_list.append(energy_cost)
df['cost_cents'] = energy_cost_list
 

```

python result = %timeit -r 3 -n 10 -o apply_tariff_iterrows(df) timeit['Iterrows'] = result.best

439 ms ± 12.5 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)
 

python assert np.allclose(df.cost_cents.values, df.cost_cents_ref.values) df.drop('cost_cents', axis=1, inplace=True)

3. Pandas’ apply()

```python def applytariffwithapply(df):

df['cost_cents'] = df.apply(
    lambda row: apply_tariff(
        kwh=row['energy_kwh'],
        hour=row['date_time'].hour),
    axis=1)
 

```

python result = %timeit -r 3 -n 10 -o apply_tariff_withapply(df) timeit['Apply'] = result.best

145 ms ± 6.89 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)
 

python assert np.allclose(df.cost_cents.values, df.cost_cents_ref.values) df.drop('cost_cents', axis=1, inplace=True)

4. Looping with .itertuples()

```python def applytariffitertuples(df):

energy_cost_list = []
for row in df.itertuples():
    # Get electricity used and hour of day
    energy_used = row.energy_kwh
    hour = row.date_time.hour
    # Append cost list
    energy_cost = apply_tariff(energy_used, hour)
    energy_cost_list.append(energy_cost)
df['cost_cents'] = energy_cost_list
 

```

python result = %timeit -r 3 -n 10 -o apply_tariff_itertuples(df) timeit['Itertuples'] = result.best

30.8 ms ± 1.07 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)
 

python assert np.allclose(df.cost_cents.values, df.cost_cents_ref.values) df.drop('cost_cents', axis=1, inplace=True)

5. Numpy's vectorize()

```python def applytariffvect(hour, kwh):

if 0 <= hour < 7:
    rate = 12
elif 7 <= hour < 17:
    rate = 20
else:
    rate = 28

return rate * kwh
 

vfunc = np.vectorize(applytariffvect)

def computecostvect(df):

df['cost_cents'] = 0.0
hour_array = df.date_time.dt.hour.values
if (hour_array.min() < 0) or (hour_array.max() > 23):
    raise ValueError(f'Invalid hour values')

df.cost_cents = vfunc(hour_array, df.energy_kwh.values)
 

```

python result = %timeit -r 3 -n 10 -o compute_cost_vect(df) timeit['Numpy vectorize'] = result.best

2.99 ms ± 166 µs per loop (mean ± std. dev. of 3 runs, 10 loops each)
 

python assert np.allclose(df.cost_cents.values, df.cost_cents_ref.values) df.drop('cost_cents', axis=1, inplace=True)

6. The Numpy loop

Before testing Cython and Numba, let's create a loop function loop_tariff with Numpy array arguments, on which we are going to apply the different methods.

```python def looptariff(hourarray, energykwharray, costcentsarray):

df_len = hour_array.shape[0]

for i in range(df_len):
    hour = hour_array[i]
    if 0 <= hour < 7:
        rate = 12
    elif 7 <= hour < 17:
        rate = 20
    else:
        rate = 28
    cost_cents_array[i] = rate * energy_kwh_array[i]
 

def compute_cost(df):

df['cost_cents'] = 0.0
hour_array = df.date_time.dt.hour.values
if (hour_array.min() < 0) or (hour_array.max() > 23):
    raise ValueError(f'Invalid hour values')

loop_tariff(hour_array, df.energy_kwh.values, df.cost_cents.values)
 

```

python result = %timeit -r 3 -n 10 -o compute_cost(df) timeit['Numpy loop'] = result.best

6.94 ms ± 437 µs per loop (mean ± std. dev. of 3 runs, 10 loops each)
 

python assert np.allclose(df.cost_cents.values, df.cost_cents_ref.values) df.drop('cost_cents', axis=1, inplace=True)

7. Cython

```cython %%cython

cimport cython

@cython.boundscheck(False) @cython.wraparound(False) @cython.initializedcheck(False) cdef looptariffcython(long[:] hourarray, double[:] energykwharray, double[:] costcents_array):

cdef: 
    int i, rate
    long hour
    int df_len = hour_array.shape[0]

for i in range(df_len):
    hour = hour_array[i]
    if 0 <= hour < 7:
        rate = 12
    elif 7 <= hour < 17:
        rate = 20
    else:
        rate = 28
    cost_cents_array[i] = rate * energy_kwh_array[i]
 

cpdef computecostcython(df):

df['cost_cents'] = 0.0
hour_array = df.date_time.dt.hour.values
if (hour_array.min() < 0) or (hour_array.max() > 23):
    raise ValueError(f'Invalid hour values')

loop_tariff_cython(hour_array, df.energy_kwh.values, df.cost_cents.values)
 

```

python result = %timeit -r 3 -n 100 -o compute_cost_cython(df) timeit['Cython'] = result.best

560 µs ± 56.1 µs per loop (mean ± std. dev. of 3 runs, 100 loops each)
 

python assert np.allclose(df.cost_cents.values, df.cost_cents_ref.values) df.drop('cost_cents', axis=1, inplace=True)

Run time is too short to try multithreading the Cython loop with openMP. However that would be very interesting to try on a larger dataframe with heavier computations performed inside the loop.

8. Numba

```python def looptariffnumba(hourarray, energykwharray, costcents_array):

df_len = hour_array.shape[0]

for i in range(df_len):
    hour = hour_array[i]
    if 0 <= hour < 7:
        rate = 12
    elif 7 <= hour < 17:
        rate = 20
    else:
        rate = 28
    cost_cents_array[i] = rate * energy_kwh_array[i]
 

looptariffnumba = jit(looptariffnumba)

def computecostnumba(df):

df['cost_cents'] = 0.0
hour_array = df.date_time.dt.hour.values
if (hour_array.min() < 0) or (hour_array.max() > 23):
    raise ValueError(f'Invalid hour values')

loop_tariff_numba(hour_array, df.energy_kwh.values, df.cost_cents.values)
 

```

python result = %timeit -r 3 -n 100 -o compute_cost_numba(df) timeit['Numba'] = result.best

816 µs ± 413 µs per loop (mean ± std. dev. of 3 runs, 100 loops each)
 

python assert np.allclose(df.cost_cents.values, df.cost_cents_ref.values) df.drop('cost_cents', axis=1, inplace=True)

Conclusion

Well we have many options to loop over Pandas data (we did not try them all!) and a large range of performance results: from 0.0005s to 2s for some very simple computations.

At first I would use Pandas' .itertuples() when prototyping a code. This is rather intuitive and efficient. Then, if I notice that a huge amount of time is spent on the loop part, I would start dealing directly with Numpy arrays from the dataframe's columns. I am surprised to see how efficient is the vectorize() method.

Remember that the dataframe that we are using in the present case is really small. The discrepancy observed here could correspond to many minutes on large dataframes!! However, optimizing can also take a long time or lead to complex issues...

I am really pleased to see that Numba and Cython exhibit equivalent performance! I am used to profile my code and cythonize the slow parts (using memory views to update the dataframes and openMP to multithread). But this is really more work than using Numba, for a similar reward! However, I do not know yet if you can address all kinds of problems with Numba with the same efficiency? Numba is definitely the best option when dealing with standard array-based operations. I am not sure that it can be as efficient as Cython on other data structures such as heaps for example.

python matplotlib.rcParams.update({'font.size': 20}) plt.figure(figsize=(14, 7)) plt.semilogx(list(timeit.values()), list(timeit.keys()), 'ko', markersize=20) plt.xlabel("Elapsed time (s)")

Text(0.5,0,'Elapsed time (s)')
 

png

python timeit

{'Crappy Loop': 1.8504301450011553,
 'Iterrows': 0.4271442763001687,
 'Apply': 0.13913396169991757,
 'Itertuples': 0.02964207869990787,
 'Numpy vectorize': 0.0028313495000475085,
 'Numpy loop': 0.006625487999917823,
 'Cython': 0.0005189561900260742,
 'Numba': 0.0005024597499868833}