贝壳的壳: python的性能问题

贝壳最近在一个朋友的网站上看到了关于SICP零钱兑换问题的python求解，使用了记忆机制，然后他给出了代码。然而他的代码计时上有点小问题，也没有用包装器(奇怪的是，有写)，而且python的栈深度有限。因此贝壳做了几个修改的版本，需要测试下性能，下面就是关于性能的几个问题和过程。
本文详细论述了python语言下和C++语言下使用各种方法测试代码性能的方法，以及粗略的关于两种语言不同算法性能对比。
原始的python代码是这样的：
def change_coins(money):
first_denomination = {
1:1, 2:5,
3:10, 4:25,
5:50,
}
def cc((amount, kinds_of_coins)):
if amount == 0:
return 1
elif amount < 0 or kinds_of_coins == 0:
return 0
else:
return cc((amount, kinds_of_coins - 1)) \
+ cc((amount - first_denomination[kinds_of_coins], kinds_of_coins))
print "change_coins return %s" % cc((money, 5));
return ;
利用记忆原理包装后是这样的：
def memoiza(fun):
cache = {}
def proc ( *arg ):
if cache.has_key(arg):
return cache[arg]
else:
x = fun( *arg )
cache[arg] = x
return x
return proc

def decorator_change_coins(money):
first_denomination = {
1:1, 2:5,
3:10, 4:25,
5:50,
}
@memoiza
def cc(amount, kinds_of_coins):
if amount == 0:
return 1
elif amount < 0 or kinds_of_coins == 0:
return 0
else:
return cc(amount, kinds_of_coins - 1) \
+ cc(amount - first_denomination[kinds_of_coins], kinds_of_coins)
print "decorator_change_coins return %s" % cc(money, 5);
return ;
不记忆，利用栈模拟递归展开是这样的：
def native_change_coins(money):
first_denomination = {
1:1, 2:5,
3:10, 4:25,
5:50,
}
stack = [(money, 5)];
rslt = 0;
while len (stack) > 0:
param = stack.pop ();
if param[0] == 0:
rslt += 1;
continue;
elif param[0] < 0 or param[1] == 0:
continue;
else:
stack.append ((param[0], param[1] - 1));
stack.append ((param[0] - first_denomination[param[1]], param[1]));
continue;
print "native_change_coins return %s" % rslt;
return ;

贝壳主要需要测试上面三个代码的执行效率和瓶颈，所以贝壳用的主代码是这样的：
import time
import timeit
import profile

def test_func(f):
f (300);

if __name__ == "__main__":
t = timeit.Timer("test_func (change_coins)", "from __main__ import *");
print min(t.repeat (5, 1));

t = timeit.Timer("test_func (decorator_change_coins)", "from __main__ import *");
print min(t.repeat (5, 1));

t = timeit.Timer("test_func (native_change_coins)", "from __main__ import *");
print min(t.repeat (5, 1));

profile.run("test_func (change_coins)");
profile.run("test_func (decorator_change_coins)");
profile.run("test_func (native_change_coins)");

下面是部分结果：
change_coins return 9590
1.22809910198
decorator_change_coins return 9590
0.00217178440277
native_change_coins return 9590
2.69215193551

以上是时间测试结果，使用timeit模块来测试运行时间，重复5次，取最小值。具体原理可以看dive into python，详细请看上面的代码。从结果中我们可以看到，使用记忆技术后，性能提升了500多倍，这是符合规律的。然而使用了集合模拟栈之后，性能大幅下降。下面我们看看为什么。

change_coins return 9590
1292596 function calls (6 primitive calls) in 13.591 CPU seconds

Ordered by: standard name

ncalls tottime percall cumtime percall filename:lineno(function)
1 0.059 0.059 0.059 0.059 :0(setprofile)
1 0.000 0.000 13.533 13.533 :1()
1 0.000 0.000 13.533 13.533 amount.py:102(test_func)
1292591/1 13.531 0.000 13.531 13.531 amount.py:11(cc)
1 0.001 0.001 13.533 13.533 amount.py:5(change_coins)
0 0.000 0.000 profile:0(profiler)
1 0.000 0.000 13.591 13.591 profile:0(test_func (change_coins))

decorator_change_coins return 9590
2494 function calls (881 primitive calls) in 0.027 CPU seconds

Ordered by: standard name

ncalls tottime percall cumtime percall filename:lineno(function)
873 0.004 0.000 0.004 0.000 :0(has_key)
1 0.000 0.000 0.000 0.000 :0(setprofile)
1 0.000 0.000 0.027 0.027 :1()
1 0.000 0.000 0.027 0.027 amount.py:102(test_func)
1 0.000 0.000 0.000 0.000 amount.py:51(memoiza)
873/1 0.013 0.000 0.026 0.026 amount.py:53(proc)
1 0.001 0.001 0.027 0.027 amount.py:62(decorator_change_coins)
742/1 0.009 0.000 0.026 0.026 amount.py:68(cc)
0 0.000 0.000 profile:0(profiler)
1 0.000 0.000 0.027 0.027 profile:0(test_func (decorator_change_coins))

native_change_coins return 9590
3877778 function calls in 38.798 CPU seconds

Ordered by: standard name

ncalls tottime percall cumtime percall filename:lineno(function)
1292590 5.824 0.000 5.824 0.000 :0(append)
1292592 5.960 0.000 5.960 0.000 :0(len)
1292591 6.076 0.000 6.076 0.000 :0(pop)
1 0.000 0.000 0.000 0.000 :0(setprofile)
1 0.000 0.000 38.798 38.798 :1()
1 0.000 0.000 38.798 38.798 amount.py:102(test_func)
1 20.938 20.938 38.798 38.798 amount.py:80(native_change_coins)
0 0.000 0.000 profile:0(profiler)
1 0.000 0.000 38.798 38.798 profile:0(test_func (native_change_coins))
以上是白盒分析结果，使用profile测试，主要分析函数的调用花费。具体可以参考http://www.sqlite.com.cn /MySqlite/11/480.Html。从上面的报表中，我们可以看出，最初的函数执行时间全消耗在了cc上。而记忆后，则是proc和cc基本对半，有的时候has_key测试也花点时间。这表示cc花费的时间大幅下降，记忆技术则花了比较多的时间。而模拟的呢？大部分时间都花在了 append,len,pop这三个函数上！这说明原始集合的效率严重制约了模拟效率。如果要提升性能的话，使用其他的集合吧。
另外贝壳又用C++写了一个，如下：

const int coin_map[] = {
1, 5, 10, 25, 50
};
const int coin_count = 5;

int cc (int amount, int kind_of_coins)
{
if (amount == 0)
return 1;
if (amount < 0 || kind_of_coins <= 0)
return 0;
return cc (amount, kind_of_coins - 1) + cc (amount - coin_map[kind_of_coins - 1], kind_of_coins);
}

int dd (int amount, int kind_of_coins)
{
if (amount == 0)
return 1;
if (amount < 0 || kind_of_coins <= 0)
return 0;
int rslt = 0;
for (int i = 0; i <= amount / coin_map[kind_of_coins - 1]; ++i)
rslt += dd (amount - i * coin_map[kind_of_coins - 1], kind_of_coins - 1);
return rslt;
}

class keys{
public:
int amount;
int kind_of_coins;
keys (int amount_p, int kind_of_coins_p):
amount(amount_p), kind_of_coins(kind_of_coins_p)
{}
bool operator == (const keys & k) const{
return (amount == k.amount && kind_of_coins == k.kind_of_coins);
}
bool operator < (const keys & k) const{
if (kind_of_coins == k.kind_of_coins)
return amount < k.amount;
return kind_of_coins < k.kind_of_coins;
}
};

map mCache;

int ee (int amount, int kind_of_coins)
{
if (amount == 0)
return 1;
if (amount < 0 || kind_of_coins <= 0)
return 0;
keys k (amount, kind_of_coins);
map::iterator iter = mCache.find(k);
if (iter != mCache.end ())
return iter->second;
int rslt = 0;
for (int i = 0; i <= amount / coin_map[kind_of_coins - 1]; ++i)
rslt += dd (amount - i * coin_map[kind_of_coins - 1], kind_of_coins - 1);
mCache.insert(pair(k, rslt));
return rslt;
}

int _tmain(int argc, _TCHAR* argv[])
{
const int loop_times = 300;
clock_t s = clock();
printf ("kind of coins: %d\n", cc (loop_times, coin_count));
printf ("times:%d\n", clock () - s);

s = clock();
printf ("kind of coins: %d\n", dd (loop_times, coin_count));
printf ("times:%d\n", clock () - s);

s = clock();
printf ("kind of coins: %d\n", ee (loop_times, coin_count));
printf ("times:%d\n", clock () - s);
return 0;
}
注意到主函数中，使用的是clock来计量时间。如果C++下要做白盒性能测试就比较麻烦，需要用精确计时函数和宏。需要的可以单独和我联系。下面是部分计算结果，cc的和ee的，没有dd的。
300的计算结果
kind of coins: 9590
times:62
kind of coins: 9590
times:46
1000的计算结果
kind of coins: 801451
times:15953
kind of coins: 801451
times:11000
单位，ms。
原生的效率差异是20倍，用了缓存后性能只有略略上升？！反而是python比较快？
看来C++下的map效率也不高，要用hash_map才好。
倒是栈长度好很多，贝壳估计是131072次调用，大约是16384分。

贝壳的壳

2008年8月27日星期三

python的性能问题

没有评论:

我的简介

博客归档