Trade Off

supercalifragilisticexpialidocious

Region

先说结果,如果你需要构建一个中国大陆地区的省市区数据库,请只参考这里:http://www.stats.gov.cn/tjsj/tjbz/xzqhdm/,目前没有比这里更加准确的数据了,但若你又和某些电商网站集成了订单信息,那很可能还会遇到问题,可以看后面。

region问题说大不大说小不小,如果你需要精确的规划物流(或者以后有可能的话)从而降低配送成本,那么region的问题就躲不过。

我国的行政规划编码挺简单,具体编码规则不谈(如果想了解请搜索相关GB开头的文件),某GB文件提到规则中的一条就是作废了的code不会再被是使用,也许有这一条规则就够用一阵子了。行政规划结构很清晰——三级结构,第一层是省及直辖市,第二层一般是“市”、“县”或“xxx市”这样的,第三级是“区”。但根据精简行政规划或者发展的需要,有些地区没有第三级,例如广东省中山市,这样的情况不算多但若要构建一个精确的region表的话,这点知识必不可少。

如果你是一个电商网站,用户只能在你的网站下单,那没问题,但这个时代如果你做的稍微大一些的话,势必要和其他电商网站或平台集成,简单说就是在人家那里开了个店,订单会被导入到你的系统中去(如果不需要导入到你的系统中的话可以不用往下看了,你基本不会遇到region的问题)很可惜现在大家都在拼命收集数据,会员啊、订单啊、用户的个人隐私啊等等,怎么能错过订单这么重要的信息,既然如此,别的平台的订单信息势必会导入到自己系统中来,其中订单中的一个重要的信息就是收货地址。

此时就会遇到一个问题,其他平台的region和我们用的是一样的吗?基本是不一样的,好的developer也许会找一份统计局的数据,但一般的也就从网上搜一个出来,搜到哪个格式看着顺眼的就拿来用了,不对比的话你还看不出大的疏漏,因为省区市的数据变化都不太大。

每年统计局都会更新这个库(除了数据的格式乱七八糟外都还好),有的时候你能看到一次新发布,有的时候你看到的是更新原来的发布时间,但变动都不大,可能有某几个市改名啦、合并啦、降级到区啦、区升级到市啦等等。就是因为改动不会很大所以从网上随便找来一个数据库还真的可以用一阵子,出来混早晚是要还的。

就是由于很多标准不统一,你用A版本的region,我用B版本的,他用C版本的,到底以谁的为准?一般情况下谁都不会改自己的region,因为有历史数据的负担,所以region的集成方式是“不集成”,也就是说直接拼好省市区和详细地址,这样看起来是没问题的,用起来可能也凑合,但这些数据就不再是格式化的了,起码不是符合规范的了,因为有的数据在你这边可能不存在,等你需要用这些数据的时候会很痛苦。

我遇到的问题是,已有系统中的region不知道是哪个版本,和X猫等平台集成的订单没有格式化的省市区数据。更新自己平台的region数据库工作量并不算大,抓取最新数据按照三级的结构存储好,再和已有数据对比,新增、删除(标记删除)、改名就好了,这里用的主键是code,code的编码很简单,前两位是省,中间两位是市,最后的是区,如果是省,那么后面的市区位置都是0,因为我们知道用过的code不会被重新使用,所以可以用作主键。

这样可以解决自己数据与最新数据不一致的问题,但仍无法解决多平台region不一致问题,除了大家使用统一标准外没有轻松解决的办法,这里只能吐槽X宝的网站和手机版,他们使用的region都是不一样的,更有甚者手机版的每个城市下面都能找到一个万能的区——“其他区”,PC版的还好,看起来准确,不过因为他们有对地址更加精细的管理,会添加一些本身并不存在区以方便用户的地址添加?例如某些工业园区、开发区、新区等等,最终的解决只能是我们处理这样的地址,看是归到某个区下还是怎样,总之这样的处理规则很多,一切都是为了数据的真实性。

希望有能力可以让这个世界变得更加美好一点的企业能从这些方面出发做一点让人感受得到的改进,有一批人会感受到专业和幸福。

Push Notification

最近App工作接近尾声,还有几个接口没有完成,其中之一就是需要在某些事件发生的时候通知App的后台,继而由App后台通知App,比如“订单即将被关闭,请尽快付款”、“优惠券即将到期”这样的消息。因为这些消息最终是要在用户的App上出现的,一般的做法是以Push notification的形式出现,主流平台android和iOS都是在屏幕最上方出现一些提示,点击后可以进入App的某个地方。

既然这样我们就直接集成第三方平台发送相应的消息好了,不过App做得实在很遗憾,用户点击消息后总会进入到App的消息中心,再点击某条消息才能看到具体内容,这样做可能是比较快,但体验不太好,至少和现有的主流App做法都不太一样,像上个时代的做法。

以上是背景,基于此我们希望能够把App作为推送会员消息的渠道,因为短信渠道现在变得越来越不可用,仅限于发送验证码等transactional消息,其实以上说的消息都是transactional消息,区别于marketing消息,这些消息一般是由用户自主激活的或满足一定条件触发,最终是区别于用户的,一般不是群发性的,例如营销类信息就属于marketing。将用户的App作为消息推送渠道的前提是能够较准确地获取这个用户的App渠道是否可用,如果不可用还是得选择短信等。

对于不同手机平台使用的技术差不多,但机制不太一样,不过这是有原因的。例如iOS上的推送机制:设备连接APNs,商户后端推送消息给APNs,再由APNs推送到最终用户的设备上;但android在国内不大一样,由于某些特殊的原因,android的GCM服务(类似APNs)无法使用,继而诞生了很多第三方推送服务商,这些服务商一方面做android的推送,另一方面也接管了到APNs的这块,其实接管APNs这块也不错,因为向APNs推送也并不容易实现——socket长连接,二进制消息等都增加了用户的开发成本,对于一般规模的App开发,使用第三方消息服务商几乎是没得选。

iOS的推送:在成功推送到APNs时服务商得不到任何响应,也不保证给你送达,用户不在线可以给你保留一段时间等用户上线了再推送,但长时间不在线的话推送消息也会被删除。除非消息格式有错等语法性的错误会得到响应,再就是如果APNs发现用户设备上没有这个App(被用户删除或被用户禁用了推送)会将该设备的ID记录下来,服务商可以通过APNs提供的反馈接口得到这些信息,也就是说只有再次给App推送消息后才有可能知道这条渠道还是不是可用。

android的推送:目前第三方服务商的做法是做出SDK,让开发App的人集成到App中,这个App安装到系统中后会出现一个Push Service在运行,当然不知道他们哪里来的智慧,如果多个App都用了某一家提供商,并不会启动多个Push Service,一般是一两个,这一两个服务所有App,从而实现了他们所描绘的多App共享长连接很省电什么的。某个设备后台的Push Service会自动和服务商的后端连接,等有消息过来时这些Service会知道该送给哪个App,到了具体的App再由用户的代码处理。由于机制的作用,android的消息服务商是可以实时知道某条消息能不能推送到设备上的,但这毕竟是一个不太常用的功能,所以服务商们一般不会公开这种API,我们用的这家在邮件中才告诉我们是有类似的API,但是VIP级别的,详情需要联系商务部门了解。

Build Hybla for Ubuntu 14.04

When I modify some ipv4 config for system I got an error like this: sysctl: setting key "net.ipv4.tcp_congestion_control": No such file or directory, so I followed this post to build a kernel module for ubuntu 14.04

The net/ipv4 hybla Makefile(bad script format in original post):

1
2
3
4
5
6
# Makefile for tcp_hybla.ko
obj-m := tcp_hybla.o
KDIR := /home/dejavu/linux-4.1.5
PWD := $(shell pwd)
default:
  $(MAKE) -C $(KDIR) SUBDIRS=$(PWD) modules

I also got a warning during build modules: No X.509 certificates found, but it doesn’t matter I think, I can build the tcp_hybla.ko and load in kernel.

Save Red Status Elasticsearch

Someday when I use Kibana, I got an error all shards failed, so I look through office guide and some posts on the Internet.

Maybe the memory limit results in some primary shards can not be loaded by elasticsearch. If one or all primary shards are unsigned, your elasticsearch cluster will be in red status, if all primary shards are loaded, your cluster status will be yellow, if all shards are loaded, the green status you will be get.

The data is stored in index in elasticsearch, the index will be divided into some shards, the shards are primary and replica kind, you can have no replica, but you can have no primary. replica shards will be stored on the other nodes in cluster, but in default setting, you have only two nodes, one is data store node, another one is a client node, and the default setting of index is 2 or 3 replicas for each primary shard, you can update all index setting to disabling this config.

MSA Thoughts

MSA (Micro Service Architecture) is a methodology of dependencies management. Its core thought is divided a large application into some independent small applications, these small applications can be deployed independent and use different programming language, etc.

I think it’s not a new methodology, but a new concept.

We can think you make an application, used some libraries written by others, these libraries are some independent application, they expose some APIs for you, your application import them and call some APIs to fullfill some functions. Generally they will keep same APIs during update and fix bugs, so you can upgrade them and use latest version. You can also copy their codes to your application code base, just like you write all codes, but if they updated, you must copy new codes to your code base, that likes the development without MSA.

So, MSA is a methodology of dependencies management, the example is a small scope of MSA, the real MSA means some independent application, for instance an order service, an logistics service, etc. We may name these order module, logistics module before, but now we name these xxx service.

Sort and Uniq Count

I got some logs which recorded someone IP address and the URL it requests, I want to summary how many unique IP addresses request some specified URL, and request counts for one IP address.

so I filter out specified URL rows:

1
grep URL_PATH *.log

then I got these rows:

1
2
3
4
5
6
7
8
...
[ 2015-08-26T23:59:59+08:00 ] 14.219.202.241 /xx
[ 2015-08-26T23:59:59+08:00 ] 14.219.202.241 /xx
[ 2015-08-26T23:59:59+08:00 ] 14.219.202.241 /xx
[ 2015-08-26T23:59:59+08:00 ] 14.219.202.241 /xx
[ 2015-08-26T23:59:59+08:00 ] 14.219.202.241 /xx
[ 2015-08-26T23:59:59+08:00 ] 14.219.202.241 /xx
...

then I use awk print the fourth column:

1
grep URL_PATH *.log | awk '{print $4}'

then I got these IPs, I stored these IPs to a file, then sort and uniq, the uniq excepts a sorted file, the -c option means count these rows:

1
sort ips | uniq -c

then I got some unique rows with counts, but I want sort these rows order by counts desc:

1
sort ips | uniq -c | sort -k1nr -k2

the -k1nr means the KEY starts with first field, and as (n)umber, and ®everse, the -k2 means ends with second field. The field means the column in your data.

Tornado-2

write some subclasses of RequestHandler and mapped these with routers, pass these routers to Application, run Application.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
from tornado.ioloop import IOLoop
from tornado.web import RequestHandler, Application, url

class HelloHandler(RequestHandler):
    def get(self):
        self.write("Hello, world")

def make_app():
    return Application([
        url(r"/", HelloHandler),
        ])

def main():
    app = make_app()
    app.listen(8888)
    IOLoop.current().start()

Tornado supports strings, bytes, and dictionaries as response, you can use Template to building strings and bytes, also you can write response by hand. If you pass a dictionary to response, it will be encoded as JSON.

Tornado will not parse other types request arguments, e.g. Tornado will not parse JSON request bodies, you need to parse these by yourself(override prepare method)

CSRF/XSRF prevent: the session cookie from SiteA is not expired, there is a form in SiteB, the form aims doing something to SiteA, if SiteA don’t support XSRF prevent, it will allow submitting if you constructed valid form arguments. If SiteA supports XSRF prevent, it will set a cookie with token to user, and that token will filled in any action form(POST,PUT,DELETE), if these two values are not equal, that request will be prevented, because SiteB can’t retrive SiteA’s cookie, so that form can’t submit with valid cookie value.

Due to the Python GIL (Global Interpreter Lock), it is necessary to run multiple Python processes to take full advantage of multi-CPU machines. Typically it is best to run one process per CPU.

Tornado-1

This is the first part of Tornado studying project.

Tornado can be roughly divided into four major components:

  • A web framework (including RequestHandler which is subclassed to create web applications, and various supporting classes).
  • Client- and server-side implementions of HTTP (HTTPServer and AsyncHTTPClient).
  • An asynchronous networking library (IOLoop and IOStream), which serve as the building blocks for the HTTP components and can also be used to implement other protocols.
  • A coroutine library (tornado.gen) which allows asynchronous code to be written in a more straightforward way than chaining callbacks.

Tornado uses a signle-threaded event loop.This means that all application code should aim to be asynchronous and non-blocking because only one operation can be active at a time.

There are many styles of asynchronous interfaces:

  • Callback argument
  • Return a placeholder (Future, Promise, Deferred)
  • Deliver to a queue
  • Callback registry (e.g. POSIX signals)

There is no free way to make a synchronous function asynchronous in a way that is transparent to its callers.

Coroutines are the recommended way to write asynchronous code in Tornado

A function containing yield is a generator. All generators are asynchronous; when called they return a generator object instead of running to completion.

Checksum Functions in MacOs

It’s better to calculate checksum for your downloaded files, it can make sure you download that real file from server, not from the ISP proxy server.

In MacOS, you can use md5 and shasum to calculate file checksums.

  • calculate a MD5 checksum

md5 xxx.data

  • calculate a SHA-1 checksum

shasum -a 1 xxx.data

  • calcuate a SHA-256 checksum

shasum -a 256 xxx.data

Thanks for here

Exception in Python

My colleague remembered that if you pass some kwargs to an Exception subclass e.g. class MyException(Exception): pass, you will get a composed message when you catch the exception and call e.message. Unitil we got nothing from one place which it should display exception’s message, python language part is implemented by C, so we downloaded python source code(2.7.9), and find these lines.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
static int
BaseException_init(PyBaseExceptionObject *self, PyObject *args, PyObject *kwds)
{
    if (!_PyArg_NoKeywords(Py_TYPE(self)->tp_name, kwds))
        return -1;

    Py_DECREF(self->args);
    self->args = args;
    Py_INCREF(self->args);

    if (PyTuple_GET_SIZE(self->args) == 1) {
        Py_CLEAR(self->message);
        self->message = PyTuple_GET_ITEM(self->args, 0);
        Py_INCREF(self->message);
    }
    return 0;
}

This is python BaseException’s C implementation, I can understand logic but C syntax. The BaseException accepts kwargs, but if you pass a kwargs to Exception, you will get a warning, because of _PyArg_NoKeywords, maybe it needs compatibility with old code, but now, you can’t use kwargs.

Another notice, if you pass one argument to it, this argument will be the e.message value, if not, your args will be placed in e.args.

We finally implemented __str__ and __unicode__ methods, because of we imported unicode_literals, so we think this is the best practice, we just use str(e) in those places.

1
2
3
4
5
6
7
8
9
10
11
12
13
class TmallAPIException(Exception):
    def __init__(self, code, msg, sub_code, sub_msg):
        super(TmallAPIException, self).__init__(code, msg, sub_code, sub_msg)
        self.code = code
        self.msg = msg
        self.sub_code = sub_code
        self.sub_msg = sub_msg

    def __unicode__(self):  # TODO: not necessary under python3
        return 'code: {}, msg: {}, sub_code: {}, sub_msg: {}'.format(self.code, self.msg, self.sub_code, self.sub_msg)

    def __str__(self):
        return self.__unicode__().encode('utf-8')