作者可以帮我分析下我的代码怎么优化吗？ #8

wajika · 2021-07-22T03:01:35Z

wajika
Jul 22, 2021

我的需求很简单，从es index中获取某个搜索结果根据某个条件去重和归档，再建立一个临时表的格式导出到excel（或csv）

我原来写的process_hits从es读取到json放到内存里清洗，数据量太大速度很慢，后来就把json先插入到数据库，通过数据库来做去重和分组归档，但是插入这个过程速度也很慢，我不知道怎么优化，但是看代码觉得很烂，优化空间很大。

本来想通过学习 idataapi-transform 项目的code 原理，但是发现code level太高，索性就直接提问好了。

excel的内容是这样

# coding:utf-8
from elasticsearch import Elasticsearch, RequestsHttpConnection
import json
from datetime import timedelta,timezone,tzinfo,datetime

import mysql.connector
from numpy import not_equal

config = {
  'user': 'scott',
  'password': 'xxxx',
  'host': '192.168.51.32',
  'port': 30306,
  'database': 'apm',
  'raise_on_warnings': True
}


# global cnx,cursor
# cnx = mysql.connector.connect(**config)
# cursor = cnx.cursor()
in_data={}

body = {}


es = Elasticsearch([{
    'host': '192.168.10.139',
    'port': 9200
}, {
    'host': '192.168.10.140',
    'port': 9200
}, {
    'host': '192.168.10.141',
    'port': 9200
}],
connection_class=RequestsHttpConnection,
http_auth=("elastic", "XXXX"))

#取上周的周一到周日的日期
def last_week():
    now_time=datetime.now()
    tzinfo=timezone.utc
    new_time =datetime(now_time.year, now_time.month, now_time.day)
    week_start = new_time - timedelta(days=new_time.weekday() + 7)
    last_week_start=week_start.strftime("%Y-%m-%dT%H:%M:%S.%fZ")
    week_end = new_time - timedelta(days=new_time.weekday() + 1)+ timedelta(hours=23, minutes=59, seconds=59)
    last_week_end=week_end.strftime("%Y-%m-%dT%H:%M:%S.%fZ")
    return last_week_start,last_week_end

#插入数据到数据库
def insert_db(insert_data):
    cnx = mysql.connector.connect(**config)
    cursor = cnx.cursor()

    add_data = ("INSERT INTO apm_list "
               "(service_name, url_path, type, timestamp,message,status_code) "
               "VALUES (%s, %s, %s, %s, %s,%s)")
    cursor.execute(add_data, insert_data)
    emp_no = cursor.lastrowid
    cnx.commit()

    cursor.close()
    cnx.close()

#每次插入前清空上上周的数据
def clean_tables(DB_NAME,TABLE_NAME):
    cnx = mysql.connector.connect(**config)
    cursor = cnx.cursor()
    cursor.execute("USE {}".format(DB_NAME))
    try:
        cursor.execute(
            "TRUNCATE TABLE {} ".format(TABLE_NAME))
    except mysql.connector.Error as err:
        print("Failed delete tables: {}".format(err))
        exit(1)

#转换es的utc时间
def fix_time(time_utc):
    UTC_FORMAT = "%Y-%m-%dT%H:%M:%S.%fZ"
    utcTime1 = datetime.strptime(time_utc, UTC_FORMAT)
    utcTime2 = utcTime1.strftime("%Y-%m-%d %H:%M:%S")
    return utcTime2

#处理es返回的json数据
def process_hits(hits):
    post_data=''
    post_message=''
    for item in hits:
        for item2 in item['_source']['error']['exception']:
            new_time=fix_time(item['_source']['@timestamp'])
            #message字段有可能内容很多
            if len(item2['message']) > 250:
                post_message=item2['message'][:250]
            else:
                post_message=item2['message']
            if 'url' in item['_source'] and 'http' in item['_source']:
                if 'response' in item['_source']['http']:
                    post_data=item['_source']['service']['name'],item['_source']['url']['path'],item2['type'],new_time,post_message,item['_source']['http']['response']['status_code']
                else:
                    post_data=item['_source']['service']['name'],item['_source']['url']['path'],item2['type'],new_time,post_message,''
            elif 'url' in item['_source']:
                post_data=item['_source']['service']['name'],item['_source']['url']['path'],item2['type'],new_time,post_message,''
            elif 'http' in item['_source']:
                if 'response' in item['_source']['http']:
                    post_data=item['_source']['service']['name'],'',item2['type'],new_time,post_message,item['_source']['http']['response']['status_code']
            if post_data != '':
                post_data =
    return post_data

#调用es搜索
def es_search(start_time,end_time):

    if not es.indices.exists(index="apm-*-error-*"):
        print("Index " + "apm-*-error-*" + " not exists")
        exit()
    
    data = es.search(
        index="apm-*-error-*",
        scroll='5m',
        size=1000,
        body={
                    "query": {
                        "bool": {
                            "filter": [{
                                "match_all": {}
                            }, {
                                "match_phrase": {
                                    "agent.name": "dotnet"
                                }
                            }, {
                                "range": {
                                    "@timestamp": {
                                        "gte": start_time,
                                        "lte": end_time,
                                        "format": "strict_date_optional_time_nanos"
                                    }
                                }
                            }]
                        }
                    }
                }
    )

    sid = data['_scroll_id']
    scroll_size = len(data['hits']['hits'])

    while scroll_size > 0:
        "Scrolling..."
        start=datetime.now()
        print("开始时间",start)

        process_hits(data['hits']['hits'])
        data = es.scroll(scroll_id=sid, scroll='5m')
        sid = data['_scroll_id']
        scroll_size = len(data['hits']['hits'])
        
        end2=datetime.now()
        print("结束时间",start)


if __name__=='__main__':
    clean_tables('apm','apm_list')
    a,b=last_week()
    es_search(a,b)

zpoint · 2021-07-22T03:58:00Z

zpoint
Jul 22, 2021
Maintainer

去重和归档是否只是根据某个字段去重计数, 做统计分析, 这样的话ES的查询语句就能直接写, 不用写业务代码处理

数据量大是多大，慢是哪里慢，读ES网络传输慢? 写数据库慢? 还是聚合的业务逻辑慢? 这里慢的话可以改代码优化, 可以自己打个点分析下运行时长, 或者用工具 line_profiler 进行分析

只是单一程序慢, 数据库和程序占用机器资源都不高的话，是否可以按日期切分多启几个进程干同样的逻辑就快了?

个人建议哈, 具体场景具体分析窝也不清楚

3 replies

wajika Jul 22, 2021
Author

去重和归档是否只是根据某个字段去重计数, 做统计分析, 这样的话ES的查询语句就能直接写, 不用写业务代码处理

数据量大是多大，慢是哪里慢，读ES网络传输慢? 写数据库慢? 还是聚合的业务逻辑慢? 这里慢的话可以改代码优化, 可以自己打个点分析下运行时长, 或者用工具 line_profiler 进行分析

只是单一程序慢, 数据库和程序占用机器资源都不高的话，是否可以按日期切分多启几个进程干同样的逻辑就快了?

个人建议哈, 具体场景具体分析窝也不清楚

我试过用es本身去做统计分析，但是es的桶聚合不是那种数据库的group by的概念，group by的话是以某个字段作为条件去影响整个返回结果字段的结构。目前来说es统计的效果不是那么符合预期，es这部分我还没有特别深入的研究。

慢我感觉是代码逻辑这里慢，这个数据库是我临时启动的一个容器，虽然性能不算很强，我读数据库时候使用了复杂的SQL语句（涉及到去重和排序，取TOP 10），基本上返回时间就1秒（直接产生excel文件），其他原因暂时也可以忽略。

请看我的代码process_hits(hits)函数，我的思路比较简单从es json中得到一条json，判断一下内容，简单组合起来就扔给数据库，这个过程我print了一下，每秒只有5-6条数据插入到数据库。效率有点差

我计划是每次先组合成一段数据，达到100-1000条之后再往数据库里插，主要是python代码这块我不知道有没有什么更好的方法做这样的事情，我的python水平只是写脚本的程度。

wajika Jul 22, 2021
Author

我也考虑过用多线程多进程之类的，但是process_hits处理的数据本身没法顺序的概念，比如说1号线程怎么把读取的位置标记通知给 2号线程呢？感觉就算加上多线程还是要改代码本身逻辑的

zpoint Jul 22, 2021
Maintainer

打个比方，你取的是上周一到周日的数据, 那函数入口就加个时间参数比如只算周一, 然后启7个进程每个进程算一天
速度就✖️7

要结合你的具体场景拆分, 不一定要用多线程

wajika · 2021-07-22T05:40:20Z

wajika
Jul 22, 2021
Author

如果我这块没法优化，那么我就直接用你的工具，我的目的只要实现这个需求就可以了。

1 reply

zpoint Jul 22, 2021
Maintainer

我的工具解决的是异步读写的问题, 你这个如果是业务逻辑慢还是要自己分析下

wajika · 2021-07-22T05:49:47Z

wajika
Jul 22, 2021
Author

三种方案其实我想用第二种

1、靠es统计分析
2、靠python统计分析
3、靠数据库统计分析

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

作者可以帮我分析下我的代码怎么优化吗？ #8

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

作者可以帮我分析下我的代码怎么优化吗？ #8

Uh oh!

Uh oh!

wajika Jul 22, 2021

Replies: 3 comments · 4 replies

Uh oh!

zpoint Jul 22, 2021 Maintainer

Uh oh!

wajika Jul 22, 2021 Author

Uh oh!

Uh oh!

wajika Jul 22, 2021 Author

Uh oh!

zpoint Jul 22, 2021 Maintainer

Uh oh!

wajika Jul 22, 2021 Author

Uh oh!

zpoint Jul 22, 2021 Maintainer

Uh oh!

wajika Jul 22, 2021 Author

wajika
Jul 22, 2021

Replies: 3 comments 4 replies

zpoint
Jul 22, 2021
Maintainer

wajika Jul 22, 2021
Author

wajika Jul 22, 2021
Author

zpoint Jul 22, 2021
Maintainer

wajika
Jul 22, 2021
Author

zpoint Jul 22, 2021
Maintainer

wajika
Jul 22, 2021
Author