Source: blog.csdn.net/master_hunter/article/details/125215433
Background of the project
As data analysts, we often need to make statistical analysis charts. But when there are too many reports, it often takes us most of our time to make reports. This delays us using a lot of time for data analysis. But as a data analyst, we should try our best to dig out the hidden related information behind the data of tables and charts, instead of simply making charts from statistical tables and sending reports. Since the work of reporting is unavoidable, how should we use the techniques we have learned to better handle the work?
This requires us to make a small Python program to implement it by ourselves, so that we have more time to do data analysis. We call this process of letting programs run on their own as automation.
The purpose of report automation
1. Save time and improve efficiency
Automation is always a great time saver and improves our work efficiency. Let our program programming reduce the coupling of each function implementation code as much as possible, and better maintain the code. In this way, we will save a lot of time and free us to do more valuable and meaningful work.
2. Reduce errors
If the coding effect is correct, it can be used all the time. If it is done manually, it may make some mistakes. It is more reassuring to leave it to a fixed program, and only modifying part of the code can solve the problem when the requirements change.
The scope of report automation
First of all, we need to formulate the reports we need according to business needs. Not every report needs to be automated. Some complex secondary development indicator data is more complicated to realize automatic programming, and there may be various bugs hidden. . Therefore, we need to summarize the characteristics of the reports used in our work. The following are several aspects that we need to consider comprehensively:
1. Frequency
For some tables that are often used in business, we may want to include these tables in the scope of automation programs. For example, customer information list, sales flow report, business loss report, month-on-month report, etc.
These frequently used reports are all necessary to automate. For those reports that need to be used occasionally, or secondary development indicators, reports that need to replicate statistics, these reports do not need to be automated.
2. Development time
This is equivalent to the cost and interest rate. If it is difficult to automate some reports and exceeds the time required for our ordinary statistical analysis, there is no need to automate them. So when you start automating work, you need to measure which time it takes to develop the script or the time it takes to manually make the table is shorter. Of course, I will provide a set of implementation solutions, but only for some common and simple reports.
3. Process
For each process and step of our report, each company is different, and we need to code and implement the function of each step according to the business scenario. Therefore, the process we make should be in line with business logic, and the program we make should also be in line with logic.
3. Implementation steps
First we need to know what metrics we need:
index
-
The overall overview indicator
reflects the overall size of a data indicator
-
comparative indicators
-
The index in the adjacent time period is directly worse than the chain -
Comparison
of indicators at a common time point in adjacent time periods
-
Central trend indicator
-
median
-
Mode
-
Average/Weighted Average
-
Dispersion index
-
standard deviation
-
variance
-
Quartile
-
Full distance (range)
maximum bound minus minimum bound
-
Correlation Metrics
-
r
We take a simple report for simulation implementation:
Step 1: Read the data source file
First we need to understand where our data comes from, that is, the data source. Our final data processing is converted into DataFrame for analysis, so we need to convert the data source into DataFrame form:
import pandas as pd
import json
import pymysql
from sqlalchemy import create_engine
# 打开数据库连接
conn = pymysql.connect(host='localhost',
port=3306,
user='root',
passwd='xxxx',
charset = 'utf8'
)
engine=create_engine('mysql+pymysql://root:xxxx@localhost/mysql?charset=utf8')
def read_excel(file):
df_excel=pd.read_excel(file)
return df_excel
def read_json(file):
with open(file,'r')as json_f:
df_json=pd.read_json(json_f)
return df_json
def read_sql(table):
sql_cmd ='SELECT * FROM %s'%table
df_sql=pd.read_sql(sql_cmd,engine)
return df_sql
def read_csv(file):
df_csv=pd.read_csv(file)
return df_csv
The above codes can be used normally after passing the test, but the read function of pandas is aimed at different forms of file reading, and its read function parameters also have different meanings, which need to be adjusted directly according to the form of the table.
Other read functions will be added later after the article is written. Except that read_sql needs to connect to the database, the others are relatively simple.
Step 2: DataFrame calculation
Let's take user information as an example:
The metrics we need to count are:
#Indicator description
Single table diagram:
Top 10 Product Audience Regions
Audience area for the product:
df=df[df['city_num'].notna()]
df=df.drop(df[df['city_num']=='error'].index)
df = df.city_num.value_counts()
We can just get the top 10 cities, encapsulated as a pie chart:
def pie_chart(df):
#将城市空值的一行删除
df=df[df['city_num'].notna()]
#删除error
df=df.drop(df[df['city_num']=='error'].index)
#统计
df = df.city_num.value_counts()
df.head(10).plot.pie(subplots=True,figsize=(5, 6),autopct='%.2f%%',radius = 1.2,startangle = 250,legend=False)
pie_chart(read_csv('user_info.csv'))
Save the chart:
plt.savefig('fig_cat.png')
If you think matplotlib's pictures are not very beautiful, you can also replace them with echarts pictures, which will look better:
pie = Pie()
pie.add("",words)
pie.set_global_opts(title_opts=opts.TitleOpts(title="前十地区"))
#pie.set_series_opts(label_opts=opts.LabelOpts(user_df))
pie.render_notebook()
After packaging, it can be used directly:
def echart_pie(user_df):
user_df=user_df[user_df['city_num'].notna()]
user_df=user_df.drop(user_df[user_df['city_num']=='error'].index)
user_df = user_df.city_num.value_counts()
name=user_df.head(10).index.tolist()
value=user_df.head(10).values.tolist()
words=list(zip(list(name),list(value)))
pie = Pie()
pie.add("",words)
pie.set_global_opts(title_opts=opts.TitleOpts(title="前十地区"))
return pie.render_notebook()
user_df=read_csv('user_info.csv')
echart_pie(user_df)
It can be saved, but unfortunately it is not a moving picture:
from snapshot_selenium import snapshot
make_snapshot(snapshot,echart_pie(user_df).render(),"test.png")
Save it as a web page to automatically load JS for rendering:
echart_pie(user_df).render('problem.html')
os.system('problem.html')
Step 3: Automatically send emails
A series of reports are generally sent to others for viewing. For some reports that need to be sent to a designated mailbox every day or multiple reports need to be sent, Python can be used to automatically send mailboxes.
Sending emails in Python mainly relies on the two modules smtplib and email.
smtplib: mainly used to establish and disconnect the work of connecting with the server.
email: mainly used to set some content related to the email itself.
Different types of mailbox servers have different connection addresses. You can set up the corresponding server to connect according to the mailboxes you usually use. Here the blogger uses NetEase mailbox to show:
First, you need to open the POP3/SMTP/IMAP service:
After that, you can use python to log in according to the authorization code.
import smtplib
from email import encoders
from email.header import Header
from email.utils import parseaddr,formataddr
from email.mime.application import MIMEApplication
from email.mime.multipart import MIMEMultipart
from email.mime.text import MIMEText
#发件人邮箱
asender="[email protected]"
#收件人邮箱
areceiver="[email protected]"
#抄送人邮箱
acc="[email protected]"
#邮箱主题
asubject="谢谢关注"
#发件人地址
from_addr="[email protected]"
#邮箱授权码
password="####"
#邮件设置
msg=MIMEMultipart()
msg['Subject']=asubject
msg['to']=areceiver
msg['Cc']=acc
msg['from']="fanstuck"
#邮件正文
body="你好,欢迎关注fanstuck,您的关注就是我继续创作的动力!"
msg.attach(MIMEText(body,'plain','utf-8'))
#添加附件
htmlFile = 'C:/Users/10799/problem.html'
html = MIMEApplication(open(htmlFile , 'rb').read())
html.add_header('Content-Disposition', 'attachment', filename='html')
msg.attach(html)
#设置邮箱服务器地址和接口
smtp_server="smtp.163.com"
server = smtplib.SMTP(smtp_server,25)
server.set_debuglevel(1)
#登录邮箱
server.login(from_addr,password)
#发生邮箱
server.sendmail(from_addr,areceiver.split(',')+acc.split(','),msg.as_string())
#断开服务器连接
server.quit()
Run the test:
download file:
definitely no problem! ! !
————END————