背景

我是计算机科学一年级的学生,我在我爸爸的小公司兼职。我没有任何实际应用程序开发的经验。我用Python写过脚本,用C写过一些课程,但没有像这样的。

我爸爸有一家小型培训公司,目前所有的课程都是通过外部网络应用程序安排、录制和跟踪的。有一个导出/“报告”功能,但它是非常通用的,我们需要特定的报告。我们无法访问实际的数据库来运行查询。我被要求建立一个自定义报告系统。

我的想法是每天晚上创建通用的CSV导出,并将它们导入(可能使用Python)到办公室托管的MySQL数据库中,从那里我可以运行所需的特定查询。我没有数据库方面的经验,但了解最基本的知识。我读了一些关于数据库创建和标准表单的知识。

我们可能很快就会有国际客户,所以我希望数据库不会爆炸,如果/当这种情况发生时。我们目前也有一些大公司作为客户,他们有不同的部门(例如ACME母公司、ACME医疗保健部门、ACME身体护理部门)。

我想到的模式如下:

From the client perspective: Clients is the main table Clients are linked to the department they work for Departments can be scattered around a country: HR in London, Marketing in Swansea, etc. Departments are linked to the division of a company Divisions are linked to the parent company From the classes perspective: Sessions is the main table A teacher is linked to each session A statusid is given to each session. E.g. 0 - Completed, 1 - Cancelled Sessions are grouped into "packs" of an arbitrary size Each packs is assigned to a client

我在一张纸上“设计”(更像是乱写)模式,试图将其正常化到第三种形式。然后我把它插入MySQL工作台,它使它对我来说都很漂亮:(点击这里查看全尺寸图形)

(来源:maian.org)

我将运行的示例查询

哪些仍有信用的客户是不活跃的(未来没有安排课程的客户) 每个客户/部门/部门的出勤率是多少(以每次会议的状态id来衡量) 一个老师一个月上几节课 标记出勤率低的客户 人力资源部门的自定义报告,包括其部门人员的出勤率

问题(s)

这是设计过度还是我的方向正确? 对于大多数查询,需要连接多个表是否会导致很大的性能损失? 我给客户端添加了一个“lastsession”列,因为这可能是一个常见的查询。这是一个好主意,还是我应该保持数据库严格规范化?

谢谢你的宝贵时间


这不是过度设计,这是我处理问题的方式。加入很好,不会有太大的性能损失(这是完全必要的,除非你对数据库进行反规格化,这是不建议的!)对于状态,查看是否可以使用枚举数据类型来优化该表。


你说得对。但是,您可以清理它,并删除一些映射(有*)表。

你能做的就是在部门表中添加CityId和DivisionId。

除此之外,我觉得一切都很好……


我在培训/学校领域工作过,我想我要指出的是,在你所说的“会话”(给定课程的实例)和课程本身之间通常存在M:1的关系。换句话说,你的目录提供了这门课程(“西班牙语101”之类的),但你可能在一个学期中有两个不同的实例(Tu-Th由Smith教授,wed - friday由Jones教授)。

除此之外,这看起来是个不错的开始。我敢打赌,您会发现客户端域(指向“客户端”的图形)比您所建模的要复杂得多,但在获得一些实际数据来指导您之前,不要过于复杂。


我唯一想做的改变是: 1-把你的VARCHAR改成NVARCHAR,如果你想要国际化,你可能需要unicode。

2-如果可能的话,将int id改为GUIDs (uniqueidentifier)(这可能只是我的个人偏好)。假设您最终拥有多个环境(开发/测试/登台/prod),您可能希望将数据从一个环境迁移到另一个环境。使用GUID id可以大大简化这一点。

3-你公司的三层结构->部门->部门结构可能不够。现在,这可能是过度工程,但你可以推广这个层次结构,这样你就可以支持n层深度。这将使您的一些查询更加复杂,因此可能不值得这样做。此外,任何具有更多层的客户端都可以很容易地“填充”到这个模型中。

4-你在客户端表中也有一个状态,它是VARCHAR,没有到状态表的链接。我希望能更清楚地说明客户状态代表什么。


以下是关于你问题的更多答案:

1)对于第一次遇到这种问题的人来说,你说得很对。我认为到目前为止,其他人在这个问题上的建议几乎涵盖了这个问题。好工作!

2 & 3)性能的下降很大程度上取决于为特定的查询/过程设置和优化正确的索引,更重要的是记录的数量。除非您的主表中有超过一百万条记录,否则您似乎已经走上了一条足够主流的设计道路,在合理的硬件上,性能不会成为问题。

That said, and this relates to your question 3, with the start you have you probably shouldn't really be overly worried about performance or hyper-sensitivity to normalization orthodoxy here. This is a reporting server you are building, not a transaction based application backend, which would have a much different profile with respect to the importance of performance or normalization. A database backing a live signup and scheduling application has to be mindful of queries that take seconds to return data. Not only does a report server function have more tolerance for complex and lengthy queries, but the strategies to improve performance are much different.

For example, in a transaction based application environment your performance improvement options might include refactoring your stored procedures and table structures to the nth degree, or developing a caching strategy for small amounts of commonly requested data. In a reporting environment you can certainly do this but you can have an even greater impact on performance by introducing a snapshot mechanism where a scheduled process runs and stores pre-configured reports and your users access the snapshot data with no stress on your db tier on a per request basis.

所有这些都是为了说明根据所创建的db的角色不同,所采用的设计原则和技巧可能会有所不同。我希望这对你们有帮助。


大多数事情都已经说过了,但我觉得我可以补充一点:年轻的开发人员在一开始就有点过多地担心性能,这是很常见的,而你关于连接表的问题似乎正朝着这个方向发展。这是一种名为“过早优化”的软件开发反模式。试着把这种反射从你的脑海中赶走:)

还有一件事:你真的需要“城市”和“国家”表吗?难道在部门表中有一个“城市”和“国家”列不足以满足您的用例吗?例如,您的应用程序需要按城市和城市按国家列出部门吗?


不。看起来你设计得很细致。

我认为在你的设计中,国家和公司实际上是同一个实体,城市和部门也是如此。我将去掉国家和城市表(以及Cities_Has_Departments),并在必要时向公司表添加一个布尔标志IsPublicSector(如果除了简单的Private Sector / PublicSector之外还有其他选择,则添加CompanyType列)。

Also, I think there's an error in your usage of the Departments table. It looks like the Departments table serves as a reference to the various kinds of departments that each customer division can have. If so, it should be called DepartmentTypes. But your clients (who are, I assume, attendees) do not belong to a department TYPE, they belong to an actual department instance in a company. As it stands now, you will know that a given client belongs to an HR department somewhere, but not which one!

换句话说,客户机应该链接到您称为Divisions_Has_Departments的表(但是我将简单地称之为Departments)。如果是这样,那么如果您想在数据库中使用标准引用完整性,就必须将Cities分解为division,就像上面讨论的那样。


以下是基于商业智能/报告专家和战略/规划经理角色的评论:

I agree with Larry's direction above. IMHO, It's not so much over engineered, some things just look a little out of place. To keep it simple, I would tag client directly to a Company ID, Department Description, Division Description, Department Type ID, Division Type ID. Use Department Type ID and Division Type ID as references to lookup tables and internal reporting/analysis fields for long term consistency. Packs table contains "Credit" column, shouldn't that actually be tied to the Client base table so if they many packs you can see how much credit owed is left for future classes? The application can take care of the calc and store it centrally in the Client table. Company info could use many more fields, including the obvious address/phone/etc. information. I'd also be prepared to add in D&B "DUNs" columns (Site/Branch/Ultimate) long term, Dun and Bradstreet (D&B) has a huge catalog of companies and you'll find later down the road their information is very helpful for reporting/analysis. This will take care of the multiple division issue you mention, and allow you to roll up their hierarchy for sub/division/branches/etc. of large corps. You don't mention how many records you'll be working with which could imply setting yourself up for a large development initiative which could have been done quicker and far fewer headaches with prepackaged "reporting" software. If your not dealing with a large database (< 65000) rows, make sure MS-Access, OpenOffice (Base) or related report/app dev solutions couldn't do the trick. I use Oracle's free APEX software quite a bit myself, it comes with their free database Oracle XE just download it from their site. FYI - Reporting insight: for large databases, you typically have two database instances a) transaction database for recording each detailed record. b) reporting database (data mart/data warehouse) housed on a separate machine. For more information search google both Star Schema and Snowflake Schema.

的问候。


I want to address only the concern that joining to mutiple tables will casue a performance hit. Do not be afraid to normalize because you will have to do joins. Joins are normal and expected in relational datbases and they are designed to handle them well. You will need to set PK/FK relationships (for data integrity, this is important to consider in designing) but in many databases FKs are not automatically indexed. Since they wil be used in the joins, you will definitelty want to start by indexing the FKS. PKs generally get an index on creation as they have to be unique. It is true that datawarehouse design reduces the number of joins, but usually one doesn't get to the point of data warehousing until one has millions of records needed to be accessed in one report. Even then almost all data warehouses start with a transactional database to collect the data in real time and then data is moved to the warehouse on a schedule (nightly or monthly or whatever the business need is). So this is a good start even if you need to design a data warehouse later to improve report performance.

我不得不说你的设计对于一个计算机科学的大一学生来说是令人印象深刻的。


顺便说一句,值得注意的是,如果您已经在生成csv并希望将它们加载到mySQL数据库中,load DATA LOCAL INFILE是您最好的朋友:http://dev.mysql.com/doc/refman/5.1/en/load-data.html。Mysqlimport也值得一看,它是一个命令行工具,基本上是一个装载数据文件的漂亮包装器。


我想到了几件事:

The tables seemed geared to reporting, but not really running the business. I would think when a client signs up, there's essentially an order being placed for the client attending a list of sessions, and that order might be for multiple employees in one company. It would seem an "order" table would really be at the center of your system and driving your data capture and eventual reporting. (Compare the paper documents you've been using to run the business with your database design to see if there's a logical match.) Companies often don't have divisions. Employees sometimes change divisions/departments, maybe even mid-session. Companies sometimes add/delete/rename divisions/departments. Make sure the possible realtime changing contents of your tables doesn't make subsequent reporting/grouping difficult. With so much contact data split over so many tables, you might have to enforce very strict data entry validation to keep your reports meaningful and inclusive. Eg, when a new client is added, making sure his company/division/department/city match the same values as his coworkers. The "packs" concept isn't clear at all. Since you indicate it's a small business, it would be surprising if performance would be an issue, considering the speed and capacity of current machines.